6. Combining “big” and “real-time” is hard
Live & historical Drill downs
Trends...
aggregates... and roll ups
6
Analytics
7. Solution Con
Scalability
$$$
Not realtime
Spartan query semantics =>
complex, DIY solutions
7
Analytics
8. • Motivation / alternatives
• What is it?
• How does it work?
• Approximate Analytics
• Whats it good for?
8
Analytics
9. Analytics
counter
updates
Click stream events
Acunu
Sensor data
Analytics
etc
• Aggregate incrementally, on the fly
• Store live + historical aggregates
10. {
time : TIME(HOUR; MIN; SEC),
page : PATH(/),
category : STRING,
loadTime : LONG
}
{
select : ["COUNT", "AVG(loadTime)"],
where : “time, ?path”,
group : “time, ?category”
}
10
Analytics
19. where time 21:00-22:00
count(*)
21:00 all→1345 :00→45 :01→62 :02→87 ...
where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ...
group by minute ... ...
UK all→229 user01→2 user14→12 user99→7 ...
US all→354 user01→4 user04→8 user56→17 ...
...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
19
Analytics
20. where time 21:00-22:00
count(*)
21:00 all→1345 :00→45 :01→62 :02→87 ...
where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ...
group by minute ... ...
UK all→229 user01→2 user14→12 user99→7 ...
where geography=UK US all→354 user01→4 user04→8 user56→17 ...
group all by user, ...
UK, 22:00 all→1905 ...
∅ all→87315 UK→239 US→354 ...
20
Analytics
21. where time 21:00-22:00
count(*)
21:00 all→1345 :00→45 :01→62 :02→87 ...
where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ...
group by minute ... ...
UK all→229 user01→2 user14→12 user99→7 ...
where geography=UK US all→354 user01→4 user04→8 user56→17 ...
group all by user, ...
UK, 22:00 all→1905 ...
count all ∅ all→87315 UK→239 US→354 ...
21
Analytics
22. where time 21:00-22:00
count(*)
21:00 all→1345 :00→45 :01→62 :02→87 ...
where time 22:00-23:00, 22:00 all→3222 :00→22 :01→19 :02→105 ...
group by minute ... ...
UK all→229 user01→2 user14→12 user99→7 ...
where geography=UK US all→354 user01→4 user04→8 user56→17 ...
group all by user, ...
UK, 22:00 all→1905 ...
count all ∅ all→87315 UK→239 US→354 ...
group all by geo
22
Analytics
23. • Motivation / alternatives
• What is it?
• How does it work?
• Approximate Analytics
• Whats it good for?
23
Analytics
25. Count Distinct
Plan A: keep a list of all the things you’ve seen
count them at query time
Quick to update
... but at scale ...
Takes lots of space
Takes a long time to query
25
Analytics
26. Approximate Distinct
max # leading zeroes seen so far
item hash leading zeroes max so far
x 00101001110... 2 2
y 11010100111... 0 2
z 00011101011... 3 3
...
... to see a max of M takes about 2M items
26
Analytics
27. Approximate Distinct
to reduce var, average over m=2k sub-streams
item hash index, zeroes max so far
x 00101001110... 0, 0 0,0,0,0
y 11010100111... 3, 1 0,0,1,0
z 00011101011... 0, 1 1,0,1,0
...
take the harmonic mean
27
Analytics
28. • Motivation / alternatives
• What is it?
• How does it work?
• Approximate Analytics
• Whats it good for?
28
Analytics
30. What’s Coming?
• Ad Hoc: same queries, but without the need
to pre-define them
• Geolocation: support for location-based
events and queries
• Drill down: see the events that make up any
given aggregate
30
Analytics
31. • Motivation / alternatives
• What is it?
• How does it work?
• Approximate Analytics
• Whats it good for?
31
Analytics
32. Manufacturing Social Media Ad Analytics
Systems Financial
Oil + Gas
Monitoring Services
Analytics
33. “Up and running in about 4 hours”
“We found out a competitor
was scraping our data”
“We keep discovering use cases
we hadn’t thought of ”
Analytics
35. www.acunu.com @acunu
Apache, Apache Cassandra, Cassandra, Hadoop, and the eye and
elephant logos are trademarks of the Apache Software Foundation.
35
Analytics