0
NOSQL and
Big Data Analytics
Tim Moreton
Founder and CTO
In the beginning, NOSQL was about storage
Google Personalized Search, 2006
profiles
Serve customised search
results using user profiles
(read only, low latency)
Col...
Discovery
Analytics
Unstructured
Warehouses
Data
Mining
?
Machine
Learning
Operational
Intelligence
Dashboards Real-time
D...
Normalization and its limits
For each update:
A few random writes
For each query:
Many random reads
Denormalization
For each query:
One sequential read
For each update:
Many writes, sequential IO
Building block: Distributed counters
+1
+1
+1
+1
Total tweets
@timmoreton
2013-08-12
By date
By user
752
+1
+1
CASSANDRA
H...
Twitter’s Rainbird
Source:Twitter
Facebook’s Puma, ODS, Claspin
Source: Facebook
"I believe firmly that ... you should
"denormalize" only as a last resort.
That is, you should back off from a
fully norma...
Denormalization and agility
‘Lambda Architecture’
http://www.josemalvarez.es/web/wp-content/uploads/2013/03/toy-lambda-arch.png
Acunu Analytics
count by day count by
hour of day
uniques by
hashtag
raw events
2 New events update cubes
1 Define aggrega...
API
event
stream
event
roll-up
cubes
Ingest
Processing
dashboard queries programatic interface
API
event
stream
event
stor...
Conclusions
NoSQL is a great fit for collecting or serving datasets
with some structure at high scale, performance, availa...
Thanks!
@timmoreton
@acunu
NoSQL and Big Data Analytics at NOSQL NOW! 2013
NoSQL and Big Data Analytics at NOSQL NOW! 2013
Upcoming SlideShare
Loading in...5
×

NoSQL and Big Data Analytics at NOSQL NOW! 2013

4,168

Published on

This presentation by Tim Moreton at NoSQL NOW! 2013 looks at the history of doing analytics in NoSQL databases. We look at the relative strengthes of normalized and denormalized approaches, and look at how Twitter and Facebook have built custom denormalized systems over NoSQL to support real-time analytics. We look at the lambda architecture, and show how Acunu Analytics provides OLAP cubes over NOSQL, combining denormalization with expressive SQL-like queries.

You can see the full talk here:
http://www.slideshare.net/Dataversity/nosql-and-big-data-analytics

Published in: Technology

Transcript of "NoSQL and Big Data Analytics at NOSQL NOW! 2013"

  1. 1. NOSQL and Big Data Analytics Tim Moreton Founder and CTO
  2. 2. In the beginning, NOSQL was about storage
  3. 3. Google Personalized Search, 2006 profiles Serve customised search results using user profiles (read only, low latency) Collect user queries, clickstream (write only, high throughput) user_id searches clicks BigTable MapReduce via GFS Out-of band batch analysis to produce user profiles
  4. 4. Discovery Analytics Unstructured Warehouses Data Mining ? Machine Learning Operational Intelligence Dashboards Real-time Decisions Alerting ! Complex, long-running Total lack of structure Low latency, fresh data Some structure to exploit When NOSQL, when Hadoop?
  5. 5. Normalization and its limits For each update: A few random writes For each query: Many random reads
  6. 6. Denormalization For each query: One sequential read For each update: Many writes, sequential IO
  7. 7. Building block: Distributed counters +1 +1 +1 +1 Total tweets @timmoreton 2013-08-12 By date By user 752 +1 +1 CASSANDRA HBASE RIAK UPDATE table SET col = col + 1 WHERE id = 2; curl -i http://host:8098/buckets/x/ counters/count2 -X POST -d "1" table.incrementColumnValue(row, cf, col, 1);
  8. 8. Twitter’s Rainbird Source:Twitter
  9. 9. Facebook’s Puma, ODS, Claspin Source: Facebook
  10. 10. "I believe firmly that ... you should "denormalize" only as a last resort. That is, you should back off from a fully normalized design only if all other strategies for improving performance have somehow failed to meet requirements." C J Date 2005
  11. 11. Denormalization and agility
  12. 12. ‘Lambda Architecture’ http://www.josemalvarez.es/web/wp-content/uploads/2013/03/toy-lambda-arch.png
  13. 13. Acunu Analytics count by day count by hour of day uniques by hashtag raw events 2 New events update cubes 1 Define aggregate cubes CREATE CUBE APPROX TOP(hashtag) WHERE browser, time GROUP BY time 3 Rich instant queries over cubes SELECT TOP(x) FROM t WHERE .. GROUP BY d1, d2, ... JOIN ... HAVING.. ORDER BY .. + 4 Drilldown to raw events5 Backfill new cubes using historic data
  14. 14. API event stream event roll-up cubes Ingest Processing dashboard queries programatic interface API event stream event store roll-up cubes Ingest Processing dashboard queries programatic interface Cassandra stores raw events and aggregates Acunu Analytics manages cubes and maps inserts and SQL-like queries to Cassandra reads and writes API event stream event store roll-up cubes Ingest Processing dashboard queries programatic interface PROCESSING AT INGEST JSON, CSV, log ingest via RESTful HTTP API, Flume, Storm, AMQP Storm, MQ HTTP Acunu Dashboards provides rich, real-time, embeddable visualizations SELECT AVG(r) FROM metrics GROUP BY host; AQL Alerting ! Cubes MILLISECOND QUERIES API event stream event store roll-up cubes Ingest Processing dashboard queries programatic interface API for rich queries, threshold alerting Acunu Analytics
  15. 15. Conclusions NoSQL is a great fit for collecting or serving datasets with some structure at high scale, performance, availability Real-time Big Data apps can’t use unplanned rich queries Use atomic counters to pre-materialize quantitative results in real-time -- but think carefully about flexibility Do analytics out-of-band if timeliness is unimportant A lambda architecture combines real-time with richer processing, but adds complexity Acunu Analytics offers real-time OLAP-style queries
  16. 16. Thanks! @timmoreton @acunu
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×