Real Time Analytics for Big Data –Lessons from Twitter (and beyond)..                             DeWayne Filppi          ...
Big Data Predictions         “Over the next few years well see the adoption of scalable         frameworks and platforms f...
The Two Vs of Big Data     Velocity                                                   Volume3        ® Copyright 2013 Giga...
We’re Living in a Real Time World…      Social                            User Tracking &                Homeland Security...
The Flavors of Big Data Analytics    Counting                                Correlating               Research5          ...
Analytics @ Twitter – Counting How many signups,  tweets, retweets for a  topic? What’s the average  latency? Demograph...
Analytics @ Twitter – Correlating What devices fail at the  same time? What features get user  hooked? What places on t...
Analytics @ Twitter – Research Sentiment analysis     “Obama is popular” Trends     “People like to tweet      after w...
It’s All about Timing      “Real time”                      Reasonably Quick                    Batch    (< few Seconds)  ...
It’s All about Timing              • Event driven / stream processing              • High resolution – every tweet gets co...
TWITTER REAL-TIMEANALYTICS SYSTEM                    11
Twitter in Numbers (Jan 2013)     It takes a week for users to     send   3 billion Tweets.                               ...
Twitter in Numbers (Jan 2013)                  On average,        500 million     tweets get sent every day.              ...
Twitter in Numbers (Jan 2013)         The highest     throughput to date is     33,388 tweets/sec.                        ...
Twitter in Numbers (March 2011)     1,000,000 new        accounts        are created daily.                               ...
Twitter in Numbers     5% of the users generate      75% of the content.                                                  ...
Analyze the Problem (Tens of) thousands of tweets per second to  process      Assumption: Need to process in near real t...
Key Elements in                     Real Time Big Data Analytics18   ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Sharding (Partitioning)                                     Counter          Tokenizer1   Filterer 1   Updater 1          ...
Use EDA (Event Driven Architecture)                                                                                  Count...
Twitter Storm21   ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Twitter Storm With Hadoop22       ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Storm Overview23      ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Storm Concepts                                                                          Spouts Streams                   ...
Challenge – Word Count       Tweets                                                                      Count            ...
Streaming word count with Storm26       ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Supercharging Storm Storm doesn’t supply persistence, but provides for it Storm optimizes IO to slow persistence (e.g. d...
XAP Real Time Analytics28   ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Two Layer Approach Advantage: Minimal                                                                                    ...
Simplified Architecture30   ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Key Concepts Flowing event streams through memory for side effects Event driven architecture executing in-memory Raw ev...
Keep Things In Memory   Facebook keeps 80% of its   data in Memory   (Stanford research)   RAM is 100-1000x faster   than ...
Take Aways A data grid can serve different needs for big data analytics:    Supercharge a dedicated stream processing cl...
References Realtime Analytics with Storm and Hadoop http://www.slideshare.net/Hadoop_Summit/realtime-  analytics-with-st...
35   ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
Upcoming SlideShare
Loading in...5
×

Bigdata analytics-twitter

593

Published on

Description of the creation of real time big data analytics by the combination of in-memory computing technology with big data storage technology. Twitter analytics used as the entry point and example, then describing how Storm functions, the combination of a data grid with Storm for ultimate performance, and a real world example of a production big data real time analytical system combining GigaSpaces XAP, Apache Cassandra (DataStax), and Apache Hadoop (Cloudera).

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
593
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
30
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • ActiveInsight
  • New Years Day 2013, Asia region
  • Bigdata analytics-twitter

    1. 1. Real Time Analytics for Big Data –Lessons from Twitter (and beyond).. DeWayne Filppi @dfilppi
    2. 2. Big Data Predictions “Over the next few years well see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large- scale location-aware mobile, social and sensor use.” Edd Dumbill, O’REILLY ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 2
    3. 3. The Two Vs of Big Data Velocity Volume3 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    4. 4. We’re Living in a Real Time World… Social User Tracking & Homeland Security Engagement eCommerce Financial Services Real Time Search4 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    5. 5. The Flavors of Big Data Analytics Counting Correlating Research5 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    6. 6. Analytics @ Twitter – Counting How many signups, tweets, retweets for a topic? What’s the average latency? Demographics  Countries and cities  Gender  Age groups  Device types  …6 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    7. 7. Analytics @ Twitter – Correlating What devices fail at the same time? What features get user hooked? What places on the globe are “happening”?7 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    8. 8. Analytics @ Twitter – Research Sentiment analysis  “Obama is popular” Trends  “People like to tweet after watching American Idol” Spam patterns  How can you tell when a user spams?8 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    9. 9. It’s All about Timing “Real time” Reasonably Quick Batch (< few Seconds) (seconds - minutes) (hours/days)9 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    10. 10. It’s All about Timing • Event driven / stream processing • High resolution – every tweet gets counted • Ad-hoc querying This is what • Medium resolution (aggregations) here we’re to discuss  • Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)10 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    11. 11. TWITTER REAL-TIMEANALYTICS SYSTEM 11
    12. 12. Twitter in Numbers (Jan 2013) It takes a week for users to send 3 billion Tweets. Source: http://blog.twitter.com/2011/03/numbers.html12 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    13. 13. Twitter in Numbers (Jan 2013) On average, 500 million tweets get sent every day. Source: http://news.cnet.com/8301-1023_3-57541566-93/report-twitter-hits-half-a-billion-tweets-a-day/l13 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    14. 14. Twitter in Numbers (Jan 2013) The highest throughput to date is 33,388 tweets/sec. http://www.huffingtonpost.com/2013/01/02/tweets-per-second-record_n_2396915.html14 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    15. 15. Twitter in Numbers (March 2011) 1,000,000 new accounts are created daily. Source: http://www.mediabistro.com/alltwitter/50-twitter-fun-facts_b3358915 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    16. 16. Twitter in Numbers 5% of the users generate 75% of the content. Source: http://www.sysomos.com/insidetwitter/16 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    17. 17. Analyze the Problem (Tens of) thousands of tweets per second to process  Assumption: Need to process in near real time Aggregate counters for each word  A few 10s of thousands of words (or hundreds of thousands if we include URLs) System needs to linearly scale System needs to be fault tolerant17 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    18. 18. Key Elements in Real Time Big Data Analytics18 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    19. 19. Sharding (Partitioning) Counter Tokenizer1 Filterer 1 Updater 1 Counter Tokenizer2 Filterer 2 Updater 2 Tokenizer Counter Filterer 3 3 Updater 3 Tokenizer Counter Filterer n n Updater n
    20. 20. Use EDA (Event Driven Architecture) Counter Raw Tokenizer Tokenized Filterer Filtered / Aggregator20 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    21. 21. Twitter Storm21 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    22. 22. Twitter Storm With Hadoop22 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    23. 23. Storm Overview23 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    24. 24. Storm Concepts Spouts Streams Bolt  Unbounded sequence of tuples Spouts  Source of streams (Queues) Bolts  Functions, Filters, Joins, Aggregations Topologies Topologies24 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    25. 25. Challenge – Word Count Tweets Count Word:Count • Hottest topics • URL mentions • etc.25 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    26. 26. Streaming word count with Storm26 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    27. 27. Supercharging Storm Storm doesn’t supply persistence, but provides for it Storm optimizes IO to slow persistence (e.g. databases) using batching. Storm processes streams. The stream provider itself needs to support persistency, batching, and reliability. Tweets, events,whatever….27 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    28. 28. XAP Real Time Analytics28 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    29. 29. Two Layer Approach Advantage: Minimal Raw Event Stream Raw Event Stream Raw Event Stream ts ents “impedance mismatch” en Real Time Ev Real Time Ev between layers. – Both NoSQL cluster technologies, with similar advantages SCALE Grid layer serves as an in Reporting Engine In Memory Compute Cluster memory cache for interactive Raw And Derived Events requests. Grid layer serves as a real time ... SCALE computation fabric for CEP, and NoSQL Cluster limited ( to allocated memory) real time distributed query capability. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
    30. 30. Simplified Architecture30 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    31. 31. Key Concepts Flowing event streams through memory for side effects Event driven architecture executing in-memory Raw events flushed, aggregations/derivations retained All layers horizontally scalable All layers highly available Real-time analytics & cached batch analytics on same scalable layer Data grid provides a transactional/consistent façade on NoSQL store (in this case eliminating SQL database entirely)31 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    32. 32. Keep Things In Memory Facebook keeps 80% of its data in Memory (Stanford research) RAM is 100-1000x faster than Disk (Random seek) • Disk: 5 -10ms • RAM: ~0.001msec
    33. 33. Take Aways A data grid can serve different needs for big data analytics:  Supercharge a dedicated stream processing cluster like Storm. – Provide fast, reliable, transactional tuple streams and state  Provide a general purpose analytics platform – Roll your own  Simplify overall architecture while enhancing scalability – Ultra high performance/low latency – Dynamically scalable processing and in-memory storage – Eliminate messaging tier – Eliminate or minimize need for RDBMS
    34. 34. References Realtime Analytics with Storm and Hadoop http://www.slideshare.net/Hadoop_Summit/realtime- analytics-with-storm Learn and fork the code on github: https://github.com/Gigaspaces/storm-integration Twitter Storm: http://storm-project.net XAP + Storm Detailed Blog Post http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/34 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    35. 35. 35 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×