Real Time Analytics for Big Data - A twitter inspired case study

3,023
-1

Published on

Real Time Analytics for Big Data - A twitter inspired case study

Published in: Technology, Business

Real Time Analytics for Big Data - A twitter inspired case study

  1. 1. Real Time Analytics for Big DataA Twitter Case Study Uri Cohen Head of Product @uri1803
  2. 2. Big Data Predictions “Over the next few years well see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large- scale location-aware mobile, social and sensor use.” Edd Dumbill, O’REILLY ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved 2
  3. 3. We’re Living in a Real Time World… Facebook Real Time Real Time User Google RT Web Analytics Social Analytics EngagementTwitter paid tweet analytics New Real Time Google Real Time Search Analytics Startups 3 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  4. 4. 3 Flavors to Analytics Counting Correlating Research4 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  5. 5. Analytics @ Twitter – Counting How many signups, tweets, retweet s for a topic? What’s the average latency? Demographics  Countries and cities  Gender  Age groups  Device types  …5 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  6. 6. Analytics @ Twitter – Correlating What devices fail at the same time? What features get user hooked? What places on the globe are “happening”?6 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  7. 7. Analytics @ Twitter – Research Sentiment analysis  “Obama is popular” Trends  “People like to tweet after watching American Idol” Spam patterns  How can you tell when a user spams?7 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  8. 8. It’s All about Timing “Real time” Reasonably Quick Batch (< few Seconds) (seconds - minutes) (hours/days)8 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  9. 9. It’s All about Timing • Event driven / stream processing • High resolution – every tweet gets counted • Ad-hoc querying This is what • Medium resolution (aggregations) here we’re to discuss  • Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)9 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  10. 10. Twitter in Numbers (March 2011) It takes a week for users to send 1 billion Tweets. Source: http://blog.twitter.com/2011/03/numbers.html10 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  11. 11. Twitter in Numbers (March 2011) On average, 140 million tweets get sent every day. Source: http://blog.twitter.com/2011/03/numbers.html11 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  12. 12. Twitter in Numbers (March 2011) The highest throughput to date is6,939 tweets/sec. Source: http://blog.twitter.com/2011/03/numbers.html12 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  13. 13. Twitter in Numbers (March 2011) 460,000 new accounts are created daily. Source: http://blog.twitter.com/2011/03/numbers.html13 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  14. 14. Twitter in Numbers 5% of the users generate 75% of the content. Source: http://www.sysomos.com/insidetwitter/14 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  15. 15. Challenge 1 – Computing Reach15 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  16. 16. Challenge 1 – Computing Reach Tweets Followers Distinct Followers Count Reach16 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  17. 17. Challenge 2 – Word Count Tweets Count Word:Count • Hottest topics • URL mentions • etc.17 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  18. 18. URL Mentions – Here’s One Use Case18 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  19. 19. Analyze the Problem (Tens of) thousands of tweets per second to process  Assumption: Need to process in near real time Aggregate counters for each word  A few 10s of thousands of words (or hundreds of thousands if we include URLs) System needs to linearly scale19 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  20. 20. It’s Really Just an Online Map/Reduce20 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  21. 21. Event Driven Flow Store Research Raw Tokenize Filter Count Store Counters21 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  22. 22. It’s Not That Simple… (Tens) of thousands of tweets to tokenize every second, even more words to filter  CPU bottleneck Tens/hundreds of thousands of counters to update  Counters contention Tens/hundreds of thousands of counters to persist  Database bottleneck (Tens) of thousands of tweets to store every second in the database  Database bottleneck22 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  23. 23. Implementation Challenges Twitter is growing by the day, how can this scale? Fault tolerance. What happens if a server fails? Consistency. Counters should be accurate!23 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  24. 24. Solutions, Solutions24 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  25. 25. Dealing with the Massive Tweet Stream We need a system that could scale linearly  Event partitioning to the rescue!  What to partition by? Tokenizer1 Filterer 1 Tokenizer2 Filterer 2 Tokenizer Filterer 3 3 Tokenizer Filterer n n25 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  26. 26. Counters Persistence & Contention Why not keep things in memory?  Treat it as a SoR IMDG Counter Tokenizer1 Filterer 1 Updater 1 Counter Tokenizer2 Filterer 2 Updater 2 Tokenizer Counter Filterer 3 3 Updater 3 Tokenizer Counter Filterer n n Updater n26 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  27. 27. Counters Persistence & Contention Why not keep things in memory (data grid)?  Question: How to partition counters? IMDG Counter Tokenizer1 Filterer 1 Updater 1 Counter Tokenizer2 Filterer 2 Updater 2 Tokenizer Counter Filterer 3 3 Updater 3 Tokenizer Counter Filterer n n Updater n27 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  28. 28. Counters Persistence & Contention Why not keep things in memory?  Question: How to partition counters? in Facebook keeps 80% of its data Memory (Stanford research) IMDG Counter Tokenizer1 Filterer 1 Updater 1 RAM is 100-1000x faster than Disk Counter (Random seek) Tokenizer2 Filterer 2 Updater 2 • Disk: 5 -10ms Tokenizer Counter Filterer 3 • RAM: ~0.001msec 3 Updater 3 Tokenizer Counter Filterer n n Updater n28 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  29. 29. Database Bottleneck – Storing Raw Tweets  RDBMS won’t cut it (unless you have $$$$$)  Hadoop / NoSQL to the rescue  Use the data grid for batching and as a reliable transaction log Persister 1 Persister 2 Persister n29 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  30. 30. Counter Consistency & Resiliency Primary/hot backup setup, sync replication Transactional, atomic updates IMDG “On Demand” backups P B P B P B P B30 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  31. 31. Counter Consistency & Resiliency Primary/hot backup setup, sync replication Transactional, atomic updates IMDG “On Demand” backups P B P P B P B P B31 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  32. 32. Implementing Event Flows Use the data grid as event bus Stateful objects as transactional events Raw Tokenizer Tokenized Filterer Filtered Counter P B32 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  33. 33. Putting It All Together IMDG P B P B B P B P33 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  34. 34. Putting It All Together • Use an IMDG IMDG • Partition events, counters • Collocate eventP handlers with data for low B latency and simple scaling • Use atomic updates to update counters in P B memory P B • Persist asynchronously to a big data store P B34 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  35. 35. Reducing Counter Contention - Optimization Store processed word1 tweets locally in word2 each partition word3 Use periodic batch writes to update 1 sec counters Process batches as a whole35 35 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  36. 36. Keep Your Eyes Open…36 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  37. 37. References Learn and fork the code on github: https://github.com/Gigaspaces/rt-analytics Detailed blog post http://bit.ly/gs-bigdata-analytics Twitter in numbers: http://blog.twitter.com/2011/03/numbers.html Twitter Storm: http://bit.ly/twitter-storm Apache S4 http://incubator.apache.org/s4/37 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  38. 38. 38 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×