Your SlideShare is downloading. ×
Bigdata analytics-twitter
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Saving this for later?

Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime - even offline.

Text the download link to your phone

Standard text messaging rates apply

Bigdata analytics-twitter

536
views

Published on

Description of the creation of real time big data analytics by the combination of in-memory computing technology with big data storage technology. Twitter analytics used as the entry point and …

Description of the creation of real time big data analytics by the combination of in-memory computing technology with big data storage technology. Twitter analytics used as the entry point and example, then describing how Storm functions, the combination of a data grid with Storm for ultimate performance, and a real world example of a production big data real time analytical system combining GigaSpaces XAP, Apache Cassandra (DataStax), and Apache Hadoop (Cloudera).

Published in: Technology

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
536
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
28
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • ActiveInsight
  • New Years Day 2013, Asia region
  • Transcript

    • 1. Real Time Analytics for Big Data –Lessons from Twitter (and beyond).. DeWayne Filppi @dfilppi
    • 2. Big Data Predictions “Over the next few years well see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large- scale location-aware mobile, social and sensor use.” Edd Dumbill, O’REILLY ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved 2
    • 3. The Two Vs of Big Data Velocity Volume3 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 4. We’re Living in a Real Time World… Social User Tracking & Homeland Security Engagement eCommerce Financial Services Real Time Search4 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 5. The Flavors of Big Data Analytics Counting Correlating Research5 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 6. Analytics @ Twitter – Counting How many signups, tweets, retweets for a topic? What’s the average latency? Demographics  Countries and cities  Gender  Age groups  Device types  …6 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 7. Analytics @ Twitter – Correlating What devices fail at the same time? What features get user hooked? What places on the globe are “happening”?7 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 8. Analytics @ Twitter – Research Sentiment analysis  “Obama is popular” Trends  “People like to tweet after watching American Idol” Spam patterns  How can you tell when a user spams?8 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 9. It’s All about Timing “Real time” Reasonably Quick Batch (< few Seconds) (seconds - minutes) (hours/days)9 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 10. It’s All about Timing • Event driven / stream processing • High resolution – every tweet gets counted • Ad-hoc querying This is what • Medium resolution (aggregations) here we’re to discuss  • Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)10 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 11. TWITTER REAL-TIMEANALYTICS SYSTEM 11
    • 12. Twitter in Numbers (Jan 2013) It takes a week for users to send 3 billion Tweets. Source: http://blog.twitter.com/2011/03/numbers.html12 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 13. Twitter in Numbers (Jan 2013) On average, 500 million tweets get sent every day. Source: http://news.cnet.com/8301-1023_3-57541566-93/report-twitter-hits-half-a-billion-tweets-a-day/l13 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 14. Twitter in Numbers (Jan 2013) The highest throughput to date is 33,388 tweets/sec. http://www.huffingtonpost.com/2013/01/02/tweets-per-second-record_n_2396915.html14 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 15. Twitter in Numbers (March 2011) 1,000,000 new accounts are created daily. Source: http://www.mediabistro.com/alltwitter/50-twitter-fun-facts_b3358915 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 16. Twitter in Numbers 5% of the users generate 75% of the content. Source: http://www.sysomos.com/insidetwitter/16 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 17. Analyze the Problem (Tens of) thousands of tweets per second to process  Assumption: Need to process in near real time Aggregate counters for each word  A few 10s of thousands of words (or hundreds of thousands if we include URLs) System needs to linearly scale System needs to be fault tolerant17 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 18. Key Elements in Real Time Big Data Analytics18 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 19. Sharding (Partitioning) Counter Tokenizer1 Filterer 1 Updater 1 Counter Tokenizer2 Filterer 2 Updater 2 Tokenizer Counter Filterer 3 3 Updater 3 Tokenizer Counter Filterer n n Updater n
    • 20. Use EDA (Event Driven Architecture) Counter Raw Tokenizer Tokenized Filterer Filtered / Aggregator20 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 21. Twitter Storm21 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 22. Twitter Storm With Hadoop22 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 23. Storm Overview23 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 24. Storm Concepts Spouts Streams Bolt  Unbounded sequence of tuples Spouts  Source of streams (Queues) Bolts  Functions, Filters, Joins, Aggregations Topologies Topologies24 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 25. Challenge – Word Count Tweets Count Word:Count • Hottest topics • URL mentions • etc.25 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 26. Streaming word count with Storm26 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 27. Supercharging Storm Storm doesn’t supply persistence, but provides for it Storm optimizes IO to slow persistence (e.g. databases) using batching. Storm processes streams. The stream provider itself needs to support persistency, batching, and reliability. Tweets, events,whatever….27 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 28. XAP Real Time Analytics28 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 29. Two Layer Approach Advantage: Minimal Raw Event Stream Raw Event Stream Raw Event Stream ts ents “impedance mismatch” en Real Time Ev Real Time Ev between layers. – Both NoSQL cluster technologies, with similar advantages SCALE Grid layer serves as an in Reporting Engine In Memory Compute Cluster memory cache for interactive Raw And Derived Events requests. Grid layer serves as a real time ... SCALE computation fabric for CEP, and NoSQL Cluster limited ( to allocated memory) real time distributed query capability. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
    • 30. Simplified Architecture30 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 31. Key Concepts Flowing event streams through memory for side effects Event driven architecture executing in-memory Raw events flushed, aggregations/derivations retained All layers horizontally scalable All layers highly available Real-time analytics & cached batch analytics on same scalable layer Data grid provides a transactional/consistent façade on NoSQL store (in this case eliminating SQL database entirely)31 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 32. Keep Things In Memory Facebook keeps 80% of its data in Memory (Stanford research) RAM is 100-1000x faster than Disk (Random seek) • Disk: 5 -10ms • RAM: ~0.001msec
    • 33. Take Aways A data grid can serve different needs for big data analytics:  Supercharge a dedicated stream processing cluster like Storm. – Provide fast, reliable, transactional tuple streams and state  Provide a general purpose analytics platform – Roll your own  Simplify overall architecture while enhancing scalability – Ultra high performance/low latency – Dynamically scalable processing and in-memory storage – Eliminate messaging tier – Eliminate or minimize need for RDBMS
    • 34. References Realtime Analytics with Storm and Hadoop http://www.slideshare.net/Hadoop_Summit/realtime- analytics-with-storm Learn and fork the code on github: https://github.com/Gigaspaces/storm-integration Twitter Storm: http://storm-project.net XAP + Storm Detailed Blog Post http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/34 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
    • 35. 35 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved