Real Time Analytics for Big Data a Twiiter Case Study
 

Real Time Analytics for Big Data a Twiiter Case Study

on

  • 1,935 views

Learn how to build a Twitter-like analytics system, designed to meet real time needs, in a simple way. Using frameworks such as Spring Social, Active In-Memory Data Grid for Big Data event processing, ...

Learn how to build a Twitter-like analytics system, designed to meet real time needs, in a simple way. Using frameworks such as Spring Social, Active In-Memory Data Grid for Big Data event processing, and NoSQL database.
Hadoop's batch-oriented processing is sufficient for many use cases, especially where the frequency of data reporting doesn't need to be up-to-the-minute. However, batch processing isn't always adequate, particularly when serving online needs such as mobile and web clients, or markets with real-time changing conditions such as finance and advertising.
In the same way that Hadoop was born out of large-scale web applications, a new class of scalable frameworks and platforms for handling streaming or real time analysis and processing is born to handle the needs of large-scale location-aware mobile, social and sensor use. Do we want to limit ourselves to just these use cases?
Facebook, Twitter and Google have been pioneers in that arena and recently launched new analytics services designed to meet the real time needs.
In this session we will Review the common patterns and architecture that drive these platforms and learn how to build a Twitter-like analytics system in a simple way using frameworks such as Spring Social, Active In-Memroy Data Grid for Big Data event processing, and NoSQL database such as Cassandra or Hbase for handling the managing the historical data.

Statistics

Views

Total Views
1,935
Views on SlideShare
1,930
Embed Views
5

Actions

Likes
3
Downloads
47
Comments
1

4 Embeds 5

https://twitter.com 2
https://si0.twimg.com 1
http://dev.techarda.com 1
http://www.pearltrees.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Additional references:
    http://blog.gigaspaces.com/2012/01/27/analytics-for-big-data-%E2%80%93-venturing-with-the-twitter-use-case/

    Video cast:
    http://blog.xebia.fr/2012/02/10/concevez-une-application-datagrid-nosql-hautement-scalable-avec-nati-shalom-fondateur-et-cto-gigaspaces-episode-1/
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • ActiveInsight

Real Time Analytics for Big Data a Twiiter Case Study Real Time Analytics for Big Data a Twiiter Case Study Presentation Transcript

  • Real Time Analytics for Big DataA Twitter Inspired Case Study @natishalom
  • Big Data Predictions ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved 2
  • The Two Vs of Big Data Velocity Volume3 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • We’re Living in a Real Time World… Social User Tracking & Homeland Security Engagement eCommerce Financial Services Real Time Search4 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • The Flavors of Big Data Analytics Counting Correlating Research5 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • Analytics @ Twitter – Counting  How many signups, tweets, retweets for a topic?  What’s the average latency?  Demographics  Countries and cities  Gender  Age groups  Device types  …6 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • Analytics @ Twitter – Correlating  What devices fail at the same time?  What features get user hooked?  What places on the globe are “happening”?7 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • Analytics @ Twitter – Research  Sentiment analysis  “Obama is popular”  Trends  “People like to tweet after watching American Idol”  Spam patterns  How can you tell when a user spams?8 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • It’s All about Timing “Real time” Reasonably Quick Batch (< few Seconds) (seconds - minutes) (hours/days)9 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • It’s All about Timing • Event driven / stream processing • High resolution – every tweet gets counted • Ad-hoc querying This is what • Medium resolution (aggregations) we’re here we’re here to discuss  • Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns)10 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • Challenge – Word Count Tweets11 ? ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Count • URL mentions • etc. Word:Count • Hottest topics
  • URL Mentions – Here’s One Use Case12 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • Twitter in Numbers (March 2011) It takes a week for users to send 1 billion Tweets. Source: http://blog.twitter.com/2011/03/numbers.html13 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • Twitter in Numbers (March 2011) On average, 140 million tweets get sent every day. Source: http://blog.twitter.com/2011/03/numbers.html14 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • Twitter in Numbers (March 2011) The highest throughput to date is6,939 tweets/sec. Source: http://blog.twitter.com/2011/03/numbers.html15 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • Twitter in Numbers (March 2011) 460,000 new accounts are created daily. Source: http://blog.twitter.com/2011/03/numbers.html16 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • Twitter in Numbers 5% of the users generate 75% of the content. Source: http://www.sysomos.com/insidetwitter/17 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • Analyze the Problem  (Tens of) thousands of tweets per second to process  Assumption: Need to process in near real time  Aggregate counters for each word  A few 10s of thousands of words (or hundreds of thousands if we include URLs)  System needs to linearly scale  System needs to be fault tolerant18 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • Key Elements in Real Time Big Data Analytics19 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • Sharding (Partitioning) Tokenizer Counter Filterer 1 1 Updater 1 Tokenizer Counter Filterer 2 Updater 2 2 Tokenizer Counter Filterer 3 3 Updater 3 Tokenizer Counter Filterer n n Updater n
  • Keep Things In Memory Facebook keeps 80% of its data in Memory (Stanford research) RAM is 100-1000x faster than Disk (Random seek) • Disk: 5 -10ms • RAM: ~0.001msec
  • Use EDA (Event Driven Architecture)22 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • Putting it all together23 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • Know Your Toolset24 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • References  Writing your own twitter analytics: http://ht.ly/d8j4I  Detailed blog post http://bit.ly/gs-bigdata-analytics  Twitter in numbers: http://blog.twitter.com/2011/03/numbers.html  Twitter Storm: http://bit.ly/twitter-storm  Apache S4 http://incubator.apache.org/s4/25 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • 26 ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved