Real Time Analytics for Big Data a Twitter Case Study
 

Real Time Analytics for Big Data a Twitter Case Study

on

  • 3,311 views

Hadoop's batch-oriented processing is sufficient for many use cases, especially where the frequency of data reporting doesn't need to be up-to-the-minute. However, batch processing isn't always ...

Hadoop's batch-oriented processing is sufficient for many use cases, especially where the frequency of data reporting doesn't need to be up-to-the-minute. However, batch processing isn't always adequate, particularly when serving online needs such as mobile and web clients, or markets with real-time changing conditions such as finance and advertising.
In the same way that Hadoop was born out of large-scale web applications, a new class of scalable frameworks and platforms for handling real time streaming processing or real time analysis is born to handle the needs of large-scale location-aware mobile, social and sensor use.

Facebook, Twitter and Google have been pioneers in that arena and recently launched new analytics services designed to meet the real time needs.

In this session we will review the common patterns and architectures that drive these platforms and learn how to build a Twitter-like analytics system in a simple way using frameworks such as Spring Social, Active In-Memory Data Grid for Big Data event processing, and NoSQL database such as Cassandra or HBase for handling the managing the historical data.

Participants in this session will also receive a hands-on tutorial for trying out these patterns on their own environment.

A detailed post covering the topic including a reference to a code example illustrating the reference architecture is available below:

http://horovits.wordpress.com/2012/01/27/analytics-for-big-data-venturing-with-the-twitter-use-case/

Statistics

Views

Total Views
3,311
Views on SlideShare
3,309
Embed Views
2

Actions

Likes
4
Downloads
111
Comments
1

2 Embeds 2

http://www.linkedin.com 1
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • Additional references:
    http://blog.gigaspaces.com/2012/01/27/analytics-for-big-data-%E2%80%93-venturing-with-the-twitter-use-case/

    Video cast:
    http://blog.xebia.fr/2012/02/10/concevez-une-application-datagrid-nosql-hautement-scalable-avec-nati-shalom-fondateur-et-cto-gigaspaces-episode-1/
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • ActiveInsight
  • http://gigaom.com/2011/08/16/twitter-by-the-numbers-infographic/
  • RAM is 100-1000x faster than Disk (Random seek) Disk 5 -10ms RAM – x0.001msec Instead of treating memory as a cache, why not treat it as a primary data store? Value Write/Read scaling through partitioning Performance through Memory speed Reliability through replication and redundancy
  • Value Data Grid like GigaSpaces comes with rich set of API that provides not only the mean to store data fast and reliably but also access the data, query it just as you would do with a database. Specifically for GigaSpaces we support both JPA and Document API and the way to mix and match between those API’s Unlike Scribe and log system we can now look at the data as it comes in and not only once it is stored into the database The later makes it possible to partition data based on time – First day in memory and the rest through the database etc.
  • RAM is 100-1000x faster than Disk (Random seek) Disk 5 -10ms RAM – x0.001msec Instead of treating memory as a cache, why not treat it as a primary data store? Value Write/Read scaling through partitioning Performance through Memory speed Reliability through replication and redundancy
  • RAM is 100-1000x faster than Disk (Random seek) Disk 5 -10ms RAM – x0.001msec Instead of treating memory as a cache, why not treat it as a primary data store? Value Write/Read scaling through partitioning Performance through Memory speed Reliability through replication and redundancy
  • Value gained: Avoid lockin to specific NoSQL API Performance – reduced network hops, serialization overhead Simplicity – less moving parts Scalability without compromising on consistency (Strict consistency at the front, eventual consistency for the long term data) JPA/Stanard API
  • Tomcat for web front end Data PU – processing – unit for processing the word counts Cassandra for long term storage
  • content based routing, workflow

Real Time Analytics for Big Data a Twitter Case Study Real Time Analytics for Big Data a Twitter Case Study Presentation Transcript

  • Real Time Analytics for Big Data Lessons from Twitter..
  • The Real Time Boom.. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Google Real Time Web Analytics Google Real Time Search Facebook Real Time Social Analytics Twitter paid tweet analytics SaaS Real Time User Tracking New Real Time Analytics Startups..
  • Analytics @ Twitter
  • Note the Time dimension
  • The data resolution & processing models
  • Twitter Real-time Analytics System ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • Twitter by the numbers ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • Real-time Analytics for Twitter Reach ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Reach is the number of unique people exposed to a URL on Twitter
  • Computing Reach ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Count Tweets Followers Distinct Followers Reach
  • Challenge – Word Count ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Count Tweets Word:Count
  • Real-time Analytics System With GigaSpaces ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • Performance/Latency
    • Instead of treating memory as a cache, why not treat it as a primary data store?
      • Facebook keeps 80% of its data in Memory (Stanford research)
      • RAM is 100-1000x faster than Disk (Random seek)
        • Disk - 5 -10ms
        • RAM – x0.001msec
    ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Data Feeds
  • ONE Data any API ’s:
    • Document
      • Storing tweets
    • Key/Value
      • Atomic-counters
    • JPA
      • Complex query
    • Executors
      • Real Time Map/Reduce
      • Aggregated join query
    ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Any API The right API for the JOB Data Feeds
  • Availability
    • Backup node per partition
    • Synchronous replication to ensure consistency and no data loss
    • On-demand backup to minimizing over provisioning overhead (cost, performance, capacity)
    ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Data Feeds
  • Write Scalability/Performance
    • Partition incoming tweets based on tweet id
    • Partition word/count index based on word hash key – all updates to the same index are routed to the same partition.
    ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved word1 word2 word3
  • Global index update - optimization
    • Use batch write for updating the word/count index
    • Use atomic update to ensure consistency with minimum performance overhead on concurrent updates
    ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved word1 word2 word3 Local batch update 1 sec
  • Processing the data
    • Use event handlers to process the data as its coming.
    • Use shared state to control the flow (order)
    • Use FIFO to ensure order
    • Use local TX to recover from failure
    1) Parse 2) Update Global index
  • Collocate
    • Group event handler and the data into processing-units
      • Minimize latency
      • Simple scaling (less moving parts)
      • Better reliability
    1) Parse 2) Update Global index
  • BigData Database for long term data
    • Write-behind
      • Batch update to the DB to Minimize disk performance and latency overhead
      • Logs are backed up to avoid data loss between batches
    • Plug-in to any DB
      • Use plug-in approach to enable flexibility for choosing the right DB for the JOB
  • Automation & Cloud enablement ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved My Recipe 2 Tomcats 10 Data PU’s 10 Cassandra
  • Application description through RECIPES
    • Recipe DSL
    • Lifecycle scripts
    • Custom plug-ins (optional)
    • Service binaries (optional)
      • application {
      • name= "simple app"
      • service {
      • name = "mysql-service”}
      • service {
      • name = "jboss-service"
      • dependsOn = [ “ mysql-service ” }
      • }
    ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved service { name "jboss-service" icon "jboss.jpg" type "APP_SERVER“ numInstances 2 [recipe body] } lifecycle{ init "mysql_install.groovy” start "mysql_start.groovy” stop "mysql_stop.groovy" } ..
  • Cloudify Application Management ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • Putting it all together Analytic Application Event Sources Write behind
    • - In Memory Data Grid
    • - RT Processing Grid
    • Light Event Processing
    • Map-reduce
    • Event driven
    • Execute code with data
    • Transactional
    • Secured
    • Elastic
    • NoSQL DB
    • Low cost storage
    • Write/Read scalability
    • Dynamic scaling
    • Raw Data and aggregated Data
    Generate Patterns Real Time Map/Reduce R Script script = new StaticScritpt( “groovy”,”println hi; return 0”) Query q = em.createNativeQuery( “execute ?”); q.setParamter(1, script); Integer result = query.getSingleResult();
  • Economic Data Scaling
    • Combine memory and disk
      • Memory is x10, x100 lower than disk for high data access rate (Stanford research)
      • Disk is lower at cost for high capacity lower access rate.
      • Solution:
        • Memory - short-term data,
        • Disk - long term. data
      • Only ~16G required to store the log in memory ( 500b messages at 10k/h ) at a cost of ~32$ month per server.
    ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Memory Disk
  • Economic Scaling
    • Automation - reduce operational cost
    • Elastic Scaling – reduce over provisioning cost
    • Cloud portability (JClouds) – choose the right cloud for the job
    • Cloud bursting – scavenge extra capacity when needed
    ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • Big Data Predictions ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
  • Summary Big Data Development Made Simple: Focus on your business logic, Use Big Data platform for dealing scalability, performance, continues availability ,.. Its Open: Use Any Stack : Avoid Lockin Any database (RDBMS or NoSQL); Any Cloud, Use common API’s & Frameworks . All While Minimizing Cost Use Memory & Disk for optimum cost/performance . Built-in Automation and management - Reduces operational costs Elasticity – reduce over provisioning cost
  • Thank YOU! @natishalom http://blog.gigaspaces.com