Hadoop's batch-oriented processing is sufficient for many use cases, especially where the frequency of data reporting doesn't need to be up-to-the-minute. However, batch processing isn't always adequate, particularly when serving online needs such as mobile and web clients, or markets with real-time changing conditions such as finance and advertising.
In the same way that Hadoop was born out of large-scale web applications, a new class of scalable frameworks and platforms for handling real time streaming processing or real time analysis is born to handle the needs of large-scale location-aware mobile, social and sensor use.
Facebook, Twitter and Google have been pioneers in that arena and recently launched new analytics services designed to meet the real time needs.
In this session we will review the common patterns and architectures that drive these platforms and learn how to build a Twitter-like analytics system in a simple way using frameworks such as Spring Social, Active In-Memory Data Grid for Big Data event processing, and NoSQL database such as Cassandra or HBase for handling the managing the historical data.
Participants in this session will also receive a hands-on tutorial for trying out these patterns on their own environment.
A detailed post covering the topic including a reference to a code example illustrating the reference architecture is available below:
http://horovits.wordpress.com/2012/01/27/analytics-for-big-data-venturing-with-the-twitter-use-case/
2. The Real Time Boom.. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Google Real Time Web Analytics Google Real Time Search Facebook Real Time Social Analytics Twitter paid tweet analytics SaaS Real Time User Tracking New Real Time Analytics Startups..
7. Twitter by the numbers ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
8. Real-time Analytics for Twitter Reach ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Reach is the number of unique people exposed to a URL on Twitter
9. Computing Reach ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Count Tweets Followers Distinct Followers Reach
10. Challenge – Word Count ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Count Tweets Word:Count
27. Summary Big Data Development Made Simple: Focus on your business logic, Use Big Data platform for dealing scalability, performance, continues availability ,.. Its Open: Use Any Stack : Avoid Lockin Any database (RDBMS or NoSQL); Any Cloud, Use common API’s & Frameworks . All While Minimizing Cost Use Memory & Disk for optimum cost/performance . Built-in Automation and management - Reduces operational costs Elasticity – reduce over provisioning cost
RAM is 100-1000x faster than Disk (Random seek) Disk 5 -10ms RAM – x0.001msec Instead of treating memory as a cache, why not treat it as a primary data store? Value Write/Read scaling through partitioning Performance through Memory speed Reliability through replication and redundancy
Value Data Grid like GigaSpaces comes with rich set of API that provides not only the mean to store data fast and reliably but also access the data, query it just as you would do with a database. Specifically for GigaSpaces we support both JPA and Document API and the way to mix and match between those API’s Unlike Scribe and log system we can now look at the data as it comes in and not only once it is stored into the database The later makes it possible to partition data based on time – First day in memory and the rest through the database etc.
RAM is 100-1000x faster than Disk (Random seek) Disk 5 -10ms RAM – x0.001msec Instead of treating memory as a cache, why not treat it as a primary data store? Value Write/Read scaling through partitioning Performance through Memory speed Reliability through replication and redundancy
RAM is 100-1000x faster than Disk (Random seek) Disk 5 -10ms RAM – x0.001msec Instead of treating memory as a cache, why not treat it as a primary data store? Value Write/Read scaling through partitioning Performance through Memory speed Reliability through replication and redundancy
Value gained: Avoid lockin to specific NoSQL API Performance – reduced network hops, serialization overhead Simplicity – less moving parts Scalability without compromising on consistency (Strict consistency at the front, eventual consistency for the long term data) JPA/Stanard API
Tomcat for web front end Data PU – processing – unit for processing the word counts Cassandra for long term storage