TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Cassandra summit-2013
1. Real Time Big Data With Storm,
Cassandra, and In-Memory Computing
DeWayne Filppi
@dfilppi
2. Big Data Predictions
“Over the next few years we'll see the adoption of scalable
frameworks and platforms for handling
streaming, or near real-time, analysis and processing. In the
same way that Hadoop has been borne out of large-scale web
applications, these platforms will be driven by the needs of large-
scale location-aware mobile, social and sensor use.”
Edd Dumbill, O’REILLY
2
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
3. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved3
The Two Vs of Big Data
Velocity Volume
4. We’re Living in a Real Time World…
Homeland Security
Real Time Search
Social
eCommerce
User Tracking &
Engagement
Financial Services
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved4
5. The Flavors of Big Data Analytics
Counting Correlating Research
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved5
6. Analytics @ Twitter – Counting
How many signups,
tweets, retweets for a
topic?
What’s the average
latency?
Demographics
Countries and cities
Gender
Age groups
Device types
…
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved6
7. Analytics @ Twitter – Correlating
What devices fail at the
same time?
What features get user
hooked?
What places on the
globe are “happening”?
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved7
8. Analytics @ Twitter – Research
Sentiment analysis
“Obama is popular”
Trends
“People like to tweet
after watching
American Idol”
Spam patterns
How can you tell when
a user spams?
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved8
9. It’s All about Timing
“Real time”
(< few Seconds)
Reasonably Quick
(seconds - minutes)
Batch
(hours/days)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved9
10. It’s All about Timing
• Event driven / stream processing
• High resolution – every tweet gets counted
• Ad-hoc querying
• Medium resolution (aggregations)
• Long running batch jobs (ETL, map/reduce)
• Low resolution (trends & patterns)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved10
This is what
we’re here
to discuss
12. RAM is the new disk
Data partitioned across a cluster
Large “virtual” memory space
Transactional
Highly available
Code collocated with data.
In Memory Data Grid Review
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved12
13. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved13
Data Grid + Cassandra: A Complete Solution
• Data flows through the in-memory cluster async to Cassandra
• Side effects calculated
• Filtering an option
• Enrichment an option
• Results instantly available
• Internal and external event listeners notified
14. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved14
Simplified Event Flow
15. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved15
Grid – Cassandra Interface
Hector and CQL based interface
In memory data must be mapped to column families.
Configurable class to column family mapping
Must serialize individual fields
Fixed fields can use defined types
Variable fields ( for schemaless in-memory mode) need serializers
Object model flattening
By default, nested fields are flattened.
Can be overridden by custom serializer.
16. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved16
Virtues and Limitations
Could be faster: high availability has a cost
Complex flows not easy to assemble or understand with simple
event handlers
Complete stack, not just two tools of many
Fast.
Microsecond latencies for in memory operations
Fast enough for almost anybody
Highly available/self healing
Elastic
17. Popular open source, real time, in-memory, streaming
computation platform.
Includes distributed runtime and intuitive API for defining
distributed processing flows.
Scalable and fault tolerant.
Developed at BackType,
and open sourced by Twitter
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved17
Storm Background
18. Streams
Unbounded sequence of tuples
Spouts
Source of streams (Queues)
Bolts
Functions, Filters, Joins, Aggregations
Topologies
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved18
Storm Abstractions
Spout
Bolt
Topologies
19. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved19
Streaming word count with Storm
Storm has a simple builder interface to creating stream processing
topologies
Storm delegates persistence to external providers
Cassandra, because of its write performance, is commonly used
20. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved20
Storm : Optimistic Processing
Storm (quite rationally) assumes success is normal
Storm uses batching and pipelining for performance
Therefore the spout must be able to replay tuples on demand
in case of error.
Any kind of quasi-queue like data source can be fashioned
into a spout.
No persistence is ever required, and speed attained by
minimizing network hops during topology processing.
21. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved21
Fast. Want to go faster?
Eliminate non-memory components
Substitute disk based queue for reliable in-memory queue
Substitute disk based state persistence to in-memory
persistence
Asynchronously update disk based state (C*)
22. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved22
Sample Architecture
23. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved23
References
Try the Cloudify recipe
Download Cloudify : http://www.cloudifysource.org/
Download the Recipe (apps/xapstream, services/xapstream):
– https://github.com/CloudifySource/cloudify-recipes
XAP – Cassandra Interface Details;
http://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency
Check out the source for the XAP Spout and a sample state
implementation backed by XAP, and a Storm friendly streaming
implemention on github:
https://github.com/Gigaspaces/storm-integration
For more background on the effort, check out my recent blog posts at
http://blog.gigaspaces.com/
http://blog.gigaspaces.com/gigaspaces-and-storm-part-1-storm-clouds/
http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/
Part 3 coming soon.
25. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved25
Twitter Storm With Cassandra
26. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved26
Storm Overview
27. Streams
Unbounded sequence of tuples
Spouts
Source of streams (Queues)
Bolts
Functions, Filters, Joins, Aggregations
Topologies
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved27
Storm Concepts
Spouts
Bolt
Topologies
28. Challenge – Word Count
Word:Count
Tweets
Count
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved28
• Hottest topics
• URL mentions
• etc.
29. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved29
Streaming word count with Storm
30. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved30
Supercharging Storm
Storm doesn’t supply persistence, but provides for it
Storm optimizes IO to slow persistence (e.g. databases) using
batching.
Storm processes streams. The stream provider itself needs to
support persistency, batching, and reliability.
Tweets,
events,whatever….
31. XAP Real Time Analytics
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved31
32. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Two Layer Approach
Advantage: Minimal
“impedance mismatch”
between layers.
– Both NoSQL cluster
technologies, with similar
advantages
Grid layer serves as an in
memory cache for interactive
requests.
Grid layer serves as a real time
computation fabric for CEP, and
limited ( to allocated memory)
real time distributed query
capability.
In Memory Compute Cluster
NoSQL Cluster
...
RawEventStream
RawEventStream
RawEventStream
RealTimeEvents
Raw And Derived Events
RealTimeEvents
ReportingEngine
SCALE
SCALE
33. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved33
Simplified Architecture
34. Flowing event streams through memory for side effects
Event driven architecture executing in-memory
Raw events flushed, aggregations/derivations retained
All layers horizontally scalable
All layers highly available
Real-time analytics & cached batch analytics on same scalable
layer
Data grid provides a transactional/consistent façade on NoSQL
store (in this case eliminating SQL database entirely)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved34
Key Concepts
35. Keep Things In Memory
Facebook keeps 80% of its
data in Memory
(Stanford research)
RAM is 100-1000x faster
than Disk (Random seek)
• Disk: 5 -10ms
• RAM: ~0.001msec
36. Take Aways
A data grid can serve different needs for big data analytics:
Supercharge a dedicated stream processing cluster like Storm.
– Provide fast, reliable, transactional tuple streams and state
Provide a general purpose analytics platform
– Roll your own
Simplify overall architecture while enhancing scalability
– Ultra high performance/low latency
– Dynamically scalable processing and in-memory storage
– Eliminate messaging tier
– Eliminate or minimize need for RDBMS
37. Realtime Analytics with Storm and Hadoop
http://www.slideshare.net/Hadoop_Summit/realtime-
analytics-with-storm
Learn and fork the code on github:
https://github.com/Gigaspaces/storm-integration
Twitter Storm:
http://storm-project.net
XAP + Storm Detailed Blog Post
http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-
xap-integration/
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved37
References