Real Time Big Data With Storm,
Cassandra, and In-Memory Computing
DeWayne Filppi
@dfilppi
Big Data Predictions
“Over the next few years we'll see the adoption of scalable
frameworks and platforms for handling
streaming, or near real-time, analysis and processing. In the
same way that Hadoop has been borne out of large-scale web
applications, these platforms will be driven by the needs of large-
scale location-aware mobile, social and sensor use.”
Edd Dumbill, O’REILLY
2
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved3
The Two Vs of Big Data
Velocity Volume
We’re Living in a Real Time World…
Homeland Security
Real Time Search
Social
eCommerce
User Tracking &
Engagement
Financial Services
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved4
The Flavors of Big Data Analytics
Counting Correlating Research
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved5
Analytics @ Twitter – Counting
 How many signups,
tweets, retweets for a
topic?
 What’s the average
latency?
 Demographics
 Countries and cities
 Gender
 Age groups
 Device types
 …
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved6
Analytics @ Twitter – Correlating
 What devices fail at the
same time?
 What features get user
hooked?
 What places on the
globe are “happening”?
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved7
Analytics @ Twitter – Research
 Sentiment analysis
 “Obama is popular”
 Trends
 “People like to tweet
after watching
American Idol”
 Spam patterns
 How can you tell when
a user spams?
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved8
It’s All about Timing
“Real time”
(< few Seconds)
Reasonably Quick
(seconds - minutes)
Batch
(hours/days)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved9
It’s All about Timing
• Event driven / stream processing
• High resolution – every tweet gets counted
• Ad-hoc querying
• Medium resolution (aggregations)
• Long running batch jobs (ETL, map/reduce)
• Low resolution (trends & patterns)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved10
This is what
we’re here
to discuss 
VELOCITY + VAST VOLUME =
IN MEMORY + BIG DATA
11
 RAM is the new disk
 Data partitioned across a cluster
 Large “virtual” memory space
 Transactional
 Highly available
 Code collocated with data.
In Memory Data Grid Review
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved12
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved13
Data Grid + Cassandra: A Complete Solution
• Data flows through the in-memory cluster async to Cassandra
• Side effects calculated
• Filtering an option
• Enrichment an option
• Results instantly available
• Internal and external event listeners notified
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved14
Simplified Event Flow
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved15
Grid – Cassandra Interface
 Hector and CQL based interface
 In memory data must be mapped to column families.
 Configurable class to column family mapping
 Must serialize individual fields
 Fixed fields can use defined types
 Variable fields ( for schemaless in-memory mode) need serializers
 Object model flattening
 By default, nested fields are flattened.
 Can be overridden by custom serializer.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved16
Virtues and Limitations
 Could be faster: high availability has a cost
 Complex flows not easy to assemble or understand with simple
event handlers
 Complete stack, not just two tools of many
 Fast.
 Microsecond latencies for in memory operations
 Fast enough for almost anybody
 Highly available/self healing
 Elastic
 Popular open source, real time, in-memory, streaming
computation platform.
 Includes distributed runtime and intuitive API for defining
distributed processing flows.
 Scalable and fault tolerant.
 Developed at BackType,
and open sourced by Twitter
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved17
Storm Background
 Streams
 Unbounded sequence of tuples
 Spouts
 Source of streams (Queues)
 Bolts
 Functions, Filters, Joins, Aggregations
 Topologies
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved18
Storm Abstractions
Spout
Bolt
Topologies
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved19
Streaming word count with Storm
 Storm has a simple builder interface to creating stream processing
topologies
 Storm delegates persistence to external providers
 Cassandra, because of its write performance, is commonly used
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved20
Storm : Optimistic Processing
 Storm (quite rationally) assumes success is normal
 Storm uses batching and pipelining for performance
 Therefore the spout must be able to replay tuples on demand
in case of error.
 Any kind of quasi-queue like data source can be fashioned
into a spout.
 No persistence is ever required, and speed attained by
minimizing network hops during topology processing.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved21
Fast. Want to go faster?
 Eliminate non-memory components
 Substitute disk based queue for reliable in-memory queue
 Substitute disk based state persistence to in-memory
persistence
 Asynchronously update disk based state (C*)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved22
Sample Architecture
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved23
References
 Try the Cloudify recipe
 Download Cloudify : http://www.cloudifysource.org/
 Download the Recipe (apps/xapstream, services/xapstream):
– https://github.com/CloudifySource/cloudify-recipes
 XAP – Cassandra Interface Details;
 http://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency
 Check out the source for the XAP Spout and a sample state
implementation backed by XAP, and a Storm friendly streaming
implemention on github:
 https://github.com/Gigaspaces/storm-integration
 For more background on the effort, check out my recent blog posts at
http://blog.gigaspaces.com/
 http://blog.gigaspaces.com/gigaspaces-and-storm-part-1-storm-clouds/
 http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/
 Part 3 coming soon.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved24
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved25
Twitter Storm With Cassandra
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved26
Storm Overview
 Streams
 Unbounded sequence of tuples
 Spouts
 Source of streams (Queues)
 Bolts
 Functions, Filters, Joins, Aggregations
 Topologies
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved27
Storm Concepts
Spouts
Bolt
Topologies
Challenge – Word Count
Word:Count
Tweets
Count
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved28
• Hottest topics
• URL mentions
• etc.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved29
Streaming word count with Storm
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved30
Supercharging Storm
 Storm doesn’t supply persistence, but provides for it
 Storm optimizes IO to slow persistence (e.g. databases) using
batching.
 Storm processes streams. The stream provider itself needs to
support persistency, batching, and reliability.
Tweets,
events,whatever….
XAP Real Time Analytics
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved31
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Two Layer Approach
 Advantage: Minimal
“impedance mismatch”
between layers.
– Both NoSQL cluster
technologies, with similar
advantages
 Grid layer serves as an in
memory cache for interactive
requests.
 Grid layer serves as a real time
computation fabric for CEP, and
limited ( to allocated memory)
real time distributed query
capability.
In Memory Compute Cluster
NoSQL Cluster
...
RawEventStream
RawEventStream
RawEventStream
RealTimeEvents
Raw And Derived Events
RealTimeEvents
ReportingEngine
SCALE
SCALE
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved33
Simplified Architecture
 Flowing event streams through memory for side effects
 Event driven architecture executing in-memory
 Raw events flushed, aggregations/derivations retained
 All layers horizontally scalable
 All layers highly available
 Real-time analytics & cached batch analytics on same scalable
layer
 Data grid provides a transactional/consistent façade on NoSQL
store (in this case eliminating SQL database entirely)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved34
Key Concepts
Keep Things In Memory
Facebook keeps 80% of its
data in Memory
(Stanford research)
RAM is 100-1000x faster
than Disk (Random seek)
• Disk: 5 -10ms
• RAM: ~0.001msec
Take Aways
 A data grid can serve different needs for big data analytics:
 Supercharge a dedicated stream processing cluster like Storm.
– Provide fast, reliable, transactional tuple streams and state
 Provide a general purpose analytics platform
– Roll your own
 Simplify overall architecture while enhancing scalability
– Ultra high performance/low latency
– Dynamically scalable processing and in-memory storage
– Eliminate messaging tier
– Eliminate or minimize need for RDBMS
 Realtime Analytics with Storm and Hadoop
 http://www.slideshare.net/Hadoop_Summit/realtime-
analytics-with-storm
 Learn and fork the code on github:
https://github.com/Gigaspaces/storm-integration
 Twitter Storm:
http://storm-project.net
 XAP + Storm Detailed Blog Post
http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-
xap-integration/
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved37
References
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved38

Cassandra summit-2013

  • 1.
    Real Time BigData With Storm, Cassandra, and In-Memory Computing DeWayne Filppi @dfilppi
  • 2.
    Big Data Predictions “Overthe next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large- scale location-aware mobile, social and sensor use.” Edd Dumbill, O’REILLY 2 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 3.
    ® Copyright 2013Gigaspaces Ltd. All Rights Reserved3 The Two Vs of Big Data Velocity Volume
  • 4.
    We’re Living ina Real Time World… Homeland Security Real Time Search Social eCommerce User Tracking & Engagement Financial Services ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved4
  • 5.
    The Flavors ofBig Data Analytics Counting Correlating Research ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved5
  • 6.
    Analytics @ Twitter– Counting  How many signups, tweets, retweets for a topic?  What’s the average latency?  Demographics  Countries and cities  Gender  Age groups  Device types  … ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved6
  • 7.
    Analytics @ Twitter– Correlating  What devices fail at the same time?  What features get user hooked?  What places on the globe are “happening”? ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved7
  • 8.
    Analytics @ Twitter– Research  Sentiment analysis  “Obama is popular”  Trends  “People like to tweet after watching American Idol”  Spam patterns  How can you tell when a user spams? ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved8
  • 9.
    It’s All aboutTiming “Real time” (< few Seconds) Reasonably Quick (seconds - minutes) Batch (hours/days) ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved9
  • 10.
    It’s All aboutTiming • Event driven / stream processing • High resolution – every tweet gets counted • Ad-hoc querying • Medium resolution (aggregations) • Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns) ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved10 This is what we’re here to discuss 
  • 11.
    VELOCITY + VASTVOLUME = IN MEMORY + BIG DATA 11
  • 12.
     RAM isthe new disk  Data partitioned across a cluster  Large “virtual” memory space  Transactional  Highly available  Code collocated with data. In Memory Data Grid Review ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved12
  • 13.
    ® Copyright 2013Gigaspaces Ltd. All Rights Reserved13 Data Grid + Cassandra: A Complete Solution • Data flows through the in-memory cluster async to Cassandra • Side effects calculated • Filtering an option • Enrichment an option • Results instantly available • Internal and external event listeners notified
  • 14.
    ® Copyright 2013Gigaspaces Ltd. All Rights Reserved14 Simplified Event Flow
  • 15.
    ® Copyright 2013Gigaspaces Ltd. All Rights Reserved15 Grid – Cassandra Interface  Hector and CQL based interface  In memory data must be mapped to column families.  Configurable class to column family mapping  Must serialize individual fields  Fixed fields can use defined types  Variable fields ( for schemaless in-memory mode) need serializers  Object model flattening  By default, nested fields are flattened.  Can be overridden by custom serializer.
  • 16.
    ® Copyright 2013Gigaspaces Ltd. All Rights Reserved16 Virtues and Limitations  Could be faster: high availability has a cost  Complex flows not easy to assemble or understand with simple event handlers  Complete stack, not just two tools of many  Fast.  Microsecond latencies for in memory operations  Fast enough for almost anybody  Highly available/self healing  Elastic
  • 17.
     Popular opensource, real time, in-memory, streaming computation platform.  Includes distributed runtime and intuitive API for defining distributed processing flows.  Scalable and fault tolerant.  Developed at BackType, and open sourced by Twitter ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved17 Storm Background
  • 18.
     Streams  Unboundedsequence of tuples  Spouts  Source of streams (Queues)  Bolts  Functions, Filters, Joins, Aggregations  Topologies ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved18 Storm Abstractions Spout Bolt Topologies
  • 19.
    ® Copyright 2013Gigaspaces Ltd. All Rights Reserved19 Streaming word count with Storm  Storm has a simple builder interface to creating stream processing topologies  Storm delegates persistence to external providers  Cassandra, because of its write performance, is commonly used
  • 20.
    ® Copyright 2013Gigaspaces Ltd. All Rights Reserved20 Storm : Optimistic Processing  Storm (quite rationally) assumes success is normal  Storm uses batching and pipelining for performance  Therefore the spout must be able to replay tuples on demand in case of error.  Any kind of quasi-queue like data source can be fashioned into a spout.  No persistence is ever required, and speed attained by minimizing network hops during topology processing.
  • 21.
    ® Copyright 2013Gigaspaces Ltd. All Rights Reserved21 Fast. Want to go faster?  Eliminate non-memory components  Substitute disk based queue for reliable in-memory queue  Substitute disk based state persistence to in-memory persistence  Asynchronously update disk based state (C*)
  • 22.
    ® Copyright 2013Gigaspaces Ltd. All Rights Reserved22 Sample Architecture
  • 23.
    ® Copyright 2013Gigaspaces Ltd. All Rights Reserved23 References  Try the Cloudify recipe  Download Cloudify : http://www.cloudifysource.org/  Download the Recipe (apps/xapstream, services/xapstream): – https://github.com/CloudifySource/cloudify-recipes  XAP – Cassandra Interface Details;  http://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency  Check out the source for the XAP Spout and a sample state implementation backed by XAP, and a Storm friendly streaming implemention on github:  https://github.com/Gigaspaces/storm-integration  For more background on the effort, check out my recent blog posts at http://blog.gigaspaces.com/  http://blog.gigaspaces.com/gigaspaces-and-storm-part-1-storm-clouds/  http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/  Part 3 coming soon.
  • 24.
    ® Copyright 2013Gigaspaces Ltd. All Rights Reserved24
  • 25.
    ® Copyright 2013Gigaspaces Ltd. All Rights Reserved25 Twitter Storm With Cassandra
  • 26.
    ® Copyright 2013Gigaspaces Ltd. All Rights Reserved26 Storm Overview
  • 27.
     Streams  Unboundedsequence of tuples  Spouts  Source of streams (Queues)  Bolts  Functions, Filters, Joins, Aggregations  Topologies ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved27 Storm Concepts Spouts Bolt Topologies
  • 28.
    Challenge – WordCount Word:Count Tweets Count ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved28 • Hottest topics • URL mentions • etc.
  • 29.
    ® Copyright 2013Gigaspaces Ltd. All Rights Reserved29 Streaming word count with Storm
  • 30.
    ® Copyright 2013Gigaspaces Ltd. All Rights Reserved30 Supercharging Storm  Storm doesn’t supply persistence, but provides for it  Storm optimizes IO to slow persistence (e.g. databases) using batching.  Storm processes streams. The stream provider itself needs to support persistency, batching, and reliability. Tweets, events,whatever….
  • 31.
    XAP Real TimeAnalytics ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved31
  • 32.
    ® Copyright 2011Gigaspaces Ltd. All Rights Reserved Two Layer Approach  Advantage: Minimal “impedance mismatch” between layers. – Both NoSQL cluster technologies, with similar advantages  Grid layer serves as an in memory cache for interactive requests.  Grid layer serves as a real time computation fabric for CEP, and limited ( to allocated memory) real time distributed query capability. In Memory Compute Cluster NoSQL Cluster ... RawEventStream RawEventStream RawEventStream RealTimeEvents Raw And Derived Events RealTimeEvents ReportingEngine SCALE SCALE
  • 33.
    ® Copyright 2013Gigaspaces Ltd. All Rights Reserved33 Simplified Architecture
  • 34.
     Flowing eventstreams through memory for side effects  Event driven architecture executing in-memory  Raw events flushed, aggregations/derivations retained  All layers horizontally scalable  All layers highly available  Real-time analytics & cached batch analytics on same scalable layer  Data grid provides a transactional/consistent façade on NoSQL store (in this case eliminating SQL database entirely) ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved34 Key Concepts
  • 35.
    Keep Things InMemory Facebook keeps 80% of its data in Memory (Stanford research) RAM is 100-1000x faster than Disk (Random seek) • Disk: 5 -10ms • RAM: ~0.001msec
  • 36.
    Take Aways  Adata grid can serve different needs for big data analytics:  Supercharge a dedicated stream processing cluster like Storm. – Provide fast, reliable, transactional tuple streams and state  Provide a general purpose analytics platform – Roll your own  Simplify overall architecture while enhancing scalability – Ultra high performance/low latency – Dynamically scalable processing and in-memory storage – Eliminate messaging tier – Eliminate or minimize need for RDBMS
  • 37.
     Realtime Analyticswith Storm and Hadoop  http://www.slideshare.net/Hadoop_Summit/realtime- analytics-with-storm  Learn and fork the code on github: https://github.com/Gigaspaces/storm-integration  Twitter Storm: http://storm-project.net  XAP + Storm Detailed Blog Post http://blog.gigaspaces.com/gigaspaces-and-storm-part-2- xap-integration/ ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved37 References
  • 38.
    ® Copyright 2013Gigaspaces Ltd. All Rights Reserved38

Editor's Notes