SlideShare a Scribd company logo
1 of 38
Real Time Big Data With Storm,
Cassandra, and In-Memory Computing
DeWayne Filppi
@dfilppi
Big Data Predictions
“Over the next few years we'll see the adoption of scalable
frameworks and platforms for handling
streaming, or near real-time, analysis and processing. In the
same way that Hadoop has been borne out of large-scale web
applications, these platforms will be driven by the needs of large-
scale location-aware mobile, social and sensor use.”
Edd Dumbill, O’REILLY
2
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved3
The Two Vs of Big Data
Velocity Volume
We’re Living in a Real Time World…
Homeland Security
Real Time Search
Social
eCommerce
User Tracking &
Engagement
Financial Services
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved4
The Flavors of Big Data Analytics
Counting Correlating Research
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved5
Analytics @ Twitter – Counting
 How many signups,
tweets, retweets for a
topic?
 What’s the average
latency?
 Demographics
 Countries and cities
 Gender
 Age groups
 Device types
 …
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved6
Analytics @ Twitter – Correlating
 What devices fail at the
same time?
 What features get user
hooked?
 What places on the
globe are “happening”?
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved7
Analytics @ Twitter – Research
 Sentiment analysis
 “Obama is popular”
 Trends
 “People like to tweet
after watching
American Idol”
 Spam patterns
 How can you tell when
a user spams?
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved8
It’s All about Timing
“Real time”
(< few Seconds)
Reasonably Quick
(seconds - minutes)
Batch
(hours/days)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved9
It’s All about Timing
• Event driven / stream processing
• High resolution – every tweet gets counted
• Ad-hoc querying
• Medium resolution (aggregations)
• Long running batch jobs (ETL, map/reduce)
• Low resolution (trends & patterns)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved10
This is what
we’re here
to discuss 
VELOCITY + VAST VOLUME =
IN MEMORY + BIG DATA
11
 RAM is the new disk
 Data partitioned across a cluster
 Large “virtual” memory space
 Transactional
 Highly available
 Code collocated with data.
In Memory Data Grid Review
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved12
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved13
Data Grid + Cassandra: A Complete Solution
• Data flows through the in-memory cluster async to Cassandra
• Side effects calculated
• Filtering an option
• Enrichment an option
• Results instantly available
• Internal and external event listeners notified
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved14
Simplified Event Flow
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved15
Grid – Cassandra Interface
 Hector and CQL based interface
 In memory data must be mapped to column families.
 Configurable class to column family mapping
 Must serialize individual fields
 Fixed fields can use defined types
 Variable fields ( for schemaless in-memory mode) need serializers
 Object model flattening
 By default, nested fields are flattened.
 Can be overridden by custom serializer.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved16
Virtues and Limitations
 Could be faster: high availability has a cost
 Complex flows not easy to assemble or understand with simple
event handlers
 Complete stack, not just two tools of many
 Fast.
 Microsecond latencies for in memory operations
 Fast enough for almost anybody
 Highly available/self healing
 Elastic
 Popular open source, real time, in-memory, streaming
computation platform.
 Includes distributed runtime and intuitive API for defining
distributed processing flows.
 Scalable and fault tolerant.
 Developed at BackType,
and open sourced by Twitter
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved17
Storm Background
 Streams
 Unbounded sequence of tuples
 Spouts
 Source of streams (Queues)
 Bolts
 Functions, Filters, Joins, Aggregations
 Topologies
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved18
Storm Abstractions
Spout
Bolt
Topologies
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved19
Streaming word count with Storm
 Storm has a simple builder interface to creating stream processing
topologies
 Storm delegates persistence to external providers
 Cassandra, because of its write performance, is commonly used
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved20
Storm : Optimistic Processing
 Storm (quite rationally) assumes success is normal
 Storm uses batching and pipelining for performance
 Therefore the spout must be able to replay tuples on demand
in case of error.
 Any kind of quasi-queue like data source can be fashioned
into a spout.
 No persistence is ever required, and speed attained by
minimizing network hops during topology processing.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved21
Fast. Want to go faster?
 Eliminate non-memory components
 Substitute disk based queue for reliable in-memory queue
 Substitute disk based state persistence to in-memory
persistence
 Asynchronously update disk based state (C*)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved22
Sample Architecture
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved23
References
 Try the Cloudify recipe
 Download Cloudify : http://www.cloudifysource.org/
 Download the Recipe (apps/xapstream, services/xapstream):
– https://github.com/CloudifySource/cloudify-recipes
 XAP – Cassandra Interface Details;
 http://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency
 Check out the source for the XAP Spout and a sample state
implementation backed by XAP, and a Storm friendly streaming
implemention on github:
 https://github.com/Gigaspaces/storm-integration
 For more background on the effort, check out my recent blog posts at
http://blog.gigaspaces.com/
 http://blog.gigaspaces.com/gigaspaces-and-storm-part-1-storm-clouds/
 http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/
 Part 3 coming soon.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved24
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved25
Twitter Storm With Cassandra
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved26
Storm Overview
 Streams
 Unbounded sequence of tuples
 Spouts
 Source of streams (Queues)
 Bolts
 Functions, Filters, Joins, Aggregations
 Topologies
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved27
Storm Concepts
Spouts
Bolt
Topologies
Challenge – Word Count
Word:Count
Tweets
Count
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved28
• Hottest topics
• URL mentions
• etc.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved29
Streaming word count with Storm
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved30
Supercharging Storm
 Storm doesn’t supply persistence, but provides for it
 Storm optimizes IO to slow persistence (e.g. databases) using
batching.
 Storm processes streams. The stream provider itself needs to
support persistency, batching, and reliability.
Tweets,
events,whatever….
XAP Real Time Analytics
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved31
® Copyright 2011 Gigaspaces Ltd. All Rights Reserved
Two Layer Approach
 Advantage: Minimal
“impedance mismatch”
between layers.
– Both NoSQL cluster
technologies, with similar
advantages
 Grid layer serves as an in
memory cache for interactive
requests.
 Grid layer serves as a real time
computation fabric for CEP, and
limited ( to allocated memory)
real time distributed query
capability.
In Memory Compute Cluster
NoSQL Cluster
...
RawEventStream
RawEventStream
RawEventStream
RealTimeEvents
Raw And Derived Events
RealTimeEvents
ReportingEngine
SCALE
SCALE
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved33
Simplified Architecture
 Flowing event streams through memory for side effects
 Event driven architecture executing in-memory
 Raw events flushed, aggregations/derivations retained
 All layers horizontally scalable
 All layers highly available
 Real-time analytics & cached batch analytics on same scalable
layer
 Data grid provides a transactional/consistent façade on NoSQL
store (in this case eliminating SQL database entirely)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved34
Key Concepts
Keep Things In Memory
Facebook keeps 80% of its
data in Memory
(Stanford research)
RAM is 100-1000x faster
than Disk (Random seek)
• Disk: 5 -10ms
• RAM: ~0.001msec
Take Aways
 A data grid can serve different needs for big data analytics:
 Supercharge a dedicated stream processing cluster like Storm.
– Provide fast, reliable, transactional tuple streams and state
 Provide a general purpose analytics platform
– Roll your own
 Simplify overall architecture while enhancing scalability
– Ultra high performance/low latency
– Dynamically scalable processing and in-memory storage
– Eliminate messaging tier
– Eliminate or minimize need for RDBMS
 Realtime Analytics with Storm and Hadoop
 http://www.slideshare.net/Hadoop_Summit/realtime-
analytics-with-storm
 Learn and fork the code on github:
https://github.com/Gigaspaces/storm-integration
 Twitter Storm:
http://storm-project.net
 XAP + Storm Detailed Blog Post
http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-
xap-integration/
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved37
References
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved38

More Related Content

What's hot

Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
inside-BigData.com
 

What's hot (20)

Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadinSpark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
Spark and Cassandra: An Amazing Apache Love Story by Patrick McFadin
 
MapR & Skytree:
MapR & Skytree: MapR & Skytree:
MapR & Skytree:
 
Programmatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & DruidProgrammatic Bidding Data Streams & Druid
Programmatic Bidding Data Streams & Druid
 
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
Gregorry Letribot - Druid at Criteo - NoSQL matters 2015
 
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidJuly 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
 
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
Real-World Machine Learning - Leverage the Features of MapR Converged Data Pl...
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
 
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
Spark in the Hadoop Ecosystem-(Mike Olson, Cloudera)
 
Deep Learning at Scale
Deep Learning at ScaleDeep Learning at Scale
Deep Learning at Scale
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
 
Build a Time Series Application with Apache Spark and Apache HBase
Build a Time Series Application with Apache Spark and Apache  HBaseBuild a Time Series Application with Apache Spark and Apache  HBase
Build a Time Series Application with Apache Spark and Apache HBase
 
What's Next for Google's BigTable
What's Next for Google's BigTableWhat's Next for Google's BigTable
What's Next for Google's BigTable
 
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J..."Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
"Einstürzenden Neudaten: Building an Analytics Engine from Scratch", Tobias J...
 
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
 
Data Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax EnterpriseData Pipelines with Spark & DataStax Enterprise
Data Pipelines with Spark & DataStax Enterprise
 
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
Data Analytics and Processing at Snap - Druid Meetup LA - September 2018
 
Performance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud EnvironmentsPerformance Analysis of Apache Spark and Presto in Cloud Environments
Performance Analysis of Apache Spark and Presto in Cloud Environments
 
Druid
DruidDruid
Druid
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 

Similar to Cassandra summit-2013

Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
Richard McDougall
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
confluent
 

Similar to Cassandra summit-2013 (20)

C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...
C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...
C* Summit 2013: Real-Time Big Data with Storm, Cassandra, and In-Memory Compu...
 
Real-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using StormReal-Time Big Data at In-Memory Speed, Using Storm
Real-Time Big Data at In-Memory Speed, Using Storm
 
Predictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-timePredictable Big Data Performance in Real-time
Predictable Big Data Performance in Real-time
 
Cloudifying High Availability: The Case for Elastic Disaster Recovery
Cloudifying High Availability: The Case for Elastic Disaster RecoveryCloudifying High Availability: The Case for Elastic Disaster Recovery
Cloudifying High Availability: The Case for Elastic Disaster Recovery
 
Big Data Learnings from a Vendor's Perspective
Big Data Learnings from a Vendor's PerspectiveBig Data Learnings from a Vendor's Perspective
Big Data Learnings from a Vendor's Perspective
 
In-Memory Stream Processing with Hazelcast Jet @JEEConf
In-Memory Stream Processing with Hazelcast Jet @JEEConfIn-Memory Stream Processing with Hazelcast Jet @JEEConf
In-Memory Stream Processing with Hazelcast Jet @JEEConf
 
Event Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache KafkaEvent Detection Pipelines with Apache Kafka
Event Detection Pipelines with Apache Kafka
 
Building Scalable Applications using Pivotal Gemfire/Apache Geode
Building Scalable Applications using Pivotal Gemfire/Apache GeodeBuilding Scalable Applications using Pivotal Gemfire/Apache Geode
Building Scalable Applications using Pivotal Gemfire/Apache Geode
 
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big AnalyticsReal time big data analytics with Storm by Ron Bodkin of Think Big Analytics
Real time big data analytics with Storm by Ron Bodkin of Think Big Analytics
 
Data has a better idea the in-memory data grid
Data has a better idea   the in-memory data gridData has a better idea   the in-memory data grid
Data has a better idea the in-memory data grid
 
Episode 4: Operating Kubernetes at Scale with DC/OS
Episode 4: Operating Kubernetes at Scale with DC/OSEpisode 4: Operating Kubernetes at Scale with DC/OS
Episode 4: Operating Kubernetes at Scale with DC/OS
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
 
Computing Outside The Box
Computing Outside The BoxComputing Outside The Box
Computing Outside The Box
 
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
 
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
New Business Applications Powered by In-Memory Technology @MIT Forum for Supp...
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Emc vi pr software defined storage
Emc vi pr software defined storageEmc vi pr software defined storage
Emc vi pr software defined storage
 
Emc vi pr software defined storage
Emc vi pr software defined storageEmc vi pr software defined storage
Emc vi pr software defined storage
 
Big Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case StudyBig Data Real Time Analytics - A Facebook Case Study
Big Data Real Time Analytics - A Facebook Case Study
 

More from dfilppi

Building an elastic real time no sql platform
Building an elastic real time no sql platform Building an elastic real time no sql platform
Building an elastic real time no sql platform
dfilppi
 

More from dfilppi (8)

Container Orchestration
Container OrchestrationContainer Orchestration
Container Orchestration
 
NFV Orchestration for Optimal Performance
NFV Orchestration for Optimal PerformanceNFV Orchestration for Optimal Performance
NFV Orchestration for Optimal Performance
 
Hybrid cloud openstack meetup
Hybrid cloud openstack meetupHybrid cloud openstack meetup
Hybrid cloud openstack meetup
 
TOSCA and Cloudify
TOSCA and CloudifyTOSCA and Cloudify
TOSCA and Cloudify
 
Middle Tier Scalability - Present and Future
Middle Tier Scalability - Present and FutureMiddle Tier Scalability - Present and Future
Middle Tier Scalability - Present and Future
 
An Application Centric Approach to Devops
An Application Centric Approach to DevopsAn Application Centric Approach to Devops
An Application Centric Approach to Devops
 
Bigdata analytics-twitter
Bigdata analytics-twitterBigdata analytics-twitter
Bigdata analytics-twitter
 
Building an elastic real time no sql platform
Building an elastic real time no sql platform Building an elastic real time no sql platform
Building an elastic real time no sql platform
 

Recently uploaded

Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Recently uploaded (20)

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 

Cassandra summit-2013

  • 1. Real Time Big Data With Storm, Cassandra, and In-Memory Computing DeWayne Filppi @dfilppi
  • 2. Big Data Predictions “Over the next few years we'll see the adoption of scalable frameworks and platforms for handling streaming, or near real-time, analysis and processing. In the same way that Hadoop has been borne out of large-scale web applications, these platforms will be driven by the needs of large- scale location-aware mobile, social and sensor use.” Edd Dumbill, O’REILLY 2 ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • 3. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved3 The Two Vs of Big Data Velocity Volume
  • 4. We’re Living in a Real Time World… Homeland Security Real Time Search Social eCommerce User Tracking & Engagement Financial Services ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved4
  • 5. The Flavors of Big Data Analytics Counting Correlating Research ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved5
  • 6. Analytics @ Twitter – Counting  How many signups, tweets, retweets for a topic?  What’s the average latency?  Demographics  Countries and cities  Gender  Age groups  Device types  … ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved6
  • 7. Analytics @ Twitter – Correlating  What devices fail at the same time?  What features get user hooked?  What places on the globe are “happening”? ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved7
  • 8. Analytics @ Twitter – Research  Sentiment analysis  “Obama is popular”  Trends  “People like to tweet after watching American Idol”  Spam patterns  How can you tell when a user spams? ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved8
  • 9. It’s All about Timing “Real time” (< few Seconds) Reasonably Quick (seconds - minutes) Batch (hours/days) ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved9
  • 10. It’s All about Timing • Event driven / stream processing • High resolution – every tweet gets counted • Ad-hoc querying • Medium resolution (aggregations) • Long running batch jobs (ETL, map/reduce) • Low resolution (trends & patterns) ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved10 This is what we’re here to discuss 
  • 11. VELOCITY + VAST VOLUME = IN MEMORY + BIG DATA 11
  • 12.  RAM is the new disk  Data partitioned across a cluster  Large “virtual” memory space  Transactional  Highly available  Code collocated with data. In Memory Data Grid Review ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved12
  • 13. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved13 Data Grid + Cassandra: A Complete Solution • Data flows through the in-memory cluster async to Cassandra • Side effects calculated • Filtering an option • Enrichment an option • Results instantly available • Internal and external event listeners notified
  • 14. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved14 Simplified Event Flow
  • 15. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved15 Grid – Cassandra Interface  Hector and CQL based interface  In memory data must be mapped to column families.  Configurable class to column family mapping  Must serialize individual fields  Fixed fields can use defined types  Variable fields ( for schemaless in-memory mode) need serializers  Object model flattening  By default, nested fields are flattened.  Can be overridden by custom serializer.
  • 16. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved16 Virtues and Limitations  Could be faster: high availability has a cost  Complex flows not easy to assemble or understand with simple event handlers  Complete stack, not just two tools of many  Fast.  Microsecond latencies for in memory operations  Fast enough for almost anybody  Highly available/self healing  Elastic
  • 17.  Popular open source, real time, in-memory, streaming computation platform.  Includes distributed runtime and intuitive API for defining distributed processing flows.  Scalable and fault tolerant.  Developed at BackType, and open sourced by Twitter ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved17 Storm Background
  • 18.  Streams  Unbounded sequence of tuples  Spouts  Source of streams (Queues)  Bolts  Functions, Filters, Joins, Aggregations  Topologies ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved18 Storm Abstractions Spout Bolt Topologies
  • 19. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved19 Streaming word count with Storm  Storm has a simple builder interface to creating stream processing topologies  Storm delegates persistence to external providers  Cassandra, because of its write performance, is commonly used
  • 20. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved20 Storm : Optimistic Processing  Storm (quite rationally) assumes success is normal  Storm uses batching and pipelining for performance  Therefore the spout must be able to replay tuples on demand in case of error.  Any kind of quasi-queue like data source can be fashioned into a spout.  No persistence is ever required, and speed attained by minimizing network hops during topology processing.
  • 21. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved21 Fast. Want to go faster?  Eliminate non-memory components  Substitute disk based queue for reliable in-memory queue  Substitute disk based state persistence to in-memory persistence  Asynchronously update disk based state (C*)
  • 22. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved22 Sample Architecture
  • 23. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved23 References  Try the Cloudify recipe  Download Cloudify : http://www.cloudifysource.org/  Download the Recipe (apps/xapstream, services/xapstream): – https://github.com/CloudifySource/cloudify-recipes  XAP – Cassandra Interface Details;  http://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency  Check out the source for the XAP Spout and a sample state implementation backed by XAP, and a Storm friendly streaming implemention on github:  https://github.com/Gigaspaces/storm-integration  For more background on the effort, check out my recent blog posts at http://blog.gigaspaces.com/  http://blog.gigaspaces.com/gigaspaces-and-storm-part-1-storm-clouds/  http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/  Part 3 coming soon.
  • 24. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved24
  • 25. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved25 Twitter Storm With Cassandra
  • 26. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved26 Storm Overview
  • 27.  Streams  Unbounded sequence of tuples  Spouts  Source of streams (Queues)  Bolts  Functions, Filters, Joins, Aggregations  Topologies ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved27 Storm Concepts Spouts Bolt Topologies
  • 28. Challenge – Word Count Word:Count Tweets Count ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved28 • Hottest topics • URL mentions • etc.
  • 29. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved29 Streaming word count with Storm
  • 30. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved30 Supercharging Storm  Storm doesn’t supply persistence, but provides for it  Storm optimizes IO to slow persistence (e.g. databases) using batching.  Storm processes streams. The stream provider itself needs to support persistency, batching, and reliability. Tweets, events,whatever….
  • 31. XAP Real Time Analytics ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved31
  • 32. ® Copyright 2011 Gigaspaces Ltd. All Rights Reserved Two Layer Approach  Advantage: Minimal “impedance mismatch” between layers. – Both NoSQL cluster technologies, with similar advantages  Grid layer serves as an in memory cache for interactive requests.  Grid layer serves as a real time computation fabric for CEP, and limited ( to allocated memory) real time distributed query capability. In Memory Compute Cluster NoSQL Cluster ... RawEventStream RawEventStream RawEventStream RealTimeEvents Raw And Derived Events RealTimeEvents ReportingEngine SCALE SCALE
  • 33. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved33 Simplified Architecture
  • 34.  Flowing event streams through memory for side effects  Event driven architecture executing in-memory  Raw events flushed, aggregations/derivations retained  All layers horizontally scalable  All layers highly available  Real-time analytics & cached batch analytics on same scalable layer  Data grid provides a transactional/consistent façade on NoSQL store (in this case eliminating SQL database entirely) ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved34 Key Concepts
  • 35. Keep Things In Memory Facebook keeps 80% of its data in Memory (Stanford research) RAM is 100-1000x faster than Disk (Random seek) • Disk: 5 -10ms • RAM: ~0.001msec
  • 36. Take Aways  A data grid can serve different needs for big data analytics:  Supercharge a dedicated stream processing cluster like Storm. – Provide fast, reliable, transactional tuple streams and state  Provide a general purpose analytics platform – Roll your own  Simplify overall architecture while enhancing scalability – Ultra high performance/low latency – Dynamically scalable processing and in-memory storage – Eliminate messaging tier – Eliminate or minimize need for RDBMS
  • 37.  Realtime Analytics with Storm and Hadoop  http://www.slideshare.net/Hadoop_Summit/realtime- analytics-with-storm  Learn and fork the code on github: https://github.com/Gigaspaces/storm-integration  Twitter Storm: http://storm-project.net  XAP + Storm Detailed Blog Post http://blog.gigaspaces.com/gigaspaces-and-storm-part-2- xap-integration/ ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved37 References
  • 38. ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved38

Editor's Notes

  1. ActiveInsight