Real Time Big Data With Storm,Cassandra, and In-Memory ComputingDeWayne Filppi@dfilppi
Big Data Predictions“Over the next few years well see the adoption of scalableframeworks and platforms for handlingstreaming, or near real-time, analysis and processing. In thesame way that Hadoop has been borne out of large-scale webapplications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.”Edd Dumbill, O’REILLY2® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved3The Two Vs of Big DataVelocity Volume
We’re Living in a Real Time World…Homeland SecurityReal Time SearchSocialeCommerceUser Tracking &EngagementFinancial Services® Copyright 2013 Gigaspaces Ltd. All Rights Reserved4
The Flavors of Big Data AnalyticsCounting Correlating Research® Copyright 2013 Gigaspaces Ltd. All Rights Reserved5
Analytics @ Twitter – Counting How many signups,tweets, retweets for atopic? What’s the averagelatency? Demographics Countries and cities Gender Age groups Device types …® Copyright 2013 Gigaspaces Ltd. All Rights Reserved6
Analytics @ Twitter – Correlating What devices fail at thesame time? What features get userhooked? What places on theglobe are “happening”?® Copyright 2013 Gigaspaces Ltd. All Rights Reserved7
Analytics @ Twitter – Research Sentiment analysis “Obama is popular” Trends “People like to tweetafter watchingAmerican Idol” Spam patterns How can you tell whena user spams?® Copyright 2013 Gigaspaces Ltd. All Rights Reserved8
It’s All about Timing“Real time”(< few Seconds)Reasonably Quick(seconds - minutes)Batch(hours/days)® Copyright 2013 Gigaspaces Ltd. All Rights Reserved9
It’s All about Timing• Event driven / stream processing• High resolution – every tweet gets counted• Ad-hoc querying• Medium resolution (aggregations)• Long running batch jobs (ETL, map/reduce)• Low resolution (trends & patterns)® Copyright 2013 Gigaspaces Ltd. All Rights Reserved10This is whatwe’re hereto discuss
VELOCITY + VAST VOLUME =IN MEMORY + BIG DATA11
RAM is the new disk Data partitioned across a cluster Large “virtual” memory space Transactional Highly available Code collocated with data.In Memory Data Grid Review® Copyright 2013 Gigaspaces Ltd. All Rights Reserved12
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved13Data Grid + Cassandra: A Complete Solution• Data flows through the in-memory cluster async to Cassandra• Side effects calculated• Filtering an option• Enrichment an option• Results instantly available• Internal and external event listeners notified
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved14Simplified Event Flow
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved15Grid – Cassandra Interface Hector and CQL based interface In memory data must be mapped to column families. Configurable class to column family mapping Must serialize individual fields Fixed fields can use defined types Variable fields ( for schemaless in-memory mode) need serializers Object model flattening By default, nested fields are flattened. Can be overridden by custom serializer.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved16Virtues and Limitations Could be faster: high availability has a cost Complex flows not easy to assemble or understand with simpleevent handlers Complete stack, not just two tools of many Fast. Microsecond latencies for in memory operations Fast enough for almost anybody Highly available/self healing Elastic
Popular open source, real time, in-memory, streamingcomputation platform. Includes distributed runtime and intuitive API for definingdistributed processing flows. Scalable and fault tolerant. Developed at BackType,and open sourced by Twitter® Copyright 2013 Gigaspaces Ltd. All Rights Reserved17Storm Background
Streams Unbounded sequence of tuples Spouts Source of streams (Queues) Bolts Functions, Filters, Joins, Aggregations Topologies® Copyright 2013 Gigaspaces Ltd. All Rights Reserved18Storm AbstractionsSpoutBoltTopologies
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved19Streaming word count with Storm Storm has a simple builder interface to creating stream processingtopologies Storm delegates persistence to external providers Cassandra, because of its write performance, is commonly used
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved20Storm : Optimistic Processing Storm (quite rationally) assumes success is normal Storm uses batching and pipelining for performance Therefore the spout must be able to replay tuples on demandin case of error. Any kind of quasi-queue like data source can be fashionedinto a spout. No persistence is ever required, and speed attained byminimizing network hops during topology processing.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved21Fast. Want to go faster? Eliminate non-memory components Substitute disk based queue for reliable in-memory queue Substitute disk based state persistence to in-memorypersistence Asynchronously update disk based state (C*)
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved22Sample Architecture
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved23References Try the Cloudify recipe Download Cloudify : http://www.cloudifysource.org/ Download the Recipe (apps/xapstream, services/xapstream):– https://github.com/CloudifySource/cloudify-recipes XAP – Cassandra Interface Details; http://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency Check out the source for the XAP Spout and a sample stateimplementation backed by XAP, and a Storm friendly streamingimplemention on github: https://github.com/Gigaspaces/storm-integration For more background on the effort, check out my recent blog posts athttp://blog.gigaspaces.com/ http://blog.gigaspaces.com/gigaspaces-and-storm-part-1-storm-clouds/ http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/ Part 3 coming soon.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved24
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved25Twitter Storm With Cassandra
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved26Storm Overview
Streams Unbounded sequence of tuples Spouts Source of streams (Queues) Bolts Functions, Filters, Joins, Aggregations Topologies® Copyright 2013 Gigaspaces Ltd. All Rights Reserved27Storm ConceptsSpoutsBoltTopologies
Challenge – Word CountWord:CountTweetsCount® Copyright 2013 Gigaspaces Ltd. All Rights Reserved28• Hottest topics• URL mentions• etc.
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved29Streaming word count with Storm
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved30Supercharging Storm Storm doesn’t supply persistence, but provides for it Storm optimizes IO to slow persistence (e.g. databases) usingbatching. Storm processes streams. The stream provider itself needs tosupport persistency, batching, and reliability.Tweets,events,whatever….
XAP Real Time Analytics® Copyright 2013 Gigaspaces Ltd. All Rights Reserved31
® Copyright 2011 Gigaspaces Ltd. All Rights ReservedTwo Layer Approach Advantage: Minimal“impedance mismatch”between layers.– Both NoSQL clustertechnologies, with similaradvantages Grid layer serves as an inmemory cache for interactiverequests. Grid layer serves as a real timecomputation fabric for CEP, andlimited ( to allocated memory)real time distributed querycapability.In Memory Compute ClusterNoSQL Cluster...RawEventStreamRawEventStreamRawEventStreamRealTimeEventsRaw And Derived EventsRealTimeEventsReportingEngineSCALESCALE
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved33Simplified Architecture
Flowing event streams through memory for side effects Event driven architecture executing in-memory Raw events flushed, aggregations/derivations retained All layers horizontally scalable All layers highly available Real-time analytics & cached batch analytics on same scalablelayer Data grid provides a transactional/consistent façade on NoSQLstore (in this case eliminating SQL database entirely)® Copyright 2013 Gigaspaces Ltd. All Rights Reserved34Key Concepts
Keep Things In MemoryFacebook keeps 80% of itsdata in Memory(Stanford research)RAM is 100-1000x fasterthan Disk (Random seek)• Disk: 5 -10ms• RAM: ~0.001msec
Take Aways A data grid can serve different needs for big data analytics: Supercharge a dedicated stream processing cluster like Storm.– Provide fast, reliable, transactional tuple streams and state Provide a general purpose analytics platform– Roll your own Simplify overall architecture while enhancing scalability– Ultra high performance/low latency– Dynamically scalable processing and in-memory storage– Eliminate messaging tier– Eliminate or minimize need for RDBMS
Realtime Analytics with Storm and Hadoop http://www.slideshare.net/Hadoop_Summit/realtime-analytics-with-storm Learn and fork the code on github:https://github.com/Gigaspaces/storm-integration Twitter Storm:http://storm-project.net XAP + Storm Detailed Blog Posthttp://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/® Copyright 2013 Gigaspaces Ltd. All Rights Reserved37References
® Copyright 2013 Gigaspaces Ltd. All Rights Reserved38
A particular slide catching your eye?
Clipping is a handy way to collect important slides you want to go back to later.