Cassandra summit-2013
Upcoming SlideShare
Loading in...5
×
 

Cassandra summit-2013

on

  • 1,517 views

 

Statistics

Views

Total Views
1,517
Views on SlideShare
1,496
Embed Views
21

Actions

Likes
4
Downloads
26
Comments
0

1 Embed 21

https://twitter.com 21

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • ActiveInsight

Cassandra summit-2013 Cassandra summit-2013 Presentation Transcript

  • Real Time Big Data With Storm,Cassandra, and In-Memory ComputingDeWayne Filppi@dfilppi
  • Big Data Predictions“Over the next few years well see the adoption of scalableframeworks and platforms for handlingstreaming, or near real-time, analysis and processing. In thesame way that Hadoop has been borne out of large-scale webapplications, these platforms will be driven by the needs of large-scale location-aware mobile, social and sensor use.”Edd Dumbill, O’REILLY2® Copyright 2013 Gigaspaces Ltd. All Rights Reserved
  • ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved3The Two Vs of Big DataVelocity Volume
  • We’re Living in a Real Time World…Homeland SecurityReal Time SearchSocialeCommerceUser Tracking &EngagementFinancial Services® Copyright 2013 Gigaspaces Ltd. All Rights Reserved4
  • The Flavors of Big Data AnalyticsCounting Correlating Research® Copyright 2013 Gigaspaces Ltd. All Rights Reserved5
  • Analytics @ Twitter – Counting How many signups,tweets, retweets for atopic? What’s the averagelatency? Demographics Countries and cities Gender Age groups Device types …® Copyright 2013 Gigaspaces Ltd. All Rights Reserved6
  • Analytics @ Twitter – Correlating What devices fail at thesame time? What features get userhooked? What places on theglobe are “happening”?® Copyright 2013 Gigaspaces Ltd. All Rights Reserved7
  • Analytics @ Twitter – Research Sentiment analysis “Obama is popular” Trends “People like to tweetafter watchingAmerican Idol” Spam patterns How can you tell whena user spams?® Copyright 2013 Gigaspaces Ltd. All Rights Reserved8
  • It’s All about Timing“Real time”(< few Seconds)Reasonably Quick(seconds - minutes)Batch(hours/days)® Copyright 2013 Gigaspaces Ltd. All Rights Reserved9
  • It’s All about Timing• Event driven / stream processing• High resolution – every tweet gets counted• Ad-hoc querying• Medium resolution (aggregations)• Long running batch jobs (ETL, map/reduce)• Low resolution (trends & patterns)® Copyright 2013 Gigaspaces Ltd. All Rights Reserved10This is whatwe’re hereto discuss 
  • VELOCITY + VAST VOLUME =IN MEMORY + BIG DATA11
  •  RAM is the new disk Data partitioned across a cluster Large “virtual” memory space Transactional Highly available Code collocated with data.In Memory Data Grid Review® Copyright 2013 Gigaspaces Ltd. All Rights Reserved12
  • ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved13Data Grid + Cassandra: A Complete Solution• Data flows through the in-memory cluster async to Cassandra• Side effects calculated• Filtering an option• Enrichment an option• Results instantly available• Internal and external event listeners notified
  • ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved14Simplified Event Flow
  • ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved15Grid – Cassandra Interface Hector and CQL based interface In memory data must be mapped to column families. Configurable class to column family mapping Must serialize individual fields Fixed fields can use defined types Variable fields ( for schemaless in-memory mode) need serializers Object model flattening By default, nested fields are flattened. Can be overridden by custom serializer.
  • ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved16Virtues and Limitations Could be faster: high availability has a cost Complex flows not easy to assemble or understand with simpleevent handlers Complete stack, not just two tools of many Fast. Microsecond latencies for in memory operations Fast enough for almost anybody Highly available/self healing Elastic
  •  Popular open source, real time, in-memory, streamingcomputation platform. Includes distributed runtime and intuitive API for definingdistributed processing flows. Scalable and fault tolerant. Developed at BackType,and open sourced by Twitter® Copyright 2013 Gigaspaces Ltd. All Rights Reserved17Storm Background
  •  Streams Unbounded sequence of tuples Spouts Source of streams (Queues) Bolts Functions, Filters, Joins, Aggregations Topologies® Copyright 2013 Gigaspaces Ltd. All Rights Reserved18Storm AbstractionsSpoutBoltTopologies
  • ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved19Streaming word count with Storm Storm has a simple builder interface to creating stream processingtopologies Storm delegates persistence to external providers Cassandra, because of its write performance, is commonly used
  • ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved20Storm : Optimistic Processing Storm (quite rationally) assumes success is normal Storm uses batching and pipelining for performance Therefore the spout must be able to replay tuples on demandin case of error. Any kind of quasi-queue like data source can be fashionedinto a spout. No persistence is ever required, and speed attained byminimizing network hops during topology processing.
  • ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved21Fast. Want to go faster? Eliminate non-memory components Substitute disk based queue for reliable in-memory queue Substitute disk based state persistence to in-memorypersistence Asynchronously update disk based state (C*)
  • ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved22Sample Architecture
  • ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved23References Try the Cloudify recipe Download Cloudify : http://www.cloudifysource.org/ Download the Recipe (apps/xapstream, services/xapstream):– https://github.com/CloudifySource/cloudify-recipes XAP – Cassandra Interface Details; http://wiki.gigaspaces.com/wiki/display/XAP95/Cassandra+Space+Persistency Check out the source for the XAP Spout and a sample stateimplementation backed by XAP, and a Storm friendly streamingimplemention on github: https://github.com/Gigaspaces/storm-integration For more background on the effort, check out my recent blog posts athttp://blog.gigaspaces.com/ http://blog.gigaspaces.com/gigaspaces-and-storm-part-1-storm-clouds/ http://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/ Part 3 coming soon.
  • ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved24
  • ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved25Twitter Storm With Cassandra
  • ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved26Storm Overview
  •  Streams Unbounded sequence of tuples Spouts Source of streams (Queues) Bolts Functions, Filters, Joins, Aggregations Topologies® Copyright 2013 Gigaspaces Ltd. All Rights Reserved27Storm ConceptsSpoutsBoltTopologies
  • Challenge – Word CountWord:CountTweetsCount® Copyright 2013 Gigaspaces Ltd. All Rights Reserved28• Hottest topics• URL mentions• etc.
  • ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved29Streaming word count with Storm
  • ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved30Supercharging Storm Storm doesn’t supply persistence, but provides for it Storm optimizes IO to slow persistence (e.g. databases) usingbatching. Storm processes streams. The stream provider itself needs tosupport persistency, batching, and reliability.Tweets,events,whatever….
  • XAP Real Time Analytics® Copyright 2013 Gigaspaces Ltd. All Rights Reserved31
  • ® Copyright 2011 Gigaspaces Ltd. All Rights ReservedTwo Layer Approach Advantage: Minimal“impedance mismatch”between layers.– Both NoSQL clustertechnologies, with similaradvantages Grid layer serves as an inmemory cache for interactiverequests. Grid layer serves as a real timecomputation fabric for CEP, andlimited ( to allocated memory)real time distributed querycapability.In Memory Compute ClusterNoSQL Cluster...RawEventStreamRawEventStreamRawEventStreamRealTimeEventsRaw And Derived EventsRealTimeEventsReportingEngineSCALESCALE
  • ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved33Simplified Architecture
  •  Flowing event streams through memory for side effects Event driven architecture executing in-memory Raw events flushed, aggregations/derivations retained All layers horizontally scalable All layers highly available Real-time analytics & cached batch analytics on same scalablelayer Data grid provides a transactional/consistent façade on NoSQLstore (in this case eliminating SQL database entirely)® Copyright 2013 Gigaspaces Ltd. All Rights Reserved34Key Concepts
  • Keep Things In MemoryFacebook keeps 80% of itsdata in Memory(Stanford research)RAM is 100-1000x fasterthan Disk (Random seek)• Disk: 5 -10ms• RAM: ~0.001msec
  • Take Aways A data grid can serve different needs for big data analytics: Supercharge a dedicated stream processing cluster like Storm.– Provide fast, reliable, transactional tuple streams and state Provide a general purpose analytics platform– Roll your own Simplify overall architecture while enhancing scalability– Ultra high performance/low latency– Dynamically scalable processing and in-memory storage– Eliminate messaging tier– Eliminate or minimize need for RDBMS
  •  Realtime Analytics with Storm and Hadoop http://www.slideshare.net/Hadoop_Summit/realtime-analytics-with-storm Learn and fork the code on github:https://github.com/Gigaspaces/storm-integration Twitter Storm:http://storm-project.net XAP + Storm Detailed Blog Posthttp://blog.gigaspaces.com/gigaspaces-and-storm-part-2-xap-integration/® Copyright 2013 Gigaspaces Ltd. All Rights Reserved37References
  • ® Copyright 2013 Gigaspaces Ltd. All Rights Reserved38