Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm


Published on

Cassandra provides facilities to integrate with Hadoop. This is sufficient for distributed batch processing, but doesn’t address CEP distributed processing. This webinar will demonstrate use of Cassandra in Storm. Storm provides a data flow and processing layer that can be used to integrate Cassandra with other external persistences mechanisms (e.g. Elastic Search) or calculate dimensional counts for reporting and dashboards. We’ll dive into a sample Storm topology that reads and writes from Cassandra using storm-cassandra bolts.

Published in: Technology

C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

  1. 1. Brian O’Neill Taylor GoetzLead Architect, Health Market Science Development Lead, Health Market @ptgoetz
  2. 2. Agenda  Use Case  What is CEP? Why? What for?  Storm  Background  Cluster configuration  Examples / Demo  Future : Trident
  3. 3. Our Products Master Data Management  Good, bad doctors? Prescriber eligibility and remediation.
  4. 4. Cassandra to the Rescue1000’s of Feeds C* Masterfile Big Data for us == Variety of Data Δt
  5. 5. But… Search unstructured data Real-time Analytics / Reporting Transactional Processing  Changes reflected immediately.  Wide-row Indexes
  6. 6. What might that look like? wide-row index C* I’m happy RDBMS Provide for Polyglot Persistence
  7. 7. What we did wrong… (part 1) Could not react to transactional changes Needed extra logic to track what changed Took too long
  8. 8. What we did wrong… (part 2) Wide Row …. C* Indexing TRIGGERS Good Intentions  Take the onus off the clients Bad Result  Guaranteeing execution  Write Overhead
  9. 9. What is Complex EventProcessing? Event processing is a method of tracking and analyzing (processing) streams of information (data) about things that happen (events),[1] and deriving a conclusion from them. Complex event processing, or CEP, is event processing that combines data from multiple sources[2] to infer events or patterns that suggest more complicated circumstances.
  10. 10. Imagine… Treating CRUD operations as events in a system. Then Suddenly, CEP = (ETL or Analytics)
  11. 11. What Storm is to us… Crud Op ETL Dimensional Enrichment Counts Fuzzy SoR RDBMS Index A High Throughput Data Processing Pipeline
  12. 12. Storm Overview Open-Sourced by Twitter in 2011 Distributed Realtime Computation System Fault Tolerant Highly Scalable Guaranteed Processing Operates on one or more streams of data (i.e. CEP)
  13. 13. Anatomy of a Storm Cluster Nimbus  Master Node Zookeeper  Cluster Coordination Supervisors  Worker Nodes
  14. 14. Storm Components Spouts  Stream Sources Bolts  Unit of Computation Topologies  Combination of n Spouts and n Bolts  Defines the overall “Computation”
  15. 15. Storm Spouts Represents a source (stream) of data  Queues (JMS, Kafka, Kestrel, etc.)  Twitter Firehose  Sensor Data Emits “Tuples” (Events) based on source  Primary Storm data structure  Set of Key-Value pairs
  16. 16. Storm Bolts Receive Tuples from Spouts or other Bolts Operate on, or React to Data  Functions/Filters/Joins/Aggregations  Database writes/lookups Optionally emit additional Tuples
  17. 17. Storm Topologies Data flow between spouts and bolts Routing of Tuples between spouts/bolts  Stream “Groupings” Parallelism of Components Long-Lived
  18. 18. Storm and Cassandra Use Cases:  Write Storm Tuple data to C* ○ Computation Results ○ Pre-computed indices  Read data from C* and emit Storm Tuples ○ Dynamic Lookups
  19. 19. Storm Cassandra Bolt Types CassandraBolt Cassandra LookupBolt C* CassandraBolt  Writes data to Cassandra  Available in Batching and Non-Batching CassandraLookupBolt  Reads data from Cassandra
  20. 20. Storm-Cassandra Project Provides generic Bolts for writing/reading Storm Tuples to/from C* Tuple Tuple Mapper Rows Tuples Columns Mapper Columns C*
  21. 21. Storm-Cassandra Project TupleMapper Interface  Tells the CassandraBolt how to write a tuple to an arbitrary data model Given a Storm Tuple:  Map to Column Family  Map to Row Key  Map to Columns
  22. 22. Storm-Cassandra Project ColumnsMapper Interface  Tells the CassandraLookupBolt how to transform a C* row into a Storm Tuple Given a C* Row Key and list of Columns:  Return a list of Storm Tuples
  23. 23. Storm-Cassandra Project Current State:  Version 0.4.0-WIP  Uses Astyanax Client  Several out-of-the-box *Mapper Implementations: ○ Basic Key-Value Columns ○ Value-less Columns ○ Counter Columns ○ Lookup by row key ○ Lookup by range query  Initial pass at Trident support  Initial pass at Composite Column Support
  24. 24. Storm-Cassandra Project Future Plans:  Switch to CQL (???)  Full Trident Support
  25. 25. Word Count Demo
  26. 26. DRPC
  27. 27. Reach Demo
  28. 28. Trident Provides a higher-level abstraction for stream processing  Constructs for state management and Batching Adds additional primitives that abstract away common topological patterns Deprecates transactional topologies Distributes with Storm
  29. 29. Sample Trident Operations Partition Local  Functions ( execute(x)  x + y )  Filters ( isKeep(x)  0,x )  PartitionAggregate ○ Combiner ( pairwise combining ) ○ Reducer ( iterative accumulation ) ○ Aggregator ( byoa )
  30. 30. A sample topologyTridentTopology topology = new TridentTopology();TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate( MemcachedState.opaque(serverLocations), new Count(), new Fields("count")) .parallelismHint(6);
  31. 31. Trident StateSequenced writes by batch/transaction id. Spouts  Transactional ○ Batch contents never change  Opaque ○ Batch contents can change State  Transactional ○ Store tx_id with counts to maintain sequencing of writes.  Opaque ○ Store previous value in order to overwrite the current value when contents of a batch change.
  32. 32. Shameless Shoutouts HMS (  storm-cassandra  storm-elastic-search  storm-jdbi (coming soon) ptgoetz (  storm-jms  storm-signals
  33. 33. Brian O’Neill Taylor GoetzLead Architect, Health Market Science Development Lead, Health Market @ptgoetz