Brian O’Neill                           Taylor GoetzLead Architect, Health Market Science   Development Lead, Health Marke...
Agenda     Use   Case      What is CEP? Why? What for?     Storm      Background      Cluster configuration     Exam...
Our Products   Master Data Management     Good, bad doctors?   Prescriber eligibility and remediation.
Cassandra to the Rescue1000’s of Feeds                  C*                                   Masterfile                  B...
But… Search unstructured data Real-time Analytics / Reporting Transactional Processing     Changes reflected immediate...
What might that look like?                wide-row index    C*                                          I’m happy     RDBM...
What we did wrong… (part 1) Could not react to transactional changes Needed extra logic to track what changed Took too ...
What we did wrong… (part 2)                                    Wide Row                                    ….      C*     ...
What is Complex EventProcessing? Event processing is a method of tracking and  analyzing (processing) streams of informat...
Imagine…   Treating CRUD operations as events in a    system.   Then Suddenly,        CEP = (ETL or Analytics)
What Storm is to us…  Crud   Op     ETL   Dimensional                              Enrichment                  Counts     ...
Storm Overview Open-Sourced by Twitter in 2011 Distributed Realtime Computation System Fault Tolerant Highly Scalable...
Anatomy of a Storm Cluster   Nimbus     Master Node   Zookeeper     Cluster Coordination   Supervisors     Worker No...
Storm Components   Spouts     Stream Sources   Bolts     Unit of Computation   Topologies     Combination of n Spout...
Storm Spouts   Represents a source (stream) of data     Queues (JMS, Kafka, Kestrel, etc.)     Twitter Firehose     Se...
Storm Bolts Receive Tuples from Spouts or other Bolts Operate on, or React to Data     Functions/Filters/Joins/Aggregat...
Storm Topologies Data flow between spouts and bolts Routing of Tuples between spouts/bolts     Stream “Groupings” Para...
Storm and Cassandra   Use Cases:     Write Storm Tuple data to C*      ○ Computation Results      ○ Pre-computed indices...
Storm Cassandra Bolt Types                      CassandraBolt                        Cassandra                        Look...
Storm-Cassandra Project   Provides generic Bolts for writing/reading    Storm Tuples to/from C*                          ...
Storm-Cassandra Project   TupleMapper Interface     Tells the CassandraBolt how to write a tuple to an     arbitrary dat...
Storm-Cassandra Project   ColumnsMapper Interface     Tells the CassandraLookupBolt how to transform a     C* row into a...
Storm-Cassandra Project   Current State:     Version 0.4.0-WIP     Uses Astyanax Client     Several out-of-the-box *Ma...
Storm-Cassandra Project   Future Plans:     Switch to CQL (???)     Full Trident Support                       http://g...
Word Count Demo          http://github.com/hmsonline/storm-cassandra
DRPC
Reach Demo
Trident   Provides a higher-level abstraction for stream    processing     Constructs for state management and Batching...
Sample Trident Operations   Partition Local     Functions      ( execute(x)  x + y )     Filters        ( isKeep(x)  ...
A sample topologyTridentTopology topology = new TridentTopology();TridentState wordCounts =    topology.newStream("spout1"...
Trident StateSequenced writes by batch/transaction id.   Spouts     Transactional      ○ Batch contents never change    ...
Shameless Shoutouts   HMS (https://github.com/hmsonline/)     storm-cassandra     storm-elastic-search     storm-jdbi ...
Brian O’Neill                           Taylor GoetzLead Architect, Health Market Science   Development Lead, Health Marke...
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
Upcoming SlideShare
Loading in...5
×

C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

5,021

Published on

Cassandra provides facilities to integrate with Hadoop. This is sufficient for distributed batch processing, but doesn’t address CEP distributed processing. This webinar will demonstrate use of Cassandra in Storm. Storm provides a data flow and processing layer that can be used to integrate Cassandra with other external persistences mechanisms (e.g. Elastic Search) or calculate dimensional counts for reporting and dashboards. We’ll dive into a sample Storm topology that reads and writes from Cassandra using storm-cassandra bolts.

Published in: Technology
1 Comment
13 Likes
Statistics
Notes
No Downloads
Views
Total Views
5,021
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
149
Comments
1
Likes
13
Embeds 0
No embeds

No notes for slide

C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

  1. 1. Brian O’Neill Taylor GoetzLead Architect, Health Market Science Development Lead, Health Market Scienceboneill@healthmarketscience.com ptgoetz@healthmarketscience.com@boneill42 @ptgoetz
  2. 2. Agenda  Use Case  What is CEP? Why? What for?  Storm  Background  Cluster configuration  Examples / Demo  Future : Trident
  3. 3. Our Products Master Data Management  Good, bad doctors? Prescriber eligibility and remediation.
  4. 4. Cassandra to the Rescue1000’s of Feeds C* Masterfile Big Data for us == Variety of Data Δt
  5. 5. But… Search unstructured data Real-time Analytics / Reporting Transactional Processing  Changes reflected immediately.  Wide-row Indexes
  6. 6. What might that look like? wide-row index C* I’m happy RDBMS Provide for Polyglot Persistence
  7. 7. What we did wrong… (part 1) Could not react to transactional changes Needed extra logic to track what changed Took too long
  8. 8. What we did wrong… (part 2) Wide Row …. C* Indexing TRIGGERS Good Intentions  Take the onus off the clients Bad Result  Guaranteeing execution  Write Overhead
  9. 9. What is Complex EventProcessing? Event processing is a method of tracking and analyzing (processing) streams of information (data) about things that happen (events),[1] and deriving a conclusion from them. Complex event processing, or CEP, is event processing that combines data from multiple sources[2] to infer events or patterns that suggest more complicated circumstances. http://en.wikipedia.org/wiki/Complex_event_processing
  10. 10. Imagine… Treating CRUD operations as events in a system. Then Suddenly, CEP = (ETL or Analytics)
  11. 11. What Storm is to us… Crud Op ETL Dimensional Enrichment Counts Fuzzy SoR RDBMS Index A High Throughput Data Processing Pipeline
  12. 12. Storm Overview Open-Sourced by Twitter in 2011 Distributed Realtime Computation System Fault Tolerant Highly Scalable Guaranteed Processing Operates on one or more streams of data (i.e. CEP)
  13. 13. Anatomy of a Storm Cluster Nimbus  Master Node Zookeeper  Cluster Coordination Supervisors  Worker Nodes
  14. 14. Storm Components Spouts  Stream Sources Bolts  Unit of Computation Topologies  Combination of n Spouts and n Bolts  Defines the overall “Computation”
  15. 15. Storm Spouts Represents a source (stream) of data  Queues (JMS, Kafka, Kestrel, etc.)  Twitter Firehose  Sensor Data Emits “Tuples” (Events) based on source  Primary Storm data structure  Set of Key-Value pairs
  16. 16. Storm Bolts Receive Tuples from Spouts or other Bolts Operate on, or React to Data  Functions/Filters/Joins/Aggregations  Database writes/lookups Optionally emit additional Tuples
  17. 17. Storm Topologies Data flow between spouts and bolts Routing of Tuples between spouts/bolts  Stream “Groupings” Parallelism of Components Long-Lived
  18. 18. Storm and Cassandra Use Cases:  Write Storm Tuple data to C* ○ Computation Results ○ Pre-computed indices  Read data from C* and emit Storm Tuples ○ Dynamic Lookups http://github.com/hmsonline/storm-cassandra
  19. 19. Storm Cassandra Bolt Types CassandraBolt Cassandra LookupBolt C* CassandraBolt  Writes data to Cassandra  Available in Batching and Non-Batching CassandraLookupBolt  Reads data from Cassandra http://github.com/hmsonline/storm-cassandra
  20. 20. Storm-Cassandra Project Provides generic Bolts for writing/reading Storm Tuples to/from C* Tuple Tuple Mapper Rows Tuples Columns Mapper Columns C* http://github.com/hmsonline/storm-cassandra
  21. 21. Storm-Cassandra Project TupleMapper Interface  Tells the CassandraBolt how to write a tuple to an arbitrary data model Given a Storm Tuple:  Map to Column Family  Map to Row Key  Map to Columns http://github.com/hmsonline/storm-cassandra
  22. 22. Storm-Cassandra Project ColumnsMapper Interface  Tells the CassandraLookupBolt how to transform a C* row into a Storm Tuple Given a C* Row Key and list of Columns:  Return a list of Storm Tuples http://github.com/hmsonline/storm-cassandra
  23. 23. Storm-Cassandra Project Current State:  Version 0.4.0-WIP  Uses Astyanax Client  Several out-of-the-box *Mapper Implementations: ○ Basic Key-Value Columns ○ Value-less Columns ○ Counter Columns ○ Lookup by row key ○ Lookup by range query  Initial pass at Trident support  Initial pass at Composite Column Support http://github.com/hmsonline/storm-cassandra
  24. 24. Storm-Cassandra Project Future Plans:  Switch to CQL (???)  Full Trident Support http://github.com/hmsonline/storm-cassandra
  25. 25. Word Count Demo http://github.com/hmsonline/storm-cassandra
  26. 26. DRPC
  27. 27. Reach Demo
  28. 28. Trident Provides a higher-level abstraction for stream processing  Constructs for state management and Batching Adds additional primitives that abstract away common topological patterns Deprecates transactional topologies Distributes with Storm
  29. 29. Sample Trident Operations Partition Local  Functions ( execute(x)  x + y )  Filters ( isKeep(x)  0,x )  PartitionAggregate ○ Combiner ( pairwise combining ) ○ Reducer ( iterative accumulation ) ○ Aggregator ( byoa )
  30. 30. A sample topologyTridentTopology topology = new TridentTopology();TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate( MemcachedState.opaque(serverLocations), new Count(), new Fields("count")) .parallelismHint(6); https://github.com/nathanmarz/storm/wiki/Trident-state
  31. 31. Trident StateSequenced writes by batch/transaction id. Spouts  Transactional ○ Batch contents never change  Opaque ○ Batch contents can change State  Transactional ○ Store tx_id with counts to maintain sequencing of writes.  Opaque ○ Store previous value in order to overwrite the current value when contents of a batch change.
  32. 32. Shameless Shoutouts HMS (https://github.com/hmsonline/)  storm-cassandra  storm-elastic-search  storm-jdbi (coming soon) ptgoetz (https://github.com/ptgoetz)  storm-jms  storm-signals
  33. 33. Brian O’Neill Taylor GoetzLead Architect, Health Market Science Development Lead, Health Market Scienceboneill@healthmarketscience.com ptgoetz@healthmarketscience.com@boneill42 @ptgoetz
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×