C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm

  • 5,795 views
Uploaded on

Cassandra provides facilities to integrate with Hadoop. This is sufficient for distributed batch processing, but doesn’t address CEP distributed processing. This webinar will demonstrate use of......

Cassandra provides facilities to integrate with Hadoop. This is sufficient for distributed batch processing, but doesn’t address CEP distributed processing. This webinar will demonstrate use of Cassandra in Storm. Storm provides a data flow and processing layer that can be used to integrate Cassandra with other external persistences mechanisms (e.g. Elastic Search) or calculate dimensional counts for reporting and dashboards. We’ll dive into a sample Storm topology that reads and writes from Cassandra using storm-cassandra bolts.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
5,795
On Slideshare
5,770
From Embeds
25
Number of Embeds
2

Actions

Shares
Downloads
137
Comments
1
Likes
13

Embeds 25

https://twitter.com 24
http://searchutil01 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Brian O’Neill Taylor GoetzLead Architect, Health Market Science Development Lead, Health Market Scienceboneill@healthmarketscience.com ptgoetz@healthmarketscience.com@boneill42 @ptgoetz
  • 2. Agenda  Use Case  What is CEP? Why? What for?  Storm  Background  Cluster configuration  Examples / Demo  Future : Trident
  • 3. Our Products Master Data Management  Good, bad doctors? Prescriber eligibility and remediation.
  • 4. Cassandra to the Rescue1000’s of Feeds C* Masterfile Big Data for us == Variety of Data Δt
  • 5. But… Search unstructured data Real-time Analytics / Reporting Transactional Processing  Changes reflected immediately.  Wide-row Indexes
  • 6. What might that look like? wide-row index C* I’m happy RDBMS Provide for Polyglot Persistence
  • 7. What we did wrong… (part 1) Could not react to transactional changes Needed extra logic to track what changed Took too long
  • 8. What we did wrong… (part 2) Wide Row …. C* Indexing TRIGGERS Good Intentions  Take the onus off the clients Bad Result  Guaranteeing execution  Write Overhead
  • 9. What is Complex EventProcessing? Event processing is a method of tracking and analyzing (processing) streams of information (data) about things that happen (events),[1] and deriving a conclusion from them. Complex event processing, or CEP, is event processing that combines data from multiple sources[2] to infer events or patterns that suggest more complicated circumstances. http://en.wikipedia.org/wiki/Complex_event_processing
  • 10. Imagine… Treating CRUD operations as events in a system. Then Suddenly, CEP = (ETL or Analytics)
  • 11. What Storm is to us… Crud Op ETL Dimensional Enrichment Counts Fuzzy SoR RDBMS Index A High Throughput Data Processing Pipeline
  • 12. Storm Overview Open-Sourced by Twitter in 2011 Distributed Realtime Computation System Fault Tolerant Highly Scalable Guaranteed Processing Operates on one or more streams of data (i.e. CEP)
  • 13. Anatomy of a Storm Cluster Nimbus  Master Node Zookeeper  Cluster Coordination Supervisors  Worker Nodes
  • 14. Storm Components Spouts  Stream Sources Bolts  Unit of Computation Topologies  Combination of n Spouts and n Bolts  Defines the overall “Computation”
  • 15. Storm Spouts Represents a source (stream) of data  Queues (JMS, Kafka, Kestrel, etc.)  Twitter Firehose  Sensor Data Emits “Tuples” (Events) based on source  Primary Storm data structure  Set of Key-Value pairs
  • 16. Storm Bolts Receive Tuples from Spouts or other Bolts Operate on, or React to Data  Functions/Filters/Joins/Aggregations  Database writes/lookups Optionally emit additional Tuples
  • 17. Storm Topologies Data flow between spouts and bolts Routing of Tuples between spouts/bolts  Stream “Groupings” Parallelism of Components Long-Lived
  • 18. Storm and Cassandra Use Cases:  Write Storm Tuple data to C* ○ Computation Results ○ Pre-computed indices  Read data from C* and emit Storm Tuples ○ Dynamic Lookups http://github.com/hmsonline/storm-cassandra
  • 19. Storm Cassandra Bolt Types CassandraBolt Cassandra LookupBolt C* CassandraBolt  Writes data to Cassandra  Available in Batching and Non-Batching CassandraLookupBolt  Reads data from Cassandra http://github.com/hmsonline/storm-cassandra
  • 20. Storm-Cassandra Project Provides generic Bolts for writing/reading Storm Tuples to/from C* Tuple Tuple Mapper Rows Tuples Columns Mapper Columns C* http://github.com/hmsonline/storm-cassandra
  • 21. Storm-Cassandra Project TupleMapper Interface  Tells the CassandraBolt how to write a tuple to an arbitrary data model Given a Storm Tuple:  Map to Column Family  Map to Row Key  Map to Columns http://github.com/hmsonline/storm-cassandra
  • 22. Storm-Cassandra Project ColumnsMapper Interface  Tells the CassandraLookupBolt how to transform a C* row into a Storm Tuple Given a C* Row Key and list of Columns:  Return a list of Storm Tuples http://github.com/hmsonline/storm-cassandra
  • 23. Storm-Cassandra Project Current State:  Version 0.4.0-WIP  Uses Astyanax Client  Several out-of-the-box *Mapper Implementations: ○ Basic Key-Value Columns ○ Value-less Columns ○ Counter Columns ○ Lookup by row key ○ Lookup by range query  Initial pass at Trident support  Initial pass at Composite Column Support http://github.com/hmsonline/storm-cassandra
  • 24. Storm-Cassandra Project Future Plans:  Switch to CQL (???)  Full Trident Support http://github.com/hmsonline/storm-cassandra
  • 25. Word Count Demo http://github.com/hmsonline/storm-cassandra
  • 26. DRPC
  • 27. Reach Demo
  • 28. Trident Provides a higher-level abstraction for stream processing  Constructs for state management and Batching Adds additional primitives that abstract away common topological patterns Deprecates transactional topologies Distributes with Storm
  • 29. Sample Trident Operations Partition Local  Functions ( execute(x)  x + y )  Filters ( isKeep(x)  0,x )  PartitionAggregate ○ Combiner ( pairwise combining ) ○ Reducer ( iterative accumulation ) ○ Aggregator ( byoa )
  • 30. A sample topologyTridentTopology topology = new TridentTopology();TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate( MemcachedState.opaque(serverLocations), new Count(), new Fields("count")) .parallelismHint(6); https://github.com/nathanmarz/storm/wiki/Trident-state
  • 31. Trident StateSequenced writes by batch/transaction id. Spouts  Transactional ○ Batch contents never change  Opaque ○ Batch contents can change State  Transactional ○ Store tx_id with counts to maintain sequencing of writes.  Opaque ○ Store previous value in order to overwrite the current value when contents of a batch change.
  • 32. Shameless Shoutouts HMS (https://github.com/hmsonline/)  storm-cassandra  storm-elastic-search  storm-jdbi (coming soon) ptgoetz (https://github.com/ptgoetz)  storm-jms  storm-signals
  • 33. Brian O’Neill Taylor GoetzLead Architect, Health Market Science Development Lead, Health Market Scienceboneill@healthmarketscience.com ptgoetz@healthmarketscience.com@boneill42 @ptgoetz