Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Cassandra and Storm at Health Market Sceince

8,640 views

Published on

Published in: Technology
  • Be the first to comment

Cassandra and Storm at Health Market Sceince

  1. 1. P. Taylor GoetzDevelopment Lead, Health Market Sciencetgoetz@healthmarketscience.com@ptgoetz
  2. 2. Agenda  Cassandra @ HMS  Storm Overview  Storm-Cassandra  Examples / Demo  Future : Trident
  3. 3. Our Products Master Data Management  Good, bad doctors? Prescriber eligibility and remediation.
  4. 4. Cassandra to the Rescue1000’s of Feeds C* Masterfile Big Data for us == Variety of Data Δt
  5. 5. But… Search unstructured data Real-time Analytics / Reporting Transactional Processing  Changes reflected immediately.  Wide-row Indexes
  6. 6. What might that look like? wide-row index C* I’m happy RDBMS Provide for Polyglot Persistence
  7. 7. What we did wrong… Could not react to transactional changes Needed extra logic to track what changed Took too long
  8. 8. C* at HMS Load, Standardize, Match, Consolidate  Write results to C* Track Changes over Time  Practitioner Data  Feed Quality
  9. 9. C* at HMS Best Practices  Prefer Write over Read (esp. Read before Write)  Avoid Queries/Scans ○ Fetch by key whenever possible ○ Put Comparators to work ○ Pre-compute whenever possible
  10. 10. How? Treat All Data as Immutable  Updates are inserts with new version/timestamp Data Model  Heavy use of composites  Timestamps/Versions in Keys Treat Feeds as Real-Time Streams
  11. 11. What Storm is to us… Crud Op ETL Dimensional Enrichment Counts Fuzzy SoR RDBMS Index A High Throughput Data Processing Pipeline
  12. 12. Storm Overview Open-Sourced by Twitter in 2011 Distributed Realtime Computation System Fault Tolerant Highly Scalable Guaranteed Processing Operates on one or more streams of data
  13. 13. Anatomy of a Storm Cluster Nimbus  Master Node Zookeeper  Cluster Coordination Supervisors  Worker Nodes
  14. 14. Storm Primatives Streams  Unbounded sequence of tuples Spouts  Stream Sources Bolts  Unit of Computation Topologies  Combination of n Spouts and n Bolts  Defines the overall “Computation”
  15. 15. Storm Spouts Represents a source (stream) of data  Queues (JMS, Kafka, Kestrel, etc.)  Twitter Firehose  Sensor Data Emits “Tuples” (Events) based on source  Primary Storm data structure  Set of Key-Value pairs
  16. 16. Storm Bolts Receive Tuples from Spouts or other Bolts Operate on, or React to Data  Functions/Filters/Joins/Aggregations  Database writes/lookups Optionally emit additional Tuples
  17. 17. Storm Topologies Data flow between spouts and bolts Routing of Tuples between spouts/bolts  Stream “Groupings” Parallelism of Components Long-Lived
  18. 18. Storm Topologies
  19. 19. Storm and Cassandra Use Cases:  Write Storm Tuple data to C* ○ Computation Results ○ Pre-computed indices  Read data from C* and emit Storm Tuples ○ Dynamic Lookups http://github.com/hmsonline/storm-cassandra
  20. 20. Storm Cassandra BoltTypes CassandraBolt Cassandra LookupBolt C* CassandraBolt  Writes data to Cassandra  Available in Batching and Non-Batching CassandraLookupBolt  Reads data from Cassandra http://github.com/hmsonline/storm-cassandra
  21. 21. Storm-Cassandra Project Provides generic Bolts for writing/reading Storm Tuples to/from C* Tuple Tuple Mapper Rows Tuples Columns Mapper Columns C* http://github.com/hmsonline/storm-cassandra
  22. 22. Storm-Cassandra Project TupleMapper Interface  Tells the CassandraBolt how to write a tuple to an arbitrary data model Given a Storm Tuple:  Map to Column Family  Map to Row Key  Map to Columns http://github.com/hmsonline/storm-cassandra
  23. 23. Storm-Cassandra Project ColumnsMapper Interface  Tells the CassandraLookupBolt how to transform a C* row into a Storm Tuple Given a C* Row Key and list of Columns:  Return a list of Storm Tuples http://github.com/hmsonline/storm-cassandra
  24. 24. Storm-Cassandra Project Current State:  Version 0.4.0  Uses Astyanax Client  Several out-of-the-box *Mapper Implementations: ○ Basic Key-Value Columns ○ Value-less Columns ○ Counter Columns ○ Lookup by row key ○ Lookup by range query  Composite Key/Column Support  Trident support http://github.com/hmsonline/storm-cassandra
  25. 25. Storm-Cassandra Project Future Plans:  Switch to CQL  Enhanced Trident Support http://github.com/hmsonline/storm-cassandra
  26. 26. Word Count Demo http://github.com/hmsonline/storm-cassandra
  27. 27. DRPC
  28. 28. Reach Demo
  29. 29. Trident Provides a higher-level abstraction for stream processing  Constructs for state management and Batching Adds additional primitives that abstract away common topological patterns Deprecates transactional topologies Distributes with Storm
  30. 30. Sample Trident Operations Partition Local  Functions ( execute(x)  x + y )  Filters ( isKeep(x)  0,x )  PartitionAggregate ○ Combiner ( pairwise combining ) ○ Reducer ( iterative accumulation ) ○ Aggregator ( byoa)
  31. 31. A sample topologyTridentTopology topology = new TridentTopology();TridentStatewordCounts =topology.newStream("spout1", spout) .each(new Fields("sentence"),new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(MemcachedState.opaque(serverLocations),new Count(), new Fields("count")) .parallelismHint(6); https://github.com/nathanmarz/storm/wiki/Trident-state
  32. 32. Trident StateSequenced writes by batch/transaction id. Spouts  Transactional ○ Batch contents never change  Opaque ○ Batch contents can change State  Transactional ○ Store tx_id with counts to maintain sequencing of writes.  Opaque ○ Store previous value in order to overwrite the current value when contents of a batch change.
  33. 33. Shameless Shoutouts HMS (https://github.com/hmsonline/)  storm-cassandra  storm-elastic-search  storm-jdbi (coming soon) ptgoetz (https://github.com/ptgoetz)  storm-jms  storm-signals
  34. 34. P. Taylor GoetzDevelopment Lead, Health Market Sciencetgoetz@healthmarketscience.com@ptgoetz

×