Cassandra and Storm at Health Market Sceince

7,337 views

Published on

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
7,337
On SlideShare
0
From Embeds
0
Number of Embeds
3,122
Actions
Shares
0
Downloads
91
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Cassandra and Storm at Health Market Sceince

  1. 1. P. Taylor GoetzDevelopment Lead, Health Market Sciencetgoetz@healthmarketscience.com@ptgoetz
  2. 2. Agenda  Cassandra @ HMS  Storm Overview  Storm-Cassandra  Examples / Demo  Future : Trident
  3. 3. Our Products Master Data Management  Good, bad doctors? Prescriber eligibility and remediation.
  4. 4. Cassandra to the Rescue1000’s of Feeds C* Masterfile Big Data for us == Variety of Data Δt
  5. 5. But… Search unstructured data Real-time Analytics / Reporting Transactional Processing  Changes reflected immediately.  Wide-row Indexes
  6. 6. What might that look like? wide-row index C* I’m happy RDBMS Provide for Polyglot Persistence
  7. 7. What we did wrong… Could not react to transactional changes Needed extra logic to track what changed Took too long
  8. 8. C* at HMS Load, Standardize, Match, Consolidate  Write results to C* Track Changes over Time  Practitioner Data  Feed Quality
  9. 9. C* at HMS Best Practices  Prefer Write over Read (esp. Read before Write)  Avoid Queries/Scans ○ Fetch by key whenever possible ○ Put Comparators to work ○ Pre-compute whenever possible
  10. 10. How? Treat All Data as Immutable  Updates are inserts with new version/timestamp Data Model  Heavy use of composites  Timestamps/Versions in Keys Treat Feeds as Real-Time Streams
  11. 11. What Storm is to us… Crud Op ETL Dimensional Enrichment Counts Fuzzy SoR RDBMS Index A High Throughput Data Processing Pipeline
  12. 12. Storm Overview Open-Sourced by Twitter in 2011 Distributed Realtime Computation System Fault Tolerant Highly Scalable Guaranteed Processing Operates on one or more streams of data
  13. 13. Anatomy of a Storm Cluster Nimbus  Master Node Zookeeper  Cluster Coordination Supervisors  Worker Nodes
  14. 14. Storm Primatives Streams  Unbounded sequence of tuples Spouts  Stream Sources Bolts  Unit of Computation Topologies  Combination of n Spouts and n Bolts  Defines the overall “Computation”
  15. 15. Storm Spouts Represents a source (stream) of data  Queues (JMS, Kafka, Kestrel, etc.)  Twitter Firehose  Sensor Data Emits “Tuples” (Events) based on source  Primary Storm data structure  Set of Key-Value pairs
  16. 16. Storm Bolts Receive Tuples from Spouts or other Bolts Operate on, or React to Data  Functions/Filters/Joins/Aggregations  Database writes/lookups Optionally emit additional Tuples
  17. 17. Storm Topologies Data flow between spouts and bolts Routing of Tuples between spouts/bolts  Stream “Groupings” Parallelism of Components Long-Lived
  18. 18. Storm Topologies
  19. 19. Storm and Cassandra Use Cases:  Write Storm Tuple data to C* ○ Computation Results ○ Pre-computed indices  Read data from C* and emit Storm Tuples ○ Dynamic Lookups http://github.com/hmsonline/storm-cassandra
  20. 20. Storm Cassandra BoltTypes CassandraBolt Cassandra LookupBolt C* CassandraBolt  Writes data to Cassandra  Available in Batching and Non-Batching CassandraLookupBolt  Reads data from Cassandra http://github.com/hmsonline/storm-cassandra
  21. 21. Storm-Cassandra Project Provides generic Bolts for writing/reading Storm Tuples to/from C* Tuple Tuple Mapper Rows Tuples Columns Mapper Columns C* http://github.com/hmsonline/storm-cassandra
  22. 22. Storm-Cassandra Project TupleMapper Interface  Tells the CassandraBolt how to write a tuple to an arbitrary data model Given a Storm Tuple:  Map to Column Family  Map to Row Key  Map to Columns http://github.com/hmsonline/storm-cassandra
  23. 23. Storm-Cassandra Project ColumnsMapper Interface  Tells the CassandraLookupBolt how to transform a C* row into a Storm Tuple Given a C* Row Key and list of Columns:  Return a list of Storm Tuples http://github.com/hmsonline/storm-cassandra
  24. 24. Storm-Cassandra Project Current State:  Version 0.4.0  Uses Astyanax Client  Several out-of-the-box *Mapper Implementations: ○ Basic Key-Value Columns ○ Value-less Columns ○ Counter Columns ○ Lookup by row key ○ Lookup by range query  Composite Key/Column Support  Trident support http://github.com/hmsonline/storm-cassandra
  25. 25. Storm-Cassandra Project Future Plans:  Switch to CQL  Enhanced Trident Support http://github.com/hmsonline/storm-cassandra
  26. 26. Word Count Demo http://github.com/hmsonline/storm-cassandra
  27. 27. DRPC
  28. 28. Reach Demo
  29. 29. Trident Provides a higher-level abstraction for stream processing  Constructs for state management and Batching Adds additional primitives that abstract away common topological patterns Deprecates transactional topologies Distributes with Storm
  30. 30. Sample Trident Operations Partition Local  Functions ( execute(x)  x + y )  Filters ( isKeep(x)  0,x )  PartitionAggregate ○ Combiner ( pairwise combining ) ○ Reducer ( iterative accumulation ) ○ Aggregator ( byoa)
  31. 31. A sample topologyTridentTopology topology = new TridentTopology();TridentStatewordCounts =topology.newStream("spout1", spout) .each(new Fields("sentence"),new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(MemcachedState.opaque(serverLocations),new Count(), new Fields("count")) .parallelismHint(6); https://github.com/nathanmarz/storm/wiki/Trident-state
  32. 32. Trident StateSequenced writes by batch/transaction id. Spouts  Transactional ○ Batch contents never change  Opaque ○ Batch contents can change State  Transactional ○ Store tx_id with counts to maintain sequencing of writes.  Opaque ○ Store previous value in order to overwrite the current value when contents of a batch change.
  33. 33. Shameless Shoutouts HMS (https://github.com/hmsonline/)  storm-cassandra  storm-elastic-search  storm-jdbi (coming soon) ptgoetz (https://github.com/ptgoetz)  storm-jms  storm-signals
  34. 34. P. Taylor GoetzDevelopment Lead, Health Market Sciencetgoetz@healthmarketscience.com@ptgoetz

×