Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1• 800.593.4467 • info@healthmarketscience.com                                                    The Big Data Quadfecta  ...
1• 800.593.4467 • info@healthmarketscience.com                                                                    Quadfect...
1• 800.593.4467 • info@healthmarketscience.comVolume                                       3 V’sVarietyVelocity
1• 800.593.4467 • info@healthmarketscience.com      The Use Case
1• 800.593.4467 • info@healthmarketscience.com                                                           Our Mission      ...
1• 800.593.4467 • info@healthmarketscience.com                                                                           T...
1• 800.593.4467 • info@healthmarketscience.com                                                                        Our ...
1• 800.593.4467 • info@healthmarketscience.com                                                          Datacenter        ...
1• 800.593.4467 • info@healthmarketscience.com                                                                  Under the ...
1• 800.593.4467 • info@healthmarketscience.com                                                    Master Data             ...
1• 800.593.4467 • info@healthmarketscience.com      The Design
1• 800.593.4467 • info@healthmarketscience.com                                                 System of Record           ...
1• 800.593.4467 • info@healthmarketscience.com
1• 800.593.4467 • info@healthmarketscience.com                                                      Design Principles     ...
1• 800.593.4467 • info@healthmarketscience.com                                                          State / Counting  ...
1• 800.593.4467 • info@healthmarketscience.com                                                     What we did wrong…     ...
1• 800.593.4467 • info@healthmarketscience.com                                                  What we did wrong… (II)   ...
1• 800.593.4467 • info@healthmarketscience.com                                                      What we did right.    ...
1• 800.593.4467 • info@healthmarketscience.com                                                                 Kafka      ...
1• 800.593.4467 • info@healthmarketscience.com                                                           Elastic Search   ...
1• 800.593.4467 • info@healthmarketscience.com                                                                 Storm      ...
1• 800.593.4467 • info@healthmarketscience.com                                                 The System              NP....
1• 800.593.4467 • info@healthmarketscience.com    ?                             What comes after Quadfecta?
1• 800.593.4467 • info@healthmarketscience.com                                                   Real-Time Integration    ...
1• 800.593.4467 • info@healthmarketscience.com      The Storm/C* Bridge
1• 800.593.4467 • info@healthmarketscience.com                                                 Anatomy of a Storm Cluster ...
1• 800.593.4467 • info@healthmarketscience.com                                                          Storm Primatives  ...
1• 800.593.4467 • info@healthmarketscience.com                                                         Storm Spouts       ...
1• 800.593.4467 • info@healthmarketscience.com                                                          Storm Bolts       ...
1• 800.593.4467 • info@healthmarketscience.com                                                      Storm Topologies      ...
1• 800.593.4467 • info@healthmarketscience.com                            Storm Topologies
1• 800.593.4467 • info@healthmarketscience.com                                                   Storm and Cassandra      ...
1• 800.593.4467 • info@healthmarketscience.com                                                      Storm Cassandra Bolt  ...
1• 800.593.4467 • info@healthmarketscience.com                                                 Storm-Cassandra Project    ...
1• 800.593.4467 • info@healthmarketscience.com                                                 Storm-Cassandra Project    ...
1• 800.593.4467 • info@healthmarketscience.com                                                 Storm-Cassandra Project    ...
1• 800.593.4467 • info@healthmarketscience.com                                                 Storm-Cassandra Project    ...
1• 800.593.4467 • info@healthmarketscience.com                                                 Storm-Cassandra Project    ...
1• 800.593.4467 • info@healthmarketscience.com                                                 Persistent Word Count      ...
1• 800.593.4467 • info@healthmarketscience.com                            DRPC
1• 800.593.4467 • info@healthmarketscience.com                            “Reach” Computation
1• 800.593.4467 • info@healthmarketscience.com*Notional                                        MDM Topology*
1• 800.593.4467 • info@healthmarketscience.com                            Load Topology
1• 800.593.4467 • info@healthmarketscience.com                                                   Shameless Shoutouts      ...
1• 800.593.4467 • info@healthmarketscience.com      Next Level : Trident
1• 800.593.4467 • info@healthmarketscience.com                                                               Trident      ...
1• 800.593.4467 • info@healthmarketscience.com                                                 Sample Trident Operations  ...
1• 800.593.4467 • info@healthmarketscience.com                                                                            ...
1• 800.593.4467 • info@healthmarketscience.com                                                                     Trident...
Upcoming SlideShare
Loading in …5
×

The Big Data Quadfecta

7,391 views

Published on

A successful Big Data platform combines distributed processing and polyglot persistence into a single cohesive infrastructure. Over the past few years, Health Market Science has transitioned from traditional relational databases and enterprise systems to a massively scalable Big Data platform that combines Cassandra and Storm to ingest thousands of feeds of data from the health market industry to produce a single high-quality masterfile. Hear how we applied event processing and NoSQL to deliver real-time analytics, while accommodating structural change over time, and fuzzy/geospatial search.

  • Be the first to comment

The Big Data Quadfecta

  1. 1. 1• 800.593.4467 • info@healthmarketscience.com The Big Data Quadfecta Brian O’Neill Taylor Goetz Lead Architect, Health Market Science Development Lead, Health Market Scienc @boneill42, bone@alumni.brown.edu @ptgoetz, ptgoetz@gmail.com
  2. 2. 1• 800.593.4467 • info@healthmarketscience.com Quadfecta? 1. Quadfecta • A legendary beirut/beer pong shot that lands on the tops of four cups simultaneously. Considered the rarest shot in the game, topping even the trifecta, 2-cup knockover- and-sink, and simultaneous 6-cup game- ending double bounce-in. • Kafka • Storm • Elastic Search • Cassandra
  3. 3. 1• 800.593.4467 • info@healthmarketscience.comVolume 3 V’sVarietyVelocity
  4. 4. 1• 800.593.4467 • info@healthmarketscience.com The Use Case
  5. 5. 1• 800.593.4467 • info@healthmarketscience.com Our Mission Prescriber eligibility and remediation Eliminate fraud, waste and abuse Insights into the healthcare space
  6. 6. 1• 800.593.4467 • info@healthmarketscience.com The Business Master Data Solutions Business Medical Claims Data Health Care Provider & Facilities Solutions Medical Procedures & Diagnosis Variety/Velocity Volume/Velocity • >l2000 of sources • ~1B claims annually • 6 Million unique HCPs • +5B records annually • 10+ years history • 5+ years history Data Challenges CompleteView, Data Challenges • Constant change in real Expense Manager, • Sources have world data CompleteSpend incomplete capture • Conflicting & partial info • Overlapping source data • Frequent changes to Prescriber • Statistical projections & source structure Eligibility/Remdiati biases on • Authoritative sources vs. • Social media type crowdsource relationships Analtyics • Predicting source quality (Influencer Networks)
  7. 7. 1• 800.593.4467 • info@healthmarketscience.com Our Solutions Business Needs Sales & Marketing Compliance Business Systems Finance & Legal 01010011 Solutions Provider Data Data Assessment, Integration & Compliance Market Enrichment Services Intelligence Advanced S orm t Technology Master Data Management HMS Authoritative Sources Medical Claims Federal State Web Derived PDC
  8. 8. 1• 800.593.4467 • info@healthmarketscience.com Datacenter ¾ Petabytes of raw storage Virtualized (VMware) On a SAN Should we go physical???
  9. 9. 1• 800.593.4467 • info@healthmarketscience.com Under the Hood User Interface Web Services Interfacing I’m happy Analytics Dashboard / Reports Visualization Match Customer Consolidate Indexing Relational Web Structured Storage Standardize Government Validate NoSQL Graph(s) Data Sources Distributed Processing Flexible Storage
  10. 10. 1• 800.593.4467 • info@healthmarketscience.com Master Data Management Harvested faddress Î F@t0 Government flicense Î F@t5 Private fsanction Î F@t1 fsanction Î F@t4 Schema Change!
  11. 11. 1• 800.593.4467 • info@healthmarketscience.com The Design
  12. 12. 1• 800.593.4467 • info@healthmarketscience.com System of Record Flexibility (Variety) Scalability (Velocity + Volume)
  13. 13. 1• 800.593.4467 • info@healthmarketscience.com
  14. 14. 1• 800.593.4467 • info@healthmarketscience.com Design Principles Patterns Idempotent Operations Elegantly handle replay Immutable data Assertions of facts over time Anti-Patterns Transactions / Locking
  15. 15. 1• 800.593.4467 • info@healthmarketscience.com State / Counting Exactly-once semantics for state Create small batches Order batches Batch Total 1 4 Batch 1 4 3 4 (wait) 2 10 (+6) å Batch 3 13 3 23 (+13) 3’ 23 (+0) Batch 2 6 Batch 3’ 13
  16. 16. 1• 800.593.4467 • info@healthmarketscience.com What we did wrong… Could not react to transactional changes Needed extra logic to track what changed Took too long
  17. 17. 1• 800.593.4467 • info@healthmarketscience.com What we did wrong… (II) AOP-based triggers Worked well initially. Business Processes captured as side- effects.
  18. 18. 1• 800.593.4467 • info@healthmarketscience.com What we did right. REST APIs for Loose Coupling See Virgil: https://github.com/hmsonline/virgil But really… watch out for Intravert https://github.com/zznate/intravert-ug
  19. 19. 1• 800.593.4467 • info@healthmarketscience.com Kafka • Millions of Messages • Replay Enabled • No transactions / Lightning Fast
  20. 20. 1• 800.593.4467 • info@healthmarketscience.com Elastic Search • Edit Distance / Soundex • Native Scalability • Fuzzy Search • Geospatial • Facets
  21. 21. 1• 800.593.4467 • info@healthmarketscience.com Storm • Guaranteed once semantics • Well-designed processing abstraction • Beats BYODP • Momentum
  22. 22. 1• 800.593.4467 • info@healthmarketscience.com The System NP. Rewind! NP. We can route around it. C* ES 2 Kafka C* ES1 REST API A NP. Replication Factor > 1. C* Elastic Search Kafka Queue(s) C B Offset
  23. 23. 1• 800.593.4467 • info@healthmarketscience.com ? What comes after Quadfecta?
  24. 24. 1• 800.593.4467 • info@healthmarketscience.com Real-Time Integration Real-time CRUD via Web Services DRPC “Real-time” Queue Not quite sure?
  25. 25. 1• 800.593.4467 • info@healthmarketscience.com The Storm/C* Bridge
  26. 26. 1• 800.593.4467 • info@healthmarketscience.com Anatomy of a Storm Cluster Nimbus Master Node Zookeeper Cluster Coordination Supervisors Worker Nodes
  27. 27. 1• 800.593.4467 • info@healthmarketscience.com Storm Primatives Streams Unbounded sequence of tuples Spouts Stream Sources Bolts Unit of Computation Topologies Combination of n Spouts and n Bolts Defines the overall “Computation”
  28. 28. 1• 800.593.4467 • info@healthmarketscience.com Storm Spouts Represents a source (stream) of data Queues (JMS, Kafka, Kestrel, etc.) Twitter Firehose Sensor Data Emits “Tuples” (Events) based on source Primary Storm data structure Set of Key-Value pairs
  29. 29. 1• 800.593.4467 • info@healthmarketscience.com Storm Bolts Receive Tuples from Spouts or other Bolts Operate on, or React to Data Functions/Filters/Joins/Aggregations Database writes/lookups Optionally emit additional Tuples
  30. 30. 1• 800.593.4467 • info@healthmarketscience.com Storm Topologies Data flow between spouts and bolts Routing of Tuples between spouts/bolts Stream “Groupings” Parallelism of Components Long-Lived
  31. 31. 1• 800.593.4467 • info@healthmarketscience.com Storm Topologies
  32. 32. 1• 800.593.4467 • info@healthmarketscience.com Storm and Cassandra Use Cases: Write Storm Tuple data to C* Computation Results Pre-compute indices Read data from C* and emit Storm Tuples Dynamic Lookups http://github.com/hmsonline/storm-cassandra
  33. 33. 1• 800.593.4467 • info@healthmarketscience.com Storm Cassandra Bolt Types CassandraBolt Cassandra LookupBolt C* CassandraBolt Writes data to Cassandra Available in Batching and Non-Batching CassandraLookupBolt Reads data from Cassandra http://github.com/hmsonline/storm-cassandra
  34. 34. 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project Provides generic Bolts for writing/reading Storm Tuples to/from C* Tuple Tuple Mapper Rows Tuples Columns Mapper Columns C* http://github.com/hmsonline/storm-cassandra
  35. 35. 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project TupleMapper Interface Tells the CassandraBolt how to write a tuple to an arbitrary data model Given a Storm Tuple: Map to Column Family Map to Row Key Map to Columns http://github.com/hmsonline/storm-cassandra
  36. 36. 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project ColumnsMapper Interface Tells the CassandraLookupBolt how to transform a C* row into a Storm Tuple Given a C* Row Key and list of Columns: Return a list of Storm Tuples http://github.com/hmsonline/storm-cassandra
  37. 37. 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project Current State: Version 0.4.0 Uses Astyanax Client Several out-of-the-box *Mapper Implementations: Basic Key-Value Columns Value-less Columns Counter Columns Lookup by row key Lookup by range query Composite Key/Column Support Trident support http://github.com/hmsonline/storm-cassandra
  38. 38. 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project Future Plans: Switch to CQL Enhanced Trident Support http://github.com/hmsonline/storm-cassandra
  39. 39. 1• 800.593.4467 • info@healthmarketscience.com Persistent Word Count http://github.com/hmsonline/storm-cassandra
  40. 40. 1• 800.593.4467 • info@healthmarketscience.com DRPC
  41. 41. 1• 800.593.4467 • info@healthmarketscience.com “Reach” Computation
  42. 42. 1• 800.593.4467 • info@healthmarketscience.com*Notional MDM Topology*
  43. 43. 1• 800.593.4467 • info@healthmarketscience.com Load Topology
  44. 44. 1• 800.593.4467 • info@healthmarketscience.com Shameless Shoutouts HMS (https://github.com/hmsonline/) storm-cassandra storm-elastic-search storm-jdbi (coming soon) ptgoetz (https://github.com/ptgoetz) storm-jms storm-signals
  45. 45. 1• 800.593.4467 • info@healthmarketscience.com Next Level : Trident
  46. 46. 1• 800.593.4467 • info@healthmarketscience.com Trident Provides a higher-level abstraction for stream processing Constructs for state management and Batching Adds additional primitives that abstract away common topological patterns Deprecates transactional topologies Distributes with Storm
  47. 47. 1• 800.593.4467 • info@healthmarketscience.com Sample Trident Operations Partition Local Functions ( execute(x)  x + y ) Filters ( isKeep(x)  0,x ) PartitionAggregate Combiner ( pairwise combining ) Reducer ( iterative accumulation ) Aggregator ( byoa )
  48. 48. 1• 800.593.4467 • info@healthmarketscience.com A sample topology TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate( MemcachedState.opaque(serverLocations), new Count(), new Fields("count")) .parallelismHint(6); https://github.com/nathanmarz/storm/wiki/Trident-state
  49. 49. 1• 800.593.4467 • info@healthmarketscience.com Trident State Sequenced writes by batch/transaction id. Spouts Transactional Batch contents never change Opaque Batch contents can change State Transactional Store tx_id with counts to maintain sequencing of writes. Opaque Store previous value in order to overwrite the current value when contents of a batch change.

×