1• 800.593.4467 • info@healthmarketscience.com                                                    The Big Data Quadfecta  ...
1• 800.593.4467 • info@healthmarketscience.com                                                                    Quadfect...
1• 800.593.4467 • info@healthmarketscience.comVolume                                       3 V’sVarietyVelocity
1• 800.593.4467 • info@healthmarketscience.com      The Use Case
1• 800.593.4467 • info@healthmarketscience.com                                                           Our Mission      ...
1• 800.593.4467 • info@healthmarketscience.com                                                                           T...
1• 800.593.4467 • info@healthmarketscience.com                                                                        Our ...
1• 800.593.4467 • info@healthmarketscience.com                                                          Datacenter        ...
1• 800.593.4467 • info@healthmarketscience.com                                                                  Under the ...
1• 800.593.4467 • info@healthmarketscience.com                                                    Master Data             ...
1• 800.593.4467 • info@healthmarketscience.com      The Design
1• 800.593.4467 • info@healthmarketscience.com                                                 System of Record           ...
1• 800.593.4467 • info@healthmarketscience.com
1• 800.593.4467 • info@healthmarketscience.com                                                      Design Principles     ...
1• 800.593.4467 • info@healthmarketscience.com                                                          State / Counting  ...
1• 800.593.4467 • info@healthmarketscience.com                                                     What we did wrong…     ...
1• 800.593.4467 • info@healthmarketscience.com                                                  What we did wrong… (II)   ...
1• 800.593.4467 • info@healthmarketscience.com                                                      What we did right.    ...
1• 800.593.4467 • info@healthmarketscience.com                                                                 Kafka      ...
1• 800.593.4467 • info@healthmarketscience.com                                                           Elastic Search   ...
1• 800.593.4467 • info@healthmarketscience.com                                                                 Storm      ...
1• 800.593.4467 • info@healthmarketscience.com                                                 The System              NP....
1• 800.593.4467 • info@healthmarketscience.com    ?                             What comes after Quadfecta?
1• 800.593.4467 • info@healthmarketscience.com                                                   Real-Time Integration    ...
1• 800.593.4467 • info@healthmarketscience.com      The Storm/C* Bridge
1• 800.593.4467 • info@healthmarketscience.com                                                 Anatomy of a Storm Cluster ...
1• 800.593.4467 • info@healthmarketscience.com                                                          Storm Primatives  ...
1• 800.593.4467 • info@healthmarketscience.com                                                         Storm Spouts       ...
1• 800.593.4467 • info@healthmarketscience.com                                                          Storm Bolts       ...
1• 800.593.4467 • info@healthmarketscience.com                                                      Storm Topologies      ...
1• 800.593.4467 • info@healthmarketscience.com                            Storm Topologies
1• 800.593.4467 • info@healthmarketscience.com                                                   Storm and Cassandra      ...
1• 800.593.4467 • info@healthmarketscience.com                                                      Storm Cassandra Bolt  ...
1• 800.593.4467 • info@healthmarketscience.com                                                 Storm-Cassandra Project    ...
1• 800.593.4467 • info@healthmarketscience.com                                                 Storm-Cassandra Project    ...
1• 800.593.4467 • info@healthmarketscience.com                                                 Storm-Cassandra Project    ...
1• 800.593.4467 • info@healthmarketscience.com                                                 Storm-Cassandra Project    ...
1• 800.593.4467 • info@healthmarketscience.com                                                 Storm-Cassandra Project    ...
1• 800.593.4467 • info@healthmarketscience.com                                                 Persistent Word Count      ...
1• 800.593.4467 • info@healthmarketscience.com                            DRPC
1• 800.593.4467 • info@healthmarketscience.com                            “Reach” Computation
1• 800.593.4467 • info@healthmarketscience.com*Notional                                        MDM Topology*
1• 800.593.4467 • info@healthmarketscience.com                            Load Topology
1• 800.593.4467 • info@healthmarketscience.com                                                   Shameless Shoutouts      ...
1• 800.593.4467 • info@healthmarketscience.com      Next Level : Trident
1• 800.593.4467 • info@healthmarketscience.com                                                               Trident      ...
1• 800.593.4467 • info@healthmarketscience.com                                                 Sample Trident Operations  ...
1• 800.593.4467 • info@healthmarketscience.com                                                                            ...
1• 800.593.4467 • info@healthmarketscience.com                                                                     Trident...
Upcoming SlideShare
Loading in …5
×

The Big Data Quadfecta

7,134 views

Published on

A successful Big Data platform combines distributed processing and polyglot persistence into a single cohesive infrastructure. Over the past few years, Health Market Science has transitioned from traditional relational databases and enterprise systems to a massively scalable Big Data platform that combines Cassandra and Storm to ingest thousands of feeds of data from the health market industry to produce a single high-quality masterfile. Hear how we applied event processing and NoSQL to deliver real-time analytics, while accommodating structural change over time, and fuzzy/geospatial search.

0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
7,134
On SlideShare
0
From Embeds
0
Number of Embeds
3,305
Actions
Shares
0
Downloads
89
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide
  • Storm:realtime, distributed computation systemComparable to complex event processing systemOriginated in the twitter analytics space.
  • Tuple: set of key-value pairs (values can be serialized objects)
  • Useful for pre-computing queries in real-time to optimize lookups (avoid expensive C* queries).
  • The Big Data Quadfecta

    1. 1. 1• 800.593.4467 • info@healthmarketscience.com The Big Data Quadfecta Brian O’Neill Taylor Goetz Lead Architect, Health Market Science Development Lead, Health Market Scienc @boneill42, bone@alumni.brown.edu @ptgoetz, ptgoetz@gmail.com
    2. 2. 1• 800.593.4467 • info@healthmarketscience.com Quadfecta? 1. Quadfecta • A legendary beirut/beer pong shot that lands on the tops of four cups simultaneously. Considered the rarest shot in the game, topping even the trifecta, 2-cup knockover- and-sink, and simultaneous 6-cup game- ending double bounce-in. • Kafka • Storm • Elastic Search • Cassandra
    3. 3. 1• 800.593.4467 • info@healthmarketscience.comVolume 3 V’sVarietyVelocity
    4. 4. 1• 800.593.4467 • info@healthmarketscience.com The Use Case
    5. 5. 1• 800.593.4467 • info@healthmarketscience.com Our Mission Prescriber eligibility and remediation Eliminate fraud, waste and abuse Insights into the healthcare space
    6. 6. 1• 800.593.4467 • info@healthmarketscience.com The Business Master Data Solutions Business Medical Claims Data Health Care Provider & Facilities Solutions Medical Procedures & Diagnosis Variety/Velocity Volume/Velocity • >l2000 of sources • ~1B claims annually • 6 Million unique HCPs • +5B records annually • 10+ years history • 5+ years history Data Challenges CompleteView, Data Challenges • Constant change in real Expense Manager, • Sources have world data CompleteSpend incomplete capture • Conflicting & partial info • Overlapping source data • Frequent changes to Prescriber • Statistical projections & source structure Eligibility/Remdiati biases on • Authoritative sources vs. • Social media type crowdsource relationships Analtyics • Predicting source quality (Influencer Networks)
    7. 7. 1• 800.593.4467 • info@healthmarketscience.com Our Solutions Business Needs Sales & Marketing Compliance Business Systems Finance & Legal 01010011 Solutions Provider Data Data Assessment, Integration & Compliance Market Enrichment Services Intelligence Advanced S orm t Technology Master Data Management HMS Authoritative Sources Medical Claims Federal State Web Derived PDC
    8. 8. 1• 800.593.4467 • info@healthmarketscience.com Datacenter ¾ Petabytes of raw storage Virtualized (VMware) On a SAN Should we go physical???
    9. 9. 1• 800.593.4467 • info@healthmarketscience.com Under the Hood User Interface Web Services Interfacing I’m happy Analytics Dashboard / Reports Visualization Match Customer Consolidate Indexing Relational Web Structured Storage Standardize Government Validate NoSQL Graph(s) Data Sources Distributed Processing Flexible Storage
    10. 10. 1• 800.593.4467 • info@healthmarketscience.com Master Data Management Harvested faddress Î F@t0 Government flicense Î F@t5 Private fsanction Î F@t1 fsanction Î F@t4 Schema Change!
    11. 11. 1• 800.593.4467 • info@healthmarketscience.com The Design
    12. 12. 1• 800.593.4467 • info@healthmarketscience.com System of Record Flexibility (Variety) Scalability (Velocity + Volume)
    13. 13. 1• 800.593.4467 • info@healthmarketscience.com
    14. 14. 1• 800.593.4467 • info@healthmarketscience.com Design Principles Patterns Idempotent Operations Elegantly handle replay Immutable data Assertions of facts over time Anti-Patterns Transactions / Locking
    15. 15. 1• 800.593.4467 • info@healthmarketscience.com State / Counting Exactly-once semantics for state Create small batches Order batches Batch Total 1 4 Batch 1 4 3 4 (wait) 2 10 (+6) å Batch 3 13 3 23 (+13) 3’ 23 (+0) Batch 2 6 Batch 3’ 13
    16. 16. 1• 800.593.4467 • info@healthmarketscience.com What we did wrong… Could not react to transactional changes Needed extra logic to track what changed Took too long
    17. 17. 1• 800.593.4467 • info@healthmarketscience.com What we did wrong… (II) AOP-based triggers Worked well initially. Business Processes captured as side- effects.
    18. 18. 1• 800.593.4467 • info@healthmarketscience.com What we did right. REST APIs for Loose Coupling See Virgil: https://github.com/hmsonline/virgil But really… watch out for Intravert https://github.com/zznate/intravert-ug
    19. 19. 1• 800.593.4467 • info@healthmarketscience.com Kafka • Millions of Messages • Replay Enabled • No transactions / Lightning Fast
    20. 20. 1• 800.593.4467 • info@healthmarketscience.com Elastic Search • Edit Distance / Soundex • Native Scalability • Fuzzy Search • Geospatial • Facets
    21. 21. 1• 800.593.4467 • info@healthmarketscience.com Storm • Guaranteed once semantics • Well-designed processing abstraction • Beats BYODP • Momentum
    22. 22. 1• 800.593.4467 • info@healthmarketscience.com The System NP. Rewind! NP. We can route around it. C* ES 2 Kafka C* ES1 REST API A NP. Replication Factor > 1. C* Elastic Search Kafka Queue(s) C B Offset
    23. 23. 1• 800.593.4467 • info@healthmarketscience.com ? What comes after Quadfecta?
    24. 24. 1• 800.593.4467 • info@healthmarketscience.com Real-Time Integration Real-time CRUD via Web Services DRPC “Real-time” Queue Not quite sure?
    25. 25. 1• 800.593.4467 • info@healthmarketscience.com The Storm/C* Bridge
    26. 26. 1• 800.593.4467 • info@healthmarketscience.com Anatomy of a Storm Cluster Nimbus Master Node Zookeeper Cluster Coordination Supervisors Worker Nodes
    27. 27. 1• 800.593.4467 • info@healthmarketscience.com Storm Primatives Streams Unbounded sequence of tuples Spouts Stream Sources Bolts Unit of Computation Topologies Combination of n Spouts and n Bolts Defines the overall “Computation”
    28. 28. 1• 800.593.4467 • info@healthmarketscience.com Storm Spouts Represents a source (stream) of data Queues (JMS, Kafka, Kestrel, etc.) Twitter Firehose Sensor Data Emits “Tuples” (Events) based on source Primary Storm data structure Set of Key-Value pairs
    29. 29. 1• 800.593.4467 • info@healthmarketscience.com Storm Bolts Receive Tuples from Spouts or other Bolts Operate on, or React to Data Functions/Filters/Joins/Aggregations Database writes/lookups Optionally emit additional Tuples
    30. 30. 1• 800.593.4467 • info@healthmarketscience.com Storm Topologies Data flow between spouts and bolts Routing of Tuples between spouts/bolts Stream “Groupings” Parallelism of Components Long-Lived
    31. 31. 1• 800.593.4467 • info@healthmarketscience.com Storm Topologies
    32. 32. 1• 800.593.4467 • info@healthmarketscience.com Storm and Cassandra Use Cases: Write Storm Tuple data to C* Computation Results Pre-compute indices Read data from C* and emit Storm Tuples Dynamic Lookups http://github.com/hmsonline/storm-cassandra
    33. 33. 1• 800.593.4467 • info@healthmarketscience.com Storm Cassandra Bolt Types CassandraBolt Cassandra LookupBolt C* CassandraBolt Writes data to Cassandra Available in Batching and Non-Batching CassandraLookupBolt Reads data from Cassandra http://github.com/hmsonline/storm-cassandra
    34. 34. 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project Provides generic Bolts for writing/reading Storm Tuples to/from C* Tuple Tuple Mapper Rows Tuples Columns Mapper Columns C* http://github.com/hmsonline/storm-cassandra
    35. 35. 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project TupleMapper Interface Tells the CassandraBolt how to write a tuple to an arbitrary data model Given a Storm Tuple: Map to Column Family Map to Row Key Map to Columns http://github.com/hmsonline/storm-cassandra
    36. 36. 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project ColumnsMapper Interface Tells the CassandraLookupBolt how to transform a C* row into a Storm Tuple Given a C* Row Key and list of Columns: Return a list of Storm Tuples http://github.com/hmsonline/storm-cassandra
    37. 37. 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project Current State: Version 0.4.0 Uses Astyanax Client Several out-of-the-box *Mapper Implementations: Basic Key-Value Columns Value-less Columns Counter Columns Lookup by row key Lookup by range query Composite Key/Column Support Trident support http://github.com/hmsonline/storm-cassandra
    38. 38. 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project Future Plans: Switch to CQL Enhanced Trident Support http://github.com/hmsonline/storm-cassandra
    39. 39. 1• 800.593.4467 • info@healthmarketscience.com Persistent Word Count http://github.com/hmsonline/storm-cassandra
    40. 40. 1• 800.593.4467 • info@healthmarketscience.com DRPC
    41. 41. 1• 800.593.4467 • info@healthmarketscience.com “Reach” Computation
    42. 42. 1• 800.593.4467 • info@healthmarketscience.com*Notional MDM Topology*
    43. 43. 1• 800.593.4467 • info@healthmarketscience.com Load Topology
    44. 44. 1• 800.593.4467 • info@healthmarketscience.com Shameless Shoutouts HMS (https://github.com/hmsonline/) storm-cassandra storm-elastic-search storm-jdbi (coming soon) ptgoetz (https://github.com/ptgoetz) storm-jms storm-signals
    45. 45. 1• 800.593.4467 • info@healthmarketscience.com Next Level : Trident
    46. 46. 1• 800.593.4467 • info@healthmarketscience.com Trident Provides a higher-level abstraction for stream processing Constructs for state management and Batching Adds additional primitives that abstract away common topological patterns Deprecates transactional topologies Distributes with Storm
    47. 47. 1• 800.593.4467 • info@healthmarketscience.com Sample Trident Operations Partition Local Functions ( execute(x)  x + y ) Filters ( isKeep(x)  0,x ) PartitionAggregate Combiner ( pairwise combining ) Reducer ( iterative accumulation ) Aggregator ( byoa )
    48. 48. 1• 800.593.4467 • info@healthmarketscience.com A sample topology TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate( MemcachedState.opaque(serverLocations), new Count(), new Fields("count")) .parallelismHint(6); https://github.com/nathanmarz/storm/wiki/Trident-state
    49. 49. 1• 800.593.4467 • info@healthmarketscience.com Trident State Sequenced writes by batch/transaction id. Spouts Transactional Batch contents never change Opaque Batch contents can change State Transactional Store tx_id with counts to maintain sequencing of writes. Opaque Store previous value in order to overwrite the current value when contents of a batch change.

    ×