• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
The Big Data Quadfecta
 

The Big Data Quadfecta

on

  • 3,793 views

A successful Big Data platform combines distributed processing and polyglot persistence into a single cohesive infrastructure. Over the past few years, Health Market Science has transitioned from ...

A successful Big Data platform combines distributed processing and polyglot persistence into a single cohesive infrastructure. Over the past few years, Health Market Science has transitioned from traditional relational databases and enterprise systems to a massively scalable Big Data platform that combines Cassandra and Storm to ingest thousands of feeds of data from the health market industry to produce a single high-quality masterfile. Hear how we applied event processing and NoSQL to deliver real-time analytics, while accommodating structural change over time, and fuzzy/geospatial search.

Statistics

Views

Total Views
3,793
Views on SlideShare
2,650
Embed Views
1,143

Actions

Likes
3
Downloads
68
Comments
0

5 Embeds 1,143

http://ptgoetz.github.io 979
http://localhost 107
http://t.co 49
https://abs.twimg.com 7
http://kred.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Storm:realtime, distributed computation systemComparable to complex event processing systemOriginated in the twitter analytics space.
  • Tuple: set of key-value pairs (values can be serialized objects)
  • Useful for pre-computing queries in real-time to optimize lookups (avoid expensive C* queries).

The Big Data Quadfecta The Big Data Quadfecta Presentation Transcript

  • 1• 800.593.4467 • info@healthmarketscience.com The Big Data Quadfecta Brian O’Neill Taylor Goetz Lead Architect, Health Market Science Development Lead, Health Market Scienc @boneill42, bone@alumni.brown.edu @ptgoetz, ptgoetz@gmail.com
  • 1• 800.593.4467 • info@healthmarketscience.com Quadfecta? 1. Quadfecta • A legendary beirut/beer pong shot that lands on the tops of four cups simultaneously. Considered the rarest shot in the game, topping even the trifecta, 2-cup knockover- and-sink, and simultaneous 6-cup game- ending double bounce-in. • Kafka • Storm • Elastic Search • Cassandra
  • 1• 800.593.4467 • info@healthmarketscience.comVolume 3 V’sVarietyVelocity
  • 1• 800.593.4467 • info@healthmarketscience.com The Use Case
  • 1• 800.593.4467 • info@healthmarketscience.com Our Mission Prescriber eligibility and remediation Eliminate fraud, waste and abuse Insights into the healthcare space
  • 1• 800.593.4467 • info@healthmarketscience.com The Business Master Data Solutions Business Medical Claims Data Health Care Provider & Facilities Solutions Medical Procedures & Diagnosis Variety/Velocity Volume/Velocity • >l2000 of sources • ~1B claims annually • 6 Million unique HCPs • +5B records annually • 10+ years history • 5+ years history Data Challenges CompleteView, Data Challenges • Constant change in real Expense Manager, • Sources have world data CompleteSpend incomplete capture • Conflicting & partial info • Overlapping source data • Frequent changes to Prescriber • Statistical projections & source structure Eligibility/Remdiati biases on • Authoritative sources vs. • Social media type crowdsource relationships Analtyics • Predicting source quality (Influencer Networks)
  • 1• 800.593.4467 • info@healthmarketscience.com Our Solutions Business Needs Sales & Marketing Compliance Business Systems Finance & Legal 01010011 Solutions Provider Data Data Assessment, Integration & Compliance Market Enrichment Services Intelligence Advanced S orm t Technology Master Data Management HMS Authoritative Sources Medical Claims Federal State Web Derived PDC
  • 1• 800.593.4467 • info@healthmarketscience.com Datacenter ¾ Petabytes of raw storage Virtualized (VMware) On a SAN Should we go physical???
  • 1• 800.593.4467 • info@healthmarketscience.com Under the Hood User Interface Web Services Interfacing I’m happy Analytics Dashboard / Reports Visualization Match Customer Consolidate Indexing Relational Web Structured Storage Standardize Government Validate NoSQL Graph(s) Data Sources Distributed Processing Flexible Storage
  • 1• 800.593.4467 • info@healthmarketscience.com Master Data Management Harvested faddress Î F@t0 Government flicense Î F@t5 Private fsanction Î F@t1 fsanction Î F@t4 Schema Change!
  • 1• 800.593.4467 • info@healthmarketscience.com The Design
  • 1• 800.593.4467 • info@healthmarketscience.com System of Record Flexibility (Variety) Scalability (Velocity + Volume)
  • 1• 800.593.4467 • info@healthmarketscience.com
  • 1• 800.593.4467 • info@healthmarketscience.com Design Principles Patterns Idempotent Operations Elegantly handle replay Immutable data Assertions of facts over time Anti-Patterns Transactions / Locking
  • 1• 800.593.4467 • info@healthmarketscience.com State / Counting Exactly-once semantics for state Create small batches Order batches Batch Total 1 4 Batch 1 4 3 4 (wait) 2 10 (+6) å Batch 3 13 3 23 (+13) 3’ 23 (+0) Batch 2 6 Batch 3’ 13
  • 1• 800.593.4467 • info@healthmarketscience.com What we did wrong… Could not react to transactional changes Needed extra logic to track what changed Took too long
  • 1• 800.593.4467 • info@healthmarketscience.com What we did wrong… (II) AOP-based triggers Worked well initially. Business Processes captured as side- effects.
  • 1• 800.593.4467 • info@healthmarketscience.com What we did right. REST APIs for Loose Coupling See Virgil: https://github.com/hmsonline/virgil But really… watch out for Intravert https://github.com/zznate/intravert-ug
  • 1• 800.593.4467 • info@healthmarketscience.com Kafka • Millions of Messages • Replay Enabled • No transactions / Lightning Fast
  • 1• 800.593.4467 • info@healthmarketscience.com Elastic Search • Edit Distance / Soundex • Native Scalability • Fuzzy Search • Geospatial • Facets
  • 1• 800.593.4467 • info@healthmarketscience.com Storm • Guaranteed once semantics • Well-designed processing abstraction • Beats BYODP • Momentum
  • 1• 800.593.4467 • info@healthmarketscience.com The System NP. Rewind! NP. We can route around it. C* ES 2 Kafka C* ES1 REST API A NP. Replication Factor > 1. C* Elastic Search Kafka Queue(s) C B Offset
  • 1• 800.593.4467 • info@healthmarketscience.com ? What comes after Quadfecta?
  • 1• 800.593.4467 • info@healthmarketscience.com Real-Time Integration Real-time CRUD via Web Services DRPC “Real-time” Queue Not quite sure?
  • 1• 800.593.4467 • info@healthmarketscience.com The Storm/C* Bridge
  • 1• 800.593.4467 • info@healthmarketscience.com Anatomy of a Storm Cluster Nimbus Master Node Zookeeper Cluster Coordination Supervisors Worker Nodes
  • 1• 800.593.4467 • info@healthmarketscience.com Storm Primatives Streams Unbounded sequence of tuples Spouts Stream Sources Bolts Unit of Computation Topologies Combination of n Spouts and n Bolts Defines the overall “Computation”
  • 1• 800.593.4467 • info@healthmarketscience.com Storm Spouts Represents a source (stream) of data Queues (JMS, Kafka, Kestrel, etc.) Twitter Firehose Sensor Data Emits “Tuples” (Events) based on source Primary Storm data structure Set of Key-Value pairs
  • 1• 800.593.4467 • info@healthmarketscience.com Storm Bolts Receive Tuples from Spouts or other Bolts Operate on, or React to Data Functions/Filters/Joins/Aggregations Database writes/lookups Optionally emit additional Tuples
  • 1• 800.593.4467 • info@healthmarketscience.com Storm Topologies Data flow between spouts and bolts Routing of Tuples between spouts/bolts Stream “Groupings” Parallelism of Components Long-Lived
  • 1• 800.593.4467 • info@healthmarketscience.com Storm Topologies
  • 1• 800.593.4467 • info@healthmarketscience.com Storm and Cassandra Use Cases: Write Storm Tuple data to C* Computation Results Pre-compute indices Read data from C* and emit Storm Tuples Dynamic Lookups http://github.com/hmsonline/storm-cassandra
  • 1• 800.593.4467 • info@healthmarketscience.com Storm Cassandra Bolt Types CassandraBolt Cassandra LookupBolt C* CassandraBolt Writes data to Cassandra Available in Batching and Non-Batching CassandraLookupBolt Reads data from Cassandra http://github.com/hmsonline/storm-cassandra
  • 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project Provides generic Bolts for writing/reading Storm Tuples to/from C* Tuple Tuple Mapper Rows Tuples Columns Mapper Columns C* http://github.com/hmsonline/storm-cassandra
  • 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project TupleMapper Interface Tells the CassandraBolt how to write a tuple to an arbitrary data model Given a Storm Tuple: Map to Column Family Map to Row Key Map to Columns http://github.com/hmsonline/storm-cassandra
  • 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project ColumnsMapper Interface Tells the CassandraLookupBolt how to transform a C* row into a Storm Tuple Given a C* Row Key and list of Columns: Return a list of Storm Tuples http://github.com/hmsonline/storm-cassandra
  • 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project Current State: Version 0.4.0 Uses Astyanax Client Several out-of-the-box *Mapper Implementations: Basic Key-Value Columns Value-less Columns Counter Columns Lookup by row key Lookup by range query Composite Key/Column Support Trident support http://github.com/hmsonline/storm-cassandra
  • 1• 800.593.4467 • info@healthmarketscience.com Storm-Cassandra Project Future Plans: Switch to CQL Enhanced Trident Support http://github.com/hmsonline/storm-cassandra
  • 1• 800.593.4467 • info@healthmarketscience.com Persistent Word Count http://github.com/hmsonline/storm-cassandra
  • 1• 800.593.4467 • info@healthmarketscience.com DRPC
  • 1• 800.593.4467 • info@healthmarketscience.com “Reach” Computation
  • 1• 800.593.4467 • info@healthmarketscience.com*Notional MDM Topology*
  • 1• 800.593.4467 • info@healthmarketscience.com Load Topology
  • 1• 800.593.4467 • info@healthmarketscience.com Shameless Shoutouts HMS (https://github.com/hmsonline/) storm-cassandra storm-elastic-search storm-jdbi (coming soon) ptgoetz (https://github.com/ptgoetz) storm-jms storm-signals
  • 1• 800.593.4467 • info@healthmarketscience.com Next Level : Trident
  • 1• 800.593.4467 • info@healthmarketscience.com Trident Provides a higher-level abstraction for stream processing Constructs for state management and Batching Adds additional primitives that abstract away common topological patterns Deprecates transactional topologies Distributes with Storm
  • 1• 800.593.4467 • info@healthmarketscience.com Sample Trident Operations Partition Local Functions ( execute(x)  x + y ) Filters ( isKeep(x)  0,x ) PartitionAggregate Combiner ( pairwise combining ) Reducer ( iterative accumulation ) Aggregator ( byoa )
  • 1• 800.593.4467 • info@healthmarketscience.com A sample topology TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate( MemcachedState.opaque(serverLocations), new Count(), new Fields("count")) .parallelismHint(6); https://github.com/nathanmarz/storm/wiki/Trident-state
  • 1• 800.593.4467 • info@healthmarketscience.com Trident State Sequenced writes by batch/transaction id. Spouts Transactional Batch contents never change Opaque Batch contents can change State Transactional Store tx_id with counts to maintain sequencing of writes. Opaque Store previous value in order to overwrite the current value when contents of a batch change.