Big Data Overview and Cassandra Deep Dive for the Philly JUG

  The Big Data Quadfecta
Brian O'Neill
Lead Architect, Health Market Science
@boneill42,
  Quadfecta?
1. Quadfecta
• A legendary beirut/beer pong shot that lands on the tops of four cups simultaneously. Considered the rarest shot in the game, topping even the trifecta, 2-cup knockover-and-sink, and simultaneous 6-cup game-ending double bounce-in.
• Kafka
• Storm
• Elastic Search
• Cassandra
  Hold on Tight
  3 V's
Volume Variety Velocity
  The Use Case
  Our Mission
Prescriber eligibility and remediation
Eliminate fraud, waste and abuse
Insights into the healthcare space
  The Business
Business Solutions
Health Care Provider & Facilities
Variety/Velocity
• >l2000 of sources
• 6 Million unique HCPs
• 10+ years history
Data Challenges
• Constant change in real world data
• Conflicting & partial info
• Frequent changes to source structure
• Authoritative sources vs. crowdsource
• Predicting source quality
Master Data Solutions
Medical Procedures & Diagnosis
Volume/Velocity
• ~1B claims annually
• +5B records annually
• 5+ years history
Data Challenges
• Sources have incomplete capture
• Overlapping source data
• Statistical projections & biases
• Social media type relationships
Medical Claims Data
CompleteView, Expense Manager, CompleteSpend
Prescriber Eligibility/Remdiation
Analtyics (Influencer Networks)
  Our Solutions
Business Needs
Finance & Legal
Business Systems
Compliance
Sales & Marketing
Solutions
Provider Data Compliance
Data Assessment, Integration & Enrichment Services
Market Intelligence
HMS Authoritative Sources
PDC Federal State
Medical Claims Web Derived
Advanced Technology
Storm
Master Data Management
  Datacenter
Hundreds of Machines
1.5 Petabytes of raw storage
Virtualized (VMware)
On a SAN
Should we go physical???
  Under the Hood
Visualization
Dashboard / Reports
Structured Storage
Relational
Indexing
Flexible Storage
NoSQL Graph(s)
Interfacing
Web Services
Distributed Processing
Standardize
Validate
Match
Consolidate
Analytics
Data Sources
Government
Web
Customer
I'm happy
User Interface
  Master Data Management
Harvested
Government
Private
faddress Î F@t0
flicense Î F@t5
fsanction Î F@t1
fsanction Î F@t4
Schema Change!
  The Design
  System of Record
Flexibility (Variety)
Scalability (Velocity + Volume)
  Deep Dive
  Installation
As easy as…
Download
tar -xvzf apache-cassandra-1.2.0-beta3-bin.tar.gz
Run
bin/cassandra –f
(-f puts it in foreground)
  Data Model
Schema (a.k.a. Keyspace)
Table (a.k.a. Column Family)
Row
Have arbitrary #'s of columns
Validator for keys (e.g. UTF8Type)
Column
Validator for values and keys
Comparator for keys (e.g. DateType or BYOC)
  Distributed Architecture
Nodes form a token ring.
Nodes partition the ring by initial token
initial_token: (in cassandra.yaml)
Partitioners map row keys to tokens.
Usually randomly, to evenly distribute the data
All columns for a row are stored together on disk in sorted order.
  Visually
(1-33)
Row Hash
Alice 50
Bob 3
Eve 15
Token/Hash Range : 0-99
  Java Interpretation
Each table is a Distributed HashMap
Each row is a SortedMap.
Each column is an entry in the SortedMap.
Cassandra provides a massively scalable version of:
HashMap<rowKey, SortedMap<columnKey, columnValue>
Implications:
Direct row fetch is fast.
Searching a range of rows can be costly.
Searching a range of columns is cheap.
  The World-Wide Globally Scalable Naughty List!
How about a Naughty and Nice list for Santa?
1.9 billion children
That will fit in a single row!
Queries to support:
Children can login and check their standing.
Santa can find nice children by country, state or zip.
Toy lists for every child in the world.
  Two Tables
Children Table
Store all the children in the world.
One row per child.
One column per attribute.
NaughtyOrNice Table
Supports the queries we anticipate
Wide-Row Strategy
  Details of the NaughtyOrNiceList
One row per standing:country
Ensures all children in a country are grouped together on disk.
One column per child using a compound key
Ensures the columns are sorted to support our search at varying levels of granularity
e.g. All nice children in the US.
e.g. All naughty children in PA.
  Node 3
Node 2
Node 1
Visually
Nice:USA
CA:94333:johny.b.good
CA:94333:richie.rich
Nice:IRL
D:EI33:collin.oneill
D:EI33:owen.oneill
Naughty:USA
CA:94111:bart.simpson
CA:94222:dennis.menace
PA:18964:michael.myers
Watch out for:
• Hot spotting
• Unbalanced Clusters
(1) Go to the row.
(2) Get the column slice
  What about the toys?
No problem.
We're in a NoSQL store. =)
Let's just add a column.
  CQL Collections!
Set
UPDATE users SET emails = emails + {} WHERE user_id = frodo;
List
UPDATE users SET top_places = [ the shire ] + top_places WHERE user_id = frodo;
Maps
UPDATE users SET todo[2012-10-2 12:10] = die WHERE user_id = frodo;
  Let's Crank a Bit...
  Let's code!
What API should we use?
Production-Readiness Potential Momentum
Thrift 10 -1 -1
Hector 10 8 8
Astyanax 8 9 10
Kundera (JPA) 6 9 9
Pelops 7 6 7
Firebrand 8 9 8
PlayORM 5 8 7
GORA 6 9 7
CQL Driver 8 10 10
Asytanax + CQL FTW!
  Coming up for air...
  But continuing at warp speed...
  DEMO
  What we did wrong…
Could not react to transactional changes
Needed extra logic to track what changed
Took too long
  What we did wrong… (II)
AOP-based triggers
Worked well initially.
Business Processes captured as side-effects.
  Design Principles
Patterns
Idempotent Operations
Elegantly handle replay
Immutable data
Assertions of facts over time
Anti-Patterns
Transactions / Locking
  What we did right.
REST APIs for Loose Coupling
See Virgil:
PS… really… watch out for Intravert
  Kafka
• Millions of Messages
• Replay Enabled
• No transactions / Lightning Fast
  Elastic Search
• Edit Distance / Soundex
• Native Scalability
• Fuzzy Search
• Geospatial
• Facets
  Storm
• Guaranteed once semantics
• Well-designed processing abstraction
• Beats BYODP
• Momentum
  The System
Kafka Queue(s)
Offset
C*
A B C
C* ES1
Kafka
Elastic Search
ES2
C*
REST API
NP. We can route around it.
NP. Replication Factor > 1.
NP. Rewind!
  Next Steps
  Shameless Shoutouts
HMS (
Blog (coming soon)
ptgoetz (
