Big data philly_jug


Published on

Big Data Overview and Cassandra Deep Dive for the Philly JUG

Published in: Technology, Business
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Big data philly_jug

  1. 1. 1•800.593.4467•info@healthmarketscience.comThe Big Data QuadfectaBrian O’NeillLead Architect, Health Market Science@boneill42,
  2. 2. 1•800.593.4467•info@healthmarketscience.comQuadfecta?1. Quadfecta• A legendary beirut/beer pong shot that landson the tops of four cups simultaneously.Considered the rarest shot in thegame, topping even the trifecta, 2-cupknockover-and-sink, and simultaneous 6-cupgame-ending double bounce-in.• Kafka• Storm• Elastic Search• Cassandra
  3. 3. 1•800.593.4467•info@healthmarketscience.comHold on Tight
  4. 4. 1•800.593.4467•info@healthmarketscience.com3 V’sVolume Variety Velocity
  5. 5. 1•800.593.4467•info@healthmarketscience.comThe Use Case
  6. 6. 1•800.593.4467•info@healthmarketscience.comOur MissionPrescriber eligibility and remediationEliminate fraud, waste and abuseInsights into the healthcare space
  7. 7. 1•800.593.4467•info@healthmarketscience.comThe BusinessBusinessSolutionsHealth Care Provider & FacilitiesVariety/Velocity• >l2000 of sources• 6 Million unique HCPs• 10+ years historyData Challenges• Constant change in realworld data• Conflicting & partial info• Frequent changes tosource structure• Authoritative sources vs.crowdsource• Predicting source qualityMaster Data SolutionsMedical Procedures & DiagnosisVolume/Velocity• ~1B claims annually• +5B records annually• 5+ years historyData Challenges• Sources haveincomplete capture• Overlapping source data• Statistical projections &biases• Social media typerelationshipsMedical Claims DataCompleteView,Expense Manager,CompleteSpendPrescriberEligibility/RemdiationAnaltyics(InfluencerNetworks)
  8. 8. 1•800.593.4467•info@healthmarketscience.comOur SolutionsBusinessNeedsFinance & LegalBusiness SystemsComplianceSales & MarketingSolutionsProvider Data ComplianceData Assessment, Integration &Enrichment Services01010011MarketIntelligenceHMSAuthoritativeSourcesPDC Federal StateMedical Claims Web DerivedAdvancedTechnologyStormMaster Data Management
  9. 9. 1•800.593.4467•info@healthmarketscience.comDatacenterHundreds of Machines1.5 Petabytes of raw storageVirtualized (VMware)On a SANShould we go physical???
  10. 10. 1•800.593.4467•info@healthmarketscience.comUnder the HoodVisualizationDashboard / ReportsStructured StorageRelationalIndexingFlexible StorageNoSQL Graph(s)InterfacingWeb ServicesDistributed ProcessingStandardizeValidateMatchConsolidateAnalyticsData SourcesGovernmentWebCustomerI’m happyUser Interface
  11. 11. 1•800.593.4467•info@healthmarketscience.comMaster DataManagementHarvestedGovernmentPrivatefaddress Î F@t0flicense Î F@t5fsanction Î F@t1 fsanction Î F@t4Schema Change!
  12. 12. 1•800.593.4467•info@healthmarketscience.comThe Design
  13. 13. 1•800.593.4467•info@healthmarketscience.comSystem of RecordFlexibility (Variety)Scalability (Velocity + Volume)
  14. 14. 1•800.593.4467•info@healthmarketscience.comDeep
  15. 15. 1•800.593.4467•info@healthmarketscience.comInstallationAs easy as…Download -xvzf apache-cassandra-1.2.0-beta3-bin.tar.gzRunbin/cassandra –f(-f puts it in foreground)
  16. 16. 1•800.593.4467•info@healthmarketscience.comData ModelSchema (a.k.a. Keyspace)Table (a.k.a. Column Family)RowHave arbitrary #’s of columnsValidator for keys (e.g. UTF8Type)ColumnValidator for values and keysComparator for keys (e.g. DateType or BYOC)(
  17. 17. 1•800.593.4467•info@healthmarketscience.comDistributed ArchitectureNodes form a token ring.Nodes partition the ring by initial tokeninitial_token: (in cassandra.yaml)Partitioners map row keys to tokens.Usually randomly, to evenly distribute the dataAll columns for a row are stored together on disk insorted order.
  18. 18. 1•800.593.4467•info@healthmarketscience.comVisually(1-33)Row HashAlice 50Bob 3Eve 15Token/Hash Range : 0-99
  19. 19. 1•800.593.4467•info@healthmarketscience.comJava InterpretationEach table is a Distributed HashMapEach row is a SortedMap.Each column is an entry in the SortedMap.Cassandra provides a massively scalable version of:HashMap<rowKey, SortedMap<columnKey, columnValue>Implications:Direct row fetch is fast.Searching a range of rows can be costly.Searching a range of columns is cheap.
  20. 20. 1•800.593.4467•info@healthmarketscience.comThe World-WideGlobally ScalableNaughty List!How about a Naughty andNice list for Santa?1.9 billion childrenThat will fit in a single row!Queries to support:Children can login andcheck their standing.Santa can find nicechildren by country, state orzip.Toy lists for every child inthe world.
  21. 21. 1•800.593.4467•info@healthmarketscience.comTwo TablesChildren TableStore all the children in the world.One row per child.One column per attribute.NaughtyOrNice TableSupports the queries we anticipateWide-Row Strategy
  22. 22. 1•800.593.4467•info@healthmarketscience.comDetails of the NaughtyOrNiceListOne row per standing:countryEnsures all children in a country are groupedtogether on disk.One column per child using a compound keyEnsures the columns are sorted to support oursearch at varying levels of granularitye.g. All nice children in the US.e.g. All naughty children in PA.
  23. 23. 1•800.593.4467•info@healthmarketscience.comNode 3Node 2Node 1Visually Nice:USACA:94333:johny.b.goodCA:94333:richie.richNice:IRLD:EI33:collin.oneillD:EI33:owen.oneillNaughty:USACA:94111:bart.simpsonCA:94222:dennis.menacePA:18964:michael.myersWatch out for:• Hot spotting• Unbalanced Clusters(1) Go to the row.(2) Get the column slice
  24. 24. 1•800.593.4467•info@healthmarketscience.comWhat about the toys?No problem.We’re in a NoSQL store. =)Let’s just add a column.
  25. 25. 1•800.593.4467•info@healthmarketscience.comCQL Collections! users SET emails = emails + {}WHERE user_id = frodo;ListUPDATE users SET top_places = [ the shire ] + top_places WHEREuser_id = frodo;MapsUPDATE users SET todo[2012-10-2 12:10] = die WHERE user_id =frodo;
  26. 26. 1•800.593.4467•info@healthmarketscience.comLet’s Crank a Bit...
  27. 27. 1•800.593.4467•info@healthmarketscience.comLet’s code!What API should we use?Production-ReadinessPotential MomentumThrift 10 -1 -1Hector 10 8 8Astyanax 8 9 10Kundera (JPA) 6 9 9Pelops 7 6 7Firebrand 8 9 8PlayORM 5 8 7GORA 6 9 7CQL Driver 8 10 10Asytanax + CQL FTW!
  28. 28. 1•800.593.4467•info@healthmarketscience.comComing up for air...
  29. 29. 1•800.593.4467•info@healthmarketscience.comBut continuing at warp speed...
  30. 30. 1•800.593.4467•
  31. 31. 1•800.593.4467•info@healthmarketscience.comWhat we did wrong…Could not react to transactional changesNeeded extra logic to track what changedTook too long
  32. 32. 1•800.593.4467•info@healthmarketscience.comWhat we did wrong… (II)AOP-based triggersWorked well initially.Business Processes captured as side-effects.
  33. 33. 1•800.593.4467•info@healthmarketscience.comDesign PrinciplesPatternsIdempotent OperationsElegantly handle replayImmutable dataAssertions of facts over timeAnti-PatternsTransactions / Locking
  34. 34. 1•800.593.4467•info@healthmarketscience.comWhat we did right.REST APIs for Loose CouplingSee Virgil: really… watch out for Intravert
  35. 35. 1•800.593.4467•info@healthmarketscience.comKafka• Millions of Messages• Replay Enabled• No transactions / Lightning Fast
  36. 36. 1•800.593.4467•info@healthmarketscience.comElastic Search• Edit Distance / Soundex• Native Scalability• Fuzzy Search• Geospatial• Facets
  37. 37. 1•800.593.4467•info@healthmarketscience.comStorm• Guaranteed once semantics• Well-designed processing abstraction• Beats BYODP• Momentum
  38. 38. 1•800.593.4467•info@healthmarketscience.comThe SystemKafkaQueue(s)OffsetC*ABCC* ES1KafkaElasticSearchES2C*REST APINP. We canroute around it.NP. ReplicationFactor > 1.NP. Rewind!
  39. 39. 1•800.593.4467•info@healthmarketscience.comNext Steps
  40. 40. 1•800.593.4467•info@healthmarketscience.comShameless ShoutoutsHMS ( (coming soon)ptgoetz (
  41. 41. 1•800.593.4467•info@healthmarketscience.comThe TeamWe’re hiring!