Big data philly_jug
Upcoming SlideShare
Loading in...5

Big data philly_jug



Big Data Overview and Cassandra Deep Dive for the Philly JUG

Big Data Overview and Cassandra Deep Dive for the Philly JUG



Total Views
Views on SlideShare
Embed Views



4 Embeds 282 207 70 4 1



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Big data philly_jug Big data philly_jug Presentation Transcript

  • 1•800.593.4467•info@healthmarketscience.comThe Big Data QuadfectaBrian O’NeillLead Architect, Health Market Science@boneill42,
  • 1•800.593.4467•info@healthmarketscience.comQuadfecta?1. Quadfecta• A legendary beirut/beer pong shot that landson the tops of four cups simultaneously.Considered the rarest shot in thegame, topping even the trifecta, 2-cupknockover-and-sink, and simultaneous 6-cupgame-ending double bounce-in.• Kafka• Storm• Elastic Search• Cassandra
  • 1•800.593.4467•info@healthmarketscience.comHold on Tight View slide
  • 1•800.593.4467•info@healthmarketscience.com3 V’sVolume Variety Velocity View slide
  • 1•800.593.4467•info@healthmarketscience.comThe Use Case
  • 1•800.593.4467•info@healthmarketscience.comOur MissionPrescriber eligibility and remediationEliminate fraud, waste and abuseInsights into the healthcare space
  • 1•800.593.4467•info@healthmarketscience.comThe BusinessBusinessSolutionsHealth Care Provider & FacilitiesVariety/Velocity• >l2000 of sources• 6 Million unique HCPs• 10+ years historyData Challenges• Constant change in realworld data• Conflicting & partial info• Frequent changes tosource structure• Authoritative sources vs.crowdsource• Predicting source qualityMaster Data SolutionsMedical Procedures & DiagnosisVolume/Velocity• ~1B claims annually• +5B records annually• 5+ years historyData Challenges• Sources haveincomplete capture• Overlapping source data• Statistical projections &biases• Social media typerelationshipsMedical Claims DataCompleteView,Expense Manager,CompleteSpendPrescriberEligibility/RemdiationAnaltyics(InfluencerNetworks)
  • 1•800.593.4467•info@healthmarketscience.comOur SolutionsBusinessNeedsFinance & LegalBusiness SystemsComplianceSales & MarketingSolutionsProvider Data ComplianceData Assessment, Integration &Enrichment Services01010011MarketIntelligenceHMSAuthoritativeSourcesPDC Federal StateMedical Claims Web DerivedAdvancedTechnologyStormMaster Data Management
  • 1•800.593.4467•info@healthmarketscience.comDatacenterHundreds of Machines1.5 Petabytes of raw storageVirtualized (VMware)On a SANShould we go physical???
  • 1•800.593.4467•info@healthmarketscience.comUnder the HoodVisualizationDashboard / ReportsStructured StorageRelationalIndexingFlexible StorageNoSQL Graph(s)InterfacingWeb ServicesDistributed ProcessingStandardizeValidateMatchConsolidateAnalyticsData SourcesGovernmentWebCustomerI’m happyUser Interface
  • 1•800.593.4467•info@healthmarketscience.comMaster DataManagementHarvestedGovernmentPrivatefaddress Î F@t0flicense Î F@t5fsanction Î F@t1 fsanction Î F@t4Schema Change!
  • 1•800.593.4467•info@healthmarketscience.comThe Design
  • 1•800.593.4467•info@healthmarketscience.comSystem of RecordFlexibility (Variety)Scalability (Velocity + Volume)
  • 1•800.593.4467•info@healthmarketscience.comDeep
  • 1•800.593.4467•info@healthmarketscience.comInstallationAs easy as…Download -xvzf apache-cassandra-1.2.0-beta3-bin.tar.gzRunbin/cassandra –f(-f puts it in foreground)
  • 1•800.593.4467•info@healthmarketscience.comData ModelSchema (a.k.a. Keyspace)Table (a.k.a. Column Family)RowHave arbitrary #’s of columnsValidator for keys (e.g. UTF8Type)ColumnValidator for values and keysComparator for keys (e.g. DateType or BYOC)(
  • 1•800.593.4467•info@healthmarketscience.comDistributed ArchitectureNodes form a token ring.Nodes partition the ring by initial tokeninitial_token: (in cassandra.yaml)Partitioners map row keys to tokens.Usually randomly, to evenly distribute the dataAll columns for a row are stored together on disk insorted order.
  • 1•800.593.4467•info@healthmarketscience.comVisually(1-33)Row HashAlice 50Bob 3Eve 15Token/Hash Range : 0-99
  • 1•800.593.4467•info@healthmarketscience.comJava InterpretationEach table is a Distributed HashMapEach row is a SortedMap.Each column is an entry in the SortedMap.Cassandra provides a massively scalable version of:HashMap<rowKey, SortedMap<columnKey, columnValue>Implications:Direct row fetch is fast.Searching a range of rows can be costly.Searching a range of columns is cheap.
  • 1•800.593.4467•info@healthmarketscience.comThe World-WideGlobally ScalableNaughty List!How about a Naughty andNice list for Santa?1.9 billion childrenThat will fit in a single row!Queries to support:Children can login andcheck their standing.Santa can find nicechildren by country, state orzip.Toy lists for every child inthe world.
  • 1•800.593.4467•info@healthmarketscience.comTwo TablesChildren TableStore all the children in the world.One row per child.One column per attribute.NaughtyOrNice TableSupports the queries we anticipateWide-Row Strategy
  • 1•800.593.4467•info@healthmarketscience.comDetails of the NaughtyOrNiceListOne row per standing:countryEnsures all children in a country are groupedtogether on disk.One column per child using a compound keyEnsures the columns are sorted to support oursearch at varying levels of granularitye.g. All nice children in the US.e.g. All naughty children in PA.
  • 1•800.593.4467•info@healthmarketscience.comNode 3Node 2Node 1Visually Nice:USACA:94333:johny.b.goodCA:94333:richie.richNice:IRLD:EI33:collin.oneillD:EI33:owen.oneillNaughty:USACA:94111:bart.simpsonCA:94222:dennis.menacePA:18964:michael.myersWatch out for:• Hot spotting• Unbalanced Clusters(1) Go to the row.(2) Get the column slice
  • 1•800.593.4467•info@healthmarketscience.comWhat about the toys?No problem.We’re in a NoSQL store. =)Let’s just add a column.
  • 1•800.593.4467•info@healthmarketscience.comCQL Collections! users SET emails = emails + {}WHERE user_id = frodo;ListUPDATE users SET top_places = [ the shire ] + top_places WHEREuser_id = frodo;MapsUPDATE users SET todo[2012-10-2 12:10] = die WHERE user_id =frodo;
  • 1•800.593.4467•info@healthmarketscience.comLet’s Crank a Bit...
  • 1•800.593.4467•info@healthmarketscience.comLet’s code!What API should we use?Production-ReadinessPotential MomentumThrift 10 -1 -1Hector 10 8 8Astyanax 8 9 10Kundera (JPA) 6 9 9Pelops 7 6 7Firebrand 8 9 8PlayORM 5 8 7GORA 6 9 7CQL Driver 8 10 10Asytanax + CQL FTW!
  • 1•800.593.4467•info@healthmarketscience.comComing up for air...
  • 1•800.593.4467•info@healthmarketscience.comBut continuing at warp speed...
  • 1•800.593.4467•
  • 1•800.593.4467•info@healthmarketscience.comWhat we did wrong…Could not react to transactional changesNeeded extra logic to track what changedTook too long
  • 1•800.593.4467•info@healthmarketscience.comWhat we did wrong… (II)AOP-based triggersWorked well initially.Business Processes captured as side-effects.
  • 1•800.593.4467•info@healthmarketscience.comDesign PrinciplesPatternsIdempotent OperationsElegantly handle replayImmutable dataAssertions of facts over timeAnti-PatternsTransactions / Locking
  • 1•800.593.4467•info@healthmarketscience.comWhat we did right.REST APIs for Loose CouplingSee Virgil: really… watch out for Intravert
  • 1•800.593.4467•info@healthmarketscience.comKafka• Millions of Messages• Replay Enabled• No transactions / Lightning Fast
  • 1•800.593.4467•info@healthmarketscience.comElastic Search• Edit Distance / Soundex• Native Scalability• Fuzzy Search• Geospatial• Facets
  • 1•800.593.4467•info@healthmarketscience.comStorm• Guaranteed once semantics• Well-designed processing abstraction• Beats BYODP• Momentum
  • 1•800.593.4467•info@healthmarketscience.comThe SystemKafkaQueue(s)OffsetC*ABCC* ES1KafkaElasticSearchES2C*REST APINP. We canroute around it.NP. ReplicationFactor > 1.NP. Rewind!
  • 1•800.593.4467•info@healthmarketscience.comNext Steps
  • 1•800.593.4467•info@healthmarketscience.comShameless ShoutoutsHMS ( (coming soon)ptgoetz (
  • 1•800.593.4467•info@healthmarketscience.comThe TeamWe’re hiring!