Big data philly_jug
Upcoming SlideShare
Loading in...5
×
 

Big data philly_jug

on

  • 1,855 views

Big Data Overview and Cassandra Deep Dive for the Philly JUG

Big Data Overview and Cassandra Deep Dive for the Philly JUG

Statistics

Views

Total Views
1,855
Views on SlideShare
1,573
Embed Views
282

Actions

Likes
2
Downloads
19
Comments
0

4 Embeds 282

http://architects.dzone.com 207
http://java.dzone.com 70
https://twitter.com 4
http://www.dzone.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Big data philly_jug Big data philly_jug Presentation Transcript

  • 1•800.593.4467•info@healthmarketscience.comThe Big Data QuadfectaBrian O’NeillLead Architect, Health Market Science@boneill42, bone@alumni.brown.edu
  • 1•800.593.4467•info@healthmarketscience.comQuadfecta?1. Quadfecta• A legendary beirut/beer pong shot that landson the tops of four cups simultaneously.Considered the rarest shot in thegame, topping even the trifecta, 2-cupknockover-and-sink, and simultaneous 6-cupgame-ending double bounce-in.• Kafka• Storm• Elastic Search• Cassandrahttp://www.flickr.com/photos/yogma/3584984540/http://www.urbandictionary.com/define.php?term=quadfecta
  • 1•800.593.4467•info@healthmarketscience.comHold on Tighthttp://www.flickr.com/photos/aspexdesign/7817329758/ View slide
  • 1•800.593.4467•info@healthmarketscience.com3 V’sVolume Variety Velocityhttp://www.flickr.com/photos/20989942@N00/373985217/http://www.flickr.com/photos/rhruzek/4071408305/ http://www.flickr.com/photos/adriansalgado/5310969147/ View slide
  • 1•800.593.4467•info@healthmarketscience.comThe Use Case
  • 1•800.593.4467•info@healthmarketscience.comOur MissionPrescriber eligibility and remediationEliminate fraud, waste and abuseInsights into the healthcare space
  • 1•800.593.4467•info@healthmarketscience.comThe BusinessBusinessSolutionsHealth Care Provider & FacilitiesVariety/Velocity• >l2000 of sources• 6 Million unique HCPs• 10+ years historyData Challenges• Constant change in realworld data• Conflicting & partial info• Frequent changes tosource structure• Authoritative sources vs.crowdsource• Predicting source qualityMaster Data SolutionsMedical Procedures & DiagnosisVolume/Velocity• ~1B claims annually• +5B records annually• 5+ years historyData Challenges• Sources haveincomplete capture• Overlapping source data• Statistical projections &biases• Social media typerelationshipsMedical Claims DataCompleteView,Expense Manager,CompleteSpendPrescriberEligibility/RemdiationAnaltyics(InfluencerNetworks)
  • 1•800.593.4467•info@healthmarketscience.comOur SolutionsBusinessNeedsFinance & LegalBusiness SystemsComplianceSales & MarketingSolutionsProvider Data ComplianceData Assessment, Integration &Enrichment Services01010011MarketIntelligenceHMSAuthoritativeSourcesPDC Federal StateMedical Claims Web DerivedAdvancedTechnologyStormMaster Data Management
  • 1•800.593.4467•info@healthmarketscience.comDatacenterHundreds of Machines1.5 Petabytes of raw storageVirtualized (VMware)On a SANShould we go physical???
  • 1•800.593.4467•info@healthmarketscience.comUnder the HoodVisualizationDashboard / ReportsStructured StorageRelationalIndexingFlexible StorageNoSQL Graph(s)InterfacingWeb ServicesDistributed ProcessingStandardizeValidateMatchConsolidateAnalyticsData SourcesGovernmentWebCustomerI’m happyUser Interface
  • 1•800.593.4467•info@healthmarketscience.comMaster DataManagementHarvestedGovernmentPrivatefaddress Î F@t0flicense Î F@t5fsanction Î F@t1 fsanction Î F@t4Schema Change!
  • 1•800.593.4467•info@healthmarketscience.comThe Design
  • 1•800.593.4467•info@healthmarketscience.comSystem of RecordFlexibility (Variety)Scalability (Velocity + Volume)
  • 1•800.593.4467•info@healthmarketscience.comDeep Divewww.history.navy.mil/museums/seabee_museum.htm
  • 1•800.593.4467•info@healthmarketscience.comInstallationAs easy as…Downloadhttp://cassandra.apache.org/download/Uncompresstar -xvzf apache-cassandra-1.2.0-beta3-bin.tar.gzRunbin/cassandra –f(-f puts it in foreground)
  • 1•800.593.4467•info@healthmarketscience.comData ModelSchema (a.k.a. Keyspace)Table (a.k.a. Column Family)RowHave arbitrary #’s of columnsValidator for keys (e.g. UTF8Type)ColumnValidator for values and keysComparator for keys (e.g. DateType or BYOC)(http://www.youtube.com/watch?v=bKfND4woylw)
  • 1•800.593.4467•info@healthmarketscience.comDistributed ArchitectureNodes form a token ring.Nodes partition the ring by initial tokeninitial_token: (in cassandra.yaml)Partitioners map row keys to tokens.Usually randomly, to evenly distribute the dataAll columns for a row are stored together on disk insorted order.
  • 1•800.593.4467•info@healthmarketscience.comVisually(1-33)Row HashAlice 50Bob 3Eve 15Token/Hash Range : 0-99
  • 1•800.593.4467•info@healthmarketscience.comJava InterpretationEach table is a Distributed HashMapEach row is a SortedMap.Each column is an entry in the SortedMap.Cassandra provides a massively scalable version of:HashMap<rowKey, SortedMap<columnKey, columnValue>Implications:Direct row fetch is fast.Searching a range of rows can be costly.Searching a range of columns is cheap.
  • 1•800.593.4467•info@healthmarketscience.comThe World-WideGlobally ScalableNaughty List!How about a Naughty andNice list for Santa?1.9 billion childrenThat will fit in a single row!Queries to support:Children can login andcheck their standing.Santa can find nicechildren by country, state orzip.Toy lists for every child inthe world.
  • 1•800.593.4467•info@healthmarketscience.comTwo TablesChildren TableStore all the children in the world.One row per child.One column per attribute.NaughtyOrNice TableSupports the queries we anticipateWide-Row Strategy
  • 1•800.593.4467•info@healthmarketscience.comDetails of the NaughtyOrNiceListOne row per standing:countryEnsures all children in a country are groupedtogether on disk.One column per child using a compound keyEnsures the columns are sorted to support oursearch at varying levels of granularitye.g. All nice children in the US.e.g. All naughty children in PA.
  • 1•800.593.4467•info@healthmarketscience.comNode 3Node 2Node 1Visually Nice:USACA:94333:johny.b.goodCA:94333:richie.richNice:IRLD:EI33:collin.oneillD:EI33:owen.oneillNaughty:USACA:94111:bart.simpsonCA:94222:dennis.menacePA:18964:michael.myersWatch out for:• Hot spotting• Unbalanced Clusters(1) Go to the row.(2) Get the column slice
  • 1•800.593.4467•info@healthmarketscience.comWhat about the toys?No problem.We’re in a NoSQL store. =)Let’s just add a column.
  • 1•800.593.4467•info@healthmarketscience.comCQL Collections!http://www.datastax.com/dev/blog/cql3_collectionsSetUPDATE users SET emails = emails + {fb@friendsofmordor.org}WHERE user_id = frodo;ListUPDATE users SET top_places = [ the shire ] + top_places WHEREuser_id = frodo;MapsUPDATE users SET todo[2012-10-2 12:10] = die WHERE user_id =frodo;
  • 1•800.593.4467•info@healthmarketscience.comLet’s Crank a Bit...
  • 1•800.593.4467•info@healthmarketscience.comLet’s code!What API should we use?Production-ReadinessPotential MomentumThrift 10 -1 -1Hector 10 8 8Astyanax 8 9 10Kundera (JPA) 6 9 9Pelops 7 6 7Firebrand 8 9 8PlayORM 5 8 7GORA 6 9 7CQL Driver 8 10 10Asytanax + CQL FTW!
  • 1•800.593.4467•info@healthmarketscience.comComing up for air...http://www.flickr.com/photos/64738468@N00/7184463727/
  • 1•800.593.4467•info@healthmarketscience.comBut continuing at warp speed...http://www.flickr.com/photos/19942094@N00/4937185452/lightbox/
  • 1•800.593.4467•info@healthmarketscience.com
  • 1•800.593.4467•info@healthmarketscience.comWhat we did wrong…Could not react to transactional changesNeeded extra logic to track what changedTook too long
  • 1•800.593.4467•info@healthmarketscience.comWhat we did wrong… (II)AOP-based triggersWorked well initially.Business Processes captured as side-effects.
  • 1•800.593.4467•info@healthmarketscience.comDesign PrinciplesPatternsIdempotent OperationsElegantly handle replayImmutable dataAssertions of facts over timeAnti-PatternsTransactions / Locking
  • 1•800.593.4467•info@healthmarketscience.comWhat we did right.REST APIs for Loose CouplingSee Virgil:https://github.com/hmsonline/virgilBut really… watch out for Intraverthttps://github.com/zznate/intravert-ug
  • 1•800.593.4467•info@healthmarketscience.comKafka• Millions of Messages• Replay Enabled• No transactions / Lightning Fast
  • 1•800.593.4467•info@healthmarketscience.comElastic Search• Edit Distance / Soundex• Native Scalability• Fuzzy Search• Geospatial• Facets
  • 1•800.593.4467•info@healthmarketscience.comStorm• Guaranteed once semantics• Well-designed processing abstraction• Beats BYODP• Momentum
  • 1•800.593.4467•info@healthmarketscience.comThe SystemKafkaQueue(s)OffsetC*ABCC* ES1KafkaElasticSearchES2C*REST APINP. We canroute around it.NP. ReplicationFactor > 1.NP. Rewind!
  • 1•800.593.4467•info@healthmarketscience.comNext Steps
  • 1•800.593.4467•info@healthmarketscience.comShameless ShoutoutsHMS (https://github.com/hmsonline/)storm-cassandrastorm-elastic-searchstorm-jdbi (coming soon)ptgoetz (https://github.com/ptgoetz)storm-jmsstorm-signals
  • 1•800.593.4467•info@healthmarketscience.comThe TeamWe’re hiring!