Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture


Published on

This presentation covers our use of Storm and the connectors we've built. It also proposes a design for integrating Storm with real-time web services by embedding parts of topologies directly into the web services layer.

Published in: Software, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • title Distributed Countingparticipant Aparticipant Bparticipant Storagenote over Storage{"fox" : 6}end notenote over Acount("fox", batch)=3end noteA->Storage: read("fox")note over Bcount("fox", batch)=10end noteStorage->A: 6B->Storage: read("fox")Storage->B: 8note over Aadd(6, 3) = 9end notenote over Badd(6, 10) = 16end noteB->Storage: write(16)A->Storage: write(9)note over Storage{"fox":16}end note
  • title Distributed Countingparticipant Clientparticipant DropWizardparticipant Kafkaparticipant State(1)participant C*participant Stormparticipant State(2)Client->DropWizard: POST(event)DropWizard->State(1): aggregate(new Tuple(event))DropWizard->Kafka: queue(event)DropWizard->Client: 200(ACK)note over State(1)duration (30 sec.)end noteState(1)->C*: state, events = read(key)note over State(1)state = aggregate (state, in_memory_state)events = join (events, in_memory_events)end noteState(1)->C*: write(state, events)Kafka->Storm: dequeue(event)Storm->State(2): persisted_state, events = read(key)note over State(2)if (!contains?(event)) ...end noteState(2)->C*: if !contains(ids) write(state)
  • Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on the Lambda Architecture

    1. 1. Data Pipelines : Improving on the Lambda Architecture Brian O’Neill, CTO @boneill42
    2. 2. Talk Breakdown 29% 20% 31% 20% Topics (1) Motivation (2) Polyglot Persistence (3) Analytics (4) Lambda Architecture
    3. 3. Health Market Science - Then What we were.
    4. 4. Health Market Science - Now Intersecting Big Data w/ Healthcare We’re fixing healthcare!
    5. 5. Data Pipelines I/O
    6. 6. The InputFrom government, state boards, etc. From the internet, social data, networks / graphs From third-parties, medical claims From customers, expenses, sales data, beneficiary information, quality scores Data Pipeline
    7. 7. The Output Script Claims Expense Sanction Address Contact (phone, fax, etc.) Drug RepresentativeDivision Expense ManagerTM Provider Verification™ MarketViewTM Customer Feed(s) Customer Master Provider MasterFileTM Credentials “Agile MDM” 1 billion claims per year Organization Practitioner Referrals
    8. 8. Sounds easy Except... Incomplete Capture No foreign keys Differing schemas Changing schemas Conflicting information Ad-hoc Analysis (is hard) Point-In-Time Retrieval
    9. 9. Golde n Record Master Data Management Harvested Government Private faddress Î F@t0 flicense Î F@t5 fsanction Î F@t1 fsanction Î F@t4 Schema Change!
    10. 10. Why? ?’s
    11. 11. Our MDM Pipeline - Data Stewardship - Data Scientists - Business Analysts Ingestion - Semantic Tagging - Standardization - Data Mapping Incorporation - Consolidation - Enumeration - Association Insight - Search - Reports - Analytics Feeds (multiple formats, changing over time) API / FTP Web Interface DimensionsLogicRules
    12. 12. Our first “Pipeline” +
    13. 13. Sweet! Dirt Simple Lightning Fast Highly Available Scalable Multi-Datacenter (DR)
    14. 14. Not Sweet. How do we query the data? NoSQL Indexes? Do such things exist?
    15. 15. Rev. 1 – Wide Rows! AOP Triggers!Data model to support your queries. 9 7 32 74 99 12 42 $3.50 $7.00 $8.75 $1.00 $4.20 $3.17 $8.88 ONC : PA : 19460 D’Oh! What about ad hoc?
    16. 16. Transformation Rev 2 – Elastic Search! AOP Triggers! D’Oh! What if ES fails? What about schema / type information?
    17. 17. Rev 3 - Apache Storm!
    18. 18. Polyglot Persistence “The Right Tool for the Job” Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.
    19. 19. Back to the Pipeline KafkaDW Storm C* ES Titan SQL
    20. 20. Design Principles • What we got: – At-least-once processing – Simple data flows • What we needed to account for: – Replays Idempotent Operations! Immutable Data!
    21. 21. Cassandra State (v0.4.0) {tuple}  <mapper>  (ks, cf, row, k:v[]) Storm Cassandra
    22. 22. Trident Elastic Search (v0.3.1) {tuple}  <mapper>  (idx, docid, k:v[]) Storm Elastic Search
    23. 23. Storm Graph (v0.1.2) Coming soon to... for (tuple : batch) <processor> (graph, tuple)
    24. 24. Storm JDBI (v0.1.14) INTERNAL ONLY (so far) Worth releasing? {tuple}  <mapper>  (JDBC Statement)
    25. 25. All good!
    26. 26. But... What was the average amount for a medical claim associated with procedure X by zip code over the last five years?
    27. 27. Hadoop (<2)? Batch? Yuck. ‘Nuff Said.
    28. 28. Let’s Pre-Compute It! stream .groupBy(new Field(“ICD9”)) .groupBy(new Field(“zip”)) .aggregate(new Field(“amount”), new Average()) D’Oh! GroupBy’s. They set data in motion!
    29. 29. Lesson Learned If possible, avoid re-partitioning operations! (e.g. LOG.error!)
    30. 30. Why so hard? D’Oh! 19 != 9 What we don’t want: LOCKS! What’s the alternative? CONSENSUS!
    31. 31. Cassandra 2.0!
    32. 32. Conditional Updates “The alert reader will notice here that Paxos gives us the ability to agree on exactly one proposal. After one has been accepted, it will be returned to future leaders in the promise, and the new leader will have to re-propose it again.” UPDATE value=9 WHERE word=“fox” IF value=6
    33. 33. Love CQL Conditional Updates + Batch Statements + Collections = BADASS DATA MODELS
    34. 34. Announcing : Storm Cassandra CQL! {tuple}  <mapper>  (CQL Statement) Trident Batching =? CQL Batching
    35. 35. CassandraCqlState public void commit(Long txid) { BatchStatement batch = new BatchStatement(Type.LOGGED); batch.addAll(this.statements); clientFactory.getSession().execute(batch); } public void addStatement(Statement statement) { this.statements.add(statement); } public ResultSet execute(Statement statement){ return clientFactory.getSession().execute(statement); }
    36. 36. CassandraCqlStateUpdater public void updateState(CassandraCqlState state, List<TridentTuple> tuples, TridentCollector collector) { for (TridentTuple tuple : tuples) { Statement statement =; state.addStatement(statement); } }
    37. 37. ExampleMapper public Statement map(List<String> keys, Number value) { Insert statement = QueryBuilder.insertInto(KEYSPACE_NAME, TABLE_NAME); statement.value(KEY_NAME, keys.get(0)); statement.value(VALUE_NAME, value); return statement; } public Statement retrieve(List<String> keys) { Select statement = .column(KEY_NAME).column(VALUE_NAME) .from(KEYSPACE_NAME, TABLE_NAME) .where(QueryBuilder.eq(KEY_NAME, keys.get(0))); return statement; }
    38. 38. Incremental State! • Collapse aggregation into the state object. – This allows the state object to aggregate with current state in a loop until success. • Uses Trident Batching to perform in-memory aggregation for the batch. for (tuple : batch) state.aggregate(tuple); while (failed?) { persisted_state = read(state) aggregate(in_memory_state, persisted_state) failed? = conditionally_update(state) }
    39. 39. Partition 1 In-Memory Aggregation by Key! Key Value fox 6 brown 3 Partition 2 Key Value fox 3 lazy 72C* No More GroupBy!
    40. 40. To protect against replays Use partition + batch identifier(s) in your conditional update! “BatchId + partitionIndex consistently represents the same data as long as: 1.Any repartitioning you do is deterministic (so partitionBy is, but shuffle is not) 2.You're using a spout that replays the exact same batch each time (which is true of transactional spouts but not of opaque transactional spouts)” - Nathan Marz
    41. 41. The Lambda Architecture
    42. 42. Let’s Challenge This a Bit because “additional tools and techniques” cost money and time. • Questions: – Can we solve the problem with a single tool and a single approach? – Can we re-use logic across layers? – Or better yet, can we collapse layers?
    43. 43. A Traditional Interpretation Speed Layer (Storm) Batch Layer (Hadoop) Data Stream Serving Layer HBase Impala D’Oh! Two pipelines!
    44. 44. Integrating Web Services • We need a web service that receives an event and provides, – an immediate acknowledgement – a high likelihood that the data is integrated very soon – a guarantee that the data will be integrated eventually • We need an architecture that provides for, – Code / Logic and approach re-use – Fault-Tolerance
    45. 45. Grand Finale
    46. 46. The Idea : Embedding State! Kafka DropWizard C* IncrementalCqlState aggregate(tuple) “Batch” Layer (Storm) Client
    47. 47. The Sequence of Events
    48. 48. The Wins • Reuse Aggregations and State Code! • To re-compute (or backfill) a dimension, simply re-queue! • Storm is the “safety” net – If a DW host fails during aggregation, Storm will fill in the gaps for all ACK’d events. • Is there an opportunity to reuse more? – BatchingStrategy & PartitionStrategy?
    49. 49. In the end, all good. =)
    50. 50. Plug The Book Shout out: Taylor Goetz
    51. 51. Thanks Brian O’Neill, CTO @boneill42