0
Re-envisioning the Lambda Architecture:
Web Services & Real-time Analytics w/
Storm and Cassandra
Brian O’Neill, CTO
bonei...
Talk Breakdown
29%
20%
31%
20%
Topics
(1) Motivation
(2) Polyglot Persistence
(3) Analytics
(4) Lambda Architecture
Health Market Science - Then
What we were.
Health Market Science - Now
Intersecting Big Data
w/ Healthcare
We’re fixing healthcare!
Data Pipelines
I/O
The InputFrom government,
state boards, etc.
From the internet,
social data,
networks / graphs
From third-parties,
medical...
The Output
Script
Claims
Expense
Sanction
Address
Contact
(phone, fax, etc.)
Drug
RepresentativeDivision
Expense ManagerTM...
Sounds easy
Except...
Incomplete Capture
No foreign keys
Differing schemas
Changing schemas
Conflicting information
Ad-hoc...
Why?
?’s
Our MDM Pipeline
- Data Stewardship
- Data Scientists
- Business Analysts
Ingestion
- Semantic Tagging
- Standardization
-...
Our first “Pipeline”
+
Sweet!
Dirt Simple
Lightning Fast
Highly Available
Scalable
Multi-Datacenter (DR)
Not Sweet.
How do we query the data?
NoSQL Indexes?
Do such things exist?
Rev. 1 – Wide Rows!
AOP
Triggers!Data model to
support your
queries.
9 7 32 74 99 12 42
$3.50 $7.00 $8.75 $1.00 $4.20 $3.1...
Transformation
Rev 2 – Elastic Search!
AOP
Triggers!
D’Oh!
What if ES fails?
What about schema / type information?
Rev 3 - Apache Storm!
Anatomy of a Storm Cluster
• Nimbus
– Master Node
• Zookeeper
– Cluster Coordination
• Supervisors
– Worker Nodes
Storm Primitives
• Streams
– Unbounded sequence of tuples
• Spouts
– Stream Sources
• Bolts
– Unit of Computation
• Topolo...
Storm Spouts
• Represents a source (stream) of data
– Queues (JMS, Kafka, Kestrel, etc.)
– Twitter Firehose
– Sensor Data
...
Storm Bolts
• Receive Tuples from Spouts or other Bolts
• Operate on, or React to Data
– Functions/Filters/Joins/Aggregati...
Storm Topologies
• Data flow between spouts and bolts
• Routing of Tuples between spouts/bolts
– Stream “Groupings”
• Para...
Storm Topologies
Persistent Word Count
http://github.com/hmsonline/storm-cassandra
NEXT LEVEL : TRIDENT
Trident
• Part of Storm
• Provides a higher-level abstraction for stream
processing
– Constructs for state management and ...
Trident State
Sequences writes by batch
• Spouts
– Transactional
• Batch contents never change
– Opaque
• Batch contents c...
State Management
Last Batch Value
15 1000
(+59)
Last Batch Value
16 1059
Transactional
Last Batch Previous Current
15 980 ...
BACK TO OUR REGULARLY
SCHEDULED TALK
Polyglot Persistence
“The Right Tool for the Job”
Oracle is a registered trademark
of Oracle Corporation and/or its
affili...
Back to the Pipeline
KafkaDW
Storm
C* ES Titan SQL
MDM Topology*
*Notional
Design Principles
• What we got:
– At-least-once processing
– Simple data flows
• What we needed to account for:
– Replays...
Cassandra State (v0.4.0)
git@github.com:hmsonline/storm-cassandra.git
{tuple}  <mapper>  (ks, cf, row, k:v[])
Storm Cass...
Trident Elastic Search (v0.3.1)
git@github.com:hmsonline/trident-elasticsearch.git
{tuple}  <mapper>  (idx, docid, k:v[]...
Storm Graph (v0.1.2)
Coming soon to...
git@github.com:hmsonline/storm-graph.git
for (tuple : batch)
<processor> (graph, tu...
Storm JDBI (v0.1.14)
INTERNAL ONLY (so far)
Worth releasing?
{tuple}  <mapper>  (JDBC Statement)
All good!
But...
What was the average amount paid for a
medical claim associated with procedure
X by zip code over the last five yea...
Hadoop (<2)? Batch?
Yuck. ‘Nuff Said.
http://www.slideshare.net/prash1784/introduction-to-hadoop-and-pig-15036186
Alternatives?
Let’s Pre-Compute It!
stream
.groupBy(new Field(“procedure”))
.groupBy(new Field(“zip”))
.aggregate(new Field(“amount”),
n...
Lesson Learned
https://github.com/nathanmarz/storm/wiki/Trident-API-Overview
If possible, avoid
re-partitioning
operations...
Why so hard?
D’Oh!
19 != 9
What we don’t want:
LOCKS!
What’s the alternative?
CONSENSUS!
Cassandra 2.0!
http://www.slideshare.net/planetcassandra/nyc-jonathan-ellis-keynote-cassandra-12-20
http://www.cs.cornell....
Conditional Updates
“The alert reader will notice here that
Paxos gives us the ability to agree on
exactly one proposal. A...
Love CQL
Conditional Updates
+
Batch Statements
+
Collections
=
BADASS DATA MODELS
Announcing : Storm Cassandra CQL!
git@github.com:hmsonline/storm-cassandra-cql.git
{tuple}  <mapper>  (CQL Statement)
Tr...
Incremental State!
• Collapse aggregation into the state object.
– This allows the state object to aggregate with current ...
Partition 1
In-Memory Aggregation by Key!
Key Value
fox 6
brown 3
Partition 2
Key Value
fox 3
lazy 72C*
No More GroupBy!
To protect against replays
Use partition + batch identifier(s) in
your conditional update!
“BatchId + partitionIndex consi...
Hyper-Cubes!
Our Terminology
• A cube comprises:
– Dimensions (e.g. procedure, zip, time slice)
– A function (e.g. count, sum)
– Functi...
Complex Event Processing
• For each event,
–Find relevant cubes
–Adjust metrics for event
–Group and aggregate metrics
–Us...
The Lambda Architecture
http://architects.dzone.com/articles/nathan-marzs-lamda
Let’s Challenge This a Bit
because “additional tools and techniques” cost
money and time.
• Questions:
– Can we solve the ...
A Traditional Interpretation
Speed Layer
(Storm)
Batch Layer
(Hadoop)
Data
Stream
Serving Layer
HBase
Impala
D’Oh! Two pip...
Integrating Web Services
• We need a web service that receives an event
and provides,
– an immediate acknowledgement
– a h...
Grand Finale
The Idea : Embedding State!
Kafka
DropWizard
C*
IncrementalCqlState
aggregate(tuple)
“Batch” Layer
(Storm)
Client
The Sequence of Events
The Wins
• Reuse Aggregations and State Code!
• To re-compute (or backfill) a dimension,
simply re-queue!
• Storm is the “...
In the end, all good. =)
Plug
The Book
Shout out:
Taylor Goetz
Thanks!
Brought to you by
Brian O’Neill, CTO
boneill@healthmarketscience.com
@boneill42
12 years together
APPENDIX
CassandraCqlState
public void commit(Long txid) {
BatchStatement batch = new BatchStatement(Type.LOGGED);
batch.addAll(thi...
CassandraCqlStateUpdater
public void updateState(CassandraCqlState state,
List<TridentTuple> tuples,
TridentCollector coll...
ExampleMapper
public Statement map(List<String> keys, Number value) {
Insert statement =
QueryBuilder.insertInto(KEYSPACE_...
Upcoming SlideShare
Loading in...5
×

Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics w/ Storm and Cassandra

1,113

Published on

Re-envisioning the Lambda Architecture: Web Services & Real-time Analytics w/ Storm and Cassandra

Published in: Software, Technology
0 Comments
5 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,113
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
35
Comments
0
Likes
5
Embeds 0
No embeds

No notes for slide
  • Tuple: set of key-value pairs (values can be serialized objects)
  • title Distributed Counting

    participant A
    participant B
    participant Storage

    note over Storage
    {"fox" : 6}
    end note
    note over A
    count("fox", batch)=3
    end note
    A->Storage: read("fox")
    note over B
    count("fox", batch)=10
    end note
    Storage->A: 6
    B->Storage: read("fox")
    Storage->B: 8
    note over A
    add(6, 3) = 9
    end note
    note over B
    add(6, 10) = 16
    end note
    B->Storage: write(16)
    A->Storage: write(9)
    note over Storage
    {"fox":16}
    end note

  • title Distributed Counting

    participant Client
    participant DropWizard
    participant Kafka
    participant State(1)
    participant C*
    participant Storm
    participant State(2)

    Client->DropWizard: POST(event)
    DropWizard->State(1): aggregate(new Tuple(event))
    DropWizard->Kafka: queue(event)
    DropWizard->Client: 200(ACK)
    note over State(1)
    duration (30 sec.)
    end note
    State(1)->C*: state, events = read(key)
    note over State(1)
    state = aggregate (state, in_memory_state)
    events = join (events, in_memory_events)
    end note
    State(1)->C*: write(state, events)
    Kafka->Storm: dequeue(event)
    Storm->State(2): persisted_state, events = read(key)
    note over State(2)
    if (!contains?(event))
    ...
    end note
    State(2)->C*: if !contains(ids) write(state)


  • Transcript of "Re-envisioning the Lambda Architecture : Web Services & Real-time Analytics w/ Storm and Cassandra "

    1. 1. Re-envisioning the Lambda Architecture: Web Services & Real-time Analytics w/ Storm and Cassandra Brian O’Neill, CTO boneill@healthmarketscience.com @boneill42
    2. 2. Talk Breakdown 29% 20% 31% 20% Topics (1) Motivation (2) Polyglot Persistence (3) Analytics (4) Lambda Architecture
    3. 3. Health Market Science - Then What we were.
    4. 4. Health Market Science - Now Intersecting Big Data w/ Healthcare We’re fixing healthcare!
    5. 5. Data Pipelines I/O
    6. 6. The InputFrom government, state boards, etc. From the internet, social data, networks / graphs From third-parties, medical claims From customers, expenses, sales data, beneficiary information, quality scores Data Pipeline
    7. 7. The Output Script Claims Expense Sanction Address Contact (phone, fax, etc.) Drug RepresentativeDivision Expense ManagerTM Provider Verification™ MarketViewTM Customer Feed(s) Customer Master Provider MasterFileTM Credentials “Agile MDM” 1 billion claims per year Organization Practitioner Referrals
    8. 8. Sounds easy Except... Incomplete Capture No foreign keys Differing schemas Changing schemas Conflicting information Ad-hoc Analysis (is hard) Point-In-Time Retrieval
    9. 9. Why? ?’s
    10. 10. Our MDM Pipeline - Data Stewardship - Data Scientists - Business Analysts Ingestion - Semantic Tagging - Standardization - Data Mapping Incorporation - Consolidation - Enumeration - Association Insight - Search - Reports - Analytics Feeds (multiple formats, changing over time) API / FTP Web Interface DimensionsLogicRules
    11. 11. Our first “Pipeline” +
    12. 12. Sweet! Dirt Simple Lightning Fast Highly Available Scalable Multi-Datacenter (DR)
    13. 13. Not Sweet. How do we query the data? NoSQL Indexes? Do such things exist?
    14. 14. Rev. 1 – Wide Rows! AOP Triggers!Data model to support your queries. 9 7 32 74 99 12 42 $3.50 $7.00 $8.75 $1.00 $4.20 $3.17 $8.88 ONC : PA : 19460 D’Oh! What about ad hoc?
    15. 15. Transformation Rev 2 – Elastic Search! AOP Triggers! D’Oh! What if ES fails? What about schema / type information?
    16. 16. Rev 3 - Apache Storm!
    17. 17. Anatomy of a Storm Cluster • Nimbus – Master Node • Zookeeper – Cluster Coordination • Supervisors – Worker Nodes
    18. 18. Storm Primitives • Streams – Unbounded sequence of tuples • Spouts – Stream Sources • Bolts – Unit of Computation • Topologies – Combination of n Spouts and m Bolts – Defines the overall “Computation”
    19. 19. Storm Spouts • Represents a source (stream) of data – Queues (JMS, Kafka, Kestrel, etc.) – Twitter Firehose – Sensor Data • Emits “Tuples” (Events) based on source – Primary Storm data structure – Set of Key-Value pairs
    20. 20. Storm Bolts • Receive Tuples from Spouts or other Bolts • Operate on, or React to Data – Functions/Filters/Joins/Aggregations – Database writes/lookups • Optionally emit additional Tuples
    21. 21. Storm Topologies • Data flow between spouts and bolts • Routing of Tuples between spouts/bolts – Stream “Groupings” • Parallelism of Components • Long-Lived
    22. 22. Storm Topologies
    23. 23. Persistent Word Count http://github.com/hmsonline/storm-cassandra
    24. 24. NEXT LEVEL : TRIDENT
    25. 25. Trident • Part of Storm • Provides a higher-level abstraction for stream processing – Constructs for state management and batching • Adds additional primitives that abstract away common topological patterns
    26. 26. Trident State Sequences writes by batch • Spouts – Transactional • Batch contents never change – Opaque • Batch contents can change • State – Transactional • Store batch number with counts to maintain sequencing of writes – Opaque • Store previous value in order to overwrite the current value when contents of a batch change
    27. 27. State Management Last Batch Value 15 1000 (+59) Last Batch Value 16 1059 Transactional Last Batch Previous Current 15 980 1000 (+59) Opaque replay == incorporated already? (because batch composition is the same) Last Batch Previous Current 16 1000 1059 Last Batch Previous Current 15 980 1000 (+72) Last Batch Previous Current 16 1000 1072 replay == re-incorporate Batch composition changes! (not guaranteed)
    28. 28. BACK TO OUR REGULARLY SCHEDULED TALK
    29. 29. Polyglot Persistence “The Right Tool for the Job” Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.
    30. 30. Back to the Pipeline KafkaDW Storm C* ES Titan SQL
    31. 31. MDM Topology* *Notional
    32. 32. Design Principles • What we got: – At-least-once processing – Simple data flows • What we needed to account for: – Replays Idempotent Operations! Immutable Data!
    33. 33. Cassandra State (v0.4.0) git@github.com:hmsonline/storm-cassandra.git {tuple}  <mapper>  (ks, cf, row, k:v[]) Storm Cassandra
    34. 34. Trident Elastic Search (v0.3.1) git@github.com:hmsonline/trident-elasticsearch.git {tuple}  <mapper>  (idx, docid, k:v[]) Storm Elastic Search
    35. 35. Storm Graph (v0.1.2) Coming soon to... git@github.com:hmsonline/storm-graph.git for (tuple : batch) <processor> (graph, tuple)
    36. 36. Storm JDBI (v0.1.14) INTERNAL ONLY (so far) Worth releasing? {tuple}  <mapper>  (JDBC Statement)
    37. 37. All good!
    38. 38. But... What was the average amount paid for a medical claim associated with procedure X by zip code over the last five years?
    39. 39. Hadoop (<2)? Batch? Yuck. ‘Nuff Said. http://www.slideshare.net/prash1784/introduction-to-hadoop-and-pig-15036186
    40. 40. Alternatives?
    41. 41. Let’s Pre-Compute It! stream .groupBy(new Field(“procedure”)) .groupBy(new Field(“zip”)) .aggregate(new Field(“amount”), new Average()) D’Oh! GroupBy’s. They set data in motion!
    42. 42. Lesson Learned https://github.com/nathanmarz/storm/wiki/Trident-API-Overview If possible, avoid re-partitioning operations! (e.g. LOG.error!)
    43. 43. Why so hard? D’Oh! 19 != 9 What we don’t want: LOCKS! What’s the alternative? CONSENSUS!
    44. 44. Cassandra 2.0! http://www.slideshare.net/planetcassandra/nyc-jonathan-ellis-keynote-cassandra-12-20 http://www.cs.cornell.edu/courses/CS6452/2012sp/papers/paxos-complex.pdf
    45. 45. Conditional Updates “The alert reader will notice here that Paxos gives us the ability to agree on exactly one proposal. After one has been accepted, it will be returned to future leaders in the promise, and the new leader will have to re-propose it again.” http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0 UPDATE value=9 WHERE word=“fox” IF value=6
    46. 46. Love CQL Conditional Updates + Batch Statements + Collections = BADASS DATA MODELS
    47. 47. Announcing : Storm Cassandra CQL! git@github.com:hmsonline/storm-cassandra-cql.git {tuple}  <mapper>  (CQL Statement) Trident Batching =? CQL Batching
    48. 48. Incremental State! • Collapse aggregation into the state object. – This allows the state object to aggregate with current state in a loop until success. • Uses Trident Batching to perform in-memory aggregation for the batch. for (tuple : batch) state.aggregate(tuple); while (failed?) { persisted_state = read(state) aggregate(in_memory_state, persisted_state) failed? = conditionally_update(state) }
    49. 49. Partition 1 In-Memory Aggregation by Key! Key Value fox 6 brown 3 Partition 2 Key Value fox 3 lazy 72C* No More GroupBy!
    50. 50. To protect against replays Use partition + batch identifier(s) in your conditional update! “BatchId + partitionIndex consistently represents the same data as long as: 1.Any repartitioning you do is deterministic (so partitionBy is, but shuffle is not) 2.You're using a spout that replays the exact same batch each time (which is true of transactional spouts but not of opaque transactional spouts)” - Nathan Marz
    51. 51. Hyper-Cubes!
    52. 52. Our Terminology • A cube comprises: – Dimensions (e.g. procedure, zip, time slice) – A function (e.g. count, sum) – Function fields (e.g. amount paid) – Granularity • A metric comprises: – Coordinates (e.g. vasectomy, 19460, 879123 – hour since epoch) – A value (e.g. 500 procedures) • A perspective comprises: – A range of coordinates (*, 19460, January) – An interval (day)
    53. 53. Complex Event Processing • For each event, –Find relevant cubes –Adjust metrics for event –Group and aggregate metrics –Use conditional updates to incorporate metric into persistent state
    54. 54. The Lambda Architecture http://architects.dzone.com/articles/nathan-marzs-lamda
    55. 55. Let’s Challenge This a Bit because “additional tools and techniques” cost money and time. • Questions: – Can we solve the problem with a single tool and a single approach? – Can we re-use logic across layers? – Or better yet, can we collapse layers?
    56. 56. A Traditional Interpretation Speed Layer (Storm) Batch Layer (Hadoop) Data Stream Serving Layer HBase Impala D’Oh! Two pipelines!
    57. 57. Integrating Web Services • We need a web service that receives an event and provides, – an immediate acknowledgement – a high likelihood that the data is integrated very soon – a guarantee that the data will be integrated eventually • We need an architecture that provides for, – Code / Logic and approach re-use – Fault-Tolerance
    58. 58. Grand Finale
    59. 59. The Idea : Embedding State! Kafka DropWizard C* IncrementalCqlState aggregate(tuple) “Batch” Layer (Storm) Client
    60. 60. The Sequence of Events
    61. 61. The Wins • Reuse Aggregations and State Code! • To re-compute (or backfill) a dimension, simply re-queue! • Storm is the “safety” net – If a DW host fails during aggregation, Storm will fill in the gaps for all ACK’d events. • Is there an opportunity to reuse more? – BatchingStrategy & PartitionStrategy?
    62. 62. In the end, all good. =)
    63. 63. Plug The Book Shout out: Taylor Goetz
    64. 64. Thanks! Brought to you by Brian O’Neill, CTO boneill@healthmarketscience.com @boneill42 12 years together
    65. 65. APPENDIX
    66. 66. CassandraCqlState public void commit(Long txid) { BatchStatement batch = new BatchStatement(Type.LOGGED); batch.addAll(this.statements); clientFactory.getSession().execute(batch); } public void addStatement(Statement statement) { this.statements.add(statement); } public ResultSet execute(Statement statement){ return clientFactory.getSession().execute(statement); }
    67. 67. CassandraCqlStateUpdater public void updateState(CassandraCqlState state, List<TridentTuple> tuples, TridentCollector collector) { for (TridentTuple tuple : tuples) { Statement statement = this.mapper.map(tuple); state.addStatement(statement); } }
    68. 68. ExampleMapper public Statement map(List<String> keys, Number value) { Insert statement = QueryBuilder.insertInto(KEYSPACE_NAME, TABLE_NAME); statement.value(KEY_NAME, keys.get(0)); statement.value(VALUE_NAME, value); return statement; } public Statement retrieve(List<String> keys) { Select statement = QueryBuilder.select() .column(KEY_NAME).column(VALUE_NAME) .from(KEYSPACE_NAME, TABLE_NAME) .where(QueryBuilder.eq(KEY_NAME, keys.get(0))); return statement; }
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×