SlideShare a Scribd company logo
1 of 68
Re-envisioning the Lambda Architecture:
Web Services & Real-time Analytics w/
Storm and Cassandra
Brian O’Neill, CTO
boneill@healthmarketscience.com
@boneill42
Talk Breakdown
29%
20%
31%
20%
Topics
(1) Motivation
(2) Polyglot Persistence
(3) Analytics
(4) Lambda Architecture
Health Market Science - Then
What we were.
Health Market Science - Now
Intersecting Big Data
w/ Healthcare
We’re fixing healthcare!
Data Pipelines
I/O
The InputFrom government,
state boards, etc.
From the internet,
social data,
networks / graphs
From third-parties,
medical claims
From customers,
expenses,
sales data,
beneficiary information,
quality scores
Data
Pipeline
The Output
Script
Claims
Expense
Sanction
Address
Contact
(phone, fax, etc.)
Drug
RepresentativeDivision
Expense ManagerTM
Provider Verification™
MarketViewTM
Customer
Feed(s)
Customer
Master
Provider MasterFileTM
Credentials
“Agile MDM”
1 billion claims
per year
Organization
Practitioner
Referrals
Sounds easy
Except...
Incomplete Capture
No foreign keys
Differing schemas
Changing schemas
Conflicting information
Ad-hoc Analysis (is hard)
Point-In-Time Retrieval
Why?
?’s
Our MDM Pipeline
- Data Stewardship
- Data Scientists
- Business Analysts
Ingestion
- Semantic Tagging
- Standardization
- Data Mapping
Incorporation
- Consolidation
- Enumeration
- Association
Insight
- Search
- Reports
- Analytics
Feeds
(multiple formats,
changing over time)
API / FTP Web Interface
DimensionsLogicRules
Our first “Pipeline”
+
Sweet!
Dirt Simple
Lightning Fast
Highly Available
Scalable
Multi-Datacenter (DR)
Not Sweet.
How do we query the data?
NoSQL Indexes?
Do such things exist?
Rev. 1 – Wide Rows!
AOP
Triggers!Data model to
support your
queries.
9 7 32 74 99 12 42
$3.50 $7.00 $8.75 $1.00 $4.20 $3.17 $8.88
ONC : PA : 19460
D’Oh! What about ad hoc?
Transformation
Rev 2 – Elastic Search!
AOP
Triggers!
D’Oh!
What if ES fails?
What about schema / type information?
Rev 3 - Apache Storm!
Anatomy of a Storm Cluster
• Nimbus
– Master Node
• Zookeeper
– Cluster Coordination
• Supervisors
– Worker Nodes
Storm Primitives
• Streams
– Unbounded sequence of tuples
• Spouts
– Stream Sources
• Bolts
– Unit of Computation
• Topologies
– Combination of n Spouts and m Bolts
– Defines the overall “Computation”
Storm Spouts
• Represents a source (stream) of data
– Queues (JMS, Kafka, Kestrel, etc.)
– Twitter Firehose
– Sensor Data
• Emits “Tuples” (Events) based on source
– Primary Storm data structure
– Set of Key-Value pairs
Storm Bolts
• Receive Tuples from Spouts or other Bolts
• Operate on, or React to Data
– Functions/Filters/Joins/Aggregations
– Database writes/lookups
• Optionally emit additional Tuples
Storm Topologies
• Data flow between spouts and bolts
• Routing of Tuples between spouts/bolts
– Stream “Groupings”
• Parallelism of Components
• Long-Lived
Storm Topologies
Persistent Word Count
http://github.com/hmsonline/storm-cassandra
NEXT LEVEL : TRIDENT
Trident
• Part of Storm
• Provides a higher-level abstraction for stream
processing
– Constructs for state management and batching
• Adds additional primitives that abstract away
common topological patterns
Trident State
Sequences writes by batch
• Spouts
– Transactional
• Batch contents never change
– Opaque
• Batch contents can change
• State
– Transactional
• Store batch number with counts to maintain sequencing of
writes
– Opaque
• Store previous value in order to overwrite the current value
when contents of a batch change
State Management
Last Batch Value
15 1000
(+59)
Last Batch Value
16 1059
Transactional
Last Batch Previous Current
15 980 1000
(+59)
Opaque
replay == incorporated already?
(because batch composition is the same)
Last Batch Previous Current
16 1000 1059
Last Batch Previous Current
15 980 1000
(+72)
Last Batch Previous Current
16 1000 1072
replay == re-incorporate
Batch composition changes! (not guaranteed)
BACK TO OUR REGULARLY
SCHEDULED TALK
Polyglot Persistence
“The Right Tool for the Job”
Oracle is a registered trademark
of Oracle Corporation and/or its
affiliates. Other names may be
trademarks of their respective
owners.
Back to the Pipeline
KafkaDW
Storm
C* ES Titan SQL
MDM Topology*
*Notional
Design Principles
• What we got:
– At-least-once processing
– Simple data flows
• What we needed to account for:
– Replays
Idempotent Operations!
Immutable Data!
Cassandra State (v0.4.0)
git@github.com:hmsonline/storm-cassandra.git
{tuple}  <mapper>  (ks, cf, row, k:v[])
Storm Cassandra
Trident Elastic Search (v0.3.1)
git@github.com:hmsonline/trident-elasticsearch.git
{tuple}  <mapper>  (idx, docid, k:v[])
Storm Elastic Search
Storm Graph (v0.1.2)
Coming soon to...
git@github.com:hmsonline/storm-graph.git
for (tuple : batch)
<processor> (graph, tuple)
Storm JDBI (v0.1.14)
INTERNAL ONLY (so far)
Worth releasing?
{tuple}  <mapper>  (JDBC Statement)
All good!
But...
What was the average amount paid for a
medical claim associated with procedure
X by zip code over the last five years?
Hadoop (<2)? Batch?
Yuck. ‘Nuff Said.
http://www.slideshare.net/prash1784/introduction-to-hadoop-and-pig-15036186
Alternatives?
Let’s Pre-Compute It!
stream
.groupBy(new Field(“procedure”))
.groupBy(new Field(“zip”))
.aggregate(new Field(“amount”),
new Average())
D’Oh!
GroupBy’s.
They set data in motion!
Lesson Learned
https://github.com/nathanmarz/storm/wiki/Trident-API-Overview
If possible, avoid
re-partitioning
operations!
(e.g. LOG.error!)
Why so hard?
D’Oh!
19 != 9
What we don’t want:
LOCKS!
What’s the alternative?
CONSENSUS!
Cassandra 2.0!
http://www.slideshare.net/planetcassandra/nyc-jonathan-ellis-keynote-cassandra-12-20
http://www.cs.cornell.edu/courses/CS6452/2012sp/papers/paxos-complex.pdf
Conditional Updates
“The alert reader will notice here that
Paxos gives us the ability to agree on
exactly one proposal. After one has been
accepted, it will be returned to future
leaders in the promise, and the new leader
will have to re-propose it again.”
http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0
UPDATE value=9 WHERE word=“fox” IF value=6
Love CQL
Conditional Updates
+
Batch Statements
+
Collections
=
BADASS DATA MODELS
Announcing : Storm Cassandra CQL!
git@github.com:hmsonline/storm-cassandra-cql.git
{tuple}  <mapper>  (CQL Statement)
Trident Batching =? CQL Batching
Incremental State!
• Collapse aggregation into the state object.
– This allows the state object to aggregate with current state
in a loop until success.
• Uses Trident Batching to perform in-memory
aggregation for the batch.
for (tuple : batch)
state.aggregate(tuple);
while (failed?) {
persisted_state = read(state)
aggregate(in_memory_state, persisted_state)
failed? = conditionally_update(state)
}
Partition 1
In-Memory Aggregation by Key!
Key Value
fox 6
brown 3
Partition 2
Key Value
fox 3
lazy 72C*
No More GroupBy!
To protect against replays
Use partition + batch identifier(s) in
your conditional update!
“BatchId + partitionIndex consistently represents the
same data as long as:
1.Any repartitioning you do is deterministic (so
partitionBy is, but shuffle is not)
2.You're using a spout that replays the exact same
batch each time (which is true of transactional spouts
but not of opaque transactional spouts)”
- Nathan Marz
Hyper-Cubes!
Our Terminology
• A cube comprises:
– Dimensions (e.g. procedure, zip, time slice)
– A function (e.g. count, sum)
– Function fields (e.g. amount paid)
– Granularity
• A metric comprises:
– Coordinates (e.g. vasectomy, 19460, 879123 – hour since epoch)
– A value (e.g. 500 procedures)
• A perspective comprises:
– A range of coordinates (*, 19460, January)
– An interval (day)
Complex Event Processing
• For each event,
–Find relevant cubes
–Adjust metrics for event
–Group and aggregate metrics
–Use conditional updates to incorporate
metric into persistent state
The Lambda Architecture
http://architects.dzone.com/articles/nathan-marzs-lamda
Let’s Challenge This a Bit
because “additional tools and techniques” cost
money and time.
• Questions:
– Can we solve the problem with a single tool and a
single approach?
– Can we re-use logic across layers?
– Or better yet, can we collapse layers?
A Traditional Interpretation
Speed Layer
(Storm)
Batch Layer
(Hadoop)
Data
Stream
Serving Layer
HBase
Impala
D’Oh! Two pipelines!
Integrating Web Services
• We need a web service that receives an event
and provides,
– an immediate acknowledgement
– a high likelihood that the data is integrated very soon
– a guarantee that the data will be integrated eventually
• We need an architecture that provides for,
– Code / Logic and approach re-use
– Fault-Tolerance
Grand Finale
The Idea : Embedding State!
Kafka
DropWizard
C*
IncrementalCqlState
aggregate(tuple)
“Batch” Layer
(Storm)
Client
The Sequence of Events
The Wins
• Reuse Aggregations and State Code!
• To re-compute (or backfill) a dimension,
simply re-queue!
• Storm is the “safety” net
– If a DW host fails during aggregation, Storm will fill
in the gaps for all ACK’d events.
• Is there an opportunity to reuse more?
– BatchingStrategy & PartitionStrategy?
In the end, all good. =)
Plug
The Book
Shout out:
Taylor Goetz
Thanks!
Brought to you by
Brian O’Neill, CTO
boneill@healthmarketscience.com
@boneill42
12 years together
APPENDIX
CassandraCqlState
public void commit(Long txid) {
BatchStatement batch = new BatchStatement(Type.LOGGED);
batch.addAll(this.statements);
clientFactory.getSession().execute(batch);
}
public void addStatement(Statement statement) {
this.statements.add(statement);
}
public ResultSet execute(Statement statement){
return clientFactory.getSession().execute(statement);
}
CassandraCqlStateUpdater
public void updateState(CassandraCqlState state,
List<TridentTuple> tuples,
TridentCollector collector) {
for (TridentTuple tuple : tuples) {
Statement statement = this.mapper.map(tuple);
state.addStatement(statement);
}
}
ExampleMapper
public Statement map(List<String> keys, Number value) {
Insert statement =
QueryBuilder.insertInto(KEYSPACE_NAME, TABLE_NAME);
statement.value(KEY_NAME, keys.get(0));
statement.value(VALUE_NAME, value);
return statement;
}
public Statement retrieve(List<String> keys) {
Select statement = QueryBuilder.select()
.column(KEY_NAME).column(VALUE_NAME)
.from(KEYSPACE_NAME, TABLE_NAME)
.where(QueryBuilder.eq(KEY_NAME, keys.get(0)));
return statement;
}

More Related Content

What's hot

Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014Chen-en Lu
 
Introduction Apache Kafka
Introduction Apache KafkaIntroduction Apache Kafka
Introduction Apache KafkaJoe Stein
 
A la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIAA la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIALa Cuisine du Web
 
Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Knoldus Inc.
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planningconfluent
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Guido Schmutz
 
Kafka connect 101
Kafka connect 101Kafka connect 101
Kafka connect 101Whiteklay
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaShiao-An Yuan
 
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...Natan Silnitsky
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013mumrah
 
Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)StreamNative
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloJoe Stein
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaGuozhang Wang
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Devoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with KafkaDevoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with KafkaLászló-Róbert Albert
 

What's hot (20)

Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
Apache Kafka: A high-throughput distributed messaging system @ JCConf 2014
 
Introduction Apache Kafka
Introduction Apache KafkaIntroduction Apache Kafka
Introduction Apache Kafka
 
A la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIAA la rencontre de Kafka, le log distribué par Florian GARCIA
A la rencontre de Kafka, le log distribué par Florian GARCIA
 
Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1Introduction to Apache Kafka- Part 1
Introduction to Apache Kafka- Part 1
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity PlanningFrom Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
From Message to Cluster: A Realworld Introduction to Kafka Capacity Planning
 
Apache kafka
Apache kafkaApache kafka
Apache kafka
 
Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !Apache Kafka - Scalable Message-Processing and more !
Apache Kafka - Scalable Message-Processing and more !
 
Kafka connect 101
Kafka connect 101Kafka connect 101
Kafka connect 101
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
8 Lessons Learned from Using Kafka in 1000 Scala microservices - Scale by the...
 
Apache kafka introduction
Apache kafka introductionApache kafka introduction
Apache kafka introduction
 
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
Introduction and Overview of Apache Kafka, TriHUG July 23, 2013
 
Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)Lessons from managing a Pulsar cluster (Nutanix)
Lessons from managing a Pulsar cluster (Nutanix)
 
Kafka basics
Kafka basicsKafka basics
Kafka basics
 
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache AccumuloReal-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
Real-Time Distributed and Reactive Systems with Apache Kafka and Apache Accumulo
 
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache KafkaBuilding Stream Infrastructure across Multiple Data Centers with Apache Kafka
Building Stream Infrastructure across Multiple Data Centers with Apache Kafka
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Devoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with KafkaDevoxx Morocco 2016 - Microservices with Kafka
Devoxx Morocco 2016 - Microservices with Kafka
 

Similar to Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & Real-time Analytics w/ Storm and Cassandra

Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardBrian O'Neill
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Brian O'Neill
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureGabriele Modena
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormDataStax
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeItai Yaffe
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Folio3 Software
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22marpierc
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsMichael Häusler
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Anton Nazaruk
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...confluent
 
Deconstructing Lambda
Deconstructing LambdaDeconstructing Lambda
Deconstructing Lambdadarach
 
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...Flink Forward
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Making Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxMaking Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxTrayan Iliev
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
Introduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big DataIntroduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big Datawaheed751
 

Similar to Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & Real-time Analytics w/ Storm and Cassandra (20)

Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizardPhily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
Phily JUG : Web Services APIs for Real-time Analytics w/ Storm and DropWizard
 
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
Data Pipelines & Integrating Real-time Web Services w/ Storm : Improving on t...
 
Yahoo compares Storm and Spark
Yahoo compares Storm and SparkYahoo compares Storm and Spark
Yahoo compares Storm and Spark
 
Moving Towards a Streaming Architecture
Moving Towards a Streaming ArchitectureMoving Towards a Streaming Architecture
Moving Towards a Streaming Architecture
 
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with StormC*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
C*ollege Credit: CEP Distribtued Processing on Cassandra with Storm
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
Distributed and Fault Tolerant Realtime Computation with Apache Storm, Apache...
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
Integration Patterns for Big Data Applications
Integration Patterns for Big Data ApplicationsIntegration Patterns for Big Data Applications
Integration Patterns for Big Data Applications
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks Metadata and Provenance for ML Pipelines with Hopsworks
Metadata and Provenance for ML Pipelines with Hopsworks
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
Event Driven Architecture with a RESTful Microservices Architecture (Kyle Ben...
 
Deconstructing Lambda
Deconstructing LambdaDeconstructing Lambda
Deconstructing Lambda
 
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Making Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFluxMaking Machine Learning Easy with H2O and WebFlux
Making Machine Learning Easy with H2O and WebFlux
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Introduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big DataIntroduction to Cloud Computing and Big Data
Introduction to Cloud Computing and Big Data
 
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
Javantura v3 - Real-time BigData ingestion and querying of aggregated data – ...
 

More from DataStax Academy

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftDataStax Academy
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseDataStax Academy
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraDataStax Academy
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsDataStax Academy
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingDataStax Academy
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackDataStax Academy
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache CassandraDataStax Academy
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready CassandraDataStax Academy
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonDataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1DataStax Academy
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2DataStax Academy
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First ClusterDataStax Academy
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with DseDataStax Academy
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraDataStax Academy
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseDataStax Academy
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraDataStax Academy
 

More from DataStax Academy (20)

Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craftForrester CXNYC 2017 - Delivering great real-time cx is a true craft
Forrester CXNYC 2017 - Delivering great real-time cx is a true craft
 
Introduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph DatabaseIntroduction to DataStax Enterprise Graph Database
Introduction to DataStax Enterprise Graph Database
 
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache CassandraIntroduction to DataStax Enterprise Advanced Replication with Apache Cassandra
Introduction to DataStax Enterprise Advanced Replication with Apache Cassandra
 
Cassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart LabsCassandra on Docker @ Walmart Labs
Cassandra on Docker @ Walmart Labs
 
Cassandra 3.0 Data Modeling
Cassandra 3.0 Data ModelingCassandra 3.0 Data Modeling
Cassandra 3.0 Data Modeling
 
Cassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stackCassandra Adoption on Cisco UCS & Open stack
Cassandra Adoption on Cisco UCS & Open stack
 
Data Modeling for Apache Cassandra
Data Modeling for Apache CassandraData Modeling for Apache Cassandra
Data Modeling for Apache Cassandra
 
Coursera Cassandra Driver
Coursera Cassandra DriverCoursera Cassandra Driver
Coursera Cassandra Driver
 
Production Ready Cassandra
Production Ready CassandraProduction Ready Cassandra
Production Ready Cassandra
 
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & PythonCassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
Cassandra @ Netflix: Monitoring C* at Scale, Gossip and Tickler & Python
 
Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1Cassandra @ Sony: The good, the bad, and the ugly part 1
Cassandra @ Sony: The good, the bad, and the ugly part 1
 
Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2Cassandra @ Sony: The good, the bad, and the ugly part 2
Cassandra @ Sony: The good, the bad, and the ugly part 2
 
Standing Up Your First Cluster
Standing Up Your First ClusterStanding Up Your First Cluster
Standing Up Your First Cluster
 
Real Time Analytics with Dse
Real Time Analytics with DseReal Time Analytics with Dse
Real Time Analytics with Dse
 
Introduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache CassandraIntroduction to Data Modeling with Apache Cassandra
Introduction to Data Modeling with Apache Cassandra
 
Cassandra Core Concepts
Cassandra Core ConceptsCassandra Core Concepts
Cassandra Core Concepts
 
Enabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax EnterpriseEnabling Search in your Cassandra Application with DataStax Enterprise
Enabling Search in your Cassandra Application with DataStax Enterprise
 
Bad Habits Die Hard
Bad Habits Die Hard Bad Habits Die Hard
Bad Habits Die Hard
 
Advanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache CassandraAdvanced Data Modeling with Apache Cassandra
Advanced Data Modeling with Apache Cassandra
 
Advanced Cassandra
Advanced CassandraAdvanced Cassandra
Advanced Cassandra
 

Recently uploaded

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Recently uploaded (20)

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

Cassandra Day 2014: Re-envisioning the Lambda Architecture - Web-Services & Real-time Analytics w/ Storm and Cassandra

  • 1. Re-envisioning the Lambda Architecture: Web Services & Real-time Analytics w/ Storm and Cassandra Brian O’Neill, CTO boneill@healthmarketscience.com @boneill42
  • 2. Talk Breakdown 29% 20% 31% 20% Topics (1) Motivation (2) Polyglot Persistence (3) Analytics (4) Lambda Architecture
  • 3. Health Market Science - Then What we were.
  • 4. Health Market Science - Now Intersecting Big Data w/ Healthcare We’re fixing healthcare!
  • 6. The InputFrom government, state boards, etc. From the internet, social data, networks / graphs From third-parties, medical claims From customers, expenses, sales data, beneficiary information, quality scores Data Pipeline
  • 7. The Output Script Claims Expense Sanction Address Contact (phone, fax, etc.) Drug RepresentativeDivision Expense ManagerTM Provider Verification™ MarketViewTM Customer Feed(s) Customer Master Provider MasterFileTM Credentials “Agile MDM” 1 billion claims per year Organization Practitioner Referrals
  • 8. Sounds easy Except... Incomplete Capture No foreign keys Differing schemas Changing schemas Conflicting information Ad-hoc Analysis (is hard) Point-In-Time Retrieval
  • 10. Our MDM Pipeline - Data Stewardship - Data Scientists - Business Analysts Ingestion - Semantic Tagging - Standardization - Data Mapping Incorporation - Consolidation - Enumeration - Association Insight - Search - Reports - Analytics Feeds (multiple formats, changing over time) API / FTP Web Interface DimensionsLogicRules
  • 12. Sweet! Dirt Simple Lightning Fast Highly Available Scalable Multi-Datacenter (DR)
  • 13. Not Sweet. How do we query the data? NoSQL Indexes? Do such things exist?
  • 14. Rev. 1 – Wide Rows! AOP Triggers!Data model to support your queries. 9 7 32 74 99 12 42 $3.50 $7.00 $8.75 $1.00 $4.20 $3.17 $8.88 ONC : PA : 19460 D’Oh! What about ad hoc?
  • 15. Transformation Rev 2 – Elastic Search! AOP Triggers! D’Oh! What if ES fails? What about schema / type information?
  • 16. Rev 3 - Apache Storm!
  • 17. Anatomy of a Storm Cluster • Nimbus – Master Node • Zookeeper – Cluster Coordination • Supervisors – Worker Nodes
  • 18. Storm Primitives • Streams – Unbounded sequence of tuples • Spouts – Stream Sources • Bolts – Unit of Computation • Topologies – Combination of n Spouts and m Bolts – Defines the overall “Computation”
  • 19. Storm Spouts • Represents a source (stream) of data – Queues (JMS, Kafka, Kestrel, etc.) – Twitter Firehose – Sensor Data • Emits “Tuples” (Events) based on source – Primary Storm data structure – Set of Key-Value pairs
  • 20. Storm Bolts • Receive Tuples from Spouts or other Bolts • Operate on, or React to Data – Functions/Filters/Joins/Aggregations – Database writes/lookups • Optionally emit additional Tuples
  • 21. Storm Topologies • Data flow between spouts and bolts • Routing of Tuples between spouts/bolts – Stream “Groupings” • Parallelism of Components • Long-Lived
  • 24. NEXT LEVEL : TRIDENT
  • 25. Trident • Part of Storm • Provides a higher-level abstraction for stream processing – Constructs for state management and batching • Adds additional primitives that abstract away common topological patterns
  • 26. Trident State Sequences writes by batch • Spouts – Transactional • Batch contents never change – Opaque • Batch contents can change • State – Transactional • Store batch number with counts to maintain sequencing of writes – Opaque • Store previous value in order to overwrite the current value when contents of a batch change
  • 27. State Management Last Batch Value 15 1000 (+59) Last Batch Value 16 1059 Transactional Last Batch Previous Current 15 980 1000 (+59) Opaque replay == incorporated already? (because batch composition is the same) Last Batch Previous Current 16 1000 1059 Last Batch Previous Current 15 980 1000 (+72) Last Batch Previous Current 16 1000 1072 replay == re-incorporate Batch composition changes! (not guaranteed)
  • 28. BACK TO OUR REGULARLY SCHEDULED TALK
  • 29. Polyglot Persistence “The Right Tool for the Job” Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners.
  • 30. Back to the Pipeline KafkaDW Storm C* ES Titan SQL
  • 32. Design Principles • What we got: – At-least-once processing – Simple data flows • What we needed to account for: – Replays Idempotent Operations! Immutable Data!
  • 33. Cassandra State (v0.4.0) git@github.com:hmsonline/storm-cassandra.git {tuple}  <mapper>  (ks, cf, row, k:v[]) Storm Cassandra
  • 34. Trident Elastic Search (v0.3.1) git@github.com:hmsonline/trident-elasticsearch.git {tuple}  <mapper>  (idx, docid, k:v[]) Storm Elastic Search
  • 35. Storm Graph (v0.1.2) Coming soon to... git@github.com:hmsonline/storm-graph.git for (tuple : batch) <processor> (graph, tuple)
  • 36. Storm JDBI (v0.1.14) INTERNAL ONLY (so far) Worth releasing? {tuple}  <mapper>  (JDBC Statement)
  • 38. But... What was the average amount paid for a medical claim associated with procedure X by zip code over the last five years?
  • 39. Hadoop (<2)? Batch? Yuck. ‘Nuff Said. http://www.slideshare.net/prash1784/introduction-to-hadoop-and-pig-15036186
  • 41. Let’s Pre-Compute It! stream .groupBy(new Field(“procedure”)) .groupBy(new Field(“zip”)) .aggregate(new Field(“amount”), new Average()) D’Oh! GroupBy’s. They set data in motion!
  • 43. Why so hard? D’Oh! 19 != 9 What we don’t want: LOCKS! What’s the alternative? CONSENSUS!
  • 45. Conditional Updates “The alert reader will notice here that Paxos gives us the ability to agree on exactly one proposal. After one has been accepted, it will be returned to future leaders in the promise, and the new leader will have to re-propose it again.” http://www.datastax.com/dev/blog/lightweight-transactions-in-cassandra-2-0 UPDATE value=9 WHERE word=“fox” IF value=6
  • 46. Love CQL Conditional Updates + Batch Statements + Collections = BADASS DATA MODELS
  • 47. Announcing : Storm Cassandra CQL! git@github.com:hmsonline/storm-cassandra-cql.git {tuple}  <mapper>  (CQL Statement) Trident Batching =? CQL Batching
  • 48. Incremental State! • Collapse aggregation into the state object. – This allows the state object to aggregate with current state in a loop until success. • Uses Trident Batching to perform in-memory aggregation for the batch. for (tuple : batch) state.aggregate(tuple); while (failed?) { persisted_state = read(state) aggregate(in_memory_state, persisted_state) failed? = conditionally_update(state) }
  • 49. Partition 1 In-Memory Aggregation by Key! Key Value fox 6 brown 3 Partition 2 Key Value fox 3 lazy 72C* No More GroupBy!
  • 50. To protect against replays Use partition + batch identifier(s) in your conditional update! “BatchId + partitionIndex consistently represents the same data as long as: 1.Any repartitioning you do is deterministic (so partitionBy is, but shuffle is not) 2.You're using a spout that replays the exact same batch each time (which is true of transactional spouts but not of opaque transactional spouts)” - Nathan Marz
  • 52. Our Terminology • A cube comprises: – Dimensions (e.g. procedure, zip, time slice) – A function (e.g. count, sum) – Function fields (e.g. amount paid) – Granularity • A metric comprises: – Coordinates (e.g. vasectomy, 19460, 879123 – hour since epoch) – A value (e.g. 500 procedures) • A perspective comprises: – A range of coordinates (*, 19460, January) – An interval (day)
  • 53. Complex Event Processing • For each event, –Find relevant cubes –Adjust metrics for event –Group and aggregate metrics –Use conditional updates to incorporate metric into persistent state
  • 55. Let’s Challenge This a Bit because “additional tools and techniques” cost money and time. • Questions: – Can we solve the problem with a single tool and a single approach? – Can we re-use logic across layers? – Or better yet, can we collapse layers?
  • 56. A Traditional Interpretation Speed Layer (Storm) Batch Layer (Hadoop) Data Stream Serving Layer HBase Impala D’Oh! Two pipelines!
  • 57. Integrating Web Services • We need a web service that receives an event and provides, – an immediate acknowledgement – a high likelihood that the data is integrated very soon – a guarantee that the data will be integrated eventually • We need an architecture that provides for, – Code / Logic and approach re-use – Fault-Tolerance
  • 59. The Idea : Embedding State! Kafka DropWizard C* IncrementalCqlState aggregate(tuple) “Batch” Layer (Storm) Client
  • 60. The Sequence of Events
  • 61. The Wins • Reuse Aggregations and State Code! • To re-compute (or backfill) a dimension, simply re-queue! • Storm is the “safety” net – If a DW host fails during aggregation, Storm will fill in the gaps for all ACK’d events. • Is there an opportunity to reuse more? – BatchingStrategy & PartitionStrategy?
  • 62. In the end, all good. =)
  • 64. Thanks! Brought to you by Brian O’Neill, CTO boneill@healthmarketscience.com @boneill42 12 years together
  • 66. CassandraCqlState public void commit(Long txid) { BatchStatement batch = new BatchStatement(Type.LOGGED); batch.addAll(this.statements); clientFactory.getSession().execute(batch); } public void addStatement(Statement statement) { this.statements.add(statement); } public ResultSet execute(Statement statement){ return clientFactory.getSession().execute(statement); }
  • 67. CassandraCqlStateUpdater public void updateState(CassandraCqlState state, List<TridentTuple> tuples, TridentCollector collector) { for (TridentTuple tuple : tuples) { Statement statement = this.mapper.map(tuple); state.addStatement(statement); } }
  • 68. ExampleMapper public Statement map(List<String> keys, Number value) { Insert statement = QueryBuilder.insertInto(KEYSPACE_NAME, TABLE_NAME); statement.value(KEY_NAME, keys.get(0)); statement.value(VALUE_NAME, value); return statement; } public Statement retrieve(List<String> keys) { Select statement = QueryBuilder.select() .column(KEY_NAME).column(VALUE_NAME) .from(KEYSPACE_NAME, TABLE_NAME) .where(QueryBuilder.eq(KEY_NAME, keys.get(0))); return statement; }

Editor's Notes

  1. Tuple: set of key-value pairs (values can be serialized objects)
  2. title Distributed Counting participant A participant B participant Storage note over Storage {"fox" : 6} end note note over A count("fox", batch)=3 end note A->Storage: read("fox") note over B count("fox", batch)=10 end note Storage->A: 6 B->Storage: read("fox") Storage->B: 8 note over A add(6, 3) = 9 end note note over B add(6, 10) = 16 end note B->Storage: write(16) A->Storage: write(9) note over Storage {"fox":16} end note
  3. title Distributed Counting participant Client participant DropWizard participant Kafka participant State(1) participant C* participant Storm participant State(2) Client->DropWizard: POST(event) DropWizard->State(1): aggregate(new Tuple(event)) DropWizard->Kafka: queue(event) DropWizard->Client: 200(ACK) note over State(1) duration (30 sec.) end note State(1)->C*: state, events = read(key) note over State(1) state = aggregate (state, in_memory_state) events = join (events, in_memory_events) end note State(1)->C*: write(state, events) Kafka->Storm: dequeue(event) Storm->State(2): persisted_state, events = read(key) note over State(2) if (!contains?(event)) ... end note State(2)->C*: if !contains(ids) write(state)