SlideShare a Scribd company logo
1 of 41
Managing the
Fast Data Pipeline
from Ingest to Export
John Hugg, Ryan Betts Founding Engineer(s)
Why A Fast Frontend?
What is a fast frontend?
HDFS
Consumers
Transformers
Data Source Flume
Kafka
Whatever
Data Source
Data Source
BIG DATA
Ingestion Batch
HDFS
Consumers
Transformers
Data Source Flume
Kafka
Whatever
Data Source
Data Source
BIG DATA
Ingestion Batch
What do the green things have in common?
Movers. Not shakers.
HDFS
Consumers
Transformers
Data Source Flume
Kafka
Whatever
Data Source
Data Source
BIG DATA
Ingestion Batch
Ingestion
Engine
Fast Data
HDFS
Consumers
Transformers
Data Source Flume
Kafka
Whatever
Data Source
Data Source
BIG DATA
Ingestion Batch
Ingestion
Engine
Fast Data
What’s the value proposition?
Better data in HDFS
Filter, Aggregate, Enrich
Lower latency to
understanding what is going
on
Serve directly from the
ingestion engine
Overview
• What’s in that Orange Box (Ingestion Engine)
• Kafka Patterns
• Filter, De-Dup, Aggregation, Enrich - Value Proposition #1
• Understand Faster - Value Proposition #2
• Serving Layer - Value Proposition #3
• How wrong could my answers be?
HDFS
Consumers
Transformers
Data Source Flume
Kafka
Whatever
Data Source
Data Source
BIG DATA
Ingestion Batch
Ingestion
Engine
Fast Data
Processing
Input Data
Direct Queries
HDFS
Response Data?
Fast
State
Processing
Input Data
Direct Queries
HDFS
Response Data?
Fast
State
Super Important!
Processing
Input Data
Direct Queries
HDFS
Response Data?
Fast
State
Let’s Get Real!
What tools do people use?
or or…
What is Kafka?
• Almost a persistent message queue, but not quite.
• A distributed and partitioned log.
• Clients read linearly, but start wherever they want.
• Scalable. Persistent.
Start of Time Next Append HereApp A read to here App B read to here
Filter / De-Dup
Aggregate
Enrich / De-Normalize
These Features = Money
Filter
• De-Duplification at ingest time is an obvious value here.
Example: sensor networks
• Otherwise, people like to dump everything into HDFS and sort it out
later.
• Should you do that? I don’t know.
• Easy thing is to filter bad data that meets some criteria for rejection
stored in the ingestion engine.
Too old. Missing values. Test data. Spam.
• With a good ingestion engine (cough), this stuff is not expensive.
Value Proposition #1
Aggregation (Counting) Example
• Raw event stream at 100k events per second.
• Send 1 aggregated row per second for hadoop.
• Aggregate can pick from COUNT, MIN, MAX, AVG, SUM,
MEDIAN, etc…
• Can delay sending aggregates to HDFS to allow for late arriving
events for the window.
• If this fits what you’re doing, this has the potential to speed up
Impala, Hive, etc… by several orders of magnitude.
Value Proposition #1
Enrich: Call Center Example
• Event comes in that an call center agent was connected to a
caller
• How long did was the caller on hold?
• User has an SLA for hold times.
• Send to HDFS a record of the hold time, including the start, end,
and duration.
Value Proposition #1
Enrich: Finance Example
• Message from the exchange that order 21756 was executed?
• What is order 21756?
• Ingestion engine has a table of outstanding orders.
• Send to HDFS a record that the order for 100 shares of
Microsoft by trader Joe using his algorithm “greedisgood” were
executed. Include the timestamp the order was placed, the
timestamp of execution and the price.
Value Proposition #1
Schema-On-Read
Schema On Read
Schema On Ingest
Should I organize by
color, type, size?
It matters, but don’t punt
because it’s nontrivial.
All Lego pros organize
pieces. HDFS is the
same thing.
Don’t get me started on
secondary indexes.
Schema On Read: Not Even Once
• Old Story: Dump everything into HDFS. Process it
when you read it into something-schema-y.
• Turns out this is slooooooooowwwwwww.
Might as well run Ruby.
• New Story: Organize data with types and
reasonable schema on ingest.
Understanding
Value Proposition #2
What forms of Understanding?
• Dashboards with aggregations
• Queries that support filtering or enrichment at
ingest
• Queries that make decisions on ingest
Routing, Responding, Alerting, Transacting
• Queries that look for certain conditions
Like SQL triggers, but way more powerful
• Actual deep analytical queries*
Value Proposition #2
Examples
Dashboards with aggregations
Show me outstanding positions by trader by algorithm on a webpage
that refreshes every second
Queries that support filtering or
enrichment at ingest
FInd me all events related to this event
If last event in a chain, push denormalized, enriched data to HDFS
Queries that make decisions on
ingest
Click event arrives. Which ad did we show? Was it fraudulent? Was it
a robot? Which customer account to debit?
Queries that look for certain
conditions
Is this router under attack? or…
Are my SLAs not being met? If so, what’s the contractual cost?
Actual deep analytical queries* How many purple hats were sold on Tuesdays in 2015 when it rained?
Value Proposition #2
But How?
• Just use VoltDB (I know. I know.)
• Just write down the logic in Java code.
• Nothing is trivial, but some things are easier.
• Two main reasons VoltDB does this stuff better:
• Strong integration of state and processing (ACID txns).
• Straightforward development. Java + SQL.
• Sadly, the things that might be comparably not-terrible are not
publicly available.
Value Proposition #2
• Logical Schema
• Queryable
• Concurrency
• No hard stuff…
Stored Procedure
Java / SQL
Input Data
SQL
JDBC/ODBC/Rest
CLI/Drivers
HDFS
OLAP
CSV
SMS
Response Data
State
• Logical Schema
• Queryable
• Concurrency
• No hard stuff…
What about this meaty bit here?
Stored Procedure
Java / SQL
• Fully integrated processing &
state.
• State access is global.
• Standard SQL with extensions
like “upsert” and JSON support.
• Fully consistent, ACID
“Program like you’re alone in the
world”
• Synchronous intra-cluster HA.
Disk durability & memory speed.
• Logical Schema
• Queryable
• Concurrency
• No hard stuff…
What about this meaty bit here?
Stored Procedure
Java / SQL
• Native Aggregations
Fully SQL-Integrated as
Live Materialized Views
• Easy counts, ranks, sorts
• Leverage existing Java
libraries
(like HyperLogLog, but
also JSON, gzip, Avro,
protobufs, etc…)
Serving Layer
• You can’t easily get real time responses with high concurrency from
HDFS.
• You can’t query Kafka or Flume or anything like that.
• So this one is a no-brainer. Query the Fast Data front end.
• I work for a vendor who sells a Fast Data front end that uses SQL and
I’m going to tell you that SQL and standards like JDBC are pretty
awesome for this purpose.
• You should be skeptical when a vendor tells your their tech choices
are the right ones. Except this one time. SQL.
Value Proposition #3
SQL?
• You should be skeptical when a vendor tells your their tech choices are the right
ones.
• Except this one time.
• SQL is winning:
• Google has moved back to SQL.
• Facebook uses tons of SQL.
• Cassandra has CQL.
• Impala, Presto, SparkSQL - Lots of new love.
• Couchbase recently introduced N1QL and is pretending they never said anything
bad about SQL ever and actually SQL is great and you should use it.
Value Proposition #3
How wrong could this data be?
Delivery Guarantees
• At Least Once
• At Most Once
• None of the Above
• Exactly Once
At most once delivery
• Send all events once.
• On failure, you might lose some.
• How many are lost?
At least once delivery
• When the status of an event is unclear, send it
again.
• This is not always easy.
Event sources need to be robust.
Exactly once delivery
• Option 1: No side effects. Strong consistency is
used to track incrementing offsets. Processing is
atomic.
• Option 2: Use at-least-once, but make everything
idempotent.
• Some people will tell you that option 2 isn’t exactly
once. This argument is exausting. The result is the
same, so who cares.
None of the Above
• Extremely common.
• Kafka high level consumer - offsets roughly track work, but
not exactly.
• “We use Storm in at-least once mode, unless it gets busy or
jammed, then we let it drop events to keep up.”
- Heavy Storm User
• Users who think they have at least once, might have this.
Partial Failure
• This is why ACID exists, folks.
• A in ACID mean “Atomic”
• Atomic means it happens or it doesn’t.
• C, I & D don’t hurt either.
• What even are side effects?
• Come back to this.
Atomic Fail: Romeo & Juliet
ACID Transaction:
Operation 1:
Fake your death
Operation 2:
Tell Romeo
Reality is More Mundane
• Call center example: When a hold ends…
Remove outstanding hold record.
Add new completed hold record.
Mark agent as busy.
Update total hold time aggregates.
• Financial example: When order executes:
Remove from outstanding orders.
Update position on Microsoft.
Update aggregates for trader, algorithm, time window
But still costly to screw up
Side Effects
• Clearest example is writing to Cassandra from within Storm to handle an event.
Storm can’t control Cassandra the way it needs to.
• If the Storm code fails, the Cassandra write is still there?
• If the Storm code is retried, the Cassandra write happens twice?
• If the Cassandra write fails, how does Storm retry?
• If processing a Spark Streaming Micro-batch fails, then it can be re-processed,
because of the integration between HDFS and SS.
Spark Streaming trades latency and makes state restrictions to avoid side effects.
Anytime doing something in one
system involves doing something
in another.
Conclusion - Thanks
• There is a lot of awesomeness to be had with a
Fast Data frontend to Hadoop/HDFS.
• VoltDB is really good at this stuff and other
available systems are much less good.
• As a vendor, you shouldn’t blindly trust me.
• So try it out. Ask questions. Engage. Explore.
http://chat.voltdb.com
rbetts@voltdb.com
@ryanbetts
jhugg@voltdb.com
@johnhugg

More Related Content

More from VoltDB

TripleLift: Preparing for a New Programmatic Ad-Tech World
TripleLift: Preparing for a New Programmatic Ad-Tech WorldTripleLift: Preparing for a New Programmatic Ad-Tech World
TripleLift: Preparing for a New Programmatic Ad-Tech WorldVoltDB
 
Understanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoTUnderstanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoTVoltDB
 
Transforming Your Business with Fast Data – Five Use Case Examples
Transforming Your Business with Fast Data – Five Use Case ExamplesTransforming Your Business with Fast Data – Five Use Case Examples
Transforming Your Business with Fast Data – Five Use Case ExamplesVoltDB
 
VoltDB and HPE Vertica Present: Building an IoT Architecture for Fast + Big Data
VoltDB and HPE Vertica Present: Building an IoT Architecture for Fast + Big DataVoltDB and HPE Vertica Present: Building an IoT Architecture for Fast + Big Data
VoltDB and HPE Vertica Present: Building an IoT Architecture for Fast + Big DataVoltDB
 
Acting on Real-time Behavior: How Peak Games Won Transactions
Acting on Real-time Behavior: How Peak Games Won TransactionsActing on Real-time Behavior: How Peak Games Won Transactions
Acting on Real-time Behavior: How Peak Games Won TransactionsVoltDB
 
Kyle Kingsbury Talks about the Jepsen Test: What VoltDB Learned About Data Ac...
Kyle Kingsbury Talks about the Jepsen Test: What VoltDB Learned About Data Ac...Kyle Kingsbury Talks about the Jepsen Test: What VoltDB Learned About Data Ac...
Kyle Kingsbury Talks about the Jepsen Test: What VoltDB Learned About Data Ac...VoltDB
 
Moving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time DataMoving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time DataVoltDB
 
Eat Your Data and Have It Too: Get the Blazing Performance of In-Memory Opera...
Eat Your Data and Have It Too: Get the Blazing Performance of In-Memory Opera...Eat Your Data and Have It Too: Get the Blazing Performance of In-Memory Opera...
Eat Your Data and Have It Too: Get the Blazing Performance of In-Memory Opera...VoltDB
 
Arguments for a Unified IoT Architecture
Arguments for a Unified IoT ArchitectureArguments for a Unified IoT Architecture
Arguments for a Unified IoT ArchitectureVoltDB
 
Why you really want SQL in a Real-Time Enterprise Environment
Why you really want SQL in a Real-Time Enterprise EnvironmentWhy you really want SQL in a Real-Time Enterprise Environment
Why you really want SQL in a Real-Time Enterprise EnvironmentVoltDB
 
Lambda-B-Gone: In-memory Case Study for Faster, Smarter and Simpler Answers
Lambda-B-Gone: In-memory Case Study for Faster, Smarter and Simpler AnswersLambda-B-Gone: In-memory Case Study for Faster, Smarter and Simpler Answers
Lambda-B-Gone: In-memory Case Study for Faster, Smarter and Simpler Answers VoltDB
 
Stored Procedure Superpowers: A Developer’s Guide
Stored Procedure Superpowers: A Developer’s GuideStored Procedure Superpowers: A Developer’s Guide
Stored Procedure Superpowers: A Developer’s GuideVoltDB
 
Fast Data Choices: 5 Strategies for Evaluating Alternative Business and Techn...
Fast Data Choices: 5 Strategies for Evaluating Alternative Business and Techn...Fast Data Choices: 5 Strategies for Evaluating Alternative Business and Techn...
Fast Data Choices: 5 Strategies for Evaluating Alternative Business and Techn...VoltDB
 
Mike Stonebraker on Designing An Architecture For Real-time Event Processing
Mike Stonebraker on Designing An Architecture For Real-time Event ProcessingMike Stonebraker on Designing An Architecture For Real-time Event Processing
Mike Stonebraker on Designing An Architecture For Real-time Event ProcessingVoltDB
 
How First to Value Beats First to Market: Case Studies of Fast Data Success
How First to Value Beats First to Market: Case Studies of Fast Data SuccessHow First to Value Beats First to Market: Case Studies of Fast Data Success
How First to Value Beats First to Market: Case Studies of Fast Data SuccessVoltDB
 
Lessons Learned: The Impact of Fast Data for Personalization
Lessons Learned: The Impact of Fast Data for PersonalizationLessons Learned: The Impact of Fast Data for Personalization
Lessons Learned: The Impact of Fast Data for PersonalizationVoltDB
 
Fast Data for Competitive Advantage: 4 Steps to Expand your Window of Opportu...
Fast Data for Competitive Advantage: 4 Steps to Expand your Window of Opportu...Fast Data for Competitive Advantage: 4 Steps to Expand your Window of Opportu...
Fast Data for Competitive Advantage: 4 Steps to Expand your Window of Opportu...VoltDB
 
Understanding the Operational Database Infrastructure for IoT and Fast Data
Understanding the Operational Database Infrastructure for IoT and Fast DataUnderstanding the Operational Database Infrastructure for IoT and Fast Data
Understanding the Operational Database Infrastructure for IoT and Fast DataVoltDB
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsVoltDB
 
The Two Generals Problem
The Two Generals ProblemThe Two Generals Problem
The Two Generals ProblemVoltDB
 

More from VoltDB (20)

TripleLift: Preparing for a New Programmatic Ad-Tech World
TripleLift: Preparing for a New Programmatic Ad-Tech WorldTripleLift: Preparing for a New Programmatic Ad-Tech World
TripleLift: Preparing for a New Programmatic Ad-Tech World
 
Understanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoTUnderstanding the Top Four Use Cases for IoT
Understanding the Top Four Use Cases for IoT
 
Transforming Your Business with Fast Data – Five Use Case Examples
Transforming Your Business with Fast Data – Five Use Case ExamplesTransforming Your Business with Fast Data – Five Use Case Examples
Transforming Your Business with Fast Data – Five Use Case Examples
 
VoltDB and HPE Vertica Present: Building an IoT Architecture for Fast + Big Data
VoltDB and HPE Vertica Present: Building an IoT Architecture for Fast + Big DataVoltDB and HPE Vertica Present: Building an IoT Architecture for Fast + Big Data
VoltDB and HPE Vertica Present: Building an IoT Architecture for Fast + Big Data
 
Acting on Real-time Behavior: How Peak Games Won Transactions
Acting on Real-time Behavior: How Peak Games Won TransactionsActing on Real-time Behavior: How Peak Games Won Transactions
Acting on Real-time Behavior: How Peak Games Won Transactions
 
Kyle Kingsbury Talks about the Jepsen Test: What VoltDB Learned About Data Ac...
Kyle Kingsbury Talks about the Jepsen Test: What VoltDB Learned About Data Ac...Kyle Kingsbury Talks about the Jepsen Test: What VoltDB Learned About Data Ac...
Kyle Kingsbury Talks about the Jepsen Test: What VoltDB Learned About Data Ac...
 
Moving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time DataMoving Beyond Batch: Transactional Databases for Real-time Data
Moving Beyond Batch: Transactional Databases for Real-time Data
 
Eat Your Data and Have It Too: Get the Blazing Performance of In-Memory Opera...
Eat Your Data and Have It Too: Get the Blazing Performance of In-Memory Opera...Eat Your Data and Have It Too: Get the Blazing Performance of In-Memory Opera...
Eat Your Data and Have It Too: Get the Blazing Performance of In-Memory Opera...
 
Arguments for a Unified IoT Architecture
Arguments for a Unified IoT ArchitectureArguments for a Unified IoT Architecture
Arguments for a Unified IoT Architecture
 
Why you really want SQL in a Real-Time Enterprise Environment
Why you really want SQL in a Real-Time Enterprise EnvironmentWhy you really want SQL in a Real-Time Enterprise Environment
Why you really want SQL in a Real-Time Enterprise Environment
 
Lambda-B-Gone: In-memory Case Study for Faster, Smarter and Simpler Answers
Lambda-B-Gone: In-memory Case Study for Faster, Smarter and Simpler AnswersLambda-B-Gone: In-memory Case Study for Faster, Smarter and Simpler Answers
Lambda-B-Gone: In-memory Case Study for Faster, Smarter and Simpler Answers
 
Stored Procedure Superpowers: A Developer’s Guide
Stored Procedure Superpowers: A Developer’s GuideStored Procedure Superpowers: A Developer’s Guide
Stored Procedure Superpowers: A Developer’s Guide
 
Fast Data Choices: 5 Strategies for Evaluating Alternative Business and Techn...
Fast Data Choices: 5 Strategies for Evaluating Alternative Business and Techn...Fast Data Choices: 5 Strategies for Evaluating Alternative Business and Techn...
Fast Data Choices: 5 Strategies for Evaluating Alternative Business and Techn...
 
Mike Stonebraker on Designing An Architecture For Real-time Event Processing
Mike Stonebraker on Designing An Architecture For Real-time Event ProcessingMike Stonebraker on Designing An Architecture For Real-time Event Processing
Mike Stonebraker on Designing An Architecture For Real-time Event Processing
 
How First to Value Beats First to Market: Case Studies of Fast Data Success
How First to Value Beats First to Market: Case Studies of Fast Data SuccessHow First to Value Beats First to Market: Case Studies of Fast Data Success
How First to Value Beats First to Market: Case Studies of Fast Data Success
 
Lessons Learned: The Impact of Fast Data for Personalization
Lessons Learned: The Impact of Fast Data for PersonalizationLessons Learned: The Impact of Fast Data for Personalization
Lessons Learned: The Impact of Fast Data for Personalization
 
Fast Data for Competitive Advantage: 4 Steps to Expand your Window of Opportu...
Fast Data for Competitive Advantage: 4 Steps to Expand your Window of Opportu...Fast Data for Competitive Advantage: 4 Steps to Expand your Window of Opportu...
Fast Data for Competitive Advantage: 4 Steps to Expand your Window of Opportu...
 
Understanding the Operational Database Infrastructure for IoT and Fast Data
Understanding the Operational Database Infrastructure for IoT and Fast DataUnderstanding the Operational Database Infrastructure for IoT and Fast Data
Understanding the Operational Database Infrastructure for IoT and Fast Data
 
Using a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming AggregationsUsing a Fast Operational Database to Build Real-time Streaming Aggregations
Using a Fast Operational Database to Build Real-time Streaming Aggregations
 
The Two Generals Problem
The Two Generals ProblemThe Two Generals Problem
The Two Generals Problem
 

Recently uploaded

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...shivangimorya083
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 

Recently uploaded (20)

Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...Vip Model  Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
Vip Model Call Girls (Delhi) Karol Bagh 9711199171✔️Body to body massage wit...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 

The Fast Frontend to Hadoop - Big Data LA

  • 1. Managing the Fast Data Pipeline from Ingest to Export John Hugg, Ryan Betts Founding Engineer(s)
  • 2. Why A Fast Frontend? What is a fast frontend?
  • 3. HDFS Consumers Transformers Data Source Flume Kafka Whatever Data Source Data Source BIG DATA Ingestion Batch
  • 4. HDFS Consumers Transformers Data Source Flume Kafka Whatever Data Source Data Source BIG DATA Ingestion Batch What do the green things have in common? Movers. Not shakers.
  • 5. HDFS Consumers Transformers Data Source Flume Kafka Whatever Data Source Data Source BIG DATA Ingestion Batch Ingestion Engine Fast Data
  • 6. HDFS Consumers Transformers Data Source Flume Kafka Whatever Data Source Data Source BIG DATA Ingestion Batch Ingestion Engine Fast Data What’s the value proposition? Better data in HDFS Filter, Aggregate, Enrich Lower latency to understanding what is going on Serve directly from the ingestion engine
  • 7. Overview • What’s in that Orange Box (Ingestion Engine) • Kafka Patterns • Filter, De-Dup, Aggregation, Enrich - Value Proposition #1 • Understand Faster - Value Proposition #2 • Serving Layer - Value Proposition #3 • How wrong could my answers be?
  • 8. HDFS Consumers Transformers Data Source Flume Kafka Whatever Data Source Data Source BIG DATA Ingestion Batch Ingestion Engine Fast Data
  • 10. Processing Input Data Direct Queries HDFS Response Data? Fast State Super Important!
  • 11. Processing Input Data Direct Queries HDFS Response Data? Fast State Let’s Get Real! What tools do people use? or or…
  • 12. What is Kafka? • Almost a persistent message queue, but not quite. • A distributed and partitioned log. • Clients read linearly, but start wherever they want. • Scalable. Persistent. Start of Time Next Append HereApp A read to here App B read to here
  • 13. Filter / De-Dup Aggregate Enrich / De-Normalize These Features = Money
  • 14. Filter • De-Duplification at ingest time is an obvious value here. Example: sensor networks • Otherwise, people like to dump everything into HDFS and sort it out later. • Should you do that? I don’t know. • Easy thing is to filter bad data that meets some criteria for rejection stored in the ingestion engine. Too old. Missing values. Test data. Spam. • With a good ingestion engine (cough), this stuff is not expensive. Value Proposition #1
  • 15. Aggregation (Counting) Example • Raw event stream at 100k events per second. • Send 1 aggregated row per second for hadoop. • Aggregate can pick from COUNT, MIN, MAX, AVG, SUM, MEDIAN, etc… • Can delay sending aggregates to HDFS to allow for late arriving events for the window. • If this fits what you’re doing, this has the potential to speed up Impala, Hive, etc… by several orders of magnitude. Value Proposition #1
  • 16. Enrich: Call Center Example • Event comes in that an call center agent was connected to a caller • How long did was the caller on hold? • User has an SLA for hold times. • Send to HDFS a record of the hold time, including the start, end, and duration. Value Proposition #1
  • 17. Enrich: Finance Example • Message from the exchange that order 21756 was executed? • What is order 21756? • Ingestion engine has a table of outstanding orders. • Send to HDFS a record that the order for 100 shares of Microsoft by trader Joe using his algorithm “greedisgood” were executed. Include the timestamp the order was placed, the timestamp of execution and the price. Value Proposition #1
  • 20. Schema On Ingest Should I organize by color, type, size? It matters, but don’t punt because it’s nontrivial. All Lego pros organize pieces. HDFS is the same thing. Don’t get me started on secondary indexes.
  • 21. Schema On Read: Not Even Once • Old Story: Dump everything into HDFS. Process it when you read it into something-schema-y. • Turns out this is slooooooooowwwwwww. Might as well run Ruby. • New Story: Organize data with types and reasonable schema on ingest.
  • 23. What forms of Understanding? • Dashboards with aggregations • Queries that support filtering or enrichment at ingest • Queries that make decisions on ingest Routing, Responding, Alerting, Transacting • Queries that look for certain conditions Like SQL triggers, but way more powerful • Actual deep analytical queries* Value Proposition #2
  • 24. Examples Dashboards with aggregations Show me outstanding positions by trader by algorithm on a webpage that refreshes every second Queries that support filtering or enrichment at ingest FInd me all events related to this event If last event in a chain, push denormalized, enriched data to HDFS Queries that make decisions on ingest Click event arrives. Which ad did we show? Was it fraudulent? Was it a robot? Which customer account to debit? Queries that look for certain conditions Is this router under attack? or… Are my SLAs not being met? If so, what’s the contractual cost? Actual deep analytical queries* How many purple hats were sold on Tuesdays in 2015 when it rained? Value Proposition #2
  • 25. But How? • Just use VoltDB (I know. I know.) • Just write down the logic in Java code. • Nothing is trivial, but some things are easier. • Two main reasons VoltDB does this stuff better: • Strong integration of state and processing (ACID txns). • Straightforward development. Java + SQL. • Sadly, the things that might be comparably not-terrible are not publicly available. Value Proposition #2
  • 26. • Logical Schema • Queryable • Concurrency • No hard stuff… Stored Procedure Java / SQL Input Data SQL JDBC/ODBC/Rest CLI/Drivers HDFS OLAP CSV SMS Response Data State
  • 27. • Logical Schema • Queryable • Concurrency • No hard stuff… What about this meaty bit here? Stored Procedure Java / SQL • Fully integrated processing & state. • State access is global. • Standard SQL with extensions like “upsert” and JSON support. • Fully consistent, ACID “Program like you’re alone in the world” • Synchronous intra-cluster HA. Disk durability & memory speed.
  • 28. • Logical Schema • Queryable • Concurrency • No hard stuff… What about this meaty bit here? Stored Procedure Java / SQL • Native Aggregations Fully SQL-Integrated as Live Materialized Views • Easy counts, ranks, sorts • Leverage existing Java libraries (like HyperLogLog, but also JSON, gzip, Avro, protobufs, etc…)
  • 29. Serving Layer • You can’t easily get real time responses with high concurrency from HDFS. • You can’t query Kafka or Flume or anything like that. • So this one is a no-brainer. Query the Fast Data front end. • I work for a vendor who sells a Fast Data front end that uses SQL and I’m going to tell you that SQL and standards like JDBC are pretty awesome for this purpose. • You should be skeptical when a vendor tells your their tech choices are the right ones. Except this one time. SQL. Value Proposition #3
  • 30. SQL? • You should be skeptical when a vendor tells your their tech choices are the right ones. • Except this one time. • SQL is winning: • Google has moved back to SQL. • Facebook uses tons of SQL. • Cassandra has CQL. • Impala, Presto, SparkSQL - Lots of new love. • Couchbase recently introduced N1QL and is pretending they never said anything bad about SQL ever and actually SQL is great and you should use it. Value Proposition #3
  • 31. How wrong could this data be?
  • 32. Delivery Guarantees • At Least Once • At Most Once • None of the Above • Exactly Once
  • 33. At most once delivery • Send all events once. • On failure, you might lose some. • How many are lost?
  • 34. At least once delivery • When the status of an event is unclear, send it again. • This is not always easy. Event sources need to be robust.
  • 35. Exactly once delivery • Option 1: No side effects. Strong consistency is used to track incrementing offsets. Processing is atomic. • Option 2: Use at-least-once, but make everything idempotent. • Some people will tell you that option 2 isn’t exactly once. This argument is exausting. The result is the same, so who cares.
  • 36. None of the Above • Extremely common. • Kafka high level consumer - offsets roughly track work, but not exactly. • “We use Storm in at-least once mode, unless it gets busy or jammed, then we let it drop events to keep up.” - Heavy Storm User • Users who think they have at least once, might have this.
  • 37. Partial Failure • This is why ACID exists, folks. • A in ACID mean “Atomic” • Atomic means it happens or it doesn’t. • C, I & D don’t hurt either. • What even are side effects? • Come back to this.
  • 38. Atomic Fail: Romeo & Juliet ACID Transaction: Operation 1: Fake your death Operation 2: Tell Romeo
  • 39. Reality is More Mundane • Call center example: When a hold ends… Remove outstanding hold record. Add new completed hold record. Mark agent as busy. Update total hold time aggregates. • Financial example: When order executes: Remove from outstanding orders. Update position on Microsoft. Update aggregates for trader, algorithm, time window But still costly to screw up
  • 40. Side Effects • Clearest example is writing to Cassandra from within Storm to handle an event. Storm can’t control Cassandra the way it needs to. • If the Storm code fails, the Cassandra write is still there? • If the Storm code is retried, the Cassandra write happens twice? • If the Cassandra write fails, how does Storm retry? • If processing a Spark Streaming Micro-batch fails, then it can be re-processed, because of the integration between HDFS and SS. Spark Streaming trades latency and makes state restrictions to avoid side effects. Anytime doing something in one system involves doing something in another.
  • 41. Conclusion - Thanks • There is a lot of awesomeness to be had with a Fast Data frontend to Hadoop/HDFS. • VoltDB is really good at this stuff and other available systems are much less good. • As a vendor, you shouldn’t blindly trust me. • So try it out. Ask questions. Engage. Explore. http://chat.voltdb.com rbetts@voltdb.com @ryanbetts jhugg@voltdb.com @johnhugg