Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

Eric Frenkiel, MemSQL CEO and co-founder
August 11, 2015 • San Diego, CA
Real-Time Data Pipelines with
Kafka, Spark, and Operational Databases

What’s In Store
MemSQL and a
fresh look at
Lambda
architectures
Building real-time
data pipelines for
immediate impact
One architecture
for many
applications
2

MemSQL at a Glance
• Enable every company to be a real-time enterprise
• Founded 2011, based in San Francisco
• Founders are ex-Facebook, SQL Server engineers
• Deliver a database technology for modern
architecture
Enterprise Focus
3

4
The Real-Time Database for Transactions and Analytics
In-Memory Distributed Relational
Data CenterSoftware Cloud

Speed
Serving
Batch Fast Updates
Unified queries, full SQL
Fast Appends
A Fresh Look at Lambda Architectures
5

Comprehensive Architecture
Transactions
6

Real Time
Speed/Streaming Layer
Fast Updates
Rowstore
Transactions
7

Real Time
Fast Updates
Rowstore
Analytics
Transactions
8

Real Time
Fast Updates
Rowstore
Historical
Batch Layer
Fast Appends
Columnstore
Analytics
Transactions
9

Real Time
Fast Updates
Rowstore
Historical
Batch Layer
Fast Appends
Columnstore
Analytics
Transactions
Execution engine that spans the data spectrum
10

Real Time
Fast Updates
Rowstore
Historical
Batch Layer
Fast Appends
Columnstore
Analytics
Transactions
11

Building Real-Time Data
Pipelines for Immediate Impact
12

By 2020, HP predicts that over
a trillion sensors will be online
“The Internet of Things Will Drastically Change Our Future” – Datafloq

Going Real-Time is the Next Phase for Big Data
More
Devices
More
Interconnectivity
More
User Demand
…and companies are at risk of being left behind

Expensive
Not scalable
Batch only
SAN-burdened
1%
15

Success will
be driven by
real-time
analytic
applications.
16

Designing the Ideal Real-Time Pipeline
Message Queue Transformation Speed/Serving Layer
End-to-End Data Pipeline Under One Second
17

 A high-throughput
distributed messaging
system
 Publish and subscribe to
Kafka “topics”
 Centralized data transport
for the organization
Kafka
18

 In-memory execution
engine
 High level operators for
procedural and
programmatic analytics
 Faster than MapReduce
Spark
19

 In-memory, distributed
database
 Full transactions and
complete durability
 Enable real-time,
performant applications
MemSQL
20

Use Spark and Operational Databases Together
Spark Operational Databases
Interface Programatic Declarative
Execution Environment Job Scheduler SQL Engine and Query Optimizer
Persistent Storage Use another system Built-in
21

Subscribing to Kafka
22
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
0111001010101111101111100000001010
111100001110101100000010010010111…
Publish to Kafka Topic
0111001010101111101111100000001010
111100001110101100000010010010111…
1110010101000101010001010100010111
111010100011110101100011010101000…
0101111000011100101010111110001111
011010111100000000101110101100000…
Event added to message queue

Enrich and Transform the Data
23
Spark polling Kafka for new messages
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
(2015-07-06T16:43:40.33Z, 329280, 94110, 23,
‘kitchen_appliance’, 60)
Deserialization
Enrichment
0111001010101111101111100000001010
111100001110101100000010010010111…

Persist and Prepare for Production
24
RDD.saveToMemSQL()
INSERT INTO memcity_table ...
time
house_i
d
zip
device
_id
device_type watts
2015-
07-
06T16:4
3:40.33
Z
32928
0
94110 23
‘kitchen_app
liance’
60
… … … … … …

Go to Production
25
Compress development
timelines
SELECT ... FROM memcity_table ...

One Architecture
for Many Applications
26

Lambda Applies to Real-Time Data Pipelines
Message
Queue
Batch
Inputs DatabaseTransformation Application
27

Kafka, Spark, and MemSQL Make it Simple
Batch
Inputs Application
28

Monitoring real-time Xfinity programming and video health

30
 Collect streaming data at scale
(hundreds of MemSQL
machines)
 Proactively diagnose issues
 Query ad-hoc and in real-time
with full SQL
From 30 minutes to less than 1 second
Real-time
Analytics

Massive Ingest and Concurrent Analytics
 Instant accuracy to the latest repin
 Build real-time analytic applications
Real-time
analytics
32

Watch the Pinterest Demo Video here:
https://youtu.be/KXelkQFVz4E

Using Real-Time for Personalization
Ad Servers
EC2
Real-time
analytics
PostgreSQL
Legacy reports
Monitoring S3 (replay)
HDFS
Data Science
Vertica
Operational Data Store (ODS)
Star Schema MictoStrategy
 Reach overlap and ad optimization
 Over 60,000 queries per second
 Millisecond response times
35

Thank You!
Visit MemSQL at Booth #518
Real-Time Demos T-Shirt GiveawayGames
37

Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

More Related Content

What's hot

Viewers also liked

Similar to Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

More from SingleStore

Recently uploaded

Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

Editor's Notes