Eric Frenkiel, MemSQL CEO and co-founder
August 11, 2015 • San Diego, CA
Real-Time Data Pipelines with
Kafka, Spark, and Operational Databases
What’s In Store
MemSQL and a
fresh look at
Lambda
architectures
Building real-time
data pipelines for
immediate impact
One architecture
for many
applications
2
MemSQL at a Glance
• Enable every company to be a real-time enterprise
• Founded 2011, based in San Francisco
• Founders are ex-Facebook, SQL Server engineers
• Deliver a database technology for modern
architecture
Enterprise Focus
3
4
The Real-Time Database for Transactions and Analytics
In-Memory Distributed Relational
Data CenterSoftware Cloud
Speed
Serving
Batch Fast Updates
Unified queries, full SQL
Fast Appends
A Fresh Look at Lambda Architectures
5
Comprehensive Architecture
Transactions
6
Comprehensive Architecture
Real Time
Speed/Streaming Layer
Fast Updates
Rowstore
Transactions
7
Comprehensive Architecture
Real Time
Speed/Streaming Layer
Fast Updates
Rowstore
Analytics
Transactions
8
Comprehensive Architecture
Real Time
Speed/Streaming Layer
Fast Updates
Rowstore
Historical
Batch Layer
Fast Appends
Columnstore
Analytics
Transactions
9
Comprehensive Architecture
Real Time
Speed/Streaming Layer
Fast Updates
Rowstore
Historical
Batch Layer
Fast Appends
Columnstore
Analytics
Transactions
Execution engine that spans the data spectrum
10
Comprehensive Architecture
Real Time
Speed/Streaming Layer
Fast Updates
Rowstore
Historical
Batch Layer
Fast Appends
Columnstore
Analytics
Transactions
11
Building Real-Time Data
Pipelines for Immediate Impact
12
By 2020, HP predicts that over
a trillion sensors will be online
“The Internet of Things Will Drastically Change Our Future” – Datafloq
Going Real-Time is the Next Phase for Big Data
More
Devices
More
Interconnectivity
More
User Demand
…and companies are at risk of being left behind
Expensive
Not scalable
Batch only
SAN-burdened
1%
15
Success will
be driven by
real-time
analytic
applications.
16
Designing the Ideal Real-Time Pipeline
Message Queue Transformation Speed/Serving Layer
End-to-End Data Pipeline Under One Second
17
 A high-throughput
distributed messaging
system
 Publish and subscribe to
Kafka “topics”
 Centralized data transport
for the organization
Kafka
18
 In-memory execution
engine
 High level operators for
procedural and
programmatic analytics
 Faster than MapReduce
Spark
19
 In-memory, distributed
database
 Full transactions and
complete durability
 Enable real-time,
performant applications
MemSQL
20
Use Spark and Operational Databases Together
Spark Operational Databases
Interface Programatic Declarative
Execution Environment Job Scheduler SQL Engine and Query Optimizer
Persistent Storage Use another system Built-in
21
Subscribing to Kafka
22
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
0111001010101111101111100000001010
111100001110101100000010010010111…
Publish to Kafka Topic
0111001010101111101111100000001010
111100001110101100000010010010111…
1110010101000101010001010100010111
111010100011110101100011010101000…
0101111000011100101010111110001111
011010111100000000101110101100000…
Event added to message queue
Enrich and Transform the Data
23
Spark polling Kafka for new messages
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
(2015-07-06T16:43:40.33Z, 329280, 94110, 23,
‘kitchen_appliance’, 60)
Deserialization
Enrichment
0111001010101111101111100000001010
111100001110101100000010010010111…
Persist and Prepare for Production
24
RDD.saveToMemSQL()
INSERT INTO memcity_table ...
time
house_i
d
zip
device
_id
device_type watts
2015-
07-
06T16:4
3:40.33
Z
32928
0
94110 23
‘kitchen_app
liance’
60
… … … … … …
Go to Production
25
Compress development
timelines
SELECT ... FROM memcity_table ...
One Architecture
for Many Applications
26
Lambda Applies to Real-Time Data Pipelines
Message
Queue
Batch
Inputs DatabaseTransformation Application
27
Kafka, Spark, and MemSQL Make it Simple
Batch
Inputs Application
28
Monitoring real-time Xfinity programming and video health
30
 Collect streaming data at scale
(hundreds of MemSQL
machines)
 Proactively diagnose issues
 Query ad-hoc and in real-time
with full SQL
From 30 minutes to less than 1 second
Real-time
Analytics
Real-Time
Trend Analytics
Massive Ingest and Concurrent Analytics
 Instant accuracy to the latest repin
 Build real-time analytic applications
Real-time
analytics
32
Watch the Pinterest Demo Video here:
https://youtu.be/KXelkQFVz4E
Real-Time
Segmentation
34
Using Real-Time for Personalization
Ad Servers
EC2
Real-time
analytics
PostgreSQL
Legacy reports
Monitoring S3 (replay)
HDFS
Data Science
Vertica
Operational Data Store (ODS)
Star Schema MictoStrategy
 Reach overlap and ad optimization
 Over 60,000 queries per second
 Millisecond response times
35
Thank You!
Visit MemSQL at Booth #518
Real-Time Demos T-Shirt GiveawayGames
37

Real-Time Data Pipelines with Kafka, Spark, and Operational Databases

Editor's Notes

  • #14 Sensors are being integrated into our cars, our phones, our medical devices – trillions of sensors impact many facets of our lives “HP expects that by 2020 a trillion sensors are needed in the world, the equivalent of 150 sensors per human. Sensors will end-up in anything imaginable” (https://datafloq.com/read/internet-of-things-with-trillions-of-sensors-will-/218) - In 2020, 25 billion connected things will be in use (Gartner); 4.9 billion (2015) http://www.gartner.com/newsroom/id/2905717 HP’s Peter Hartwell: “one trillion nanoscale sensors and actuators will need the equivalent of 1000 internets: the next huge demand for computing!”