Modeling the Smart and Connected
City of the Future with Kafka and Spark
Eric Frenkiel, CEO & Co-Founder, MemSQL
@ericfrenkiel
MAKE DATA WORK
DECEMBER 1-3, 2015  SINGAPORE
2
MemSQL at a Glance
Enterprise Focused
 Our Mission:
 Real-time database for transactions and analytics
 Founded in 2011, based in San Francisco
 Founders are former Facebook, SQL Server
database engineers
 $50 million in funding to date
Make every company a real-time enterprise.
What does a Smart City Look Like?
4
Our Conception
5
Our Reality
6
3.9b people live in cities today
7
By 2050, we’ll add another 2.5b people
8
We need to create sustainable cities
9
We need to use technology to help us
10
We don’t live in Tomorrowland
11
We live here
The good news:
the Technology of Today can
build smart cities.
12
13
 City-wide WiFi
 City App to report issues
 Open-Data Initiatives to
share data with the public
 Most importantly, an
adaptive IT department
A Smart City Should Have…
14
Let’s learn how.
A Model Application: MemCity
Capturing data from 1.4 million households
Total AWS hardware costs at $2.35 per hour
MemCity
Reach
1.4 million
households
(approximately
the size of
Chicago)
Capturing data from
8 devices in each home,
every minute
*
#MemCity
186,667 transactions per second
from Kafka Spark MemSQL
#MemCity
1.4 Million Households
8 Devices per Household
186K Events per Second
The “Real-Time Trinity”
Designing the Ideal Real-Time Pipeline
Message Queue Transformation Speed/Serving Layer
End-to-End Data Pipeline Under One Second
21
 A high-throughput
distributed messaging
system
 Publish and subscribe to
Kafka “topics”
 Centralized data transport
for the organization
Kafka
22
 In-memory execution
engine
 High level operators for
procedural and
programmatic analytics
 Faster than MapReduce
Spark
23
 In-memory, distributed
database
 Full transactions and
complete durability
 Enable real-time,
performant applications
MemSQL
24
Subscribing to Kafka
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
0111001010101111101111100000001010
111100001110101100000010010010111…
Publish to Kafka Topic
0111001010101111101111100000001010
111100001110101100000010010010111…
1110010101000101010001010100010111
111010100011110101100011010101000…
0101111000011100101010111110001111
011010111100000000101110101100000…
Event added to message queue
25
Enrich and Transform the Data
Spark polling Kafka for new messages
(2015-07-06T16:43:40.33Z, 329280, 23, 60)
(2015-07-06T16:43:40.33Z, 329280, 94110, 23,
‘kitchen_appliance’, 60)
Deserialization
Enrichment
0111001010101111101111100000001010
111100001110101100000010010010111…
26
Persist and Prepare for Production
RDD.saveToMemSQL()
INSERT INTO memcity_table ...
time house_id zip
device
_id
device_type watts
2015-
07-
06T16:4
3:40.33
Z
329280 94110 23
‘kitchen_app
liance’
60
… … … … … …
27
Go to Production
Compress development
timelines
SELECT ... FROM memcity_table ...
28
We can use In-Memory
technology to build
interactive applications
for Cities.
31
 Urban planning
 Efficient power consumption
 Efficient transportation
 Sustainable energy practices
So We Can Optimize…
Creating Real-Time Pipelines
should be push button easy.
32
 One click deployment of
integrated Apache Spark
 Put Spark in the Fast Lane
• GUI pipeline setup
• Multiple data pipelines
• Real-time transformation
 Eliminates batch ETL
 Open source on GitHub
MemSQL Streamliner for IoT Applications
33
Simple Deployment Process
Application
34
Cluster
1. Deploy MemSQL
In-Memory | Distributed | Relational
Application
35
Cluster
2. Deploy Spark
Application
36
Cluster
Kafka Connects to Each Node
Application
37
Streamliner Architecture
First of many integrated Apache Spark solutions
Other
Real-Time Data
Sources Application
Apache Spark
Future Solution
Future Machine
Learning Solution
STREAMLINER
38
Streamliner ETL Detail
Other
Real-Time Data
Sources Application
Apache Spark
Future Solution
Future Machine
Learning Solution
STREAMLINER
Custom
Future Extractor
JSON
Custom
Future Transformer
STREAMLINER
Extract Transform Load
39
Streamliner
40
Extract
41
Transform
42
Load
43
Extending Analytics with Lambda Architecture
Real-Time Analytics Streaming
Analytic Applications
Not Excel Reports
 Financial Services
 Adtech
 eCommerce
 IoT
 Consumer Internet
 Energy
 Federal
Lambda Architecture
New Real-Time Processing
Existing Batch Processing
Msg
Queue
45
 Multi-TB on commodity
hardware
 Store the “state of the
model”
 Easily build applications
 Avoid direct disk at all
cost
In-Memory Databases Rise Up
Comprehensive Architecture
Transactions
46
Comprehensive Architecture
Real Time
Speed/Streaming Layer
Fast Updates
Rowstore
Transactions
47
Comprehensive Architecture
Real Time
Speed/Streaming Layer
Fast Updates
Rowstore
Analytics
Transactions
48
Comprehensive Architecture
Real Time
Speed/Streaming Layer
Fast Updates
Rowstore
Historical
Batch Layer
Fast Appends
Columnstore
Analytics
Transactions
49
Comprehensive Architecture
Real Time
Speed/Streaming Layer
Fast Updates
Rowstore
Historical
Batch Layer
Fast Appends
Columnstore
Analytics
Transactions
Execution engine that spans the data spectrum
50
Comprehensive Architecture
Real Time
Speed/Streaming Layer
Fast Updates
Rowstore
Historical
Batch Layer
Fast Appends
Columnstore
Analytics
Transactions
51
Simplified Lambda Architectures with MemSQL
Layer Traditional Lambda MemSQL Lambda
Batch Hadoop MemSQL Column Store
Speed Storm, Spark Kafka > Spark > MemSQL
Serving Cassandra, HBase MemSQL
52
Lambda Applies to Real-Time Data Pipelines
Message
Queue
Batch
Inputs DatabaseTransformation Application
53
Kafka, Spark, and MemSQL Make it Simple
Batch
Inputs Application
54
Massive Ingest and Concurrent Analytics
55
 Instant accuracy to the latest repin
 Build real-time analytic applications
 1 GB/sec totaling 72 TB/day
Real-time
analytics
Using Real-Time for Personalization
Ad Servers
EC2
Real-time
analytics
PostgreSQL
Legacy reports
Monitoring S3 (replay)
HDFS
Data Science
Vertica
Star Schema MictoStrategy
 Reach overlap and ad optimization
 Over 60,000 queries per second
 Millisecond response times
56
57
300k events/sec
Reduced Latency from 30 minutes to Sub-Second
Real-time
Analytics
Sample Pipeline: Analyzing Twitter Data in Real Time
ApplicationApache Spark
SPARK
STREAMLINER
Public API
“Garden Hose”
</>
Python
Extract Transform Load
SPARK STREAMLINER
58
Install MemSQL and Apache Spark in < 1min
With MemSQL Ops and Streamliner
59
Run Kafka in Docker Container and Create a New
Topic: TWITTER
60
Fill Out Extract, Transform and Load Details to Set
Up Pipeline
61
Use Python Script to Load Tweets into Kafka Topic
and Get Data Flowing
62
Connect to MemSQL Database and Run SQL Queries
Instantly
63
Run Online Alter Table to Optimize Query Performance
64
Streamliner: Dynamic Resource Management
Without Streamliner With Streamliner
Pipeline 1
Spark Worker
Pipeline 2
Spark Worker
Executor
(P2 only)
Executor
(P2 only)
Executor
(P1 only)
Executor
(P1 only)
Driver
(P1 only)
Driver
(P2 only)
All Pipelines
Streamliner Driver
…
…
Spark WorkerSpark Worker
Executor
(P1 or P2)
Executor
(P1 or P2)
Executor
(P1 or P2)
Executor
(P1 or P2)
65
Building Real-Time Data Pipelines
and Predictive Applications
66
Adding Real-Time Scoring to Predictive Applications
Streamliner
Input
User Jar
SAS Generated PMML
Industrial
Equipment
Sensor Data
S1 S2 S3 P1 P2 P3
Scoring Real-Time Data
with Predictive Models
Sensor 1 Predictive Model 1
67
68
GET YOUR FREE COPY:
memsql.com/oreilly
69

Modeling the Smart and Connected City of the Future with Kafka and Spark

Editor's Notes

  • #7 2.5 billion people to the world’s urban population by 2050, with nearly 90 per cent of the increase concentrated in Asia and Africa.
  • #46 Market Guide for In-Memory DBMS, 2015 (October, Edjlali, Feinberg, Jain)