Real-Time Analytics with Spark and MemSQL

Neil Dahlke, Engineer
2017 March 17
Streaming and MemSQL

About Me: Neil Dahlke
 Engineer
 Formerly Globus
• high performance data transfer for research scientists
 Past talks
• Real-time, Geospatial, Maps
 Slides: http://www.slideshare.net/MemSQL/realtime-geospatial-
maps-by-neil-dahlke

Architecture: Aggregators and Leaves
Agg 1 Agg 2
Leaf 1 Leaf 2 Leaf 3 Leaf 4

Architecture: Aggregators Aggregate
Agg 1 Agg 2

Architecture: Leaves Hold Partitions
Agg 1 Agg 2

Architecture: It’s SQL All The Way Down
Agg 1 Agg 2
select avg(price) from orders;
leaf1> using memsql_demo_0
select count(1), sum(price)
from orders;
leaf2> using memsql_demo_12
select count(1), sum(price)
from orders;
...

Latency in the Enterprise
SELECT*
FROM
WHERE
SLOW DATA
LOADING
Batched Loading
Hours to load
Sampled Data Views
No real-time ingestion
LENGTHY QUERY
EXECUTION
Slow query responses
Slow reports
Slow applications
No real-time response
LOW CONCURRENCY
Single threaded operations
Challenge with mixed workloads
Overall poor performance

REIMAGINE AN EXISTING BUSINESS PROCESS.
What if you had intra-day information to inform your decision making,
instead of daily or even weekly?

Why MemSQL?
FAST DATA
INGEST
The volume of data
that can be ingested
into the database

Why MemSQL?
LOW LATENCY
QUERIES
The time it takes to
execute queries and
receive results

Why MemSQL?
HIGH
CONCURRENCY
The ability to scale
simultaneous operations

Why MemSQL?
FAST DATA
INGEST
The volume of data
that can be ingested
into the database
LOW LATENCY
QUERIES
The time it takes to
execute queries and
receive results
HIGH
CONCURRENCY
The ability to scale
simultaneous operations

REAL WORLD
EVENT
REAL-TIME
RESPONSE
REDUCED
LATENCY

WHAT WE
ARE SEEING
A WORLD OF CONNECTED
MACHINES AND PEOPLE

Use Cases
Executive dashboards
Time series analytics
Sales analytics
Real-time data visualization
Live business dashboards
Website analytics

Real-time data with massive concurrency – millions of cars, drivers and
riders accessing the database optimizing supply, demand and pricing.
+

TECHNICAL BENEFITS
 Analyze millions of rows / second
 Analyze historical against live data
 Massive concurrency

THE UBER REAL-TIME ARCHITECTURE
REAL-TIME
ANALYTICS
REAL-TIME
INPUTS

A massively scalable database and ingest solution allowed for
massive growth, real-time analytic applications and faster, targeted.
+

 Kafka
 S3
• Persisted all logs to cold storage for eventual analysis
 Hadoop
• Nightly map-reduce jobs
 Redshift
• Took a full day to load data from previous day
• Reaching overlap of times caused data crisis
• Pre-aggregated
• Limited concurrency
Before

 Late data
 Limited access to the data once it’s in
 Long waits for insight
 Expensive
Why was this bad for their business?

Why was this bad for their data operations?
 Not scalable
 No deduplication
• aka not exactly-once
 Unfiltered and incomplete data (silos)
 Pre-aggregated data
FAST DATA
INGEST
LOW
LATENCY
QUERIES
HIGH
CONCURRENCY

INSTANT ACCURACY TO THE LATEST PIN
REAL-TIME
ANALYTICS

Accelerated ingest
from 24 hours to 5 secs
1 GB/sec totaling
72 TB/day
RESULTS

Visualizing the Data
 Demo built using
• Mapbox
• Websockets
• Tornado web server
 When an image is pinned, the circles on the globe
expand, showing higher volume areas
 Reads data from MemSQL directly

Introducing MemSQL Pipelines
 CREATE PIPELINE is a database construct that enables
data ingestion with exactly-once semantics
• MemSQL stores the Kafka offset in a table
• Exactly once delivery facilitated by co-locating data and offsets
 Extract, transform, and load external data natively
 Fully distributed workloads
 User-defined transformations
 Scalable, highly performant, online ALTER TABLE and
ALTER PIPELINE

MemSQL Pipelines Sequence
1. Extract from data sources
2. Transform extracted data
3. Load transformed data into Database tables in parallel
Data
Sources
MemSQL
1. Extract 2. Transform extracted data 3. Load into Database tables
Pipelines

MemSQL Pipelines Architecture: Kafka Example
Kafka
Broker
MemSQL NodePipelines
Kafka
Broker
Kafka
Broker
MemSQL MasterPipelines
1. Extract 2. Transform 3. Load
Data
reshuffle
Metadata query

Understanding
CREATE PIPELINE
and Streamliner

Getting Data to MemSQL
CREATE PIPELINE Streamliner
Parallel loading from multiple sources Parallel loading from multiple sources
Direct to leaf nodes
Data to multiple aggregators, then leaf
nodes
Native feature Built with Apache Spark
Exactly-once semantics

Learn More
 [ODBMS Watch] Powering Big Data at Pinterest.
Interview with Krishna Gade
 [GigaOm] Pinterest is experimenting with MemSQL for
real-time data analytics
 [InfoQ] Real-time Data Analytics at Pinterest using
MemSQL and Spark Streaming
 [MemSQL Blog] How Pinterest Measures Real-Time User
Engagement with Spark
 [Pinterest Engineering Blog] Real-time analytics at
Pinterest

Resources
 https://github.com/memsql/memsql-spark-connector
 http://docs.memsql.com/docs/streamliner-administration
 http://docs.memsql.com/docs/pipelines-overview
 https://github.com/memsql/memsql-docker-quickstart

Real-Time Analytics with Spark and MemSQL

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (19)

Similar to Real-Time Analytics with Spark and MemSQL

Similar to Real-Time Analytics with Spark and MemSQL (20)

More from SingleStore

More from SingleStore (17)

Recently uploaded

Recently uploaded (20)

Real-Time Analytics with Spark and MemSQL