2. About Me: Neil Dahlke
Engineer
MemSQL
• real-time database for transactions / analytics
Formerly Globus
• high performance data transfer for research scientists
Past talks
• Real-time, Geospatial, Maps
Slides: http://www.slideshare.net/MemSQL/realtime-geospatial-maps-
by-neil-dahlke
4. WHAT WE ARE SEEING:
Sensors. Applications. Machines. And us.
Generating more data every single day.
By 2020, over 20 billion connected things will
be in use across a range of industries.
6. WHAT DO REAL TIME BUSINESSES NEED?
FAST DATA
INGEST
The volume of data
that can be ingested
into the database
7. WHAT DO REAL TIME BUSINESSES NEED?
LOW LATENCY
QUERIES
The time it takes to
execute queries and
receive results
8. WHAT DO REAL TIME BUSINESSES NEED?
HIGH
CONCURRENCY
The ability to scale
simultaneous operations
9. WHAT DO REAL TIME BUSINESSES NEED?
FAST DATA
INGEST
The volume of data
that can be ingested
into the database
LOW LATENCY
QUERIES
The time it takes to
execute queries and
receive results
HIGH
CONCURRENCY
The ability to scale
simultaneous operations
11. A massively scalable database and ingest solution allowed for
massive growth, real-time analytic applications and faster, targeted.
+
12. Kafka
• Component we kept
S3
• Persisted all logs to cold storage for eventual analysis
Hadoop
• Nighly map-reduce jobs
Redshift
• Took a full day to load data from previous day
• Reaching overlap of times caused data crisis
Before
13. No real time access to analytics
No SQL interface for analysts and data scientists
Massive nightly Hadoop batch jobs (late data)
Unfiltered and incomplete data (silos)
Expensive
Why was this bad for their business operations?
14. Why was this bad for their data operations?
Too slow
Not scalable
No deduplication
• aka not exactly-once
Low concurrency
FAST DATA
INGEST
LOW
LATENCY
QUERIES
HIGH
CONCURRENCY
24. Visualizing the Data
Demo built using
• Mapbox
• Websockets
• Tornado web server
When an image is re pinned, the circles on the globe
expand, showing higher volume areas
Reads data from MemSQL directly
24
- Distributed In-Memory Database
- Built for real-time analytics and transactions
Familiar SQL Interface
Spark integration out-of-the-box
- Native Kafka Ingestion
What did they want to do?
- highly scalable infrastructure that collects, stores and processes user engagement data in real-time
higher performance event logging
Reliable log transport and storage
ability to query real-time data
user clicks Pin or repin
event is pushed to Apache Kafka
Storm, Spark and other custom built log readers process these events in real-time
log persistence service called Secor that reliably writes these events to Amazon S3 (zero data loss, overcoming its weak eventual consistency model).
self-serve big data platform loads the data from S3 into many different Hadoop clusters for batch processing
In house tools Singer (logger) & Secor (replicator) asynchronously replicating local logs from app servers to centralized S3 location using Kafka for transport
Kafka was great for throughput, but needed a way to derive value, e.g. run SQL against these datasets in real time
A few days later this data would hit Redshift and be queryable
- took several days to access analytics and make available to data science team (too late, A/B testing, advertising)
- no SQL Interface
- 5.5 M rows / second for one topic, 1.7 M rows / second for another, with the lowest throughput being 132k rows / second
- data needs to be filtered as well as enriched
- At LEAST once semantics
user clicks Pin or repin
event is pushed to Apache Kafka
Storm, Spark and other custom built log readers process these events in real-time
log persistence service called Secor that reliably writes these events to Amazon S3 (zero data loss, overcoming its weak eventual consistency model).
self-serve big data platform loads the data from S3 into many different Hadoop clusters for batch processing
In house tools Singer (logger) & Secor (replicator) asynchronously replicating local logs from app servers to centralized S3 location using Kafka for transport
Kafka was great for throughput, but needed a way to derive value, e.g. run SQL against these datasets in real time
A few days later this data would hit Redshift and be queryable
Goes both ways
easily repeatable success
days to seconds
now has a source of record for sharing relevant user engagement data and metrics their data analyst and with key brands
Pinterest and their partners can get a better understanding of user behavior and provide more value to the Pinner community
Cheaper
the ability to identify (and react to) developing trends as they happen
provides insight into how users are engaging with Pins across the globe in real-time
helps Pinterest become a better recommendation engine- SQL interface for engineering and data science teams
fast ad-hoc query execution on real-time data to allow the execution of SQL queries on the real-time events as they arrive
easily repeatable success
days to seconds
now has a source of record for sharing relevant user engagement data and metrics their data analyst and with key brands
Pinterest and their partners can get a better understanding of user behavior and provide more value to the Pinner community
Cheaper
the ability to identify (and react to) developing trends as they happen
provides insight into how users are engaging with Pins across the globe in real-time
helps Pinterest become a better recommendation engine- SQL interface for engineering and data science teams
fast ad-hoc query execution on real-time data to allow the execution of SQL queries on the real-time events as they arrive
Pull up Ops
Pull up a terminal and create the database
Deploy Spark
Create a Streamliner pipeline
Create a Pipeline pipeline
Expose the UI
Ad-Hoc queries, Tableau, and custom reporting