2015-12-03
Programmatic Bidding
Data Streams & Druid
Charles Allen
2015-12-03
We Are Hiring!
We’d love to connect! Our current open positions are:
Engineering Director, UI Engineer and
Distributed Systems Engineer.
We always have positions opening up so feel free to
connect with Sarah Carter (our Head of Recruiting) for
future openings - sarah.carter@metamarkets.com.
2015-12-03
What is Real-Time Bidding?
Real-Time Bidding is resolving advertising
supply and demand at the moment of supply.
+Best suited for systems with internet
connectivity.
2015-12-03
For the sake of this conversation, Real-Time
Bidding (RTB) is the general method by which
digital media supply and demand is commonly
reconciled using programmatic methodologies
over very short time frames.
2015-12-03
What Happens in Real-Time
Bidding?
1. User loads resources which contain ad space
(supply is created by a Publisher)
2015-12-03
What Happens in Real-Time
Bidding?
2. Information / notification is generated and
distributed to interested parties
Avail (a unit of supply of audience attention) is
handled by an Exchange
2015-12-03
What Happens in Real-Time
Bidding?
3. Information on the avail is distributed to
potentially interested parties
We now have an auction
2015-12-03
What Happens in Real-Time
Bidding?
4. Potentially interested parties judge the avail
and either bid on the auction, or they do not.
2015-12-03
What Happens in Real-Time
Bidding?
5. The winner of the auction is determined by
the exchange.
5b. 100 ms has passed
If a human can perceive that an auction took
place YOU ARE TOO SLOW
2015-12-03
What Happens in Real-Time
Bidding?
6. The winning ad is attempted to be served as
an impression
2015-12-03
What Happens in Real-Time
Bidding?
7. The impression hopefully turns into a click
or conversion
2015-12-03
Avail / Auction
Bid
Impression
Click /
Conversion
??
?
2015-12-03
Programmatic data is 100x larger
than Wall Street
2015-12-03
Cern - LHC
The LHC produces about 1GBs average
http://home.cern/about/updates/2015/06/lhc-season-2-cern-computing-ready-data-torrent
MMX raw incoming stream data regularly
exceeds this
* 1hr average
2015-12-03
Avail / Auction
Bid
Impression
Click /
Conversion
??
?
2015-12-03
General Architecture
Kafka
Samza/Kafka Druid RTTranquility
Raw (S3)
Hadoop /
Spark
Deep
Storage
(S3)
Druid HistoricalUI / User
2015-12-03
Druid for Queries!.. But what is Druid?
Official - Druid is a fast column-oriented
distributed data store
Me - Druid is a highly available Data Store
designed for interactive, ad-hoc, OLAP style
queries on time-series, denormalized data.
2015-12-03
Key points for BEST use cases
Highly Available - No downtime for maintenance since 2011
Interactive - FAST
OLAP - Insightful
Ad-hoc - Dynamic
Time-series - Sequential
Denormalized - Flat
* By the way, it works
on Streams
(aka Real Real-Time)
2015-12-03
Lifecycle of a Real-Time Datum
Mr. Charlie Event
2015-12-03
Lifecycle of a Real-Time Datum
Firehose
Firehose
Druid RT Peon 0
Druid RT Peon 1
* Launched by Overlord
by way of a Middle Manager
2015-12-03
Lifecycle of a Real-Time Datum
Firehose Druid RT 0
In Memory
Write-Optimized
Store
Parser
2015-12-03
Lifecycle of a Real-Time Datum
Druid RT 0
In Memory
Write-Optimized
Store
2015-12-03
Lifecycle of a Real-Time Datum
Druid RT 0
In Memory
Write-Optimized
Store
Rollup
2015-12-03
Lifecycle of a Real-Time Datum
Druid RT 0
In Memory
Write-Optimized
Store
2015-12-03
Lifecycle of a Real-Time Datum
In Memory
Write-Optimized
Store
Time or Size Memory Mapped
Read-Only Store
Persist
2015-12-03
Lifecycle of a Real-Time Datum
Memory Mapped
Read-Only Store
Memory Mapped
Read-Only Store
Memory Mapped
Read-Only Store
Merge Memory Mapped
Read-Only Store
* Segment
2015-12-03
Handoff
Lifecycle of a Real-Time Datum
Memory Mapped
Read-Only Store
Druid RT 0
Druid
Historical
Deep Storage
(S3, HDFS, Azure,
Cassandra)
2015-12-03
Lifecycle of a Real-Time Datum
Druid RT 0
Druid
Historical
Deep Storage
(S3, HDFS, Azure,
Cassandra)
Memory Mapped
Read-Only Store
* Orchestrated by Coordinator
2015-12-03
Lifecycle of a Real-Time Datum
Druid Historical
Memory Mapped
Read-Only Store
Druid - Hot Druid - Cold Druid - Icy
Memory Mapped
Read-Only Store
Very Little Paging Some Paging Lots of Paging
2015-12-03
Lifecycle of a Real-Time Datum
Druid Historical
Memory Mapped
Read-Only Store
Druid - Hot Druid - Cold Druid - Icy
Memory Mapped
Read-Only Store
Very Little Paging Some Paging Lots of Paging
Memory Mapped
Read-Only Store
2015-12-03
Lifecycle of a Real-Time Datum
Druid Historical
Memory Mapped
Read-Only Store
Druid - Hot Druid - Cold Druid - Icy
Memory Mapped
Read-Only Store
Very Little Paging Some Paging Lots of Paging
2015-12-03
Lifecycle of a Real-Time Datum
Druid Historical
Memory Mapped
Read-Only Store
Druid - Hot Druid - Cold Druid - Icy
Very Little Paging Some Paging Lots of Paging
2015-12-03
Lifecycle of a Real-Time Datum
Lifecycle rules tunable by datasource
2015-12-03
Canary / Metrics cluster
Coordinator
Console
2015-12-03
Lifecycle of a Query
Query Router
Cold -
Broker
Hot - Broker
XOR
2015-12-03
Lifecycle of a Query
Broker
Druid RT (Peon)
Druid Historical
Hot
Druid Historical
Cold
Druid Historical
Icy
Cache
2015-12-03
Define Stream Hooks
Lifecycle of a Query
Cache
Druid Historical
XYZ
Memory Mapped
Read-Only Store
Memory Mapped
Read-Only Store
2015-12-03
Lifecycle of a Query
Memory Mapped
Read-Only Store
Column
Dictionary
Dimension
Value Bitmap
Dimension
Value Bitmap
Dimension
Value Bitmap
Metric
Column
Metric
Column
Metric
Column
Metric
Column
2015-12-03
Lifecycle of a Query
Memory Mapped
Read-Only Store
Column
Dictionary
Dimension
Value Bitmap
Dimension
Value Bitmap
Dimension
Value Bitmap
Metric
Column
Metric
Column
Metric
Column
Metric
Column
* ByteBuffer slices
2015-12-03
Lifecycle of a Query
Dimension
Value Bitmap
Dimension
Value Bitmap
Metric
Column
Metric
Column
Metric
Column
Iterator
Aggregator Aggregator Aggregator
Ready, set… GO!
2015-12-03
Lifecycle of a Query
Iterator
Aggregator
Aggregator
Aggregator
“Take 0, take 1,
take 7, take 10”
Scan columns ONCE
Metrics
Dimensions
2015-12-03
Lifecycle of a Query
Iterator
Aggregator
Aggregator
Aggregator
Metrics
Dimensions
Memory Mapped Byte Buffers (Kernel disk cache)
2015-12-03
Lifecycle of a Query
Iterator
Aggregator
Aggregator
Aggregator
Metrics
Dimensions
JVM managed memory
2015-12-03
Lifecycle of a Query
Intermediate
Results
Intermediate
Results
Merge
Cache
Cache
Druid Historical
XYZ Result
2015-12-03
Lifecycle of a Query
Druid Historical
XYZ Result
Druid RT
DEF Result
Druid Historical
ABC Result
Merge Broker
Done!
bubble up to UI
Router UI*
* Technically bubbles
up to Business Logic
layer
2015-12-03
Demo!
2015-12-03
What was in the Demo?
2015-12-03
Actual Druid Usage Data
Query load is
about ½ Million
Per Day
2015-12-03
Actual Druid Indexing Data
Only 2.8M streaming
events/sec
yesterday during
peak hour.
Was a slow day.
2015-12-03
Druid OSS Clients!
Official
+ R https://github.com/druid-io/RDruid
+ Python https://github.com/druid-io/pydruid
Community
+ Spark https://github.com/SparklineData/spark-druid-olap
+ SQL https://github.com/srikalyc/Sql4D
+ Many more! http://druid.io/docs/latest/development/libraries.html
JavaScript, Node.js, Clojure, Ruby, (other) SQL, TypeScript
2015-12-03
R Example
library(RDruid)
start_time <- as.POSIXlt(Sys.time(), "UTC", origin = "1970-01-01")
start_time$sec <- 0
end_time <- start_time
start_time$hour <- start_time$hour - 24
intvl <- interval(start_time, end_time)
segment_times <- druid.query.timeseries(
url = druid_query_url, # bard endpoint
intervals = intvl,
dataSource = "mmx_metrics_druid",
aggregations = list(count = longSum(metric("count")), value = longSum(metric("value"))),
filter = dimension("host") %=% hosts & dimension("metric") %=% "query/segment/time",
granularity = "minute",
context = list(useCache = T, populateCache = T)
)
2015-12-03
UI - Panoramix
https://github.com/mistercrunch/panoramix
2015-12-03
UI - Grafana
https://github.com/Quantiply/grafana-
plugins/tree/master/features/druid
2015-12-03
UI - Pivot
https://github.com/implydata/pivot
2015-12-03
Druid Speed
+ https://www.linkedin.com/pulse/combining-druid-spark-interactive-flexible-
analytics-scale-butani
+ http://druid.io/blog/2014/03/17/benchmarking-druid.html
We’re always getting faster!
Very common question in PRs is “How does this affect speed?” and PROVE IT
Micro-benchmarks in druid-io master branch
https://github.com/druid-io/druid/tree/master/benchmarks
Macro-benchmarks done at scale
(see your metrics console for answers)
2015-12-03
We Are Hiring!
We’d love to connect! Our current open positions are:
Engineering Director, UI Engineer and
Distributed Systems Engineer.
We always have positions opening up so feel free to
connect with Sarah Carter (our Head of Recruiting) for
future openings - sarah.carter@metamarkets.com.

Programmatic Bidding Data Streams & Druid