Druid
Elixir For
analytics
Grégory Letribot @ar_rabbit
The right
tool
for
the JOB
Performan
ce
unique internet users per month
Ads displayed per day
10
000
1
Servers
Hadoop
nodes
Analytics the old way
CPOP
Loadi
ng…
Once upon a time, in an SQL Galaxy..
« Guys, database whatever_DB contains 3B rows
Disks are full, it needs a purge
Server will be reinstalled and will host only the last 30 days »
« Well, talk with product »
SQL limit is
reachedProduct working, but infrastructure falling apart
We want more !
More dimensions
User centric
Interactive
Realtime
NOSQL ?
Precomputing gives fast queries
Not flexible !
Scales exponentially on dimensions
Suming
up Food for BI
 Interactive, sub-second insights
 Arbitrarily drill into data
 Scalability, availability…
metamar
Built for analytics
Scalable and Available
Real-time ingestion and Queries
Read-oriented
store Column oriented
 In Memory
 Fast Filtering
 Segment based
Distribu
Real
REST
api
Cold
stor
age
Broke
r
nodes
queri
es
Data feed
Histo
rical
nodes
realt
ime
nodes Hand off
Histo
rical
nodes
realt
ime
nodes Hand offcoordin
ator
nodes
zookeep
er
metadat
a
Back to real life
Data workflow
Real-time or Batch ingestion
Lambda
architecture !
Drill, baby drill
!
Columnar
Store
Only reads relevant dataIphone Google Computer 0.1€08:12:37
Android Yahoo Cloth 0.2€08:12:38
Select sum(cost) where
Device = Iphone
High
compressi
on
Wacken
Hellfest
Fall of Summer
Wacken
Hellfest
1
2
3
1
2
Metadata:
Wacken =>
1
Hellfest=>
2
Fall of
summer=> 3
Inverted
indexWacken 1,0,0,0,1
Hellfest 0,1,0,1,0
Fall of Summer 0,0,1,0,0
Wacken
Hellfest
Fall of Summer
Wacken
Hellfest
Fast binary operations
Don’t !
« … WHERE value >= 10 »
cardinality 1500
1490 binary OR…
Userflow
Unique user count ain’t only
an interview exercise !
Sketching
algorithmsHyperLogLog
 Approximate unique count
 Extreme storage reduction
 Constant time computation
Live Demo
aggregated
rows
months of data
rows
nodes
40
0B
7
2B
12
Performances
 No downtime in 6 months
 Aggregate displays, clicks, sales & revenue generated for our biggest advertiser
grouped by device
over 7 months = 197 ms
Performances
According to
metamarkets
 33M rows per second per core
 Scaled up to 26B rows per second
 10k event per second ingestion per node
What’s wrong ?
 Be carefull with your data model
 Immutable is.. Immutable !
 No joins, no full sql capabilities
 A couple of bugs.. But very active and
friendly team !
Happy !
So
far
http://www.criteo.com/careers/
Looking for talents

Gregorry Letribot - Druid at Criteo - NoSQL matters 2015