Building Data Applications with Apache Druid

Building data applications
with Apache Druid
October 2020
Gian Merlino, Cofounder & CTO, Imply
1

Who am I?
2
Gian Merlino
Committer & PMC chair at
Cofounder at (we’re hiring!)

Druid in the wild
5
100+ billion rows/day
1+ trillion rows, 1+ year retained
100s of servers
sub-second to few seconds query latency
mix of streaming and batch ingest

Online DB pattern
6
Data lakeData
source
Data mover
(Apache Kafka,
Apache Airflow, etc)
Data
source
Data
source
Query
engine
Query
engine
Query
engine
Pure
storage
Data lake
Data lake
Direct to lake
Online DB
Stores an
optimized
copy
Online
app

7
● Scale-out, fault-tolerant
architecture
● No downtime for software updates
● No downtime for data
management
● Heavily optimized storage format
● Integrated storage format and
query engine
Druid as an online DB

Data apps
● Interactive query speeds
● Always online
● Fresh data from streams
● Quality of service
● Price/performance
8

Interactive query speeds
9
Secondary indexes
Operate on
compressed data Late materializationCompression
INDEX
[0,1,2](11100000)
[3,4] (00011000)
[5,6,7](0000111)
DATA
0
0
0
1
1
2
2
2
DICT
DC = 0
LA = 1
SF = 2

Interactive query speeds
10
artist
(STRING)
__time
(LONG)
1293840000000
1293840000000
1293840000000
1293840000000
1293840000000
1293840000000
1293840000000
1293840000000
DATA
DICT
INDEX
0
0
0
1
1
2
2
2
Justin = 0
Ke$ha = 1
Miley = 2
[0,1,2](11100000)
[3,4](00011000)
[5,6,7](0000111)
25
42
17
170
112
67
53
94
DATA2
1
2
1
1
0
0
0
[0,2](10100000)
[1,3,4](01011000)
[5,6,7](00000111)
DICT
DC = 0
LA = 1
SF = 2
INDEX
1800
2912
1953
3194
5690
1100
8423
9080
city
(STRING)
count
(LONG)
price
(LONG)
Dictionary encoded
(sorted)
Bitmap index
(stored compressed)

Always online
11
Coordinator
Apache
ZooKeeper
Master server
Historical Indexer Historical Indexer
Data server
Deep storage
Broker
Query server
Streaming
data
Batch
data

Fresh data from streams
12
Coordinator
Apache
ZooKeeper
Master server
Data server
Deep storage
Broker
Query server
Streaming
data
Batch
data

Quality of service
14
Coordinator
Apache
ZooKeeper
Master server
Data server
Deep storage
Broker
Query server
Streaming
data
Batch
data

Price/performance
Data sourced from: Correia, José & Costa, Carlos & Santos, Maribel. (2019). Challenging SQL-on-Hadoop Performance with Apache Druid.
vs. leading open source SQL engines

Price/performance
Data sourced from Imply benchmarks.
vs. a leading cloud-based data warehouse

Stay in touch
18
@druidio
Join the community
(Mailing lists, Slack, meetups)
https://druid.apache.org/community/
Follow the Druid project on Twitter!

Time for questions
@gianmerlino
19
Thank you!
Apache Druid is an independent project of The Apache Software Foundation. More information can be found at https://druid.apache.org.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.

Building Data Applications with Apache Druid

More Related Content

What's hot

Similar to Building Data Applications with Apache Druid

More from Imply

Recently uploaded

Building Data Applications with Apache Druid