Analytics over terabytes
April 15, 2020
Swapnesh Gandhi
Sr. Software Engineer at Twitter
@RealSwapneshG
1
MoPub
MoPub Analytics
Infrastructure
Lookups
Monitoring
Tiering
Experimenting
Coordinator issues
Security and retention
MoPub
Real time ad-exchange for mobile publishers
> 40k Publishers
> 200 Bidders
> 200 TB of raw datasets/day; 1.7TB/day of aggregated data
> 700 TB of data in Druid
MoPub Analytics
Metamarkets
Druid vs other solutions
Significant to our business
First production cluster at Twitter; helping wider adoption
Infrastructure
Infrastructure
On prem data center using Apache Mesos
Component # of
nodes
Mesos CPUs RAM DISK SSD
Broker 12 Shared 32 130 15 GB
Coordinator 1 Shared 32 64 10 GB
Router 8 Shared 16 20 15 GB
Historicals 700 Dedicated 80 373 15 GB 4 TB(NVMe)
Infrastructure
Historicals
Host 13 months of data in the cluster
Fast tier
Most recent 2 weeks of data.
1:1 SSD to RAM ratio to achieve low latency
2 replicas
Slow tier
All data except the most recent 2 weeks
5:1 SSD to RAM ratio
2 replicas
Node size vs cluster size
Lookups
Monitoring
Tiering
Experimenting
Coordinator issues
Security and retention
Lookups
Query time lookups
Data
Id -> name mapping
Many to one mapping
About 15 total lookups
4 large lookups > 8m rows and GBs of data
GC Issues during lookup reloads
G1 collector
Incremental load
Monitoring
Internal monitoring and alerting system
Good for finding issues
Track simple metrics such as CPU, Memory, GC
Latency
Not good for finding exact cause
Keep evolving
Monitoring
Imply Clarity
Druid cluster
Query latency
Broker queries
Historicals
Users
Usage of the platform
Tune configs
Tiering
Think in terms of use cases
Isolation vs shared resources
20% saved on infra costs
Experimenting
Performance tests
Running A/B tests in the cluster
Coordinator
Coordinator
Druid.coordinator.loadqueuepeon.type
The default - Curator is single threaded
Http is multi-threaded
Security & retention
mTLS
Stripping dimensions after 30 days
Manage retention through Druid kill tasks
Deep storage backups
Summary
Lookups
Monitoring
Tiering
Experimenting
Coordinator issues
Security and retention
Thank you.
19
Time for questions
Apache Druid is an independent project of The Apache Software Foundation. More information can be found at https://druid.apache.org.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.
20
Register now for
Druid Summit
November 2-4, 2020
San Francisco, CA
druidsummit.org
Apache Druid is an independent project of The Apache Software Foundation. More information can be found at https://druid.apache.org.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.

Analytics over Terabytes of Data at Twitter