Analytics over Terabytes of Data at Twitter

Analytics over terabytes
April 15, 2020
Swapnesh Gandhi
Sr. Software Engineer at Twitter
@RealSwapneshG
1

MoPub
MoPub Analytics
Infrastructure
Lookups
Monitoring
Tiering
Experimenting
Coordinator issues
Security and retention

MoPub
Real time ad-exchange for mobile publishers
> 40k Publishers
> 200 Bidders
> 200 TB of raw datasets/day; 1.7TB/day of aggregated data
> 700 TB of data in Druid

MoPub Analytics
Metamarkets
Druid vs other solutions
Signiﬁcant to our business
First production cluster at Twitter; helping wider adoption

Infrastructure
On prem data center using Apache Mesos
Component # of
nodes
Mesos CPUs RAM DISK SSD
Broker 12 Shared 32 130 15 GB
Coordinator 1 Shared 32 64 10 GB
Router 8 Shared 16 20 15 GB
Historicals 700 Dedicated 80 373 15 GB 4 TB(NVMe)

Infrastructure
Historicals
Host 13 months of data in the cluster
Fast tier
Most recent 2 weeks of data.
1:1 SSD to RAM ratio to achieve low latency
2 replicas
Slow tier
All data except the most recent 2 weeks
5:1 SSD to RAM ratio
2 replicas
Node size vs cluster size

Lookups
Monitoring
Tiering
Experimenting
Coordinator issues

Lookups
Query time lookups
Data
Id -> name mapping
Many to one mapping
About 15 total lookups
4 large lookups > 8m rows and GBs of data
GC Issues during lookup reloads
G1 collector
Incremental load

Monitoring
Internal monitoring and alerting system
Good for ﬁnding issues
Track simple metrics such as CPU, Memory, GC
Latency
Not good for ﬁnding exact cause
Keep evolving

Monitoring
Imply Clarity
Druid cluster
Query latency
Broker queries
Historicals
Users
Usage of the platform
Tune conﬁgs

Tiering
Think in terms of use cases
Isolation vs shared resources
20% saved on infra costs

Experimenting
Performance tests
Running A/B tests in the cluster

Coordinator
Druid.coordinator.loadqueuepeon.type
The default - Curator is single threaded
Http is multi-threaded

Security & retention
mTLS
Stripping dimensions after 30 days
Manage retention through Druid kill tasks
Deep storage backups

Summary
Lookups
Monitoring
Tiering
Experimenting
Coordinator issues

19
Time for questions
Apache Druid is an independent project of The Apache Software Foundation. More information can be found at https://druid.apache.org.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.

20
Register now for
Druid Summit
November 2-4, 2020
San Francisco, CA
druidsummit.org
Apache Druid is an independent project of The Apache Software Foundation. More information can be found at https://druid.apache.org.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.

Analytics over Terabytes of Data at Twitter

More Related Content

What's hot

Similar to Analytics over Terabytes of Data at Twitter

More from Imply

Recently uploaded

Analytics over Terabytes of Data at Twitter