5. Twitter Data Analytics : Scale
5
>1EB
>100PB
Several >10K
Hadoop clusters
>10KNodes Hadoop Cluster
Storage capacity
Reads and Write
~1 Exabyte Storage
capacity
Amount of data
read and written
daily
>50KAnalytic Jobs
Jobs running on Data
Platform per day
7. Storage @DataPlatform
● Apache HDFS for
storage
● DAL for metadata
management
● Replication for
cross cluster
● Retention service
for expiry
Hadoop Distributed File System
Replication
Service
Retention
Service
DAL
(Metadata Service)
8. Real Time
Cluster
HDFS @DataPlatform
Twitter DataCenter
Production
Cluster
Ad hoc Cluster Cold Storage
Log
Pipeline
Micro
Services
Data
Generate > 1.5
Trillion events
every day
Incoming
Storage
Produce > 4PB
per day
Production
jobs
Process
hundreds of PB
per day
Ad hoc queries
Executes tens
of thousands
of jobs per day
Cold/Backup
Hundreds of
PBs of data
9. Data Access Layer
● Dataset has logical name
and one or more physical
locations
● Users/Tools such as
scalding, presto, HIVE
query DAL for available
hourly partitions
● Dataset has hourly/daily
partitions in DAL
● Also stores various
properties such as owner,
schema, location with
datasets
* https://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.html
10. Dataset defined in DAL
$dal logical-dataset list --role hadoop --name logs.partly-cloudy
| 4031 | http://dallds/401 | hadoop | Prod | logs.partly-cloudy| Active |
$dal physical-dataset list --role hadoop --name logs.partly-cloudy
| 26491 | http://dalpds/26491 | dw | viewfs://hadoop-dw-nn/logs/partly-
cloudy/yyyy/mm/dd/hh |
| 41065 | http://dalpds/41065 | cold |
viewfs://hadoop-cold-nn/partly-cloudy/yyyy/mm/dd/hh |
List all physical locations
Find dataset by logical name
14. Large Data Management on Cloud
● Storage system (Google Cloud Storage)
● Metadata Management
● Replication Service
● Retention Service
● User and Service management
● Data format and data pipeline
● Compute provisioning
● Networking and VPC
● Security/Key management
15. Storage @Cloud
● Google Cloud
Storage
● DAL for metadata
management
● Replication for
cloud cluster
● Supplementary
Retention (SDRS)
service for expiry
Google Cloud Storage
Replication
Service
SDRS
(Retention)
DAL
(Metadata Service)
16. GCS
● Object store vs HDFS.
○ We widely adopted gcs connector to provide HDFS compatible API so user
could migrate their jobs/applications without code change.
○ Take care of semantic difference case by case. E.g. rename is not atomic.
● Buckets design.
○ Different orgs have different cloud projects.
○ We will have one bucket per user and log category (dataset).
○ We build a service to manage buckets. E.g. creation, ACL setting, and etc.
22. Twitter DataCenter
Network setup for copy
Twitter & Google private
peering (PNI)
Copy Cluster
GCS
/gcs/logs/partly-
cloudy/2019/09/
10/03
Distcp
Replicator : GCS
Proxy
group
23. Merge same dataset on GCS (Multi Region Bucket)
Twitter DataCenter X-2
Copy Cluster X-2
/gcs/logs/partly-
cloudy/2019/09/
10/03
Source ClusterX-2
/DC2/ClusterX-2/logs/partly-
cloudy//2019/09/10/03
Twitter DataCenter X-1
Copy Cluster X-1Source ClusterX-1
/DC1/ClusterX-1/logs/partly-
cloudy/2019/09/10/03
Distcp
Multi Region
Bucket
Distcp
Cloud Storage
24. Merging and updating DAL
● Multiple Replicators copy same
dataset partition to destination
● Each of Replicator checks for
availability of data independently
● Creates individual
_SUCCESS_<SRC> files
● Updates DAL when all
_SUCCESS_<SRC> are found
● Updates are idempotent
Compare
src and
dest
Kick off
distcp job
Check
success
file (ALL)
Update
DAL
Success
Let other
instance
update
DAL
Need to
copy
Copied
already
Success
Failure
No
Yes
Done
Each Replicator updates partition
independently
25. Dataset via EagleEye
● View different
destination for
same dataset
● GCS is another
destination
● Also shows delay
for each hourly
partition
26. SDRS (retention in cloud)
OLM (object life management) on GCS support ages, modification time and etc.
However, OLM does not support:
● Dataset based retention are not supported.
○ E.g. GDPR requires /logs/partly-cloudy/2019/09/10/21 to be scrubbed 30 days after generation
rather than the creation time on GCS.
○ OLM wouldn’t notify you on deletion. It’s hard to keep the dataset in sync with DAL.
● Soft delete (trash feature) impossible without versioning.
○ The trash feature had saved us multiple times...
27. Notification
Storage
Pub/SubREST API
Client
Service
Interface
Event Service
SDRS Architecture
ValidationExecution
Configuration
Internal Queue
Config
Retention
Scheduler... Event Handling● Open source and cloud native.
● Support retentions rules.
○ Delete marker - on demand delete.
○ Dataset rule.
○ Bucket Default rule.
● Soft delete
○ Move data from one bucket to
another bucket.
○ Plugable engine. Simple Transfer
Storage Supported.
● Rest API and Events Notification.
○ Rest API to control.
○ SDRS will generate and sent events
to pub/sub system on deletion,
trashing events.
28. DAL
Twitter GCS Retention Management
Retention Config
Manager
● We have one set of SDRS service stack for one
org.
● Trash buckets. One bucket will have a trash
bucket. Delete will be move to trashed bucket
before gone.
○ E.g. /gcs/logs/part-cloudy will have one
/gcs/logs/part-cloudy-trash act as trash.
● Partition over objects. We configured retention
for dataset partition based on build-int time in
path.
○ E.g. /gcs/logs/part-cloudy/2019/09/10/01
with 3 days of retention will be removed
on 2019/09/13/01.
○ We will drop partition when SDRS remove
data. Thus the data is DAL is always in
sync.
DAL Sync Manager
SDRS Service Pub/Sub Topics
Cloud SQL
Trash
Buckets
GCS
Buckets
29. Describe Twitter’s Data Storage Architecture,
present our solution to managing large data in the Cloud.