Managing 100s of PetaBytes of data in Cloud

Managing hundreds of
PBs of data in Cloud
ApacheCon 2019
Lohit VijayaRenu, Zhenzhao Wang

Describe Twitter’s Data Storage Architecture,
present our solution to managing large data in the Cloud.

Data Platform @Twitter
Oxpecker
Roneobird
Data Access
Layer
ETL Pipelines

Twitter Data Analytics : Scale
5
>1EB
>100PB
Several >10K
Hadoop clusters
>10KNodes Hadoop Cluster
Storage capacity
Reads and Write
~1 Exabyte Storage
capacity
Amount of data
read and written
daily
>50KAnalytic Jobs
Jobs running on Data
Platform per day

Storage @DataPlatform
● Apache HDFS for
storage
● DAL for metadata
management
● Replication for
cross cluster
● Retention service
for expiry
Hadoop Distributed File System
Replication
Service
Retention
Service
DAL
(Metadata Service)

Real Time
Cluster
HDFS @DataPlatform
Twitter DataCenter
Production
Cluster
Ad hoc Cluster Cold Storage
Log
Pipeline
Micro
Services
Data
Generate > 1.5
Trillion events
every day
Incoming
Storage
Produce > 4PB
per day
Production
jobs
Process
hundreds of PB
per day
Ad hoc queries
Executes tens
of thousands
of jobs per day
Cold/Backup
Hundreds of
PBs of data

Data Access Layer
● Dataset has logical name
and one or more physical
locations
● Users/Tools such as
scalding, presto, HIVE
query DAL for available
hourly partitions
● Dataset has hourly/daily
partitions in DAL
● Also stores various
properties such as owner,
schema, location with
datasets
* https://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.html

Replication
Destination Cluster
/ClusterY/logs/partly-cloudy/
2019/09/10/03
Source Cluster
/ClusterX/logs/partly-cloudy/
2019/09/10/03
Replicator : ClusterY
Distcp
2019/09/10/03
DAL/Config
Dataset : partly-cloudy
Src : /ClusterX/logs/partly-cloudy
Dst : /ClusterY/logs/partly-cloudyFetch replication
config per dataset
Scan src cluster
Scan dst cluster
data transfer
Update partition
information

Retention
ClusterX
/ClusterX/logs/partly-cloudy/
2019/08/10/03 (onwards)
Retention : ClusterX
DAL/Config
Cluster : ClusterX
Retention : 30 days
Scan for any data
older than 30 days Move to Trash.
Let HDFS expire
Drop partition
information
Fetch retention
config per dataset

Large Data Management on Cloud
● Storage system (Google Cloud Storage)
● Metadata Management
● Replication Service
● Retention Service
● User and Service management
● Data format and data pipeline
● Compute provisioning
● Networking and VPC
● Security/Key management

Storage @Cloud
● Google Cloud
Storage
● DAL for metadata
management
● Replication for
cloud cluster
● Supplementary
Retention (SDRS)
service for expiry
Google Cloud Storage
Replication
Service
SDRS
(Retention)
DAL
(Metadata Service)

GCS
● Object store vs HDFS.
○ We widely adopted gcs connector to provide HDFS compatible API so user
could migrate their jobs/applications without code change.
○ Take care of semantic difference case by case. E.g. rename is not atomic.
● Buckets design.
○ Different orgs have different cloud projects.
○ We will have one bucket per user and log category (dataset).
○ We build a service to manage buckets. E.g. creation, ACL setting, and etc.

GCS
On-premises
path
/dc1/cluster1/user/
ads/some/path/part-
001.lzo
Logical Cloud
path
/gcs/user/ads/so
me/path/part-
001.lzo
GCS bucket path
gs://user.ads.dp.tw
itter.domain/some/p
ath/part-001.lzo
● Owner: ads
● readergroup: ads-reader-group

RegEx based path resolution
<property>
<name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:--
;replaceresolveddstpath:_:-#.^/gcs/logs/(?!((tst|test)(_|-)))(?<dataset>[^/]+)</name>
<value>gs://logs.${dataset}</value>
</property>
<property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:--
;replaceresolveddstpath:_:-#.^/gcs/user/(?!((tst|test)(_|-)))(?<userName>[^/]+)</name>
<value>gs://user.${userName}</value>
</property>
/gcs/logs/partly-
cloudy/2019/04/10
/gcs/user/lohit/hadoop-stats
gs://logs.partly-
cloudy/2019/04/10
gs://user.lohit/hadoop-stats
Twitter ViewFS Path
Twitter ViewFS mounttable.xml

Bucket on GCS : gs://logs.partly-cloudy
Connector Path : /logs/partly-cloudy
Twitter Resolved Path : /gcs/logs/partly-cloudy
View FileSystem and Google Hadoop Connector
Twitter’s View FileSystem
Cluster-X Cluster-Y ClusterZ
Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2
DataCenter-1 DataCenter-2 Cloud Storage
Connector
Replicator
Cloud Storage

DAL (metadata) extension for Cloud
$dal physical-dataset list --role hadoop --name logs.partly-cloudy
| 26491 | http://dalpds/26491 | dw | viewfs://hadoop-dw-nn/logs/partly-
cloudy/yyyy/mm/dd/hh |
| 41065 | http://dalpds/41065 | gcs |
gcs:////partly-cloudy/yyyy/mm/dd/hh |
List all physical locations
$dal physical-dataset list --role hadoop --name logs.partly-cloudy --location-name gcs
2019-09-10T11:00:00Z 2019-04-01T12:00:00Z gcs:///logs/partly-cloudy/2019/09/10/11
HadoopLzop
2019-09-10T12:00:00Z 2019-04-01T13:00:00Z gcs:///logs/partly-cloudy/2019/09/10/12
HadoopLzop
All partitions for dataset on GCS

Twitter
DataCenter
Architecture behind replication to GCS
Copy Cluster
GCS
/gcs/logs/partly-cloud
/2019/09/10/03
Replicator : GCS
DAL
Source Cluster
/DC1/ClusterX/logs/partly-
cloudy/2019/09/10/03
Distcp
/ClusterX/logs/partly-cloudy
/gcs/logs/partly-cloudy

Twitter DataCenter
Network setup for copy
Twitter & Google private
peering (PNI)
Copy Cluster
GCS
/gcs/logs/partly-
cloudy/2019/09/
10/03
Distcp
Replicator : GCS
Proxy
group

Merge same dataset on GCS (Multi Region Bucket)
Twitter DataCenter X-2
Copy Cluster X-2
/gcs/logs/partly-
cloudy/2019/09/
10/03
Source ClusterX-2
/DC2/ClusterX-2/logs/partly-
cloudy//2019/09/10/03
Twitter DataCenter X-1
Copy Cluster X-1Source ClusterX-1
/DC1/ClusterX-1/logs/partly-
cloudy/2019/09/10/03
Distcp
Multi Region
Bucket
Distcp
Cloud Storage

Merging and updating DAL
● Multiple Replicators copy same
dataset partition to destination
● Each of Replicator checks for
availability of data independently
● Creates individual
_SUCCESS_<SRC> files
● Updates DAL when all
_SUCCESS_<SRC> are found
● Updates are idempotent
Compare
src and
dest
Kick off
distcp job
Check
success
file (ALL)
Update
DAL
Success
Let other
instance
update
DAL
Need to
copy
Copied
already
Success
Failure
No
Yes
Done
Each Replicator updates partition
independently

Dataset via EagleEye
● View different
destination for
same dataset
● GCS is another
destination
● Also shows delay
for each hourly
partition

SDRS (retention in cloud)
OLM (object life management) on GCS support ages, modification time and etc.
However, OLM does not support:
● Dataset based retention are not supported.
○ E.g. GDPR requires /logs/partly-cloudy/2019/09/10/21 to be scrubbed 30 days after generation
rather than the creation time on GCS.
○ OLM wouldn’t notify you on deletion. It’s hard to keep the dataset in sync with DAL.
● Soft delete (trash feature) impossible without versioning.
○ The trash feature had saved us multiple times...

Notification
Storage
Pub/SubREST API
Client
Service
Interface
Event Service
SDRS Architecture
ValidationExecution
Configuration
Internal Queue
Config
Retention
Scheduler... Event Handling● Open source and cloud native.
● Support retentions rules.
○ Delete marker - on demand delete.
○ Dataset rule.
○ Bucket Default rule.
● Soft delete
○ Move data from one bucket to
another bucket.
○ Plugable engine. Simple Transfer
Storage Supported.
● Rest API and Events Notification.
○ Rest API to control.
○ SDRS will generate and sent events
to pub/sub system on deletion,
trashing events.

DAL
Twitter GCS Retention Management
Retention Config
Manager
● We have one set of SDRS service stack for one
org.
● Trash buckets. One bucket will have a trash
bucket. Delete will be move to trashed bucket
before gone.
○ E.g. /gcs/logs/part-cloudy will have one
/gcs/logs/part-cloudy-trash act as trash.
● Partition over objects. We configured retention
for dataset partition based on build-int time in
path.
○ E.g. /gcs/logs/part-cloudy/2019/09/10/01
with 3 days of retention will be removed
on 2019/09/13/01.
○ We will drop partition when SDRS remove
data. Thus the data is DAL is always in
sync.
DAL Sync Manager
SDRS Service Pub/Sub Topics
Cloud SQL
Trash
Buckets
GCS
Buckets

Questions / Feedback
Tweet @TwitterHadoop

Managing 100s of PetaBytes of data in Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Managing 100s of PetaBytes of data in Cloud

Similar to Managing 100s of PetaBytes of data in Cloud (20)

Recently uploaded

Recently uploaded (20)

Managing 100s of PetaBytes of data in Cloud