DA300:
How Twitter Replicates
Petabytes of Data to
Google Cloud Storage
Lohit VijayaRenu, Twitter
@lohitvijayarenu
Agenda
Describe Twitter’s Data
Replicator Architecture,
present our solution to extend it
to Google Cloud Storage
and maintain consistent
interface for users.
Tweet questions
#GoogleNext19Twitter
Twitter DataCenter
Data Infrastructure for Analytics
Real Time
Cluster
Production
Cluster
Ad hoc Cluster Cold Storage
Log
Pipeline
Micro
Services
Data
Generate > 1.5
Trillion events
every day
Incoming
Storage
Produce > 4PB
per day
Production
jobs
Process
hundreds of PB
per day
Ad hoc queries
Executes tens
of thousands
of jobs per day
Cold/Backup
Hundreds of
PBs of data
Streaming systems
Data Infrastructure for Analytics
`
Hadoop Cluster
Data
Access
Layer
Replication Service
Retention Service
Hadoop Cluster
Replication Service
Retention Service
Data Access Layer
● Dataset has logical name
and one or more physical
locations
● Users/Tools such as
scalding, presto, HIVE
query DAL for available
hourly partitions
● Dataset has hourly/daily
partitions in DAL
● Also stores various
properties such as owner,
schema, location with
datasets
* https://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.html
FileSystem abstraction
Path on HDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy
* https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html
Namespace
FileSystem abstraction
Path on HDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy
Path on Federated HDFS cluster : viewfs://cluster-X/logs/partly-cloudy
* https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html
ClusterZ
Namespace 2 Namespace 1
FileSystem abstraction
Path on HDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy
Path on Federated HDFS cluster : viewfs://cluster-X/logs/partly-cloudy
* https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html
Cluster-X Cluster-Y ClusterZ
Namespace 1 Namespace 2 Namespace 1 Namespace 2 Namespace 1
DataCenter-1 DataCenter-2
FileSystem abstraction
Path on HDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy
Path on Federated HDFS cluster : viewfs://cluster-X/logs/partly-cloudy
Path on Twitter’s HDFS Clusters* : /DataCenter-1/cluster-X/logs/partly-cloudy
* https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html
Twitter’s View FileSystem
Cluster-X Cluster-Y ClusterZ
Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2
DataCenter-1 DataCenter-2
Replicator
DataCenter 2DataCenter 1
Need for Replication
Hadoop
ClusterM
Hadoop
ClusterN
Hadoop
ClusterC
Hadoop
ClusterZ
Hadoop
ClusterX-2
Hadoop
ClusterL
Hadoop
ClusterX-1
● Thousands of
datasets configured
for replication
● Across tens of
different clusters
● Data kept in sync
hourly/daily/snapshot
● Fault tolerant
Data Replicator
● Replicator per destination
● 1 : 1 Copy from src to
destination
● N : 1 Copy + Merge from
multiple src to destination
● Publish to DAL upon
completion
Copy
Source
Cluster
Destination
Cluster
Replicator
Copy + Merge
Source
Cluster
Destination
Cluster
Replicator
Source
Cluster
Dataset : partly-cloudy
Src Cluster : ClusterX
Src path : /logs/partly-cloudy
Dest Cluster : ClusterY
Dest path : /logs/partly-cloudy
Copy Since : 3 days
Owner : hadoop-team
Replication setup
Data Access
Layer
Replicator
Dataset : partly-cloudy
/ClusterX/logs/partly-cloudy
/ClusterY/logs/partly-cloudy
Destination Cluster
/ClusterY/logs/partly-cloudy/
2019/04/10/03
Data Replicator Copy
Source Cluster
/ClusterX/logs/partly-cloudy/
2019/04/10/03
Replicator : ClusterY
Distcp
2019/04/10/03
DAL
Dataset : partly-cloudy
/ClusterX/logs/partly-cloudy
/ClusterY/logs/partly-cloudy
Destination Cluster
/ClusterY/logs/partly-cloudy/
2019/04/10/03
Data Replicator Copy + Merge
Source Cluster
/ClusterX-2/logs/partly-cloudy/
2019/04/10/03
Replicator : ClusterY
Distcp
2019/04/10/03
DAL
Dataset : partly-cloudy
/ClusterX-1/logs/partly-cloudy
/ClusterX-2/logs/partly-cloudy
/ClusterY/logs/partly-cloudy
Type : Multiple Src
Source Cluster
/ClusterX-1/logs/partly-cloudy/
2019/04/10/03
Distcp
2019/04/10/03
Merge
Extending Replication to GCS
DataCenter 2DataCenter 1
Hadoop
ClusterM
Hadoop
ClusterN
Hadoop
ClusterC
Hadoop
ClusterZ
Hadoop
ClusterX-2
Hadoop
ClusterL
Hadoop
ClusterX-1
● Same dataset
available on GCS for
users
● Unlock Presto on
GCP, Hadoop on
GCP, BigQuery and
other tools
Cloud Storage
Extending Replication to GCS
DataCenter 1
Hadoop
Cluster
BigQuery
GCE VMs
● Same dataset available
on GCS for users
● Unlock Presto on GCP,
Hadoop on GCP,
BigQuery and other
tools
Cloud Storage
Bucket on GCS : gs://logs.partly-cloudy
View FileSystem and Google Hadoop Connector
Cloud Storage
Bucket on GCS : gs://logs.partly-cloudy
Connector Path : /logs/partly-cloudy
View FileSystem and Google Hadoop Connector
Cloud Storage
Connector
Cloud Storage
Bucket on GCS : gs://logs.partly-cloudy
Connector Path : /logs/partly-cloudy
Twitter Resolved Path : /gcs/logs/partly-cloudy
View FileSystem and Google Hadoop Connector
Twitter’s View FileSystem
Cluster-X Cluster-Y ClusterZ
Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2
DataCenter-1 DataCenter-2 Cloud Storage
Connector
Replicator
Cloud Storage
Twitter
DataCenter
Architecture behind GCS replication
Copy Cluster
GCS
/gcs/logs/partly-cloud
/2019/04/10/03
Replicator : GCS
DAL
Source Cluster
/ClusterX/logs/partly-cloudy/
2019/04/10/03
Distcp
Dataset : partly-cloudy
/ClusterX/logs/partly-cloudy
/gcs/logs/partly-cloudy
Twitter DataCenter
Network setup for copy
Twitter & Google private
peering (PNI)
Copy Cluster
GCS
/gcs/logs/partly-
cloudy/2019/04/
10/03
Distcp
Replicator : GCS
Proxy
group
Merge same dataset on GCS (Multi Region Bucket)
Twitter DataCenter X-2
Copy Cluster X-2
/gcs/logs/partly-
cloudy/2019/04/
10/03
Source ClusterX-2
/ClusterX-2/logs/partly-
cloudy//2019/04/10/03
Twitter DataCenter X-1
Copy Cluster X-1Source ClusterX-1
/ClusterX-1/logs/partly-
cloudy/2019/04/10/03
Distcp
Multi Region
Bucket
Distcp
Cloud Storage
Merging and updating DAL
● Multiple Replicators copy same
dataset partition to destination
● Each of Replicator checks for
availability of data independently
● Creates individual
_SUCCESS_<SRC> files
● Updates DAL when all
_SUCCESS_<SRC> are found
● Updates are idempotent
Compare
src and
dest
Kick of
distcp job
Check
success
file (ALL)
Update
DAL
Success
Let other
instance
update
DAL
Need to
copy
Copied
already
Success
Failure
No
Yes
Done
Each Replicator updates partition
independently
Uniform Access
for Users
Dataset via EagleEye
● View different
destination for
same dataset
● GCS is another
destination
● Also shows delay
for each hourly
partition
Query dataset
$dal logical-dataset list --role hadoop --name logs.partly-cloudy
| 4031 | http://dallds/401 | hadoop | Prod | logs.partly-cloudy| Active |
$dal physical-dataset list --role hadoop --name logs.partly-cloudy
| 26491 | http://dalpds/26491 | dw | viewfs://hadoop-dw-nn/logs/partly-
cloudy/yyyy/mm/dd/hh |
| 41065 | http://dalpds/41065 | gcs |
gcs:///logs/partly-cloudy/yyyy/mm/dd/hh |
List all physical locations
Find dataset by logical name
Query partitions of dataset
$dal physical-dataset list --role hadoop --name logs.partly-cloudy --location-name gcs
2019-04-01T11:00:00Z 2019-04-01T12:00:00Z gcs:///logs/partly-cloudy/2019/04/01/11
HadoopLzop
2019-04-01T12:00:00Z 2019-04-01T13:00:00Z gcs:///logs/partly-cloudy/2019/04/01/12
HadoopLzop
2019-04-01T13:00:00Z 2019-04-01T14:00:00Z gcs:///logs/partly-cloudy/2019/04/01/13
HadoopLzop
2019-04-01T14:00:00Z 2019-04-01T15:00:00Z gcs:///logs/partly-cloudy/2019/04/01/14
HadoopLzop
2019-04-01T15:00:00Z 2019-04-01T16:00:00Z gcs:///logs/partly-cloudy/2019/04/01/15
HadoopLzop
2019-04-01T16:00:00Z 2019-04-01T17:00:00Z gcs:///logs/partly-cloudy/2019/04/01/16
HadoopLzop
All partitions for dataset on GCS
Monitoring
● Rich set of
monitoring for
Replicator and
replicator configs
● Uniform monitoring
dashboard for
onprem and cloud
replicators
Read/Write bytes per destination
Latency per destination
9. Alerting
● Fine tuned alert configs per metric per
replicator
● Pages on call for critical issues
● Uniform alert dashboard and config for
onprem and cloud replicators
GCP Project ZGCP Project YGCP Project X
Replicators per project
Twitter DataCenter
Copy Cluster
/gcs/dataX/2019/0
4/10/03
/gcs/dataY/2019/0
4/10/03
/gcs/dataZ/2019/04
/10/03
DistcpDistcp
DistcpDistcp DistcpDistcp
Replicator X Replicator Y Replicator Z
Cloud Storage Cloud Storage Cloud Storage
RegEx based path resolution
<property>
<name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:--
;replaceresolveddstpath:_:-#.^/gcs/logs/(?!((tst|test)(_|-)))(?&lt;dataset&gt;[^/]+)</name>
<value>gs://logs.${dataset}</value>
</property>
<property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:--
;replaceresolveddstpath:_:-#.^/gcs/user/(?!((tst|test)(_|-)))(?&lt;userName&gt;[^/]+)</name>
<value>gs://user.${userName}</value>
</property>
/gcs/logs/partly-cloudy/2019/04/10
/gcs/user/lohit/hadoop-stats
gs://logs.partly-cloudy/2019/04/10
gs://user.lohit/hadoop-stats
Twitter ViewFS Path GCS bucket
Twitter ViewFS mounttable.xml
Where are we today
● Tens of instances of GCS
Replicators
● Copied tens of petabytes of data
● Hundreds of thousands of copy
jobs
● Unlocked multiple use cases on
GCP
Made here
together
Twitter + Google
Google Storage Hadoop connector
● Checksum mismatch between Hadoop FileSystem and Google Cloud Storage
○ Composite checksum HDFS-13056
○ More details in blog post*
● Proxy configuration as path
● Per user credentials
● Lazy initialization to support View FileSystem
* https://cloud.google.com/blog/products/storage-data-transfer/new-file-checksum-feature-lets-you-validate-data-transfers-between-hdfs-and-cloud-storage
Performance and Consistency
● Performance optimization uncovered during evaluation Presto on GCP
● Cooperative locking in Google Connector for atomic renames
○ https://github.com/GoogleCloudPlatform/bigdata-interop/tree/cooperative_locking
● Same version of connector (onprem and open source)
Summary
Describe Twitter’s Data Replicator Architecture,
present our solution to extend it to Google Cloud Storage
and maintain consistent interface for users.
Acknowledgement
Ran Wang @RanWang18
Zhenzhao Wang @zhen____w
Joseph Boyd @sluicing
Joep Rottinghuis @joep
Hadoop Team @TwitterHadoop
https://cloud.google.com/twitter
Tweet to @TwitterEng
https://careers.twitter.com
Questions
Your Feedback is Greatly Appreciated!
Complete the
session survey
in mobile app
1-5 star rating
system
Open field for
comments
Rate icon in
status bar
Thank you

Twitter's Data Replicator for Google Cloud Storage

  • 1.
    DA300: How Twitter Replicates Petabytesof Data to Google Cloud Storage Lohit VijayaRenu, Twitter @lohitvijayarenu
  • 3.
    Agenda Describe Twitter’s Data ReplicatorArchitecture, present our solution to extend it to Google Cloud Storage and maintain consistent interface for users. Tweet questions #GoogleNext19Twitter
  • 4.
    Twitter DataCenter Data Infrastructurefor Analytics Real Time Cluster Production Cluster Ad hoc Cluster Cold Storage Log Pipeline Micro Services Data Generate > 1.5 Trillion events every day Incoming Storage Produce > 4PB per day Production jobs Process hundreds of PB per day Ad hoc queries Executes tens of thousands of jobs per day Cold/Backup Hundreds of PBs of data Streaming systems
  • 5.
    Data Infrastructure forAnalytics ` Hadoop Cluster Data Access Layer Replication Service Retention Service Hadoop Cluster Replication Service Retention Service
  • 6.
    Data Access Layer ●Dataset has logical name and one or more physical locations ● Users/Tools such as scalding, presto, HIVE query DAL for available hourly partitions ● Dataset has hourly/daily partitions in DAL ● Also stores various properties such as owner, schema, location with datasets * https://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.html
  • 7.
    FileSystem abstraction Path onHDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy * https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html Namespace
  • 8.
    FileSystem abstraction Path onHDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy Path on Federated HDFS cluster : viewfs://cluster-X/logs/partly-cloudy * https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html ClusterZ Namespace 2 Namespace 1
  • 9.
    FileSystem abstraction Path onHDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy Path on Federated HDFS cluster : viewfs://cluster-X/logs/partly-cloudy * https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html Cluster-X Cluster-Y ClusterZ Namespace 1 Namespace 2 Namespace 1 Namespace 2 Namespace 1 DataCenter-1 DataCenter-2
  • 10.
    FileSystem abstraction Path onHDFS cluster : hdfs://cluster-X-nn:8020/logs/partly-cloudy Path on Federated HDFS cluster : viewfs://cluster-X/logs/partly-cloudy Path on Twitter’s HDFS Clusters* : /DataCenter-1/cluster-X/logs/partly-cloudy * https://blog.twitter.com/engineering/en_us/a/2015/hadoop-filesystem-at-twitter.html Twitter’s View FileSystem Cluster-X Cluster-Y ClusterZ Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2 DataCenter-1 DataCenter-2 Replicator
  • 11.
    DataCenter 2DataCenter 1 Needfor Replication Hadoop ClusterM Hadoop ClusterN Hadoop ClusterC Hadoop ClusterZ Hadoop ClusterX-2 Hadoop ClusterL Hadoop ClusterX-1 ● Thousands of datasets configured for replication ● Across tens of different clusters ● Data kept in sync hourly/daily/snapshot ● Fault tolerant
  • 12.
    Data Replicator ● Replicatorper destination ● 1 : 1 Copy from src to destination ● N : 1 Copy + Merge from multiple src to destination ● Publish to DAL upon completion Copy Source Cluster Destination Cluster Replicator Copy + Merge Source Cluster Destination Cluster Replicator Source Cluster
  • 13.
    Dataset : partly-cloudy SrcCluster : ClusterX Src path : /logs/partly-cloudy Dest Cluster : ClusterY Dest path : /logs/partly-cloudy Copy Since : 3 days Owner : hadoop-team Replication setup Data Access Layer Replicator Dataset : partly-cloudy /ClusterX/logs/partly-cloudy /ClusterY/logs/partly-cloudy
  • 14.
    Destination Cluster /ClusterY/logs/partly-cloudy/ 2019/04/10/03 Data ReplicatorCopy Source Cluster /ClusterX/logs/partly-cloudy/ 2019/04/10/03 Replicator : ClusterY Distcp 2019/04/10/03 DAL Dataset : partly-cloudy /ClusterX/logs/partly-cloudy /ClusterY/logs/partly-cloudy
  • 15.
    Destination Cluster /ClusterY/logs/partly-cloudy/ 2019/04/10/03 Data ReplicatorCopy + Merge Source Cluster /ClusterX-2/logs/partly-cloudy/ 2019/04/10/03 Replicator : ClusterY Distcp 2019/04/10/03 DAL Dataset : partly-cloudy /ClusterX-1/logs/partly-cloudy /ClusterX-2/logs/partly-cloudy /ClusterY/logs/partly-cloudy Type : Multiple Src Source Cluster /ClusterX-1/logs/partly-cloudy/ 2019/04/10/03 Distcp 2019/04/10/03 Merge
  • 16.
    Extending Replication toGCS DataCenter 2DataCenter 1 Hadoop ClusterM Hadoop ClusterN Hadoop ClusterC Hadoop ClusterZ Hadoop ClusterX-2 Hadoop ClusterL Hadoop ClusterX-1 ● Same dataset available on GCS for users ● Unlock Presto on GCP, Hadoop on GCP, BigQuery and other tools Cloud Storage
  • 17.
    Extending Replication toGCS DataCenter 1 Hadoop Cluster BigQuery GCE VMs ● Same dataset available on GCS for users ● Unlock Presto on GCP, Hadoop on GCP, BigQuery and other tools Cloud Storage
  • 18.
    Bucket on GCS: gs://logs.partly-cloudy View FileSystem and Google Hadoop Connector Cloud Storage
  • 19.
    Bucket on GCS: gs://logs.partly-cloudy Connector Path : /logs/partly-cloudy View FileSystem and Google Hadoop Connector Cloud Storage Connector Cloud Storage
  • 20.
    Bucket on GCS: gs://logs.partly-cloudy Connector Path : /logs/partly-cloudy Twitter Resolved Path : /gcs/logs/partly-cloudy View FileSystem and Google Hadoop Connector Twitter’s View FileSystem Cluster-X Cluster-Y ClusterZ Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2 DataCenter-1 DataCenter-2 Cloud Storage Connector Replicator Cloud Storage
  • 21.
    Twitter DataCenter Architecture behind GCSreplication Copy Cluster GCS /gcs/logs/partly-cloud /2019/04/10/03 Replicator : GCS DAL Source Cluster /ClusterX/logs/partly-cloudy/ 2019/04/10/03 Distcp Dataset : partly-cloudy /ClusterX/logs/partly-cloudy /gcs/logs/partly-cloudy
  • 22.
    Twitter DataCenter Network setupfor copy Twitter & Google private peering (PNI) Copy Cluster GCS /gcs/logs/partly- cloudy/2019/04/ 10/03 Distcp Replicator : GCS Proxy group
  • 23.
    Merge same dataseton GCS (Multi Region Bucket) Twitter DataCenter X-2 Copy Cluster X-2 /gcs/logs/partly- cloudy/2019/04/ 10/03 Source ClusterX-2 /ClusterX-2/logs/partly- cloudy//2019/04/10/03 Twitter DataCenter X-1 Copy Cluster X-1Source ClusterX-1 /ClusterX-1/logs/partly- cloudy/2019/04/10/03 Distcp Multi Region Bucket Distcp Cloud Storage
  • 24.
    Merging and updatingDAL ● Multiple Replicators copy same dataset partition to destination ● Each of Replicator checks for availability of data independently ● Creates individual _SUCCESS_<SRC> files ● Updates DAL when all _SUCCESS_<SRC> are found ● Updates are idempotent Compare src and dest Kick of distcp job Check success file (ALL) Update DAL Success Let other instance update DAL Need to copy Copied already Success Failure No Yes Done Each Replicator updates partition independently
  • 25.
  • 26.
    Dataset via EagleEye ●View different destination for same dataset ● GCS is another destination ● Also shows delay for each hourly partition
  • 27.
    Query dataset $dal logical-datasetlist --role hadoop --name logs.partly-cloudy | 4031 | http://dallds/401 | hadoop | Prod | logs.partly-cloudy| Active | $dal physical-dataset list --role hadoop --name logs.partly-cloudy | 26491 | http://dalpds/26491 | dw | viewfs://hadoop-dw-nn/logs/partly- cloudy/yyyy/mm/dd/hh | | 41065 | http://dalpds/41065 | gcs | gcs:///logs/partly-cloudy/yyyy/mm/dd/hh | List all physical locations Find dataset by logical name
  • 28.
    Query partitions ofdataset $dal physical-dataset list --role hadoop --name logs.partly-cloudy --location-name gcs 2019-04-01T11:00:00Z 2019-04-01T12:00:00Z gcs:///logs/partly-cloudy/2019/04/01/11 HadoopLzop 2019-04-01T12:00:00Z 2019-04-01T13:00:00Z gcs:///logs/partly-cloudy/2019/04/01/12 HadoopLzop 2019-04-01T13:00:00Z 2019-04-01T14:00:00Z gcs:///logs/partly-cloudy/2019/04/01/13 HadoopLzop 2019-04-01T14:00:00Z 2019-04-01T15:00:00Z gcs:///logs/partly-cloudy/2019/04/01/14 HadoopLzop 2019-04-01T15:00:00Z 2019-04-01T16:00:00Z gcs:///logs/partly-cloudy/2019/04/01/15 HadoopLzop 2019-04-01T16:00:00Z 2019-04-01T17:00:00Z gcs:///logs/partly-cloudy/2019/04/01/16 HadoopLzop All partitions for dataset on GCS
  • 29.
    Monitoring ● Rich setof monitoring for Replicator and replicator configs ● Uniform monitoring dashboard for onprem and cloud replicators Read/Write bytes per destination Latency per destination
  • 30.
    9. Alerting ● Finetuned alert configs per metric per replicator ● Pages on call for critical issues ● Uniform alert dashboard and config for onprem and cloud replicators
  • 31.
    GCP Project ZGCPProject YGCP Project X Replicators per project Twitter DataCenter Copy Cluster /gcs/dataX/2019/0 4/10/03 /gcs/dataY/2019/0 4/10/03 /gcs/dataZ/2019/04 /10/03 DistcpDistcp DistcpDistcp DistcpDistcp Replicator X Replicator Y Replicator Z Cloud Storage Cloud Storage Cloud Storage
  • 32.
    RegEx based pathresolution <property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:-- ;replaceresolveddstpath:_:-#.^/gcs/logs/(?!((tst|test)(_|-)))(?&lt;dataset&gt;[^/]+)</name> <value>gs://logs.${dataset}</value> </property> <property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:-- ;replaceresolveddstpath:_:-#.^/gcs/user/(?!((tst|test)(_|-)))(?&lt;userName&gt;[^/]+)</name> <value>gs://user.${userName}</value> </property> /gcs/logs/partly-cloudy/2019/04/10 /gcs/user/lohit/hadoop-stats gs://logs.partly-cloudy/2019/04/10 gs://user.lohit/hadoop-stats Twitter ViewFS Path GCS bucket Twitter ViewFS mounttable.xml
  • 33.
    Where are wetoday ● Tens of instances of GCS Replicators ● Copied tens of petabytes of data ● Hundreds of thousands of copy jobs ● Unlocked multiple use cases on GCP
  • 34.
  • 35.
    Google Storage Hadoopconnector ● Checksum mismatch between Hadoop FileSystem and Google Cloud Storage ○ Composite checksum HDFS-13056 ○ More details in blog post* ● Proxy configuration as path ● Per user credentials ● Lazy initialization to support View FileSystem * https://cloud.google.com/blog/products/storage-data-transfer/new-file-checksum-feature-lets-you-validate-data-transfers-between-hdfs-and-cloud-storage
  • 36.
    Performance and Consistency ●Performance optimization uncovered during evaluation Presto on GCP ● Cooperative locking in Google Connector for atomic renames ○ https://github.com/GoogleCloudPlatform/bigdata-interop/tree/cooperative_locking ● Same version of connector (onprem and open source)
  • 37.
    Summary Describe Twitter’s DataReplicator Architecture, present our solution to extend it to Google Cloud Storage and maintain consistent interface for users.
  • 38.
    Acknowledgement Ran Wang @RanWang18 ZhenzhaoWang @zhen____w Joseph Boyd @sluicing Joep Rottinghuis @joep Hadoop Team @TwitterHadoop https://cloud.google.com/twitter
  • 39.
  • 40.
    Your Feedback isGreatly Appreciated! Complete the session survey in mobile app 1-5 star rating system Open field for comments Rate icon in status bar
  • 41.

Editor's Notes

  • #2 Twitter’s Data Replicator for GCS at GoogleNext 2019. Lohit VijayaRenu, Twitter
  • #6 Data is identified by a dataset name HDFS is the primary storage for Analytics Users configure replication rules for different clusters Dataset also has retention rules defined per cluster Dataset are always represented on fixed interval partitions (hourly/daily) Dataset is defined in system called Data Access Layer (DAL)* Data is made available at different destination using Replicator
  • #8 All systems rely on global filesystem paths /cluster1/dataset-1/2019/04/10/03 /cluster3/user/larry/dataset-5/2019/04/10/03 Build on Hadoop ViewFileSystem* Each path prefix is mapped to specific cluster’s configuration Makes it very easy to discover data from location Replicator uses this to resolve path across clusters Can hide different FileSystem implementations behind ViewFileSystem Eg : /gcs/user/dataset-9/2019/04/10/03 can map to gs://user-dataset-1-bucket.twttr.net/2019/04/10/03
  • #9 All systems rely on global filesystem paths /cluster1/dataset-1/2019/04/10/03 /cluster3/user/larry/dataset-5/2019/04/10/03 Build on Hadoop ViewFileSystem* Each path prefix is mapped to specific cluster’s configuration Makes it very easy to discover data from location Replicator uses this to resolve path across clusters Can hide different FileSystem implementations behind ViewFileSystem Eg : /gcs/user/dataset-9/2019/04/10/03 can map to gs://user-dataset-1-bucket.twttr.net/2019/04/10/03
  • #10 All systems rely on global filesystem paths /cluster1/dataset-1/2019/04/10/03 /cluster3/user/larry/dataset-5/2019/04/10/03 Build on Hadoop ViewFileSystem* Each path prefix is mapped to specific cluster’s configuration Makes it very easy to discover data from location Replicator uses this to resolve path across clusters Can hide different FileSystem implementations behind ViewFileSystem Eg : /gcs/user/dataset-9/2019/04/10/03 can map to gs://user-dataset-1-bucket.twttr.net/2019/04/10/03
  • #11 All systems rely on global filesystem paths /cluster1/dataset-1/2019/04/10/03 /cluster3/user/larry/dataset-5/2019/04/10/03 Build on Hadoop ViewFileSystem* Each path prefix is mapped to specific cluster’s configuration Makes it very easy to discover data from location Replicator uses this to resolve path across clusters Can hide different FileSystem implementations behind ViewFileSystem Eg : /gcs/user/dataset-9/2019/04/10/03 can map to gs://user-dataset-1-bucket.twttr.net/2019/04/10/03
  • #13 Run one Replicator per destination cluster Always pull model Fault tolerant 1:1 or N:1 setup Upon copy, publish to DAL
  • #14 Users setup replication entry one per Dataset with properties Source and Destination clusters Copy since X days. (Optionally copy until Y days) Owner, team, contact email Copy job configuration Different ways to specify configuration. yml , DAL, configdb. Configure contact email for alerts. Fault tolerant copy to keep data in sync.
  • #15 Long running daemon (on mesos) Daemon checks configuration and schedules copy on hourly partition Copy jobs are executed as Hadoop distcp jobs Jobs are on destination cluster After hourly copy, publish partition to DAL
  • #16 Some datasets are collected across multiple DataCenters Replicator would kick off multiple DistCP jobs to copy at tmp location Replicator then merges dataset into single directory and does atomic rename to final destination Renames on HDFS are cheap and atomic, which makes this operation easy
  • #22 Use same Replicator code to sync data to GCS Utilize ViewFileSystem abstraction to hide GCS /gcs/dataset/2019/04/10/03 maps to gs://dataset.bucket/2019/04/10/03 Use Google Hadoop Connector to interact with GCS using Hadoop APIs Distcp jobs runs on dedicated Copy cluster Create ViewFileSystem mount point on Copy cluster to fake GCS destination Distcp tasks stream data from source HDFS to GCS (no local copy)
  • #23 Replicator Daemon uses proxy While actual data flow directly to GCP from Twitter PNI setup between Twitter and Google
  • #24 Data for same dataset is aggregated at multiple DataCenters (DC x and DC y) Replicators in each DC schedules individual DistCp jobs Data from multiple DC ends up under same path on GCS
  • #27 UI support via EagleEye to view all replication configurations Properties associated with configuration. Src, dest, owner, email, etc… CLI support to manage replication configurations Load new or modify existing configuration List all configurations Mark active/inactive configurations API support for clients and replicators Rich set of api access for all above operations
  • #28 Command line tools dal command line to look up datasets, destination and available partitions API access to DAL Scalding/Presto query DAL to check partitions for time range Jobs also link to scheduler which can kick off jobs based on new partitions UI access EagleEye is UI to view details about Datasets and also available partitions Can also who delay per hourly partitions Uniform access on prem or cloud Interface to dataset properties are same on prem or cloud
  • #32 GCP Projects are based on organization Deploy separate Replicator with its own credentials per project Shared copy cluster per DataCenter Enables independent updates and reduces risk of errors
  • #33 Logs vs user path resolution Projects and buckets have standard naming convention Logs at : gs://logs.<category name>.twttr.net/ User data at gs://user.<user name>.twtter.net/ Access to these buckets are via standard path Logs at /gcs/logs/<category name>/ User data at /gcs/user/<user name>/ Typically we need mapping of path prefix to bucket name in Hadoop ViewFileSystem mountable.xml Modified ViewFileSystem to dynamically create mountable mapping on demand since bucket name and path name are standard No configuration or update needed
  • #36 Google Cloud Storage connector to access gcs Existing applications using Hadoop FileSystem APIs continue to work Existing tools continue to work against GCS For most part users do not know there is separate tool / API to access GCS Users use commands such as hadoop fs -ls /gcs/user/larry/my_dataset/2019/01/04 hadoop fs -du -s -h /gcs/user/larry/my_dataset hadoop fs -put /gcs/user/larry/my_dataset/2019/01/04/file.txt ./file.txt Hadoop Cloud Storage connector is installed along with hadoop client on jump hosts and hadoop nodes Applications can also package connector jar
  • #42 Google supports data at petabyte scale, securely with our best in class analytics and machine learning capabilities to inform real-time decisions and coordinate response on the roads.