Extending Twitter's Data Platform to Google Cloud

Extending Twitter’s
Data Platform to
Google Cloud
1
Lohit VijayaRenu , Vrushali Channapattan

Data Platform @Twitter
Oxpecker
Roneobird
Data Access
Layer
ETL
Pipelines

Why Cloud?
- Provides a convenient way to test Hadoop changes at scale
- Temporarily rapidly grow / shrink
- A broader geographical footprint for locality and business continuity
- Access to other Google offerings such as BigQuery, CloudML, Cloud
DataFlow etc

Partly Cloudy
A project to extend Data Processing at Twitter
from an on-premises only model
to a hybrid on-premises and Cloud model

Design considerations
User Experience
Consistency in
user experience
for on-premises
& in cloud data
processing
Scalability
Ability scale out
to handle all
datasets & all
users from day
1
Onboarding
Seamless
onboarding
experience
New Avenues
Data access in
new processing
tools in cloud

Design principles
Authentication
Strong authentication
for all user and service
access to data
Authorization
Explicit authorization
for all user and service
access to data
Least privileged access
Audit
Ability to easily
determine who
performed what
actions on the data

Workstreams
● Various focus areas across the tech stack
○ Networking
○ GCP config
○ Replication
○ Data Processing Tools
○ Internal services
● Collaboration across teams within Twitter
● Collaboration with Google

Partly Cloudy Data Replication
Sync Datasets to GCS

Data Infrastructure for Analytics
`
Hadoop Cluster
Data
Access
Layer
Replication Service
Retention Service
Hadoop Cluster
Replication Service
Retention Service

Extending Replication to GCS
DataCenter 2DataCenter 1
Hadoop
ClusterM
Hadoop
ClusterN
Hadoop
ClusterC
Hadoop
ClusterZ
Hadoop
ClusterX-2
Hadoop
ClusterL
Hadoop
ClusterX-1
● Same dataset
available on GCS for
users
● Unlock Presto on
GCP, Hadoop on
GCP, BigQuery and
other tools

Destination Cluster
/ClusterY/logs/partly-cloudy/
2019/04/10/03
Data Replicator Copy
Source Cluster
/ClusterX/logs/partly-cloudy/
2019/04/10/03
Replicator : ClusterY
Distcp
2019/04/10/03
DAL
Dataset : partly-cloudy
/ClusterX/logs/partly-cloudy
/ClusterY/logs/partly-cloudy

Destination Cluster
/ClusterY/logs/partly-cloudy/
2019/04/10/03
Data Replicator Copy + Merge
Source Cluster
/ClusterX-2/logs/partly-cloudy/
2019/04/10/03
Replicator : ClusterY
Distcp
2019/04/10/03
DAL
/ClusterX-1/logs/partly-cloudy
/ClusterX-2/logs/partly-cloudy
/ClusterY/logs/partly-cloudy
Type : Multiple Src
Source Cluster
/ClusterX-1/logs/partly-cloudy/
2019/04/10/03
Distcp
2019/04/10/03
Merge

Twitter
DataCenter
Architecture behind GCS replication
Copy Cluster
GCS
/gcs/logs/partly-cloud
/2019/04/10/03
Replicator : GCS
DAL
Source Cluster
/ClusterX/logs/partly-cloudy/
2019/04/10/03
Distcp
/ClusterX/logs/partly-cloudy
/gcs/logs/partly-cloudy

Merge same dataset on GCS (Multi Region Bucket)
Twitter DataCenter X-2
Copy Cluster X-2
/gcs/logs/partly-
cloudy/2019/04/
10/03
Source ClusterX-2
/ClusterX-2/logs/partly-
cloudy//2019/04/10/03
Twitter DataCenter X-1
Copy Cluster X-1Source ClusterX-1
/ClusterX-1/logs/partly-
cloudy/2019/04/10/03
Distcp
Multi Region
Bucket
Distcp
Cloud Storage

Dataset via EagleEye
● View different
destination for
same dataset
● GCS is another
destination
● Also shows delay
for each hourly
partition

Partly Cloudy Resource Hierarchy
Organization and Project
structure

TWITTER Org
DATA INFRA
Folder
twitter-
product
twitter-revenue
twitter-infraeng GCP
Projects

Project
Dataset
bucket
User Bucket
Google Cloud Storage
Connector for Hadoop
Google Cloud Storage
Connector for Hadoop
Nest
Name
Nodes
Worker Nodes
Resource
Manager
Task
ViewFS filesystem layer
ViewFS filesystem layer
Shadow account based
access
User account based access
User account based access
Scratch
bucket
Scrubbed
bucket
Project contents

GCP Project ZGCP Project YGCP Project X
Replicators per project
Twitter DataCenter
Copy Cluster
/gcs/dataX/2019/
04/10/03
/gcs/dataY/2019/
04/10/03
/gcs/dataZ/2019/
04/10/03
DistcpDistcp
DistcpDistcp DistcpDistcp
Replicator X Replicator Y Replicator Z
Cloud Storage Cloud Storage Cloud Storage

Storage in the Cloud

GCS
On-premises
path
/dc1/cluster1/user/
helen/some/path/par
t-001.lzo
Logical Cloud
path
/gcs/user/helen/
some/path/part-
001.lzo
GCS bucket path
gs://user.helen.dp.
twitter.domain/some
/path/part-001.lzo

RegEx based path resolution
<property>
<name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:--
;replaceresolveddstpath:_:-#.^/gcs/logs/(?!((tst|test)(_|-)))(?<dataset>[^/]+)</name>
<value>gs://logs.${dataset}</value>
</property>
<property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:--
;replaceresolveddstpath:_:-#.^/gcs/user/(?!((tst|test)(_|-)))(?<userName>[^/]+)</name>
<value>gs://user.${userName}</value>
</property>
/gcs/logs/partly-
cloudy/2019/04/10
/gcs/user/lohit/hadoop-stats
gs://logs.partly-
cloudy/2019/04/10
gs://user.lohit/hadoop-stats
Twitter ViewFS Path GCS bucket
Twitter ViewFS mounttable.xml

Bucket on GCS : gs://logs.partly-cloudy
Connector Path : /logs/partly-cloudy
Twitter Resolved Path : /gcs/logs/partly-cloudy
View FileSystem and Google Hadoop Connector
Twitter’s View FileSystem
Cluster-X Cluster-Y ClusterZ
Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2
DataCenter-1 DataCenter-2 Cloud Storage
Connector
Replicator
Cloud Storage

User Management

User
UNIX
Kerberos
credentials
GSuite
OAuth2
credentials
GSuite
OAuth2
credentials
Shadow account
(GCP Service account)
Users & Accounts
Shadow account
Json key

Key Management
- A new key is generated every N days
- Each key is valid for 2N + N days
- Keys are distributed to compute nodes by Twitter’s key
distribution service
- The shadow account key is readable only by that user
- Key management & distribution is transparent to the user

DATA INFRA
twitter-[org]
twitter-
employee-users

How
do the
Data Processing Users
at Twitter get
to use Partly Cloudy
DemiGod
Services

What are DemiGod services
Demigod is a group of service(s) that are responsible for
configuring GCP for Twitter’s Data Platform.
They run in GCP.

Salient features of DemiGods
- Run asynchronously of each other.
- Run with exactly-scoped, privileged google service accounts
- Idempotent runs
- Puppet-like functionality. Will override any manual changes
- Modular in design
- Each kept as simple as possible

Twitter infra eng project Twitter product project
Partly Cloudy Admin Project
Twitter user project
bucket-creation
-ie org (svc-acc-ie)
bucket-creation
Product (svc-acc-
product)
shadow-user-
creation
policy-granting-ie
Key/
Secrets
store
LDAP/Googl
e Groups
GCS Config
bucket
key-
rotation/creation
Deployment of DemiGods

What
do the
Data Processing
Users
at Twitter get
❏ Datasets replicated on GCS
❏ A shadow account to access GCS
❏ GCS buckets for their scratch &
scrubbed data
❏ Access to a Twitter managed
Hadoop cluster in GCP
❏ Access to a Twitter managed
Presto cluster in GCP
❏ Exploring other Google offerings
(such as BigQuery, DataProc & DataFlow)

● Copied tens of petabytes of data
and keeping them in sync
● Tens of different projects with
hundreds of buckets
● Complex set of VPC rules
● Hundreds of users using GCP
● Unlocked multiple use cases on
GCP
Where are we today

Thank you!
Hiring https://careers.twitter.com
Tweet @TwitterHadoop

Extending Twitter's Data Platform to Google Cloud

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Extending Twitter's Data Platform to Google Cloud

Similar to Extending Twitter's Data Platform to Google Cloud (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Extending Twitter's Data Platform to Google Cloud

Editor's Notes