SEC302 Twitter's GCP Architecture for its petabyte scale data storage in gcs and user identity management

SEC302: Twitter's GCP
Architecture for Its Petabyte-
Scale Data Storage
in GCS and User Identity
Management
Vrushali Channapattan, Staff Engineer, Twitter
James Duke, Strategic Cloud Engineer, Google Cloud

● What is Partly Cloudy
● Architecture
● Project & Bucket Design
● User Identity Management
● DemiGod services
● Deployment
Outline

What is Partly Cloudy
A project to extend Data Processing at Twitter
from an on-premises only model
to a hybrid on-premises and Cloud model

Why Partly Cloudy
- A long term desire to have some cloud presence
- Right strategy will balance developer agility, capabilities, and
cost.
- Provides a convenient way to test Hadoop changes at scale
- A broader geographical footprint for locality and business continuity
- Access to other Google offerings such as BigQuery, CloudML, Cloud
DataFlow etc

Design principles
Authentication
Strong authentication
for all user and service
access to data
Authorization
Explicit authorization
for all user and service
access to data
Audit
Ability to easily
determine who
performed what
actions on the data

Partly Cloudy Resource Hierarchy
Organization and Project
structure

TWITTER Org

TWITTER Org
DATA INFRA
Folder

TWITTER Org
DATA INFRA
Folder
twitter-
product
twitter-revenue
twitter-infraeng GCP
Projects

What do these Projects contain?
twitter-project

Project
Dataset bucket
Cloud Storage
User Bucket
Cloud Storage
Google Cloud Storage
Connector for Hadoop
Google Cloud Storage
Connector for Hadoop
Nest
nest-compute@project-
name.iam.gserviceacc
ount.com
Name Nodes
nn-per-cluster-
compute@project-
ount.com
Worker Node(s)
wn-per-cluster-
compute@project-
name.iam.gserviceaccou
nt.com
Resource Manager
rm-per-cluster-
compute@project-
ount.com
Task
ViewFS filesystem layer
ViewFS filesystem layer
Shadow account based access
User account based access
User account based access
Scratch bucket
Cloud Storage
Scrubbed bucket
Cloud Storage

Storage in the Cloud

GCS
On-premises
path
/dc1/cluster1/user/
helen/some/path/par
t-001.lzo
Logical Cloud
path
/gcs/user/helen/
some/path/part-
001.lzo
GCS bucket path
gs://user.helen.dp.
twitter.domain/some
/path/part-001.lzo

User Management

Key Management
- A new key is generated every N days
- Each key is valid for 2N + N days
- Keys are distributed to compute nodes by Twitter’s key
distribution service
- The shadow account key is readable only by that user
- Key management & distribution is transparent to the user

DATA INFRA
twitter-[org]
twitter-
employee-users

What
do the
Data Processing
Users
at Twitter get

What
do the
Data Processing
Users
at Twitter get
❏ A shadow account to access GCS
❏ A GCS bucket for their data
❏ Access to a Twitter managed
Hadoop cluster in GCP
❏ Access to a Twitter managed
Presto cluster in GCP
❏ To work with us to leverage other
Google offerings (such as BigQuery,
Cloud DataProc & Cloud DataFlow)

Who configures
GCP for the
Data Processing Users
at Twitter

Who configures
GCP for the
Data Processing Users
at Twitter
DemiGod
Services

DemiGod Services

What are DemiGod services
Demigod is a group of service(s) that are responsible for
configuring GCP for Twitter’s Data Platform.
They run in GCP.

Salient features of DemiGods
- Run asynchronously of each other.
- Run with exactly-scoped, privileged google service accounts
- Idempotent runs
- Puppet-like functionality. Will override any manual changes
- Modular in design
- Each kept as simple as possible

Partly Cloudy
Types of DemiGods

Bucket Creation
❏ Creates buckets
❏ Twitter domain
❏ One DemiGod per pillar twitter
project
❏ Inputs
❏ Configurable prefixes
❏ LDAP input
❏ YAML input

Shadow Account
Management
❏ Creates shadow accounts
❏ Google service accounts
❏ One DemiGod
❏ Inputs
❏ Configurable pattern
❏ LDAP input
❏ YAML input

Policy
Management
❏ Creates IAM policies
❏ One DemiGod per pillar project
❏ Inputs
❏ LDAP input
❏ YAML input
❏ Google groups
❏ Ignore list

Key LifeCycle
Management
❏ Creates keys with expiration
❏ Manges lifecycle of every N
days
❏ Adds them to Key Store
❏ Inputs
❏ destination for keys
❏ LDAP input
❏ Shadow account

Partly Cloudy
Deployment of DemiGods

Deployment considerations
- Demigods will run on GCE with the VM running a demigod service
account
- Demigod service accounts will be created in partly-cloudy-admin
project that has limited ssh access
- Demigod processes will run as a kerberized headless twitter user
- Demigod Key Creation Service shall NOT write service account
keys to disk. It will store in memory until written to Secret Store.

Partly Cloudy
DemiGods execution flow

What happens when
a user joins
an ldap pillar group?
Demigod ⇔ twitter user interaction

What happens when
a user joins
an ldap pillar group?
Demigod ⇔ twitter user interaction
❏ A shadow account is created
❏ added to google group
❏ A GCS user bucket is created
❏ Scratch bucket
❏ Keys are generated
❏ Added to Secrets Store
❏ Keys are distributed thereby
enabling access to a Twitter
managed Hadoop cluster in GCP &
Presto cluster in GCP

What happens when
a new dataset is added?
Demigod ⇔ twitter dataset interaction

What happens when
a new dataset is added?
Demigod ⇔ twitter dataset interaction
❏ Dataset info is replicated to a YAML
file in a GCS config bucket
❏ A GCS dataset bucket is created
❏ Scratch, Scrubbed, Scratch-scrubbed
bucket also created
❏ Access privileges are granted
❏ Owner - read on orig dataset , r/w on
scratch & scrubbed
❏ Reader group: read on dataset, scrubbed

Thank you!
We are hiring
https://careers.twitter.com https://careers.google.com/cloud/

Your Feedback is Greatly Appreciated!
Complete the
session survey
in mobile app
1-5 star rating
system
Open field for
comments
Rate icon in
status bar

Appendix
Google Cloud Twitter
https://cloud.google.com/twitter/

SEC302 Twitter's GCP Architecture for its petabyte scale data storage in gcs and user identity management

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SEC302 Twitter's GCP Architecture for its petabyte scale data storage in gcs and user identity management

Similar to SEC302 Twitter's GCP Architecture for its petabyte scale data storage in gcs and user identity management (20)

Recently uploaded

Recently uploaded (20)

SEC302 Twitter's GCP Architecture for its petabyte scale data storage in gcs and user identity management