Twitter collects petabytes of data every day and empowers its engineers and data scientists for large data processing with an hybrid on-premises and cloud model. In this talk, we will look at its GCP architecture and the resource hierarchy. We will deep dive into the storage design that uses Google Cloud Storage to organize petabytes of data that are replicated from on-premises HDFS clusters. We will take a look at how the user-management tooling has been designed to create and manage access for thousands of accounts (human and service accounts) at Twitter. We will talk about how the design deals with the security measures for accounts and tooling systems running in GCP and the complexities of dataset permissions. We will share the challenges we faced as we tried to design our system at scale and our learnings and solutions.
Scaling API-first – The story of a global engineering organization
SEC302 Twitter's GCP Architecture for its petabyte scale data storage in gcs and user identity management
1. SEC302: Twitter's GCP
Architecture for Its Petabyte-
Scale Data Storage
in GCS and User Identity
Management
Vrushali Channapattan, Staff Engineer, Twitter
James Duke, Strategic Cloud Engineer, Google Cloud
2. ● What is Partly Cloudy
● Architecture
● Project & Bucket Design
● User Identity Management
● DemiGod services
● Deployment
Outline
3. What is Partly Cloudy
A project to extend Data Processing at Twitter
from an on-premises only model
to a hybrid on-premises and Cloud model
4. Why Partly Cloudy
- A long term desire to have some cloud presence
- Right strategy will balance developer agility, capabilities, and
cost.
- Provides a convenient way to test Hadoop changes at scale
- A broader geographical footprint for locality and business continuity
- Access to other Google offerings such as BigQuery, CloudML, Cloud
DataFlow etc
5. Design principles
Authentication
Strong authentication
for all user and service
access to data
Authorization
Explicit authorization
for all user and service
access to data
Audit
Ability to easily
determine who
performed what
actions on the data
18. Key Management
- A new key is generated every N days
- Each key is valid for 2N + N days
- Keys are distributed to compute nodes by Twitter’s key
distribution service
- The shadow account key is readable only by that user
- Key management & distribution is transparent to the user
21. What
do the
Data Processing
Users
at Twitter get
❏ A shadow account to access GCS
❏ A GCS bucket for their data
❏ Access to a Twitter managed
Hadoop cluster in GCP
❏ Access to a Twitter managed
Presto cluster in GCP
❏ To work with us to leverage other
Google offerings (such as BigQuery,
Cloud DataProc & Cloud DataFlow)
25. What are DemiGod services
Demigod is a group of service(s) that are responsible for
configuring GCP for Twitter’s Data Platform.
They run in GCP.
26. Salient features of DemiGods
- Run asynchronously of each other.
- Run with exactly-scoped, privileged google service accounts
- Idempotent runs
- Puppet-like functionality. Will override any manual changes
- Modular in design
- Each kept as simple as possible
29. Shadow Account
Management
❏ Creates shadow accounts
❏ Google service accounts
❏ One DemiGod
❏ Inputs
❏ Configurable pattern
❏ LDAP input
❏ YAML input
30. Policy
Management
❏ Creates IAM policies
❏ One DemiGod per pillar project
❏ Inputs
❏ LDAP input
❏ YAML input
❏ Google groups
❏ Ignore list
31. Key LifeCycle
Management
❏ Creates keys with expiration
❏ Manges lifecycle of every N
days
❏ Adds them to Key Store
❏ Inputs
❏ destination for keys
❏ LDAP input
❏ Shadow account
34. Deployment considerations
- Demigods will run on GCE with the VM running a demigod service
account
- Demigod service accounts will be created in partly-cloudy-admin
project that has limited ssh access
- Demigod processes will run as a kerberized headless twitter user
- Demigod Key Creation Service shall NOT write service account
keys to disk. It will store in memory until written to Secret Store.
36. What happens when
a user joins
an ldap pillar group?
Demigod ⇔ twitter user interaction
37. What happens when
a user joins
an ldap pillar group?
Demigod ⇔ twitter user interaction
❏ A shadow account is created
❏ added to google group
❏ A GCS user bucket is created
❏ Scratch bucket
❏ Keys are generated
❏ Added to Secrets Store
❏ Keys are distributed thereby
enabling access to a Twitter
managed Hadoop cluster in GCP &
Presto cluster in GCP
38. What happens when
a new dataset is added?
Demigod ⇔ twitter dataset interaction
39. What happens when
a new dataset is added?
Demigod ⇔ twitter dataset interaction
❏ Dataset info is replicated to a YAML
file in a GCS config bucket
❏ A GCS dataset bucket is created
❏ Scratch, Scrubbed, Scratch-scrubbed
bucket also created
❏ Access privileges are granted
❏ Owner - read on orig dataset , r/w on
scratch & scrubbed
❏ Reader group: read on dataset, scrubbed
40.
41. Thank you!
We are hiring
https://careers.twitter.com https://careers.google.com/cloud/
42. Your Feedback is Greatly Appreciated!
Complete the
session survey
in mobile app
1-5 star rating
system
Open field for
comments
Rate icon in
status bar