Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Extending Twitter's Data Platform to Google Cloud


Published on

Twitter's Data Platform is built using multiple complex open source and in house projects to support Data Analytics on hundreds of petabytes of data. Our platform support storage, compute, data ingestion, discovery and management and various tools and libraries to help users for both batch and realtime analytics. Our DataPlatform operates on multiple clusters across different data centers to help thousands of users discover valuable insights. As we were scaling our Data Platform to multiple clusters, we also evaluated various cloud vendors to support use cases outside of our data centers. In this talk we share our architecture and how we extend our data platform to use cloud as another datacenter. We walk through our evaluation process, challenges we faced supporting data analytics at Twitter scale on cloud and present our current solution. Extending Twitter's Data platform to cloud was complex task which we deep dive in this presentation.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Extending Twitter's Data Platform to Google Cloud

  1. 1. Extending Twitter’s Data Platform to Google Cloud 1 Lohit VijayaRenu , Vrushali Channapattan
  2. 2. Data Platform @Twitter Oxpecker Roneobird Data Access Layer ETL Pipelines
  3. 3. Why Cloud? - Provides a convenient way to test Hadoop changes at scale - Temporarily rapidly grow / shrink - A broader geographical footprint for locality and business continuity - Access to other Google offerings such as BigQuery, CloudML, Cloud DataFlow etc
  4. 4. Partly Cloudy A project to extend Data Processing at Twitter from an on-premises only model to a hybrid on-premises and Cloud model
  5. 5. Before Partly Cloudy
  6. 6. Partly Cloudy
  7. 7. Design considerations User Experience Consistency in user experience for on-premises & in cloud data processing Scalability Ability scale out to handle all datasets & all users from day 1 Onboarding Seamless onboarding experience New Avenues Data access in new processing tools in cloud
  8. 8. Design principles Authentication Strong authentication for all user and service access to data Authorization Explicit authorization for all user and service access to data Least privileged access Audit Ability to easily determine who performed what actions on the data
  9. 9. Workstreams ● Various focus areas across the tech stack ○ Networking ○ GCP config ○ Replication ○ Data Processing Tools ○ Internal services ● Collaboration across teams within Twitter ● Collaboration with Google
  10. 10. Partly Cloudy Data Replication Sync Datasets to GCS
  11. 11. Data Infrastructure for Analytics ` Hadoop Cluster Data Access Layer Replication Service Retention Service Hadoop Cluster Replication Service Retention Service
  12. 12. Extending Replication to GCS DataCenter 2DataCenter 1 Hadoop ClusterM Hadoop ClusterN Hadoop ClusterC Hadoop ClusterZ Hadoop ClusterX-2 Hadoop ClusterL Hadoop ClusterX-1 ● Same dataset available on GCS for users ● Unlock Presto on GCP, Hadoop on GCP, BigQuery and other tools
  13. 13. Destination Cluster /ClusterY/logs/partly-cloudy/ 2019/04/10/03 Data Replicator Copy Source Cluster /ClusterX/logs/partly-cloudy/ 2019/04/10/03 Replicator : ClusterY Distcp 2019/04/10/03 DAL Dataset : partly-cloudy /ClusterX/logs/partly-cloudy /ClusterY/logs/partly-cloudy
  14. 14. Destination Cluster /ClusterY/logs/partly-cloudy/ 2019/04/10/03 Data Replicator Copy + Merge Source Cluster /ClusterX-2/logs/partly-cloudy/ 2019/04/10/03 Replicator : ClusterY Distcp 2019/04/10/03 DAL Dataset : partly-cloudy /ClusterX-1/logs/partly-cloudy /ClusterX-2/logs/partly-cloudy /ClusterY/logs/partly-cloudy Type : Multiple Src Source Cluster /ClusterX-1/logs/partly-cloudy/ 2019/04/10/03 Distcp 2019/04/10/03 Merge
  15. 15. Twitter DataCenter Architecture behind GCS replication Copy Cluster GCS /gcs/logs/partly-cloud /2019/04/10/03 Replicator : GCS DAL Source Cluster /ClusterX/logs/partly-cloudy/ 2019/04/10/03 Distcp Dataset : partly-cloudy /ClusterX/logs/partly-cloudy /gcs/logs/partly-cloudy
  16. 16. Merge same dataset on GCS (Multi Region Bucket) Twitter DataCenter X-2 Copy Cluster X-2 /gcs/logs/partly- cloudy/2019/04/ 10/03 Source ClusterX-2 /ClusterX-2/logs/partly- cloudy//2019/04/10/03 Twitter DataCenter X-1 Copy Cluster X-1Source ClusterX-1 /ClusterX-1/logs/partly- cloudy/2019/04/10/03 Distcp Multi Region Bucket Distcp Cloud Storage
  17. 17. Dataset via EagleEye ● View different destination for same dataset ● GCS is another destination ● Also shows delay for each hourly partition
  18. 18. Partly Cloudy Resource Hierarchy Organization and Project structure
  19. 19. Partly Cloudy Resource Hierarchy TWITTER Org DATA INFRA Folder twitter- product twitter-revenue twitter-infraeng GCP Projects
  20. 20. Project Dataset bucket User Bucket Google Cloud Storage Connector for Hadoop Google Cloud Storage Connector for Hadoop Nest Name Nodes Worker Nodes Resource Manager Task ViewFS filesystem layer ViewFS filesystem layer Shadow account based access User account based access User account based access Scratch bucket Scrubbed bucket Project contents
  21. 21. GCP Project ZGCP Project YGCP Project X Replicators per project Twitter DataCenter Copy Cluster /gcs/dataX/2019/ 04/10/03 /gcs/dataY/2019/ 04/10/03 /gcs/dataZ/2019/ 04/10/03 DistcpDistcp DistcpDistcp DistcpDistcp Replicator X Replicator Y Replicator Z Cloud Storage Cloud Storage Cloud Storage
  22. 22. Partly Cloudy Resource Hierarchy Storage in the Cloud
  23. 23. GCS On-premises path /dc1/cluster1/user/ helen/some/path/par t-001.lzo Logical Cloud path /gcs/user/helen/ some/path/part- 001.lzo GCS bucket path gs://user.helen.dp. twitter.domain/some /path/part-001.lzo
  24. 24. RegEx based path resolution <property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:-- ;replaceresolveddstpath:_:-#.^/gcs/logs/(?!((tst|test)(_|-)))(?&lt;dataset&gt;[^/]+)</name> <value>gs://logs.${dataset}</value> </property> <property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:-- ;replaceresolveddstpath:_:-#.^/gcs/user/(?!((tst|test)(_|-)))(?&lt;userName&gt;[^/]+)</name> <value>gs://user.${userName}</value> </property> /gcs/logs/partly- cloudy/2019/04/10 /gcs/user/lohit/hadoop-stats gs://logs.partly- cloudy/2019/04/10 gs://user.lohit/hadoop-stats Twitter ViewFS Path GCS bucket Twitter ViewFS mounttable.xml
  25. 25. Bucket on GCS : gs://logs.partly-cloudy Connector Path : /logs/partly-cloudy Twitter Resolved Path : /gcs/logs/partly-cloudy View FileSystem and Google Hadoop Connector Twitter’s View FileSystem Cluster-X Cluster-Y ClusterZ Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2 DataCenter-1 DataCenter-2 Cloud Storage Connector Replicator Cloud Storage
  26. 26. Partly Cloudy Resource Hierarchy User Management
  27. 27. User UNIX Kerberos credentials GSuite OAuth2 credentials GSuite OAuth2 credentials Shadow account (GCP Service account) Users & Accounts Shadow account Json key
  28. 28. Key Management - A new key is generated every N days - Each key is valid for 2N + N days - Keys are distributed to compute nodes by Twitter’s key distribution service - The shadow account key is readable only by that user - Key management & distribution is transparent to the user
  29. 29. Partly Cloudy Resource Hierarchy DATA INFRA twitter-[org] twitter- employee-users
  30. 30. How do the Data Processing Users at Twitter get to use Partly Cloudy DemiGod Services
  31. 31. What are DemiGod services Demigod is a group of service(s) that are responsible for configuring GCP for Twitter’s Data Platform. They run in GCP.
  32. 32. Salient features of DemiGods - Run asynchronously of each other. - Run with exactly-scoped, privileged google service accounts - Idempotent runs - Puppet-like functionality. Will override any manual changes - Modular in design - Each kept as simple as possible
  33. 33. Twitter infra eng project Twitter product project Partly Cloudy Admin Project Twitter user project bucket-creation -ie org (svc-acc-ie) bucket-creation Product (svc-acc- product) shadow-user- creation policy-granting-ie Key/ Secrets store LDAP/Googl e Groups GCS Config bucket key- rotation/creation Deployment of DemiGods
  34. 34. What do the Data Processing Users at Twitter get ❏ Datasets replicated on GCS ❏ A shadow account to access GCS ❏ GCS buckets for their scratch & scrubbed data ❏ Access to a Twitter managed Hadoop cluster in GCP ❏ Access to a Twitter managed Presto cluster in GCP ❏ Exploring other Google offerings (such as BigQuery, DataProc & DataFlow)
  35. 35. ● Copied tens of petabytes of data and keeping them in sync ● Tens of different projects with hundreds of buckets ● Complex set of VPC rules ● Hundreds of users using GCP ● Unlocked multiple use cases on GCP Where are we today
  36. 36. Thank you! Hiring Tweet @TwitterHadoop