SlideShare a Scribd company logo
Extending Twitter’s
Data Platform to
Google Cloud
1
Lohit VijayaRenu , Vrushali Channapattan
Data Platform @Twitter
Oxpecker
Roneobird
Data Access
Layer
ETL
Pipelines
Why Cloud?
- Provides a convenient way to test Hadoop changes at scale
- Temporarily rapidly grow / shrink
- A broader geographical footprint for locality and business continuity
- Access to other Google offerings such as BigQuery, CloudML, Cloud
DataFlow etc
Partly Cloudy
A project to extend Data Processing at Twitter
from an on-premises only model
to a hybrid on-premises and Cloud model
Before Partly Cloudy
Partly Cloudy
Design considerations
User Experience
Consistency in
user experience
for on-premises
& in cloud data
processing
Scalability
Ability scale out
to handle all
datasets & all
users from day
1
Onboarding
Seamless
onboarding
experience
New Avenues
Data access in
new processing
tools in cloud
Design principles
Authentication
Strong authentication
for all user and service
access to data
Authorization
Explicit authorization
for all user and service
access to data
Least privileged access
Audit
Ability to easily
determine who
performed what
actions on the data
Workstreams
● Various focus areas across the tech stack
○ Networking
○ GCP config
○ Replication
○ Data Processing Tools
○ Internal services
● Collaboration across teams within Twitter
● Collaboration with Google
Partly Cloudy Data Replication
Sync Datasets to GCS
Data Infrastructure for Analytics
`
Hadoop Cluster
Data
Access
Layer
Replication Service
Retention Service
Hadoop Cluster
Replication Service
Retention Service
Extending Replication to GCS
DataCenter 2DataCenter 1
Hadoop
ClusterM
Hadoop
ClusterN
Hadoop
ClusterC
Hadoop
ClusterZ
Hadoop
ClusterX-2
Hadoop
ClusterL
Hadoop
ClusterX-1
● Same dataset
available on GCS for
users
● Unlock Presto on
GCP, Hadoop on
GCP, BigQuery and
other tools
Destination Cluster
/ClusterY/logs/partly-cloudy/
2019/04/10/03
Data Replicator Copy
Source Cluster
/ClusterX/logs/partly-cloudy/
2019/04/10/03
Replicator : ClusterY
Distcp
2019/04/10/03
DAL
Dataset : partly-cloudy
/ClusterX/logs/partly-cloudy
/ClusterY/logs/partly-cloudy
Destination Cluster
/ClusterY/logs/partly-cloudy/
2019/04/10/03
Data Replicator Copy + Merge
Source Cluster
/ClusterX-2/logs/partly-cloudy/
2019/04/10/03
Replicator : ClusterY
Distcp
2019/04/10/03
DAL
Dataset : partly-cloudy
/ClusterX-1/logs/partly-cloudy
/ClusterX-2/logs/partly-cloudy
/ClusterY/logs/partly-cloudy
Type : Multiple Src
Source Cluster
/ClusterX-1/logs/partly-cloudy/
2019/04/10/03
Distcp
2019/04/10/03
Merge
Twitter
DataCenter
Architecture behind GCS replication
Copy Cluster
GCS
/gcs/logs/partly-cloud
/2019/04/10/03
Replicator : GCS
DAL
Source Cluster
/ClusterX/logs/partly-cloudy/
2019/04/10/03
Distcp
Dataset : partly-cloudy
/ClusterX/logs/partly-cloudy
/gcs/logs/partly-cloudy
Merge same dataset on GCS (Multi Region Bucket)
Twitter DataCenter X-2
Copy Cluster X-2
/gcs/logs/partly-
cloudy/2019/04/
10/03
Source ClusterX-2
/ClusterX-2/logs/partly-
cloudy//2019/04/10/03
Twitter DataCenter X-1
Copy Cluster X-1Source ClusterX-1
/ClusterX-1/logs/partly-
cloudy/2019/04/10/03
Distcp
Multi Region
Bucket
Distcp
Cloud Storage
Dataset via EagleEye
● View different
destination for
same dataset
● GCS is another
destination
● Also shows delay
for each hourly
partition
Partly Cloudy Resource Hierarchy
Organization and Project
structure
Partly Cloudy Resource Hierarchy
TWITTER Org
DATA INFRA
Folder
twitter-
product
twitter-revenue
twitter-infraeng GCP
Projects
Project
Dataset
bucket
User Bucket
Google Cloud Storage
Connector for Hadoop
Google Cloud Storage
Connector for Hadoop
Nest
Name
Nodes
Worker Nodes
Resource
Manager
Task
ViewFS filesystem layer
ViewFS filesystem layer
Shadow account based
access
User account based access
User account based access
Scratch
bucket
Scrubbed
bucket
Project contents
GCP Project ZGCP Project YGCP Project X
Replicators per project
Twitter DataCenter
Copy Cluster
/gcs/dataX/2019/
04/10/03
/gcs/dataY/2019/
04/10/03
/gcs/dataZ/2019/
04/10/03
DistcpDistcp
DistcpDistcp DistcpDistcp
Replicator X Replicator Y Replicator Z
Cloud Storage Cloud Storage Cloud Storage
Partly Cloudy Resource Hierarchy
Storage in the Cloud
GCS
On-premises
path
/dc1/cluster1/user/
helen/some/path/par
t-001.lzo
Logical Cloud
path
/gcs/user/helen/
some/path/part-
001.lzo
GCS bucket path
gs://user.helen.dp.
twitter.domain/some
/path/part-001.lzo
RegEx based path resolution
<property>
<name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:--
;replaceresolveddstpath:_:-#.^/gcs/logs/(?!((tst|test)(_|-)))(?&lt;dataset&gt;[^/]+)</name>
<value>gs://logs.${dataset}</value>
</property>
<property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:--
;replaceresolveddstpath:_:-#.^/gcs/user/(?!((tst|test)(_|-)))(?&lt;userName&gt;[^/]+)</name>
<value>gs://user.${userName}</value>
</property>
/gcs/logs/partly-
cloudy/2019/04/10
/gcs/user/lohit/hadoop-stats
gs://logs.partly-
cloudy/2019/04/10
gs://user.lohit/hadoop-stats
Twitter ViewFS Path GCS bucket
Twitter ViewFS mounttable.xml
Bucket on GCS : gs://logs.partly-cloudy
Connector Path : /logs/partly-cloudy
Twitter Resolved Path : /gcs/logs/partly-cloudy
View FileSystem and Google Hadoop Connector
Twitter’s View FileSystem
Cluster-X Cluster-Y ClusterZ
Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2
DataCenter-1 DataCenter-2 Cloud Storage
Connector
Replicator
Cloud Storage
Partly Cloudy Resource Hierarchy
User Management
User
UNIX
Kerberos
credentials
GSuite
OAuth2
credentials
GSuite
OAuth2
credentials
Shadow account
(GCP Service account)
Users & Accounts
Shadow account
Json key
Key Management
- A new key is generated every N days
- Each key is valid for 2N + N days
- Keys are distributed to compute nodes by Twitter’s key
distribution service
- The shadow account key is readable only by that user
- Key management & distribution is transparent to the user
Partly Cloudy Resource Hierarchy
DATA INFRA
twitter-[org]
twitter-
employee-users
How
do the
Data Processing Users
at Twitter get
to use Partly Cloudy
DemiGod
Services
What are DemiGod services
Demigod is a group of service(s) that are responsible for
configuring GCP for Twitter’s Data Platform.
They run in GCP.
Salient features of DemiGods
- Run asynchronously of each other.
- Run with exactly-scoped, privileged google service accounts
- Idempotent runs
- Puppet-like functionality. Will override any manual changes
- Modular in design
- Each kept as simple as possible
Twitter infra eng project Twitter product project
Partly Cloudy Admin Project
Twitter user project
bucket-creation
-ie org (svc-acc-ie)
bucket-creation
Product (svc-acc-
product)
shadow-user-
creation
policy-granting-ie
Key/
Secrets
store
LDAP/Googl
e Groups
GCS Config
bucket
key-
rotation/creation
Deployment of DemiGods
What
do the
Data Processing
Users
at Twitter get
❏ Datasets replicated on GCS
❏ A shadow account to access GCS
❏ GCS buckets for their scratch &
scrubbed data
❏ Access to a Twitter managed
Hadoop cluster in GCP
❏ Access to a Twitter managed
Presto cluster in GCP
❏ Exploring other Google offerings
(such as BigQuery, DataProc & DataFlow)
● Copied tens of petabytes of data
and keeping them in sync
● Tens of different projects with
hundreds of buckets
● Complex set of VPC rules
● Hundreds of users using GCP
● Unlocked multiple use cases on
GCP
Where are we today
Thank you!
Hiring https://careers.twitter.com
Tweet @TwitterHadoop

More Related Content

What's hot

Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD Pipelines
Drew Hansen
 
Oracle 21c: New Features and Enhancements of Data Pump & TTS
Oracle 21c: New Features and Enhancements of Data Pump & TTSOracle 21c: New Features and Enhancements of Data Pump & TTS
Oracle 21c: New Features and Enhancements of Data Pump & TTS
Christian Gohmann
 
PostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized WorldPostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized World
Jignesh Shah
 
PostgreSQL
PostgreSQLPostgreSQL
Em13c New Features- Two of Two
Em13c New Features- Two of TwoEm13c New Features- Two of Two
Em13c New Features- Two of Two
Kellyn Pot'Vin-Gorman
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
Till Rohrmann
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
StreamNative
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
Flink Forward
 
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
Julien Le Dem
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
Amazon Web Services
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
Rostislav Pashuto
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
Sid Anand
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
Flink Forward
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
DataWorks Summit
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
Dongmin Yu
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
Amazon Web Services
 
Rocks db state store in structured streaming
Rocks db state store in structured streamingRocks db state store in structured streaming
Rocks db state store in structured streaming
Balaji Mohanam
 
MAA Best Practices for Oracle Database 19c
MAA Best Practices for Oracle Database 19cMAA Best Practices for Oracle Database 19c
MAA Best Practices for Oracle Database 19c
Markus Michalewicz
 
PostgreSQL Database Slides
PostgreSQL Database SlidesPostgreSQL Database Slides
PostgreSQL Database Slides
metsarin
 

What's hot (20)

Snowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD PipelinesSnowflake Automated Deployments / CI/CD Pipelines
Snowflake Automated Deployments / CI/CD Pipelines
 
Oracle 21c: New Features and Enhancements of Data Pump & TTS
Oracle 21c: New Features and Enhancements of Data Pump & TTSOracle 21c: New Features and Enhancements of Data Pump & TTS
Oracle 21c: New Features and Enhancements of Data Pump & TTS
 
PostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized WorldPostgreSQL High Availability in a Containerized World
PostgreSQL High Availability in a Containerized World
 
PostgreSQL
PostgreSQLPostgreSQL
PostgreSQL
 
Em13c New Features- Two of Two
Em13c New Features- Two of TwoEm13c New Features- Two of Two
Em13c New Features- Two of Two
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Data and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineageData and AI summit: data pipelines observability with open lineage
Data and AI summit: data pipelines observability with open lineage
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 
Building Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache AirflowBuilding Better Data Pipelines using Apache Airflow
Building Better Data Pipelines using Apache Airflow
 
The Current State of Table API in 2022
The Current State of Table API in 2022The Current State of Table API in 2022
The Current State of Table API in 2022
 
Unified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache SamzaUnified Batch & Stream Processing with Apache Samza
Unified Batch & Stream Processing with Apache Samza
 
Presto anatomy
Presto anatomyPresto anatomy
Presto anatomy
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Rocks db state store in structured streaming
Rocks db state store in structured streamingRocks db state store in structured streaming
Rocks db state store in structured streaming
 
MAA Best Practices for Oracle Database 19c
MAA Best Practices for Oracle Database 19cMAA Best Practices for Oracle Database 19c
MAA Best Practices for Oracle Database 19c
 
PostgreSQL Database Slides
PostgreSQL Database SlidesPostgreSQL Database Slides
PostgreSQL Database Slides
 

Similar to Extending Twitter's Data Platform to Google Cloud

Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud
Vrushali Channapattan
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
lohitvijayarenu
 
SEC302 Twitter's GCP Architecture for its petabyte scale data storage in gcs...
SEC302  Twitter's GCP Architecture for its petabyte scale data storage in gcs...SEC302  Twitter's GCP Architecture for its petabyte scale data storage in gcs...
SEC302 Twitter's GCP Architecture for its petabyte scale data storage in gcs...
Vrushali Channapattan
 
Managing 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in CloudManaging 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in Cloud
lohitvijayarenu
 
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdfCloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdf
Leah Cole
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
HostedbyConfluent
 
Best practices for application migration to public clouds interop presentation
Best practices for application migration to public clouds interop presentationBest practices for application migration to public clouds interop presentation
Best practices for application migration to public clouds interop presentation
esebeus
 
仕事ではじめる機械学習
仕事ではじめる機械学習仕事ではじめる機械学習
仕事ではじめる機械学習
Aki Ariga
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
Alluxio, Inc.
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Denodo
 
GDG Cloud Southlake #8 Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...
GDG Cloud Southlake #8  Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...GDG Cloud Southlake #8  Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...
GDG Cloud Southlake #8 Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...
James Anderson
 
Informatica big data relational topics and presentation
Informatica big data relational topics and presentationInformatica big data relational topics and presentation
Informatica big data relational topics and presentation
Janardhan Reddy
 
ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox
Tsahi Glik
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
Yahoo Developer Network
 
How @twitterhadoop chose google cloud
How @twitterhadoop chose google cloudHow @twitterhadoop chose google cloud
How @twitterhadoop chose google cloud
lohitvijayarenu
 
Journey to Containerized Application / Google Container Engine
Journey to Containerized Application / Google Container EngineJourney to Containerized Application / Google Container Engine
Journey to Containerized Application / Google Container Engine
Google Cloud Platform - Japan
 
GCP-pde.pdf
GCP-pde.pdfGCP-pde.pdf
GCP-pde.pdf
NirajKumar938204
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Alluxio, Inc.
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
DataWorks Summit
 

Similar to Extending Twitter's Data Platform to Google Cloud (20)

Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
SEC302 Twitter's GCP Architecture for its petabyte scale data storage in gcs...
SEC302  Twitter's GCP Architecture for its petabyte scale data storage in gcs...SEC302  Twitter's GCP Architecture for its petabyte scale data storage in gcs...
SEC302 Twitter's GCP Architecture for its petabyte scale data storage in gcs...
 
Managing 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in CloudManaging 100s of PetaBytes of data in Cloud
Managing 100s of PetaBytes of data in Cloud
 
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdfCloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdf
 
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...How a distributed graph analytics platform uses Apache Kafka for data ingesti...
How a distributed graph analytics platform uses Apache Kafka for data ingesti...
 
Best practices for application migration to public clouds interop presentation
Best practices for application migration to public clouds interop presentationBest practices for application migration to public clouds interop presentation
Best practices for application migration to public clouds interop presentation
 
仕事ではじめる機械学習
仕事ではじめる機械学習仕事ではじめる機械学習
仕事ではじめる機械学習
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
 
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical DemonstrationMaximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
Maximizing Data Lake ROI with Data Virtualization: A Technical Demonstration
 
GDG Cloud Southlake #8 Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...
GDG Cloud Southlake #8  Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...GDG Cloud Southlake #8  Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...
GDG Cloud Southlake #8 Steve Cravens: Infrastructure as-Code (IaC) in 2022: ...
 
Informatica big data relational topics and presentation
Informatica big data relational topics and presentationInformatica big data relational topics and presentation
Informatica big data relational topics and presentation
 
ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox ML Infrastracture @ Dropbox
ML Infrastracture @ Dropbox
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
How @twitterhadoop chose google cloud
How @twitterhadoop chose google cloudHow @twitterhadoop chose google cloud
How @twitterhadoop chose google cloud
 
Journey to Containerized Application / Google Container Engine
Journey to Containerized Application / Google Container EngineJourney to Containerized Application / Google Container Engine
Journey to Containerized Application / Google Container Engine
 
GCP-pde.pdf
GCP-pde.pdfGCP-pde.pdf
GCP-pde.pdf
 
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
Integrating Google Cloud Dataproc with Alluxio for faster performance in the ...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for successArchitecting a Scalable Hadoop Platform: Top 10 considerations for success
Architecting a Scalable Hadoop Platform: Top 10 considerations for success
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
Transforming and Scaling Large Scale Data Analytics: Moving to a Cloud-based ...
 

Recently uploaded

Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 

Recently uploaded (20)

Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 

Extending Twitter's Data Platform to Google Cloud

  • 1. Extending Twitter’s Data Platform to Google Cloud 1 Lohit VijayaRenu , Vrushali Channapattan
  • 3. Why Cloud? - Provides a convenient way to test Hadoop changes at scale - Temporarily rapidly grow / shrink - A broader geographical footprint for locality and business continuity - Access to other Google offerings such as BigQuery, CloudML, Cloud DataFlow etc
  • 4. Partly Cloudy A project to extend Data Processing at Twitter from an on-premises only model to a hybrid on-premises and Cloud model
  • 7. Design considerations User Experience Consistency in user experience for on-premises & in cloud data processing Scalability Ability scale out to handle all datasets & all users from day 1 Onboarding Seamless onboarding experience New Avenues Data access in new processing tools in cloud
  • 8. Design principles Authentication Strong authentication for all user and service access to data Authorization Explicit authorization for all user and service access to data Least privileged access Audit Ability to easily determine who performed what actions on the data
  • 9. Workstreams ● Various focus areas across the tech stack ○ Networking ○ GCP config ○ Replication ○ Data Processing Tools ○ Internal services ● Collaboration across teams within Twitter ● Collaboration with Google
  • 10. Partly Cloudy Data Replication Sync Datasets to GCS
  • 11. Data Infrastructure for Analytics ` Hadoop Cluster Data Access Layer Replication Service Retention Service Hadoop Cluster Replication Service Retention Service
  • 12. Extending Replication to GCS DataCenter 2DataCenter 1 Hadoop ClusterM Hadoop ClusterN Hadoop ClusterC Hadoop ClusterZ Hadoop ClusterX-2 Hadoop ClusterL Hadoop ClusterX-1 ● Same dataset available on GCS for users ● Unlock Presto on GCP, Hadoop on GCP, BigQuery and other tools
  • 13. Destination Cluster /ClusterY/logs/partly-cloudy/ 2019/04/10/03 Data Replicator Copy Source Cluster /ClusterX/logs/partly-cloudy/ 2019/04/10/03 Replicator : ClusterY Distcp 2019/04/10/03 DAL Dataset : partly-cloudy /ClusterX/logs/partly-cloudy /ClusterY/logs/partly-cloudy
  • 14. Destination Cluster /ClusterY/logs/partly-cloudy/ 2019/04/10/03 Data Replicator Copy + Merge Source Cluster /ClusterX-2/logs/partly-cloudy/ 2019/04/10/03 Replicator : ClusterY Distcp 2019/04/10/03 DAL Dataset : partly-cloudy /ClusterX-1/logs/partly-cloudy /ClusterX-2/logs/partly-cloudy /ClusterY/logs/partly-cloudy Type : Multiple Src Source Cluster /ClusterX-1/logs/partly-cloudy/ 2019/04/10/03 Distcp 2019/04/10/03 Merge
  • 15. Twitter DataCenter Architecture behind GCS replication Copy Cluster GCS /gcs/logs/partly-cloud /2019/04/10/03 Replicator : GCS DAL Source Cluster /ClusterX/logs/partly-cloudy/ 2019/04/10/03 Distcp Dataset : partly-cloudy /ClusterX/logs/partly-cloudy /gcs/logs/partly-cloudy
  • 16. Merge same dataset on GCS (Multi Region Bucket) Twitter DataCenter X-2 Copy Cluster X-2 /gcs/logs/partly- cloudy/2019/04/ 10/03 Source ClusterX-2 /ClusterX-2/logs/partly- cloudy//2019/04/10/03 Twitter DataCenter X-1 Copy Cluster X-1Source ClusterX-1 /ClusterX-1/logs/partly- cloudy/2019/04/10/03 Distcp Multi Region Bucket Distcp Cloud Storage
  • 17. Dataset via EagleEye ● View different destination for same dataset ● GCS is another destination ● Also shows delay for each hourly partition
  • 18. Partly Cloudy Resource Hierarchy Organization and Project structure
  • 19. Partly Cloudy Resource Hierarchy TWITTER Org DATA INFRA Folder twitter- product twitter-revenue twitter-infraeng GCP Projects
  • 20. Project Dataset bucket User Bucket Google Cloud Storage Connector for Hadoop Google Cloud Storage Connector for Hadoop Nest Name Nodes Worker Nodes Resource Manager Task ViewFS filesystem layer ViewFS filesystem layer Shadow account based access User account based access User account based access Scratch bucket Scrubbed bucket Project contents
  • 21. GCP Project ZGCP Project YGCP Project X Replicators per project Twitter DataCenter Copy Cluster /gcs/dataX/2019/ 04/10/03 /gcs/dataY/2019/ 04/10/03 /gcs/dataZ/2019/ 04/10/03 DistcpDistcp DistcpDistcp DistcpDistcp Replicator X Replicator Y Replicator Z Cloud Storage Cloud Storage Cloud Storage
  • 22. Partly Cloudy Resource Hierarchy Storage in the Cloud
  • 24. RegEx based path resolution <property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:-- ;replaceresolveddstpath:_:-#.^/gcs/logs/(?!((tst|test)(_|-)))(?&lt;dataset&gt;[^/]+)</name> <value>gs://logs.${dataset}</value> </property> <property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:-- ;replaceresolveddstpath:_:-#.^/gcs/user/(?!((tst|test)(_|-)))(?&lt;userName&gt;[^/]+)</name> <value>gs://user.${userName}</value> </property> /gcs/logs/partly- cloudy/2019/04/10 /gcs/user/lohit/hadoop-stats gs://logs.partly- cloudy/2019/04/10 gs://user.lohit/hadoop-stats Twitter ViewFS Path GCS bucket Twitter ViewFS mounttable.xml
  • 25. Bucket on GCS : gs://logs.partly-cloudy Connector Path : /logs/partly-cloudy Twitter Resolved Path : /gcs/logs/partly-cloudy View FileSystem and Google Hadoop Connector Twitter’s View FileSystem Cluster-X Cluster-Y ClusterZ Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2 DataCenter-1 DataCenter-2 Cloud Storage Connector Replicator Cloud Storage
  • 26. Partly Cloudy Resource Hierarchy User Management
  • 28. Key Management - A new key is generated every N days - Each key is valid for 2N + N days - Keys are distributed to compute nodes by Twitter’s key distribution service - The shadow account key is readable only by that user - Key management & distribution is transparent to the user
  • 29. Partly Cloudy Resource Hierarchy DATA INFRA twitter-[org] twitter- employee-users
  • 30. How do the Data Processing Users at Twitter get to use Partly Cloudy DemiGod Services
  • 31. What are DemiGod services Demigod is a group of service(s) that are responsible for configuring GCP for Twitter’s Data Platform. They run in GCP.
  • 32. Salient features of DemiGods - Run asynchronously of each other. - Run with exactly-scoped, privileged google service accounts - Idempotent runs - Puppet-like functionality. Will override any manual changes - Modular in design - Each kept as simple as possible
  • 33. Twitter infra eng project Twitter product project Partly Cloudy Admin Project Twitter user project bucket-creation -ie org (svc-acc-ie) bucket-creation Product (svc-acc- product) shadow-user- creation policy-granting-ie Key/ Secrets store LDAP/Googl e Groups GCS Config bucket key- rotation/creation Deployment of DemiGods
  • 34. What do the Data Processing Users at Twitter get ❏ Datasets replicated on GCS ❏ A shadow account to access GCS ❏ GCS buckets for their scratch & scrubbed data ❏ Access to a Twitter managed Hadoop cluster in GCP ❏ Access to a Twitter managed Presto cluster in GCP ❏ Exploring other Google offerings (such as BigQuery, DataProc & DataFlow)
  • 35. ● Copied tens of petabytes of data and keeping them in sync ● Tens of different projects with hundreds of buckets ● Complex set of VPC rules ● Hundreds of users using GCP ● Unlocked multiple use cases on GCP Where are we today

Editor's Notes

  1. To transfer data from on-premise to GCS Runs only yarn for GCS transfer, no local data Security Minimal in-DC hosts connect to GCS Networking Dedicated high bandwidth Requires separate dedicated configuration for routing to public end-points Each worker node has two IP addresses. Our DC RFC space that can't be used on the public Internet GCS traffic uses public IP Internal traffic (reading from cluster, observability, puppet, etc) uses internal IP
  2. Data is identified by a dataset name HDFS is the primary storage for Analytics Users configure replication rules for different clusters Dataset also has retention rules defined per cluster Dataset are always represented on fixed interval partitions (hourly/daily) Dataset is defined in system called Data Access Layer (DAL)* Data is made available at different destination using Replicator
  3. Long running daemon (on mesos) Daemon checks configuration and schedules copy on hourly partition Copy jobs are executed as Hadoop distcp jobs Jobs are on destination cluster After hourly copy, publish partition to DAL
  4. Some datasets are collected across multiple DataCenters Replicator would kick off multiple DistCP jobs to copy at tmp location Replicator then merges dataset into single directory and does atomic rename to final destination Renames on HDFS are cheap and atomic, which makes this operation easy
  5. Use same Replicator code to sync data to GCS Utilize ViewFileSystem abstraction to hide GCS /gcs/dataset/2019/04/10/03 maps to gs://dataset.bucket/2019/04/10/03 Use Google Hadoop Connector to interact with GCS using Hadoop APIs Distcp jobs runs on dedicated Copy cluster Create ViewFileSystem mount point on Copy cluster to fake GCS destination Distcp tasks stream data from source HDFS to GCS (no local copy)
  6. Data for same dataset is aggregated at multiple DataCenters (DC x and DC y) Replicators in each DC schedules individual DistCp jobs Data from multiple DC ends up under same path on GCS
  7. UI support via EagleEye to view all replication configurations Properties associated with configuration. Src, dest, owner, email, etc… CLI support to manage replication configurations Load new or modify existing configuration List all configurations Mark active/inactive configurations API support for clients and replicators Rich set of api access for all above operations
  8. GCP Projects are based on organization Deploy separate Replicator with its own credentials per project Shared copy cluster per DataCenter Enables independent updates and reduces risk of errors
  9. Logs vs user path resolution Projects and buckets have standard naming convention Logs at : gs://logs.<category name>.twttr.net/ User data at gs://user.<user name>.twtter.net/ Access to these buckets are via standard path Logs at /gcs/logs/<category name>/ User data at /gcs/user/<user name>/ Typically we need mapping of path prefix to bucket name in Hadoop ViewFileSystem mountable.xml Modified ViewFileSystem to dynamically create mountable mapping on demand since bucket name and path name are standard No configuration or update needed