SlideShare a Scribd company logo
1 of 30
Managing hundreds of
PBs of data in Cloud
ApacheCon 2019
Lohit VijayaRenu, Zhenzhao Wang
Describe Twitter’s Data Storage Architecture,
present our solution to managing large data in the Cloud.
Data Platform @Twitter
Oxpecker
Roneobird
Data Access
Layer
ETL Pipelines
Twitter Data Analytics : Scale
5
>1EB
>100PB
Several >10K
Hadoop clusters
>10KNodes Hadoop Cluster
Storage capacity
Reads and Write
~1 Exabyte Storage
capacity
Amount of data
read and written
daily
>50KAnalytic Jobs
Jobs running on Data
Platform per day
Storage @DataPlatform
Storage @DataPlatform
● Apache HDFS for
storage
● DAL for metadata
management
● Replication for
cross cluster
● Retention service
for expiry
Hadoop Distributed File System
Replication
Service
Retention
Service
DAL
(Metadata Service)
Real Time
Cluster
HDFS @DataPlatform
Twitter DataCenter
Production
Cluster
Ad hoc Cluster Cold Storage
Log
Pipeline
Micro
Services
Data
Generate > 1.5
Trillion events
every day
Incoming
Storage
Produce > 4PB
per day
Production
jobs
Process
hundreds of PB
per day
Ad hoc queries
Executes tens
of thousands
of jobs per day
Cold/Backup
Hundreds of
PBs of data
Data Access Layer
● Dataset has logical name
and one or more physical
locations
● Users/Tools such as
scalding, presto, HIVE
query DAL for available
hourly partitions
● Dataset has hourly/daily
partitions in DAL
● Also stores various
properties such as owner,
schema, location with
datasets
* https://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.html
Dataset defined in DAL
$dal logical-dataset list --role hadoop --name logs.partly-cloudy
| 4031 | http://dallds/401 | hadoop | Prod | logs.partly-cloudy| Active |
$dal physical-dataset list --role hadoop --name logs.partly-cloudy
| 26491 | http://dalpds/26491 | dw | viewfs://hadoop-dw-nn/logs/partly-
cloudy/yyyy/mm/dd/hh |
| 41065 | http://dalpds/41065 | cold |
viewfs://hadoop-cold-nn/partly-cloudy/yyyy/mm/dd/hh |
List all physical locations
Find dataset by logical name
Replication
Destination Cluster
/ClusterY/logs/partly-cloudy/
2019/09/10/03
Source Cluster
/ClusterX/logs/partly-cloudy/
2019/09/10/03
Replicator : ClusterY
Distcp
2019/09/10/03
DAL/Config
Dataset : partly-cloudy
Src : /ClusterX/logs/partly-cloudy
Dst : /ClusterY/logs/partly-cloudyFetch replication
config per dataset
Scan src cluster
Scan dst cluster
data transfer
Update partition
information
Retention
ClusterX
/ClusterX/logs/partly-cloudy/
2019/08/10/03 (onwards)
Retention : ClusterX
DAL/Config
Dataset : partly-cloudy
Cluster : ClusterX
Retention : 30 days
Scan for any data
older than 30 days Move to Trash.
Let HDFS expire
Drop partition
information
Fetch retention
config per dataset
Storage @Cloud
Large Data Management on Cloud
● Storage system (Google Cloud Storage)
● Metadata Management
● Replication Service
● Retention Service
● User and Service management
● Data format and data pipeline
● Compute provisioning
● Networking and VPC
● Security/Key management
Storage @Cloud
● Google Cloud
Storage
● DAL for metadata
management
● Replication for
cloud cluster
● Supplementary
Retention (SDRS)
service for expiry
Google Cloud Storage
Replication
Service
SDRS
(Retention)
DAL
(Metadata Service)
GCS
● Object store vs HDFS.
○ We widely adopted gcs connector to provide HDFS compatible API so user
could migrate their jobs/applications without code change.
○ Take care of semantic difference case by case. E.g. rename is not atomic.
● Buckets design.
○ Different orgs have different cloud projects.
○ We will have one bucket per user and log category (dataset).
○ We build a service to manage buckets. E.g. creation, ACL setting, and etc.
GCS
On-premises
path
/dc1/cluster1/user/
ads/some/path/part-
001.lzo
Logical Cloud
path
/gcs/user/ads/so
me/path/part-
001.lzo
GCS bucket path
gs://user.ads.dp.tw
itter.domain/some/p
ath/part-001.lzo
● Owner: ads
● readergroup: ads-reader-group
RegEx based path resolution
<property>
<name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:--
;replaceresolveddstpath:_:-#.^/gcs/logs/(?!((tst|test)(_|-)))(?&lt;dataset&gt;[^/]+)</name>
<value>gs://logs.${dataset}</value>
</property>
<property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:--
;replaceresolveddstpath:_:-#.^/gcs/user/(?!((tst|test)(_|-)))(?&lt;userName&gt;[^/]+)</name>
<value>gs://user.${userName}</value>
</property>
/gcs/logs/partly-
cloudy/2019/04/10
/gcs/user/lohit/hadoop-stats
gs://logs.partly-
cloudy/2019/04/10
gs://user.lohit/hadoop-stats
Twitter ViewFS Path
Twitter ViewFS mounttable.xml
Bucket on GCS : gs://logs.partly-cloudy
Connector Path : /logs/partly-cloudy
Twitter Resolved Path : /gcs/logs/partly-cloudy
View FileSystem and Google Hadoop Connector
Twitter’s View FileSystem
Cluster-X Cluster-Y ClusterZ
Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2
DataCenter-1 DataCenter-2 Cloud Storage
Connector
Replicator
Cloud Storage
DAL (metadata) extension for Cloud
$dal physical-dataset list --role hadoop --name logs.partly-cloudy
| 26491 | http://dalpds/26491 | dw | viewfs://hadoop-dw-nn/logs/partly-
cloudy/yyyy/mm/dd/hh |
| 41065 | http://dalpds/41065 | gcs |
gcs:////partly-cloudy/yyyy/mm/dd/hh |
List all physical locations
$dal physical-dataset list --role hadoop --name logs.partly-cloudy --location-name gcs
2019-09-10T11:00:00Z 2019-04-01T12:00:00Z gcs:///logs/partly-cloudy/2019/09/10/11
HadoopLzop
2019-09-10T12:00:00Z 2019-04-01T13:00:00Z gcs:///logs/partly-cloudy/2019/09/10/12
HadoopLzop
All partitions for dataset on GCS
Twitter
DataCenter
Architecture behind replication to GCS
Copy Cluster
GCS
/gcs/logs/partly-cloud
/2019/09/10/03
Replicator : GCS
DAL
Source Cluster
/DC1/ClusterX/logs/partly-
cloudy/2019/09/10/03
Distcp
Dataset : partly-cloudy
/ClusterX/logs/partly-cloudy
/gcs/logs/partly-cloudy
Twitter DataCenter
Network setup for copy
Twitter & Google private
peering (PNI)
Copy Cluster
GCS
/gcs/logs/partly-
cloudy/2019/09/
10/03
Distcp
Replicator : GCS
Proxy
group
Merge same dataset on GCS (Multi Region Bucket)
Twitter DataCenter X-2
Copy Cluster X-2
/gcs/logs/partly-
cloudy/2019/09/
10/03
Source ClusterX-2
/DC2/ClusterX-2/logs/partly-
cloudy//2019/09/10/03
Twitter DataCenter X-1
Copy Cluster X-1Source ClusterX-1
/DC1/ClusterX-1/logs/partly-
cloudy/2019/09/10/03
Distcp
Multi Region
Bucket
Distcp
Cloud Storage
Merging and updating DAL
● Multiple Replicators copy same
dataset partition to destination
● Each of Replicator checks for
availability of data independently
● Creates individual
_SUCCESS_<SRC> files
● Updates DAL when all
_SUCCESS_<SRC> are found
● Updates are idempotent
Compare
src and
dest
Kick off
distcp job
Check
success
file (ALL)
Update
DAL
Success
Let other
instance
update
DAL
Need to
copy
Copied
already
Success
Failure
No
Yes
Done
Each Replicator updates partition
independently
Dataset via EagleEye
● View different
destination for
same dataset
● GCS is another
destination
● Also shows delay
for each hourly
partition
SDRS (retention in cloud)
OLM (object life management) on GCS support ages, modification time and etc.
However, OLM does not support:
● Dataset based retention are not supported.
○ E.g. GDPR requires /logs/partly-cloudy/2019/09/10/21 to be scrubbed 30 days after generation
rather than the creation time on GCS.
○ OLM wouldn’t notify you on deletion. It’s hard to keep the dataset in sync with DAL.
● Soft delete (trash feature) impossible without versioning.
○ The trash feature had saved us multiple times...
Notification
Storage
Pub/SubREST API
Client
Service
Interface
Event Service
SDRS Architecture
ValidationExecution
Configuration
Internal Queue
Config
Retention
Scheduler... Event Handling● Open source and cloud native.
● Support retentions rules.
○ Delete marker - on demand delete.
○ Dataset rule.
○ Bucket Default rule.
● Soft delete
○ Move data from one bucket to
another bucket.
○ Plugable engine. Simple Transfer
Storage Supported.
● Rest API and Events Notification.
○ Rest API to control.
○ SDRS will generate and sent events
to pub/sub system on deletion,
trashing events.
DAL
Twitter GCS Retention Management
Retention Config
Manager
● We have one set of SDRS service stack for one
org.
● Trash buckets. One bucket will have a trash
bucket. Delete will be move to trashed bucket
before gone.
○ E.g. /gcs/logs/part-cloudy will have one
/gcs/logs/part-cloudy-trash act as trash.
● Partition over objects. We configured retention
for dataset partition based on build-int time in
path.
○ E.g. /gcs/logs/part-cloudy/2019/09/10/01
with 3 days of retention will be removed
on 2019/09/13/01.
○ We will drop partition when SDRS remove
data. Thus the data is DAL is always in
sync.
DAL Sync Manager
SDRS Service Pub/Sub Topics
Cloud SQL
Trash
Buckets
GCS
Buckets
Describe Twitter’s Data Storage Architecture,
present our solution to managing large data in the Cloud.
Questions / Feedback
Tweet @TwitterHadoop

More Related Content

What's hot

AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics
 
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
DataStax
 

What's hot (20)

Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium Change Data Streaming Patterns for Microservices With Debezium
Change Data Streaming Patterns for Microservices With Debezium
 
Symantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in actionSymantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in action
 
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
Cassandra on Google Cloud Platform (Ravi Madasu, Google / Ben Lackey, DataSta...
 
Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20Imply at Apache Druid Meetup in London 1-15-20
Imply at Apache Druid Meetup in London 1-15-20
 
Serverless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipelineServerless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipeline
 
DIscover Spark and Spark streaming
DIscover Spark and Spark streamingDIscover Spark and Spark streaming
DIscover Spark and Spark streaming
 
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit EU talk by Sebastian Schroeder and Ralf SigmundSpark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
Spark Summit EU talk by Sebastian Schroeder and Ralf Sigmund
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
HDF Data in the Cloud
HDF Data in the CloudHDF Data in the Cloud
HDF Data in the Cloud
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Google Cloud Dataflow
Google Cloud DataflowGoogle Cloud Dataflow
Google Cloud Dataflow
 
AquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks PresentationAquaQ Analytics Kx Event - Data Direct Networks Presentation
AquaQ Analytics Kx Event - Data Direct Networks Presentation
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
5 Levels of High Availability: From Multi-instance to Hybrid Cloud
5 Levels of High Availability: From Multi-instance to Hybrid Cloud5 Levels of High Availability: From Multi-instance to Hybrid Cloud
5 Levels of High Availability: From Multi-instance to Hybrid Cloud
 
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
Optimizing InfluxDB Performance in the Real World by Dean Sheehan, Senior Dir...
 
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...
 
Routing trillion events per day @twitter
Routing trillion events per day @twitterRouting trillion events per day @twitter
Routing trillion events per day @twitter
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 

Similar to Managing 100s of PetaBytes of data in Cloud

Similar to Managing 100s of PetaBytes of data in Cloud (20)

Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud Extending twitter's data platform to google cloud
Extending twitter's data platform to google cloud
 
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with SchlumbergerGet Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
Get Your Head in the Cloud - Lessons in GPU Computing with Schlumberger
 
DCEU 18: Use Cases and Practical Solutions for Docker Container Storage on Sw...
DCEU 18: Use Cases and Practical Solutions for Docker Container Storage on Sw...DCEU 18: Use Cases and Practical Solutions for Docker Container Storage on Sw...
DCEU 18: Use Cases and Practical Solutions for Docker Container Storage on Sw...
 
Improving Apache Spark Downscaling
 Improving Apache Spark Downscaling Improving Apache Spark Downscaling
Improving Apache Spark Downscaling
 
Cloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdfCloud Composer workshop at Airflow Summit 2023.pdf
Cloud Composer workshop at Airflow Summit 2023.pdf
 
Getting more into GCP.pdf
Getting more into GCP.pdfGetting more into GCP.pdf
Getting more into GCP.pdf
 
Implementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch governmentImplementing data and databases on K8s within the Dutch government
Implementing data and databases on K8s within the Dutch government
 
Introduction to Apache Airflow
Introduction to Apache AirflowIntroduction to Apache Airflow
Introduction to Apache Airflow
 
How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...How the Development Bank of Singapore solves on-prem compute capacity challen...
How the Development Bank of Singapore solves on-prem compute capacity challen...
 
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For YouFlink Forward SF 2017: James Malone - Make The Cloud Work For You
Flink Forward SF 2017: James Malone - Make The Cloud Work For You
 
Hadoop Technical Presentation
Hadoop Technical PresentationHadoop Technical Presentation
Hadoop Technical Presentation
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Scalable Clusters On Demand
Scalable Clusters On DemandScalable Clusters On Demand
Scalable Clusters On Demand
 
Building Super Fast Cloud-Native Data Platforms - Yaron Haviv, KubeCon 2017 EU
Building Super Fast Cloud-Native Data Platforms - Yaron Haviv, KubeCon 2017 EUBuilding Super Fast Cloud-Native Data Platforms - Yaron Haviv, KubeCon 2017 EU
Building Super Fast Cloud-Native Data Platforms - Yaron Haviv, KubeCon 2017 EU
 
Netflix oss season 1 episode 3
Netflix oss season 1 episode 3 Netflix oss season 1 episode 3
Netflix oss season 1 episode 3
 
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Solving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute finalSolving enterprise challenges through scale out storage &amp; big compute final
Solving enterprise challenges through scale out storage &amp; big compute final
 
Hybrid data lake on google cloud with alluxio and dataproc
Hybrid data lake on google cloud  with alluxio and dataprocHybrid data lake on google cloud  with alluxio and dataproc
Hybrid data lake on google cloud with alluxio and dataproc
 

Recently uploaded

Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
MsecMca
 

Recently uploaded (20)

KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...Booking open Available Pune Call Girls Pargaon  6297143586 Call Hot Indian Gi...
Booking open Available Pune Call Girls Pargaon 6297143586 Call Hot Indian Gi...
 
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
Call Girls Pimpri Chinchwad Call Me 7737669865 Budget Friendly No Advance Boo...
 
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptxBSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
BSides Seattle 2024 - Stopping Ethan Hunt From Taking Your Data.pptx
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort ServiceCall Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
Call Girls in Ramesh Nagar Delhi 💯 Call Us 🔝9953056974 🔝 Escort Service
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
The Most Attractive Pune Call Girls Budhwar Peth 8250192130 Will You Miss Thi...
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank  Design by Working Stress - IS Method.pdfIntze Overhead Water Tank  Design by Working Stress - IS Method.pdf
Intze Overhead Water Tank Design by Working Stress - IS Method.pdf
 
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...Top Rated  Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
Top Rated Pune Call Girls Budhwar Peth ⟟ 6297143586 ⟟ Call Me For Genuine Se...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
(INDIRA) Call Girl Aurangabad Call Now 8617697112 Aurangabad Escorts 24x7
 
notes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.pptnotes on Evolution Of Analytic Scalability.ppt
notes on Evolution Of Analytic Scalability.ppt
 

Managing 100s of PetaBytes of data in Cloud

  • 1. Managing hundreds of PBs of data in Cloud ApacheCon 2019 Lohit VijayaRenu, Zhenzhao Wang
  • 2. Describe Twitter’s Data Storage Architecture, present our solution to managing large data in the Cloud.
  • 4.
  • 5. Twitter Data Analytics : Scale 5 >1EB >100PB Several >10K Hadoop clusters >10KNodes Hadoop Cluster Storage capacity Reads and Write ~1 Exabyte Storage capacity Amount of data read and written daily >50KAnalytic Jobs Jobs running on Data Platform per day
  • 7. Storage @DataPlatform ● Apache HDFS for storage ● DAL for metadata management ● Replication for cross cluster ● Retention service for expiry Hadoop Distributed File System Replication Service Retention Service DAL (Metadata Service)
  • 8. Real Time Cluster HDFS @DataPlatform Twitter DataCenter Production Cluster Ad hoc Cluster Cold Storage Log Pipeline Micro Services Data Generate > 1.5 Trillion events every day Incoming Storage Produce > 4PB per day Production jobs Process hundreds of PB per day Ad hoc queries Executes tens of thousands of jobs per day Cold/Backup Hundreds of PBs of data
  • 9. Data Access Layer ● Dataset has logical name and one or more physical locations ● Users/Tools such as scalding, presto, HIVE query DAL for available hourly partitions ● Dataset has hourly/daily partitions in DAL ● Also stores various properties such as owner, schema, location with datasets * https://blog.twitter.com/engineering/en_us/topics/insights/2016/discovery-and-consumption-of-analytics-data-at-twitter.html
  • 10. Dataset defined in DAL $dal logical-dataset list --role hadoop --name logs.partly-cloudy | 4031 | http://dallds/401 | hadoop | Prod | logs.partly-cloudy| Active | $dal physical-dataset list --role hadoop --name logs.partly-cloudy | 26491 | http://dalpds/26491 | dw | viewfs://hadoop-dw-nn/logs/partly- cloudy/yyyy/mm/dd/hh | | 41065 | http://dalpds/41065 | cold | viewfs://hadoop-cold-nn/partly-cloudy/yyyy/mm/dd/hh | List all physical locations Find dataset by logical name
  • 11. Replication Destination Cluster /ClusterY/logs/partly-cloudy/ 2019/09/10/03 Source Cluster /ClusterX/logs/partly-cloudy/ 2019/09/10/03 Replicator : ClusterY Distcp 2019/09/10/03 DAL/Config Dataset : partly-cloudy Src : /ClusterX/logs/partly-cloudy Dst : /ClusterY/logs/partly-cloudyFetch replication config per dataset Scan src cluster Scan dst cluster data transfer Update partition information
  • 12. Retention ClusterX /ClusterX/logs/partly-cloudy/ 2019/08/10/03 (onwards) Retention : ClusterX DAL/Config Dataset : partly-cloudy Cluster : ClusterX Retention : 30 days Scan for any data older than 30 days Move to Trash. Let HDFS expire Drop partition information Fetch retention config per dataset
  • 14. Large Data Management on Cloud ● Storage system (Google Cloud Storage) ● Metadata Management ● Replication Service ● Retention Service ● User and Service management ● Data format and data pipeline ● Compute provisioning ● Networking and VPC ● Security/Key management
  • 15. Storage @Cloud ● Google Cloud Storage ● DAL for metadata management ● Replication for cloud cluster ● Supplementary Retention (SDRS) service for expiry Google Cloud Storage Replication Service SDRS (Retention) DAL (Metadata Service)
  • 16. GCS ● Object store vs HDFS. ○ We widely adopted gcs connector to provide HDFS compatible API so user could migrate their jobs/applications without code change. ○ Take care of semantic difference case by case. E.g. rename is not atomic. ● Buckets design. ○ Different orgs have different cloud projects. ○ We will have one bucket per user and log category (dataset). ○ We build a service to manage buckets. E.g. creation, ACL setting, and etc.
  • 17. GCS On-premises path /dc1/cluster1/user/ ads/some/path/part- 001.lzo Logical Cloud path /gcs/user/ads/so me/path/part- 001.lzo GCS bucket path gs://user.ads.dp.tw itter.domain/some/p ath/part-001.lzo ● Owner: ads ● readergroup: ads-reader-group
  • 18. RegEx based path resolution <property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:-- ;replaceresolveddstpath:_:-#.^/gcs/logs/(?!((tst|test)(_|-)))(?&lt;dataset&gt;[^/]+)</name> <value>gs://logs.${dataset}</value> </property> <property> <name>fs.viewfs.mounttable.copycluster.linkRegex.replaceresolveddstpath:-:-- ;replaceresolveddstpath:_:-#.^/gcs/user/(?!((tst|test)(_|-)))(?&lt;userName&gt;[^/]+)</name> <value>gs://user.${userName}</value> </property> /gcs/logs/partly- cloudy/2019/04/10 /gcs/user/lohit/hadoop-stats gs://logs.partly- cloudy/2019/04/10 gs://user.lohit/hadoop-stats Twitter ViewFS Path Twitter ViewFS mounttable.xml
  • 19. Bucket on GCS : gs://logs.partly-cloudy Connector Path : /logs/partly-cloudy Twitter Resolved Path : /gcs/logs/partly-cloudy View FileSystem and Google Hadoop Connector Twitter’s View FileSystem Cluster-X Cluster-Y ClusterZ Namespace 1 Namespace 2 Namespace 1 Namespace 1 Namespace 2 DataCenter-1 DataCenter-2 Cloud Storage Connector Replicator Cloud Storage
  • 20. DAL (metadata) extension for Cloud $dal physical-dataset list --role hadoop --name logs.partly-cloudy | 26491 | http://dalpds/26491 | dw | viewfs://hadoop-dw-nn/logs/partly- cloudy/yyyy/mm/dd/hh | | 41065 | http://dalpds/41065 | gcs | gcs:////partly-cloudy/yyyy/mm/dd/hh | List all physical locations $dal physical-dataset list --role hadoop --name logs.partly-cloudy --location-name gcs 2019-09-10T11:00:00Z 2019-04-01T12:00:00Z gcs:///logs/partly-cloudy/2019/09/10/11 HadoopLzop 2019-09-10T12:00:00Z 2019-04-01T13:00:00Z gcs:///logs/partly-cloudy/2019/09/10/12 HadoopLzop All partitions for dataset on GCS
  • 21. Twitter DataCenter Architecture behind replication to GCS Copy Cluster GCS /gcs/logs/partly-cloud /2019/09/10/03 Replicator : GCS DAL Source Cluster /DC1/ClusterX/logs/partly- cloudy/2019/09/10/03 Distcp Dataset : partly-cloudy /ClusterX/logs/partly-cloudy /gcs/logs/partly-cloudy
  • 22. Twitter DataCenter Network setup for copy Twitter & Google private peering (PNI) Copy Cluster GCS /gcs/logs/partly- cloudy/2019/09/ 10/03 Distcp Replicator : GCS Proxy group
  • 23. Merge same dataset on GCS (Multi Region Bucket) Twitter DataCenter X-2 Copy Cluster X-2 /gcs/logs/partly- cloudy/2019/09/ 10/03 Source ClusterX-2 /DC2/ClusterX-2/logs/partly- cloudy//2019/09/10/03 Twitter DataCenter X-1 Copy Cluster X-1Source ClusterX-1 /DC1/ClusterX-1/logs/partly- cloudy/2019/09/10/03 Distcp Multi Region Bucket Distcp Cloud Storage
  • 24. Merging and updating DAL ● Multiple Replicators copy same dataset partition to destination ● Each of Replicator checks for availability of data independently ● Creates individual _SUCCESS_<SRC> files ● Updates DAL when all _SUCCESS_<SRC> are found ● Updates are idempotent Compare src and dest Kick off distcp job Check success file (ALL) Update DAL Success Let other instance update DAL Need to copy Copied already Success Failure No Yes Done Each Replicator updates partition independently
  • 25. Dataset via EagleEye ● View different destination for same dataset ● GCS is another destination ● Also shows delay for each hourly partition
  • 26. SDRS (retention in cloud) OLM (object life management) on GCS support ages, modification time and etc. However, OLM does not support: ● Dataset based retention are not supported. ○ E.g. GDPR requires /logs/partly-cloudy/2019/09/10/21 to be scrubbed 30 days after generation rather than the creation time on GCS. ○ OLM wouldn’t notify you on deletion. It’s hard to keep the dataset in sync with DAL. ● Soft delete (trash feature) impossible without versioning. ○ The trash feature had saved us multiple times...
  • 27. Notification Storage Pub/SubREST API Client Service Interface Event Service SDRS Architecture ValidationExecution Configuration Internal Queue Config Retention Scheduler... Event Handling● Open source and cloud native. ● Support retentions rules. ○ Delete marker - on demand delete. ○ Dataset rule. ○ Bucket Default rule. ● Soft delete ○ Move data from one bucket to another bucket. ○ Plugable engine. Simple Transfer Storage Supported. ● Rest API and Events Notification. ○ Rest API to control. ○ SDRS will generate and sent events to pub/sub system on deletion, trashing events.
  • 28. DAL Twitter GCS Retention Management Retention Config Manager ● We have one set of SDRS service stack for one org. ● Trash buckets. One bucket will have a trash bucket. Delete will be move to trashed bucket before gone. ○ E.g. /gcs/logs/part-cloudy will have one /gcs/logs/part-cloudy-trash act as trash. ● Partition over objects. We configured retention for dataset partition based on build-int time in path. ○ E.g. /gcs/logs/part-cloudy/2019/09/10/01 with 3 days of retention will be removed on 2019/09/13/01. ○ We will drop partition when SDRS remove data. Thus the data is DAL is always in sync. DAL Sync Manager SDRS Service Pub/Sub Topics Cloud SQL Trash Buckets GCS Buckets
  • 29. Describe Twitter’s Data Storage Architecture, present our solution to managing large data in the Cloud.
  • 30. Questions / Feedback Tweet @TwitterHadoop