SlideShare a Scribd company logo
Story of moving 4 Trillion Events
(Log Pipeline) from Batch to Streaming
ApacheCon 2020
Lohit VijayaRenu, Zhenzhao Wang, Praveen Killamsetti
1.Introduction & Goals
2.Log Pipeline in GCP
3.Streaming between DCs
4.Conclusion
5.Q&A
Scale of Event Log Aggregation
~10PB
Across millions of clients
Still growing
~3.4~4.1
Trillion Events a Day of Data a Day
Incoming uncompressed
How many and how big?
Twitter DataCenter
Events and Event Logs @Twitter
Real Time
Cluster
Production
Cluster
Ad hoc Cluster Cold Storage
Log
Pipeline
Micro
Services
Streaming systems
GCP
Google Cloud
Storage
Services to
manage
data
Data
Processing
frameworks
● Clients log events specifying a Category
name. Eg: ads_click, like_event ...
● Events are stored on HDFS, bucketed every
hour into separate directories
○ /logs/ads_click/2020/09/01/23
○ /logs/like_event/2020/09/01/23
Events and Event Logs @Twitter
Log Management Components
Clients
Aggregated by Category
Storage HDFS (Incoming)
Http Clients
Clients
Client Daemon
Client Daemon Client Daemon
Rufous
Storage HDFS (Replicated)
Event Aggregation
Event Log Processing
Event Log Replication
Event Log Management
Clients
● Modularization
○ Each component should be independent and plugable.
○ The communication between components should follow via simple protocol
○ Each component could scale indepently
● Tier based approach
○ Resource should be shared to inside tier improve utilization and resiliency.
○ Resource should be isolated between tier to control blast radius.
● Scalability is always primary concern
○ Traffic grows every year.
○ Scale leads to problem. E.g. HDFS file number limit.
○ QOS of network traffic
● Users make mistakes
○ E.g. user might make back incompatible schema changes
○ User might want to restate the data because of error
● Debuggability, long tail problem, DC failover support, and etc
Lessons Learnt
● Seamless integration of
on-prem clusters and cloud
● On-prem parity on cloud
such as data format
Goals
01
● Empower streaming user
cases. E.g. dataflow.
● Support batch users cases
such as spark, dataflow,
presto.
02
● Leverage cloud native
technologies and unlock
more cloud native tools
● PDP(Private data
protection) is always a big
thing at Twitter
03
Scalability
● The traffic grows every year.
The new log pipeline should
be able to handle the traffic
04
Hybrid Environments Streaming/Batching Cloud Native and PDP
1.Introduction & Goals
2.Log Pipeline in GCP
3.Streaming between DCs
4.Conclusion
5.Q&A
User cases Overview
PUB/SUB
…
Producers
GKE
Container 1
Container N
…
VMs
Service 1
Service N
…
Serverless
CloudFunction
App Engine
…
Rest
API
IDL
API
Topic 1
Topic 2
Topic N
…
Batch
DataFlowJob 1
DataFlowJobN
… GCS
Consumers
Restful
API
IDL
API
Kafka
Stream Processing
Stream
Ingestion Jobs
BigQuery
User stream
jobs
… …
Log Pipeline In GCP - Architecture
Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow BQ
Processor
Log
Processors
Scheduler
State Store
Log Pipeline
Client Lib
● Unified client lib
○ Abstracts backend implementation
● Google PubSub as subscribable storage
○ Rich meta headers. E.g. checksum
○ Exclusive subscription per destination
● Schedule processors and export metrics
● Processors: Dataflow jobs which stream
data to different destinations.
● State store:
○ Schema info
○ Per category meta such as owner
● Various destination. BQ, GCS, Druid, and
etc
● Replication service: Glue of destinations
Replication Service
Log Processors
Streaming Processor:
● Per topic data flow job reads from PubSub
and write to BQ
● E2E latency in few seconds
● Dead letter table to handle corrupt
data/schema errors.
● E2E Checksum validation
Batch Processor:
● Multiple Format output.
○ Thrift-lzo: row based format.
○ Parquet: column based format
● E2E Checksum validation
● Tackle cold start with dummy events.
○ To handle empty time ranges
Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow BQ
Processor
Log Pipeline
Client Lib
Replication Service
Event Controller
Processor Scheduler
Config
Watcher
● User friendly configuration.
○ No need to worry about
implementation
○ Rich options including destination,
data format and etc.
● Scalable and extendable
○ Multiple destination sinks support
○ Stream and Batch support.
● Managed execution
○ Provide Metrics, health check
○ Priority and quota control support
(planned)
Status
Watcher
Rest API
Event Execution Pool
Job Abstraction Layer
GCS
Stream
Ingestion
BigQuery
Stream
Ingestion
Druid
Ingestion
... ...
User
Config
Restful
CMD
Other Components
Client Library
● Uniform way to publish log events
● Per log category metrics tracking
● Static schema validation check at event
source
● Privacy Data Protection improvements
○ End to End checksums
○ End to End encryption
○ Optional Base-64 encoding
Schema Management
● CI job to create schema jar and upload to
GCS
● Each Processor re-loads latest schema
bundle periodically
Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow BQ
Processor
Log Pipeline
Client Lib
Replication Service
Log Replication
● Used for batch workflow
● Logs are collected separately at each data center independently
● Log Replicator merges the logs across the data centers
○ Copies data from one DC to rest
○ Use GCS connector to write to GCS using HDFS apis
Deployment
● Separate Log Pipeline for each organization(GCP project) for better security and charge
back
● Provisioning Log Category
○ Map log category to GCP project during provisioning
○ Create GCP resources (pubsub topics, buckets, BQ datasets) automatically using
demigod service (Terraform)
○ Configure event routing
○ Access Control:
■ Limit write access to storage(GCS/BQ) to pillar org specific log processor
■ Read access of the GCS bucket/BQ limited to service account only
1.Introduction & Goals
2.Log Pipeline in GCP
3.Streaming between DCs
4.Conclusion
5.Q&A
Streaming Data from Twitter DCs to Cloud Log Pipeline
Application1
(TWTTR-DC1)
Flume Aggregation
GCP Pub-Sub
Streaming Log Processor
(Data Flow)
Scribe Daemon
Client
Library
Application2
(TWTTR-DC1)
Scribe Daemon
Client
Library
Log Delivery - Big Picture
Application2
(TWTTR-DC1)
Flume Aggregation
(One per each Twitter DC)
GCP Pub-Sub
Application1(
GCP)
Streaming Log Processor
(Data Flow)
Client
Library
Kafka
(Twitter DC)
Scribe Daemon
Client
Library
Application3
(TWTTR-DC2)
Scribe Daemon
Client
Library
Tez
Log Processor
Replication
Service
Possible Routings
● Stream Flows:
LPClient -> PubSub -> BQ
LPClient -> Flume -> Pubsub -> BQ
● Batch Flows
LPClient -> Flume -> Hdfs
LPClient -> PubSub -> GCS
1.Introduction & Goals
2.Log Pipeline in GCP
3.Streaming between DCs
4.Conclusion
5.Q&A
Conclusion
● Embrace hybrid cloud environment and provide unified experience to publish log
events
● Log Pipeline serves as global scale log data delivery mechanism inside Twitter
○ Aggregation of data across DCs
○ Streaming and batch mode delivery
○ Support various sinks
○ Configure routing with simple knobs for user
Q&A
Thank you.
DataFlow Processors
Streaming Processor:
● Per topic data flow job reads from PubSub and
write to BQ
● E2E latency in few seconds
● Dead letter table to handle corrupt data/schema
errors.
● Checksum validation
Batch Processor:
● Multiple Format output.
○ Thrift-lzo: row based format.
○ Parquet: column based format
● E2E Checksum support.
● Tackle cold start with dummy events.
○ To handle empty time ranges
Log Pipeline In GCP - Architecture
Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow GCS
Processor
Log
Processors
Scheduler
State
Store
User
Interface
(UI/CLI)
Log Pipeline
Client Lib
Log Pipeline In GCP - Architecture
Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow BQ
Processor
Log
Processors
Scheduler
State Store
Log Pipeline
Client Lib
● Unified client lib
○ Disguise client difference
● Google PubSub as subscribable storage
○ Rich context headers. E.g. checksum
○ Exclusive destination per destination
● Schedule processors
● Processors. Dataflow jobs which stream
data to different destinations.
● Processors. Dataflow jobs which stream
data
● State store.
○ Schema info
○ Per category meta such as owner
Job Scheduler
● A processor is a Stream/Batch ETL
job which could deliver data to user
specified Destination:
○ BigQuery Stream ingestion
○ Gcs Stream Ingestion
Event Controller
Job Schedulers
Config Watcher
● User friendly Configuration.
○ User could config the data
destination easily.
○ Rich options including output format,
● Managed Execution Env
○ Move
○ Plugable engine. Simple Transfer
Storage Supported.
Status Watcher Rest API
Event Execution Pool
Job Abstraction Layer
GCS Stream
Ingestion
BigQuery
Stream
Ingestion
Druid
Ingestion
... ...
User Config Restful Cmd
Log Pipeline In GCP - Architecture
Application
Google Pub/Sub
DataFlow GCS
Processor
DataFlow BQ
Processor
Log
Processors
Scheduler
State Store
Log Pipeline
Client Lib
● Unified client lib
○ Disguise client difference
● Google PubSub as subscribable storage
○ Rich context headers. E.g. checksum
○ Exclusive destination per destination
● Schedule processors and export metrics
● Processors. Dataflow jobs which stream
data to different destinations.
● Various destination. BQ, GCS, Druid, and etc
● State store.
○ Schema info
○ Per category meta such as owner
Replication Service
● Replication service. Glue of destinations
Twitter Data Analytics : Scale
29
>1EB
>100PB
Several >10K
Hadoop clusters
>10K
Nodes Hadoop Cluster
Storage capacity
Reads and Write
~1 Exabyte Storage
capacity
Amount of data
read and written
daily
>50K
Analytic Jobs
Jobs running on Data
Platform per day
● Clients log events specifying a Category name.
Eg: ads_click, like_event ...
● Events are grouped together across all clients
into the Category
● Events are stored on HDFS, bucketed every hour
into separate directories
○ /logs/ads_click/2020/09/01/23
○ /logs/like_event/2020/09/01/23
● Event logs are replicated to other clusters or
GCP
○ On-prem HDFS clusters
○ GCS
Clients
Aggregated by Category
Storage HDFS (Incoming)
Http Clients
Clients
Client Daemon
Client Daemon Client Daemon
Http Endpoint
Storage HDFS (Replicated)
Events and Event Logs @Twitter
Life of an event
Events and Event Logs @Twitter
● Clients log events specifying a Category
name. Eg ad_activated_keywords,
login_event ...
● Events are grouped together across all
clients into the Category
● Events are stored on HDFS, bucketed every
hour into separate directories
○ /logs/ad_activated_keywords/2017/05/
01/23
○ /logs/login_event/2017/05/01/23
● Event logs are replicated to other clusters
Life of an Event
Clients
Aggregated by Category
Storage HDFS (Incoming)
Http Clients
Clients
Client Daemon
Client Daemon Client Daemon
Rufous
Storage HDFS (Replicated)
● Terminologies
○ GCS - google cloud storage
○ GCP - google cloud platform
○ Project - google cloud project which is a organization of google resources including
API
● The backend components into different pillar cloud projects.
○ Pillar is decided based the organizations. E.g. ads
○ Resource isolation and planned indepently
○ Better chargeback control
Log Pipeline In GCP
Twitter DataCenter
Events and Event Logs @Twitter
Real Time
Cluster
Production
Cluster
Ad hoc Cluster Cold Storage
Log
Pipeline
Micro
Services
Streaming systems
GCP
Google Cloud
Storage
Services to
manage
data
Data
Processing
frameworks
Data Ingestion
Data Replication
Data Retention & Management
● Modularization
○ Each component should be independent.
○ The communication between components should follow via simple protocol
○ Each component could scale indepently
●
Lessons learned

More Related Content

What's hot

Data Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETLData Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETL
Anant Corporation
 
Symantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in actionSymantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in action
DataStax Academy
 
Stream processing at Hotstar
Stream processing at HotstarStream processing at Hotstar
Stream processing at Hotstar
KafkaZone
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
Gleb Kanterov
 
Routing trillion events per day @twitter
Routing trillion events per day @twitterRouting trillion events per day @twitter
Routing trillion events per day @twitter
lohitvijayarenu
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
HBaseCon
 
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
Flink Forward
 
Introduction to Streaming with Apache Flink
Introduction to Streaming with Apache FlinkIntroduction to Streaming with Apache Flink
Introduction to Streaming with Apache Flink
Tugdual Grall
 
The Rise of Streaming SQL
The Rise of Streaming SQLThe Rise of Streaming SQL
The Rise of Streaming SQL
Sriskandarajah Suhothayan
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
Joey Bolduc-Gilbert
 
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward
 
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
Anna Ossowski
 
Serverless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipelineServerless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipeline
Shu-Jeng Hsieh
 
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...
Flink Forward
 
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy IndustriesWebinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
MongoDB
 
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix ContainersPresto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containers
kbajda
 
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
How Much Can You Connect? | Bhavesh Raheja, Disney + HotstarHow Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
HostedbyConfluent
 
Stream Processing with Ballerina
Stream Processing with BallerinaStream Processing with Ballerina
Stream Processing with Ballerina
Sriskandarajah Suhothayan
 
Siddhi - cloud-native stream processor
Siddhi - cloud-native stream processorSiddhi - cloud-native stream processor
Siddhi - cloud-native stream processor
Sriskandarajah Suhothayan
 
Presto Summit 2018 - 10 - Qubole
Presto Summit 2018  - 10 - QubolePresto Summit 2018  - 10 - Qubole
Presto Summit 2018 - 10 - Qubole
kbajda
 

What's hot (20)

Data Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETLData Engineer’s Lunch #41: PygramETL
Data Engineer’s Lunch #41: PygramETL
 
Symantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in actionSymantec: Cassandra Data Modelling techniques in action
Symantec: Cassandra Data Modelling techniques in action
 
Stream processing at Hotstar
Stream processing at HotstarStream processing at Hotstar
Stream processing at Hotstar
 
Using ClickHouse for Experimentation
Using ClickHouse for ExperimentationUsing ClickHouse for Experimentation
Using ClickHouse for Experimentation
 
Routing trillion events per day @twitter
Routing trillion events per day @twitterRouting trillion events per day @twitter
Routing trillion events per day @twitter
 
Argus Production Monitoring at Salesforce
Argus Production Monitoring at SalesforceArgus Production Monitoring at Salesforce
Argus Production Monitoring at Salesforce
 
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
Flink Forward San Francisco 2019: Building Financial Identity Platform using ...
 
Introduction to Streaming with Apache Flink
Introduction to Streaming with Apache FlinkIntroduction to Streaming with Apache Flink
Introduction to Streaming with Apache Flink
 
The Rise of Streaming SQL
The Rise of Streaming SQLThe Rise of Streaming SQL
The Rise of Streaming SQL
 
Case Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa ArchitectureCase Study: Stream Processing on AWS using Kappa Architecture
Case Study: Stream Processing on AWS using Kappa Architecture
 
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
 
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
[Virtual Meetup] Using Elasticsearch as a Time-Series Database in the Endpoin...
 
Serverless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipelineServerless ETL and Optimization on ML pipeline
Serverless ETL and Optimization on ML pipeline
 
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...
FlinkDTW: Time-series Pattern Search at Scale Using Dynamic Time Warping - Ch...
 
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy IndustriesWebinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
Webinar: MongoDB Use Cases within the Oil, Gas, and Energy Industries
 
Presto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix ContainersPresto Summit 2018 - 04 - Netflix Containers
Presto Summit 2018 - 04 - Netflix Containers
 
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
How Much Can You Connect? | Bhavesh Raheja, Disney + HotstarHow Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
How Much Can You Connect? | Bhavesh Raheja, Disney + Hotstar
 
Stream Processing with Ballerina
Stream Processing with BallerinaStream Processing with Ballerina
Stream Processing with Ballerina
 
Siddhi - cloud-native stream processor
Siddhi - cloud-native stream processorSiddhi - cloud-native stream processor
Siddhi - cloud-native stream processor
 
Presto Summit 2018 - 10 - Qubole
Presto Summit 2018  - 10 - QubolePresto Summit 2018  - 10 - Qubole
Presto Summit 2018 - 10 - Qubole
 

Similar to Story of migrating event pipeline from batch to streaming

How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
Amazon Web Services
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Mariano Gonzalez
 
Sweet Streams (Are made of this)
Sweet Streams (Are made of this)Sweet Streams (Are made of this)
Sweet Streams (Are made of this)
Corneil du Plessis
 
Kafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupKafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetup
Mingmin Chen
 
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Miguel Pérez Colino
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Bowen Li
 
Upcoming features in Airflow 2
Upcoming features in Airflow 2Upcoming features in Airflow 2
Upcoming features in Airflow 2
Kaxil Naik
 
Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams
confluent
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Peter Bakas
 
Citi Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and PerformanceCiti Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and Performance
confluent
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Steven Wu
 
Evolution of Real-time User Engagement Event Consumption at Pinterest
Evolution of Real-time User Engagement Event Consumption at PinterestEvolution of Real-time User Engagement Event Consumption at Pinterest
Evolution of Real-time User Engagement Event Consumption at Pinterest
HostedbyConfluent
 
Kubernetes + netflix oss
Kubernetes + netflix ossKubernetes + netflix oss
Kubernetes + netflix oss
Cristiano Altmann
 
Kubernetes Networking - Sreenivas Makam - Google - CC18
Kubernetes Networking - Sreenivas Makam - Google - CC18Kubernetes Networking - Sreenivas Makam - Google - CC18
Kubernetes Networking - Sreenivas Makam - Google - CC18
CodeOps Technologies LLP
 
Deep dive into Kubernetes Networking
Deep dive into Kubernetes NetworkingDeep dive into Kubernetes Networking
Deep dive into Kubernetes Networking
Sreenivas Makam
 
BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis
BDA403 How Netflix Monitors Applications in Real-time with Amazon KinesisBDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis
BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis
Amazon Web Services
 
My past-3 yeas-developer-journey-at-linkedin-by-iantsai
My past-3 yeas-developer-journey-at-linkedin-by-iantsaiMy past-3 yeas-developer-journey-at-linkedin-by-iantsai
My past-3 yeas-developer-journey-at-linkedin-by-iantsai
Kim Kao
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
Ahmed Ossama
 
Event Driven Microservices
Event Driven MicroservicesEvent Driven Microservices
Event Driven Microservices
Fabrizio Fortino
 

Similar to Story of migrating event pipeline from batch to streaming (20)

How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
How Netflix Uses Amazon Kinesis Streams to Monitor and Optimize Large-scale N...
 
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
Architecting Analytic Pipelines on GCP - Chicago Cloud Conference 2020
 
Sweet Streams (Are made of this)
Sweet Streams (Are made of this)Sweet Streams (Are made of this)
Sweet Streams (Are made of this)
 
Kafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetupKafka Practices @ Uber - Seattle Apache Kafka meetup
Kafka Practices @ Uber - Seattle Apache Kafka meetup
 
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
Red Hat Summit 2017 - LT107508 - Better Managing your Red Hat footprint with ...
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
 
Upcoming features in Airflow 2
Upcoming features in Airflow 2Upcoming features in Airflow 2
Upcoming features in Airflow 2
 
Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams Building Pinterest Real-Time Ads Platform Using Kafka Streams
Building Pinterest Real-Time Ads Platform Using Kafka Streams
 
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/SecNetflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
Netflix Keystone - How Netflix Handles Data Streams up to 11M Events/Sec
 
Citi Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and PerformanceCiti Tech Talk: Monitoring and Performance
Citi Tech Talk: Monitoring and Performance
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Evolution of Real-time User Engagement Event Consumption at Pinterest
Evolution of Real-time User Engagement Event Consumption at PinterestEvolution of Real-time User Engagement Event Consumption at Pinterest
Evolution of Real-time User Engagement Event Consumption at Pinterest
 
Kubernetes + netflix oss
Kubernetes + netflix ossKubernetes + netflix oss
Kubernetes + netflix oss
 
Kubernetes Networking - Sreenivas Makam - Google - CC18
Kubernetes Networking - Sreenivas Makam - Google - CC18Kubernetes Networking - Sreenivas Makam - Google - CC18
Kubernetes Networking - Sreenivas Makam - Google - CC18
 
Deep dive into Kubernetes Networking
Deep dive into Kubernetes NetworkingDeep dive into Kubernetes Networking
Deep dive into Kubernetes Networking
 
BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis
BDA403 How Netflix Monitors Applications in Real-time with Amazon KinesisBDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis
BDA403 How Netflix Monitors Applications in Real-time with Amazon Kinesis
 
My past-3 yeas-developer-journey-at-linkedin-by-iantsai
My past-3 yeas-developer-journey-at-linkedin-by-iantsaiMy past-3 yeas-developer-journey-at-linkedin-by-iantsai
My past-3 yeas-developer-journey-at-linkedin-by-iantsai
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Event Driven Microservices
Event Driven MicroservicesEvent Driven Microservices
Event Driven Microservices
 

More from lohitvijayarenu

OpenSource and the Cloud ApacheCon.pptx
OpenSource and the Cloud  ApacheCon.pptxOpenSource and the Cloud  ApacheCon.pptx
OpenSource and the Cloud ApacheCon.pptx
lohitvijayarenu
 
The Adoption of Apache Beam at Twitter
The Adoption of Apache Beam at TwitterThe Adoption of Apache Beam at Twitter
The Adoption of Apache Beam at Twitter
lohitvijayarenu
 
Twitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud StorageTwitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud Storage
lohitvijayarenu
 
Large Scale EventLog Management @Twitter
Large Scale EventLog Management @TwitterLarge Scale EventLog Management @Twitter
Large Scale EventLog Management @Twitter
lohitvijayarenu
 
Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at
lohitvijayarenu
 
HBase backups and performance on MapR
HBase backups and performance on MapRHBase backups and performance on MapR
HBase backups and performance on MapR
lohitvijayarenu
 

More from lohitvijayarenu (6)

OpenSource and the Cloud ApacheCon.pptx
OpenSource and the Cloud  ApacheCon.pptxOpenSource and the Cloud  ApacheCon.pptx
OpenSource and the Cloud ApacheCon.pptx
 
The Adoption of Apache Beam at Twitter
The Adoption of Apache Beam at TwitterThe Adoption of Apache Beam at Twitter
The Adoption of Apache Beam at Twitter
 
Twitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud StorageTwitter's Data Replicator for Google Cloud Storage
Twitter's Data Replicator for Google Cloud Storage
 
Large Scale EventLog Management @Twitter
Large Scale EventLog Management @TwitterLarge Scale EventLog Management @Twitter
Large Scale EventLog Management @Twitter
 
Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at Hadoop 2 @Twitter, Elephant Scale. Presented at
Hadoop 2 @Twitter, Elephant Scale. Presented at
 
HBase backups and performance on MapR
HBase backups and performance on MapRHBase backups and performance on MapR
HBase backups and performance on MapR
 

Recently uploaded

Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
abbyasa1014
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
co23btech11018
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
KrishnaveniKrishnara1
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
MDSABBIROJJAMANPAYEL
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
Aditya Rajan Patra
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
ihlasbinance2003
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
gerogepatton
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
171ticu
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
mahammadsalmanmech
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
IJNSA Journal
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
sachin chaurasia
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Christina Lin
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
NidhalKahouli2
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
Madan Karki
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
Dr Ramhari Poudyal
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
camseq
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
bijceesjournal
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
Rahul
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
Madan Karki
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
NazakatAliKhoso2
 

Recently uploaded (20)

Engineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdfEngineering Drawings Lecture Detail Drawings 2014.pdf
Engineering Drawings Lecture Detail Drawings 2014.pdf
 
Computational Engineering IITH Presentation
Computational Engineering IITH PresentationComputational Engineering IITH Presentation
Computational Engineering IITH Presentation
 
22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt22CYT12-Unit-V-E Waste and its Management.ppt
22CYT12-Unit-V-E Waste and its Management.ppt
 
Properties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptxProperties Railway Sleepers and Test.pptx
Properties Railway Sleepers and Test.pptx
 
Recycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part IIIRecycled Concrete Aggregate in Construction Part III
Recycled Concrete Aggregate in Construction Part III
 
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
5214-1693458878915-Unit 6 2023 to 2024 academic year assignment (AutoRecovere...
 
International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...International Conference on NLP, Artificial Intelligence, Machine Learning an...
International Conference on NLP, Artificial Intelligence, Machine Learning an...
 
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样官方认证美国密歇根州立大学毕业证学位证书原版一模一样
官方认证美国密歇根州立大学毕业证学位证书原版一模一样
 
Question paper of renewable energy sources
Question paper of renewable energy sourcesQuestion paper of renewable energy sources
Question paper of renewable energy sources
 
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSA SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMS
 
The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.The Python for beginners. This is an advance computer language.
The Python for beginners. This is an advance computer language.
 
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesHarnessing WebAssembly for Real-time Stateless Streaming Pipelines
Harnessing WebAssembly for Real-time Stateless Streaming Pipelines
 
basic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdfbasic-wireline-operations-course-mahmoud-f-radwan.pdf
basic-wireline-operations-course-mahmoud-f-radwan.pdf
 
spirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptxspirit beverages ppt without graphics.pptx
spirit beverages ppt without graphics.pptx
 
Literature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptxLiterature Review Basics and Understanding Reference Management.pptx
Literature Review Basics and Understanding Reference Management.pptx
 
Modelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdfModelagem de um CSTR com reação endotermica.pdf
Modelagem de um CSTR com reação endotermica.pdf
 
Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...Comparative analysis between traditional aquaponics and reconstructed aquapon...
Comparative analysis between traditional aquaponics and reconstructed aquapon...
 
ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024ACEP Magazine edition 4th launched on 05.06.2024
ACEP Magazine edition 4th launched on 05.06.2024
 
Manufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptxManufacturing Process of molasses based distillery ppt.pptx
Manufacturing Process of molasses based distillery ppt.pptx
 
Textile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdfTextile Chemical Processing and Dyeing.pdf
Textile Chemical Processing and Dyeing.pdf
 

Story of migrating event pipeline from batch to streaming

  • 1. Story of moving 4 Trillion Events (Log Pipeline) from Batch to Streaming ApacheCon 2020 Lohit VijayaRenu, Zhenzhao Wang, Praveen Killamsetti
  • 2. 1.Introduction & Goals 2.Log Pipeline in GCP 3.Streaming between DCs 4.Conclusion 5.Q&A
  • 3. Scale of Event Log Aggregation ~10PB Across millions of clients Still growing ~3.4~4.1 Trillion Events a Day of Data a Day Incoming uncompressed How many and how big?
  • 4. Twitter DataCenter Events and Event Logs @Twitter Real Time Cluster Production Cluster Ad hoc Cluster Cold Storage Log Pipeline Micro Services Streaming systems GCP Google Cloud Storage Services to manage data Data Processing frameworks ● Clients log events specifying a Category name. Eg: ads_click, like_event ... ● Events are stored on HDFS, bucketed every hour into separate directories ○ /logs/ads_click/2020/09/01/23 ○ /logs/like_event/2020/09/01/23
  • 5. Events and Event Logs @Twitter Log Management Components Clients Aggregated by Category Storage HDFS (Incoming) Http Clients Clients Client Daemon Client Daemon Client Daemon Rufous Storage HDFS (Replicated) Event Aggregation Event Log Processing Event Log Replication Event Log Management Clients
  • 6. ● Modularization ○ Each component should be independent and plugable. ○ The communication between components should follow via simple protocol ○ Each component could scale indepently ● Tier based approach ○ Resource should be shared to inside tier improve utilization and resiliency. ○ Resource should be isolated between tier to control blast radius. ● Scalability is always primary concern ○ Traffic grows every year. ○ Scale leads to problem. E.g. HDFS file number limit. ○ QOS of network traffic ● Users make mistakes ○ E.g. user might make back incompatible schema changes ○ User might want to restate the data because of error ● Debuggability, long tail problem, DC failover support, and etc Lessons Learnt
  • 7. ● Seamless integration of on-prem clusters and cloud ● On-prem parity on cloud such as data format Goals 01 ● Empower streaming user cases. E.g. dataflow. ● Support batch users cases such as spark, dataflow, presto. 02 ● Leverage cloud native technologies and unlock more cloud native tools ● PDP(Private data protection) is always a big thing at Twitter 03 Scalability ● The traffic grows every year. The new log pipeline should be able to handle the traffic 04 Hybrid Environments Streaming/Batching Cloud Native and PDP
  • 8. 1.Introduction & Goals 2.Log Pipeline in GCP 3.Streaming between DCs 4.Conclusion 5.Q&A
  • 9. User cases Overview PUB/SUB … Producers GKE Container 1 Container N … VMs Service 1 Service N … Serverless CloudFunction App Engine … Rest API IDL API Topic 1 Topic 2 Topic N … Batch DataFlowJob 1 DataFlowJobN … GCS Consumers Restful API IDL API Kafka Stream Processing Stream Ingestion Jobs BigQuery User stream jobs … …
  • 10. Log Pipeline In GCP - Architecture Application Google Pub/Sub DataFlow GCS Processor DataFlow BQ Processor Log Processors Scheduler State Store Log Pipeline Client Lib ● Unified client lib ○ Abstracts backend implementation ● Google PubSub as subscribable storage ○ Rich meta headers. E.g. checksum ○ Exclusive subscription per destination ● Schedule processors and export metrics ● Processors: Dataflow jobs which stream data to different destinations. ● State store: ○ Schema info ○ Per category meta such as owner ● Various destination. BQ, GCS, Druid, and etc ● Replication service: Glue of destinations Replication Service
  • 11. Log Processors Streaming Processor: ● Per topic data flow job reads from PubSub and write to BQ ● E2E latency in few seconds ● Dead letter table to handle corrupt data/schema errors. ● E2E Checksum validation Batch Processor: ● Multiple Format output. ○ Thrift-lzo: row based format. ○ Parquet: column based format ● E2E Checksum validation ● Tackle cold start with dummy events. ○ To handle empty time ranges Application Google Pub/Sub DataFlow GCS Processor DataFlow BQ Processor Log Pipeline Client Lib Replication Service
  • 12. Event Controller Processor Scheduler Config Watcher ● User friendly configuration. ○ No need to worry about implementation ○ Rich options including destination, data format and etc. ● Scalable and extendable ○ Multiple destination sinks support ○ Stream and Batch support. ● Managed execution ○ Provide Metrics, health check ○ Priority and quota control support (planned) Status Watcher Rest API Event Execution Pool Job Abstraction Layer GCS Stream Ingestion BigQuery Stream Ingestion Druid Ingestion ... ... User Config Restful CMD
  • 13. Other Components Client Library ● Uniform way to publish log events ● Per log category metrics tracking ● Static schema validation check at event source ● Privacy Data Protection improvements ○ End to End checksums ○ End to End encryption ○ Optional Base-64 encoding Schema Management ● CI job to create schema jar and upload to GCS ● Each Processor re-loads latest schema bundle periodically Application Google Pub/Sub DataFlow GCS Processor DataFlow BQ Processor Log Pipeline Client Lib Replication Service
  • 14. Log Replication ● Used for batch workflow ● Logs are collected separately at each data center independently ● Log Replicator merges the logs across the data centers ○ Copies data from one DC to rest ○ Use GCS connector to write to GCS using HDFS apis
  • 15. Deployment ● Separate Log Pipeline for each organization(GCP project) for better security and charge back ● Provisioning Log Category ○ Map log category to GCP project during provisioning ○ Create GCP resources (pubsub topics, buckets, BQ datasets) automatically using demigod service (Terraform) ○ Configure event routing ○ Access Control: ■ Limit write access to storage(GCS/BQ) to pillar org specific log processor ■ Read access of the GCS bucket/BQ limited to service account only
  • 16. 1.Introduction & Goals 2.Log Pipeline in GCP 3.Streaming between DCs 4.Conclusion 5.Q&A
  • 17. Streaming Data from Twitter DCs to Cloud Log Pipeline Application1 (TWTTR-DC1) Flume Aggregation GCP Pub-Sub Streaming Log Processor (Data Flow) Scribe Daemon Client Library Application2 (TWTTR-DC1) Scribe Daemon Client Library
  • 18. Log Delivery - Big Picture Application2 (TWTTR-DC1) Flume Aggregation (One per each Twitter DC) GCP Pub-Sub Application1( GCP) Streaming Log Processor (Data Flow) Client Library Kafka (Twitter DC) Scribe Daemon Client Library Application3 (TWTTR-DC2) Scribe Daemon Client Library Tez Log Processor Replication Service Possible Routings ● Stream Flows: LPClient -> PubSub -> BQ LPClient -> Flume -> Pubsub -> BQ ● Batch Flows LPClient -> Flume -> Hdfs LPClient -> PubSub -> GCS
  • 19. 1.Introduction & Goals 2.Log Pipeline in GCP 3.Streaming between DCs 4.Conclusion 5.Q&A
  • 20. Conclusion ● Embrace hybrid cloud environment and provide unified experience to publish log events ● Log Pipeline serves as global scale log data delivery mechanism inside Twitter ○ Aggregation of data across DCs ○ Streaming and batch mode delivery ○ Support various sinks ○ Configure routing with simple knobs for user
  • 21. Q&A
  • 23. DataFlow Processors Streaming Processor: ● Per topic data flow job reads from PubSub and write to BQ ● E2E latency in few seconds ● Dead letter table to handle corrupt data/schema errors. ● Checksum validation Batch Processor: ● Multiple Format output. ○ Thrift-lzo: row based format. ○ Parquet: column based format ● E2E Checksum support. ● Tackle cold start with dummy events. ○ To handle empty time ranges
  • 24. Log Pipeline In GCP - Architecture Application Google Pub/Sub DataFlow GCS Processor DataFlow GCS Processor Log Processors Scheduler State Store User Interface (UI/CLI) Log Pipeline Client Lib
  • 25. Log Pipeline In GCP - Architecture Application Google Pub/Sub DataFlow GCS Processor DataFlow BQ Processor Log Processors Scheduler State Store Log Pipeline Client Lib ● Unified client lib ○ Disguise client difference ● Google PubSub as subscribable storage ○ Rich context headers. E.g. checksum ○ Exclusive destination per destination ● Schedule processors ● Processors. Dataflow jobs which stream data to different destinations. ● Processors. Dataflow jobs which stream data ● State store. ○ Schema info ○ Per category meta such as owner
  • 26. Job Scheduler ● A processor is a Stream/Batch ETL job which could deliver data to user specified Destination: ○ BigQuery Stream ingestion ○ Gcs Stream Ingestion
  • 27. Event Controller Job Schedulers Config Watcher ● User friendly Configuration. ○ User could config the data destination easily. ○ Rich options including output format, ● Managed Execution Env ○ Move ○ Plugable engine. Simple Transfer Storage Supported. Status Watcher Rest API Event Execution Pool Job Abstraction Layer GCS Stream Ingestion BigQuery Stream Ingestion Druid Ingestion ... ... User Config Restful Cmd
  • 28. Log Pipeline In GCP - Architecture Application Google Pub/Sub DataFlow GCS Processor DataFlow BQ Processor Log Processors Scheduler State Store Log Pipeline Client Lib ● Unified client lib ○ Disguise client difference ● Google PubSub as subscribable storage ○ Rich context headers. E.g. checksum ○ Exclusive destination per destination ● Schedule processors and export metrics ● Processors. Dataflow jobs which stream data to different destinations. ● Various destination. BQ, GCS, Druid, and etc ● State store. ○ Schema info ○ Per category meta such as owner Replication Service ● Replication service. Glue of destinations
  • 29. Twitter Data Analytics : Scale 29 >1EB >100PB Several >10K Hadoop clusters >10K Nodes Hadoop Cluster Storage capacity Reads and Write ~1 Exabyte Storage capacity Amount of data read and written daily >50K Analytic Jobs Jobs running on Data Platform per day
  • 30. ● Clients log events specifying a Category name. Eg: ads_click, like_event ... ● Events are grouped together across all clients into the Category ● Events are stored on HDFS, bucketed every hour into separate directories ○ /logs/ads_click/2020/09/01/23 ○ /logs/like_event/2020/09/01/23 ● Event logs are replicated to other clusters or GCP ○ On-prem HDFS clusters ○ GCS Clients Aggregated by Category Storage HDFS (Incoming) Http Clients Clients Client Daemon Client Daemon Client Daemon Http Endpoint Storage HDFS (Replicated) Events and Event Logs @Twitter Life of an event
  • 31. Events and Event Logs @Twitter ● Clients log events specifying a Category name. Eg ad_activated_keywords, login_event ... ● Events are grouped together across all clients into the Category ● Events are stored on HDFS, bucketed every hour into separate directories ○ /logs/ad_activated_keywords/2017/05/ 01/23 ○ /logs/login_event/2017/05/01/23 ● Event logs are replicated to other clusters Life of an Event Clients Aggregated by Category Storage HDFS (Incoming) Http Clients Clients Client Daemon Client Daemon Client Daemon Rufous Storage HDFS (Replicated)
  • 32. ● Terminologies ○ GCS - google cloud storage ○ GCP - google cloud platform ○ Project - google cloud project which is a organization of google resources including API ● The backend components into different pillar cloud projects. ○ Pillar is decided based the organizations. E.g. ads ○ Resource isolation and planned indepently ○ Better chargeback control Log Pipeline In GCP
  • 33. Twitter DataCenter Events and Event Logs @Twitter Real Time Cluster Production Cluster Ad hoc Cluster Cold Storage Log Pipeline Micro Services Streaming systems GCP Google Cloud Storage Services to manage data Data Processing frameworks Data Ingestion Data Replication Data Retention & Management
  • 34. ● Modularization ○ Each component should be independent. ○ The communication between components should follow via simple protocol ○ Each component could scale indepently ● Lessons learned