SlideShare a Scribd company logo
1 of 34
Download to read offline
1
ETL as a Platform
Pandora Plays Nicely
Everywhere with Real-Time
Data Pipelines
Lawrence Weikum, Senior Software Engineer, Pandora
Gehrig Kunz, Technical Product Marketing, Confluent
2
Streaming in Action Series
Watch on Confluent.io
You are here!
Watch on Confluent.io
3
A look at today
What if ETL was a platform?
● How would this help us
● What would it require
● Kafka, a distributed streaming platform
How Pandora builds real-time data pipelines
● Why Pandora turned to Kafka
● A look at Pandora’s data pipelines
● Exploring Kafka and the Connect API
4
ETL, a brief history
Operational
database
Data
warehouse
Extract data from databases
Transform into destination warehouse schema
Load into a central data warehouse
5
ETL, a brief history
Operational
database
Data
warehouse
6
ETL, a brief history
Operational
database
Data
warehouse
7
ETL, a brief history
Operational
database
Data
warehouse
● ETL has been focused on operations
● What can make it more valuable today?
8
Today’s ETL for developers – a streaming platform?
9
ETL via events
Streaming
Platform
Data
warehouse
Mobile
App
“A product was purchased”
10
ETL via events
Streaming
Platform
Mobile
App
Web
App
API
“A product was purchased”
Data
warehouse
11
ETL as a Platform via events
Streaming
Platform
Mobile
App
“A product was purchased”
Web
App
API
Monitoring
Recommendation
Payments
Ordering
Data
warehouse
12
A New Engineering Goal
Orient our infrastructure around real-
time stream processing analytics
13
Why move from batch to real-time?
● Batch Processing: Too slow, too hands-on
○ Building reliable and resilient pipelines into HDFS yourself is difficult and repetitive
○ Batch processing is slow and error prone
● Speed is money
○ Start making business decisions without waiting
14
First challenge: real-time ads
Ad trafficking infrastructure scope:
● Determine which ad to serve
● Track billed reporting events:
impressions, clicks, engagements
15
What we needed to support
Million monthly
active users
Uptime
TB
per day
85 1 1+ 99.99%
Billion events
Per day
16
Streaming challenges we needed to solve
• Handle an expanding high volume of data
• Deliver real-time data integration
• Want to use the data to make business decisions
• Want to land all data in HDFS for prosperity
17
Why Kafka was the right choice
1. Distributed, high availability, low latency
2. Security
3. Integrates well into HDFS and Hive
4. Fairly simple to write consumers and producers
5. Easily connects microservices
18
What we built
19
Step 1: Create and Register Schema
20
Confluent’s Schema Registry
● Manages metadata
● Serialization/deserialization with Avro
● Allows evolution of schemas according to the configured compatibility setting
This helps us:
● Automatically update schemas in Hive
● Eliminates error-prone batch jobs to parse data and fit into schemas
● Discover Data over multiple teams and projects
● Write schemas once, use it everywhere
● Update producers OR consumers first
21
Enabling our developers
● Developer creates a pull request for the new schema
● Code change is approved and merged
● Gradle conducts compatibility checks against Schema Registry throughout the process
● Gradle compiles Avro to Java and submits Jar to internal Maven repository
{
"namespace": "com.namespace",
"type": "record",
"name": "EventName",
"fields": [
{"name": "name", "type": ["null", "string"], "default": null},
{"name": "favorite_number", "type": ["int"], "default": 0},
{"name": "favorite_color", "type": ["null", "string"], "default": null}
]
22
Step 2: Produce Kafka Messages
23
Step 3: Consume Kafka Messages using HDFS Connector
24
Intro to Kafka’s Connect API
● A framework for scalably and
reliably connecting Kafka with
external systems
25
Using the Connect API
{
"connector.class":"io.confluent.connect.hdfs.HdfsSinkConnector",
"tasks.max":,
"topics":"ad_server_event_A,ad_server_event_B,ad_server_event_C…",
"hdfs.url":,
"flush.size":,
"rotate.interval.ms":"1000000",
"logs.dir":,
"topics.dir":,
"hive.database":,
"hive.integration":"true",
"hive.metastore.uris":,
"schema.compatibility":"FULL",
"format.class":"io.confluent.connect.hdfs.parquet.ParquetFormat",
"partitioner.class":"io.confluent.connect.hdfs.partitioner.FieldPartitioner",
"partition.field.name":"day",
}
26
Achieving High Availability
27
The Results
● One Kafka Connect worker can easily achieve writing 136,000
messages per second to HDFS
● 3% CPU usage
● 225 MBps network inbound
28
Lessons learned
• Hadoop cluster maintenance is far easier
• Kafka will retain data
• Kafka Connect will move the data when Hadoop is back up
• No coordination needed with clients or stakeholders
• Seeing is Believing
• Our choices were in opposition to schema-less data or fitting data into schemas at the
end of the pipe
• Once people saw the new pipeline in action, attitudes started changing
29
This enables us to...
Benefits for Pandora
● Get to make business decisions faster and more often
● Have up-to-date dashboards for stakeholders and business partners
● Can focus on other engineering initiatives
Benefits for our developers
● Only have to focus on writing data once
● Downstream consumers will always be able to read data
● Strict typing for languages that support it
● Hours of constant updating and coordinating schemas and serializations between
connected teams are now replaced by a few minutes by one team
30
What’s next for Kafka @ Pandora
● Expand connectors for our other internal systems
● Update connectors to write in ORC format
● Continue converting older pipelines pushing into HDFS to use Kafka
● Streaming computation and analysis of data
31
ETL as a (Streaming) Platform
Move from batch to real-time
Transition to an event-driven,
streaming architecture
Drive developer access
Integrate databases,
stream processing, and
business applications
Distributed scale
Future-proof ETL with the scale and
reliability of a distributed system
32
How this helps
Simple at scale
Break it down to real-time events to remove exponential complexity
Future-ready
Adapt and build with what you need, be it a new database or ml library
Speed up development
Let developers get what they need to support microservices
33
Interested?
Check out our open positions:
https://www.pandora.com/careers/all
Read up on our Engineering Blog:
https://engineering.pandora.com/welcome-to-the-pandora-
engineering-blog-8c2fab14ea8a
34
Download Confluent Open Source
Join the Confluent Slack community
Check out Kafka Summit!
August 28th in San Francisco
Thanks!

More Related Content

What's hot

Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Kai Wähner
 

What's hot (20)

Leveraging Microservice Architectures & Event-Driven Systems for Global APIs
Leveraging Microservice Architectures & Event-Driven Systems for Global APIsLeveraging Microservice Architectures & Event-Driven Systems for Global APIs
Leveraging Microservice Architectures & Event-Driven Systems for Global APIs
 
Etl is Dead; Long Live Streams
Etl is Dead; Long Live StreamsEtl is Dead; Long Live Streams
Etl is Dead; Long Live Streams
 
Data integration with Apache Kafka
Data integration with Apache KafkaData integration with Apache Kafka
Data integration with Apache Kafka
 
Capital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream ProcessingCapital One Delivers Risk Insights in Real Time with Stream Processing
Capital One Delivers Risk Insights in Real Time with Stream Processing
 
Streaming Transformations - Putting the T in Streaming ETL
Streaming Transformations - Putting the T in Streaming ETLStreaming Transformations - Putting the T in Streaming ETL
Streaming Transformations - Putting the T in Streaming ETL
 
Apache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platformApache kafka-a distributed streaming platform
Apache kafka-a distributed streaming platform
 
Data Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDBData Streaming with Apache Kafka & MongoDB
Data Streaming with Apache Kafka & MongoDB
 
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
 
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
user Behavior Analysis with Session Windows and Apache Kafka's Streams APIuser Behavior Analysis with Session Windows and Apache Kafka's Streams API
user Behavior Analysis with Session Windows and Apache Kafka's Streams API
 
Introducing Confluent Cloud: Apache Kafka as a Service
Introducing Confluent Cloud: Apache Kafka as a Service Introducing Confluent Cloud: Apache Kafka as a Service
Introducing Confluent Cloud: Apache Kafka as a Service
 
What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?What is Apache Kafka and What is an Event Streaming Platform?
What is Apache Kafka and What is an Event Streaming Platform?
 
A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology A Practical Guide to Selecting a Stream Processing Technology
A Practical Guide to Selecting a Stream Processing Technology
 
KSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache KafkaKSQL: Open Source Streaming for Apache Kafka
KSQL: Open Source Streaming for Apache Kafka
 
Using Apache Kafka to Analyze Session Windows
Using Apache Kafka to Analyze Session WindowsUsing Apache Kafka to Analyze Session Windows
Using Apache Kafka to Analyze Session Windows
 
Leveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern AnalyticsLeveraging Mainframe Data for Modern Analytics
Leveraging Mainframe Data for Modern Analytics
 
Kafka streams - From pub/sub to a complete stream processing platform
Kafka streams - From pub/sub to a complete stream processing platformKafka streams - From pub/sub to a complete stream processing platform
Kafka streams - From pub/sub to a complete stream processing platform
 
Evolving from Messaging to Event Streaming
Evolving from Messaging to Event StreamingEvolving from Messaging to Event Streaming
Evolving from Messaging to Event Streaming
 
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
Event-Driven Stream Processing and Model Deployment with Apache Kafka, Kafka ...
 
Real-time Data Streaming from Oracle to Apache Kafka
Real-time Data Streaming from Oracle to Apache Kafka Real-time Data Streaming from Oracle to Apache Kafka
Real-time Data Streaming from Oracle to Apache Kafka
 
Running Kafka for Maximum Pain
Running Kafka for Maximum PainRunning Kafka for Maximum Pain
Running Kafka for Maximum Pain
 

Similar to ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines

Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 

Similar to ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines (20)

Streaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache KafkaStreaming Data Ingest and Processing with Apache Kafka
Streaming Data Ingest and Processing with Apache Kafka
 
Music city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lakeMusic city data Hail Hydrate! from stream to lake
Music city data Hail Hydrate! from stream to lake
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and Streaming
 
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
Building Scalable Big Data Infrastructure Using Open Source Software Presenta...
 
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramière
Au delà des brokers, un tour de l’environnement Kafka | Florent RamièreAu delà des brokers, un tour de l’environnement Kafka | Florent Ramière
Au delà des brokers, un tour de l’environnement Kafka | Florent Ramière
 
Apache Kafka® and the Data Mesh
Apache Kafka® and the Data MeshApache Kafka® and the Data Mesh
Apache Kafka® and the Data Mesh
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
kafka for db as postgres
kafka for db as postgreskafka for db as postgres
kafka for db as postgres
 
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
Journey to the Data Lake: How Progressive Paved a Faster, Smoother Path to In...
 
Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017Confluent kafka meetupseattle jan2017
Confluent kafka meetupseattle jan2017
 
Death of the dumb pipes: Using Apache Kafka® for Integration projects
Death of the dumb pipes: Using Apache Kafka® for Integration projectsDeath of the dumb pipes: Using Apache Kafka® for Integration projects
Death of the dumb pipes: Using Apache Kafka® for Integration projects
 
Confluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with SynthesisConfluent Partner Tech Talk with Synthesis
Confluent Partner Tech Talk with Synthesis
 
Trend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache BigtopTrend Micro Big Data Platform and Apache Bigtop
Trend Micro Big Data Platform and Apache Bigtop
 
Moving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from PivotalMoving data to the cloud BY CESAR ROJAS from Pivotal
Moving data to the cloud BY CESAR ROJAS from Pivotal
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OS
 
Building real time data-driven products
Building real time data-driven productsBuilding real time data-driven products
Building real time data-driven products
 

More from confluent

More from confluent (20)

Speed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in MinutesSpeed Wins: From Kafka to APIs in Minutes
Speed Wins: From Kafka to APIs in Minutes
 
Evolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI EraEvolving Data Governance for the Real-time Streaming and AI Era
Evolving Data Governance for the Real-time Streaming and AI Era
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Santander Stream Processing with Apache Flink
Santander Stream Processing with Apache FlinkSantander Stream Processing with Apache Flink
Santander Stream Processing with Apache Flink
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Workshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con FlinkWorkshop híbrido: Stream Processing con Flink
Workshop híbrido: Stream Processing con Flink
 
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
Industry 4.0: Building the Unified Namespace with Confluent, HiveMQ and Spark...
 
AWS Immersion Day Mapfre - Confluent
AWS Immersion Day Mapfre   -   ConfluentAWS Immersion Day Mapfre   -   Confluent
AWS Immersion Day Mapfre - Confluent
 
Eventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalkEventos y Microservicios - Santander TechTalk
Eventos y Microservicios - Santander TechTalk
 
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent CloudQ&A with Confluent Experts: Navigating Networking in Confluent Cloud
Q&A with Confluent Experts: Navigating Networking in Confluent Cloud
 
Citi TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep DiveCiti TechTalk Session 2: Kafka Deep Dive
Citi TechTalk Session 2: Kafka Deep Dive
 
Build real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with ConfluentBuild real-time streaming data pipelines to AWS with Confluent
Build real-time streaming data pipelines to AWS with Confluent
 
Q&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service MeshQ&A with Confluent Professional Services: Confluent Service Mesh
Q&A with Confluent Professional Services: Confluent Service Mesh
 
Citi Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka MicroservicesCiti Tech Talk: Event Driven Kafka Microservices
Citi Tech Talk: Event Driven Kafka Microservices
 
Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3Confluent & GSI Webinars series - Session 3
Confluent & GSI Webinars series - Session 3
 
Citi Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging ModernizationCiti Tech Talk: Messaging Modernization
Citi Tech Talk: Messaging Modernization
 
Citi Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time dataCiti Tech Talk: Data Governance for streaming and real time data
Citi Tech Talk: Data Governance for streaming and real time data
 
Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2Confluent & GSI Webinars series: Session 2
Confluent & GSI Webinars series: Session 2
 
Data In Motion Paris 2023
Data In Motion Paris 2023Data In Motion Paris 2023
Data In Motion Paris 2023
 
The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023The Future of Application Development - API Days - Melbourne 2023
The Future of Application Development - API Days - Melbourne 2023
 

Recently uploaded

ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 
School management system project report.pdf
School management system project report.pdfSchool management system project report.pdf
School management system project report.pdf
Kamal Acharya
 

Recently uploaded (20)

Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical EngineeringIntroduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
Introduction to Machine Learning Unit-4 Notes for II-II Mechanical Engineering
 
Lect_Z_Transform_Main_digital_image_processing.pptx
Lect_Z_Transform_Main_digital_image_processing.pptxLect_Z_Transform_Main_digital_image_processing.pptx
Lect_Z_Transform_Main_digital_image_processing.pptx
 
RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5
RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5
RM&IPR M5 notes.pdfResearch Methodolgy & Intellectual Property Rights Series 5
 
Lab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docxLab Manual Arduino UNO Microcontrollar.docx
Lab Manual Arduino UNO Microcontrollar.docx
 
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and VisualizationKIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
KIT-601 Lecture Notes-UNIT-5.pdf Frame Works and Visualization
 
ChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdfChatGPT Prompt Engineering for project managers.pdf
ChatGPT Prompt Engineering for project managers.pdf
 
Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1Research Methodolgy & Intellectual Property Rights Series 1
Research Methodolgy & Intellectual Property Rights Series 1
 
ROAD CONSTRUCTION PRESENTATION.PPTX.pptx
ROAD CONSTRUCTION PRESENTATION.PPTX.pptxROAD CONSTRUCTION PRESENTATION.PPTX.pptx
ROAD CONSTRUCTION PRESENTATION.PPTX.pptx
 
Multivibrator and its types defination and usges.pptx
Multivibrator and its types defination and usges.pptxMultivibrator and its types defination and usges.pptx
Multivibrator and its types defination and usges.pptx
 
Filters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility ApplicationsFilters for Electromagnetic Compatibility Applications
Filters for Electromagnetic Compatibility Applications
 
Artificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian ReasoningArtificial Intelligence Bayesian Reasoning
Artificial Intelligence Bayesian Reasoning
 
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdfONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
ONLINE VEHICLE RENTAL SYSTEM PROJECT REPORT.pdf
 
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdfInvolute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
Involute of a circle,Square, pentagon,HexagonInvolute_Engineering Drawing.pdf
 
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...Software Engineering - Modelling Concepts + Class Modelling + Building the An...
Software Engineering - Modelling Concepts + Class Modelling + Building the An...
 
Arduino based vehicle speed tracker project
Arduino based vehicle speed tracker projectArduino based vehicle speed tracker project
Arduino based vehicle speed tracker project
 
Dairy management system project report..pdf
Dairy management system project report..pdfDairy management system project report..pdf
Dairy management system project report..pdf
 
Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2Research Methodolgy & Intellectual Property Rights Series 2
Research Methodolgy & Intellectual Property Rights Series 2
 
Theory for How to calculation capacitor bank
Theory for How to calculation capacitor bankTheory for How to calculation capacitor bank
Theory for How to calculation capacitor bank
 
School management system project report.pdf
School management system project report.pdfSchool management system project report.pdf
School management system project report.pdf
 
Supermarket billing system project report..pdf
Supermarket billing system project report..pdfSupermarket billing system project report..pdf
Supermarket billing system project report..pdf
 

ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines

  • 1. 1 ETL as a Platform Pandora Plays Nicely Everywhere with Real-Time Data Pipelines Lawrence Weikum, Senior Software Engineer, Pandora Gehrig Kunz, Technical Product Marketing, Confluent
  • 2. 2 Streaming in Action Series Watch on Confluent.io You are here! Watch on Confluent.io
  • 3. 3 A look at today What if ETL was a platform? ● How would this help us ● What would it require ● Kafka, a distributed streaming platform How Pandora builds real-time data pipelines ● Why Pandora turned to Kafka ● A look at Pandora’s data pipelines ● Exploring Kafka and the Connect API
  • 4. 4 ETL, a brief history Operational database Data warehouse Extract data from databases Transform into destination warehouse schema Load into a central data warehouse
  • 5. 5 ETL, a brief history Operational database Data warehouse
  • 6. 6 ETL, a brief history Operational database Data warehouse
  • 7. 7 ETL, a brief history Operational database Data warehouse ● ETL has been focused on operations ● What can make it more valuable today?
  • 8. 8 Today’s ETL for developers – a streaming platform?
  • 10. 10 ETL via events Streaming Platform Mobile App Web App API “A product was purchased” Data warehouse
  • 11. 11 ETL as a Platform via events Streaming Platform Mobile App “A product was purchased” Web App API Monitoring Recommendation Payments Ordering Data warehouse
  • 12. 12 A New Engineering Goal Orient our infrastructure around real- time stream processing analytics
  • 13. 13 Why move from batch to real-time? ● Batch Processing: Too slow, too hands-on ○ Building reliable and resilient pipelines into HDFS yourself is difficult and repetitive ○ Batch processing is slow and error prone ● Speed is money ○ Start making business decisions without waiting
  • 14. 14 First challenge: real-time ads Ad trafficking infrastructure scope: ● Determine which ad to serve ● Track billed reporting events: impressions, clicks, engagements
  • 15. 15 What we needed to support Million monthly active users Uptime TB per day 85 1 1+ 99.99% Billion events Per day
  • 16. 16 Streaming challenges we needed to solve • Handle an expanding high volume of data • Deliver real-time data integration • Want to use the data to make business decisions • Want to land all data in HDFS for prosperity
  • 17. 17 Why Kafka was the right choice 1. Distributed, high availability, low latency 2. Security 3. Integrates well into HDFS and Hive 4. Fairly simple to write consumers and producers 5. Easily connects microservices
  • 19. 19 Step 1: Create and Register Schema
  • 20. 20 Confluent’s Schema Registry ● Manages metadata ● Serialization/deserialization with Avro ● Allows evolution of schemas according to the configured compatibility setting This helps us: ● Automatically update schemas in Hive ● Eliminates error-prone batch jobs to parse data and fit into schemas ● Discover Data over multiple teams and projects ● Write schemas once, use it everywhere ● Update producers OR consumers first
  • 21. 21 Enabling our developers ● Developer creates a pull request for the new schema ● Code change is approved and merged ● Gradle conducts compatibility checks against Schema Registry throughout the process ● Gradle compiles Avro to Java and submits Jar to internal Maven repository { "namespace": "com.namespace", "type": "record", "name": "EventName", "fields": [ {"name": "name", "type": ["null", "string"], "default": null}, {"name": "favorite_number", "type": ["int"], "default": 0}, {"name": "favorite_color", "type": ["null", "string"], "default": null} ]
  • 22. 22 Step 2: Produce Kafka Messages
  • 23. 23 Step 3: Consume Kafka Messages using HDFS Connector
  • 24. 24 Intro to Kafka’s Connect API ● A framework for scalably and reliably connecting Kafka with external systems
  • 25. 25 Using the Connect API { "connector.class":"io.confluent.connect.hdfs.HdfsSinkConnector", "tasks.max":, "topics":"ad_server_event_A,ad_server_event_B,ad_server_event_C…", "hdfs.url":, "flush.size":, "rotate.interval.ms":"1000000", "logs.dir":, "topics.dir":, "hive.database":, "hive.integration":"true", "hive.metastore.uris":, "schema.compatibility":"FULL", "format.class":"io.confluent.connect.hdfs.parquet.ParquetFormat", "partitioner.class":"io.confluent.connect.hdfs.partitioner.FieldPartitioner", "partition.field.name":"day", }
  • 27. 27 The Results ● One Kafka Connect worker can easily achieve writing 136,000 messages per second to HDFS ● 3% CPU usage ● 225 MBps network inbound
  • 28. 28 Lessons learned • Hadoop cluster maintenance is far easier • Kafka will retain data • Kafka Connect will move the data when Hadoop is back up • No coordination needed with clients or stakeholders • Seeing is Believing • Our choices were in opposition to schema-less data or fitting data into schemas at the end of the pipe • Once people saw the new pipeline in action, attitudes started changing
  • 29. 29 This enables us to... Benefits for Pandora ● Get to make business decisions faster and more often ● Have up-to-date dashboards for stakeholders and business partners ● Can focus on other engineering initiatives Benefits for our developers ● Only have to focus on writing data once ● Downstream consumers will always be able to read data ● Strict typing for languages that support it ● Hours of constant updating and coordinating schemas and serializations between connected teams are now replaced by a few minutes by one team
  • 30. 30 What’s next for Kafka @ Pandora ● Expand connectors for our other internal systems ● Update connectors to write in ORC format ● Continue converting older pipelines pushing into HDFS to use Kafka ● Streaming computation and analysis of data
  • 31. 31 ETL as a (Streaming) Platform Move from batch to real-time Transition to an event-driven, streaming architecture Drive developer access Integrate databases, stream processing, and business applications Distributed scale Future-proof ETL with the scale and reliability of a distributed system
  • 32. 32 How this helps Simple at scale Break it down to real-time events to remove exponential complexity Future-ready Adapt and build with what you need, be it a new database or ml library Speed up development Let developers get what they need to support microservices
  • 33. 33 Interested? Check out our open positions: https://www.pandora.com/careers/all Read up on our Engineering Blog: https://engineering.pandora.com/welcome-to-the-pandora- engineering-blog-8c2fab14ea8a
  • 34. 34 Download Confluent Open Source Join the Confluent Slack community Check out Kafka Summit! August 28th in San Francisco Thanks!