ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines

1
ETL as a Platform
Pandora Plays Nicely
Everywhere with Real-Time
Data Pipelines
Lawrence Weikum, Senior Software Engineer, Pandora
Gehrig Kunz, Technical Product Marketing, Confluent

2
Streaming in Action Series
Watch on Confluent.io
You are here!
Watch on Confluent.io

3
A look at today
What if ETL was a platform?
● How would this help us
● What would it require
● Kafka, a distributed streaming platform
How Pandora builds real-time data pipelines
● Why Pandora turned to Kafka
● A look at Pandora’s data pipelines
● Exploring Kafka and the Connect API

4
ETL, a brief history
Operational
database
Data
warehouse
Extract data from databases
Transform into destination warehouse schema
Load into a central data warehouse

5
Operational
database
Data
warehouse

6
Operational
database
Data
warehouse

7
Operational
database
Data
warehouse
● ETL has been focused on operations
● What can make it more valuable today?

8
Today’s ETL for developers – a streaming platform?

9
ETL via events
Streaming
Platform
Data
warehouse
Mobile
App
“A product was purchased”

10
ETL via events
Streaming
Platform
Mobile
App
Web
App
API
Data
warehouse

11
ETL as a Platform via events
Streaming
Platform
Mobile
App
Web
App
API
Monitoring
Recommendation
Payments
Ordering
Data
warehouse

12
A New Engineering Goal
Orient our infrastructure around real-
time stream processing analytics

13
Why move from batch to real-time?
● Batch Processing: Too slow, too hands-on
○ Building reliable and resilient pipelines into HDFS yourself is difficult and repetitive
○ Batch processing is slow and error prone
● Speed is money
○ Start making business decisions without waiting

14
First challenge: real-time ads
Ad trafficking infrastructure scope:
● Determine which ad to serve
● Track billed reporting events:
impressions, clicks, engagements

15
What we needed to support
Million monthly
active users
Uptime
TB
per day
85 1 1+ 99.99%
Billion events
Per day

16
Streaming challenges we needed to solve
• Handle an expanding high volume of data
• Deliver real-time data integration
• Want to use the data to make business decisions
• Want to land all data in HDFS for prosperity

17
Why Kafka was the right choice
1. Distributed, high availability, low latency
2. Security
3. Integrates well into HDFS and Hive
4. Fairly simple to write consumers and producers
5. Easily connects microservices

19
Step 1: Create and Register Schema

20
Confluent’s Schema Registry
● Manages metadata
● Serialization/deserialization with Avro
● Allows evolution of schemas according to the configured compatibility setting
This helps us:
● Automatically update schemas in Hive
● Eliminates error-prone batch jobs to parse data and fit into schemas
● Discover Data over multiple teams and projects
● Write schemas once, use it everywhere
● Update producers OR consumers first

21
Enabling our developers
● Developer creates a pull request for the new schema
● Code change is approved and merged
● Gradle conducts compatibility checks against Schema Registry throughout the process
● Gradle compiles Avro to Java and submits Jar to internal Maven repository
{
"namespace": "com.namespace",
"type": "record",
"name": "EventName",
"fields": [
{"name": "name", "type": ["null", "string"], "default": null},
{"name": "favorite_number", "type": ["int"], "default": 0},
{"name": "favorite_color", "type": ["null", "string"], "default": null}
]

22
Step 2: Produce Kafka Messages

23
Step 3: Consume Kafka Messages using HDFS Connector

24
Intro to Kafka’s Connect API
● A framework for scalably and
reliably connecting Kafka with
external systems

25
Using the Connect API
{
"connector.class":"io.confluent.connect.hdfs.HdfsSinkConnector",
"tasks.max":,
"topics":"ad_server_event_A,ad_server_event_B,ad_server_event_C…",
"hdfs.url":,
"flush.size":,
"rotate.interval.ms":"1000000",
"logs.dir":,
"topics.dir":,
"hive.database":,
"hive.integration":"true",
"hive.metastore.uris":,
"schema.compatibility":"FULL",
"format.class":"io.confluent.connect.hdfs.parquet.ParquetFormat",
"partitioner.class":"io.confluent.connect.hdfs.partitioner.FieldPartitioner",
"partition.field.name":"day",
}

26
Achieving High Availability

27
The Results
● One Kafka Connect worker can easily achieve writing 136,000
messages per second to HDFS
● 3% CPU usage
● 225 MBps network inbound

28
Lessons learned
• Hadoop cluster maintenance is far easier
• Kafka will retain data
• Kafka Connect will move the data when Hadoop is back up
• No coordination needed with clients or stakeholders
• Seeing is Believing
• Our choices were in opposition to schema-less data or fitting data into schemas at the
end of the pipe
• Once people saw the new pipeline in action, attitudes started changing

29
This enables us to...
Benefits for Pandora
● Get to make business decisions faster and more often
● Have up-to-date dashboards for stakeholders and business partners
● Can focus on other engineering initiatives
Benefits for our developers
● Only have to focus on writing data once
● Downstream consumers will always be able to read data
● Strict typing for languages that support it
● Hours of constant updating and coordinating schemas and serializations between
connected teams are now replaced by a few minutes by one team

30
What’s next for Kafka @ Pandora
● Expand connectors for our other internal systems
● Update connectors to write in ORC format
● Continue converting older pipelines pushing into HDFS to use Kafka
● Streaming computation and analysis of data

31
ETL as a (Streaming) Platform
Move from batch to real-time
Transition to an event-driven,
streaming architecture
Drive developer access
Integrate databases,
stream processing, and
business applications
Distributed scale
Future-proof ETL with the scale and
reliability of a distributed system

32
How this helps
Simple at scale
Break it down to real-time events to remove exponential complexity
Future-ready
Adapt and build with what you need, be it a new database or ml library
Speed up development
Let developers get what they need to support microservices

33
Interested?
Check out our open positions:
https://www.pandora.com/careers/all
Read up on our Engineering Blog:
https://engineering.pandora.com/welcome-to-the-pandora-
engineering-blog-8c2fab14ea8a

34
Download Confluent Open Source
Join the Confluent Slack community
Check out Kafka Summit!
August 28th in San Francisco
Thanks!

ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines

Similar to ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines (20)

More from confluent

More from confluent (20)

Recently uploaded

Recently uploaded (20)

ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines