ETL can be painful with dirty data and outdated batch processes slowing you down; there has to be a better way. In this talk we’ll discuss the benefits of introducing a streaming platform to your architecture including how it can greatly simplify complexity, speed up performance, and help your team deliver the features they need with real-time data integration.
Pandora’s Lawrence Weikum will discuss what they’ve done to bring real-time data integration to the team. We’ll review their Kafka-powered data pipelines and how they make the most of Kafka’s Connect API to make it surprisingly system to keep systems in sync.
Presented by:
Lawrence Weikum, Senior Software Engineer, Pandora
Gehrig Kunz, Technical Product Marketing Manager, Confluent
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
1. 1
ETL as a Platform
Pandora Plays Nicely
Everywhere with Real-Time
Data Pipelines
Lawrence Weikum, Senior Software Engineer, Pandora
Gehrig Kunz, Technical Product Marketing, Confluent
2. 2
Streaming in Action Series
Watch on Confluent.io
You are here!
Watch on Confluent.io
3. 3
A look at today
What if ETL was a platform?
● How would this help us
● What would it require
● Kafka, a distributed streaming platform
How Pandora builds real-time data pipelines
● Why Pandora turned to Kafka
● A look at Pandora’s data pipelines
● Exploring Kafka and the Connect API
4. 4
ETL, a brief history
Operational
database
Data
warehouse
Extract data from databases
Transform into destination warehouse schema
Load into a central data warehouse
5. 5
ETL, a brief history
Operational
database
Data
warehouse
6. 6
ETL, a brief history
Operational
database
Data
warehouse
7. 7
ETL, a brief history
Operational
database
Data
warehouse
● ETL has been focused on operations
● What can make it more valuable today?
11. 11
ETL as a Platform via events
Streaming
Platform
Mobile
App
“A product was purchased”
Web
App
API
Monitoring
Recommendation
Payments
Ordering
Data
warehouse
12. 12
A New Engineering Goal
Orient our infrastructure around real-
time stream processing analytics
13. 13
Why move from batch to real-time?
● Batch Processing: Too slow, too hands-on
○ Building reliable and resilient pipelines into HDFS yourself is difficult and repetitive
○ Batch processing is slow and error prone
● Speed is money
○ Start making business decisions without waiting
14. 14
First challenge: real-time ads
Ad trafficking infrastructure scope:
● Determine which ad to serve
● Track billed reporting events:
impressions, clicks, engagements
15. 15
What we needed to support
Million monthly
active users
Uptime
TB
per day
85 1 1+ 99.99%
Billion events
Per day
16. 16
Streaming challenges we needed to solve
• Handle an expanding high volume of data
• Deliver real-time data integration
• Want to use the data to make business decisions
• Want to land all data in HDFS for prosperity
17. 17
Why Kafka was the right choice
1. Distributed, high availability, low latency
2. Security
3. Integrates well into HDFS and Hive
4. Fairly simple to write consumers and producers
5. Easily connects microservices
20. 20
Confluent’s Schema Registry
● Manages metadata
● Serialization/deserialization with Avro
● Allows evolution of schemas according to the configured compatibility setting
This helps us:
● Automatically update schemas in Hive
● Eliminates error-prone batch jobs to parse data and fit into schemas
● Discover Data over multiple teams and projects
● Write schemas once, use it everywhere
● Update producers OR consumers first
21. 21
Enabling our developers
● Developer creates a pull request for the new schema
● Code change is approved and merged
● Gradle conducts compatibility checks against Schema Registry throughout the process
● Gradle compiles Avro to Java and submits Jar to internal Maven repository
{
"namespace": "com.namespace",
"type": "record",
"name": "EventName",
"fields": [
{"name": "name", "type": ["null", "string"], "default": null},
{"name": "favorite_number", "type": ["int"], "default": 0},
{"name": "favorite_color", "type": ["null", "string"], "default": null}
]
27. 27
The Results
● One Kafka Connect worker can easily achieve writing 136,000
messages per second to HDFS
● 3% CPU usage
● 225 MBps network inbound
28. 28
Lessons learned
• Hadoop cluster maintenance is far easier
• Kafka will retain data
• Kafka Connect will move the data when Hadoop is back up
• No coordination needed with clients or stakeholders
• Seeing is Believing
• Our choices were in opposition to schema-less data or fitting data into schemas at the
end of the pipe
• Once people saw the new pipeline in action, attitudes started changing
29. 29
This enables us to...
Benefits for Pandora
● Get to make business decisions faster and more often
● Have up-to-date dashboards for stakeholders and business partners
● Can focus on other engineering initiatives
Benefits for our developers
● Only have to focus on writing data once
● Downstream consumers will always be able to read data
● Strict typing for languages that support it
● Hours of constant updating and coordinating schemas and serializations between
connected teams are now replaced by a few minutes by one team
30. 30
What’s next for Kafka @ Pandora
● Expand connectors for our other internal systems
● Update connectors to write in ORC format
● Continue converting older pipelines pushing into HDFS to use Kafka
● Streaming computation and analysis of data
31. 31
ETL as a (Streaming) Platform
Move from batch to real-time
Transition to an event-driven,
streaming architecture
Drive developer access
Integrate databases,
stream processing, and
business applications
Distributed scale
Future-proof ETL with the scale and
reliability of a distributed system
32. 32
How this helps
Simple at scale
Break it down to real-time events to remove exponential complexity
Future-ready
Adapt and build with what you need, be it a new database or ml library
Speed up development
Let developers get what they need to support microservices
33. 33
Interested?
Check out our open positions:
https://www.pandora.com/careers/all
Read up on our Engineering Blog:
https://engineering.pandora.com/welcome-to-the-pandora-
engineering-blog-8c2fab14ea8a
34. 34
Download Confluent Open Source
Join the Confluent Slack community
Check out Kafka Summit!
August 28th in San Francisco
Thanks!