The document discusses the future of Extract, Transform, Load (ETL) processes. It outlines how ETL has evolved from monthly batch processes 15 years ago to more real-time integration enabled by streaming data platforms. Key trends driving changes include cloud computing, microservices, and software engineers taking on more responsibilities. The document argues the future of ETL is to model all data as a continuous stream and leverage streaming data platforms to enable real-time integration that moves fast without breaking compatibility. This allows for fully integrated applications and databases at scale.
How AI, OpenAI, and ChatGPT impact business and software.
The Future of ETL - Strata Data New York 2018
1. 1
The Future of ETL
(Revised)
Gwen Shapira, Principal Data Architect
@gwenshap
https://www.confluent.io/blog/the-future-of-etl-isnt-what-it-used-to-be/
5. 5
So, to think about future of ETL, we want to look at…
• The future we imagined in the past
• The trends we are seeing now
• The future these trends indicate
• Practical suggestions
• Final thoughts
6. 6
Me
• Moving data around for 20 years
• Works at Confluent.
• Apache Kafka Committer
• Wrote a book or two
• Tweets a lot
7. 7
What was it like 15 years ago?
• Very few sources
• One Data Warehouse to rule them all
• One big “friendly” ETL tool
• Significant effort modeling, cleaning and
converging.
• “Although it is a huge leap to move from a
monthly or weekly process to a nightly one…”
9. 9
And 5 years ago?
• Many diverse sources.
• One big target
• Complete confusion of ETL tools
• “Schema on Read”
• … and still loading data around once a
day.
11. 11
What did we dream about then?
• Real-time.
We don’t want to wait even 5 minutes for insights.
• Governance.
Save us from the mess we created.
• Better tooling.
We imagined really fancy drag-and-drop.
12. But we missed some key ways
in which the world is changing
24. Old New
Read to offset & scan
Only sequential access. Order is guaranteed.
25. Partitions add scalability
Cluster
of
Machines
Producer 1 Producer 2 Producer 3
Messages are sent to
different partitions
Partitions are distributed
to different machines
37. 39
Promotion!
We want to send Platinum members
Who are looking at beach properties
An email about
Discount package in new Florida hotel
Lets look at an Example. Our marketing team has a small request:
44. Kafka Connect
+ CDC Connector
DB
KSQL> CREATE STREAM promotions AS
SELECT customer_email, customer_name
FROM pageview_events
LEFT OUTER JOIN customers
on c_id=pageview_cust_id
WHERE customer_class = ‘Prime’;
49. Resources and Next Steps
https://github.com/confluentinc/ksql
https://github.com/confluentinc/cp-demo
https://www.confluent.io/download/
https://slackpass.io/confluentcommunity
https://www.confluent.io/blog