Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Introduction to Apache Kafka
And Real-Time ETL
for DBAs and others who are interested in new ways
of working with relation...
About Myself
• Gwen Shapira – SystemArchitect @Confluent
• Committer @Apache Kafka,Apache Sqoop
• Author of “HadoopApplica...
There’s a Book on That!
Apache Kafka is
publish-subscribe messaging
rethought as a
distributed commit log.
turned into
a stream data platform
An O...
• Write-ahead Logs
• So What is Kafka?
• Awesome use-case for Kafka
• Data streams and real-time ETL
• Where can you learn...
Write-Ahead Logging (WAL)
a standard method for ensuring data
integrity… changes to data files … must
be written only afte...
Important Point
The write-ahead log is the only reliable
source of information about current state of
the database.
WAL is used for
• Recover consistent state of a database
• Replicate the database (Streaming Replication, Hot Standby)
If ...
That’s nice, but what is Kafka?
Kafka provides a fast, distributed, highly scalable,
highly available, publish-subscribe messaging system.
Based on the tr...
The Basics
•Messages are organized into topics
•Producers push messages
•Consumers pull messages
•Kafka runs in a cluster....
Topics, Partitions and Logs
Each partition is a log
Each Broker has many partitions
Partition 0 Partition 0
Partition 1 Partition 1
Partition 2
Partition 1
Partition 0
Partit...
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partitio...
Producers load balance between partitions
Partition 0
Partition 1
Partition 2
Partition 1
Partition 0
Partition 2
Partitio...
Consumers
Consumer Group Y
Consumer Group X
Consumer
Kafka Cluster
Topic
Partition A (File)
Partition B (File)
Partition C...
Kafka “Magic” – Why is it so fast?
• 250M Events per sec on one node at 3ms latency
• Scales to any number of consumers
• ...
How do people use Kafka?
• As a message bus
• As a buffer for replication systems
• As reliable feed for event processing
...
But really,
how do they use Kafka?
21
Raise your hand if this sounds familiar
“My next project was to get a working Hadoop setup…
Having little experience in th...
23
Client Source
Data Pipelines Start like this.
24
Client Source
Client
Client
Client
Then we reuse them
25
Client Backend
Client
Client
Client
Then we add consumers to the
existing sources
Another
Backend
26
Client Backend
Client
Client
Client
Then it starts to look like this
Another
Backend
Another
Backend
Another
Backend
27
Client Backend
Client
Client
Client
With maybe some of this
Another
Backend
Another
Backend
Another
Backend
Queues decouple systems:
Adding new systems doesn’t require changing
Existing systems
This is where we are trying to get
29
Source System Source System Source System Source System
Kafka decouples Data Pipelin...
Important notes:
• Producers and Consumers dont need to know about each other
• Performance issues on Consumers dont impac...
My Favorite Use Cases
• Shops consume inventory updates
• Clicking around an online shop? Your clicks go to Kafka and
reco...
Got it!
But what about real-time ETL?
Remember This?
33
Source System Source System Source System Source System
Kafka decouples Data Pipelines
Hadoop Security S...
If data flies into Kafka in real time
Why wait 24h before pulling it into a DWH?
34
35
Why Kafka makes real-time ETL better?
• Can integrate with any data source
• RDBMS, NoSQL, Applications, web applications,...
It is all valuable data
Raw data
Raw data Clean data
Aggregated dataClean data Enriched data
Filtered data
Dash
board
Repo...
• Producers
• Log4J
• Rest Proxy
• BottledWater
• KafkaConnect and its connectors ecosystem
• Other ecosystem
OK, but how ...
• However you want:
• You just consume data, modify it, and produce it back
• Built into Kafka:
• Kprocessor
• Kstream
• P...
One more thing...
Schema is a MUST HAVE for
data integration
Need More Kafka?
• https://kafka.apache.org/documentation.html
• My video tutorial:
http://shop.oreilly.com/product/063692...
kafka for db as postgres
kafka for db as postgres
kafka for db as postgres
Upcoming SlideShare
Loading in …5
×

kafka for db as postgres

3,575 views

Published on

Introduction to Apache Kafka And Real-Time ETL for DBAs and others who are interested in new ways of working with relational databases

Published in: Data & Analytics
  • Be the first to comment

kafka for db as postgres

  1. 1. Introduction to Apache Kafka And Real-Time ETL for DBAs and others who are interested in new ways of working with relational databases 1
  2. 2. About Myself • Gwen Shapira – SystemArchitect @Confluent • Committer @Apache Kafka,Apache Sqoop • Author of “HadoopApplicationArchitectures”, “Kafka – The Definitive Guide” • Previously: • Software Engineer @ Cloudera • OracleACE Director • Senior Consultants @ Pythian • DBA@ Mercury Interactive • Find me: • gwen@confluent.io • @gwenshap
  3. 3. There’s a Book on That!
  4. 4. Apache Kafka is publish-subscribe messaging rethought as a distributed commit log. turned into a stream data platform An Optical Illusion
  5. 5. • Write-ahead Logs • So What is Kafka? • Awesome use-case for Kafka • Data streams and real-time ETL • Where can you learn more We’ll talk about:
  6. 6. Write-Ahead Logging (WAL) a standard method for ensuring data integrity… changes to data files … must be written only after those changes have been logged… in the event of a crash we will be able to recover the database using the log.
  7. 7. Important Point The write-ahead log is the only reliable source of information about current state of the database.
  8. 8. WAL is used for • Recover consistent state of a database • Replicate the database (Streaming Replication, Hot Standby) If you look far enough into archived logs – you can reconstruct the entire database.
  9. 9. That’s nice, but what is Kafka?
  10. 10. Kafka provides a fast, distributed, highly scalable, highly available, publish-subscribe messaging system. Based on the tried and true log structure. In turn this solves part of a much harder problem: Communication and integration between components of large software systems
  11. 11. The Basics •Messages are organized into topics •Producers push messages •Consumers pull messages •Kafka runs in a cluster. Nodes are called brokers
  12. 12. Topics, Partitions and Logs
  13. 13. Each partition is a log
  14. 14. Each Broker has many partitions Partition 0 Partition 0 Partition 1 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partion 2
  15. 15. Producers load balance between partitions Partition 0 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 1 Partion 2 Client
  16. 16. Producers load balance between partitions Partition 0 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 1 Partion 2 Client
  17. 17. Consumers Consumer Group Y Consumer Group X Consumer Kafka Cluster Topic Partition A (File) Partition B (File) Partition C (File) Consumer Consumer Consumer Order retained with in partition Order retained with in partition but not over partitionsOffSetX OffSetX OffSetX OffSetYOffSetYOffSetY Off sets are kept per consumer group
  18. 18. Kafka “Magic” – Why is it so fast? • 250M Events per sec on one node at 3ms latency • Scales to any number of consumers • Stores data for set amount of time – Without tracking who read what data • Replicates – but no need to sync to disk • Zero-copy writes from memory / disk to network
  19. 19. How do people use Kafka? • As a message bus • As a buffer for replication systems • As reliable feed for event processing • As a buffer for event processing • Decouple apps from databases
  20. 20. But really, how do they use Kafka? 21
  21. 21. Raise your hand if this sounds familiar “My next project was to get a working Hadoop setup… Having little experience in this area, we naturally budgeted a few weeks for getting data in and out, and the rest of our time for implementing fancy algorithms. “ --Jay Kreps, Kafka PMC
  22. 22. 23 Client Source Data Pipelines Start like this.
  23. 23. 24 Client Source Client Client Client Then we reuse them
  24. 24. 25 Client Backend Client Client Client Then we add consumers to the existing sources Another Backend
  25. 25. 26 Client Backend Client Client Client Then it starts to look like this Another Backend Another Backend Another Backend
  26. 26. 27 Client Backend Client Client Client With maybe some of this Another Backend Another Backend Another Backend
  27. 27. Queues decouple systems: Adding new systems doesn’t require changing Existing systems
  28. 28. This is where we are trying to get 29 Source System Source System Source System Source System Kafka decouples Data Pipelines Hadoop Security Systems Real-time monitoring Data Warehouse Kafka Producers Brokers Consumers Kafka decouples Data Pipelines
  29. 29. Important notes: • Producers and Consumers dont need to know about each other • Performance issues on Consumers dont impact Producers • Consumers are protected from herds of Producers • Lots of flexibility in handling load • Messages are available for anyone – lots of new use cases, monitoring, audit, troubleshooting http://www.slideshare.net/gwenshap/queues-pools-caches
  30. 30. My Favorite Use Cases • Shops consume inventory updates • Clicking around an online shop? Your clicks go to Kafka and recommendations come back. • Flagging credit card transactions as fraudulent • Flagging game interactions as abuse • Least favorite: Surge pricing in Uber • Huge list of users at kafka.apache.org 31
  31. 31. Got it! But what about real-time ETL?
  32. 32. Remember This? 33 Source System Source System Source System Source System Kafka decouples Data Pipelines Hadoop Security Systems Real-time monitoring Data Warehouse Kafka Producers Brokers Consumers Kafka is smack in middle of all Data Pipelines
  33. 33. If data flies into Kafka in real time Why wait 24h before pulling it into a DWH? 34
  34. 34. 35
  35. 35. Why Kafka makes real-time ETL better? • Can integrate with any data source • RDBMS, NoSQL, Applications, web applications, logs • Consumers can be real-time But they do not have to • Reading and writing to/from Kafka is cheap • So this is a great place to store intermediate state • You can fix mistakes by rereading some of the data again • Same data in same order • Adding more pipelines / aggregations has no impact on source systems = low risk 36
  36. 36. It is all valuable data Raw data Raw data Clean data Aggregated dataClean data Enriched data Filtered data Dash board Report Data scientist Alerts OMG
  37. 37. • Producers • Log4J • Rest Proxy • BottledWater • KafkaConnect and its connectors ecosystem • Other ecosystem OK, but how does my data get into Kafka
  38. 38. • However you want: • You just consume data, modify it, and produce it back • Built into Kafka: • Kprocessor • Kstream • Popular choices: • Storm • SparkStreaming But wait, how do we process the data?
  39. 39. One more thing...
  40. 40. Schema is a MUST HAVE for data integration
  41. 41. Need More Kafka? • https://kafka.apache.org/documentation.html • My video tutorial: http://shop.oreilly.com/product/0636920038603.do • http://www.michael-noll.com/blog/2014/08/18/apache-kafka- training-deck-and-tutorial/ • Our website: http://confluent.io • Oracle guide to real-time ETL: http://www.oracle.com/technetwork/middleware/data- integrator/overview/best-practices-for-realtime-data-wa-132882.pdf

×