Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1© Cloudera, Inc. All rights reserved.
Apache Kafka for Oracle DBAs
What is Kafka
Why should you care
How to learn Kafka
2© Cloudera, Inc. All rights reserved.
• Oracle DBA
• Turned Oracle Consultant
• Turned Hadoop Solutions Architect
• Turne...
3© Cloudera, Inc. All rights reserved.
Apache Kafka is a
publish-subscribe messaging
rethought as a
distributed commit log...
4© Cloudera, Inc. All rights reserved.
• Redo log as an abstraction
• How redo logs are useful
• Pub-sub message queues
• ...
5© Cloudera, Inc. All rights reserved.
Redo Log:
The most crucial structure for
recovery operations …
store all changes ma...
6© Cloudera, Inc. All rights reserved.
Important Point
The redo log is the only reliable source of
information about curre...
7© Cloudera, Inc. All rights reserved.
Redo Log is used for
• Recover consistent state of a database
• Replicate the datab...
8© Cloudera, Inc. All rights reserved.
What if…
You built an entire data storage system
that is just a transaction log?
9© Cloudera, Inc. All rights reserved.
Kafka can log
• Transactions from any database
• Clicks from websites
• Application...
10© Cloudera, Inc. All rights reserved.
Only one thing is missing
Q: How do you query a redo log?
A: Not very efficiently
...
11© Cloudera, Inc. All rights reserved.
12© Cloudera, Inc. All rights reserved.
Publish-Subscribe
Message Queue
13© Cloudera, Inc. All rights reserved.
Raise your hand if this sounds familiar
“My next project was to get a working Hado...
14© Cloudera, Inc. All rights reserved.14
Client Source
Data Pipelines Start like this.
15© Cloudera, Inc. All rights reserved.15
Client Source
Client
Client
Client
Then we reuse them
16© Cloudera, Inc. All rights reserved.16
Client Backend
Client
Client
Client
Then we add consumers to the
existing source...
17© Cloudera, Inc. All rights reserved.17
Client Backend
Client
Client
Client
Then it starts to look like this
Another
Bac...
18© Cloudera, Inc. All rights reserved.18
Client Backend
Client
Client
Client
With maybe some of this
Another
Backend
Anot...
19© Cloudera, Inc. All rights reserved.
Queues decouple systems: Both statically and in time
20© Cloudera, Inc. All rights reserved.
This is where we are trying to get
20
Source System Source System Source System So...
21© Cloudera, Inc. All rights reserved.
Important notes:
• Producers and Consumers don’t need to know about each other
• P...
22© Cloudera, Inc. All rights reserved.
So… What is Kafka?
23© Cloudera, Inc. All rights reserved.
Kafka provides a fast, distributed, highly scalable,
highly available, publish-sub...
24© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights
•Messages are organized into topics
•Producers push...
25© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights
Topics, Partitions and Logs
26© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights
Each partition is a log
27© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights
Each Broker has many partitions
Partition 0 Partiti...
28© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights
Producers load balance between partitions
Partition...
29© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights
Producers load balance between partitions
Partition...
30© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights
Consumers
31© Cloudera, Inc. All rights reserved.
Why is Kafka better than other MQ?
• Can keep data forever
• Scales very well – hi...
32© Cloudera, Inc. All rights reserved.
How do people use Kafka?
• As a message bus
• As a buffer for replication systems ...
33© Cloudera, Inc. All rights reserved.
Need More Kafka?
• https://kafka.apache.org/documentation.html
• My video tutorial...
34© Cloudera, Inc. All rights reserved.
One more thing...
35© Cloudera, Inc. All rights reserved.
Schema is a MUST HAVE for
data integration
Click to enter confidentiality
36© Cloudera, Inc. All rights reserved.
Kafka only stores Bytes – So where’s the schema?
• People go around asking each ot...
37© Cloudera, Inc. All rights reserved.
I Avro
• Define Schema
• Generate code for objects
• Serialize / Deserialize into ...
38© Cloudera, Inc. All rights reserved.
Replicating from Oracle to Kafka?
Don’t lose the schema!
39© Cloudera, Inc. All rights reserved.
Schemas are Agile
• Leave out MySQL and your favorite DBA for a second
• Schemas a...
40© Cloudera, Inc. All rights reserved.
Click to enter confidentiality
41© Cloudera, Inc. All rights reserved.
Thank you
@gwenshap
gshapira@cloudera.com
Upcoming SlideShare
Loading in …5
×

Kafka for DBAs

10,550 views

Published on

Explaining to DBAs - what is Apache Kafka and why they should care.

Published in: Software

Kafka for DBAs

  1. 1. 1© Cloudera, Inc. All rights reserved. Apache Kafka for Oracle DBAs What is Kafka Why should you care How to learn Kafka
  2. 2. 2© Cloudera, Inc. All rights reserved. • Oracle DBA • Turned Oracle Consultant • Turned Hadoop Solutions Architect • Turned Developer Committer on Apache Sqoop Contributor to Apache Kafka and Apache Flume About me
  3. 3. 3© Cloudera, Inc. All rights reserved. Apache Kafka is a publish-subscribe messaging rethought as a distributed commit log. An Optical Illusion
  4. 4. 4© Cloudera, Inc. All rights reserved. • Redo log as an abstraction • How redo logs are useful • Pub-sub message queues • How message queues are useful • What exactly is Kafka • How do people use Kafka • Where can you learn more We’ll talk about:
  5. 5. 5© Cloudera, Inc. All rights reserved. Redo Log: The most crucial structure for recovery operations … store all changes made to the database as they occur.
  6. 6. 6© Cloudera, Inc. All rights reserved. Important Point The redo log is the only reliable source of information about current state of the database.
  7. 7. 7© Cloudera, Inc. All rights reserved. Redo Log is used for • Recover consistent state of a database • Replicate the database (Dataguard, Streams, GoldenGate…) • Update materialized logs (well, it’s a log anyway) If you look far enough into archive logs – you can reconstruct the entire database
  8. 8. 8© Cloudera, Inc. All rights reserved. What if… You built an entire data storage system that is just a transaction log?
  9. 9. 9© Cloudera, Inc. All rights reserved. Kafka can log • Transactions from any database • Clicks from websites • Application logs (ERROR, WARN, INFO…) • Metrics– cpu, memory, io • Audit events • And any system can read those logs: Hadoop, alerts, dashboards, databases.
  10. 10. 10© Cloudera, Inc. All rights reserved. Only one thing is missing Q: How do you query a redo log? A: Not very efficiently Sometimes we just need the events – no need to query. Other times, we need to load the results into a database. While messages are in transit – we can do all kinds of transformations.
  11. 11. 11© Cloudera, Inc. All rights reserved.
  12. 12. 12© Cloudera, Inc. All rights reserved. Publish-Subscribe Message Queue
  13. 13. 13© Cloudera, Inc. All rights reserved. Raise your hand if this sounds familiar “My next project was to get a working Hadoop setup… Having little experience in this area, we naturally budgeted a few weeks for getting data in and out, and the rest of our time for implementing fancy algorithms. “ --Jay Kreps, Kafka PMC
  14. 14. 14© Cloudera, Inc. All rights reserved.14 Client Source Data Pipelines Start like this.
  15. 15. 15© Cloudera, Inc. All rights reserved.15 Client Source Client Client Client Then we reuse them
  16. 16. 16© Cloudera, Inc. All rights reserved.16 Client Backend Client Client Client Then we add consumers to the existing sources Another Backend
  17. 17. 17© Cloudera, Inc. All rights reserved.17 Client Backend Client Client Client Then it starts to look like this Another Backend Another Backend Another Backend
  18. 18. 18© Cloudera, Inc. All rights reserved.18 Client Backend Client Client Client With maybe some of this Another Backend Another Backend Another Backend
  19. 19. 19© Cloudera, Inc. All rights reserved. Queues decouple systems: Both statically and in time
  20. 20. 20© Cloudera, Inc. All rights reserved. This is where we are trying to get 20 Source System Source System Source System Source System Kafka decouples Data Pipelines Hadoop Security Systems Real-time monitoring Data Warehouse Kafka Producers Brokers Consumers Kafka decouples Data Pipelines
  21. 21. 21© Cloudera, Inc. All rights reserved. Important notes: • Producers and Consumers don’t need to know about each other • Performance issues on Consumers don’t impact Producers • Consumers are protected from herds of Producers • Lots of flexibility in handling load • Messages are available for anyone – lots of new use cases, monitoring, audit, troubleshooting http://www.slideshare.net/gwenshap/queues-pools-caches
  22. 22. 22© Cloudera, Inc. All rights reserved. So… What is Kafka?
  23. 23. 23© Cloudera, Inc. All rights reserved. Kafka provides a fast, distributed, highly scalable, highly available, publish-subscribe messaging system. In turn this solves part of a much harder problem: Communication and integration between components of large software systems Click to enter confidentiality
  24. 24. 24© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights •Messages are organized into topics •Producers push messages •Consumers pull messages •Kafka runs in a cluster. Nodes are called brokers The Basics
  25. 25. 25© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights Topics, Partitions and Logs
  26. 26. 26© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights Each partition is a log
  27. 27. 27© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights Each Broker has many partitions Partition 0 Partition 0 Partition 1 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partion 2
  28. 28. 28© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights Producers load balance between partitions Partition 0 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 1 Partion 2 Client
  29. 29. 29© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights Producers load balance between partitions Partition 0 Partition 1 Partition 2 Partition 1 Partition 0 Partition 2 Partition 0 Partition 1 Partion 2 Client
  30. 30. 30© Cloudera, Inc. All rights reserved.©2014 Cloudera, Inc. All rights Consumers
  31. 31. 31© Cloudera, Inc. All rights reserved. Why is Kafka better than other MQ? • Can keep data forever • Scales very well – high throughputs, low latency, lots of storage • Scales to any number of consumers
  32. 32. 32© Cloudera, Inc. All rights reserved. How do people use Kafka? • As a message bus • As a buffer for replication systems (Like AdvancedQueue in Streams) • As reliable feed for event processing • As a buffer for event processing • Decouple apps from database (both OLTP and DWH)
  33. 33. 33© Cloudera, Inc. All rights reserved. Need More Kafka? • https://kafka.apache.org/documentation.html • My video tutorial: http://shop.oreilly.com/product/0636920038603.do • http://www.michael-noll.com/blog/2014/08/18/apache-kafka-training-deck-and- tutorial/ • Try with Cloudera Manager: http://www.cloudera.com/content/cloudera/en/documentation/cloudera- kafka/latest/topics/kafka_install.html
  34. 34. 34© Cloudera, Inc. All rights reserved. One more thing...
  35. 35. 35© Cloudera, Inc. All rights reserved. Schema is a MUST HAVE for data integration Click to enter confidentiality
  36. 36. 36© Cloudera, Inc. All rights reserved. Kafka only stores Bytes – So where’s the schema? • People go around asking each other: “So, what does the 5th field of the messages in topic Blah contain?” • There’s utility code for reading/writing messages that everyone reuses • Schema embedded in the message • A centralized repository for schemas • Each message has Schema ID • Each topic has Schema ID Click to enter confidentiality
  37. 37. 37© Cloudera, Inc. All rights reserved. I Avro • Define Schema • Generate code for objects • Serialize / Deserialize into Bytes or JSON • Embed schema in files / records… or not • Support for our favorite languages… Except Go. • Schema Evolution • Add and remove fields without breaking anything Click to enter confidentiality
  38. 38. 38© Cloudera, Inc. All rights reserved. Replicating from Oracle to Kafka? Don’t lose the schema!
  39. 39. 39© Cloudera, Inc. All rights reserved. Schemas are Agile • Leave out MySQL and your favorite DBA for a second • Schemas allow adding readers and writers easily • Schemas allow modifying readers and writers independently • Schemas can evolve as the system grows • Allows validating data soon after its written • No need to throw away data that doesn’t fit! Click to enter confidentiality
  40. 40. 40© Cloudera, Inc. All rights reserved. Click to enter confidentiality
  41. 41. 41© Cloudera, Inc. All rights reserved. Thank you @gwenshap gshapira@cloudera.com

×