Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Kafka. seattle data science and data engineering meetup

246 views

Published on

Apache Kafka is publish-subscribe messaging service designed as a distributed, partitioned, replicated commit log service. In this meetup we will take a gentle introduction to Kafka, and also discuss some internals and usage patterns

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Kafka. seattle data science and data engineering meetup

  1. 1. Seattle Data Science And Data Engineering Meetup Abhishek Goswami. 12/14/2016 abgoswam@gmail.com https://www.linkedin.com/in/abgoswam
  2. 2. Table Of Content Introduction Motivation What is Kafka Characteristics APIs Demos Internals Logs Logs in Distributed Systems Design Fundamentals ZooKeeper Dependency Replication Source Code Summary, Q&A 2
  3. 3. ● Introduction ○ Motivation ○ What is Kafka? ○ Characteristics ○ APIs ○ Demos ● Internals ● Summary, Q&A 3
  4. 4. Introduction: Motivation 4 Data integration.
  5. 5. Introduction: What is Kafka ? Distributed, partitioned, replicated commit-log service Provides the functionality of a messaging system, but with a unique-design 5 Competitive Landscape: ● AWS Kinesis, Azure EventHub Use Cases: ● Messaging ● Website Activity Tracking ● Logging ● Stream Processing
  6. 6. Introduction: Characteristics 6 Scalability of a filesystem High Throughput Many TB per server Guarantees of a database Messages strictly ordered All data persistent Distributed by default Replication Partitioning
  7. 7. Introduction: APIs Four core APIs: Producer API allows applications to send streams of data to topics in the Kafka cluster. Consumer API allows applications to read streams of data from topics in the Kafka cluster. Connect API allows implementing connectors that continually pull from some source system or application into Kafka or push from Kafka into some sink system or application. Streams API generalization of batch processing in a real time environment, low latency requirements. 7
  8. 8. Introduction: Demos 8
  9. 9. ● Introduction ● Internals ○ Log ○ Logs in Distributed Systems ○ Design Fundamentals ○ ZooKeeper Dependency ○ Replication ○ Source Code ● Summary, Q&A 9
  10. 10. Internals: Log 10
  11. 11. Internals: Logs in Distributed Systems 11
  12. 12. Internals: Logs in Distributed Systems 12
  13. 13. Internals: Design Fundamentals 13
  14. 14. Internals: ZooKeeper Dependency Kafka requires ZooKeeper Kafka uses ZooKeeper to do things like: Cluster membership Electing a controller Topic Configuration (which topic exists, who’s the leader etc) 14
  15. 15. Internals: Replication 15
  16. 16. Internals: Source Code Github Repo https://github.com/apache/kafka 16
  17. 17. ● Introduction ● Internals ● Summary, Q&A 17
  18. 18. Summary 18 Kafka solves data integration needs. Distributed, partitioned, replicated commit-log service
  19. 19. Q&A 19 References: 1. Simplifying data pipelines with Apache Kafka 2. Learning Apache Kafka, 2nd Edition 3. https://www.tutorialspoint.com/apache_kafka/index.htm 4. https://www.infoq.com/articles/apache-kafka 5. https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer- should-know-about-real-time-datas-unifying abgoswam@gmail.com https://www.linkedin.com/in/abgoswam

×