Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Developing Real-Time Data Pipelines with Apache Kafka

5,012 views

Published on

Recorded at SpringOne2GX 2015
Presenter: Joe Stein
Big Data Track

Developing Real-Time Data Pipelines with Apache Kafka http://kafka.apache.org/ is an introduction for developers about why and how to use Apache Kafka. Apache Kafka is a publish-subscribe messaging system rethought of as a distributed commit log. Kafka is designed to allow a single cluster to serve as the central data backbone. A single Kafka broker can handle hundreds of megabytes of reads and writes per second from thousands of clients. It can be elastically and transparently expanded without downtime. Data streams are partitioned and spread over a cluster of machines to allow data streams larger than the capability of any single machine and to allow clusters of coordinated consumers. Messages are persisted on disk and replicated within the cluster to prevent data loss. Each broker can handle terabytes of messages. For the Spring user, Spring Integration Kafka and Spring XD provide integration with Apache Kafka.

Published in: Technology
  • The Bulimia Recovery Program, We Recovered, You CAN TOO! ■■■ http://scamcb.com/bulimiarec/pdf
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Developing Real-Time Data Pipelines with Apache Kafka

  1. 1. SPRINGONE2GX WASHINGTON, DC Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Developing Real-Time Data Pipelines with Apache Kafka Joe Stein @allthingshadoop
  2. 2. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ CEO of Elodina http://www.elodina.net/ a big data as a service platform built on top open source software. The Elodina platform enables customers to analyze data streams and programmatically react to the results in real-time. We solve today’s data analytics needs by providing the tools and support necessary to utilize open source technologies. As users, contributors and committers, Elodina also provides support for frameworks that run on Mesos including Apache Kafka, Exhibitor (Zookeeper), Apache Storm, Apache Cassandra and a whole lot more! Apache Kafka Committer & PMC Member LinkedIn: http://linkedin.com/in/charmalloc Twitter : @allthingshadoop whoami 2
  3. 3. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Contents •  Introduction To Kafka •  Overview •  Topics, Partitions & Segments •  Data Durability •  Replication •  Producers •  Consumers •  Performance •  Integration •  Quick Start •  Operations 3 •  Designs •  Distributed RPC o  Request o  Process o  Response •  Storage & Analytics o  Stream o  Transform o  Analyze o  Store o  Search
  4. 4. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Apache Kafka 4
  5. 5. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Apache Kafka Apache Kafka was first open sourced by LinkedIn in 2011 Papers ●  Building a Replicated Logging System with Apache Kafka http://www.vldb.org/pvldb/vol8/p1654-wang.pdf ●  Kafka: A Distributed Messaging System for Log Processing http://research.microsoft.com/en-us/um/people/srikanth/netdb11/ netdb11papers/netdb11-final12.pdf ●  Building LinkedIn’s Real-time Activity Data Pipeline http://sites.computer.org/debull/A12june/pipeline.pdf ●  The Log: What Every Software Engineer Should Know About Real-time Data's Unifying Abstraction http:// engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying http://kafka.apache.org/ 5
  6. 6. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ How Big Data Usually Starts 6
  7. 7. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ More Big Data! 7
  8. 8. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Ah! 8
  9. 9. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ eesh 9
  10. 10. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Kafka de-couples data pipelines 10
  11. 11. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Distributed Replicated Log Read and write In real time As much as you want As fast as your network can go 11
  12. 12. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Topics and Partitions 12
  13. 13. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Log Segments 13
  14. 14. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Distributed Replicated Log 14
  15. 15. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Data Durability 15
  16. 16. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Replication 16
  17. 17. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Producers 17
  18. 18. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Consumers 18
  19. 19. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Consumer Failover 19
  20. 20. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Producer Performance 20 https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines
  21. 21. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Consumer Performance http://kafka.apache.org/documentation.html#maximizingefficiency 21
  22. 22. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Client Libraries Community Clients https://cwiki.apache.org/confluence/display/KAFKA/Clients ●  Go (aka golang) Pure Go implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. ●  Python - Pure Python implementation with full protocol support. Consumer and Producer implementations included, GZIP and Snappy compression supported. ●  C - High performance C library with full protocol support ●  Ruby - Pure Ruby, Consumer and Producer implementations included, GZIP and Snappy compression supported. Ruby 1.9.3 and up (CI runs MRI 2. ●  Clojure - Clojure DSL for the Kafka API ●  JavaScript (NodeJS) - NodeJS client in a pure JavaScript implementation Wire Protocol Developer's Guide https://cwiki.apache.org/confluence/display/KAFKA/A+Guide+To+The+Kafka+Protocol 22
  23. 23. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Spring Integration Good blog about it https://spring.io/blog/2015/04/15/using-apache-kafka-for-integration-and-data- processing-pipelines-with-spring Kafka Integration Source https://github.com/spring-projects/spring-integration-kafka Spring XD samples https://github.com/spring-projects/spring-xd-samples/tree/master/kafka-source 23
  24. 24. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Quick Start https://kafka.apache.org/documentation.html#quickstart Download the 0.8.2.2 release and un-tar it. > tar -xzf kafka_2.10-0.8.2.2.tgz > cd kafka_2.10-0.8.2.2 (use at least four terminal windows) > bin/zookeeper-server-start.sh config/zookeeper.properties > bin/kafka-server-start.sh config/server.properties > bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test > bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test This is a message This is another message > bin/kafka-console-consumer.sh --zookeeper localhost:2181 --topic test --from-beginning This is a message This is another message 24
  25. 25. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Operationalizing Kafka https://kafka.apache.org/documentation.html#basic_ops Basic Kafka Operations ●  Adding and removing topics ●  Modifying topics ●  Graceful shutdown ●  Balancing leadership ●  Checking consumer position ●  Mirroring data between clusters ●  Expanding your cluster ●  Decommissioning brokers ●  Increasing replication factor 25
  26. 26. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Running on Mesos 26
  27. 27. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Static Partitioning 27
  28. 28. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Scaling is manual (even if orchestrated) 28
  29. 29. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Static failures require manual intervention 29
  30. 30. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Application Elasticity 30
  31. 31. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ An operating system for your data center 31
  32. 32. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Everything goes on Mesos 32
  33. 33. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Kafka on Mesos https://github.com/mesos/kafka ●  smart broker.id assignment. ●  preservation of broker placement (through constraints and/or new features). ●  ability to-do configuration changes. ●  rolling restarts (for things like configuration changes). ●  scaling the cluster up and down with automatic, programmatic and manual options. ●  smart partition assignment via constraints visa vi roles, resources and attributes. 33
  34. 34. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Kafka on Mesos Scheduler ●  Provides the operational automation for a Kafka Cluster. ●  Manages the changes to the broker's configuration. ●  Exposes a REST API for the CLI to use or any other client. ●  Runs on Marathon for high availability. Executor ●  The executor interacts with the kafka broker as an intermediary to the scheduler 34
  35. 35. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ REST API & CLI ●  scheduler - starts the scheduler. ●  add - adds one more more brokers to the cluster. ●  update - changes resources, constraints or broker properties one or more brokers. ●  remove - take a broker out of the cluster. ●  start - starts a broker up. ●  stop - this can either a graceful shutdown or will force kill it (./kafka-mesos.sh help stop) ●  rebalance - allows you to rebalance a cluster either by selecting the brokers or topics to rebalance. Manual assignment is still possible using the Apache Kafka project tools. Rebalance can also change the replication factor on a topic. ●  help - ./kafka-mesos.sh help || ./kafka-mesos.sh help {command} 35
  36. 36. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Launch 20 brokers in seconds 36
  37. 37. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Kafka 0.9 KIP (Kafka Improvement Process) •  https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals New Consumer •  https://cwiki.apache.org/confluence/display/KAFKA/Kafka+0.9+Consumer+Rewrite+Design Security •  https://cwiki.apache.org/confluence/display/KAFKA/Security •  https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=51809888 JIRA •  https://issues.apache.org/jira/browse/KAFKA/fixforversion/12328745/? selectedTab=com.atlassian.jira.jira-projects-plugin:version-issues-panel 37
  38. 38. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Distributed RPC 38
  39. 39. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Reference Architecture 39
  40. 40. Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ Questions? http://www.elodina.net 40

×