Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Streaming process with Kafka Connect and Kafka Streams

514 views

Published on

http://2017.datacon.tw/agenda/
Apache Kafka 是一個分散式 streaming 儲存系統;擁有極高的吞吐量 (high throughput) 與容錯機制 (fault-tolerant) 在 BigData 領域逐漸廣泛應用。這裡介紹 Kafka 提供的另外二個工具 Kafka Connect 以及 Kafka Streams 的開發架構以及如何運用在 ETL 領域。

Published in: Software

Streaming process with Kafka Connect and Kafka Streams

  1. 1. Streaming process with Kafka Connect & Kafka Streams 鄭紹志@亦思科技 vito@is-land.com.tw 2017/09/30
  2. 2. About me ● 鄭紹志 Vito ● 亦思科技, R&D Director ● BigData 相關研究開發工作 ● Enjoy Java / Scala development
  3. 3. Producer Consumer High Level 架構 Kafka (Broker) Kafka (Broker) Kafka Streams Data Source - Database - Filesystem - . . . Data Sink - Database - Filesystem - . . . KafkaConnect KafkaConnect
  4. 4. Kafka Connect
  5. 5. Kafka Connect 使用場景: ETL ● 把 X (Source) 的資料送進 Kafka ○ 儲存系統, ex: FileSystem, RDB, Cassandra, S3, ... ○ 外部應用系統, ex: Twitter, Github ● 把 Kafka 的資料送進 Y (Sink) ○ 儲存系統, ex: FileSystem, RDB, Cassandra, S3, ... ○ Search, ex: Elastic, Solr
  6. 6. Kafka Connect overview ● Apache Kafka 0.9+ ● A common framework for Kafka connectors ● Standalone and distributed mode ● REST interface(distributed mode) ● Automatic offset management ● Distributed and scalable by default ● Lightweight transformations https://kafka.apache.org/documentation/#connect_overview
  7. 7. Source & Sink Kafka Connect connector connector connector Kafka (Broker) Kafka (Broker) Database File ? Database ? connector connector connector Source Sink Elastic
  8. 8. Running Kafka Connect ● Standalone ● Distributed $ bin/connect-standalone.sh config/connect-standalone.properties connector1.properties [connector2.properties]... $ bin/connect-distributed.sh config/connect-distributed.properties
  9. 9. Connector ● Connector 架構可實作客製化需求 ● Apache Kafka ○ FileStreamSourceConnector / FileStreamSinkConnector ● More connectors: https://www.confluent.io/product/connectors/
  10. 10. Worker ● Worker: 一個 Kafka Connect 的執行單位(JVM process) ● 負責執行 connector 以及 task ● Two types: Standalone / Distributed ● Automatically load balance & fail over
  11. 11. Kafka Connect (Worker) Conn-1 Conn-1, Task 1 Conn-1, Task 2 Partition 1 Partition 2 Partition 3 Conn-2 Conn-2, Task 1 Conn-2, Task 2 Conn-2, Task 3 : : : . . . . . . thread JVM process Inside the worker Max task config (per connector): tasks.max
  12. 12. Distributed mode: Worker cluster Worker 1 Conn-1 Conn-1, Task 1 Conn-1, Task 2 Conn-1, Task 3 Worker 1 Conn-1 Conn-1, Task 1 Worker 2 Conn-1, Task 2 Conn-1, Task 3 Conn-1, Task 2 Conn-1, Task 3
  13. 13. Kafka Streams
  14. 14. Overview & Concept
  15. 15. Streaming data ● Overloaded term ○ streaming data / data stream / event stream ... ○ event / message / log ● 常見特徵 ○ Unbounded data(unlimited size) - 沒有範圍 ○ Immutable - 產生後即不再變更 ○ Time ordered - 有時間順序 ○ Replayable - 重覆播放 https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
  16. 16. Kafka Streams overview ● Included with the Apache Kafka v0.10+ , May 2016 ○ Not compatible with old Kafka broker ● Just a java library, no dedicated cluster required ● Realtime ● Highly scalable, Fault-tolerant ● Stateful / Stateless transformation
  17. 17. Time ● Event time ● Ingestion time(Log append time) ● Processing time ● Message's timestamp in Kafka ○ 0.10+ , Add timestamps to Kafka message(KIP-32) ○ Depend on configuration ■ Event time → Producer Time → CreateTime ■ Ingestion time → Broker Time → LogAppendTime
  18. 18. State & State stores ● Stateful transformation 需要持續維持某些狀態(state) ● StateStore: ○ For cache: Memory(HashMap) ○ For persist: RocksDB https://stackoverflow.com/a/40114039/3155650
  19. 19. Steam Processing Topology http://kafka.apache.org/0110/documentation/str eams/core-concepts#streams_topology Building a topology: ● High level: DSL ● Low level: Processor API
  20. 20. Cluster!! Local state store 一個 Kafka Streams 應用程式 https://kafka.apache.org/0110/documentation/streams/developer-guide#treams_developer-guide_interactive -queries_your_app
  21. 21. Quick Sample (DSL)
  22. 22. Question: 計算每個州的機場數量 "iata","airport","city","state","country" "L70","Agua Dulce Airpark", "Agua Dulce","CA","USA" "TPA","Tampa International ","Tampa","FL","USA" airportpush Topic 美國各州的機場資料(csv) http://stat-computing.org/dataexpo/2009/
  23. 23. airport Topic Get 'State' value (Parse csv message) Input message from 'airport' groupBy 'State' Count recordsairport-count Topic output message to 'airport-count'
  24. 24. KStreamBuilder builder = new KStreamBuilder(); KStream<String, String> textLines = builder.stream("airport"); KTable<String, Long> airportCounts = textLines.mapValues(textLine->{ String state; try { state = csvParser.parseLine(textLine)[3]; } catch (Exception e) { state = null; } return state; }).groupBy((key, state)-> state) .count("counts"); airportCounts.to(Serdes.String(), Serdes.Long(), "airport-counts");
  25. 25. Demo
  26. 26. airport Topic Get 'State' value (Parse csv message) Input message from 'airport' groupBy 'State' Count recordsairport-count Topic output message to 'airport-count' KStream<String, String> KStream<String, String> KGroupedStream<String, String> KTable<String, Long>
  27. 27. airport Topic transform Create source stream transform tranformairport-count Topic Write stream to Kafka
  28. 28. 計算結果 AS 3 CT 15 VT 13 IN 65 MT 71 : : : : $ bin/kafka-console-consumer.sh --topic airport-counts --from-beginning --property print.key=true --property value.deserializer=org.apache.kafka.common.serialization.LongDeserializer Key Value
  29. 29. Kafka Streams Application Reset ● 重新執行 streaming 計算, 需要狀態重置 ● Local reset ○ call KafkaStreams#cleanUp() ● Global reset ○ ○ Resetting offsets to zero for input topics ○ Delete all internal(auto-created) topics for application ■ {application.id}-xxxx-repartition ■ {application.id}-xxxx-changelog $ bin/kafka-streams-application-reset.sh
  30. 30. Kafka Streams DSL
  31. 31. Kafka Streams DSL overview ● KStream, KTable, GlobalKTable ● Stateless transformation ● Stateful transformation ○ State ○ Aggregation ○ Join ○ Window
  32. 32. KStream vs KTable |jack| Taipei| |vito|Hsinchu| |jack|Hsinchu| stream data (Person, City)
  33. 33. KStream vs KTable |jack| Taipei| |vito|Hsinchu| |jack|Hsinchu| jack 去了 Taipei KStream jack 去了 Taipei, Hsinchu jack 住在 Taipei Hsinchu KTable jack 住在 Taipei stream data (Person, City) time1 time2
  34. 34. KStream, KTable 互相轉換 ● KStream → KStream ● KTable → KTable ● KStream → KTable ● KTable → KStream http://kafka.apache.org/0110/documentation/strea ms/developer-guide#streams_duality
  35. 35. Stateless transformation ● filter() , filterNot() ● map(), mapValues() ● flatMap() , flatMapValues() ● foreach() , peek() Key 轉變時會re-partition !!
  36. 36. Stateful transformation ● Join ● Aggregation ● Window
  37. 37. Join operations https://docs.confluent.io/3.3.0/streams/developer-guide.html#joining ● Key-based ● Require co-partitioning of the input data
  38. 38. Aggregation operations ● Key-based ● count() ● reduce() ● aggregate() ● Two type ○ Latest(rolling) aggregation ○ Windowed aggregation
  39. 39. Window ● 一個時間區段處理 ● Tumbling window ● Hopping window ● Sliding window ● Session window
  40. 40. Tumbling Window Window size: 3 mins Window move: 3 mins (advance interval) | | | | | 0 3 6 9 12 stream.map( /* do something */ ) .groupByKey() .count(TimeWindows.of(5*60*1000L), "store");
  41. 41. Hopping Window | | | | | 0 3 6 9 12 Window size: 3 mins Window move: 2 mins (advance interval) stream.map( /* do something */ ) .groupByKey() .count( TimeWindows.of(5*60*1000L) .advanceBy(60 * 1000L), "store");
  42. 42. ● move on every record ● used only for join operation Silding window
  43. 43. Session window | | | | | 0 3 6 9 12 final Long INACTIVITY_GAP = TimeUnit.MINUTES.toMinutes(6); stream.map( /* do something */ ) .groupByKey() .count(SessionWindows.with(INACTIVITY_GAP), "store");
  44. 44. Parallelism Model https://kafka.apache.org/documentation/streams/architecture ● Partition: Topic partitions / Stream partitions ● 一個 Thread 執行多個 StreamTask ● Partition 數量決定 StreamTask 數量 ● 一個 partition 只會分配給一個 StreamTask 處理 ● 一個 StreamTask 執行一個 Topology ● StreamConfig: num.stream.threads
  45. 45. Parallelism Model https://kafka.apache.org/documentation/streams/architecture
  46. 46. Thank you !

×