Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Pipelines Made Simple with Apache Kafka

914 views

Published on

Presentation by Ewen Cheslack-Postava, Engineer, Apache Kafka Committer, Confluent
In streaming workloads, often times data produced at the source is not useful down the pipeline or it requires some transformation to get it into usable shape. Similarly, where sensitive data is concerned, filtering of topics is helpful to ensure that the wrong data doesn't get to the wrong place.

The newest release of Apache Kafka now offers the ability to do transformations on individual messages, making is possible to implement finer grained transformations customized to your unique needs. In this session we’ll talk about the new single message transform capabilities, how to use them to implement things like data masking and advanced partitioning, and when you’ll need to use more complex tools like the Kafka Streams API instead.

Published in: Software
  • Be the first to comment

Data Pipelines Made Simple with Apache Kafka

  1. 1. 1 Data Pipelines Made Simple With Apache Kafka Ewen Cheslack-Postava Engineer, Apache Kafka Committer
  2. 2. 2 Attend the whole series! Simplify Governance for Streaming Data in Apache Kafka Date: Thursday, April 6, 2017 Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET Speaker: Gwen Shapira, Product Manager, Confluent Using Apache Kafka to Analyze Session Windows Date: Thursday, March 30, 2017 Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET Speaker: Michael Noll, Product Manager, Confluent Monitoring and Alerting Apache Kafka with Confluent Control Center Date: Thursday, March 16, 2017 Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET Speaker: Nick Dearden, Director, Engineering and Product Data Pipelines Made Simple with Apache Kafka Date: Thursday, March 23, 2017 Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET Speaker: Ewen Cheslack-Postava, Engineer, Confluent https://www.confluent.io/online-talk/online-talk-series-five-steps-to-production-with-apache-kafka/ What’s New in Apache Kafka 0.10.2 and Confluent 3.2 Date: Thursday, March 9, 2017 Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET Speaker: Clarke Patterson, Senior Director, Product Marketing
  3. 3. 3 The Challenge: Streaming Data Pipelines
  4. 4. 4 Simplifying Streaming Data Pipelines with Apache Kafka
  5. 5. 5 Kafka Connect
  6. 6. 6 Streaming ETL
  7. 7. 7 Single Message Transforms for Kafka Connect Modify events before storing in Kafka: • Mask sensitive information • Add identifiers • Tag events • Store lineage • Remove unnecessary columns Modify events going out of Kafka: • Route high priority events to faster data stores • Direct events to different Elasticsearch indexes • Cast data types to match destination • Remove unnecessary columns
  8. 8. 8 Where Single Message Transforms Fit In
  9. 9. 9 Built-in Transformations • InsertField – Add a field using either static data or record metadata • ReplaceField – Filter or rename fields • MaskField – Replace field with valid null value for the type (0, empty string, etc) • ValueToKey – Set the key to one of the value’s fields • HoistField – Wrap the entire event as a single field inside a Struct or a Map • ExtractField – Extract a specific field from Struct and Map and include only this field in results • SetSchemaMetadata – modify the schema name or version • TimestampRouter – Modify the topic of a record based on original topic and timestamp. Useful when using a sink that needs to write to different tables or indexes based on timestamps • RegexpRouter – modify the topic of a record based on original topic, replacement string and a regular expression
  10. 10. 10 Configuring Single Message Transforms name=local-file-source connector.class=FileStreamSource tasks.max=1 file=test.txt topic=connect-test transforms=MakeMap,InsertSource transforms.MakeMap.type=org.apache.kafka.connect.transforms.HoistField$Value transforms.MakeMap.field=line transforms.InsertSource.type=org.apache.kafka.connect.transforms.InsertField$Value transforms.InsertSource.static.field=data_source transforms.InsertSource.static.value=test-file-source
  11. 11. 11 Why only single messages? • Delivery guarantees! • Always provide at least once semantics • For supported connectors, provide exactly once semantics • No additional complication: transformations happens inline with import/export
  12. 12. 12 When should I use each tool? Kafka Connect & Single Message Transforms • Simple, message at a time • Transformation can be performed inline • Transformation does not interact with external systems Kafka Streams • Complex transformations including • Aggregations • Windowing • Joins • Transformed data stored back in Kafka, enabling reuse • Write, deploy, and monitor a Java application
  13. 13. 13 Conclusion Single Message Transforms in Kafka Connect • Lightweight transformation of individual messages • Configuration-only data pipelines • Pluggable, with lots of built-in transformations
  14. 14. 14 Attend the whole series! Simplify Governance for Streaming Data in Apache Kafka Date: Thursday, April 6, 2017 Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET Speaker: Gwen Shapira, Product Manager, Confluent Using Apache Kafka to Analyze Session Windows Date: Thursday, March 30, 2017 Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET Speaker: Michael Noll, Product Manager, Confluent Monitoring and Alerting Apache Kafka with Confluent Control Center Date: Thursday, March 16, 2017 Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET Speaker: Nick Dearden, Director, Engineering and Product Data Pipelines Made Simple with Apache Kafka Date: Thursday, March 23, 2017 Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET Speaker: Ewen Cheslack-Postava, Engineer, Confluent https://www.confluent.io/online-talk/online-talk-series-five-steps-to-production-with-apache-kafka/ What’s New in Apache Kafka 0.10.2 and Confluent 3.2 Date: Thursday, March 9, 2017 Time: 9:30 am - 10:00 am PT | 12:30 pm - 1:00 pm ET Speaker: Clarke Patterson, Senior Director, Product Marketing
  15. 15. 15 Get Started with Apache Kafka Today! https://www.confluent.io/downloads/ THE place to start with Apache Kafka! Thoroughly tested and quality assured More extensible developer experience Easy upgrade path to Confluent Enterprise
  16. 16. 16 Discount code: kafcom17  Use the Apache Kafka community discount code to get $50 off  www.kafka-summit.org Kafka Summit New York: May 8 Kafka Summit San Francisco: August 28 Presented by

×