Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

2,089 views

Published on

Here are the slides of my Berlin Buzzwords talk

Published in: Technology
  • Be the first to comment

A Data Streaming Architecture with Apache Flink (berlin Buzzwords 2016)

  1. 1. A Data Streaming Architecture with Apache Flink Robert Metzger @rmetzger_ rmetzger@apache.org Berlin Buzzwords, June 7, 2016
  2. 2. Talk overview  My take on the stream processing space, and how it changes the way we think about data  Transforming an existing data analysis pattern into the streaming world (“Streaming ETL”)  Demo 2
  3. 3. Apache Flink  Apache Flink is an open source stream processing framework • Low latency • High throughput • Stateful • Distributed  Developed at the Apache Software Foundation, 1.0.0 released in March 2016, used in production 3
  4. 4. Entering the streaming era 4
  5. 5. 5 Streaming is the biggest change in data infrastructure since Hadoop
  6. 6. 6 1. Radically simplified infrastructure 2. Do more with your data, faster 3. Can completely subsume batch
  7. 7. 7 Real-world data is produced in a continuous fashion. New systems like Flink and Kafka embrace streaming nature of data. Web server Kafka topic Stream processor
  8. 8. Apache Flink stack 8 Gelly Table/SQL ML SAMOA DataSet (Java/Scala)DataStream (Java / Scala) HadoopM/R LocalClusterYARN ApacheBeam ApacheBeam Table/ StreamSQL Cascading Streaming dataflow runtimeStormAPI Zeppelin CEP
  9. 9. What makes Flink flink? 9 Low latency High Throughput Well-behaved flow control (back pressure) Make more sense of data Works on real-time and historic data True Streaming Event Time APIs Libraries Stateful Streaming Globally consistent savepoints Exactly-once semantics for fault tolerance Windows & user-defined state Flexible windows (time, count, session, roll-your own) Complex Event Processing
  10. 10. Moving existing (batch) data analysis into streaming 10
  11. 11. Extract, Transform, Load (ETL)  ETL: Move data from A to B and transform it on the way  Old approach: Server LogsServer Logs Server Logs Mobile IoT
  12. 12. Extract, Transform, Load (ETL)  ETL: Move data from A to B and transform it on the way  Old approach: Server Logs HDFS / S3 “Data Lake” Server Logs Server Logs Mobile IoT Tier 0: Raw data
  13. 13. Extract, Transform, Load (ETL)  ETL: Move data from A to B and transform it on the way  Old approach: Server Logs HDFS / S3 “Data Lake” Server Logs Server Logs Mobile IoT Tier 0: Raw data Tier 1: Normalized, cleansed data Periodic jobs Parquet / ORC in HDFS User
  14. 14. Extract, Transform, Load (ETL)  ETL: Move data from A to B and transform it on the way  Old approach: Server Logs HDFS / S3 “Data Lake” Server Logs Server Logs Mobile IoT Tier 0: Raw data Tier 1: Normalized, cleansed data Periodic jobs Parquet / ORC in HDFS Tier 2: Aggregated data Periodic jobs User User “Data Warehouse”
  15. 15. Extract, Transform, Load (Streaming ETL)  ETL: Move data from A to B and transform it on the way  Streaming approach: Server Logs “Data Lake” Server Logs Server Logs Mobile IoT Tier 0: Raw data
  16. 16. Stream Processor Extract, Transform, Load (Streaming ETL)  ETL: Move data from A to B and transform it on the way  Streaming approach: Server Logs “Data Lake” Server Logs Server Logs Mobile IoT Kafka Connector Tier 0: Raw data Cleansing Transformation Time-Window Alerts Time-Window
  17. 17. Stream Processor Extract, Transform, Load (Streaming ETL)  ETL: Move data from A to B and transform it on the way  Streaming approach: Server Logs “Data Lake” Server Logs Server Logs Mobile IoT Tier 1: Normalized, cleansed data Parquet / ORC in HDFS Kafka Connector ES Connector Rolling file sink Tier 0: Raw data Cleansing Transformation Time-Window Alerts Time-Window User Batch Processing
  18. 18. Stream Processor Extract, Transform, Load (Streaming ETL)  ETL: Move data from A to B and transform it on the way  Streaming approach: Server Logs “Data Lake” Server Logs Server Logs Mobile IoT Tier 1: Normalized, cleansed data Parquet / ORC in HDFS Tier 2: Aggregated data User Kafka Connector ES Connector Rolling file sink JDBC sink Cassandra sink Tier 0: Raw data Cleansing Transformation Time-Window Alerts Time-Window User Batch Processing
  19. 19. Streaming ETL: Low Latency 19* Your mileage may vary. These are rule of thumb estimates.  Events are processed immediately  No need to wait until the next “load” batch job is running hours minutes milliseconds Periodic batch job Batch processor with micro-batches Latency Approach seconds Stream processor
  20. 20. Streaming ETL: Event-time aware 20  Events derived from the same real-world activity might arrive out of order in the system  Flink is event-time aware 11:28 11:29 11:28 11:29 11:28 11:29 Same real-world activity Out of sync clocks Network delays Machine failures
  21. 21. Demo 21
  22. 22. Job Overview 22 Flink Twitter Source Data Ingestion Job “Streaming ETL” Job
  23. 23. Job Overview 23 (Rolling) file sinkFilter operationFilter operation Aggregation to ElasticSearch Streaming WordCount TopN operator
  24. 24. Demo code @ GitHub 24 https://github.com/rmetzger/flink-streaming-etl
  25. 25. Closing 25
  26. 26. 26 https://www.eventbrite.com/e/apache-flink-hackathon-by-berlin-buzzwords-tickets-25580481910
  27. 27. Flink Forward 2016, Berlin Submission deadline: June 30, 2016 Early bird deadline: July 15, 2016 www.flink-forward.org
  28. 28. We are hiring! data-artisans.com/careers
  29. 29. Questions?  Ask now!  eMail: rmetzger@apache.org  Twitter: @rmetzger_  Follow: @ApacheFlink  Read: flink.apache.org/blog, data-artisans.com/blog/  Mailinglists: (news | user | dev)@flink.apache.org 29
  30. 30. Appendix 30
  31. 31. Sources 31  “Large scale ETL with Hadoop” http://www.slideshare.net/OReillyStrata/large-scale-etl- with-hadoop

×