Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing with Apache Flink's Relational APIs


Published on

Apache Flink's DataStream API is very expressive and gives users precise control over time and state. However, many applications do not require this level of expressiveness and can be implemented more concisely and easily with a domain-specific API. SQL is undoubtedly the most widely used language for data processing but usually applied in the domain of batch processing. Apache Flink features two relational APIs for unified stream and batch processing, the Table API, a language-integrated relational query API for Scala and Java, and SQL. A Table API or SQL query computes the same result regardless whether it is evaluated on a static file or on a Kafka topic. While Flink evaluates queries on batch input like a conventional query engine, queries on streaming input are continuously processed and their results constantly updated and refined. In this talk we present Flink’s unified relational APIs, show how streaming SQL queries are processed, and discuss exciting new use-cases.

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

Flink Forward Berlin 2017: Fabian Hueske - Using Stream and Batch Processing with Apache Flink's Relational APIs

  1. 1. 1 Fabian Hueske @fhueske Flink Forward Berlin September, 13th 2017 Stream Analytics with SQL on Apache Flink®
  2. 2. 2 Original creators of Apache Flink® Providers of dA Platform 2, including open source Apache Flink + dA Application Manager
  3. 3. The DataStream API  Flink’s DataStream API is very expressive • Application logic implemented as user-defined functions • Windows, triggers, evictors, state, timers, async calls, …  Many applications follow similar patterns • Do not require the expressiveness of the DataStream API • Can be specified more concisely and easily with a DSL Q: What’s the most popular DSL for data processing? A: SQL! 3
  4. 4. Apache Flink’s relational APIs  Standard SQL & LINQ-style Table API  Unified APIs for batch & streaming data A query specifies exactly the same result regardless whether its input is static batch data or streaming data.  Common translation layers • Optimization based on Apache Calcite • Type system & code-generation • Table sources & sinks 4
  5. 5. Show me some code! tableEnvironment .scan("clicks") .filter('"") .groupBy('user) .select('user, 'url.count as 'cnt) SELECT user, COUNT(url) AS cnt FROM clicks WHERE url LIKE '' GROUP BY user 5 “clicks” can be a - file - database table, - stream, …
  6. 6. What if “clicks” is a file? 6 user cTime url Mary 12:00:00 https://… Bob 12:00:00 https://… Mary 12:00:02 https://… Liz 12:00:03 https://… user cnt Mary 2 Bob 1 Liz 1 Q: What if we get more click data? A: We run the query again. SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user
  7. 7. What if “clicks” is a stream? 7  We want the same results as for batch input!  Can we query a stream with SQL as well?
  8. 8. SQL was not designed for streams  Relations are bounded (multi-)sets.  DBMS can access all data.  SQL queries return a result and complete. 8 Streams are infinite sequences. Streaming data arrives over time. Streaming queries continuously emit results and never complete. ↔ ↔ ↔
  9. 9. DBMSs run queries on streams  Materialized views (MV) are similar to regular views, but persisted to disk or memory • Used to speed-up analytical queries • MVs need to be updated when the base tables change  MV maintenance is very similar to SQL on streams • Base table updates are a stream of DML statements • MV definition query is evaluated on that stream • MV is query result and continuously updated 9
  10. 10. Continuous Queries in Flink  Core concept is a “Dynamic Table” • Dynamic tables are changing over time  Queries on dynamic tables • produce new dynamic tables (which are updated based on input) • do not terminate  Stream ↔ Dynamic table conversions 10
  11. 11. Stream → Dynamic Table  Append mode • Stream records are appended to table • Table grows as more data arrives 11 user cTime url Mary 12:00:00 ./home Bob 12:00:00 ./cart Mary 12:00:05 ./prod?id=1 Liz 12:01:00 ./home Bob 12:01:30 ./prod?id=3 Mary 12:01:45 ./prod?id=7 … … Mary, 12:00:00, ./home Bob, 12:00:00, ./cart Mary, 12:00:05, ./prod?id=1 Liz, 12:01:00, ./home Bob, 12:01:30, ./prod?id=3 Mary, 12:01:45, ./prod?id=7
  12. 12. Stream → Dynamic Table  Upsert mode • Stream records have (composite) key attributes • Records are inserted or update existing records with same key 12 user lastLogin Mary 2017-07-01 Bob 2017-06-01 Liz 2017-05-01 … Mary, 2017-03-01 Bob, 2017-03-15 Mary, 2017-04-01 Liz, 2017-05-01 Bob, 2017-06-01 Mary, 2017-07-01
  13. 13. Querying a Dynamic Table clicks user cnt u1 1 result u2 1 u3 1 u1 2 u3 2 u1 3SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user Rows of result table are updated. 13 Mary 12:01:45 ./prod?id=7 Liz 12:01:30 ./prod?id=3 Liz 12:01:00 ./home Mary 12:00:05 ./prod?id=1 Bob 12:00:00 ./cart Mary 12:00:00 ./home user cTime url
  14. 14. What about windows? tableEnvironment .scan("clicks") .window(Tumble over 1.hour on 'cTime as 'w) .groupBy('w, 'user) .select('user, 'w.end AS endT, 'url.count as 'cnt) SELECT user, TUMBLE_END(cTime, INTERVAL '1' HOURS) AS endT, COUNT(url) AS cnt FROM clicks GROUP BY TUMBLE(cTime, INTERVAL '1' HOURS), user 14
  15. 15. clicks Computing window aggregates user endT cnt u1 13:00:00 3 u2 13:00:00 1 result u2 14:00:00 1 u3 14:00:00 2 u1 15:00:00 1 u2 15:00:00 2 u3 15:00:00 1 Mary 12:00:00 ./home Bob 12:00:00 ./cart Mary 12:02:00 ./prod?id=2 Mary 12:55:00 ./home Mary 14:00:00 ./prod?id=1 Liz 14:02:00 ./prod?id=8 Bob 14:30:00 ./prod?id=7 Bob 14:40:00 ./home Bob 13:01:00 ./prod?id=4 Liz 13:30:00 ./cart Liz 13:59:00 ./home SELECT user, TUMBLE_END( cTime, INTERVAL '1' HOURS) AS endT, COUNT(url) AS cnt FROM clicks GROUP BY user, TUMBLE( cTime, INTERVAL '1' HOURS) Rows are appended to result table. 15 user cTime url
  16. 16. Why are results always appended?  cTime attribute is event-time attribute • Guarded by watermarks • Internally represented as special type • User-facing as TIMESTAMP  Special plans for queries that operate on event-time attributes 16 SELECT user, TUMBLE_END(cTime, INTERVAL '1' HOURS) AS endT, COUNT(url) AS cnt FROM clicks GROUP BY TUMBLE(cTime, INTERVAL '1' HOURS), user
  17. 17. Dynamic Table → Stream  Converting a dynamic table into a stream • Dynamic tables might update or delete existing rows • Updates must be encoded in outgoing stream  Conversion of tables to streams inspired by DBMS logs • DBMS use logs to restore databases (and tables) • REDO logs store new records to redo changes • UNDO logs store old records to undo changes 17
  18. 18. Dynamic Table → Stream: REDO/UNDO + Bob,1+ Mary,2+ Liz,1+ Bob,2 + Mary,1- Mary,1- Bob,1 SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user + INSERT / - DELETE 18 user url clicks Mary ./home Bob ./cart Mary ./prod?id=1 Liz ./home Bob ./prod?id=3
  19. 19. Dynamic Table → Stream: REDO * Bob,1* Mary,2* Liz,1* Liz,2* Mary,3 * Mary,1 * UPSERT by KEY / - DELETE by KEY SELECT user, COUNT(url) as cnt FROM clicks GROUP BY user 19 user url clicks Mary ./home Bob ./cart Mary ./prod?id=1 Liz ./home Liz ./prod?id=3 Mary ./prod?id=7
  20. 20. Can we run any query on a dynamic table?  No, there are space and computation constraints   State size may not grow infinitely as more data arrives SELECT sessionId, COUNT(url) FROM clicks GROUP BY sessionId;  A change of an input table may only trigger a partial re-computation of the result table SELECT user, RANK() OVER (ORDER BY lastLogin) FROM users; 20
  21. 21. Bounding the size of query state  Adapt the semantics of the query • Aggregate data of last 24 hours. Discard older data.  Trade the accuracy of the result for size of state • Remove state for keys that became inactive. 21 SELECT sessionId, COUNT(url) AS cnt FROM clicks WHERE last(cTime, INTERVAL '1' DAY) GROUP BY sessionId
  22. 22. Current state of SQL & Table API  Flink’s relational APIs are rapidly evolving • Lots of interest by community and many contributors • Used in production at large scale by Alibaba and others  Features released in Flink 1.3 • GroupBy & Over windowed aggregates • Non-windowed aggregates (with update changes) • User-defined aggregation functions  Features coming with Flink 1.4 • Windowed Joins • Reworked connectors APIs 22
  23. 23. What can be built with this?  Continuous ETL • Continuously ingest data • Process with transformations & window aggregates • Write to files (Parquet, ORC), Kafka, PostgreSQL, HBase, … 23
  24. 24. What can be built with this? 24  Dashboards, reporting & event-driven architectures • Flink updates query results with low latency • Result can be written to KV store, DBMS, compacted Kafka topic • Maintain result table as queryable state
  25. 25. Wrap-up!  Table API & SQL support many streaming use cases • High-level / declarative specification • Automatic optimization and translation • Efficient execution • Scalar, table, aggregation UDFs for flexibility  Updating results enable many exciting applications  Check it out! 25
  26. 26. Thank you! @fhueske @ApacheFlink @dataArtisans Available on O’Reilly Early Release!
  27. 27. We are hiring!
  28. 28. Tables are materialized streams  A table is the materialization of a stream of modifications • SQL DML statements: INSERT, UPDATE, and DELETE • DBMSs process statements by modifying tables 29 user name lastLogin u2 Peter 2017-05-01 u1 Mary 2017-03-01u1 Mary 2017-06-01 INSERT (u1, Mary, "2017-03-01") INSERT (u2, Peter, "2017-05-01") DELETE WHERE (user = u2) UPDATE (lastLogin = "2017-06-01") WHERE (user = u1)