Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Flink SQL & TableAPI in Large Scale Production at Alibaba


Published on

Search and recommendation system for Alibaba’s e-commerce platform use batch and streaming processing heavily. Flink SQL and Table API (which is a SQL-like DSL) provide simple, flexible, and powerful language to express the data processing logic. More importantly, it opens the door to unify the semantics of batch and streaming jobs.

Blink is a project at Alibaba which improves Apache Flink to make it ready for large scale production use. To support our products, we made lots of improvements to Flink SQL & TableAPI in Alibaba's Blink project. We added the support for User-Defined Table function (UDTF), User-Defined Aggregates (UDAGG), Window Aggregate, and retraction, etc. We are actively working with the Flink community to contribute these improvements back. In this talk, we will present the rationale, semantics, design and implementation of these improvements. We will also share the experience of running large scale Flink SQL and TableAPI jobs at Alibaba.

Published in: Technology
  • Be the first to comment

Flink SQL & TableAPI in Large Scale Production at Alibaba

  1. 1. Flink SQL & Table API in Large Scale Production at Alibaba Xiaowei Jiang Shaoxuan Wang June, 2017
  2. 2. About Us Xiaowei Jiang • 2014-now Alibaba • 2010-2014 Facebook • 2002-2010 Microsoft • 2000-2002 Stratify Shaoxuan Wang • 2015-now Alibaba • 2014-2015 Facebook • 2010-2014 Broadcom
  3. 3. Outline 1 Background 2 Why SQL & Table API 3 Blink SQL & Table API 4 Blink SQL & Table API in Large Scale Production
  4. 4. Background Section 1
  5. 5. About Alibaba Alibaba Group • Operates the world’s largest e-commerce platform • Recorded GMV of $485 Billion in year 2016, $17.8 billion worth of GMV in a single day on Nov 11, 2016 Realtime Data Infrastructure • Supports internal products such as search, recommendation, BI • Also supports external customers through its cloud service
  6. 6. Blink – Alibaba’s version of Flink Looked into Flink two years ago • best choice of unified computing engine • a few issues in Flink that can be problems for large scale applications Started “Blink” project • aimed to make Flink work reliably and efficiently at the very large scale at Alibaba Made various improvements in Flink runtime Enhanced Flink SQL & Table API to production ready Working with Flink community to contribute back since last August • several key improvements • hundreds of patches
  7. 7. Blink Ecosystem in Alibaba Cluster Resource Management (Yarn/Fuxi) Search Storage (HDFS/Pangu) SQL & Table API Blink Products Recommendation BI Security DataStream API Runtime Engine Ads DataSet API Machine Learning Platform StreamCompute PlatformPlatform
  8. 8. Why SQL & Table API Section 2
  9. 9. Why SQL & Table API Unified batch and streaming • Flink currently offers DataSet API for batch and DataStream API for streaming • We want a single API that can run in both batch and streaming mode Improved development efficiency • Users only describe the semantics of data processing • Leave hard optimization problems to the system • SQL is proven to be good at describing data processing • Table API offers seamless integration with Scala and Java • Table API makes it easy to extend standard SQL when necessary
  10. 10. Stream-Table Duality word count Hello 3 World 1 Bark 1 word count Hello 1 World 1 Hello 2 Bark 1 Hello 3 Stream Dynamic Table Apply Changelog
  11. 11. Dynamic Tables Apply Changelog Stream to Dynamic Table • Append Mode: each stream record is an insert to the dynamic table. Hence, all records of a stream are appended to the dynamic table • Update Mode: a stream record can represent an insert, update, or delete modification on the dynamic table (append mode is in fact a special case of update mode)
  12. 12. Derive Changelog Stream from Dynamic Table • REDO Mode: where the stream records the new value of a modified element to redo lost changes of completed transactions • REDO+UNDO Mode: where the stream records the old and the new value of a changed element to undo incomplete transactions and redo lost changes of completed transactions Dynamic Tables
  13. 13. There is no such thing as Stream SQL Stream SQL? Dynamic Tables generalize the concept of Static Tables SQL serves as the unified way to describe data processing in both batch and streaming
  14. 14. Blink SQL & Table API Section 3
  15. 15. Blink SQL & Table API Overview Simple Query: Select and Where Stream-Stream Inner Join User Defined Function (UDF) User Defined Table Function (UDTF) User Defined Aggregate Function (UDAGG) Retraction (stream only) Aggregate
  16. 16. A Simple Query: Select and Where id name price sales stock 1 Latte 6 1 1000 8 Mocha 8 1 800 4 Breve 5 1 200 7 Tea 4 1 2000 1 Latte 6 2 998 id name price sales stock 1 Latte 6 1 1000 1 Latte 6 2 998
  17. 17. Stream-Stream Inner Join id1 name stock 1 Latte 1000 8 Mocha 800 4 Breve 200 3 Water 5000 7 Tea 2000 id2 price sales 1 6 1 8 8 1 9 3 1 4 5 1 7 4 1 id name price sales stock 1 Latte 6 1 1000 8 Mocha 8 1 800 4 Breve 5 1 200 7 Tea 4 1 2000 This is proposed and discussed in FLINK-5878
  18. 18. User Defined Function (UDF) UDF converts a scalar input to a scalar output. Create and use a UDF is very simple and easy: We have enhanced UDF/UDTF to support variable types and variable arguments lSum iSum 35L 1106 Scalar  Scalar long1 long2 int1 int2 int3 10L 25L 6 100 1000
  19. 19. User Defined Table Function (UDTF) name age Tom 23 Jack 17 David 50 line Tom#23 Jark#17 David#50 Scalar  Table (multi rows and columns) UDTF converts a scalar input to a table output: We have shipped UDTF in Flink release 1.2 (FLINK-4469).
  20. 20. “SELECT SUM(stock) as total” Table  Scalar User Defined Aggregate Function (UDAGG) - Motivation total 2000 UDAGG converts a table input to a scalar output: Flink has built-in aggregates (count, sum, avg, min, max) for SQL and table API: What if user wants an aggregate that is not covered by built-in aggregates, say a weighted average aggregate? We need an aggregate interface to support user defined aggregate function. id name price sales stock 1 Latte 6 1 1000 8 Mocha 8 1 800 4 Breve 5 1 200
  21. 21. UDAGG – Accumulator (ACC) id name price sales stock 1 Latte 6 1 1000 8 Mocha 8 1 800 4 Breve 5 1 200 7 Tea 4 1 2000 1 Latte 6 2 998 UDAGG represents its state using accumulator
  22. 22. UDAGG – Interface UDAGG example: a weighted average SQL Query UDAGG Interface
  23. 23. UDAGG – Merge Motivated by local & global aggregate (and session window merge etc.), we need a merge method which can merge the partial aggregated accumulator into one single accumulator How to count the total visits on TaoBao web pages in real time?
  24. 24. UDAGG – Retraction – Motivations Incorrect! The freq of cnt=1 should be 2
  25. 25. UDAGG – Retraction – Motivations We need a retract method in UDAGG, which can retract the UNDO messages from the accumulator
  26. 26. Retraction – Solution The design doc and the progress of retraction implementation are tracked in FLINK-6047. We have delivered this in Flink release 1.3 Retraction is introduced to handle updates We use query optimizer to decide where the retraction is needed. DataStreamAgg (Redo+Undo) Update Table (consume Undo log) DataStreamAgg (Redo) Update Table Append Table TableScan without PK (Redo) NeedRetraction NeedRetraction Sink Table (Does not needRetraction) UpsertSink
  27. 27. UDAGG – Summary Master JIRA for UDAGG is FLINK-5564. We have shipped this in Flink release 1.3.
  28. 28. Aggregate – Over Aggregate time itemID avgPrice 1000 101 1 3000 201 1.5 4000 301 2 5000 101 2.2 5000 401 2.2 7000 301 2.6 8000 401 3 10000 101 2.8 time itemID price 1000 101 1 3000 201 2 4000 301 3 5000 101 1 5000 401 4 7000 301 3 8000 501 5 10000 101 1 Time based Group Aggregate is not able to differentiate two records with the same row time, but Over Aggregate can. Calculate moving average (in the past 5 seconds), and emit the result for each record
  29. 29. Aggregate – Summary The design of aggregate is mainly tracked in FLIP11 (FLINK-4557). We have delivered the above aggregates in Flink release 1.3 Grouping methods: Groupby / Over Time types: Event time; Process time (only for stream) Unbounded Aggregate: early-firing under a certain emit configuration (by default it emits the result on every input record) Windows: • Time/Count + TUMBLE/SESSION/SLIDE window • OVER Rows/Time Range window
  30. 30. Contributions to Flink SQL & Table API Flink blog: “Continuous Queries on Dynamic Tables” (posted at UDF (several improvements are released in 1.3) UDTF (FLINK-4469, released in 1.2) UDAGG (FLINK-5564, released in 1.3) Retraction (FLINK-6047, released in 1.3) Group/Over Window Aggregate (FLINK-4557, released in 1.3) Unbounded Stream Group Aggregate (FLINK-6216, released in 1.3) Stream-Stream Inner Join (FLINK-5878, targeted for release 1.4) More coming…..
  31. 31. SQL & Table API in Large Scale Production Section 4
  32. 32. SQL & Table API in Alibaba Production - example SQL & Table API is proven to be a successful and sufficient declarative language for data processing. Significantly reduce the development efforts to rewrite existing jobs or implement new jobs
  33. 33. SQL & Table API in Alibaba Production - Summary Blink@Alibaba In production at Alibaba for more than a year • Hundreds of jobs • The biggest cluster is more than 1500 nodes • The biggest job has thousands of tasks and states over tens of TB Blink SQL@Alibaba In production before 2016 China Singles’ Day (biggest shopping festival, similar as black Friday in US) • Blink jobs written by SQL & Table API are used to do real time analysis for recommendataion system, which helps improve the targetting efficiency thereby increasing the traffic-to-sales conversion. • The biggest SQL job has thousands of tasks and states over TB The latest release of Blink SQL will be used to support entire Alibaba internal business and server external customers via Alibaba Cloud streamCompute Service
  34. 34. Thanks