What's new for Apache Flink's Table & SQL APIs?

© 2019 Ververica
Aljoscha Krettek – Software Engineer, Flink PMC, Beam PMC
Timo Walther – Software Engineer, Flink PMC
What's new for Flink's Table & SQL APIs?
Planners, Python, DDL, and more!

© 2019 Ververica
• Very expressive stream processing API
– Transform, aggregate, and join events
– Java and Scala
• Control how events are processed with respect to time
– Timestamps, Watermarks, Windows, Timers, Triggers, Allowed Lateness, …
• Maintain and update application state
– Keyed state, operator state, state backends, checkpointing, …
The DataStream API is great…

© 2019 Ververica
• Writing distributed programs is not always easy
– Stream processing technology spreads rapidly
– New concepts (time, state, ...)
• Requires knowledge & skill
– Continous applications have special requirements
– Programming experience (Java / Scala)
• Users want to focus on their business logic
… but it‘s not made for everyone.

© 2019 Ververica
• Relational APIs are declarative
– User says what is needed, system decides how to compute it
• Queries can be effectively optimized
– Less imperative black-box code
– Well-researched field
• Queries are efficiently executed
– Let Flink deal with state and time
• ”Everybody” knows and uses SQL
Why not SQL (or another relational API)?

© 2019 Ververica
Apache Flink’s Relational APIs
Unified APIs for batch & streaming data
A query specifies exactly the same result
regardless whether its input is
static batch data or streaming data.
tableEnvironment
.scan("clicks")
.groupBy('user)
.select('user, 'url.count as 'cnt)
SELECT user, COUNT(url) AS cnt
FROM clicks
GROUP BY user
LINQ-style Table APIANSI SQL

© 2019 Ververica6
This is joint work with members of the
Apache Flink community.

© 2019 Ververica7
Some of this presents work that is in
progress in the Flink community. Other
things are planned and/or have design
documents. Some were discussed at
one point or another on the mailing lists
or in person.
This represents our understanding of
the current state, this is not a fixed
roadmap, Flink is an open-source
Apache project.

© 2019 Ververica
Evolution in Progress…
FLIP-32
FLIP-55
FLIP-29
FLIP-30
FLIP-57
FLIP-29
FLIP-64 FLIP-68
FLIP-65
FLIP-51
FLIP-69
FLIP-66
FLIP-38
FLIP-37

© 2019 Ververica
New Planner in a Unified Architecture
FLIP-32

© 2019 Ververica
Architecture before Flink 1.9
Internal Stream API
Runtime
DataSet
ExecutionEnvironment
DataStream
StreamExecutionEnvironment
Table / SQL
BatchTableEnvironment StreamTableEnvironment
Does this look unified?

© 2019 Ververica
Architecture in Flink 1.9+
Internal Stream API
Runtime
DataStream
StreamExecutionEnvironment
Table / SQL
(Stream)TableEnvironment

© 2019 Ververica
Alibaba’s Contribution of Blink
• A truly unified runtime operator stack
• Many more runtime operators for better SQL coverage
• Proper cost model for planning
• Improved data structures (sorters, hash tables) and serializers for operating
on binary data
• Support all TPC-H and TPC-DS Queries
• And much more...

© 2019 Ververica
How can we merge Blink gradually?
• Separate the API from the query processor
• Make the query processor pluggable
• Reduce technical debt in the API on the way
– Make the API Scala-free (private members were public in Java !)
– Remove API design fails (nobody needs 7 TableEnvironments or TypeInformation in SQL ")
– Allow pure table programs (regular table users don’t need DataStream API #)
Old Planner Blink Planner
Table / SQL

© 2019 Ververica
How can we merge Blink gradually?
time

© 2019 Ververica
An API is growing up
FLIP-29 / FLIP-30 / FLIP-55 / FLIP-64

© 2019 Ververica
Separation of Concerns
flink-table
flink-table-common
à SPI interfaces for connectors, catalogs, UDFs
flink-table-api-java
à Pure Java table programs
flink-table-api-scala
à Pure Scala table programs
flink-table-api-java-bridge
à Programs that interact with Java DataStream API
flink-table-api-scala-bridge
à Programs that interact with Scala DataStream API
flink-table-planner
flink-table-planner-blink

© 2019 Ververica
Pure Table Programs
val settings = EnvironmentSettings.newInstance()
.useBlinkPlanner()
.inBatchMode()
.build()
val env = TableEnvironment.create(settings)
env.registerCatalog("enterprise_hive", new HiveCatalog(...))
env.registerCatalog("enterprise_kafka", new SchemaRegistry(...))
env.sqlUpdate("""
INSERT INTO enterprise_hive.sensitive.customers
SELECT * FROM enterprise_kafka.topics.customer
""")
env.execute("ETL pipeline")
Deep catalog integration.
No (Stream)ExecutionEnvironment.

© 2019 Ververica
Pure Table Programs
val settings = EnvironmentSettings.newInstance()
.useBlinkPlanner()
.inBatchMode()
.build()
val env = TableEnvironment.create(settings)
env.registerCatalog("enterprise_hive", new HiveCatalog(...))
env.registerCatalog("enterprise_kafka", new SchemaRegistry(...))
env.sqlUpdate("""
INSERT INTO enterprise_hive.sensitive.customers
SELECT * FROM enterprise_kafka.topics.customer
""")
env.execute("ETL pipeline")
Goal: no batch/streaming mode + only 1 planner

© 2019 Ververica
Pure Table Programs
env.createTemporaryFunction("parse", classOf[JsonParser])
val data = env.sqlQuery("""
SELECT parse(json) AS customer
FROM enterprise_hive.sensitive.customers
""")
val enrichedData = data
.select('json.flatten())
.dropColumns(12 to 34)
.addColumns('firstname + " " + 'lastname as 'fullname)
env.createView("enrichedData", enrichedData)
Unified functions for Java/Scala.
Functionality beyond SQL.
But still integrated into catalogs.

© 2019 Ververica
A good Type System as a Basis
FLIP-37 / FLIP-65

© 2019 Ververica
What is the Input and Output Type?
case class Customer(name: String, balance: BigDecimal)
class TableFunction[Row] {
def eval(customer: Customer) {
val outputRow = // ... some transformation
this.collect(outputRow)
}
}
class TableFunction[Customer] {
def eval(row: Row) {
val outputCustomer = // ... some transformation
this.collect(outputCustomer)
}
}

© 2019 Ververica
New Data Type Abstraction
• Uncoupled from Flink’s TypeInformation or TypeSerializers
• 24 types defined with parser syntax, semantics, boundaries
• DataType = logical type + runtime class hint for edges of the API
import org.apache.flink.table.api.DataTypes._
ROW(
FIELD("name", VARCHAR(200)),
FIELD("balance", DECIMAL(5, 3))
TIMESTAMP(3).bridgedTo(classOf[java.time.LocalDateTime])

© 2019 Ververica
New Type Inference
case class Customer(name: String, @DataTypeHint("DECIMAL(4, 2)") balance: BigDecimal)
@FunctionHint(output = @DataTypeHint("ROW<name STRING, balance DECIMAL(4, 2)>"))
}
}
def eval(@DataTypeHint("ROW<name STRING, balance DECIMAL(4, 2)>") row: Row) {
}
}

© 2019 Ververica
New Type Inference
case class Customer(name: String, @DataTypeHint("DECIMAL(4, 2)") balance: BigDecimal)
@FunctionHint(output = @DataTypeHint("ROW<name STRING, balance DECIMAL(4, 2)>"))
}
}
def eval(@DataTypeHint("ROW<name STRING, balance DECIMAL(4, 2)>") row: Row) {
}
}
More needed?
Override getTypeInference() and
implement functions like a pro.
No gap between UDFs and system
functions anymore!

© 2019 Ververica
SQL end-to-end
FLIP-66 / FLIP-68 / FLIP-69 / FLIP-59

© 2019 Ververica
LOAD MODULE string_utils;
LOAD MODULE ml_utils;
SET exec.auto-watermark-interval = '400 ms';
SET exec.max-parallelism = '128';
SET table.optimizer.join-reorder-enabled = 'true';
CREATE TABLE kafka_source (
user_id STRING,
log_ts TIMESTAMP(3)
WATERMARK FOR log_ts AS log_ts - INTERVAL '5' SECOND
) WITH (
'connector.type' = 'kafka',
'connector.version' = '0.10',
'connector.topic' = 'topic_name',
'format.type' = 'json'
);
CREATE TABLE kafka_sink (user_id STRING, ...) WITH (...);
INSERT INTO kafka_sink SELECT ...
What is Missing in Flink SQL?

© 2019 Ververica
The Flink Python Table API*
*in Flink 1.9.0

© 2019 Ververica28
Introducing the new Python Table API
• The new Python API was introduced in Flink 1.9.0 (FLIP-38)
• The older DataSet Python API and DataStream Python API were removed in
Flink 1.9.0
Goals/Features in Flink 1.9.0
• Support relational, LINQ-style queries
written in Python
• Support SQL queries, including DDL
• Support working with existing
Table/SQL connector ecosystem
Non-Goals in Flink 1.9.0
• User-defined functions written in
Python

© 2019 Ververica29
Python Table API Example
exec_env = ExecutionEnvironment.get_execution_environment()
exec_env.set_parallelism(1)
t_config = TableConfig()
t_env = BatchTableEnvironment.create(exec_env, t_config)
# connector definitions omitted
t_env.scan(‚mySource')
.group_by('word')
.select('word, count(1)')
.insert_into('mySink')
t_env.execute("WordCount in Python")

© 2019 Ververica30
Some Assembly Required (for now)
$ mvn clean install -DskipTests –Dfast
$ cd flink-python
$ python3 setup.py sdist
$ pip install dist/*.tar.gz
• Right now, pyflink needs to be built form source, we‘re working on getting it into
pypi
• Download from https://flink.apache.org/downloads.html

© 2019 Ververica31
Defining Sources and Sinks (using builders)
t_env.connect(FileSystem().path('/tmp/input'))
.with_format(OldCsv()
.line_delimiter('n')
.field('word', DataTypes.STRING()))
.with_schema(Schema()
.field('word', DataTypes.STRING()))
.register_table_source('mySource')

© 2019 Ververica32
Defining Sources and Sinks (using builders)
t_env.connect(FileSystem().path('/tmp/output'))
.with_format(OldCsv()
.field_delimiter(',')
.field('word', DataTypes.STRING())
.field('count', DataTypes.BIGINT()))
.with_schema(Schema()
.field('word', DataTypes.STRING())
.field('mycount', DataTypes.BIGINT()))
.register_table_sink('mySink')

© 2019 Ververica33
Defining Sources and Sinks (using DDL)
source_ddl =
'''
create table mySource(
word varchar
) with (
'connector.type' = 'filesystem',
'connector.path' = '/tmp/input',
'format.type' = 'csv',
'format.fields.0.name' = 'word',
'format.fields.0.type' = 'VARCHAR'
)
'''
t_env.sql_update(source_ddl)

© 2019 Ververica34
Defining Sources and Sinks (using DDL)
sink_ddl =
'''
create table mySink(
word VARCHAR,
cnt BIGINT
) with (
'connector.type' = 'filesystem',
'connector.path' = '/tmp/output',
'format.type' = 'csv',
'format.fields.0.name' = 'word',
'format.fields.0.type' = 'VARCHAR',
'format.fields.1.name' = 'mycount',
'format.fields.1.type' = 'BIGINT'
)
'''
t_env.sql_update(sink_ddl)

© 2019 Ververica35
Running SQL Queries
t_env.scan('mySource')
.group_by('word')
.select('word, count(1)')
.insert_into('mySink')
t_env.sql_update('''
INSERT INTO mySink SELECT word, COUNT(1)
FROM mySource GROUP BY word
''')

© 2019 Ververica37
A Preview of FLIP-58: User-defined Python functions
• Problem:
–Flink runs on the JVM
–Proper Python does not run on the JVM
https://cwiki.apache.org/confluence/display/FLINK/FLIP-58%3A+Flink+Python+User-Defined+Stateless+Function+for+Table
• Solution:

© 2019 Ververica38
Resources
• All FLIPs:
https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Propos
als
• Table documentation: https://ci.apache.org/projects/flink/flink-docs-release-
1.9/dev/table/
• Building the new Python Table API: https://ci.apache.org/projects/flink/flink-
docs-release-1.9/flinkDev/building.html#build-pyflink
• Python Table API tutorial: https://ci.apache.org/projects/flink/flink-docs-release-
1.9/tutorials/python_table_api.html
• Python Table API documentation: https://ci.apache.org/projects/flink/flink-docs-
release-1.9/api/python/index.html

What's new for Apache Flink's Table & SQL APIs?

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to What's new for Apache Flink's Table & SQL APIs?

Similar to What's new for Apache Flink's Table & SQL APIs? (20)

Recently uploaded

Recently uploaded (20)

What's new for Apache Flink's Table & SQL APIs?