Flink and Hive integration - unifying enterprise data processing systems

Flink and Hive Integration
- Unifying Enterprise Data Processing Systems
Bowen Li
Committer@Flink, Senior Engineer@Alibaba
Flink Forward Europe, Oct 2019

Agenda
● Background
● Motivations and Impacts
● Flink 1.9 - State of Union
● Flink 1.10 - What’s upcoming next?
● Q&A

Background
Flink aims at unifying data processing for streaming and batch use cases
● Batch is a special case of Streaming
○ bounded v.s. unbounded data streams
● Unified and simpler tech stack, deployment and operation
● Smaller learning cost for end users
○ developers, data scientists, analysts, etc

Why integrate with Hive?
● Hive is de facto standard for batch processing (ETL, analytics, etc) in enterprises
● Hive is widely adopted with huge user base
● Hive metastore is the center of big data ecosystem
● Hive users want lower latency and near real time data warehouse
● Streaming users usually have Hive deployment and need to access Hive data/metadata

Motivations and Impacts
● Strengthen Flink’s lead in stream processing by enhancing its metadata stack
● Advance Flink’s batch capabilities
● Provide unified solution for stream and batch processings using SQL
● Enrich and extend Flink’s ecosystem
● Promote Flink’s adoption

Platform Level Integration
● Not another “Hive on Xxx”
● The integration is fully in Flink repo
● Released as part of Flink

Flink 1.9 - State of the Union
check out more official documentations
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/hive/

Integrate with Hive Metadata
We developed brand-new Catalog APIs to
● integrate Flink with Hive Metadata/Metastore
● completely reshape Flink’s metadata stack for both streaming and batch
FLIP-30 - Uniﬁed Catalog APIs
(https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs)

Catalog APIs
Catalog
● Meta-Objects
○ Database, Table, View, Partition, Functions, TableStats, and PartitionStats
● Operations
○ Get/List/Exist/Create/Alter/Rename/Drop
Namespace of Catalog Table/View/Function
● fully qualified namespace as <catalog_name>.<db_name>.<object_name>

Catalogs are pluggable and opens opportunities for
● Catalog for Hive
● Catalog for Streams and MQ
○ Pulsar Catalog is in review
○ Kafka(Confluent Schema Registry), RabbitMQ, RocketMQ, etc
● Catalog for structured data
○ RDMS like MySQL, etc
● Catalogs for semi-structured data
○ ElasticSearch, HBase, Cassandra, etc
Roadmaps of Catalog APIs

CatalogManager
● manage all registered catalogs and resolve objects
● default to current catalog and current database in parsing queries
select * from currentCatalog.currentDb.myTable
>>>=== can be simplified as ===>>>
select * from myTable

Arichtecture
Flink Runtime
Query processing & optimization
Table API and SQL Catalog APIs
SQL Client/Zeppelin

Flink 1.9 provides two catalog implementations out of shelf.
● GenericInMemoryCatalog
○ in-memory non-persistent, per session, used by default
● HiveCatalog
○ compatible with multiple Hive versions
○ supports most Hive data types
○ can read/write Hive meta-objects
○ can persist Flink non-hive streaming and batch meta-objects to Hive Metastore
■ e.g. kafka/pulsar tables
Catalogs

HiveCatalog
Flink-Hive interoperability: Flink can read/write Hive
metadata thru HiveCatalog
Flink can persist non-hive metadata using Hive
Metastore as storage via HiveCatalog

● Officially support Hive 2.3.4 and 1.2.1
● Rely on Hive’s own compatibility for other 2.x and 1.x
Supported Hive Versions

● Supports all Hive UDF interfaces via HiveCatalog
○ UDF
○ GenericUDF
○ UDTF
○ UDAF
○ GenericUDAFResolver
Support Hive UDF

● Can read/write non-partitioned Hive tables
● Can read partitioned Hive tables
● Supports partition-pruning
● Supports text, SequenceFile, ORC, Parquet
Hive Source and Sink

Example - Table API
TableEnvironment tEnv = ...
tEnv.registerCatalog(new HiveCatalog("myHive", "/opt/hive-conf/"));
tEnv.useCatalog("myHive");
tEnv.useDatabase("myDb");
// Read Hive meta-objects
Catalog myHive1 = tEnv.getCatalog("myHive1").get();
myHive1.listDatabases();
myHive1.listTables("myDb");
ObjectPath myTablePath = new ObjectPath("myDb", "myHiveTable");
myHive1.getTable(myTablePath);
myHive1.listPartitions(myTablePath);
// Query Hive data
tEnv.sqlQuery("select * from myHiveTable").print()

SQL Client Example
// Register catalogs in sql-cli-defaults.yml

SQL CLI Example (cont’)
Flink SQL> SHOW CATALOGS;
myhive1
default_catalog
Flink SQL> SHOW DATABASES;
myDb
Flink SQL> USE myhive1.myDb;
Flink SQL> SHOW TABLES;
myTable
Flink SQL> DRESCRIBE myHiveTable;
...
Flink SQL> SELECT * FROM myHiveTable;
...

● Integration with Hive was released in Flink 1.9 in Beta
● It lays the foundation for Flink’s integration with Hive
● This initiative led us to vast development and enhancement of Flink’s SQL stack and
metadata management capabilities
Summary

Flink 1.10 - What’s upcoming next?

Supports all Hive 1.2, 2.0, 2.1, 2.2, 2.3, 3.1 versions
● 1.2.0, 1.2.1
● 2.0.0, 2.0.1
● 2.1.0, 2.1.1
● 2.2.0
● 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.3.4, 2.3.5, 2.3.6
● 3.1.0, 3.1.1, 3.1.2
Support More Hive Versions

● HiveTableSource supports
○ projection pushdown
○ reading Hive views (in-progress)
● HiveTableSink supports
○ “INSERT OVERWRITE”
○ inserting into partitions, both dynamic and static
Hive Source and Sink Improvements

FLIP-57 Rework FunctionCatalog
● Problems to solve
○ Clarify and complete function categories
■ currently definition of temp functions is ambiguous
○ Enabling referencing functions with qualified names across catalog and database
○ Redefine function resolution order
https://cwiki.apache.org/confluence/display/FLINK/FLIP-57%3A+Rework+FunctionCatalog

Similarity:
● volatile, and lifespan within a session
Differences:
● Temp System Functions
○ has no namespace, can be referenced anywhere with function name
○ can override system/built-in functions
○ “CREATE TEMPORARY SYSTEM FUNCTION …”
● Temp Catalog Functions
○ has catalog/db namespaces
○ can override catalog functions
○ “CREATE TEMPORARY FUNCTION …”
Introducing Temp System Function v.s. Temp Catalog Function

Introducing Ambiguous v.s. Precise Function References
Ambiguous function reference Precise function reference (NEW!)
with only function name
SELECT <func>(col) FROM T
with fully or partially qualified name
SELECT <cat>.<db>.<func>(col) FROM T
SELECT <db>.<func>(col) FROM T
enables cross-catalog/db function reference
new resolution order:
1. temp system function
2. system(built-in) function
3. temp catalog function in current cat/db
4. catalog function in current cat/db
resolution order:
1. temp catalog function
2. catalog function

FLIP-68 Extend Core Table System with Pluggable Modules
● Motivations:
○ Enable users to integrate Flink with cores and built-in objects of other systems
■ Supports Hive built-in functions via pluggable modules
■ To name a few more upcoming - Geo and Machine Learning modules
○ Empower users to write code and do customized developement for Flink table core
https://cwiki.apache.org/confluence/display/FLINK/FLIP-68%3A+Extend+Core+Table+Sys
tem+with+Modular+Plugins

Full set of DDLs and other commands supported via **Unified** SQL parser
● CREATE/DROP/ALTER/RENAME
○ CATALOG/DATABASE/FUNCTION
○ TABLE
■ CREATE TABLE AS SELECT ...
User Facing Points
● SQL Client - already supported a few but not thru SQL parser
● Table APIs - TableEnvironment#sqlUpdate(“...”) and sqlQuery(“...”)
https://cwiki.apache.org/confluence/display/FLINK/FLIP+69+-+Flink+SQL+DDL+Enhancement
FLIP-69 DDL Enhancement in SQL and Table APIs

Beyond Flink 1.10
● More Hive SQL compatibilities to minimize migration efforts
● More user custom objects, like serdes, storage handlers
● Performance optimization
● Feature parity with Hive (bucketing, etc)
● Enterprise readiness - security, governess
● Regular maintenance and releases

Xuefu Zhang, Rui Li, Terry Wang, Timo Walter,
Dawid Wysakowicz, Kurt Young, Jingsong Lee, Jark Wu, etc
Thanks to Other Contributors

Flink’s integrating with Hive
● helps Flink to realize its potential in batch processing
● is a critical step for Flink towards unified data processing
● reshapes Flink’s metadata management capabilities
● enhances Flink’s SQL stack
● brings mass user base and lays the foundation for enterprise adoption
Conclusions

Flink and Hive integration - unifying enterprise data processing systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Flink and Hive integration - unifying enterprise data processing systems

Similar to Flink and Hive integration - unifying enterprise data processing systems (20)

More from Bowen Li

More from Bowen Li (11)

Recently uploaded

Recently uploaded (20)

Flink and Hive integration - unifying enterprise data processing systems