SlideShare a Scribd company logo
1 of 43
Download to read offline
Integrating Apache Flink with Apache Hive
Xuefu Zhang, Senior Staff Engineer, Hive PMC
Bowen Li, Senior Engineer
Seattle Apache Flink Meetup - 02/21/2019
Agenda
● Background
● Motivations
● Goals
● Architecture and Design
● Roadmap
● Current Progress
● Demo
● Q&A
Background
● Stream analytic users usually have also offline, batch analytics
● Many batch users want to reduce latency by moving some of analytics to
real-time
● AI is a major driving force behind both real-time and batch analytics (training
a model offline and apply a model in real time)
● ETL is still an important use case for big data
● SQL is the main tool processing big data, streaming or batch
Background (cont’d)
● Flink has showed prevailing advantages over other solutions for
heavy-volume stream processing
1.7B Events/secEB Total PB Everyday 1T Event/Day
Background (cont’d)
● In Blink, we systematically explored Flink’s capabilities in batch processing
● Flink shows great potential
TPC-DS: Blink v.s. Spark (the Lower, the Better)
Observation: the larger the data size, the more performance advantage Flink has
Performance of Blink versus Spark 2.3.1 in the TPC-DS benchmark, aggregate time for all queries together.
Presentation by Xiaowei Jiang at Flink Forward Beijing, Dec 2018.
Flink is the fastest due to its pipelined execution
Tez and Spark do not overlap 1st and 2nd stages
MapReduce is slow despite overlapping stages
A Comparative Performance Evaluation of Flink, Dongwon Kim, POSTECH, Flink Forward 2015
Background (cont’d)
● Flink needs a persistent storage for its metadata
Background (cont’d)
● Hive is the de facto standard for big data/batch processing on Hadoop
● Hive is the center of big data ecosystem with its metadata store
● Streaming users usually have Hive deployment and need to access data/metadata managed by
Hive
● For Hive users, new requirements may come for stream processing
Integrating Flink with Hive
Motivations
● Strengthen Flink’s lead in stream processing
● Unify a solution for both stream and batch processings
● Provide a unified SQL interface for the solution
● Enrich Flink ecosystem
● Promote Flink’s adoption
Goals
● Access all Hive metadata stored in Hive metastore
● Access all Hive data managed by Hive
● Store Flink metadata (both streaming or batch) in Hive metastore
● Compatible with Hive grammar while abiding to SQL standard
● Support user custom objects in Hive continue to work in Flink SQL
○ UDFs
○ serdes
○ storage handlers
Goals (cont’d)
● Feature parity with Hive (paritioning, bucketing, etc)
● Data types compatibility
● Make Flink an alternative, more powerful engine in Hive (longer term)
Integrating Flink with Hive - Goals
Flink DML or Hive DML
Hive Metadata is 100% compatible
(Data types, Tables, Views, UDFs)
Hive Data is 100% accessible
Arichtecture
Flink Deployment
Flink Runtime
Query processing & optimization
Table API and SQL Catalog APIs
SQL Client/Zeppelin
Design
GenericInMemoryCatalog
GenericHiveMetastoreCatalog
ReadableCatalog
ReadableWritableCatalog
HiveCatalog
HiveMetastoreClient
CatalogManager
TableEnvironment
inheritance reference
SQL Client HiveCatalogBase
Hive Metastore
Catalog APIs
Design (cont’d)
BatchTableFactory
HiveTableFactory
BatchTableSource
HiveTableSource
InputFormat
HiveTableInputFormat
BatchTableSink
HiveTableSink
OutputFormat
HiveTableOutputFormat
Read
write
Hive Data
Roadmap
● Basic integration
○ Read access to Hive
○ Table metadata only
○ Simple data types
○ Demo version
Roadmap (cont’d)
● Deep integration
○ Read/Write Hive metadata, data
○ Most data types
○ Basic DDL/DML
○ All meta objects (tables, functions, views, etc)
○ MVP version
Roadmap (cont’d)
● Complete integration
○ Complete DDL/DML
○ All data types
○ Temporary meta objects (functions, tables)
○ Support all user custom objects defined in Hive (serdes, storage handlers)
○ Usability, stability
○ First release
Roadmap (cont’d)
● Longer term
○ Optimizations
○ Feature parity
○ Regular maintenance and releases
● Hive on Flink
Current Progress and Development Plan
Bowen Li
Integrating Flink with Hive - Phases
This is a major change, work needs to be broken into steps
Phase 1: Unified Catalog APIs (FLIP-30, FLINK-11275)
Phase 2: Integrate Flink with Hive (FLINK-10556)
● for metadata thru Hive Metastore (FLINK-10744)
● for data (FLINK-10729)
Phase 3: Support SQL DDL in Flink (FLINK-10232)
Phase 1: Unified Catalogs APIs
Flink current status:
Barely any catalog support; Cannot connect to external catalogs
What we have done:
Introduced new catalog APIs ReadableCatalog and ReadableWritableCatalog, and framework to
connect to Calcite, that supports
● Meta-Objects
○ Database, Table, View, Partition, Functions, TableStats, and PartitionStats
● Operations
○ Get/List/Exist/Create/Alter/Rename/Drop
Status: Done internally, ready to submit PR to community
Phase 1: Unified Catalogs APIs
Flink current status:
No view support, currently views are tables.
Function catalog doesn’t have well-managed hierarchy, and cannot persist UDFs.
What we have done:
● Added true view support, users can create and query views on top of tables in Flink
● Introduced in a new UDF management system with proper namespace and hierarchy for Flink
based on ReadableCatalog/ReadableWritableCatalog
Status: Mostly done internally, ready to submit PR to community
Phase 1: Unified Catalogs APIs
Flink current status:
No well-structured hierarchy to manage metadata
Endless nested Calcite schema (think it as db, like db under db under db under …)
What we have done:
● Introduced two-level management and reference structure: <catalog>.<db>.<meta-object>
● Added CatalogManager:
○ manages all registered catalogs in TableEnvironment
○ has a concept of default catalog and default database to simply query
select * from mycatalog.mydb.myTable
=== can be simplified as ===>>>
select * from myTable
Status: Done internally, ready to submit PR to community
Flink current status:
No production-ready catalogs
What we have done:
Developed three production-ready catalogs
● GenericInMemoryCatalog
○ in-memory non-persistent, per session, default
● HiveCatalog
○ compatible with Hive, able to read/write Hive meta-objects
● GenericHiveMetastoreCatalog
○ persist Flink streaming and batch meta-objects
Status: Done internally, ready to submit PR to community
Phase 1: Unified Catalogs APIs
Catalogs are pluggable and opens opportunities for
● Catalog for MQ
○ Kafka(Confluent Schema Registry), RabbitMQ, Pulsar, RocketMQ, etc
● Catalog for structured data
○ RDMS like MySQL, etc
● Catalogs for semi-structured data
○ ElasticSearch, HBase, Cassandra, etc
● Catalogs for your other favorite data management system
○ …...
Phase 1: Unified Catalogs APIs
Flink current status:
Flink’s batch is great, but cannot run against the most widely used data warehouse - Hive
What we have done:
Developed HiveCatalog
● Flink can read Hive metaobjects, like tables, views, functions, table/partition stats,thru HiveCatalog
● Flink can create Hive metaobjects and write back to Hive via HiveCatalog such that Hive can consume
Phase 2: Flink-Hive Integration - Metadata - HiveCatalog
Flink can read and write Hive metadata thru HiveCatalog
Flink current status:
Flink’s metadata cannot be persisted anywhere
Users have to recreate metadata like tables/functions for every new session, very inconvenient
What we have done:
● Persist Flink’s metadata (both streaming and batch) by using Hive Metastore purely as storage
Phase 2: Flink-Hive Integration - Metadata - GenericHiveMetastoreCatalog
● for Hive batch metadata
● Hive can understand
HiveCatalog v.s. GenericHiveMetastoreCatalog
● for any streaming and batch metadata
● Hive may not understand
Flink current status:
Has flink-hcatalog module, but last modified 2 years ago - not really usable.
HCatalog also cannot to access all Hive data
What we have done:
Connector:
○ Developed HiveTableSource that supports reading both partition and non-partition table and
views, and partition-pruning
○ Working on HiveTableSink
Data Types:
○ Added support for all Hive simple data types.
○ Working on supporting Hive complex data types (array, map, struct, etc)
Phase 2: Flink-Hive Integration - Data
● Hive version
○ Officially support Hive 2.3.4 for now
○ We plan to support more Hive versions in the near future
● Above features are partially available in the Blink branch of Flink released in Jan 2019
○ https://github.com/apache/flink/tree/blink
Phase 2: Flink-Hive Integration - Hive Compatibility
● In progress
● Some will be shown in demo
Phase 3: Support SQL DDL + DML in Flink
Example and Demo Time!
Query your Hive data from Flink!
Table API Example
BatchTableEnvironment tEnv = ...
tEnv.registerCatalog(new HiveCatalog("myHive1", "thrift:xxx"));
tEnv.registerCatalog(new HiveCatalog("myHive2", hiveConf));
tEnv.setDefaultDatabase("myHive1", "myDb");
// Read Hive meta-objects
ReadableCatalog myHive1 = tEnv.getCatalog("myHive1");
myHive1.listDatabases();
myHive1.listTables("myDb");
ObjectPath myTablePath = new ObjectPath("myDb", "myTable");
myHive1.getTable(myTablePath);
myHive1.listPartitions(myTablePath);
// Query Hive data
tEnv.sqlQuery("select * from myTable").print()
SQL Client Example
// Register catalogs in sql-cli-defaults.yml
SQL Client Example (cont’)
Flink SQL> SHOW CATALOGS;
myhive1
mygeneric
Flink SQL> USE myhive1.myDb;
Flink SQL> SHOW DATABASES;
myDb
Flink SQL> SHOW TABLES;
myTable
Flink SQL> DRESCRIBE myHiveTable;
...
Flink SQL> SELECT * FROM myHiveView;
...
Happy Live Demo on SQL CLI!
This tremendous amount of work cannot happen without help and support
Shout out to everyone in the community and our team
who have been helping us with designs, codes, feedbacks, etc!
● Flink is good at stream processing, but batch processing is equally important
● Flink has shown its potential in batch processing
● Flink/Hive integration benefits both communities
● This is a big effort
● We are taking a phased approach
● Your contribution is greatly welcome and appreciated!
Conclusions
THANKS
https://www.meetup.com/seattle-apache-flink/

More Related Content

What's hot

Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaMohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Flink Forward
 
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache FlinkSuneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Flink Forward
 

What's hot (20)

Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Stream Processing use cases and applications with Apache Apex by Thomas Weise
Stream Processing use cases and applications with Apache Apex by Thomas WeiseStream Processing use cases and applications with Apache Apex by Thomas Weise
Stream Processing use cases and applications with Apache Apex by Thomas Weise
 
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & KafkaMohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
Mohamed Amine Abdessemed – Real-time Data Integration with Apache Flink & Kafka
 
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache FlinkSuneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
Suneel Marthi – BigPetStore Flink: A Comprehensive Blueprint for Apache Flink
 
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - HackacIntro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
Intro to Apache Apex - Next Gen Native Hadoop Platform - Hackac
 
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
Flink Forward SF 2017: Shaoxuan Wang_Xiaowei Jiang - Blinks Improvements to F...
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.
 
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
Ingesting Data from Kafka to JDBC with Transformation and EnrichmentIngesting Data from Kafka to JDBC with Transformation and Enrichment
Ingesting Data from Kafka to JDBC with Transformation and Enrichment
 
SICS: Apache Flink Streaming
SICS: Apache Flink StreamingSICS: Apache Flink Streaming
SICS: Apache Flink Streaming
 
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache ApexApache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
Apache Big Data EU 2016: Next Gen Big Data Analytics with Apache Apex
 
From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017From Batch to Streaming with Apache Apex Dataworks Summit 2017
From Batch to Streaming with Apache Apex Dataworks Summit 2017
 
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
Virtual Flink Forward 2020: How Streaming Helps Your Staging Environment and ...
 
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, UberKafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
Kafka Tiered Storage | Satish Duggana and Sriharsha Chintalapani, Uber
 
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
Flink Forward Berlin 2017 Keynote: Ferd Scheepers - Taking away customer fric...
 
Flink Apachecon Presentation
Flink Apachecon PresentationFlink Apachecon Presentation
Flink Apachecon Presentation
 
Marton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream ProcessingMarton Balassi – Stateful Stream Processing
Marton Balassi – Stateful Stream Processing
 
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
Virtual Flink Forward 2020: Production-Ready Flink and Hive Integration - wha...
 
Introduction to Apache Apex
Introduction to Apache ApexIntroduction to Apache Apex
Introduction to Apache Apex
 
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
Large-Scale Stream Processing in the Hadoop Ecosystem - Hadoop Summit 2016
 
Maximilian Michels - Flink and Beam
Maximilian Michels - Flink and BeamMaximilian Michels - Flink and Beam
Maximilian Michels - Flink and Beam
 

Similar to Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019

HBaseCon2015-final
HBaseCon2015-finalHBaseCon2015-final
HBaseCon2015-final
Maryann Xue
 

Similar to Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019 (20)

Integrating Flink with Hive - Flink Forward SF 2019
Integrating Flink with Hive - Flink Forward SF 2019Integrating Flink with Hive - Flink Forward SF 2019
Integrating Flink with Hive - Flink Forward SF 2019
 
Flink and Hive integration - unifying enterprise data processing systems
Flink and Hive integration - unifying enterprise data processing systemsFlink and Hive integration - unifying enterprise data processing systems
Flink and Hive integration - unifying enterprise data processing systems
 
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022
Why Wait? Realtime Ingestion With Chen Qin and Heng Zhang | Current 2022
 
Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...Building a fully managed stream processing platform on Flink at scale for Lin...
Building a fully managed stream processing platform on Flink at scale for Lin...
 
Stinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of HortonworksStinger.Next by Alan Gates of Hortonworks
Stinger.Next by Alan Gates of Hortonworks
 
Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices   Flink in Zalando's world of Microservices
Flink in Zalando's world of Microservices
 
Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices  Flink in Zalando's World of Microservices
Flink in Zalando's World of Microservices
 
Apache Flink Online Training
Apache Flink Online TrainingApache Flink Online Training
Apache Flink Online Training
 
Apache flink
Apache flinkApache flink
Apache flink
 
Apache flink
Apache flinkApache flink
Apache flink
 
Apache flink
Apache flinkApache flink
Apache flink
 
OpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit LondonOpenLineage for Stream Processing | Kafka Summit London
OpenLineage for Stream Processing | Kafka Summit London
 
Unified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache FlinkUnified Batch and Real-Time Stream Processing Using Apache Flink
Unified Batch and Real-Time Stream Processing Using Apache Flink
 
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System OverviewApache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
Apache Flink Training Workshop @ HadoopCon2016 - #1 System Overview
 
DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient ins...
DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient ins...DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient ins...
DevOps Fest 2020. Даніель Яворович. Data pipelines: building an efficient ins...
 
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics FrameworkOverview of Apache Flink: Next-Gen Big Data Analytics Framework
Overview of Apache Flink: Next-Gen Big Data Analytics Framework
 
HBaseCon2015-final
HBaseCon2015-finalHBaseCon2015-final
HBaseCon2015-final
 
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
HBaseCon 2015: Apache Phoenix - The Evolution of a Relational Database Layer ...
 
Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015Bay Area Apache Flink Meetup Community Update August 2015
Bay Area Apache Flink Meetup Community Update August 2015
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 

More from Bowen Li

More from Bowen Li (12)

Apache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyondApache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyond
 
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen LiTowards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
 
How to contribute to Apache Flink @ Seattle Flink meetup
How to contribute to Apache Flink @ Seattle Flink meetupHow to contribute to Apache Flink @ Seattle Flink meetup
How to contribute to Apache Flink @ Seattle Flink meetup
 
Community update on flink 1.9 and How to Contribute to Flink
Community update on flink 1.9 and How to Contribute to FlinkCommunity update on flink 1.9 and How to Contribute to Flink
Community update on flink 1.9 and How to Contribute to Flink
 
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
Tensorflow data preparation on Apache Beam using Portable Flink Runner, Ankur...
 
Community and Meetup Update, Seattle Flink Meetup, Feb 2019
Community and Meetup Update, Seattle Flink Meetup, Feb 2019Community and Meetup Update, Seattle Flink Meetup, Feb 2019
Community and Meetup Update, Seattle Flink Meetup, Feb 2019
 
Status Update of Seattle Flink Meetup, Jun 2018
Status Update of Seattle Flink Meetup, Jun 2018Status Update of Seattle Flink Meetup, Jun 2018
Status Update of Seattle Flink Meetup, Jun 2018
 
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
Streaming at Lyft, Gregory Fee, Seattle Flink Meetup, Jun 2018
 
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...Approximate queries and graph streams on Flink, theodore vasiloudis,  seattle...
Approximate queries and graph streams on Flink, theodore vasiloudis, seattle...
 
Stream processing with Apache Flink @ OfferUp
Stream processing with Apache Flink @ OfferUpStream processing with Apache Flink @ OfferUp
Stream processing with Apache Flink @ OfferUp
 
Apache Flink @ Alibaba - Seattle Apache Flink Meetup
Apache Flink @ Alibaba - Seattle Apache Flink MeetupApache Flink @ Alibaba - Seattle Apache Flink Meetup
Apache Flink @ Alibaba - Seattle Apache Flink Meetup
 
Opening - Seattle Apache Flink Meetup
Opening - Seattle Apache Flink MeetupOpening - Seattle Apache Flink Meetup
Opening - Seattle Apache Flink Meetup
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Recently uploaded (20)

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Integrating Flink with Hive, Seattle Flink Meetup, Feb 2019

  • 1. Integrating Apache Flink with Apache Hive Xuefu Zhang, Senior Staff Engineer, Hive PMC Bowen Li, Senior Engineer Seattle Apache Flink Meetup - 02/21/2019
  • 2. Agenda ● Background ● Motivations ● Goals ● Architecture and Design ● Roadmap ● Current Progress ● Demo ● Q&A
  • 3. Background ● Stream analytic users usually have also offline, batch analytics ● Many batch users want to reduce latency by moving some of analytics to real-time ● AI is a major driving force behind both real-time and batch analytics (training a model offline and apply a model in real time) ● ETL is still an important use case for big data ● SQL is the main tool processing big data, streaming or batch
  • 4. Background (cont’d) ● Flink has showed prevailing advantages over other solutions for heavy-volume stream processing
  • 5. 1.7B Events/secEB Total PB Everyday 1T Event/Day
  • 6. Background (cont’d) ● In Blink, we systematically explored Flink’s capabilities in batch processing ● Flink shows great potential
  • 7. TPC-DS: Blink v.s. Spark (the Lower, the Better) Observation: the larger the data size, the more performance advantage Flink has Performance of Blink versus Spark 2.3.1 in the TPC-DS benchmark, aggregate time for all queries together. Presentation by Xiaowei Jiang at Flink Forward Beijing, Dec 2018.
  • 8. Flink is the fastest due to its pipelined execution Tez and Spark do not overlap 1st and 2nd stages MapReduce is slow despite overlapping stages A Comparative Performance Evaluation of Flink, Dongwon Kim, POSTECH, Flink Forward 2015
  • 9. Background (cont’d) ● Flink needs a persistent storage for its metadata
  • 10. Background (cont’d) ● Hive is the de facto standard for big data/batch processing on Hadoop ● Hive is the center of big data ecosystem with its metadata store ● Streaming users usually have Hive deployment and need to access data/metadata managed by Hive ● For Hive users, new requirements may come for stream processing
  • 12. Motivations ● Strengthen Flink’s lead in stream processing ● Unify a solution for both stream and batch processings ● Provide a unified SQL interface for the solution ● Enrich Flink ecosystem ● Promote Flink’s adoption
  • 13. Goals ● Access all Hive metadata stored in Hive metastore ● Access all Hive data managed by Hive ● Store Flink metadata (both streaming or batch) in Hive metastore ● Compatible with Hive grammar while abiding to SQL standard ● Support user custom objects in Hive continue to work in Flink SQL ○ UDFs ○ serdes ○ storage handlers
  • 14. Goals (cont’d) ● Feature parity with Hive (paritioning, bucketing, etc) ● Data types compatibility ● Make Flink an alternative, more powerful engine in Hive (longer term)
  • 15. Integrating Flink with Hive - Goals Flink DML or Hive DML Hive Metadata is 100% compatible (Data types, Tables, Views, UDFs) Hive Data is 100% accessible
  • 16. Arichtecture Flink Deployment Flink Runtime Query processing & optimization Table API and SQL Catalog APIs SQL Client/Zeppelin
  • 19. Roadmap ● Basic integration ○ Read access to Hive ○ Table metadata only ○ Simple data types ○ Demo version
  • 20. Roadmap (cont’d) ● Deep integration ○ Read/Write Hive metadata, data ○ Most data types ○ Basic DDL/DML ○ All meta objects (tables, functions, views, etc) ○ MVP version
  • 21. Roadmap (cont’d) ● Complete integration ○ Complete DDL/DML ○ All data types ○ Temporary meta objects (functions, tables) ○ Support all user custom objects defined in Hive (serdes, storage handlers) ○ Usability, stability ○ First release
  • 22. Roadmap (cont’d) ● Longer term ○ Optimizations ○ Feature parity ○ Regular maintenance and releases ● Hive on Flink
  • 23. Current Progress and Development Plan Bowen Li
  • 24. Integrating Flink with Hive - Phases This is a major change, work needs to be broken into steps Phase 1: Unified Catalog APIs (FLIP-30, FLINK-11275) Phase 2: Integrate Flink with Hive (FLINK-10556) ● for metadata thru Hive Metastore (FLINK-10744) ● for data (FLINK-10729) Phase 3: Support SQL DDL in Flink (FLINK-10232)
  • 25. Phase 1: Unified Catalogs APIs Flink current status: Barely any catalog support; Cannot connect to external catalogs What we have done: Introduced new catalog APIs ReadableCatalog and ReadableWritableCatalog, and framework to connect to Calcite, that supports ● Meta-Objects ○ Database, Table, View, Partition, Functions, TableStats, and PartitionStats ● Operations ○ Get/List/Exist/Create/Alter/Rename/Drop Status: Done internally, ready to submit PR to community
  • 26. Phase 1: Unified Catalogs APIs Flink current status: No view support, currently views are tables. Function catalog doesn’t have well-managed hierarchy, and cannot persist UDFs. What we have done: ● Added true view support, users can create and query views on top of tables in Flink ● Introduced in a new UDF management system with proper namespace and hierarchy for Flink based on ReadableCatalog/ReadableWritableCatalog Status: Mostly done internally, ready to submit PR to community
  • 27. Phase 1: Unified Catalogs APIs Flink current status: No well-structured hierarchy to manage metadata Endless nested Calcite schema (think it as db, like db under db under db under …) What we have done: ● Introduced two-level management and reference structure: <catalog>.<db>.<meta-object> ● Added CatalogManager: ○ manages all registered catalogs in TableEnvironment ○ has a concept of default catalog and default database to simply query select * from mycatalog.mydb.myTable === can be simplified as ===>>> select * from myTable Status: Done internally, ready to submit PR to community
  • 28. Flink current status: No production-ready catalogs What we have done: Developed three production-ready catalogs ● GenericInMemoryCatalog ○ in-memory non-persistent, per session, default ● HiveCatalog ○ compatible with Hive, able to read/write Hive meta-objects ● GenericHiveMetastoreCatalog ○ persist Flink streaming and batch meta-objects Status: Done internally, ready to submit PR to community Phase 1: Unified Catalogs APIs
  • 29. Catalogs are pluggable and opens opportunities for ● Catalog for MQ ○ Kafka(Confluent Schema Registry), RabbitMQ, Pulsar, RocketMQ, etc ● Catalog for structured data ○ RDMS like MySQL, etc ● Catalogs for semi-structured data ○ ElasticSearch, HBase, Cassandra, etc ● Catalogs for your other favorite data management system ○ …... Phase 1: Unified Catalogs APIs
  • 30. Flink current status: Flink’s batch is great, but cannot run against the most widely used data warehouse - Hive What we have done: Developed HiveCatalog ● Flink can read Hive metaobjects, like tables, views, functions, table/partition stats,thru HiveCatalog ● Flink can create Hive metaobjects and write back to Hive via HiveCatalog such that Hive can consume Phase 2: Flink-Hive Integration - Metadata - HiveCatalog Flink can read and write Hive metadata thru HiveCatalog
  • 31. Flink current status: Flink’s metadata cannot be persisted anywhere Users have to recreate metadata like tables/functions for every new session, very inconvenient What we have done: ● Persist Flink’s metadata (both streaming and batch) by using Hive Metastore purely as storage Phase 2: Flink-Hive Integration - Metadata - GenericHiveMetastoreCatalog
  • 32. ● for Hive batch metadata ● Hive can understand HiveCatalog v.s. GenericHiveMetastoreCatalog ● for any streaming and batch metadata ● Hive may not understand
  • 33. Flink current status: Has flink-hcatalog module, but last modified 2 years ago - not really usable. HCatalog also cannot to access all Hive data What we have done: Connector: ○ Developed HiveTableSource that supports reading both partition and non-partition table and views, and partition-pruning ○ Working on HiveTableSink Data Types: ○ Added support for all Hive simple data types. ○ Working on supporting Hive complex data types (array, map, struct, etc) Phase 2: Flink-Hive Integration - Data
  • 34. ● Hive version ○ Officially support Hive 2.3.4 for now ○ We plan to support more Hive versions in the near future ● Above features are partially available in the Blink branch of Flink released in Jan 2019 ○ https://github.com/apache/flink/tree/blink Phase 2: Flink-Hive Integration - Hive Compatibility
  • 35. ● In progress ● Some will be shown in demo Phase 3: Support SQL DDL + DML in Flink
  • 36. Example and Demo Time! Query your Hive data from Flink!
  • 37. Table API Example BatchTableEnvironment tEnv = ... tEnv.registerCatalog(new HiveCatalog("myHive1", "thrift:xxx")); tEnv.registerCatalog(new HiveCatalog("myHive2", hiveConf)); tEnv.setDefaultDatabase("myHive1", "myDb"); // Read Hive meta-objects ReadableCatalog myHive1 = tEnv.getCatalog("myHive1"); myHive1.listDatabases(); myHive1.listTables("myDb"); ObjectPath myTablePath = new ObjectPath("myDb", "myTable"); myHive1.getTable(myTablePath); myHive1.listPartitions(myTablePath); // Query Hive data tEnv.sqlQuery("select * from myTable").print()
  • 38. SQL Client Example // Register catalogs in sql-cli-defaults.yml
  • 39. SQL Client Example (cont’) Flink SQL> SHOW CATALOGS; myhive1 mygeneric Flink SQL> USE myhive1.myDb; Flink SQL> SHOW DATABASES; myDb Flink SQL> SHOW TABLES; myTable Flink SQL> DRESCRIBE myHiveTable; ... Flink SQL> SELECT * FROM myHiveView; ...
  • 40. Happy Live Demo on SQL CLI!
  • 41. This tremendous amount of work cannot happen without help and support Shout out to everyone in the community and our team who have been helping us with designs, codes, feedbacks, etc!
  • 42. ● Flink is good at stream processing, but batch processing is equally important ● Flink has shown its potential in batch processing ● Flink/Hive integration benefits both communities ● This is a big effort ● We are taking a phased approach ● Your contribution is greatly welcome and appreciated! Conclusions