SlideShare a Scribd company logo
Connector Tips & Tricks
Eron Wright, Dell EMC
@eronwright
@ 2019 Dell EMC
Who am I?
● Tech Staff at Dell EMC
● Contributor to Pravega stream storage system
○ Dynamically-sharded streams
○ Event-time tracking
○ Transaction support
● Maintainer of Flink connector for Pravega
Overview
Topics:
● Connector Basics
● Table Connectors
● Event Time
● State & Fault Tolerance
Connector Basics
Developing a Connector
● Applications take an explicit dependency on a connector
○ Not generally built-in to the Flink environment
○ Treated as a normal application dependency
○ Consider shading and relocating your connector’s dependencies
● Possible connector repositories:
○ Apache Flink repository
○ Apache Bahir (for Flink) repository
○ Your own repository
Types of Flink Connectors
● Streaming Connectors
○ Provide sources and/or sinks
○ Sources may be bounded or unbounded
● Batch Connectors
○ Not discussed here
● Table Connectors
○ Provide tables which act as sources, sinks, or both
○ Unifies the batch and streaming programming model
○ Typically relies on a streaming and/or batch connector under the hood
○ A table’s update mode determines how a table is converted to/from a stream
■ Append Mode, Retract Mode, Upsert Mode
Key Challenges
● How to parallelize your data source/sink
○ Subdivide the source data amongst operator subtasks, e.g. by partition
○ Support parallelism changes
● How to provide fault tolerance
○ Provide exactly-once semantics
○ Support coarse- and fine-grained recovery for failed tasks
○ Support Flink checkpoints and savepoints
● How to support historical and real-time processing
○ Facilitate correct program output
○ Support event time semantics
● Security considerations
○ Safeguarding secrets
Connector Lifecycle
● Construction
○ Instantiated in the driver program (i.e. main method); must be serializable
○ Use the builder pattern to provide a DSL for your connector
○ Avoid making connections if possible
● State Initialization
○ Separate configuration from state
● Run
○ Supports both unbounded and bounded sources
● Cancel / Stop
○ Supports graceful termination (w/ savepoint)
○ May advance the event time clock to the end-of-time (MAX_WATERMARK)
Connector Lifecycle (con’t)
● Advanced: Initialize/Finalize on Job Master
○ Exclusively for OutputFormat (e.g.. file-based sinks)
○ Implement InitializeOnMaster, FinalizeOnMaster, and CleanupWhenUnsuccessful
○ Support for Steaming API added in Flink 1.9; see FLINK-1722
User-Defined Data Types
● Connectors are typically agnostic to the record data type
○ Expects application to supply type information w/ serializer
● For sources:
○ Accept a DeserializationSchema<T>
○ Implement ResultTypeQueryable<T>
● For sinks:
○ Accept a SerializationSchema<T>
● First-class support for Avro, Parquet, JSON
○ Geared towards Flink Table API
Connector Metrics
● Flink exposes a metric system for gathering and reporting metrics
○ Reporters: Flink UI, JMX, InfluxDB, Prometheus, ...
● Use the metric API in your connector to expose relevant metric data
○ Types: counters, gauges, histograms, meters
● Metrics are tracked on a per-subtask basis
● More information:
○ Flink Documentation / Debugging & Monitoring / Metrics
Connector Security
● Credentials are typically passed as ordinary program parameters
○ Beware lack of isolation between jobs in a given cluster
● Flink does have first-class support for Kerberos credentials
○ Based on keytabs (in support of long-running jobs)
○ Expects connector to use a named JAAS context
○ See: Kerberos Authentication Setup and Configuration
Table API
Summary
● The Table API is evolving rapidly
○ For new connectors, focus on supporting the Blink planner
● Table sources and sinks are generally built upon the DataStream API
● Two configuration styles - typed DSL and string-based properties
● Table formats are connector-independent
○ E.g. CSV, JSON, Avro
● A catalog encapsulates a collection of tables, views, and functions
○ Provides convenience and interactivity
● More information:
○ Docs: User-Defined Sources & Sinks
Event Time Support
Key Considerations
● Connectors play an critical role in program correctness
○ Connector internals influence the order-of-observation (in event time) and hence the practicality of
watermark generation
○ Connectors exhibit different behavior in historical vs real-time processing
● Event time skew leads to excess buffering and hence inefficiency
● There’s an inherent trade-off between latency and complexity
Global Watermark Tracking
● Flink 1.9 has a facility for tracking a global aggregate value across sub-tasks
○ Ideal for establishing a global minimum watermark
○ See StreamingRuntimeContext#getGlobalAggregateManager
● Most useful in highly dynamic sources
○ Compensates for impact of resharding, rebalancing on event time
○ Increases latency
● See Kinesis connector’s JobManagerWatermarkTracker
Source Idleness
● Downstream tasks depend on arrival of watermarks from all sub-tasks
○ Beware stalling the pipeline
● A sub-task may remove itself from consideration by idling
○ i.e. “release the hold on the event time clock”
● A source should be idled mainly for semantic reasons
○
Sink Watermark Propagation
● Consider the possibility of watermark propagation across jobs
○ Propagate upstream watermarks along with output records
○ Job 1 → (external system) → Job 2
● Sink function does have access to current watermark
○ But only when processing an input record 😞
● Solution: event-time timers
○ Chain a ProcessFunction and corresponding SinkFunction, or develop a custom operator
Practical Suggestions
● Provide an API to assign timestamps and to generate watermarks
○ Strive to isolate system internals, e.g. apply the watermark generator on a per-partition basis
○ Aggregate the watermarks into a per-subtask or global watermark
● Strive to minimize event time ‘skew’ across subtasks
○ Strategy: prioritize oldest data and pause ingestion of partitions that are too far ahead
○ See FLINK-10886 for improvements to Kinesis, Kafka connectors
● Remember: the goal is not a total ordering of elements (in event time)
State & Fault Tolerance
Working with State
● Sources are typically stateful, e.g.
○ partition assignment to sub-tasks
○ position tracking
● Use managed operator state to track redistributable units of work
○ List state - a list of redistributable elements (e.g. partitions w/ current position index)
○ Union list state - a variation where each sub-task gets the complete list of elements
● Various interfaces:
○ CheckpointedFunction - most powerful
○ ListCheckpointed - limited but convenient
○ CheckpointListener - to observe checkpoint completion (e.g. for 2PC)
Exactly-Once Semantics
● Definition: evolution of state is based on a single observation of a given element
● Writes to external systems are ideally idempotent
● For sinks, Flink provides a few building blocks:
○ TwoPhaseCommitSinkFunction - base class providing a transaction-like API (but not storage)
○ GenericWriteAheadSink - implements a WAL using the state backend (see: CassandraSink)
○ CheckpointCommitter - stores information about completed checkpoints
● Savepoints present various complications
○ User may opt to resume from any prior checkpoint, not just the most recent checkpoint
○ The connector may be reconfigured w/ new inputs and/or outputs
Advanced: Externally-Induced Sources
● Flink is still in control of initiating the overall checkpoint, with a twist!
● It allows a source to control the checkpoint barrier insertion point
○ E.g. based on incoming data or external coordination
● Hooks into the checkpoint coordinator on the master
○ Flink → Hook → External System → Sub-task
● See:
○ ExternallyInducedSource
○ WithMasterCheckpointHook
Thank You!
● Feedback welcome (e.g. via the FF app)
● See me at the Speaker’s Lounge

More Related Content

What's hot

Apache Software Foundation: How To Contribute, with Apache Flink as Example (...
Apache Software Foundation: How To Contribute, with Apache Flink as Example (...Apache Software Foundation: How To Contribute, with Apache Flink as Example (...
Apache Software Foundation: How To Contribute, with Apache Flink as Example (...
Apache Flink Taiwan User Group
 
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Flink Forward
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Ververica
 
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward
 
Open Source Serverless: a practical view. - Gabriele Provinciali Luca Postacc...
Open Source Serverless: a practical view. - Gabriele Provinciali Luca Postacc...Open Source Serverless: a practical view. - Gabriele Provinciali Luca Postacc...
Open Source Serverless: a practical view. - Gabriele Provinciali Luca Postacc...
Codemotion
 
Stream Loops on Flink - Reinventing the wheel for the streaming era
Stream Loops on Flink - Reinventing the wheel for the streaming eraStream Loops on Flink - Reinventing the wheel for the streaming era
Stream Loops on Flink - Reinventing the wheel for the streaming era
Paris Carbone
 

What's hot (6)

Apache Software Foundation: How To Contribute, with Apache Flink as Example (...
Apache Software Foundation: How To Contribute, with Apache Flink as Example (...Apache Software Foundation: How To Contribute, with Apache Flink as Example (...
Apache Software Foundation: How To Contribute, with Apache Flink as Example (...
 
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
Francesco Versaci - Flink in genomics - efficient and scalable processing of ...
 
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache BeamAljoscha Krettek - Portable stateful big data processing in Apache Beam
Aljoscha Krettek - Portable stateful big data processing in Apache Beam
 
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
Flink Forward SF 2017: Stefan Richter - Improvements for large state and reco...
 
Open Source Serverless: a practical view. - Gabriele Provinciali Luca Postacc...
Open Source Serverless: a practical view. - Gabriele Provinciali Luca Postacc...Open Source Serverless: a practical view. - Gabriele Provinciali Luca Postacc...
Open Source Serverless: a practical view. - Gabriele Provinciali Luca Postacc...
 
Stream Loops on Flink - Reinventing the wheel for the streaming era
Stream Loops on Flink - Reinventing the wheel for the streaming eraStream Loops on Flink - Reinventing the wheel for the streaming era
Stream Loops on Flink - Reinventing the wheel for the streaming era
 

Similar to Flink Connector Development Tips & Tricks

Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Timo Walther
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
KafkaZone
 
Flux architecture and Redux - theory, context and practice
Flux architecture and Redux - theory, context and practiceFlux architecture and Redux - theory, context and practice
Flux architecture and Redux - theory, context and practice
Jakub Kocikowski
 
Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...
Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...
Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...
HostedbyConfluent
 
Timing is Everything: Understanding Event-Time Processing in Flink SQL
Timing is Everything: Understanding Event-Time Processing in Flink SQLTiming is Everything: Understanding Event-Time Processing in Flink SQL
Timing is Everything: Understanding Event-Time Processing in Flink SQL
HostedbyConfluent
 
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
NETWAYS
 
Apache flink
Apache flinkApache flink
Apache flink
pranay kumar
 
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Running Flink in Production:  The good, The bad and The in Between - Lakshmi ...Running Flink in Production:  The good, The bad and The in Between - Lakshmi ...
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Flink Forward
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
BagustTriCahyo1
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
datamantra
 
Network Statistics for OpenFlow
Network Statistics for OpenFlowNetwork Statistics for OpenFlow
Network Statistics for OpenFlow
Miro Cupak
 
Airflow 101
Airflow 101Airflow 101
Airflow 101
SaarBergerbest
 
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaBuilding a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Mushfekur Rahman
 
Serverless Event Streaming with Pulsar Functions
Serverless Event Streaming with Pulsar FunctionsServerless Event Streaming with Pulsar Functions
Serverless Event Streaming with Pulsar Functions
StreamNative
 
Log Event Stream Processing In Flink Way
Log Event Stream Processing In Flink WayLog Event Stream Processing In Flink Way
Log Event Stream Processing In Flink Way
George T. C. Lai
 
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward
 
Coordination in distributed systems
Coordination in distributed systemsCoordination in distributed systems
Coordination in distributed systems
Andrea Monacchi
 
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward
 
When Streaming Needs Batch With Konstantin Knauf | Current 2022
When Streaming Needs Batch With Konstantin Knauf | Current 2022When Streaming Needs Batch With Konstantin Knauf | Current 2022
When Streaming Needs Batch With Konstantin Knauf | Current 2022
HostedbyConfluent
 
Dynamic log processing with fluentd and konfigurator
Dynamic log processing with fluentd and konfiguratorDynamic log processing with fluentd and konfigurator
Dynamic log processing with fluentd and konfigurator
Ahmad Iqbal Ali
 

Similar to Flink Connector Development Tips & Tricks (20)

Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
 
Flux architecture and Redux - theory, context and practice
Flux architecture and Redux - theory, context and practiceFlux architecture and Redux - theory, context and practice
Flux architecture and Redux - theory, context and practice
 
Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...
Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...
Getting Data In and Out of Flink - Understanding Flink and Its Connector Ecos...
 
Timing is Everything: Understanding Event-Time Processing in Flink SQL
Timing is Everything: Understanding Event-Time Processing in Flink SQLTiming is Everything: Understanding Event-Time Processing in Flink SQL
Timing is Everything: Understanding Event-Time Processing in Flink SQL
 
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
OSMC 2018 | Stream connector: Easily sending events and/or metrics from the C...
 
Apache flink
Apache flinkApache flink
Apache flink
 
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
Running Flink in Production:  The good, The bad and The in Between - Lakshmi ...Running Flink in Production:  The good, The bad and The in Between - Lakshmi ...
Running Flink in Production: The good, The bad and The in Between - Lakshmi ...
 
Airflow Intro-1.pdf
Airflow Intro-1.pdfAirflow Intro-1.pdf
Airflow Intro-1.pdf
 
Introduction to Flink Streaming
Introduction to Flink StreamingIntroduction to Flink Streaming
Introduction to Flink Streaming
 
Network Statistics for OpenFlow
Network Statistics for OpenFlowNetwork Statistics for OpenFlow
Network Statistics for OpenFlow
 
Airflow 101
Airflow 101Airflow 101
Airflow 101
 
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and KibanaBuilding a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
Building a Unified Logging Layer with Fluentd, Elasticsearch and Kibana
 
Serverless Event Streaming with Pulsar Functions
Serverless Event Streaming with Pulsar FunctionsServerless Event Streaming with Pulsar Functions
Serverless Event Streaming with Pulsar Functions
 
Log Event Stream Processing In Flink Way
Log Event Stream Processing In Flink WayLog Event Stream Processing In Flink Way
Log Event Stream Processing In Flink Way
 
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
Flink Forward San Francisco 2019: Massive Scale Data Processing at Netflix us...
 
Coordination in distributed systems
Coordination in distributed systemsCoordination in distributed systems
Coordination in distributed systems
 
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck -  Pravega: Storage Rei...
Flink Forward SF 2017: Srikanth Satya & Tom Kaitchuck - Pravega: Storage Rei...
 
When Streaming Needs Batch With Konstantin Knauf | Current 2022
When Streaming Needs Batch With Konstantin Knauf | Current 2022When Streaming Needs Batch With Konstantin Knauf | Current 2022
When Streaming Needs Batch With Konstantin Knauf | Current 2022
 
Dynamic log processing with fluentd and konfigurator
Dynamic log processing with fluentd and konfiguratorDynamic log processing with fluentd and konfigurator
Dynamic log processing with fluentd and konfigurator
 

Recently uploaded

Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
correoyaya
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
ewymefz
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
ewymefz
 

Recently uploaded (20)

Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
Innovative Methods in Media and Communication Research by Sebastian Kubitschk...
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
一比一原版(IIT毕业证)伊利诺伊理工大学毕业证成绩单
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
一比一原版(UMich毕业证)密歇根大学|安娜堡分校毕业证成绩单
 

Flink Connector Development Tips & Tricks

  • 1. Connector Tips & Tricks Eron Wright, Dell EMC @eronwright @ 2019 Dell EMC
  • 2. Who am I? ● Tech Staff at Dell EMC ● Contributor to Pravega stream storage system ○ Dynamically-sharded streams ○ Event-time tracking ○ Transaction support ● Maintainer of Flink connector for Pravega
  • 3. Overview Topics: ● Connector Basics ● Table Connectors ● Event Time ● State & Fault Tolerance
  • 5. Developing a Connector ● Applications take an explicit dependency on a connector ○ Not generally built-in to the Flink environment ○ Treated as a normal application dependency ○ Consider shading and relocating your connector’s dependencies ● Possible connector repositories: ○ Apache Flink repository ○ Apache Bahir (for Flink) repository ○ Your own repository
  • 6. Types of Flink Connectors ● Streaming Connectors ○ Provide sources and/or sinks ○ Sources may be bounded or unbounded ● Batch Connectors ○ Not discussed here ● Table Connectors ○ Provide tables which act as sources, sinks, or both ○ Unifies the batch and streaming programming model ○ Typically relies on a streaming and/or batch connector under the hood ○ A table’s update mode determines how a table is converted to/from a stream ■ Append Mode, Retract Mode, Upsert Mode
  • 7. Key Challenges ● How to parallelize your data source/sink ○ Subdivide the source data amongst operator subtasks, e.g. by partition ○ Support parallelism changes ● How to provide fault tolerance ○ Provide exactly-once semantics ○ Support coarse- and fine-grained recovery for failed tasks ○ Support Flink checkpoints and savepoints ● How to support historical and real-time processing ○ Facilitate correct program output ○ Support event time semantics ● Security considerations ○ Safeguarding secrets
  • 8. Connector Lifecycle ● Construction ○ Instantiated in the driver program (i.e. main method); must be serializable ○ Use the builder pattern to provide a DSL for your connector ○ Avoid making connections if possible ● State Initialization ○ Separate configuration from state ● Run ○ Supports both unbounded and bounded sources ● Cancel / Stop ○ Supports graceful termination (w/ savepoint) ○ May advance the event time clock to the end-of-time (MAX_WATERMARK)
  • 9. Connector Lifecycle (con’t) ● Advanced: Initialize/Finalize on Job Master ○ Exclusively for OutputFormat (e.g.. file-based sinks) ○ Implement InitializeOnMaster, FinalizeOnMaster, and CleanupWhenUnsuccessful ○ Support for Steaming API added in Flink 1.9; see FLINK-1722
  • 10. User-Defined Data Types ● Connectors are typically agnostic to the record data type ○ Expects application to supply type information w/ serializer ● For sources: ○ Accept a DeserializationSchema<T> ○ Implement ResultTypeQueryable<T> ● For sinks: ○ Accept a SerializationSchema<T> ● First-class support for Avro, Parquet, JSON ○ Geared towards Flink Table API
  • 11. Connector Metrics ● Flink exposes a metric system for gathering and reporting metrics ○ Reporters: Flink UI, JMX, InfluxDB, Prometheus, ... ● Use the metric API in your connector to expose relevant metric data ○ Types: counters, gauges, histograms, meters ● Metrics are tracked on a per-subtask basis ● More information: ○ Flink Documentation / Debugging & Monitoring / Metrics
  • 12. Connector Security ● Credentials are typically passed as ordinary program parameters ○ Beware lack of isolation between jobs in a given cluster ● Flink does have first-class support for Kerberos credentials ○ Based on keytabs (in support of long-running jobs) ○ Expects connector to use a named JAAS context ○ See: Kerberos Authentication Setup and Configuration
  • 14. Summary ● The Table API is evolving rapidly ○ For new connectors, focus on supporting the Blink planner ● Table sources and sinks are generally built upon the DataStream API ● Two configuration styles - typed DSL and string-based properties ● Table formats are connector-independent ○ E.g. CSV, JSON, Avro ● A catalog encapsulates a collection of tables, views, and functions ○ Provides convenience and interactivity ● More information: ○ Docs: User-Defined Sources & Sinks
  • 16. Key Considerations ● Connectors play an critical role in program correctness ○ Connector internals influence the order-of-observation (in event time) and hence the practicality of watermark generation ○ Connectors exhibit different behavior in historical vs real-time processing ● Event time skew leads to excess buffering and hence inefficiency ● There’s an inherent trade-off between latency and complexity
  • 17.
  • 18.
  • 19.
  • 20. Global Watermark Tracking ● Flink 1.9 has a facility for tracking a global aggregate value across sub-tasks ○ Ideal for establishing a global minimum watermark ○ See StreamingRuntimeContext#getGlobalAggregateManager ● Most useful in highly dynamic sources ○ Compensates for impact of resharding, rebalancing on event time ○ Increases latency ● See Kinesis connector’s JobManagerWatermarkTracker
  • 21. Source Idleness ● Downstream tasks depend on arrival of watermarks from all sub-tasks ○ Beware stalling the pipeline ● A sub-task may remove itself from consideration by idling ○ i.e. “release the hold on the event time clock” ● A source should be idled mainly for semantic reasons ○
  • 22. Sink Watermark Propagation ● Consider the possibility of watermark propagation across jobs ○ Propagate upstream watermarks along with output records ○ Job 1 → (external system) → Job 2 ● Sink function does have access to current watermark ○ But only when processing an input record 😞 ● Solution: event-time timers ○ Chain a ProcessFunction and corresponding SinkFunction, or develop a custom operator
  • 23. Practical Suggestions ● Provide an API to assign timestamps and to generate watermarks ○ Strive to isolate system internals, e.g. apply the watermark generator on a per-partition basis ○ Aggregate the watermarks into a per-subtask or global watermark ● Strive to minimize event time ‘skew’ across subtasks ○ Strategy: prioritize oldest data and pause ingestion of partitions that are too far ahead ○ See FLINK-10886 for improvements to Kinesis, Kafka connectors ● Remember: the goal is not a total ordering of elements (in event time)
  • 24. State & Fault Tolerance
  • 25. Working with State ● Sources are typically stateful, e.g. ○ partition assignment to sub-tasks ○ position tracking ● Use managed operator state to track redistributable units of work ○ List state - a list of redistributable elements (e.g. partitions w/ current position index) ○ Union list state - a variation where each sub-task gets the complete list of elements ● Various interfaces: ○ CheckpointedFunction - most powerful ○ ListCheckpointed - limited but convenient ○ CheckpointListener - to observe checkpoint completion (e.g. for 2PC)
  • 26. Exactly-Once Semantics ● Definition: evolution of state is based on a single observation of a given element ● Writes to external systems are ideally idempotent ● For sinks, Flink provides a few building blocks: ○ TwoPhaseCommitSinkFunction - base class providing a transaction-like API (but not storage) ○ GenericWriteAheadSink - implements a WAL using the state backend (see: CassandraSink) ○ CheckpointCommitter - stores information about completed checkpoints ● Savepoints present various complications ○ User may opt to resume from any prior checkpoint, not just the most recent checkpoint ○ The connector may be reconfigured w/ new inputs and/or outputs
  • 27. Advanced: Externally-Induced Sources ● Flink is still in control of initiating the overall checkpoint, with a twist! ● It allows a source to control the checkpoint barrier insertion point ○ E.g. based on incoming data or external coordination ● Hooks into the checkpoint coordinator on the master ○ Flink → Hook → External System → Sub-task ● See: ○ ExternallyInducedSource ○ WithMasterCheckpointHook
  • 28.
  • 29.
  • 30. Thank You! ● Feedback welcome (e.g. via the FF app) ● See me at the Speaker’s Lounge