Successfully reported this slideshow.
Your SlideShare is downloading. ×

Select Star: Unified Batch & Streaming with Flink SQL & Pulsar

Select Star: Unified Batch & Streaming with Flink SQL & Pulsar

Download to read offline

If you want to leverage the strengths of multiple streaming tools, but without the typical overhead of familiarizing yourself with each tool and their interconnected relationships, then this talk is for you.

This is a beginner friendly talk intended for anyone who is familiar with just one of these tools, or none of them (although some familiarity with SQL will help).

If you want to walk away with the necessary foundation to go home and build a unified batch and streaming pipeline without having to know Flink (or Java, or Scala, etc...), that's where Flink SQL comes in.

We'll then add in Pulsar, enabling you to maximize their pub-sub messaging system alongside your pipeline. This talk will go through an easy-to-follow 5 step process (with demos for each step) to building a unified pipeline with Flink SQL and Pulsar together, and resources for future steps.

If you want to leverage the strengths of multiple streaming tools, but without the typical overhead of familiarizing yourself with each tool and their interconnected relationships, then this talk is for you.

This is a beginner friendly talk intended for anyone who is familiar with just one of these tools, or none of them (although some familiarity with SQL will help).

If you want to walk away with the necessary foundation to go home and build a unified batch and streaming pipeline without having to know Flink (or Java, or Scala, etc...), that's where Flink SQL comes in.

We'll then add in Pulsar, enabling you to maximize their pub-sub messaging system alongside your pipeline. This talk will go through an easy-to-follow 5 step process (with demos for each step) to building a unified pipeline with Flink SQL and Pulsar together, and resources for future steps.

More Related Content

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

Select Star: Unified Batch & Streaming with Flink SQL & Pulsar

  1. 1. © 2020 Ververica 1
  2. 2. © 2020 Ververica 2 2 ● Caito Scherr Introduction
  3. 3. © 2020 Ververica 3 3 ● Caito Scherr Introduction 3 ● Caito Scherr ● Developer Advocate
  4. 4. © 2020 Ververica 4 ● Caito Scherr ● Developer Advocate ● Ververica, GmbH Introduction
  5. 5. © 2020 Ververica 5 ● Caito Scherr ● Developer Advocate ● Ververica, GmbH ● Portland, OR, USA Introduction
  6. 6. © 2020 Ververica 6 Introduction
  7. 7. © 2020 Ververica 7 Demo credit: Marta Paes Introduction
  8. 8. © 2020 Ververica 8 Agenda ● Pulsar + Flink ● Where SQL comes in ● Demo: Pulsar + Flink SQL
  9. 9. © 2020 Ververica 9 ● Pulsar + Flink ● Where SQL comes in ● Demo: Pulsar + Flink SQL Agenda
  10. 10. © 2020 Ververica 10 Agenda ● Pulsar + Flink ● Where SQL comes in ● Demo: Pulsar + Flink SQL
  11. 11. © 2020 Ververica 11 >> What is Flink? Pulsar + Flink ● Stateful ● Stream processing engine ● Unified batch & streaming
  12. 12. © 2020 Ververica 12 >> What is Flink? Pulsar + Flink
  13. 13. © 2020 Ververica 13 >> What is Flink? Pulsar + Flink
  14. 14. © 2020 Ververica 14 >> Why Pulsar + Flink? Pulsar + Flink “Batch as a special case of streaming” “Stream as a unified view on data”
  15. 15. © 2020 Ververica 15 >> Pulsar: Unified Storage ● Pub/Sub messaging layer (Streaming) ● Durable storage layer (Batch) Pulsar + Flink
  16. 16. © 2020 Ververica 16 now bounded query unbounded query past future bounded query start of the stream unbounded query >> Flink: Unified Processing ● Reuse code and logic ● Consistent semantics ● Simplify operations ● Mix historic and real-time ● Pub/Sub messaging layer (Stream) ● Durable storage layer (Batch) Pulsar + Flink
  17. 17. © 2020 Ververica 17 Unified Processing Engine (Batch / Streaming) Unified Storage (Segments / Pub/Sub) >> A Unified Data Stack Pulsar + Flink
  18. 18. © 2020 Ververica 18 Flink 1.6+ 2018 Streaming Source/Sink Connectors Table Sink Connector >> Pulsar + Flink History Pulsar + Flink
  19. 19. © 2020 Ververica 19 Flink 1.6+ 2018 Streaming Source/Sink Connectors Table Sink Connector >> Pulsar + Flink History Pulsar + Flink Flink 1.9+ Pulsar Schema + Flink Catalog Table API/SQL as 1st class citizens Exactly-once Source At-least once Sink
  20. 20. © 2020 Ververica 20 Flink 1.6+ 2018 Streaming Source/Sink Connectors Table Sink Connector >> Pulsar + Flink History Pulsar + Flink Flink 1.9+ Pulsar Schema + Flink Catalog Table API/SQL as 1st class citizens Exactly-once Source At-least once Sink Flink 1.12 Upserts DDL Computed Columns, Watermarks. Metadata End-to-end Exactly-once Key-shared Subscription Model
  21. 21. © 2020 Ververica 21 Flink Runtime Stateful Computations over Data Streams Stateful Stream Processing Streams, State, Time Event-Driven Applications Stateful Functions Streaming Analytics & ML SQL, PyFlink, Tables >> Why Flink SQL? Pulsar + Flink
  22. 22. © 2020 Ververica 22 >> Why Flink SQL? ● Focus on business logic, not implementation ● Mixed workloads (batch + streaming) ● Maximize developer speed and autonomy ML Feature Generation Unified Online/Offline Model Training E2E Streaming Analytics Pipelines Pulsar + Flink
  23. 23. © 2020 Ververica 23 user cnt Mary 2 Bob 1 SELECT user_id, COUNT(url) AS cnt FROM clicks GROUP BY user_id; Take a snapshot when the query starts A final result is produced A row that was added after the query was started is not considered user cTime url Mary 12:00:00 https://… Bob 12:00:00 https://… Mary 12:00:02 https://… Liz 12:00:03 https://… The query terminates Where SQL Fits In >> A Regular SQL Engine
  24. 24. © 2020 Ververica 24 user cTime url user cnt SELECT user_id, COUNT(url) AS cnt FROM clicks GROUP BY user_id; Mary 12:00:00 https://… Bob 12:00:00 https://… Mary 12:00:02 https://… Liz 12:00:03 https://… Bob 1 Liz 1 Mary 1 Mary 2 Ingest all changes as they happen Continuously update the result The result is identical to the one-time query (at this point) Where SQL Fits In >> A Streaming SQL Engine
  25. 25. © 2020 Ververica 25 ● Standard SQL syntax and semantics (i.e. not a “SQL-flavor”) ● Unified APIs for batch and streaming ● Support for advanced time handling and operations (e.g. CDC, pattern matching) UDF Support Python Java Scala Execution TPC-DS Coverage Batch Streaming + Formats Native Connectors Apache Kafka Elasticsearch FileSystems JDBC HBase + Kinesis Metastore Postgres (JDBC) Data Catalogs Debezium Where SQL Fits In >> Flink SQL In A Nutshell
  26. 26. © 2020 Ververica 26 >> 1a. Twitter Firehose Demo
  27. 27. © 2020 Ververica 27 Demo >> 1b. Data?
  28. 28. © 2020 Ververica 28 Demo >> 2. SQL Client + Pulsar CREATE CATALOG pulsar WITH ( 'type' = 'pulsar', 'service-url' = 'pulsar://pulsar:6650', 'admin-url' = 'http://pulsar:8080', 'format' = 'json' ); Catalog DDL
  29. 29. © 2020 Ververica 29 Not cool. 👹 Demo
  30. 30. © 2020 Ververica 30 Demo CREATE TABLE pulsar_tweets ( publishTime TIMESTAMP(3) METADATA, WATERMARK FOR publishTime AS publishTime - INTERVAL '5' SECOND ) WITH ( 'connector' = 'pulsar', 'topic' = 'persistent://public/default/tweets', 'value.format' = 'json', 'service-url' = 'pulsar://pulsar:6650', 'admin-url' = 'http://pulsar:8080', 'scan.startup.mode' = 'earliest-offset' ) LIKE tweets; Derive schema from the original topic Define the source connector (Pulsar) Read and use Pulsar message metadata >> 3. Get relevant timestamp
  31. 31. © 2020 Ververica 31 Demo >> 4. Windowed aggregation CREATE TABLE pulsar_tweets_agg ( tmstmp TIMESTAMP(3), tweet_cnt BIGINT ) WITH ( 'connector'='pulsar', 'topic'='persistent://public/default/tweets_agg', 'value.format'='json', 'service-url'='pulsar://pulsar:6650', 'admin-url'='http://pulsar:8080' ); Sink Table DDL INSERT INTO pulsar_tweets_agg SELECT TUMBLE_START(publishTime, INTERVAL '10' SECOND) AS wStart, COUNT(id) AS tweet_cnt FROM pulsar_tweets GROUP BY TUMBLE(publishTime, INTERVAL '10' SECOND); Continuous SQL Query
  32. 32. © 2020 Ververica 32 Demo >> 5. Tweet count in windows
  33. 33. © 2020 Ververica 33 What Next? >> Flink SQL Cookbook
  34. 34. © 2020 Ververica Resources ● Flink Ahead: What Comes After Batch & Streaming: https://youtu.be/h5OYmy9Yx7Y ● Apache Pulsar as one Storage System for Real Time & Historical Data Analysis: https://medium.com/streamnative/apache-pulsar-as-one-storage-455222c59017 ● Flink Table API & SQL: https://ci.apache.org/projects/flink/flink-docs-master/dev/table/sql/queries.html#operatio ns ● Flink SQL Cookbook: https://github.com/ververica/flink-sql-cookbook ● When Flink & Pulsar Come Together: https://flink.apache.org/2019/05/03/pulsar-flink.html ● How to Query Pulsar Streams in Flink: https://flink.apache.org/news/2019/11/25/query-pulsar-streams-using-apache-flink.html ● What’s New in the Flink/Pulsar Connector: ● https://flink.apache.org/2021/01/07/pulsar-flink-connector-270.html ● Marta’s Demo: https://github.com/morsapaes/flink-sql-pulsar 34 @Caito_200_OK
  35. 35. © 2020 Ververica ● Pulsar Conference staff!! ● Marta Paes 35 Thank You! @Caito_200_OK Scan here for links & resources

×