SlideShare a Scribd company logo
1 of 31
Pulsar Virtual Summit Europe 2021
Pulsar in the Lakehouse
Ryan Zhu
Staff Software Engineer, Databricks
Addison Higham
Chief Architect, StreamNative
Pulsar Virtual Summit Europe 2021
Ryan Zhu
Staff Software Engineer
Ryan Zhu, Staff Software Engineer at Databricks
● Tech Lead of Delta Ecosystem team
● Apache Spark PMC member and commiter
● Experience:
○ One of the core developers of Delta Lake and Spark Structured
Streaming. Working on these two projects since the beginning.
○ Working on Delta Sharing, a new open protocol to share data recently.
Pulsar Virtual Summit Europe 2021
Addison Higham
Chief Architect
Addison Higham, Chief Architect at StreamNative
● Apache Pulsar Committer
● Experience:
○ 10+ years as Software Engineer, with 7 years working on streaming
systems.
○ 3+ years Pulsar experience, including leading the successful adoption
of Pulsar at Instructure.
Pulsar Virtual Summit Europe 2021
Delta Lake/Lakehouse Overview
Pulsar Virtual Summit Europe 2021
Data is
fragmented
across many
systems
Cost and
complexity is a
drag on the
organization
Silos get in the
way of data team
collaboration
Pulsar Virtual Summit Europe 2021
Data infrastructure is too complicated
Data Lake
Semi-structured
Data Warehouse
Structured
Machine
Learning
Data
Science
BI
Unstructured
Data Warehouse
BI
Data Warehouse
BI
Pulsar Virtual Summit Europe 2021
Pros
Great for
Business
Intelligence (BI)
applications
Cons
Limited support
for Machine
Learning (ML)
workloads
Proprietary
systems with
only a SQL
interface
Pros
Supports ML
Completely open
ecosystem of
tools and
formats
Cons
Poor support BI
Complex to
manage and
govern →data
swamp
Data
Warehouse
Data
Lake
Pulsar Virtual Summit Europe 2021
Lakehouse
One platform to unify all
your data, analytics, and AI workloads
BI & SQL
Open Data Lake
Data Management & Governance
Real-time Data
Applications
Data Science
& ML
Pulsar Virtual Summit Europe 2021
QUALITY
Filtered, Cleaned,
Augmented
Business-level
Aggregates
Raw Ingestion
and History
Building the foundation of a Lakehouse - Delta
Lake
CSV,
JSON, TXT…
Kinesis
BI &
Reporting
Streaming
Analytics
Data Science
& ML
BRONZE SILVER GOLD
Pulsar Virtual Summit Europe 2021
350+ PB
processed /
day
75%
Data Scanned
3K+
Customers in
Production
Pulsar Virtual Summit Europe 2021
OSS Delta Lake Key Features
Feature
ACID Transactions Delta Lake brings ACID transactions to your data lakes. It provides serializability, the
strongest level of isolation level. Learn more at Diving into Delta Lake: Unpacking the
Transaction Log.
Scalable Metadata Handling Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
Time Travel (data versioning) Delta Lake provides data snapshots to access and revert to earlier versions of data for audits,
rollbacks or to reproduce experiments.
Open Format All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the
efficient compression and encoding schemes that are native to Parquet
Pulsar Virtual Summit Europe 2021
OSS Delta Lake Key Features (Continued)
Feature
Unified Batch and Streaming
Source and Sink
A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming
data ingest, batch historic backfill, and interactive queries all just work out of the box.
Schema Enforcement and
Evolution
Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the
data types are correct and required columns are present, preventing bad data from causing data
corruption. For more information, refer to Diving Into Delta Lake: Schema Enforcement &
Evolution.
Audit History Delta Lake transaction log records details about every change made to data providing a full audit
trail of the changes.
DML Operations Delta Lake supports SQL, Scala / Java and Python APIs to merge, update and delete datasets
allowing you to easily comply with GDPR and CCPA and simplifying use cases like change data
capture. For more information, refer to Diving Into Delta Lake: DML Internals
Pulsar Virtual Summit Europe 2021
Upcoming features
Feature
Column dropping and renaming Allow users to drop a column and rename a column.
Atomic data replacement Allow users to delete a portion of data from the table and replace it with new data atomically.
Schema evolution improvement
for MERGE
StructType in ArrayType will support schema evolutions in the MERGE command.
MERGE support for generated
columns
Generate Columns is a feature added in Delta 1.0 to support generating columns based on SQL
expressions. MERGE will support these columns.
New release cadence One release every 3 months
Pulsar Virtual Summit Europe 2021
Ecosystem Project Status
Delta Standalone Reader
Delta Standalone Writer
Available
Q4’ 21
Flink/Delta Source
Flink/Delta Sink
Q1’ 22
Q4’ 21
Pulsar/Delta Source
Pulsar/Delta Sink
Q4’ 21
Q1’ 22
PrestoDB/Trino integration Q4’ 21
Rust Integration
(kafka-delta-ingest)
Available
Nessie Integration Q4’ 21
LakeFS Integration Q4’ 21
Hive3 Connector Available
Spark 3.2 Support Q4’ 21
Delta Lake ecosystem
Pulsar Virtual Summit Europe 2021
Pulsar + Lakehouse
Pulsar Virtual Summit Europe 2021
Pulsar is the unified messaging and
streaming platform for real-time teams
Pulsar Virtual Summit Europe 2021
Why Pulsar?
Streams and
messages to
support more
workloads
Multi-tenancy to
break down data
silos and ease
data ingestion
Geo-replication
to support multi-
cloud and global
business
Pulsar Virtual Summit Europe 2021
Pulsar + Delta Lake enable data unification
Delta Lake and Lakehouse
support unified system for data,
analytics, and ML
Pulsar unifies real-time data across
diverse use cases like streaming,
messaging, and microservices
Simplifies data
infrastructures across
your entire organization
Pulsar
Delta Lake + =
Pulsar Virtual Summit Europe 2021
The Pulsar and Spark/Delta Lake communities are committed to building solid
integrations
Pulsar, Delta Lake, and Spark Connectors
Connector
Spark Pulsar Connector Connectors for Spark for reading and writing data from Pulsar for use with DataFrame and
DataStream APIs. https://github.com/streamnative/pulsar-spark. Discussions in progress for
upstream contribution.
Pulsar IO Delta Lake Source A Pulsar “Source” for reading data directly from Delta Lake within the Pulsar IO framework. It’s
built on top of Delta Standalone project. In progress, expect a first release this year.
Pulsar Virtual Summit Europe 2021
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Pulsar Virtual Summit Europe 2021
Pulsar offers many options for integration, including Pulsar, KoP, AoP,
connectors, to connect with many systems in real-time.
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Pulsar Virtual Summit Europe 2021
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Delta Lake Connectors allow for data to be exchanged between Delta
Lake and Pulsar.
Pulsar Virtual Summit Europe 2021
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Spark’s Pulsar connector allows for developers to write Spark jobs that
can read data from Pulsar topics, transform the data, and write back to
Pulsar topics.
Pulsar Virtual Summit Europe 2021
Application events stored in Delta Lake for use in ML
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Pulsar
Source
Pulsar
Source
Pulsar Virtual Summit Europe 2021
ML Results made available to applications
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Pulsar
Source
Pulsar
Source
Pulsar Virtual Summit Europe 2021
CDC Events transformed and stored in Delta Lake
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Pulsar Virtual Summit Europe 2021
Other systems data made available in Delta Lake for Data Science
Database
Pulsar IO
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Pulsar
Source
Pulsar
Source
Pulsar
---
KoP
---
AoP
---
Websocke
t
---
HTTP
Pulsar IO
Pulsar
Source
Pulsar Virtual Summit Europe 2021
Pulsar IO Delta Lake Source
With the Pulsar IO Delta Lake source, users will be able to ingest Delta Lake changes
into Pulsar without running a separate component
Delta Lake Source
or
Metadata
Change
Topic
W/
Schema
New File
Removed
File
Metadata
Change
Parquet
File
Update
Schema
Records
Write
Records
Pulsar Virtual Summit Europe 2021
Future of Pulsar + Delta Lake
One of Pulsar’s unique features is tiered storage, which allows for streams to be
offloaded out of Apache BookKeeper into S3, GCS, etc.
Work is in progress to offload data in Delta Lake compatible files, with the required
metadata, allowing for Pulsar to make streams available to Delta Lake without any
need to copy data out of Pulsar and allows for the data to still be read as streams.
Stay connected to learn more in early 2022!
Pulsar Virtual Summit Europe 2021
Pulsar and Delta Lake are technologies
designed to simplify your data
infrastructure
Connect with us on #connector-pulsar in
Delta Lake Slack to learn more!
Pulsar Virtual Summit Europe 2021
Thank-You!

More Related Content

What's hot

Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developersconfluent
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Jean-Paul Azar
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using KafkaKnoldus Inc.
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and DeltaDatabricks
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsAlluxio, Inc.
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotFlink Forward
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignMichael Noll
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar ZecevicDataScienceConferenc1
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeFlink Forward
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks FundamentalsDalibor Wijas
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache KafkaJeff Holoman
 
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...HostedbyConfluent
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Cloudera, Inc.
 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka StreamsGuozhang Wang
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connectconfluent
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 

What's hot (20)

Apache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and DevelopersApache Kafka Fundamentals for Architects, Admins and Developers
Apache Kafka Fundamentals for Architects, Admins and Developers
 
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)Kafka Tutorial - Introduction to Apache Kafka (Part 1)
Kafka Tutorial - Introduction to Apache Kafka (Part 1)
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta3D: DBT using Databricks and Delta
3D: DBT using Databricks and Delta
 
Iceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data AnalyticsIceberg + Alluxio for Fast Data Analytics
Iceberg + Alluxio for Fast Data Analytics
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Apache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - VerisignApache Kafka 0.8 basic training - Verisign
Apache Kafka 0.8 basic training - Verisign
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Databricks Fundamentals
Databricks FundamentalsDatabricks Fundamentals
Databricks Fundamentals
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Introduction to Apache Kafka
Introduction to Apache KafkaIntroduction to Apache Kafka
Introduction to Apache Kafka
 
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra... Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
Disaster Recovery Options Running Apache Kafka in Kubernetes with Rema Subra...
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive

Apache Kudu: Technical Deep Dive


Apache Kudu: Technical Deep Dive


 
Introduction to Kafka Streams
Introduction to Kafka StreamsIntroduction to Kafka Streams
Introduction to Kafka Streams
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 
From Zero to Hero with Kafka Connect
From Zero to Hero with Kafka ConnectFrom Zero to Hero with Kafka Connect
From Zero to Hero with Kafka Connect
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 

Similar to Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - Pulsar Summit Europe 2021 Keynote

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Helena Edelson
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamDataWorks Summit
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...DataWorks Summit
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...StreamNative
 
Leverage Kafka to build a stream processing platform
Leverage Kafka to build a stream processing platformLeverage Kafka to build a stream processing platform
Leverage Kafka to build a stream processing platformconfluent
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaData Con LA
 
Overview SQL Server 2019
Overview SQL Server 2019Overview SQL Server 2019
Overview SQL Server 2019Juan Fabian
 
Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...StreamNative
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptxAlex Ivy
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...DataWorks Summit
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamDataWorks Summit
 
Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021
Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021
Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021StreamNative
 
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Lightbend
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...DataStax Academy
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Provectus
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the unionDatabricks
 

Similar to Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - Pulsar Summit Europe 2021 Keynote (20)

Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Realizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache BeamRealizing the Promise of Portable Data Processing with Apache Beam
Realizing the Promise of Portable Data Processing with Apache Beam
 
Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...Present and future of unified, portable and efficient data processing with Ap...
Present and future of unified, portable and efficient data processing with Ap...
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
What Is Delta Lake ???
What Is Delta Lake ???What Is Delta Lake ???
What Is Delta Lake ???
 
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
Pulsar in the Lakehouse: Overview of Apache Pulsar and Delta Lake Connector -...
 
Leverage Kafka to build a stream processing platform
Leverage Kafka to build a stream processing platformLeverage Kafka to build a stream processing platform
Leverage Kafka to build a stream processing platform
 
Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7Jack Gudenkauf sparkug_20151207_7
Jack Gudenkauf sparkug_20151207_7
 
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­ticaA noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
A noETL Parallel Streaming Transformation Loader using Spark, Kafka­ & Ver­tica
 
Overview SQL Server 2019
Overview SQL Server 2019Overview SQL Server 2019
Overview SQL Server 2019
 
Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...Distributed Database Design Decisions to Support High Performance Event Strea...
Distributed Database Design Decisions to Support High Performance Event Strea...
 
Databricks Platform.pptx
Databricks Platform.pptxDatabricks Platform.pptx
Databricks Platform.pptx
 
Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...Present and future of unified, portable, and efficient data processing with A...
Present and future of unified, portable, and efficient data processing with A...
 
Realizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache BeamRealizing the promise of portable data processing with Apache Beam
Realizing the promise of portable data processing with Apache Beam
 
Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021
Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021
Select Star: Flink SQL for Pulsar Folks - Pulsar Summit NA 2021
 
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
Build Real-Time Streaming ETL Pipelines With Akka Streams, Alpakka And Apache...
 
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data...
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Spark streaming state of the union
Spark streaming state of the unionSpark streaming state of the union
Spark streaming state of the union
 

More from StreamNative

Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022StreamNative
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...StreamNative
 
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...StreamNative
 
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022StreamNative
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022StreamNative
 
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...StreamNative
 
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...StreamNative
 
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022StreamNative
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...StreamNative
 
Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022StreamNative
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...StreamNative
 
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022StreamNative
 
Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022StreamNative
 
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022StreamNative
 
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022StreamNative
 
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022StreamNative
 
Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022StreamNative
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...StreamNative
 
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...StreamNative
 
Improvements Made in KoP 2.9.0 - Pulsar Summit Asia 2021
Improvements Made in KoP 2.9.0  - Pulsar Summit Asia 2021Improvements Made in KoP 2.9.0  - Pulsar Summit Asia 2021
Improvements Made in KoP 2.9.0 - Pulsar Summit Asia 2021StreamNative
 

More from StreamNative (20)

Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
Is Using KoP (Kafka-on-Pulsar) a Good Idea? - Pulsar Summit SF 2022
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
 
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
Blue-green deploys with Pulsar & Envoy in an event-driven microservice ecosys...
 
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
Simplify Pulsar Functions Development with SQL - Pulsar Summit SF 2022
 
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
Towards a ZooKeeper-less Pulsar, etcd, etcd, etcd. - Pulsar Summit SF 2022
 
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
Validating Apache Pulsar’s Behavior under Failure Conditions - Pulsar Summit ...
 
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
Cross the Streams! Creating Streaming Data Pipelines with Apache Flink + Apac...
 
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
Message Redelivery: An Unexpected Journey - Pulsar Summit SF 2022
 
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
Unlocking the Power of Lakehouse Architectures with Apache Pulsar and Apache ...
 
Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022Understanding Broker Load Balancing - Pulsar Summit SF 2022
Understanding Broker Load Balancing - Pulsar Summit SF 2022
 
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
Building an Asynchronous Application Framework with Python and Pulsar - Pulsa...
 
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
Pulsar's Journey in Yahoo!: On-prem, Cloud and Hybrid - Pulsar Summit SF 2022
 
Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022Event-Driven Applications Done Right - Pulsar Summit SF 2022
Event-Driven Applications Done Right - Pulsar Summit SF 2022
 
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
Pulsar @ Scale. 200M RPM and 1K instances - Pulsar Summit SF 2022
 
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
Data Democracy: Journey to User-Facing Analytics - Pulsar Summit SF 2022
 
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
Beam + Pulsar: Powerful Stream Processing at Scale - Pulsar Summit SF 2022
 
Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022Welcome and Opening Remarks - Pulsar Summit SF 2022
Welcome and Opening Remarks - Pulsar Summit SF 2022
 
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
Log System As Backbone – How We Built the World’s Most Advanced Vector Databa...
 
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
MoP(MQTT on Pulsar) - a Powerful Tool for Apache Pulsar in IoT - Pulsar Summi...
 
Improvements Made in KoP 2.9.0 - Pulsar Summit Asia 2021
Improvements Made in KoP 2.9.0  - Pulsar Summit Asia 2021Improvements Made in KoP 2.9.0  - Pulsar Summit Asia 2021
Improvements Made in KoP 2.9.0 - Pulsar Summit Asia 2021
 

Recently uploaded

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 

Recently uploaded (20)

FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 

Pulsar in the Lakehouse: Apache Pulsar™ with Apache Spark™ and Delta Lake - Pulsar Summit Europe 2021 Keynote

  • 1. Pulsar Virtual Summit Europe 2021 Pulsar in the Lakehouse Ryan Zhu Staff Software Engineer, Databricks Addison Higham Chief Architect, StreamNative
  • 2. Pulsar Virtual Summit Europe 2021 Ryan Zhu Staff Software Engineer Ryan Zhu, Staff Software Engineer at Databricks ● Tech Lead of Delta Ecosystem team ● Apache Spark PMC member and commiter ● Experience: ○ One of the core developers of Delta Lake and Spark Structured Streaming. Working on these two projects since the beginning. ○ Working on Delta Sharing, a new open protocol to share data recently.
  • 3. Pulsar Virtual Summit Europe 2021 Addison Higham Chief Architect Addison Higham, Chief Architect at StreamNative ● Apache Pulsar Committer ● Experience: ○ 10+ years as Software Engineer, with 7 years working on streaming systems. ○ 3+ years Pulsar experience, including leading the successful adoption of Pulsar at Instructure.
  • 4. Pulsar Virtual Summit Europe 2021 Delta Lake/Lakehouse Overview
  • 5. Pulsar Virtual Summit Europe 2021 Data is fragmented across many systems Cost and complexity is a drag on the organization Silos get in the way of data team collaboration
  • 6. Pulsar Virtual Summit Europe 2021 Data infrastructure is too complicated Data Lake Semi-structured Data Warehouse Structured Machine Learning Data Science BI Unstructured Data Warehouse BI Data Warehouse BI
  • 7. Pulsar Virtual Summit Europe 2021 Pros Great for Business Intelligence (BI) applications Cons Limited support for Machine Learning (ML) workloads Proprietary systems with only a SQL interface Pros Supports ML Completely open ecosystem of tools and formats Cons Poor support BI Complex to manage and govern →data swamp Data Warehouse Data Lake
  • 8. Pulsar Virtual Summit Europe 2021 Lakehouse One platform to unify all your data, analytics, and AI workloads BI & SQL Open Data Lake Data Management & Governance Real-time Data Applications Data Science & ML
  • 9. Pulsar Virtual Summit Europe 2021 QUALITY Filtered, Cleaned, Augmented Business-level Aggregates Raw Ingestion and History Building the foundation of a Lakehouse - Delta Lake CSV, JSON, TXT… Kinesis BI & Reporting Streaming Analytics Data Science & ML BRONZE SILVER GOLD
  • 10. Pulsar Virtual Summit Europe 2021 350+ PB processed / day 75% Data Scanned 3K+ Customers in Production
  • 11. Pulsar Virtual Summit Europe 2021 OSS Delta Lake Key Features Feature ACID Transactions Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level. Learn more at Diving into Delta Lake: Unpacking the Transaction Log. Scalable Metadata Handling Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease. Time Travel (data versioning) Delta Lake provides data snapshots to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments. Open Format All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet
  • 12. Pulsar Virtual Summit Europe 2021 OSS Delta Lake Key Features (Continued) Feature Unified Batch and Streaming Source and Sink A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box. Schema Enforcement and Evolution Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption. For more information, refer to Diving Into Delta Lake: Schema Enforcement & Evolution. Audit History Delta Lake transaction log records details about every change made to data providing a full audit trail of the changes. DML Operations Delta Lake supports SQL, Scala / Java and Python APIs to merge, update and delete datasets allowing you to easily comply with GDPR and CCPA and simplifying use cases like change data capture. For more information, refer to Diving Into Delta Lake: DML Internals
  • 13. Pulsar Virtual Summit Europe 2021 Upcoming features Feature Column dropping and renaming Allow users to drop a column and rename a column. Atomic data replacement Allow users to delete a portion of data from the table and replace it with new data atomically. Schema evolution improvement for MERGE StructType in ArrayType will support schema evolutions in the MERGE command. MERGE support for generated columns Generate Columns is a feature added in Delta 1.0 to support generating columns based on SQL expressions. MERGE will support these columns. New release cadence One release every 3 months
  • 14. Pulsar Virtual Summit Europe 2021 Ecosystem Project Status Delta Standalone Reader Delta Standalone Writer Available Q4’ 21 Flink/Delta Source Flink/Delta Sink Q1’ 22 Q4’ 21 Pulsar/Delta Source Pulsar/Delta Sink Q4’ 21 Q1’ 22 PrestoDB/Trino integration Q4’ 21 Rust Integration (kafka-delta-ingest) Available Nessie Integration Q4’ 21 LakeFS Integration Q4’ 21 Hive3 Connector Available Spark 3.2 Support Q4’ 21 Delta Lake ecosystem
  • 15. Pulsar Virtual Summit Europe 2021 Pulsar + Lakehouse
  • 16. Pulsar Virtual Summit Europe 2021 Pulsar is the unified messaging and streaming platform for real-time teams
  • 17. Pulsar Virtual Summit Europe 2021 Why Pulsar? Streams and messages to support more workloads Multi-tenancy to break down data silos and ease data ingestion Geo-replication to support multi- cloud and global business
  • 18. Pulsar Virtual Summit Europe 2021 Pulsar + Delta Lake enable data unification Delta Lake and Lakehouse support unified system for data, analytics, and ML Pulsar unifies real-time data across diverse use cases like streaming, messaging, and microservices Simplifies data infrastructures across your entire organization Pulsar Delta Lake + =
  • 19. Pulsar Virtual Summit Europe 2021 The Pulsar and Spark/Delta Lake communities are committed to building solid integrations Pulsar, Delta Lake, and Spark Connectors Connector Spark Pulsar Connector Connectors for Spark for reading and writing data from Pulsar for use with DataFrame and DataStream APIs. https://github.com/streamnative/pulsar-spark. Discussions in progress for upstream contribution. Pulsar IO Delta Lake Source A Pulsar “Source” for reading data directly from Delta Lake within the Pulsar IO framework. It’s built on top of Delta Standalone project. In progress, expect a first release this year.
  • 20. Pulsar Virtual Summit Europe 2021 Database Pulsar IO Pulsar Source Pulsar Source Pulsar --- KoP --- AoP --- Websocke t --- HTTP
  • 21. Pulsar Virtual Summit Europe 2021 Pulsar offers many options for integration, including Pulsar, KoP, AoP, connectors, to connect with many systems in real-time. Database Pulsar IO Pulsar Source Pulsar Source Pulsar --- KoP --- AoP --- Websocke t --- HTTP Database Pulsar IO Pulsar Source Pulsar Source Pulsar --- KoP --- AoP --- Websocke t --- HTTP
  • 22. Pulsar Virtual Summit Europe 2021 Database Pulsar IO Pulsar Source Pulsar Source Pulsar --- KoP --- AoP --- Websocke t --- HTTP Delta Lake Connectors allow for data to be exchanged between Delta Lake and Pulsar.
  • 23. Pulsar Virtual Summit Europe 2021 Database Pulsar IO Pulsar Source Pulsar Source Pulsar --- KoP --- AoP --- Websocke t --- HTTP Spark’s Pulsar connector allows for developers to write Spark jobs that can read data from Pulsar topics, transform the data, and write back to Pulsar topics.
  • 24. Pulsar Virtual Summit Europe 2021 Application events stored in Delta Lake for use in ML Database Pulsar IO Pulsar Source Pulsar Source Pulsar --- KoP --- AoP --- Websocke t --- HTTP Pulsar Source Pulsar Source
  • 25. Pulsar Virtual Summit Europe 2021 ML Results made available to applications Database Pulsar IO Pulsar Source Pulsar Source Pulsar --- KoP --- AoP --- Websocke t --- HTTP Pulsar --- KoP --- AoP --- Websocke t --- HTTP Pulsar Source Pulsar Source
  • 26. Pulsar Virtual Summit Europe 2021 CDC Events transformed and stored in Delta Lake Database Pulsar IO Pulsar Source Pulsar Source Pulsar --- KoP --- AoP --- Websocke t --- HTTP Database Pulsar IO Pulsar Source Pulsar Source Pulsar --- KoP --- AoP --- Websocke t --- HTTP
  • 27. Pulsar Virtual Summit Europe 2021 Other systems data made available in Delta Lake for Data Science Database Pulsar IO Pulsar Source Pulsar Source Pulsar --- KoP --- AoP --- Websocke t --- HTTP Pulsar Source Pulsar Source Pulsar --- KoP --- AoP --- Websocke t --- HTTP Pulsar IO Pulsar Source
  • 28. Pulsar Virtual Summit Europe 2021 Pulsar IO Delta Lake Source With the Pulsar IO Delta Lake source, users will be able to ingest Delta Lake changes into Pulsar without running a separate component Delta Lake Source or Metadata Change Topic W/ Schema New File Removed File Metadata Change Parquet File Update Schema Records Write Records
  • 29. Pulsar Virtual Summit Europe 2021 Future of Pulsar + Delta Lake One of Pulsar’s unique features is tiered storage, which allows for streams to be offloaded out of Apache BookKeeper into S3, GCS, etc. Work is in progress to offload data in Delta Lake compatible files, with the required metadata, allowing for Pulsar to make streams available to Delta Lake without any need to copy data out of Pulsar and allows for the data to still be read as streams. Stay connected to learn more in early 2022!
  • 30. Pulsar Virtual Summit Europe 2021 Pulsar and Delta Lake are technologies designed to simplify your data infrastructure Connect with us on #connector-pulsar in Delta Lake Slack to learn more!
  • 31. Pulsar Virtual Summit Europe 2021 Thank-You!

Editor's Notes

  1. Today I’m pretty excited to have the chance to talk about Delta Lake and Lakehouse.
  2. Before we jump into Lakehouse, I would like to talk about the challenges people are facing today. Data infrastructure is too complicated and expensive to manage for advanced use cases. Many systems are not open. They are using proprietary formats, and you need to copy data around multiple systems. Teams are not well connected to collaborate.
  3. This is one classic architecture example. The massive majority of data is flowing into a data lake. Companies do a lot of data validations so that they can serve data science and machine learning on top of these data lakes. At the same time, a huge amount of data is ETLed to many downstream data warehouses to do business intelligence and other use cases. We have to do that because the BI workloads are often too slow to run on a data lake directly. Depend on the workload, data also needs to be moved out of data warehouse back to data lake if it’s been updated in the data warehouse. Increasingly, the machine learning workloads are also reading and writing to the data warehouses at the same time.
  4. The root of the problem is inherent difference between data lakes and data warehouses. On one hand, we have data lakes that do a great job supporting machine learning. They have open formats and a big ecosystem on top of them. But, they have poor support for business intelligence and suffer complex data quality problems. [CLICK] On the other hand, we have data warehouses that are great for business intelligence applications. But, they have limited support for machine learning workloads, and they are proprietary systems with only a SQL interface.
  5. Unifying these systems can be transformational in how we think about data. This is why we’re such big believers in the lakehouse to provide one platform to unify all of your data, analytics, and AI to allow all members of the data team to collaborate together. By definition, a lakehouse is based on open standards and open source. Because without being open, it’s impossible to create the unification across all data types, all these tools, and workloads. And, the communities for the best open source projects to enable the lakehouse are in the audience today across Apache Spark, Delta Lake, MLflow, and Redash.
  6. The foundation of your lakehouse is Delta Lake. Delta Lake allows you to build a data quality framework for your data lake by ensuring the data is reliable via ACID transactions. In this model, data is flowing into Delta Lake from all various of data sources. The data quality is improved incrementally, from raw data, intermediate data, clean data. Then downstream applications such as machine learning, AI and BI can build on top of the clean, fresh and reliable data to fit the use cases it’s intended for. Delta Lake is foundation of this model, an open source project that enables building a lakehouse architecture on top of existing storage systems such as S3, ADLS, GCS, and HDFS.
  7. Today Delta Lake is used all over the world. Over 350+ PB data per day gets processed on Delta Lake. 70% data scanned on Databricks platform is using Delta Lake. And Dalte Lake has been deployed to over 3 thousands customers in their production lakehouse architecture.
  8. So what key features does Delta Lake provide to help you build your lakehouse? Delta Lake provides ACID transactions, scalable metadata handling to process tables with billions of files, data versioning which provides the ability to time travel. Data is stored in open format to allow users to leverage existing tools.
  9. In addition, Delta Lake unifies batch and streaming queries, makes it easy to write either batch or streaming applications. Delta Lake automatically handles schema validation to prevent bad records from causing data corruption, and supports schema evolution. It allows you to audit history through transaction logs. Delta Lake also supports DML operations such as update, delete and merge.
  10. We also continue to improve Delta Lake for more use cases and there are multiple excited features coming soon. Delta Lake will allow users to drop a column or rename a column. You will be able to replace a portion of data in your table with new data atomically. MERGE command will get better schema evolution support and generated column support. We also plan to make a release every 3 months so that people can get new Delta Lake features quickly.
  11. Moreover, the Delta Lake community is interested and excited to expand Delta Lake and make it available everywhere. We are building the Delta Standalone project to allow users to read and write Delta Lake natively. This enables the community to build a lot of connectors for various of systems, such as Pulsar, Flink, Presto. Next, I will hand over to Addison to talk about the work of Pulsar connector for Delta Lake.