SlideShare a Scribd company logo
© 2023 Cloudera, Inc. All rights reserved.
Streaming Data Pipeline Development
Tim Spann
Principal Developer Advocate
25-April-2023
© 2023 Cloudera, Inc. All rights reserved.
© 2023 Cloudera, Inc. All rights reserved.
https://attend.cloudera.com/nificommitters0503
© 2023 Cloudera, Inc. All rights reserved. 4
FLaNK Stack
Tim Spann
@PaasDev // Blog: www.datainmotion.dev
Principal Developer Advocate.
Princeton Future of Data Meetup.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC
https://medium.com/@tspann
https://github.com/tspannhw
Apache NiFi x Apache Kafka x Apache Flink x Java
© 2023 Cloudera, Inc. All rights reserved.
Tim Spann
Principal Developer Advocate | Cloudera
© 2023 Cloudera, Inc. All rights reserved.
© 2023 Cloudera, Inc. All rights reserved. 7
FLiP Stack Weekly
This week in Apache NiFi, Apache Flink, Apache
Kafka, Apache Spark, Apache Iceberg, Python,
Java and Open Source friends.
https://bit.ly/32dAJft
© 2023 Cloudera, Inc. All rights reserved. 8
Future of Data - Princeton + Virtual
@PaasDev
https://www.meetup.com/futureofdata-princeton/
From Big Data to AI to Streaming to Containers to
Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...
© 2023 Cloudera, Inc. All rights reserved.
FREE LEARNING ENVIRONMENT
© 2023 Cloudera, Inc. All rights reserved. 10
CSP Community
Edition
• Kafka, KConnect, SMM, SR,
Flink, and SSB in Docker
• Runs in Docker
• Try new features quickly
• Develop applications locally
● Docker compose file of CSP to run from command line w/o any
dependencies, including Flink, SQL Stream Builder, Kafka, Kafka
Connect, Streams Messaging Manager and Schema Registry
○ $> docker compose up
● Licensed under the Cloudera Community License
● Unsupported
● Community Group Hub for CSP
● Find it on docs.cloudera.com under Applications
© 2023 Cloudera, Inc. All rights reserved.
STREAMING
© 2023 Cloudera, Inc. All rights reserved. 12
WHAT IS REAL-TIME?
© 2023 Cloudera, Inc. All rights reserved. 13
ENABLING ANALYTICS AND INSIGHTS ANYWHERE
Driving enterprise business value
REAL-TIME
STREAMING
ENGINE
ANALYTICS &
DATA WAREHOUSE
DATA SCIENCE/
MACHINE LEARNING
CENTRALIZED DATA
PLATFORM
STORAGE & PROCESSING
ANALYTICS & INSIGHTS
Stream
Ingest
Ingest – Data
at Rest
Deploy
Models
BI
Solutions
SQL Predictive
Analytics
• Model Building
• Model Training
• Model Scoring
Actions &
Alerts
[SQL]
Real-Time
Apps
STREAMING DATA
SOURCES
Clickstream Market data
Machine logs Social
ENTERPRISE DATA
SOURCES
CRM
Customer
history
Research
Compliance
Data
Risk Data
Lending
© 2023 Cloudera, Inc. All rights reserved. 14
STREAMING FROM … TO .. WHILE ..
Data distribution as a first class citizen
IOT
Devices
LOG DATA
SOURCES
ON-PREM
DATA SOURCES
BIG DATA CLOUD
SERVICES
CLOUD BUSINESS
PROCESS SERVICES *
CLOUD DATA*
ANALYTICS /SERVICE
(Cloudera DW)
App
Logs
Laptops
/Servers Mobile
Apps
Security
Agents
CLOUD
WAREHOUSE
UNIVERSAL
DATA DISTRIBUTION
(Ingest, Transform, Deliver)
Ingest
Processors
Ingest
Gateway
Router, Filter &
Transform
Processors
Destination
Processors
© 2023 Cloudera, Inc. All rights reserved.
© 2019 Cloudera, Inc. All rights reserved. 15
EVENT-DRIVEN ORGANIZATION
Modernize your data and applications
CDF Event Streaming Platform
Integration - Processing - Management - Cloud
Stream
ETL
Cloud
Storage
Application
Data Lake Data Stores
Make
Payment
µServices
Streams
Edge - IoT Dashboard
© 2023 Cloudera, Inc. All rights reserved. 16
BUILDING REAL-TIME REQUIRES A TEAM
© 2023 Cloudera, Inc. All rights reserved.
APACHE KAFKA
I Can Haz Data?
© 2023 Cloudera, Inc. All rights reserved. 18
Yes, Franz, It’s Kafka
Let’s do a metamorphosis on your data. Don’t fear changing data.
You don’t need to be a brilliant writer to stream
data.
Franz Kafka was a German-speaking
Bohemian novelist and short-story writer,
widely regarded as one of the major figures of
20th-century literature. His work fuses
elements of realism and the fantastic.
Wikipedia
© 2023 Cloudera, Inc. All rights reserved.
© 2019 Cloudera, Inc. All rights reserved. 19
STREAMS MESSAGING WITH KAFKA
• Highly reliable distributed messaging system.
• Decouple applications, enables many-to-many
patterns.
• Publish-Subscribe semantics.
• Horizontal scalability.
• Efficient implementation to operate at speed with
big data volumes.
• Organized by topic to support several use cases.
© 2023 Cloudera, Inc. All rights reserved. 20
What is Apache Kafka?
– Distributed: horizontally scalable
– Partitioned: the data is split-up and distributed across the brokers
– Replicated: allows for automatic failover
– Unique: Kafka does not track the consumption of messages (the consumers
do)
– Fast: designed from the ground up with a focus on performance and
throughput
– Kafka was built at Linkedin in 2011
– Open sourced as an Apache project
© 2023 Cloudera, Inc. All rights reserved. 21
What is Can You Do With Apache Kafka?
• Web site activity: track page views, searches, etc. in real time
• Events & log aggregation: particularly in distributed systems where messages
come from multiple sources
• Monitoring and metrics: aggregate statistics from distributed applications and
build a dashboard application
• Stream processing: process raw data, clean it up, and forward it on to another
topic or messaging system
• Real-time data ingestion: fast processing of a very large volume of messages
© 2023 Cloudera, Inc. All rights reserved. 22
KAFKA TERMINOLOGY
• Kafka is a publish/subscribe messaging system comprised of the
following components:
– Topic: a message feed
– Producer: a process that publishes messages to a topic
– Consumer: a process that subscribes to a topic and processes its messages
– Broker: a server in a Kafka cluster
© 2021 Cloudera, Inc. All rights reserved. 23
Apache Kafka
• Highly reliable distributed
messaging system
• Decouple applications, enables
many-to-many patterns
• Publish-Subscribe semantics
• Horizontal scalability
• Efficient implementation to
operate at speed with big data
volumes
• Organized by topic to support
several use cases
Source
System
Source
System
Source
System
Kafka
Fraud
Detection
Security
Systems
Real-Time
Monitoring
Many-To-Many
Publish-Subscribe
© 2019 Cloudera, Inc. All rights reserved. 24
KAFKA CLUSTER GEO 2
DATA SYNDICATE SERVICES
Kafka Topic
syndicate-
transmission
Kafka Topic
syndicate-
temp
Kafka Topic
syndicate-
speed
Kafka Topic
syndicate-
geo
KAFKA CLUSTER GEO 1
DATA SYNDICATE SERVICES
Kafka Topic
syndicate-
transmission
Kafka Topic
syndicate-
temp
Kafka Topic
syndicate-
speed
Kafka Topic
syndicate-
geo
Apache Kafka
DATA COLLECTION
AT THE EDGE
C++ agent
US-West Fleet
C++ agent
US-Central Fleet
C++ agent
US-East Fleet
INGEST GATEWAY
POWERED BY
KAFKA
gateway-west-
raw-sensors
gateway-central-
raw-sensors
gateway-east-
raw-sensors
DATA FLOW APPS
POWERED BY NIFI
STREAMING
ANALYTICS APPS
Micro Batch Analytics
Stream Analytics App
Micro Services
Stream Analytics App
Complex Low Latent
Stream Analytics App
Apache
Flink
Structured
Streaming
Replication /
Data Deployment
MiNiFi Apache Kafka Apache NiFi Apache Kafka Apache Flink
© 2023 Cloudera, Inc. All rights reserved.
APACHE FLINK
© 2023 Cloudera, Inc. All rights reserved. 26
Flink SQL
https://www.datainmotion.dev/2021/04/cloudera-sql-stream-builder-ssb-updated.html
● Streaming Analytics
● Continuous SQL
● Continuous ETL
● Complex Event Processing
● Standard SQL Powered by Apache Calcite
© 2023 Cloudera, Inc. All rights reserved. 27
CONTINUOUS SQL
● SSB is a Continuous SQL engine
● It’s SQL, but a slightly different mental model, but with big implications
Traditional Parse/Execute/Fetch model Continuous SQL Model
Hint: The query is boundless and never finishes, and time matters
AKA: SELECT * FROM foo WHERE 1=0 -- will run forever
© 2023 Cloudera, Inc. All rights reserved. 28
Flink SQL
-- specify Kafka partition key on output
SELECT foo AS _eventKey FROM sensors
-- use event time timestamp from kafka
-- exactly once compatible
SELECT eventTimestamp FROM sensors
-- nested structures access
SELECT foo.’bar’ FROM table; -- must quote nested
column
-- timestamps
SELECT * FROM payments
WHERE eventTimestamp > CURRENT_TIMESTAMP-interval
'10' second;
-- unnest
SELECT b.*, u.*
FROM bgp_avro b,
UNNEST(b.path) AS u(pathitem)
-- aggregations and windows
SELECT card,
MAX(amount) as theamount,
TUMBLE_END(eventTimestamp, interval '5' minute) as
ts
FROM payments
WHERE lat IS NOT NULL
AND lon IS NOT NULL
GROUP BY card,
TUMBLE(eventTimestamp, interval '5' minute)
HAVING COUNT(*) > 4 -- >4==fraud
-- try to do this ksql!
SELECT us_west.user_score+ap_south.user_score
FROM kafka_in_zone_us_west us_west
FULL OUTER JOIN kafka_in_zone_ap_south ap_south
ON us_west.user_id = ap_south.user_id;
Key Takeaway: Rich SQL grammar with advanced time and aggregation tools
© 2023 Cloudera, Inc. All rights reserved. 29
CLOUDERA SQL STREAM BUILDER
Making Streaming Analytics accessible to everyone with SQL
Application Developer
● Develop & test SQL queries with a
powerful UI
● Expose streaming data to
applications through materialized
views
● Single button “Push to
production” turns SQL queries into
Flink application
Business Analyst,
● Explore Streaming Data using SQL
without learning new skills
● Build new real-time business
reporting applications
30
© 2022 Cloudera, Inc. All rights reserved.
SQL STREAM BUILDER (SSB)
SQL STREAM BUILDER allows
developers, analysts, and data
scientists to write streaming
applications with industry
standard SQL.
No Java or Scala code
development required.
Simplifies access to data in Kafka
& Flink. Connectors to batch data in
HDFS, Kudu, Hive, S3, JDBC, CDC
and more
Enrich streaming data with batch
data in a single tool
Democratize access to real-time data with just SQL
© 2023 Cloudera, Inc. All rights reserved. 31
SCHEMA
● AVRO - Schema Registry
● JSON - Schema Auto-detect
● Virtual Table design pattern
● Kafka Data Source
auto-created in SSB
{
"fields": [
{
"doc": "Type inferred from '215'",
"name": "userid",
"type": "long"
},
{
"doc": "Type inferred from '94204'",
"name": "amount",
"type": "long"
}
],
"name": "inferredSchema",
"type": "record"
}
Key Takeaway: Integrated with schema registry, also auto-detection for JSON types.
© 2023 Cloudera, Inc. All rights reserved. 32
SSB MATERIALIZED VIEWS
Key Takeaway; MV’s allow data scientist, analyst and developers consume data from the firehose
© 2023 Cloudera, Inc. All rights reserved. 33
Streaming ETL Data Pipeline Made Simple with SQL StreamBuilder
Write Streaming
Result to Kudu
Join 2 Streaming
User Event Topics
Enrich Stream from
Warehouse HR Table
Enrich Stream from RT
Mart Timesheet Table
Filter &
Transform
© 2023 Cloudera, Inc. All rights reserved. 34
Streaming Data Lineage with SDX
DATA GOVERNANCE FOR THE ENTIRE STREAMING PIPELINE
• Track Consumer, Producer, Topics
and Consumer Group Lineage
• No changes required to
Consumers or Producers
• End-To-End lineage from consumer
to producer
© 2023 Cloudera, Inc. All rights reserved. 35
SSB Projects - Container Structure for All Assets of SQL Streaming Job
SDLC for Streaming SQL Applications With First Class Git Integration
Project in SSB
SSB Project provides the container structure
for all the assets for your streaming app.
Project is configured with a git repository
SSB allows you to
push/import projects
to/from Git
Project Represented In Git
The streaming application assets in
git within the project structure
© 2023 Cloudera, Inc. All rights reserved. 36
SDLC Life Cycle with SSB Projects
Create SSB Project &
Configure Git Repo
Step 1
Run Service Discovery to
register Kafka, Hive, etc
Step 2
Create/Develop
Streaming Assets & Test
Step 3
Check-in Project
Into Git
Step 4
Import Project from Git into SSB
Prod, Setup Monitoring & Deploy
Step 5
© 2023 Cloudera, Inc. All rights reserved. 37
Moving Beyond Draining of Streams Into Lakes: Analytics-in-Stream
Data Sources Streaming Storage
Substrate
Cloudera Stream Processing
Kafka + NiFi enables
real-time ingestion into
lakes / analytics services
Data Distribution
Service
Cloudera DataFlow
Warehouses & Operational DB
Data Lakes & Lake Houses
Data-At-Rest Analytics
Data Apps Powered by
Streaming Insights and used
by other Analytics Services
Kafka + Flink
enables streaming
analytics
Cloudera Stream Processing
Streaming
Analytics
Low Latency
Data Products
Data-In-Motion Streaming Analytics
© 2023 Cloudera, Inc. All rights reserved.
DATAFLOW
APACHE NIFI
© 2023 Cloudera, Inc. All rights reserved. 39
Cloudera DataFlow: Universal Data Distribution Service
Process
Route
Filter
Enrich
Transform
Distribute
Connectors
Any
destination
Deliver
Ingest
Active
Passive
Connectors
Gateway
Endpoint
Connect & Pull
Send
Data born in
the cloud
Data born
outside the
cloud
UNIVERSAL DATA DISTRIBUTION WITH CLOUDERA DATAFLOW (CDF)
Connect to Any Data Source Anywhere then Process and Deliver to Any Destination
© 2023 Cloudera, Inc. All rights reserved.
© 2019 Cloudera, Inc. All rights reserved. 40
CLOUDERA DATAFLOW - POWERED BY APACHE NiFi
Ingest and manage data from edge-to-cloud using a no-code interface
● #1 data ingestion/movement engine
● Strong community
● Product maturity over 11 years
● Deploy on-premises or in the cloud
● Over 400+ pre-built processors
● Built-in data provenance
● Guaranteed delivery
● Throttling and Backpressure
© 2023 Cloudera, Inc. All rights reserved. 41
CLOUDERA FLOW AND EDGE MANAGEMENT
Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data
center) to any downstream system with built in end-to-end security and provenance
Advanced tooling to industrialize
flow development (Flow Development
Life Cycle)
ACQUIRE
• Over 300 Prebuilt Processors
• Easy to build your own
• Parse, Enrich & Apply Schema
• Filter, Split, Merger & Route
• Throttle & Backpressure
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG
PROCESS
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ENCRYPT
TALL
EVALUATE
EXECUTE
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
ROUTE RATE
DISTRIBUTE LOAD
DELIVER
• Guaranteed Delivery
• Full data provenance from
acquisition to delivery
• Diverse, Non-Traditional Sources
• Eco-system integration
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG
© 2023 Cloudera, Inc. All rights reserved. 42
Processing one millions events per second with Apache NiFi
https://blog.cloudera.com/benchmarking-nifi-performance-and-scalability/
© 2023 Cloudera, Inc. All rights reserved. 43
PROVENANCE
© 2023 Cloudera, Inc. All rights reserved. 44
EXTENSIBILITY
• Built from the ground up with extensions in mind
• Service-loader pattern for…
– Processors
– Controller Services
– Reporting Tasks
– Prioritizers
• Extensions packaged as NiFi Archives (NARs)
– Deploy NiFi lib directory and restart
– Same model as standard components
© 2019 Cloudera, Inc. All rights reserved. 45
NiFi Load Balancing
• Improve NiFi cluster throughput
• Defined at connection level
• Configurable balancing
strategies
• Critical for scale up paradigm in
Kubernetes
• Alleviates S2S balancing “hack”
customers use
© 2019 Cloudera, Inc. All rights reserved. 46
QUEUE CONFIGURATION
• FlowFile Expiration - Data that cannot be processed in a timely
fashion can be automatically removed from the flow.
• Back Pressure Thresholds - Thresholds indicate how much data
should be allowed to exist in the queue before the component
that is the source of the Connection is no longer scheduled to
run. This allows the system to avoid being overrun with data.
• Load Balance Strategy – Strategy to distribute the data in a flow
across the nodes in the cluster. When enabled, compression can
be configured on FlowFile contents and attributes.
• Prioritization – Determines the order in which flow files are
processed.
© 2019 Cloudera, Inc. All rights reserved. 47
RECORD-ORIENTED DATA WITH NIFI
• Record Readers - Avro, CSV, Grok, IPFIX, JSAN1, JSON, Parquet,
Scripted, Syslog5424, Syslog, WindowsEvent, XML
• Record Writers - Avro, CSV, FreeFromText, Json, Parquet,
Scripted, XML
• Record Reader and Writer support referencing a schema registry
for retrieving schemas when necessary.
• Enable processors that accept any data format without having to
worry about the parsing and serialization logic.
• Allows us to keep FlowFiles larger, each consisting of multiple
records, which results in far better performance.
© 2019 Cloudera, Inc. All rights reserved. 48
RUNNING SQL ON FLOWFILES
• Evaluates one or more SQL queries against the contents of a
FlowFile.
• This can be used, for example, for field-specific filtering,
transformation, and row-level filtering.
• Columns can be renamed, simple calculations and aggregations
performed.
• The SQL statement must be valid ANSI SQL and is powered by
Apache Calcite.
Apache NiFi with Python Custom Processors
Python as a 1st class citizen
50
© 2023 Cloudera, Inc. All rights reserved.
READYFLOW
GALLERY
• Cloudera provided flow
definitions
• Cover most common data flow
use cases
• Optimized to work with CDP
sources/destinations
• Can be deployed and adjusted
as needed
51
© 2023 Cloudera, Inc. All rights reserved.
FLOW CATALOG
• Central repository for flow
definitions
• Import existing NiFi flows
• Manage flow definitions
• Initiate flow deployments
52
© 2023 Cloudera, Inc. All rights reserved.
DEPLOYMENT
WIZARD
• Turns flow definitions into flow
deployments
• Guides users through providing
required configuration
• Choose NiFi runtime version
• Pick from pre-defined NiFi node sizes
• Define KPIs for the deployment
Start Deployment Wizard Provide Parameters
Configure Sizing & Scaling Define KPIs
53
© 2023 Cloudera, Inc. All rights reserved.
KEY
PERFORMANCE
INDICATORS
• Visibility into flow deployments
• Track high level flow
performance
• Track in-depth NiFi component
metrics
• Defined in Deployment Wizard
• Monitoring & Alerts in
Deployment Details
KPI Definition in Deployment Wizard KPI Monitoring
54
© 2023 Cloudera, Inc. All rights reserved.
DASHBOARD
• Central Monitoring View
• Monitors flow deployments
across CDP environments
• Monitors flow deployment
health & performance
• Drill into flow deployment to
monitor system metrics and
deployment events
55
© 2023 Cloudera, Inc. All rights reserved.
DEPLOYMENT
MANAGER
• Manage flow deployment
lifecycle
(Suspend/Start/Terminate)
• Add/Edit KPIs
• Change sizing configuration
• Update parameters
• Change NiFi version of the
deployment
• Gateway to NiFi canvas
56
© 2023 Cloudera, Inc. All rights reserved.
NIFI VERSION
UPGRADES
• Pick up NiFi hotfixes easily
• Upgrade (or downgrade) the
hotfix version of existing
deployments
• Rolling upgrade (if the
deployment has >1 NiFi nodes)
© 2023 Cloudera, Inc. All rights reserved.
BEST PRACTICES
© 2023 Cloudera, Inc. All rights reserved. 58
STREAMING TECH DEBT TIPS
• Version Control All Assets
• Managed Public Cloud like Cloudera
• Use DevOps and APIs
• Latest Java and Python
• Stream Sizing (NiFi, Kafka, Flink)
© 2023 Cloudera, Inc. All rights reserved. 59
Streaming
Solutions
When to use what?
Routing vs Analytics
Listeners
Joins
In-Memory
Operational Load
Current Skills
Use NiFi
Doing more than just Syndication
Not just small Kafka sized events
Edge Management is needed
Listener Type use cases that bind to ports
Lightweight ETL, Lineage, Provenance, Message Replay
Use Flink
Joining Streams
Windowing
Late Data Handling
Streaming Analytics
Use KConnect
Kafka Centric
In-Memory Stateless
© 2023 Cloudera, Inc. All rights reserved.
RESOURCES AND WRAP-UP
© 2023 Cloudera, Inc. All rights reserved. 61
Resources
© 2023 Cloudera, Inc. All rights reserved.
© 2021 Cloudera, Inc. All rights reserved. 62
© 2023 Cloudera, Inc. All rights reserved.
© 2021 Cloudera, Inc. All rights reserved. 63
Upcoming Events
April 26
May 10
May 9
64
TH N Y U

More Related Content

What's hot

The delta architecture
The delta architectureThe delta architecture
The delta architecture
Prakash Chockalingam
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
DataScienceConferenc1
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
Databricks
 
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data ScienceGet Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Neo4j
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
Trivadis
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Kai Wähner
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
Databricks
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
Alluxio, Inc.
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
DataScienceConferenc1
 
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
Databricks
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
Kai Wähner
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
Jeffrey T. Pollock
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
Databricks
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
Adam Doyle
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
Databricks
 
Data Mesh 101
Data Mesh 101Data Mesh 101
Data Mesh 101
ChrisFord803185
 
Modern Data Flow
Modern Data FlowModern Data Flow
Modern Data Flow
confluent
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
Databricks
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
Modern Data Stack France
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
James Serra
 

What's hot (20)

The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
[DSC Europe 22] Overview of the Databricks Platform - Petar Zecevic
 
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
The Future of Data Science and Machine Learning at Scale: A Look at MLflow, D...
 
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data ScienceGet Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
Get Started with the Most Advanced Edition Yet of Neo4j Graph Data Science
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse ArchitectureServerless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
Serverless Kafka and Spark in a Multi-Cloud Lakehouse Architecture
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
[DSC Europe 22] Lakehouse architecture with Delta Lake and Databricks - Draga...
 
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
ACID ORC, Iceberg, and Delta Lake—An Overview of Table Formats for Large Scal...
 
Kappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology ComparisonKappa vs Lambda Architectures and Technology Comparison
Kappa vs Lambda Architectures and Technology Comparison
 
Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3Webinar Data Mesh - Part 3
Webinar Data Mesh - Part 3
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
 
Snowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at ScaleSnowflake Data Science and AI/ML at Scale
Snowflake Data Science and AI/ML at Scale
 
Building Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta LakeBuilding Data Quality pipelines with Apache Spark and Delta Lake
Building Data Quality pipelines with Apache Spark and Delta Lake
 
Data Mesh 101
Data Mesh 101Data Mesh 101
Data Mesh 101
 
Modern Data Flow
Modern Data FlowModern Data Flow
Modern Data Flow
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
From Data Warehouse to Lakehouse
From Data Warehouse to LakehouseFrom Data Warehouse to Lakehouse
From Data Warehouse to Lakehouse
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 

Similar to Meetup: Streaming Data Pipeline Development

Meetup Streaming Data Pipeline Development
Meetup Streaming Data Pipeline DevelopmentMeetup Streaming Data Pipeline Development
Meetup Streaming Data Pipeline Development
Timothy Spann
 
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
ssuser73434e
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
Timothy Spann
 
ITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming AppsITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming Apps
Timothy Spann
 
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
Timothy Spann
 
BigDataFest_ Building Modern Data Streaming Apps
BigDataFest_  Building Modern Data Streaming AppsBigDataFest_  Building Modern Data Streaming Apps
BigDataFest_ Building Modern Data Streaming Apps
ssuser73434e
 
big data fest building modern data streaming apps
big data fest building modern data streaming appsbig data fest building modern data streaming apps
big data fest building modern data streaming apps
Timothy Spann
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
Timothy Spann
 
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataBuilding Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Timothy Spann
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann
 
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdfOSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
Timothy Spann
 
Building Real-Time Travel Alerts
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
Timothy Spann
 
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
Timothy Spann
 
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesOSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
Timothy Spann
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Timothy Spann
 
AIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AIAIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AI
Timothy Spann
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
Timothy Spann
 
BigDataFest Building Modern Data Streaming Apps
BigDataFest  Building Modern Data Streaming AppsBigDataFest  Building Modern Data Streaming Apps
BigDataFest Building Modern Data Streaming Apps
ssuser73434e
 

Similar to Meetup: Streaming Data Pipeline Development (20)

Meetup Streaming Data Pipeline Development
Meetup Streaming Data Pipeline DevelopmentMeetup Streaming Data Pipeline Development
Meetup Streaming Data Pipeline Development
 
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
Future of Data Milwaukee Meetup Streaming Data Pipeline Development 28 June 2023
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
 
ITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming AppsITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming Apps
 
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
 
BigDataFest_ Building Modern Data Streaming Apps
BigDataFest_  Building Modern Data Streaming AppsBigDataFest_  Building Modern Data Streaming Apps
BigDataFest_ Building Modern Data Streaming Apps
 
big data fest building modern data streaming apps
big data fest building modern data streaming appsbig data fest building modern data streaming apps
big data fest building modern data streaming apps
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
 
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataBuilding Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
 
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdfOSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
 
Building Real-Time Travel Alerts
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
 
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
 
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesOSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
AIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AIAIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AI
 
OSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming AppsOSSNA Building Modern Data Streaming Apps
OSSNA Building Modern Data Streaming Apps
 
BigDataFest Building Modern Data Streaming Apps
BigDataFest  Building Modern Data Streaming AppsBigDataFest  Building Modern Data Streaming Apps
BigDataFest Building Modern Data Streaming Apps
 

More from Timothy Spann

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
Timothy Spann
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
Timothy Spann
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
Timothy Spann
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
Timothy Spann
 
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
Timothy Spann
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
Timothy Spann
 
TCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI Pipelines
Timothy Spann
 
2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits
Timothy Spann
 
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsConf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Timothy Spann
 
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Timothy Spann
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
Timothy Spann
 
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
Timothy Spann
 
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
Timothy Spann
 
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines DemoEvolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Timothy Spann
 
AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101
Timothy Spann
 
CoC23_ Looking at the New Features of Apache NiFi
CoC23_ Looking at the New Features of Apache NiFiCoC23_ Looking at the New Features of Apache NiFi
CoC23_ Looking at the New Features of Apache NiFi
Timothy Spann
 
CoC23_ Let’s Monitor The Conditions at the Conference
CoC23_ Let’s Monitor The Conditions at the ConferenceCoC23_ Let’s Monitor The Conditions at the Conference
CoC23_ Let’s Monitor The Conditions at the Conference
Timothy Spann
 
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationCoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
Timothy Spann
 

More from Timothy Spann (20)

06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
 
TCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI Pipelines
 
2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits
 
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsConf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python Processors
 
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
 
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
 
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
 
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines DemoEvolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
 
AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101
 
CoC23_ Looking at the New Features of Apache NiFi
CoC23_ Looking at the New Features of Apache NiFiCoC23_ Looking at the New Features of Apache NiFi
CoC23_ Looking at the New Features of Apache NiFi
 
CoC23_ Let’s Monitor The Conditions at the Conference
CoC23_ Let’s Monitor The Conditions at the ConferenceCoC23_ Let’s Monitor The Conditions at the Conference
CoC23_ Let’s Monitor The Conditions at the Conference
 
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationCoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
 

Recently uploaded

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
varshanayak241
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Anthony Dahanne
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
MayankTawar1
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
Jelle | Nordend
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
Globus
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Shahin Sheidaei
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
Globus
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
Matt Welsh
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
Ortus Solutions, Corp
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
Sharepoint Designs
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
vrstrong314
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
Peter Caitens
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke
 

Recently uploaded (20)

Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Strategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptxStrategies for Successful Data Migration Tools.pptx
Strategies for Successful Data Migration Tools.pptx
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
Paketo Buildpacks : la meilleure façon de construire des images OCI? DevopsDa...
 
Software Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdfSoftware Testing Exam imp Ques Notes.pdf
Software Testing Exam imp Ques Notes.pdf
 
De mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FMEDe mooiste recreatieve routes ontdekken met RouteYou en FME
De mooiste recreatieve routes ontdekken met RouteYou en FME
 
How to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good PracticesHow to Position Your Globus Data Portal for Success Ten Good Practices
How to Position Your Globus Data Portal for Success Ten Good Practices
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Enhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdfEnhancing Research Orchestration Capabilities at ORNL.pdf
Enhancing Research Orchestration Capabilities at ORNL.pdf
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Large Language Models and the End of Programming
Large Language Models and the End of ProgrammingLarge Language Models and the End of Programming
Large Language Models and the End of Programming
 
Into the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdfInto the Box 2024 - Keynote Day 2 Slides.pdf
Into the Box 2024 - Keynote Day 2 Slides.pdf
 
Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024Explore Modern SharePoint Templates for 2024
Explore Modern SharePoint Templates for 2024
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 

Meetup: Streaming Data Pipeline Development

  • 1. © 2023 Cloudera, Inc. All rights reserved. Streaming Data Pipeline Development Tim Spann Principal Developer Advocate 25-April-2023
  • 2. © 2023 Cloudera, Inc. All rights reserved.
  • 3. © 2023 Cloudera, Inc. All rights reserved. https://attend.cloudera.com/nificommitters0503
  • 4. © 2023 Cloudera, Inc. All rights reserved. 4 FLaNK Stack Tim Spann @PaasDev // Blog: www.datainmotion.dev Principal Developer Advocate. Princeton Future of Data Meetup. ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC https://medium.com/@tspann https://github.com/tspannhw Apache NiFi x Apache Kafka x Apache Flink x Java
  • 5. © 2023 Cloudera, Inc. All rights reserved. Tim Spann Principal Developer Advocate | Cloudera
  • 6. © 2023 Cloudera, Inc. All rights reserved.
  • 7. © 2023 Cloudera, Inc. All rights reserved. 7 FLiP Stack Weekly This week in Apache NiFi, Apache Flink, Apache Kafka, Apache Spark, Apache Iceberg, Python, Java and Open Source friends. https://bit.ly/32dAJft
  • 8. © 2023 Cloudera, Inc. All rights reserved. 8 Future of Data - Princeton + Virtual @PaasDev https://www.meetup.com/futureofdata-princeton/ From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to Machine Learning to Microservices to ...
  • 9. © 2023 Cloudera, Inc. All rights reserved. FREE LEARNING ENVIRONMENT
  • 10. © 2023 Cloudera, Inc. All rights reserved. 10 CSP Community Edition • Kafka, KConnect, SMM, SR, Flink, and SSB in Docker • Runs in Docker • Try new features quickly • Develop applications locally ● Docker compose file of CSP to run from command line w/o any dependencies, including Flink, SQL Stream Builder, Kafka, Kafka Connect, Streams Messaging Manager and Schema Registry ○ $> docker compose up ● Licensed under the Cloudera Community License ● Unsupported ● Community Group Hub for CSP ● Find it on docs.cloudera.com under Applications
  • 11. © 2023 Cloudera, Inc. All rights reserved. STREAMING
  • 12. © 2023 Cloudera, Inc. All rights reserved. 12 WHAT IS REAL-TIME?
  • 13. © 2023 Cloudera, Inc. All rights reserved. 13 ENABLING ANALYTICS AND INSIGHTS ANYWHERE Driving enterprise business value REAL-TIME STREAMING ENGINE ANALYTICS & DATA WAREHOUSE DATA SCIENCE/ MACHINE LEARNING CENTRALIZED DATA PLATFORM STORAGE & PROCESSING ANALYTICS & INSIGHTS Stream Ingest Ingest – Data at Rest Deploy Models BI Solutions SQL Predictive Analytics • Model Building • Model Training • Model Scoring Actions & Alerts [SQL] Real-Time Apps STREAMING DATA SOURCES Clickstream Market data Machine logs Social ENTERPRISE DATA SOURCES CRM Customer history Research Compliance Data Risk Data Lending
  • 14. © 2023 Cloudera, Inc. All rights reserved. 14 STREAMING FROM … TO .. WHILE .. Data distribution as a first class citizen IOT Devices LOG DATA SOURCES ON-PREM DATA SOURCES BIG DATA CLOUD SERVICES CLOUD BUSINESS PROCESS SERVICES * CLOUD DATA* ANALYTICS /SERVICE (Cloudera DW) App Logs Laptops /Servers Mobile Apps Security Agents CLOUD WAREHOUSE UNIVERSAL DATA DISTRIBUTION (Ingest, Transform, Deliver) Ingest Processors Ingest Gateway Router, Filter & Transform Processors Destination Processors
  • 15. © 2023 Cloudera, Inc. All rights reserved. © 2019 Cloudera, Inc. All rights reserved. 15 EVENT-DRIVEN ORGANIZATION Modernize your data and applications CDF Event Streaming Platform Integration - Processing - Management - Cloud Stream ETL Cloud Storage Application Data Lake Data Stores Make Payment µServices Streams Edge - IoT Dashboard
  • 16. © 2023 Cloudera, Inc. All rights reserved. 16 BUILDING REAL-TIME REQUIRES A TEAM
  • 17. © 2023 Cloudera, Inc. All rights reserved. APACHE KAFKA I Can Haz Data?
  • 18. © 2023 Cloudera, Inc. All rights reserved. 18 Yes, Franz, It’s Kafka Let’s do a metamorphosis on your data. Don’t fear changing data. You don’t need to be a brilliant writer to stream data. Franz Kafka was a German-speaking Bohemian novelist and short-story writer, widely regarded as one of the major figures of 20th-century literature. His work fuses elements of realism and the fantastic. Wikipedia
  • 19. © 2023 Cloudera, Inc. All rights reserved. © 2019 Cloudera, Inc. All rights reserved. 19 STREAMS MESSAGING WITH KAFKA • Highly reliable distributed messaging system. • Decouple applications, enables many-to-many patterns. • Publish-Subscribe semantics. • Horizontal scalability. • Efficient implementation to operate at speed with big data volumes. • Organized by topic to support several use cases.
  • 20. © 2023 Cloudera, Inc. All rights reserved. 20 What is Apache Kafka? – Distributed: horizontally scalable – Partitioned: the data is split-up and distributed across the brokers – Replicated: allows for automatic failover – Unique: Kafka does not track the consumption of messages (the consumers do) – Fast: designed from the ground up with a focus on performance and throughput – Kafka was built at Linkedin in 2011 – Open sourced as an Apache project
  • 21. © 2023 Cloudera, Inc. All rights reserved. 21 What is Can You Do With Apache Kafka? • Web site activity: track page views, searches, etc. in real time • Events & log aggregation: particularly in distributed systems where messages come from multiple sources • Monitoring and metrics: aggregate statistics from distributed applications and build a dashboard application • Stream processing: process raw data, clean it up, and forward it on to another topic or messaging system • Real-time data ingestion: fast processing of a very large volume of messages
  • 22. © 2023 Cloudera, Inc. All rights reserved. 22 KAFKA TERMINOLOGY • Kafka is a publish/subscribe messaging system comprised of the following components: – Topic: a message feed – Producer: a process that publishes messages to a topic – Consumer: a process that subscribes to a topic and processes its messages – Broker: a server in a Kafka cluster
  • 23. © 2021 Cloudera, Inc. All rights reserved. 23 Apache Kafka • Highly reliable distributed messaging system • Decouple applications, enables many-to-many patterns • Publish-Subscribe semantics • Horizontal scalability • Efficient implementation to operate at speed with big data volumes • Organized by topic to support several use cases Source System Source System Source System Kafka Fraud Detection Security Systems Real-Time Monitoring Many-To-Many Publish-Subscribe
  • 24. © 2019 Cloudera, Inc. All rights reserved. 24 KAFKA CLUSTER GEO 2 DATA SYNDICATE SERVICES Kafka Topic syndicate- transmission Kafka Topic syndicate- temp Kafka Topic syndicate- speed Kafka Topic syndicate- geo KAFKA CLUSTER GEO 1 DATA SYNDICATE SERVICES Kafka Topic syndicate- transmission Kafka Topic syndicate- temp Kafka Topic syndicate- speed Kafka Topic syndicate- geo Apache Kafka DATA COLLECTION AT THE EDGE C++ agent US-West Fleet C++ agent US-Central Fleet C++ agent US-East Fleet INGEST GATEWAY POWERED BY KAFKA gateway-west- raw-sensors gateway-central- raw-sensors gateway-east- raw-sensors DATA FLOW APPS POWERED BY NIFI STREAMING ANALYTICS APPS Micro Batch Analytics Stream Analytics App Micro Services Stream Analytics App Complex Low Latent Stream Analytics App Apache Flink Structured Streaming Replication / Data Deployment MiNiFi Apache Kafka Apache NiFi Apache Kafka Apache Flink
  • 25. © 2023 Cloudera, Inc. All rights reserved. APACHE FLINK
  • 26. © 2023 Cloudera, Inc. All rights reserved. 26 Flink SQL https://www.datainmotion.dev/2021/04/cloudera-sql-stream-builder-ssb-updated.html ● Streaming Analytics ● Continuous SQL ● Continuous ETL ● Complex Event Processing ● Standard SQL Powered by Apache Calcite
  • 27. © 2023 Cloudera, Inc. All rights reserved. 27 CONTINUOUS SQL ● SSB is a Continuous SQL engine ● It’s SQL, but a slightly different mental model, but with big implications Traditional Parse/Execute/Fetch model Continuous SQL Model Hint: The query is boundless and never finishes, and time matters AKA: SELECT * FROM foo WHERE 1=0 -- will run forever
  • 28. © 2023 Cloudera, Inc. All rights reserved. 28 Flink SQL -- specify Kafka partition key on output SELECT foo AS _eventKey FROM sensors -- use event time timestamp from kafka -- exactly once compatible SELECT eventTimestamp FROM sensors -- nested structures access SELECT foo.’bar’ FROM table; -- must quote nested column -- timestamps SELECT * FROM payments WHERE eventTimestamp > CURRENT_TIMESTAMP-interval '10' second; -- unnest SELECT b.*, u.* FROM bgp_avro b, UNNEST(b.path) AS u(pathitem) -- aggregations and windows SELECT card, MAX(amount) as theamount, TUMBLE_END(eventTimestamp, interval '5' minute) as ts FROM payments WHERE lat IS NOT NULL AND lon IS NOT NULL GROUP BY card, TUMBLE(eventTimestamp, interval '5' minute) HAVING COUNT(*) > 4 -- >4==fraud -- try to do this ksql! SELECT us_west.user_score+ap_south.user_score FROM kafka_in_zone_us_west us_west FULL OUTER JOIN kafka_in_zone_ap_south ap_south ON us_west.user_id = ap_south.user_id; Key Takeaway: Rich SQL grammar with advanced time and aggregation tools
  • 29. © 2023 Cloudera, Inc. All rights reserved. 29 CLOUDERA SQL STREAM BUILDER Making Streaming Analytics accessible to everyone with SQL Application Developer ● Develop & test SQL queries with a powerful UI ● Expose streaming data to applications through materialized views ● Single button “Push to production” turns SQL queries into Flink application Business Analyst, ● Explore Streaming Data using SQL without learning new skills ● Build new real-time business reporting applications
  • 30. 30 © 2022 Cloudera, Inc. All rights reserved. SQL STREAM BUILDER (SSB) SQL STREAM BUILDER allows developers, analysts, and data scientists to write streaming applications with industry standard SQL. No Java or Scala code development required. Simplifies access to data in Kafka & Flink. Connectors to batch data in HDFS, Kudu, Hive, S3, JDBC, CDC and more Enrich streaming data with batch data in a single tool Democratize access to real-time data with just SQL
  • 31. © 2023 Cloudera, Inc. All rights reserved. 31 SCHEMA ● AVRO - Schema Registry ● JSON - Schema Auto-detect ● Virtual Table design pattern ● Kafka Data Source auto-created in SSB { "fields": [ { "doc": "Type inferred from '215'", "name": "userid", "type": "long" }, { "doc": "Type inferred from '94204'", "name": "amount", "type": "long" } ], "name": "inferredSchema", "type": "record" } Key Takeaway: Integrated with schema registry, also auto-detection for JSON types.
  • 32. © 2023 Cloudera, Inc. All rights reserved. 32 SSB MATERIALIZED VIEWS Key Takeaway; MV’s allow data scientist, analyst and developers consume data from the firehose
  • 33. © 2023 Cloudera, Inc. All rights reserved. 33 Streaming ETL Data Pipeline Made Simple with SQL StreamBuilder Write Streaming Result to Kudu Join 2 Streaming User Event Topics Enrich Stream from Warehouse HR Table Enrich Stream from RT Mart Timesheet Table Filter & Transform
  • 34. © 2023 Cloudera, Inc. All rights reserved. 34 Streaming Data Lineage with SDX DATA GOVERNANCE FOR THE ENTIRE STREAMING PIPELINE • Track Consumer, Producer, Topics and Consumer Group Lineage • No changes required to Consumers or Producers • End-To-End lineage from consumer to producer
  • 35. © 2023 Cloudera, Inc. All rights reserved. 35 SSB Projects - Container Structure for All Assets of SQL Streaming Job SDLC for Streaming SQL Applications With First Class Git Integration Project in SSB SSB Project provides the container structure for all the assets for your streaming app. Project is configured with a git repository SSB allows you to push/import projects to/from Git Project Represented In Git The streaming application assets in git within the project structure
  • 36. © 2023 Cloudera, Inc. All rights reserved. 36 SDLC Life Cycle with SSB Projects Create SSB Project & Configure Git Repo Step 1 Run Service Discovery to register Kafka, Hive, etc Step 2 Create/Develop Streaming Assets & Test Step 3 Check-in Project Into Git Step 4 Import Project from Git into SSB Prod, Setup Monitoring & Deploy Step 5
  • 37. © 2023 Cloudera, Inc. All rights reserved. 37 Moving Beyond Draining of Streams Into Lakes: Analytics-in-Stream Data Sources Streaming Storage Substrate Cloudera Stream Processing Kafka + NiFi enables real-time ingestion into lakes / analytics services Data Distribution Service Cloudera DataFlow Warehouses & Operational DB Data Lakes & Lake Houses Data-At-Rest Analytics Data Apps Powered by Streaming Insights and used by other Analytics Services Kafka + Flink enables streaming analytics Cloudera Stream Processing Streaming Analytics Low Latency Data Products Data-In-Motion Streaming Analytics
  • 38. © 2023 Cloudera, Inc. All rights reserved. DATAFLOW APACHE NIFI
  • 39. © 2023 Cloudera, Inc. All rights reserved. 39 Cloudera DataFlow: Universal Data Distribution Service Process Route Filter Enrich Transform Distribute Connectors Any destination Deliver Ingest Active Passive Connectors Gateway Endpoint Connect & Pull Send Data born in the cloud Data born outside the cloud UNIVERSAL DATA DISTRIBUTION WITH CLOUDERA DATAFLOW (CDF) Connect to Any Data Source Anywhere then Process and Deliver to Any Destination
  • 40. © 2023 Cloudera, Inc. All rights reserved. © 2019 Cloudera, Inc. All rights reserved. 40 CLOUDERA DATAFLOW - POWERED BY APACHE NiFi Ingest and manage data from edge-to-cloud using a no-code interface ● #1 data ingestion/movement engine ● Strong community ● Product maturity over 11 years ● Deploy on-premises or in the cloud ● Over 400+ pre-built processors ● Built-in data provenance ● Guaranteed delivery ● Throttling and Backpressure
  • 41. © 2023 Cloudera, Inc. All rights reserved. 41 CLOUDERA FLOW AND EDGE MANAGEMENT Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data center) to any downstream system with built in end-to-end security and provenance Advanced tooling to industrialize flow development (Flow Development Life Cycle) ACQUIRE • Over 300 Prebuilt Processors • Easy to build your own • Parse, Enrich & Apply Schema • Filter, Split, Merger & Route • Throttle & Backpressure FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLOG PROCESS HASH MERGE EXTRACT DUPLICATE SPLIT ENCRYPT TALL EVALUATE EXECUTE GEOENRICH SCAN REPLACE TRANSLATE CONVERT ROUTE TEXT ROUTE CONTENT ROUTE CONTEXT ROUTE RATE DISTRIBUTE LOAD DELIVER • Guaranteed Delivery • Full data provenance from acquisition to delivery • Diverse, Non-Traditional Sources • Eco-system integration FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLOG
  • 42. © 2023 Cloudera, Inc. All rights reserved. 42 Processing one millions events per second with Apache NiFi https://blog.cloudera.com/benchmarking-nifi-performance-and-scalability/
  • 43. © 2023 Cloudera, Inc. All rights reserved. 43 PROVENANCE
  • 44. © 2023 Cloudera, Inc. All rights reserved. 44 EXTENSIBILITY • Built from the ground up with extensions in mind • Service-loader pattern for… – Processors – Controller Services – Reporting Tasks – Prioritizers • Extensions packaged as NiFi Archives (NARs) – Deploy NiFi lib directory and restart – Same model as standard components
  • 45. © 2019 Cloudera, Inc. All rights reserved. 45 NiFi Load Balancing • Improve NiFi cluster throughput • Defined at connection level • Configurable balancing strategies • Critical for scale up paradigm in Kubernetes • Alleviates S2S balancing “hack” customers use
  • 46. © 2019 Cloudera, Inc. All rights reserved. 46 QUEUE CONFIGURATION • FlowFile Expiration - Data that cannot be processed in a timely fashion can be automatically removed from the flow. • Back Pressure Thresholds - Thresholds indicate how much data should be allowed to exist in the queue before the component that is the source of the Connection is no longer scheduled to run. This allows the system to avoid being overrun with data. • Load Balance Strategy – Strategy to distribute the data in a flow across the nodes in the cluster. When enabled, compression can be configured on FlowFile contents and attributes. • Prioritization – Determines the order in which flow files are processed.
  • 47. © 2019 Cloudera, Inc. All rights reserved. 47 RECORD-ORIENTED DATA WITH NIFI • Record Readers - Avro, CSV, Grok, IPFIX, JSAN1, JSON, Parquet, Scripted, Syslog5424, Syslog, WindowsEvent, XML • Record Writers - Avro, CSV, FreeFromText, Json, Parquet, Scripted, XML • Record Reader and Writer support referencing a schema registry for retrieving schemas when necessary. • Enable processors that accept any data format without having to worry about the parsing and serialization logic. • Allows us to keep FlowFiles larger, each consisting of multiple records, which results in far better performance.
  • 48. © 2019 Cloudera, Inc. All rights reserved. 48 RUNNING SQL ON FLOWFILES • Evaluates one or more SQL queries against the contents of a FlowFile. • This can be used, for example, for field-specific filtering, transformation, and row-level filtering. • Columns can be renamed, simple calculations and aggregations performed. • The SQL statement must be valid ANSI SQL and is powered by Apache Calcite.
  • 49. Apache NiFi with Python Custom Processors Python as a 1st class citizen
  • 50. 50 © 2023 Cloudera, Inc. All rights reserved. READYFLOW GALLERY • Cloudera provided flow definitions • Cover most common data flow use cases • Optimized to work with CDP sources/destinations • Can be deployed and adjusted as needed
  • 51. 51 © 2023 Cloudera, Inc. All rights reserved. FLOW CATALOG • Central repository for flow definitions • Import existing NiFi flows • Manage flow definitions • Initiate flow deployments
  • 52. 52 © 2023 Cloudera, Inc. All rights reserved. DEPLOYMENT WIZARD • Turns flow definitions into flow deployments • Guides users through providing required configuration • Choose NiFi runtime version • Pick from pre-defined NiFi node sizes • Define KPIs for the deployment Start Deployment Wizard Provide Parameters Configure Sizing & Scaling Define KPIs
  • 53. 53 © 2023 Cloudera, Inc. All rights reserved. KEY PERFORMANCE INDICATORS • Visibility into flow deployments • Track high level flow performance • Track in-depth NiFi component metrics • Defined in Deployment Wizard • Monitoring & Alerts in Deployment Details KPI Definition in Deployment Wizard KPI Monitoring
  • 54. 54 © 2023 Cloudera, Inc. All rights reserved. DASHBOARD • Central Monitoring View • Monitors flow deployments across CDP environments • Monitors flow deployment health & performance • Drill into flow deployment to monitor system metrics and deployment events
  • 55. 55 © 2023 Cloudera, Inc. All rights reserved. DEPLOYMENT MANAGER • Manage flow deployment lifecycle (Suspend/Start/Terminate) • Add/Edit KPIs • Change sizing configuration • Update parameters • Change NiFi version of the deployment • Gateway to NiFi canvas
  • 56. 56 © 2023 Cloudera, Inc. All rights reserved. NIFI VERSION UPGRADES • Pick up NiFi hotfixes easily • Upgrade (or downgrade) the hotfix version of existing deployments • Rolling upgrade (if the deployment has >1 NiFi nodes)
  • 57. © 2023 Cloudera, Inc. All rights reserved. BEST PRACTICES
  • 58. © 2023 Cloudera, Inc. All rights reserved. 58 STREAMING TECH DEBT TIPS • Version Control All Assets • Managed Public Cloud like Cloudera • Use DevOps and APIs • Latest Java and Python • Stream Sizing (NiFi, Kafka, Flink)
  • 59. © 2023 Cloudera, Inc. All rights reserved. 59 Streaming Solutions When to use what? Routing vs Analytics Listeners Joins In-Memory Operational Load Current Skills Use NiFi Doing more than just Syndication Not just small Kafka sized events Edge Management is needed Listener Type use cases that bind to ports Lightweight ETL, Lineage, Provenance, Message Replay Use Flink Joining Streams Windowing Late Data Handling Streaming Analytics Use KConnect Kafka Centric In-Memory Stateless
  • 60. © 2023 Cloudera, Inc. All rights reserved. RESOURCES AND WRAP-UP
  • 61. © 2023 Cloudera, Inc. All rights reserved. 61 Resources
  • 62. © 2023 Cloudera, Inc. All rights reserved. © 2021 Cloudera, Inc. All rights reserved. 62
  • 63. © 2023 Cloudera, Inc. All rights reserved. © 2021 Cloudera, Inc. All rights reserved. 63 Upcoming Events April 26 May 10 May 9