SlideShare a Scribd company logo
1 of 79
Download to read offline
© 2023 Cloudera, Inc. All rights reserved.
Meetup:
Mastering Data Streaming
Pipelines
Tim Spann
Principal Developer Advocate
28-June-2023
© 2023 Cloudera, Inc. All rights reserved.
© 2023 Cloudera, Inc. All rights reserved.
4
© 2023 Cloudera, Inc. All rights reserved.
Future of Data - Princeton + Virtual
@PaasDev
https://www.meetup.com/futureofdata-princeton/
From Big Data to AI to Streaming to Containers to
Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...
© 2023 Cloudera, Inc. All rights reserved. 5
FLaNK Stack
Tim Spann
@PaasDev // Blog: www.datainmotion.dev
Principal Developer Advocate.
Princeton Future of Data Meetup.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC
https://medium.com/@tspann
https://github.com/tspannhw
Apache NiFi x Apache Kafka x Apache Flink x Java
© 2023 Cloudera, Inc. All rights reserved. 6
FLaNK Stack Weekly
This week in Apache NiFi, Apache Flink, Apache
Kafka, Apache Spark, Apache Iceberg, Python,
Java and Open Source friends.
https://bit.ly/32dAJft
© 2023 Cloudera, Inc. All rights reserved. 7
https://www.cloudera.com/about/events/cloudera-now-cdp.html
© 2023 Cloudera, Inc. All rights reserved.
FREE LEARNING ENVIRONMENT
© 2023 Cloudera, Inc. All rights reserved. 9
CSP Community
Edition
• Kafka, KConnect, SMM, SR,
Flink, and SSB in Docker
• Runs in Docker
• Try new features quickly
• Develop applications locally
● Docker compose file of CSP to run from command line w/o any
dependencies, including Flink, SQL Stream Builder, Kafka, Kafka
Connect, Streams Messaging Manager and Schema Registry
○ $> docker compose up
● Licensed under the Cloudera Community License
● Unsupported
● Community Group Hub for CSP
● Find it on docs.cloudera.com under Applications
© 2023 Cloudera, Inc. All rights reserved.
STREAMING
© 2023 Cloudera, Inc. All rights reserved. 11
CDP: AN OPEN DATA LAKEHOUSE
METADATA AND
DATA CATALOG
OBSERVABILITY REPLICATION
SECURITY &
GOVERNANCE
Private Cloud
© 2023 Cloudera, Inc. All rights reserved. 12
BUILDING REAL-TIME REQUIRES A TEAM
© 2023 Cloudera, Inc. All rights reserved. 13
DATA INGESTION ARCHITECTURE - Using NiFi to pull the Data out of anything in near-real time
INGEST PREPARE PUBLISH
DATA SOURCES
EXTERNAL DATA
(Weather)
(Air Quality)
IoT Devices
Internal Users
(After Sales)
External
Systems
ENTERPRISE
LAKEHOUSE
CAPABILITY VIEW
INGESTION
ENTERPRISE DATA
MESSAGE HUB
STORAGE
BATCH
MANAGEMENT
STREAM
CONSUMPTION
Closed Loop
Systems
SQL Stream Builder
Machine Learning
Data Visualization
Workload Manager
© 2023 Cloudera, Inc. All rights reserved. 14
Moving Beyond Draining of Streams Into Lakes: Analytics-in-Stream
Data Sources Streaming Storage
Substrate
Cloudera Stream Processing
Kafka + NiFi enables
real-time ingestion into
lakes / analytics services
Data Distribution
Service
Cloudera DataFlow
Warehouses & Operational DB
Data Lakes & Lake Houses
Data-At-Rest Analytics
Data Apps Powered by
Streaming Insights and used
by other Analytics Services
Kafka + Flink
enables streaming
analytics
Cloudera Stream Processing
Streaming
Analytics
Low Latency
Data Products
Data-In-Motion Streaming Analytics
© 2023 Cloudera, Inc. All rights reserved. 15
Sploot Nano Shaggy Stormy
© 2023 Cloudera, Inc. All rights reserved. 16
Already using Spark? Need NiFi? Need Flink?
Want unified Batch/Stream?
Want highest Throughput?
Don’t need low latency?
Large files?
Scheduled batches?
Replacing Sqoop, ETL
Simple JDBC queries?
Transform individual records?
Want easy development?
Lots of small files, events, records, rows?
Continuous stream of rows
Support many different sources
Need Microservices, Batch and Stream?
Want high Throughput?
Want Low Latency?
Want Advanced Windowing and State?
Happy with a New Solution that is
best-in-class?
Spark, NiFi, Flink? Which engine to choose?
SOURCES AND SINKS
© 2023 Cloudera, Inc. All rights reserved. 18
APACHE ICEBERG
A Flexible, Performant & Scalable Table Format
• Donated by Netflix to the Apache Foundation in 2018
• Flexibility
– Hidden partitioning
– Full schema evolution
• Data Warehouse Operations
– Atomic Consistent Isolated Durable (ACID)
Transactions
– Time travel and rollback
• Supports best in class SQL performance
– High performance at Petabyte scale
© 2023 Cloudera, Inc. All rights reserved. 19
© 2023 Cloudera, Inc. All rights reserved.
COMMON USE CASES
© 2023 Cloudera, Inc. All rights reserved. 21
BE READY TO DETECT, RESPOND AND COMPLY
Real-time alerting & Longer retention = Enhanced Readiness
Detect
Collect event data at
high events per second
Automatically identify
high fidelity alerts
Analytics for proactive
threat hunting
Respond
Respond automatically to
real-time alerts
Full context for complete
forensics and incident
response
Comply
Retain data as required
Automated compliance
assessments
3
2
1
© 2023 Cloudera, Inc. All rights reserved. 22
STREAMING LOG ANALYTICS REFERENCE ARCHITECTURE
Scalable and Secure Real-Time Log Analytics
Raw Logs Cloudera Data Flow Cloudera Data
Warehouse
Cloudera Machine Learning
Log Analytics Hub
Log
Analytics
Cloudera Streaming
Analytics
Cloudera Streams
Processing
Open Log Store
Triaging ML Models
Threat Hunting
Response and
Investigation
UEBA/Fraud
Detection
Reports
Kafka
SOAR
© 2020 Cloudera, Inc. All rights reserved. 23
CONSOLIDATE ACROSS CLOUDS
Run collection and streaming on any cloud
Raw Logs Cloudera Data Flow Cloudera Data
Warehouse
Cloudera Machine Learning
Log Analytics Hub
Log
Analytics
Cloudera Streaming
Analytics
Cloudera Streams
Processing
Open Log Store
Triaging ML Models
Threat Hunting
Response and
Investigation
UEBA/Fraud
Detection
Reports
Kafka
SOAR
Data Center
24
© 2023 Cloudera, Inc. All rights reserved.
THE PROBLEM
Making Observable Events Actionable can be Challenging
• Inability to distribute event data
• Data Silos
• Inefficient data movement
• Situational awareness
Resulting in….
• License costs
• Consumption costs
• Development overhead
• Resource utilization
Multi-structured
Distributed
Formatting
Aggregation
Signal: Noise
Inconsistent
Difficult to integrate
Proprietary Tools
● Capture events from any system anywhere
● Any data format, any protocol
● Open- Not limited to proprietary tools
Capture Observable Data
CLOUDERA DATA-IN-MOTION SOLUTION
Actionability
02
04
03
01
● Process raw data
● Integrate with enterprise data
● Build, test, deploy quickly
Create Useful Information
● Dramatically reduce time to insight
● Monitor proactively for critical events
● Distribute broadly and process
in-stream real-time via stream
Enable Real-Time Observability
● Reduce noise, boost signal
● Reduce load on other system
● Filter, compress, route anywhere
Filter and Distribute
Universal Connectivity
450+ Connectors
Hybrid Deployment-Ready
Author once, deploy anywhere
Self-Service across full lifecycle
Integrated no code UX
26
© 2023 Cloudera, Inc. All rights reserved.
NO-CODE/LOW-CODE USER EXPERIENCE
DataFlow Designer SQL Stream Builder
visual canvas + automatic provisioning=
Self-service pipeline development
Instant data access + unified SQL processing =
Self-service streaming event processing
27
© 2023 Cloudera, Inc. All rights reserved.
KEY USE CASES AND OUTCOMES
INTELLIGENT OPS
& SYSTEM RELIABILITY
“Cut our data volume in half that
translated to a 30%-40% reduction in
public cloud cost. With saving,
expanded footprint & coverage of
monitoring & visibility without a
significant $$$ impact. This drove
improved situational awareness of
app infrastructure availability &
performance”
Large US Airline
“Faster resolution of incidents,
reducing exponentially the stops of
trains in full operation.”
Metro de Madrid
CYBER SECURITY
“Cut log data by 60% resulting in
>40% reduction in Splunk costs.
Delivered cyber logs to other teams
analyzing logs resulting in a decrease
of MTTD by 90%.”
Fortune 100 Energy Company
“Improved threat hunting by
eliminating 25% of false positive
anomalies and speeding up
investigations”
Global Telco
CUSTOMER INSIGHT
“Improved Marketing Campaign
effectiveness 15% YoY. This allowed
us to redirect funds for failing
campaigns and allocate to new ones.”
Global Telco
© 2023 Cloudera, Inc. All rights reserved.
APACHE NiFi -
MiNiFi Agents
NiFi For Ants
© 2023 Cloudera, Inc. All rights reserved. 29
Cloudera Edge Management
Edge device data collection and processing with easy to use central command and control
• Small footprint agent with MiNiFi
• Java and C++ agents
• Rich edge processors (edge collection &
processing)
• End to end lineage and security
• Kubernetes support
Flow
Authorship
Flow
Deployment
Flow
Monitoring
• Central Command and Control (C2)
• Design and deploy to millions of agents
• Edge Applications lifecycle management
• Multitenancy with Agent classes
• Native integration with other CDF services
+
A lightweight edge agent that implements the core
features of Apache NiFi, focusing on data collection
and processing at the edge
Edge Flow Manager
© 2023 Cloudera, Inc. All rights reserved.
APACHE KAFKA
© 2019 Cloudera, Inc. All rights reserved. 31
KAFKA CLUSTER GEO 2
DATA SYNDICATE SERVICES
Kafka Topic
syndicate-
transmission
Kafka Topic
syndicate-
temp
Kafka Topic
syndicate-
speed
Kafka Topic
syndicate-
geo
KAFKA CLUSTER GEO 1
DATA SYNDICATE SERVICES
Kafka Topic
syndicate-
transmission
Kafka Topic
syndicate-
temp
Kafka Topic
syndicate-
speed
Kafka Topic
syndicate-
geo
Apache Kafka
DATA COLLECTION
AT THE EDGE
C++ agent
US-West Fleet
C++ agent
US-Central Fleet
C++ agent
US-East Fleet
INGEST GATEWAY
POWERED BY
KAFKA
gateway-west-
raw-sensors
gateway-central-
raw-sensors
gateway-east-
raw-sensors
DATA FLOW APPS
POWERED BY NIFI
STREAMING
ANALYTICS APPS
Micro Batch Analytics
Stream Analytics App
Micro Services
Stream Analytics App
Complex Low Latent
Stream Analytics App
Apache
Flink
Structured
Streaming
Replication /
Data Deployment
MiNiFi Apache Kafka Apache NiFi Apache Kafka Apache Flink
© 2023 Cloudera, Inc. All rights reserved.
APACHE FLINK
I Can Haz Data?
33
© 2021 Cloudera, Inc. All rights reserved.
DELIVERING STREAMING ANALYTICS
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. (second)
SQL
Parsing and
Blending Data
Streaming
Analytics
Both offline and
streaming data
Data Analysts Can
Write Queries
Across the Lines of Business
Capture Events
that Matter
Low-latency analytics use
cases
Events
Processing
34
© 2022 Cloudera, Inc. All rights reserved.
SQL STREAM BUILDER (SSB)
SQL STREAM BUILDER allows
developers, analysts, and data
scientists to write streaming
applications with industry
standard SQL.
No Java or Scala code
development required.
Simplifies access to data in Kafka
& Flink. Connectors to batch data in
HDFS, Kudu, Hive, S3, JDBC, CDC
and more
Enrich streaming data with batch
data in a single tool
Democratize access to real-time data with just SQL
© 2023 Cloudera, Inc. All rights reserved. 35
SSB MATERIALIZED VIEWS
Key Takeaway; MV’s allow data scientist, analyst and developers consume data from the firehose
© 2023 Cloudera, Inc. All rights reserved. 36
Streaming ETL Data Pipeline Made Simple with SQL StreamBuilder
Write Streaming
Result to Kudu
Join 2 Streaming
User Event Topics
Enrich Stream from
Warehouse HR Table
Enrich Stream from RT
Mart Timesheet Table
Filter &
Transform
© 2023 Cloudera, Inc. All rights reserved. 37
Infer Kafka Tables from JSON, AVRO, … Data
© 2023 Cloudera, Inc. All rights reserved.
DATAFLOW
APACHE NIFI
© 2023 Cloudera, Inc. All rights reserved.
NiFi Positioning
Apache
NiFi / MiNiFi
ETL
(Informatica, etc.)
Enterprise
Service Bus
(Fuse, Mule, etc.)
Messaging
Bus
(Kafka, MQ, etc.)
Processing
Framework
(Storm, Flink,
Spark, etc.)
© 2023 Cloudera, Inc. All rights reserved. 40
Cloudera DataFlow: Universal Data Distribution Service
Process
Route
Filter
Enrich
Transform
Distribute
Connectors
Any
destination
Deliver
Ingest
Active
Passive
Connectors
Gateway
Endpoint
Connect & Pull
Send
Data born in
the cloud
Data born
outside the
cloud
UNIVERSAL DATA DISTRIBUTION WITH CLOUDERA DATAFLOW (CDF)
Connect to Any Data Source Anywhere then Process and Deliver to Any Destination
© 2023 Cloudera, Inc. All rights reserved.
© 2019 Cloudera, Inc. All rights reserved. 41
CLOUDERA DATAFLOW - POWERED BY APACHE NiFi
Ingest and manage data from edge-to-cloud using a no-code interface
● #1 data ingestion/movement engine
● Strong community
● Product maturity over 11 years
● Deploy on-premises or in the cloud
● Over 400+ pre-built processors
● Built-in data provenance
● Guaranteed delivery
● Throttling and Back pressure
© 2023 Cloudera, Inc. All rights reserved. 42
CLOUDERA FLOW AND EDGE MANAGEMENT
Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data
center) to any downstream system with built in end-to-end security and provenance
Advanced tooling to industrialize
flow development (Flow Development
Life Cycle)
ACQUIRE
• Over 300 Prebuilt Processors
• Easy to build your own
• Parse, Enrich & Apply Schema
• Filter, Split, Merger & Route
• Throttle & Backpressure
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG
PROCESS
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ENCRYPT
TALL
EVALUATE
EXECUTE
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
ROUTE RATE
DISTRIBUTE LOAD
DELIVER
• Guaranteed Delivery
• Full data provenance from
acquisition to delivery
• Diverse, Non-Traditional Sources
• Eco-system integration
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLOG
© 2023 Cloudera, Inc. All rights reserved.
© 2023 Cloudera, Inc. All rights reserved. 44
PROVENANCE
© 2023 Cloudera, Inc. All rights reserved. 45
EXTENSIBILITY
• Built from the ground up with extensions in mind
• Service-loader pattern for…
– Processors
– Controller Services
– Reporting Tasks
– Prioritizers
• Extensions packaged as NiFi Archives (NARs)
– Deploy NiFi lib directory and restart
– Same model as standard components
© 2019 Cloudera, Inc. All rights reserved. 46
NiFi Load Balancing
• Improve NiFi cluster throughput
• Defined at connection level
• Configurable balancing
strategies
• Critical for scale up paradigm in
Kubernetes
• Alleviates S2S balancing “hack”
customers use
© 2019 Cloudera, Inc. All rights reserved. 47
QUEUE CONFIGURATION
• FlowFile Expiration - Data that cannot be processed in a timely
fashion can be automatically removed from the flow.
• Back Pressure Thresholds - Thresholds indicate how much data
should be allowed to exist in the queue before the component
that is the source of the Connection is no longer scheduled to
run. This allows the system to avoid being overrun with data.
• Load Balance Strategy – Strategy to distribute the data in a flow
across the nodes in the cluster. When enabled, compression can
be configured on FlowFile contents and attributes.
• Prioritization – Determines the order in which flow files are
processed.
© 2019 Cloudera, Inc. All rights reserved. 48
RECORD-ORIENTED DATA WITH NIFI
• Record Readers - Avro, CSV, Grok, IPFIX, JSAN1, JSON, Parquet,
Scripted, Syslog5424, Syslog, WindowsEvent, XML
• Record Writers - Avro, CSV, FreeFromText, Json, Parquet,
Scripted, XML
• Record Reader and Writer support referencing a schema registry
for retrieving schemas when necessary.
• Enable processors that accept any data format without having to
worry about the parsing and serialization logic.
• Allows us to keep FlowFiles larger, each consisting of multiple
records, which results in far better performance.
© 2019 Cloudera, Inc. All rights reserved. 49
RUNNING SQL ON FLOWFILES
• Evaluates one or more SQL queries against the contents of a
FlowFile.
• This can be used, for example, for field-specific filtering,
transformation, and row-level filtering.
• Columns can be renamed, simple calculations and aggregations
performed.
• The SQL statement must be valid ANSI SQL and is powered by
Apache Calcite.
Apache NiFi with Python Custom Processors
Python as a 1st class citizen
https://www.youtube.com/watch?v=9Oi_6nFmbPg
https://github.com/apache/nifi/blob/614947e4ac6798ad80817e82514c39349d5faacb/nifi-docs/src/main/
asciidoc/python-developer-guide.adoc
© 2023 Cloudera, Inc. All rights reserved. 52
© 2023 Cloudera, Inc. All rights reserved. 53
54
© 2023 Cloudera, Inc. All rights reserved.
READYFLOW
GALLERY
• Cloudera provided flow
definitions
• Cover most common data flow
use cases
• Optimized to work with CDP
sources/destinations
• Can be deployed and adjusted
as needed
55
© 2023 Cloudera, Inc. All rights reserved.
FLOW CATALOG
• Central repository for flow
definitions
• Import existing NiFi flows
• Manage flow definitions
• Initiate flow deployments
56
© 2023 Cloudera, Inc. All rights reserved.
DEPLOYMENT
WIZARD
• Turns flow definitions into flow
deployments
• Guides users through providing
required configuration
• Choose NiFi runtime version
• Pick from pre-defined NiFi node sizes
• Define KPIs for the deployment
Start Deployment Wizard Provide Parameters
Configure Sizing & Scaling Define KPIs
57
© 2023 Cloudera, Inc. All rights reserved.
KEY
PERFORMANCE
INDICATORS
• Visibility into flow deployments
• Track high level flow
performance
• Track in-depth NiFi component
metrics
• Defined in Deployment Wizard
• Monitoring & Alerts in
Deployment Details
KPI Definition in Deployment Wizard KPI Monitoring
58
© 2023 Cloudera, Inc. All rights reserved.
DASHBOARD
• Central Monitoring View
• Monitors flow deployments
across CDP environments
• Monitors flow deployment
health & performance
• Drill into flow deployment to
monitor system metrics and
deployment events
59
© 2023 Cloudera, Inc. All rights reserved.
DEPLOYMENT
MANAGER
• Manage flow deployment
lifecycle
(Suspend/Start/Terminate)
• Add/Edit KPIs
• Change sizing configuration
• Update parameters
• Change NiFi version of the
deployment
• Gateway to NiFi canvas
60
© 2023 Cloudera, Inc. All rights reserved.
NIFI VERSION
UPGRADES
• Pick up NiFi hotfixes easily
• Upgrade (or downgrade) the
hotfix version of existing
deployments
• Rolling upgrade (if the
deployment has >1 NiFi nodes)
© 2019 Cloudera, Inc. All rights reserved. 61
Apache NiFi - Release History Review
• Soon: NiFi 2.0
• Apr 23: 1.21
• Feb 23: 1.20
• Dec 22: 1.19
• Oct 22: 1.18
• Aug 22: 1.17
• Mar 22: 1.16
• Dec 21: 1.15
• Jul 21: 1.14
• Mar 21: 1.13
• Aug 20: 1.12
• Jan 20: 1.11
• Non-stop security improvements, vulnerability resolution, library
updates. “Security by default”. Secure Zookeeper.
• Key capabilities: all repository on-disk encryption, classloader
isolation for native libs, empty all queues, prometheus, Azure AD authz,
‘Run Once’, Framework level retries, Stateless NiFi in NiFi, Parameter
Contexts with inheritance and sensitive values, interactive component
validation, generate fake records, scripted record transforms
• New connectors for: IBM MQ, GCP, AWS, Azure, AMQP, MQTT,
CDC/MySQL, Kudu, CEF, Geo Hashing, SNMP Traps, SMB, Airtable,
Dropbox, Box, Hubspot, Workday, Zendesk, Iceberg, Snowflake,
Salesforce, Google Drive
© 2023 Cloudera, Inc. All rights reserved. 62
NiFi 2.0 - Quick Look
- Moves to Java 11 and Java 17 as the base
- JSON for flow serialization
- XML and Templates concepts gone
- Parameter Contexts for easier SDLC/Deployment
- Variables removed.
- Technical Debt Reduction
- Delete a bunch of legacy components.
- Focus on timer and cron based scheduling
- Event driven scheduling removed / unnecessary.
63
© 2022 Cloudera, Inc. All rights reserved.
Key / Common Themes Of Change beyond the framework
Stateless NiFi
engine
• Faster execution
(no disk)
• End to end
transactional
• Run in FaaS or
NiFi or
command line
Record Oriented
Processors
• The proven best
practice pattern
for handling any
‘record’ pattern
• Same proven
processors -
just bring your
format/schema
Scripting /
Python
• Data Engineers
and Data
Science user
base growing
• Existing script
processors
always
improving -
heavy use.
All the
connectors!
• Apache NiFi
includes 362
processors by
default with
more available
in Maven
• 132 controller
services
• 18 Reporting
Tasks
© 2023 Cloudera, Inc. All rights reserved.
BEST PRACTICES
© 2023 Cloudera, Inc. All rights reserved. 65
STREAMING TECH DEBT TIPS
• Version Control All Assets
• Managed Public Cloud like Cloudera
• Use DevOps and APIs
• Latest Java and Python
• Stream Sizing (NiFi, Kafka, Flink)
© 2023 Cloudera, Inc. All rights reserved.
EXAMPLES
© 2023 Cloudera, Inc. All rights reserved. 67
Particulate Matter Sensor Pipeline
© 2023 Cloudera, Inc. All rights reserved.
© 2021 Cloudera, Inc. All rights reserved. 68
https://docs.cloudera.com/csa/1.10.0/how-to-ssb/topics/csa-ssb-cdc-connectors.html
CDC with Debezium and Flink
SQL Stream Builder with Flink SQL
© 2023 Cloudera, Inc. All rights reserved. 69
CDC with Debezium and Kafka
Kafka Connect
© 2023 Cloudera, Inc. All rights reserved. 70
© 2023 Cloudera, Inc. All rights reserved. 71
© 2023 Cloudera, Inc. All rights reserved. 72
© 2023 Cloudera, Inc. All rights reserved.
© 2023 Cloudera, Inc. All rights reserved.
RESOURCES AND WRAP-UP
© 2023 Cloudera, Inc. All rights reserved. 75
Resources
© 2023 Cloudera, Inc. All rights reserved.
© 2021 Cloudera, Inc. All rights reserved. 76
© 2023 Cloudera, Inc. All rights reserved.
© 2021 Cloudera, Inc. All rights reserved. 77
Upcoming Events
Nov 22
https://events.pinetool.ai/3079/#sessions/101077
© 2023 Cloudera, Inc. All rights reserved.
https://medium.com/@tspann/cdc-not-cat-data-capture-e43713879c03
79
TH N Y U

More Related Content

Similar to Meetup Streaming Data Pipeline Development

Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataBuilding Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataTimothy Spann
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline DevelopmentTimothy Spann
 
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Timothy Spann
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table NotesTimothy Spann
 
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdfOSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdfTimothy Spann
 
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...Timothy Spann
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023Timothy Spann
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Timothy Spann
 
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines DemoEvolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines DemoTimothy Spann
 
ITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming AppsITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming AppsTimothy Spann
 
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Timothy Spann
 
PartnerSkillUp_Enable a Streaming CDC Solution
PartnerSkillUp_Enable a Streaming CDC SolutionPartnerSkillUp_Enable a Streaming CDC Solution
PartnerSkillUp_Enable a Streaming CDC SolutionTimothy Spann
 
Cloudera streaming with flink oct 29, 2020 meetup london
Cloudera streaming with flink oct 29, 2020 meetup londonCloudera streaming with flink oct 29, 2020 meetup london
Cloudera streaming with flink oct 29, 2020 meetup londonTimothy Spann
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020Adam Doyle
 
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesOSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesTimothy Spann
 
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)Timothy Spann
 
Get started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionGet started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionCloudera, Inc.
 
Building Real-Time Travel Alerts
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel AlertsTimothy Spann
 
Get Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber SolutionGet Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber SolutionCloudera, Inc.
 
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine ...
Edge to AI:  Analytics from Edge to Cloud with Efficient Movement of Machine ...Edge to AI:  Analytics from Edge to Cloud with Efficient Movement of Machine ...
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine ...Timothy Spann
 

Similar to Meetup Streaming Data Pipeline Development (20)

Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit DataBuilding Real-time Pipelines with FLaNK_ A Case Study with Transit Data
Building Real-time Pipelines with FLaNK_ A Case Study with Transit Data
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline Development
 
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
 
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdfOSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
 
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
 
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines DemoEvolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
 
ITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming AppsITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming Apps
 
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
 
PartnerSkillUp_Enable a Streaming CDC Solution
PartnerSkillUp_Enable a Streaming CDC SolutionPartnerSkillUp_Enable a Streaming CDC Solution
PartnerSkillUp_Enable a Streaming CDC Solution
 
Cloudera streaming with flink oct 29, 2020 meetup london
Cloudera streaming with flink oct 29, 2020 meetup londonCloudera streaming with flink oct 29, 2020 meetup london
Cloudera streaming with flink oct 29, 2020 meetup london
 
Stl meetup cloudera platform - january 2020
Stl meetup   cloudera platform  - january 2020Stl meetup   cloudera platform  - january 2020
Stl meetup cloudera platform - january 2020
 
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesOSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
 
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
Using the FLaNK Stack for edge ai (flink, nifi, kafka, kudu)
 
Get started with Cloudera's cyber solution
Get started with Cloudera's cyber solutionGet started with Cloudera's cyber solution
Get started with Cloudera's cyber solution
 
Building Real-Time Travel Alerts
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
 
Get Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber SolutionGet Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber Solution
 
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine ...
Edge to AI:  Analytics from Edge to Cloud with Efficient Movement of Machine ...Edge to AI:  Analytics from Edge to Cloud with Efficient Movement of Machine ...
Edge to AI: Analytics from Edge to Cloud with Efficient Movement of Machine ...
 

More from Timothy Spann

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...Timothy Spann
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-PipelinesTimothy Spann
 
TCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTimothy Spann
 
2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-ProfitsTimothy Spann
 
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsConf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsTimothy Spann
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI PipelinesTimothy Spann
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkTimothy Spann
 
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...Timothy Spann
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
 
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data PipelinesTimothy Spann
 
AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101Timothy Spann
 
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC MeetupTimothy Spann
 
CoC23_ Looking at the New Features of Apache NiFi
CoC23_ Looking at the New Features of Apache NiFiCoC23_ Looking at the New Features of Apache NiFi
CoC23_ Looking at the New Features of Apache NiFiTimothy Spann
 
CoC23_ Let’s Monitor The Conditions at the Conference
CoC23_ Let’s Monitor The Conditions at the ConferenceCoC23_ Let’s Monitor The Conditions at the Conference
CoC23_ Let’s Monitor The Conditions at the ConferenceTimothy Spann
 
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationCoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationTimothy Spann
 
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingTimothy Spann
 
Meetup - Brasil - Data In Motion - 2023 September 19
Meetup - Brasil - Data In Motion - 2023 September 19Meetup - Brasil - Data In Motion - 2023 September 19
Meetup - Brasil - Data In Motion - 2023 September 19Timothy Spann
 

More from Timothy Spann (20)

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
 
TCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI Pipelines
 
2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits
 
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsConf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python Processors
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
 
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
 
AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101
 
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
 
CoC23_ Looking at the New Features of Apache NiFi
CoC23_ Looking at the New Features of Apache NiFiCoC23_ Looking at the New Features of Apache NiFi
CoC23_ Looking at the New Features of Apache NiFi
 
CoC23_ Let’s Monitor The Conditions at the Conference
CoC23_ Let’s Monitor The Conditions at the ConferenceCoC23_ Let’s Monitor The Conditions at the Conference
CoC23_ Let’s Monitor The Conditions at the Conference
 
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationCoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
 
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and Streaming
 
Meetup - Brasil - Data In Motion - 2023 September 19
Meetup - Brasil - Data In Motion - 2023 September 19Meetup - Brasil - Data In Motion - 2023 September 19
Meetup - Brasil - Data In Motion - 2023 September 19
 

Recently uploaded

EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityNeo4j
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEEVICTOR MAESTRE RAMIREZ
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)OPEN KNOWLEDGE GmbH
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...Christina Lin
 

Recently uploaded (20)

EY_Graph Database Powered Sustainability
EY_Graph Database Powered SustainabilityEY_Graph Database Powered Sustainability
EY_Graph Database Powered Sustainability
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Cloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEECloud Data Center Network Construction - IEEE
Cloud Data Center Network Construction - IEEE
 
Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)Der Spagat zwischen BIAS und FAIRNESS (2024)
Der Spagat zwischen BIAS und FAIRNESS (2024)
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
ODSC - Batch to Stream workshop - integration of Apache Spark, Cassandra, Pos...
 

Meetup Streaming Data Pipeline Development

  • 1. © 2023 Cloudera, Inc. All rights reserved. Meetup: Mastering Data Streaming Pipelines Tim Spann Principal Developer Advocate 28-June-2023
  • 2. © 2023 Cloudera, Inc. All rights reserved.
  • 3. © 2023 Cloudera, Inc. All rights reserved.
  • 4. 4 © 2023 Cloudera, Inc. All rights reserved. Future of Data - Princeton + Virtual @PaasDev https://www.meetup.com/futureofdata-princeton/ From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to Machine Learning to Microservices to ...
  • 5. © 2023 Cloudera, Inc. All rights reserved. 5 FLaNK Stack Tim Spann @PaasDev // Blog: www.datainmotion.dev Principal Developer Advocate. Princeton Future of Data Meetup. ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC https://medium.com/@tspann https://github.com/tspannhw Apache NiFi x Apache Kafka x Apache Flink x Java
  • 6. © 2023 Cloudera, Inc. All rights reserved. 6 FLaNK Stack Weekly This week in Apache NiFi, Apache Flink, Apache Kafka, Apache Spark, Apache Iceberg, Python, Java and Open Source friends. https://bit.ly/32dAJft
  • 7. © 2023 Cloudera, Inc. All rights reserved. 7 https://www.cloudera.com/about/events/cloudera-now-cdp.html
  • 8. © 2023 Cloudera, Inc. All rights reserved. FREE LEARNING ENVIRONMENT
  • 9. © 2023 Cloudera, Inc. All rights reserved. 9 CSP Community Edition • Kafka, KConnect, SMM, SR, Flink, and SSB in Docker • Runs in Docker • Try new features quickly • Develop applications locally ● Docker compose file of CSP to run from command line w/o any dependencies, including Flink, SQL Stream Builder, Kafka, Kafka Connect, Streams Messaging Manager and Schema Registry ○ $> docker compose up ● Licensed under the Cloudera Community License ● Unsupported ● Community Group Hub for CSP ● Find it on docs.cloudera.com under Applications
  • 10. © 2023 Cloudera, Inc. All rights reserved. STREAMING
  • 11. © 2023 Cloudera, Inc. All rights reserved. 11 CDP: AN OPEN DATA LAKEHOUSE METADATA AND DATA CATALOG OBSERVABILITY REPLICATION SECURITY & GOVERNANCE Private Cloud
  • 12. © 2023 Cloudera, Inc. All rights reserved. 12 BUILDING REAL-TIME REQUIRES A TEAM
  • 13. © 2023 Cloudera, Inc. All rights reserved. 13 DATA INGESTION ARCHITECTURE - Using NiFi to pull the Data out of anything in near-real time INGEST PREPARE PUBLISH DATA SOURCES EXTERNAL DATA (Weather) (Air Quality) IoT Devices Internal Users (After Sales) External Systems ENTERPRISE LAKEHOUSE CAPABILITY VIEW INGESTION ENTERPRISE DATA MESSAGE HUB STORAGE BATCH MANAGEMENT STREAM CONSUMPTION Closed Loop Systems SQL Stream Builder Machine Learning Data Visualization Workload Manager
  • 14. © 2023 Cloudera, Inc. All rights reserved. 14 Moving Beyond Draining of Streams Into Lakes: Analytics-in-Stream Data Sources Streaming Storage Substrate Cloudera Stream Processing Kafka + NiFi enables real-time ingestion into lakes / analytics services Data Distribution Service Cloudera DataFlow Warehouses & Operational DB Data Lakes & Lake Houses Data-At-Rest Analytics Data Apps Powered by Streaming Insights and used by other Analytics Services Kafka + Flink enables streaming analytics Cloudera Stream Processing Streaming Analytics Low Latency Data Products Data-In-Motion Streaming Analytics
  • 15. © 2023 Cloudera, Inc. All rights reserved. 15 Sploot Nano Shaggy Stormy
  • 16. © 2023 Cloudera, Inc. All rights reserved. 16 Already using Spark? Need NiFi? Need Flink? Want unified Batch/Stream? Want highest Throughput? Don’t need low latency? Large files? Scheduled batches? Replacing Sqoop, ETL Simple JDBC queries? Transform individual records? Want easy development? Lots of small files, events, records, rows? Continuous stream of rows Support many different sources Need Microservices, Batch and Stream? Want high Throughput? Want Low Latency? Want Advanced Windowing and State? Happy with a New Solution that is best-in-class? Spark, NiFi, Flink? Which engine to choose?
  • 18. © 2023 Cloudera, Inc. All rights reserved. 18 APACHE ICEBERG A Flexible, Performant & Scalable Table Format • Donated by Netflix to the Apache Foundation in 2018 • Flexibility – Hidden partitioning – Full schema evolution • Data Warehouse Operations – Atomic Consistent Isolated Durable (ACID) Transactions – Time travel and rollback • Supports best in class SQL performance – High performance at Petabyte scale
  • 19. © 2023 Cloudera, Inc. All rights reserved. 19
  • 20. © 2023 Cloudera, Inc. All rights reserved. COMMON USE CASES
  • 21. © 2023 Cloudera, Inc. All rights reserved. 21 BE READY TO DETECT, RESPOND AND COMPLY Real-time alerting & Longer retention = Enhanced Readiness Detect Collect event data at high events per second Automatically identify high fidelity alerts Analytics for proactive threat hunting Respond Respond automatically to real-time alerts Full context for complete forensics and incident response Comply Retain data as required Automated compliance assessments 3 2 1
  • 22. © 2023 Cloudera, Inc. All rights reserved. 22 STREAMING LOG ANALYTICS REFERENCE ARCHITECTURE Scalable and Secure Real-Time Log Analytics Raw Logs Cloudera Data Flow Cloudera Data Warehouse Cloudera Machine Learning Log Analytics Hub Log Analytics Cloudera Streaming Analytics Cloudera Streams Processing Open Log Store Triaging ML Models Threat Hunting Response and Investigation UEBA/Fraud Detection Reports Kafka SOAR
  • 23. © 2020 Cloudera, Inc. All rights reserved. 23 CONSOLIDATE ACROSS CLOUDS Run collection and streaming on any cloud Raw Logs Cloudera Data Flow Cloudera Data Warehouse Cloudera Machine Learning Log Analytics Hub Log Analytics Cloudera Streaming Analytics Cloudera Streams Processing Open Log Store Triaging ML Models Threat Hunting Response and Investigation UEBA/Fraud Detection Reports Kafka SOAR Data Center
  • 24. 24 © 2023 Cloudera, Inc. All rights reserved. THE PROBLEM Making Observable Events Actionable can be Challenging • Inability to distribute event data • Data Silos • Inefficient data movement • Situational awareness Resulting in…. • License costs • Consumption costs • Development overhead • Resource utilization Multi-structured Distributed Formatting Aggregation Signal: Noise Inconsistent Difficult to integrate Proprietary Tools
  • 25. ● Capture events from any system anywhere ● Any data format, any protocol ● Open- Not limited to proprietary tools Capture Observable Data CLOUDERA DATA-IN-MOTION SOLUTION Actionability 02 04 03 01 ● Process raw data ● Integrate with enterprise data ● Build, test, deploy quickly Create Useful Information ● Dramatically reduce time to insight ● Monitor proactively for critical events ● Distribute broadly and process in-stream real-time via stream Enable Real-Time Observability ● Reduce noise, boost signal ● Reduce load on other system ● Filter, compress, route anywhere Filter and Distribute Universal Connectivity 450+ Connectors Hybrid Deployment-Ready Author once, deploy anywhere Self-Service across full lifecycle Integrated no code UX
  • 26. 26 © 2023 Cloudera, Inc. All rights reserved. NO-CODE/LOW-CODE USER EXPERIENCE DataFlow Designer SQL Stream Builder visual canvas + automatic provisioning= Self-service pipeline development Instant data access + unified SQL processing = Self-service streaming event processing
  • 27. 27 © 2023 Cloudera, Inc. All rights reserved. KEY USE CASES AND OUTCOMES INTELLIGENT OPS & SYSTEM RELIABILITY “Cut our data volume in half that translated to a 30%-40% reduction in public cloud cost. With saving, expanded footprint & coverage of monitoring & visibility without a significant $$$ impact. This drove improved situational awareness of app infrastructure availability & performance” Large US Airline “Faster resolution of incidents, reducing exponentially the stops of trains in full operation.” Metro de Madrid CYBER SECURITY “Cut log data by 60% resulting in >40% reduction in Splunk costs. Delivered cyber logs to other teams analyzing logs resulting in a decrease of MTTD by 90%.” Fortune 100 Energy Company “Improved threat hunting by eliminating 25% of false positive anomalies and speeding up investigations” Global Telco CUSTOMER INSIGHT “Improved Marketing Campaign effectiveness 15% YoY. This allowed us to redirect funds for failing campaigns and allocate to new ones.” Global Telco
  • 28. © 2023 Cloudera, Inc. All rights reserved. APACHE NiFi - MiNiFi Agents NiFi For Ants
  • 29. © 2023 Cloudera, Inc. All rights reserved. 29 Cloudera Edge Management Edge device data collection and processing with easy to use central command and control • Small footprint agent with MiNiFi • Java and C++ agents • Rich edge processors (edge collection & processing) • End to end lineage and security • Kubernetes support Flow Authorship Flow Deployment Flow Monitoring • Central Command and Control (C2) • Design and deploy to millions of agents • Edge Applications lifecycle management • Multitenancy with Agent classes • Native integration with other CDF services + A lightweight edge agent that implements the core features of Apache NiFi, focusing on data collection and processing at the edge Edge Flow Manager
  • 30. © 2023 Cloudera, Inc. All rights reserved. APACHE KAFKA
  • 31. © 2019 Cloudera, Inc. All rights reserved. 31 KAFKA CLUSTER GEO 2 DATA SYNDICATE SERVICES Kafka Topic syndicate- transmission Kafka Topic syndicate- temp Kafka Topic syndicate- speed Kafka Topic syndicate- geo KAFKA CLUSTER GEO 1 DATA SYNDICATE SERVICES Kafka Topic syndicate- transmission Kafka Topic syndicate- temp Kafka Topic syndicate- speed Kafka Topic syndicate- geo Apache Kafka DATA COLLECTION AT THE EDGE C++ agent US-West Fleet C++ agent US-Central Fleet C++ agent US-East Fleet INGEST GATEWAY POWERED BY KAFKA gateway-west- raw-sensors gateway-central- raw-sensors gateway-east- raw-sensors DATA FLOW APPS POWERED BY NIFI STREAMING ANALYTICS APPS Micro Batch Analytics Stream Analytics App Micro Services Stream Analytics App Complex Low Latent Stream Analytics App Apache Flink Structured Streaming Replication / Data Deployment MiNiFi Apache Kafka Apache NiFi Apache Kafka Apache Flink
  • 32. © 2023 Cloudera, Inc. All rights reserved. APACHE FLINK I Can Haz Data?
  • 33. 33 © 2021 Cloudera, Inc. All rights reserved. DELIVERING STREAMING ANALYTICS 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. (second) SQL Parsing and Blending Data Streaming Analytics Both offline and streaming data Data Analysts Can Write Queries Across the Lines of Business Capture Events that Matter Low-latency analytics use cases Events Processing
  • 34. 34 © 2022 Cloudera, Inc. All rights reserved. SQL STREAM BUILDER (SSB) SQL STREAM BUILDER allows developers, analysts, and data scientists to write streaming applications with industry standard SQL. No Java or Scala code development required. Simplifies access to data in Kafka & Flink. Connectors to batch data in HDFS, Kudu, Hive, S3, JDBC, CDC and more Enrich streaming data with batch data in a single tool Democratize access to real-time data with just SQL
  • 35. © 2023 Cloudera, Inc. All rights reserved. 35 SSB MATERIALIZED VIEWS Key Takeaway; MV’s allow data scientist, analyst and developers consume data from the firehose
  • 36. © 2023 Cloudera, Inc. All rights reserved. 36 Streaming ETL Data Pipeline Made Simple with SQL StreamBuilder Write Streaming Result to Kudu Join 2 Streaming User Event Topics Enrich Stream from Warehouse HR Table Enrich Stream from RT Mart Timesheet Table Filter & Transform
  • 37. © 2023 Cloudera, Inc. All rights reserved. 37 Infer Kafka Tables from JSON, AVRO, … Data
  • 38. © 2023 Cloudera, Inc. All rights reserved. DATAFLOW APACHE NIFI
  • 39. © 2023 Cloudera, Inc. All rights reserved. NiFi Positioning Apache NiFi / MiNiFi ETL (Informatica, etc.) Enterprise Service Bus (Fuse, Mule, etc.) Messaging Bus (Kafka, MQ, etc.) Processing Framework (Storm, Flink, Spark, etc.)
  • 40. © 2023 Cloudera, Inc. All rights reserved. 40 Cloudera DataFlow: Universal Data Distribution Service Process Route Filter Enrich Transform Distribute Connectors Any destination Deliver Ingest Active Passive Connectors Gateway Endpoint Connect & Pull Send Data born in the cloud Data born outside the cloud UNIVERSAL DATA DISTRIBUTION WITH CLOUDERA DATAFLOW (CDF) Connect to Any Data Source Anywhere then Process and Deliver to Any Destination
  • 41. © 2023 Cloudera, Inc. All rights reserved. © 2019 Cloudera, Inc. All rights reserved. 41 CLOUDERA DATAFLOW - POWERED BY APACHE NiFi Ingest and manage data from edge-to-cloud using a no-code interface ● #1 data ingestion/movement engine ● Strong community ● Product maturity over 11 years ● Deploy on-premises or in the cloud ● Over 400+ pre-built processors ● Built-in data provenance ● Guaranteed delivery ● Throttling and Back pressure
  • 42. © 2023 Cloudera, Inc. All rights reserved. 42 CLOUDERA FLOW AND EDGE MANAGEMENT Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data center) to any downstream system with built in end-to-end security and provenance Advanced tooling to industrialize flow development (Flow Development Life Cycle) ACQUIRE • Over 300 Prebuilt Processors • Easy to build your own • Parse, Enrich & Apply Schema • Filter, Split, Merger & Route • Throttle & Backpressure FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLOG PROCESS HASH MERGE EXTRACT DUPLICATE SPLIT ENCRYPT TALL EVALUATE EXECUTE GEOENRICH SCAN REPLACE TRANSLATE CONVERT ROUTE TEXT ROUTE CONTENT ROUTE CONTEXT ROUTE RATE DISTRIBUTE LOAD DELIVER • Guaranteed Delivery • Full data provenance from acquisition to delivery • Diverse, Non-Traditional Sources • Eco-system integration FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLOG
  • 43. © 2023 Cloudera, Inc. All rights reserved.
  • 44. © 2023 Cloudera, Inc. All rights reserved. 44 PROVENANCE
  • 45. © 2023 Cloudera, Inc. All rights reserved. 45 EXTENSIBILITY • Built from the ground up with extensions in mind • Service-loader pattern for… – Processors – Controller Services – Reporting Tasks – Prioritizers • Extensions packaged as NiFi Archives (NARs) – Deploy NiFi lib directory and restart – Same model as standard components
  • 46. © 2019 Cloudera, Inc. All rights reserved. 46 NiFi Load Balancing • Improve NiFi cluster throughput • Defined at connection level • Configurable balancing strategies • Critical for scale up paradigm in Kubernetes • Alleviates S2S balancing “hack” customers use
  • 47. © 2019 Cloudera, Inc. All rights reserved. 47 QUEUE CONFIGURATION • FlowFile Expiration - Data that cannot be processed in a timely fashion can be automatically removed from the flow. • Back Pressure Thresholds - Thresholds indicate how much data should be allowed to exist in the queue before the component that is the source of the Connection is no longer scheduled to run. This allows the system to avoid being overrun with data. • Load Balance Strategy – Strategy to distribute the data in a flow across the nodes in the cluster. When enabled, compression can be configured on FlowFile contents and attributes. • Prioritization – Determines the order in which flow files are processed.
  • 48. © 2019 Cloudera, Inc. All rights reserved. 48 RECORD-ORIENTED DATA WITH NIFI • Record Readers - Avro, CSV, Grok, IPFIX, JSAN1, JSON, Parquet, Scripted, Syslog5424, Syslog, WindowsEvent, XML • Record Writers - Avro, CSV, FreeFromText, Json, Parquet, Scripted, XML • Record Reader and Writer support referencing a schema registry for retrieving schemas when necessary. • Enable processors that accept any data format without having to worry about the parsing and serialization logic. • Allows us to keep FlowFiles larger, each consisting of multiple records, which results in far better performance.
  • 49. © 2019 Cloudera, Inc. All rights reserved. 49 RUNNING SQL ON FLOWFILES • Evaluates one or more SQL queries against the contents of a FlowFile. • This can be used, for example, for field-specific filtering, transformation, and row-level filtering. • Columns can be renamed, simple calculations and aggregations performed. • The SQL statement must be valid ANSI SQL and is powered by Apache Calcite.
  • 50. Apache NiFi with Python Custom Processors Python as a 1st class citizen https://www.youtube.com/watch?v=9Oi_6nFmbPg https://github.com/apache/nifi/blob/614947e4ac6798ad80817e82514c39349d5faacb/nifi-docs/src/main/ asciidoc/python-developer-guide.adoc
  • 51.
  • 52. © 2023 Cloudera, Inc. All rights reserved. 52
  • 53. © 2023 Cloudera, Inc. All rights reserved. 53
  • 54. 54 © 2023 Cloudera, Inc. All rights reserved. READYFLOW GALLERY • Cloudera provided flow definitions • Cover most common data flow use cases • Optimized to work with CDP sources/destinations • Can be deployed and adjusted as needed
  • 55. 55 © 2023 Cloudera, Inc. All rights reserved. FLOW CATALOG • Central repository for flow definitions • Import existing NiFi flows • Manage flow definitions • Initiate flow deployments
  • 56. 56 © 2023 Cloudera, Inc. All rights reserved. DEPLOYMENT WIZARD • Turns flow definitions into flow deployments • Guides users through providing required configuration • Choose NiFi runtime version • Pick from pre-defined NiFi node sizes • Define KPIs for the deployment Start Deployment Wizard Provide Parameters Configure Sizing & Scaling Define KPIs
  • 57. 57 © 2023 Cloudera, Inc. All rights reserved. KEY PERFORMANCE INDICATORS • Visibility into flow deployments • Track high level flow performance • Track in-depth NiFi component metrics • Defined in Deployment Wizard • Monitoring & Alerts in Deployment Details KPI Definition in Deployment Wizard KPI Monitoring
  • 58. 58 © 2023 Cloudera, Inc. All rights reserved. DASHBOARD • Central Monitoring View • Monitors flow deployments across CDP environments • Monitors flow deployment health & performance • Drill into flow deployment to monitor system metrics and deployment events
  • 59. 59 © 2023 Cloudera, Inc. All rights reserved. DEPLOYMENT MANAGER • Manage flow deployment lifecycle (Suspend/Start/Terminate) • Add/Edit KPIs • Change sizing configuration • Update parameters • Change NiFi version of the deployment • Gateway to NiFi canvas
  • 60. 60 © 2023 Cloudera, Inc. All rights reserved. NIFI VERSION UPGRADES • Pick up NiFi hotfixes easily • Upgrade (or downgrade) the hotfix version of existing deployments • Rolling upgrade (if the deployment has >1 NiFi nodes)
  • 61. © 2019 Cloudera, Inc. All rights reserved. 61 Apache NiFi - Release History Review • Soon: NiFi 2.0 • Apr 23: 1.21 • Feb 23: 1.20 • Dec 22: 1.19 • Oct 22: 1.18 • Aug 22: 1.17 • Mar 22: 1.16 • Dec 21: 1.15 • Jul 21: 1.14 • Mar 21: 1.13 • Aug 20: 1.12 • Jan 20: 1.11 • Non-stop security improvements, vulnerability resolution, library updates. “Security by default”. Secure Zookeeper. • Key capabilities: all repository on-disk encryption, classloader isolation for native libs, empty all queues, prometheus, Azure AD authz, ‘Run Once’, Framework level retries, Stateless NiFi in NiFi, Parameter Contexts with inheritance and sensitive values, interactive component validation, generate fake records, scripted record transforms • New connectors for: IBM MQ, GCP, AWS, Azure, AMQP, MQTT, CDC/MySQL, Kudu, CEF, Geo Hashing, SNMP Traps, SMB, Airtable, Dropbox, Box, Hubspot, Workday, Zendesk, Iceberg, Snowflake, Salesforce, Google Drive
  • 62. © 2023 Cloudera, Inc. All rights reserved. 62 NiFi 2.0 - Quick Look - Moves to Java 11 and Java 17 as the base - JSON for flow serialization - XML and Templates concepts gone - Parameter Contexts for easier SDLC/Deployment - Variables removed. - Technical Debt Reduction - Delete a bunch of legacy components. - Focus on timer and cron based scheduling - Event driven scheduling removed / unnecessary.
  • 63. 63 © 2022 Cloudera, Inc. All rights reserved. Key / Common Themes Of Change beyond the framework Stateless NiFi engine • Faster execution (no disk) • End to end transactional • Run in FaaS or NiFi or command line Record Oriented Processors • The proven best practice pattern for handling any ‘record’ pattern • Same proven processors - just bring your format/schema Scripting / Python • Data Engineers and Data Science user base growing • Existing script processors always improving - heavy use. All the connectors! • Apache NiFi includes 362 processors by default with more available in Maven • 132 controller services • 18 Reporting Tasks
  • 64. © 2023 Cloudera, Inc. All rights reserved. BEST PRACTICES
  • 65. © 2023 Cloudera, Inc. All rights reserved. 65 STREAMING TECH DEBT TIPS • Version Control All Assets • Managed Public Cloud like Cloudera • Use DevOps and APIs • Latest Java and Python • Stream Sizing (NiFi, Kafka, Flink)
  • 66. © 2023 Cloudera, Inc. All rights reserved. EXAMPLES
  • 67. © 2023 Cloudera, Inc. All rights reserved. 67 Particulate Matter Sensor Pipeline
  • 68. © 2023 Cloudera, Inc. All rights reserved. © 2021 Cloudera, Inc. All rights reserved. 68 https://docs.cloudera.com/csa/1.10.0/how-to-ssb/topics/csa-ssb-cdc-connectors.html CDC with Debezium and Flink SQL Stream Builder with Flink SQL
  • 69. © 2023 Cloudera, Inc. All rights reserved. 69 CDC with Debezium and Kafka Kafka Connect
  • 70. © 2023 Cloudera, Inc. All rights reserved. 70
  • 71. © 2023 Cloudera, Inc. All rights reserved. 71
  • 72. © 2023 Cloudera, Inc. All rights reserved. 72
  • 73. © 2023 Cloudera, Inc. All rights reserved.
  • 74. © 2023 Cloudera, Inc. All rights reserved. RESOURCES AND WRAP-UP
  • 75. © 2023 Cloudera, Inc. All rights reserved. 75 Resources
  • 76. © 2023 Cloudera, Inc. All rights reserved. © 2021 Cloudera, Inc. All rights reserved. 76
  • 77. © 2023 Cloudera, Inc. All rights reserved. © 2021 Cloudera, Inc. All rights reserved. 77 Upcoming Events Nov 22 https://events.pinetool.ai/3079/#sessions/101077
  • 78. © 2023 Cloudera, Inc. All rights reserved. https://medium.com/@tspann/cdc-not-cat-data-capture-e43713879c03