SlideShare a Scribd company logo
1 of 56
Download to read offline
Partner SkillUp:
Implement a Universal Data Distribution
Architecture to Manage All Streaming Data
Tim Spann
Principal Developer Advocate in Data
In Motion for Cloudera
tspann@cloudera.com
Salvador Almazan
Partner Solutions Engineer, US
salmazan@cloudera.com Cloudera Data Services
Workshop Registration
2
© 2023 Cloudera, Inc. All rights reserved.
TODAY’S LEAD
Who am I?
@PaasDev
DZone Zone Leader and Big Data MVB
Princeton NJ Future of Data Meetup
ex-Pivotal Field Engineer ex-StreamNative ex-PwC
https://github.com/tspannhw https://www.datainmotion.dev/
Principal Data-in-Motion Developer Advocate
3
© 2023 Cloudera, Inc. All rights reserved.
CHALLENGES IN DATA COLLECTION & MOVEMENT
Platform/Flow Administrators
Operational
Challenges
cloud service provider specific data
flow monitoring, management &
security tools
public & private sector data
public or private clouds
Data Architects & Executives
Multiple Point
Solutions Required
vendor specific data
aggregation hubs & interfaces
specialized tools vendor specific monitoring &
management tools
custom code to collect
data from edge devices
edge devices
custom management tools App/Integration Developers
Lack Agility
4
© 2023 Cloudera, Inc. All rights reserved.
SOLUTION: UNIVERSAL DATA DISTRIBUTION
public & private sector data
edge devices
specialized tools
multi-public & hybrid clouds Consistent data flow design, deployment,
management & monitoring.
Any data. Any source. Any destination.
Data Architects & Executives
Platform/Flow Administrators
App/Integration Developers
Simplify & Accelerate
onboarding and
managing data sources.
Use no code tools.
Standardize development
tools for reusability.
Streamline Ops to deploy,
monitor, manage, secure
and auto-scale data
flows.
5
Analytics apps
AI Models
Any Data
Consumer
Event-driven apps
Streaming
data
Data at rest
Any Data
Change Data
Capture
Real-time Processing
Any Business
Event
● Analyze data in
motion
● Continuous
monitoring
● Trends and
anomalies
Continuous Results
Any Data
Analyst
● No-Code UI
● Author once publish
anywhere
● Analytics lifecycle
management for dev/ops
Data
Products
Data Lakehouse
6
© 2023 Cloudera, Inc. All rights reserved.
Transactions
Data
Account Data
Fraud Monitoring
Dashboard
Real-Time Fraud
Scoring
Business Event Logic Continuous Results
Freeze transaction
Device Logs
Flagged
Records
Data Lakehouse
What constitutes a
suspicious
transaction?
Defined by domain
expert, deployed as
continuous query
Stop Fraud when it happens
Simplified example of deployed use case
7
© 2023 Cloudera, Inc. All rights reserved.
What Challenges Do We Address
Fragmented infrastructure is complex to manage
Devs/Analysts/Data Science
● How do I understand the contents
of pipelines when I’m so focused
on code-heavy tools?
● How do I integrate and analyze
data without compromising
integrity?
● How am I supposed to deploy
innovative solutions and
models?
Siloed data
processing tools
● How do I manage security and
access across all these data
sources and pipelines?
● How do I ensure data governance,
reliability and performance?
● How am I supposed to scale
with demand?
Platform Administrator
Operational
challenges
CDO / CIO / VP of Data Mgmt
● How do I meet demands for
business insight?
● How do I manage resources and
control expenses?
● How am I supposed to
execute on Digital
Transformation goals?
Pace of business
demand
SQL Stream Builder Makes it Easy
Platform tools
built to scale
● Integrated security and
governance
● Click and deploy clusters
with automated reliability
and performance tools
● Scale with confidence
Unified Processing for
reduced complexity
● No-code single interface
for all real-time processing
● Click and deploy clusters,
analytics and pipelines
● Domain experts focused on
high value analysis
Simplified data
architecture
● Control over resources
and expenses
● Coherent data
architecture for maximum
insight and agility
● Accelerate digital
transformation
8
© 2023 Cloudera, Inc. All rights reserved.
Responding to Critical Events
Stream Processing Use Cases Have 4 Criteria
1. Urgent need to react/respond
– Rapidly cooling data and/or limited window of
opportunity
– Eg: fraud, cybersecurity, etc
2. Analytical insight is required
– Raw data does not contain all necessary
information
– Eg: anomaly detection
3. Relevant data is dispersed
– All necessary information is not in one stream
– Eg: transaction streams + customer tables
4. Data Apps require agility
– Embedding insights into business processes is a
non-trivial exercise
– Ie: Multiple endpoints, environments, formats or
consumers
Strike while the iron is hot
-Ancient Proverb
9
© 2023 Cloudera, Inc. All rights reserved.
UNIVERSAL DATA DISTRIBUTION - KEY REQUIREMENTS
Connect to Any Data Source Anywhere then Process and Deliver to Any Destination
Ingest Process Deliver
Active
Passive
ROUTE
FILTER
ENRICH
TRANSFORM
Data born in
the cloud
Data born
outside the
cloud
Any
destination
Connectors
Gateway Endpoint
Connectors
Agile Data
Pipelines
Scaling
Dev/ops
Real-Time
Observability
Universal Data
Distribution
All Roads Lead to UDD
Event Stream
Processing
11
© 2023 Cloudera, Inc. All rights reserved.
Maximize Impact of High Velocity Data
Solve for untapped data potential across hybrid environments
Rapid Data
Decay
Technical
Friction
Data Silos
Agile Data
Pipelines
Event Stream
Processing
Real-time
Observability
Dev/Ops Velocity
Untapped
Data
Potential
12
© 2023 Cloudera, Inc. All rights reserved.
No Code, Visual Flow Designer for Data Collection & Movement
End-to-end data flow, from many sources to many destinations
DATA-IN-MOTION
PLATFORM
14
© 2023 Cloudera, Inc. All rights reserved.
CLOUDERA DATAFLOW DATA-IN-MOTION PLATFORM
EDGE & FLOW MANAGEMENT — Everything you need
to build scalable data flows from the edge to the public
cloud leveraging Apache NiFi, Apache MiNiFi and Cloudera
Edge Flow Manager
STREAMS MESSAGING — Enterprise grade
messaging products for Apache Kafka. Streams
Messaging Manager to monitor/operate clusters, Streams
Replication Manager for HA/DR deployments, support for
Kafka Connect and Cruise Control
STREAM PROCESSING & ANALYTICS — Powered by
Apache Flink with SQL StreamBuilder, it provides
low-latency stream processing capabilities with advanced
windowing & state management
CLOUDERA SDX — Secure, Monitor and Govern your
Streaming workloads with the same tooling around
Apache Ranger & Apache Atlas.
15
© 2023 Cloudera, Inc. All rights reserved.
Key Features
DataFlow Designer SQL Stream Builder
ease of administration + intuitive GUI =
Self-service flow development
Instant data access + unified SQL processing =
Self-service streaming event processing
© 2023 Cloudera, Inc. All rights reserved. 16
Already using Spark? Need NiFi? Need Flink?
Want unified Batch/Stream?
Want highest Throughput?
Don’t need low latency?
Large files?
Scheduled batches?
Replacing Sqoop, ETL
Simple JDBC queries?
Transform individual records?
Want easy development?
Lots of small files, events, records, rows?
Continuous stream of rows
Support many different sources
Need Microservices, Batch and Stream?
Want high Throughput?
Want Low Latency?
Want Advanced Windowing and State?
Happy with a New Solution that is
best-in-class?
Spark, NiFi, Flink? Which engine to choose?
USE CASES
18
© 2023 Cloudera, Inc. All rights reserved.
UNIVERSAL DATA DISTRIBUTION USE CASES
IoT &
Streaming Data Collection
Data Lakehouse &
Warehouse Ingest
Modernize data pipelines with a single
tool that works with any data lake,
lakehouse or warehouse.
Collect from any source with any type
including unstructured data and
transform the data into the format that
the lakehouse, data lake or warehouse
requires.
Send data from IoT devices at the edge to a
central data flow in the cloud that scales up
and down as needed.
Process streaming data at scale, allowing
organizations to start their IoT projects small,
but with the confidence that their data flows
can manage data bursts caused by adding
more source devices as well as handle
intermittent connectivity issues.
19
© 2023 Cloudera, Inc. All rights reserved.
MEDICAL DEVICE MANUFACTURING
● Devices must be built to
exact specifications
● Reduce scrap and failures
by detecting deviations
from specifications during
manufacturing in real time
● Messages need to be split
to perform analytics
● 700,000 Records Per
Second with streaming
analytics taking place
within 2-4 seconds
● Use SQL Stream Builder
to inspect complex
nested structures to
measure and alert on
out-of-spec optic
resolutions and color
balance.
● Non-Programmer
Analysts able to
successfully develop and
implement algorithm in
SQL
● Immediately know when
devices miss
specification
● Improve cost-efficiency
via scrap reduction
● Operates in real time with
manufacturing producing
actionable alerts in
seconds
CHALLENGE SOLUTION OUTCOMES
Processed with Flink for
actionable real-time
alerts to reduce scrap
Messages Per Second
© 2023 Cloudera, Inc. All rights reserved. 20
MINING: IOT DATA COLLECTION WITH CLOUDERA DATAFLOW
Millions of data points
generated by factories and
refineries required
operational modernization
CHALLENGE SOLUTION OUTCOMES
Using NiFi and MiNiFi to
collect millions of
messages per minute from
IoT devices and sensor
data from the machinery
producing the steel
AI has automated speed
and adjustments of the
assembly line - increasing
speed by more than 5%
Tonnes of additional metal
processing per year
21
© 2023 Cloudera, Inc. All rights reserved.
LEADER IN GIGAOM REPORT FOR STREAMING DATA PLATFORMS
GigaOm Radar for Streaming Data Platforms
Cloudera is a fast mover in the
GigaOm innovation / platform
play quadrant.
CDF is “an impressively
enterprise-ready streaming
solution that combines open
source Apache Kafka, Flink,
Spark, and NiFi technologies …
extensive data governance;
multiple stream processing
engines.”
DATAFLOW
APACHE NIFI
23
© 2023 Cloudera, Inc. All rights reserved.
Integration w/
big data
platform
The Evolution from NiFi to Data-in-Motion
NSA tool created for
field
Spread to other
Gov agencies
Quick wins,
rapid spread
through field
2006-2009
HQ -“Corporate”
data, same problem.
2010-2012
UI optimization
enables massive
scale data
collection
Request to open
source granted
Apache NiFi is born!
Spinoff company
quickly acquired by
HortonWorks
2013-2015
Edge management
launched
2016-2018
Cloudera PaaS
launched
DataFlow as a
Service (K8s)
2019-2021
DataFlow
Functions
Universal Data
Distribution
Vision Unveiled
DataFlow
Designer
2022-present
Solving for 1st mile
data collection
Solving for last mile
data ingestion
Solving for end to end process
data distribution
Key Requirements:
● Speed & scale
● Ease of configurability
● Resilience
Key Requirements:
● Centralized management capabilities
● Flexibility in data types
● Remote delivery
Key Requirements:
● Development velocity
● Cloud optimized deployment
● Completeness of capabilities
Apache Kafka
Service launched
Apache Flink
added (acquisition)
NiFi 2.0
24
© 2023 Cloudera, Inc. All rights reserved.
DATAFLOW COMPETITIVE DIFFERENTIATORS
Upleveled key differentiators
Ingest from any data source
(450+) born inside and outside
of the the cloud with any
structure
1
Ingest in a Hybrid World:
From Any Data Source
Anywhere With Any Structure
Structured
Data
Unstructured
Data
Multi-Hybrid
Cloud
Turn any data source into
streaming mode and support
streaming scale supporting
hundreds of thousands of data
generating clients.
3
Streaming Mode
and Scale
Streaming
Stream
Processing
Enterprise
Scale
Development tooling that solves
real-world data integration
needs with a flow based
low-code development
paradigm with extensibility
4
Low Code Extensible
Developer Tooling
Developer /
Engineer
Tools
Deliver to any destination
including data lakes, lake
houses, data meshes and
cloud services
Deliver to any Warehouse or
Lake
or Mesh on any Cloud
Lakehouse Data Fabric Data Mesh
2
25
© 2023 Cloudera, Inc. All rights reserved.
Apache NiFi
Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data
center) to any downstream system with built in end-to-end security and provenance
ACQUIRE PROCESS DELIVER
• Over 300 Prebuilt Processors
• Easy to build your own
• Parse, Enrich & Apply Schema
• Filter, Split, Merger & Route
• Throttle & Backpressure
• Guaranteed Delivery
• Full data provenance from acquisition to
delivery
• Diverse, Non-Traditional Sources
• Eco-system integration
Advanced tooling to industrialize flow development
(Flow Development Life Cycle)
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
FTP
SFTP
HL7
UDP
XML
HTTP
EMAIL
HTML
IMAGE
SYSLO
G
HASH
MERGE
EXTRACT
DUPLICATE
SPLIT
ROUTE TEXT
ROUTE CONTENT
ROUTE CONTEXT
CONTROL RATE
DISTRIBUTE LOAD
GEOENRICH
SCAN
REPLACE
TRANSLATE
CONVERT
ENCRYPT
TALL
EVALUATE
EXECUTE
© 2023 Cloudera, Inc. All rights reserved. 26
Provenance
27
© 2023 Cloudera, Inc. All rights reserved.
Flow Deployment & Monitoring
Easily Deploy, Manage, Monitor, and Auto-Scale Flows
© 2023 Cloudera, Inc. All rights reserved. 28
Flow
Catalog
• Central repository for
flow definitions
• Import existing NiFi
flows
• Manage flow
definitions
• Initiate flow
deployments
© 2023 Cloudera, Inc. All rights reserved. 29
ReadyFlow
Gallery
• Cloudera provided
flow definitions
• Cover most common
data flow use cases
• Optimized to work
with CDP
sources/destinations
• Can be deployed and
adjusted as needed
30
© 2023 Cloudera, Inc. All rights reserved.
Apache NiFi with Python Custom Processors
Python as a 1st class citizen
31
© 2023 Cloudera, Inc. All rights reserved.
Processing one million events per second with
NiFi
32
© 2023 Cloudera, Inc. All rights reserved.
DEVELOP & DEPLOY SERVERLESS LOW CODE FUNCTIONS
Step1. Develop functions
on local workstation or in
CDP Public Cloud using
no-code, UI designer
Step 2. Run functions on
serverless compute
services in AWS, Azure &
GCP
AWS Lambda Azure Functions Google Cloud Functions
33
© 2023 Cloudera, Inc. All rights reserved.
Highlights - Cloudera DataFlow for the Public Cloud (CDF-PC)
Integrated with CDP – Out of the box connectors and proven patterns for integration with other CDP
services such as Cloudera Data Warehouse, Cloudera Operational Database or Cloudera Machine
Learning.
Ingest & move data in a hybrid world with an ecosystem of 450+ connectors that connect to and
distribute data from the edge and on-premises data centers, to SaaS solutions, and public clouds.
Enterprise-grade security & governance – Deploy your data flows with confidence and trust with
Cloudera’s offering unified security and governance across the entire platform.
Low-code developer tooling to simplify and accelerate onboarding and managing of new data sources
with a low-code, extensible visual flow designer.
Two NiFi runtimes optimized [1] for real-time, always running workloads with DataFlow deployments
and [2] for event-driven, short-lived batch jobs with DataFlow Functions for any public cloud.
34
© 2023 Cloudera, Inc. All rights reserved.
OPERATIONAL CHALLENGES WITH NiFi FLOWS
Resource
contention
Impacts performance of all
flows in the cluster
On-demand
manual scaling
Operational nightmare
Comprehensive flow
visibility
Monitoring NiFi flows
across multiple clusters
can be challenging
Oversizing
clusters
High infrastructure costs
35
© 2023 Cloudera, Inc. All rights reserved.
ARCHITECTURAL SHIFT IN DATAFLOW
Decouple NiFi Flow Designer and the NiFi Cluster Runtime to Support Diverse Runtimes
Classic NiFi
Architecture
NiFi Flow Designer + NiFi Cluster
Runtime are tightly coupled
NiFi Cluster
NiFi Cluster
Runtime
NiFi Flow Designer
Flow
Development
New NiFi
Architecture
Runtimes
Flow
Development
NiFi Flow Designer
CEM /
MiNiFi
DataFlow
Deployment
Containerized
NiFi Clusters
Kafka
Connect
Kafka centric /
Microservices
DataFlow
Function
Serverless
functions
Develop flows in designer and deploy in different runtimes
based on use case
Agent Based
Deployment
NiFi
Cluster
Stateful
traditional NiFi
Option 1
“Nifi on VMs”
Option 2
“Nifi on K8s”
36
© 2023 Cloudera, Inc. All rights reserved.
CONTAINER BASED DATAFLOW (AS OF TODAY PUBLIC CLOUD ONLY)
Flow Deployment Flow Monitoring
Allows easy flow deployment based
on NiFi 1.18 across CDP
environments (Dev, QA, Prod)
Define and assign KPIs to your
flows
Easy NiFi version upgrades
Update/Add KPIs, Update
Parameters, Change sizing
configuration
Automatic infrastructure scaling
based on CPU utilization
Central monitoring console for all
your flows across environments
Monitor flow metrics and
infrastructure usage
Define alerts for flows breaching
assigned KPIs
Flow Catalog
Keep track of your flow definitions
and versions in a central catalog
Reuse your existing NiFi flows by
uploading them to the catalog
Discover, search and reuse existing
flows easily
37
© 2023 Cloudera, Inc. All rights reserved.
FLOW DEVELOPMENT BEST PRACTICES
good
bad
Name your
processors/
connections
Parameterize
connection
information
Tag sensitive
properties as
“sensitive”
Define controller
services on
process group
level (except
Default NiFi SSL
Context Service)
38
© 2023 Cloudera, Inc. All rights reserved.
FLOW DEVELOPMENT BEST PRACTICES
CDPEnvironment Parameter & Default SSLContextService
CDP Environment
Parameter
•Use whenever Hadoop
configuration files are
needed
•CDF detects parameter
usage, obtains the
Hadoop configuration
files from SDX, makes
them available to NiFi
pods and replaces
parameter value
accordingly
•No copying of config
files required anymore
Default NiFi SSL Context
Service
•Use whenever SSL
Context Service is
required to interact with
CDP service in target
environment
•CDF detects reference,
creates a key and
truststore for the target
environment and
configures a Default NiFi
SSL Context Service
accordingly
•No more manual
creation of truststores
and moving around of
certificates to interact
with CDP services
•Default NiFi SSL Context
Service must be an
external controller
service i.e. defined
outside of the process
group that’s exported
39
© 2023 Cloudera, Inc. All rights reserved.
FLOW MANAGEMENT CLUSTERS - CDP DATA ACCESS
Object Store Access Governed by SDX
PutCDPObjectStore
PutCDPObjectStore
-cdp_username
-workload pw
-Target path
-cdp_username
-workload pw
-Target path
IDBroker
IDBroker
Mappings
Cdp_username -- IAM Role
Check mapping
Access Credential
https://docs.cloudera.com/management-console/cloud/security-overview/topics/security_how_identity_federation_works_in_cdp.html
Uses access credential to access S3
Check mapping
Access Credential
Mappings
Cdp_username -- Azure Managed Identity
Uses access credential to access ADLS
40
© 2023 Cloudera, Inc. All rights reserved.
Data Sources – Edge 2
Data Sources – Edge 3
Data Sources – Edge 1
Data Collection
at the Edge
Collect
NiFi / MiNiFi / Kafka Clients
Distribute
NiFi
Data Filtered by
NiFi
Visualize
Data Visualization
with Cloudera
DataViz
Analyze
Streaming OLAP
Analytics & Time
Series Store Powered
by
Kudu
Analyze
Streaming Analytics Apps
Stream
Processing
Powered by SQL
Stream Builder on
Flink
Buffer
Kafka
Syndicate
topics
Syndicate
Services Powered
by Kafka
Learn
CML
Build and Enrich
Models by
Cloudera Machine
Learning
LOG ANALYTICS REFERENCE ARCHITECTURE
EDGE FLOW MANAGEMENT
42
© 2023 Cloudera, Inc. All rights reserved.
Data Collection & Management at the Edge
Central command & control to design and deploy to thousands of agents
Edge Agent
Edge Agent
…
Edge Agent
Edge Agent
Lightweight edge agents focused on data
collection and processing at the edge
43
© 2023 Cloudera, Inc. All rights reserved.
Cloudera Edge Management
Edge device data collection and processing with easy to use central command and control
• Small footprint agent with MiNiFi
• Java and C++ agents
• Rich edge processors (edge collection &
processing)
• End to end lineage and security
• Kubernetes support
Flow
Authorship
Flow
Deployment
Flow
Monitoring
• Central Command and Control (C2)
• Design and deploy to millions of agents
• Edge Applications lifecycle management
• Multitenancy with Agent classes
• Native integration with other CDF services
+
A lightweight edge agent that implements the core
features of Apache NiFi, focusing on data collection
and processing at the edge
Edge Flow Manager
STREAM PROCESSING &
ANALYTICS
45
© 2023 Cloudera, Inc. All rights reserved.
STREAMING ANALYTICS ACCESSIBILITY ACROSS THE ENTERPRISE
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. (second)
Filtering and
Joining Data
Streaming
Analytics
Both batch and
streaming data
Data Analysts Can
Write SQL Queries
Across the Line of Businesses
Capture Events
that Matter
Low-latency analytics use
cases
Event Processing
46
© 2023 Cloudera, Inc. All rights reserved.
SQL STREAM BUILDER (SSB)
SQL STREAM BUILDER allows
developers, analysts, and data
scientists to write streaming
applications with industry
standard SQL.
No Java or Scala code
development required.
Simplifies access to data in Kafka
& Flink. Connectors to batch data in
HDFS, Kudu, Hive, S3, JDBC, CDC
and more
Enrich streaming data with batch
data in a single tool
Democratize access to real-time data with just SQL
47
© 2023 Cloudera, Inc. All rights reserved.
STREAMING SQL BUILDER ON FLINK
Fast deployment of streaming SQL queries on CDP Public & Private Cloud
● SQL Stream Builder for fast curated
event lists
● Automatic schemas for streaming
events
● Materialized Views
● Instant APIs for streaming results
● Extensible with user defined functions
Key Highlights
© 2021 Cloudera, Inc. All rights reserved. 48
APACHE FLINK
• Distributed processing engine for
stateful computations
• 10s TBs of managed state
• Flexible and expressive APIs
• Guaranteed correctness &
Exactly-once state consistency
• Event-time semantics
• Flexible deployment & large
ecosystem (K8s, YARN, S3, HDFS..)
• Support for Flink SQL API
Real-Time
Insights
Event
Processing
Low
Latency
STREAMS MESSAGING
50
© 2022 Cloudera, Inc. All rights reserved.
STREAMS
MESSAGING
MANAGER
(SMM)
● Single Monitoring Dashboard for all
your Kafka Clusters across 4 entities
● Data Explorer
● Topic administration: CRUD
● Alerts on available metrics
● REST as a First Class Citizen
● Designed for the Enterprise
○ Support for Secure Kafka cluster
○ Rich Access Control Policies
51
© 2022 Cloudera, Inc. All rights reserved.
Streams
Replication
Manager (SRM)
• Event Replication engine for Kafka
• Supports active-active, multi-cluster,
cross DC replication scenarios
• Leverage Kafka Connect for
scalability and HA
• Replicate data and configurations
(ACL, partitioning, new topics, etc)
• Offset translation for simplified
failover
• Integrate replication monitoring with
SMM
52
© 2022 Cloudera, Inc. All rights reserved.
KConnect
Connectors
• New Connectors
• CDC Debezium Connectors
• SMM UI Integration
• Reuse your Kafka
Infrastructure
• Enterprise Security for
Secrets, Authentication
and Authorization
Sources
● ActiveMQ (via JMS)
● MQTT
● Syslog over TCP
● JDBC
● JMS
● HTTP
Sinks
● AWS S3
● ADLS
● Kudu
● HDFS
● HTTP
CDC Connectors
● MySQL
● PostGreSQL
● Oracle
● SQLServer
53
© 2022 Cloudera, Inc. All rights reserved.
Cruise Control
• Self- Healing from broker,
hard-drive and other failures
• Goal based rebalancing based
on replica and broker load and
other metrics
• Simplify broker onboarding &
decommissioning; no more
JSON replica assignments
DEMOS
55
© 2023 Cloudera, Inc. All rights reserved.
Future of Data - Princeton + Virtual
@PaasDev
https://www.meetup.com/futureofdata-princeton/
From Big Data to AI to Streaming to Containers to
Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...
TH N Y U

More Related Content

Similar to Implement a Universal Data Distribution Architecture to Manage All Streaming Data

Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudPart 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudCloudera, Inc.
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table NotesTimothy Spann
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database RoundtableEric Kavanagh
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsconfluent
 
Azure Overview Csco
Azure Overview CscoAzure Overview Csco
Azure Overview Cscorajramab
 
Get Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber SolutionGet Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber SolutionCloudera, Inc.
 
Overcoming Data Gravity in Multi-Cloud Enterprise Architectures
Overcoming Data Gravity in Multi-Cloud Enterprise ArchitecturesOvercoming Data Gravity in Multi-Cloud Enterprise Architectures
Overcoming Data Gravity in Multi-Cloud Enterprise ArchitecturesVMware Tanzu
 
Cloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for AnalyticsCloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for AnalyticsCloudera, Inc.
 
Data Driven Advanced Analytics using Denodo Platform on AWS
Data Driven Advanced Analytics using Denodo Platform on AWSData Driven Advanced Analytics using Denodo Platform on AWS
Data Driven Advanced Analytics using Denodo Platform on AWSDenodo
 
Azure Overview Business Model Overview
Azure Overview Business Model OverviewAzure Overview Business Model Overview
Azure Overview Business Model Overviewrramabad
 
SplunkLive! London - Splunk App for Stream & MINT Breakout
SplunkLive! London - Splunk App for Stream & MINT BreakoutSplunkLive! London - Splunk App for Stream & MINT Breakout
SplunkLive! London - Splunk App for Stream & MINT BreakoutSplunk
 
Horizontal Scaling for Millions of Customers!
Horizontal Scaling for Millions of Customers! Horizontal Scaling for Millions of Customers!
Horizontal Scaling for Millions of Customers! elangovans
 
Peek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and RoadmapPeek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and RoadmapNeo4j
 
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...DataStax Academy
 
Maximizing Oil and Gas (Data) Asset Utilization with a Logical Data Fabric (A...
Maximizing Oil and Gas (Data) Asset Utilization with a Logical Data Fabric (A...Maximizing Oil and Gas (Data) Asset Utilization with a Logical Data Fabric (A...
Maximizing Oil and Gas (Data) Asset Utilization with a Logical Data Fabric (A...Denodo
 
The 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: ExposedThe 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: ExposedCloudera, Inc.
 
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)Denodo
 

Similar to Implement a Universal Data Distribution Architecture to Manage All Streaming Data (20)

Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the CloudPart 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
Part 2: Cloudera’s Operational Database: Unlocking New Benefits in the Cloud
 
Unconference Round Table Notes
Unconference Round Table NotesUnconference Round Table Notes
Unconference Round Table Notes
 
Horses for Courses: Database Roundtable
Horses for Courses: Database RoundtableHorses for Courses: Database Roundtable
Horses for Courses: Database Roundtable
 
Unlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insightsUnlocking the Power of IoT: A comprehensive approach to real-time insights
Unlocking the Power of IoT: A comprehensive approach to real-time insights
 
Azure Overview Csco
Azure Overview CscoAzure Overview Csco
Azure Overview Csco
 
Vue d'ensemble Dremio
Vue d'ensemble DremioVue d'ensemble Dremio
Vue d'ensemble Dremio
 
Get Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber SolutionGet Started with Cloudera’s Cyber Solution
Get Started with Cloudera’s Cyber Solution
 
Overcoming Data Gravity in Multi-Cloud Enterprise Architectures
Overcoming Data Gravity in Multi-Cloud Enterprise ArchitecturesOvercoming Data Gravity in Multi-Cloud Enterprise Architectures
Overcoming Data Gravity in Multi-Cloud Enterprise Architectures
 
Cloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for AnalyticsCloudera - The Modern Platform for Analytics
Cloudera - The Modern Platform for Analytics
 
Data Driven Advanced Analytics using Denodo Platform on AWS
Data Driven Advanced Analytics using Denodo Platform on AWSData Driven Advanced Analytics using Denodo Platform on AWS
Data Driven Advanced Analytics using Denodo Platform on AWS
 
Azure Overview Business Model Overview
Azure Overview Business Model OverviewAzure Overview Business Model Overview
Azure Overview Business Model Overview
 
SplunkLive! London - Splunk App for Stream & MINT Breakout
SplunkLive! London - Splunk App for Stream & MINT BreakoutSplunkLive! London - Splunk App for Stream & MINT Breakout
SplunkLive! London - Splunk App for Stream & MINT Breakout
 
Orange Business Live 2013 cloud breakout
Orange Business Live 2013 cloud breakoutOrange Business Live 2013 cloud breakout
Orange Business Live 2013 cloud breakout
 
Horizontal Scaling for Millions of Customers!
Horizontal Scaling for Millions of Customers! Horizontal Scaling for Millions of Customers!
Horizontal Scaling for Millions of Customers!
 
inmation Presentation
inmation Presentationinmation Presentation
inmation Presentation
 
Peek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and RoadmapPeek into Neo4j Product Strategy and Roadmap
Peek into Neo4j Product Strategy and Roadmap
 
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
Cassandra Summit 2014: Internet of Complex Things Analytics with Apache Cassa...
 
Maximizing Oil and Gas (Data) Asset Utilization with a Logical Data Fabric (A...
Maximizing Oil and Gas (Data) Asset Utilization with a Logical Data Fabric (A...Maximizing Oil and Gas (Data) Asset Utilization with a Logical Data Fabric (A...
Maximizing Oil and Gas (Data) Asset Utilization with a Logical Data Fabric (A...
 
The 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: ExposedThe 5 Biggest Data Myths in Telco: Exposed
The 5 Biggest Data Myths in Telco: Exposed
 
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)
Datenvirtualisierung: Wie Sie Ihre Datenarchitektur agiler machen (German)
 

More from Timothy Spann

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max PrincetonTimothy Spann
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...Timothy Spann
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-PipelinesTimothy Spann
 
TCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTimothy Spann
 
2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-ProfitsTimothy Spann
 
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...Timothy Spann
 
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsConf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsTimothy Spann
 
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Timothy Spann
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI PipelinesTimothy Spann
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkTimothy Spann
 
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...Timothy Spann
 
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesOSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesTimothy Spann
 
Building Real-Time Travel Alerts
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel AlertsTimothy Spann
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkTimothy Spann
 
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data PipelinesTimothy Spann
 
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines DemoEvolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines DemoTimothy Spann
 
AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101Timothy Spann
 
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC MeetupTimothy Spann
 

More from Timothy Spann (20)

April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Real-Time AI Streaming - AI Max Princeton
Real-Time AI  Streaming - AI Max PrincetonReal-Time AI  Streaming - AI Max Princeton
Real-Time AI Streaming - AI Max Princeton
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
 
28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines28March2024-Codeless-Generative-AI-Pipelines
28March2024-Codeless-Generative-AI-Pipelines
 
TCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI PipelinesTCFPro24 Building Real-Time Generative AI Pipelines
TCFPro24 Building Real-Time Generative AI Pipelines
 
2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits
 
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
 
Conf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python ProcessorsConf42-Python-Building Apache NiFi 2.0 Python Processors
Conf42-Python-Building Apache NiFi 2.0 Python Processors
 
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
 
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
2024 Feb AI Meetup NYC GenAI_LLMs_ML_Data Codeless Generative AI Pipelines
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
 
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
 
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesOSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
 
Building Real-Time Travel Alerts
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
 
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines DemoEvolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
 
AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101
 
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
 

Recently uploaded

What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....kzayra69
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样umasea
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmSujith Sukumaran
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceBrainSell Technologies
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfLivetecs LLC
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 

Recently uploaded (20)

What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....What are the key points to focus on before starting to learn ETL Development....
What are the key points to focus on before starting to learn ETL Development....
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
办理学位证(UQ文凭证书)昆士兰大学毕业证成绩单原版一模一样
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Intelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalmIntelligent Home Wi-Fi Solutions | ThinkPalm
Intelligent Home Wi-Fi Solutions | ThinkPalm
 
CRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. SalesforceCRM Contender Series: HubSpot vs. Salesforce
CRM Contender Series: HubSpot vs. Salesforce
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
How to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdfHow to Track Employee Performance A Comprehensive Guide.pdf
How to Track Employee Performance A Comprehensive Guide.pdf
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 

Implement a Universal Data Distribution Architecture to Manage All Streaming Data

  • 1. Partner SkillUp: Implement a Universal Data Distribution Architecture to Manage All Streaming Data Tim Spann Principal Developer Advocate in Data In Motion for Cloudera tspann@cloudera.com Salvador Almazan Partner Solutions Engineer, US salmazan@cloudera.com Cloudera Data Services Workshop Registration
  • 2. 2 © 2023 Cloudera, Inc. All rights reserved. TODAY’S LEAD Who am I? @PaasDev DZone Zone Leader and Big Data MVB Princeton NJ Future of Data Meetup ex-Pivotal Field Engineer ex-StreamNative ex-PwC https://github.com/tspannhw https://www.datainmotion.dev/ Principal Data-in-Motion Developer Advocate
  • 3. 3 © 2023 Cloudera, Inc. All rights reserved. CHALLENGES IN DATA COLLECTION & MOVEMENT Platform/Flow Administrators Operational Challenges cloud service provider specific data flow monitoring, management & security tools public & private sector data public or private clouds Data Architects & Executives Multiple Point Solutions Required vendor specific data aggregation hubs & interfaces specialized tools vendor specific monitoring & management tools custom code to collect data from edge devices edge devices custom management tools App/Integration Developers Lack Agility
  • 4. 4 © 2023 Cloudera, Inc. All rights reserved. SOLUTION: UNIVERSAL DATA DISTRIBUTION public & private sector data edge devices specialized tools multi-public & hybrid clouds Consistent data flow design, deployment, management & monitoring. Any data. Any source. Any destination. Data Architects & Executives Platform/Flow Administrators App/Integration Developers Simplify & Accelerate onboarding and managing data sources. Use no code tools. Standardize development tools for reusability. Streamline Ops to deploy, monitor, manage, secure and auto-scale data flows.
  • 5. 5 Analytics apps AI Models Any Data Consumer Event-driven apps Streaming data Data at rest Any Data Change Data Capture Real-time Processing Any Business Event ● Analyze data in motion ● Continuous monitoring ● Trends and anomalies Continuous Results Any Data Analyst ● No-Code UI ● Author once publish anywhere ● Analytics lifecycle management for dev/ops Data Products Data Lakehouse
  • 6. 6 © 2023 Cloudera, Inc. All rights reserved. Transactions Data Account Data Fraud Monitoring Dashboard Real-Time Fraud Scoring Business Event Logic Continuous Results Freeze transaction Device Logs Flagged Records Data Lakehouse What constitutes a suspicious transaction? Defined by domain expert, deployed as continuous query Stop Fraud when it happens Simplified example of deployed use case
  • 7. 7 © 2023 Cloudera, Inc. All rights reserved. What Challenges Do We Address Fragmented infrastructure is complex to manage Devs/Analysts/Data Science ● How do I understand the contents of pipelines when I’m so focused on code-heavy tools? ● How do I integrate and analyze data without compromising integrity? ● How am I supposed to deploy innovative solutions and models? Siloed data processing tools ● How do I manage security and access across all these data sources and pipelines? ● How do I ensure data governance, reliability and performance? ● How am I supposed to scale with demand? Platform Administrator Operational challenges CDO / CIO / VP of Data Mgmt ● How do I meet demands for business insight? ● How do I manage resources and control expenses? ● How am I supposed to execute on Digital Transformation goals? Pace of business demand SQL Stream Builder Makes it Easy Platform tools built to scale ● Integrated security and governance ● Click and deploy clusters with automated reliability and performance tools ● Scale with confidence Unified Processing for reduced complexity ● No-code single interface for all real-time processing ● Click and deploy clusters, analytics and pipelines ● Domain experts focused on high value analysis Simplified data architecture ● Control over resources and expenses ● Coherent data architecture for maximum insight and agility ● Accelerate digital transformation
  • 8. 8 © 2023 Cloudera, Inc. All rights reserved. Responding to Critical Events Stream Processing Use Cases Have 4 Criteria 1. Urgent need to react/respond – Rapidly cooling data and/or limited window of opportunity – Eg: fraud, cybersecurity, etc 2. Analytical insight is required – Raw data does not contain all necessary information – Eg: anomaly detection 3. Relevant data is dispersed – All necessary information is not in one stream – Eg: transaction streams + customer tables 4. Data Apps require agility – Embedding insights into business processes is a non-trivial exercise – Ie: Multiple endpoints, environments, formats or consumers Strike while the iron is hot -Ancient Proverb
  • 9. 9 © 2023 Cloudera, Inc. All rights reserved. UNIVERSAL DATA DISTRIBUTION - KEY REQUIREMENTS Connect to Any Data Source Anywhere then Process and Deliver to Any Destination Ingest Process Deliver Active Passive ROUTE FILTER ENRICH TRANSFORM Data born in the cloud Data born outside the cloud Any destination Connectors Gateway Endpoint Connectors
  • 11. 11 © 2023 Cloudera, Inc. All rights reserved. Maximize Impact of High Velocity Data Solve for untapped data potential across hybrid environments Rapid Data Decay Technical Friction Data Silos Agile Data Pipelines Event Stream Processing Real-time Observability Dev/Ops Velocity Untapped Data Potential
  • 12. 12 © 2023 Cloudera, Inc. All rights reserved. No Code, Visual Flow Designer for Data Collection & Movement End-to-end data flow, from many sources to many destinations
  • 14. 14 © 2023 Cloudera, Inc. All rights reserved. CLOUDERA DATAFLOW DATA-IN-MOTION PLATFORM EDGE & FLOW MANAGEMENT — Everything you need to build scalable data flows from the edge to the public cloud leveraging Apache NiFi, Apache MiNiFi and Cloudera Edge Flow Manager STREAMS MESSAGING — Enterprise grade messaging products for Apache Kafka. Streams Messaging Manager to monitor/operate clusters, Streams Replication Manager for HA/DR deployments, support for Kafka Connect and Cruise Control STREAM PROCESSING & ANALYTICS — Powered by Apache Flink with SQL StreamBuilder, it provides low-latency stream processing capabilities with advanced windowing & state management CLOUDERA SDX — Secure, Monitor and Govern your Streaming workloads with the same tooling around Apache Ranger & Apache Atlas.
  • 15. 15 © 2023 Cloudera, Inc. All rights reserved. Key Features DataFlow Designer SQL Stream Builder ease of administration + intuitive GUI = Self-service flow development Instant data access + unified SQL processing = Self-service streaming event processing
  • 16. © 2023 Cloudera, Inc. All rights reserved. 16 Already using Spark? Need NiFi? Need Flink? Want unified Batch/Stream? Want highest Throughput? Don’t need low latency? Large files? Scheduled batches? Replacing Sqoop, ETL Simple JDBC queries? Transform individual records? Want easy development? Lots of small files, events, records, rows? Continuous stream of rows Support many different sources Need Microservices, Batch and Stream? Want high Throughput? Want Low Latency? Want Advanced Windowing and State? Happy with a New Solution that is best-in-class? Spark, NiFi, Flink? Which engine to choose?
  • 18. 18 © 2023 Cloudera, Inc. All rights reserved. UNIVERSAL DATA DISTRIBUTION USE CASES IoT & Streaming Data Collection Data Lakehouse & Warehouse Ingest Modernize data pipelines with a single tool that works with any data lake, lakehouse or warehouse. Collect from any source with any type including unstructured data and transform the data into the format that the lakehouse, data lake or warehouse requires. Send data from IoT devices at the edge to a central data flow in the cloud that scales up and down as needed. Process streaming data at scale, allowing organizations to start their IoT projects small, but with the confidence that their data flows can manage data bursts caused by adding more source devices as well as handle intermittent connectivity issues.
  • 19. 19 © 2023 Cloudera, Inc. All rights reserved. MEDICAL DEVICE MANUFACTURING ● Devices must be built to exact specifications ● Reduce scrap and failures by detecting deviations from specifications during manufacturing in real time ● Messages need to be split to perform analytics ● 700,000 Records Per Second with streaming analytics taking place within 2-4 seconds ● Use SQL Stream Builder to inspect complex nested structures to measure and alert on out-of-spec optic resolutions and color balance. ● Non-Programmer Analysts able to successfully develop and implement algorithm in SQL ● Immediately know when devices miss specification ● Improve cost-efficiency via scrap reduction ● Operates in real time with manufacturing producing actionable alerts in seconds CHALLENGE SOLUTION OUTCOMES Processed with Flink for actionable real-time alerts to reduce scrap Messages Per Second
  • 20. © 2023 Cloudera, Inc. All rights reserved. 20 MINING: IOT DATA COLLECTION WITH CLOUDERA DATAFLOW Millions of data points generated by factories and refineries required operational modernization CHALLENGE SOLUTION OUTCOMES Using NiFi and MiNiFi to collect millions of messages per minute from IoT devices and sensor data from the machinery producing the steel AI has automated speed and adjustments of the assembly line - increasing speed by more than 5% Tonnes of additional metal processing per year
  • 21. 21 © 2023 Cloudera, Inc. All rights reserved. LEADER IN GIGAOM REPORT FOR STREAMING DATA PLATFORMS GigaOm Radar for Streaming Data Platforms Cloudera is a fast mover in the GigaOm innovation / platform play quadrant. CDF is “an impressively enterprise-ready streaming solution that combines open source Apache Kafka, Flink, Spark, and NiFi technologies … extensive data governance; multiple stream processing engines.”
  • 23. 23 © 2023 Cloudera, Inc. All rights reserved. Integration w/ big data platform The Evolution from NiFi to Data-in-Motion NSA tool created for field Spread to other Gov agencies Quick wins, rapid spread through field 2006-2009 HQ -“Corporate” data, same problem. 2010-2012 UI optimization enables massive scale data collection Request to open source granted Apache NiFi is born! Spinoff company quickly acquired by HortonWorks 2013-2015 Edge management launched 2016-2018 Cloudera PaaS launched DataFlow as a Service (K8s) 2019-2021 DataFlow Functions Universal Data Distribution Vision Unveiled DataFlow Designer 2022-present Solving for 1st mile data collection Solving for last mile data ingestion Solving for end to end process data distribution Key Requirements: ● Speed & scale ● Ease of configurability ● Resilience Key Requirements: ● Centralized management capabilities ● Flexibility in data types ● Remote delivery Key Requirements: ● Development velocity ● Cloud optimized deployment ● Completeness of capabilities Apache Kafka Service launched Apache Flink added (acquisition) NiFi 2.0
  • 24. 24 © 2023 Cloudera, Inc. All rights reserved. DATAFLOW COMPETITIVE DIFFERENTIATORS Upleveled key differentiators Ingest from any data source (450+) born inside and outside of the the cloud with any structure 1 Ingest in a Hybrid World: From Any Data Source Anywhere With Any Structure Structured Data Unstructured Data Multi-Hybrid Cloud Turn any data source into streaming mode and support streaming scale supporting hundreds of thousands of data generating clients. 3 Streaming Mode and Scale Streaming Stream Processing Enterprise Scale Development tooling that solves real-world data integration needs with a flow based low-code development paradigm with extensibility 4 Low Code Extensible Developer Tooling Developer / Engineer Tools Deliver to any destination including data lakes, lake houses, data meshes and cloud services Deliver to any Warehouse or Lake or Mesh on any Cloud Lakehouse Data Fabric Data Mesh 2
  • 25. 25 © 2023 Cloudera, Inc. All rights reserved. Apache NiFi Enable easy ingestion, routing, management and delivery of any data anywhere (Edge, cloud, data center) to any downstream system with built in end-to-end security and provenance ACQUIRE PROCESS DELIVER • Over 300 Prebuilt Processors • Easy to build your own • Parse, Enrich & Apply Schema • Filter, Split, Merger & Route • Throttle & Backpressure • Guaranteed Delivery • Full data provenance from acquisition to delivery • Diverse, Non-Traditional Sources • Eco-system integration Advanced tooling to industrialize flow development (Flow Development Life Cycle) FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLO G FTP SFTP HL7 UDP XML HTTP EMAIL HTML IMAGE SYSLO G HASH MERGE EXTRACT DUPLICATE SPLIT ROUTE TEXT ROUTE CONTENT ROUTE CONTEXT CONTROL RATE DISTRIBUTE LOAD GEOENRICH SCAN REPLACE TRANSLATE CONVERT ENCRYPT TALL EVALUATE EXECUTE
  • 26. © 2023 Cloudera, Inc. All rights reserved. 26 Provenance
  • 27. 27 © 2023 Cloudera, Inc. All rights reserved. Flow Deployment & Monitoring Easily Deploy, Manage, Monitor, and Auto-Scale Flows
  • 28. © 2023 Cloudera, Inc. All rights reserved. 28 Flow Catalog • Central repository for flow definitions • Import existing NiFi flows • Manage flow definitions • Initiate flow deployments
  • 29. © 2023 Cloudera, Inc. All rights reserved. 29 ReadyFlow Gallery • Cloudera provided flow definitions • Cover most common data flow use cases • Optimized to work with CDP sources/destinations • Can be deployed and adjusted as needed
  • 30. 30 © 2023 Cloudera, Inc. All rights reserved. Apache NiFi with Python Custom Processors Python as a 1st class citizen
  • 31. 31 © 2023 Cloudera, Inc. All rights reserved. Processing one million events per second with NiFi
  • 32. 32 © 2023 Cloudera, Inc. All rights reserved. DEVELOP & DEPLOY SERVERLESS LOW CODE FUNCTIONS Step1. Develop functions on local workstation or in CDP Public Cloud using no-code, UI designer Step 2. Run functions on serverless compute services in AWS, Azure & GCP AWS Lambda Azure Functions Google Cloud Functions
  • 33. 33 © 2023 Cloudera, Inc. All rights reserved. Highlights - Cloudera DataFlow for the Public Cloud (CDF-PC) Integrated with CDP – Out of the box connectors and proven patterns for integration with other CDP services such as Cloudera Data Warehouse, Cloudera Operational Database or Cloudera Machine Learning. Ingest & move data in a hybrid world with an ecosystem of 450+ connectors that connect to and distribute data from the edge and on-premises data centers, to SaaS solutions, and public clouds. Enterprise-grade security & governance – Deploy your data flows with confidence and trust with Cloudera’s offering unified security and governance across the entire platform. Low-code developer tooling to simplify and accelerate onboarding and managing of new data sources with a low-code, extensible visual flow designer. Two NiFi runtimes optimized [1] for real-time, always running workloads with DataFlow deployments and [2] for event-driven, short-lived batch jobs with DataFlow Functions for any public cloud.
  • 34. 34 © 2023 Cloudera, Inc. All rights reserved. OPERATIONAL CHALLENGES WITH NiFi FLOWS Resource contention Impacts performance of all flows in the cluster On-demand manual scaling Operational nightmare Comprehensive flow visibility Monitoring NiFi flows across multiple clusters can be challenging Oversizing clusters High infrastructure costs
  • 35. 35 © 2023 Cloudera, Inc. All rights reserved. ARCHITECTURAL SHIFT IN DATAFLOW Decouple NiFi Flow Designer and the NiFi Cluster Runtime to Support Diverse Runtimes Classic NiFi Architecture NiFi Flow Designer + NiFi Cluster Runtime are tightly coupled NiFi Cluster NiFi Cluster Runtime NiFi Flow Designer Flow Development New NiFi Architecture Runtimes Flow Development NiFi Flow Designer CEM / MiNiFi DataFlow Deployment Containerized NiFi Clusters Kafka Connect Kafka centric / Microservices DataFlow Function Serverless functions Develop flows in designer and deploy in different runtimes based on use case Agent Based Deployment NiFi Cluster Stateful traditional NiFi Option 1 “Nifi on VMs” Option 2 “Nifi on K8s”
  • 36. 36 © 2023 Cloudera, Inc. All rights reserved. CONTAINER BASED DATAFLOW (AS OF TODAY PUBLIC CLOUD ONLY) Flow Deployment Flow Monitoring Allows easy flow deployment based on NiFi 1.18 across CDP environments (Dev, QA, Prod) Define and assign KPIs to your flows Easy NiFi version upgrades Update/Add KPIs, Update Parameters, Change sizing configuration Automatic infrastructure scaling based on CPU utilization Central monitoring console for all your flows across environments Monitor flow metrics and infrastructure usage Define alerts for flows breaching assigned KPIs Flow Catalog Keep track of your flow definitions and versions in a central catalog Reuse your existing NiFi flows by uploading them to the catalog Discover, search and reuse existing flows easily
  • 37. 37 © 2023 Cloudera, Inc. All rights reserved. FLOW DEVELOPMENT BEST PRACTICES good bad Name your processors/ connections Parameterize connection information Tag sensitive properties as “sensitive” Define controller services on process group level (except Default NiFi SSL Context Service)
  • 38. 38 © 2023 Cloudera, Inc. All rights reserved. FLOW DEVELOPMENT BEST PRACTICES CDPEnvironment Parameter & Default SSLContextService CDP Environment Parameter •Use whenever Hadoop configuration files are needed •CDF detects parameter usage, obtains the Hadoop configuration files from SDX, makes them available to NiFi pods and replaces parameter value accordingly •No copying of config files required anymore Default NiFi SSL Context Service •Use whenever SSL Context Service is required to interact with CDP service in target environment •CDF detects reference, creates a key and truststore for the target environment and configures a Default NiFi SSL Context Service accordingly •No more manual creation of truststores and moving around of certificates to interact with CDP services •Default NiFi SSL Context Service must be an external controller service i.e. defined outside of the process group that’s exported
  • 39. 39 © 2023 Cloudera, Inc. All rights reserved. FLOW MANAGEMENT CLUSTERS - CDP DATA ACCESS Object Store Access Governed by SDX PutCDPObjectStore PutCDPObjectStore -cdp_username -workload pw -Target path -cdp_username -workload pw -Target path IDBroker IDBroker Mappings Cdp_username -- IAM Role Check mapping Access Credential https://docs.cloudera.com/management-console/cloud/security-overview/topics/security_how_identity_federation_works_in_cdp.html Uses access credential to access S3 Check mapping Access Credential Mappings Cdp_username -- Azure Managed Identity Uses access credential to access ADLS
  • 40. 40 © 2023 Cloudera, Inc. All rights reserved. Data Sources – Edge 2 Data Sources – Edge 3 Data Sources – Edge 1 Data Collection at the Edge Collect NiFi / MiNiFi / Kafka Clients Distribute NiFi Data Filtered by NiFi Visualize Data Visualization with Cloudera DataViz Analyze Streaming OLAP Analytics & Time Series Store Powered by Kudu Analyze Streaming Analytics Apps Stream Processing Powered by SQL Stream Builder on Flink Buffer Kafka Syndicate topics Syndicate Services Powered by Kafka Learn CML Build and Enrich Models by Cloudera Machine Learning LOG ANALYTICS REFERENCE ARCHITECTURE
  • 42. 42 © 2023 Cloudera, Inc. All rights reserved. Data Collection & Management at the Edge Central command & control to design and deploy to thousands of agents Edge Agent Edge Agent … Edge Agent Edge Agent Lightweight edge agents focused on data collection and processing at the edge
  • 43. 43 © 2023 Cloudera, Inc. All rights reserved. Cloudera Edge Management Edge device data collection and processing with easy to use central command and control • Small footprint agent with MiNiFi • Java and C++ agents • Rich edge processors (edge collection & processing) • End to end lineage and security • Kubernetes support Flow Authorship Flow Deployment Flow Monitoring • Central Command and Control (C2) • Design and deploy to millions of agents • Edge Applications lifecycle management • Multitenancy with Agent classes • Native integration with other CDF services + A lightweight edge agent that implements the core features of Apache NiFi, focusing on data collection and processing at the edge Edge Flow Manager
  • 45. 45 © 2023 Cloudera, Inc. All rights reserved. STREAMING ANALYTICS ACCESSIBILITY ACROSS THE ENTERPRISE 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. (second) Filtering and Joining Data Streaming Analytics Both batch and streaming data Data Analysts Can Write SQL Queries Across the Line of Businesses Capture Events that Matter Low-latency analytics use cases Event Processing
  • 46. 46 © 2023 Cloudera, Inc. All rights reserved. SQL STREAM BUILDER (SSB) SQL STREAM BUILDER allows developers, analysts, and data scientists to write streaming applications with industry standard SQL. No Java or Scala code development required. Simplifies access to data in Kafka & Flink. Connectors to batch data in HDFS, Kudu, Hive, S3, JDBC, CDC and more Enrich streaming data with batch data in a single tool Democratize access to real-time data with just SQL
  • 47. 47 © 2023 Cloudera, Inc. All rights reserved. STREAMING SQL BUILDER ON FLINK Fast deployment of streaming SQL queries on CDP Public & Private Cloud ● SQL Stream Builder for fast curated event lists ● Automatic schemas for streaming events ● Materialized Views ● Instant APIs for streaming results ● Extensible with user defined functions Key Highlights
  • 48. © 2021 Cloudera, Inc. All rights reserved. 48 APACHE FLINK • Distributed processing engine for stateful computations • 10s TBs of managed state • Flexible and expressive APIs • Guaranteed correctness & Exactly-once state consistency • Event-time semantics • Flexible deployment & large ecosystem (K8s, YARN, S3, HDFS..) • Support for Flink SQL API Real-Time Insights Event Processing Low Latency
  • 50. 50 © 2022 Cloudera, Inc. All rights reserved. STREAMS MESSAGING MANAGER (SMM) ● Single Monitoring Dashboard for all your Kafka Clusters across 4 entities ● Data Explorer ● Topic administration: CRUD ● Alerts on available metrics ● REST as a First Class Citizen ● Designed for the Enterprise ○ Support for Secure Kafka cluster ○ Rich Access Control Policies
  • 51. 51 © 2022 Cloudera, Inc. All rights reserved. Streams Replication Manager (SRM) • Event Replication engine for Kafka • Supports active-active, multi-cluster, cross DC replication scenarios • Leverage Kafka Connect for scalability and HA • Replicate data and configurations (ACL, partitioning, new topics, etc) • Offset translation for simplified failover • Integrate replication monitoring with SMM
  • 52. 52 © 2022 Cloudera, Inc. All rights reserved. KConnect Connectors • New Connectors • CDC Debezium Connectors • SMM UI Integration • Reuse your Kafka Infrastructure • Enterprise Security for Secrets, Authentication and Authorization Sources ● ActiveMQ (via JMS) ● MQTT ● Syslog over TCP ● JDBC ● JMS ● HTTP Sinks ● AWS S3 ● ADLS ● Kudu ● HDFS ● HTTP CDC Connectors ● MySQL ● PostGreSQL ● Oracle ● SQLServer
  • 53. 53 © 2022 Cloudera, Inc. All rights reserved. Cruise Control • Self- Healing from broker, hard-drive and other failures • Goal based rebalancing based on replica and broker load and other metrics • Simplify broker onboarding & decommissioning; no more JSON replica assignments
  • 54. DEMOS
  • 55. 55 © 2023 Cloudera, Inc. All rights reserved. Future of Data - Princeton + Virtual @PaasDev https://www.meetup.com/futureofdata-princeton/ From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to Machine Learning to Microservices to ...
  • 56. TH N Y U