SlideShare a Scribd company logo
1 of 39
Download to read offline
© 2024 Cloudera, Inc. All rights reserved.
Building Real-Time Generative AI
Pipelines
Tim Spann
Principal Developer Advocate
March 15, 2024
© 2024 Cloudera, Inc. All rights reserved.
© 2024 Cloudera, Inc. All rights reserved. 3
This week in Apache NiFi, Apache Flink,
Apache Kafka, ML, AI, Apache Spark, Apache
Iceberg, Python, Java and Open Source
friends.
https://bit.ly/32dAJft
https://www.meetup.com/futureofdata-
princeton/
FLaNK Stack Weekly by Tim Spann
© 2024 Cloudera, Inc. All rights reserved. 4
Confidential—Restricted
@PaasDev
https://www.meetup.com/futureofdata-princeton/
From Big Data to AI to Streaming to Containers to
Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...
Future of Data - NYC + NJ + Philly + Virtual
https://linktr.ee/tspannhw
© 2024 Cloudera, Inc. All rights reserved. 5
Tim Spann
Twitter: @PaasDev Blog: datainmotion.dev
Principal Developer Advocate
Princeton Future of Data Meetup
ex-Pivotal, ex-Hortonworks,
ex-StreamNative, ex-HPE,
ex-PwC, ex-EY.
https://medium.com/@tspann
https://github.com/tspannhw
© 2024 Cloudera, Inc. All rights reserved. 6
Some common Vector DBs Open Community & Open Models
RAPID INNOVATION IN THE LLM SPACE
Too much to cover today.. but you should know the common LLMs, Frameworks, Tools
Notable LLMs
Closed Models Open Models
GPT3.5
GPT4
Llama2
Mistral7B
Mixtral8x7B
Claude2
++ 100s more… check out the HuggingFace LLM
Leaderboard (pretrained, domain fine-tuned, chat models, …)
Code Llama
Popular LLM Frameworks
When to use one over the other? Use Langchain if you need a general-purpose framework with
flexibility and extensibility. Consider LlamaIndex if you’re building a RAG only app (retrieval/search)
Langchain is a framework for
developing apps powered by LLMs
● Python and JavaScript Libraries
● Provides modules for LLM
Interface, Retrieval, & Agents
LLamaIndex is a framework designed
specifically for RAG apps
● Python and JavaScript Libraries
● Provides built in optimizations /
techniques for advanced RAG
HuggingFace is an ML community for hosting &
collaborating on models, datasets, and ML applications
● Latest open source LLMs are in HuggingFace
● + great learning resources / demos
https://huggingface.co/
Open Source vs Self Hosted vs SaaS option
© 2024 Cloudera, Inc. All rights reserved. 7
Enterprise Knowledge Base / Chatbot / Q&A
- Customer Support & Troubleshooting
- Enable open ended conversations with user
provided prompts
Code assistant:
- Provide relevant snippets of code as a
response to a request written in natural
language.
- Assist with creating test cases and
synthetic test data.
- Reference other relevant data such as a
company’s documentation to help provide
more accurate responses.
Social and emotional sensing
- Gauge emotions and opinions based on a
piece of text.
- Understand and deliver a more nuanced
message back based on sentiment.
ENTERPRISE WIDE USE CASES FOR AN LLM
Classification and Clustering
- Categorize and sort large volumes of data
into common themes and trends to support
more informed decision making.
Language Translation
- Globalize your content by feeding web
pages through LLMs for translation.
- Combine with chatbots to provide
multilingual support to your customer base.
Document Summarization
- Distill large amounts of text down to the
most relevant points.
Content Generation
- Provide detailed and contextually relevant
prompts to develop outlines, brainstorm
ideas and approaches for content.
L
Adoption dependent upon an Enterprise’s risk tolerance, restrictions, decision rights and disclosure obligations.
© 2024 Cloudera, Inc. All rights reserved.
NLP / AI / LLM
Generative AI
9
Which Model and When?
Use the right model for right job: closed or open-source
Closed
Source
Usage can
easily scale
but so can
your costs
Rapidly
improving
AI models
Most
advanced
AI models
Excel at more
specialized
tasks
Great for a
wide range
of tasks
Open
Source
Better cost
planning
Compliance,
privacy, and
security risks
More control
over where &
how models
are deployed
10
Adoption of Generative AI is a Journey
Identifying AI challenges in the enterprise
Data integration
barriers
● Streamlined access to
enterprise data
Rigid model
infrastructure
● Modularity
● Flexibility
● AI Ops
Lack of security
and transparency
● Model control
● Built-in security
● Visibility & governance
What’s missing
Challenges
11
Data = Organization Context
Your data enables contextually accurate responses from LLMs
Large Language
Model
User Query
Contextually
Inaccurate
Response
Data
Organization
Context
User Query
Large Language
Model
Contextually
Accurate
Response
© 2024 Cloudera, Inc. All rights reserved. 12
CLOSED-SOURCE
FOUNDATION MODELS
MODEL HUBS
OPEN SOURCE
FOUNDATION MODELS
FINE-TUNED MODELS
PRIVATE
VECTOR STORE
MANAGED
VECTOR STORE
CLOUD INFRASTRUCTURE
Milvus, Solr*
Meta (Llama 2)
Applied Machine Learning Prototypes (AMPs)
Hugging Face
Pinecone
SPECIALIZED HARDWARE
APIs: OpenAI (GPT-4 Turbo)
Amazon Bedrock: Anthropic (Claude 2), Cohere…
DATA
WRANGLING
REAL-TIME
DATA INGEST
& ROUTING
AI MODEL
TRAINING &
INFERENCE
DATA STORE &
VISUALIZATION
Open Data Lakehouse
DATA
WRANGLING
REAL-TIME
DATA INGEST
& ROUTING
AI MODEL
TRAINING &
SERVING
DATA STORE &
VISUALIZATION
AI APPLICATIONS
Live Q&A
Travel Advisories
Weather Reports
Documents
Social Media
Databases
Transactions
Public Data Feeds
S3 / Files
Logs
ATM Data
Live Chat
…
ARCHITECTURE
INTERACT
COLLECT STORE
ENRICH, REPORT
Distribute
Collect
Report
REPORT
Visualize
Report, Automate
AI BASED ENHANCEMENTS
Predict, Automate
VECTOR DATABASE
LLM
Machine
Learning
Data
Visualization
Data Flow
Data
Warehouse
SQL
Stream Builder
Data
Visualization
Input Sentences
Generated Text
Timestamp
Input Sentence
Timestamps
Enrichments
Messaging
Broker
Real-time alerting
Real-time alerting
Aggregations
© 2024 Cloudera, Inc. All rights reserved.
DATAFLOW / STREAMING
LLM USE CASE
Vector DB
AI Model
Unstructured file types
Data in Motion
on Cloudera Data
Platform (CDP)
Capture, process &
distribute any data,
anywhere
Other enterprise data Open Data Lakehouse
Materialized Views
Structured Sources
Applications/API’s
Streams
https://medium.com/cloudera-inc/getting-ready-for-apache-nifi-2-0-5a5e6a67f450
NiFi 2.0.0 Features
● Python Integration
● Parameters
● JDK 21+
● JSON Flow Serialization
● Rules Engine for Development
Assistance
● Run Process Group as Stateless
● flow.json.gz
https://cwiki.apache.org/confluence/display/NIFI/NiFi+2.0+Release+Goals
17
DataFlow Pipelines Can Help
External Context Ingest
Ingesting, routing, clean, enrich, transforming,
parsing, chunking and vectorizing structured,
unstructured, semistructured, binary data and
documents
Prompt engineering
Crafting and structuring queries to optimize
LLM responses
Context Retrieval
Enhancing LLM with external context such as
Retrieval Augmented Generation (RAG)
Roundtrip Interface
Act as a Discord, REST, Kafka, SQL, Slack bot to
roundtrip discussions
18
UNSTRUCTURED DATA WITH NIFI
• Archives - tar, gzipped, zipped, …
• Images - PNG, JPG, GIF, BMP, …
• Documents - HTML, Markdown, RSS, PDF, Doc, RTF, Plain Text, …
• Videos - MP4, Clips, Mov, Youtube URL…
• Sound - MP3, …
• Social / Chat - Slack, Discord, Twitter, REST, Email, …
• Identify Mime Types, Chunk Documents, Store to Vector Database
• Parse Documents - HTML, Markdown, PDF, Word, Excel, Powerpoint
19
CLOUD ML/DL/AI/Vector Database Services
• Cloudera ML
• Amazon Polly, Translate, Textract, Transcribe, Bedrock, …
• Hugging Face
• IBM Watson X.AI
• Vector Stores Anywhere: Weaviate, Pinecone, Milvus,
Chroma DB, SOLR, …
20
https://medium.com/cloudera-inc/google-gemma-for-real-time-lightweight-open-llm-infe
rence-88efe98e580f
© 2024 Cloudera, Inc. All rights reserved.
© 2024 Cloudera, Inc. All rights reserved.
Python Processors
© 2024 Cloudera, Inc. All rights reserved.
Extract Company Names
● Python 3.10+
● Hugging Face, NLP, SpaCY, PyTorch
https://github.com/tspannhw/FLaNK-python-ExtractCompanyName-processor
© 2024 Cloudera, Inc. All rights reserved.
WatsonX SDK To Foundation
● Python 3.10+
● LLM
● WatsonX.AI Foundation Models
● Inference
● Secure
● Official SDK from IBM
https://github.com/tspannhw/FLaNK-python-watsonx-processor
© 2024 Cloudera, Inc. All rights reserved.
CaptionImage
● Python 3.10+
● Hugging Face
● Salesforce/blip-image-captioning-large
● Generate Captions for Images
● Adds captions to FlowFile Attributes
● Does not require download or copies of
your images
https://github.com/tspannhw/FLaNK-python-processors
© 2024 Cloudera, Inc. All rights reserved.
RESNetImageClassification
● Python 3.10+
● Hugging Face
● Transformers
● Pytorch
● Datasets
● microsoft/resnet-50
● Adds classification label to FlowFile
Attributes
● Does not require download or copies of
your images
https://github.com/tspannhw/FLaNK-python-processors
© 2024 Cloudera, Inc. All rights reserved.
NSFWImageDetection
● Python 3.10+
● Hugging Face
● Transformers
● Falconsai/nsfw_image_detection
● Adds normal and nsfw to FlowFile
Attributes
● Gives score on safety of image
● Does not require download or copies of
your images
https://github.com/tspannhw/FLaNK-python-processors
© 2024 Cloudera, Inc. All rights reserved.
FacialEmotionsImageDetection
● Python 3.10+
● Hugging Face
● Transformers
● facial_emotions_image_detection
● Image Classification
● Adds labels/scores to FlowFile Attributes
● Does not require download or copies of
your images
https://github.com/tspannhw/FLaNK-python-processors
© 2024 Cloudera, Inc. All rights reserved.
Other Python Processors
● Chunk Document, Parse Document
● Prompt Chat GPT
● Put Chroma, Query Chroma
● Put Pinecone, Query Pinecone
30
DEMO
Startup Grind
AI Max
Summit -
April 12 NJ
32
TH N Y U
CLOUDERA STREAM PROCESSING
Two Major Capabilities: Enterprise Messaging and Powerful Stream Processing
Enterprise grade messaging products for Apache
Kafka. Streams Messaging Manager to
monitor/operate clusters, Streams Replication
Manager for HA/DR deployments, Schema Registry for
centralized schema management, and support for
Kafka Connect and Cruise Control
Cloudera Streaming Analytics (CSA)
Powered By Apache Flink
Cloudera Streams Messaging (CSM)
Powered by Apache Kafka
Powered by Apache Flink with SQL StreamBuilder, it
provides low-latency stream processing capabilities
with advanced windowing & state management made
simple with SQL
ENTERPRISE MANAGEMENT CAPABILITIES FOR APACHE KAFKA
Extend streams messaging services for Schema Mgmt, Replication & Monitoring
Schema Registry
Kafka Schema Governance
Streams Replication Manager
Kafka Replication Service
for Disaster Recovery
Streams Messaging Manager
Management & Monitoring Service
for all of your Kafka clusters
Kafka Data Movement, Operations and Security Made Easier
ENTERPRISE MANAGEMENT CAPABILITIES FOR APACHE KAFKA
Kafka Connect Support
Simple Data Movement
Change Data Capture Connectors
Build Custom Connectors with NiFi
Ranger Security
Improved ACL and Audit for
Kafka, KConnect and Schema
Registry
Cruise Control Support
Intelligent Rebalancing
& Self-Healing of your
Kafka Clusters
CLOUDERA STREAM PROCESSING
Two Major Capabilities: Enterprise Messaging and Powerful Stream Processing
Enterprise grade messaging products for Apache
Kafka. Streams Messaging Manager to
monitor/operate clusters, Streams Replication
Manager for HA/DR deployments, Schema Registry for
centralized schema management, and support for
Kafka Connect and Cruise Control
Cloudera Streaming Analytics (CSA)
Powered By Apache Flink
Cloudera Streams Messaging (CSM)
Powered by Apache Kafka
Powered by Apache Flink with SQL StreamBuilder, it
provides low-latency stream processing capabilities
with advanced windowing & state management made
simple with SQL
NEXT GENERATION STREAMING ANALYTICS WITH APACHE
FLINK
Low latency stateful stream processing
● Flink is a distributed data processing
systems ideally suited for real-time, event
driven applications.
● Unifies stream and batch processing
● Advanced features - late arriving data,
checkpointing, event time processing,
Exactly Once Processing
Real-Time
Insights
Event
Processing
Low
Latency
SQL STREAM BUILDER (SSB)
SQL STREAM BUILDER allows
developers, analysts, and data
scientists to write streaming
applications with industry
standard SQL.
No Java or Scala code
development required.
Simplifies access to data in Kafka
& Flink. Connectors to batch data in
HDFS, Kudu, Hive, S3, JDBC, CDC
and more
Enrich streaming data with batch
data in a single tool
Democratize access to real-time data with just SQL
© 2024 Cloudera, Inc. All rights reserved. 39
LLMs ARE FOUNDATION MODELS
Base models that can be adapted for a wide range of use cases
Terabytes of Data
(Multiple Formats)
Foundation Models
(Billions of Parameters)
Train Adapt
Question/Answering
Sentiment Analysis
Doc summarization
… ++ more
➔ Historically, data scientists trained specialized models against narrow datasets to solve specific tasks.
➔ LLMs are Foundation models that can be adapted to perform a variety of tasks.
◆ It is faster to “adapt” a foundation model than it is to train a specialized model from scratch
◆ Decouples “knowledge” from “intelligence”
◆ Opens up AI use cases to software developers (instead of just specialised data scientists)

More Related Content

Similar to TCFPro24 Building Real-Time Generative AI Pipelines

OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesOSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
Timothy Spann
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline Development
Timothy Spann
 
Greg Dixon - 2011 ScanSource POS & Barcoding Partner Conference
Greg Dixon - 2011 ScanSource POS & Barcoding Partner ConferenceGreg Dixon - 2011 ScanSource POS & Barcoding Partner Conference
Greg Dixon - 2011 ScanSource POS & Barcoding Partner Conference
ScanSource, Inc.
 
ITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming AppsITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming Apps
Timothy Spann
 
Azure Overview Arc
Azure Overview ArcAzure Overview Arc
Azure Overview Arc
rajramab
 
Cloud computing present
Cloud computing presentCloud computing present
Cloud computing present
James Sutter
 

Similar to TCFPro24 Building Real-Time Generative AI Pipelines (20)

OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesOSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
 
Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline Development
 
AIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AIAIDEVDAY_ Data-in-Motion to Supercharge AI
AIDEVDAY_ Data-in-Motion to Supercharge AI
 
Introduction to pyspark new
Introduction to pyspark newIntroduction to pyspark new
Introduction to pyspark new
 
Greg Dixon - 2011 ScanSource POS & Barcoding Partner Conference
Greg Dixon - 2011 ScanSource POS & Barcoding Partner ConferenceGreg Dixon - 2011 ScanSource POS & Barcoding Partner Conference
Greg Dixon - 2011 ScanSource POS & Barcoding Partner Conference
 
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the CloudBuilding Enterprise-Ready Knowledge Graph Applications in the Cloud
Building Enterprise-Ready Knowledge Graph Applications in the Cloud
 
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real WorldS2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real World
 
AWS & kreuzwerker Startup Day Warsaw - 09.11.2023
AWS & kreuzwerker Startup Day Warsaw - 09.11.2023AWS & kreuzwerker Startup Day Warsaw - 09.11.2023
AWS & kreuzwerker Startup Day Warsaw - 09.11.2023
 
Fighting cyber fraud with hadoop v2
Fighting cyber fraud with hadoop v2Fighting cyber fraud with hadoop v2
Fighting cyber fraud with hadoop v2
 
ITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming AppsITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming Apps
 
ML Model Deployment and Scoring on the Edge with Automatic ML & DF
ML Model Deployment and Scoring on the Edge with Automatic ML & DFML Model Deployment and Scoring on the Edge with Automatic ML & DF
ML Model Deployment and Scoring on the Edge with Automatic ML & DF
 
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...
 
Azure Overview Arc
Azure Overview ArcAzure Overview Arc
Azure Overview Arc
 
Solving the Really Big Tech Problems with IoT
 Solving the Really Big Tech Problems with IoT Solving the Really Big Tech Problems with IoT
Solving the Really Big Tech Problems with IoT
 
Machine Learning in the Enterprise 2019
Machine Learning in the Enterprise 2019   Machine Learning in the Enterprise 2019
Machine Learning in the Enterprise 2019
 
Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4Introduction to Apache NiFi 1.11.4
Introduction to Apache NiFi 1.11.4
 
Cloud computing present
Cloud computing presentCloud computing present
Cloud computing present
 
Unblocking Innovation for Digital Transformation
Unblocking Innovation for Digital TransformationUnblocking Innovation for Digital Transformation
Unblocking Innovation for Digital Transformation
 
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summitAnalysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
Analysis-of-Major-Trends-in-big-data-analytics-slim-baltagi-hadoop-summit
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 

More from Timothy Spann

DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
Timothy Spann
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 
AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101
Timothy Spann
 
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
Timothy Spann
 
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationCoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
Timothy Spann
 

More from Timothy Spann (20)

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
 
2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits
 
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
 
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
 
Building Real-Time Travel Alerts
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
 
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines DemoEvolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
 
AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101
 
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
 
CoC23_ Looking at the New Features of Apache NiFi
CoC23_ Looking at the New Features of Apache NiFiCoC23_ Looking at the New Features of Apache NiFi
CoC23_ Looking at the New Features of Apache NiFi
 
CoC23_ Let’s Monitor The Conditions at the Conference
CoC23_ Let’s Monitor The Conditions at the ConferenceCoC23_ Let’s Monitor The Conditions at the Conference
CoC23_ Let’s Monitor The Conditions at the Conference
 
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdfOSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
 
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationCoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
 
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and Streaming
 
Meetup - Brasil - Data In Motion - 2023 September 19
Meetup - Brasil - Data In Motion - 2023 September 19Meetup - Brasil - Data In Motion - 2023 September 19
Meetup - Brasil - Data In Motion - 2023 September 19
 
PartnerSkillUp_Enable a Streaming CDC Solution
PartnerSkillUp_Enable a Streaming CDC SolutionPartnerSkillUp_Enable a Streaming CDC Solution
PartnerSkillUp_Enable a Streaming CDC Solution
 
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
 

Recently uploaded

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
gajnagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
ahmedjiabur940
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
HyderabadDolls
 

Recently uploaded (20)

Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get CytotecAbortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
Abortion pills in Doha {{ QATAR }} +966572737505) Get Cytotec
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
💞 Safe And Secure Call Girls Agra Call Girls Service Just Call 🍑👄6378878445 🍑...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...
👉 Bhilai Call Girls Service Just Call 🍑👄6378878445 🍑👄 Top Class Call Girl Ser...
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi ArabiaIn Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
In Riyadh ((+919101817206)) Cytotec kit @ Abortion Pills Saudi Arabia
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
 

TCFPro24 Building Real-Time Generative AI Pipelines

  • 1. © 2024 Cloudera, Inc. All rights reserved. Building Real-Time Generative AI Pipelines Tim Spann Principal Developer Advocate March 15, 2024
  • 2. © 2024 Cloudera, Inc. All rights reserved.
  • 3. © 2024 Cloudera, Inc. All rights reserved. 3 This week in Apache NiFi, Apache Flink, Apache Kafka, ML, AI, Apache Spark, Apache Iceberg, Python, Java and Open Source friends. https://bit.ly/32dAJft https://www.meetup.com/futureofdata- princeton/ FLaNK Stack Weekly by Tim Spann
  • 4. © 2024 Cloudera, Inc. All rights reserved. 4 Confidential—Restricted @PaasDev https://www.meetup.com/futureofdata-princeton/ From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to Machine Learning to Microservices to ... Future of Data - NYC + NJ + Philly + Virtual https://linktr.ee/tspannhw
  • 5. © 2024 Cloudera, Inc. All rights reserved. 5 Tim Spann Twitter: @PaasDev Blog: datainmotion.dev Principal Developer Advocate Princeton Future of Data Meetup ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-HPE, ex-PwC, ex-EY. https://medium.com/@tspann https://github.com/tspannhw
  • 6. © 2024 Cloudera, Inc. All rights reserved. 6 Some common Vector DBs Open Community & Open Models RAPID INNOVATION IN THE LLM SPACE Too much to cover today.. but you should know the common LLMs, Frameworks, Tools Notable LLMs Closed Models Open Models GPT3.5 GPT4 Llama2 Mistral7B Mixtral8x7B Claude2 ++ 100s more… check out the HuggingFace LLM Leaderboard (pretrained, domain fine-tuned, chat models, …) Code Llama Popular LLM Frameworks When to use one over the other? Use Langchain if you need a general-purpose framework with flexibility and extensibility. Consider LlamaIndex if you’re building a RAG only app (retrieval/search) Langchain is a framework for developing apps powered by LLMs ● Python and JavaScript Libraries ● Provides modules for LLM Interface, Retrieval, & Agents LLamaIndex is a framework designed specifically for RAG apps ● Python and JavaScript Libraries ● Provides built in optimizations / techniques for advanced RAG HuggingFace is an ML community for hosting & collaborating on models, datasets, and ML applications ● Latest open source LLMs are in HuggingFace ● + great learning resources / demos https://huggingface.co/ Open Source vs Self Hosted vs SaaS option
  • 7. © 2024 Cloudera, Inc. All rights reserved. 7 Enterprise Knowledge Base / Chatbot / Q&A - Customer Support & Troubleshooting - Enable open ended conversations with user provided prompts Code assistant: - Provide relevant snippets of code as a response to a request written in natural language. - Assist with creating test cases and synthetic test data. - Reference other relevant data such as a company’s documentation to help provide more accurate responses. Social and emotional sensing - Gauge emotions and opinions based on a piece of text. - Understand and deliver a more nuanced message back based on sentiment. ENTERPRISE WIDE USE CASES FOR AN LLM Classification and Clustering - Categorize and sort large volumes of data into common themes and trends to support more informed decision making. Language Translation - Globalize your content by feeding web pages through LLMs for translation. - Combine with chatbots to provide multilingual support to your customer base. Document Summarization - Distill large amounts of text down to the most relevant points. Content Generation - Provide detailed and contextually relevant prompts to develop outlines, brainstorm ideas and approaches for content. L Adoption dependent upon an Enterprise’s risk tolerance, restrictions, decision rights and disclosure obligations.
  • 8. © 2024 Cloudera, Inc. All rights reserved. NLP / AI / LLM Generative AI
  • 9. 9 Which Model and When? Use the right model for right job: closed or open-source Closed Source Usage can easily scale but so can your costs Rapidly improving AI models Most advanced AI models Excel at more specialized tasks Great for a wide range of tasks Open Source Better cost planning Compliance, privacy, and security risks More control over where & how models are deployed
  • 10. 10 Adoption of Generative AI is a Journey Identifying AI challenges in the enterprise Data integration barriers ● Streamlined access to enterprise data Rigid model infrastructure ● Modularity ● Flexibility ● AI Ops Lack of security and transparency ● Model control ● Built-in security ● Visibility & governance What’s missing Challenges
  • 11. 11 Data = Organization Context Your data enables contextually accurate responses from LLMs Large Language Model User Query Contextually Inaccurate Response Data Organization Context User Query Large Language Model Contextually Accurate Response
  • 12. © 2024 Cloudera, Inc. All rights reserved. 12 CLOSED-SOURCE FOUNDATION MODELS MODEL HUBS OPEN SOURCE FOUNDATION MODELS FINE-TUNED MODELS PRIVATE VECTOR STORE MANAGED VECTOR STORE CLOUD INFRASTRUCTURE Milvus, Solr* Meta (Llama 2) Applied Machine Learning Prototypes (AMPs) Hugging Face Pinecone SPECIALIZED HARDWARE APIs: OpenAI (GPT-4 Turbo) Amazon Bedrock: Anthropic (Claude 2), Cohere… DATA WRANGLING REAL-TIME DATA INGEST & ROUTING AI MODEL TRAINING & INFERENCE DATA STORE & VISUALIZATION Open Data Lakehouse DATA WRANGLING REAL-TIME DATA INGEST & ROUTING AI MODEL TRAINING & SERVING DATA STORE & VISUALIZATION AI APPLICATIONS
  • 13. Live Q&A Travel Advisories Weather Reports Documents Social Media Databases Transactions Public Data Feeds S3 / Files Logs ATM Data Live Chat … ARCHITECTURE INTERACT COLLECT STORE ENRICH, REPORT Distribute Collect Report REPORT Visualize Report, Automate AI BASED ENHANCEMENTS Predict, Automate VECTOR DATABASE LLM Machine Learning Data Visualization Data Flow Data Warehouse SQL Stream Builder Data Visualization Input Sentences Generated Text Timestamp Input Sentence Timestamps Enrichments Messaging Broker Real-time alerting Real-time alerting Aggregations
  • 14. © 2024 Cloudera, Inc. All rights reserved. DATAFLOW / STREAMING
  • 15. LLM USE CASE Vector DB AI Model Unstructured file types Data in Motion on Cloudera Data Platform (CDP) Capture, process & distribute any data, anywhere Other enterprise data Open Data Lakehouse Materialized Views Structured Sources Applications/API’s Streams
  • 16. https://medium.com/cloudera-inc/getting-ready-for-apache-nifi-2-0-5a5e6a67f450 NiFi 2.0.0 Features ● Python Integration ● Parameters ● JDK 21+ ● JSON Flow Serialization ● Rules Engine for Development Assistance ● Run Process Group as Stateless ● flow.json.gz https://cwiki.apache.org/confluence/display/NIFI/NiFi+2.0+Release+Goals
  • 17. 17 DataFlow Pipelines Can Help External Context Ingest Ingesting, routing, clean, enrich, transforming, parsing, chunking and vectorizing structured, unstructured, semistructured, binary data and documents Prompt engineering Crafting and structuring queries to optimize LLM responses Context Retrieval Enhancing LLM with external context such as Retrieval Augmented Generation (RAG) Roundtrip Interface Act as a Discord, REST, Kafka, SQL, Slack bot to roundtrip discussions
  • 18. 18 UNSTRUCTURED DATA WITH NIFI • Archives - tar, gzipped, zipped, … • Images - PNG, JPG, GIF, BMP, … • Documents - HTML, Markdown, RSS, PDF, Doc, RTF, Plain Text, … • Videos - MP4, Clips, Mov, Youtube URL… • Sound - MP3, … • Social / Chat - Slack, Discord, Twitter, REST, Email, … • Identify Mime Types, Chunk Documents, Store to Vector Database • Parse Documents - HTML, Markdown, PDF, Word, Excel, Powerpoint
  • 19. 19 CLOUD ML/DL/AI/Vector Database Services • Cloudera ML • Amazon Polly, Translate, Textract, Transcribe, Bedrock, … • Hugging Face • IBM Watson X.AI • Vector Stores Anywhere: Weaviate, Pinecone, Milvus, Chroma DB, SOLR, …
  • 21. © 2024 Cloudera, Inc. All rights reserved.
  • 22. © 2024 Cloudera, Inc. All rights reserved. Python Processors
  • 23. © 2024 Cloudera, Inc. All rights reserved. Extract Company Names ● Python 3.10+ ● Hugging Face, NLP, SpaCY, PyTorch https://github.com/tspannhw/FLaNK-python-ExtractCompanyName-processor
  • 24. © 2024 Cloudera, Inc. All rights reserved. WatsonX SDK To Foundation ● Python 3.10+ ● LLM ● WatsonX.AI Foundation Models ● Inference ● Secure ● Official SDK from IBM https://github.com/tspannhw/FLaNK-python-watsonx-processor
  • 25. © 2024 Cloudera, Inc. All rights reserved. CaptionImage ● Python 3.10+ ● Hugging Face ● Salesforce/blip-image-captioning-large ● Generate Captions for Images ● Adds captions to FlowFile Attributes ● Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  • 26. © 2024 Cloudera, Inc. All rights reserved. RESNetImageClassification ● Python 3.10+ ● Hugging Face ● Transformers ● Pytorch ● Datasets ● microsoft/resnet-50 ● Adds classification label to FlowFile Attributes ● Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  • 27. © 2024 Cloudera, Inc. All rights reserved. NSFWImageDetection ● Python 3.10+ ● Hugging Face ● Transformers ● Falconsai/nsfw_image_detection ● Adds normal and nsfw to FlowFile Attributes ● Gives score on safety of image ● Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  • 28. © 2024 Cloudera, Inc. All rights reserved. FacialEmotionsImageDetection ● Python 3.10+ ● Hugging Face ● Transformers ● facial_emotions_image_detection ● Image Classification ● Adds labels/scores to FlowFile Attributes ● Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  • 29. © 2024 Cloudera, Inc. All rights reserved. Other Python Processors ● Chunk Document, Parse Document ● Prompt Chat GPT ● Put Chroma, Query Chroma ● Put Pinecone, Query Pinecone
  • 33. CLOUDERA STREAM PROCESSING Two Major Capabilities: Enterprise Messaging and Powerful Stream Processing Enterprise grade messaging products for Apache Kafka. Streams Messaging Manager to monitor/operate clusters, Streams Replication Manager for HA/DR deployments, Schema Registry for centralized schema management, and support for Kafka Connect and Cruise Control Cloudera Streaming Analytics (CSA) Powered By Apache Flink Cloudera Streams Messaging (CSM) Powered by Apache Kafka Powered by Apache Flink with SQL StreamBuilder, it provides low-latency stream processing capabilities with advanced windowing & state management made simple with SQL
  • 34. ENTERPRISE MANAGEMENT CAPABILITIES FOR APACHE KAFKA Extend streams messaging services for Schema Mgmt, Replication & Monitoring Schema Registry Kafka Schema Governance Streams Replication Manager Kafka Replication Service for Disaster Recovery Streams Messaging Manager Management & Monitoring Service for all of your Kafka clusters
  • 35. Kafka Data Movement, Operations and Security Made Easier ENTERPRISE MANAGEMENT CAPABILITIES FOR APACHE KAFKA Kafka Connect Support Simple Data Movement Change Data Capture Connectors Build Custom Connectors with NiFi Ranger Security Improved ACL and Audit for Kafka, KConnect and Schema Registry Cruise Control Support Intelligent Rebalancing & Self-Healing of your Kafka Clusters
  • 36. CLOUDERA STREAM PROCESSING Two Major Capabilities: Enterprise Messaging and Powerful Stream Processing Enterprise grade messaging products for Apache Kafka. Streams Messaging Manager to monitor/operate clusters, Streams Replication Manager for HA/DR deployments, Schema Registry for centralized schema management, and support for Kafka Connect and Cruise Control Cloudera Streaming Analytics (CSA) Powered By Apache Flink Cloudera Streams Messaging (CSM) Powered by Apache Kafka Powered by Apache Flink with SQL StreamBuilder, it provides low-latency stream processing capabilities with advanced windowing & state management made simple with SQL
  • 37. NEXT GENERATION STREAMING ANALYTICS WITH APACHE FLINK Low latency stateful stream processing ● Flink is a distributed data processing systems ideally suited for real-time, event driven applications. ● Unifies stream and batch processing ● Advanced features - late arriving data, checkpointing, event time processing, Exactly Once Processing Real-Time Insights Event Processing Low Latency
  • 38. SQL STREAM BUILDER (SSB) SQL STREAM BUILDER allows developers, analysts, and data scientists to write streaming applications with industry standard SQL. No Java or Scala code development required. Simplifies access to data in Kafka & Flink. Connectors to batch data in HDFS, Kudu, Hive, S3, JDBC, CDC and more Enrich streaming data with batch data in a single tool Democratize access to real-time data with just SQL
  • 39. © 2024 Cloudera, Inc. All rights reserved. 39 LLMs ARE FOUNDATION MODELS Base models that can be adapted for a wide range of use cases Terabytes of Data (Multiple Formats) Foundation Models (Billions of Parameters) Train Adapt Question/Answering Sentiment Analysis Doc summarization … ++ more ➔ Historically, data scientists trained specialized models against narrow datasets to solve specific tasks. ➔ LLMs are Foundation models that can be adapted to perform a variety of tasks. ◆ It is faster to “adapt” a foundation model than it is to train a specialized model from scratch ◆ Decouples “knowledge” from “intelligence” ◆ Opens up AI use cases to software developers (instead of just specialised data scientists)