SlideShare a Scribd company logo
1 of 46
Download to read offline
© 2024 Cloudera, Inc. All rights reserved.
Adding Generative AI to Real-Time
Streaming Pipelines
Tim Spann
Principal Developer Advocate
May 1, 2024
© 2024 Cloudera, Inc. All rights reserved.
© 2024 Cloudera, Inc. All rights reserved. 3
TIM SPANN
Twitter: @PaasDev // Blog: datainmotion.dev
Principal Developer Advocate. Field Engineer.
Princeton/NYC Future of Data Meetups.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC
https://medium.com/@tspann
https://github.com/tspannhw
© 2024 Cloudera, Inc. All rights reserved. 4
This week in Apache NiFi, Apache Flink,
Apache Kafka, ML, AI, Apache Spark, Apache
Iceberg, Python, Java, LLM, GenAI, Vector
DB and Open Source friends.
https://bit.ly/32dAJft
https://www.meetup.com/futureofdata-
princeton/
FLaNK Stack Weekly by Tim Spann
© 2024 Cloudera, Inc. All rights reserved. 5
Confidential—Restricted
@PaasDev
https://www.meetup.com/futureofdata-princeton/
From Big Data to AI to Streaming to Containers to
Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...
Future of Data - NYC + NJ + Philly + Virtual
© 2024 Cloudera, Inc. All rights reserved. 6
Open Source Vector DBs Open Community & Open Models
RAPID INNOVATION IN THE LLM SPACE
Too much to cover today.. but you should know the common LLMs, Frameworks, Tools
Notable LLMs
Closed Models Open Models
GPT3.5
GPT4
Llama2
Mistral7B
Mixtral8x7B
Claude2
++ 100s more… check out the HuggingFace LLM
Leaderboard (pretrained, domain fine-tuned, chat models, …)
Code Llama
Popular LLM Frameworks
When to use one over the other? Use Langchain if you need a general-purpose framework with
flexibility and extensibility. Consider LlamaIndex if you’re building a RAG only app (retrieval/search)
Langchain is a framework for
developing apps powered by LLMs
● Python and JavaScript Libraries
● Provides modules for LLM
Interface, Retrieval, & Agents
LLamaIndex is a framework designed
specifically for RAG apps
● Python and JavaScript Libraries
● Provides built in optimizations /
techniques for advanced RAG
HuggingFace is an ML community for hosting &
collaborating on models, datasets, and ML applications
● Latest open source LLMs are in HuggingFace
● + great learning resources / demos
https://huggingface.co/
© 2024 Cloudera, Inc. All rights reserved. 7
Enterprise Knowledge Base / Chatbot / Q&A
- Customer Support & Troubleshooting
- Enable open ended conversations with user
provided prompts
Code assistant:
- Provide relevant snippets of code as a
response to a request written in natural
language.
- Assist with creating test cases and
synthetic test data.
- Reference other relevant data such as a
company’s documentation to help provide
more accurate responses.
Social and emotional sensing
- Gauge emotions and opinions based on a
piece of text.
- Understand and deliver a more nuanced
message back based on sentiment.
ENTERPRISE WIDE USE CASES FOR AN LLM
Classification and Clustering
- Categorize and sort large volumes of data
into common themes and trends to support
more informed decision making.
Language Translation
- Globalize your content by feeding web
pages through LLMs for translation.
- Combine with chatbots to provide
multilingual support to your customer base.
Document Summarization
- Distill large amounts of text down to the
most relevant points.
Content Generation
- Provide detailed and contextually relevant
prompts to develop outlines, brainstorm
ideas and approaches for content.
L
Adoption dependent upon an Enterprise’s risk tolerance, restrictions, decision rights and disclosure obligations.
© 2024 Cloudera, Inc. All rights reserved. 8
WHICH MODEL AND WHEN?
Use the right model for right job: closed or open-source
Closed
Source
Usage can
easily scale
but so can
your costs
Rapidly
improving
AI models
Most
advanced
AI models
Excel at more
specialized
tasks
Great for a
wide range
of tasks
Open
Source
Better cost
planning
Compliance,
privacy, and
security risks
More control
over where &
how models
are deployed
© 2024 Cloudera, Inc. All rights reserved. 9
APPLICATIONS
CLOSED-SOURCE
FOUNDATION MODELS
MODEL HUBS
OPEN SOURCE
FOUNDATION MODELS
FINE-TUNED MODELS
PRIVATE
VECTOR STORE
MANAGED
VECTOR STORE
CLOUD INFRASTRUCTURE
Milvus
Meta (Llama 2)
Applied Machine Learning Prototypes (AMPs)
Hugging Face
SPECIALIZED HARDWARE
APIs: OpenAI (GPT-4 Turbo)
Amazon Bedrock: Anthropic (Claude 2), Cohere…
DATA
WRANGLING
REAL-TIME
DATA INGEST
& ROUTING
AI MODEL
TRAINING &
INFERENCE
DATA STORE &
VISUALIZATION
Open Data Lakehouse
DATA
WRANGLING
REAL-TIME
DATA INGEST
& ROUTING
AI MODEL
TRAINING &
SERVING
DATA STORE &
VISUALIZATION
© 2019 Cloudera, Inc. All rights reserved. 10
CLOUDERA + LLMS
Knowledge Repository
Data Storage / Management
Data Preparation
Data Engineering
LLM Fine Tuning Process
Training Framework
LLM Serving
Serving Framework
Key:
CPU Task
GPU Task
CML
CDE
CDP
Vector DB
CDF
Streaming Classification
Real-Time Model Deployment
© 2024 Cloudera, Inc. All rights reserved. 11
EVOLUTION OF LLM APP ARCHITECTURES
RAG, MLOps, and Cache
L5
App Text Embed.
MLOps
Cache
Vector
DB
AuthR
Base
LLM
P
EFT
RAG and MLOps
L4
App Text Embed.
MLOps
Vector
DB
AuthR
Base
LLM
P
EFT
Governed RAG
L3
App Text Embed.
Vector
DB
Base
LLM
AuthR
Retrieval Augmented (RAG)
L2
App Text Embed.
Vector
DB
Base
LLM
Direct-to-LLM
L1
App
Base
LLM
Choose a cost-effective platform that provides all of the necessary components and
integrations
© 2024 Cloudera, Inc. All rights reserved. 12
ML OPS IN CLOUDERA MACHINE LEARNING
Enabling Production ML At Scale
MODEL & PREDICTION
MONITORING
● UUID for each prediction
● Analyze metrics
granularly to the feature
level
● Ground truth to
production
environments
MODEL
DEPLOYMENT
& HA SERVING
● One-click deployment of
models
● Robust and HA model
serving infrastructure
SHARED DATA
EXPERIENCE FOR
MODELS
● Automatic model
cataloging & lineage
● Governed and secure
production workflows
DISTRIBUTED AI
COMPUTE
● Cutting Edge
frameworks
● Support for Ray, Dask,
Modin…
© 2023 Cloudera, Inc. All rights reserved. 13
Live Q&A
Travel Advisories
Weather Reports
Documents
Social Media
Databases
Transactions
Public Data Feeds
S3 / Files
Logs
ATM Data
Live Chat
…
HYBRID CLOUD
INTERACT
COLLECT STORE
ENRICH, REPORT
Distribute
Collect
Report
REPORT
Visualize
Report, Automate
AI BASED ENHANCEMENTS
Predict, Automate
VECTOR DATABASE
LLM
Machine
Learning
Data
Visualization
Data Flow
Data
Warehouse
SQL
Stream Builder
Data
Visualization
Input Sentences
Generated Text
Timestamp
Input Sentence
Timestamps
Enrichments
Messaging
Broker
Real-time alerting
Real-time alerting
Aggregations
14
© 2023 Cloudera, Inc. All rights reserved.
REAL-TIME CONTEXT FOR GEN AI
Classic RAG Architecture
Contextualized prompt
Knowledge-base Context
(continuously updated)
User LLM Hallucinated
Response
RAG Optimized
Response
Initial Prompt LLM
Raw Prompt
● Performance
● Factuality
● Interpretability
● Domain
knowledge
● Efficiency
Vector DB
(continuously updated)
Retrieve
Augment
Vectorize
© 2024 Cloudera, Inc. All rights reserved.
NLP / AI / LLM
Generative AI
© 2024 Cloudera, Inc. All rights reserved. 16
DataFlow Pipelines Can Help
External Context Ingest
Ingesting, routing, clean, enrich, transforming,
parsing, chunking and vectorizing structured,
unstructured, semistructured, binary data and
documents
Prompt engineering
Crafting and structuring queries to optimize
LLM responses
Context Retrieval
Enhancing LLM with external context such as
Retrieval Augmented Generation (RAG)
Roundtrip Interface
Act as a Discord, REST, Kafka, SQL, Slack bot to
roundtrip discussions
© 2019 Cloudera, Inc. All rights reserved. 17
UNSTRUCTURED DATA WITH NIFI
• Archives - tar, gzipped, zipped, …
• Images - PNG, JPG, GIF, BMP, …
• Documents - HTML, Markdown, RSS, PDF, Doc, RTF, Plain Text, …
• Videos - MP4, Clips, Mov, Youtube URL…
• Sound - MP3, …
• Social / Chat - Slack, Discord, Twitter, REST, Email, …
• Identify Mime Types, Chunk Documents, Store to Vector Database
• Parse Documents - HTML, Markdown, PDF, Word, Excel, Powerpoint
© 2019 Cloudera, Inc. All rights reserved. 18
CLOUD ML/DL/AI/Vector Database Services
• Cloudera ML
• Amazon Polly, Translate, Textract, Transcribe, Bedrock, …
• Hugging Face
• IBM Watson X.AI
• Vector Stores Anywhere: Milvus, …
© 2024 Cloudera, Inc. All rights reserved.
https://medium.com/@tspann/building-a-milvus-connector-for-nifi-34372cb
3c7fa
CLOUDERA DATAFLOW
© 2024 Cloudera, Inc. All rights reserved. 21
CAPTURE ALL DATA
Vector DB
AI Model
Unstructured file types DataFlow
450 + Processors
Available
Knowledge stores and
other enterprise data
Materialized Views
Structured Sources
Applications/API’s
Streams
Edge devices and Streams
DataFlow Has a Vast Library of Processors to connect to ANYTHING
Chat Bots
© 2019 Cloudera, Inc. All rights reserved. 22
CAPTURE ALL DATA
Multimodal data
Structured Data
Traditional ML pipelines:
● Structured
● Batch/Microbatch
● Highly engineered
feature sets
● Clearly labeled data
● Metrics, KPI’s, etc
Gen AI pipelines:
● Multimodal
● Real-time/Streaming
● Parsing, chunking
required
● ELT tools poor suited
DataFlow is built for MultiModal Data
© 2023 Cloudera, Inc. All rights reserved. 23
PROVENANCE
https://medium.com/cloudera-inc/getting-ready-for-apache-nifi-2-0-5a5e6a67f450
NiFi 2.0.0 Features
● Python Integration
● Parameters
● JDK 21+
● JSON Flow Serialization
● Rules Engine for Development
Assistance
● Run Process Group as Stateless
● flow.json.gz
https://cwiki.apache.org/confluence/display/NIFI/NiFi+2.0+Release+Goals
© 2023 Cloudera, Inc. All rights reserved. 25
FLINK SQL -> NIFI -> HUGGING FACE GOOGLE GEMINI
© 2023 Cloudera, Inc. All rights reserved. 26
SSB UDF JS/JAVA + GenAI = Real-Time GenAI SQL
https://medium.com/cloudera-inc/adding-generative-ai-results-to-sql-streams-513e1fd2a6af
SELECT CALLLLM(CAST(messagetext as
STRING)) as generatedtext,
messagerealname, messageusername,
messagetext,messageusertz,
messageid, threadts, ts
FROM flankslackmessages
WHERE messagetype = 'message'
© 2024 Cloudera, Inc. All rights reserved. 27
Python Processors
Generate Synthetic Records w/
Faker
● Python 3.10+
● faker
● Choose as many as you want
● Attribute output
Download a Wiki Page as
HTML or WikiFormat (Text)
● Python 3.10+
● Wikipedia-api
● HTML or Text
● Choose your wiki page dynamically
Extract Company Names
● Python 3.10+
● Hugging Face, NLP, SpaCY, PyTorch
https://github.com/tspannhw/FLaNK-python-ExtractCompanyName-processor
CaptionImage
● Python 3.10+
● Hugging Face
● Salesforce/blip-image-captioning-large
● Generate Captions for Images
● Adds captions to FlowFile Attributes
● Does not require download or copies of
your images
https://github.com/tspannhw/FLaNK-python-processors
RESNetImageClassification
● Python 3.10+
● Hugging Face
● Transformers
● Pytorch
● Datasets
● microsoft/resnet-50
● Adds classification label to FlowFile
Attributes
● Does not require download or copies of
your images
https://github.com/tspannhw/FLaNK-python-processors
NSFWImageDetection
● Python 3.10+
● Hugging Face
● Transformers
● Falconsai/nsfw_image_detection
● Adds normal and nsfw to FlowFile
Attributes
● Gives score on safety of image
● Does not require download or copies of
your images
https://github.com/tspannhw/FLaNK-python-processors
FacialEmotionsImageDetection
● Python 3.10+
● Hugging Face
● Transformers
● facial_emotions_image_detection
● Image Classification
● Adds labels/scores to FlowFile Attributes
● Does not require download or copies of
your images
https://github.com/tspannhw/FLaNK-python-processors
Extract Entities
● Python 3.10+
● NLP, SpaCY
● Extract locations
● Extract organizations
● Extract money
● Extract time
● Extract events
● Extract countries
● Extract objects, food, people, quantities
https://github.com/tspannhw/FLaNK-python-processors/blob/main/ExtractEntities.py
Parse Addresses
● Python 3.10+
● PYAP Library
● Simple Library if your text includes an
address
● Address Parsing
● Address Detecting
● MIT Licensed
● Looking at other libraries, GenAI, DL, ML
https://github.com/tspannhw/FLaNK-python-processors
Address To Lat/Long
● Python 3.10+
● geopy Library
● Nominatim
● OpenStreetMaps (OSM)
● openstreetmap.org/copyright
● Returns as attributes and JSON file
● Works with partial addresses
● Categorizes location
● Bounding Box
https://github.com/tspannhw/FLaNKAI-Boston
DEMO
DEMO #1 - Cloudera Machine Learning - AMPs
DEMO #2 - Cloudera DataFlow - Milvus
REFERENCES
https://github.com/cloudera/CML_AMP_LLM_Chatbot_Augmented_with_Enterprise_Data
https://medium.com/@tspann/building-a-milvus-connector-for-nifi-34372cb3c7fa
CSP Community Edition
● Docker compose file of CSP to run from command line w/o any
dependencies, including Flink, SQL Stream Builder, Kafka, Kafka
Connect, Streams Messaging Manager and Schema Registry.
○ $>docker compose up
● Licensed under the Cloudera Community License
● Unsupported Commercially (Community Help - Ask Tim)
● Community Group Hub for CSP
● Find it on docs.cloudera.com (see QR Code)
● Kafka, Kafka Connect, SMM, SR, Flink, Flink SQL, MV, Postgresql, SSB
● Develop apps locally
45
TH N Y U
46

More Related Content

Similar to Generative AI on Enterprise Cloud with NiFi and Milvus

GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann
 
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
Timothy Spann
 
ITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming AppsITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming Apps
Timothy Spann
 
Greg Dixon - 2011 ScanSource POS & Barcoding Partner Conference
Greg Dixon - 2011 ScanSource POS & Barcoding Partner ConferenceGreg Dixon - 2011 ScanSource POS & Barcoding Partner Conference
Greg Dixon - 2011 ScanSource POS & Barcoding Partner Conference
ScanSource, Inc.
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Timothy Spann
 

Similar to Generative AI on Enterprise Cloud with NiFi and Milvus (20)

Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...
Live Demo Jam Expands: The Leading-Edge Streaming Data Platform with NiFi, Ka...
 
Building Real-Time Travel Alerts
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
 
GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
 
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
 
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
Conf42Python -Using Apache NiFi, Apache Kafka, RisingWave, and Apache Iceberg...
 
Machine Learning in the Enterprise 2019
Machine Learning in the Enterprise 2019   Machine Learning in the Enterprise 2019
Machine Learning in the Enterprise 2019
 
Trafodion – an enterprise class sql based on hadoop
Trafodion – an enterprise class sql based on hadoopTrafodion – an enterprise class sql based on hadoop
Trafodion – an enterprise class sql based on hadoop
 
ITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming AppsITPC Building Modern Data Streaming Apps
ITPC Building Modern Data Streaming Apps
 
Introduction to pyspark new
Introduction to pyspark newIntroduction to pyspark new
Introduction to pyspark new
 
Level Up – How to Achieve Hadoop Acceleration
Level Up – How to Achieve Hadoop AccelerationLevel Up – How to Achieve Hadoop Acceleration
Level Up – How to Achieve Hadoop Acceleration
 
Greg Dixon - 2011 ScanSource POS & Barcoding Partner Conference
Greg Dixon - 2011 ScanSource POS & Barcoding Partner ConferenceGreg Dixon - 2011 ScanSource POS & Barcoding Partner Conference
Greg Dixon - 2011 ScanSource POS & Barcoding Partner Conference
 
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
Budapest Data/ML - Building Modern Data Streaming Apps with NiFi, Flink and K...
 
Enabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
 
Solving the Really Big Tech Problems with IoT
 Solving the Really Big Tech Problems with IoT Solving the Really Big Tech Problems with IoT
Solving the Really Big Tech Problems with IoT
 
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...ODSC East 2020   Accelerate ML Lifecycle with Kubernetes and Containerized Da...
ODSC East 2020 Accelerate ML Lifecycle with Kubernetes and Containerized Da...
 
The Modern Tech Stack: Data Analytics in the Cloud for Developers and Founders
The Modern Tech Stack: Data Analytics in the Cloud for Developers and FoundersThe Modern Tech Stack: Data Analytics in the Cloud for Developers and Founders
The Modern Tech Stack: Data Analytics in the Cloud for Developers and Founders
 
LA M and E Symposium Nov 2017 All decks.pdf
LA M and E Symposium Nov 2017 All decks.pdfLA M and E Symposium Nov 2017 All decks.pdf
LA M and E Symposium Nov 2017 All decks.pdf
 
Digital Reinvention by NRB
Digital Reinvention by NRBDigital Reinvention by NRB
Digital Reinvention by NRB
 
Top Deep Learning Frameworks.pdf
Top Deep Learning Frameworks.pdfTop Deep Learning Frameworks.pdf
Top Deep Learning Frameworks.pdf
 
Video Encoding in the Cloud A Key Strategy for 2011
Video Encoding in the Cloud A Key Strategy for 2011Video Encoding in the Cloud A Key Strategy for 2011
Video Encoding in the Cloud A Key Strategy for 2011
 

More from Timothy Spann

DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
Timothy Spann
 
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesOSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
Timothy Spann
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
Timothy Spann
 
AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101
Timothy Spann
 
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
Timothy Spann
 
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationCoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
Timothy Spann
 

More from Timothy Spann (19)

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...2024 XTREMEJ_  Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
2024 XTREMEJ_ Building Real-time Pipelines with FLaNK_ A Case Study with Tra...
 
2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits2024 Build Generative AI for Non-Profits
2024 Build Generative AI for Non-Profits
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
 
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
NY Open Source Data Meetup Feb 8 2024 Building Real-time Pipelines with FLaNK...
 
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time PipelinesOSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
OSACon 2023_ Unlocking Financial Data with Real-Time Pipelines
 
JConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and FlinkJConWorld_ Continuous SQL with Kafka and Flink
JConWorld_ Continuous SQL with Kafka and Flink
 
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
[EN]DSS23_tspann_Integrating LLM with Streaming Data Pipelines
 
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines DemoEvolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
Evolve 2023 NYC - Integrating AI Into Realtime Data Pipelines Demo
 
AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101AIDevWorldApacheNiFi101
AIDevWorldApacheNiFi101
 
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
26Oct2023_Adding Generative AI to Real-Time Streaming Pipelines_ NYC Meetup
 
CoC23_ Looking at the New Features of Apache NiFi
CoC23_ Looking at the New Features of Apache NiFiCoC23_ Looking at the New Features of Apache NiFi
CoC23_ Looking at the New Features of Apache NiFi
 
CoC23_ Let’s Monitor The Conditions at the Conference
CoC23_ Let’s Monitor The Conditions at the ConferenceCoC23_ Let’s Monitor The Conditions at the Conference
CoC23_ Let’s Monitor The Conditions at the Conference
 
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdfOSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
OSSFinance_UnlockingFinancialDatawithReal-TimePipelines.pdf
 
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel OptimizationCoC23_Utilizing Real-Time Transit Data for Travel Optimization
CoC23_Utilizing Real-Time Transit Data for Travel Optimization
 
The Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and StreamingThe Never Landing Stream with HTAP and Streaming
The Never Landing Stream with HTAP and Streaming
 
Meetup - Brasil - Data In Motion - 2023 September 19
Meetup - Brasil - Data In Motion - 2023 September 19Meetup - Brasil - Data In Motion - 2023 September 19
Meetup - Brasil - Data In Motion - 2023 September 19
 
PartnerSkillUp_Enable a Streaming CDC Solution
PartnerSkillUp_Enable a Streaming CDC SolutionPartnerSkillUp_Enable a Streaming CDC Solution
PartnerSkillUp_Enable a Streaming CDC Solution
 
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...Implement a Universal Data Distribution Architecture to Manage All Streaming ...
Implement a Universal Data Distribution Architecture to Manage All Streaming ...
 

Recently uploaded

Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
pwgnohujw
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
jk0tkvfv
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
yulianti213969
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
yulianti213969
 
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
siskavia95
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
pwgnohujw
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
aqpto5bt
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
Amil baba
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
saurabvyas476
 

Recently uploaded (20)

Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
The Significance of Transliteration Enhancing
The Significance of Transliteration EnhancingThe Significance of Transliteration Enhancing
The Significance of Transliteration Enhancing
 
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
原件一样(UWO毕业证书)西安大略大学毕业证成绩单留信学历认证
 
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
如何办理(UCLA毕业证书)加州大学洛杉矶分校毕业证成绩单学位证留信学历认证原件一样
 
Formulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdfFormulas dax para power bI de microsoft.pdf
Formulas dax para power bI de microsoft.pdf
 
Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"Aggregations - The Elasticsearch "GROUP BY"
Aggregations - The Elasticsearch "GROUP BY"
 
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
obat aborsi Tarakan wa 081336238223 jual obat aborsi cytotec asli di Tarakan9...
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
obat aborsi Bontang wa 081336238223 jual obat aborsi cytotec asli di Bontang6...
 
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di  Ban...
obat aborsi Banjarmasin wa 082135199655 jual obat aborsi cytotec asli di Ban...
 
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital AgeCredit Card Fraud Detection: Safeguarding Transactions in the Digital Age
Credit Card Fraud Detection: Safeguarding Transactions in the Digital Age
 
原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证原件一样伦敦国王学院毕业证成绩单留信学历认证
原件一样伦敦国王学院毕业证成绩单留信学历认证
 
Bios of leading Astrologers & Researchers
Bios of leading Astrologers & ResearchersBios of leading Astrologers & Researchers
Bios of leading Astrologers & Researchers
 
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
一比一原版(ucla文凭证书)加州大学洛杉矶分校毕业证学历认证官方成绩单
 
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
Jual Obat Aborsi Bandung (Asli No.1) Wa 082134680322 Klinik Obat Penggugur Ka...
 
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
Identify Customer Segments to Create Customer Offers for Each Segment - Appli...
 
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
NO1 Best Kala Jadu Expert Specialist In Germany Kala Jadu Expert Specialist I...
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 

Generative AI on Enterprise Cloud with NiFi and Milvus

  • 1. © 2024 Cloudera, Inc. All rights reserved. Adding Generative AI to Real-Time Streaming Pipelines Tim Spann Principal Developer Advocate May 1, 2024
  • 2. © 2024 Cloudera, Inc. All rights reserved.
  • 3. © 2024 Cloudera, Inc. All rights reserved. 3 TIM SPANN Twitter: @PaasDev // Blog: datainmotion.dev Principal Developer Advocate. Field Engineer. Princeton/NYC Future of Data Meetups. ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC https://medium.com/@tspann https://github.com/tspannhw
  • 4. © 2024 Cloudera, Inc. All rights reserved. 4 This week in Apache NiFi, Apache Flink, Apache Kafka, ML, AI, Apache Spark, Apache Iceberg, Python, Java, LLM, GenAI, Vector DB and Open Source friends. https://bit.ly/32dAJft https://www.meetup.com/futureofdata- princeton/ FLaNK Stack Weekly by Tim Spann
  • 5. © 2024 Cloudera, Inc. All rights reserved. 5 Confidential—Restricted @PaasDev https://www.meetup.com/futureofdata-princeton/ From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to Machine Learning to Microservices to ... Future of Data - NYC + NJ + Philly + Virtual
  • 6. © 2024 Cloudera, Inc. All rights reserved. 6 Open Source Vector DBs Open Community & Open Models RAPID INNOVATION IN THE LLM SPACE Too much to cover today.. but you should know the common LLMs, Frameworks, Tools Notable LLMs Closed Models Open Models GPT3.5 GPT4 Llama2 Mistral7B Mixtral8x7B Claude2 ++ 100s more… check out the HuggingFace LLM Leaderboard (pretrained, domain fine-tuned, chat models, …) Code Llama Popular LLM Frameworks When to use one over the other? Use Langchain if you need a general-purpose framework with flexibility and extensibility. Consider LlamaIndex if you’re building a RAG only app (retrieval/search) Langchain is a framework for developing apps powered by LLMs ● Python and JavaScript Libraries ● Provides modules for LLM Interface, Retrieval, & Agents LLamaIndex is a framework designed specifically for RAG apps ● Python and JavaScript Libraries ● Provides built in optimizations / techniques for advanced RAG HuggingFace is an ML community for hosting & collaborating on models, datasets, and ML applications ● Latest open source LLMs are in HuggingFace ● + great learning resources / demos https://huggingface.co/
  • 7. © 2024 Cloudera, Inc. All rights reserved. 7 Enterprise Knowledge Base / Chatbot / Q&A - Customer Support & Troubleshooting - Enable open ended conversations with user provided prompts Code assistant: - Provide relevant snippets of code as a response to a request written in natural language. - Assist with creating test cases and synthetic test data. - Reference other relevant data such as a company’s documentation to help provide more accurate responses. Social and emotional sensing - Gauge emotions and opinions based on a piece of text. - Understand and deliver a more nuanced message back based on sentiment. ENTERPRISE WIDE USE CASES FOR AN LLM Classification and Clustering - Categorize and sort large volumes of data into common themes and trends to support more informed decision making. Language Translation - Globalize your content by feeding web pages through LLMs for translation. - Combine with chatbots to provide multilingual support to your customer base. Document Summarization - Distill large amounts of text down to the most relevant points. Content Generation - Provide detailed and contextually relevant prompts to develop outlines, brainstorm ideas and approaches for content. L Adoption dependent upon an Enterprise’s risk tolerance, restrictions, decision rights and disclosure obligations.
  • 8. © 2024 Cloudera, Inc. All rights reserved. 8 WHICH MODEL AND WHEN? Use the right model for right job: closed or open-source Closed Source Usage can easily scale but so can your costs Rapidly improving AI models Most advanced AI models Excel at more specialized tasks Great for a wide range of tasks Open Source Better cost planning Compliance, privacy, and security risks More control over where & how models are deployed
  • 9. © 2024 Cloudera, Inc. All rights reserved. 9 APPLICATIONS CLOSED-SOURCE FOUNDATION MODELS MODEL HUBS OPEN SOURCE FOUNDATION MODELS FINE-TUNED MODELS PRIVATE VECTOR STORE MANAGED VECTOR STORE CLOUD INFRASTRUCTURE Milvus Meta (Llama 2) Applied Machine Learning Prototypes (AMPs) Hugging Face SPECIALIZED HARDWARE APIs: OpenAI (GPT-4 Turbo) Amazon Bedrock: Anthropic (Claude 2), Cohere… DATA WRANGLING REAL-TIME DATA INGEST & ROUTING AI MODEL TRAINING & INFERENCE DATA STORE & VISUALIZATION Open Data Lakehouse DATA WRANGLING REAL-TIME DATA INGEST & ROUTING AI MODEL TRAINING & SERVING DATA STORE & VISUALIZATION
  • 10. © 2019 Cloudera, Inc. All rights reserved. 10 CLOUDERA + LLMS Knowledge Repository Data Storage / Management Data Preparation Data Engineering LLM Fine Tuning Process Training Framework LLM Serving Serving Framework Key: CPU Task GPU Task CML CDE CDP Vector DB CDF Streaming Classification Real-Time Model Deployment
  • 11. © 2024 Cloudera, Inc. All rights reserved. 11 EVOLUTION OF LLM APP ARCHITECTURES RAG, MLOps, and Cache L5 App Text Embed. MLOps Cache Vector DB AuthR Base LLM P EFT RAG and MLOps L4 App Text Embed. MLOps Vector DB AuthR Base LLM P EFT Governed RAG L3 App Text Embed. Vector DB Base LLM AuthR Retrieval Augmented (RAG) L2 App Text Embed. Vector DB Base LLM Direct-to-LLM L1 App Base LLM Choose a cost-effective platform that provides all of the necessary components and integrations
  • 12. © 2024 Cloudera, Inc. All rights reserved. 12 ML OPS IN CLOUDERA MACHINE LEARNING Enabling Production ML At Scale MODEL & PREDICTION MONITORING ● UUID for each prediction ● Analyze metrics granularly to the feature level ● Ground truth to production environments MODEL DEPLOYMENT & HA SERVING ● One-click deployment of models ● Robust and HA model serving infrastructure SHARED DATA EXPERIENCE FOR MODELS ● Automatic model cataloging & lineage ● Governed and secure production workflows DISTRIBUTED AI COMPUTE ● Cutting Edge frameworks ● Support for Ray, Dask, Modin…
  • 13. © 2023 Cloudera, Inc. All rights reserved. 13 Live Q&A Travel Advisories Weather Reports Documents Social Media Databases Transactions Public Data Feeds S3 / Files Logs ATM Data Live Chat … HYBRID CLOUD INTERACT COLLECT STORE ENRICH, REPORT Distribute Collect Report REPORT Visualize Report, Automate AI BASED ENHANCEMENTS Predict, Automate VECTOR DATABASE LLM Machine Learning Data Visualization Data Flow Data Warehouse SQL Stream Builder Data Visualization Input Sentences Generated Text Timestamp Input Sentence Timestamps Enrichments Messaging Broker Real-time alerting Real-time alerting Aggregations
  • 14. 14 © 2023 Cloudera, Inc. All rights reserved. REAL-TIME CONTEXT FOR GEN AI Classic RAG Architecture Contextualized prompt Knowledge-base Context (continuously updated) User LLM Hallucinated Response RAG Optimized Response Initial Prompt LLM Raw Prompt ● Performance ● Factuality ● Interpretability ● Domain knowledge ● Efficiency Vector DB (continuously updated) Retrieve Augment Vectorize
  • 15. © 2024 Cloudera, Inc. All rights reserved. NLP / AI / LLM Generative AI
  • 16. © 2024 Cloudera, Inc. All rights reserved. 16 DataFlow Pipelines Can Help External Context Ingest Ingesting, routing, clean, enrich, transforming, parsing, chunking and vectorizing structured, unstructured, semistructured, binary data and documents Prompt engineering Crafting and structuring queries to optimize LLM responses Context Retrieval Enhancing LLM with external context such as Retrieval Augmented Generation (RAG) Roundtrip Interface Act as a Discord, REST, Kafka, SQL, Slack bot to roundtrip discussions
  • 17. © 2019 Cloudera, Inc. All rights reserved. 17 UNSTRUCTURED DATA WITH NIFI • Archives - tar, gzipped, zipped, … • Images - PNG, JPG, GIF, BMP, … • Documents - HTML, Markdown, RSS, PDF, Doc, RTF, Plain Text, … • Videos - MP4, Clips, Mov, Youtube URL… • Sound - MP3, … • Social / Chat - Slack, Discord, Twitter, REST, Email, … • Identify Mime Types, Chunk Documents, Store to Vector Database • Parse Documents - HTML, Markdown, PDF, Word, Excel, Powerpoint
  • 18. © 2019 Cloudera, Inc. All rights reserved. 18 CLOUD ML/DL/AI/Vector Database Services • Cloudera ML • Amazon Polly, Translate, Textract, Transcribe, Bedrock, … • Hugging Face • IBM Watson X.AI • Vector Stores Anywhere: Milvus, …
  • 19. © 2024 Cloudera, Inc. All rights reserved. https://medium.com/@tspann/building-a-milvus-connector-for-nifi-34372cb 3c7fa
  • 21. © 2024 Cloudera, Inc. All rights reserved. 21 CAPTURE ALL DATA Vector DB AI Model Unstructured file types DataFlow 450 + Processors Available Knowledge stores and other enterprise data Materialized Views Structured Sources Applications/API’s Streams Edge devices and Streams DataFlow Has a Vast Library of Processors to connect to ANYTHING Chat Bots
  • 22. © 2019 Cloudera, Inc. All rights reserved. 22 CAPTURE ALL DATA Multimodal data Structured Data Traditional ML pipelines: ● Structured ● Batch/Microbatch ● Highly engineered feature sets ● Clearly labeled data ● Metrics, KPI’s, etc Gen AI pipelines: ● Multimodal ● Real-time/Streaming ● Parsing, chunking required ● ELT tools poor suited DataFlow is built for MultiModal Data
  • 23. © 2023 Cloudera, Inc. All rights reserved. 23 PROVENANCE
  • 24. https://medium.com/cloudera-inc/getting-ready-for-apache-nifi-2-0-5a5e6a67f450 NiFi 2.0.0 Features ● Python Integration ● Parameters ● JDK 21+ ● JSON Flow Serialization ● Rules Engine for Development Assistance ● Run Process Group as Stateless ● flow.json.gz https://cwiki.apache.org/confluence/display/NIFI/NiFi+2.0+Release+Goals
  • 25. © 2023 Cloudera, Inc. All rights reserved. 25 FLINK SQL -> NIFI -> HUGGING FACE GOOGLE GEMINI
  • 26. © 2023 Cloudera, Inc. All rights reserved. 26 SSB UDF JS/JAVA + GenAI = Real-Time GenAI SQL https://medium.com/cloudera-inc/adding-generative-ai-results-to-sql-streams-513e1fd2a6af SELECT CALLLLM(CAST(messagetext as STRING)) as generatedtext, messagerealname, messageusername, messagetext,messageusertz, messageid, threadts, ts FROM flankslackmessages WHERE messagetype = 'message'
  • 27. © 2024 Cloudera, Inc. All rights reserved. 27
  • 29. Generate Synthetic Records w/ Faker ● Python 3.10+ ● faker ● Choose as many as you want ● Attribute output
  • 30. Download a Wiki Page as HTML or WikiFormat (Text) ● Python 3.10+ ● Wikipedia-api ● HTML or Text ● Choose your wiki page dynamically
  • 31. Extract Company Names ● Python 3.10+ ● Hugging Face, NLP, SpaCY, PyTorch https://github.com/tspannhw/FLaNK-python-ExtractCompanyName-processor
  • 32. CaptionImage ● Python 3.10+ ● Hugging Face ● Salesforce/blip-image-captioning-large ● Generate Captions for Images ● Adds captions to FlowFile Attributes ● Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  • 33. RESNetImageClassification ● Python 3.10+ ● Hugging Face ● Transformers ● Pytorch ● Datasets ● microsoft/resnet-50 ● Adds classification label to FlowFile Attributes ● Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  • 34. NSFWImageDetection ● Python 3.10+ ● Hugging Face ● Transformers ● Falconsai/nsfw_image_detection ● Adds normal and nsfw to FlowFile Attributes ● Gives score on safety of image ● Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  • 35. FacialEmotionsImageDetection ● Python 3.10+ ● Hugging Face ● Transformers ● facial_emotions_image_detection ● Image Classification ● Adds labels/scores to FlowFile Attributes ● Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  • 36. Extract Entities ● Python 3.10+ ● NLP, SpaCY ● Extract locations ● Extract organizations ● Extract money ● Extract time ● Extract events ● Extract countries ● Extract objects, food, people, quantities https://github.com/tspannhw/FLaNK-python-processors/blob/main/ExtractEntities.py
  • 37. Parse Addresses ● Python 3.10+ ● PYAP Library ● Simple Library if your text includes an address ● Address Parsing ● Address Detecting ● MIT Licensed ● Looking at other libraries, GenAI, DL, ML https://github.com/tspannhw/FLaNK-python-processors
  • 38. Address To Lat/Long ● Python 3.10+ ● geopy Library ● Nominatim ● OpenStreetMaps (OSM) ● openstreetmap.org/copyright ● Returns as attributes and JSON file ● Works with partial addresses ● Categorizes location ● Bounding Box https://github.com/tspannhw/FLaNKAI-Boston
  • 39. DEMO
  • 40. DEMO #1 - Cloudera Machine Learning - AMPs
  • 41. DEMO #2 - Cloudera DataFlow - Milvus
  • 43. CSP Community Edition ● Docker compose file of CSP to run from command line w/o any dependencies, including Flink, SQL Stream Builder, Kafka, Kafka Connect, Streams Messaging Manager and Schema Registry. ○ $>docker compose up ● Licensed under the Cloudera Community License ● Unsupported Commercially (Community Help - Ask Tim) ● Community Group Hub for CSP ● Find it on docs.cloudera.com (see QR Code) ● Kafka, Kafka Connect, SMM, SR, Flink, Flink SQL, MV, Postgresql, SSB ● Develop apps locally
  • 44.
  • 46. 46