© 2023 Cloudera, Inc. All rights reserved.
Adding Generative AI to Real-Time
Streaming Pipelines
Tim Spann
Principal Developer Advocate
April 2024
© 2023 Cloudera, Inc. All rights reserved.
© 2023 Cloudera, Inc. All rights reserved. 3
Tim Spann
Twitter: @PaasDev // Blog: datainmotion.dev
Principal Developer Advocate. Field Engineer.
Princeton/NYC Future of Data Meetups.
ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC
https://medium.com/@tspann
https://github.com/tspannhw
© 2023 Cloudera, Inc. All rights reserved. 4
This week in Apache NiFi, Apache Flink,
Apache Kafka, ML, AI, Apache Spark, Apache
Iceberg, Python, Java, LLM, GenAI, Vector
DB and Open Source friends.
https://bit.ly/32dAJft
https://www.meetup.com/futureofdata-
princeton/
FLaNK Stack Weekly by Tim Spann
© 2023 Cloudera, Inc. All rights reserved. 5
Confidential—Restricted
@PaasDev
https://www.meetup.com/futureofdata-princeton/
From Big Data to AI to Streaming to Containers to
Cloud to Analytics to Cloud Storage to Fast Data to
Machine Learning to Microservices to ...
Future of Data - NYC + NJ + Philly + Virtual
© 2023 Cloudera, Inc. All rights reserved. 6
Some common Vector DBs Open Community & Open Models
RAPID INNOVATION IN THE LLM SPACE
Too much to cover today.. but you should know the common LLMs, Frameworks, Tools
Notable LLMs
Closed Models Open Models
GPT3.5
GPT4
Llama2
Mistral7B
Mixtral8x7B
Claude2
++ 100s more… check out the HuggingFace LLM
Leaderboard (pretrained, domain fine-tuned, chat models, …)
Code Llama
Popular LLM Frameworks
When to use one over the other? Use Langchain if you need a general-purpose framework with
flexibility and extensibility. Consider LlamaIndex if you’re building a RAG only app (retrieval/search)
Langchain is a framework for
developing apps powered by LLMs
● Python and JavaScript Libraries
● Provides modules for LLM
Interface, Retrieval, & Agents
LLamaIndex is a framework designed
specifically for RAG apps
● Python and JavaScript Libraries
● Provides built in optimizations /
techniques for advanced RAG
HuggingFace is an ML community for hosting &
collaborating on models, datasets, and ML applications
● Latest open source LLMs are in HuggingFace
● + great learning resources / demos
https://huggingface.co/
Open Source vs Self Hosted vs SaaS option
© 2023 Cloudera, Inc. All rights reserved. 7
Enterprise Knowledge Base / Chatbot / Q&A
- Customer Support & Troubleshooting
- Enable open ended conversations with user
provided prompts
Code assistant:
- Provide relevant snippets of code as a
response to a request written in natural
language.
- Assist with creating test cases and
synthetic test data.
- Reference other relevant data such as a
company’s documentation to help provide
more accurate responses.
Social and emotional sensing
- Gauge emotions and opinions based on a
piece of text.
- Understand and deliver a more nuanced
message back based on sentiment.
ENTERPRISE WIDE USE CASES FOR AN LLM
Classification and Clustering
- Categorize and sort large volumes of data
into common themes and trends to support
more informed decision making.
Language Translation
- Globalize your content by feeding web
pages through LLMs for translation.
- Combine with chatbots to provide
multilingual support to your customer base.
Document Summarization
- Distill large amounts of text down to the
most relevant points.
Content Generation
- Provide detailed and contextually relevant
prompts to develop outlines, brainstorm
ideas and approaches for content.
L
Adoption dependent upon an Enterprise’s risk tolerance, restrictions, decision rights and disclosure obligations.
© 2023 Cloudera, Inc. All rights reserved. 8
Which Model and When?
Use the right model for right job: closed or open-source
Closed
Source
Usage can
easily scale
but so can
your costs
Rapidly
improving
AI models
Most
advanced
AI models
Excel at more
specialized
tasks
Great for a
wide range
of tasks
Open
Source
Better cost
planning
Compliance,
privacy, and
security risks
More control
over where &
how models
are deployed
© 2023 Cloudera, Inc. All rights reserved. 9
ECOSYSTEM PARTNERSHIPS
Best of breed capabilities for best in class Enterprise AI
RAY
COMPUTE
● Tune, manage, scale
AI models and
applications
● Integrated into CML
Sessions
FOUNDATION
● Widest range of
Foundation Models
● Serverless
integration with
CDP for fast time to
value
PERFORMANCE
● Optimized GPU
performance &
accelerated data
science pipelines
SEARCH
● Cloud-based
semantic search
made easy and at
scale
● Store and manage
AI representations
of data in the public
cloud
TOOLING
● Access to open
source
innovation
through CML
AMPs
● Embedded into
CML (Model
Registry &
Serving)
© 2023 Cloudera, Inc. All rights reserved. 10
APPLICATIONS
CLOSED-SOURCE
FOUNDATION MODELS
MODEL HUBS
OPEN SOURCE
FOUNDATION MODELS
FINE-TUNED MODELS
PRIVATE
VECTOR STORE
MANAGED
VECTOR STORE
CLOUD INFRASTRUCTURE
Milvus, Solr*
Meta (Llama 2)
Applied Machine Learning Prototypes (AMPs)
Cloudera Generative AI Stack
Hugging Face
Pinecone
SPECIALIZED HARDWARE
APIs: OpenAI (GPT-4 Turbo)
Amazon Bedrock: Anthropic (Claude 2), Cohere…
DATA
WRANGLING
REAL-TIME
DATA INGEST
& ROUTING
AI MODEL
TRAINING &
INFERENCE
DATA STORE &
VISUALIZATION
Open Data Lakehouse
DATA
WRANGLING
REAL-TIME
DATA INGEST
& ROUTING
AI MODEL
TRAINING &
SERVING
DATA STORE &
VISUALIZATION
© 2023 Cloudera, Inc. All rights reserved. 11
Live Q&A
Travel Advisories
Weather Reports
Documents
Social Media
Databases
Transactions
Public Data Feeds
S3 / Files
Logs
ATM Data
Live Chat
…
HYBRID CLOUD
INTERACT
COLLECT STORE
ENRICH, REPORT
Distribute
Collect
Report
REPORT
Visualize
Report, Automate
AI BASED ENHANCEMENTS
Predict, Automate
VECTOR DATABASE
LLM
Machine
Learning
Data
Visualization
Data Flow
Data
Warehouse
SQL
Stream Builder
Data
Visualization
Input Sentences
Generated Text
Timestamp
Input Sentence
Timestamps
Enrichments
Messaging
Broker
Real-time alerting
Real-time alerting
Aggregations
© 2023 Cloudera, Inc. All rights reserved. 12
© 2019 Cloudera, Inc. All rights reserved. 13
Cloudera + LLMs
Knowledge Repository
Data Storage / Management
Data Preparation
Data Engineering
LLM Fine Tuning Process
Training Framework
LLM Serving
Serving Framework
Key:
CPU Task
GPU Task
CML
CDE
CDP
Vector DB
CDF
Streaming Classification
Real-Time Model Deployment
© 2023 Cloudera, Inc. All rights reserved.
NLP / AI / LLM
Generative AI
© 2023 Cloudera, Inc. All rights reserved. 16
DataFlow Pipelines Can Help
External Context Ingest
Ingesting, routing, clean, enrich, transforming,
parsing, chunking and vectorizing structured,
unstructured, semistructured, binary data and
documents
Prompt engineering
Crafting and structuring queries to optimize
LLM responses
Context Retrieval
Enhancing LLM with external context such as
Retrieval Augmented Generation (RAG)
Roundtrip Interface
Act as a Discord, REST, Kafka, SQL, Slack bot to
roundtrip discussions
© 2019 Cloudera, Inc. All rights reserved. 17
UNSTRUCTURED DATA WITH NIFI
• Archives - tar, gzipped, zipped, …
• Images - PNG, JPG, GIF, BMP, …
• Documents - HTML, Markdown, RSS, PDF, Doc, RTF, Plain Text, …
• Videos - MP4, Clips, Mov, Youtube URL…
• Sound - MP3, …
• Social / Chat - Slack, Discord, Twitter, REST, Email, …
• Identify Mime Types, Chunk Documents, Store to Vector Database
• Parse Documents - HTML, Markdown, PDF, Word, Excel, Powerpoint
© 2019 Cloudera, Inc. All rights reserved. 18
CLOUD ML/DL/AI/Vector Database Services
• Cloudera ML
• Amazon Polly, Translate, Textract, Transcribe, Bedrock, …
• Hugging Face
• IBM Watson X.AI
• Vector Stores Anywhere: Weaviate, Pinecone, Milvus,
Chroma DB, SOLR, …
https://medium.com/cloudera-inc/getting-ready-for-apache-nifi-2-0-5a5e6a67f450
NiFi 2.0.0 Features
● Python Integration
● Parameters
● JDK 21+
● JSON Flow Serialization
● Rules Engine for Development
Assistance
● Run Process Group as Stateless
● flow.json.gz
https://cwiki.apache.org/confluence/display/NIFI/NiFi+2.0+Release+Goals
© 2023 Cloudera, Inc. All rights reserved. 20
FLINK SQL -> CLOUDERA MACHINE LEARNING MODELS
© 2023 Cloudera, Inc. All rights reserved. 21
FLINK SQL -> NIFI -> HUGGING FACE GOOGLE GEMINI
© 2023 Cloudera, Inc. All rights reserved. 22
SSB UDF JS/JAVA + GenAI = Real-Time GenAI SQL
https://medium.com/cloudera-inc/adding-generative-ai-results-to-sql-streams-513e1fd2a6af
SELECT CALLLLM(CAST(messagetext as
STRING)) as generatedtext,
messagerealname, messageusername,
messagetext,messageusertz,
messageid, threadts, ts
FROM flankslackmessages
WHERE messagetype = 'message'
© 2023 Cloudera, Inc. All rights reserved. 23
© 2023 Cloudera, Inc. All rights reserved. 24
https://medium.com/cloudera-inc/google-gemma-for-real-time-lightweight-open-llm-infe
rence-88efe98e580f
Python Processors
Basics
Basics
Basics
Extract Text from Web VTT
● Python 3.10+
● Web VTT to Text
● Web Video Text Tracks Format Extractor
https://developer.mozilla.org/en-US/docs/Web/API/WebVTT_API
https://github.com/tspannhw/FLaNK-python-processors/blob/main/TranslateWebVTT.py
WEBVTT
1
00:00:06.066 --> 00:00:07.166
Now let's talk about
2
00:00:07.166 --> 00:00:12.033
data retrieval, views,
and materialized views.
WatsonX SDK To Foundation
● Python 3.10+
● LLM
● WatsonX.AI Foundation Models
● Inference
● Secure
● Official SDK from IBM
https://github.com/tspannhw/FLaNK-python-watsonx-processor
Generate Synthetic Records w/
Faker
● Python 3.10+
● faker
● Choose as many as you want
● Attribute output
Download a Wiki Page as
HTML or WikiFormat (Text)
● Python 3.10+
● Wikipedia-api
● HTML or Text
● Choose your wiki page dynamically
Extract Company Names
● Python 3.10+
● Hugging Face, NLP, SpaCY, PyTorch
https://github.com/tspannhw/FLaNK-python-ExtractCompanyName-processor
CaptionImage
● Python 3.10+
● Hugging Face
● Salesforce/blip-image-captioning-large
● Generate Captions for Images
● Adds captions to FlowFile Attributes
● Does not require download or copies of
your images
https://github.com/tspannhw/FLaNK-python-processors
RESNetImageClassification
● Python 3.10+
● Hugging Face
● Transformers
● Pytorch
● Datasets
● microsoft/resnet-50
● Adds classification label to FlowFile
Attributes
● Does not require download or copies of
your images
https://github.com/tspannhw/FLaNK-python-processors
NSFWImageDetection
● Python 3.10+
● Hugging Face
● Transformers
● Falconsai/nsfw_image_detection
● Adds normal and nsfw to FlowFile
Attributes
● Gives score on safety of image
● Does not require download or copies of
your images
https://github.com/tspannhw/FLaNK-python-processors
FacialEmotionsImageDetection
● Python 3.10+
● Hugging Face
● Transformers
● facial_emotions_image_detection
● Image Classification
● Adds labels/scores to FlowFile Attributes
● Does not require download or copies of
your images
https://github.com/tspannhw/FLaNK-python-processors
Other Python Processors
● Put/Query-Pinecone (Vector DB Interface)
● ChunkDocument, ParseDocument
● ConvertCSVtoExcel
● DetectObjectInImage
● PromptChatGPT
● Put/Query-Chroma (Vector DB Interface)
DEMO
https://medium.com/@tspann/septa-transit-real-time-81082878b485
Philadelphia SEPTA
https://medium.com/cloudera-inc/streaming-street-cams-to-yolo-v8-with-python-and-nifi-to-minio-s3-3277e73723ce
Street Cameras
https://medium.com/cloudera-inc/subways-and-transit-updates-in-real-time-30c104c359ef
NYC Subway
MORE ARTICLES
● https://medium.com/cloudera-inc/watching-airport-traffic-in-real-time-32c522a6e386
● https://medium.com/cloudera-inc/building-a-real-time-data-pipeline-a-comprehensive-tutorial-on-min
ifi-nifi-kafka-and-flink-ee03ee6722cb
● https://medium.com/cloudera-inc/finding-the-best-way-around-7491c76ca4cb
● https://medium.com/cloudera-inc/nyc-traffic-are-you-kidding-me-6d3fa853903b
● https://medium.com/@tspann/building-a-travel-advisory-app-with-apache-nifi-in-k8-969b44c84958
● https://medium.com/@tspann/using-ollama-with-mistral-and-apache-nifi-720c17f5ff12
● https://medium.com/cloudera-inc/google-gemma-for-real-time-lightweight-open-llm-inference-88efe
98e580f
● https://medium.com/@tspann/image-processing-with-custom-python-and-nifi-2-0-06eadc62c03c
● https://medium.com/@tspann/ai-augmented-devrel-part-1-4058af905a89
● https://medium.com/cloudera-inc/mixtral-generative-sparse-mixture-of-experts-in-dataflows-59744f
7d28a9
● https://medium.com/@tspann/building-an-llm-bot-for-meetups-and-conference-interactivity-c211ea
6e3b61
● https://medium.com/@tspann/yet-another-python-processor-45aaae6fe406
LLM 2024
Startup
Grind AI
Max
Summit -
April 12 NJ
45
TH N Y U

Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines

  • 1.
    © 2023 Cloudera,Inc. All rights reserved. Adding Generative AI to Real-Time Streaming Pipelines Tim Spann Principal Developer Advocate April 2024
  • 2.
    © 2023 Cloudera,Inc. All rights reserved.
  • 3.
    © 2023 Cloudera,Inc. All rights reserved. 3 Tim Spann Twitter: @PaasDev // Blog: datainmotion.dev Principal Developer Advocate. Field Engineer. Princeton/NYC Future of Data Meetups. ex-Pivotal, ex-Hortonworks, ex-StreamNative, ex-PwC https://medium.com/@tspann https://github.com/tspannhw
  • 4.
    © 2023 Cloudera,Inc. All rights reserved. 4 This week in Apache NiFi, Apache Flink, Apache Kafka, ML, AI, Apache Spark, Apache Iceberg, Python, Java, LLM, GenAI, Vector DB and Open Source friends. https://bit.ly/32dAJft https://www.meetup.com/futureofdata- princeton/ FLaNK Stack Weekly by Tim Spann
  • 5.
    © 2023 Cloudera,Inc. All rights reserved. 5 Confidential—Restricted @PaasDev https://www.meetup.com/futureofdata-princeton/ From Big Data to AI to Streaming to Containers to Cloud to Analytics to Cloud Storage to Fast Data to Machine Learning to Microservices to ... Future of Data - NYC + NJ + Philly + Virtual
  • 6.
    © 2023 Cloudera,Inc. All rights reserved. 6 Some common Vector DBs Open Community & Open Models RAPID INNOVATION IN THE LLM SPACE Too much to cover today.. but you should know the common LLMs, Frameworks, Tools Notable LLMs Closed Models Open Models GPT3.5 GPT4 Llama2 Mistral7B Mixtral8x7B Claude2 ++ 100s more… check out the HuggingFace LLM Leaderboard (pretrained, domain fine-tuned, chat models, …) Code Llama Popular LLM Frameworks When to use one over the other? Use Langchain if you need a general-purpose framework with flexibility and extensibility. Consider LlamaIndex if you’re building a RAG only app (retrieval/search) Langchain is a framework for developing apps powered by LLMs ● Python and JavaScript Libraries ● Provides modules for LLM Interface, Retrieval, & Agents LLamaIndex is a framework designed specifically for RAG apps ● Python and JavaScript Libraries ● Provides built in optimizations / techniques for advanced RAG HuggingFace is an ML community for hosting & collaborating on models, datasets, and ML applications ● Latest open source LLMs are in HuggingFace ● + great learning resources / demos https://huggingface.co/ Open Source vs Self Hosted vs SaaS option
  • 7.
    © 2023 Cloudera,Inc. All rights reserved. 7 Enterprise Knowledge Base / Chatbot / Q&A - Customer Support & Troubleshooting - Enable open ended conversations with user provided prompts Code assistant: - Provide relevant snippets of code as a response to a request written in natural language. - Assist with creating test cases and synthetic test data. - Reference other relevant data such as a company’s documentation to help provide more accurate responses. Social and emotional sensing - Gauge emotions and opinions based on a piece of text. - Understand and deliver a more nuanced message back based on sentiment. ENTERPRISE WIDE USE CASES FOR AN LLM Classification and Clustering - Categorize and sort large volumes of data into common themes and trends to support more informed decision making. Language Translation - Globalize your content by feeding web pages through LLMs for translation. - Combine with chatbots to provide multilingual support to your customer base. Document Summarization - Distill large amounts of text down to the most relevant points. Content Generation - Provide detailed and contextually relevant prompts to develop outlines, brainstorm ideas and approaches for content. L Adoption dependent upon an Enterprise’s risk tolerance, restrictions, decision rights and disclosure obligations.
  • 8.
    © 2023 Cloudera,Inc. All rights reserved. 8 Which Model and When? Use the right model for right job: closed or open-source Closed Source Usage can easily scale but so can your costs Rapidly improving AI models Most advanced AI models Excel at more specialized tasks Great for a wide range of tasks Open Source Better cost planning Compliance, privacy, and security risks More control over where & how models are deployed
  • 9.
    © 2023 Cloudera,Inc. All rights reserved. 9 ECOSYSTEM PARTNERSHIPS Best of breed capabilities for best in class Enterprise AI RAY COMPUTE ● Tune, manage, scale AI models and applications ● Integrated into CML Sessions FOUNDATION ● Widest range of Foundation Models ● Serverless integration with CDP for fast time to value PERFORMANCE ● Optimized GPU performance & accelerated data science pipelines SEARCH ● Cloud-based semantic search made easy and at scale ● Store and manage AI representations of data in the public cloud TOOLING ● Access to open source innovation through CML AMPs ● Embedded into CML (Model Registry & Serving)
  • 10.
    © 2023 Cloudera,Inc. All rights reserved. 10 APPLICATIONS CLOSED-SOURCE FOUNDATION MODELS MODEL HUBS OPEN SOURCE FOUNDATION MODELS FINE-TUNED MODELS PRIVATE VECTOR STORE MANAGED VECTOR STORE CLOUD INFRASTRUCTURE Milvus, Solr* Meta (Llama 2) Applied Machine Learning Prototypes (AMPs) Cloudera Generative AI Stack Hugging Face Pinecone SPECIALIZED HARDWARE APIs: OpenAI (GPT-4 Turbo) Amazon Bedrock: Anthropic (Claude 2), Cohere… DATA WRANGLING REAL-TIME DATA INGEST & ROUTING AI MODEL TRAINING & INFERENCE DATA STORE & VISUALIZATION Open Data Lakehouse DATA WRANGLING REAL-TIME DATA INGEST & ROUTING AI MODEL TRAINING & SERVING DATA STORE & VISUALIZATION
  • 11.
    © 2023 Cloudera,Inc. All rights reserved. 11 Live Q&A Travel Advisories Weather Reports Documents Social Media Databases Transactions Public Data Feeds S3 / Files Logs ATM Data Live Chat … HYBRID CLOUD INTERACT COLLECT STORE ENRICH, REPORT Distribute Collect Report REPORT Visualize Report, Automate AI BASED ENHANCEMENTS Predict, Automate VECTOR DATABASE LLM Machine Learning Data Visualization Data Flow Data Warehouse SQL Stream Builder Data Visualization Input Sentences Generated Text Timestamp Input Sentence Timestamps Enrichments Messaging Broker Real-time alerting Real-time alerting Aggregations
  • 12.
    © 2023 Cloudera,Inc. All rights reserved. 12
  • 13.
    © 2019 Cloudera,Inc. All rights reserved. 13 Cloudera + LLMs Knowledge Repository Data Storage / Management Data Preparation Data Engineering LLM Fine Tuning Process Training Framework LLM Serving Serving Framework Key: CPU Task GPU Task CML CDE CDP Vector DB CDF Streaming Classification Real-Time Model Deployment
  • 14.
    © 2023 Cloudera,Inc. All rights reserved. NLP / AI / LLM Generative AI
  • 16.
    © 2023 Cloudera,Inc. All rights reserved. 16 DataFlow Pipelines Can Help External Context Ingest Ingesting, routing, clean, enrich, transforming, parsing, chunking and vectorizing structured, unstructured, semistructured, binary data and documents Prompt engineering Crafting and structuring queries to optimize LLM responses Context Retrieval Enhancing LLM with external context such as Retrieval Augmented Generation (RAG) Roundtrip Interface Act as a Discord, REST, Kafka, SQL, Slack bot to roundtrip discussions
  • 17.
    © 2019 Cloudera,Inc. All rights reserved. 17 UNSTRUCTURED DATA WITH NIFI • Archives - tar, gzipped, zipped, … • Images - PNG, JPG, GIF, BMP, … • Documents - HTML, Markdown, RSS, PDF, Doc, RTF, Plain Text, … • Videos - MP4, Clips, Mov, Youtube URL… • Sound - MP3, … • Social / Chat - Slack, Discord, Twitter, REST, Email, … • Identify Mime Types, Chunk Documents, Store to Vector Database • Parse Documents - HTML, Markdown, PDF, Word, Excel, Powerpoint
  • 18.
    © 2019 Cloudera,Inc. All rights reserved. 18 CLOUD ML/DL/AI/Vector Database Services • Cloudera ML • Amazon Polly, Translate, Textract, Transcribe, Bedrock, … • Hugging Face • IBM Watson X.AI • Vector Stores Anywhere: Weaviate, Pinecone, Milvus, Chroma DB, SOLR, …
  • 19.
    https://medium.com/cloudera-inc/getting-ready-for-apache-nifi-2-0-5a5e6a67f450 NiFi 2.0.0 Features ●Python Integration ● Parameters ● JDK 21+ ● JSON Flow Serialization ● Rules Engine for Development Assistance ● Run Process Group as Stateless ● flow.json.gz https://cwiki.apache.org/confluence/display/NIFI/NiFi+2.0+Release+Goals
  • 20.
    © 2023 Cloudera,Inc. All rights reserved. 20 FLINK SQL -> CLOUDERA MACHINE LEARNING MODELS
  • 21.
    © 2023 Cloudera,Inc. All rights reserved. 21 FLINK SQL -> NIFI -> HUGGING FACE GOOGLE GEMINI
  • 22.
    © 2023 Cloudera,Inc. All rights reserved. 22 SSB UDF JS/JAVA + GenAI = Real-Time GenAI SQL https://medium.com/cloudera-inc/adding-generative-ai-results-to-sql-streams-513e1fd2a6af SELECT CALLLLM(CAST(messagetext as STRING)) as generatedtext, messagerealname, messageusername, messagetext,messageusertz, messageid, threadts, ts FROM flankslackmessages WHERE messagetype = 'message'
  • 23.
    © 2023 Cloudera,Inc. All rights reserved. 23
  • 24.
    © 2023 Cloudera,Inc. All rights reserved. 24 https://medium.com/cloudera-inc/google-gemma-for-real-time-lightweight-open-llm-infe rence-88efe98e580f
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
    Extract Text fromWeb VTT ● Python 3.10+ ● Web VTT to Text ● Web Video Text Tracks Format Extractor https://developer.mozilla.org/en-US/docs/Web/API/WebVTT_API https://github.com/tspannhw/FLaNK-python-processors/blob/main/TranslateWebVTT.py WEBVTT 1 00:00:06.066 --> 00:00:07.166 Now let's talk about 2 00:00:07.166 --> 00:00:12.033 data retrieval, views, and materialized views.
  • 30.
    WatsonX SDK ToFoundation ● Python 3.10+ ● LLM ● WatsonX.AI Foundation Models ● Inference ● Secure ● Official SDK from IBM https://github.com/tspannhw/FLaNK-python-watsonx-processor
  • 31.
    Generate Synthetic Recordsw/ Faker ● Python 3.10+ ● faker ● Choose as many as you want ● Attribute output
  • 32.
    Download a WikiPage as HTML or WikiFormat (Text) ● Python 3.10+ ● Wikipedia-api ● HTML or Text ● Choose your wiki page dynamically
  • 33.
    Extract Company Names ●Python 3.10+ ● Hugging Face, NLP, SpaCY, PyTorch https://github.com/tspannhw/FLaNK-python-ExtractCompanyName-processor
  • 34.
    CaptionImage ● Python 3.10+ ●Hugging Face ● Salesforce/blip-image-captioning-large ● Generate Captions for Images ● Adds captions to FlowFile Attributes ● Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  • 35.
    RESNetImageClassification ● Python 3.10+ ●Hugging Face ● Transformers ● Pytorch ● Datasets ● microsoft/resnet-50 ● Adds classification label to FlowFile Attributes ● Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  • 36.
    NSFWImageDetection ● Python 3.10+ ●Hugging Face ● Transformers ● Falconsai/nsfw_image_detection ● Adds normal and nsfw to FlowFile Attributes ● Gives score on safety of image ● Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  • 37.
    FacialEmotionsImageDetection ● Python 3.10+ ●Hugging Face ● Transformers ● facial_emotions_image_detection ● Image Classification ● Adds labels/scores to FlowFile Attributes ● Does not require download or copies of your images https://github.com/tspannhw/FLaNK-python-processors
  • 38.
    Other Python Processors ●Put/Query-Pinecone (Vector DB Interface) ● ChunkDocument, ParseDocument ● ConvertCSVtoExcel ● DetectObjectInImage ● PromptChatGPT ● Put/Query-Chroma (Vector DB Interface)
  • 39.
  • 40.
  • 41.
  • 42.
  • 43.
    MORE ARTICLES ● https://medium.com/cloudera-inc/watching-airport-traffic-in-real-time-32c522a6e386 ●https://medium.com/cloudera-inc/building-a-real-time-data-pipeline-a-comprehensive-tutorial-on-min ifi-nifi-kafka-and-flink-ee03ee6722cb ● https://medium.com/cloudera-inc/finding-the-best-way-around-7491c76ca4cb ● https://medium.com/cloudera-inc/nyc-traffic-are-you-kidding-me-6d3fa853903b ● https://medium.com/@tspann/building-a-travel-advisory-app-with-apache-nifi-in-k8-969b44c84958 ● https://medium.com/@tspann/using-ollama-with-mistral-and-apache-nifi-720c17f5ff12 ● https://medium.com/cloudera-inc/google-gemma-for-real-time-lightweight-open-llm-inference-88efe 98e580f ● https://medium.com/@tspann/image-processing-with-custom-python-and-nifi-2-0-06eadc62c03c ● https://medium.com/@tspann/ai-augmented-devrel-part-1-4058af905a89 ● https://medium.com/cloudera-inc/mixtral-generative-sparse-mixture-of-experts-in-dataflows-59744f 7d28a9 ● https://medium.com/@tspann/building-an-llm-bot-for-meetups-and-conference-interactivity-c211ea 6e3b61 ● https://medium.com/@tspann/yet-another-python-processor-45aaae6fe406
  • 44.
  • 45.