https://princetonacm.acm.org/tcfpro/
18th Annual IEEE IT Professional Conference (ITPC)
Armstrong Hall at The College of New Jersey
Friday, March 15th, 2024 | 10:00 AM to 5:00 PM
IT Professional Conference at Trenton Computer Festival
IEEE Information Technology Professional Conference on Friday, March 15th, 2024
TCFPro24 Building Real-Time Generative AI Pipelines
Building Real-Time Generative AI Pipelines
In this talk, Tim will delve into the exciting realm of building real-time generative AI pipelines with streaming capabilities. The discussion will revolve around the integration of cutting-edge technologies to create dynamic and responsive systems that harness the power of generative algorithms.
From leveraging streaming data sources to implementing advanced machine learning models, the presentation will explore the key components necessary for constructing a robust real-time generative AI pipeline. Practical insights, use cases, and best practices will be shared, offering a comprehensive guide for developers and data scientists aspiring to design and implement dynamic AI systems in a streaming environment.
Tim will show a live demo showing we can use Apache NiFi to provide a live chat between a person in Slack and several LLM models all orchestrated with Apache NiFi, Apache Kafka and Python. We will use RAG against Chroma and Pinecone vector data stores, Hugging Face and WatsonX.AI LLM, and add additional context with NiFi lookups of stocks, weather and other data streams in real-time.
Timothy Spann
Tim Spann is a Principal Developer Advocate in Data In Motion for Cloudera. He works with Apache NiFi, Apache Pulsar, Apache Kafka, Apache Flink, Flink SQL, Apache Pinot, Trino, Apache Iceberg, DeltaLake, Apache Spark, Big Data, IoT, Cloud, AI/DL, machine learning, and deep learning. Tim has over ten years of experience with the IoT, big data, distributed computing, messaging, streaming technologies, and Java programming.
Previously, he was a Developer Advocate at StreamNative, Principal DataFlow Field Engineer at Cloudera, a Senior Solutions Engineer at Hortonworks, a Senior Solutions Architect at AirisData, a Senior Field Engineer at Pivotal and a Team Leader at HPE. He blogs for DZone, where he is the Big Data Zone leader, and runs a popular meetup in Princeton & NYC on Big Data, Cloud, IoT, deep learning, streaming, NiFi, the blockchain, and Spark.
Tim is a frequent speaker at conferences such as ApacheCon, DeveloperWeek, Pulsar Summit and many more. He holds a BS and MS in computer science.
9. 9
Which Model and When?
Use the right model for right job: closed or open-source
Closed
Source
Usage can
easily scale
but so can
your costs
Rapidly
improving
AI models
Most
advanced
AI models
Excel at more
specialized
tasks
Great for a
wide range
of tasks
Open
Source
Better cost
planning
Compliance,
privacy, and
security risks
More control
over where &
how models
are deployed
10. 10
Adoption of Generative AI is a Journey
Identifying AI challenges in the enterprise
Data integration
barriers
● Streamlined access to
enterprise data
Rigid model
infrastructure
● Modularity
● Flexibility
● AI Ops
Lack of security
and transparency
● Model control
● Built-in security
● Visibility & governance
What’s missing
Challenges
11. 11
Data = Organization Context
Your data enables contextually accurate responses from LLMs
Large Language
Model
User Query
Contextually
Inaccurate
Response
Data
Organization
Context
User Query
Large Language
Model
Contextually
Accurate
Response
13. Live Q&A
Travel Advisories
Weather Reports
Documents
Social Media
Databases
Transactions
Public Data Feeds
S3 / Files
Logs
ATM Data
Live Chat
…
ARCHITECTURE
INTERACT
COLLECT STORE
ENRICH, REPORT
Distribute
Collect
Report
REPORT
Visualize
Report, Automate
AI BASED ENHANCEMENTS
Predict, Automate
VECTOR DATABASE
LLM
Machine
Learning
Data
Visualization
Data Flow
Data
Warehouse
SQL
Stream Builder
Data
Visualization
Input Sentences
Generated Text
Timestamp
Input Sentence
Timestamps
Enrichments
Messaging
Broker
Real-time alerting
Real-time alerting
Aggregations
15. LLM USE CASE
Vector DB
AI Model
Unstructured file types
Data in Motion
on Cloudera Data
Platform (CDP)
Capture, process &
distribute any data,
anywhere
Other enterprise data Open Data Lakehouse
Materialized Views
Structured Sources
Applications/API’s
Streams
17. 17
DataFlow Pipelines Can Help
External Context Ingest
Ingesting, routing, clean, enrich, transforming,
parsing, chunking and vectorizing structured,
unstructured, semistructured, binary data and
documents
Prompt engineering
Crafting and structuring queries to optimize
LLM responses
Context Retrieval
Enhancing LLM with external context such as
Retrieval Augmented Generation (RAG)
Roundtrip Interface
Act as a Discord, REST, Kafka, SQL, Slack bot to
roundtrip discussions
33. CLOUDERA STREAM PROCESSING
Two Major Capabilities: Enterprise Messaging and Powerful Stream Processing
Enterprise grade messaging products for Apache
Kafka. Streams Messaging Manager to
monitor/operate clusters, Streams Replication
Manager for HA/DR deployments, Schema Registry for
centralized schema management, and support for
Kafka Connect and Cruise Control
Cloudera Streaming Analytics (CSA)
Powered By Apache Flink
Cloudera Streams Messaging (CSM)
Powered by Apache Kafka
Powered by Apache Flink with SQL StreamBuilder, it
provides low-latency stream processing capabilities
with advanced windowing & state management made
simple with SQL
34. ENTERPRISE MANAGEMENT CAPABILITIES FOR APACHE KAFKA
Extend streams messaging services for Schema Mgmt, Replication & Monitoring
Schema Registry
Kafka Schema Governance
Streams Replication Manager
Kafka Replication Service
for Disaster Recovery
Streams Messaging Manager
Management & Monitoring Service
for all of your Kafka clusters
35. Kafka Data Movement, Operations and Security Made Easier
ENTERPRISE MANAGEMENT CAPABILITIES FOR APACHE KAFKA
Kafka Connect Support
Simple Data Movement
Change Data Capture Connectors
Build Custom Connectors with NiFi
Ranger Security
Improved ACL and Audit for
Kafka, KConnect and Schema
Registry
Cruise Control Support
Intelligent Rebalancing
& Self-Healing of your
Kafka Clusters
36. CLOUDERA STREAM PROCESSING
Two Major Capabilities: Enterprise Messaging and Powerful Stream Processing
Enterprise grade messaging products for Apache
Kafka. Streams Messaging Manager to
monitor/operate clusters, Streams Replication
Manager for HA/DR deployments, Schema Registry for
centralized schema management, and support for
Kafka Connect and Cruise Control
Cloudera Streaming Analytics (CSA)
Powered By Apache Flink
Cloudera Streams Messaging (CSM)
Powered by Apache Kafka
Powered by Apache Flink with SQL StreamBuilder, it
provides low-latency stream processing capabilities
with advanced windowing & state management made
simple with SQL
37. NEXT GENERATION STREAMING ANALYTICS WITH APACHE
FLINK
Low latency stateful stream processing
● Flink is a distributed data processing
systems ideally suited for real-time, event
driven applications.
● Unifies stream and batch processing
● Advanced features - late arriving data,
checkpointing, event time processing,
Exactly Once Processing
Real-Time
Insights
Event
Processing
Low
Latency
38. SQL STREAM BUILDER (SSB)
SQL STREAM BUILDER allows
developers, analysts, and data
scientists to write streaming
applications with industry
standard SQL.
No Java or Scala code
development required.
Simplifies access to data in Kafka
& Flink. Connectors to batch data in
HDFS, Kudu, Hive, S3, JDBC, CDC
and more
Enrich streaming data with batch
data in a single tool
Democratize access to real-time data with just SQL