SlideShare a Scribd company logo
1 of 103
Download to read offline
Welcome to ServerlessToronto.org
1
Introduce Yourself:
- Where from? Why are you here?
- Looking for work or Offering work?
Help us serve you better: bit.ly/slsto
An Evening with Mark Ryan and Jerry Liu
• 6:00 - 6:10 Networking & Opening remarks
• 6:10 - 6:35 Mark Ryan: The LLM Landscape
• 6:35 - 7:15 Jerry Liu: Solving Core Challenges
in RAG Pipelines
• 7:15 - 7:45 Q&A
• 7:45 - 8:00 Manning Publications raffle
Why this Generative AI Talk?
2
1. Navigating the Tsunami: Understand the sweeping
changes the "GenAI tsunami“ brings to industries and jobs.
2. Situational Awareness: Learn from AI leaders Mark Ryan
& Jerry Liu to gain a strategic view of the LLM and RAG
landscape.
3. Career Transformation: Learn to position yourself as the
architect of automation rather than its subject.
4. Practical Advice: Acquire actionable strategies to apply
Generative AI within your enterprise.
5. Interactive Learning: Engage in live Q&A to discuss and
clarify your AI dilemmas with experts.
Battle of Waterloo
What is Serverless Toronto about?
3
Serverless became New Agile & Mindset
#1 We started as Back-
end FaaS Developers
who enjoyed 'gluing
together' other people's
APIs and Managed
Services
#3 We're obsessed
with creating business
value (meaningful
Products), focusing on
Outcomes/Impact –
NOT Outputs
#2 We build bridges
between Serverless
Community (“Dev leg”),
and Front-end, Voice-First
& UX folks (“UX leg”)
#4 Achieve agility NOT by
“sprinting” faster but working
smarter (by using bigger
building blocks & less Ops)
1
2
3
4
Serverless is a State of Mind…
4
Way too often, we – the IT folks,
have obsession with “pimping up
our cars” (infrastructure / code /
pipelines) instead of “driving
business” forward & taking them
places ☺
... It is a way to focus on business value.
5
It can be applied to any Tech stack, even On-Prem
Jared Short:
1. If the platform has it, use it
2. If the market has it, buy it
3. If you can reconsider requirements, do it
4. If you have to build it, own it.
Ben Kehoe: Serverless is about how you make
decisions, not about your choices.
Upcoming ServerlessToronto.org Meetups
6
Friday Lunch & Learn, April 19 Monday evening, May 6
Summer 2024
Knowledge Sponsor
1. Go to www.manning.com
2. Select *any* e-Book, Video course, or liveProject you want!
3. Add it to your shopping cart (no more than 1 item in the cart)
4. Raffle winners will send me the emails (used in Manning portal),
5. So the publisher can move it to your Dashboard – as if purchased.
Fill out the Survey to win: bit.ly/slsto
8
Feature Presentations:
LLM Landscape
A Journey Through A Year of Evolution
Mark Ryan
Developer Knowledge Platform AI Lead, Google Cloud
ryanmark2014@gmail.com
Generative AI
Milestones
Major Generative AI Milestones: Part 1
Jun 2017
Attention Is All You Need:
Seminal paper from Google
that introduced transformers
Oct 2018
BERT: Google
transformer-based
language model
Feb 2019
GPT-2: OpenAI LLM
May 2020
GPT-3: OpenAI LLM
Aug 2021
Codex: OpenAI code model
Apr 2022
DALLE 2: OpenAI image
model
Jan 2021
DALLE: OpenAI image model
May 2021
LaMDA: Google LLM
May 2022
Imagen: Google image model
PaLM: Google LLM
Gato: DeepMind multimodal model
Aug 2022
Stable Diffusion: Image
model
Major Generative AI Milestones: Part 2
Nov 2022
ChatGPT: Consumer chat from
OpenAI initially featuring GPT 3.5
Feb 2023
Bard: Consumer chat from
Google
Mar 2023
GPT-4: OpenAI flagship model
ChatGPT Plugins: Connect to
third-party applications
Apr 2023
CodeWhisperer:
AWS AI coding
assistant
July 2023
Llama 2: Meta open source LLM licensed for
commercial use.
Code Interpreter: OpenAI integrated sandbox
environment for data upload and analysis
Aug 2023
Duet AI: AI Assistant for Google
Cloud, including chat in console,
and general purpose (VSCode)
and SQL (Big Query) code
completion/interpretation
Sept 2023
DALLE 3: OpenAI image
model
Dec 2023
Gemini: Google flagship
multimodal (text / image /
video) models
Feb 2024
Gemini Pro 1.5: 1M context
multimodal model
Gemma: Google open model
Sora: OpenAI text to video
May 2023
Vertex AI Gen AI: including curated set
of Google, third-party, and open models
PaLM 2: Google flagship model
Nov 2023
Q: AWS chatbot
Grok: X chatbot
Mar 2024
Claude 3: Anthropic flagship models
Devin: Cognition SWE AI
Ecosystem and
Vendor Landscape
The Emerging LLM Ecosystem
Examples Description Use Case
Vector
databases
● Pinecone
● Chroma
● Vertex AI Vector
Search
Store and find associations
between embeddings,
high-dimensional vector
representations of data
Grounding LLM responses in a
set of documents (example of
RAG)
Encapsulated
coding
environments
OpenAI Code Interpreter /
Advanced Data Analysis
Upload datasets & ask questions
to get visualizations and code
running in a limited Python
instance
Ad hoc data analysis
Plugins /
extensions
● ChatGPT plugins /
GPTs
● Vertex AI extensions
Connect LLMs to third-party /
external applications
Access current data / query &
modify data that is external to
the LLM
LLM app
development
frameworks
● LangChain
● LlamaIndex
● Autogen
LLM-centric framework to manage
workflow (data sources, agents,
models, etc)
Assembling LLM-based
applications
Generative AI Landscape by Vendor
Vendor Prod. Suite
Assistance
Developer / Ops
Assistant
Consumer
Chat
Enterprise Gen
AI
Dev / Hobbyist
Gen AI
Open Foundation
Models
Google Gemini for
Google
Workspace
Duet AI for
Google Cloud
Gemini Vertex AI Google AI for
Developers
Gemma
Microsoft CoPilot 365 Github Copilot Bing Chat Azure OpenAI
OpenAI ChatGPT ChatGPT ChatGPT
Enterprise
ChatGPT
AWS ● Q
● CodeWhisperer
Bedrock / Titan
Anthropic Claude 3* Claude 3*
Meta Llama 2*
Mistral Mixtral 8x7B*
Twitter: @MarkRyanMkm
LinkedIn:
www.linkedin.com/in/mark
-ryan-31826743
YouTube:
@markryan2475
RAG in 2024
Jerry Liu, LlamaIndex co-founder/CEO
LlamaIndex: Context Augmentation for your LLM app
Paradigms for inserting knowledge
Retrieval Augmentation - Fix the model, put context into the prompt
LLM
Before college the two main
things I worked on, outside of
school, were writing and
programming. I didn't write
essays. I wrote what
beginning writers were
supposed to write then, and
probably still are: short
stories. My stories were awful.
They had hardly any plot, just
characters with strong
feelings, which I imagined
made them deep...
Input Prompt
Here is the context:
Before college the
two main things…
Given the context,
answer the following
question:
{query_str}
Paradigms for inserting knowledge
Fine-tuning - baking knowledge into the weights of the network
LLM
Before college the two main things
I worked on, outside of school,
were writing and programming. I
didn't write essays. I wrote what
beginning writers were supposed to
write then, and probably still are:
short stories. My stories were
awful. They had hardly any plot,
just characters with strong feelings,
which I imagined made them
deep...
RLHF, Adam, SGD, etc.
RAG Stack
Current RAG Stack for building a QA System
Vector
Database
Doc
Chunk
Chunk
Chunk
Chunk
Chunk
Chunk
Chunk
LLM
Data Ingestion / Parsing Data Querying
5 Lines of Code in LlamaIndex!
Current RAG Stack (Data Ingestion/Parsing)
Vector
Database
Doc
Chunk
Chunk
Chunk
Chunk
Process:
● Split up document(s) into even chunks.
● Each chunk is a piece of raw text.
● Generate embedding for each chunk (e.g.
OpenAI embeddings, sentence_transformer)
● Store each chunk into a vector database
Current RAG Stack (Querying)
Vector
Database
Chunk
Chunk
Chunk
LLM
Process:
● Find top-k most similar chunks from vector
database collection
● Plug into LLM response synthesis module
Current RAG Stack (Querying)
Vector
Database
Chunk
Chunk
Chunk
LLM
Process:
● Find top-k most similar chunks from vector
database collection
● Plug into LLM response synthesis module
Retrieval Synthesis
Response Synthesis
Create and refine
Response Synthesis
Tree Summarize
Quickstart
https://colab.research.google.com/drive/1knQpGJLHj-LTTHqlZhgcjDH5F_nJIiY0?
usp=sharing
Challenges with “Naive” RAG
RAG
Data Parsing & Ingestion Data Querying
Index
Data
Data Parsing +
Ingestion
Retrieval
LLM +
Prompts
Response
Naive RAG
PyPDF
Sentence
Splitting
Chunk Size 256
Simple QA
Prompt
Dense Retrieval
Top-k = 5
Index
Data
Data Parsing +
Ingestion
Retrieval
LLM +
Prompts
Response
Easy to Prototype, Hard to Productionize
Naive RAG approaches tend to work well for simple questions over a simple,
small set of documents.
● “What are the main risk factors for Tesla?” (over Tesla 2021 10K)
● “What did the author do during his time at YC?” (Paul Graham essay)
Easy to Prototype, Hard to Productionize
But productionizing RAG over more questions and a larger set of data is hard!
Failure Modes:
● Response Quality: Bad Retrieval, Bad Response Generation
● Hard to Improve: Too many parameters to tune
● Systems: Latency, Cost, Security
Easy to Prototype, Hard to Productionize
But productionizing RAG over more questions and a larger set of data is hard!
Failure Modes:
● Response Quality: Bad Retrieval, Bad Response Generation
● Hard to Improve: Too many parameters to tune
● Systems: Latency, Cost, Security
Challenges with Naive RAG (Response Quality)
● Bad Retrieval
○ Low Precision: Not all chunks in retrieved set are relevant
■ Hallucination + Lost in the Middle Problems
○ Low Recall: Now all relevant chunks are retrieved.
■ Lacks enough context for LLM to synthesize an answer
○ Outdated information: The data is redundant or out of date.
Challenges with Naive RAG (Response Quality)
● Bad Retrieval
○ Low Precision: Not all chunks in retrieved set are relevant
■ Hallucination + Lost in the Middle Problems
○ Low Recall: Now all relevant chunks are retrieved.
■ Lacks enough context for LLM to synthesize an answer
○ Outdated information: The data is redundant or out of date.
● Bad Response Generation
○ Hallucination: Model makes up an answer that isn’t in the context.
○ Irrelevance: Model makes up an answer that doesn’t answer the question.
○ Toxicity/Bias: Model makes up an answer that’s harmful/offensive.
Difference with Traditional Software
Data Extract Response
Traditional software is defined by a set of programmatic rules.
Given an input, you can easily reason about the expected output.
Transform Load
Difference with Traditional Software
AI-powered software is defined by a
black-box set of parameters.
It is really hard to reason about what the
function space looks like.
The model parameters are tuned, the
surrounding parameters (prompt templates)
are not.
Index
Data
Data Parsing +
Ingestion
Retrieval
LLM +
Prompts
Response
Difference with Traditional Software
If one component of the system is a
black-box, all components of the system
become black boxes.
The more components, the more parameters
you have to tune.
Index
Data
Data Parsing +
Ingestion
Retrieval
LLM +
Prompts
Response
Difference with Traditional Software
If one component of the system is a
black-box, all components of the system
become black boxes.
Every parameter affects the performance of
the end system.
Index
Data
Data Parsing +
Ingestion
Retrieval
LLM +
Prompts
Response
RAG
There’s Too Many Parameters
Every parameter affects the performance of
the entire RAG pipeline.
Which parameters should a user tune?
There’s too many options!
Index
Data
Data Parsing +
Ingestion
Retrieval
LLM +
Prompts
Response
Which PDF parser
should I use?
How do I chunk my
documents?
How do I process
embedded tables and
charts?
Which embedding
model should I use?
What retrieval
parameters should I
use?
Dense retrieval or
sparse?
Which LLM should I
use?
Mapping Pain Points to Solutions
Solution
Categorize by pain point, and establish best practices
Solution
Categorize by pain point, and establish best practices
“Seven Failure Points When Engineering a Retrieval
Augmented Generation System”, Barnett et al.
Solution
Categorize by pain point, and establish best practices
“12 RAG Pain Points and Proposed Solutions”, by Wenqi Glantz
Pain Points
Response Quality Related
1. Context Missing in the Knowledge
Base
2. Context Missing in the Initial
Retrieval Pass
3. Context Missing After Reranking
4. Context Not Extracted
5. Output is in Wrong Format
6. Output has Incorrect Level of
Specificity
7. Output is Incomplete
Pain Points
Scalability
8. Can't Scale to Larger Data Volumes
11. Rate-Limit Errors
Security
12. LLM Security
Use Case Specific
9. Ability to QA Tabular Data
10. Ability to Parse PDFs
Pain Points
Scalability
8. Can't Scale to Larger Data Volumes
11. Rate-Limit Errors
Security
12. LLM Security
Use Case Specific
9. Ability to QA Tabular Data
10. Ability to Parse PDFs
Let’s figure out solutions
1. Context Missing in the Knowledge Base
Clean your data: Pick a good
document parser (more on this
later!)
Add in Metadata: inject global
context to each chunk
Keep your data updated: Setup a
recurring data ingestion pipeline.
Upsert documents to prevent
duplicates.
2. Context Missing in the Initial Retrieval Pass
Solution: Hyperparameter tuning for chunk
size and top-k
Solution: Reranking
Source: ColBERT
3. Context Missing After Reranking
Solution: try out fancier retrieval methods
(small-to-big, auto-merging, auto-retrieval,
ensembling, …)
Solution: fine-tune your embedding models
to task-specific data
4. Context is there, but not extracted by the LLM
The context is there,
but the LLM doesn’t
understand it.
“Lost in the middle”
Problems.
https://x.com/GregKamradt/status/1722386725635580292?s=20
4. Context is there, but not extracted by the LLM
Solution: Prompt Compression
(LongLLMLingua)
Solution: LongContextReorder LongLLMLingua by Jiang et al.
4. Context is there, but not extracted by the LLM
Solution: Prompt Compression
(LongLLMLingua)
Solution: LongContextReorder LongLLMLingua by Jiang et al.
5. Output is in Wrong Format
A lot of use cases require outputting the
answer in JSON format.
Solutions:
Better text prompting/output parsing
Use OpenAI function calling + JSON mode
Use token-level prompting (LMQL, Guidance)
Source: Guidance
7. Incomplete Answer
What if you have a complex multi-part
question?
Naive RAG is primarily good for answering
simple questions about specific facts.
7. Incomplete Answer
Solution: Add Agentic Reasoning
Agents? RAG
Query Response
Simple
Lower Cost
Lower Latency
Advanced
Higher Cost
Higher Latency
Routing
One-Shot Query
Planning
Tool Use
ReAct
Dynamic
Planning +
Execution
8. Scaling your Data Pipeline
Pain points:
● Processing thousands/millions of docs is slow
● How do we efficiently handle document updates?
8. Scaling your Data Pipeline
Pain points:
● Processing thousands/millions of docs is slow
● How do we efficiently handle document updates?
Reference Production Ingestion Stack
● Parallelize document processing
● HuggingFace TEI
● RabbitMQ Message Queue
● AWS EKS clusters
https://github.com/run-llama/llamaindex_aws_ingestion
10. Proper RAG over Complex Documents
How do we model complex docs
with embedded tables?
RAG with naive chunking +
retrieval → leads to hallucinations!
Embedded Table
Advanced Retrieval:
Embedded Tables
Advanced Retrieval:
Embedded Tables
Instead: model data hierarchically.
Index tables/figures by their
summaries.
The only missing component:
how do I parse out the tables from
the data?
Most PDF Parsing is Inadequate
Extracts into a
messy format that is
impossible to pass
down into more
advanced
ingestion/retrieval
algorithms.
Introducing LlamaParse
A genAI-native parser
designed to let you build
RAG over complex
documents
https://github.com/run-llama/llam
a_parse
Introducing LlamaParse
Capabilities
✅ Extracts tables / charts
✅ Input natural language parsing
instructions
✅JSON mode
✅Image Extraction
✅Support for ~10+ document types
(.pdf, .pptx, .docx, .xml)
Current PDFReader Llama Parse
LlamaParse Results
The best parser at table extraction == the only parser for advanced RAG
Expanded: https://drive.google.com/file/d/1fyQAg7nOtChQzhF2Ai7HEeKYYqdeWsdt/view?usp=sharing
Steerability
Default (no instructions)
Steerability
With Instructions
What’s next for RAG: Agents?
RAG
Query Response
From RAG to Agents
From RAG to Agents
Agents? RAG
Query Response
Agents?
Agents?
Agents? RAG
Query Response
From RAG to Agents
Agents?
Agent Definition: Using LLMs for automated reasoning and tool selection
RAG is just one Tool: Agents can decide to use RAG with other tools
Agents?
From Simple to Advanced Agents
Simple
Lower Cost
Lower Latency
Advanced
Higher Cost
Higher Latency
Routing
One-Shot Query
Planning
Tool Use
ReAct
Dynamic
Planning +
Execution
Routing
Simplest form of agentic
reasoning.
Given user query and set of
choices, output subset of
choices to route query to.
Routing
Use Case: Joint QA and
Summarization
Guide
Query Planning
Break down query into
parallelizable sub-queries.
Each sub-query can be
executed against any set of
RAG pipelines
Uber 10-K chunk 4
top-2
Uber 10-K chunk 8
Lyft 10-K chunk 4
Lyft 10-K chunk 8
Compare revenue growth of
Uber and Lyft in 2021
Uber 10-K
Lyft 10-K
Describe revenue growth
of Uber in 2021
Describe revenue
growth of Lyft in 2021
top-2
Query Planning
Example: Compare
revenue of Uber and Lyft in
2021
Query Planning Guide
Uber 10-K chunk 4
top-2
Uber 10-K chunk 8
Lyft 10-K chunk 4
Lyft 10-K chunk 8
Compare revenue growth of
Uber and Lyft in 2021
Uber 10-K
Lyft 10-K
Describe revenue growth
of Uber in 2021
Describe revenue
growth of Lyft in 2021
top-2
Tool Use
Use an LLM to call an API
Infer the parameters of that
API
Tool Use
In normal RAG you just pass
through the query.
But what if you used the
LLM to infer all the
parameters for the API
interface?
A key capability in many QA
use cases (auto-retrieval,
text-to-SQL, and more)
This is cool but
● How can an agent tackle sequential multi-part problems?
● How can an agent maintain state over time?
This is cool but
● How can an agent tackle sequential multi-part problems?
○ Let’s make it loop
● How can an agent maintain state over time?
○ Let’s add basic memory
Data Agents - Core Components
Agent Reasoning Loop
● ReAct Agent (any LLM)
● OpenAI Agent (only OAI)
Tools
Query Engine Tools (RAG
pipeline)
LlamaHub Tools (30+ tools to
external services)
ReAct: Reasoning + Acting with LLMs
Source: https://react-lm.github.io/
ReAct: Reasoning + Acting with LLMs
Add a loop around
query
decomposition + tool
use
ReAct: Reasoning + Acting with LLMs
Superset of query
planning + routing
capabilities.
ReAct + RAG Guide
Can we make this even better?
● Stop being so short-sighted - plan ahead at each step
● Parallelize execution where we can
LLMCompiler
Kim et al. 2023
An agent compiler
for parallel
multi-function
planning +
execution.
LLMCompiler
Plan out steps
beforehand, and
replan as necessary
LLMCompiler Agent
Tree-based Planning
Tree of Thoughts
(Yao et al. 2023)
Reasoning via
Planning (Hao et al.
2023)
Language Agent
Tree Search (Zhou
et al. 2023)
Additional Requirements
● Observability: see the full trace of the agent
○ Observability Guide
● Control: Be able to guide the intermediate steps of an agent step-by-step
○ Lower-Level Agent API
● Customizability: Define your own agentic logic around any set of tools.
○ Custom Agent Guide
○ Custom Agent with Query Pipeline Guide
Additional Requirements
Possible through our
query pipeline syntax
Query Pipeline Guide
What’s next for RAG: Long Contexts?
Is RAG Dead?
https://x.com/Francis_YAO_/status/1759962812229800012?s=20
Gemini 1.5 Pro has a 1-10M
context window.
What does this mean for RAG?
Our Position
1. Frameworks are valuable whether or not RAG lives or dies
2. Certain RAG concepts will go away, but others will remain and evolve
Long Context LLMs will Solve the Following
1. Developers will worry less about tuning chunking algorithms
2. Developers will need to spend less time tuning retrieval and
chain-of-thought over single documents
3. Summarization will be easier
4. Personalized memory will be better and easier to build
Some Challenges Remain
1. 10M tokens is not enough for large document corpuses (hundreds of
MB, GB)
2. Embedding models are lagging behind in context length
3. Cost and Latency
4. A KV Cache takes up a significant amount of GPU memory, and has
sequential dependencies
New RAG Architectures
1. Small to Big Retrieval over Documents
2. Intelligent Routing for Latency/Cost Tradeoffs
3. Retrieval Augmented KV Caching
Small to Big Retrieval over Documents
Intelligent Routing for Latency/Cost Tradeoffs
Retrieval Augmented KV Caching
www.ServerlessToronto.org
Reducing the gap between IT and Business needs

More Related Content

What's hot

Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Mihai Criveti
 
MLOps for production-level machine learning
MLOps for production-level machine learningMLOps for production-level machine learning
MLOps for production-level machine learning
cnvrg.io AI OS - Hands-on ML Workshops
 

What's hot (20)

How to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptxHow to fine-tune and develop your own large language model.pptx
How to fine-tune and develop your own large language model.pptx
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with Transformers
 
Reinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face TransformersReinventing Deep Learning
 with Hugging Face Transformers
Reinventing Deep Learning
 with Hugging Face Transformers
 
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
Thomas Wolf "An Introduction to Transfer Learning and Hugging Face"
 
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
Retrieval Augmented Generation in Practice: Scalable GenAI platforms with k8s...
 
GPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge GraphGPT and Graph Data Science to power your Knowledge Graph
GPT and Graph Data Science to power your Knowledge Graph
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
 
LanGCHAIN Framework
LanGCHAIN FrameworkLanGCHAIN Framework
LanGCHAIN Framework
 
Fine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP modelsFine tune and deploy Hugging Face NLP models
Fine tune and deploy Hugging Face NLP models
 
Vertica
VerticaVertica
Vertica
 
MLOps for production-level machine learning
MLOps for production-level machine learningMLOps for production-level machine learning
MLOps for production-level machine learning
 
Deep dive into LangChain integration with Neo4j.pptx
Deep dive into LangChain integration with Neo4j.pptxDeep dive into LangChain integration with Neo4j.pptx
Deep dive into LangChain integration with Neo4j.pptx
 
Vector database
Vector databaseVector database
Vector database
 
Customizing LLMs
Customizing LLMsCustomizing LLMs
Customizing LLMs
 
Neo4j 4 Overview
Neo4j 4 OverviewNeo4j 4 Overview
Neo4j 4 Overview
 
Google Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data editionGoogle Cloud Next '22 Recap: Serverless & Data edition
Google Cloud Next '22 Recap: Serverless & Data edition
 
HuggingFace AI - Hugging Face lets users create interactive, in-browser demos...
HuggingFace AI - Hugging Face lets users create interactive, in-browser demos...HuggingFace AI - Hugging Face lets users create interactive, in-browser demos...
HuggingFace AI - Hugging Face lets users create interactive, in-browser demos...
 
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
 
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInDataModel serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
Model serving made easy using Kedro pipelines - Mariusz Strzelecki, GetInData
 
Optimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j GraphOptimizing Your Supply Chain with the Neo4j Graph
Optimizing Your Supply Chain with the Neo4j Graph
 

Similar to All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (LlamaIndex).pdf

Salesforce Architect Group, Frederick, United States July 2023 - Generative A...
Salesforce Architect Group, Frederick, United States July 2023 - Generative A...Salesforce Architect Group, Frederick, United States July 2023 - Generative A...
Salesforce Architect Group, Frederick, United States July 2023 - Generative A...
NadinaLisbon1
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
Trey Grainger
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
VictorSzoltysek
 
Skillshare - From Noob to Tech CEO - nov 7th, 2011
Skillshare - From Noob to Tech CEO - nov 7th, 2011Skillshare - From Noob to Tech CEO - nov 7th, 2011
Skillshare - From Noob to Tech CEO - nov 7th, 2011
Kareem Amin
 

Similar to All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (LlamaIndex).pdf (20)

What Web Framework To Use?
What Web Framework To Use?What Web Framework To Use?
What Web Framework To Use?
 
Data science presentation
Data science presentationData science presentation
Data science presentation
 
Maintainable Machine Learning Products
Maintainable Machine Learning ProductsMaintainable Machine Learning Products
Maintainable Machine Learning Products
 
Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
 
Neurodb Engr245 2021 Lessons Learned
Neurodb Engr245 2021 Lessons LearnedNeurodb Engr245 2021 Lessons Learned
Neurodb Engr245 2021 Lessons Learned
 
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video AnalyticsBreaking Through The Challenges of Scalable Deep Learning for Video Analytics
Breaking Through The Challenges of Scalable Deep Learning for Video Analytics
 
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai searchChatGPT-and-Generative-AI-Landscape Working of generative ai search
ChatGPT-and-Generative-AI-Landscape Working of generative ai search
 
Machine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabsMachine learning at scale - Webinar By zekeLabs
Machine learning at scale - Webinar By zekeLabs
 
Technologies for startup
Technologies for startupTechnologies for startup
Technologies for startup
 
Fed Up Of Framework Hype Dcphp
Fed Up Of Framework Hype DcphpFed Up Of Framework Hype Dcphp
Fed Up Of Framework Hype Dcphp
 
Salesforce Architect Group, Frederick, United States July 2023 - Generative A...
Salesforce Architect Group, Frederick, United States July 2023 - Generative A...Salesforce Architect Group, Frederick, United States July 2023 - Generative A...
Salesforce Architect Group, Frederick, United States July 2023 - Generative A...
 
Cinci ug-january2011-anti-patterns
Cinci ug-january2011-anti-patternsCinci ug-january2011-anti-patterns
Cinci ug-january2011-anti-patterns
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM TechniquesAI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
AI Mastery 201: Elevating Your Workflow with Advanced LLM Techniques
 
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and PandasDistributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
Distributed Models Over Distributed Data with MLflow, Pyspark, and Pandas
 
Data Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAMLData Workflows for Machine Learning - Seattle DAML
Data Workflows for Machine Learning - Seattle DAML
 
Skillshare - From Noob to Tech CEO - nov 7th, 2011
Skillshare - From Noob to Tech CEO - nov 7th, 2011Skillshare - From Noob to Tech CEO - nov 7th, 2011
Skillshare - From Noob to Tech CEO - nov 7th, 2011
 
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5Certification Study Group - NLP & Recommendation Systems on GCP Session 5
Certification Study Group - NLP & Recommendation Systems on GCP Session 5
 
Data Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area MLData Workflows for Machine Learning - SF Bay Area ML
Data Workflows for Machine Learning - SF Bay Area ML
 
NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk NLP & Machine Learning - An Introductory Talk
NLP & Machine Learning - An Introductory Talk
 

More from Daniel Zivkovic

Opinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & BuildersOpinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & Builders
Daniel Zivkovic
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Daniel Zivkovic
 
This is my Architecture to prevent Cloud Bill Shock
This is my Architecture to prevent Cloud Bill ShockThis is my Architecture to prevent Cloud Bill Shock
This is my Architecture to prevent Cloud Bill Shock
Daniel Zivkovic
 
Azure for AWS & GCP Pros: Which Azure services to use?
Azure for AWS & GCP Pros: Which Azure services to use?Azure for AWS & GCP Pros: Which Azure services to use?
Azure for AWS & GCP Pros: Which Azure services to use?
Daniel Zivkovic
 
Serverless Evolution during 3 years of Serverless Toronto
Serverless Evolution during 3 years of Serverless TorontoServerless Evolution during 3 years of Serverless Toronto
Serverless Evolution during 3 years of Serverless Toronto
Daniel Zivkovic
 
AWS re:Invent 2020 Serverless Recap
AWS re:Invent 2020 Serverless RecapAWS re:Invent 2020 Serverless Recap
AWS re:Invent 2020 Serverless Recap
Daniel Zivkovic
 

More from Daniel Zivkovic (20)

Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
Canadian Experts Discuss Modern Data Stacks and Cloud Computing for 5 Years o...
 
Opinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & BuildersOpinionated re:Invent recap with AWS Heroes & Builders
Opinionated re:Invent recap with AWS Heroes & Builders
 
Conversational Document Processing AI with Rui Costa
Conversational Document Processing AI with Rui CostaConversational Document Processing AI with Rui Costa
Conversational Document Processing AI with Rui Costa
 
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and DataflowHow to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
How to build unified Batch & Streaming Pipelines with Apache Beam and Dataflow
 
Gojko's 5 rules for super responsive Serverless applications
Gojko's 5 rules for super responsive Serverless applicationsGojko's 5 rules for super responsive Serverless applications
Gojko's 5 rules for super responsive Serverless applications
 
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha JarettRetail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
Retail Analytics and BI with Looker, BigQuery, GCP & Leigha Jarett
 
What's new in Serverless at AWS?
What's new in Serverless at AWS?What's new in Serverless at AWS?
What's new in Serverless at AWS?
 
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML EngineersIntro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
Intro to Vertex AI, unified MLOps platform for Data Scientists & ML Engineers
 
Empowering Developers to be Healthcare Heroes
Empowering Developers to be Healthcare HeroesEmpowering Developers to be Healthcare Heroes
Empowering Developers to be Healthcare Heroes
 
Get started with Dialogflow & Contact Center AI on Google Cloud
Get started with Dialogflow & Contact Center AI on Google CloudGet started with Dialogflow & Contact Center AI on Google Cloud
Get started with Dialogflow & Contact Center AI on Google Cloud
 
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
Building a Data Cloud to enable Analytics & AI-Driven Innovation - Lak Lakshm...
 
Smart Cities of Italy: Integrating the Cyber World with the IoT
Smart Cities of Italy: Integrating the Cyber World with the IoTSmart Cities of Italy: Integrating the Cyber World with the IoT
Smart Cities of Italy: Integrating the Cyber World with the IoT
 
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
Running Business Analytics for a Serverless Insurance Company - Joe Emison & ...
 
This is my Architecture to prevent Cloud Bill Shock
This is my Architecture to prevent Cloud Bill ShockThis is my Architecture to prevent Cloud Bill Shock
This is my Architecture to prevent Cloud Bill Shock
 
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customersLunch & Learn BigQuery & Firebase from other Google Cloud customers
Lunch & Learn BigQuery & Firebase from other Google Cloud customers
 
Azure for AWS & GCP Pros: Which Azure services to use?
Azure for AWS & GCP Pros: Which Azure services to use?Azure for AWS & GCP Pros: Which Azure services to use?
Azure for AWS & GCP Pros: Which Azure services to use?
 
Serverless Evolution during 3 years of Serverless Toronto
Serverless Evolution during 3 years of Serverless TorontoServerless Evolution during 3 years of Serverless Toronto
Serverless Evolution during 3 years of Serverless Toronto
 
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCPSimpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
Simpler, faster, cheaper Enterprise Apps using only Spring Boot on GCP
 
AWS re:Invent 2020 Serverless Recap
AWS re:Invent 2020 Serverless RecapAWS re:Invent 2020 Serverless Recap
AWS re:Invent 2020 Serverless Recap
 
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
SRE Topics with Charity Majors and Liz Fong-Jones of HoneycombSRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
SRE Topics with Charity Majors and Liz Fong-Jones of Honeycomb
 

Recently uploaded

Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
WSO2 Micro Integrator for Enterprise Integration in a Decentralized, Microser...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data PlatformLess Is More: Utilizing Ballerina to Architect a Cloud Data Platform
Less Is More: Utilizing Ballerina to Architect a Cloud Data Platform
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Decarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational PerformanceDecarbonising Commercial Real Estate: The Role of Operational Performance
Decarbonising Commercial Real Estate: The Role of Operational Performance
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 

All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (LlamaIndex).pdf

  • 1. Welcome to ServerlessToronto.org 1 Introduce Yourself: - Where from? Why are you here? - Looking for work or Offering work? Help us serve you better: bit.ly/slsto An Evening with Mark Ryan and Jerry Liu • 6:00 - 6:10 Networking & Opening remarks • 6:10 - 6:35 Mark Ryan: The LLM Landscape • 6:35 - 7:15 Jerry Liu: Solving Core Challenges in RAG Pipelines • 7:15 - 7:45 Q&A • 7:45 - 8:00 Manning Publications raffle
  • 2. Why this Generative AI Talk? 2 1. Navigating the Tsunami: Understand the sweeping changes the "GenAI tsunami“ brings to industries and jobs. 2. Situational Awareness: Learn from AI leaders Mark Ryan & Jerry Liu to gain a strategic view of the LLM and RAG landscape. 3. Career Transformation: Learn to position yourself as the architect of automation rather than its subject. 4. Practical Advice: Acquire actionable strategies to apply Generative AI within your enterprise. 5. Interactive Learning: Engage in live Q&A to discuss and clarify your AI dilemmas with experts. Battle of Waterloo
  • 3. What is Serverless Toronto about? 3 Serverless became New Agile & Mindset #1 We started as Back- end FaaS Developers who enjoyed 'gluing together' other people's APIs and Managed Services #3 We're obsessed with creating business value (meaningful Products), focusing on Outcomes/Impact – NOT Outputs #2 We build bridges between Serverless Community (“Dev leg”), and Front-end, Voice-First & UX folks (“UX leg”) #4 Achieve agility NOT by “sprinting” faster but working smarter (by using bigger building blocks & less Ops) 1 2 3 4
  • 4. Serverless is a State of Mind… 4 Way too often, we – the IT folks, have obsession with “pimping up our cars” (infrastructure / code / pipelines) instead of “driving business” forward & taking them places ☺
  • 5. ... It is a way to focus on business value. 5 It can be applied to any Tech stack, even On-Prem Jared Short: 1. If the platform has it, use it 2. If the market has it, buy it 3. If you can reconsider requirements, do it 4. If you have to build it, own it. Ben Kehoe: Serverless is about how you make decisions, not about your choices.
  • 6. Upcoming ServerlessToronto.org Meetups 6 Friday Lunch & Learn, April 19 Monday evening, May 6 Summer 2024
  • 7. Knowledge Sponsor 1. Go to www.manning.com 2. Select *any* e-Book, Video course, or liveProject you want! 3. Add it to your shopping cart (no more than 1 item in the cart) 4. Raffle winners will send me the emails (used in Manning portal), 5. So the publisher can move it to your Dashboard – as if purchased. Fill out the Survey to win: bit.ly/slsto
  • 9. LLM Landscape A Journey Through A Year of Evolution Mark Ryan Developer Knowledge Platform AI Lead, Google Cloud ryanmark2014@gmail.com
  • 11. Major Generative AI Milestones: Part 1 Jun 2017 Attention Is All You Need: Seminal paper from Google that introduced transformers Oct 2018 BERT: Google transformer-based language model Feb 2019 GPT-2: OpenAI LLM May 2020 GPT-3: OpenAI LLM Aug 2021 Codex: OpenAI code model Apr 2022 DALLE 2: OpenAI image model Jan 2021 DALLE: OpenAI image model May 2021 LaMDA: Google LLM May 2022 Imagen: Google image model PaLM: Google LLM Gato: DeepMind multimodal model Aug 2022 Stable Diffusion: Image model
  • 12. Major Generative AI Milestones: Part 2 Nov 2022 ChatGPT: Consumer chat from OpenAI initially featuring GPT 3.5 Feb 2023 Bard: Consumer chat from Google Mar 2023 GPT-4: OpenAI flagship model ChatGPT Plugins: Connect to third-party applications Apr 2023 CodeWhisperer: AWS AI coding assistant July 2023 Llama 2: Meta open source LLM licensed for commercial use. Code Interpreter: OpenAI integrated sandbox environment for data upload and analysis Aug 2023 Duet AI: AI Assistant for Google Cloud, including chat in console, and general purpose (VSCode) and SQL (Big Query) code completion/interpretation Sept 2023 DALLE 3: OpenAI image model Dec 2023 Gemini: Google flagship multimodal (text / image / video) models Feb 2024 Gemini Pro 1.5: 1M context multimodal model Gemma: Google open model Sora: OpenAI text to video May 2023 Vertex AI Gen AI: including curated set of Google, third-party, and open models PaLM 2: Google flagship model Nov 2023 Q: AWS chatbot Grok: X chatbot Mar 2024 Claude 3: Anthropic flagship models Devin: Cognition SWE AI
  • 14. The Emerging LLM Ecosystem Examples Description Use Case Vector databases ● Pinecone ● Chroma ● Vertex AI Vector Search Store and find associations between embeddings, high-dimensional vector representations of data Grounding LLM responses in a set of documents (example of RAG) Encapsulated coding environments OpenAI Code Interpreter / Advanced Data Analysis Upload datasets & ask questions to get visualizations and code running in a limited Python instance Ad hoc data analysis Plugins / extensions ● ChatGPT plugins / GPTs ● Vertex AI extensions Connect LLMs to third-party / external applications Access current data / query & modify data that is external to the LLM LLM app development frameworks ● LangChain ● LlamaIndex ● Autogen LLM-centric framework to manage workflow (data sources, agents, models, etc) Assembling LLM-based applications
  • 15. Generative AI Landscape by Vendor Vendor Prod. Suite Assistance Developer / Ops Assistant Consumer Chat Enterprise Gen AI Dev / Hobbyist Gen AI Open Foundation Models Google Gemini for Google Workspace Duet AI for Google Cloud Gemini Vertex AI Google AI for Developers Gemma Microsoft CoPilot 365 Github Copilot Bing Chat Azure OpenAI OpenAI ChatGPT ChatGPT ChatGPT Enterprise ChatGPT AWS ● Q ● CodeWhisperer Bedrock / Titan Anthropic Claude 3* Claude 3* Meta Llama 2* Mistral Mixtral 8x7B*
  • 17. RAG in 2024 Jerry Liu, LlamaIndex co-founder/CEO
  • 19. Paradigms for inserting knowledge Retrieval Augmentation - Fix the model, put context into the prompt LLM Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep... Input Prompt Here is the context: Before college the two main things… Given the context, answer the following question: {query_str}
  • 20. Paradigms for inserting knowledge Fine-tuning - baking knowledge into the weights of the network LLM Before college the two main things I worked on, outside of school, were writing and programming. I didn't write essays. I wrote what beginning writers were supposed to write then, and probably still are: short stories. My stories were awful. They had hardly any plot, just characters with strong feelings, which I imagined made them deep... RLHF, Adam, SGD, etc.
  • 22. Current RAG Stack for building a QA System Vector Database Doc Chunk Chunk Chunk Chunk Chunk Chunk Chunk LLM Data Ingestion / Parsing Data Querying 5 Lines of Code in LlamaIndex!
  • 23. Current RAG Stack (Data Ingestion/Parsing) Vector Database Doc Chunk Chunk Chunk Chunk Process: ● Split up document(s) into even chunks. ● Each chunk is a piece of raw text. ● Generate embedding for each chunk (e.g. OpenAI embeddings, sentence_transformer) ● Store each chunk into a vector database
  • 24. Current RAG Stack (Querying) Vector Database Chunk Chunk Chunk LLM Process: ● Find top-k most similar chunks from vector database collection ● Plug into LLM response synthesis module
  • 25. Current RAG Stack (Querying) Vector Database Chunk Chunk Chunk LLM Process: ● Find top-k most similar chunks from vector database collection ● Plug into LLM response synthesis module Retrieval Synthesis
  • 30. RAG Data Parsing & Ingestion Data Querying Index Data Data Parsing + Ingestion Retrieval LLM + Prompts Response
  • 31. Naive RAG PyPDF Sentence Splitting Chunk Size 256 Simple QA Prompt Dense Retrieval Top-k = 5 Index Data Data Parsing + Ingestion Retrieval LLM + Prompts Response
  • 32. Easy to Prototype, Hard to Productionize Naive RAG approaches tend to work well for simple questions over a simple, small set of documents. ● “What are the main risk factors for Tesla?” (over Tesla 2021 10K) ● “What did the author do during his time at YC?” (Paul Graham essay)
  • 33. Easy to Prototype, Hard to Productionize But productionizing RAG over more questions and a larger set of data is hard! Failure Modes: ● Response Quality: Bad Retrieval, Bad Response Generation ● Hard to Improve: Too many parameters to tune ● Systems: Latency, Cost, Security
  • 34. Easy to Prototype, Hard to Productionize But productionizing RAG over more questions and a larger set of data is hard! Failure Modes: ● Response Quality: Bad Retrieval, Bad Response Generation ● Hard to Improve: Too many parameters to tune ● Systems: Latency, Cost, Security
  • 35. Challenges with Naive RAG (Response Quality) ● Bad Retrieval ○ Low Precision: Not all chunks in retrieved set are relevant ■ Hallucination + Lost in the Middle Problems ○ Low Recall: Now all relevant chunks are retrieved. ■ Lacks enough context for LLM to synthesize an answer ○ Outdated information: The data is redundant or out of date.
  • 36. Challenges with Naive RAG (Response Quality) ● Bad Retrieval ○ Low Precision: Not all chunks in retrieved set are relevant ■ Hallucination + Lost in the Middle Problems ○ Low Recall: Now all relevant chunks are retrieved. ■ Lacks enough context for LLM to synthesize an answer ○ Outdated information: The data is redundant or out of date. ● Bad Response Generation ○ Hallucination: Model makes up an answer that isn’t in the context. ○ Irrelevance: Model makes up an answer that doesn’t answer the question. ○ Toxicity/Bias: Model makes up an answer that’s harmful/offensive.
  • 37. Difference with Traditional Software Data Extract Response Traditional software is defined by a set of programmatic rules. Given an input, you can easily reason about the expected output. Transform Load
  • 38. Difference with Traditional Software AI-powered software is defined by a black-box set of parameters. It is really hard to reason about what the function space looks like. The model parameters are tuned, the surrounding parameters (prompt templates) are not. Index Data Data Parsing + Ingestion Retrieval LLM + Prompts Response
  • 39. Difference with Traditional Software If one component of the system is a black-box, all components of the system become black boxes. The more components, the more parameters you have to tune. Index Data Data Parsing + Ingestion Retrieval LLM + Prompts Response
  • 40. Difference with Traditional Software If one component of the system is a black-box, all components of the system become black boxes. Every parameter affects the performance of the end system. Index Data Data Parsing + Ingestion Retrieval LLM + Prompts Response RAG
  • 41. There’s Too Many Parameters Every parameter affects the performance of the entire RAG pipeline. Which parameters should a user tune? There’s too many options! Index Data Data Parsing + Ingestion Retrieval LLM + Prompts Response Which PDF parser should I use? How do I chunk my documents? How do I process embedded tables and charts? Which embedding model should I use? What retrieval parameters should I use? Dense retrieval or sparse? Which LLM should I use?
  • 42. Mapping Pain Points to Solutions
  • 43. Solution Categorize by pain point, and establish best practices
  • 44. Solution Categorize by pain point, and establish best practices “Seven Failure Points When Engineering a Retrieval Augmented Generation System”, Barnett et al.
  • 45. Solution Categorize by pain point, and establish best practices “12 RAG Pain Points and Proposed Solutions”, by Wenqi Glantz
  • 46. Pain Points Response Quality Related 1. Context Missing in the Knowledge Base 2. Context Missing in the Initial Retrieval Pass 3. Context Missing After Reranking 4. Context Not Extracted 5. Output is in Wrong Format 6. Output has Incorrect Level of Specificity 7. Output is Incomplete
  • 47. Pain Points Scalability 8. Can't Scale to Larger Data Volumes 11. Rate-Limit Errors Security 12. LLM Security Use Case Specific 9. Ability to QA Tabular Data 10. Ability to Parse PDFs
  • 48. Pain Points Scalability 8. Can't Scale to Larger Data Volumes 11. Rate-Limit Errors Security 12. LLM Security Use Case Specific 9. Ability to QA Tabular Data 10. Ability to Parse PDFs
  • 49. Let’s figure out solutions
  • 50. 1. Context Missing in the Knowledge Base Clean your data: Pick a good document parser (more on this later!) Add in Metadata: inject global context to each chunk Keep your data updated: Setup a recurring data ingestion pipeline. Upsert documents to prevent duplicates.
  • 51. 2. Context Missing in the Initial Retrieval Pass Solution: Hyperparameter tuning for chunk size and top-k Solution: Reranking Source: ColBERT
  • 52. 3. Context Missing After Reranking Solution: try out fancier retrieval methods (small-to-big, auto-merging, auto-retrieval, ensembling, …) Solution: fine-tune your embedding models to task-specific data
  • 53. 4. Context is there, but not extracted by the LLM The context is there, but the LLM doesn’t understand it. “Lost in the middle” Problems. https://x.com/GregKamradt/status/1722386725635580292?s=20
  • 54. 4. Context is there, but not extracted by the LLM Solution: Prompt Compression (LongLLMLingua) Solution: LongContextReorder LongLLMLingua by Jiang et al.
  • 55. 4. Context is there, but not extracted by the LLM Solution: Prompt Compression (LongLLMLingua) Solution: LongContextReorder LongLLMLingua by Jiang et al.
  • 56. 5. Output is in Wrong Format A lot of use cases require outputting the answer in JSON format. Solutions: Better text prompting/output parsing Use OpenAI function calling + JSON mode Use token-level prompting (LMQL, Guidance) Source: Guidance
  • 57. 7. Incomplete Answer What if you have a complex multi-part question? Naive RAG is primarily good for answering simple questions about specific facts.
  • 58. 7. Incomplete Answer Solution: Add Agentic Reasoning Agents? RAG Query Response Simple Lower Cost Lower Latency Advanced Higher Cost Higher Latency Routing One-Shot Query Planning Tool Use ReAct Dynamic Planning + Execution
  • 59. 8. Scaling your Data Pipeline Pain points: ● Processing thousands/millions of docs is slow ● How do we efficiently handle document updates?
  • 60. 8. Scaling your Data Pipeline Pain points: ● Processing thousands/millions of docs is slow ● How do we efficiently handle document updates? Reference Production Ingestion Stack ● Parallelize document processing ● HuggingFace TEI ● RabbitMQ Message Queue ● AWS EKS clusters https://github.com/run-llama/llamaindex_aws_ingestion
  • 61. 10. Proper RAG over Complex Documents
  • 62. How do we model complex docs with embedded tables? RAG with naive chunking + retrieval → leads to hallucinations! Embedded Table Advanced Retrieval: Embedded Tables
  • 63. Advanced Retrieval: Embedded Tables Instead: model data hierarchically. Index tables/figures by their summaries. The only missing component: how do I parse out the tables from the data?
  • 64. Most PDF Parsing is Inadequate Extracts into a messy format that is impossible to pass down into more advanced ingestion/retrieval algorithms.
  • 65. Introducing LlamaParse A genAI-native parser designed to let you build RAG over complex documents https://github.com/run-llama/llam a_parse
  • 66. Introducing LlamaParse Capabilities ✅ Extracts tables / charts ✅ Input natural language parsing instructions ✅JSON mode ✅Image Extraction ✅Support for ~10+ document types (.pdf, .pptx, .docx, .xml)
  • 68. LlamaParse Results The best parser at table extraction == the only parser for advanced RAG Expanded: https://drive.google.com/file/d/1fyQAg7nOtChQzhF2Ai7HEeKYYqdeWsdt/view?usp=sharing
  • 71. What’s next for RAG: Agents?
  • 73. From RAG to Agents Agents? RAG Query Response Agents? Agents?
  • 74. Agents? RAG Query Response From RAG to Agents Agents? Agent Definition: Using LLMs for automated reasoning and tool selection RAG is just one Tool: Agents can decide to use RAG with other tools Agents?
  • 75. From Simple to Advanced Agents Simple Lower Cost Lower Latency Advanced Higher Cost Higher Latency Routing One-Shot Query Planning Tool Use ReAct Dynamic Planning + Execution
  • 76. Routing Simplest form of agentic reasoning. Given user query and set of choices, output subset of choices to route query to.
  • 77. Routing Use Case: Joint QA and Summarization Guide
  • 78. Query Planning Break down query into parallelizable sub-queries. Each sub-query can be executed against any set of RAG pipelines Uber 10-K chunk 4 top-2 Uber 10-K chunk 8 Lyft 10-K chunk 4 Lyft 10-K chunk 8 Compare revenue growth of Uber and Lyft in 2021 Uber 10-K Lyft 10-K Describe revenue growth of Uber in 2021 Describe revenue growth of Lyft in 2021 top-2
  • 79. Query Planning Example: Compare revenue of Uber and Lyft in 2021 Query Planning Guide Uber 10-K chunk 4 top-2 Uber 10-K chunk 8 Lyft 10-K chunk 4 Lyft 10-K chunk 8 Compare revenue growth of Uber and Lyft in 2021 Uber 10-K Lyft 10-K Describe revenue growth of Uber in 2021 Describe revenue growth of Lyft in 2021 top-2
  • 80. Tool Use Use an LLM to call an API Infer the parameters of that API
  • 81. Tool Use In normal RAG you just pass through the query. But what if you used the LLM to infer all the parameters for the API interface? A key capability in many QA use cases (auto-retrieval, text-to-SQL, and more)
  • 82. This is cool but ● How can an agent tackle sequential multi-part problems? ● How can an agent maintain state over time?
  • 83. This is cool but ● How can an agent tackle sequential multi-part problems? ○ Let’s make it loop ● How can an agent maintain state over time? ○ Let’s add basic memory
  • 84. Data Agents - Core Components Agent Reasoning Loop ● ReAct Agent (any LLM) ● OpenAI Agent (only OAI) Tools Query Engine Tools (RAG pipeline) LlamaHub Tools (30+ tools to external services)
  • 85. ReAct: Reasoning + Acting with LLMs Source: https://react-lm.github.io/
  • 86. ReAct: Reasoning + Acting with LLMs Add a loop around query decomposition + tool use
  • 87. ReAct: Reasoning + Acting with LLMs Superset of query planning + routing capabilities. ReAct + RAG Guide
  • 88. Can we make this even better? ● Stop being so short-sighted - plan ahead at each step ● Parallelize execution where we can
  • 89. LLMCompiler Kim et al. 2023 An agent compiler for parallel multi-function planning + execution.
  • 90. LLMCompiler Plan out steps beforehand, and replan as necessary LLMCompiler Agent
  • 91. Tree-based Planning Tree of Thoughts (Yao et al. 2023) Reasoning via Planning (Hao et al. 2023) Language Agent Tree Search (Zhou et al. 2023)
  • 92. Additional Requirements ● Observability: see the full trace of the agent ○ Observability Guide ● Control: Be able to guide the intermediate steps of an agent step-by-step ○ Lower-Level Agent API ● Customizability: Define your own agentic logic around any set of tools. ○ Custom Agent Guide ○ Custom Agent with Query Pipeline Guide
  • 93. Additional Requirements Possible through our query pipeline syntax Query Pipeline Guide
  • 94. What’s next for RAG: Long Contexts?
  • 95. Is RAG Dead? https://x.com/Francis_YAO_/status/1759962812229800012?s=20 Gemini 1.5 Pro has a 1-10M context window. What does this mean for RAG?
  • 96. Our Position 1. Frameworks are valuable whether or not RAG lives or dies 2. Certain RAG concepts will go away, but others will remain and evolve
  • 97. Long Context LLMs will Solve the Following 1. Developers will worry less about tuning chunking algorithms 2. Developers will need to spend less time tuning retrieval and chain-of-thought over single documents 3. Summarization will be easier 4. Personalized memory will be better and easier to build
  • 98. Some Challenges Remain 1. 10M tokens is not enough for large document corpuses (hundreds of MB, GB) 2. Embedding models are lagging behind in context length 3. Cost and Latency 4. A KV Cache takes up a significant amount of GPU memory, and has sequential dependencies
  • 99. New RAG Architectures 1. Small to Big Retrieval over Documents 2. Intelligent Routing for Latency/Cost Tradeoffs 3. Retrieval Augmented KV Caching
  • 100. Small to Big Retrieval over Documents
  • 101. Intelligent Routing for Latency/Cost Tradeoffs
  • 103. www.ServerlessToronto.org Reducing the gap between IT and Business needs