SlideShare a Scribd company logo
1 of 35
Download to read offline
DALL-E 3: "A detailed graphic that visualizes a multimodal vector embedding space"
Multimodal LLMs
• What are Multimodal Language Models
• Background / How do they work
• LLaVA papers/projects
• LLaVA model demonstration
• Image classification project with LLaVA
Robert McDermott (he/him)
Director: Solutions, Engineering & Architecture (SEA)
rmcdermo@fredhutch.org
Deep Learning Affinity Group (DLAG)
https://research.fredhutch.org/dlag/en.html
Feb 20, 2024
Who Am I?
2
Link: AI Robert
3
The Papers to Read
LLaVA 1.0
https://arxiv.org/abs/2304.08485
https://arxiv.org/pdf/2304.08485.pdf
https://llava-vl.github.io/
https://github.com/haotian-liu/LLaVA
https://arxiv.org/abs/2310.03744
https://arxiv.org/pdf/2310.03744.pdf
https://huggingface.co/liuhaotian/llava-v1.5-13b
https://arxiv.org/abs/2306.00890
https://arxiv.org/pdf/2306.00890.pdf
https://huggingface.co/microsoft/llava-med-7b-delta
https://github.com/microsoft/LLaVA-Med
LLaVA 1.5 LLaVA-Med
4
Multimodal Language Models
Multimodal language models are AI systems designed to understand, interpret, and generate information across different
forms of data, such as text and images. These models leverage large datasets of annotated examples to learn associations
between text and visual content, enabling them to perform tasks that require comprehension of both textual and visual
information.
Why is the
sky blue?
A person wearing a red cap and
sleeveless outfit is soaring through
a cloudless sky on a brightly
colored hang glider.
The sky appears blue because
molecules in the Earth's
atmosphere scatter sunlight the
shorter wavelength of blue more
than other colors.
Multimodal
Language
Model
I like pizza
5
Multimodal Language Models
Source: https://twitter.com/GregKamradt/status/1711772496159252981
Use Case Breakdown
Describe
• Animal Identification
• What's in this photo
Interpret
• Technical Flame Graph Interpretation
• Schematic Interpretation
• Twitter Thread Explainer
Recommend
• Food Recommendations
• Website Feedback
• Painting Feedback
Convert
• Figma Screens
• Adobe Lightroom Settings
• Suggest ad copy based on a webpage
Extract
• Structured Data From Driver's License
• Extract structured itemsfrom an image
• Handwriting Extraction
Assist
• Excel Formula Helper
• Find My Glasses
• Live Poker Advice
• Video game recommendations
Evaluate
• Dog Cuteness Evaluator
• Bounding Box Evaluator
• Thumbnail Testing
Links to Examples
6
AI Vision has come a long way.
GPT-4 Vision
LLaVA 1.6 34B
Research scientist and a founding member
at OpenAI. Sr. Director of AI at Telsa.
source: https://karpathy.github.io/2012/10/22/state-of-computer-vision/
2024
2012
7
What’s funny about this?
Image source: https://www.reddit.com/r/hmmm/comments/ubab5v/hmmm/
LLaVA 1.6 34B
GPT-4 Vision
What’s unusual about this image?
LLaVA 1.6 34B
GPT-4 Vision
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Quick Introduction to Tokens and Embeddings
required to understand how LLMs process text and
images.
9
Text Tokenization
10
Tokenization is a foundational step in the preprocessing of text for many natural language processing (NLP) tasks, including for language
models like GPT-4 and Llama-2. Tokenization involves breaking down text into smaller chunks, or "tokens", which can be as short as one
character or as long as one word (or even longer in some cases). These tokens can then be processed, analyzed, and used as input for
machine learning models.
https://platform.openai.com/tokenizer
Tokenization
Visualized
Resulting
Token IDs
11
Vector Embeddings
Applications
• Natural Language Processing tasks: sentiment analysis,
named entity recognition, etc.
• Information retrieval: search engines, recommendation
systems.
• Visualization: using dimensionality reduction to visualize
semantic relationships
https://huggingface.co/spaces/mteb/leaderboard
5.41765615e-02 4.20716889e-02 -2.41547506e-02 1.11813843e-01
-9.33169946e-02 -7.56109739e-03 6.54651076e-02 -1.54011259e-02
-2.80906167e-02 1.97344255e-02 -1.58324391e-02 -8.46638903e-02
-1.31631363e-02 1.98841579e-02 -1.26802064e-02 -9.36008468e-02
-4.51933630e-02 -1.20324306e-02 -2.48974599e-02 4.87890420e-03
-2.54017510e-03 4.92022634e-02 5.12179844e-02 2.54505035e-02
-9.70738381e-02 1.42842624e-02 -3.46412621e-02 -8.45314115e-02
-7.38010108e-02 -2.72879936e-02 -2.81507652e-02 -5.01780510e-02
5.35405474e-03 2.96438616e-02 -5.18742464e-02 -6.24342896e-02
6.04359470e-02 -2.22260728e-02 3.36266570e-02 5.17647602e-02
-3.09484527e-02 -8.72448832e-02 -1.53413722e-02 9.27508809e-03
-4.92608221e-03 -4.97105941e-02 -1.04904985e-02 -4.15333314e-03
1.55722797e-02 -2.66851094e-02 -6.49709478e-02 -5.94373941e-02
-2.10976638e-02 3.59102758e-03 5.88850211e-03 -1.03685725e-02
5.03626876e-02 -3.31290103e-02 -7.70502910e-02 1.53052341e-02
*
"A fat tuxedo cat" =
* The "all-MiniLM-L6-v2" embedding model has 384 dimensions
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
Definition
• Representations of text in numerical form.
• Convert variable-length text into fixed-size vectors in high-
dimensional space.
Purpose
• Capture semantic meaning and relationships between words,
phrases, or longer text.
• Enable mathematical operations on text (e.g., similarity
measurement, arithmetic operations).
Characteristics
• Words with similar meanings are close in vector space.
• Allows for operations like "king" - "man" + "woman" ≈ "queen".
There are many embedding models:
12
Vector Embeddings
• There are several dozen embedding models
• They range in complexity from 384 to 1536 dimensions
• They range in max sequence length from 512 to 8191 tokens
• Embedding models are generally not compatible with each other
Interactive embedding explorer:
https://blog.echen.me/embedding-explorer/
Semantic Text Similarity
13
Sentence 1 Sentence 2 Cosine Similarity
The cat sits outside The dog plays in the garden 0.2838
A man is playing guitar A woman watches TV -0.0327
The new movie is awesome The new movie is so great 0.8939
Jim can run very fast James is the fastest runner 0.6844
My goldfish is hungry Pluto is a planet! 0.0454
• Measures the cosine of the angle between two vectors.
• Value between -1 and 1; where 1 means vectors are identical, 0 means
orthogonal, and -1 means diametrically opposite (rare in text embeddings).
These clearly used different
embedding models
https://gist.github.com/robert-mcdermott/67cf2623237989bc2315d35a108246ef
Embeddings Plot Tool
14
https://github.com/robert-mcdermott/embeddings_plot
A command line utility I created to
visualize word embeddings
Embedding-plot, is a command line
utility that can visualize word
embeddings in either 2D or 3D scatter
plots using dimensionality reduction
techniques (PCA, t-SNE or UMAP) and
clustering in a scatter plot.
Image Embedding Example
15
source: https://www.researchgate.net/publication/282181243_Learning_Visual_Clothing_Style_with_Heterogeneous_Dyadic_Co-Occurrences
“Visualization of a 2D embedding of the
style space trained with strategic sampling
computed with t-SNE. The embedding is
based on 200,000 images from the test set.
For a clear visual representation, we
discretize the style space into a grid and
pick one image from each grid cell at
random.”
CLIP (Contrastive Language-Image Pre-Training)
16
Source: Paper: https://arxiv.org/pdf/2103.00020.pdf Code: https://github.com/OpenAI/CLIP
High-Level Architecture
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
The LLaVA Papers
17
LLaVA 1.0 – Large Language and Vision Assistant
18
• https://arxiv.org/abs/2304.08485
• https://arxiv.org/pdf/2304.08485.pdf
• https://llava-vl.github.io/
• https://github.com/haotian-liu/LLaVA
Instruction tuning large language models (LLMs) using machine-
generated instruction-following data has improved zero-shot
capabilities on new tasks, but the idea is less explored in the
multimodal field. In this paper, we present the first attempt to use
language-only GPT-4 to generate multimodal language-image
instruction-following data. By instruction tuning on such generated
data, we introduce LLaVA: Large Language and Vision Assistant, an
end-to-end trained large multimodal model that connects a vision
encoder and LLM for general-purpose visual and language
understanding. Our early experiments show that LLaVA
demonstrates impressive multimodel chat abilities, sometimes
exhibiting the behaviors of multimodal GPT-4 on unseen
images/instructions and yields a 85.1% relative score compared
with GPT-4 on a synthetic multimodal instruction-following dataset.
When fine-tuned on Science QA, the synergy of LLaVA and GPT-4
achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4
generated visual instruction tuning data, our model and code base
publicly available.
Abstract
LLaVA 1.0 Training
19
LLaVA 1.0 Architecture
20
Tokenizing + Embedding Tokenizing + Embedding
LLaVA 1.0 Performance
21
22
• https://arxiv.org/abs/2310.03744
• https://arxiv.org/pdf/2310.03744.pdf
• https://huggingface.co/liuhaotian/llava
-v1.5-13b
Large multimodal models (LMM) have recently shown
encouraging progress with visual instruction tuning. In this
note, we show that the fully-connected vision-language cross-
modal connector in LLaVA is surprisingly powerful and data-
efficient. With simple modifications to LLaVA, namely, using
CLIP-ViT-L-336px with an MLP projection and adding academic-
task-oriented VQA data with simple response formatting
prompts, we establish stronger baselines that achieve state-of-
the-art across 11 benchmarks.
Our final 13B checkpoint uses merely 1.2M publicly available
data, and finishes full training in ~1 day on a single 8-A100
node.
We hope this can make state-of-the-art LMM research more
accessible. Code and model will be publicly available.
Abstract
LLaVA (1.5) – Large Language and Vision Assistant
23
LLaVA 1.5 Changes
Modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented
VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art
across 11 benchmarks.
LLaVA 1.5
24
Training Data Sources Performance
25
LLaVA 1.5 Comparison Examples
LLaVA 1.6
26
LLaVA-1.6-34B outperforms Gemini Pro on several benchmarks
https://llava-vl.github.io/blog/2024-01-30-llava-1-6/
Benchmarks Example
27
• https://arxiv.org/abs/2306.00890
• https://arxiv.org/pdf/2306.00890.pdf
Conversational generative AI has demonstrated remarkable promise for empowering
biomedical practitioners, but current investigationsfocus on unimodal text.
Multimodalconversational AI has seen rapid progress by leveraging billions of
image-text pairs from the public web, but such general-domain vision-language
models still lack sophisticationin understanding and conversing about biomedical
images. In this paper, we propose a cost-efficient approach for training a vision language
conversational assistant that can answer open-ended research questions
of biomedical images. The key idea is to leverage a large-scale, broad-coverage
biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to
self-instruct open-ended instruction-followingdata from the captions, and then
fine-tune a large general-domain vision-language model using a novel curriculum
learning method. Specifically,the model first learns to align biomedical vocabulary
using the figure-caption pairs as is, then learns to master open-ended conversational
semantics using GPT-4 generated instruction-followingdata, broadly mimicking
how a layperson gradually acquires biomedical knowledge. This enables us to train
a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less
than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational
capability and can follow open-ended instruction to assist with inquiries
about a biomedical image. On three standard biomedical visual question answering
datasets, fine-tuning LLaVA-Med outperforms previous supervised state-of-the-art
on certain metrics. To facilitate biomedical multimodal research, we will release
our instruction-followingdata and the LLaVA-Med model.
Abstract
LLaVA-Med
https://github.com/microsoft/LLaVA-Med
https://huggingface.co/microsoft/llava-med-7b-delta
28
LLaVA-Med
29
LLaVA-Med Comparison Examples
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
My LLaVA based Image Classifier Experiment
30
Full details, results and code: https://github.com/robert-mcdermott/LLM-Image-Classification
31
LLaVA 1.5: Handwritten Digit Classification
Abstract
https://github.com/robert-mcdermott/LLM-Image-Classification
32
LLaVA 1.5: Animal Classification
https://github.com/robert-mcdermott/LLM-Image-Classification
33
LLaVA 1.5: Chess Piece Identification
https://github.com/robert-mcdermott/LLM-Image-Classification
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Live Demo Time
34
Thank you
Robert McDermott (he/him)
Director: Solutions, Engineering & Architecture (SEA)
rmcdermo@fredhutch.org

More Related Content

What's hot

LangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AILangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AI
OzgurOscarOzkan
 
generative-ai-fundamentals and Large language models
generative-ai-fundamentals and Large language modelsgenerative-ai-fundamentals and Large language models
generative-ai-fundamentals and Large language models
AdventureWorld5
 

What's hot (20)

Conversational AI and Chatbot Integrations
Conversational AI and Chatbot IntegrationsConversational AI and Chatbot Integrations
Conversational AI and Chatbot Integrations
 
Was ist eigentlich KI?
Was ist eigentlich KI?Was ist eigentlich KI?
Was ist eigentlich KI?
 
Generative-AI-in-enterprise-20230615.pdf
Generative-AI-in-enterprise-20230615.pdfGenerative-AI-in-enterprise-20230615.pdf
Generative-AI-in-enterprise-20230615.pdf
 
Building NLP applications with Transformers
Building NLP applications with TransformersBuilding NLP applications with Transformers
Building NLP applications with Transformers
 
LangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AILangChain Intro by KeyMate.AI
LangChain Intro by KeyMate.AI
 
Deep dive into LangChain integration with Neo4j.pptx
Deep dive into LangChain integration with Neo4j.pptxDeep dive into LangChain integration with Neo4j.pptx
Deep dive into LangChain integration with Neo4j.pptx
 
generative-ai-fundamentals and Large language models
generative-ai-fundamentals and Large language modelsgenerative-ai-fundamentals and Large language models
generative-ai-fundamentals and Large language models
 
ChatGPT_ppf.pdf
ChatGPT_ppf.pdfChatGPT_ppf.pdf
ChatGPT_ppf.pdf
 
LLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team Structure
 
An Introduction to Generative AI - May 18, 2023
An Introduction  to Generative AI - May 18, 2023An Introduction  to Generative AI - May 18, 2023
An Introduction to Generative AI - May 18, 2023
 
1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)1909 BERT: why-and-how (CODE SEMINAR)
1909 BERT: why-and-how (CODE SEMINAR)
 
Intro to LLMs
Intro to LLMsIntro to LLMs
Intro to LLMs
 
ChatGPT, Foundation Models and Web3.pptx
ChatGPT, Foundation Models and Web3.pptxChatGPT, Foundation Models and Web3.pptx
ChatGPT, Foundation Models and Web3.pptx
 
Federated Learning with TensorFlow
Federated Learning with TensorFlowFederated Learning with TensorFlow
Federated Learning with TensorFlow
 
[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
⼤語⾔模型 LLM 應⽤開發入⾨
⼤語⾔模型 LLM 應⽤開發入⾨⼤語⾔模型 LLM 應⽤開發入⾨
⼤語⾔模型 LLM 應⽤開發入⾨
 
Generative AI How It's Changing Our World and What It Means for You_final.pdf
Generative AI How It's Changing Our World and What It Means for You_final.pdfGenerative AI How It's Changing Our World and What It Means for You_final.pdf
Generative AI How It's Changing Our World and What It Means for You_final.pdf
 
Artificial intelligence
Artificial intelligenceArtificial intelligence
Artificial intelligence
 
The Rise of the LLMs - How I Learned to Stop Worrying & Love the GPT!
The Rise of the LLMs - How I Learned to Stop Worrying & Love the GPT!The Rise of the LLMs - How I Learned to Stop Worrying & Love the GPT!
The Rise of the LLMs - How I Learned to Stop Worrying & Love the GPT!
 
LLMs Bootcamp
LLMs BootcampLLMs Bootcamp
LLMs Bootcamp
 

Similar to Introduction to Multimodal LLMs with LLaVA

Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Anant Corporation
 
A Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And RlbpA Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And Rlbp
Rikki Wright
 
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docxA Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
bartholomeocoombs
 

Similar to Introduction to Multimodal LLMs with LLaVA (20)

Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
 
Emotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logicEmotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logic
 
DMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesDMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slides
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
 
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for DevelopersHow do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
A Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And RlbpA Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And Rlbp
 
Marvin_Capstone
Marvin_CapstoneMarvin_Capstone
Marvin_Capstone
 
My Curriculum Vitae
My Curriculum VitaeMy Curriculum Vitae
My Curriculum Vitae
 
DeepPavlov 2019
DeepPavlov 2019DeepPavlov 2019
DeepPavlov 2019
 
Rajesh - CV
Rajesh - CVRajesh - CV
Rajesh - CV
 
Open Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache SparkOpen Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache Spark
 
Distributed Tracing
Distributed TracingDistributed Tracing
Distributed Tracing
 
The effect of distributed archetypes on complexity theory
The effect of distributed archetypes on complexity theoryThe effect of distributed archetypes on complexity theory
The effect of distributed archetypes on complexity theory
 
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docxA Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
 
Distributed tracing 101
Distributed tracing 101Distributed tracing 101
Distributed tracing 101
 
NDC London 2014: Erlang Patterns Matching Business Needs
NDC London 2014: Erlang Patterns Matching Business NeedsNDC London 2014: Erlang Patterns Matching Business Needs
NDC London 2014: Erlang Patterns Matching Business Needs
 

Recently uploaded

Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
FIDO Alliance
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Navigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern EnterpriseNavigating Identity and Access Management in the Modern Enterprise
Navigating Identity and Access Management in the Modern Enterprise
 
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...Stronger Together: Developing an Organizational Strategy for Accessible Desig...
Stronger Together: Developing an Organizational Strategy for Accessible Desig...
 
Event-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream ProcessingEvent-Driven Architecture Masterclass: Challenges in Stream Processing
Event-Driven Architecture Masterclass: Challenges in Stream Processing
 
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...Hyatt driving innovation and exceptional customer experiences with FIDO passw...
Hyatt driving innovation and exceptional customer experiences with FIDO passw...
 
JavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate GuideJavaScript Usage Statistics 2024 - The Ultimate Guide
JavaScript Usage Statistics 2024 - The Ultimate Guide
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Quantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation ComputingQuantum Leap in Next-Generation Computing
Quantum Leap in Next-Generation Computing
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Modernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using BallerinaModernizing Legacy Systems Using Ballerina
Modernizing Legacy Systems Using Ballerina
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 

Introduction to Multimodal LLMs with LLaVA

  • 1. DALL-E 3: "A detailed graphic that visualizes a multimodal vector embedding space" Multimodal LLMs • What are Multimodal Language Models • Background / How do they work • LLaVA papers/projects • LLaVA model demonstration • Image classification project with LLaVA Robert McDermott (he/him) Director: Solutions, Engineering & Architecture (SEA) rmcdermo@fredhutch.org Deep Learning Affinity Group (DLAG) https://research.fredhutch.org/dlag/en.html Feb 20, 2024
  • 2. Who Am I? 2 Link: AI Robert
  • 3. 3 The Papers to Read LLaVA 1.0 https://arxiv.org/abs/2304.08485 https://arxiv.org/pdf/2304.08485.pdf https://llava-vl.github.io/ https://github.com/haotian-liu/LLaVA https://arxiv.org/abs/2310.03744 https://arxiv.org/pdf/2310.03744.pdf https://huggingface.co/liuhaotian/llava-v1.5-13b https://arxiv.org/abs/2306.00890 https://arxiv.org/pdf/2306.00890.pdf https://huggingface.co/microsoft/llava-med-7b-delta https://github.com/microsoft/LLaVA-Med LLaVA 1.5 LLaVA-Med
  • 4. 4 Multimodal Language Models Multimodal language models are AI systems designed to understand, interpret, and generate information across different forms of data, such as text and images. These models leverage large datasets of annotated examples to learn associations between text and visual content, enabling them to perform tasks that require comprehension of both textual and visual information. Why is the sky blue? A person wearing a red cap and sleeveless outfit is soaring through a cloudless sky on a brightly colored hang glider. The sky appears blue because molecules in the Earth's atmosphere scatter sunlight the shorter wavelength of blue more than other colors. Multimodal Language Model I like pizza
  • 5. 5 Multimodal Language Models Source: https://twitter.com/GregKamradt/status/1711772496159252981 Use Case Breakdown Describe • Animal Identification • What's in this photo Interpret • Technical Flame Graph Interpretation • Schematic Interpretation • Twitter Thread Explainer Recommend • Food Recommendations • Website Feedback • Painting Feedback Convert • Figma Screens • Adobe Lightroom Settings • Suggest ad copy based on a webpage Extract • Structured Data From Driver's License • Extract structured itemsfrom an image • Handwriting Extraction Assist • Excel Formula Helper • Find My Glasses • Live Poker Advice • Video game recommendations Evaluate • Dog Cuteness Evaluator • Bounding Box Evaluator • Thumbnail Testing Links to Examples
  • 6. 6 AI Vision has come a long way. GPT-4 Vision LLaVA 1.6 34B Research scientist and a founding member at OpenAI. Sr. Director of AI at Telsa. source: https://karpathy.github.io/2012/10/22/state-of-computer-vision/ 2024 2012
  • 7. 7 What’s funny about this? Image source: https://www.reddit.com/r/hmmm/comments/ubab5v/hmmm/ LLaVA 1.6 34B GPT-4 Vision
  • 8. What’s unusual about this image? LLaVA 1.6 34B GPT-4 Vision
  • 9. Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Quick Introduction to Tokens and Embeddings required to understand how LLMs process text and images. 9
  • 10. Text Tokenization 10 Tokenization is a foundational step in the preprocessing of text for many natural language processing (NLP) tasks, including for language models like GPT-4 and Llama-2. Tokenization involves breaking down text into smaller chunks, or "tokens", which can be as short as one character or as long as one word (or even longer in some cases). These tokens can then be processed, analyzed, and used as input for machine learning models. https://platform.openai.com/tokenizer Tokenization Visualized Resulting Token IDs
  • 11. 11 Vector Embeddings Applications • Natural Language Processing tasks: sentiment analysis, named entity recognition, etc. • Information retrieval: search engines, recommendation systems. • Visualization: using dimensionality reduction to visualize semantic relationships https://huggingface.co/spaces/mteb/leaderboard 5.41765615e-02 4.20716889e-02 -2.41547506e-02 1.11813843e-01 -9.33169946e-02 -7.56109739e-03 6.54651076e-02 -1.54011259e-02 -2.80906167e-02 1.97344255e-02 -1.58324391e-02 -8.46638903e-02 -1.31631363e-02 1.98841579e-02 -1.26802064e-02 -9.36008468e-02 -4.51933630e-02 -1.20324306e-02 -2.48974599e-02 4.87890420e-03 -2.54017510e-03 4.92022634e-02 5.12179844e-02 2.54505035e-02 -9.70738381e-02 1.42842624e-02 -3.46412621e-02 -8.45314115e-02 -7.38010108e-02 -2.72879936e-02 -2.81507652e-02 -5.01780510e-02 5.35405474e-03 2.96438616e-02 -5.18742464e-02 -6.24342896e-02 6.04359470e-02 -2.22260728e-02 3.36266570e-02 5.17647602e-02 -3.09484527e-02 -8.72448832e-02 -1.53413722e-02 9.27508809e-03 -4.92608221e-03 -4.97105941e-02 -1.04904985e-02 -4.15333314e-03 1.55722797e-02 -2.66851094e-02 -6.49709478e-02 -5.94373941e-02 -2.10976638e-02 3.59102758e-03 5.88850211e-03 -1.03685725e-02 5.03626876e-02 -3.31290103e-02 -7.70502910e-02 1.53052341e-02 * "A fat tuxedo cat" = * The "all-MiniLM-L6-v2" embedding model has 384 dimensions https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 Definition • Representations of text in numerical form. • Convert variable-length text into fixed-size vectors in high- dimensional space. Purpose • Capture semantic meaning and relationships between words, phrases, or longer text. • Enable mathematical operations on text (e.g., similarity measurement, arithmetic operations). Characteristics • Words with similar meanings are close in vector space. • Allows for operations like "king" - "man" + "woman" ≈ "queen". There are many embedding models:
  • 12. 12 Vector Embeddings • There are several dozen embedding models • They range in complexity from 384 to 1536 dimensions • They range in max sequence length from 512 to 8191 tokens • Embedding models are generally not compatible with each other Interactive embedding explorer: https://blog.echen.me/embedding-explorer/
  • 13. Semantic Text Similarity 13 Sentence 1 Sentence 2 Cosine Similarity The cat sits outside The dog plays in the garden 0.2838 A man is playing guitar A woman watches TV -0.0327 The new movie is awesome The new movie is so great 0.8939 Jim can run very fast James is the fastest runner 0.6844 My goldfish is hungry Pluto is a planet! 0.0454 • Measures the cosine of the angle between two vectors. • Value between -1 and 1; where 1 means vectors are identical, 0 means orthogonal, and -1 means diametrically opposite (rare in text embeddings). These clearly used different embedding models https://gist.github.com/robert-mcdermott/67cf2623237989bc2315d35a108246ef
  • 14. Embeddings Plot Tool 14 https://github.com/robert-mcdermott/embeddings_plot A command line utility I created to visualize word embeddings Embedding-plot, is a command line utility that can visualize word embeddings in either 2D or 3D scatter plots using dimensionality reduction techniques (PCA, t-SNE or UMAP) and clustering in a scatter plot.
  • 15. Image Embedding Example 15 source: https://www.researchgate.net/publication/282181243_Learning_Visual_Clothing_Style_with_Heterogeneous_Dyadic_Co-Occurrences “Visualization of a 2D embedding of the style space trained with strategic sampling computed with t-SNE. The embedding is based on 200,000 images from the test set. For a clear visual representation, we discretize the style space into a grid and pick one image from each grid cell at random.”
  • 16. CLIP (Contrastive Language-Image Pre-Training) 16 Source: Paper: https://arxiv.org/pdf/2103.00020.pdf Code: https://github.com/OpenAI/CLIP High-Level Architecture
  • 17. Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center The LLaVA Papers 17
  • 18. LLaVA 1.0 – Large Language and Vision Assistant 18 • https://arxiv.org/abs/2304.08485 • https://arxiv.org/pdf/2304.08485.pdf • https://llava-vl.github.io/ • https://github.com/haotian-liu/LLaVA Instruction tuning large language models (LLMs) using machine- generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding. Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available. Abstract
  • 20. LLaVA 1.0 Architecture 20 Tokenizing + Embedding Tokenizing + Embedding
  • 22. 22 • https://arxiv.org/abs/2310.03744 • https://arxiv.org/pdf/2310.03744.pdf • https://huggingface.co/liuhaotian/llava -v1.5-13b Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross- modal connector in LLaVA is surprisingly powerful and data- efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic- task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of- the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available. Abstract LLaVA (1.5) – Large Language and Vision Assistant
  • 23. 23 LLaVA 1.5 Changes Modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks.
  • 24. LLaVA 1.5 24 Training Data Sources Performance
  • 26. LLaVA 1.6 26 LLaVA-1.6-34B outperforms Gemini Pro on several benchmarks https://llava-vl.github.io/blog/2024-01-30-llava-1-6/ Benchmarks Example
  • 27. 27 • https://arxiv.org/abs/2306.00890 • https://arxiv.org/pdf/2306.00890.pdf Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigationsfocus on unimodal text. Multimodalconversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophisticationin understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a vision language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-followingdata from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method. Specifically,the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-followingdata, broadly mimicking how a layperson gradually acquires biomedical knowledge. This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image. On three standard biomedical visual question answering datasets, fine-tuning LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics. To facilitate biomedical multimodal research, we will release our instruction-followingdata and the LLaVA-Med model. Abstract LLaVA-Med https://github.com/microsoft/LLaVA-Med https://huggingface.co/microsoft/llava-med-7b-delta
  • 30. Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center My LLaVA based Image Classifier Experiment 30 Full details, results and code: https://github.com/robert-mcdermott/LLM-Image-Classification
  • 31. 31 LLaVA 1.5: Handwritten Digit Classification Abstract https://github.com/robert-mcdermott/LLM-Image-Classification
  • 32. 32 LLaVA 1.5: Animal Classification https://github.com/robert-mcdermott/LLM-Image-Classification
  • 33. 33 LLaVA 1.5: Chess Piece Identification https://github.com/robert-mcdermott/LLM-Image-Classification
  • 34. Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Live Demo Time 34
  • 35. Thank you Robert McDermott (he/him) Director: Solutions, Engineering & Architecture (SEA) rmcdermo@fredhutch.org