SlideShare a Scribd company logo
1 of 35
Download to read offline
DALL-E 3: "A detailed graphic that visualizes a multimodal vector embedding space"
Multimodal LLMs
• What are Multimodal Language Models
• Background / How do they work
• LLaVA papers/projects
• LLaVA model demonstration
• Image classification project with LLaVA
Robert McDermott (he/him)
Director: Solutions, Engineering & Architecture (SEA)
rmcdermo@fredhutch.org
Deep Learning Affinity Group (DLAG)
https://research.fredhutch.org/dlag/en.html
Feb 20, 2024
Who Am I?
2
Link: AI Robert
3
The Papers to Read
LLaVA 1.0
https://arxiv.org/abs/2304.08485
https://arxiv.org/pdf/2304.08485.pdf
https://llava-vl.github.io/
https://github.com/haotian-liu/LLaVA
https://arxiv.org/abs/2310.03744
https://arxiv.org/pdf/2310.03744.pdf
https://huggingface.co/liuhaotian/llava-v1.5-13b
https://arxiv.org/abs/2306.00890
https://arxiv.org/pdf/2306.00890.pdf
https://huggingface.co/microsoft/llava-med-7b-delta
https://github.com/microsoft/LLaVA-Med
LLaVA 1.5 LLaVA-Med
4
Multimodal Language Models
Multimodal language models are AI systems designed to understand, interpret, and generate information across different
forms of data, such as text and images. These models leverage large datasets of annotated examples to learn associations
between text and visual content, enabling them to perform tasks that require comprehension of both textual and visual
information.
Why is the
sky blue?
A person wearing a red cap and
sleeveless outfit is soaring through
a cloudless sky on a brightly
colored hang glider.
The sky appears blue because
molecules in the Earth's
atmosphere scatter sunlight the
shorter wavelength of blue more
than other colors.
Multimodal
Language
Model
I like pizza
5
Multimodal Language Models
Source: https://twitter.com/GregKamradt/status/1711772496159252981
Use Case Breakdown
Describe
• Animal Identification
• What's in this photo
Interpret
• Technical Flame Graph Interpretation
• Schematic Interpretation
• Twitter Thread Explainer
Recommend
• Food Recommendations
• Website Feedback
• Painting Feedback
Convert
• Figma Screens
• Adobe Lightroom Settings
• Suggest ad copy based on a webpage
Extract
• Structured Data From Driver's License
• Extract structured itemsfrom an image
• Handwriting Extraction
Assist
• Excel Formula Helper
• Find My Glasses
• Live Poker Advice
• Video game recommendations
Evaluate
• Dog Cuteness Evaluator
• Bounding Box Evaluator
• Thumbnail Testing
Links to Examples
6
AI Vision has come a long way.
GPT-4 Vision
LLaVA 1.6 34B
Research scientist and a founding member
at OpenAI. Sr. Director of AI at Telsa.
source: https://karpathy.github.io/2012/10/22/state-of-computer-vision/
2024
2012
7
What’s funny about this?
Image source: https://www.reddit.com/r/hmmm/comments/ubab5v/hmmm/
LLaVA 1.6 34B
GPT-4 Vision
What’s unusual about this image?
LLaVA 1.6 34B
GPT-4 Vision
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Quick Introduction to Tokens and Embeddings
required to understand how LLMs process text and
images.
9
Text Tokenization
10
Tokenization is a foundational step in the preprocessing of text for many natural language processing (NLP) tasks, including for language
models like GPT-4 and Llama-2. Tokenization involves breaking down text into smaller chunks, or "tokens", which can be as short as one
character or as long as one word (or even longer in some cases). These tokens can then be processed, analyzed, and used as input for
machine learning models.
https://platform.openai.com/tokenizer
Tokenization
Visualized
Resulting
Token IDs
11
Vector Embeddings
Applications
• Natural Language Processing tasks: sentiment analysis,
named entity recognition, etc.
• Information retrieval: search engines, recommendation
systems.
• Visualization: using dimensionality reduction to visualize
semantic relationships
https://huggingface.co/spaces/mteb/leaderboard
5.41765615e-02 4.20716889e-02 -2.41547506e-02 1.11813843e-01
-9.33169946e-02 -7.56109739e-03 6.54651076e-02 -1.54011259e-02
-2.80906167e-02 1.97344255e-02 -1.58324391e-02 -8.46638903e-02
-1.31631363e-02 1.98841579e-02 -1.26802064e-02 -9.36008468e-02
-4.51933630e-02 -1.20324306e-02 -2.48974599e-02 4.87890420e-03
-2.54017510e-03 4.92022634e-02 5.12179844e-02 2.54505035e-02
-9.70738381e-02 1.42842624e-02 -3.46412621e-02 -8.45314115e-02
-7.38010108e-02 -2.72879936e-02 -2.81507652e-02 -5.01780510e-02
5.35405474e-03 2.96438616e-02 -5.18742464e-02 -6.24342896e-02
6.04359470e-02 -2.22260728e-02 3.36266570e-02 5.17647602e-02
-3.09484527e-02 -8.72448832e-02 -1.53413722e-02 9.27508809e-03
-4.92608221e-03 -4.97105941e-02 -1.04904985e-02 -4.15333314e-03
1.55722797e-02 -2.66851094e-02 -6.49709478e-02 -5.94373941e-02
-2.10976638e-02 3.59102758e-03 5.88850211e-03 -1.03685725e-02
5.03626876e-02 -3.31290103e-02 -7.70502910e-02 1.53052341e-02
*
"A fat tuxedo cat" =
* The "all-MiniLM-L6-v2" embedding model has 384 dimensions
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
Definition
• Representations of text in numerical form.
• Convert variable-length text into fixed-size vectors in high-
dimensional space.
Purpose
• Capture semantic meaning and relationships between words,
phrases, or longer text.
• Enable mathematical operations on text (e.g., similarity
measurement, arithmetic operations).
Characteristics
• Words with similar meanings are close in vector space.
• Allows for operations like "king" - "man" + "woman" ≈ "queen".
There are many embedding models:
12
Vector Embeddings
• There are several dozen embedding models
• They range in complexity from 384 to 1536 dimensions
• They range in max sequence length from 512 to 8191 tokens
• Embedding models are generally not compatible with each other
Interactive embedding explorer:
https://blog.echen.me/embedding-explorer/
Semantic Text Similarity
13
Sentence 1 Sentence 2 Cosine Similarity
The cat sits outside The dog plays in the garden 0.2838
A man is playing guitar A woman watches TV -0.0327
The new movie is awesome The new movie is so great 0.8939
Jim can run very fast James is the fastest runner 0.6844
My goldfish is hungry Pluto is a planet! 0.0454
• Measures the cosine of the angle between two vectors.
• Value between -1 and 1; where 1 means vectors are identical, 0 means
orthogonal, and -1 means diametrically opposite (rare in text embeddings).
These clearly used different
embedding models
https://gist.github.com/robert-mcdermott/67cf2623237989bc2315d35a108246ef
Embeddings Plot Tool
14
https://github.com/robert-mcdermott/embeddings_plot
A command line utility I created to
visualize word embeddings
Embedding-plot, is a command line
utility that can visualize word
embeddings in either 2D or 3D scatter
plots using dimensionality reduction
techniques (PCA, t-SNE or UMAP) and
clustering in a scatter plot.
Image Embedding Example
15
source: https://www.researchgate.net/publication/282181243_Learning_Visual_Clothing_Style_with_Heterogeneous_Dyadic_Co-Occurrences
“Visualization of a 2D embedding of the
style space trained with strategic sampling
computed with t-SNE. The embedding is
based on 200,000 images from the test set.
For a clear visual representation, we
discretize the style space into a grid and
pick one image from each grid cell at
random.”
CLIP (Contrastive Language-Image Pre-Training)
16
Source: Paper: https://arxiv.org/pdf/2103.00020.pdf Code: https://github.com/OpenAI/CLIP
High-Level Architecture
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
The LLaVA Papers
17
LLaVA 1.0 – Large Language and Vision Assistant
18
• https://arxiv.org/abs/2304.08485
• https://arxiv.org/pdf/2304.08485.pdf
• https://llava-vl.github.io/
• https://github.com/haotian-liu/LLaVA
Instruction tuning large language models (LLMs) using machine-
generated instruction-following data has improved zero-shot
capabilities on new tasks, but the idea is less explored in the
multimodal field. In this paper, we present the first attempt to use
language-only GPT-4 to generate multimodal language-image
instruction-following data. By instruction tuning on such generated
data, we introduce LLaVA: Large Language and Vision Assistant, an
end-to-end trained large multimodal model that connects a vision
encoder and LLM for general-purpose visual and language
understanding. Our early experiments show that LLaVA
demonstrates impressive multimodel chat abilities, sometimes
exhibiting the behaviors of multimodal GPT-4 on unseen
images/instructions and yields a 85.1% relative score compared
with GPT-4 on a synthetic multimodal instruction-following dataset.
When fine-tuned on Science QA, the synergy of LLaVA and GPT-4
achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4
generated visual instruction tuning data, our model and code base
publicly available.
Abstract
LLaVA 1.0 Training
19
LLaVA 1.0 Architecture
20
Tokenizing + Embedding Tokenizing + Embedding
LLaVA 1.0 Performance
21
22
• https://arxiv.org/abs/2310.03744
• https://arxiv.org/pdf/2310.03744.pdf
• https://huggingface.co/liuhaotian/llava
-v1.5-13b
Large multimodal models (LMM) have recently shown
encouraging progress with visual instruction tuning. In this
note, we show that the fully-connected vision-language cross-
modal connector in LLaVA is surprisingly powerful and data-
efficient. With simple modifications to LLaVA, namely, using
CLIP-ViT-L-336px with an MLP projection and adding academic-
task-oriented VQA data with simple response formatting
prompts, we establish stronger baselines that achieve state-of-
the-art across 11 benchmarks.
Our final 13B checkpoint uses merely 1.2M publicly available
data, and finishes full training in ~1 day on a single 8-A100
node.
We hope this can make state-of-the-art LMM research more
accessible. Code and model will be publicly available.
Abstract
LLaVA (1.5) – Large Language and Vision Assistant
23
LLaVA 1.5 Changes
Modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented
VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art
across 11 benchmarks.
LLaVA 1.5
24
Training Data Sources Performance
25
LLaVA 1.5 Comparison Examples
LLaVA 1.6
26
LLaVA-1.6-34B outperforms Gemini Pro on several benchmarks
https://llava-vl.github.io/blog/2024-01-30-llava-1-6/
Benchmarks Example
27
• https://arxiv.org/abs/2306.00890
• https://arxiv.org/pdf/2306.00890.pdf
Conversational generative AI has demonstrated remarkable promise for empowering
biomedical practitioners, but current investigationsfocus on unimodal text.
Multimodalconversational AI has seen rapid progress by leveraging billions of
image-text pairs from the public web, but such general-domain vision-language
models still lack sophisticationin understanding and conversing about biomedical
images. In this paper, we propose a cost-efficient approach for training a vision language
conversational assistant that can answer open-ended research questions
of biomedical images. The key idea is to leverage a large-scale, broad-coverage
biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to
self-instruct open-ended instruction-followingdata from the captions, and then
fine-tune a large general-domain vision-language model using a novel curriculum
learning method. Specifically,the model first learns to align biomedical vocabulary
using the figure-caption pairs as is, then learns to master open-ended conversational
semantics using GPT-4 generated instruction-followingdata, broadly mimicking
how a layperson gradually acquires biomedical knowledge. This enables us to train
a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less
than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational
capability and can follow open-ended instruction to assist with inquiries
about a biomedical image. On three standard biomedical visual question answering
datasets, fine-tuning LLaVA-Med outperforms previous supervised state-of-the-art
on certain metrics. To facilitate biomedical multimodal research, we will release
our instruction-followingdata and the LLaVA-Med model.
Abstract
LLaVA-Med
https://github.com/microsoft/LLaVA-Med
https://huggingface.co/microsoft/llava-med-7b-delta
28
LLaVA-Med
29
LLaVA-Med Comparison Examples
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
My LLaVA based Image Classifier Experiment
30
Full details, results and code: https://github.com/robert-mcdermott/LLM-Image-Classification
31
LLaVA 1.5: Handwritten Digit Classification
Abstract
https://github.com/robert-mcdermott/LLM-Image-Classification
32
LLaVA 1.5: Animal Classification
https://github.com/robert-mcdermott/LLM-Image-Classification
33
LLaVA 1.5: Chess Piece Identification
https://github.com/robert-mcdermott/LLM-Image-Classification
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Live Demo Time
34
Thank you
Robert McDermott (he/him)
Director: Solutions, Engineering & Architecture (SEA)
rmcdermo@fredhutch.org

More Related Content

What's hot

Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfDavid Rostcheck
 
Jeff Dean at AI Frontiers: Trends and Developments in Deep Learning Research
Jeff Dean at AI Frontiers: Trends and Developments in Deep Learning ResearchJeff Dean at AI Frontiers: Trends and Developments in Deep Learning Research
Jeff Dean at AI Frontiers: Trends and Developments in Deep Learning ResearchAI Frontiers
 
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1DianaGray10
 
generative-ai-fundamentals and Large language models
generative-ai-fundamentals and Large language modelsgenerative-ai-fundamentals and Large language models
generative-ai-fundamentals and Large language modelsAdventureWorld5
 
Generative AI at the edge.pdf
Generative AI at the edge.pdfGenerative AI at the edge.pdf
Generative AI at the edge.pdfQualcomm Research
 
The Future is in Responsible Generative AI
The Future is in Responsible Generative AIThe Future is in Responsible Generative AI
The Future is in Responsible Generative AISaeed Al Dhaheri
 
The Future of AI is Generative not Discriminative 5/26/2021
The Future of AI is Generative not Discriminative 5/26/2021The Future of AI is Generative not Discriminative 5/26/2021
The Future of AI is Generative not Discriminative 5/26/2021Steve Omohundro
 
Cavalry Ventures | Deep Dive: Generative AI
Cavalry Ventures | Deep Dive: Generative AICavalry Ventures | Deep Dive: Generative AI
Cavalry Ventures | Deep Dive: Generative AICavalry Ventures
 
Generative AI: Past, Present, and Future – A Practitioner's Perspective
Generative AI: Past, Present, and Future – A Practitioner's PerspectiveGenerative AI: Past, Present, and Future – A Practitioner's Perspective
Generative AI: Past, Present, and Future – A Practitioner's PerspectiveHuahai Yang
 
Generative AI: Shifting the AI Landscape
Generative AI: Shifting the AI LandscapeGenerative AI: Shifting the AI Landscape
Generative AI: Shifting the AI LandscapeDeakin University
 
Responsible AI
Responsible AIResponsible AI
Responsible AINeo4j
 
How AI is going to change the world _M.Mujeeb Riaz.pdf
How AI is going to change the world _M.Mujeeb Riaz.pdfHow AI is going to change the world _M.Mujeeb Riaz.pdf
How AI is going to change the world _M.Mujeeb Riaz.pdfMujeeb Riaz
 
Cloud AI GenAI Overview.pptx
Cloud AI GenAI Overview.pptxCloud AI GenAI Overview.pptx
Cloud AI GenAI Overview.pptxSahithiGurlinka
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023HyunJoon Jung
 
Generative-AI-in-enterprise-20230615.pdf
Generative-AI-in-enterprise-20230615.pdfGenerative-AI-in-enterprise-20230615.pdf
Generative-AI-in-enterprise-20230615.pdfLiming Zhu
 
Generative AI Risks & Concerns
Generative AI Risks & ConcernsGenerative AI Risks & Concerns
Generative AI Risks & ConcernsAjitesh Kumar
 
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Fordham University
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language ModelsLeon Dohmen
 

What's hot (20)

Large Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdfLarge Language Models - Chat AI.pdf
Large Language Models - Chat AI.pdf
 
Jeff Dean at AI Frontiers: Trends and Developments in Deep Learning Research
Jeff Dean at AI Frontiers: Trends and Developments in Deep Learning ResearchJeff Dean at AI Frontiers: Trends and Developments in Deep Learning Research
Jeff Dean at AI Frontiers: Trends and Developments in Deep Learning Research
 
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1AI and ML Series - Introduction to Generative AI and LLMs - Session 1
AI and ML Series - Introduction to Generative AI and LLMs - Session 1
 
LLMs Bootcamp
LLMs BootcampLLMs Bootcamp
LLMs Bootcamp
 
generative-ai-fundamentals and Large language models
generative-ai-fundamentals and Large language modelsgenerative-ai-fundamentals and Large language models
generative-ai-fundamentals and Large language models
 
Generative AI at the edge.pdf
Generative AI at the edge.pdfGenerative AI at the edge.pdf
Generative AI at the edge.pdf
 
The Future is in Responsible Generative AI
The Future is in Responsible Generative AIThe Future is in Responsible Generative AI
The Future is in Responsible Generative AI
 
The Future of AI is Generative not Discriminative 5/26/2021
The Future of AI is Generative not Discriminative 5/26/2021The Future of AI is Generative not Discriminative 5/26/2021
The Future of AI is Generative not Discriminative 5/26/2021
 
Cavalry Ventures | Deep Dive: Generative AI
Cavalry Ventures | Deep Dive: Generative AICavalry Ventures | Deep Dive: Generative AI
Cavalry Ventures | Deep Dive: Generative AI
 
Generative AI: Past, Present, and Future – A Practitioner's Perspective
Generative AI: Past, Present, and Future – A Practitioner's PerspectiveGenerative AI: Past, Present, and Future – A Practitioner's Perspective
Generative AI: Past, Present, and Future – A Practitioner's Perspective
 
Generative AI: Shifting the AI Landscape
Generative AI: Shifting the AI LandscapeGenerative AI: Shifting the AI Landscape
Generative AI: Shifting the AI Landscape
 
Responsible AI
Responsible AIResponsible AI
Responsible AI
 
OpenAI Chatgpt.pptx
OpenAI Chatgpt.pptxOpenAI Chatgpt.pptx
OpenAI Chatgpt.pptx
 
How AI is going to change the world _M.Mujeeb Riaz.pdf
How AI is going to change the world _M.Mujeeb Riaz.pdfHow AI is going to change the world _M.Mujeeb Riaz.pdf
How AI is going to change the world _M.Mujeeb Riaz.pdf
 
Cloud AI GenAI Overview.pptx
Cloud AI GenAI Overview.pptxCloud AI GenAI Overview.pptx
Cloud AI GenAI Overview.pptx
 
Landscape of AI/ML in 2023
Landscape of AI/ML in 2023Landscape of AI/ML in 2023
Landscape of AI/ML in 2023
 
Generative-AI-in-enterprise-20230615.pdf
Generative-AI-in-enterprise-20230615.pdfGenerative-AI-in-enterprise-20230615.pdf
Generative-AI-in-enterprise-20230615.pdf
 
Generative AI Risks & Concerns
Generative AI Risks & ConcernsGenerative AI Risks & Concerns
Generative AI Risks & Concerns
 
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
Foundation of Generative AI: Study Materials Connecting the Dots by Delving i...
 
And then there were ... Large Language Models
And then there were ... Large Language ModelsAnd then there were ... Large Language Models
And then there were ... Large Language Models
 

Similar to Introduction to Multimodal LLMs with LLaVA

Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Robert McDermott
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Dhruv Gohil
 
Emotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logicEmotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logicFinalyear Projects
 
DMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesDMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesPaige Morgan
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...Daniel Zivkovic
 
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for DevelopersHow do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for DevelopersIvo Andreev
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021Gérard Dupont
 
A Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And RlbpA Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And RlbpRikki Wright
 
My Curriculum Vitae
My Curriculum VitaeMy Curriculum Vitae
My Curriculum Vitaeadil raja
 
Open Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache SparkOpen Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
The effect of distributed archetypes on complexity theory
The effect of distributed archetypes on complexity theoryThe effect of distributed archetypes on complexity theory
The effect of distributed archetypes on complexity theoryVinícius Uchôa
 
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docxA Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docxbartholomeocoombs
 
Distributed tracing 101
Distributed tracing 101Distributed tracing 101
Distributed tracing 101Itiel Shwartz
 
NDC London 2014: Erlang Patterns Matching Business Needs
NDC London 2014: Erlang Patterns Matching Business NeedsNDC London 2014: Erlang Patterns Matching Business Needs
NDC London 2014: Erlang Patterns Matching Business NeedsTorben Hoffmann
 
Django in enterprise world
Django in enterprise worldDjango in enterprise world
Django in enterprise worldSimone Federici
 

Similar to Introduction to Multimodal LLMs with LLaVA (20)

Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical
 
Emotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logicEmotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logic
 
DMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesDMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slides
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
 
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for DevelopersHow do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
A Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And RlbpA Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And Rlbp
 
Marvin_Capstone
Marvin_CapstoneMarvin_Capstone
Marvin_Capstone
 
My Curriculum Vitae
My Curriculum VitaeMy Curriculum Vitae
My Curriculum Vitae
 
DeepPavlov 2019
DeepPavlov 2019DeepPavlov 2019
DeepPavlov 2019
 
Rajesh - CV
Rajesh - CVRajesh - CV
Rajesh - CV
 
Open Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache SparkOpen Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache Spark
 
Distributed Tracing
Distributed TracingDistributed Tracing
Distributed Tracing
 
The effect of distributed archetypes on complexity theory
The effect of distributed archetypes on complexity theoryThe effect of distributed archetypes on complexity theory
The effect of distributed archetypes on complexity theory
 
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docxA Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
 
Distributed tracing 101
Distributed tracing 101Distributed tracing 101
Distributed tracing 101
 
NDC London 2014: Erlang Patterns Matching Business Needs
NDC London 2014: Erlang Patterns Matching Business NeedsNDC London 2014: Erlang Patterns Matching Business Needs
NDC London 2014: Erlang Patterns Matching Business Needs
 
Django in enterprise world
Django in enterprise worldDjango in enterprise world
Django in enterprise world
 

Recently uploaded

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxfnnc6jmgwh
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 

Recently uploaded (20)

A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptxGenerative AI - Gitex v1Generative AI - Gitex v1.pptx
Generative AI - Gitex v1Generative AI - Gitex v1.pptx
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 

Introduction to Multimodal LLMs with LLaVA

  • 1. DALL-E 3: "A detailed graphic that visualizes a multimodal vector embedding space" Multimodal LLMs • What are Multimodal Language Models • Background / How do they work • LLaVA papers/projects • LLaVA model demonstration • Image classification project with LLaVA Robert McDermott (he/him) Director: Solutions, Engineering & Architecture (SEA) rmcdermo@fredhutch.org Deep Learning Affinity Group (DLAG) https://research.fredhutch.org/dlag/en.html Feb 20, 2024
  • 2. Who Am I? 2 Link: AI Robert
  • 3. 3 The Papers to Read LLaVA 1.0 https://arxiv.org/abs/2304.08485 https://arxiv.org/pdf/2304.08485.pdf https://llava-vl.github.io/ https://github.com/haotian-liu/LLaVA https://arxiv.org/abs/2310.03744 https://arxiv.org/pdf/2310.03744.pdf https://huggingface.co/liuhaotian/llava-v1.5-13b https://arxiv.org/abs/2306.00890 https://arxiv.org/pdf/2306.00890.pdf https://huggingface.co/microsoft/llava-med-7b-delta https://github.com/microsoft/LLaVA-Med LLaVA 1.5 LLaVA-Med
  • 4. 4 Multimodal Language Models Multimodal language models are AI systems designed to understand, interpret, and generate information across different forms of data, such as text and images. These models leverage large datasets of annotated examples to learn associations between text and visual content, enabling them to perform tasks that require comprehension of both textual and visual information. Why is the sky blue? A person wearing a red cap and sleeveless outfit is soaring through a cloudless sky on a brightly colored hang glider. The sky appears blue because molecules in the Earth's atmosphere scatter sunlight the shorter wavelength of blue more than other colors. Multimodal Language Model I like pizza
  • 5. 5 Multimodal Language Models Source: https://twitter.com/GregKamradt/status/1711772496159252981 Use Case Breakdown Describe • Animal Identification • What's in this photo Interpret • Technical Flame Graph Interpretation • Schematic Interpretation • Twitter Thread Explainer Recommend • Food Recommendations • Website Feedback • Painting Feedback Convert • Figma Screens • Adobe Lightroom Settings • Suggest ad copy based on a webpage Extract • Structured Data From Driver's License • Extract structured itemsfrom an image • Handwriting Extraction Assist • Excel Formula Helper • Find My Glasses • Live Poker Advice • Video game recommendations Evaluate • Dog Cuteness Evaluator • Bounding Box Evaluator • Thumbnail Testing Links to Examples
  • 6. 6 AI Vision has come a long way. GPT-4 Vision LLaVA 1.6 34B Research scientist and a founding member at OpenAI. Sr. Director of AI at Telsa. source: https://karpathy.github.io/2012/10/22/state-of-computer-vision/ 2024 2012
  • 7. 7 What’s funny about this? Image source: https://www.reddit.com/r/hmmm/comments/ubab5v/hmmm/ LLaVA 1.6 34B GPT-4 Vision
  • 8. What’s unusual about this image? LLaVA 1.6 34B GPT-4 Vision
  • 9. Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Quick Introduction to Tokens and Embeddings required to understand how LLMs process text and images. 9
  • 10. Text Tokenization 10 Tokenization is a foundational step in the preprocessing of text for many natural language processing (NLP) tasks, including for language models like GPT-4 and Llama-2. Tokenization involves breaking down text into smaller chunks, or "tokens", which can be as short as one character or as long as one word (or even longer in some cases). These tokens can then be processed, analyzed, and used as input for machine learning models. https://platform.openai.com/tokenizer Tokenization Visualized Resulting Token IDs
  • 11. 11 Vector Embeddings Applications • Natural Language Processing tasks: sentiment analysis, named entity recognition, etc. • Information retrieval: search engines, recommendation systems. • Visualization: using dimensionality reduction to visualize semantic relationships https://huggingface.co/spaces/mteb/leaderboard 5.41765615e-02 4.20716889e-02 -2.41547506e-02 1.11813843e-01 -9.33169946e-02 -7.56109739e-03 6.54651076e-02 -1.54011259e-02 -2.80906167e-02 1.97344255e-02 -1.58324391e-02 -8.46638903e-02 -1.31631363e-02 1.98841579e-02 -1.26802064e-02 -9.36008468e-02 -4.51933630e-02 -1.20324306e-02 -2.48974599e-02 4.87890420e-03 -2.54017510e-03 4.92022634e-02 5.12179844e-02 2.54505035e-02 -9.70738381e-02 1.42842624e-02 -3.46412621e-02 -8.45314115e-02 -7.38010108e-02 -2.72879936e-02 -2.81507652e-02 -5.01780510e-02 5.35405474e-03 2.96438616e-02 -5.18742464e-02 -6.24342896e-02 6.04359470e-02 -2.22260728e-02 3.36266570e-02 5.17647602e-02 -3.09484527e-02 -8.72448832e-02 -1.53413722e-02 9.27508809e-03 -4.92608221e-03 -4.97105941e-02 -1.04904985e-02 -4.15333314e-03 1.55722797e-02 -2.66851094e-02 -6.49709478e-02 -5.94373941e-02 -2.10976638e-02 3.59102758e-03 5.88850211e-03 -1.03685725e-02 5.03626876e-02 -3.31290103e-02 -7.70502910e-02 1.53052341e-02 * "A fat tuxedo cat" = * The "all-MiniLM-L6-v2" embedding model has 384 dimensions https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 Definition • Representations of text in numerical form. • Convert variable-length text into fixed-size vectors in high- dimensional space. Purpose • Capture semantic meaning and relationships between words, phrases, or longer text. • Enable mathematical operations on text (e.g., similarity measurement, arithmetic operations). Characteristics • Words with similar meanings are close in vector space. • Allows for operations like "king" - "man" + "woman" ≈ "queen". There are many embedding models:
  • 12. 12 Vector Embeddings • There are several dozen embedding models • They range in complexity from 384 to 1536 dimensions • They range in max sequence length from 512 to 8191 tokens • Embedding models are generally not compatible with each other Interactive embedding explorer: https://blog.echen.me/embedding-explorer/
  • 13. Semantic Text Similarity 13 Sentence 1 Sentence 2 Cosine Similarity The cat sits outside The dog plays in the garden 0.2838 A man is playing guitar A woman watches TV -0.0327 The new movie is awesome The new movie is so great 0.8939 Jim can run very fast James is the fastest runner 0.6844 My goldfish is hungry Pluto is a planet! 0.0454 • Measures the cosine of the angle between two vectors. • Value between -1 and 1; where 1 means vectors are identical, 0 means orthogonal, and -1 means diametrically opposite (rare in text embeddings). These clearly used different embedding models https://gist.github.com/robert-mcdermott/67cf2623237989bc2315d35a108246ef
  • 14. Embeddings Plot Tool 14 https://github.com/robert-mcdermott/embeddings_plot A command line utility I created to visualize word embeddings Embedding-plot, is a command line utility that can visualize word embeddings in either 2D or 3D scatter plots using dimensionality reduction techniques (PCA, t-SNE or UMAP) and clustering in a scatter plot.
  • 15. Image Embedding Example 15 source: https://www.researchgate.net/publication/282181243_Learning_Visual_Clothing_Style_with_Heterogeneous_Dyadic_Co-Occurrences “Visualization of a 2D embedding of the style space trained with strategic sampling computed with t-SNE. The embedding is based on 200,000 images from the test set. For a clear visual representation, we discretize the style space into a grid and pick one image from each grid cell at random.”
  • 16. CLIP (Contrastive Language-Image Pre-Training) 16 Source: Paper: https://arxiv.org/pdf/2103.00020.pdf Code: https://github.com/OpenAI/CLIP High-Level Architecture
  • 17. Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center The LLaVA Papers 17
  • 18. LLaVA 1.0 – Large Language and Vision Assistant 18 • https://arxiv.org/abs/2304.08485 • https://arxiv.org/pdf/2304.08485.pdf • https://llava-vl.github.io/ • https://github.com/haotian-liu/LLaVA Instruction tuning large language models (LLMs) using machine- generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding. Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available. Abstract
  • 20. LLaVA 1.0 Architecture 20 Tokenizing + Embedding Tokenizing + Embedding
  • 22. 22 • https://arxiv.org/abs/2310.03744 • https://arxiv.org/pdf/2310.03744.pdf • https://huggingface.co/liuhaotian/llava -v1.5-13b Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross- modal connector in LLaVA is surprisingly powerful and data- efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic- task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of- the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available. Abstract LLaVA (1.5) – Large Language and Vision Assistant
  • 23. 23 LLaVA 1.5 Changes Modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks.
  • 24. LLaVA 1.5 24 Training Data Sources Performance
  • 26. LLaVA 1.6 26 LLaVA-1.6-34B outperforms Gemini Pro on several benchmarks https://llava-vl.github.io/blog/2024-01-30-llava-1-6/ Benchmarks Example
  • 27. 27 • https://arxiv.org/abs/2306.00890 • https://arxiv.org/pdf/2306.00890.pdf Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigationsfocus on unimodal text. Multimodalconversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophisticationin understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a vision language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-followingdata from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method. Specifically,the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-followingdata, broadly mimicking how a layperson gradually acquires biomedical knowledge. This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image. On three standard biomedical visual question answering datasets, fine-tuning LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics. To facilitate biomedical multimodal research, we will release our instruction-followingdata and the LLaVA-Med model. Abstract LLaVA-Med https://github.com/microsoft/LLaVA-Med https://huggingface.co/microsoft/llava-med-7b-delta
  • 30. Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center My LLaVA based Image Classifier Experiment 30 Full details, results and code: https://github.com/robert-mcdermott/LLM-Image-Classification
  • 31. 31 LLaVA 1.5: Handwritten Digit Classification Abstract https://github.com/robert-mcdermott/LLM-Image-Classification
  • 32. 32 LLaVA 1.5: Animal Classification https://github.com/robert-mcdermott/LLM-Image-Classification
  • 33. 33 LLaVA 1.5: Chess Piece Identification https://github.com/robert-mcdermott/LLM-Image-Classification
  • 34. Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Live Demo Time 34
  • 35. Thank you Robert McDermott (he/him) Director: Solutions, Engineering & Architecture (SEA) rmcdermo@fredhutch.org