SlideShare a Scribd company logo
1 of 35
Download to read offline
DALL-E 3: "A detailed graphic that visualizes a multimodal vector embedding space"
Multimodal LLMs
• What are Multimodal Language Models
• Background / How do they work
• LLaVA papers/projects
• LLaVA model demonstration
• Image classification project with LLaVA
Robert McDermott (he/him)
Director: Solutions, Engineering & Architecture (SEA)
rmcdermo@fredhutch.org
Deep Learning Affinity Group (DLAG)
https://research.fredhutch.org/dlag/en.html
Feb 20, 2024
Who Am I?
2
Link: AI Robert
3
The Papers to Read
LLaVA 1.0
https://arxiv.org/abs/2304.08485
https://arxiv.org/pdf/2304.08485.pdf
https://llava-vl.github.io/
https://github.com/haotian-liu/LLaVA
https://arxiv.org/abs/2310.03744
https://arxiv.org/pdf/2310.03744.pdf
https://huggingface.co/liuhaotian/llava-v1.5-13b
https://arxiv.org/abs/2306.00890
https://arxiv.org/pdf/2306.00890.pdf
https://huggingface.co/microsoft/llava-med-7b-delta
https://github.com/microsoft/LLaVA-Med
LLaVA 1.5 LLaVA-Med
4
Multimodal Language Models
Multimodal language models are AI systems designed to understand, interpret, and generate information across different
forms of data, such as text and images. These models leverage large datasets of annotated examples to learn associations
between text and visual content, enabling them to perform tasks that require comprehension of both textual and visual
information.
Why is the
sky blue?
A person wearing a red cap and
sleeveless outfit is soaring through
a cloudless sky on a brightly
colored hang glider.
The sky appears blue because
molecules in the Earth's
atmosphere scatter sunlight the
shorter wavelength of blue more
than other colors.
Multimodal
Language
Model
I like pizza
5
Multimodal Language Models
Source: https://twitter.com/GregKamradt/status/1711772496159252981
Use Case Breakdown
Describe
• Animal Identification
• What's in this photo
Interpret
• Technical Flame Graph Interpretation
• Schematic Interpretation
• Twitter Thread Explainer
Recommend
• Food Recommendations
• Website Feedback
• Painting Feedback
Convert
• Figma Screens
• Adobe Lightroom Settings
• Suggest ad copy based on a webpage
Extract
• Structured Data From Driver's License
• Extract structured itemsfrom an image
• Handwriting Extraction
Assist
• Excel Formula Helper
• Find My Glasses
• Live Poker Advice
• Video game recommendations
Evaluate
• Dog Cuteness Evaluator
• Bounding Box Evaluator
• Thumbnail Testing
Links to Examples
6
AI Vision has come a long way.
GPT-4 Vision
LLaVA 1.6 34B
Research scientist and a founding member
at OpenAI. Sr. Director of AI at Telsa.
source: https://karpathy.github.io/2012/10/22/state-of-computer-vision/
2024
2012
7
What’s funny about this?
Image source: https://www.reddit.com/r/hmmm/comments/ubab5v/hmmm/
LLaVA 1.6 34B
GPT-4 Vision
What’s unusual about this image?
LLaVA 1.6 34B
GPT-4 Vision
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Quick Introduction to Tokens and Embeddings
required to understand how LLMs process text and
images.
9
Text Tokenization
10
Tokenization is a foundational step in the preprocessing of text for many natural language processing (NLP) tasks, including for language
models like GPT-4 and Llama-2. Tokenization involves breaking down text into smaller chunks, or "tokens", which can be as short as one
character or as long as one word (or even longer in some cases). These tokens can then be processed, analyzed, and used as input for
machine learning models.
https://platform.openai.com/tokenizer
Tokenization
Visualized
Resulting
Token IDs
11
Vector Embeddings
Applications
• Natural Language Processing tasks: sentiment analysis,
named entity recognition, etc.
• Information retrieval: search engines, recommendation
systems.
• Visualization: using dimensionality reduction to visualize
semantic relationships
https://huggingface.co/spaces/mteb/leaderboard
5.41765615e-02 4.20716889e-02 -2.41547506e-02 1.11813843e-01
-9.33169946e-02 -7.56109739e-03 6.54651076e-02 -1.54011259e-02
-2.80906167e-02 1.97344255e-02 -1.58324391e-02 -8.46638903e-02
-1.31631363e-02 1.98841579e-02 -1.26802064e-02 -9.36008468e-02
-4.51933630e-02 -1.20324306e-02 -2.48974599e-02 4.87890420e-03
-2.54017510e-03 4.92022634e-02 5.12179844e-02 2.54505035e-02
-9.70738381e-02 1.42842624e-02 -3.46412621e-02 -8.45314115e-02
-7.38010108e-02 -2.72879936e-02 -2.81507652e-02 -5.01780510e-02
5.35405474e-03 2.96438616e-02 -5.18742464e-02 -6.24342896e-02
6.04359470e-02 -2.22260728e-02 3.36266570e-02 5.17647602e-02
-3.09484527e-02 -8.72448832e-02 -1.53413722e-02 9.27508809e-03
-4.92608221e-03 -4.97105941e-02 -1.04904985e-02 -4.15333314e-03
1.55722797e-02 -2.66851094e-02 -6.49709478e-02 -5.94373941e-02
-2.10976638e-02 3.59102758e-03 5.88850211e-03 -1.03685725e-02
5.03626876e-02 -3.31290103e-02 -7.70502910e-02 1.53052341e-02
*
"A fat tuxedo cat" =
* The "all-MiniLM-L6-v2" embedding model has 384 dimensions
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
Definition
• Representations of text in numerical form.
• Convert variable-length text into fixed-size vectors in high-
dimensional space.
Purpose
• Capture semantic meaning and relationships between words,
phrases, or longer text.
• Enable mathematical operations on text (e.g., similarity
measurement, arithmetic operations).
Characteristics
• Words with similar meanings are close in vector space.
• Allows for operations like "king" - "man" + "woman" ≈ "queen".
There are many embedding models:
12
Vector Embeddings
• There are several dozen embedding models
• They range in complexity from 384 to 1536 dimensions
• They range in max sequence length from 512 to 8191 tokens
• Embedding models are generally not compatible with each other
Interactive embedding explorer:
https://blog.echen.me/embedding-explorer/
Semantic Text Similarity
13
Sentence 1 Sentence 2 Cosine Similarity
The cat sits outside The dog plays in the garden 0.2838
A man is playing guitar A woman watches TV -0.0327
The new movie is awesome The new movie is so great 0.8939
Jim can run very fast James is the fastest runner 0.6844
My goldfish is hungry Pluto is a planet! 0.0454
• Measures the cosine of the angle between two vectors.
• Value between -1 and 1; where 1 means vectors are identical, 0 means
orthogonal, and -1 means diametrically opposite (rare in text embeddings).
These clearly used different
embedding models
https://gist.github.com/robert-mcdermott/67cf2623237989bc2315d35a108246ef
Embeddings Plot Tool
14
https://github.com/robert-mcdermott/embeddings_plot
A command line utility I created to
visualize word embeddings
Embedding-plot, is a command line
utility that can visualize word
embeddings in either 2D or 3D scatter
plots using dimensionality reduction
techniques (PCA, t-SNE or UMAP) and
clustering in a scatter plot.
Image Embedding Example
15
source: https://www.researchgate.net/publication/282181243_Learning_Visual_Clothing_Style_with_Heterogeneous_Dyadic_Co-Occurrences
“Visualization of a 2D embedding of the
style space trained with strategic sampling
computed with t-SNE. The embedding is
based on 200,000 images from the test set.
For a clear visual representation, we
discretize the style space into a grid and
pick one image from each grid cell at
random.”
CLIP (Contrastive Language-Image Pre-Training)
16
Source: Paper: https://arxiv.org/pdf/2103.00020.pdf Code: https://github.com/OpenAI/CLIP
High-Level Architecture
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
The LLaVA Papers
17
LLaVA 1.0 – Large Language and Vision Assistant
18
• https://arxiv.org/abs/2304.08485
• https://arxiv.org/pdf/2304.08485.pdf
• https://llava-vl.github.io/
• https://github.com/haotian-liu/LLaVA
Instruction tuning large language models (LLMs) using machine-
generated instruction-following data has improved zero-shot
capabilities on new tasks, but the idea is less explored in the
multimodal field. In this paper, we present the first attempt to use
language-only GPT-4 to generate multimodal language-image
instruction-following data. By instruction tuning on such generated
data, we introduce LLaVA: Large Language and Vision Assistant, an
end-to-end trained large multimodal model that connects a vision
encoder and LLM for general-purpose visual and language
understanding. Our early experiments show that LLaVA
demonstrates impressive multimodel chat abilities, sometimes
exhibiting the behaviors of multimodal GPT-4 on unseen
images/instructions and yields a 85.1% relative score compared
with GPT-4 on a synthetic multimodal instruction-following dataset.
When fine-tuned on Science QA, the synergy of LLaVA and GPT-4
achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4
generated visual instruction tuning data, our model and code base
publicly available.
Abstract
LLaVA 1.0 Training
19
LLaVA 1.0 Architecture
20
Tokenizing + Embedding Tokenizing + Embedding
LLaVA 1.0 Performance
21
22
• https://arxiv.org/abs/2310.03744
• https://arxiv.org/pdf/2310.03744.pdf
• https://huggingface.co/liuhaotian/llava
-v1.5-13b
Large multimodal models (LMM) have recently shown
encouraging progress with visual instruction tuning. In this
note, we show that the fully-connected vision-language cross-
modal connector in LLaVA is surprisingly powerful and data-
efficient. With simple modifications to LLaVA, namely, using
CLIP-ViT-L-336px with an MLP projection and adding academic-
task-oriented VQA data with simple response formatting
prompts, we establish stronger baselines that achieve state-of-
the-art across 11 benchmarks.
Our final 13B checkpoint uses merely 1.2M publicly available
data, and finishes full training in ~1 day on a single 8-A100
node.
We hope this can make state-of-the-art LMM research more
accessible. Code and model will be publicly available.
Abstract
LLaVA (1.5) – Large Language and Vision Assistant
23
LLaVA 1.5 Changes
Modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented
VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art
across 11 benchmarks.
LLaVA 1.5
24
Training Data Sources Performance
25
LLaVA 1.5 Comparison Examples
LLaVA 1.6
26
LLaVA-1.6-34B outperforms Gemini Pro on several benchmarks
https://llava-vl.github.io/blog/2024-01-30-llava-1-6/
Benchmarks Example
27
• https://arxiv.org/abs/2306.00890
• https://arxiv.org/pdf/2306.00890.pdf
Conversational generative AI has demonstrated remarkable promise for empowering
biomedical practitioners, but current investigationsfocus on unimodal text.
Multimodalconversational AI has seen rapid progress by leveraging billions of
image-text pairs from the public web, but such general-domain vision-language
models still lack sophisticationin understanding and conversing about biomedical
images. In this paper, we propose a cost-efficient approach for training a vision language
conversational assistant that can answer open-ended research questions
of biomedical images. The key idea is to leverage a large-scale, broad-coverage
biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to
self-instruct open-ended instruction-followingdata from the captions, and then
fine-tune a large general-domain vision-language model using a novel curriculum
learning method. Specifically,the model first learns to align biomedical vocabulary
using the figure-caption pairs as is, then learns to master open-ended conversational
semantics using GPT-4 generated instruction-followingdata, broadly mimicking
how a layperson gradually acquires biomedical knowledge. This enables us to train
a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less
than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational
capability and can follow open-ended instruction to assist with inquiries
about a biomedical image. On three standard biomedical visual question answering
datasets, fine-tuning LLaVA-Med outperforms previous supervised state-of-the-art
on certain metrics. To facilitate biomedical multimodal research, we will release
our instruction-followingdata and the LLaVA-Med model.
Abstract
LLaVA-Med
https://github.com/microsoft/LLaVA-Med
https://huggingface.co/microsoft/llava-med-7b-delta
28
LLaVA-Med
29
LLaVA-Med Comparison Examples
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
My LLaVA based Image Classifier Experiment
30
Full details, results and code: https://github.com/robert-mcdermott/LLM-Image-Classification
31
LLaVA 1.5: Handwritten Digit Classification
Abstract
https://github.com/robert-mcdermott/LLM-Image-Classification
32
LLaVA 1.5: Animal Classification
https://github.com/robert-mcdermott/LLM-Image-Classification
33
LLaVA 1.5: Chess Piece Identification
https://github.com/robert-mcdermott/LLM-Image-Classification
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Live Demo Time
34
Thank you
Robert McDermott (he/him)
Director: Solutions, Engineering & Architecture (SEA)
rmcdermo@fredhutch.org

More Related Content

What's hot

OpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve OmohundroOpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve OmohundroNumenta
 
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...ssuser4edc93
 
Benchmark comparison of Large Language Models
Benchmark comparison of Large Language ModelsBenchmark comparison of Large Language Models
Benchmark comparison of Large Language ModelsMatej Varga
 
Prompting is an art / Sztuka promptowania
Prompting is an art / Sztuka promptowaniaPrompting is an art / Sztuka promptowania
Prompting is an art / Sztuka promptowaniaMichal Jaskolski
 
Copilot to Cover: Why AI can't replace developers with robots, but can make l...
Copilot to Cover: Why AI can't replace developers with robots, but can make l...Copilot to Cover: Why AI can't replace developers with robots, but can make l...
Copilot to Cover: Why AI can't replace developers with robots, but can make l...Andy Piper
 
ChatGpt.pptx
ChatGpt.pptxChatGpt.pptx
ChatGpt.pptxJahanvi B
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017Carol Smith
 
Unlocking the Power of ChatGPT and AI in Testing - NextSteps, presented by Ap...
Unlocking the Power of ChatGPT and AI in Testing - NextSteps, presented by Ap...Unlocking the Power of ChatGPT and AI in Testing - NextSteps, presented by Ap...
Unlocking the Power of ChatGPT and AI in Testing - NextSteps, presented by Ap...Applitools
 
Smart Contract Testing
Smart Contract TestingSmart Contract Testing
Smart Contract TestingDilum Bandara
 
LLM Healthcare.pdf
LLM Healthcare.pdfLLM Healthcare.pdf
LLM Healthcare.pdfATPowr
 
ChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdfChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdfSonal Tiwari
 
Introduction to LLMs
Introduction to LLMsIntroduction to LLMs
Introduction to LLMsLoic Merckel
 

What's hot (20)

Fuzzing.pptx
Fuzzing.pptxFuzzing.pptx
Fuzzing.pptx
 
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve OmohundroOpenAI’s GPT 3 Language Model - guest Steve Omohundro
OpenAI’s GPT 3 Language Model - guest Steve Omohundro
 
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...How Does Generative AI Actually Work? (a quick semi-technical introduction to...
How Does Generative AI Actually Work? (a quick semi-technical introduction to...
 
ChatGPT in Education
ChatGPT in EducationChatGPT in Education
ChatGPT in Education
 
Introduction to ChatGPT
Introduction to ChatGPTIntroduction to ChatGPT
Introduction to ChatGPT
 
Benchmark comparison of Large Language Models
Benchmark comparison of Large Language ModelsBenchmark comparison of Large Language Models
Benchmark comparison of Large Language Models
 
Prompting is an art / Sztuka promptowania
Prompting is an art / Sztuka promptowaniaPrompting is an art / Sztuka promptowania
Prompting is an art / Sztuka promptowania
 
Copilot to Cover: Why AI can't replace developers with robots, but can make l...
Copilot to Cover: Why AI can't replace developers with robots, but can make l...Copilot to Cover: Why AI can't replace developers with robots, but can make l...
Copilot to Cover: Why AI can't replace developers with robots, but can make l...
 
ChatGPT SEO Guide 2023
ChatGPT SEO Guide 2023ChatGPT SEO Guide 2023
ChatGPT SEO Guide 2023
 
ChatGPT General Meeting
ChatGPT General MeetingChatGPT General Meeting
ChatGPT General Meeting
 
How to Chat Gpt Works?
How to Chat Gpt Works?How to Chat Gpt Works?
How to Chat Gpt Works?
 
Jim Lecinski - Capturing the Power of AI in Marketing.pdf
Jim Lecinski - Capturing the Power of AI in Marketing.pdfJim Lecinski - Capturing the Power of AI in Marketing.pdf
Jim Lecinski - Capturing the Power of AI in Marketing.pdf
 
steganography
steganographysteganography
steganography
 
ChatGpt.pptx
ChatGpt.pptxChatGpt.pptx
ChatGpt.pptx
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
 
Unlocking the Power of ChatGPT and AI in Testing - NextSteps, presented by Ap...
Unlocking the Power of ChatGPT and AI in Testing - NextSteps, presented by Ap...Unlocking the Power of ChatGPT and AI in Testing - NextSteps, presented by Ap...
Unlocking the Power of ChatGPT and AI in Testing - NextSteps, presented by Ap...
 
Smart Contract Testing
Smart Contract TestingSmart Contract Testing
Smart Contract Testing
 
LLM Healthcare.pdf
LLM Healthcare.pdfLLM Healthcare.pdf
LLM Healthcare.pdf
 
ChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdfChatGPT and OpenAI.pdf
ChatGPT and OpenAI.pdf
 
Introduction to LLMs
Introduction to LLMsIntroduction to LLMs
Introduction to LLMs
 

Similar to Introduction to Multimodal LLMs with LLaVA

Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Robert McDermott
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Dhruv Gohil
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAnant Corporation
 
Emotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logicEmotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logicFinalyear Projects
 
DMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesDMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesPaige Morgan
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...Daniel Zivkovic
 
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for DevelopersHow do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for DevelopersIvo Andreev
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021Gérard Dupont
 
A Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And RlbpA Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And RlbpRikki Wright
 
My Curriculum Vitae
My Curriculum VitaeMy Curriculum Vitae
My Curriculum Vitaeadil raja
 
Open Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache SparkOpen Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
The effect of distributed archetypes on complexity theory
The effect of distributed archetypes on complexity theoryThe effect of distributed archetypes on complexity theory
The effect of distributed archetypes on complexity theoryVinícius Uchôa
 
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docxA Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docxbartholomeocoombs
 
Distributed tracing 101
Distributed tracing 101Distributed tracing 101
Distributed tracing 101Itiel Shwartz
 
NDC London 2014: Erlang Patterns Matching Business Needs
NDC London 2014: Erlang Patterns Matching Business NeedsNDC London 2014: Erlang Patterns Matching Business Needs
NDC London 2014: Erlang Patterns Matching Business NeedsTorben Hoffmann
 

Similar to Introduction to Multimodal LLMs with LLaVA (20)

Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
 
Emotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logicEmotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logic
 
DMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesDMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slides
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
 
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for DevelopersHow do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
A Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And RlbpA Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And Rlbp
 
Marvin_Capstone
Marvin_CapstoneMarvin_Capstone
Marvin_Capstone
 
My Curriculum Vitae
My Curriculum VitaeMy Curriculum Vitae
My Curriculum Vitae
 
DeepPavlov 2019
DeepPavlov 2019DeepPavlov 2019
DeepPavlov 2019
 
Rajesh - CV
Rajesh - CVRajesh - CV
Rajesh - CV
 
Open Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache SparkOpen Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache Spark
 
Distributed Tracing
Distributed TracingDistributed Tracing
Distributed Tracing
 
The effect of distributed archetypes on complexity theory
The effect of distributed archetypes on complexity theoryThe effect of distributed archetypes on complexity theory
The effect of distributed archetypes on complexity theory
 
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docxA Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
 
Distributed tracing 101
Distributed tracing 101Distributed tracing 101
Distributed tracing 101
 
NDC London 2014: Erlang Patterns Matching Business Needs
NDC London 2014: Erlang Patterns Matching Business NeedsNDC London 2014: Erlang Patterns Matching Business Needs
NDC London 2014: Erlang Patterns Matching Business Needs
 

Recently uploaded

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 

Recently uploaded (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 

Introduction to Multimodal LLMs with LLaVA

  • 1. DALL-E 3: "A detailed graphic that visualizes a multimodal vector embedding space" Multimodal LLMs • What are Multimodal Language Models • Background / How do they work • LLaVA papers/projects • LLaVA model demonstration • Image classification project with LLaVA Robert McDermott (he/him) Director: Solutions, Engineering & Architecture (SEA) rmcdermo@fredhutch.org Deep Learning Affinity Group (DLAG) https://research.fredhutch.org/dlag/en.html Feb 20, 2024
  • 2. Who Am I? 2 Link: AI Robert
  • 3. 3 The Papers to Read LLaVA 1.0 https://arxiv.org/abs/2304.08485 https://arxiv.org/pdf/2304.08485.pdf https://llava-vl.github.io/ https://github.com/haotian-liu/LLaVA https://arxiv.org/abs/2310.03744 https://arxiv.org/pdf/2310.03744.pdf https://huggingface.co/liuhaotian/llava-v1.5-13b https://arxiv.org/abs/2306.00890 https://arxiv.org/pdf/2306.00890.pdf https://huggingface.co/microsoft/llava-med-7b-delta https://github.com/microsoft/LLaVA-Med LLaVA 1.5 LLaVA-Med
  • 4. 4 Multimodal Language Models Multimodal language models are AI systems designed to understand, interpret, and generate information across different forms of data, such as text and images. These models leverage large datasets of annotated examples to learn associations between text and visual content, enabling them to perform tasks that require comprehension of both textual and visual information. Why is the sky blue? A person wearing a red cap and sleeveless outfit is soaring through a cloudless sky on a brightly colored hang glider. The sky appears blue because molecules in the Earth's atmosphere scatter sunlight the shorter wavelength of blue more than other colors. Multimodal Language Model I like pizza
  • 5. 5 Multimodal Language Models Source: https://twitter.com/GregKamradt/status/1711772496159252981 Use Case Breakdown Describe • Animal Identification • What's in this photo Interpret • Technical Flame Graph Interpretation • Schematic Interpretation • Twitter Thread Explainer Recommend • Food Recommendations • Website Feedback • Painting Feedback Convert • Figma Screens • Adobe Lightroom Settings • Suggest ad copy based on a webpage Extract • Structured Data From Driver's License • Extract structured itemsfrom an image • Handwriting Extraction Assist • Excel Formula Helper • Find My Glasses • Live Poker Advice • Video game recommendations Evaluate • Dog Cuteness Evaluator • Bounding Box Evaluator • Thumbnail Testing Links to Examples
  • 6. 6 AI Vision has come a long way. GPT-4 Vision LLaVA 1.6 34B Research scientist and a founding member at OpenAI. Sr. Director of AI at Telsa. source: https://karpathy.github.io/2012/10/22/state-of-computer-vision/ 2024 2012
  • 7. 7 What’s funny about this? Image source: https://www.reddit.com/r/hmmm/comments/ubab5v/hmmm/ LLaVA 1.6 34B GPT-4 Vision
  • 8. What’s unusual about this image? LLaVA 1.6 34B GPT-4 Vision
  • 9. Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Quick Introduction to Tokens and Embeddings required to understand how LLMs process text and images. 9
  • 10. Text Tokenization 10 Tokenization is a foundational step in the preprocessing of text for many natural language processing (NLP) tasks, including for language models like GPT-4 and Llama-2. Tokenization involves breaking down text into smaller chunks, or "tokens", which can be as short as one character or as long as one word (or even longer in some cases). These tokens can then be processed, analyzed, and used as input for machine learning models. https://platform.openai.com/tokenizer Tokenization Visualized Resulting Token IDs
  • 11. 11 Vector Embeddings Applications • Natural Language Processing tasks: sentiment analysis, named entity recognition, etc. • Information retrieval: search engines, recommendation systems. • Visualization: using dimensionality reduction to visualize semantic relationships https://huggingface.co/spaces/mteb/leaderboard 5.41765615e-02 4.20716889e-02 -2.41547506e-02 1.11813843e-01 -9.33169946e-02 -7.56109739e-03 6.54651076e-02 -1.54011259e-02 -2.80906167e-02 1.97344255e-02 -1.58324391e-02 -8.46638903e-02 -1.31631363e-02 1.98841579e-02 -1.26802064e-02 -9.36008468e-02 -4.51933630e-02 -1.20324306e-02 -2.48974599e-02 4.87890420e-03 -2.54017510e-03 4.92022634e-02 5.12179844e-02 2.54505035e-02 -9.70738381e-02 1.42842624e-02 -3.46412621e-02 -8.45314115e-02 -7.38010108e-02 -2.72879936e-02 -2.81507652e-02 -5.01780510e-02 5.35405474e-03 2.96438616e-02 -5.18742464e-02 -6.24342896e-02 6.04359470e-02 -2.22260728e-02 3.36266570e-02 5.17647602e-02 -3.09484527e-02 -8.72448832e-02 -1.53413722e-02 9.27508809e-03 -4.92608221e-03 -4.97105941e-02 -1.04904985e-02 -4.15333314e-03 1.55722797e-02 -2.66851094e-02 -6.49709478e-02 -5.94373941e-02 -2.10976638e-02 3.59102758e-03 5.88850211e-03 -1.03685725e-02 5.03626876e-02 -3.31290103e-02 -7.70502910e-02 1.53052341e-02 * "A fat tuxedo cat" = * The "all-MiniLM-L6-v2" embedding model has 384 dimensions https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 Definition • Representations of text in numerical form. • Convert variable-length text into fixed-size vectors in high- dimensional space. Purpose • Capture semantic meaning and relationships between words, phrases, or longer text. • Enable mathematical operations on text (e.g., similarity measurement, arithmetic operations). Characteristics • Words with similar meanings are close in vector space. • Allows for operations like "king" - "man" + "woman" ≈ "queen". There are many embedding models:
  • 12. 12 Vector Embeddings • There are several dozen embedding models • They range in complexity from 384 to 1536 dimensions • They range in max sequence length from 512 to 8191 tokens • Embedding models are generally not compatible with each other Interactive embedding explorer: https://blog.echen.me/embedding-explorer/
  • 13. Semantic Text Similarity 13 Sentence 1 Sentence 2 Cosine Similarity The cat sits outside The dog plays in the garden 0.2838 A man is playing guitar A woman watches TV -0.0327 The new movie is awesome The new movie is so great 0.8939 Jim can run very fast James is the fastest runner 0.6844 My goldfish is hungry Pluto is a planet! 0.0454 • Measures the cosine of the angle between two vectors. • Value between -1 and 1; where 1 means vectors are identical, 0 means orthogonal, and -1 means diametrically opposite (rare in text embeddings). These clearly used different embedding models https://gist.github.com/robert-mcdermott/67cf2623237989bc2315d35a108246ef
  • 14. Embeddings Plot Tool 14 https://github.com/robert-mcdermott/embeddings_plot A command line utility I created to visualize word embeddings Embedding-plot, is a command line utility that can visualize word embeddings in either 2D or 3D scatter plots using dimensionality reduction techniques (PCA, t-SNE or UMAP) and clustering in a scatter plot.
  • 15. Image Embedding Example 15 source: https://www.researchgate.net/publication/282181243_Learning_Visual_Clothing_Style_with_Heterogeneous_Dyadic_Co-Occurrences “Visualization of a 2D embedding of the style space trained with strategic sampling computed with t-SNE. The embedding is based on 200,000 images from the test set. For a clear visual representation, we discretize the style space into a grid and pick one image from each grid cell at random.”
  • 16. CLIP (Contrastive Language-Image Pre-Training) 16 Source: Paper: https://arxiv.org/pdf/2103.00020.pdf Code: https://github.com/OpenAI/CLIP High-Level Architecture
  • 17. Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center The LLaVA Papers 17
  • 18. LLaVA 1.0 – Large Language and Vision Assistant 18 • https://arxiv.org/abs/2304.08485 • https://arxiv.org/pdf/2304.08485.pdf • https://llava-vl.github.io/ • https://github.com/haotian-liu/LLaVA Instruction tuning large language models (LLMs) using machine- generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding. Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available. Abstract
  • 20. LLaVA 1.0 Architecture 20 Tokenizing + Embedding Tokenizing + Embedding
  • 22. 22 • https://arxiv.org/abs/2310.03744 • https://arxiv.org/pdf/2310.03744.pdf • https://huggingface.co/liuhaotian/llava -v1.5-13b Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross- modal connector in LLaVA is surprisingly powerful and data- efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic- task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of- the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available. Abstract LLaVA (1.5) – Large Language and Vision Assistant
  • 23. 23 LLaVA 1.5 Changes Modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks.
  • 24. LLaVA 1.5 24 Training Data Sources Performance
  • 26. LLaVA 1.6 26 LLaVA-1.6-34B outperforms Gemini Pro on several benchmarks https://llava-vl.github.io/blog/2024-01-30-llava-1-6/ Benchmarks Example
  • 27. 27 • https://arxiv.org/abs/2306.00890 • https://arxiv.org/pdf/2306.00890.pdf Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigationsfocus on unimodal text. Multimodalconversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophisticationin understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a vision language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-followingdata from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method. Specifically,the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-followingdata, broadly mimicking how a layperson gradually acquires biomedical knowledge. This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image. On three standard biomedical visual question answering datasets, fine-tuning LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics. To facilitate biomedical multimodal research, we will release our instruction-followingdata and the LLaVA-Med model. Abstract LLaVA-Med https://github.com/microsoft/LLaVA-Med https://huggingface.co/microsoft/llava-med-7b-delta
  • 30. Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center My LLaVA based Image Classifier Experiment 30 Full details, results and code: https://github.com/robert-mcdermott/LLM-Image-Classification
  • 31. 31 LLaVA 1.5: Handwritten Digit Classification Abstract https://github.com/robert-mcdermott/LLM-Image-Classification
  • 32. 32 LLaVA 1.5: Animal Classification https://github.com/robert-mcdermott/LLM-Image-Classification
  • 33. 33 LLaVA 1.5: Chess Piece Identification https://github.com/robert-mcdermott/LLM-Image-Classification
  • 34. Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Live Demo Time 34
  • 35. Thank you Robert McDermott (he/him) Director: Solutions, Engineering & Architecture (SEA) rmcdermo@fredhutch.org