SlideShare a Scribd company logo
1 of 35
Download to read offline
DALL-E 3: "A detailed graphic that visualizes a multimodal vector embedding space"
Multimodal LLMs
• What are Multimodal Language Models
• Background / How do they work
• LLaVA papers/projects
• LLaVA model demonstration
• Image classification project with LLaVA
Robert McDermott (he/him)
Director: Solutions, Engineering & Architecture (SEA)
rmcdermo@fredhutch.org
Deep Learning Affinity Group (DLAG)
https://research.fredhutch.org/dlag/en.html
Feb 20, 2024
Who Am I?
2
Link: AI Robert
3
The Papers to Read
LLaVA 1.0
https://arxiv.org/abs/2304.08485
https://arxiv.org/pdf/2304.08485.pdf
https://llava-vl.github.io/
https://github.com/haotian-liu/LLaVA
https://arxiv.org/abs/2310.03744
https://arxiv.org/pdf/2310.03744.pdf
https://huggingface.co/liuhaotian/llava-v1.5-13b
https://arxiv.org/abs/2306.00890
https://arxiv.org/pdf/2306.00890.pdf
https://huggingface.co/microsoft/llava-med-7b-delta
https://github.com/microsoft/LLaVA-Med
LLaVA 1.5 LLaVA-Med
4
Multimodal Language Models
Multimodal language models are AI systems designed to understand, interpret, and generate information across different
forms of data, such as text and images. These models leverage large datasets of annotated examples to learn associations
between text and visual content, enabling them to perform tasks that require comprehension of both textual and visual
information.
Why is the
sky blue?
A person wearing a red cap and
sleeveless outfit is soaring through
a cloudless sky on a brightly
colored hang glider.
The sky appears blue because
molecules in the Earth's
atmosphere scatter sunlight the
shorter wavelength of blue more
than other colors.
Multimodal
Language
Model
I like pizza
5
Multimodal Language Models
Source: https://twitter.com/GregKamradt/status/1711772496159252981
Use Case Breakdown
Describe
• Animal Identification
• What's in this photo
Interpret
• Technical Flame Graph Interpretation
• Schematic Interpretation
• Twitter Thread Explainer
Recommend
• Food Recommendations
• Website Feedback
• Painting Feedback
Convert
• Figma Screens
• Adobe Lightroom Settings
• Suggest ad copy based on a webpage
Extract
• Structured Data From Driver's License
• Extract structured itemsfrom an image
• Handwriting Extraction
Assist
• Excel Formula Helper
• Find My Glasses
• Live Poker Advice
• Video game recommendations
Evaluate
• Dog Cuteness Evaluator
• Bounding Box Evaluator
• Thumbnail Testing
Links to Examples
6
AI Vision has come a long way.
GPT-4 Vision
LLaVA 1.6 34B
Research scientist and a founding member
at OpenAI. Sr. Director of AI at Telsa.
source: https://karpathy.github.io/2012/10/22/state-of-computer-vision/
2024
2012
7
What’s funny about this?
Image source: https://www.reddit.com/r/hmmm/comments/ubab5v/hmmm/
LLaVA 1.6 34B
GPT-4 Vision
What’s unusual about this image?
LLaVA 1.6 34B
GPT-4 Vision
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Quick Introduction to Tokens and Embeddings
required to understand how LLMs process text and
images.
9
Text Tokenization
10
Tokenization is a foundational step in the preprocessing of text for many natural language processing (NLP) tasks, including for language
models like GPT-4 and Llama-2. Tokenization involves breaking down text into smaller chunks, or "tokens", which can be as short as one
character or as long as one word (or even longer in some cases). These tokens can then be processed, analyzed, and used as input for
machine learning models.
https://platform.openai.com/tokenizer
Tokenization
Visualized
Resulting
Token IDs
11
Vector Embeddings
Applications
• Natural Language Processing tasks: sentiment analysis,
named entity recognition, etc.
• Information retrieval: search engines, recommendation
systems.
• Visualization: using dimensionality reduction to visualize
semantic relationships
https://huggingface.co/spaces/mteb/leaderboard
5.41765615e-02 4.20716889e-02 -2.41547506e-02 1.11813843e-01
-9.33169946e-02 -7.56109739e-03 6.54651076e-02 -1.54011259e-02
-2.80906167e-02 1.97344255e-02 -1.58324391e-02 -8.46638903e-02
-1.31631363e-02 1.98841579e-02 -1.26802064e-02 -9.36008468e-02
-4.51933630e-02 -1.20324306e-02 -2.48974599e-02 4.87890420e-03
-2.54017510e-03 4.92022634e-02 5.12179844e-02 2.54505035e-02
-9.70738381e-02 1.42842624e-02 -3.46412621e-02 -8.45314115e-02
-7.38010108e-02 -2.72879936e-02 -2.81507652e-02 -5.01780510e-02
5.35405474e-03 2.96438616e-02 -5.18742464e-02 -6.24342896e-02
6.04359470e-02 -2.22260728e-02 3.36266570e-02 5.17647602e-02
-3.09484527e-02 -8.72448832e-02 -1.53413722e-02 9.27508809e-03
-4.92608221e-03 -4.97105941e-02 -1.04904985e-02 -4.15333314e-03
1.55722797e-02 -2.66851094e-02 -6.49709478e-02 -5.94373941e-02
-2.10976638e-02 3.59102758e-03 5.88850211e-03 -1.03685725e-02
5.03626876e-02 -3.31290103e-02 -7.70502910e-02 1.53052341e-02
*
"A fat tuxedo cat" =
* The "all-MiniLM-L6-v2" embedding model has 384 dimensions
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
Definition
• Representations of text in numerical form.
• Convert variable-length text into fixed-size vectors in high-
dimensional space.
Purpose
• Capture semantic meaning and relationships between words,
phrases, or longer text.
• Enable mathematical operations on text (e.g., similarity
measurement, arithmetic operations).
Characteristics
• Words with similar meanings are close in vector space.
• Allows for operations like "king" - "man" + "woman" ≈ "queen".
There are many embedding models:
12
Vector Embeddings
• There are several dozen embedding models
• They range in complexity from 384 to 1536 dimensions
• They range in max sequence length from 512 to 8191 tokens
• Embedding models are generally not compatible with each other
Interactive embedding explorer:
https://blog.echen.me/embedding-explorer/
Semantic Text Similarity
13
Sentence 1 Sentence 2 Cosine Similarity
The cat sits outside The dog plays in the garden 0.2838
A man is playing guitar A woman watches TV -0.0327
The new movie is awesome The new movie is so great 0.8939
Jim can run very fast James is the fastest runner 0.6844
My goldfish is hungry Pluto is a planet! 0.0454
• Measures the cosine of the angle between two vectors.
• Value between -1 and 1; where 1 means vectors are identical, 0 means
orthogonal, and -1 means diametrically opposite (rare in text embeddings).
These clearly used different
embedding models
https://gist.github.com/robert-mcdermott/67cf2623237989bc2315d35a108246ef
Embeddings Plot Tool
14
https://github.com/robert-mcdermott/embeddings_plot
A command line utility I created to
visualize word embeddings
Embedding-plot, is a command line
utility that can visualize word
embeddings in either 2D or 3D scatter
plots using dimensionality reduction
techniques (PCA, t-SNE or UMAP) and
clustering in a scatter plot.
Image Embedding Example
15
source: https://www.researchgate.net/publication/282181243_Learning_Visual_Clothing_Style_with_Heterogeneous_Dyadic_Co-Occurrences
“Visualization of a 2D embedding of the
style space trained with strategic sampling
computed with t-SNE. The embedding is
based on 200,000 images from the test set.
For a clear visual representation, we
discretize the style space into a grid and
pick one image from each grid cell at
random.”
CLIP (Contrastive Language-Image Pre-Training)
16
Source: Paper: https://arxiv.org/pdf/2103.00020.pdf Code: https://github.com/OpenAI/CLIP
High-Level Architecture
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
The LLaVA Papers
17
LLaVA 1.0 – Large Language and Vision Assistant
18
• https://arxiv.org/abs/2304.08485
• https://arxiv.org/pdf/2304.08485.pdf
• https://llava-vl.github.io/
• https://github.com/haotian-liu/LLaVA
Instruction tuning large language models (LLMs) using machine-
generated instruction-following data has improved zero-shot
capabilities on new tasks, but the idea is less explored in the
multimodal field. In this paper, we present the first attempt to use
language-only GPT-4 to generate multimodal language-image
instruction-following data. By instruction tuning on such generated
data, we introduce LLaVA: Large Language and Vision Assistant, an
end-to-end trained large multimodal model that connects a vision
encoder and LLM for general-purpose visual and language
understanding. Our early experiments show that LLaVA
demonstrates impressive multimodel chat abilities, sometimes
exhibiting the behaviors of multimodal GPT-4 on unseen
images/instructions and yields a 85.1% relative score compared
with GPT-4 on a synthetic multimodal instruction-following dataset.
When fine-tuned on Science QA, the synergy of LLaVA and GPT-4
achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4
generated visual instruction tuning data, our model and code base
publicly available.
Abstract
LLaVA 1.0 Training
19
LLaVA 1.0 Architecture
20
Tokenizing + Embedding Tokenizing + Embedding
LLaVA 1.0 Performance
21
22
• https://arxiv.org/abs/2310.03744
• https://arxiv.org/pdf/2310.03744.pdf
• https://huggingface.co/liuhaotian/llava
-v1.5-13b
Large multimodal models (LMM) have recently shown
encouraging progress with visual instruction tuning. In this
note, we show that the fully-connected vision-language cross-
modal connector in LLaVA is surprisingly powerful and data-
efficient. With simple modifications to LLaVA, namely, using
CLIP-ViT-L-336px with an MLP projection and adding academic-
task-oriented VQA data with simple response formatting
prompts, we establish stronger baselines that achieve state-of-
the-art across 11 benchmarks.
Our final 13B checkpoint uses merely 1.2M publicly available
data, and finishes full training in ~1 day on a single 8-A100
node.
We hope this can make state-of-the-art LMM research more
accessible. Code and model will be publicly available.
Abstract
LLaVA (1.5) – Large Language and Vision Assistant
23
LLaVA 1.5 Changes
Modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented
VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art
across 11 benchmarks.
LLaVA 1.5
24
Training Data Sources Performance
25
LLaVA 1.5 Comparison Examples
LLaVA 1.6
26
LLaVA-1.6-34B outperforms Gemini Pro on several benchmarks
https://llava-vl.github.io/blog/2024-01-30-llava-1-6/
Benchmarks Example
27
• https://arxiv.org/abs/2306.00890
• https://arxiv.org/pdf/2306.00890.pdf
Conversational generative AI has demonstrated remarkable promise for empowering
biomedical practitioners, but current investigationsfocus on unimodal text.
Multimodalconversational AI has seen rapid progress by leveraging billions of
image-text pairs from the public web, but such general-domain vision-language
models still lack sophisticationin understanding and conversing about biomedical
images. In this paper, we propose a cost-efficient approach for training a vision language
conversational assistant that can answer open-ended research questions
of biomedical images. The key idea is to leverage a large-scale, broad-coverage
biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to
self-instruct open-ended instruction-followingdata from the captions, and then
fine-tune a large general-domain vision-language model using a novel curriculum
learning method. Specifically,the model first learns to align biomedical vocabulary
using the figure-caption pairs as is, then learns to master open-ended conversational
semantics using GPT-4 generated instruction-followingdata, broadly mimicking
how a layperson gradually acquires biomedical knowledge. This enables us to train
a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less
than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational
capability and can follow open-ended instruction to assist with inquiries
about a biomedical image. On three standard biomedical visual question answering
datasets, fine-tuning LLaVA-Med outperforms previous supervised state-of-the-art
on certain metrics. To facilitate biomedical multimodal research, we will release
our instruction-followingdata and the LLaVA-Med model.
Abstract
LLaVA-Med
https://github.com/microsoft/LLaVA-Med
https://huggingface.co/microsoft/llava-med-7b-delta
28
LLaVA-Med
29
LLaVA-Med Comparison Examples
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
My LLaVA based Image Classifier Experiment
30
Full details, results and code: https://github.com/robert-mcdermott/LLM-Image-Classification
31
LLaVA 1.5: Handwritten Digit Classification
Abstract
https://github.com/robert-mcdermott/LLM-Image-Classification
32
LLaVA 1.5: Animal Classification
https://github.com/robert-mcdermott/LLM-Image-Classification
33
LLaVA 1.5: Chess Piece Identification
https://github.com/robert-mcdermott/LLM-Image-Classification
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Fred Hutchinson Cancer Center
Live Demo Time
34
Thank you
Robert McDermott (he/him)
Director: Solutions, Engineering & Architecture (SEA)
rmcdermo@fredhutch.org

More Related Content

What's hot

Luis cataldi-ue4-vr-best-practices2
Luis cataldi-ue4-vr-best-practices2Luis cataldi-ue4-vr-best-practices2
Luis cataldi-ue4-vr-best-practices2Luis Cataldi
 
Privacy in cloud computing
Privacy in cloud computingPrivacy in cloud computing
Privacy in cloud computingAhmed Nour
 
Metaverse Developments, Technologies, and Standards - Towards a Military Meta...
Metaverse Developments, Technologies, and Standards - Towards a Military Meta...Metaverse Developments, Technologies, and Standards - Towards a Military Meta...
Metaverse Developments, Technologies, and Standards - Towards a Military Meta...Andy Fawkes
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentationAmrut Patil
 
Cloud computing and impact on the business
Cloud computing and impact on the businessCloud computing and impact on the business
Cloud computing and impact on the businessJuvénal CHOKOGOUE
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPTAnand Pandey
 
Virtualization Vs. Containers
Virtualization Vs. ContainersVirtualization Vs. Containers
Virtualization Vs. Containersactualtechmedia
 
Blockchain - HyperLedger Fabric
Blockchain - HyperLedger FabricBlockchain - HyperLedger Fabric
Blockchain - HyperLedger FabricAraf Karsh Hamid
 
Data Visualization
Data VisualizationData Visualization
Data Visualizationgzargary
 
Pig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store FunctionsPig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store FunctionsRupak Roy
 
Literature Review: Security on cloud computing
Literature Review: Security on cloud computingLiterature Review: Security on cloud computing
Literature Review: Security on cloud computingSuranga Nisiwasala
 
Blockchain HyperLedger Fabric Internals - Clavent
Blockchain HyperLedger Fabric Internals - ClaventBlockchain HyperLedger Fabric Internals - Clavent
Blockchain HyperLedger Fabric Internals - ClaventAraf Karsh Hamid
 
PoW vs. PoS - Key Differences
PoW vs. PoS - Key DifferencesPoW vs. PoS - Key Differences
PoW vs. PoS - Key Differences101 Blockchains
 
Cloud computing lab experiments
Cloud computing lab experimentsCloud computing lab experiments
Cloud computing lab experimentsrichendraravi
 

What's hot (18)

Luis cataldi-ue4-vr-best-practices2
Luis cataldi-ue4-vr-best-practices2Luis cataldi-ue4-vr-best-practices2
Luis cataldi-ue4-vr-best-practices2
 
Privacy in cloud computing
Privacy in cloud computingPrivacy in cloud computing
Privacy in cloud computing
 
Metaverse Developments, Technologies, and Standards - Towards a Military Meta...
Metaverse Developments, Technologies, and Standards - Towards a Military Meta...Metaverse Developments, Technologies, and Standards - Towards a Military Meta...
Metaverse Developments, Technologies, and Standards - Towards a Military Meta...
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
 
Ethereum
EthereumEthereum
Ethereum
 
Cloud computing and impact on the business
Cloud computing and impact on the businessCloud computing and impact on the business
Cloud computing and impact on the business
 
Healthcare in the Metaverse.pdf
Healthcare in the Metaverse.pdfHealthcare in the Metaverse.pdf
Healthcare in the Metaverse.pdf
 
Hadoop Presentation - PPT
Hadoop Presentation - PPTHadoop Presentation - PPT
Hadoop Presentation - PPT
 
Virtualization Vs. Containers
Virtualization Vs. ContainersVirtualization Vs. Containers
Virtualization Vs. Containers
 
Ethereum
EthereumEthereum
Ethereum
 
Blockchain - HyperLedger Fabric
Blockchain - HyperLedger FabricBlockchain - HyperLedger Fabric
Blockchain - HyperLedger Fabric
 
Data Visualization
Data VisualizationData Visualization
Data Visualization
 
Cloud Management
Cloud ManagementCloud Management
Cloud Management
 
Pig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store FunctionsPig Latin, Data Model with Load and Store Functions
Pig Latin, Data Model with Load and Store Functions
 
Literature Review: Security on cloud computing
Literature Review: Security on cloud computingLiterature Review: Security on cloud computing
Literature Review: Security on cloud computing
 
Blockchain HyperLedger Fabric Internals - Clavent
Blockchain HyperLedger Fabric Internals - ClaventBlockchain HyperLedger Fabric Internals - Clavent
Blockchain HyperLedger Fabric Internals - Clavent
 
PoW vs. PoS - Key Differences
PoW vs. PoS - Key DifferencesPoW vs. PoS - Key Differences
PoW vs. PoS - Key Differences
 
Cloud computing lab experiments
Cloud computing lab experimentsCloud computing lab experiments
Cloud computing lab experiments
 

Similar to Introduction to Multimodal LLMs with LLaVA

Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Robert McDermott
 
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Robert McDermott
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Dhruv Gohil
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAnant Corporation
 
Emotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logicEmotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logicFinalyear Projects
 
DMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesDMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesPaige Morgan
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onDony Riyanto
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...Daniel Zivkovic
 
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for DevelopersHow do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for DevelopersIvo Andreev
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021Gérard Dupont
 
A Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And RlbpA Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And RlbpRikki Wright
 
My Curriculum Vitae
My Curriculum VitaeMy Curriculum Vitae
My Curriculum Vitaeadil raja
 
Open Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache SparkOpen Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
 
The effect of distributed archetypes on complexity theory
The effect of distributed archetypes on complexity theoryThe effect of distributed archetypes on complexity theory
The effect of distributed archetypes on complexity theoryVinícius Uchôa
 
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docxA Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docxbartholomeocoombs
 
Distributed tracing 101
Distributed tracing 101Distributed tracing 101
Distributed tracing 101Itiel Shwartz
 

Similar to Introduction to Multimodal LLMs with LLaVA (20)

Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
 
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
Basics of Generative AI: Models, Tokenization, Embeddings, Text Similarity, V...
 
Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical Nautral Langauge Processing - Basics / Non Technical
Nautral Langauge Processing - Basics / Non Technical
 
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPTAutomate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
Automate your Job and Business with ChatGPT #3 - Fundamentals of LLM/GPT
 
Emotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logicEmotion recognition from facial expression using fuzzy logic
Emotion recognition from facial expression using fuzzy logic
 
DMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slidesDMDS Winter 2015 Workshop 1 slides
DMDS Winter 2015 Workshop 1 slides
 
Big Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-onBig Data Analytics (ML, DL, AI) hands-on
Big Data Analytics (ML, DL, AI) hands-on
 
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...
 
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for DevelopersHow do OpenAI GPT Models Work - Misconceptions and Tips for Developers
How do OpenAI GPT Models Work - Misconceptions and Tips for Developers
 
Tds — big science dec 2021
Tds — big science dec 2021Tds — big science dec 2021
Tds — big science dec 2021
 
A Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And RlbpA Strong Object Recognition Using Lbp, Ltp And Rlbp
A Strong Object Recognition Using Lbp, Ltp And Rlbp
 
Marvin_Capstone
Marvin_CapstoneMarvin_Capstone
Marvin_Capstone
 
My Curriculum Vitae
My Curriculum VitaeMy Curriculum Vitae
My Curriculum Vitae
 
DeepPavlov 2019
DeepPavlov 2019DeepPavlov 2019
DeepPavlov 2019
 
Rajesh - CV
Rajesh - CVRajesh - CV
Rajesh - CV
 
Open Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache SparkOpen Source Big Graph Analytics on Neo4j with Apache Spark
Open Source Big Graph Analytics on Neo4j with Apache Spark
 
Distributed Tracing
Distributed TracingDistributed Tracing
Distributed Tracing
 
The effect of distributed archetypes on complexity theory
The effect of distributed archetypes on complexity theoryThe effect of distributed archetypes on complexity theory
The effect of distributed archetypes on complexity theory
 
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docxA Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
A Survey on Domain-Specific Languages for Machine.pdfA Sur.docx
 
Distributed tracing 101
Distributed tracing 101Distributed tracing 101
Distributed tracing 101
 

Recently uploaded

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 

Recently uploaded (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 

Introduction to Multimodal LLMs with LLaVA

  • 1. DALL-E 3: "A detailed graphic that visualizes a multimodal vector embedding space" Multimodal LLMs • What are Multimodal Language Models • Background / How do they work • LLaVA papers/projects • LLaVA model demonstration • Image classification project with LLaVA Robert McDermott (he/him) Director: Solutions, Engineering & Architecture (SEA) rmcdermo@fredhutch.org Deep Learning Affinity Group (DLAG) https://research.fredhutch.org/dlag/en.html Feb 20, 2024
  • 2. Who Am I? 2 Link: AI Robert
  • 3. 3 The Papers to Read LLaVA 1.0 https://arxiv.org/abs/2304.08485 https://arxiv.org/pdf/2304.08485.pdf https://llava-vl.github.io/ https://github.com/haotian-liu/LLaVA https://arxiv.org/abs/2310.03744 https://arxiv.org/pdf/2310.03744.pdf https://huggingface.co/liuhaotian/llava-v1.5-13b https://arxiv.org/abs/2306.00890 https://arxiv.org/pdf/2306.00890.pdf https://huggingface.co/microsoft/llava-med-7b-delta https://github.com/microsoft/LLaVA-Med LLaVA 1.5 LLaVA-Med
  • 4. 4 Multimodal Language Models Multimodal language models are AI systems designed to understand, interpret, and generate information across different forms of data, such as text and images. These models leverage large datasets of annotated examples to learn associations between text and visual content, enabling them to perform tasks that require comprehension of both textual and visual information. Why is the sky blue? A person wearing a red cap and sleeveless outfit is soaring through a cloudless sky on a brightly colored hang glider. The sky appears blue because molecules in the Earth's atmosphere scatter sunlight the shorter wavelength of blue more than other colors. Multimodal Language Model I like pizza
  • 5. 5 Multimodal Language Models Source: https://twitter.com/GregKamradt/status/1711772496159252981 Use Case Breakdown Describe • Animal Identification • What's in this photo Interpret • Technical Flame Graph Interpretation • Schematic Interpretation • Twitter Thread Explainer Recommend • Food Recommendations • Website Feedback • Painting Feedback Convert • Figma Screens • Adobe Lightroom Settings • Suggest ad copy based on a webpage Extract • Structured Data From Driver's License • Extract structured itemsfrom an image • Handwriting Extraction Assist • Excel Formula Helper • Find My Glasses • Live Poker Advice • Video game recommendations Evaluate • Dog Cuteness Evaluator • Bounding Box Evaluator • Thumbnail Testing Links to Examples
  • 6. 6 AI Vision has come a long way. GPT-4 Vision LLaVA 1.6 34B Research scientist and a founding member at OpenAI. Sr. Director of AI at Telsa. source: https://karpathy.github.io/2012/10/22/state-of-computer-vision/ 2024 2012
  • 7. 7 What’s funny about this? Image source: https://www.reddit.com/r/hmmm/comments/ubab5v/hmmm/ LLaVA 1.6 34B GPT-4 Vision
  • 8. What’s unusual about this image? LLaVA 1.6 34B GPT-4 Vision
  • 9. Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Quick Introduction to Tokens and Embeddings required to understand how LLMs process text and images. 9
  • 10. Text Tokenization 10 Tokenization is a foundational step in the preprocessing of text for many natural language processing (NLP) tasks, including for language models like GPT-4 and Llama-2. Tokenization involves breaking down text into smaller chunks, or "tokens", which can be as short as one character or as long as one word (or even longer in some cases). These tokens can then be processed, analyzed, and used as input for machine learning models. https://platform.openai.com/tokenizer Tokenization Visualized Resulting Token IDs
  • 11. 11 Vector Embeddings Applications • Natural Language Processing tasks: sentiment analysis, named entity recognition, etc. • Information retrieval: search engines, recommendation systems. • Visualization: using dimensionality reduction to visualize semantic relationships https://huggingface.co/spaces/mteb/leaderboard 5.41765615e-02 4.20716889e-02 -2.41547506e-02 1.11813843e-01 -9.33169946e-02 -7.56109739e-03 6.54651076e-02 -1.54011259e-02 -2.80906167e-02 1.97344255e-02 -1.58324391e-02 -8.46638903e-02 -1.31631363e-02 1.98841579e-02 -1.26802064e-02 -9.36008468e-02 -4.51933630e-02 -1.20324306e-02 -2.48974599e-02 4.87890420e-03 -2.54017510e-03 4.92022634e-02 5.12179844e-02 2.54505035e-02 -9.70738381e-02 1.42842624e-02 -3.46412621e-02 -8.45314115e-02 -7.38010108e-02 -2.72879936e-02 -2.81507652e-02 -5.01780510e-02 5.35405474e-03 2.96438616e-02 -5.18742464e-02 -6.24342896e-02 6.04359470e-02 -2.22260728e-02 3.36266570e-02 5.17647602e-02 -3.09484527e-02 -8.72448832e-02 -1.53413722e-02 9.27508809e-03 -4.92608221e-03 -4.97105941e-02 -1.04904985e-02 -4.15333314e-03 1.55722797e-02 -2.66851094e-02 -6.49709478e-02 -5.94373941e-02 -2.10976638e-02 3.59102758e-03 5.88850211e-03 -1.03685725e-02 5.03626876e-02 -3.31290103e-02 -7.70502910e-02 1.53052341e-02 * "A fat tuxedo cat" = * The "all-MiniLM-L6-v2" embedding model has 384 dimensions https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 Definition • Representations of text in numerical form. • Convert variable-length text into fixed-size vectors in high- dimensional space. Purpose • Capture semantic meaning and relationships between words, phrases, or longer text. • Enable mathematical operations on text (e.g., similarity measurement, arithmetic operations). Characteristics • Words with similar meanings are close in vector space. • Allows for operations like "king" - "man" + "woman" ≈ "queen". There are many embedding models:
  • 12. 12 Vector Embeddings • There are several dozen embedding models • They range in complexity from 384 to 1536 dimensions • They range in max sequence length from 512 to 8191 tokens • Embedding models are generally not compatible with each other Interactive embedding explorer: https://blog.echen.me/embedding-explorer/
  • 13. Semantic Text Similarity 13 Sentence 1 Sentence 2 Cosine Similarity The cat sits outside The dog plays in the garden 0.2838 A man is playing guitar A woman watches TV -0.0327 The new movie is awesome The new movie is so great 0.8939 Jim can run very fast James is the fastest runner 0.6844 My goldfish is hungry Pluto is a planet! 0.0454 • Measures the cosine of the angle between two vectors. • Value between -1 and 1; where 1 means vectors are identical, 0 means orthogonal, and -1 means diametrically opposite (rare in text embeddings). These clearly used different embedding models https://gist.github.com/robert-mcdermott/67cf2623237989bc2315d35a108246ef
  • 14. Embeddings Plot Tool 14 https://github.com/robert-mcdermott/embeddings_plot A command line utility I created to visualize word embeddings Embedding-plot, is a command line utility that can visualize word embeddings in either 2D or 3D scatter plots using dimensionality reduction techniques (PCA, t-SNE or UMAP) and clustering in a scatter plot.
  • 15. Image Embedding Example 15 source: https://www.researchgate.net/publication/282181243_Learning_Visual_Clothing_Style_with_Heterogeneous_Dyadic_Co-Occurrences “Visualization of a 2D embedding of the style space trained with strategic sampling computed with t-SNE. The embedding is based on 200,000 images from the test set. For a clear visual representation, we discretize the style space into a grid and pick one image from each grid cell at random.”
  • 16. CLIP (Contrastive Language-Image Pre-Training) 16 Source: Paper: https://arxiv.org/pdf/2103.00020.pdf Code: https://github.com/OpenAI/CLIP High-Level Architecture
  • 17. Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center The LLaVA Papers 17
  • 18. LLaVA 1.0 – Large Language and Vision Assistant 18 • https://arxiv.org/abs/2304.08485 • https://arxiv.org/pdf/2304.08485.pdf • https://llava-vl.github.io/ • https://github.com/haotian-liu/LLaVA Instruction tuning large language models (LLMs) using machine- generated instruction-following data has improved zero-shot capabilities on new tasks, but the idea is less explored in the multimodal field. In this paper, we present the first attempt to use language-only GPT-4 to generate multimodal language-image instruction-following data. By instruction tuning on such generated data, we introduce LLaVA: Large Language and Vision Assistant, an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding. Our early experiments show that LLaVA demonstrates impressive multimodel chat abilities, sometimes exhibiting the behaviors of multimodal GPT-4 on unseen images/instructions and yields a 85.1% relative score compared with GPT-4 on a synthetic multimodal instruction-following dataset. When fine-tuned on Science QA, the synergy of LLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We make GPT-4 generated visual instruction tuning data, our model and code base publicly available. Abstract
  • 20. LLaVA 1.0 Architecture 20 Tokenizing + Embedding Tokenizing + Embedding
  • 22. 22 • https://arxiv.org/abs/2310.03744 • https://arxiv.org/pdf/2310.03744.pdf • https://huggingface.co/liuhaotian/llava -v1.5-13b Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross- modal connector in LLaVA is surprisingly powerful and data- efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic- task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of- the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available. Abstract LLaVA (1.5) – Large Language and Vision Assistant
  • 23. 23 LLaVA 1.5 Changes Modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks.
  • 24. LLaVA 1.5 24 Training Data Sources Performance
  • 26. LLaVA 1.6 26 LLaVA-1.6-34B outperforms Gemini Pro on several benchmarks https://llava-vl.github.io/blog/2024-01-30-llava-1-6/ Benchmarks Example
  • 27. 27 • https://arxiv.org/abs/2306.00890 • https://arxiv.org/pdf/2306.00890.pdf Conversational generative AI has demonstrated remarkable promise for empowering biomedical practitioners, but current investigationsfocus on unimodal text. Multimodalconversational AI has seen rapid progress by leveraging billions of image-text pairs from the public web, but such general-domain vision-language models still lack sophisticationin understanding and conversing about biomedical images. In this paper, we propose a cost-efficient approach for training a vision language conversational assistant that can answer open-ended research questions of biomedical images. The key idea is to leverage a large-scale, broad-coverage biomedical figure-caption dataset extracted from PubMed Central, use GPT-4 to self-instruct open-ended instruction-followingdata from the captions, and then fine-tune a large general-domain vision-language model using a novel curriculum learning method. Specifically,the model first learns to align biomedical vocabulary using the figure-caption pairs as is, then learns to master open-ended conversational semantics using GPT-4 generated instruction-followingdata, broadly mimicking how a layperson gradually acquires biomedical knowledge. This enables us to train a Large Language and Vision Assistant for BioMedicine (LLaVA-Med) in less than 15 hours (with eight A100s). LLaVA-Med exhibits excellent multimodal conversational capability and can follow open-ended instruction to assist with inquiries about a biomedical image. On three standard biomedical visual question answering datasets, fine-tuning LLaVA-Med outperforms previous supervised state-of-the-art on certain metrics. To facilitate biomedical multimodal research, we will release our instruction-followingdata and the LLaVA-Med model. Abstract LLaVA-Med https://github.com/microsoft/LLaVA-Med https://huggingface.co/microsoft/llava-med-7b-delta
  • 30. Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center My LLaVA based Image Classifier Experiment 30 Full details, results and code: https://github.com/robert-mcdermott/LLM-Image-Classification
  • 31. 31 LLaVA 1.5: Handwritten Digit Classification Abstract https://github.com/robert-mcdermott/LLM-Image-Classification
  • 32. 32 LLaVA 1.5: Animal Classification https://github.com/robert-mcdermott/LLM-Image-Classification
  • 33. 33 LLaVA 1.5: Chess Piece Identification https://github.com/robert-mcdermott/LLM-Image-Classification
  • 34. Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Fred Hutchinson Cancer Center Live Demo Time 34
  • 35. Thank you Robert McDermott (he/him) Director: Solutions, Engineering & Architecture (SEA) rmcdermo@fredhutch.org