© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Unraveling Multimodality with LLMs
Alex Coqueiro
O P E N A I + D A T A F O R U M
Director of Solutions Architecture for Canada Public Sector
AWS
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Multimodality
refers to a concept that utilizes multiple methods of
communication or representation
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Multimodal
Intelligence as a
service
Question
Answer
Medical
Staff
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Foundation Models (FM) as the heart of LLM
Text generation
Summarization
Information
extraction
Q&A
Chatbot
Pretrain Adapt
Tasks
Unlabeled
data
FM
Text generation
Summarization
Information
extraction
Q&A
Chatbot
Train Deploy
Tasks
ML
models
…
…
…
…
Labeled
data
…
…
…
…
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Precision
Speed
Cost
Model Evaluation Score HIL/LLM Feedback
FM1 5/5 <Feedback summary>
FM2 4/5 <Feedback summary>
FM3 3/5 <Feedback summary>
Model Cost
FM1 $$$$
FM2 $
FM3 $$$
Model Speed
FM1 ⚡⚡
FM2 ⚡
FM3 ⚡
Finding the best FM based on your use case
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Our Application (v1)
Foundational
Model
Question + Context
Answer
Medical
Staff
Could you suggest me ways to prevent allergy?
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Meta Llama
Source: https://ai.meta.com/resources/models-and-libraries/llama/
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
… New Requirements …
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Adding Multimodal Capabilities to FMs
text2text
text2image
image2text
text2audio
text2video
Pretrain Adapt
Tasks Capability
Unlabeled
data
FM
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Multimodality Business Use Case
Same Product
Image
Title
Title
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Advertising
tailors ads with deep
understanding of user
preferences with multimodal
query representation
Search Engine
Based on multimodal query
understanding
Recommendation
System
It recommends based on
diverse data sources effectively
Robotics
Enhances robotics with natural
language understanding and
decision-making
Assistant (E.g. Chat)
Enhance assistant capability
adding visual analysis
Query Suggestion
Guidance on the best way to
explore the image content
based on Multimodal query
suggestion
Multimodal Application Examples
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Visual Instruction Tuning (Llava)
• Employ LLM's reasoning capability for vision based tasks
• Addressing VQA (Visual Question Answering)
• Instruction-tuning for images with pre-trained image captions
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Hugging Face – llava-v1.6-mistral-7b
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Our Application (v2) – Multimodal Retrieval
C O L L E C T D A T A F R O M M U L T I M O D A L D O C U M E N T S
Llava
Mistral 7B
Question + Context
Answer
Medical
Staff
Could you summarize the main points
of these data?
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Demo
T E X T E X T R A C T
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Demo
M U L T I M O D A L U N D E R S T A N D I N G
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Demo
R E A S O N I N G
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved.
… Increasing Complexity …
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Latent Diffusion Models
24
https://www.amazon.science/blog/virtual-try-all-
visualizing-any-product-in-any-personal-setting
https://arxiv.org/abs/2401.13795
• Enriching Image Conditioned Inpainting in Latent
Diffusion Models
• Multimodal retrieval tasks x Multimodal generation
(different problem)
• E.g. Virtual try-all: Visualizing any product in any
personal setting
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Brief Overview of Diffusion Models
- “destroy” the data by gradually adding
small amounts of gaussian noise
- “create” data by gradually denoising a
noisy code from a stationary
distribution
Animations from https://yang-song.github.io/blog/2021/score/
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Stable Diffusion 2.0 with Fine-tuning
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Depth to Image Model (Stable Diffusion 2.0)
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
UIs / Plug-Ins for Photoshop, GIMP etc
28
https://twitter.com/wbuchw/status/1563162131024920576
https://github.com/lkwq007/stablediffusion-infinity
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Our Application (v3) – Multimodal Generation
G E N E R A T E I M A G E W I T H S T A B L E D I F F U S I O N
Question + Context
Answer
Medical
Staff
Create an image about the patient's journey from admission to discharge for my
clinical report
Stable Difusion – SDXL (or SD 2.1)
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
… Balancing Multiple Tasks …
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
LLM Multimodel
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
LangChain
A framework to simplify applications using
an LLM
Provides a common way of accessing APIs
of different LLMs
Helps with learning how to use LLMs but
may be too restrictive for some use cases
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agents can take Actions
Instructions: “you are an health assistant, helping
nurses understand about patients”
Health Assistant
What’s the Joan’s
insurance?
Insurance ACME
Vector DB
Patient
Medical records
Actions
Booking Procedure
In: name, procedure
Out: protocol number
Please book
Patients’s anesthesia
Approved and
protocol is 12345
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agent Orchestration
Task Langchain
Final
response
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Unimodel
Multimodal Multimodel
Closings
© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
Alex Coqueiro
Director of Solutions Architecture for Canada Public Sector
AWS

Unraveling Multimodality with Large Language Models.pdf

  • 1.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Unraveling Multimodality with LLMs Alex Coqueiro O P E N A I + D A T A F O R U M Director of Solutions Architecture for Canada Public Sector AWS
  • 2.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved.
  • 3.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved.
  • 4.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Multimodality refers to a concept that utilizes multiple methods of communication or representation
  • 5.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Multimodal Intelligence as a service Question Answer Medical Staff
  • 6.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Foundation Models (FM) as the heart of LLM Text generation Summarization Information extraction Q&A Chatbot Pretrain Adapt Tasks Unlabeled data FM Text generation Summarization Information extraction Q&A Chatbot Train Deploy Tasks ML models … … … … Labeled data … … … …
  • 7.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved.
  • 8.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Precision Speed Cost Model Evaluation Score HIL/LLM Feedback FM1 5/5 <Feedback summary> FM2 4/5 <Feedback summary> FM3 3/5 <Feedback summary> Model Cost FM1 $$$$ FM2 $ FM3 $$$ Model Speed FM1 ⚡⚡ FM2 ⚡ FM3 ⚡ Finding the best FM based on your use case
  • 9.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Our Application (v1) Foundational Model Question + Context Answer Medical Staff Could you suggest me ways to prevent allergy?
  • 10.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Meta Llama Source: https://ai.meta.com/resources/models-and-libraries/llama/
  • 11.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved.
  • 12.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved.
  • 13.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. … New Requirements …
  • 14.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Adding Multimodal Capabilities to FMs text2text text2image image2text text2audio text2video Pretrain Adapt Tasks Capability Unlabeled data FM
  • 15.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Multimodality Business Use Case Same Product Image Title Title
  • 16.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Advertising tailors ads with deep understanding of user preferences with multimodal query representation Search Engine Based on multimodal query understanding Recommendation System It recommends based on diverse data sources effectively Robotics Enhances robotics with natural language understanding and decision-making Assistant (E.g. Chat) Enhance assistant capability adding visual analysis Query Suggestion Guidance on the best way to explore the image content based on Multimodal query suggestion Multimodal Application Examples
  • 17.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Visual Instruction Tuning (Llava) • Employ LLM's reasoning capability for vision based tasks • Addressing VQA (Visual Question Answering) • Instruction-tuning for images with pre-trained image captions
  • 18.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Hugging Face – llava-v1.6-mistral-7b
  • 19.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Our Application (v2) – Multimodal Retrieval C O L L E C T D A T A F R O M M U L T I M O D A L D O C U M E N T S Llava Mistral 7B Question + Context Answer Medical Staff Could you summarize the main points of these data?
  • 20.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Demo T E X T E X T R A C T
  • 21.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Demo M U L T I M O D A L U N D E R S T A N D I N G
  • 22.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Demo R E A S O N I N G
  • 23.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. © 2018, Amazon Web Services, Inc. or Its Affiliates. All rights reserved. … Increasing Complexity …
  • 24.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Latent Diffusion Models 24 https://www.amazon.science/blog/virtual-try-all- visualizing-any-product-in-any-personal-setting https://arxiv.org/abs/2401.13795 • Enriching Image Conditioned Inpainting in Latent Diffusion Models • Multimodal retrieval tasks x Multimodal generation (different problem) • E.g. Virtual try-all: Visualizing any product in any personal setting
  • 25.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Brief Overview of Diffusion Models - “destroy” the data by gradually adding small amounts of gaussian noise - “create” data by gradually denoising a noisy code from a stationary distribution Animations from https://yang-song.github.io/blog/2021/score/
  • 26.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Stable Diffusion 2.0 with Fine-tuning
  • 27.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Depth to Image Model (Stable Diffusion 2.0)
  • 28.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. UIs / Plug-Ins for Photoshop, GIMP etc 28 https://twitter.com/wbuchw/status/1563162131024920576 https://github.com/lkwq007/stablediffusion-infinity
  • 29.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Our Application (v3) – Multimodal Generation G E N E R A T E I M A G E W I T H S T A B L E D I F F U S I O N Question + Context Answer Medical Staff Create an image about the patient's journey from admission to discharge for my clinical report Stable Difusion – SDXL (or SD 2.1)
  • 30.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved.
  • 31.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. … Balancing Multiple Tasks …
  • 32.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. LLM Multimodel
  • 33.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. LangChain A framework to simplify applications using an LLM Provides a common way of accessing APIs of different LLMs Helps with learning how to use LLMs but may be too restrictive for some use cases
  • 34.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved.
  • 35.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Agents can take Actions Instructions: “you are an health assistant, helping nurses understand about patients” Health Assistant What’s the Joan’s insurance? Insurance ACME Vector DB Patient Medical records Actions Booking Procedure In: name, procedure Out: protocol number Please book Patients’s anesthesia Approved and protocol is 12345
  • 36.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Agent Orchestration Task Langchain Final response
  • 37.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Unimodel Multimodal Multimodel Closings
  • 38.
    © 2024, AmazonWeb Services, Inc. or its affiliates. All rights reserved. Thank you! Alex Coqueiro Director of Solutions Architecture for Canada Public Sector AWS