Unraveling Multimodality with Large Language Models.pdf

© 2024, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Unraveling Multimodality with LLMs
Alex Coqueiro
O P E N A I + D A T A F O R U M
Director of Solutions Architecture for Canada Public Sector
AWS

Multimodality
refers to a concept that utilizes multiple methods of
communication or representation

Multimodal
Intelligence as a
service
Question
Answer
Medical
Staff

Foundation Models (FM) as the heart of LLM
Text generation
Summarization
Information
extraction
Q&A
Chatbot
Pretrain Adapt
Tasks
Unlabeled
data
FM
Text generation
Summarization
Information
extraction
Q&A
Chatbot
Train Deploy
Tasks
ML
models
…
…
…
…
Labeled
data
…
…
…
…

Precision
Speed
Cost
Model Evaluation Score HIL/LLM Feedback
FM1 5/5 <Feedback summary>
Model Cost
FM1 $$$$
FM2 $
FM3 $$$
Model Speed
FM1 ⚡⚡
FM2 ⚡
FM3 ⚡
Finding the best FM based on your use case

Our Application (v1)
Foundational
Model
Question + Context
Answer
Medical
Staﬀ
Could you suggest me ways to prevent allergy?

Meta Llama
Source: https://ai.meta.com/resources/models-and-libraries/llama/

… New Requirements …

Adding Multimodal Capabilities to FMs
text2text
text2image
image2text
text2audio
text2video
Pretrain Adapt
Tasks Capability
Unlabeled
data
FM

Multimodality Business Use Case
Same Product
Image
Title
Title

Advertising
tailors ads with deep
understanding of user
preferences with multimodal
query representation
Search Engine
Based on multimodal query
understanding
Recommendation
System
It recommends based on
diverse data sources effectively
Robotics
Enhances robotics with natural
language understanding and
decision-making
Assistant (E.g. Chat)
Enhance assistant capability
adding visual analysis
Query Suggestion
Guidance on the best way to
explore the image content
based on Multimodal query
suggestion
Multimodal Application Examples

Visual Instruction Tuning (Llava)
• Employ LLM's reasoning capability for vision based tasks
• Addressing VQA (Visual Question Answering)
• Instruction-tuning for images with pre-trained image captions

Hugging Face – llava-v1.6-mistral-7b

Our Application (v2) – Multimodal Retrieval
C O L L E C T D A T A F R O M M U L T I M O D A L D O C U M E N T S
Llava
Mistral 7B
Question + Context
Answer
Medical
Staff
Could you summarize the main points
of these data?

Demo
T E X T E X T R A C T

Demo
M U L T I M O D A L U N D E R S T A N D I N G

Demo
R E A S O N I N G

© 2018, Amazon Web Services, Inc. or Its Aﬃliates. All rights reserved.
… Increasing Complexity …

© 2024, Amazon Web Services, Inc. or its aﬃliates. All rights reserved.
Latent Diffusion Models
24
https://www.amazon.science/blog/virtual-try-all-
visualizing-any-product-in-any-personal-setting
https://arxiv.org/abs/2401.13795
• Enriching Image Conditioned Inpainting in Latent
Diffusion Models
• Multimodal retrieval tasks x Multimodal generation
(different problem)
• E.g. Virtual try-all: Visualizing any product in any
personal setting

Brief Overview of Diffusion Models
- “destroy” the data by gradually adding
small amounts of gaussian noise
- “create” data by gradually denoising a
noisy code from a stationary
distribution
Animations from https://yang-song.github.io/blog/2021/score/

Stable Diﬀusion 2.0 with Fine-tuning

Depth to Image Model (Stable Diffusion 2.0)

UIs / Plug-Ins for Photoshop, GIMP etc
28
https://twitter.com/wbuchw/status/1563162131024920576
https://github.com/lkwq007/stablediffusion-infinity

Our Application (v3) – Multimodal Generation
G E N E R A T E I M A G E W I T H S T A B L E D I F F U S I O N
Question + Context
Answer
Medical
Staff
Create an image about the patient's journey from admission to discharge for my
clinical report
Stable Difusion – SDXL (or SD 2.1)

… Balancing Multiple Tasks …

LLM Multimodel

LangChain
A framework to simplify applications using
an LLM
Provides a common way of accessing APIs
of different LLMs
Helps with learning how to use LLMs but
may be too restrictive for some use cases

Agents can take Actions
Instructions: “you are an health assistant, helping
nurses understand about patients”
Health Assistant
What’s the Joan’s
insurance?
Insurance ACME
Vector DB
Patient
Medical records
Actions
Booking Procedure
In: name, procedure
Out: protocol number
Please book
Patients’s anesthesia
Approved and
protocol is 12345

Agent Orchestration
Task Langchain
Final
response

Unimodel
Multimodal Multimodel
Closings

Thank you!
Alex Coqueiro
Director of Solutions Architecture for Canada Public Sector
AWS

Unraveling Multimodality with Large Language Models.pdf

More Related Content

Similar to Unraveling Multimodality with Large Language Models.pdf

More from Alex Barbosa Coqueiro

Recently uploaded

Unraveling Multimodality with Large Language Models.pdf