2023/06/01 IoT ALGYAN ChatGPT研究会第9弾資料

生成 AI の基礎と
TaskMatrix (Visual ChatGPT)

自己紹介
松崎剛
日本マイクロソフト株式会社
パートナー事業本部
クラウドソリューションアーキテクト
ブログ
https://tsmatz.wordpress.com/
GitHub
https://github.com/tsmatz

NLP (natural language processing) Tutorials
https://github.com/tsmatz/nlp-tutorials

OpenAI の基礎
GPT
テキストの生成と理解
Embedding
テキストプロンプトから画像を生成
Codex
コードの生成と理解
DALL·E
テキストの生成と理解

言語モデルの発展
出典 “A Survey of Large Language Models” (Wayne et. al., 2023) https://arxiv.org/abs/2303.18223

LLM の Emergent Abilities
Emergency = 突然出
現するもの
言語モデルが、ある
パラメーター数を超
えると突然持ち始め
る予測不能な現象
出典 : “Emergent Abilities of Large Language Models” (Jason, et. al., 2022)
https://arxiv.org/abs/2206.07682

考え : 会社 C の請求金額を確認する必要がある
行動 : GetInvoice[C]
考え : 会社 F の請求金額を確認する必要がある
行動 : GetInvoice[F]
考え : C と F の合計を計算する必要がある
行動 : Total[20000, 4100]
考え : 会社 A の請求金額を確認する必要がある
行動 : GetInvoice[A]
考え : 会社 E の請求金額を確認する必要がある
行動 : GetInvoice[E]
請求 DB から C の請求金額を検索:
20000
請求 DB から F の請求金額を検索:
4100
計算機 20000 + 4100 = 24100
請求 DB から A の請求金額を検索:
2000
請求 DB から E の請求金額を検索:
1000
言語モデル (OpenAI GPT-3) 外部ツール
考え : A と E の合計を計算する必要がある
行動 : Total[2000, 1000] 計算機 2000 + 1000 = 3000
考え : 会社 C, F の請求金額の合計と会社 A, E の請求
金額の合計との間の差額を計算する必要がある
行動 : Diff[24100 3000]
計算機 | 24100 – 3000 | = 21100
考え : よって、答えは 21100
行動 : Finish[21100]
OpenAI GPT が、思考を担当
(つぎにすべきことを提案)
= Reasoning
外部ツール (関数部分) が
実作業を担当
= Acting

• 下記ツールを使用
• Search
• Lookup
• 回答精度の大幅な向上 (ベンチマーク
結果)
• より高度な方式では、強化学習・模倣
学習などとも連携
出典 “REACT: SYNERGIZING REASONING AND ACTING IN LANGUAGE
MODELS” (Shunyu et. al., 2022)
ReAct / MRKL Systems

Prior knowledge vs Label mapping
(Google がおこなった ICL の実験)
unrelated
labels
flipped
labels
IF あり 1 2
IF なし 3 4

unrelated
labels
flipped
labels
IF あり 1 2
IF なし 3 4
出典 : “Larger language models do in-context learning differently” (Jerry, et. al., 2023)

Symbol Tuning
出典 : “Symbol tuning improves in-context learning in language models” (Jerry, et. al., 2023)

Multi Modalities への期待
出典 : “ImageBind: One Embedding Space To Bind Them All” (Rohit, et. al., 2023)

TaskMatrix (Visual ChatGPT)
• 画像の処理は、主に
Hugging Face 上の Visual
Foundation Model が担当
• 言語モデル (ChatGPT) を
使って Reasoning

実行例
• Thought: Do I need to use
a tool? Yes
• Action: Replace Something
From The Photo
• Action Input:
image/9bb5e03b.png,
couch, desk
• Observation:
image/5737_replace-
something_9bb5e03b_9bb
5e03b.png
a tool? Yes
• Action: Instruct Image
Using Text
• Action Input:
image/5737_replace-
something_9bb5e03b_9bb
5e03b.png, make it like a
water-color painting
• Observation:
image/770e_pix2pix_5737
_9bb5e03b.png
a tool? No
• AI: Here is the image you
requested.
• ![image/770e_pix2pix_573
7_9bb5e03b.png](image/7
70e_pix2pix_5737_9bb5e0
3b.png)
“replace the sofa in this image with a desk and then make it like
a water-color painting”
Action 1:
Replace Something From The Photo
“couch, desk”
Action 2:
Instruct Image Using Text
“make it like a water-color painting”

Visual ChatGPT is designed to be able to assist with a wide range of text and visual related tasks,
from answering simple questions to providing in-depth explanations and discussions on a wide range of
topics. Visual ChatGPT is able to generate human-like text based on the input it receives, allowing it
to engage in natural-sounding conversations and provide responses that are coherent and relevant to
the topic at hand.
Visual ChatGPT is able to process and understand large amounts of text and images. As a language
model, Visual ChatGPT can not directly read images, but it has a list of tools to finish different
visual tasks. Each image will have a file name formed as "image/xxx.png", and Visual ChatGPT can
invoke different tools to indirectly understand pictures. When talking about images, Visual ChatGPT is
very strict to the file name and will never fabricate nonexistent files. When using tools to generate
new image files, Visual ChatGPT is also known that the image may not be the same as the user's demand,
and will use other visual question answering tools or description tools to observe the real image.
Visual ChatGPT is able to use tools in a sequence, and is loyal to the tool observation outputs rather
than faking the image content and image file name. It will remember to provide the file name from the
last tool observation, if a new image is generated.
Human may provide new figures to Visual ChatGPT with a description. The description helps Visual
ChatGPT to understand this image, but Visual ChatGPT should use tools to finish following tasks,
rather than directly imagine from the description.
Overall, Visual ChatGPT is a powerful visual dialogue assistant tool that can help with a wide range
of tasks and provide valuable insights and information on a wide range of topics.
TOOLS:
------

Visual ChatGPT has access to the following tools:
> Get Photo Description: useful when you want to know what is inside the photo. receives image_path as
input. The input to this tool should be a string, representing the image_path.
> Remove Something From The Photo: useful when you want to remove and object or something from the
photo from its description or location. The input to this tool should be a comma seperated string of
two, representing the image_path and the object need to be removed.
> Replace Something From The Photo: useful when you want to replace an object from the object
description or location with another object from its description. The input to this tool should be a
comma seperated string of three, representing the image_path, the object to be replaced, the object to
be replaced with
> Instruct Image Using Text: useful when you want to the style of the image to be like the text. like:
make it look like a painting. or make it like a robot. The input to this tool should be a comma
seperated string of two, representing the image_path and the text.
. . . . .
To use a tool, please use the following format:
```
Thought: Do I need to use a tool? Yes
Action: the action to take, should be one of [Get Photo Description, Remove Something From The Photo,
Replace Something From The Photo, Instruct Image Using Text]
Action Input: the input to the action
Observation: the result of the action
```
When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the
format:

```
Thought: Do I need to use a tool? No
AI: [your response here]
```
You are very strict to the filename correctness and will never fake a file name if it does not exist.
You will remember to provide the image file name loyally if it's provided in the last tool
observation.
Begin!
Previous conversation history:
Human: provide a figure named image/9bb5e03b.png. The description is: a living room with a couch and a
couch in the corner. This information helps you to understand this image, but you should use tools to
finish following tasks, rather than directly imagine from my description. If you understand, say
"Received".
AI: Received.
New input: replace the sofa in this image with a desk and then make it like a water-color painting
Since Visual ChatGPT is a text language model, Visual ChatGPT must use tools to observe images rather
than imagination.
The thoughts and observations are only visible for Visual ChatGPT, Visual ChatGPT should remember to
repeat important information in the final response for Human.
Thought: Do I need to use a tool? Yes
Action: Replace Something From The Photo
Action Input: image/9bb5e03b.png, couch, desk

could you
generate a cat
for me ?
could you
replace a cat to
a dog and then
remove the
book ?
could you
generate a
canny edge of
this image ?
generate a yellow
dog based on
......png

Visual Foundation Models
Get Photo Description
Generate Image From User Input Text
Remove Something From The Photo
Replace Something From The Photo
Instruct Image Using Text
Answer Question About The Image
Edge Detection On Image
Generate Image Condition On Canny Image
Line Detection On Image
Generate Image Condition On Line Image
Hed Detection On Image
Generate Image Condition On Soft Hed
Boundary Image
Segmentation On Image
Generate Image Condition On Segmentations
Predict Depth On Image
Generate Image Condition On Depth
Predict Normal Map On Image
Generate Image Condition On Normal Map
Sketch Detection On Image
Generate Image Condition On Sketch Image
Pose Detection On Image
Generate Image Condition On Pose Image

Diffusions
出典 : “Denoising Diffusion Probabilistic Models” (Jonathan, et. al., 2020) https://arxiv.org/abs/2006.11239
出典 : “Learning Transferable Visual Models From
Natural Language Supervision” (Alec, et. al., 2021)

Related Projects
• JARVIS (HuggingGPT) - Microsoft
https://github.com/microsoft/JARVIS
• LLM-Augmenter system - Microsoft
https://github.com/pengbaolin/LLM-Augmenter
• Transformers Agent – Hugging Face
https://huggingface.co/docs/transformers/transformers_
agents
• GPT-4 + Stable-Diffusion - Berkeley AI Research
https://llm-grounded-diffusion.github.io/

2023/06/01 IoT ALGYAN ChatGPT研究会第9弾資料

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 2023/06/01 IoT ALGYAN ChatGPT研究会第9弾資料

Similar to 2023/06/01 IoT ALGYAN ChatGPT研究会第9弾資料 (20)

Recently uploaded

Recently uploaded (20)