画像をテキストで検索したい！(OpenAI CLIP) - VRC-LT #15

画像をテキストで検索したい！
(OpenAI CLIP)
VRC-LT #15
2022/11/26 @shuyo

物体検出ベースの画像検索
 Google Photos の検索機能
 キーワードで絞り込み
 画像からあらかじめ物体検出
 今回対象外
 顔認識して同一人物をグルーピング。
ラベルを付けておくと名前で検索できる
 画像内のテキストで検索(一部の写真？)

困るところ
 想定外のキーワードに対応できない
 「とんこつラーメン」「シャトルバス」では検索できない
 キーワードでしか検索できない
 「青いバス」「夜の教会」では検索できない
 ちょっとでも写っていたらヒットしてしまう
「バス」で検索→

CLIP ってのを使ったら
いい感じの画像検索できるよ！

Python 70行 (webサーバ込)
import os, io, base64, glob, tqdm
from PIL import Image
import tornado.ioloop, tornado.web
import torch
import japanese_clip as ja_clip
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = ja_clip.load("rinna/japanese-clip-vit-b-16",
device=device)
tokenizer = ja_clip.load_tokenizer()
DATASETS = [
"/media/hdd/dataset/imagenette2-320/train/**/*.JPEG",
"/media/hdd/dataset/imagenette2-320/test/**/*.JPEG",
"/media/hdd/dataset/coco/val2017/*.jpg",
"/media/hdd/dataset/coco/test2017/*.jpg",
]
imglist = []
for path in DATASETS:
imglist.extend(glob.glob(path))
features = []
for path in tqdm.tqdm(imglist):
img = Image.open(path)
image = preprocess(img).unsqueeze(0).to(device)
with torch.no_grad():
features.append(model.get_image_features(image))
features = torch.cat(features)
norm = features / torch.sqrt((features**2).sum(axis=1)).unsqueeze(1)
def read(path):
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
def search(query):
encodings = ja_clip.tokenize(query, tokenizer=tokenizer)
with torch.no_grad():
text_features = model.get_text_features(**encodings)
textnorm = text_features / torch.sqrt((text_features**2).sum())
sim = norm.matmul(textnorm.squeeze(0))
topk = torch.topk(sim, 5)
return [{"image_base64":read(imglist[topk.indices[i]]),
"score":topk.values[i].item()} for i in range(5)]
class MainHandler(tornado.web.RequestHandler):
def get(self):
query = self.get_argument("query", "").strip()
if query!="":
topk = search(query)
else:
topk = []
self.render("main.html", query=query, topk=topk)
if __name__ == "__main__":
dir = os.path.dirname(__file__)
app = tornado.web.Application([("/", MainHandler)],
template_path=os.path.join(dir, "template"),
static_path=os.path.join(dir, "static"),
)
app.listen(8000)
tornado.ioloop.IOLoop.current().start()
あとは CSS と
HTML テンプレートだけ

約5万枚の画像から
テキストにあう画像を表示
(COCO & ImageNette データセットより
抜粋)
検索テキストを入力
結果は瞬時に表示(<100ms)

OpenAI CLIP (Contrastive Language-Image Pre-training)
 画像・テキストの組からそれぞれの
ベクトル表現を事前学習
 正しい組のベクトル表現のコサイン類
似度が大きく、異なる組のが小さくな
るように対照学習
 Encoder はなんでもいい
 text: transformer 系
 image: ResNet か ViT
Radford, Alec, et al. "Learning transferable visual models from natural language supervision."
International Conference on Machine Learning. PMLR, 2021.

CLIP で画像検索サービス
 CLIP の学習済みモデルを使う
 今回利用したのは rinna 社の日本語 CLIP モデル（商用利用可能）
 https://huggingface.co/rinna/japanese-clip-vit-b-16
 画像を固定長ベクトル(512次元)にエンコード＆長さ1に正規化
 5万枚のベクトル化に約 8分(RTX3060)
 クエリーテキストをベクトル化し、画像ベクトルと比較
 コサイン類似度が大きい画像を検索結果として出力
 画像が多ければ SimHash などでコサイン類似度探索を高速化

画像をテキストで検索したい！(OpenAI CLIP) - VRC-LT #15

画像をテキストで検索したい！(OpenAI CLIP) - VRC-LT #15

More Related Content

What's hot

Similar to 画像をテキストで検索したい！(OpenAI CLIP) - VRC-LT #15

More from Shuyo Nakatani

画像をテキストで検索したい！(OpenAI CLIP) - VRC-LT #15