論文紹介：Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

•

0 likes•28 views

Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval Chaorui Deng, Qi Chen, Pengda Qin, Da Chen, Qi Wu; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 15648-15658 https://openaccess.thecvf.com/content/ICCV2023/html/Deng_Prompt_Switch_Efficient_CLIP_Adaptation_for_Text-Video_Retrieval_ICCV_2023_paper.html

Technology

Prompt Switch: Efficient CLIP
Adaptation for Text-Video Retrieval
Chaorui Deng,Qi Chen, Pengda Qin, Da Chen,Qi Wu, ICCV2023
2023/10/30

◼ Prompt
•
• Prompt Head
• Prompt
• Prompt
◼ Text-image CLIP Prompt
• Video Token Prompt Token
•
[Jia+, ECCV2022]

◼ Text-image CLIP Prompt
1. Prompt Token
• Video Token token
2. Prompt Switch
• Video Encoder prompt token
3. Prompt Aggregation
• Class token prompt token
4. Captioning loss
◼
• Video up
•

2. Prompt Switch
𝑉 : Visual Token
𝑃 : Prompt Token

Prompt Aggregation
◼ Prompt Aggregation
•
• 𝑐𝑖 : Visual class token 𝑖 ∈ 1, 𝑁𝑓 , 𝑁𝑓:
• ෠
𝑃 : Prompt Token
• 𝑀𝐻𝐴: Multi-Head Attention
• 𝐿𝑁: Layer Normalization
• Multi-Head Attention + Add LayerNorm
• Query: 𝑐𝑖, key, value : ෠
𝑃
• Mean-Pooling

Loss
◼ Loss
• Contrastive Loss : 𝐿𝑐𝑜𝑛
• Captioning Loss : 𝐿𝑐𝑎𝑝
◼
• 𝑥𝑖 : video
• 𝑦𝑖 : text
• 𝐵 :
• 𝑁𝑓 :
• ഥ
𝑝𝑖 : prompt Aggregation
• 𝜏 :
• 𝑤<𝑙 : 𝑙 caption
• 𝑤𝑙 : 𝑙 caption
• 𝑁𝑤 : token

◼
• MSR-VTT [Xu+, CVPR2016]
• Video: 10k, caption: 200k
• Train: 9k, Val: 1k
• MSVD [Chen&Dolan, ACL2011]
• Video: 1970, caption: 120k
• Train: 1200, Val: 100, test: 670
• LSMDC [Rohrbach+, arXiv2015]
• Video: 118081, caption: 11808
• Train: 109673, Val: 7408, test: 1k
◼ Video
• 6
• 224×224
◼ : CLIP
•
• Text encoder, video encoder
• CLIP (ViT-B/32)
• Prompt Token
• ( 0, 0.02)
◼
• Batch size : 128
• Optimizer : AdamW
• Learning rate : 3𝑒−5
• Scheduler : CosinAnnealing

1: Ablation
◼ ablation
1. Baseline CLIP
2. 1 + Prompt Switch
3. 2 + Prompt Aggregation
4. 3 + Captioning Loss
◼
• MSR-VTT 1K-A
◼
•

2
◼ temporal modeling
• Temporal Transformer
• Attention
• Token Shift [Liu+, ECCV2022]
• Video Proxy [Xue+, ICLR2023]
• Full Attention
• video Attention
• ours (Prompt Switch + Aggregation)
◼
•
◼
• MSR-VTT 1K-A
◼
• Recall (𝑘 = 1, 5, 10)

◼ Text-image CLIP Prompt
• Prompt Token
• Prompt Switch
• Prompt Aggregation
• Captioning loss
◼ SOTA
•
• MSR-VTT
• MSVD

Similar to 論文紹介：Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

PR-365: Fast object detection in compressed videoHyeongmin Lee

PR-340: DVC: An End-to-end Deep Video Compression FrameworkHyeongmin Lee

Avtex Lync 2013 Event - FargoAvtex

Performance Measurements of 360◦ Video Streaming to Head-Mounted Displays Ove...Wen-Chih Lo

Seattle Video Tech Meetup August 2019: Optimal Multi-codec StreamingDavid Sayed

Video Coding for Large-Scale HTTP Adaptive Streaming Deployments: State of th...Alpen-Adria-Universität

Intro to Compression: Audio and Video Optimization for LearningNick Floro

An Overview of High Efficiency Video Codec HEVC (H.265)Varun Ravi

Hacking cable TV Networks Like Die hard MovieRahul Sasi

Introduction to Transcoding: Tools and ProcessesPrestoCentre

Video smart cropping web applicationVasileiosMezaris

Serverless Media WorkflowMooYeol Lee

Mpeg4copy 120428133000-phpapp01netzwelt12345

Criteo Labs Infrastructure Tech Talk Meetup Nov. 7Shuo LI

Vivotek presentationClaudia Sträter

Apan media encodingAndrew Howard

Live, Low Delay, High Quality – How?Bitmovin Inc

Chapter 6 : VIDEOazira96

Chapter 6nooramirahazmn

Video File & Recording MediaAtiwat Rungsirikulwit

Similar to 論文紹介：Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval (20)

PR-365: Fast object detection in compressed video

PR-340: DVC: An End-to-end Deep Video Compression Framework

Avtex Lync 2013 Event - Fargo

Performance Measurements of 360◦ Video Streaming to Head-Mounted Displays Ove...

Seattle Video Tech Meetup August 2019: Optimal Multi-codec Streaming

Video Coding for Large-Scale HTTP Adaptive Streaming Deployments: State of th...

Intro to Compression: Audio and Video Optimization for Learning

An Overview of High Efficiency Video Codec HEVC (H.265)

Hacking cable TV Networks Like Die hard Movie

Introduction to Transcoding: Tools and Processes

Video smart cropping web application

Serverless Media Workflow

Mpeg4copy 120428133000-phpapp01

Criteo Labs Infrastructure Tech Talk Meetup Nov. 7

Vivotek presentation

Apan media encoding

Live, Low Delay, High Quality – How?

Chapter 6 : VIDEO

Chapter 6

Video File & Recording Media

Recently uploaded

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

Understanding the Laravel MVC ArchitecturePixlogix Infotech

Key Features Of Token Development (1).pptxLBM Solutions

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Artificial intelligence in the post-deep learning eraDeakin University

Install Stable Diffusion in windows machinePadma Pradeep

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Build your next Gen AI Breakthrough - April 2024Neo4j

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

CloudStudio User manual (basic edition):comworks

Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida

The transition to renewables in India.pdfCompetition Advisory Services (India) LLP

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Recently uploaded (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024

My Hashitalk Indonesia April 2024 Presentation

Benefits Of Flutter Compared To Other Frameworks

Understanding the Laravel MVC Architecture

Key Features Of Token Development (1).pptx

Unblocking The Main Thread Solving ANRs and Frozen Frames

Designing IA for AI - Information Architecture Conference 2024

Unleash Your Potential - Namagunga Girls Coding Club

Artificial intelligence in the post-deep learning era

Install Stable Diffusion in windows machine

SQL Database Design For Developers at php[tek] 2024

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

Build your next Gen AI Breakthrough - April 2024

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

Streamlining Python Development: A Guide to a Modern Project Setup

CloudStudio User manual (basic edition):

Science&tech:THE INFORMATION AGE STS.pdf

The transition to renewables in India.pdf

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

論文紹介：Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

1. Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval Chaorui Deng,Qi Chen, Pengda Qin, Da Chen,Qi Wu, ICCV2023 2023/10/30

2. ◼ Prompt • • Prompt Head • Prompt • Prompt ◼ Text-image CLIP Prompt • Video Token Prompt Token • [Jia+, ECCV2022]

3. ◼ CLIP [Radford+, ICML2021]

4. ◼ Text-image CLIP Prompt 1. Prompt Token • Video Token token 2. Prompt Switch • Video Encoder prompt token 3. Prompt Aggregation • Class token prompt token 4. Captioning loss ◼ • Video up •

6. 1. Prompt Token

7. 2. Prompt Switch 𝑉 : Visual Token 𝑃 : Prompt Token

8. 3. Prompt Aggregation

9. Prompt Aggregation ◼ Prompt Aggregation • • 𝑐𝑖 : Visual class token 𝑖 ∈ 1, 𝑁𝑓 , 𝑁𝑓: • ෠ 𝑃 : Prompt Token • 𝑀𝐻𝐴: Multi-Head Attention • 𝐿𝑁: Layer Normalization • Multi-Head Attention + Add LayerNorm • Query: 𝑐𝑖, key, value : ෠ 𝑃 • Mean-Pooling

10. 4. Captioning Loss Captioning output

11. Loss ◼ Loss • Contrastive Loss : 𝐿𝑐𝑜𝑛 • Captioning Loss : 𝐿𝑐𝑎𝑝 ◼ • 𝑥𝑖 : video • 𝑦𝑖 : text • 𝐵 : • 𝑁𝑓 : • ഥ 𝑝𝑖 : prompt Aggregation • 𝜏 : • 𝑤<𝑙 : 𝑙 caption • 𝑤𝑙 : 𝑙 caption • 𝑁𝑤 : token

12. ◼ • MSR-VTT [Xu+, CVPR2016] • Video: 10k, caption: 200k • Train: 9k, Val: 1k • MSVD [Chen&Dolan, ACL2011] • Video: 1970, caption: 120k • Train: 1200, Val: 100, test: 670 • LSMDC [Rohrbach+, arXiv2015] • Video: 118081, caption: 11808 • Train: 109673, Val: 7408, test: 1k ◼ Video • 6 • 224×224 ◼ : CLIP • • Text encoder, video encoder • CLIP (ViT-B/32) • Prompt Token • ( 0, 0.02) ◼ • Batch size : 128 • Optimizer : AdamW • Learning rate : 3𝑒−5 • Scheduler : CosinAnnealing

13. 1: Ablation ◼ ablation 1. Baseline CLIP 2. 1 + Prompt Switch 3. 2 + Prompt Aggregation 4. 3 + Captioning Loss ◼ • MSR-VTT 1K-A ◼ •

14. 2 ◼ temporal modeling • Temporal Transformer • Attention • Token Shift [Liu+, ECCV2022] • Video Proxy [Xue+, ICLR2023] • Full Attention • video Attention • ours (Prompt Switch + Aggregation) ◼ • ◼ • MSR-VTT 1K-A ◼ • Recall (𝑘 = 1, 5, 10)

15. 3: SOTA ◼ MSRVTT

16. 3: SOTA ◼ MSVD

17. ◼ Text-image CLIP Prompt • Prompt Token • Prompt Switch • Prompt Aggregation • Captioning loss ◼ SOTA • • MSR-VTT • MSVD

論文紹介：Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

Recommended

Recommended

More Related Content

Similar to 論文紹介：Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

Similar to 論文紹介：Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval (20)

More from Toru Tamaki

More from Toru Tamaki (20)

Recently uploaded

Recently uploaded (20)

論文紹介：Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval