SlideShare a Scribd company logo
1 of 17
Download to read offline
Prompt Switch: Efficient CLIP
Adaptation for Text-Video Retrieval
Chaorui Deng,Qi Chen, Pengda Qin, Da Chen,Qi Wu, ICCV2023
2023/10/30
◼ Prompt
•
• Prompt Head
• Prompt
• Prompt
◼ Text-image CLIP Prompt
• Video Token Prompt Token
•
[Jia+, ECCV2022]
◼ CLIP [Radford+, ICML2021]
◼ Text-image CLIP Prompt
1. Prompt Token
• Video Token token
2. Prompt Switch
• Video Encoder prompt token
3. Prompt Aggregation
• Class token prompt token
4. Captioning loss
◼
• Video up
•
1. Prompt Token
2. Prompt Switch
𝑉 : Visual Token
𝑃 : Prompt Token
3. Prompt Aggregation
Prompt Aggregation
◼ Prompt Aggregation
•
• 𝑐𝑖 : Visual class token 𝑖 ∈ 1, 𝑁𝑓 , 𝑁𝑓:
• ෠
𝑃 : Prompt Token
• 𝑀𝐻𝐴: Multi-Head Attention
• 𝐿𝑁: Layer Normalization
• Multi-Head Attention + Add LayerNorm
• Query: 𝑐𝑖, key, value : ෠
𝑃
• Mean-Pooling
4. Captioning Loss
Captioning output
Loss
◼ Loss
• Contrastive Loss : 𝐿𝑐𝑜𝑛
• Captioning Loss : 𝐿𝑐𝑎𝑝
◼
• 𝑥𝑖 : video
• 𝑦𝑖 : text
• 𝐵 :
• 𝑁𝑓 :
• ഥ
𝑝𝑖 : prompt Aggregation
• 𝜏 :
• 𝑤<𝑙 : 𝑙 caption
• 𝑤𝑙 : 𝑙 caption
• 𝑁𝑤 : token
◼
• MSR-VTT [Xu+, CVPR2016]
• Video: 10k, caption: 200k
• Train: 9k, Val: 1k
• MSVD [Chen&Dolan, ACL2011]
• Video: 1970, caption: 120k
• Train: 1200, Val: 100, test: 670
• LSMDC [Rohrbach+, arXiv2015]
• Video: 118081, caption: 11808
• Train: 109673, Val: 7408, test: 1k
◼ Video
• 6
• 224×224
◼ : CLIP
•
• Text encoder, video encoder
• CLIP (ViT-B/32)
• Prompt Token
• ( 0, 0.02)
◼
• Batch size : 128
• Optimizer : AdamW
• Learning rate : 3𝑒−5
• Scheduler : CosinAnnealing
1: Ablation
◼ ablation
1. Baseline CLIP
2. 1 + Prompt Switch
3. 2 + Prompt Aggregation
4. 3 + Captioning Loss
◼
• MSR-VTT 1K-A
◼
•
2
◼ temporal modeling
• Temporal Transformer
• Attention
• Token Shift [Liu+, ECCV2022]
• Video Proxy [Xue+, ICLR2023]
• Full Attention
• video Attention
• ours (Prompt Switch + Aggregation)
◼
•
◼
• MSR-VTT 1K-A
◼
• Recall (𝑘 = 1, 5, 10)
3: SOTA
◼ MSRVTT
3: SOTA
◼ MSVD
◼ Text-image CLIP Prompt
• Prompt Token
• Prompt Switch
• Prompt Aggregation
• Captioning loss
◼ SOTA
•
• MSR-VTT
• MSVD

More Related Content

Similar to 論文紹介:Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

PR-365: Fast object detection in compressed video
PR-365: Fast object detection in compressed videoPR-365: Fast object detection in compressed video
PR-365: Fast object detection in compressed videoHyeongmin Lee
 
PR-340: DVC: An End-to-end Deep Video Compression Framework
PR-340: DVC: An End-to-end Deep Video Compression FrameworkPR-340: DVC: An End-to-end Deep Video Compression Framework
PR-340: DVC: An End-to-end Deep Video Compression FrameworkHyeongmin Lee
 
Avtex Lync 2013 Event - Fargo
Avtex Lync 2013 Event - FargoAvtex Lync 2013 Event - Fargo
Avtex Lync 2013 Event - FargoAvtex
 
Performance Measurements of 360◦ Video Streaming to Head-Mounted Displays Ove...
Performance Measurements of 360◦ Video Streaming to Head-Mounted Displays Ove...Performance Measurements of 360◦ Video Streaming to Head-Mounted Displays Ove...
Performance Measurements of 360◦ Video Streaming to Head-Mounted Displays Ove...Wen-Chih Lo
 
Seattle Video Tech Meetup August 2019: Optimal Multi-codec Streaming
Seattle Video Tech Meetup August 2019: Optimal Multi-codec StreamingSeattle Video Tech Meetup August 2019: Optimal Multi-codec Streaming
Seattle Video Tech Meetup August 2019: Optimal Multi-codec StreamingDavid Sayed
 
Video Coding for Large-Scale HTTP Adaptive Streaming Deployments: State of th...
Video Coding for Large-Scale HTTP Adaptive Streaming Deployments: State of th...Video Coding for Large-Scale HTTP Adaptive Streaming Deployments: State of th...
Video Coding for Large-Scale HTTP Adaptive Streaming Deployments: State of th...Alpen-Adria-Universität
 
Intro to Compression: Audio and Video Optimization for Learning
Intro to Compression: Audio and Video Optimization for LearningIntro to Compression: Audio and Video Optimization for Learning
Intro to Compression: Audio and Video Optimization for LearningNick Floro
 
An Overview of High Efficiency Video Codec HEVC (H.265)
An Overview of High Efficiency Video Codec HEVC (H.265)An Overview of High Efficiency Video Codec HEVC (H.265)
An Overview of High Efficiency Video Codec HEVC (H.265)Varun Ravi
 
Hacking cable TV Networks Like Die hard Movie
Hacking cable TV Networks Like Die hard MovieHacking cable TV Networks Like Die hard Movie
Hacking cable TV Networks Like Die hard MovieRahul Sasi
 
Introduction to Transcoding: Tools and Processes
Introduction to Transcoding: Tools and ProcessesIntroduction to Transcoding: Tools and Processes
Introduction to Transcoding: Tools and ProcessesPrestoCentre
 
Video smart cropping web application
Video smart cropping web applicationVideo smart cropping web application
Video smart cropping web applicationVasileiosMezaris
 
Serverless Media Workflow
Serverless Media WorkflowServerless Media Workflow
Serverless Media WorkflowMooYeol Lee
 
Mpeg4copy 120428133000-phpapp01
Mpeg4copy 120428133000-phpapp01Mpeg4copy 120428133000-phpapp01
Mpeg4copy 120428133000-phpapp01netzwelt12345
 
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7Shuo LI
 
Live, Low Delay, High Quality – How?
Live, Low Delay, High Quality – How?Live, Low Delay, High Quality – How?
Live, Low Delay, High Quality – How?Bitmovin Inc
 
Chapter 6 : VIDEO
Chapter 6 : VIDEOChapter 6 : VIDEO
Chapter 6 : VIDEOazira96
 

Similar to 論文紹介:Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval (20)

PR-365: Fast object detection in compressed video
PR-365: Fast object detection in compressed videoPR-365: Fast object detection in compressed video
PR-365: Fast object detection in compressed video
 
PR-340: DVC: An End-to-end Deep Video Compression Framework
PR-340: DVC: An End-to-end Deep Video Compression FrameworkPR-340: DVC: An End-to-end Deep Video Compression Framework
PR-340: DVC: An End-to-end Deep Video Compression Framework
 
Avtex Lync 2013 Event - Fargo
Avtex Lync 2013 Event - FargoAvtex Lync 2013 Event - Fargo
Avtex Lync 2013 Event - Fargo
 
Performance Measurements of 360◦ Video Streaming to Head-Mounted Displays Ove...
Performance Measurements of 360◦ Video Streaming to Head-Mounted Displays Ove...Performance Measurements of 360◦ Video Streaming to Head-Mounted Displays Ove...
Performance Measurements of 360◦ Video Streaming to Head-Mounted Displays Ove...
 
Seattle Video Tech Meetup August 2019: Optimal Multi-codec Streaming
Seattle Video Tech Meetup August 2019: Optimal Multi-codec StreamingSeattle Video Tech Meetup August 2019: Optimal Multi-codec Streaming
Seattle Video Tech Meetup August 2019: Optimal Multi-codec Streaming
 
Video Coding for Large-Scale HTTP Adaptive Streaming Deployments: State of th...
Video Coding for Large-Scale HTTP Adaptive Streaming Deployments: State of th...Video Coding for Large-Scale HTTP Adaptive Streaming Deployments: State of th...
Video Coding for Large-Scale HTTP Adaptive Streaming Deployments: State of th...
 
Intro to Compression: Audio and Video Optimization for Learning
Intro to Compression: Audio and Video Optimization for LearningIntro to Compression: Audio and Video Optimization for Learning
Intro to Compression: Audio and Video Optimization for Learning
 
An Overview of High Efficiency Video Codec HEVC (H.265)
An Overview of High Efficiency Video Codec HEVC (H.265)An Overview of High Efficiency Video Codec HEVC (H.265)
An Overview of High Efficiency Video Codec HEVC (H.265)
 
Hacking cable TV Networks Like Die hard Movie
Hacking cable TV Networks Like Die hard MovieHacking cable TV Networks Like Die hard Movie
Hacking cable TV Networks Like Die hard Movie
 
Introduction to Transcoding: Tools and Processes
Introduction to Transcoding: Tools and ProcessesIntroduction to Transcoding: Tools and Processes
Introduction to Transcoding: Tools and Processes
 
Video smart cropping web application
Video smart cropping web applicationVideo smart cropping web application
Video smart cropping web application
 
Serverless Media Workflow
Serverless Media WorkflowServerless Media Workflow
Serverless Media Workflow
 
Mpeg4copy 120428133000-phpapp01
Mpeg4copy 120428133000-phpapp01Mpeg4copy 120428133000-phpapp01
Mpeg4copy 120428133000-phpapp01
 
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
Criteo Labs Infrastructure Tech Talk Meetup Nov. 7
 
Vivotek presentation
Vivotek presentationVivotek presentation
Vivotek presentation
 
Apan media encoding
Apan media encodingApan media encoding
Apan media encoding
 
Live, Low Delay, High Quality – How?
Live, Low Delay, High Quality – How?Live, Low Delay, High Quality – How?
Live, Low Delay, High Quality – How?
 
Chapter 6 : VIDEO
Chapter 6 : VIDEOChapter 6 : VIDEO
Chapter 6 : VIDEO
 
Chapter 6
Chapter 6Chapter 6
Chapter 6
 
Video File & Recording Media
Video File & Recording MediaVideo File & Recording Media
Video File & Recording Media
 

More from Toru Tamaki

論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...Toru Tamaki
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...Toru Tamaki
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNetToru Tamaki
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A surveyToru Tamaki
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex ScenesToru Tamaki
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...Toru Tamaki
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video SegmentationToru Tamaki
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New HopeToru Tamaki
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...Toru Tamaki
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt TuningToru Tamaki
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in MoviesToru Tamaki
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICAToru Tamaki
 
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context RefinementToru Tamaki
 
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...Toru Tamaki
 
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...Toru Tamaki
 
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusionToru Tamaki
 
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous DrivingToru Tamaki
 
論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large MotionToru Tamaki
 
論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense PredictionsToru Tamaki
 
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understandingToru Tamaki
 

More from Toru Tamaki (20)

論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA
 
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
 
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
 
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
 
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
 
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
 
論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion
 
論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions
 
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
 

Recently uploaded

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfjimielynbastida
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Recently uploaded (20)

My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Science&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdfScience&tech:THE INFORMATION AGE STS.pdf
Science&tech:THE INFORMATION AGE STS.pdf
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

論文紹介:Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval

  • 1. Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval Chaorui Deng,Qi Chen, Pengda Qin, Da Chen,Qi Wu, ICCV2023 2023/10/30
  • 2. ◼ Prompt • • Prompt Head • Prompt • Prompt ◼ Text-image CLIP Prompt • Video Token Prompt Token • [Jia+, ECCV2022]
  • 4. ◼ Text-image CLIP Prompt 1. Prompt Token • Video Token token 2. Prompt Switch • Video Encoder prompt token 3. Prompt Aggregation • Class token prompt token 4. Captioning loss ◼ • Video up •
  • 5.
  • 7. 2. Prompt Switch 𝑉 : Visual Token 𝑃 : Prompt Token
  • 9. Prompt Aggregation ◼ Prompt Aggregation • • 𝑐𝑖 : Visual class token 𝑖 ∈ 1, 𝑁𝑓 , 𝑁𝑓: • ෠ 𝑃 : Prompt Token • 𝑀𝐻𝐴: Multi-Head Attention • 𝐿𝑁: Layer Normalization • Multi-Head Attention + Add LayerNorm • Query: 𝑐𝑖, key, value : ෠ 𝑃 • Mean-Pooling
  • 11. Loss ◼ Loss • Contrastive Loss : 𝐿𝑐𝑜𝑛 • Captioning Loss : 𝐿𝑐𝑎𝑝 ◼ • 𝑥𝑖 : video • 𝑦𝑖 : text • 𝐵 : • 𝑁𝑓 : • ഥ 𝑝𝑖 : prompt Aggregation • 𝜏 : • 𝑤<𝑙 : 𝑙 caption • 𝑤𝑙 : 𝑙 caption • 𝑁𝑤 : token
  • 12. ◼ • MSR-VTT [Xu+, CVPR2016] • Video: 10k, caption: 200k • Train: 9k, Val: 1k • MSVD [Chen&Dolan, ACL2011] • Video: 1970, caption: 120k • Train: 1200, Val: 100, test: 670 • LSMDC [Rohrbach+, arXiv2015] • Video: 118081, caption: 11808 • Train: 109673, Val: 7408, test: 1k ◼ Video • 6 • 224×224 ◼ : CLIP • • Text encoder, video encoder • CLIP (ViT-B/32) • Prompt Token • ( 0, 0.02) ◼ • Batch size : 128 • Optimizer : AdamW • Learning rate : 3𝑒−5 • Scheduler : CosinAnnealing
  • 13. 1: Ablation ◼ ablation 1. Baseline CLIP 2. 1 + Prompt Switch 3. 2 + Prompt Aggregation 4. 3 + Captioning Loss ◼ • MSR-VTT 1K-A ◼ •
  • 14. 2 ◼ temporal modeling • Temporal Transformer • Attention • Token Shift [Liu+, ECCV2022] • Video Proxy [Xue+, ICLR2023] • Full Attention • video Attention • ours (Prompt Switch + Aggregation) ◼ • ◼ • MSR-VTT 1K-A ◼ • Recall (𝑘 = 1, 5, 10)
  • 17. ◼ Text-image CLIP Prompt • Prompt Token • Prompt Switch • Prompt Aggregation • Captioning loss ◼ SOTA • • MSR-VTT • MSVD