SlideShare a Scribd company logo
1 of 31
Download to read offline
!"#$%&'$()*"+,"%-.&
/&01*2$3&%4&5$,6%#(7&'8,8($,(7&
8-#&928:18,"%-&5$,*")(
!"##$%&'"(")*&'+","-&./"0*&1$/&2/3*&4#$5&63-)"%0"/0&7/-"0/*&
.38"%"9&4:":&
';.&;<,=3>/0?&43%@$#A&BCDE
仁田智也 (名工大玉木研)
!"#$%&'()#*+",-"./
nF,"?$A
nG/5$<A
• G/5$<&;"=>/<0/0?
• D動画からDつのH"=>/<0生成
• G/5$<&I$AH%/=>/<0
• D動画から詳細な文章の生成
• 多くのセンテンスによって動画の説明
• I$0H$&G/5$<&;"=>/<0/0?
• 動作の区間とそれに対応したH"=>/<0生成
• D動画から複数のH"=>/<0生成
• G/5$<&GJ'
• 動画に関する質問応答
Anna Senina1
Marcus Rohrbach1
Wei Qiu1,2
Annemarie Friedrich2
Sikandar Amin3
Mykhaylo Andriluka1
Manfred Pinkal2
Bernt Schiele1
1
Max Planck Institute for Informatics, Saarbrücken, Germany
2
Department of Computational Linguistics, Saarland University, Saarbrücken, Germany
3
Intelligent Autonomous Systems, Technische Universität München, München, Germany
Abstract
Humans can easily describe what they see in a coher-
ent way and at varying level of detail. However, existing
approaches for automatic video description are mainly fo-
cused on single sentence generation and produce descrip-
tions at a fixed level of detail. In this paper, we address
both of these limitations: for a variable level of detail we
produce coherent multi-sentence descriptions of complex
videos. We follow a two-step approach where we first learn
to predict a semantic representation (SR) from video and
then generate natural language descriptions from the SR. To
produce consistent multi-sentence descriptions, we model
across-sentence consistency at the level of the SR by en-
forcing a consistent topic. We also contribute both to the
visual recognition of objects proposing a hand-centric ap-
proach as well as to the robust generation of sentences us-
ing a word lattice. Human judges rate our multi-sentence
descriptions as more readable, correct, and relevant than
related work. To understand the difference between more
Detailed: A woman turned on stove. Then, she took out a cucumber from
the fridge. She washed the cucumber in the sink. She took out a
cutting board and knife. She took out a plate from the drawer. She
got out a plate. Next, she took out a peeler from the drawer. She
peeled the skin off of the cucumber. She threw away the peels into
the wastebin. The woman sliced the cucumber on the cutting board.
In the end, she threw away the peels into the wastebin.
Short: A woman took out a cucumber from the refrigerator. Then, she
peeled the cucumber. Finally, she sliced the cucumber on the cutting
board.
One sentence: A woman entered the kitchen and sliced a cucumber.
Figure 1: Output of our system for a video, producing co-
herent multi-sentence descriptions at three levels of detail,
using our automatic segmentation and extraction.
03.6173v1
[cs.CV]
24
Mar
2014
K4$0/0"L*&7;MNBCDOP
!"#$%&'()#*+",-"./
115:4 N. Aafaq et al.
Fig. 2. Illustration of differences between image captioning, video captioning, and dense video captioning.
Image (video frame) captioning describes each frame with a single sentence. Video captioning describes the
フレームの枚数分
!"#$%&'生成
検出された動作の数
!"#$%&'生成
各!"#$%&'の区間
よくある手法
ideo Description: A Survey of Methods, Datasets, and Evaluation Metrics 115:13
g. 8. Summary of deep learning–based video description methods. Most methods employ mean pooling
frame representations to represent a video. More advanced methods use attention mechanisms, semantic
ttribute learning, and/or employ a sequence-to-sequence approach. These methods differ in whether the
sual features are fed only at first time-step or all time-steps of the language model.
特徴量抽出
KBI;!!*&QI;!!などP
時間集約 キャプション生成
KN!!*&24R.などP
データセット
n;<<9/0?
• .MST&;<<9/0?&KN<:%8"H:L*&
;GMNBCDBP
• U<3;<<9&KI"AL*&;GMNBCDQP
• R';<4&KN<:%8"H:L*&V;;GBCDBP
• R';<4S.3->/2$@$-&K4$0/0"L*&
7;MNBCDOP
• U<3;<<9T&K6:<3L*&'''FBCDWP
n.<@/$A
• .MTS.I&KN<:%8"H:L*&;GMNBCDXP
• .SG'I&KR<%"8/L*&"%Y/@BCDXP
n4<H/"-&.$5/"
• G/5$<4><%#&K7$--"L*&V!.2MBCDWP
• 'H>/@/>#!$>&V0>/>/$A&K6:<3L*&
"%Y/@BCDWP
nG/5$<&/0&>:$&1/-5
• .4GI&K;:$0&Z&I<-"0*&';2BCDDP
• .4NSGRR&KY3L*&;GMNBCD[P
• ;:"%"5$A&K4/?3%5AA<0L*&
V;;GBCD[P
• GR1&K6$0?L*&V;;GBCD[P
• 'H>/@/>#!$>&;"=>/<0A&K%/A:0"L*&
F;;GBCD]P
データセット
115:16 N. Aafaq et al.
Table 1. Standard Datasets for Benchmarking Video Description Methods
Dataset Domain # classes # videos avg len # clips # sent # words vocab len (hrs)
MSVD [22] open 218 1970 10s 1,970 70,028 607,339 13,010 5.3
MPII Cooking [118] cooking 65 44 600s - 5,609 - - 8.0
YouCook [29] cooking 6 88 - - 2,688 42,457 2,711 2.3
TACoS [109] cooking 26 127 360s 7,206 18,227 146,771 28,292 15.9
TACos-MLevel [114] cooking 1 185 360s 14,105 52,593 2K - 27.1
MPII-MD [116] movie - 94 3.9s 68,337 68,375 653,467 24,549 73.6
M-VAD [139] movie - 92 6.2s 48,986 55,904 519,933 17,609 84.6
MSR-VTT [160] open 20 7,180 20s 10K 200K 1,856,523 29,316 41.2
Charades [130] human 157 9,848 30s - 27,847 - - 82.01
VTW [171] open - 18,100 90s - 44,613 - - 213.2
YouCook II [174] cooking 89 2K 316s 15.4K 15.4K - 2,600 176.0
ActyNet Cap [70] open - 20K 180s - 100K 1,348,000 - 849.0
ANet-Entities [173] social media - 14,281 180s 52K - - - -
VideoStory [45] social media - 20K - 123K 123K - - 396.0
評価指標
n定性的評価
n定量的評価
• ^2V_&KM"=/0$0/L*&';2BCCBP
• N`_7V&K2/0*&';2BCCOP
• .VRV`N&K^"0$%+$$&Z&2"@/$*&';2BCCXP
• ;FIV%&KG$5"0>",L*&;GMNBCDXP
• 1.I&K3A0$%L*&F;.2BCDXP
• 4MF;V&K'05$%A<0L*&V;;GBCD[P
Video Description: A Survey of Methods, Datasets, and Evaluation Metrics 115:2
Table 2. Summary of Metrics Used for Video Description Evaluation
Metric Name Designed For Methodology
BLEU [102] Machine translation n-gram precision
ROUGE [82] Document summarization n-gram recall
METEOR [12] Machine translation n-gram with synonym matching
CIDEr [142] Image captioning tf-idf weighted n-gram similarity
SPICE [5] Image captioning Scene-graph synonym matching
WMD [76] Document similarity Earth mover distance on word2vec
0+)*"#"./
nM%$H/A/<0
• 候補文の各単語が参照文に含まれている割合
• 候補文 (予測):>:$ >:$ >:$ >:$ >:$ >:$ K[単語P
• 参照文 (アノテーション):R:$ >"8-$&/A&%$5a
• (候補文のマッチしている単語数)b(候補文の単語数)c[b[
n.<5/(/$5&=%$H/A/<0
• M%$H/A/<0のマッチしている単語数を参照文で出た回数までしか数えない
• 候補文:>:$ >:$&>:$&>:$&>:$&>:$&KD単語P
• 参照文:R:$ >"8-$&/A&%$5a&K参照文でのd>:$dは1回しか使われてないP
• (候補文のマッチしている単語数)b(候補文の単語数)cDb[
• 候補文の単語数が短いほど有利
0+)*"#"./1'2)*%&&
n.<5/(/$5&0S?%",&=%$H/A/<0
• 参照文と候補文を0S?%",に分割して.<5/(/$5&0S?%",&=%$H/A/<0
• 候補文:/>&/A&?</0?&><&%"/0&"(>$%0<<0
• K/>*&/AP*&K/A*&?</0?P*&K?</0?*&><P*&K><&%"/0P*&K%"/0*&"(>$%0<<0P
• 参照文:F>&e/--&%"/0&"(>$%0<<0
• KF>*&e/--P*&Ke/--*&%"/0P*&K%"/0*&"(>$%0<<0P
• (候補文のマッチしている0S?%",の数)b(候補文の0S?%",の数)cDbX
nN$H"--
• 参照文の各単語が候補文に含まれている割合
• 候補文:/> /A&?</0?&><&%"/0 "(>$%0<<0
• 参照文:F> e/--&%"/0 "(>$%0<<0
• (参照文のマッチしている単語数)b(参照文の単語数)cQbO
• 候補文の単語数が長いほど有利
3456
n^/-/0?3"-&V@"-3>/<0&_05$%A>35#&K^2V_P
• BLEU = 𝑒
!"#(%&
!"
!#
, ()
exp ∑*+%
,
𝑤* log 𝑝*
• ∑*+%
,
𝑤* log 𝑝*:DS?%",から0S?%",の=%$H/A/<0の対数をとった重み付き和
• 𝑝*f0S?%",での=%$H/A/<0
• 𝑤*f重み
• 𝑒
!"#(%&
!"
!#
, ()
:候補文が短すぎるときのペナルティ
• 𝑙-:参照文の単語数
• 𝑙.:候補文の単語数
• 候補文:"&,"0&/A&=-"#/0?&=/"0<
• 参照文:"0&$-5$%-#&,"0&/A&=-"#/0?&=/"0<&/0&(%<0>&<(&"&H%<e5&/0&"0&"0>$%<<,
• ^2_VSOcCaDB
• 候補文が極端に短いのでスコアが低い
27685
nN$H"--S<%/$0>$5&_05$%A>35#&(<%&7/A>/0?&V@"-3">/<0&KN`_7VP
• N`_7VS!
• 0S?%",に分割した%$H"--
n2<0?$A>&H<,,<0&A38A$)3$0H$&K2;4P
• 候補文と参照文の一致する連続した最大のシーケンスの単語数
• 候補文fF&>"9$&"0&3,8%$--"&8$H"3A$&/> /A&?</0?&><&%"/0
• 参照文:#<3&A:<3-5&>"9$&"0&3,8%$--"&8$H"3A$&/> e/--&%"/0
• 最大のシーケンス:>"9$&"0&3,8%$--"&/>&%"/0&(X単語P
27685
nN`_7VS2
• N`_7VS2=
(%/0$)1!#%2!#%
1!#%/0$2!#%
• 𝑅3.4:
567(8,9)
:
, 𝑃3.4:
567(8,9)
*
• ':参照文*&^:候補文
• ,f'の単語数*&0f^の単語数
• 𝛽:
2!#%
1!#%
• N`_7VS!と違って0S?%",を任意で決定する必要がない
• 候補文:"&,"0&/A&=-"#/0?&=/"0<
• 参照文:"0&$-5$%-#&,"0&/A&=-"#/0?&=/"0<&/0&(%<0>&<(&"&H%<e5&/0&"0&"0>$%<<,
• N`_7VS2cCaBB
補足資料
90:;;'<..="/>
n."g&M-"09&F0A>/>3>$&(<%&F0(<%,">/HA&K.MSFFP&;<<9/0?
• 動画数:OO
• 平均時間:[CC秒
• アノテーション数:X[CE
• 人がQCフレーム以上写ってない時
はd8"H9?%<305&"H>/@/>#dとアノテーションされて
いる
• クラス数:BDW
A Database for Fine Grained Activity Detection of Cooking Activities
Marcus Rohrbach Sikandar Amin Mykhaylo Andriluka Bernt Schiele
Max Planck Institute for Informatics, Saarbrücken, Germany
Abstract
While activity recognition is a current focus of research
the challenging problem of fine-grained activity recognition
is largely overlooked. We thus propose a novel database of
65 cooking activities, continuously recorded in a realistic
setting. Activities are distinguished by fine-grained body
motions that have low inter-class variability and high intra-
class variability due to diverse subjects and ingredients. We
benchmark two approaches on our dataset, one based on
articulated pose tracks and the second using holistic video
features. While the holistic approach outperforms the pose-
based approach, our evaluation suggests that fine-grained
activities are more difficult to detect and the body model
can help in those cases. Providing high-resolution videos
as well as an intermediate pose representation we hope to
foster research in fine-grained activity recognition.
1. Introduction
Figure 1. Fine grained cooking activities. (a) Full scene of cut
slices, and crops of (b) take out from drawer, (c) cut dice, (d) take
out from fridge, (e) squeeze, (f) peel, (g) wash object, (h) grate
(Rohrbach+, CVPR2012)
?.$<..=
nU<3;<<9
• 動画数:WW
• 学習:OE
• テスト:QE
• ほとんど動画で背景が変わる
• .MSFF&;<<9/0?よりも多くの人が出てくる
Keywords: refrigerator/OBJ cleans/VERB man/SUBJ-HUMAN clean/VERB blender/OBJ cleaning/VERB woman/SUBJ-HUMAN
person/SUBJ-HUMAN stove/OBJ microwave/OBJ sponge/NOUN food/OBJ home/OBJ hose/OBJ oven/OBJ
Sentences from Our System 1. A person is using dish towel and hand held brush or vacuum to clean panel with knobs and
washing basin or sink. 2. Man cleaning a refrigerator. 3. Man cleans his blender. 4. Woman cleans old food out of refrigerator.
5. Man cleans top of microwave with sponge.
Human Synopsis: Two standing persons clean a stove top with a vacuum clean with a hose.
Keywords: meeting/VERB town/NOUN hall/OBJ microphone/OBJ talking/VERB people/OBJ podium/OBJ speech/OBJ woman/SUBJ-
HUMAN man/SUBJ-HUMAN chairs/NOUN clapping/VERB speaks/VERB questions/VERB giving/VERB
Sentences from Our System 1. A person is speaking to a small group of sitting people and a small group of standing people
with board in the back. 2. A person is speaking to a small group of standing people with board in the back. 3. Man opens
town hall meeting. 4. Woman speaks at town meeting. 5. Man gives speech on health care reform at a town hall meeting.
Human Synopsis: A man talks to a mob of sitting persons who clap at the end of his short speech.
Keywords: people/SUBJ-HUMAN, home/OBJ, group/OBJ, renovating/VERB, working/VERB, montage/OBJ, stop/VERB, motion/
OBJ, appears/VERB, building/VERB, floor/OBJ, tiles/OBJ, floorboards/OTHER, man/SUBJ-HUMAN, laying/VERB
Sentences from Our System: 1. A person is using power drill to renovate a house. 2. A crouching person is using power
drill to renovate a house. 3. A person is using trowel to renovate a house. 4. man lays out underlay for installing flooring. 5.
A man lays a plywood floor in time lapsed video.
Human Synopsis: Time lapse video of people making a concrete porch with sanders, brooms, vacuums and other tools.
Cleaning an appliance
Town hall meeting
Renovating home
Keywords: metal/OBJ man/SUBJ-HUMAN bending/VERB hammer/VERB piece/OBJ tools/OBJ rods/OBJ hammering/VERB craft/
VERB iron/OBJ workshop/OBJ holding/VERB works/VERB steel/OBJ bicycle/OBJ
Sentences from Our System 1. A person is working with pliers. 2. Man hammering metal. 3. Man bending metal in
workshop. 4. Man works various pieces of metal. 5. A man works on a metal craft at a workshop.
Human Synopsis: A man is shaping a star with a hammer.
Metal crafts project
Keywords: bowl/OBJ pan/OBJ video/OBJ adds/VERB lady/OBJ pieces/OBJ ingredients/OBJ oil/OBJ glass/OBJ liquid/OBJ
butter/SUBJ-HUMAN woman/SUBJ-HUMAN add/VERB stove/OBJ salt/OBJ
Sentences from Our System: !!1. A person is cooking with bowl and stovetop. !!2. !In a pan add little butter. !!3. !!She !adds
(Das+, CVPR2013)
@A<.B
nR$g>3"--#&'00<>">$5&;<<9/0?&4H$0$A&KR';<4P
• 動画数:DB]
• クリップ数:]BC[
• 平均時間:Q[C秒
• クラス数:B[
• 総単語数:DO[]]D
• 動詞:BWBEB
• キャプションには開始と終了のタイムスタンプが取得可能
@A<.B:9$&-"&)C)&
nR';<4S.3->/-$@$-
• 1動画に対してQ種類のキャプション
• DX文以内の詳細な説明
• QSX文の短い説明
• D文の説明
Anna Senina1
Marcus Rohrbach1
Wei Qiu1,2
Annemarie Friedrich2
Sikandar Amin3
Mykhaylo Andriluka1
Manfred Pinkal2
Bernt Schiele1
1
Max Planck Institute for Informatics, Saarbrücken, Germany
2
Department of Computational Linguistics, Saarland University, Saarbrücken, Germany
3
Intelligent Autonomous Systems, Technische Universität München, München, Germany
Abstract
Humans can easily describe what they see in a coher-
ent way and at varying level of detail. However, existing
approaches for automatic video description are mainly fo-
cused on single sentence generation and produce descrip-
tions at a fixed level of detail. In this paper, we address
both of these limitations: for a variable level of detail we
produce coherent multi-sentence descriptions of complex
videos. We follow a two-step approach where we first learn
to predict a semantic representation (SR) from video and
then generate natural language descriptions from the SR. To
produce consistent multi-sentence descriptions, we model
across-sentence consistency at the level of the SR by en-
forcing a consistent topic. We also contribute both to the
visual recognition of objects proposing a hand-centric ap-
proach as well as to the robust generation of sentences us-
ing a word lattice. Human judges rate our multi-sentence
descriptions as more readable, correct, and relevant than
related work. To understand the difference between more
Detailed: A woman turned on stove. Then, she took out a cucumber from
the fridge. She washed the cucumber in the sink. She took out a
cutting board and knife. She took out a plate from the drawer. She
got out a plate. Next, she took out a peeler from the drawer. She
peeled the skin off of the cucumber. She threw away the peels into
the wastebin. The woman sliced the cucumber on the cutting board.
In the end, she threw away the peels into the wastebin.
Short: A woman took out a cucumber from the refrigerator. Then, she
peeled the cucumber. Finally, she sliced the cucumber on the cutting
board.
One sentence: A woman entered the kitchen and sliced a cucumber.
Figure 1: Output of our system for a video, producing co-
herent multi-sentence descriptions at three levels of detail,
using our automatic segmentation and extraction.
ularity and generating a conceptually and linguistically co-
403.6173v1
[cs.CV]
24
Mar
2014
K4$0/0"L*&7;MNBCDOP
?.$<..=;;
nU<3;<<9&FF
• 動画数:B
• クリップ数:DXaO
• 全てのクリップにキャプションと時間情報が紐付けされている
Grill the tomatoes in
a pan and then put
them on a plate. Add oil to a pan and spread
it well so as to fry the bacon
Place a piece of lettuce as
the first layer, place the
tomatoes over it.
Sprinkle salt and
pepper to taste.
Add a bit of Worcestershire
sauce to mayonnaise and
spread it over the bread.
Place the bacon at
the top.
Place a piece of
bread at the top.
Cook bacon until crispy,
then drain on paper towel
00:21 00:54 01:06 01:56 02:41 03:08 03:16 03:25
00:51 01:03 01:54 02:40 03:00 03:15 03:25 03:28
Start time:
End time:
Figure 1: An example from the YouCook2 dataset on making a BLT sandwich. Each procedure step has time boundaries
annotated and is described by an English sentence. Video from YouTube with ID: 4eWzsx1vAi8.
(Zhou+, AAAI2018)
90;;:9(
n.MFFS.<@/$&I$AH%/=>/<0
• ハリウッド映画の音声説明を書き起こした
• 音声から得られるキャプション
• 映像から得られるキャプション
• クリップ数:[WQQ]
• アノテーション数:[WQ]X
• タイムスタンプによって関連する時間と紐付け
• 語彙:[XQO[]
(Rohrbach+, CVPR2015)
9:!A(
n.<0>%$"-&G/5$<&'00<>">/<0&I">"A$>
• クリップ数:OWEW[
• 学習:QWEOE
• 検証:OWWW
• テスト:XDOE
• 平均時間:[aB秒
• アノテーション数:XXECO
(Torabi+, arXiv2015)
!"D).B-.+E
nG/5$<4><%#
• 長い動画に対するストーリーナレーション
• 動画のナレーションをパラグラフで説明
• 動画数:BC
• 学習:D]ECW
• 検証:EEE
• テスト:DCDD
Annotation 1: A bride walks down the aisle to her waiting bridegroom. As the bride walks, a photographer captures
photos. At the end of the aisle the man giving the bride away shakes hands and hugs the bridegroom. The bride and
bridegroom then interlock arms and face forward together.
Annotation 2: A large group of people have gathered inside of a room for a wedding. A woman walks down the aisle
with a man slowly as people watch. The two of them get to the end of the aisle where a groom stands waiting.The man
shakes hands with the groom and gives the woman a kiss on her forehead. The man who walked her down the aisle steps
away towards the side of the room as the couple take each others arms.
Annotation 3: A groom is standing at the end of an aisle as a photographer takes a photo. The bride and father then
come into view and walk down the aisle to the waiting groom. They stop at the grooms spot and the bride’s father then
shakes the grooms hand and gives a hug and walks to his spot. The groom then holds arms with the bride to begin the
wedding ceremony.
Table 3: Example video description annotations in our VideoStory set. Each video has multiple paragraphs
and localized time-interval annotations for every sentence in the paragraph.
(Gella+, ENMLP2018)
A*-"C"-EF)-'5/-"-")#
n'H>/@/>#!$>&V0>/>/$A
• キャプションとそれの単語に対応したバウンディングボックス
• クリップ数:XB
• アノテーション
• キャプション:BC
• バウンディングボックス:DXW
A man in a striped shirt is playing the piano on the street while people watch him.
Figure 2: An annotated example from our dataset. The
D
F
M
Y
A
A
Ta
no
A
ar
pr
Y
ta
3.
(Zhou+, arXiv2018)
9B!(
n./H%<A<(>&G/5$<&I$AH%/=>/<0
• クリップ数:DE]C
• 学習:DBCC
• 検証:DCC
• テスト:[]C
• アノテーション
• D動画につき平均OD
• 多言語に対応
ally extracted paraphrases. Our work is
m these earlier efforts both in terms of
ttempting to collect linguistic descrip-
a visual stimulus – and the dramatically
of the data collected.
ollection
al was to collect large numbers of para-
ckly and inexpensively using a crowd,
rk was designed to make the tasks short,
, accessible and somewhat fun. For each
ed the annotators to watch a very short
usually less than 10 seconds long) and
one sentence the main action or event
d in the video clip
yed the task on Amazon’s Mechanical
video segments selected from YouTube.
with sound although this is not necessary.
Please only describe the action/event that occurred in the selected segment and not any other parts of
the video.
Please focus on the main person/group shown in the segment
If you do not understand what is happening in the selected segment, please skip this HIT and move
onto the next one
Write your description in one sentence
Use complete, grammatically-correct sentences
You can write the descriptions in any language you are comfortable with
Examples of good descriptions:
A woman is slicing some tomatoes.
A band is performing on a stage outside.
A dog is catching a Frisbee.
The sun is rising over a mountain landscape.
Examples of bad descriptions (With the reasons why they are bad in parentheses):
Tomato slicing
(Incomplete sentence)
This video is shot outside at night about a band performing on a stage
(Description about the video itself instead of the action/event in the video)
I like this video because it is very cute
(Not about the action/event in the video)
The sun is rising in the distance while a group of tourists standing near some railings are taking
pictures of the sunrise and a small boy is shivering in his jacket because it is really cold
(Too much detail instead of focusing only on the main action/event)
Segment starts: 25 | ends: 30 | length: 5 seconds
Play Segment · Play Entire Video
Please describe the main event/action in the selected segment (ONE SENTENCE):
Note: If you have a hard time typing in your native language on an English keyboard, you may find
(Chen & Dolan, ACL2011)
9B2:!@@
n.4NSG/5$<&><&R$g>
• クリップ数:DC
• アノテーション:BCC9
• DクリップにつきBC
• 音声情報も含まれる
Figure 1. Examples of the clips and labeled sentences in our MSR-VTT dataset. We give six samples, with each containing four frames to
represent the video clip and five human-labeled sentences.
more natural and diverse sentences. Second, our dataset and computational costs? We examine these questions em-
(Xu+, CVPR2016)
<G%+)D)#
n;:"%$5$A
• 日常的な室内での動画
• 動画数:EWOW
• 学習:]EWX
• テスト:DW[Q
• 出現するオブジェクト数:O[
• アクションクラス:DX]
• アノテーション数:B]WO]
512 G.A. Sigurdsson et al.
The Charades Dataset
(Sigurdsson+, ECCV2016)
!@H
nG/5$<&R/>-$A&/0&>:$&1/-5
• クリップ数:DWDCC
• 平均時間:DaX分
• アノテーション数:OO[DQ
• Dセンテンスで説明
• クリップに含まれていない情報も記述されている (補強文)
610 K.-H. Zeng et al.
(Zeng+, ECCV2016)
A*-"C"-EF)-'<%,-"./#
n'H>/@/>#!$>&;"=>/<0A
• 動画数:BC
• アノテーション数:DCC
• 平均単語数:DQa[W
• Dビデオの複数のキャプションは時間が
重複する部分もある
(Krishna+, ICCV2017)
95@572
n.$>%/H&(<%&V@"-3">/<0&<(&R%"0A-">/<0&e/>:&Vg=-/>&`%5$%/0?&K.VRV`NP
• 前処理として1<%5!$>&Kh$--8"3,*&DEEWPから同義語などをステミング
• ステミング:意味が一致するように語形を変化
• .VRV`N= 𝐹:;<*(1 − 𝑝) (例では.VRV`NcCaOXP
• 𝐹:;<* =
%(21
1/2
• 𝑃, 𝑅fユニグラムでの=%$H/A/<0*&%$H"--
• 𝑝 =
%
=
,#
,&
=
• 𝑁.:順番が同じ系列の数 (例ではBP
• 𝑁>:一致するユニグラムの数
• 候補文:"0&$-5$%#&,"0&/A A:<e/0?&:<e&><&=-"#&=/"0<&/0&(%<0>&<(&H%<e5&
/0&"&:"--&%<<,
• 参照文: "0&$-5$%-#&,"0&/A =-"#/0?&=/"0<&/0&(%<0>&<(&"&H%<e5&/0&"0&
"0>$%<<,
ステミングされているので同じと考えれる
<;(5+
n>(S/5(
• >(c(文書'における単語eの出現頻度)b(文書'の全ての単語の出現頻度の和)
• /5(c-<?K(全文書数Pb(単語eが含まれる文書数)P
• >(S/5(c>(i/5(
• >(S/5(ベクトル:各単語に対する>(S/5(のベクトル
n;<0A$0A3AS8"A$5&F,"?$&I$AH%/=>/<0&V@"-3">/<0
• .VRV`N同様に前処理としてステミングを行う
• 0S?%",での正解キャプションと生成キャプションの>(S/5(ベクトルのコサイン類似度
• CIDEr 𝑐?, 𝑆? = ∑*+%
,
𝑤* CIDEr*(𝑐?, 𝑆?)
• CIDEr*(𝑐?, 𝑆?) =
%
:
∑@
A'(.()BA'(4())
A'(.() A'(4())
• 𝑔*(𝑐?):生成キャプションの0S?%",の>(S/5(ベクトル
• 𝑔*(𝑠?@):正解キャプションの+番目の0S?%",の>(S/5(ベクトル
• D動画に正解キャプションは複数存在する
H9(
n1<%5&.<@$%jA&I/A>"0H$
• 比較する文書を^"?S<(S1<%5Aでベクトルに変換した文書間の距離
• V"%>:&.<@$%A&I/A>"0H$をとくことで距離を計算
N. Aa
B0;<5
n4$,"0>/H&M%<=<A/>/<0"-&F,"?$&;"=>/<0/0?&V@"-3">/<0
• 文書から生成されるシーングラフのhDスコアを計算
• シーングラフ生成に依存 SPICE: Semantic Propositional Image Caption Evaluation 383
Fig. 1. Illustrates our method’s main principle which uses semantic propositional con-
tent to assess the quality of image captions. Reference and candidate captions are
mapped through dependency parse trees (top) to semantic scene graphs (right)—
(Anderson+, ECCV2016)

More Related Content

Similar to 文献紹介:Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video ClassificationToru Tamaki
 
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
文献紹介:TinyVIRAT: Low-resolution Video Action RecognitionToru Tamaki
 
エクセル版 ヒストグラム(度数分布表)による工程解析 改善事例集 
エクセル版 ヒストグラム(度数分布表)による工程解析 改善事例集 エクセル版 ヒストグラム(度数分布表)による工程解析 改善事例集 
エクセル版 ヒストグラム(度数分布表)による工程解析 改善事例集 博行 門眞
 
SSII2014 チュートリアル資料
SSII2014 チュートリアル資料SSII2014 チュートリアル資料
SSII2014 チュートリアル資料Masayuki Tanaka
 
文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video Recognition文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video RecognitionToru Tamaki
 
FRT Vol. 4 インサイド・クラウドプラットフォーム
FRT Vol. 4 インサイド・クラウドプラットフォームFRT Vol. 4 インサイド・クラウドプラットフォーム
FRT Vol. 4 インサイド・クラウドプラットフォームYasunari Goto (iChain. Inc.)
 
文献紹介:Tell Me Where to Look: Guided Attention Inference Network
文献紹介:Tell Me Where to Look: Guided Attention Inference Network文献紹介:Tell Me Where to Look: Guided Attention Inference Network
文献紹介:Tell Me Where to Look: Guided Attention Inference NetworkToru Tamaki
 
市場競争力強化に向けて(2004)
市場競争力強化に向けて(2004)市場競争力強化に向けて(2004)
市場競争力強化に向けて(2004)Tsuyoshi Horigome
 
文献紹介:CutDepth: Edge-aware Data Augmentation in Depth Estimation
文献紹介:CutDepth: Edge-aware Data Augmentation in Depth Estimation文献紹介:CutDepth: Edge-aware Data Augmentation in Depth Estimation
文献紹介:CutDepth: Edge-aware Data Augmentation in Depth EstimationToru Tamaki
 
Sigir2013 勉強会資料
Sigir2013 勉強会資料Sigir2013 勉強会資料
Sigir2013 勉強会資料Mitsuo Yamamoto
 
OpenCVを用いた画像処理入門
OpenCVを用いた画像処理入門OpenCVを用いた画像処理入門
OpenCVを用いた画像処理入門uranishi
 
文献紹介:Benchmarking Neural Network Robustness to Common Corruptions and Perturb...
文献紹介:Benchmarking Neural Network Robustness to Common Corruptions and Perturb...文献紹介:Benchmarking Neural Network Robustness to Common Corruptions and Perturb...
文献紹介:Benchmarking Neural Network Robustness to Common Corruptions and Perturb...Toru Tamaki
 
動画像理解のための深層学習アプローチ
動画像理解のための深層学習アプローチ動画像理解のための深層学習アプローチ
動画像理解のための深層学習アプローチToru Tamaki
 
10分で分かるr言語入門ver2.9 14 0920
10分で分かるr言語入門ver2.9 14 0920 10分で分かるr言語入門ver2.9 14 0920
10分で分かるr言語入門ver2.9 14 0920 Nobuaki Oshiro
 
ScalableCore system at SWoPP2010 BoF-2
ScalableCore system at SWoPP2010 BoF-2ScalableCore system at SWoPP2010 BoF-2
ScalableCore system at SWoPP2010 BoF-2Shinya Takamaeda-Y
 
実践で学ぶネットワーク分析
実践で学ぶネットワーク分析実践で学ぶネットワーク分析
実践で学ぶネットワーク分析Mitsunori Sato
 
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusionToru Tamaki
 

Similar to 文献紹介:Video Description: A Survey of Methods, Datasets, and Evaluation Metrics (20)

文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
文献紹介:VideoMix: Rethinking Data Augmentation for Video Classification
 
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
文献紹介:TinyVIRAT: Low-resolution Video Action Recognition
 
エクセル版 ヒストグラム(度数分布表)による工程解析 改善事例集 
エクセル版 ヒストグラム(度数分布表)による工程解析 改善事例集 エクセル版 ヒストグラム(度数分布表)による工程解析 改善事例集 
エクセル版 ヒストグラム(度数分布表)による工程解析 改善事例集 
 
SSII2014 チュートリアル資料
SSII2014 チュートリアル資料SSII2014 チュートリアル資料
SSII2014 チュートリアル資料
 
文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video Recognition文献紹介:SlowFast Networks for Video Recognition
文献紹介:SlowFast Networks for Video Recognition
 
FRT Vol. 4 インサイド・クラウドプラットフォーム
FRT Vol. 4 インサイド・クラウドプラットフォームFRT Vol. 4 インサイド・クラウドプラットフォーム
FRT Vol. 4 インサイド・クラウドプラットフォーム
 
文献紹介:Tell Me Where to Look: Guided Attention Inference Network
文献紹介:Tell Me Where to Look: Guided Attention Inference Network文献紹介:Tell Me Where to Look: Guided Attention Inference Network
文献紹介:Tell Me Where to Look: Guided Attention Inference Network
 
南幸衛・HPC専務
南幸衛・HPC専務南幸衛・HPC専務
南幸衛・HPC専務
 
JEITA4602B
JEITA4602BJEITA4602B
JEITA4602B
 
市場競争力強化に向けて(2004)
市場競争力強化に向けて(2004)市場競争力強化に向けて(2004)
市場競争力強化に向けて(2004)
 
文献紹介:CutDepth: Edge-aware Data Augmentation in Depth Estimation
文献紹介:CutDepth: Edge-aware Data Augmentation in Depth Estimation文献紹介:CutDepth: Edge-aware Data Augmentation in Depth Estimation
文献紹介:CutDepth: Edge-aware Data Augmentation in Depth Estimation
 
Sigir2013 勉強会資料
Sigir2013 勉強会資料Sigir2013 勉強会資料
Sigir2013 勉強会資料
 
OpenCVを用いた画像処理入門
OpenCVを用いた画像処理入門OpenCVを用いた画像処理入門
OpenCVを用いた画像処理入門
 
文献紹介:Benchmarking Neural Network Robustness to Common Corruptions and Perturb...
文献紹介:Benchmarking Neural Network Robustness to Common Corruptions and Perturb...文献紹介:Benchmarking Neural Network Robustness to Common Corruptions and Perturb...
文献紹介:Benchmarking Neural Network Robustness to Common Corruptions and Perturb...
 
動画像理解のための深層学習アプローチ
動画像理解のための深層学習アプローチ動画像理解のための深層学習アプローチ
動画像理解のための深層学習アプローチ
 
Ruby4Ctf
Ruby4CtfRuby4Ctf
Ruby4Ctf
 
10分で分かるr言語入門ver2.9 14 0920
10分で分かるr言語入門ver2.9 14 0920 10分で分かるr言語入門ver2.9 14 0920
10分で分かるr言語入門ver2.9 14 0920
 
ScalableCore system at SWoPP2010 BoF-2
ScalableCore system at SWoPP2010 BoF-2ScalableCore system at SWoPP2010 BoF-2
ScalableCore system at SWoPP2010 BoF-2
 
実践で学ぶネットワーク分析
実践で学ぶネットワーク分析実践で学ぶネットワーク分析
実践で学ぶネットワーク分析
 
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
論文紹介:Revealing the unseen: Benchmarking video action recognition under occlusion
 

More from Toru Tamaki

論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...Toru Tamaki
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...Toru Tamaki
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNetToru Tamaki
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A surveyToru Tamaki
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex ScenesToru Tamaki
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...Toru Tamaki
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video SegmentationToru Tamaki
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New HopeToru Tamaki
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...Toru Tamaki
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt TuningToru Tamaki
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in MoviesToru Tamaki
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICAToru Tamaki
 
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context RefinementToru Tamaki
 
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...Toru Tamaki
 
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...Toru Tamaki
 
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous DrivingToru Tamaki
 
論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large MotionToru Tamaki
 
論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense PredictionsToru Tamaki
 
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understandingToru Tamaki
 
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation LearningToru Tamaki
 

More from Toru Tamaki (20)

論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
論文紹介:Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Gene...
 
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
論文紹介:Content-Aware Token Sharing for Efficient Semantic Segmentation With Vis...
 
論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet論文紹介:Automated Classification of Model Errors on ImageNet
論文紹介:Automated Classification of Model Errors on ImageNet
 
論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey論文紹介:Semantic segmentation using Vision Transformers: A survey
論文紹介:Semantic segmentation using Vision Transformers: A survey
 
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
論文紹介:MOSE: A New Dataset for Video Object Segmentation in Complex Scenes
 
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
論文紹介:MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Acti...
 
論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation論文紹介:Tracking Anything with Decoupled Video Segmentation
論文紹介:Tracking Anything with Decoupled Video Segmentation
 
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
論文紹介:Real-Time Evaluation in Online Continual Learning: A New Hope
 
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
論文紹介:PointNet: Deep Learning on Point Sets for 3D Classification and Segmenta...
 
論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning論文紹介:Multitask Vision-Language Prompt Tuning
論文紹介:Multitask Vision-Language Prompt Tuning
 
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies論文紹介:MovieCLIP: Visual Scene Recognition in Movies
論文紹介:MovieCLIP: Visual Scene Recognition in Movies
 
論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA論文紹介:Discovering Universal Geometry in Embeddings with ICA
論文紹介:Discovering Universal Geometry in Embeddings with ICA
 
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
論文紹介:Efficient Video Action Detection with Token Dropout and Context Refinement
 
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
論文紹介:Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Lo...
 
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
論文紹介:MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Lon...
 
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
論文紹介:Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
 
論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion論文紹介:Spatio-Temporal Action Detection Under Large Motion
論文紹介:Spatio-Temporal Action Detection Under Large Motion
 
論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions論文紹介:Vision Transformer Adapter for Dense Predictions
論文紹介:Vision Transformer Adapter for Dense Predictions
 
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
動画像理解のための深層学習アプローチ Deep learning approaches to video understanding
 
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning
論文紹介:Masked Vision and Language Modeling for Multi-modal Representation Learning
 

Recently uploaded

AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdfAWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdfFumieNakayama
 
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdfクラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdfFumieNakayama
 
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineerYuki Kikuchi
 
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?akihisamiyanaga1
 
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)UEHARA, Tetsutaro
 
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版:キンドリルジャパン社内勉強会:2024年4月発表)
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版:キンドリルジャパン社内勉強会:2024年4月発表)業務で生成AIを活用したい人のための生成AI入門講座(社外公開版:キンドリルジャパン社内勉強会:2024年4月発表)
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版:キンドリルジャパン社内勉強会:2024年4月発表)Hiroshi Tomioka
 
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察 ~Text-to-MusicとText-To-ImageかつImage-to-Music...
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察  ~Text-to-MusicとText-To-ImageかつImage-to-Music...モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察  ~Text-to-MusicとText-To-ImageかつImage-to-Music...
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察 ~Text-to-MusicとText-To-ImageかつImage-to-Music...博三 太田
 
NewSQLの可用性構成パターン(OCHaCafe Season 8 #4 発表資料)
NewSQLの可用性構成パターン(OCHaCafe Season 8 #4 発表資料)NewSQLの可用性構成パターン(OCHaCafe Season 8 #4 発表資料)
NewSQLの可用性構成パターン(OCHaCafe Season 8 #4 発表資料)NTT DATA Technology & Innovation
 

Recently uploaded (8)

AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdfAWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf
AWS の OpenShift サービス (ROSA) を使った OpenShift Virtualizationの始め方.pdf
 
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdfクラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf
クラウドネイティブなサーバー仮想化基盤 - OpenShift Virtualization.pdf
 
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer
自分史上一番早い2024振り返り〜コロナ後、仕事は通常ペースに戻ったか〜 by IoT fullstack engineer
 
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?
CTO, VPoE, テックリードなどリーダーポジションに登用したくなるのはどんな人材か?
 
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)
デジタル・フォレンジックの最新動向(2024年4月27日情洛会総会特別講演スライド)
 
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版:キンドリルジャパン社内勉強会:2024年4月発表)
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版:キンドリルジャパン社内勉強会:2024年4月発表)業務で生成AIを活用したい人のための生成AI入門講座(社外公開版:キンドリルジャパン社内勉強会:2024年4月発表)
業務で生成AIを活用したい人のための生成AI入門講座(社外公開版:キンドリルジャパン社内勉強会:2024年4月発表)
 
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察 ~Text-to-MusicとText-To-ImageかつImage-to-Music...
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察  ~Text-to-MusicとText-To-ImageかつImage-to-Music...モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察  ~Text-to-MusicとText-To-ImageかつImage-to-Music...
モーダル間の変換後の一致性とジャンル表を用いた解釈可能性の考察 ~Text-to-MusicとText-To-ImageかつImage-to-Music...
 
NewSQLの可用性構成パターン(OCHaCafe Season 8 #4 発表資料)
NewSQLの可用性構成パターン(OCHaCafe Season 8 #4 発表資料)NewSQLの可用性構成パターン(OCHaCafe Season 8 #4 発表資料)
NewSQLの可用性構成パターン(OCHaCafe Season 8 #4 発表資料)
 

文献紹介:Video Description: A Survey of Methods, Datasets, and Evaluation Metrics

  • 2. !"#$%&'()#*+",-"./ nF,"?$A nG/5$<A • G/5$<&;"=>/<0/0? • D動画からDつのH"=>/<0生成 • G/5$<&I$AH%/=>/<0 • D動画から詳細な文章の生成 • 多くのセンテンスによって動画の説明 • I$0H$&G/5$<&;"=>/<0/0? • 動作の区間とそれに対応したH"=>/<0生成 • D動画から複数のH"=>/<0生成 • G/5$<&GJ' • 動画に関する質問応答 Anna Senina1 Marcus Rohrbach1 Wei Qiu1,2 Annemarie Friedrich2 Sikandar Amin3 Mykhaylo Andriluka1 Manfred Pinkal2 Bernt Schiele1 1 Max Planck Institute for Informatics, Saarbrücken, Germany 2 Department of Computational Linguistics, Saarland University, Saarbrücken, Germany 3 Intelligent Autonomous Systems, Technische Universität München, München, Germany Abstract Humans can easily describe what they see in a coher- ent way and at varying level of detail. However, existing approaches for automatic video description are mainly fo- cused on single sentence generation and produce descrip- tions at a fixed level of detail. In this paper, we address both of these limitations: for a variable level of detail we produce coherent multi-sentence descriptions of complex videos. We follow a two-step approach where we first learn to predict a semantic representation (SR) from video and then generate natural language descriptions from the SR. To produce consistent multi-sentence descriptions, we model across-sentence consistency at the level of the SR by en- forcing a consistent topic. We also contribute both to the visual recognition of objects proposing a hand-centric ap- proach as well as to the robust generation of sentences us- ing a word lattice. Human judges rate our multi-sentence descriptions as more readable, correct, and relevant than related work. To understand the difference between more Detailed: A woman turned on stove. Then, she took out a cucumber from the fridge. She washed the cucumber in the sink. She took out a cutting board and knife. She took out a plate from the drawer. She got out a plate. Next, she took out a peeler from the drawer. She peeled the skin off of the cucumber. She threw away the peels into the wastebin. The woman sliced the cucumber on the cutting board. In the end, she threw away the peels into the wastebin. Short: A woman took out a cucumber from the refrigerator. Then, she peeled the cucumber. Finally, she sliced the cucumber on the cutting board. One sentence: A woman entered the kitchen and sliced a cucumber. Figure 1: Output of our system for a video, producing co- herent multi-sentence descriptions at three levels of detail, using our automatic segmentation and extraction. 03.6173v1 [cs.CV] 24 Mar 2014 K4$0/0"L*&7;MNBCDOP
  • 3. !"#$%&'()#*+",-"./ 115:4 N. Aafaq et al. Fig. 2. Illustration of differences between image captioning, video captioning, and dense video captioning. Image (video frame) captioning describes each frame with a single sentence. Video captioning describes the フレームの枚数分 !"#$%&'生成 検出された動作の数 !"#$%&'生成 各!"#$%&'の区間
  • 4. よくある手法 ideo Description: A Survey of Methods, Datasets, and Evaluation Metrics 115:13 g. 8. Summary of deep learning–based video description methods. Most methods employ mean pooling frame representations to represent a video. More advanced methods use attention mechanisms, semantic ttribute learning, and/or employ a sequence-to-sequence approach. These methods differ in whether the sual features are fed only at first time-step or all time-steps of the language model. 特徴量抽出 KBI;!!*&QI;!!などP 時間集約 キャプション生成 KN!!*&24R.などP
  • 5. データセット n;<<9/0? • .MST&;<<9/0?&KN<:%8"H:L*& ;GMNBCDBP • U<3;<<9&KI"AL*&;GMNBCDQP • R';<4&KN<:%8"H:L*&V;;GBCDBP • R';<4S.3->/2$@$-&K4$0/0"L*& 7;MNBCDOP • U<3;<<9T&K6:<3L*&'''FBCDWP n.<@/$A • .MTS.I&KN<:%8"H:L*&;GMNBCDXP • .SG'I&KR<%"8/L*&"%Y/@BCDXP n4<H/"-&.$5/" • G/5$<4><%#&K7$--"L*&V!.2MBCDWP • 'H>/@/>#!$>&V0>/>/$A&K6:<3L*& "%Y/@BCDWP nG/5$<&/0&>:$&1/-5 • .4GI&K;:$0&Z&I<-"0*&';2BCDDP • .4NSGRR&KY3L*&;GMNBCD[P • ;:"%"5$A&K4/?3%5AA<0L*& V;;GBCD[P • GR1&K6$0?L*&V;;GBCD[P • 'H>/@/>#!$>&;"=>/<0A&K%/A:0"L*& F;;GBCD]P
  • 6. データセット 115:16 N. Aafaq et al. Table 1. Standard Datasets for Benchmarking Video Description Methods Dataset Domain # classes # videos avg len # clips # sent # words vocab len (hrs) MSVD [22] open 218 1970 10s 1,970 70,028 607,339 13,010 5.3 MPII Cooking [118] cooking 65 44 600s - 5,609 - - 8.0 YouCook [29] cooking 6 88 - - 2,688 42,457 2,711 2.3 TACoS [109] cooking 26 127 360s 7,206 18,227 146,771 28,292 15.9 TACos-MLevel [114] cooking 1 185 360s 14,105 52,593 2K - 27.1 MPII-MD [116] movie - 94 3.9s 68,337 68,375 653,467 24,549 73.6 M-VAD [139] movie - 92 6.2s 48,986 55,904 519,933 17,609 84.6 MSR-VTT [160] open 20 7,180 20s 10K 200K 1,856,523 29,316 41.2 Charades [130] human 157 9,848 30s - 27,847 - - 82.01 VTW [171] open - 18,100 90s - 44,613 - - 213.2 YouCook II [174] cooking 89 2K 316s 15.4K 15.4K - 2,600 176.0 ActyNet Cap [70] open - 20K 180s - 100K 1,348,000 - 849.0 ANet-Entities [173] social media - 14,281 180s 52K - - - - VideoStory [45] social media - 20K - 123K 123K - - 396.0
  • 7. 評価指標 n定性的評価 n定量的評価 • ^2V_&KM"=/0$0/L*&';2BCCBP • N`_7V&K2/0*&';2BCCOP • .VRV`N&K^"0$%+$$&Z&2"@/$*&';2BCCXP • ;FIV%&KG$5"0>",L*&;GMNBCDXP • 1.I&K3A0$%L*&F;.2BCDXP • 4MF;V&K'05$%A<0L*&V;;GBCD[P Video Description: A Survey of Methods, Datasets, and Evaluation Metrics 115:2 Table 2. Summary of Metrics Used for Video Description Evaluation Metric Name Designed For Methodology BLEU [102] Machine translation n-gram precision ROUGE [82] Document summarization n-gram recall METEOR [12] Machine translation n-gram with synonym matching CIDEr [142] Image captioning tf-idf weighted n-gram similarity SPICE [5] Image captioning Scene-graph synonym matching WMD [76] Document similarity Earth mover distance on word2vec
  • 8. 0+)*"#"./ nM%$H/A/<0 • 候補文の各単語が参照文に含まれている割合 • 候補文 (予測):>:$ >:$ >:$ >:$ >:$ >:$ K[単語P • 参照文 (アノテーション):R:$ >"8-$&/A&%$5a • (候補文のマッチしている単語数)b(候補文の単語数)c[b[ n.<5/(/$5&=%$H/A/<0 • M%$H/A/<0のマッチしている単語数を参照文で出た回数までしか数えない • 候補文:>:$ >:$&>:$&>:$&>:$&>:$&KD単語P • 参照文:R:$ >"8-$&/A&%$5a&K参照文でのd>:$dは1回しか使われてないP • (候補文のマッチしている単語数)b(候補文の単語数)cDb[ • 候補文の単語数が短いほど有利
  • 9. 0+)*"#"./1'2)*%&& n.<5/(/$5&0S?%",&=%$H/A/<0 • 参照文と候補文を0S?%",に分割して.<5/(/$5&0S?%",&=%$H/A/<0 • 候補文:/>&/A&?</0?&><&%"/0&"(>$%0<<0 • K/>*&/AP*&K/A*&?</0?P*&K?</0?*&><P*&K><&%"/0P*&K%"/0*&"(>$%0<<0P • 参照文:F>&e/--&%"/0&"(>$%0<<0 • KF>*&e/--P*&Ke/--*&%"/0P*&K%"/0*&"(>$%0<<0P • (候補文のマッチしている0S?%",の数)b(候補文の0S?%",の数)cDbX nN$H"-- • 参照文の各単語が候補文に含まれている割合 • 候補文:/> /A&?</0?&><&%"/0 "(>$%0<<0 • 参照文:F> e/--&%"/0 "(>$%0<<0 • (参照文のマッチしている単語数)b(参照文の単語数)cQbO • 候補文の単語数が長いほど有利
  • 10. 3456 n^/-/0?3"-&V@"-3>/<0&_05$%A>35#&K^2V_P • BLEU = 𝑒 !"#(%& !" !# , () exp ∑*+% , 𝑤* log 𝑝* • ∑*+% , 𝑤* log 𝑝*:DS?%",から0S?%",の=%$H/A/<0の対数をとった重み付き和 • 𝑝*f0S?%",での=%$H/A/<0 • 𝑤*f重み • 𝑒 !"#(%& !" !# , () :候補文が短すぎるときのペナルティ • 𝑙-:参照文の単語数 • 𝑙.:候補文の単語数 • 候補文:"&,"0&/A&=-"#/0?&=/"0< • 参照文:"0&$-5$%-#&,"0&/A&=-"#/0?&=/"0<&/0&(%<0>&<(&"&H%<e5&/0&"0&"0>$%<<, • ^2_VSOcCaDB • 候補文が極端に短いのでスコアが低い
  • 11. 27685 nN$H"--S<%/$0>$5&_05$%A>35#&(<%&7/A>/0?&V@"-3">/<0&KN`_7VP • N`_7VS! • 0S?%",に分割した%$H"-- n2<0?$A>&H<,,<0&A38A$)3$0H$&K2;4P • 候補文と参照文の一致する連続した最大のシーケンスの単語数 • 候補文fF&>"9$&"0&3,8%$--"&8$H"3A$&/> /A&?</0?&><&%"/0 • 参照文:#<3&A:<3-5&>"9$&"0&3,8%$--"&8$H"3A$&/> e/--&%"/0 • 最大のシーケンス:>"9$&"0&3,8%$--"&/>&%"/0&(X単語P
  • 12. 27685 nN`_7VS2 • N`_7VS2= (%/0$)1!#%2!#% 1!#%/0$2!#% • 𝑅3.4: 567(8,9) : , 𝑃3.4: 567(8,9) * • ':参照文*&^:候補文 • ,f'の単語数*&0f^の単語数 • 𝛽: 2!#% 1!#% • N`_7VS!と違って0S?%",を任意で決定する必要がない • 候補文:"&,"0&/A&=-"#/0?&=/"0< • 参照文:"0&$-5$%-#&,"0&/A&=-"#/0?&=/"0<&/0&(%<0>&<(&"&H%<e5&/0&"0&"0>$%<<, • N`_7VS2cCaBB
  • 14. 90:;;'<..="/> n."g&M-"09&F0A>/>3>$&(<%&F0(<%,">/HA&K.MSFFP&;<<9/0? • 動画数:OO • 平均時間:[CC秒 • アノテーション数:X[CE • 人がQCフレーム以上写ってない時 はd8"H9?%<305&"H>/@/>#dとアノテーションされて いる • クラス数:BDW A Database for Fine Grained Activity Detection of Cooking Activities Marcus Rohrbach Sikandar Amin Mykhaylo Andriluka Bernt Schiele Max Planck Institute for Informatics, Saarbrücken, Germany Abstract While activity recognition is a current focus of research the challenging problem of fine-grained activity recognition is largely overlooked. We thus propose a novel database of 65 cooking activities, continuously recorded in a realistic setting. Activities are distinguished by fine-grained body motions that have low inter-class variability and high intra- class variability due to diverse subjects and ingredients. We benchmark two approaches on our dataset, one based on articulated pose tracks and the second using holistic video features. While the holistic approach outperforms the pose- based approach, our evaluation suggests that fine-grained activities are more difficult to detect and the body model can help in those cases. Providing high-resolution videos as well as an intermediate pose representation we hope to foster research in fine-grained activity recognition. 1. Introduction Figure 1. Fine grained cooking activities. (a) Full scene of cut slices, and crops of (b) take out from drawer, (c) cut dice, (d) take out from fridge, (e) squeeze, (f) peel, (g) wash object, (h) grate (Rohrbach+, CVPR2012)
  • 15. ?.$<..= nU<3;<<9 • 動画数:WW • 学習:OE • テスト:QE • ほとんど動画で背景が変わる • .MSFF&;<<9/0?よりも多くの人が出てくる Keywords: refrigerator/OBJ cleans/VERB man/SUBJ-HUMAN clean/VERB blender/OBJ cleaning/VERB woman/SUBJ-HUMAN person/SUBJ-HUMAN stove/OBJ microwave/OBJ sponge/NOUN food/OBJ home/OBJ hose/OBJ oven/OBJ Sentences from Our System 1. A person is using dish towel and hand held brush or vacuum to clean panel with knobs and washing basin or sink. 2. Man cleaning a refrigerator. 3. Man cleans his blender. 4. Woman cleans old food out of refrigerator. 5. Man cleans top of microwave with sponge. Human Synopsis: Two standing persons clean a stove top with a vacuum clean with a hose. Keywords: meeting/VERB town/NOUN hall/OBJ microphone/OBJ talking/VERB people/OBJ podium/OBJ speech/OBJ woman/SUBJ- HUMAN man/SUBJ-HUMAN chairs/NOUN clapping/VERB speaks/VERB questions/VERB giving/VERB Sentences from Our System 1. A person is speaking to a small group of sitting people and a small group of standing people with board in the back. 2. A person is speaking to a small group of standing people with board in the back. 3. Man opens town hall meeting. 4. Woman speaks at town meeting. 5. Man gives speech on health care reform at a town hall meeting. Human Synopsis: A man talks to a mob of sitting persons who clap at the end of his short speech. Keywords: people/SUBJ-HUMAN, home/OBJ, group/OBJ, renovating/VERB, working/VERB, montage/OBJ, stop/VERB, motion/ OBJ, appears/VERB, building/VERB, floor/OBJ, tiles/OBJ, floorboards/OTHER, man/SUBJ-HUMAN, laying/VERB Sentences from Our System: 1. A person is using power drill to renovate a house. 2. A crouching person is using power drill to renovate a house. 3. A person is using trowel to renovate a house. 4. man lays out underlay for installing flooring. 5. A man lays a plywood floor in time lapsed video. Human Synopsis: Time lapse video of people making a concrete porch with sanders, brooms, vacuums and other tools. Cleaning an appliance Town hall meeting Renovating home Keywords: metal/OBJ man/SUBJ-HUMAN bending/VERB hammer/VERB piece/OBJ tools/OBJ rods/OBJ hammering/VERB craft/ VERB iron/OBJ workshop/OBJ holding/VERB works/VERB steel/OBJ bicycle/OBJ Sentences from Our System 1. A person is working with pliers. 2. Man hammering metal. 3. Man bending metal in workshop. 4. Man works various pieces of metal. 5. A man works on a metal craft at a workshop. Human Synopsis: A man is shaping a star with a hammer. Metal crafts project Keywords: bowl/OBJ pan/OBJ video/OBJ adds/VERB lady/OBJ pieces/OBJ ingredients/OBJ oil/OBJ glass/OBJ liquid/OBJ butter/SUBJ-HUMAN woman/SUBJ-HUMAN add/VERB stove/OBJ salt/OBJ Sentences from Our System: !!1. A person is cooking with bowl and stovetop. !!2. !In a pan add little butter. !!3. !!She !adds (Das+, CVPR2013)
  • 16. @A<.B nR$g>3"--#&'00<>">$5&;<<9/0?&4H$0$A&KR';<4P • 動画数:DB] • クリップ数:]BC[ • 平均時間:Q[C秒 • クラス数:B[ • 総単語数:DO[]]D • 動詞:BWBEB • キャプションには開始と終了のタイムスタンプが取得可能
  • 17. @A<.B:9$&-"&)C)& nR';<4S.3->/-$@$- • 1動画に対してQ種類のキャプション • DX文以内の詳細な説明 • QSX文の短い説明 • D文の説明 Anna Senina1 Marcus Rohrbach1 Wei Qiu1,2 Annemarie Friedrich2 Sikandar Amin3 Mykhaylo Andriluka1 Manfred Pinkal2 Bernt Schiele1 1 Max Planck Institute for Informatics, Saarbrücken, Germany 2 Department of Computational Linguistics, Saarland University, Saarbrücken, Germany 3 Intelligent Autonomous Systems, Technische Universität München, München, Germany Abstract Humans can easily describe what they see in a coher- ent way and at varying level of detail. However, existing approaches for automatic video description are mainly fo- cused on single sentence generation and produce descrip- tions at a fixed level of detail. In this paper, we address both of these limitations: for a variable level of detail we produce coherent multi-sentence descriptions of complex videos. We follow a two-step approach where we first learn to predict a semantic representation (SR) from video and then generate natural language descriptions from the SR. To produce consistent multi-sentence descriptions, we model across-sentence consistency at the level of the SR by en- forcing a consistent topic. We also contribute both to the visual recognition of objects proposing a hand-centric ap- proach as well as to the robust generation of sentences us- ing a word lattice. Human judges rate our multi-sentence descriptions as more readable, correct, and relevant than related work. To understand the difference between more Detailed: A woman turned on stove. Then, she took out a cucumber from the fridge. She washed the cucumber in the sink. She took out a cutting board and knife. She took out a plate from the drawer. She got out a plate. Next, she took out a peeler from the drawer. She peeled the skin off of the cucumber. She threw away the peels into the wastebin. The woman sliced the cucumber on the cutting board. In the end, she threw away the peels into the wastebin. Short: A woman took out a cucumber from the refrigerator. Then, she peeled the cucumber. Finally, she sliced the cucumber on the cutting board. One sentence: A woman entered the kitchen and sliced a cucumber. Figure 1: Output of our system for a video, producing co- herent multi-sentence descriptions at three levels of detail, using our automatic segmentation and extraction. ularity and generating a conceptually and linguistically co- 403.6173v1 [cs.CV] 24 Mar 2014 K4$0/0"L*&7;MNBCDOP
  • 18. ?.$<..=;; nU<3;<<9&FF • 動画数:B • クリップ数:DXaO • 全てのクリップにキャプションと時間情報が紐付けされている Grill the tomatoes in a pan and then put them on a plate. Add oil to a pan and spread it well so as to fry the bacon Place a piece of lettuce as the first layer, place the tomatoes over it. Sprinkle salt and pepper to taste. Add a bit of Worcestershire sauce to mayonnaise and spread it over the bread. Place the bacon at the top. Place a piece of bread at the top. Cook bacon until crispy, then drain on paper towel 00:21 00:54 01:06 01:56 02:41 03:08 03:16 03:25 00:51 01:03 01:54 02:40 03:00 03:15 03:25 03:28 Start time: End time: Figure 1: An example from the YouCook2 dataset on making a BLT sandwich. Each procedure step has time boundaries annotated and is described by an English sentence. Video from YouTube with ID: 4eWzsx1vAi8. (Zhou+, AAAI2018)
  • 19. 90;;:9( n.MFFS.<@/$&I$AH%/=>/<0 • ハリウッド映画の音声説明を書き起こした • 音声から得られるキャプション • 映像から得られるキャプション • クリップ数:[WQQ] • アノテーション数:[WQ]X • タイムスタンプによって関連する時間と紐付け • 語彙:[XQO[] (Rohrbach+, CVPR2015)
  • 20. 9:!A( n.<0>%$"-&G/5$<&'00<>">/<0&I">"A$> • クリップ数:OWEW[ • 学習:QWEOE • 検証:OWWW • テスト:XDOE • 平均時間:[aB秒 • アノテーション数:XXECO (Torabi+, arXiv2015)
  • 21. !"D).B-.+E nG/5$<4><%# • 長い動画に対するストーリーナレーション • 動画のナレーションをパラグラフで説明 • 動画数:BC • 学習:D]ECW • 検証:EEE • テスト:DCDD Annotation 1: A bride walks down the aisle to her waiting bridegroom. As the bride walks, a photographer captures photos. At the end of the aisle the man giving the bride away shakes hands and hugs the bridegroom. The bride and bridegroom then interlock arms and face forward together. Annotation 2: A large group of people have gathered inside of a room for a wedding. A woman walks down the aisle with a man slowly as people watch. The two of them get to the end of the aisle where a groom stands waiting.The man shakes hands with the groom and gives the woman a kiss on her forehead. The man who walked her down the aisle steps away towards the side of the room as the couple take each others arms. Annotation 3: A groom is standing at the end of an aisle as a photographer takes a photo. The bride and father then come into view and walk down the aisle to the waiting groom. They stop at the grooms spot and the bride’s father then shakes the grooms hand and gives a hug and walks to his spot. The groom then holds arms with the bride to begin the wedding ceremony. Table 3: Example video description annotations in our VideoStory set. Each video has multiple paragraphs and localized time-interval annotations for every sentence in the paragraph. (Gella+, ENMLP2018)
  • 22. A*-"C"-EF)-'5/-"-")# n'H>/@/>#!$>&V0>/>/$A • キャプションとそれの単語に対応したバウンディングボックス • クリップ数:XB • アノテーション • キャプション:BC • バウンディングボックス:DXW A man in a striped shirt is playing the piano on the street while people watch him. Figure 2: An annotated example from our dataset. The D F M Y A A Ta no A ar pr Y ta 3. (Zhou+, arXiv2018)
  • 23. 9B!( n./H%<A<(>&G/5$<&I$AH%/=>/<0 • クリップ数:DE]C • 学習:DBCC • 検証:DCC • テスト:[]C • アノテーション • D動画につき平均OD • 多言語に対応 ally extracted paraphrases. Our work is m these earlier efforts both in terms of ttempting to collect linguistic descrip- a visual stimulus – and the dramatically of the data collected. ollection al was to collect large numbers of para- ckly and inexpensively using a crowd, rk was designed to make the tasks short, , accessible and somewhat fun. For each ed the annotators to watch a very short usually less than 10 seconds long) and one sentence the main action or event d in the video clip yed the task on Amazon’s Mechanical video segments selected from YouTube. with sound although this is not necessary. Please only describe the action/event that occurred in the selected segment and not any other parts of the video. Please focus on the main person/group shown in the segment If you do not understand what is happening in the selected segment, please skip this HIT and move onto the next one Write your description in one sentence Use complete, grammatically-correct sentences You can write the descriptions in any language you are comfortable with Examples of good descriptions: A woman is slicing some tomatoes. A band is performing on a stage outside. A dog is catching a Frisbee. The sun is rising over a mountain landscape. Examples of bad descriptions (With the reasons why they are bad in parentheses): Tomato slicing (Incomplete sentence) This video is shot outside at night about a band performing on a stage (Description about the video itself instead of the action/event in the video) I like this video because it is very cute (Not about the action/event in the video) The sun is rising in the distance while a group of tourists standing near some railings are taking pictures of the sunrise and a small boy is shivering in his jacket because it is really cold (Too much detail instead of focusing only on the main action/event) Segment starts: 25 | ends: 30 | length: 5 seconds Play Segment · Play Entire Video Please describe the main event/action in the selected segment (ONE SENTENCE): Note: If you have a hard time typing in your native language on an English keyboard, you may find (Chen & Dolan, ACL2011)
  • 24. 9B2:!@@ n.4NSG/5$<&><&R$g> • クリップ数:DC • アノテーション:BCC9 • DクリップにつきBC • 音声情報も含まれる Figure 1. Examples of the clips and labeled sentences in our MSR-VTT dataset. We give six samples, with each containing four frames to represent the video clip and five human-labeled sentences. more natural and diverse sentences. Second, our dataset and computational costs? We examine these questions em- (Xu+, CVPR2016)
  • 25. <G%+)D)# n;:"%$5$A • 日常的な室内での動画 • 動画数:EWOW • 学習:]EWX • テスト:DW[Q • 出現するオブジェクト数:O[ • アクションクラス:DX] • アノテーション数:B]WO] 512 G.A. Sigurdsson et al. The Charades Dataset (Sigurdsson+, ECCV2016)
  • 26. !@H nG/5$<&R/>-$A&/0&>:$&1/-5 • クリップ数:DWDCC • 平均時間:DaX分 • アノテーション数:OO[DQ • Dセンテンスで説明 • クリップに含まれていない情報も記述されている (補強文) 610 K.-H. Zeng et al. (Zeng+, ECCV2016)
  • 27. A*-"C"-EF)-'<%,-"./# n'H>/@/>#!$>&;"=>/<0A • 動画数:BC • アノテーション数:DCC • 平均単語数:DQa[W • Dビデオの複数のキャプションは時間が 重複する部分もある (Krishna+, ICCV2017)
  • 28. 95@572 n.$>%/H&(<%&V@"-3">/<0&<(&R%"0A-">/<0&e/>:&Vg=-/>&`%5$%/0?&K.VRV`NP • 前処理として1<%5!$>&Kh$--8"3,*&DEEWPから同義語などをステミング • ステミング:意味が一致するように語形を変化 • .VRV`N= 𝐹:;<*(1 − 𝑝) (例では.VRV`NcCaOXP • 𝐹:;<* = %(21 1/2 • 𝑃, 𝑅fユニグラムでの=%$H/A/<0*&%$H"-- • 𝑝 = % = ,# ,& = • 𝑁.:順番が同じ系列の数 (例ではBP • 𝑁>:一致するユニグラムの数 • 候補文:"0&$-5$%#&,"0&/A A:<e/0?&:<e&><&=-"#&=/"0<&/0&(%<0>&<(&H%<e5& /0&"&:"--&%<<, • 参照文: "0&$-5$%-#&,"0&/A =-"#/0?&=/"0<&/0&(%<0>&<(&"&H%<e5&/0&"0& "0>$%<<, ステミングされているので同じと考えれる
  • 29. <;(5+ n>(S/5( • >(c(文書'における単語eの出現頻度)b(文書'の全ての単語の出現頻度の和) • /5(c-<?K(全文書数Pb(単語eが含まれる文書数)P • >(S/5(c>(i/5( • >(S/5(ベクトル:各単語に対する>(S/5(のベクトル n;<0A$0A3AS8"A$5&F,"?$&I$AH%/=>/<0&V@"-3">/<0 • .VRV`N同様に前処理としてステミングを行う • 0S?%",での正解キャプションと生成キャプションの>(S/5(ベクトルのコサイン類似度 • CIDEr 𝑐?, 𝑆? = ∑*+% , 𝑤* CIDEr*(𝑐?, 𝑆?) • CIDEr*(𝑐?, 𝑆?) = % : ∑@ A'(.()BA'(4()) A'(.() A'(4()) • 𝑔*(𝑐?):生成キャプションの0S?%",の>(S/5(ベクトル • 𝑔*(𝑠?@):正解キャプションの+番目の0S?%",の>(S/5(ベクトル • D動画に正解キャプションは複数存在する
  • 31. B0;<5 n4$,"0>/H&M%<=<A/>/<0"-&F,"?$&;"=>/<0/0?&V@"-3">/<0 • 文書から生成されるシーングラフのhDスコアを計算 • シーングラフ生成に依存 SPICE: Semantic Propositional Image Caption Evaluation 383 Fig. 1. Illustrates our method’s main principle which uses semantic propositional con- tent to assess the quality of image captions. Reference and candidate captions are mapped through dependency parse trees (top) to semantic scene graphs (right)— (Anderson+, ECCV2016)