更具適應性的AOI	
National	Tsing	Hua	University
Min	Sun
孫民
VSLab
Myself please visit aliensunmin.github.io
Vision Science Lab (VSLab)
Analyzing
Street	Views
Understanding
Personal	Videos
3D	&	Robot	Vision Human	Sensing
Research	Topicsin ComputerVision & Machine Learning
Wearable	Camera	Applications
Make3D
3
Challenges
p AOI is similar to fine-grained Recognition
p How to adapt to changes (e.g., due to different sensors/viewpoints)?
What kind of bird? Attentionshould help
image source
http://yassersouri.github.io/pages/fast-bird-part.html
Domain shift
image source
http://vision.cs.uml.edu/adaptation.html
Domain adaption should help
AttentionModel
p Attention	for	Image	Captioning
Xu et	al.	Show,	Attend	and	Tell:	Neural	Image	Caption	Generation	with	Visual	Attention.	ICML	2015
AttentionModel
Soft-attention
hard-attention
AttentionModel
p Fine-grained	Recognition Fu	et	al.	Look	Closer	to	See	Better:	Recurrent	Attention	Convolutional	Neural	Network	for	Fine-grained	Image	Recognition.	CVPR 2017
Motivation	– Adapting	Changes
p State-of-the-art	segmenter suffers	from	domain	shift
n The	appearances	of road	scenes	are	quite	different	across	domains (cities).
Taipei	
Rio
Cairo
New York
Frankfurt
Tokyo
Motivation
p Effect	of	domain	shift:
n Domain	bias	will	result	in	inferior	performance	on	target	domain	when	one	
applies	a	segmenter trained	on	the	source	domain.
Feature Space
Linear Classifier
Frankfurt (Src)
Taipei	(Tgt)
Segmenter
trained	on	
Src Domain
Frankfurt (Src)
Taipei	(Tgt)
No	More	Discrimination:
Cross	City	Adaptation	of	Road	Scene	
Segmenters
Yi-Hsin Chen Wei-Yu Chen Yu-Ting Chen Bo-Cheng Tsai Yu-Chiang Frank Wang Min Sun
Motivation
p Goal:	use	domain	adaptation	to	mitigate	the	effect	of	domain	shift.
p Approaches:
n Supervised	Fine-Tuning:	CAN access	the	label	on	the	target	domain.
• Straightforward	but	time-consuming	and	expensive.
n Unsupervised	Adaptation:	CAN’T access	the	label	on	the	target	domain.
• More	challenging	but	low	cost.
Pixel	labeling	of	one	
Cityscapes	 image	takes	
90	minutes on	average.[4]
[4] M. Cordts, M.Omran, S. Ramos, T. Rehfeld,M. Enzweiler, R. Benenson,U.Franke,S.Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene
understanding,” in CVPR,IEEE,2016.
a
Practical	in	real	life	!
Data	Collection
p Use	Google	Streetview API	to	download	images	of	different	cities.
n Randomly	sample	locations	at	each	city	to	ensure	sufficient	variations	in	visual	appearance.
p Use	Time-Machine	feature	to	collect	images	pairs	at	the	same	location	but	different	
times.	
Tokyo
Rome
Rio
Taipei
T1 T2 T1 T2
Location	B	Location	A	
TimeTime
Unlabeled Image Pairs
Same location / Different Times
Our	Dataset
p We	propose	a	new	dataset	of	complex	road	scenes,	with:
n Diverse	Appearance:	includes	4 different	cities	across	continents.
n Temporal	Information:	each	city	includes	1600 image	pairs	which	provide	helpful	supervision	
without	any	human	interaction.
n Dense	pixel	annotations :	each city includes	100	high-quality	 annotated	images.	
Please	visit	:	https://yihsinchen.github.io/segmentation_adaptation/
Overview
𝐿"#"$% = 𝐿"$'( +	𝜆* 𝐿* + 𝜆,%$'' 𝐿,%$''
Global	Domain	Alignment
Global	Domain	Alignment	
p How	to	extend	the	idea	of	domain	adversarial	learning	for	adapting	
cross-domain	image	segmentation?
n We	take	each	gridin	the	𝒇 𝒄 𝟕 feature	map	of	the	FCN-based	segmenter as	an	
instance.
Global	Domain	Alignment	
p Our	objective	is	to	minimize	 by	iteratively update	domain	classifier
and	feature	extractor	:
• and	:	the	images	from	source	and	target	domain	respectively.
• :	the	number	of	grids	in	each	map.	
• and			the	feature	maps	of	source	and	target	domain	images.
• :	the	probability	that	the	grid	n	of	image	x	belongs	to	 the	source	domain,	where	is	the	
sigmoid	function.
Class-wise	Domain	Alignment
Class-wise	Domain	Alignment	
p Let	each	class	do	domain	adversarial	learning	individually.
p But	we	must	first	address	some	problem	:
n Under	the	unsupervisedsetting,	we	don’t	have	any	label	on	target	domain	to	
link	with	source	domain.
• Can’t	do	domain	adversarial	learning	against	source	domain.
n In	global	domain	adaptation,	we	define	each	grid	𝑛 in	the	feature	space	as	one	
instance.
• Can’t	directly	use	the	labels	which	are	in	the	image(pixel)	space.	
a pseudo	label
agrid-level	soft	label	
Up-Sampling
Input	Image Network Prediction
feature	space	 Pixel	space
Class-wise	Domain	Alignment	--- Grid-Level Soft	Label
p (In	source	domain)
n Calculate	grid-wise	soft	label	Φ2
, (𝐼4)	as	the	
probability	of	grid	𝑛 belonging	to	class	𝑐:	
• 𝑖:	is	the	pixel	index	in	image	space.
• 𝑛:	is	the	grid	index	in	feature	space.
• 𝑅(𝑛):	is	the	set	of	pixels	that	correspond	 to	grid	
n.
• 𝑦= 𝐼4 : denotes	the	ground	truth	label	of	pixel	𝑖.
Pixel-Level	 Ground-truthGrid-Level	Soft	Label
Class-wise	Domain	Alignment	--- Pseudo	Label
p (In	target	domain)
n Calculate	target-domain	grid-wise	soft	
pseudo	label	Φ2
, (𝐼>)	as	the	probability	of	
grid	𝑛 belonging	to	class	𝑐:
• 𝑖:	is	the	pixel	index	in	image	space.
• 𝑛:	is	the	grid	index	in	feature	space.
• 𝑅(𝑛):	is	the	set	of	pixels	that	correspond	 to	
grid	n.
• 𝜙=
,
𝐼> : is	the	pixel-wise	soft	pseudo	label	of	
pixel	𝑖 corresponding	 to	class	c
Pixel-Level	 Pseudo	LabelGrid-Level	Soft	Label
Class-wise	Domain	Alignment
p Due	to	the	pseudo label and	soft label,	we	could	“link”	each	class	
between	source	and	target	domain.
p Using	the	same	adversarial	learning	framework	can	be	achieved.	
Road
Car
Source Domain
(Ground Truth)
Target Domain
(Pseudo Label)
High
Low
Links of
Road
Probability	
bar
Class-wise	Domain	Alignment	⎯ Static-Object
Prior
Static-Object	
Prior
• Static-objects:	building,	road,	sidewalk…etc.
• Non-static-objects:	person,	car,	motorbike…etc
Class-wise	Domain	Alignment	⎯ Static-Object	Prior	
p Download	image	pair	at	the	same	location	but	different	times
Class-wise	Domain	Alignment	⎯ Static-Object	Prior	
p Perform	Dense	Match	(find	matched	points)
Class-wise	Domain	Alignment	⎯ Static-Object	Prior	
p Identify	superpixel containing	k>=3	matched	points	as	the	static	object	
prior
Class-wise	Domain	Alignment---Static-Object	Prior	
p Use static-object prior to refine pseudo label.
p For	pixel	that	belongs	to	static-object	prior,	we	suppress its	probability	
corresponding	to	non-static	objects.
• 𝑃'"$"=,(𝐼>) :	the	set	of	pixels	belong	to	static-object	prior	.
• 𝐶'"$"=,:	the	set	of	static-object	classes.
• Static-objects:	building,	 road,	sidewalk…etc.
• Non-static-objects:	 person,	car,	motorbike…etc
Experiments
p We	adapt	a	model	pretrained on	
Cityscapes	to	other	cities	in	our	
dataset.
nSource	domain:
• The	training	set	of	Cityscapes.
• 2975	road	scene	images	with	annotation.
nTarget	domain:
• 4 different	cities	of	our	datasets.
• Each	city	have	1600	images	without any	
annotation.
adapt
Experiments	⎯ Quantitative	Results
Experiments ⎯ Quantitative	Results
• Global	alignment	method	contributes	2.6%	mIoU gain.
Experiments	⎯ Quantitative	Results
• Class-wise	alignment	method	also	contributes	0.9%	mIoU gain.
Experiments	⎯ Quantitative	Results
• The	static-object	priors	contributes	another	0.6%	mIOU improvement.
Experiments	⎯ T-SNE	Visualization
p From	pre-trained,	GA	only,	to	
GA+CA(prior),	we	could	observe	
the	bias	between	domains	keep	
decreasing.
nGA	stands	for	Global	Domain	
Alignment
nCA	stands	for	Class-wise	Domain	
Alignment
Experiments ⎯ Typical	Examples
Recap
p AOI is similar to fine-grained Recognition
p How to adapt to changes (e.g., due to different sensors/viewpoints)?
What kind of bird? Attentionshould help
image source
http://yassersouri.github.io/pages/fast-bird-part.html
Domain shift
image source
http://vision.cs.uml.edu/adaptation.html
Domain adaption should help
Thank you

更適應性的AOI-深度強化學習之應用