Boosting based Transfer Learning

CENTER FOR COGNITIVE UBIQUITOUS COMPUTING
CUbiC
ARIZONA STATE UNIVERSITY
A Study of Boosting based Transfer Learning for
Activity and Gesture Recognition
Ashok Venkatesan
Committee Members
Sethuraman Panchanathan, Professor (Chair)
Jieping Ye, Associate Professor
Baoxin Li, Associate Professor
Master’s Thesis Defense

Outline
• Motivation
• Transfer Learning
• Problem and Related Work
• Cost-Sensitive Boosting
• Results and Discussions
• Conclusion

Outline
• Motivation : Real World Data, Dataset Shifts, Traditional Learning
• Problem and Related Work
• Conclusion

Real-World Data
Difficult to learn as it is Non-Stationary and Continuously Evolving

Example : Spam Filtering
A spam filter is trained on random emails tracked from a group of users
under the assumption that new users would classify spam identically.
1. What if the training data is no longer relevant?
2. What if the user preferences are not identical?

Motivational Example : Accelerometer Based 3D Gesture Recognition
A gesture recognition model is trained on mock data obtained in a control environment
under the assumption that real life data would be identical
1. What if the user has peculiar traits?
2. What if environmental factors and the objects interacted with vary and impact the
property of the gesture?
Scoop
Stir

Simple Covariate Shift
• Change in 𝑃(𝑥) due to the
change in a known covariate.
Prior Probability Shift
• Change in 𝑃(𝑦) when
𝑃(𝑦|𝑥) is modeled as
𝑃(𝑥|𝑦) 𝑃(𝑦)
Sample Selection
Bias
• 𝑃 𝑥𝑖 ≠ 𝑃(𝑥)
Imbalanced Data
• Change in 𝑃(𝑦) by
design
Domain Shift
• Change in measurement
system of 𝑥𝑖
Source Component
Shift
• Involves changes in
strength of contributing
components
Concept Drift
• Change in 𝑃(𝑦|𝑥) in
continuous and real-time
data streams
Dataset Shift[1]
[1] Quionero-Candela, J., et al., Dataset shift in Machine Learning. s.l. : The MIT Press, 2009

• Training and test examples are assumed to be independently drawn and
identically distributed
Traditional Learning
NOT SUITED FOR HANDLING DATASET SHIFTS
Algorithm
Algorithm
Tasks Models
Traditional Learning over Multiple Domains

Outline
• Motivation
• Transfer Learning : Definition, Learning Settings, Notation,
Problem, What to Transfer?
• Instance-Weighting using Boosting
• Conclusion

Definition[2][3]
“Transfer Learning is a methodology that uses prior acquired
knowledge to effectively develop a new hypothesis. It emphasizes
knowledge transfer across domains, tasks and distributions that are
similar but not the same.”
[2] NIPS Inductive Transfer Workshop 2005
[3] Pan, S.J. and Yang, Q., "A Survey on Transfer Learning, TKDE 2009
• It is motivated by human learning. People can often transfer knowledge
learnt previously to novel situations.
• e.g. Knowing how to ride a bicycle might help improve learning to ride a
motorbike
• Outdated data representing prior knowledge is called to as Source
• Newer data representing the newer knowledge is referred to as Target.
𝐷 = *𝒳, 𝑃(𝑋)+ 𝑇 = *𝒴, 𝑓(. )+ 𝑃(𝑋), 𝑃(𝑌) & 𝑃(𝑌|𝑋)

Algorithm
Insufficient Target
Training Data
Target Task Model
Transfer Learning - Illustration
Algorithm
Knowledge
Abundant Source
Training Data
Transfer Learning is beneficial for lessening labeling costs associated in re-training a model from
scratch and to make classification rapidly adaptable in real-time.

• Few labeled target domain data is available
for obtaining a weak inductive bias. Source
data is used as auxiliary data.
Inductive
Transfer
• Lots of labeled source data and lots of
unlabeled target domain data. Capitalize
on the difference in the domains.
Transductive
Transfer
• Both source data and target data are
unlabeled. Apply techniques such as
clustering and density estimation
Unsupervised
Transfer
Transfer Settings[3]
The scope of Transfer Learning in general is to learn a classifier that performs well over
target data samples alone. Classification performance over source tasks is ignored.
[3] Pan, S.J. and Yang, Q., "A Survey on Transfer Learning, TKDE 2009

• Two sets of tasks, source and target, represented by
instances 𝑋𝑠𝑜𝑢𝑟𝑐𝑒, 𝑋𝑡𝑎𝑟𝑔𝑒𝑡 ∈ 𝒳 and labels 𝑌𝑠𝑜𝑢𝑟𝑐𝑒, 𝑌𝑡𝑎𝑟𝑔𝑒𝑡 ∈ 𝒴 such that,
𝑃 𝑋𝑠𝑜𝑢𝑟𝑐𝑒, 𝑌𝑠𝑜𝑢𝑟𝑐𝑒 ≠ 𝑃(𝑋𝑡𝑎𝑟𝑔𝑒𝑡, 𝑌𝑡𝑎𝑟𝑔𝑒𝑡)
• Training examples are grouped and named based on
their task distributions
– Same task distribution as target ,
– Different task distribution from that of target,
• Unlabeled test examples representing the target tasks,
Notation
𝑇𝑠 = 𝑥𝑖
𝑠
, 𝑦𝑖
𝑠
𝑖=1
𝑚
𝑇𝑑 = 𝑥𝑗
𝑑
, 𝑦𝑗
𝑑
𝑗=1
𝑛
𝑆

Problem Statement
Unseen
Target Data
(𝑆)
Abundant
Source Data
𝑇𝑑 = 𝑥𝑖
𝑑
, 𝑦𝑖
𝑑
𝑖=1
𝑛
Little Labeled
Target Data
𝑇𝑠 = 𝑥𝑗
𝑠
, 𝑦𝑗
𝑠
𝑗=1
𝑚
Model trained on 𝑇𝑑
Target model
Model trained on 𝑇𝑠
Objective: Given |𝑇𝑠| ≪ |𝑇𝑑| and that 𝑇𝑠 is insufficient to learn the
target tasks, learn a model using 𝑇𝑑 ∪ 𝑇𝑠 that classifies target task
examples 𝑆 with minimum error.

Instance-based
• Reuse instances observed in source domain similar to the target domain.
• E.g. – Instance reweighting, Importance sampling
Feature-based
• Find an alternate feature space for learning the target domain while
projecting the source domain in the new space.
• E.g. – Feature subset selection, Feature space transformation
Model/Parameter-based
• Use model components such as parameters and hyper-parameters to
influence learning the target task.
• E.g. – Parameter-space partitioning, Superimposing shape constraints
What to Transfer?

Outline
• Motivation
• Instance-Weighting using Boosting : Instance
Weighting, AdaBoost, TrAdaBoost, TransferBoost, Limitations
• Conclusion

• AdaBoost[4] boosts a weak learning algorithm into a strong
learner by linearly combining an ensemble of weak
hypotheses.
• Why Boosting based Instance-Weighting?
– Provides theoretical guarantees on generalization error bounds.
– Incremental instance boosting aids in systematic selection of
important examples
– Well defined focus areas to modified for knowledge transfer
• Weak hypothesis loss function
• Weight update scheme
• Linear combination of the weak hypotheses
Boosting
[4] Freund, Y., Schapire, R. and Abe, N., "A Short Introduction to Boosting“, JSAI,
1999

Stage 1 Stage 2 (AdaBoost)
Similarity
Measure
Weak
Hypotheses
Loss
Function
Weight
Update
Scheme
Linear
combination
of Weak
Hypotheses
Instance-Weighting and Boosting
• Two recent instance-weighting algorithms adapt AdaBoost
for knowledge transfer :
• TrAdaBoost[5],
• TransferBoost[6]
[5] Dai, W., et al., "Boosting for transfer learning." ICML, 2007
[6] Eaton, E. and Desjardins., "Set-Based Boosting for Instance-level Transfer”,IEEE, ICDM-W 2009.

AdaBoost
• 𝜖 𝑡 = 𝑝𝑖
𝑡
|𝑕 𝑡 𝑥𝑖 − 𝑦𝑖|𝑁
𝑖=1
• 𝛼 𝑡 =
1
2
log
1−𝜖 𝑡
𝜖 𝑡
Loss Function
• 𝑤𝑖
𝑡+1
= 𝑤𝑖
𝑡
exp −𝛼 𝑡 𝑦𝑖 𝑕 𝑥𝑖Weight Update
• 𝐻 𝑥 = 𝑠𝑖𝑔𝑛 𝛼 𝑡 𝑕 𝑡 𝑥𝑇
𝑡=1
Linear
Combination
Main Idea: Increase weights of misclassified training
samples (𝑇𝑑 ∪ 𝑇𝑠)

TrAdaBoost
𝑡
|𝑕 𝑡 𝑥𝑖 − 𝑦𝑖|
|𝑇𝑠|
𝑖=1Loss Function
• 𝑤𝑖
𝑡+1
=
𝑤𝑖
𝑡
exp −𝛼 𝑡
𝑠
|𝑕 𝑡 𝑥𝑖
𝑠
− 𝑦𝑖
𝑠
|
𝑤𝑖
𝑡
exp −𝛼 𝑑 𝑦𝑖 𝑕 𝑡 𝑥𝑖
𝑑Weight Update
• 𝐻 𝑥 = 𝑠𝑖𝑔𝑛 𝛼 𝑡
𝑠
𝑕 𝑡 𝑥𝑇
𝑡=𝑇/2
Linear
Combination
• Increases weights of misclassified 𝑇𝑠 samples
• Decreases weights of misclassified 𝑇𝑑 samples

TransferBoost
• 𝛾𝑡
𝑘
= 𝜖 𝑇𝑠
− 𝜖 𝑇 𝑑∪𝑇𝑠
Similarity
Measure
𝑡
|𝑕 𝑡 𝑥𝑖 − 𝑦𝑖|
|𝑇𝑠|
𝑖=1Loss Function
• 𝑤𝑖
𝑡+1
=
𝑤𝑖
𝑡
exp −𝛼 𝑡 𝑦𝑖
𝑘
𝑕 𝑡 𝑥𝑖
𝑘
+ 𝛾𝑡
𝑘
, 𝑥𝑖
𝑘
, 𝑦𝑖
𝑘
∈ 𝑇𝑑 𝑘
𝑤𝑖
𝑡
exp −𝛼 𝑡 𝑦𝑖
𝑠
𝑕 𝑥𝑖
𝑠Weight Update
• Increases weights of misclassified Ts samples
• Weights 𝐾 tasks using a measure called transferability

• TrAdaBoost
– decreases weights of supporting source domain
instances, making knowledge transfer inefficient.
– converging over target error makes it prone to over
fitting
• TransferBoost
– positive transferability is hard to come by due to the
small size of 𝑇𝑠
– requires external information of the structure of data to
be of any use
Limitations

Outline
• Motivation
• Cost-Sensitive Boosting : General Idea, Weight update
schemes, Algorithm , Cost Estimation, Dynamic Cost
• Conclusion

Stage 1 Stage 2 (AdaBoost)
Similarity
Measure
Weak
Hypotheses
Loss
Function
Weight
Update
Scheme
Linear
combination
of Weak
Hypotheses
General Idea
• Compute instance-weights for Td and Ts separately.
• Augment 𝑇𝑑 instances with computed cost factors 𝐶.
• Learn a strong classifier to minimize training error over 𝑇𝑠 and reduce net
misclassification cost over 𝑇𝑑.

Weight Update Schemes[7]
[7] Sun, Y. et al., "Cost-sensitive boosting for classification of imbalanced data." 2007

Algorithm

Cost Properties
• Represents the similarity of instance distributions and
classification functions between 𝑇𝑠 and 𝑇𝑑.
• Lies in the interval ,0,1-.
• Relevant examples have cost values lying closer to 1.
• 𝑇𝑑 examples that have a cost, 𝑐𝑖 = 0 are not used for
training.

• Instance Pruning[8]
The probability of correct classification of an instance by
model trained on 𝑇𝑠
• Relevance Measure
𝑑𝑖𝑠𝑡 𝑥𝑖
𝑑
, 𝑥𝑗
𝑠
𝑗,𝑦𝑖
𝑑≠𝑦𝑗
𝑠
𝑑𝑖𝑠𝑡 𝑥𝑖
𝑑
, 𝑥𝑗
𝑠
𝑗,𝑦𝑖
𝑑=𝑦𝑗
𝑠
Cost Estimation
[8] Jiang, J. and Zhai, C.X., "Instance weighting for domain adaptation in NLP.“, Association For
Computational Linguistics, 2007

• KL Importance Estimation Procedure[9]
Transductively estimates
𝑃 𝑥 𝑠 𝑖
𝑃 𝑥 𝑡 𝑖
by minimizing KL
divergence between distributions of 𝑇𝑑 and 𝑇𝑠
• Concept Feature Vector Distance[10]
Measures the distance between the Concept Feature
Vectors that represent different class labels in 𝑇𝑑 and 𝑇𝑠.
Cost Estimation
[9] Sugiyama, M. et al., "Direct importance estimation with model selection and its
application to covariate shift adaptation." NIPS, 2008
[10] Katakis, I. et al., “An Ensemble of Classifiers for coping with Recurring Contexts in
Data Streams.”, ECAI, 2008.

Dynamic Cost-Sensitive Boosting
11. Update the cost vector C by calling the Cost Estimation Procedure along with the
weights of 𝑇𝑠

Outline
• Motivation
• Results and Discussions : Datasets, Classification
Accuracies, Dominance of AdaC2, vs. % of Training Data, Effect of Cost,
Dynamic Cost, Multisource Transfer
• Conclusion

• Source and Target Datasets
– Multi-class mock laboratory data(20
action samples from 5 users)
– Multi-class real-life data (4 users made 4
glasses of Gatorade and drank)
– 44 features and 500 source instances.
• Factors that induce dataset shift:
– Environmental Factors including size, shape
and weight of real-world objects
– User traits
• Avg. cross validation accuracy was
obtained over 5 trials.
Datasets
Act_gest : Accelerometer Based 3D Gesture Recognition (4 datasets)

Activity Gesture dataset shows clear signs of a domain shift upon performing a PCA on the feature
points and projecting its instances onto the first three principle components.

• Multi-Source Datasets
– Multi-class activity data captured from
different 7 smart home test beds.
– Modeled into single source and target
datasets by using one vs. all.
– 19 features and a max of 5468 instances
for a source
• Factors that introduce dataset shift:
– Different Apartment Layouts
– Different Residents
• Avg. cross validation accuracy was
obtained over 5 trials.
Datasets
WSU Smart Home Activity Recognition (7 datasets)

WSU Activity Recognition datasets show signs of a shift in 𝑃 𝑋 . Of particular interest is the how the
dataset shift varies in agreement to the actual task in question.

• Source and Target Datasets
– 65K features were reduced to 45k features using document frequency thresholding.
– All features were encoded to binary
– Modeled into a binary classification dataset with class labels as one subcategory vs.
another.
• Factors that introduce concept drift in the gesture dataset
– Different term frequencies
– Synthetically generated from different subcategories.
• Avg. cross validation accuracy was obtained over 5 trials.
Datasets
20Newsgroups 1 (6 datasets)
• A multi-source variation containing one subcategory vs. noisy subcategories.
20Newsgroups 2 (7 datasets)

Classification Accuracies
Dataset Svm𝑇𝑠 Svm𝑇𝑑 Svm𝑇𝑑𝑠 Ada Trada Adac1 Adac2 Adac3
User 1 0.77 0.56 0.79 0.85 0.82 0.85 0.88 0.85
User 2 0.84 0.64 0.98 0.93 0.98 0.97 0.98 0.98
User 3 0.54 0.33 0.71 0.67 0.65 0.70 0.75 0.74
User 4 0.44 0.61 0.77 0.73 0.75 0.76 0.79 0.80
Apt-A 0.71 0.67 0.71 0.78 0.63 0.80 0.82 0.75
Apt-B 0.67 0.62 0.68 0.72 0.57 0.79 0.80 0.76
Apt-C 0.79 0.37 0.81 0.76 0.49 0.79 0.83 0.78
Apt-D 0.76 0.34 0.77 0.82 0.52 0.83 0.81 0.81
Apt-E 0.29 0.04 0.45 0.46 0.70 0.46 0.48 0.49
Apt-F 0.58 0.20 0.60 0.62 0.40 0.67 0.68 0.67
Apt-G 0.52 0.44 0.55 0.53 0.46 0.59 0.59 0.58
Act-gest dataset
Act-rec dataset

Classification Accuracies
Rec vs. Talk 0.68 0.72 0.75 0.72 0.73 0.71 0.83 0.72
Rec vs. Sci 0.63 0.70 0.69 0.69 0.69 0.70 0.77 0.69
Sci vs. Talk 0.60 0.64 0.67 0.64 0.70 0.67 0.74 0.68
Comp vs. Rec 0.80 0.73 0.85 0.83 0.72 0.82 0.86 0.84
Comp vs. Sci 0.62 0.64 0.67 0.68 0.58 0.69 0.76 0.69
Comp vs. Talk 0.86 0.68 0.87 0.87 0.73 0.88 0.89 0.88
Rec vs. Talk 0.68 0.72 0.75 0.72 0.73 0.71 0.83 0.72
20Newsgroups 1

AdaC2 vs. AdaC1, AdaC3

AdaC2 vs. % of Target Training Data
The above plot corresponds to Apartment-A dataset of act_rec

Effect of Cost

Dynamic Cost-Sensitive Boosting
Dataset AdaC2 DAdaC2
User1 0.88 0.87
User2 0.98 0.98
User3 0.75 0.71
User4 0.79 0.80
Apt - A 0.82 0.82
Apt - B 0.80 0.74
Apt - C 0.83 0.80
Apt - D 0.81 0.77
Apt - E 0.48 0.48
Apt - F 0.68 0.69
Apt - G 0.59 0.60
Rec vs Talk 0.83 0.84
Rec vs Sci 0.77 0.77
Sci vs Talk 0.74 0.74
Rec vs Comp 0.86 0.89
Comp vs Sci 0.76 0.75
Comp vs Talk 0.89 0.90

AdaC2 vs. Multisource Transfer
Dataset TrAdaBoost TransferBoost AdaC2
Apt-A 0.63 0.71 0.82
Apt-B 0.57 0.69 0.80
Apt-C 0.49 0.79 0.83
Apt-D 0.52 0.78 0.81
Apt-E 0.70 0.37 0.48
Apt-F 0.40 0.61 0.68
Apt-G 0.46 0.56 0.59
baseball 0.46 0.54 0.78
electronics 0.65 0.54 0.64
med 0.52 0.51 0.67
mideast 0.39 0.48 0.54
misc 0.47 0.53 0.51
pchardware 0.63 0.53 0.69
windowsx 0.64 0.57 0.66

Outline
• Motivation
• Conclusion : Conclusion, Thesis Summary, Future Directions,
Dissemination

• An extension of AdaBoost for Transfer Learning
• Performs better than existing instance transfer techniques on real-word
datasets.
• Provides flexibility in using different relatedness measures and base classifiers
• Has good theoretical basis
Conclusion
• May be prone to over fitting
• Performance is dependent on the effectiveness of the cost estimated.
• Relies on being a bottom-top weighting approach. Does not utilize a given
structure of data.
Pros
Cons

• Cost-sensitive boosting schemes were evaluated over real-
world datasets and compared against well known
algorithms.
• 3 variants of cost-sensitive boosting algorithms were
investigated. AdaC2 was found to be better among the lot.
• 4 different relatedness measures were evaluated. Instance
pruning was found to give better results.
• Effect of maintaining a dynamic cost scheme was studied.
• Equivalence of AdaC2 with respect to multisource transfer
learning was analyzed.
Summary

• Estimating Relatedness
– Does a better a priori relatedness measure exist?
• Target Domain Instance Selection
– How to optimally select instances from the target
domain?
• Discovering Structure in datasets
– How can an existing structure in data be capitalized?
• System Integration
– How to best integrate these methodologies into an
application framework?
Future Directions

• A.Venkatesan, N.C.Krishnan, and S. Panchanathan, "Cost-sensitive
Boosting for Concept Drift", ECML Workshop on Handling Concept
Drift in Adaptive Information Systems (HaCDAIS), Barcelona, Spain,
2010.
• N.C. Krishnan, A. Venkatesan, S. Panchanathan, D.Cook, “Cost-
sensitive Boosting for Transfer Learning”, In preparation to be
submitted to IEEE Transactions on Knowledge and Data Engineering.
Dissemination

Thank you. Questions?

Boosting based Transfer Learning

Recommended

Recommended

More Related Content

Similar to Boosting based Transfer Learning

Similar to Boosting based Transfer Learning (20)

Recently uploaded

Recently uploaded (20)

Boosting based Transfer Learning