Successfully reported this slideshow.
Your SlideShare is downloading. ×

Transfer Learning: An overview

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 131 Ad

More Related Content

Slideshows for you (20)

Viewers also liked (20)

Advertisement

Similar to Transfer Learning: An overview (20)

More from jins0618 (20)

Advertisement

Recently uploaded (20)

Transfer Learning: An overview

  1. 1. 迁移学习理论与应用 Transfer Learning: An Overview 杨强,香港科大 Qiang Yang, HKUST Thanks: Sinno Jialin Pan, NTU, Singapore Ying Wei, HKUST, Hong Kong Ben Tan, HKUST, Hong Kong
  2. 2. A psychological point of view •  Transfer of Learning (学习迁移)in Educa7on and Psychology – The study of dependency of human conduct, learning or performance on prior experience. –  [Thorndike and Woodworth, 1901] explored how individuals would transfer in one context to another context that share similar characteristics. •  E.g. !  C++ " Java !  Math/Physics " Computer Science/Economics 2
  3. 3. Transfer Learning In the machine learning community •  The ability of a system to recognize and apply knowledge and skills learned in previous domains/ tasks to novel tasks/domains, which share some commonality. •  Given a target domain/task, how to transfer knowledge to new domains/tasks (target)? •  Key: –  Representation Learning, Change of Representation 3
  4. 4. Why Transfer? !  Build every model from scratch? #  Time consuming and expensive # Expense: •  Data Collec7on/Labeling •  Privacy •  Time to train !  Reuse common knowledge extracted from exis7ng systems? #  More prac7cal 4
  5. 5. Why Transfer Learning? 5 Source Domain Data Target Domain Data Predictive Models Labeled Training Unlabeled data/a few labeled data for adaptation Transfer Learning Algorithms Target Domain Data Testing Electronics Time Period A Device A DVDDevice B Time Period B
  6. 6. Transfer Learning Different fields •  Transfer learning for reinforcement learning. [Taylor and Stone, Transfer Learning for Reinforcement Learning Domains: A Survey, JMLR 2009] •  Transfer learning for classifica7on, and regression problems. [Pan and Yang, A Survey on Transfer Learning, IEEE TKDE 2010] 6 Focus!
  7. 7. Motivating Example I: Indoor WiFi localiza7on 7 -30dBm -70dBm -40dBm
  8. 8. Indoor WiFi Localization (cont.) 8 Training Training Test Device A Test Device B ~ 1.5 meters ~10 meters Device A Device A S=(-37dbm, .., -77dbm), L=(1, 3) S=(-41dbm, .., -83dbm), L=(1, 4) … S=(-49dbm, .., -34dbm), L=(9, 10) S=(-61dbm, .., -28dbm), L=(15,22) S=(-37dbm, .., -77dbm) S=(-41dbm, .., -83dbm) … S=(-49dbm, .., -34dbm) S=(-61dbm, .., -28dbm) S=(-37dbm, .., -77dbm) S=(-41dbm, .., -83dbm) … S=(-49dbm, .., -34dbm) S=(-61dbm, .., -28dbm) S=(-33dbm, .., -82dbm), L=(1, 3) … S=(-57dbm, .., -63dbm), L=(10, 23) Localization model Localization model Drop! Average Error Distance
  9. 9. Difference between Domains 9 Time Period A Time Period B Device B Device A
  10. 10. Motivating Example II: Sen7ment classifica7on 10
  11. 11. Sentiment Classification (cont.) 11 Training Training Test Electronics Test ~ 84.6% ~72.65% Sentiment Classifier Sentiment Classifier Drop! Electronics Classification Accuracy ElectronicsDVD
  12. 12. Difference in Representation 12 Electronics Video Games (1) Compact; easy to operate; very good picture quality; looks sharp! (2) A very good game! It is action packed and full of excitement. I am very much hooked on this game. (3) I purchased this unit from Circuit City and I was very excited about the quality of the picture. It is really nice and sharp. (4) Very realistic shooting action and good plots. We played this and were hooked. (5) It is also quite blurry in very dark settings. I will never buy HP again. (6) The game is so boring. I am extremely unhappy and will probably never buy UbiSoft again.
  13. 13. A Major Assump7on in Tradi7onal Machine Learning ! Training and future (test) data come from the same domain, which implies # Represented in the same feature spaces. # Follow the same data distribution. 13
  14. 14. Machine Learning: Yesterday, Today and Tomorrow 14 Deep Learning: Features Reinforcement Learning: Rewards Transfer Learning: Adapta7on Yesterday Today Tomorrow
  15. 15. Machine Learning: Yesterday, Today and Tomorrow 15 Deep Learning: Lots of Data Only the Rich Reinforcement Learning: Lots of Data Only the Rich Transfer Learning: Few Data Everyone Yesterday Today Tomorrow
  16. 16. Different Scenarios •  Training and tes7ng data may come from different domains: # Different different feature spaces/ marginal distributions: # Different conditional distributions or different label spaces: 16
  17. 17. Transfer Learning Approaches 17 Instance-based Approaches Feature-based Approaches Parameter/Model - based Approaches Relational Approaches
  18. 18. Instance-based Transfer Learning Approaches Source and target domains have a lot of overlapping features 18 General Assumption
  19. 19. Instance-based Transfer Learning Approaches Case I: Unlabeled Target Case II: Some Labels in Target 19 Problem Setting Assumption Assumption Problem Setting
  20. 20. Instance-based Approaches Case I Given a target task, 20
  21. 21. Instance-based Approaches Case I (cont.) Assumption: 21
  22. 22. Instance-based Approaches Case I (cont.) 22 Correcting Sample Selection Bias / Covariate Shift [Quionero-Candela, etal, Data Shift in Machine Learning, MIT Press 2009]
  23. 23. Instance-based Approaches Correcting sample selection bias (cont.) •  The distribu7on of the selector variable maps the target onto the source distribu7on 23 ! Label instances from the source domain with label 1 ! Label instances from the target domain with label 0 ! Train a binary classifier [Zadrozny, ICML-04]
  24. 24. Instance-based Approaches Kernel mean matching (KMM) Maximum Mean Discrepancy (MMD) [Alex Smola, Arthur Gretton and Kenji Kukumizu, ICML-08 tutorial] 24
  25. 25. Instance-based Approaches Direct density ratio estimation 25 [Sugiyama etal., NIPS-07, Kanamori etal., JMLR-09] KL divergence loss Least squared loss [Sugiyama etal., NIPS-07] [Kanamori etal., JMLR-09]
  26. 26. Instance-based Approaches Case II •  Intui7on: Part of the labeled data in the source domain can be reused in the target domain after re-weighting 26
  27. 27. Instance-based Approaches Case II (cont.) ! TrAdaBoost [Dai etal ICML-07] – For each boosting iteration, # Use the same strategy as AdaBoost to update the weights of target domain data. # Use a new mechanism to decrease the weights of misclassified source domain data. 27
  28. 28. Feature-based Transfer Learning Approaches When source and target domains only have some overlapping features. (lots of features only have support in either the source or the target domain) 28
  29. 29. Feature-based Transfer Learning Approaches (cont.) How to learn ? ! Solution 1: Encode application-specific knowledge to learn the transformation. ! Solution 2: General approaches to learning the transformation. 29
  30. 30. Feature-based Approaches Encode application-specific knowledge 30 Electronics Video Games (1) Compact; easy to operate; very good picture quality; looks sharp! (2) A very good game! It is action packed and full of excitement. I am very much hooked on this game. (3) I purchased this unit from Circuit City and I was very excited about the quality of the picture. It is really nice and sharp. (4) Very realistic shooting action and good plots. We played this and were hooked. (5) It is also quite blurry in very dark settings. I will never_buy HP again. (6) The game is so boring. I am extremely unhappy and will probably never_buy UbiSoft again.
  31. 31. Feature-based Approaches Encode application-specific knowledge (cont.) 31 compact sharp blurry hooked realistic boring 1 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 ( ) sgn( ), [1,1, 1,0,0,0]T y f x w x w= = ⋅ = − compact sharp blurry hooked realistic boring 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 Electronics Video Game Training Prediction
  32. 32. Feature-based Approaches Encode application-specific knowledge (cont.) 32 Electronics Video Games (1) Compact; easy to operate; very good picture quality; looks sharp! (2) A very good game! It is action packed and full of excitement. I am very much hooked on this game. (3) I purchased this unit from Circuit City and I was very excited about the quality of the picture. It is really nice and sharp. (4) Very realistic shooting action and good plots. We played this and were hooked. (5) It is also quite blurry in very dark settings. I will never_buy HP again. (6) The game is so boring. I am extremely unhappy and will probably never_buy UbiSoft again.
  33. 33. Feature-based Approaches Encode application-specific knowledge (cont.) ! Three different types of features !  Source domain (Electronics) specific features, e.g., compact, sharp, blurry !  Target domain (Video Game) specific features, e.g., hooked, realistic, boring !  Domain independent features (pivot features), e.g., good, excited, nice, never_buy 33
  34. 34. Feature-based Approaches Encode application-specific knowledge (cont.) ! How to identify pivot features? ! Term frequency on both domains ! Mutual information between features and labels (source domain) ! Mutual information on between features and domains ! How to utilize pivots to align features across domains? ! Structural Correspondence Learning (SCL) [Biltzer etal. EMNLP-06] ! Spectral Feature Alignment (SFA) [Pan etal. WWW-10] 34
  35. 35. Feature-based Approaches Spectral Feature Alignment (SFA) ! Intuition # Use a bipartite graph to model the correlations between pivot features and other features # Discover new shared features by applying spectral clustering techniques on the graph 35
  36. 36. ! If two domain-specific words have connections to more common pivot words in the graph, they tend to be aligned or clustered together with a higher probability. ! If two pivot words have connections to more common domain-specific words in the graph, they tend to be aligned together with a higher probability. Spectral Feature Alignment (SFA) High level idea 36 exciting good never_buy sharp boring blurry hooked compact realistic Pivot features Domain-specific features 7 6 8 3 6 2 4 5 Electronics Video Game
  37. 37. exciting good never_buy sharp boring blurry hooked compact realistic Pivot features Domain-specific features 7 6 8 3 6 2 4 5 Electronics Video Game boring realistic hooked blurry sharp compact Electronics Video Game Electronics Electronics Video Game Video Game Derive new features Spectral Clustering 37
  38. 38. Spectral Feature Alignment (SFA) Derive new features (cont.) sharp/hooked compact/realistic blurry/boring 1 1 0 1 0 0 0 0 1 38 ( ) sgn( ), [1,1, 1]T y f x w x w= = ⋅ = − sharp/hooked compact/realistic blurry/boring 1 0 0 1 1 0 0 0 1 Electronics Video Game Training Prediction
  39. 39. Spectral Feature Alignment (SFA) 1.  Identify P pivot features 2.  Construct a bipartite graph between the pivot and remaining features. 3.  Apply spectral clustering on the graph to derive new features 4.  Train classifiers on the source using augmented features (original features + new features) 39
  40. 40. Feature-based Approaches Develop general approaches 40 Time Period A Time Period B Device B Device A
  41. 41. Feature-based Approaches Transfer Component Analysis [Pan etal., IJCAI-09, TNN-11] 41 TargetSource Latent factors Temperature Signal properties Building structure Power of APs Motivation
  42. 42. Transfer Component Analysis (cont.) 42 TargetSource Latent factors Temperature Signal properties Building structure Power of APs Causes the data distributions between two domains to be different
  43. 43. Transfer Component Analysis (cont.) 43 TargetSource Signal properties Noisy component Building structure Principal components
  44. 44. Transfer Component Analysis (cont.) Learning by only minimizing the distance between distribu7ons 44
  45. 45. Transfer Component Analysis (cont.) Main idea: the learned should map the source and target domain data to the latent space spanned by the factors which can reduce domain difference and preserve original data structure. 45 High level optimization problem
  46. 46. Transfer Component Analysis (cont.) 46 Recall: Maximum Mean Discrepancy (MMD)
  47. 47. Transfer Component Analysis (cont.) 47 An illustrative example Latent features learned by PCA and TCA PCAOriginal feature space TCA
  48. 48. Feature-based Approaches Self-taught Feature Learning (Andrew Ng. et al.) ! Intuition: Useful higher-level features can be learned from unlabeled data. ! Steps: 1)  Learn higher-level features from a lot of unlabeled data. 2)  Use the learned higher-level features to represent the data of the target task. 3)  Train models from the new representations of the target task (supervised) ! How to learn higher-level features # Sparse Coding [Raina etal., 2007] # Deep learning [Glorot etal., 2011] 48
  49. 49. Feature-based Approaches Mul7-task Feature Learning ! Assumption: If tasks are related, they should share some good common features. ! Goal: Learn a low-dimensional representation shared across related tasks. 49 General Multi-task Learning Setting
  50. 50. Multi-task Learning Assumption: If tasks are related, they may share similar parameter vectors. For example, [Evgeniou and Pontil, KDD-04] 50 Common part Specific part for individual task
  51. 51. Mul7-task Feature Learning 51 [Argyriou etal., NIPS-07] [Ando and Zhang, JMLR-05] [Ji etal, KDD-08]
  52. 52. Deep Learning in Transfer Learning 52
  53. 53. Transfer Learning with Deep LearningTransfer Learning Perspec7ve: Why need Deep Learning? •  Deep neural networks learn nonlinear representa7ons –  that are hierarchical; –  that disentangle different explanatory factors of varia7on behind data samples; –  that manifest invariant factors underlying different popula7ons. Deep Learning Perspec7ve: Why need Transfer Learning? •  Transfer Learning alleviates –  the incapability of learning on a dataset which may not be large enough to train an en7re deep neural network from scratch
  54. 54. Benchmark Dataset: Office •  Descrip7on: leverage source images to improve classifica7on of target images 3 domains 31categories backpackbike amazon 2,817 images webcam 957 images dslr 795 images object images in Amazon low-resolution images taken from a web camera high-resolution images taken from a digital SLR camera
  55. 55. Results Unsupervised domain adaptation Amazon→Webcam over time 2011 2012 2013 2014 2015 Multi-classaccuracy 10 20 30 40 50 60 70 TCA GFK CNN DD C DLID DAN BA TL without DL TL with DL DL without TL With Deep Learning, Transfer Learning improves. Applying Transfer Learning techniques outperforms directly applying Deep Learning models trained on the source.
  56. 56. DASH-N [1] Finetuning [2,3] SCNN [4] Overview •  Overview supervised unsupervised single modality multiple modalities [5] DLID [6] DCC [7] DAN [8] BA [9] SHL-MDNN [10] ST [11] Are there labelled target data? Are dimensions of source and target domains equal? [12] Ngiam, Jiquan, et al. "Multimodal deep learning." ICML. 2011. [13] Srivastava, Nitish, and Ruslan Salakhutdinov. "Multimodal learning with deep Boltzmann machines." JMLR. 2014 [14] Sohn, Kihyuk, Wenling Shang, and Honglak Lee. "Improved multimodal deep learning with variation of information." NIPS. 2014. DBN [12] DBM [13] MDRNN [14] [5] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Domain adaptation for large-scale sentiment classification: A deep learning approach." ICML. 2011. [6] Chopra, Sumit, Suhrid Balakrishnan, and Raghuraman Gopalan. "Dlid: Deep learning for domain adaptation by interpolating between domains." ICML. 2013. [7] Tzeng, Eric, et al. "Deep domain confusion: Maximizing for domain invariance." arXiv preprint arXiv:1412.3474. 2014. [8] Long, Mingsheng, and Jianmin Wang. "Learning transferable features with deep adaptation networks." arXiv preprint arXiv:1502.02791. 2015. [9] Ganin, Yaroslav, and Victor Lempitsky. "Unsupervised Domain Adaptation by Backpropagation." ICML. 2015. [10] Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers." ICASSP. 2013. [11] Gupta, Saurabh, Judy Hoffman, and Jitendra Malik. "Cross Modal Distillation for Supervision Transfer." arXiv preprint arXiv:1507.00448. 2015. [1] Nguyen, Hien V., et al. "Joint hierarchical domain adaptation and feature learning." PAMI. 2013. [2] Oquab, Maxime, et al. "Learning and transferring mid-level image representations using convolutional neural networks." CVPR 2014. [3] Yosinski, Jason, et al. "How transferable are features in deep neural networks?." NIPS 2014. [4] Tzeng, Eric, et al. "Simultaneous deep transfer across domains and tasks." CVPR. 2015.
  57. 57. Single Modality •  Directly applying the model parameters (deep neural network weights) from the source to targetsource domai n input outpu t target domai n input outpu t shared weightsAre the features transferrable?
  58. 58. Single Modality •  Transferability of layer-wise features ImageNet 1000 class A: 500 class B: 500 class source domain target domain random split baseA: train all layers with A baseB: train all layers with B BnB: initialize the first n layers with baseB and fix, randomly initialize the other layers and train with B BnB+: initialize the first n layers with baseB, randomly initialize the other layers, and train all layers with B AnB: initialize the first n layers with baseA and fix, randomly initialize the other layers and train with B AnB+: initialize the first n layers with baseA, randomly initialize the other layers, and train all layers with B
  59. 59. Single Modality •  Transferability of layer-wise features [3] Conclusion 1: lower layer features are more general and transferrable, and higher layer features are more specific and non-transferrable. Conclusion 2: transferring features + fine-tuning always improve generalization. What if we do not have any labelled data to finetune in the target domain? What happens if the source and target domain are very dissimilar? ImageNet is not randomly split, but into A = {man-made classes} and B = {natural classes}
  60. 60. Single Modality •  General framework of unsupervised transfer source domai n input outpu t target domai n input outpu t domain distanc e loss For lower level features (more general & transferrable), the source transfers to the target directly. For higher level features (more domain specific & not transferrable), the source transfers to the target by minimizing domain distances. shared weights If some labelled target data are available, it would be better.
  61. 61. Single Modality •  Overall training objec7ve •  Domain distance losses – Maximum Mean Discrepancy [7] source domain classification lossdomain distance loss a particular representation, e.g. the representation after 5th layer
  62. 62. Single Modality •  Domain distance losses –  MK-MMD (Mul7-kernel variant of MMD) [8] –  Domain classifier [4, 9] an embedding A distribution-free metric - maximizes the domain classification error Learn a more flexible distance metric than MMD by adjusting
  63. 63. Single Modality •  Other factors to improve transfer –  Which layers should the domain distance loss be considered? •  By learning, pinpoint the layer that minimizes the domain distance among all specific layers, say the fourth. [7] •  All the specific layers, say the last two layers. [8] source domain input output target domain inpu t domain distanc e loss
  64. 64. Single Modality•  Other factors to improve transfer – When we have some training data in the target domain? •  soj label supervision [4]: categories without any labeled target data are s7ll updated to output non-zero probabili7es target doma in inpu t outpu t source domain
  65. 65. Mul7ple Modali7es •  The source domain and target domain could have different feature spaces, i.e., dimensionality. –  Mul7media on the web •  Images •  Text documents •  Audio •  Video –  Recommender systems •  Douban •  Taobao •  Xiami Music –  Robo7cs •  Vision •  Audio •  Sensors How to deal with multi-modal transfer with Deep Learning?
  66. 66. Mul7ple Modali7es •  Key The cat is sitting on a sofa with ears cocking. shared concept cat ears kitteneyes
  67. 67. Mul7ple Modali7es •  General framework of unsupervised transfer source domai n input target domai n input commo n Paired loss reconstruction layer reconstruction layer Reconstruction errors: Paired loss: the similarity of a pair of source and target instances is preserved in the common space. Paired loss: similarity
  68. 68. Mul7ple Modali7es •  General framework of supervised transfer outpu t outpu t Paired loss Classification loss: source domai n input target domai n input commo n
  69. 69. MIR-Flickr Dataset •  1 million images with user-generated tags •  25,000 images are labelled with 24 categories •  10,000 for training, 5,000 for validation, 10,000 for testing categories baby, female, portrait, people plant life, river, water clouds, sea, sky, transport, water animals, dog, food domain 1: images domain 2: text
  70. 70. Results Mean Average Precision (MAP) by applying LR to different layers [13] Transferring either one of the two domains to the other (joint hidden), outperforms the domain itself (image_input OR text_input). DBN [12] DBM [13]
  71. 71. References [1] Nguyen, Hien V., et al. "Joint hierarchical domain adaptation and feature learning." PAMI. 2013. [2] Oquab, Maxime, et al. "Learning and transferring mid-level image representations using convolutional neural networks." CVPR. 2014. [3] Yosinski, Jason, et al. "How transferable are features in deep neural networks?." NIPS. 2014. [4] Tzeng, Eric, et al. "Simultaneous deep transfer across domains and tasks." CVPR. 2015. [5] Glorot, Xavier, Antoine Bordes, and Yoshua Bengio. "Domain adaptation for large-scale sentiment classification: A deep learning approach." ICML. 2011. [6] Chopra, Sumit, Suhrid Balakrishnan, and Raghuraman Gopalan. "Dlid: Deep learning for domain adaptation by interpolating between domains." ICML. 2013. [7] Tzeng, Eric, et al. "Deep domain confusion: Maximizing for domain invariance." arXiv preprint arXiv:1412.3474. 2014. [8] Long, Mingsheng, and Jianmin Wang. "Learning transferable features with deep adaptation networks." arXiv preprint arXiv:1502.02791. 2015. [9] Ganin, Yaroslav, and Victor Lempitsky. "Unsupervised Domain Adaptation by Backpropagation." ICML. 2015. [10] Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers." ICASSP. 2013. [11] Gupta, Saurabh, Judy Hoffman, and Jitendra Malik. "Cross Modal Distillation for Supervision Transfer." arXiv preprint arXiv:1507.00448. 2015. [12] Ngiam, Jiquan, et al. "Multimodal deep learning." ICML. 2011. [13] Srivastava, Nitish, and Ruslan Salakhutdinov. "Multimodal learning with deep Boltzmann machines." JMLR. 2014 [14] Sohn, Kihyuk, Wenling Shang, and Honglak Lee. "Improved multimodal deep learning with variation of information." NIPS. 2014.
  72. 72. Simultaneous Deep Transfer Across Domains and Tasks Eric Tzeng, Judy Hoffman, Trevor Darrell, Kate Saenko, ICCV 2015
  73. 73. Tzeng et al.: Architecture
  74. 74. Tzeng et al.: Architecture
  75. 75. Oquab, Bottou, Laptev, Sivic: Learning and Transferring Mid-Level Image Representations using Convolutional Neural Networks. CVPR 2014.
  76. 76. Transfer Learning in Convolu7onal Neural Networks •  Source Domain: ImageNet –  1000 classes, 1.2 million images •  Target Domain: Pascal VOC 2007 object classifica7on –  20 classes, about 5000 images •  PRE-1000C: the proposed method
  77. 77. DeCAF: A Deep Convolu7onal Ac7va7on Feature for Generic Visual Recogni7on •  Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, Trevor Darrell. ICML2014 •  Ques7ons: –  How to transfer features to tasks with different labels –  Do features extracted from the CNN generalize to other datasets? –  How does performance vary with network depth? •  Algorithm: –  A deep convolu7onal model is first trained in a fully supervised seqng using a state-of-the-art method Krizhevsky et al. (2012 ). –  extract various features from this network, and evaluate the efficacy of these features on generic vision tasks. 78
  78. 78. Comparison: DECAF to others 79
  79. 79. Relational Transfer Learning Approaches ! Motivation: !  If two logically described domains (rela7onal, data is non-i.i.d) are related, they must share similar rela)ons among objects. ! These rela7ons can be used for transfer learning 80
  80. 80. Relational Transfer Learning Approaches (cont.) 81 Actor(A) Director(B) WorkedFor Movie (M) Student (B) Professor (A) AdvisedBy Paper (T) Publication Publication Academic domain (source) Movie domain (target) MovieMember MovieMember AdvisedBy (B, A) ˄ Publication (B, T) => Publication (A, T) WorkedFor (A, B) ˄ MovieMember (A, M) => MovieMember (B, M) P1(x, y) ˄ P2 (x, z) => P2 (y, z) [Mihalkova etal., AAAI-07, Davis and Domingos, ICML-09]
  81. 81. TRANSFER LEARNING APPLICATIONS 迁移学习应用 82
  82. 82. 83 Query Classification and Online Advertisement •  ACM KDDCUP 05 Winner •  SIGIR 06 •  ACM Transactions on Information Systems Journal 2006 –  Joint work with Dou Shen, Jiantao Sun and Zheng Chen
  83. 83. 84 84 QC as Machine Learning Inspired by the KDDCUP’05 competition –  Classify a query into a ranked list of categories –  Queries are collected from real search engines –  Target categories are organized in a tree with each node being a category
  84. 84. 85 Target-transfer Learning in QC •  Classifier, once trained, stays constant –  Target Classes Before •  Sports, Politics (European, US, China) –  Target Classes Now •  Sports (Olympics, Football, NBA), Stock Market (Asian, Dow, Nasdaq), History (Chinese, World) How to allow target to change? •  Application: –  advertisements come and go, –  but our query"target mapping needs not be retrained! •  We call this the target-transfer learning problem
  85. 85. 86 86 Solutions: Query Enrichment + Staged Classification Target Categories Queries Solution: Bridging classifier Construction of Synonym- based Classifiers Construction of Statistical Classifier Query Search Engine Labels of Returned Pages Text of Returned Pages Classified results Classified results Finial Results Phase II: the testing phase Phase I: the training phase
  86. 86. 87 87 $  Category information Full Step 1: Query enrichment •  Textual information Title Snippet Category
  87. 87. 88 88 Step 2: Bridging Classifier •  Wish to avoid: –  When target is changed, training needs to repeat! •  Solution: –  Connect the target taxonomy and queries by taking an intermediate taxonomy as a bridge
  88. 88. 89 89 Bridging Classifier (Cont.) $  How to connect? Prior prob. of I jC The relation between and I jC T iC The relation between and I jC q The relation between and T iC q
  89. 89. 90 90 Category Selection for Intermediate Taxonomy –  Category Selection for Reducing Complexity •  Total Probability (TP) •  Mutual Information
  90. 90. 91 Result of Bridging Classifiers – Using bridging classifier allows the target classes to change freely •  no the need to retrain the classifier! $  Performance of the Bridging Classifier with Different Granularity of Intermediate Taxonomy
  91. 91. Cross Domain Ac7vity Recogni7on [Zheng, Hu, Yang, Ubicomp 2009] •  Challenges: –  A new domain of ac7vi7es without labeled data •  Cross-domain ac7vity recogni7on –  Transfer some available labeled data from source ac7vi7es to help training the recognizer for the target ac7vi7es. 92 Cleaning Indoor Laundry Dishwashing
  92. 92. How to use the similari7es? 93 Source Domain Labeled Data Similarity Measure <Sensor Reading, Ac7vity Name> Example: <SS, “Make Coffee”> sim(“Make Coffee”, “Make Tea”) = 0.6 Pseudo Training Data: <SS, “Make Tea”, 0.6> Target Domain Pseudo Labeled Data Weighted SVM Classifier THE WEB
  93. 93. Calcula7ng Ac7vity Similari7es ! How similar are two ac7vi7es? ◦  Use Web search results ◦  TFIDF: Tradi7onal IR similarity metrics (cosine similarity) ◦  Example "  Mined similarity between the ac7vity “sweeping” and “vacuuming”, “making the bed”, “gardening” Calculated Similarity with the activity "Sweeping" Similarity with the activity "Sweeping " 94
  94. 94. Cross-Domain AR: Performance Mean Accuracy with Cross Domain Transfer # Activities (Source Domain) # Activities (Target Domain) Baseline (Random Guess) MIT Dataset (Cleaning to Laundry) 58.9% 13 8 12.5% MIT Dataset (Cleaning to Dishwashing) 53.2% 13 7 14.3% Intel Research Lab Dataset 63.2% 5 6 16.7% 95 !  Ac7vi7es in the source domain and the target domain are generated from ten random trials, mean accuracies are reported.
  95. 95. Transferring knowledge from social to physical ! Ubiquitous physical sensors mo7vate extensive research on ubiquitous compu7ng. Which ac7vity is this person performing?
  96. 96. Transferring from social to physical I am on a business trip in New York. The Metropolitan Museum of Art is fantas7c! Brilliant night at Chilli Food, wine, hospitality all excellent. Bristol's top restaurant. Back in the #gym ajer 3.5 weeks :) feeling good #exercise
  97. 97. Can we transfer knowledge from social media to physical world?
  98. 98. Transfer from social to physical Cellphone Sensor Dataset ! 232 sensor records ! 10 volunteers ! 7me, GPS, tri-axial accelerometer, loca7on POI info Sina Weibo ! 10,791 tweets ! Distribu7on of labels ! Distribu7on of top 9 labels
  99. 99. Transfer from social to physical ! Results A naive combina7on of sensor and social features performs bezer than sensor features only (Combined v.s. Sensor), which validates the necessity of ins7lling social knowledge into physical sensor data. Heterogeneous transfer learning methods show improvement over Combined: employing social messages to enrich sensor readings’ feature representa7on in a latent space is more effec7ve than naive combina7on. !  Our method could use only 50% labelled data of other methods to achieve the same performance.
  100. 100. Transfer Learning for Collabora7ve Filtering 101 IMDB Database Amazon.com 101
  101. 101. Transfer Learning in Collabora7ve Filtering •  Source (Dense): Encode cluster-level ra7ng pazerns •  Target (Sparse): Map users/items to the encoded prototypes 102 A B C III II I A B III II I a e b f c d 2 6 4 5 1 3 c d a b e 1 3 6 2 4 7 a b c d e f a b c d e 1 6 5 4 3 2 7 1 6 5 4 3 2 BOOKS (Target-Sparse) MOVIES (Auxiliary-Dense) Cluster-level Rating Pattern Matching 3 2 3 2 3 1 3 2 3 2 3 1 1 3 2 3 ? 3 3 1 1 1 1 2 2 ? 2 2 2 2 2 3 ? 3 3 3 3 3 3 ? 3 3 3 2 2 2 2 1 1 1 ? ? 1 1 1 1 ? 1 ? 1 2 ? 2 2 3 3 3 3 ? 3 ? 3 3 3 2 2 ? ? 2 2 2 3 2 ? 3 3 3 2 3 2 1 1 3 1 3 1 ? 2 3 ? 3 1 2 2 2 3 2 3 3 ? ? 2 3 2 1 1 ? 3 1 1 2 3 1 ? ? 1 2 3 3 2 ? 3 ? 2 3 2 3 ? 3 ? 1 3 1 ? ? 3 ? 2 3 3 2 Permuterows&cols 5 ReducetoGroups 3 3 3 ? ? 3
  102. 102. ADVANCED DEVELOPMENTS 103
  103. 103. Source-Free Transfer Learning Evan Wei Xiang, Sinno Jialin Pan, Weike Pan, Jian Su and Qiang Yang. Source-Selec7on-Free Transfer Learning. In Proceedings of the 22nd Interna7onal Joint Conference on Ar7ficial Intelligence (IJCAI-11), Barcelona, Spain, July 2011.
  104. 104. Transfer Learning Lack of labeled training data always happens When we have some related source domains Supervised Learning Transfer Learning
  105. 105. Where are the “right” source data? •  We may have an extremely large number of choices of poten7al sources to use.
  106. 106. SFTL – Building base models vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. vs. From the taxonomy of the online informa7on source, we can “compile” a lot of base classifica7on models
  107. 107. Source Free Transfer Learning vs. vs. vs. vs. vs. For each target instance, we can obtain a combined result on the label space via aggrega=ng the predic=ons from all the base classifiers However, do we need to call the base classifiers during the predic)on phase? The answer is No! Then we can use the projec=on matrix V to transform such combined results from the label space to a latent space V Projection matrix q m Probability Label space A Target Instance
  108. 108. Compila7on: Learning a projec7on matrix W to amp the target instance to latent space vs. vs. vs. vs. vs. V Projection matrix Target Domain Labeled & Unlabeled Data q m W d m Learned Projection matrix Our regression model Loss on labeled data Loss on unlabeled data For each target instance, we first aggregate its predic=on in the base label space, and then project it onto the latent space
  109. 109. SFTL – Predic7ons for the incoming test data vs. vs. vs. vs. vs. V Projec=on matrix Target Domain Incoming Test Data q m W d m Learned Projec=on matrix With the parameter matrix W, we can make predic=on on any incoming test data based on the distance to the label prototypes, without calling the base classifica=on models No need to use base models explicitly!
  110. 110. Transi7ve Transfer Learning with intermediate domains Qiang Yang Hong Kong University of Science and Technology http://www.cse.ust.hk/~qyang
  111. 111. Far Transfer vs. Near Transfer
  112. 112. Problem defini7on !  Given distant source and target domains, and a set of intermediate domains, can we find one or more intermediate domains to enable the transfer learning between source and target? Not directly Transferrable Intermediat e domain 1 Common factor 1
  113. 113. Previous work and TTL %  Tradi7onal machine learning &  training and test data should be from the same problem domain. %  Transfer learning &  training and test data should be from similar problem domains. %  Transi7ve transfer learning &  training and test data could be from distant problem domains. ML: Same domain TL: Similar domains TTL: Distant domains
  114. 114. Text-to-Image Classifica7on Source and target domains have few overlaps Text-to-image Classification with co- occurrence data as intermediate domain accelerator-to-gyroscope activity recognition with data from intelligent devices as intermediate domains
  115. 115. TTL: single intermediate domain Intermediate domain selec7on, then propagate knowledge !  Crawl a lot of images with annota7ons from Internet !  Use domain distance, such as A-distance, to iden7fy domain !  Transi7ve transfer through shared hidden factors in row by matrix tri- factoriza7on Matrix tri-factoriza7on for clustering/classifica7on
  116. 116. TTL: shared hidden factors in row by matrix tri-factoriza7on
  117. 117. Experiments NUS-WISE data set !  The NUS-WISE data set are used !  45 text-to-image tasks ! Each task is composed of 1200 text documents, 600 images, and 1600 co-occurred text-image pairs.
  118. 118. Supervised Learning w/ auto-encoder Labeled Source Domain Feature Engineering Predictive Model Learning Shared Text Classification
  119. 119. Designing Objec7ve Func7on of TTL Transitive Transfer Learning with intermediate data Intermediate domain weighting/selection The weights for the intermediate domains are learned from data. The intermediate data help find a better hidden layer. Predictive Model Learning Feature Engineering
  120. 120. TTL with supervised auto- encoder Source Feature Engineering Predictive Model Learning SharedTarget Intermediates ! The NUS-WISE data ! 45 text-to-image tasks ! Each task is composed of 1200 text documents, 600 images, and 1600 co- occurred text-image pairs. In each task, 1600*45 co-occurred text-image pairs will be used for knowledge transfer.
  121. 121. TTL with supervised auto- encoder Source Feature Engineering Predictive Model Learning SharedTarget Intermediates Text-to-image w/ intermediate data
  122. 122. Reinforcement Transfer Learning via Sparse Coding •  Slow learning speed remains a fundamental problem for reinforcement learning in complex environments. •  Main problem: the numbers of states and actions in the source and target domains are different. –  Existing works: hand-coded inter-task mapping between state- action pairs •  Tool: new transfer learning based on sparse coding Ammar, Tuyls, Taylor, Driessens, Weiss: Reinforcement Learning Transfer via Sparse Coding. AAMAS, 2012.
  123. 123. Reinforcement Learning Transfer via Sparse CodingA u t h o r s m e a s u r e d t h e performance as the number of steps during an episode to control the pole in an upright posi7on on a given fixed amount of samples.
  124. 124. •  Given State-Ac7on-State Triplets in the source task, learn dic7onary as •  Using the coefficient matrix in the first step, we can learn the dic7onary in the target task as •  Then for each triplet in the target task, - sparse projec7on is used to find its coefficients •  As a result, the inter-task mapping can be learned! Reinforcement Transfer Learning via Sparse Coding
  125. 125. Reference !  [Thorndike and Woodworth, The Influence of Improvement in one mental function upon the efficiency of the other functions, 1901] !  [Taylor and Stone, Transfer Learning for Reinforcement Learning Domains: A Survey, JMLR 2009] !  [Pan and Yang, A Survey on Transfer Learning, IEEE TKDE 2009] !  [Quionero-Candela, etal, Data Shift in Machine Learning, MIT Press 2009] !  [Biltzer etal.. Domain Adapta7on with Structural Correspondence Learning, EMNLP 2006] !  [Pan etal., Cross-Domain Sentiment Classification via Spectral Feature Alignment, WWW 2010] !  [Pan etal., Transfer Learning via Dimensionality Reduction, AAAI 2008] 126
  126. 126. Reference (cont.) !  [Pan etal., Domain Adaptation via Transfer Component Analysis, IJCAI 2009] !  [Evgeniou and Pontil, Regularized Multi-Task Learning, KDD 2004] !  [Zhang and Yeung, A Convex Formulation for Learning Task Relationships in Multi-Task Learning, UAI 2010] !  [Agarwal etal, Learning Multiple Tasks using Manifold Regularization, NIPS 2010] !  [Argyriou etal., Multi-Task Feature Learning, NIPS 2007] !  [Ando and Zhang, A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data, JMLR 2005] !  [Ji etal, Extracting Shared Subspace for Multi-label Classification, KDD 2008] 127
  127. 127. Reference (cont.) !  [Raina etal., Self-taught Learning: Transfer Learning from Unlabeled Data, ICML 2007] !  [Dai etal., Boosting for Transfer Learning, ICML 2007] !  [Glorot etal., Domain Adaptation for Large-Scale Sentiment Classification: A Deep Learning Approach, ICML 2011] !  [Davis and Domingos, Deep Transfer vis Second-order Markov Logic, ICML 2009] !  [Mihalkova etal., Mapping and Revising Markov Logic Networks for Transfer Learning, AAAI 2007] !  [Li etal., Cross-Domain Co-Extraction of Sentiment and Topic Lexicons, ACL 2012] 128
  128. 128. Reference (cont.) !  [Sugiyama etal., Direct Importance Estimation with Model Selection and Its Application to Covariate Shift Adaptation, NIPS 2007] !  [Kanamori etal., A Least-squares Approach to Direct Importance Estimation, JMLR 2009] !  [Cris7anini etal., On Kernel Target Alignment, NIPS 2002] !  [Huang etal., Correcting Sample Selection Bias by Unlabeled Data, NIPS 2006] !  [Zadrozny, Learning and Evaluating Classifiers under Sample Selection Bias, ICML 2004] 129
  129. 129. Transfer Learning in Convolu7onal Neural Networks •  Convolutional neural networks (CNN): outstanding image-classification. •  Learning CNNs requires a very large number of annotated image samples –  Millions of parameters, to many that prevents application of CNNs to problems with limited training data. •  Key Idea: –  the internal layers of the CNN can act as a generic extractor of mid-level image representation –  Model-based Transfer Learning
  130. 130. Thank You 131

×