林守德/Practical Issues in Machine Learning

3,137 views

Published on

Shou-de Lin is currently a full professor in the CSIE department of National Taiwan University. He holds a BS in EE department from National Taiwan University, an MS-EE from the University of Michigan, and an MS in Computational Linguistics and PhD in Computer Science both from the University of Southern California. He leads the Machine Discovery and Social Network Mining Lab in NTU. Before joining NTU, he was a post-doctoral research fellow at the Los Alamos National Lab. Prof. Lin's research includes the areas of machine learning and data mining, social network analysis, and natural language processing. His international recognition includes the best paper award in IEEE Web Intelligent conference 2003, Google Research Award in 2007, Microsoft research award in 2008, merit paper award in TAAI 2010, best paper award in ASONAM 2011, US Aerospace AFOSR/AOARD research award winner for 5 years. He is the all-time winners in ACM KDD Cup, leading or co-leading the NTU team to win 5 championships. He also leads a team to win WSDM Cup 2016 Champion. He has served as the senior PC for SIGKDD and area chair for ACL. He is currently the associate editor for International Journal on Social Network Mining, Journal of Information Science and Engineering, and International Journal of Computational Linguistics and Chinese Language Processing. He receives the Young Scholars' Creativity Award from Foundation for the Advancement of Outstanding Scholarship and Ta-You Wu Memorial Award.

Published in: Data & Analytics
0 Comments
39 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,137
On SlideShare
0
From Embeds
0
Number of Embeds
11
Actions
Shares
0
Downloads
381
Comments
0
Likes
39
Embeds 0
No embeds

No notes for slide

林守德/Practical Issues in Machine Learning

  1. 1. Practical Issues in Machine Learning -how to create an effective ML model? Prof. Shou-de Lin (林守德) CSIE, National Taiwan University sdlin@csie.ntu.edu.tw
  2. 2. Diagnose a Machine Learning Solution -Why My ML Model Doesn’t Work? Prof. Shou-de Lin (林守德) CSIE, National Taiwan University sdlin@csie.ntu.edu.tw
  3. 3. Machine Discovery and Social Network Mining Lab, CSIE, NTU • PI: Shou-de Lin – B.S. in NTUEE – M.S. in EECS, UM – M.S. in Computational Linguistics, USC – Ph.D. in CS, USC – Postdoc in Los Alamos National Lab • Courses: – Machine Learning and Data Mining- Theory and Practice – Machine Discovery – Social network Analysis – Technical Writing and Research Method – Probabilistic Graphical Model • Awards: – All-time ACM KDD Cup Champion (2008, 2010, 2011, 2012, 2013) – Google Research Award 2008 – Microsoft Research Award 2009 – Best Paper Award WI2003, TAAI 2010, ASONAM 2011, TAAI 2014 – US Areospace AROAD Research Grant Award 2011, 2013, 2014, 2015, 2016 Machine Learning with Big Data Machine Discovery Learning IoT Data Applications in NLP&SNA& Recommender Practical Issues in Learning 3
  4. 4. Talk Materials Based on Hands-On Experience in Solving ML Tasks, Including  Participating ACM KDD Cup for 6 years  >20 Industrial collaboration  Visiting Scholar in Microsoft from 2015~2016
  5. 5. Team NTU’s Performance on ACM KDD Cup KDD Cups 2008 2009 2010 2011 2012 2013 Organizer Siemens Orange PSLC Datashop Yahoo! Tencent Microsoft Topic Breast Cancer Prediction User Behavior Prediction Learner Performance Prediction Recommendation Internet advertising (track 2) Author-paper & Author name Identification Data Type Medical Telcom Education Music Search Engine Log Academic Search Data Challenge Imbalance Data Heterogeneous Data Time- dependent instances Large Scale Temporal + Taxonomy Info Click through rate prediction Alias in names # of records 0.2M 0.1M 30M 300M 155M 250K Authors, 2.5M papers # of teams >200 >400 >100 >1000 >170 >700 Our Record Champion 3rd place Champion Champion Champion Champion 5
  6. 6. Industrial Collaboration 企業產學合作 異質探勘技術 Google 2009 結合社群模型與遊戲激 勵機制 IBM 2015 空氣品質自動偵測 Microsoft 2014 感測網路內容分析 INTEL 2011-2015 分散式學習 US Airforce 2011-2017 收視率預測以及交通狀 況預測 資策會 2011-2015 音樂歌詞情緒辨識 中華電信 2014 銀行履歷推薦分析 104人力銀行 2011-2013 客服自動系統 趨勢科技 2016- 大數據徵信及生命週期 預估 KPMG安侯建業 2015 資料探勘顧問 台達電子 2015 技術轉移與技術服務 ranked-based matrix factorization 工研院資通所 2013 Multi-task Learning Models 工研院巨資中心 2015 rankNet 工研院資通所 2013 Multi-label Classification Tools 工研院巨資中心 2013 社群網路技術服務 MobiApp 2012 商品推薦服務技術合作 i-True 2012
  7. 7. What is Machine Learning?
  8. 8. General Def of Machine Learning (ML) • ML tries to optimize a performance criterion using example data or past experience. • Mathematically speaking: given data X, we want to learn a function mapping f(X) for certain purpose, e.g. – f(x)= a label y  classification – f(x)= a set Y in X  clustering – f(x)=p(x)  distribution estimation – … • ML techniques tell us how to produce high quality f(x), given certain objective and evaluation metrics 8
  9. 9. Why My Machine Learning Models Fail (meaning prediction accuracy is low)? A series of analyses are required to understand why
  10. 10. The ML Diagnose Tree 10 Is it ML? NoTry other solutions Correct ML Scenario? NoTry Other Scenario Have you identified a suitable model?’ Yes Do you have enough data? Yes  Is your model too complicated? Right Feature? Yes  Performed feature engineering? Suitable Evaluation Metrics? Proper validation set?
  11. 11. Diagnose 1: Is it an ML task? • Are you sure Machine Learning is the best solution for your task? To ML or not to ML, that is the question !! 11
  12. 12. • X come from a close set with limited variation – simply memorize all possible XY mappings – E.g. Word translation using dictionary • F(x) can be easily derived by writing rules – E.g. compression/de-compression • X is (sort of) independent of Y – E.g. X<ID, name, wealth>, YHeight • f(x) is not smooth, or f(x+X) y+ y Tasks Doubtful for ML Too hard!! Too easy!! 12
  13. 13. OK, my task belongs to ML-solvable, but I still failed
  14. 14. The ML Diagnose Tree Is it ML? NoTry other solutions Correct ML Scenario? NoTry Other Scenario Have you identified a suitable model?’ Yes Do you have enough data? Yes  Is your model too complicated? Right Feature? Yes  Performed feature engineering? Suitable Evaluation Metrics? Proper validation set? 14
  15. 15. Diagnose 2: Which ML Scenario? • Have you modeled your task into the right ML scenario? – ML  Classification/Regression  SVM, DNN, DT • Which ML toolbox should you choose? Machine Learning Supervised Learning Classification SVM 15
  16. 16. Let’s Talk About….. Understanding Machine Learning in 10 Mins What Can Do
  17. 17. A variety of ML Scenarios Supervised Learning Classification & Regression Multi-label learning Multi- instance learning Cost-sensitive leering Semi-supervised learning Active learning Unsupervised Learning Clustering Learning Data Distribution Pattern Learning Reinforcement learning Variations Transfer learning Online Learning Distributed Learning 17
  18. 18. Supervised Learning • Given: a set of <input X, output Y> pairs • Goal: given an unseen input, predict the corresponding output • There are two kinds of outputs – Categorical: classification problem • Binary classification vs. Multiclass classification – Real values: regression problem • Example: 1. Binary classification: the sensor information, output: whether such sensor is broken (INTEL) 2. Multi-class classification: Lyric of a song, output: the mood: happy/sad/surprise/angry (CHT) 3. Regression: the weather/traffic/air condition, output: PM 2.5 value (Microsoft Research Asia) 18
  19. 19. Cost-sensitive Learning • A classification task with non-uniform cost for different types of classification error. • Goal: To predict the class C* that minimizes the expected cost rather than the misclassification rate • An example cost matrix L : medical diagnosis Ljk Actual Cancer Actual Normal Predict Cancer 0 1 Predict Normal 10000 0 19
  20. 20. A variety of ML Scenarios Supervised Learning Classification & Regression Multi-label learning Multi- instance learning Cost-sensitive leering Semi-supervised learning Active learning Unsupervised Learning Clustering Learning Data Distribution Pattern Learning Reinforcement learning Variations Transfer learning Online Learning Distributed Learning 20
  21. 21. Unsupervised Learning • Learning without teachers (presumably harder than supervised learning) – Learning “what normally happens” – E.g. babies learn their first language (unsupervised) vs. how people learn their 2nd language (supervised). • Given: a bunch of input X (there is no output Y) • Goal: depending on the tasks, for example – Estimate P(X)  then we can find augmax P(X) Bayesian – Finding P(X2|X1) we can know the dependency between inputs  Association Rule, Probabilistic Graphical Model – Finding Sim(X1,X2)  then we can group similar X’s  clustering 21
  22. 22. A variety of ML Scenarios Supervised Learning Classification & Regression Multi-label learning Multi- instance learning Cost-sensitive leering Semi-supervised learning Active learning Unsupervised Learning Clustering Learning Data Distribution Pattern Learning Reinforcement learning Variations Transfer learning Online Learning Distributed Learning 22
  23. 23. Reinforcement Learning (RL) • RL is a “decision making” process. – How an agent should make decision to maximize the long- term rewards • RL is associated with a sequence of states X and actions Y (i.e. think about Markov Decision Process) with certain “rewards”. • It’s goal is to find an optimal policy to guide the decision. Figure from Mark Chang 23
  24. 24. • 1st Stage:天下棋手為我師 (Supervised Learning) – Data: 過去棋譜 – Learning: f(X)=Y, X: 盤面, Y:next move – Results: AI can play Go now, but not an expert • 2nd Stage:超越過去的自己(Reinforcement Learning) – Data: generating from playing with 1st Stage AI – Learning: Observation盤面, rewardif win, action next move AlphaGo: SL+RL 24
  25. 25. Key: Finding which ML Scenario Best Suits the Current Task
  26. 26. • CTR: the ratio that users click into a displayed advertisement • It looks like a regression task, but indeed it is not. – CTR= #click/#view  3/10 =? 300/1000 – Better solution: treating it as a binary classification, transfer 3/10 to 3 positive cases and 7 negative cases Case Study: Click Through Rate Prediction (KDD Cup 2012) 26
  27. 27. FAQ: I have a lot of Data, but they are not labelled • This is by far the most common question I have been asked. • My honest answer: try to get them labelled because you anyway need the ground truth for evaluation. • In several cases, labelling are too costly – Semi-supervised solution – Transfer learning (using labelled data in other domains) – Active learning (query a subset of labels) 28
  28. 28. A variety of ML Scenarios Supervised Learning Classification & Regression Multi-label learning Multi- instance learning Cost-sensitive leering Semi-supervised learning Active learning Unsupervised Learning Clustering Learning Data Distribution Pattern Learning Reinforcement learning Variations Transfer learning Online Learning Distributed Learning 29
  29. 29. Semi-supervised Learning • Only a small portion of data are annotated (usually due to high annotation cost) • Leverage the unlabeled data for better performance [Zhu2008] 30
  30. 30. Transfer Learning Learning Process of Traditional ML Learning Process of Transfer Learning training items Learning System Learning System training items Learning System Learning SystemKnowledge Improving a learning task via incorporating knowledge from learning tasks in other domains (with different feature space and data distribution). More labeled data Fewer labeled data 31Pan et al.
  31. 31. A Label for that Example Request for the Label of another Example A Label for that Example Request for the Label of an Example Active Learning Unlabeled Data ... Algorithm outputs a classifier Learning Algorithm Expert / Oracle • Achieves better learning with fewer labeled training data via actively selecting a subset of unlabeled data to be annotated 32
  32. 32. OK, I have identified a suitable ML scenario, but my model still doesn’t work
  33. 33. The ML Diagnose Tree Is it ML? NoTry other solutions Correct ML Scenario? NoTry Other Scenario Have you identified a suitable model?’ Yes Do you have enough data? Yes  Is your model too complicated? Right Feature? Yes  Performed feature engineering? Suitable Evaluation Metrics? Proper validation set? 34
  34. 34. Did you choose a proper model? • A proper model considers – The size of data • Small data  linear (or simpler) model • Large data  linear model or non-linear model – The sparsity of data • Sparse data  more tricks to perform better and faster • Dense data  requires light algorithm that consumes less memory – The balance condition of data • Imbalanced data  special treatment for minority class – The quality of data (whether there are noise, missing values, etc) • Some loss function (e.g. 0/1 loss or L2) are more robust to noise than others (e.g. hinge loss or exponential loss) 35
  35. 35. Do you have enough data to train the given model? • Draw the learning curve to understand whether your data is sufficient Data Size Accuracy Enough Data Need more data!! 36
  36. 36. Is your model too complicated? • Increasing the model complexity and check for the training/testing performance – E.g. increasing the latent dimension k in matrix factorization Model Complexity 37
  37. 37. FAQ: How to avoid overfitting? – Occam's Razor: simpler model first – Regularization: a way to constraint the complexity of the model – Carefully design your experiment (see later) – Being aware of the drifting phenomenon Overfitting: the Biggest Enemy of ML !! 38
  38. 38. I still cannot do well given a reasonable model is identified
  39. 39. The ML Diagnose Tree Is it ML? NoTry other solutions Correct ML Scenario? NoTry Other Scenario Have you identified a suitable model?’ Yes Do you have enough data? Yes  Is your model too complicated? Right Feature? Yes  Performed feature engineering? Suitable Evaluation Metrics? Proper validation set? 40
  40. 40. • Have you identified all potentially useful features (X)? Diagnose : Quality of Features X 41
  41. 41. • Use domain knowledge and human judgement to determine which features to obtain for training. • The rule of thumb: If you don’t know whether a feature is useful, then try it!! – Human judgement can be misleading Which Information shall be extracted as Features? Let ML algorithms (e.g. feature selection) determine whether certain information is useful !! 42
  42. 42. Feature Engineering • Feature engineering is arguably the best strategy to improve the performance. • The goal is to explicitly reveal important information to the model – domain knowledge might or might not be useful • Original features  different encoding of the features  combined features 43
  43. 43. FAQ: What shall I do if I absolutely need to boost the accuracy • Combining a variety of models: – If the models are diverse enough with similar performance, there is a high chance the performance will be boosted. • Additional validation data are needed to learn the ensemble weights. 44
  44. 44. I am really happy that my ML model FINALLY works so well!!
  45. 45. The ML Diagnose Tree Is it ML? NoTry other solutions Correct ML Scenario? NoTry Other Scenario Have you identified a suitable model?’ Yes Do you have enough data? Yes  Is your model too complicated? Right Feature? Yes  Performed feature engineering? Suitable Evaluation Metrics? Proper validation set? 46
  46. 46. Is your model evaluated correctly? • Accuracy is NOT always the best way to evaluate a machine learning model 47
  47. 47. Case Study: An 99.999% accurate system in Detecting Malicious Personnel • Randomly picked person  not likely a terrorist. • Thus, a model that always guess ‘non-terrorist’ will achieve very high accuracy – But it is useless !! • “Area under ROC Curve” (or AUC) is generally used to evaluate such system. 48
  48. 48. Have the Right Evaluation Data Been Used? • We normally divide labeled data (or ground truth data) into three parts: training, validation (or heldout), and testing • Performance on training data is obviously biased since the model was constructed to fit this data. – Accuracy must be evaluated on an independent (usually disjoint) test set. – Cannot peak the test set labels!! • Use validation set to adjust hyper-parameters How the validation set is chosen can affect the performance of the model!! Training Validation Testing 49
  49. 49. Case Study: KDD Cup 2008 (identify potential cancer patients) • Training set: a set of positive and negative instances (each contains 118 features) – Each positive patient contain a set of negative instances (i.e. an ROI in the X-ray) and at least one positive instances. – ALL instances in a negative patient are negative – It’s a multi-instance classification problem. • Random division for CV: – training: 90%, testing: 72%  peeking the testing • Patient-based CV: – training: 80%, testing: 77% 50
  50. 50. 1. Determine whether it shall be model as an ML task 2. Model the task into a suitable ML scenario 3. Given the known scenario, determine a suitable model 4. Feature Engineering 5. Parameter learning 6. Evaluation (e.g. cross validation) 7. Apply to unseen data Take Home Point: How to Solve My Task Using ML? 51
  51. 51. Final Remark: The (weak) AI Era Has Arrived Big Data Knowledge Representation Machine Learning AI 科學人2015 52
  52. 52. So, What’s Next for ML & AI? -My three bold predictions about the future Halle Tecco
  53. 53. On AI Revolution Machine will take over some non-trivial jobs that are used to be conducted by human beings – Those jobs (unfortunately?) include data analytics 54
  54. 54. On Decision Making • The decision makers will gradually shift from humans to machines 55科學人2015
  55. 55. On the Next of Machine Learning • The evolution of human intelligence: memorizing learning discovering • Machine Discovery (Shou-De Lin, Fall 2016, NTU CSIE) Machine Machine Machine 56 Thank you for your attention !!

×