Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Hierarchical Classification by Jurgen Van Gael

1,985 views

Published on

Hierarchical Classification by Jurgen Van Gael

Published in: Technology, Business
  • Be the first to comment

Hierarchical Classification by Jurgen Van Gael

  1. 1. Hierarchical  Classification Jurgen Van Gael - .
  2. 2. About •  Computer Scientist w/ background in ML. •  London Machine Learning Meetup. •  Founder of Math.NET numerical library. •  Previously @ Microsoft Research. •  Data science team lead at Rangespan.
  3. 3. Taxonomy  Classification •  Input: raw product data •  Output: classification models, classified product data ROOT Electronics Audio Audio   Cables Amps … Computers … Clothing Pants T-­‐‑Shirts … Toys Model   Rockets … …
  4. 4. Data   Collection Feature   Extraction Training Testing Labeling
  5. 5. Feature  Extraction
  6. 6. Name: INK-M50 Black Ink Cartridge (600 pages) Manufacturer: Samsung Description: null Label: toner-inkjet-cartridges "category": "toner-inkjet-cartridges”, "features": ["cartridge", "samsung", "black", "ink", "ink-m50", "pages”] Feature  Extraction: •  Text  cleaning  (stopword,  lexicalisation) •  Unigram  +  Bigram  Features •  LDA  Topic  Features Data   Collection Feature   Extraction Training Testing Labelling
  7. 7. h"p://radimrehurek.com/gensim
  8. 8. Training,  Testing  &  Labelling
  9. 9. Hierarchical  Classification D A C B E D A C E B 4  (5)  way  multiclass  classification
  10. 10. Hierarchical  Classification D A C B E D A C B E 2  +  3  way  multiclass  classification
  11. 11. Naïve  Bayes            Neural  Network Logistic  Regression Support   Vector   Machines   … ?
  12. 12. Logistic  Regression  -­‐‑  Model word printer-­‐‑ ink printer-­‐‑hardware cartridge 4.0 0.3 the 0.0 0.0 samsung 0.5 0.5 black 0.5 0.3 printer -­‐‑1.0 2.0 ink 5.0 -­‐‑1.7 … … … For each class For each feature Add the weight Exponentiate & Normalize 10.0 Σ= -­‐‑0.6 Pr= 0.99997 0.0003 Data   Collection Feature   Extraction Training Testing Labelling
  13. 13. Logistic  Regression  -­‐‑  Inference •  Optimise using Wapiti. •  Hyperparameter optimisation using grid search. •  Using development set to stop training? Data   Collection Feature   Extraction Training Testing Labelling
  14. 14. h"p://wapiti.limsi.fr/
  15. 15. ROOT Electronics Clothing Data   Collection Feature   Extraction Training Testing Labelling
  16. 16. Cross Validation Calibration •  Estimate classifier errors. •  DO NOT o  Test on training data. o  Leave data aside. •  Are my probability estimates correct. •  Computation: o  Take x data points with p(.|x) = 0.9, o  Check that about 90% of labels were correct. Data   Collection Feature   Extraction Training Testing Labelling Training  Data Error  =  1.2% Error  =  1.1% Error  =  1.2% Error  =  1.2% Error  =  1.3% = Error  =  1.2%
  17. 17. Data   Collection Feature   Extraction Training Testing Labelling ROOT Electronics Clothing Using  Bayes  rule  to  chain  classifiers:
  18. 18. Active  Learning
  19. 19. ROOT Electronics Clothing p(electronics|{text})  =  0.1 Data   Collection Feature   Extraction Training Testing Labelling
  20. 20. •  High probability datapoints o  Upload to production •  Low probability datapoints o  Subsample o  Acquire more labels Data   Collection Feature   Extraction Training Testing Labelling ROOT Electronics Clothing p(electronics|{text})  =  0.1 e.g.  Mechanical  Turk
  21. 21. Implementation
  22. 22. Implementation MongoDB S3  Raw S3  Training  Data S3  Models 1.  JSON  export 2.  Feature  Extraction 3.  Training 4.  Classification
  23. 23. Training   MapReduce •  Dumbo on Hadoop •  2000 classifiers •  5 fold CV (+ full) •  20 hypers on grid = 200.000 training runs
  24. 24. Labelling •  128 chunks •  Full Cascade each chunk D A CB E Chunk   1 Chunk   2 Chunk   3 Chunk   N … D A CB ED A CB ED A CB E
  25. 25. Thoughts •  Extra’s: o Partial labeling: stop when probability becomes low. o Data ensemble learning. •  Most time spent feature engineering. •  Tie the parameters of the classifiers? o Frustratingly easy domain adaptation, Hal Daume III •  Partially flattening the hierarchy for training?

×