Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Simple representations for learning: factorizations and similarities

676 views

Published on

Real-life data seldom comes in the ideal form for statistical learning.

This talk focuses on high-dimensional problems for signals and
discrete entities: when dealing with many, correlated, signals or
entities, it is useful to extract representations that capture these
correlations.

Matrix factorization models provide simple but powerful representations. They are used for recommender systems across discrete entities such as users and products, or to learn good dictionaries to represent images. However they entail large computing costs on very high-dimensional data, databases with many products or high-resolution images. I will present an
algorithm to factorize huge matrices based on stochastic subsampling that gives up to 10-fold speed-ups [1].

With discrete entities, the explosion of dimensionality may be due to variations in how a smaller number of categories are represented. Such a problem of "dirty categories" is typical of uncurated data sources. I will discuss how encoding this data based on similarities recovers a useful category structure with no preprocessing. I will show how it interpolates between one-hot encoding and techniques used in character-level natural language processing.


[1] Stochastic subsampling for factorizing huge matrices, A Mensch, J Mairal, B Thirion, G Varoquaux, IEEE Transactions on Signal Processing 66 (1), 113-128

[2] Similarity encoding for learning with dirty categorical variables. P Cerda, G Varoquaux, B Kégl Machine Learning (2018): 1-18

Published in: Technology
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y8nn3gmc } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Simple representations for learning: factorizations and similarities

  1. 1. Simple representations for learning: factorizations and similarities Ga¨el Varoquaux
  2. 2. Settings: Very high dimensionality - signals (images, spectra) - many entities (customers, product) - non-standardized categories (typos, variants) Exploit links & redundancy across features G Varoquaux 2
  3. 3. 1 Factorizing huge matrices 2 Encoding with similarities G Varoquaux 3
  4. 4. 1 Factorizing huge matrices with A. Mensch, J. Mairal, B. Thirion [Mensch... 2016, 2017] samples features samples features Y +E · S= N Challenge: scalability 1 Intuitions 2 Experiments 3 Algorithms 4 Proof G Varoquaux 4
  5. 5. 1 Real world data: recommender systems 3 9 7 7 9 5 7 8 4 1 6 9 7 7 1 4 4 9 5 5 8 Product ratings Millions of entries Hundreds of thousands of products and users Large sparse matrix users product users products Y +E · S= N G Varoquaux 5
  6. 6. 1 Real world data: brain imaging Brain activity at rest 1000 subjects with ∼ 100–10 000 samples Images of dimensionality > 100 000 Dense matrix, large both ways time voxels time voxels time voxels Y +E · S= 25 N G Varoquaux 6
  7. 7. 1 Scalable solvers for matrix factorizations Large matrices = terabytes of data argmin E,S Y−E ST 2 Fro + λΩ(S) G Varoquaux 7
  8. 8. 1 Scalable solvers for matrix factorizations - Data access - Dictionary update - Code com- putation Alternating minimization Seen at t Seen at t+1 Unseen at t Data matrix Large matrices = terabytes of data argmin E,S Y−E ST 2 Fro + λΩ(S) G Varoquaux 7
  9. 9. 1 Scalable solvers for matrix factorizations Large matrices = terabytes of data argmin E,S Y−E ST 2 Fro + λΩ(S) Rewrite as an expectation: [Mairal... 2010] argmin E i mins Yi − E sT 2 Fro + λΩ(s) argmin E E f (E) ⇒ Optimize on approximations (sub-samples) G Varoquaux 7
  10. 10. 1 Scalable solvers for matrix factorizations - Data access - Dictionary update Stream columns - Code com- putation Online matrix factorization Alternating minimization Seen at t Seen at t+1 Unseen at t Data matrix G Varoquaux 7
  11. 11. 1 Scalable solvers for matrix factorizations - Data access - Dictionary update Stream columns - Code com- putation Online matrix factorization Alternating minimization Seen at t Seen at t+1 Unseen at t Data matrix 159 h run time 2 terabytes of data 12 h run time 100 gigabytes of data Online matrix factorization [Mairal... 2010] G Varoquaux 7
  12. 12. 1 Scalable solvers for matrix factorizations – SOMF - Data access - Dictionary update Stream columns - Code com- putation Subsample rows Online matrix factorization New subsampling algorithm Alternating minimization Seen at t Seen at t+1 Unseen at t Data matrix Subsampled Online Matrix Factorization = SOMF G Varoquaux 7
  13. 13. 1 Scalable solvers for matrix factorizations – SOMF - Data access - Dictionary update Stream columns - Code com- putation Subsample rows Online matrix factorization New subsampling algorithm Alternating minimization Seen at t Seen at t+1 Unseen at t Data matrix Subsampled Online Matrix Factorization = SOMF 159 h run time 2 terabytes of data 12 h run time 100 gigabytes of data 13 h run time 1 terabyte of data Online matrix factorization [Mairal... 2010] [Mensch... 2017] ×10 speed up G Varoquaux 7
  14. 14. 1 Experimental results: resting-state fMRI 159 h run time 2 terabytes of data 12 h run time 100 gigabytes of data 13 h run time 1 terabyte of data 100s 1000s 1h 5h 24h 1.02 1.04 1.06 Testobjectivevalue ×105 Time HCP (3.5TB) x 1e5 SGD (best step-size) Online matrix factorization Proposed SOMF (r = 12) SOMF = Subsambled Online Matrix Factorization G Varoquaux 8
  15. 15. 1 Experimental results: large images 5s 1min 6min 2.80 2.85 2.90 2.95 Testobjectivevalue ×104 Time ADHD Sparse dictionary 2 GB 1min 1h 5h 0.105 0.106 0.107 0.108 0.109 Aviris NMF 103 GB 1min 1h 5h 0.35 0.36 0.37 0.38 0.39 0.40 Testobjectivevalue Time Aviris Dictionary learning 103 GB OMF: SOMF: r = 4 r = 6 r = 8 r = 12 r = 24r = 1 Best step-size SGD 100s 1h 5h 24h 0.98 1.00 1.02 1.04 ×105 HCP Sparse dictionary 2 TB SOMF = Subsampled Online Matrix Factorization G Varoquaux 9
  16. 16. 1 Experimental results: recommender system SOMF = Subsampled Online Matrix Factorization G Varoquaux 10
  17. 17. 1 Algorithm: Online matrix factorization prior art Stream samples xt: [Mairal... 2010] 1. Compute code αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt) 2. Update the surrogate function gt(D) = 1 t t i=1 xi − Dαi 2 2 = trace   1 2 D DAt − D Bt   At = (1 − 1 t )At−1 + 1 t αtαt Bt = (1 − 1 t )Bt−1 + 1 t xtαt 3. Minimize surrogate Dt = argmin D∈C gt(D) gt = DAt − Bt G Varoquaux 11
  18. 18. 1 Algorithm: Online matrix factorization prior art Stream samples xt: [Mairal... 2010] 1. Compute code αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt) 2. Update the surrogate function gt(D) = 1 t t i=1 xi − Dαi 2 2 = trace   1 2 D DAt − D Bt   At = (1 − 1 t )At−1 + 1 t αtαt Bt = (1 − 1 t )Bt−1 + 1 t xtαt 3. Minimize surrogate Dt = argmin D∈C gt(D) gt = DAt − Bt gt(D) surrogate = x l(x, D) αi is used, and not α ⇒ Stochastic Majorization-Minimization No nasty hyper-parameters G Varoquaux 11
  19. 19. 1 Algorithm: Online matrix factorization prior art Stream samples xt: [Mairal... 2010] 1. Compute code complexity depends on p αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt) 2. Update the surrogate function O(p) gt(D) = 1 t t i=1 xi − Dαi 2 2 = trace   1 2 D DAt − D Bt   At = (1 − 1 t )At−1 + 1 t αtαt Bt = (1 − 1 t )Bt−1 + 1 t xtαt 3. Minimize surrogate O(p) Dt = argmin D∈C gt(D) gt = DAt − Bt G Varoquaux 11
  20. 20. 1 Sub-sample features Data stream: (xt)t → masked (Mtxt)t Dimension: p → s Use only Mtxt in computation → complexity in O(s) Mtxt Stream Ignore p n 1 Modify all steps to work on s features Code computation Surrogate update Surrogate minimization G Varoquaux 12
  21. 21. 1 Sub-sample features Original online MF 1. Code computation αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt) 2. Surrogate aggregation At = 1 t t i=1 αi αi Bt = Bt−1 + 1 t (xtαt − Bt−1) 3. Surrogate minimization Dj ← p⊥ Cr j (Dj − 1 (At)j,j (DAj t−Bj t)) Our algorithm 1. Approximate code computation: masked β (i) t ← (1 − γ)G (i) t−1 + γDt−1Mtx(i) G (i) t ← (1 − γ)G (i) t−1 + γDt−1MtDt−1 αt ← argmin α∈Rk 1 2 α Gtα − α βt + λ Ω(α). 2. Surrogate aggregation, averaging At = 1 wt αtαt + (1 − 1 wt )At−1 Pt ¯Bt ← (1 − wt)Pt ¯Bt−1 + wtPtxtαt 3. Surrogate minimization PtDt ← argmin Dr ∈Cr 1 2 tr(Dr Dr ¯At) − tr(Dr Pt ¯B Pt ⊥ ¯Bt ← (1 − wt)Pt ⊥ ¯Bt−1 + wtPt ⊥ xtαt . G Varoquaux 13
  22. 22. 1 Sub-sample features – variance reduction Original online MF 1. Code computation αt = argmin α∈Rk xt − Dt−1α 2 2 + λΩ(αt) 2. Surrogate aggregation At = 1 t t i=1 αi αi Bt = Bt−1 + 1 t (xtαt − Bt−1) 3. Surrogate minimization Dj ← p⊥ Cr j (Dj − 1 (At)j,j (DAj t−Bj t)) Our algorithm 1. Approximate code computation: masked β (i) t ← (1 − γ)G (i) t−1 + γDt−1Mtx(i) G (i) t ← (1 − γ)G (i) t−1 + γDt−1MtDt−1 αt ← argmin α∈Rk 1 2 α Gtα − α βt + λ Ω(α). 2. Surrogate aggregation, averaging At = 1 wt αtαt + (1 − 1 wt )At−1 Pt ¯Bt ← (1 − wt)Pt ¯Bt−1 + wtPtxtαt 3. Surrogate minimization PtDt ← argmin Dr ∈Cr 1 2 tr(Dr Dr ¯At) − tr(Dr Pt ¯B Pt ⊥ ¯Bt ← (1 − wt)Pt ⊥ ¯Bt−1 + wtPt ⊥ xtαt . 10−1 100 101 97000 97500 98000 98500 99000 99500 100000 Testobjectivefunction Zoom 10−2 10−3 (relative to lowest value) Subsampling ratio None r = 12 r = 24 100 101 Time 10−2 10−3 Code computation No subsampling (19) Averaged estimators (c) Masked loss (a) G Varoquaux 13
  23. 23. 1 Why does it work? Objective: D = argmin D∈C x l(x, D) where l(x, D) = minα f (x, D, α) Algorithm (online matrix factorization) gt(D) majorant = x l(x, D) αi is used, and not α ⇒ Stochastic Majorization-Minimization [Mairal 2013] G Varoquaux 14
  24. 24. 1 Why does it work? Objective: D = argmin D∈C x l(x, D) where l(x, D) = minα f (x, D, α) Algorithm (online matrix factorization) gt(D) majorant = x l(x, D) αi is used, and not α ⇒ Stochastic Majorization-Minimization [Mairal 2013] Surrogate computation SMM Full minimization G Varoquaux 14
  25. 25. 1 Stochastic Approximate Majorization-Minimization Objective: D = argmin D∈C x l(x, D) where l(x, D) = minα f (x, D, α) Algorithm (online matrix factorization) gt(D) majorant = x l(x, D) αi is used, and not α ⇒ Stochastic Majorization-Minimization [Mairal 2013] Surrogate computation Surrogate approximation Partial minimization SMM SAMM Full minimization G Varoquaux 14
  26. 26. samples features samples features Y +E · S= N Massive matrix factorization via subsampling Subsampling features ⇒ doubly stochastic 10x speed ups on a fast algorithm Analysis via stochastic approximate majorization-minization Conclusive on various high-dimensional problems G Varoquaux 15
  27. 27. samples features samples features Y +E · S= N 2 Encoding with similarities with P. Cerda and B. Kegl [Cerda... 2018] When categories create a huge dimensionality G Varoquaux 16
  28. 28. samples features samples features Y +E · S= N 2 Encoding with similarities with P. Cerda and B. Kegl [Cerda... 2018] Machine learning Let X ∈ Rn×p The real world Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 11/19/1989 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M 11/22/2010 Library Assistant IG Varoquaux 16
  29. 29. samples features samples features Y +E · S= N 2 Encoding with similarities with P. Cerda and B. Kegl [Cerda... 2018] Machine learning Let X ∈ Rn×p The real world Gender Date Hired Employee Position Title M 09/12/1988 Master Police Officer F 11/19/1989 Social Worker IV M 07/16/2007 Police Officer III F 02/05/2007 Police Aide M 01/13/2014 Electrician I F 06/26/2006 Social Worker III F 01/26/2000 Library Assistant I M 11/22/2010 Library Assistant I A data cleaning problem? A feature engineering problem? A problem of representations in high dimension G Varoquaux 16
  30. 30. 2 The problem of “dirty categories” Non-curated categorical entries Employee Position Title Master Police Officer Social Worker IV Police Officer III Police Aide Electrician I Bus Operator Bus Operator Social Worker III Library Assistant I Library Assistant I Overlapping categories “Master Police Officer”, “Police Officer III”, “Police Officer II”... High cardinality 400 unique entries in 10 000 rows Rare categories Only 1 “Architect III” New categories in test set G Varoquaux 17
  31. 31. 2 Dirty categories in the wild Employee Salaries: salary information for employees of Montgomery County, Maryland. Employee Position Title Master Police Officer Social Worker IV ... G Varoquaux 18
  32. 32. 2 Dirty categories in the wild Employee Salaries: salary information for employees of Montgomery County, Maryland. Open Payments: payments by health care companies to medical doctors or hospitals. Company name Frequency Pfizer Inc. 79,073 Pfizer Pharmaceuticals LLC 486 Pfizer International LLC 425 Pfizer Limited 13 Pfizer Corporation Hong Kong Limited 4 Pfizer Pharmaceuticals Korea Limited 3 ... G Varoquaux 18
  33. 33. 2 Dirty categories in the wild Employee Salaries: salary information for employees of Montgomery County, Maryland. Open Payments: payments by health care companies to medical doctors or hospitals. Medical charges: patient discharges: utilization, payment, and hospital-specific charges across 3 000 US hospitals. ... Nothing on UCI machine-learning data repository G Varoquaux 18
  34. 34. 2 Dirty categories in the wild Employee Salaries: salary information for employees of Montgomery County, Maryland. Open Payments: payments by health care companies to medical doctors or hospitals. Medical charges: patient discharges: utilization, payment, and hospital-specific charges across 3 000 US hospitals. ... Nothing on UCI machine-learning data repository Cardinality slowly increases with number of rows 100 1k 10k 100k 1M Number of rows 100 1 000 10 000 Numberofcategories beer reviews road safety traffic violations midwest survey open payments employee salaries medical charges 100 √ n 5 log2(n) Create a high-dimensional learning problem G Varoquaux 18
  35. 35. 2 Dirty categories in the wild Employee Salaries: salary information for employees of Montgomery County, Maryland. Open Payments: payments by health care companies to medical doctors or hospitals. Medical charges: patient discharges: utilization, payment, and hospital-specific charges across 3 000 US hospitals. ... Nothing on UCI machine-learning data repository Our goal: a statistical view of supervised learning on dirty categories The statistical question should inform curation Pfizer Corporation Hong Kong =? Pfizer Pharmaceuticals Korea G Varoquaux 18
  36. 36. 2 Related work: Database cleaning Recognizing / merging entities Record linkage: matching across different (clean) tables Deduplication/fuzzy matching: matching in one dirty table Techniques [Fellegi and Sunter 1969] Supervised learning (known matches) Clustering Expectation Maximization to learn a metric Outputs a “clean” database G Varoquaux 19
  37. 37. 2 Related work: natural language processing Stemming / normalization Set of (handcrafted) rules Need to be adapted to new language / new domains G Varoquaux 20
  38. 38. 2 Related work: natural language processing Stemming / normalization Set of (handcrafted) rules Need to be adapted to new language / new domains Semantics Relate different discreet objects Formal semantics (entity resolution in knowlege bases) Distributional semantics: “a word is characterized by the company it keeps” G Varoquaux 20
  39. 39. 2 Related work: natural language processing Stemming / normalization Set of (handcrafted) rules Need to be adapted to new language / new domains Semantics Relate different discreet objects Formal semantics (entity resolution in knowlege bases) Distributional semantics: “a word is characterized by the company it keeps” Character-level NLP For entity resolution [Klein... 2003] For semantics [Bojanowski... 2017] “London” & “Londres” may carry different information G Varoquaux 20
  40. 40. 2 Similarity encoding: a simple solution Adding similarities to one-hot encoding 1. One-hot encoding maps categories to vector spaces 2. String similarities capture information G Varoquaux 21
  41. 41. 2 Similarity encoding: a simple solution One-hot encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 BX ∈ Rn×p p grows fast new categories? link categories? G Varoquaux 22
  42. 42. 2 Similarity encoding: a simple solution One-hot encoding London Londres Paris Londres 0 1 0 London 1 0 0 Paris 0 0 1 BX ∈ Rn×p p grows fast new categories? link categories?Similarity encoding London Londres Paris Londres 0.3 1.0 0.0 London 1.0 0.3 0.0 Paris 0.0 0.0 1.0 string distance(Londres, London) G Varoquaux 22
  43. 43. 2 Some string similarities Levenshtein Number of edit operations on one string to match the other Jaro-Winkler djaro(s1, s2) = m 3|s1| + m 3|s2| + m−t 3m m: number of matching characters t: number of character transpositions n-gram similarity n-gram: group of n consecutive characters similarity = #n-gram in comon #n-gram in total G Varoquaux 23
  44. 44. 2 Empirical study Datasets with dirty categories Dataset # of rows # of cat- egories Less frequent category Prediction type medical charges 160k 100 613 regression employee salaries 9.2k 385 1 regression open payments 100k 973 1 binary clf midwest survey 2.8k 1009 1 multiclass clf traffic violations 100k 3043 1 multiclass clf road safety 10k 4617 1 binary clf beer reviews 10k 4634 1 multiclass clf 7 datasets! All open Experimental paradigm Cross-validation & measure prediction Stupid Simple G Varoquaux 24
  45. 45. 2 Experiments: gradient boosted trees 0.8 0.9 medical charges 3­gram     Levenshtein ratio       Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding 0.6 0.8 employee salaries 0.6 0.8 open payments 0.5 midw surv G Varoquaux 25
  46. 46. 2 Experiments: gradient boosted trees 0.8 0.9 medical charges 3­gram     Levenshtein ratio       Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding 0.6 0.8 employee salaries 0.6 0.8 open payments 0.5 midw surv G Varoquaux 25
  47. 47. 2 Experiments: gradient boosted trees 0.8 0.9 medical charges 3­gram     Levenshtein ratio       Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding 0.6 0.8 employee salaries 0.6 0.8 open payments 0.5 midw surv G Varoquaux 25
  48. 48. 2 Experiments: gradient boosted trees 0.8 0.9 medical charges 3­gram     Levenshtein ratio       Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding 0.6 0.8 employee salaries 0.6 0.8 open payments 0.5 0.7 midwest survey 0.6 0.8 traffic violations 0.4 0.5 road safety 0.25 0.75 beer reviews 1.6 2.4 2.9 3.7 4.6 5.9 Average ranking across datasets G Varoquaux 25
  49. 49. 2 Experiments: gradient boosted trees 0.8 0.9 medical charges 3­gram     Levenshtein ratio       Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding 0.6 0.8 employee salaries 0.6 0.8 open payments 0.5 0.7 midwest survey 0.6 0.8 traffic violations 0.4 0.5 road safety 0.25 0.75 beer reviews 1.6 2.4 2.9 3.7 4.6 5.9 Average ranking across datasets G Varoquaux 25
  50. 50. 2 Experiments: ridge 0.7 0.9 medical charges 3­gram     Levenshtein ratio       Jaro­winkler Target encoding One­hot encoding Hash encoding Similarity encoding 0.6 0.8 employee salaries 0.25 0.50 open payments 0.5 0.7 midwest survey 0.6 0.8 traffic violations 0.45 0.50 road safety 0.25 0.75 beer reviews 1.0 2.9 3.1 4.4 3.6 6.0 Average ranking across datasets Similarity encoding, with 3-gram similarity G Varoquaux 26
  51. 51. 2 Experiments: different learner 0.85 0.90 medical charges Random Forest Gradient Boosting Ridge CV Logistic CV 0.7 0.9 employee salaries 0.50 0.75 open payments 0.5 0.7 midwest survey 0.7500.775 traffic violations 0.45 0.55 road safety 0.50 0.75 beer reviews one­hot encoding 3­gram similarity encoding 2.7 2.4 2.3 2.0 G Varoquaux 27
  52. 52. 2 This is just a string similarity? What similarity is defined by our encoding? (kernel) si, sj sim = k l=1 sim(si, s(l) ) sim(sj, s(l) ) Sum over the categories Reference categories The categories in the train set shape the similarity G Varoquaux 28
  53. 53. 2 This is just a string similarity? What similarity is defined by our encoding? (kernel) si, sj sim = k l=1 sim(si, s(l) ) sim(sj, s(l) ) Sum over the categories Reference categories The categories in the train set shape the similarity 0.83 0.88 medical charges 3-gram Levenshtein- ratio Jaro-winkler Bag of 3-grams Target encoding MDV One-hot encoding Hash encoding Similarity encoding 0.75 0.85 employee salaries 0.3 0.5 open payments 0.6 0.7 midwest survey 0.72 0.78 traffic violations 0.44 0.52 road safety 0.3 0.8 beer reviews 1.1 3.1 3.4 4.1 5.3 6.4 4.7 7.3 Similarity encoding >>> a feature map capturing string similarities G Varoquaux 28
  54. 54. 2 Reducing the dimensionality BX ∈ Rn×p but p is large Statistical problems Computational problems G Varoquaux 29
  55. 55. 2 Reducing the dimensionality d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full One­hot encoding ity  K­means  Deduplication with K­means Random projections G Varoquaux 29
  56. 56. 2 Reducing the dimensionality d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full One­hot encoding ity  K­means  Deduplication with K­means Random projections G Varoquaux 29
  57. 57. 2 Reducing the dimensionality 0.7 0.8 0.9 d=30 d=100 d=300 d=30 d=100 d=300 d=30 d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full One­hot encoding 3­gram similarity encoding Random projections Most frequent categories  K­means  Deduplication with K­means Random projections 0.7 0.8 0.6 0.7 0.7500.G Varoquaux 29
  58. 58. 2 Reducing the dimensionality 0.7 0.8 0.9 employee salaries d=30 d=100 d=300 d=30 d=100 d=300 d=30 d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full (k=355)Cardinality of categorical variable One­hot encoding 3­gram similarity encoding Random projections Most frequent categories  K­means  Deduplication with K­means Random projections 0.7 0.8 open payments (k=910) 0.6 0.7 midwest survey (k=644) 0.7500.775 traffic violations (k=2588) 0.45 0.50 0.55 road safety (k=3988) 0.25 0.50 0.75 beer reviews (k=4015) 7.2 5.0 3.7 10.5 7.2 4.6 10.5 6.3 3.3 2.0 16.3 14.5 14.3 12.3 10.8 9.9 14.5 Average ranking across datasets G Varoquaux 29
  59. 59. 2 Reducing the dimensionality 0.7 0.8 0.9 employee salaries d=30 d=100 d=300 d=30 d=100 d=300 d=30 d=100 d=300 Full d=30 d=100 d=300 d=30 d=100 d=300 Full (k=355)Cardinality of categorical variable One­hot encoding 3­gram similarity encoding Random projections Most frequent categories  K­means  Deduplication with K­means Random projections 0.7 0.8 open payments (k=910) 0.6 0.7 midwest survey (k=644) 0.7500.775 traffic violations (k=2588) 0.45 0.50 0.55 road safety (k=3988) 0.25 0.50 0.75 beer reviews (k=4015) 7.2 5.0 3.7 10.5 7.2 4.6 10.5 6.3 3.3 2.0 16.3 14.5 14.3 12.3 10.8 9.9 14.5 Average ranking across datasets Factorizing one-hot: Multiple Correspondance Analysis Hashing n-grams (for speed and collisions) G Varoquaux 29
  60. 60. @GaelVaroquaux Representations in high dimension factorizations and similarities signals, entities, categories
  61. 61. @GaelVaroquaux Representations in high dimension factorizations and similarities signals, entities, categories Factorizations Costly in large-p, large-n Sub-sampling p gives huge speed ups Stochastic Approximate Majorization-Minimization https://github.com/arthurmensch/modl
  62. 62. @GaelVaroquaux Representations in high dimension factorizations and similarities signals, entities, categories Factorizations https://github.com/arthurmensch/modl Similarity encoding for categories No separate duplication / cleaning step Creates a categorie-aware metric space https://dirty-cat.github.io DirtyData project (hiring)
  63. 63. References I P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. Transactions of the Association of Computational Linguistics, 5(1):135–146, 2017. P. Cerda, G. Varoquaux, and B. K´egl. Similarity encoding for learning with dirty categorical variables. Machine Learning, pages 1–18, 2018. I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64:1183, 1969. D. Klein, J. Smarr, H. Nguyen, and C. D. Manning. Named entity recognition with character-level models. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003-Volume 4, pages 180–183. Association for Computational Linguistics, 2003. J. Mairal. Stochastic majorization-minimization algorithms for large-scale optimization. In Advances in Neural Information Processing Systems, 2013.
  64. 64. References II J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11:19, 2010. A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Dictionary learning for massive matrix factorization. In ICML, 2016. A. Mensch, J. Mairal, B. Thirion, and G. Varoquaux. Stochastic subsampling for factorizing huge matrices. IEEE Transactions on Signal Processing, 66(1):113–128, 2017.

×