Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)


Published on

Published in: Education, Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Scaling up practical learning algorithms (Lecture by Mikhail Bilenko)

  1. 1. Scaling Up Practical Learning Algorithms Misha Bilenko ALMADA Summer School, Moscow 2013
  2. 2. Preliminaries: ML-in-four-slides
  3. 3. ML Examples
  4. 4. Key Prediction Models
  5. 5. Learning: Training Predictors
  6. 6. Big Learning: Large Datasets.. and Beyond • Large training sets: many examples iff accuracy is improved • Large models: many features, ensembles, “deep” nets • Model selection: hyper-parameter tuning, statistical significance • Fast inference: structured prediction (e.g., speech) • Fundamental differences across settings – Learning vs. inference, input complexity vs. model complexity – Dataflow/computation and bottlenecks are highly algorithm- and task-specific – Rest of this talk: practical algorithm nuggets for (1), (2)
  7. 7. Dealing with Large Training Sets (I): SGD
  8. 8. Dealing with Large Training Sets (II): L-BFGS
  9. 9. • Rule-based prediction is natural and powerful (non-linear) – Play outside: if no rain and not too hot, or if snowing but not windy. • Trees hierarchically encode rule-based prediction – Nodes test features and split – Leaves produce predictions – Regression trees: numeric outputs • Ensembles combine tree predictions Dealing with Large Datasets (II): Trees = = 0.05 = 0.01 0.7 + + +…
  10. 10. Tree Ensemble Zoo • Different models can define different types of: – Combiner function: voting vs. weighting – Leaf prediction models: constant vs. regression – Split conditions: single vs. multiple features • Examples (small biased sample, some are not tree-specific) – Boosting: AdaBoost, LogitBoost, GBM/MART, BrownBoost, Transform Regression – Random Forests: Random Subspaces, Bagging, Additive Groves, BagBoo – Beyond regression and binary classification: RankBoost, abc- mart, GBRank, LambdaMART, MatrixNet
  11. 11. Tree Ensembles Are Rightfully Popular • State-of-the-art accuracy: web, vision, CRM, bio, … • Efficient at prediction time – Multithread evaluation of individual trees; optimize/short-circuit • Principled: extensively studied in statistics and learning theory • Practical – Naturally handle mixed, missing, (un)transformed data – Feature selection embedded in algorithm – Well-understood parameter sweeps – Scalable to extremely large datasets: rest of this section
  12. 12. Naturally Parallel Tree Ensembles • No interaction when learning individual trees – Bagging: each tree trained on a bootstrap sample of data – Random forests: bootstrap plus subsample features at each split – For large datasets, local data replaces bootstrap -> embarrassingly parallel Bagging tree construction Random forest tree construction
  13. 13. Boosting: Iterative Tree Construction “Best off-the-shelf classifier in the world” – Breiman • Reweight examples for each subsequent tree to focus on errors . . .
  14. 14. Efficient Tree Construction • Boosting is iterative: scaling up = parallelizing tree construction • For every node: pick best feature to split – For every feature: pick best split-point • For every potential split-point: compute gain – For every example in current node, add its gain contribution for given split • Key efficiency: limiting+ordering the set of considered split points – Continuous features: discretize into bins, splits = bin boundaries – Allows computing split values in a single pass over data
  15. 15. Binned Split Evaluation … … Features Bins
  16. 16. A B Tree Construction Visualized • Observation 1: a single pass is sufficient per tree level • Observation 2: data pass can iterate by-instance or by-feature – Supports horizontally or vertically partitioned data . . . . Features Bins Features Instances Features Bins A B . . . Features Bins Features Bins
  17. 17. Data-Distributed Tree Construction • Master 1. Send workers current model and set of nodes to expand 2. Wait to receive local split histograms from workers 3. Aggregate local split histograms, select best split for every node • Worker 2a. Pass through local data, aggregating split histograms 2b. Send completed local histograms to master Master Worker
  18. 18. Feature-Distributed Tree Construction • Workers maintain per-instance index of current residuals and previous splits • Master 1. Request workers to expand a set of nodes 2. Wait to receive best per-feature splits from workers 3. Select best feature-split for every node 4. Request best splits’ workers to broadcast per-instance assignments and residuals • Worker 2a. Pass through all instances for local features, aggregating split histograms for each node 2b. Select local features’ best splits for each node, send to master Master Worker
  19. 19. • How many is “many”? At least billions. • Exhibit A: English n-grams Unigrams: 13 million Bigrams: 315 million Trigrams: 977 million Fourgrams: 1.3 billion Fivegrams: 1.2 billion • Can we scale up linear learners? Yes, but there are limits: – Retraining: ideally real-time, definitely not more than a couple hours – Modularity: ideally fit in memory, definitely decompose elastically • Exhibit B: search ads, 3 months User IDs: hundreds of millions Listing IDs: hundreds of millions Queries: tens to hundreds of millions User x Listing x Query: billions Learning with Many Features
  20. 20. Towards infinite features: Feature hashing …
  21. 21. Scaling up ML: Concluding Thoughts • Learner parallelization is highly algorithm-dependent • High-level parallelization (MapReduce) – Less work but there is a convenience penalty – Limits on communication and control can be algorithm-killing • Low-level parallelization (Multicore, GPUs, ) – Harder to implement/debug – Successes architecture-vs-algorithm specific: i.e. GPUs are great if matrix multiplication is the core operation (NNs) – Typical trade-off: memory/IO latency/contention vs. update complexity