Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Like this presentation? Why not share!

- Big Data and Data Science Challenge... by Anton Konushin 733 views
- Machine Learning Applications in Me... by Anton Konushin 1520 views
- Data Science in the cloud with Micr... by TechExeter 137 views
- Machine Learning in Modern Medicine... by Jo-fai Chow 2526 views
- Microsoft Research WorldWide Telesc... by Anton Konushin 527 views
- Model-Based Machine Learning by Anton Konushin 857 views

971 views

Published on

No Downloads

Total views

971

On SlideShare

0

From Embeds

0

Number of Embeds

119

Shares

0

Downloads

23

Comments

0

Likes

2

No embeds

No notes for slide

- 1. Scaling Up Practical Learning Algorithms Misha Bilenko ALMADA Summer School, Moscow 2013
- 2. Preliminaries: ML-in-four-slides
- 3. ML Examples
- 4. Key Prediction Models
- 5. Learning: Training Predictors
- 6. Big Learning: Large Datasets.. and Beyond • Large training sets: many examples iff accuracy is improved • Large models: many features, ensembles, “deep” nets • Model selection: hyper-parameter tuning, statistical significance • Fast inference: structured prediction (e.g., speech) • Fundamental differences across settings – Learning vs. inference, input complexity vs. model complexity – Dataflow/computation and bottlenecks are highly algorithm- and task-specific – Rest of this talk: practical algorithm nuggets for (1), (2)
- 7. Dealing with Large Training Sets (I): SGD
- 8. Dealing with Large Training Sets (II): L-BFGS
- 9. • Rule-based prediction is natural and powerful (non-linear) – Play outside: if no rain and not too hot, or if snowing but not windy. • Trees hierarchically encode rule-based prediction – Nodes test features and split – Leaves produce predictions – Regression trees: numeric outputs • Ensembles combine tree predictions Dealing with Large Datasets (II): Trees = = 0.05 = 0.01 0.7 + + +…
- 10. Tree Ensemble Zoo • Different models can define different types of: – Combiner function: voting vs. weighting – Leaf prediction models: constant vs. regression – Split conditions: single vs. multiple features • Examples (small biased sample, some are not tree-specific) – Boosting: AdaBoost, LogitBoost, GBM/MART, BrownBoost, Transform Regression – Random Forests: Random Subspaces, Bagging, Additive Groves, BagBoo – Beyond regression and binary classification: RankBoost, abc- mart, GBRank, LambdaMART, MatrixNet
- 11. Tree Ensembles Are Rightfully Popular • State-of-the-art accuracy: web, vision, CRM, bio, … • Efficient at prediction time – Multithread evaluation of individual trees; optimize/short-circuit • Principled: extensively studied in statistics and learning theory • Practical – Naturally handle mixed, missing, (un)transformed data – Feature selection embedded in algorithm – Well-understood parameter sweeps – Scalable to extremely large datasets: rest of this section
- 12. Naturally Parallel Tree Ensembles • No interaction when learning individual trees – Bagging: each tree trained on a bootstrap sample of data – Random forests: bootstrap plus subsample features at each split – For large datasets, local data replaces bootstrap -> embarrassingly parallel Bagging tree construction Random forest tree construction
- 13. Boosting: Iterative Tree Construction “Best off-the-shelf classifier in the world” – Breiman • Reweight examples for each subsequent tree to focus on errors . . .
- 14. Efficient Tree Construction • Boosting is iterative: scaling up = parallelizing tree construction • For every node: pick best feature to split – For every feature: pick best split-point • For every potential split-point: compute gain – For every example in current node, add its gain contribution for given split • Key efficiency: limiting+ordering the set of considered split points – Continuous features: discretize into bins, splits = bin boundaries – Allows computing split values in a single pass over data
- 15. Binned Split Evaluation … … Features Bins
- 16. A B Tree Construction Visualized • Observation 1: a single pass is sufficient per tree level • Observation 2: data pass can iterate by-instance or by-feature – Supports horizontally or vertically partitioned data . . . . Features Bins Features Instances Features Bins A B . . . Features Bins Features Bins
- 17. Data-Distributed Tree Construction • Master 1. Send workers current model and set of nodes to expand 2. Wait to receive local split histograms from workers 3. Aggregate local split histograms, select best split for every node • Worker 2a. Pass through local data, aggregating split histograms 2b. Send completed local histograms to master Master Worker
- 18. Feature-Distributed Tree Construction • Workers maintain per-instance index of current residuals and previous splits • Master 1. Request workers to expand a set of nodes 2. Wait to receive best per-feature splits from workers 3. Select best feature-split for every node 4. Request best splits’ workers to broadcast per-instance assignments and residuals • Worker 2a. Pass through all instances for local features, aggregating split histograms for each node 2b. Select local features’ best splits for each node, send to master Master Worker
- 19. • How many is “many”? At least billions. • Exhibit A: English n-grams Unigrams: 13 million Bigrams: 315 million Trigrams: 977 million Fourgrams: 1.3 billion Fivegrams: 1.2 billion • Can we scale up linear learners? Yes, but there are limits: – Retraining: ideally real-time, definitely not more than a couple hours – Modularity: ideally fit in memory, definitely decompose elastically • Exhibit B: search ads, 3 months User IDs: hundreds of millions Listing IDs: hundreds of millions Queries: tens to hundreds of millions User x Listing x Query: billions Learning with Many Features
- 20. Towards infinite features: Feature hashing …
- 21. Scaling up ML: Concluding Thoughts • Learner parallelization is highly algorithm-dependent • High-level parallelization (MapReduce) – Less work but there is a convenience penalty – Limits on communication and control can be algorithm-killing • Low-level parallelization (Multicore, GPUs, ) – Harder to implement/debug – Successes architecture-vs-algorithm specific: i.e. GPUs are great if matrix multiplication is the core operation (NNs) – Typical trade-off: memory/IO latency/contention vs. update complexity

No public clipboards found for this slide

×
### Save the most important slides with Clipping

Clipping is a handy way to collect and organize the most important slides from a presentation. You can keep your great finds in clipboards organized around topics.

Be the first to comment