SlideShare a Scribd company logo
Rise of Scalable Machine Learning
at Yahoo
A n d y F e n g
V P A r c h i t ect ur e , Ya h o o
My Talks @ Hadoop Summit
2
 Storm (2013)
 Spark (2014)
 Machine Learning (2015)
3
Use Case: Search & Advertisement
 Application needs
› Content ranking
› Ad click prediction
› Query-Ads matching
 Machine learning algorithm
› Gradient boosted decision tree
› Logistic regression
› Neural network
Challenge: Scale
4
1. Massive amount of examples
› Naïve solutions take days/weeks
2. Billions of features
› Model exceeds memory limits of 1 computer
3. Variety of algorithms
› Different solutions required for scale-up
Massive Hadoop at Yahoo
5
600 PB
HDFS
43K
Computers
MACHINE
LEARNING
Big-Data ML in Action
6
ML learner
ML server
Map
Reduce
7
Architecture for Scalable ML
 ML Server
› Customized in-memory
stores (Hashmap, Matrix)
• Lockless concurrency
• Zero garbage created
› Map/Reduce API to move
computing to servers
3 Examples of ML Algorithms
8 Yahoo Confidential & Proprietary
1. Gradient Boosted Decision Tree
› Problem: Training latency
› Solution: Hadoop streaming + MPI
2. Logistic Regression
› Problem: Model size
› Solution: Spark + ML Server
3. Ad-Query Vectors
› Problem: Model size + Training latency
› Solution: Spark + ML Server
Algorithm 1: Gradient Boosted Decision Tree
 Boosting is sequential  Training takes days for
1000s of features
Gradient Boosted Decision Tree: 30x Speed-up
Algorithm 2: Logistic Regression
11
When |β| > 100B,
› 100 Billion * 16 Bytes = 1.6 TB
› β exceeds memory limit of 1 computer
Logistic Regression: 1000x Scale-up
12
 Vector: numeric representation
of queries/ads
› Vector(“san jose weather”) ≈ Vector(“weather 95113”)
≈ Vector(ad123)
 Model size
› 1 Billion* 300 dimensions = 2.4TB
 Vector computation (X*Y, aX+Y)
› Took weeks for small datasets
13 * Yahoo Labs: http://bit.ly/1G3f6L2
Algorithm 3: Ad-Query Vectors
Ad-Query2Vec: 100x Speed/Scale-up
14
 Computation on servers
› (1) Negative sampling
› (1) Compute gradient: X*Y
› (3) Adjust vectors: Y=aX+Y
 Daily training enabled
› weeks  hours
 Asynchronous
 Faster
 More data
 Larger model
15
Lesson Learned: Approximate Computing  Better Accuracy
…
Summary
16
 Scalable machine learning at Yahoo
› critical business: search, advertisement
› daily model training w/ billions of features
 Hadoop/YARN plays a central role
› approximate computing
› CPU + GPU
17
Thank You!
afeng@apache.org

More Related Content

What's hot

Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
MLconf
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
Spark Summit
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Jen Aman
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
Cloudera, Inc.
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Databricks
 
Spark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas Dinsmore
Spark Summit
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
Jen Aman
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Databricks
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
Milind Bhandarkar
 
How Machine Learning and AI Can Support the Fight Against COVID-19
How Machine Learning and AI Can Support the Fight Against COVID-19How Machine Learning and AI Can Support the Fight Against COVID-19
How Machine Learning and AI Can Support the Fight Against COVID-19
Databricks
 
How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML models
Databricks
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
DataWorks Summit
 
Inside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick ReissInside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick Reiss
Spark Summit
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1
Kenta Oono
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Databricks
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
Databricks
 
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
MLconf
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Spark Summit
 
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Databricks
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
MLconf
 

What's hot (20)

Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
Dr. Ike Nassi, Founder, TidalScale at MLconf NYC - 4/15/16
 
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander UlanovA Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
A Scaleable Implementation of Deep Learning on Spark -Alexander Ulanov
 
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In SparkYggdrasil: Faster Decision Trees Using Column Partitioning In Spark
Yggdrasil: Faster Decision Trees Using Column Partitioning In Spark
 
Large Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache SparkLarge Scale Machine Learning with Apache Spark
Large Scale Machine Learning with Apache Spark
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Spark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas DinsmoreSpark and the Future of Advanced Analytics by Thomas Dinsmore
Spark and the Future of Advanced Analytics by Thomas Dinsmore
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
How Machine Learning and AI Can Support the Fight Against COVID-19
How Machine Learning and AI Can Support the Fight Against COVID-19How Machine Learning and AI Can Support the Fight Against COVID-19
How Machine Learning and AI Can Support the Fight Against COVID-19
 
How to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML modelsHow to use Apache TVM to optimize your ML models
How to use Apache TVM to optimize your ML models
 
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark ClustersTensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
TensorFlowOnSpark: Scalable TensorFlow Learning on Spark Clusters
 
Inside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick ReissInside Apache SystemML by Frederick Reiss
Inside Apache SystemML by Frederick Reiss
 
Tokyo Webmining Talk1
Tokyo Webmining Talk1Tokyo Webmining Talk1
Tokyo Webmining Talk1
 
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
Deep Learning with DL4J on Apache Spark: Yeah it’s Cool, but are You Doing it...
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
 
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
Sketching Data with T-Digest In Apache Spark: Spark Summit East talk by Erik ...
 
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim DowlingDistributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
Distributed Deep Learning with Apache Spark and TensorFlow with Jim Dowling
 
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
Josh Patterson, Advisor, Skymind – Deep learning for Industry at MLconf ATL 2016
 

Viewers also liked

Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
DataRobot
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
DataWorks Summit/Hadoop Summit
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
Jen Aman
 
Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...
Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...
Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...
Jian Xu
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
Amund Tveit
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterDataWorks Summit
 
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
Badri Narayan Bhaskar
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong Yan
Hakka Labs
 
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
Universitat Politècnica de Catalunya
 
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
Spark Summit
 
Kerry Karl | Debunking Myths: GLUTEN
Kerry Karl | Debunking Myths: GLUTENKerry Karl | Debunking Myths: GLUTEN
Kerry Karl | Debunking Myths: GLUTEN
Kerry Karl
 
5 Reasons Why Your Headlines Are On Life Support
5 Reasons Why Your Headlines Are On Life Support5 Reasons Why Your Headlines Are On Life Support
5 Reasons Why Your Headlines Are On Life Support
Wishpond
 
Brexit Webinar Series 3
Brexit Webinar Series 3Brexit Webinar Series 3
Brexit Webinar Series 3
U.S. Chamber of Commerce
 
Ratpack - SpringOne2GX 2015
Ratpack - SpringOne2GX 2015Ratpack - SpringOne2GX 2015
Ratpack - SpringOne2GX 2015
Daniel Woods
 
Getting Open Data Used
Getting Open Data UsedGetting Open Data Used
Getting Open Data Used
Andrew Stott
 
Impacto de las tics en la educaciòn
Impacto de las tics en la educaciònImpacto de las tics en la educaciòn
Impacto de las tics en la educaciòn
Darìo Miranda S.A
 
Polyglot Gradle with Node.js and Play
Polyglot Gradle with Node.js and PlayPolyglot Gradle with Node.js and Play
Polyglot Gradle with Node.js and Play
Evgeny Goldin
 
1 4 vamos a jugar
1 4 vamos a jugar1 4 vamos a jugar
1 4 vamos a jugar
Araceli Sanz Muñoz
 
Auktuálne otázky zodpovednosti za porušovanie práv duševného vlastníctva online
Auktuálne otázky zodpovednosti za porušovanie práv duševného vlastníctva onlineAuktuálne otázky zodpovednosti za porušovanie práv duševného vlastníctva online
Auktuálne otázky zodpovednosti za porušovanie práv duševného vlastníctva online
Martin Husovec
 

Viewers also liked (20)

Gradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learnGradient Boosted Regression Trees in scikit-learn
Gradient Boosted Regression Trees in scikit-learn
 
Distributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop ClustersDistributed Deep Learning on Hadoop Clusters
Distributed Deep Learning on Hadoop Clusters
 
Scaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of ParametersScaling Machine Learning To Billions Of Parameters
Scaling Machine Learning To Billions Of Parameters
 
Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...
Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...
Learning, Prediction and Optimization in Real-Time Bidding based Display Adve...
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop ClusterSpark-on-YARN: Empower Spark Applications on Hadoop Cluster
Spark-on-YARN: Empower Spark Applications on Hadoop Cluster
 
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
Scaling Machine Learning to Billions of Parameters - Spark Summit 2016
 
Spark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod NarasimhaSpark Summit EU talk by Debasish Das and Pramod Narasimha
Spark Summit EU talk by Debasish Das and Pramod Narasimha
 
Square's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong YanSquare's Machine Learning Infrastructure and Applications - Rong Yan
Square's Machine Learning Infrastructure and Applications - Rong Yan
 
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
End-to-end Speech Recognition with Recurrent Neural Networks (D3L6 Deep Learn...
 
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
High Resolution Energy Modeling that Scales with Apache Spark 2.0 Spark Summi...
 
Kerry Karl | Debunking Myths: GLUTEN
Kerry Karl | Debunking Myths: GLUTENKerry Karl | Debunking Myths: GLUTEN
Kerry Karl | Debunking Myths: GLUTEN
 
5 Reasons Why Your Headlines Are On Life Support
5 Reasons Why Your Headlines Are On Life Support5 Reasons Why Your Headlines Are On Life Support
5 Reasons Why Your Headlines Are On Life Support
 
Brexit Webinar Series 3
Brexit Webinar Series 3Brexit Webinar Series 3
Brexit Webinar Series 3
 
Ratpack - SpringOne2GX 2015
Ratpack - SpringOne2GX 2015Ratpack - SpringOne2GX 2015
Ratpack - SpringOne2GX 2015
 
Getting Open Data Used
Getting Open Data UsedGetting Open Data Used
Getting Open Data Used
 
Impacto de las tics en la educaciòn
Impacto de las tics en la educaciònImpacto de las tics en la educaciòn
Impacto de las tics en la educaciòn
 
Polyglot Gradle with Node.js and Play
Polyglot Gradle with Node.js and PlayPolyglot Gradle with Node.js and Play
Polyglot Gradle with Node.js and Play
 
1 4 vamos a jugar
1 4 vamos a jugar1 4 vamos a jugar
1 4 vamos a jugar
 
Auktuálne otázky zodpovednosti za porušovanie práv duševného vlastníctva online
Auktuálne otázky zodpovednosti za porušovanie práv duševného vlastníctva onlineAuktuálne otázky zodpovednosti za porušovanie práv duševného vlastníctva online
Auktuálne otázky zodpovednosti za porušovanie práv duševného vlastníctva online
 

Similar to Surge: Rise of Scalable Machine Learning at Yahoo!

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Yahoo Developer Network
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Spark Summit
 
Big Data Lessons from the Cloud
Big Data Lessons from the CloudBig Data Lessons from the Cloud
Big Data Lessons from the Cloud
MapR Technologies
 
SparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time BiddingSparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time Bidding
Databricks
 
Gluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with HadoopGluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with Hadoop
gluent.
 
(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305
Amazon Web Services
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Matej Misik
 
Things you can find in the plan cache
Things you can find in the plan cacheThings you can find in the plan cache
Things you can find in the plan cachesqlserver.co.il
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
Institute of Contemporary Sciences
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create
PyData
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
Bill Liu
 
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine LearningA Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
Makoto Yui
 
Machine Learning: How small businesses can enter the race
Machine Learning: How small businesses can enter the raceMachine Learning: How small businesses can enter the race
Machine Learning: How small businesses can enter the race
Scaleway
 
Big Data Testing
Big Data TestingBig Data Testing
Big Data Testing
QA InfoTech
 
Relevance of time series databases & druid.io
Relevance of time series databases & druid.ioRelevance of time series databases & druid.io
Relevance of time series databases & druid.io
Muniraju V
 
Big Data at DYNO
Big Data at DYNOBig Data at DYNO
Big Data at DYNO
Tu Pham
 
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and moreBig Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Amazon Web Services
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
MLconf
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
Databricks
 
Modern recommender system in large content website
Modern recommender system in large content websiteModern recommender system in large content website
Modern recommender system in large content website
Cyrus Chien-Ching Chiu
 

Similar to Surge: Rise of Scalable Machine Learning at Yahoo! (20)

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth LoganMulti Model Machine Learning by Maximo Gurmendez and Beth Logan
Multi Model Machine Learning by Maximo Gurmendez and Beth Logan
 
Big Data Lessons from the Cloud
Big Data Lessons from the CloudBig Data Lessons from the Cloud
Big Data Lessons from the Cloud
 
SparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time BiddingSparkML: Easy ML Productization for Real-Time Bidding
SparkML: Easy ML Productization for Real-Time Bidding
 
Gluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with HadoopGluent Extending Enterprise Applications with Hadoop
Gluent Extending Enterprise Applications with Hadoop
 
(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305(CMP305) Deep Learning on AWS Made EasyCmp305
(CMP305) Deep Learning on AWS Made EasyCmp305
 
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
Fast data in times of crisis with GPU accelerated database QikkDB | Business ...
 
Things you can find in the plan cache
Things you can find in the plan cacheThings you can find in the plan cache
Things you can find in the plan cache
 
AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...AutoML for user segmentation: how to match millions of users with hundreds of...
AutoML for user segmentation: how to match millions of users with hundreds of...
 
Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create Danny Bickson - Python based predictive analytics with GraphLab Create
Danny Bickson - Python based predictive analytics with GraphLab Create
 
Toronto meetup 20190917
Toronto meetup 20190917Toronto meetup 20190917
Toronto meetup 20190917
 
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine LearningA Database-Hadoop Hybrid Approach to Scalable Machine Learning
A Database-Hadoop Hybrid Approach to Scalable Machine Learning
 
Machine Learning: How small businesses can enter the race
Machine Learning: How small businesses can enter the raceMachine Learning: How small businesses can enter the race
Machine Learning: How small businesses can enter the race
 
Big Data Testing
Big Data TestingBig Data Testing
Big Data Testing
 
Relevance of time series databases & druid.io
Relevance of time series databases & druid.ioRelevance of time series databases & druid.io
Relevance of time series databases & druid.io
 
Big Data at DYNO
Big Data at DYNOBig Data at DYNO
Big Data at DYNO
 
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and moreBig Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
Big Data & Analytics - Use Cases in Mobile, E-commerce, Media and more
 
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
Misha Bilenko, Principal Researcher, Microsoft at MLconf SEA - 5/01/15
 
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
AI as a Service, Build Shared AI Service Platforms Based on Deep Learning Tec...
 
Modern recommender system in large content website
Modern recommender system in large content websiteModern recommender system in large content website
Modern recommender system in large content website
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 

Recently uploaded (20)

Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 

Surge: Rise of Scalable Machine Learning at Yahoo!

  • 1. Rise of Scalable Machine Learning at Yahoo A n d y F e n g V P A r c h i t ect ur e , Ya h o o
  • 2. My Talks @ Hadoop Summit 2  Storm (2013)  Spark (2014)  Machine Learning (2015)
  • 3. 3 Use Case: Search & Advertisement  Application needs › Content ranking › Ad click prediction › Query-Ads matching  Machine learning algorithm › Gradient boosted decision tree › Logistic regression › Neural network
  • 4. Challenge: Scale 4 1. Massive amount of examples › Naïve solutions take days/weeks 2. Billions of features › Model exceeds memory limits of 1 computer 3. Variety of algorithms › Different solutions required for scale-up
  • 5. Massive Hadoop at Yahoo 5 600 PB HDFS 43K Computers MACHINE LEARNING
  • 6. Big-Data ML in Action 6 ML learner ML server Map Reduce
  • 7. 7 Architecture for Scalable ML  ML Server › Customized in-memory stores (Hashmap, Matrix) • Lockless concurrency • Zero garbage created › Map/Reduce API to move computing to servers
  • 8. 3 Examples of ML Algorithms 8 Yahoo Confidential & Proprietary 1. Gradient Boosted Decision Tree › Problem: Training latency › Solution: Hadoop streaming + MPI 2. Logistic Regression › Problem: Model size › Solution: Spark + ML Server 3. Ad-Query Vectors › Problem: Model size + Training latency › Solution: Spark + ML Server
  • 9. Algorithm 1: Gradient Boosted Decision Tree  Boosting is sequential  Training takes days for 1000s of features
  • 10. Gradient Boosted Decision Tree: 30x Speed-up
  • 11. Algorithm 2: Logistic Regression 11 When |β| > 100B, › 100 Billion * 16 Bytes = 1.6 TB › β exceeds memory limit of 1 computer
  • 13.  Vector: numeric representation of queries/ads › Vector(“san jose weather”) ≈ Vector(“weather 95113”) ≈ Vector(ad123)  Model size › 1 Billion* 300 dimensions = 2.4TB  Vector computation (X*Y, aX+Y) › Took weeks for small datasets 13 * Yahoo Labs: http://bit.ly/1G3f6L2 Algorithm 3: Ad-Query Vectors
  • 14. Ad-Query2Vec: 100x Speed/Scale-up 14  Computation on servers › (1) Negative sampling › (1) Compute gradient: X*Y › (3) Adjust vectors: Y=aX+Y  Daily training enabled › weeks  hours
  • 15.  Asynchronous  Faster  More data  Larger model 15 Lesson Learned: Approximate Computing  Better Accuracy …
  • 16. Summary 16  Scalable machine learning at Yahoo › critical business: search, advertisement › daily model training w/ billions of features  Hadoop/YARN plays a central role › approximate computing › CPU + GPU

Editor's Notes

  1. Good afternoon. I am Andy Feng from Yahoo. In this talk, I will share our recent effort to enable large scale machine learning on hadoop clusters.
  2. In 2013, I talked about Yahoo’s adoption of Storm for low-latency processing. Last year, I described Yahoo’s effort to bring Spark onto YARN cluster. Today, we should cover our progress on machine learning using YARN clusters. I will cover 3 areas: WHY does Yahoo apply machine learning WHAT challenges we try to address HOW we address them I will wrap the talk with key lessons learned from our experience.
  3. Let’s start with WHY machine learning. Search is one of the key applications for Yahoo. For a user’s search phrase, we construct a result page with organic contents together with ads. To generate result page, we rank contents basedtheir relevance to query terms, match ads against query, and predict the probability of ad click. Several machine learning algorithms are applied in this process: decision tree, logistic regression and neural network.
  4. Machine learning at Yahoo has an scalability challenge. 1st, the # of training examples. In order to produce an accurate machine learning model, Yahoo examines massive amount of training examples. For example, we examine several months of user search activity logs. Typically, we are looking at hundreds billions of training examples. When naïve solutions are applied, the training process could take several weeks. Models that represents what happened weeks ago have limited business value, since they don’t represent the current state of our users and contents. 2nd the # of features. We need to pick up signals from all possible signals. It’s usual for Yahoo to consider billions of features in our model. 3rd, we use variety of algorithms, and different solutions are required for scaling up these algorithms. We want our machine learning algorithms massive scalable.
  5. We believe that Hadoop is an ideal platform for scalable machine learning. Yahoo has one of the largest Hadoop deployments in the world. At the moment we store 600 PB on 43 thousands nodes. In last year, we decide to make hadoop the single best system for running large scale machine learning applications. So lets look into a bit under the hood
  6. At Yahoo, our data scientists are applying big-data machine learning on Hadoop clusters daily. Here is a screenshot from one Hadoop cluster. In addition to various MapReduce jobs, we have a Spark job for machine learning, and a ML server for managing data of ML models.
  7. ML Server Customized in-memory stores (Hashmap, Matrix) Lockless concurrency Zero garbage created Map/Reduce API to move computing to servers To enable approximate computing, we are build machine learning on top of Hadoop, Spark and our machine learning servers. These servers are a YARN application, specfically design for machine learning. All data are stored in memory with customized stores. These stores enables lockless concurrency, and could handle millions operations per second. Our servers were implemented in Java, but creates zero garbage. This enables us to run training consistently with high throughput, without worry about garbage collection. Our API supports asynchronous machine learning and mini-batch. This ensures very fast training by many learners. To minimize data movement, we enable clients to move computing logic to servers. For example, we enable MapReduce operations on servers. As an example, you may want to perform statistic analysis of large models using MapReduce operations. Our servers provides built-in support of Hadoop file systems. You could store your models after each training, and load previoud trained models from HDFS.
  8. Let me share 3 success stories about machine learning algorithms. Our 1st story illustrates how Hadoop and MPI could dramatically reduce training latency. Our 2nd story shows Spark and YARN to enable training of very large machine learning models. Our 3rd story will attack both model size and training latency.
  9. Let’s start with our 1st story. In gradient boosted decision trees, we represent model as a collection of decision trees. Each tree node represents a decision point using one feature. By top-down walk through of these trees, you will reach leaf nodes with numerical values. Adding those numerical values together will be our prediction for a given example. To achieve high accuracy, we tend to have many trees. In Yahoo use cases, we use thousands of trees. To construct such trees, we have to construct trees one-by-one. Within each trees, we have to tree layer-by-layer. For each node, we need to select a best feature and a best value for the split. If you use a single machine, the training process could take several days.
  10. At Yahoo, we developed a GBDT algorithm on top of Hadoop and MPI, and achieve 30X speed-up. More specifically, we use Hadoop Streaming to launch multiple GBDT workers. We partition training examples by columns, instead of rows. Each worker has subset of features for all training examples. During the training, each worker perform local computation to identical best splits for its feature set. We then apply MPI allreduce operation to decide the global best split among all features, and broadcast the best split to all workers. We repeat this process for all tree nodes. At the end, we have a collection of trees as our model. In this approach, we could tens of Hadoop mappers, and fully utilize their computation power. We achieved 30 times speedup for Yahoo use cases. For a training job previously took days, we could now produce decision trees in about 1 hour.
  11. Our 2nd story is about logistic regression. For a given vector X, logistic regression predicts the outcome via a logistic function to the dot product of parameters beta and feature vector. During training phase, we try to find the best parameter beta from training examples. If we have 2 parameters, logistic regression is find a line in 2 dimensional space to best fit our examples. The scalability challenge is around the # of parameters. We want to have produce model with over 100B parameters. Assume that each parameter uses 16 bytes, the storage of our parameters require 1.6 TB. We could not store the model in memory of a single computer.
  12. To enable large models, we decide to use multiple servers in a YARN cluster. Each server keeps a subset of parameters in memory. We launch logistic regression learners as a Spark job on YARN. Each learner will cover a subset of training examples from HDFS. For each example, we will fetch current parameter values from servers, compute gradient, and update servers with latest value. This new architecture enables us to scale up learning 1000 times. Our previous models had at thousands or millions parameters, and our new model now has billions of parameters. All learners are perform learning independently. There is no synchronous across learners at all. Therefore, we could learn from massive amount of training data very quickly. As a result, our model w/ billions of parameters is significantly more precise than our previous models. That has brought us meaningful business impact.
  13. Our 3rd story is related to search query and ads. In this case, we are learning numerical vectors of search queries and ads from user session logs. From these vectors, we will be able to know that query terms “san jose weather” and “weather 95113” are essentially identical. We learn vectors from user’s search sessions. Each search session will have a collection of query terms and ads. Ad/query vectors are learned by applying n-gram techniques. Details of our algorithm is explained in a recent conference paper from yahoo labs. In this use case, we have 2 problems. First, we have billions of query terms. If each vector has 300 dimensions, we will need 2.4 TB memory space for vector storage. That’s way beyond our typical computer today. Vector calculation is very expensive. For each search sessions, we need to perform hundreds vector operations such as multiplication and addition. For a relative small datasets, training could take weeks.
  14. For computing vectors of queries and ads, we use a set of matrix servers on YARN cluster. Each server has a subset of columns of our matrix. These servers has built-in matrix operations such as vector multiplication and addition. We use Spark job to launch multiple learners on a YARN cluster. Each learner will examine a subset of training dataset. To reduce data movement, we conduct majority of computation on servers. For each training example, we let each server to produce negative examples, and calculate gradients locally. Then, our learner calculate a global coefficence based on each server’s partial gradients. Finally, we let each server adjust vectors. This distributed solution enables us to train vectors within a few hours. Remember it took several weeks for such a task previously. 100X speedup using YARN.
  15. From these use cases, we learned one important lesson. That is, big-data approximate computing could produce more accurate models. In all use cases, we use a set of computers to learning from dataset, and produce a mathematical model. We want each learners to conduct their learning as fast as they can. We don’t want any synchronization across learners. We even let learners to overwrite each other in the shared data model. Each execution may produce slightly different result. We are performing approximate computing on YARN. At the end, we produce a mathematical model with large # of parameters. Since this model represent the signals from massive amount of data, our model is more accurate than previous model built from precise computing.
  16. In summary, Yahoo has made significant progress on scalable machine learning. We conduct daily training w/ billions of signals for our critical business such as search and advertisement. Hadoop and YARN are playing a central role for this evolution. In YARN cluster, we built a framework for approximate computing. We are currently exploring both GPU and CPU in a single cluster.