Machine Learning Systems at Scale: Building Reliable ML Infrastructure

•

3 likes•2,136 views

Machine Learning Systems at Scale: OpenAI is a non-profit research company, discovering and enacting the path to safe artificial general intelligence. As part of our work, we regularly push the limits of scalability in cutting-edge ML algorithms. We’ve found that in many cases, designing the systems we build around the core algorithms is as important as designing the algorithms themselves. This means that many systems engineering areas, such as distributed computing, networking, and orchestration, are crucial for machine learning to succeed on large problems requiring thousands of computers. As a result, at OpenAI engineers and researchers work closely together to build these large systems as opposed to a strict researcher/engineer split. In this talk, we will go over some of the lessons we’ve learned, and how they come together in the design and internals of our system for learning-based robotics research. Bio: Jonas leads technology development for OpenAI’s robotics group, developing methods to apply machine learning and AI to robots. He also helped build the infrastructure to scale OpenAI’s distributed ML systems to thousands of machines.

Technology

Machine Learning
Systems at Scale
MLconf San Francisco
Jonas Schneider
November 10th, 2017

OpenAI
Non-proﬁt research lab
Goal: ensure AGI is good for humanity
Teams: Robotics, Dota, basic research, …

Robots that Learn
https://blog.openai.com/robots-that-learn/

What’s in a ML system?
ML core
(e.g. PPO, A3C, …)

Data
munging
Compute infra Networking
Observability
Tooling
Regression tests
ML core
(e.g. PPO, A3C, …)
Deployment/
Inference
Storage
Orchestration

Example: Orchestration
Kubernetes
Azure
Our Model

Kubernetes
Azure
Kubernetes
GCE
Kubernetes
On-Premises
Hardware
Our Model Our Model Our Model
Example: Orchestration

Scriptable infrastructure
exp = Experiment()
exp.add_parameter_server()
for i in range(NUM_WORKERS):
exp.add_tensorflow_worker(my_tf_graph, cpu=24, gpu=4)
exp.run(mode=’kube’) # or ’docker’
https://blog.openai.com/infrastructure-for-deep-learning/
“Building the Infrastructure that powers the future of AI”, KubeCon 2017

Think:
Instead of:
Research Engineering
Systems
Algorithms
TRPO
PPO
DQN
ES
?
https://blog.openai.com/evolution-strategies/
https://blog.openai.com/openai-baselines-ppo/

How to scale RL?
Supervised learning: gradient averaging
Large batch sizes ﬁx many problems
Turns out, it works for reinforcement learning too

Example: DDPG+HER
optimizer
worker worker worker
evaluator

Know your stack
CUDA bindings
TF Graph Language
Distributed TF
TensorFlow

Know your stack
CUDA bindings
TF Graph Language
Distributed TF
Seems fast until
you see PyTorch
Performance issues
on plain Ethernet
Nice design,
takes getting used to

TensorFlow++
One of our stacks
CUDA
bindings
TF Graph Language
MPI + Redis
Custom
Ops

Track performance
https://blog.openai.com/more-on-dota-2/

1. Hire a team of diverse skills.
2. Think about the entire system.
3. Track your performance.

Thanks!
Interested in working at OpenAI? Ping jonas@openai.com!

What's hot

Deep Learning with Microsoft Cognitive ToolkitBarbara Fusinska

Kaz Sato, Evangelist, Google at MLconf ATL 2016MLconf

Introduction to KerasJohn Ramey

Anomaly detection in deep learning (Updated) EnglishAdam Gibson

Strata Beijing 2017: Jumpy, a python interface for nd4jAdam Gibson

Deep Learning with CNTKAshish Jaiman

Khan farhan cvfarhan0039

Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017StampedeCon

Poonam data scientistPoonam Agrawal

Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...Databricks

Scaling AI in production using PyTorchgeetachauhan

Ferruzza g automl deckEric Dill

Keras: Deep Learning Library for PythonRafi Khan

Deep learning with Tensorflow in Rmikaelhuss

MongoDB & Machine LearningTom Maiaroto

DeepLearning and Advanced Machine Learning on IoTRomeo Kienzler

A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...Databricks

Cognitive IoT using DeepLearning on data parallel frameworks like Spark & FlinkRomeo Kienzler

Large-Scale Malicious Domain Detection with Spark AIDatabricks

PyConline AU 2021 - Things might go wrong in a data-intensive applicationHua Chu

What's hot (20)

Deep Learning with Microsoft Cognitive Toolkit

Kaz Sato, Evangelist, Google at MLconf ATL 2016

Introduction to Keras

Anomaly detection in deep learning (Updated) English

Strata Beijing 2017: Jumpy, a python interface for nd4j

Deep Learning with CNTK

Khan farhan cv

Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017

Poonam data scientist

Distributed Inference on Large Datasets Using Apache MXNet and Apache Spark ...

Scaling AI in production using PyTorch

Ferruzza g automl deck

Keras: Deep Learning Library for Python

Deep learning with Tensorflow in R

MongoDB & Machine Learning

DeepLearning and Advanced Machine Learning on IoT

A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...

Cognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink

Large-Scale Malicious Domain Detection with Spark AI

PyConline AU 2021 - Things might go wrong in a data-intensive application

Viewers also liked

Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...MLconf

Daniel Shank, Data Scientist, Talla at MLconf SF 2017MLconf

LN Renganarayana, Architect, ML Platform and Services and Madhura Dudhgaonkar...MLconf

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017MLconf

Ashfaq Munshi, ML7 Fellow, PepperdataMLconf

Xavier Amatriain, Cofounder & CTO, Curai at MLconf SF 2017MLconf

Doug Eck, Research Scientist, Google Magenta, at MLconf SF 2017MLconf

Dr. Steve Liu, Chief Scientist, Tinder at MLconf SF 2017MLconf

Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...MLconf

Rushin Shah, Engineering Manager, Facebook at MLconf SF 2017MLconf

Dr. June Andrews, Principal Data Scientist, Wise.io, From GE Digital at MLcon...MLconf

Talha Obaid, Email Security, Symantec at MLconf ATL 2017MLconf

Alexandra Johnson, Software Engineer, SigOpt at MLconf ATL 2017MLconf

Jessica Rudd, PhD Student, Analytics and Data Science, Kennesaw State Univers...MLconf

Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017MLconf

Ryan West, Machine Learning Engineer, Nexosis at MLconf ATL 2017MLconf

Ashrith Barthur, Security Scientist, H2o.ai, at MLconf 2017MLconf

Viewers also liked (17)

Anima Anadkumar, Principal Scientist, Amazon Web Services, Endowed Professor,...

Daniel Shank, Data Scientist, Talla at MLconf SF 2017

LN Renganarayana, Architect, ML Platform and Services and Madhura Dudhgaonkar...

Matineh Shaker, Artificial Intelligence Scientist, Bonsai at MLconf SF 2017

Ashfaq Munshi, ML7 Fellow, Pepperdata

Xavier Amatriain, Cofounder & CTO, Curai at MLconf SF 2017

Doug Eck, Research Scientist, Google Magenta, at MLconf SF 2017

Dr. Steve Liu, Chief Scientist, Tinder at MLconf SF 2017

Tamara G. Kolda, Distinguished Member of Technical Staff, Sandia National Lab...

Rushin Shah, Engineering Manager, Facebook at MLconf SF 2017

Dr. June Andrews, Principal Data Scientist, Wise.io, From GE Digital at MLcon...

Talha Obaid, Email Security, Symantec at MLconf ATL 2017

Alexandra Johnson, Software Engineer, SigOpt at MLconf ATL 2017

Jessica Rudd, PhD Student, Analytics and Data Science, Kennesaw State Univers...

Venkatesh Ramanathan, Data Scientist, PayPal at MLconf ATL 2017

Ryan West, Machine Learning Engineer, Nexosis at MLconf ATL 2017

Ashrith Barthur, Security Scientist, H2o.ai, at MLconf 2017

Similar to Machine Learning Systems at Scale: Building Reliable ML Infrastructure

Building a Cyber Threat Intelligence Knowledge Management System (Paris Augus...Vaticle

Machine Learning - Know Enough To Be Dangerous #SearchLoveBritney Muller

SearchLove San Diego 2019 - Britney Muller - Machine Learning: Know Enough To...Distilled

Analyzing Big Data's Weakest Link (hint: it might be you)HPCC Systems

[第45回 Machine Learning 15minutes! Broadcast] Azure AI - Build 2020 UpdatesNaoki (Neo) SATO

DF1 - ML - Petukhov - Azure Ml Machine Learning as a ServiceMoscowDataFest

Big Data: the weakest linkCS, NcState

Machine Learning on the Cloud with Apache MXNetdelagoya

Data Science Challenges in Personal Program AnalysisWork-Bench

2019 04-13 ai for .net developers (fwdays)Oleksandr Krakovetskyi

Oleksander Krakovetskyi "Artificial Intelligence and Machine Learning for .NE...Fwdays

Parallel and Distributed Algorithms for Large Text Datasets AnalysisIllia Ovchynnikov

Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...Rafael Ferreira da Silva

Usb 3.0 technology mindshareNguyen Nhat Han

Saving Human Lives with the IoTDat Tran

Biometric Systems - Automate Video Streaming Analysis with Azure and AWSRoberto Falconi

2018 11 14 Artificial Intelligence and Machine Learning in AzureBruno Capuano

Microsoft DryadColin Clark

ALM Search Presentation for the VSS Arch CouncilSunita Shrivastava

Deep Learning: Application Landscape - March 2018Grigory Sapunov

Similar to Machine Learning Systems at Scale: Building Reliable ML Infrastructure (20)

Building a Cyber Threat Intelligence Knowledge Management System (Paris Augus...

Machine Learning - Know Enough To Be Dangerous #SearchLove

SearchLove San Diego 2019 - Britney Muller - Machine Learning: Know Enough To...

Analyzing Big Data's Weakest Link (hint: it might be you)

[第45回 Machine Learning 15minutes! Broadcast] Azure AI - Build 2020 Updates

DF1 - ML - Petukhov - Azure Ml Machine Learning as a Service

Big Data: the weakest link

Machine Learning on the Cloud with Apache MXNet

Data Science Challenges in Personal Program Analysis

2019 04-13 ai for .net developers (fwdays)

Oleksander Krakovetskyi "Artificial Intelligence and Machine Learning for .NE...

Parallel and Distributed Algorithms for Large Text Datasets Analysis

Running Accurate, Scalable, and Reproducible Simulations of Distributed Syste...

Usb 3.0 technology mindshare

Saving Human Lives with the IoT

Biometric Systems - Automate Video Streaming Analysis with Azure and AWS

2018 11 14 Artificial Intelligence and Machine Learning in Azure

Microsoft Dryad

ALM Search Presentation for the VSS Arch Council

Deep Learning: Application Landscape - March 2018

Recently uploaded

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

GenCyber Cyber Security Day PresentationMichael W. Hawkins

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

Scaling API-first – The story of a global engineering organizationRadu Cotescu

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Pigging Solutions Piggable Sweeping ElbowsPigging Solutions

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix

Presentation on how to chat with PDF using ChatGPT code interpreternaman860154

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

Slack Application Development 101 Slidespraypatel2

08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls

How to Remove Document Management Hurdles with X-Docs?XfilesPro

Recently uploaded (20)

Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation

04-2024-HHUG-Sales-and-Marketing-Alignment.pptx

Maximizing Board Effectiveness 2024 Webinar.pptx

How to Troubleshoot Apps for the Modern Connected Worker

GenCyber Cyber Security Day Presentation

SQL Database Design For Developers at php[tek] 2024

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

Scaling API-first – The story of a global engineering organization

The Codex of Business Writing Software for Real-World Solutions 2.pptx

Factors to Consider When Choosing Accounts Payable Services Providers.pptx

Unblocking The Main Thread Solving ANRs and Frozen Frames

Pigging Solutions Piggable Sweeping Elbows

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men

Swan(sea) Song – personal research during my six years at Swansea ... and bey...

Presentation on how to chat with PDF using ChatGPT code interpreter

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

Slack Application Development 101 Slides

08448380779 Call Girls In Civil Lines Women Seeking Men

How to Remove Document Management Hurdles with X-Docs?

Machine Learning Systems at Scale: Building Reliable ML Infrastructure

1. Machine Learning Systems at Scale MLconf San Francisco Jonas Schneider November 10th, 2017

2. OpenAI Non-proﬁt research lab Goal: ensure AGI is good for humanity Teams: Robotics, Dota, basic research, …

3. Robots that Learn https://blog.openai.com/robots-that-learn/

4. Dota 2 https://blog.openai.com/dota-2/

5. What’s in a ML system? ML core (e.g. PPO, A3C, …)

6. What’s in a ML system? ML core (e.g. PPO, A3C, …)

7. Data munging Compute infra Networking Observability Tooling Regression tests ML core (e.g. PPO, A3C, …) Deployment/ Inference Storage Orchestration

8. Data munging Compute infra Networking Observability Tooling Regression tests ML core (e.g. PPO, A3C, …) Deployment/ Inference Storage Orchestration

9. Example: Orchestration Kubernetes Azure Our Model

10. Kubernetes Azure Kubernetes GCE Kubernetes On-Premises Hardware Our Model Our Model Our Model Example: Orchestration

11. Scriptable infrastructure exp = Experiment() exp.add_parameter_server() for i in range(NUM_WORKERS): exp.add_tensorflow_worker(my_tf_graph, cpu=24, gpu=4) exp.run(mode=’kube’) # or ’docker’ https://blog.openai.com/infrastructure-for-deep-learning/ “Building the Infrastructure that powers the future of AI”, KubeCon 2017

12. Think: Instead of: Research Engineering

13. Think: Instead of: Research Engineering Systems Algorithms TRPO PPO DQN ES ? https://blog.openai.com/evolution-strategies/ https://blog.openai.com/openai-baselines-ppo/

14. How to scale RL? Supervised learning: gradient averaging Large batch sizes ﬁx many problems Turns out, it works for reinforcement learning too

15. Example: DDPG+HER optimizer worker worker worker evaluator

16. 1. Scale your models 2. Scale your team

17. Know your stack CUDA bindings TF Graph Language Distributed TF TensorFlow

18. Know your stack CUDA bindings TF Graph Language Distributed TF Seems fast until you see PyTorch Performance issues on plain Ethernet Nice design, takes getting used to

19. TensorFlow++ One of our stacks CUDA bindings TF Graph Language MPI + Redis Custom Ops

20. Track performance https://blog.openai.com/more-on-dota-2/

21. Track regressions

22. If OpenAI can do it…

23. 1. Hire a team of diverse skills. 2. Think about the entire system. 3. Track your performance.

24. Thanks! Interested in working at OpenAI? Ping jonas@openai.com!

Machine Learning Systems at Scale: Building Reliable ML Infrastructure

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (17)

Similar to Machine Learning Systems at Scale: Building Reliable ML Infrastructure

Similar to Machine Learning Systems at Scale: Building Reliable ML Infrastructure (20)

More from MLconf

More from MLconf (20)

Recently uploaded

Recently uploaded (20)

Machine Learning Systems at Scale: Building Reliable ML Infrastructure