Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Facebook ML Infrastructure - 2018 slides

973 views

Published on

Facebook Machine Learning Infrastructure - 2018 slides

Published in: Technology
  • Be the first to comment

Facebook ML Infrastructure - 2018 slides

  1. 1. ML at Facebook: An Infrastructure View Yangqing Jia Director, Facebook AI Infra
  2. 2. (* This is not Facebook AI Infra)
  3. 3. The Machine Learning Moore’s Law? 0 500 1000 1500 2000 2500 2001 2003 2005 2007 2009 2011 2013 2015 2017 NumberofCitations
  4. 4. Machine Learning Execution Flow Data Features Training Evaluation Inference Offline Online
  5. 5. Machine Learning Execution Flow Data Features Training Eval Inference PredictionsModel
  6. 6. It’s an infrastructure challenge Data Offline Training Online Inference Storage Challenges! Network Challenges! Compute Challenges!
  7. 7. Let’s Answer Some Pressing Questions • Does Facebook leverage machine learning? • Does Facebook design hardware? • Does Facebook design hardware for machine learning? • What platforms and frameworks exist; can the community use them? • What assumptions break when supporting 2B people?
  8. 8. Does Facebook Use Machine Learning? News Feed Ads Search Language Translation Sigma Facer Lumos Speech Recognition Content Understanding Classification & Ranking Services
  9. 9. What ML Models Do We Leverage? GBDTSVM MLP CNN RNN Support Vector Machines Gradient- Boosted Decision Trees Multi-Layer Perceptron Convolutional Neural Nets Recurrent Neural Nets Facer Sigma News Feed Ads Search Sigma Facer Lumos Language Translation Content Understanding Speech Rec
  10. 10. How Often Do We Train Models? minutes hours days months
  11. 11. How Long Does Training Take? seconds minutes hours days
  12. 12. How Much Compute Does Inference Consume? 100X 10x 1x
  13. 13. Does Facebook Design Hardware? • Yes! Since 2010! All designs released through open compute! • Facebook Server Design Philosophy • Identify a small number of major services with unique resource requirements • Design servers for those major services One major server design versus A B D C Customized, dedicated hardwareGlobal shared pool
  14. 14. Does Facebook Design Hardware? Yosemite/Twin Lakes: For the web tier and other “stateless services” Open Compute “Sleds” are 2U x 3 Across in an Open Compute Rack Server Card Chassis 4-way Shared NIC SSD Boot Rack
  15. 15. Does Facebook Design Hardware? Tioga Pass: For compute or memory- intensive workloads: Bryce Canyon: For storage-heavy workloads:
  16. 16. Does Facebook Design Hardware for AI/ML? • HP SL270s (2013): learning serviceability, thermal, perf, reliability, cluster mgmt. • Big Sur (M40) -> Big Basin (P100) -> Big Basin Volta (V100) Big Sur Integrated Compute 8 Nvidia M40 GPUs Big Basin JBOG Design (CPU headnode) 8 Nvidia P100 / V100 GPUs
  17. 17. Putting it Together Data Features Training Evaluation Inference Bryce Canyon Big Basin, Tioga Pass Tioga Pass Twin Lakes
  18. 18. Let’s Answer Some Pressing Questions • Does Facebook leverage machine learning? • Does Facebook design hardware? • Does Facebook design hardware for machine learning? • What platforms and frameworks exist; can the community use them? • What assumptions break when supporting 2B people?
  19. 19. Facebook AI Frameworks Infra Efficiency for Production • Stability • Scale & Speed • Data Integration • Relatively Fixed Developer Efficiency for Research • Flexible • Fast Iteration • Highly Debuggable • Less Robust
  20. 20. Facebook AI Frameworks Infra Efficiency for Production • Stability • Scale & Speed • Data Integration • Relatively Fixed Developer Efficiency for Research • Flexible • Fast Iteration • Highly Debuggable • Less Robust OpenGL ESNNPACK Metal™/ MPSCNN Qualcomm Snapdragon NPE CUDA/cuDNN
  21. 21. Deep Learning Frameworks Framework backends Vendor and numeric libraries Apple CoreML Nvidia TensorRT Intel/Nervana ngraph Qualcom SNPE … O(n2) pairs
  22. 22. Shared model and operator representation Open Neural Network Exchange Framework backends Vendor and numeric libraries Apple CoreML Nvidia TensorRT Intel/Nervana ngraph Qualcom SNPE … From O(n2) to O(n) pairs
  23. 23. Putting it Together Data Features Training Evaluation Inference Bryce Canyon Big Basin, Tioga Pass Tioga Pass Twin Lakes Data APIs Caffe2, PyTorch, ONNX C2 Predictor
  24. 24. Facebook AI Ecosystem Frameworks: Core ML Software Caffe2 / PyTorch / ONNX Platforms: Workflow Management, Deployment FB Learner Infrastructure: Servers, Storage, Network Strategy Open Compute Project
  25. 25. FB Learner Platform FB Learner Feature Store FB Learner Flow FB Learner Predictor • AI Workflow • Model Management and Deployment
  26. 26. FBLearner in ML Data Features Training Eval Inference PredictionsModel FBLearner Flow FBLearner Predictor FBLearner Feature Store CPU+GPUCPU CPU
  27. 27. Putting it All Together Data Features Training Evaluation Inference Bryce Canyon Big Basin, Tioga Pass Tioga Pass Twin Lakes Feature Store FBLearner Flow FBL Predictor Data APIs Caffe2, PyTorch, ONNX C2 Predictor
  28. 28. 2 Billion People What changes when you scale to over
  29. 29. Scaling Challenges / Opportunities Lots of Data Lots of Compute
  30. 30. Scaling Opportunity: Free Compute!
  31. 31. Santa Clara, California Ashburn, Virginia Prineville, Oregon Forest City, North Carolina Lulea, Sweden Altoona, Iowa Fort Worth, Texas Clonee, Ireland Los Lunas, New Mexico Odense, Denmark New Albany, Ohio Papillion, Nebraska
  32. 32. Key Takeaways Facebook AI Lots of Data Wide variety of models Full stack challenges Global scale
  33. 33. Kim Hazelwood Sarah Bird David Brooks Soumith Chintala Utku Diril Dmytro Dzhulgakov Mohamed Fawzy Bill Jia Yangqing Jia Aditya Kalro James Law Kevin Lee Jason Lu Pieter Noordhuis Misha Smelyanskiy Liang Xiong Xiaodong Wang

×