Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017

Yi Wang is the tech lead of AI Platform at Baidu. The team is a primary contributor of PaddlePaddle, the open source deep learning platform originally developed in Baidu. Before Baidu, he was a founding member of ScaledInference, a Palo Alto-based AI startup company. Before that, he was a senior staff at LinkedIn, engineering director of advertising system at Tencent, and researcher at Google.

Abstract Summary:

Fault-tolerable Deep Learning on General-purpose Clusters:
Researchers have been used to running deep learning jobs on clusters. In industrial applications, AI is built on top of big data and deep learning is only one stage of the data pipeline. That is where MPI-based clusters are not enough, and general-purpose cluster management systems are necessary to run Web servers like Nginx, log collectors like fluentd and Kafka, data processors on top of Hadoop, Spark, and Storm, and deep learning, which improves the Web service quality. This talk explains how we integrate PaddlePaddle and Kubernetes to provide an open source fault-tolerable large-scale deep learning platform.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Be the first to comment

Yi Wang, Tech Lead of AI Platform, Baidu, at MLconf 2017

  2. 2. A REAL REQUEST FOR AI ▸ How to control TV sets via voice ▸ AI Hub ▸ No. An Alexa in each room? ▸ AI API ▸ No. Business owners don’t want user behavior data go to AI tech providers. ▸ AI on Cloud ▸ No. GPU instances are too expensive. ▸ AI on on-premise clusters ▸ Yes.
  3. 3. Unisound, a PaddlePaddle collaborator, planted their speech recognition technology into air conditioners, TV sets, and Android-based mirrors in cars.
  4. 4. CLOUD AND ON-PREMISE CLUSTERS Internet traditional big companies on- premises on- premises small companies cloud on- premises
  5. 5. THE SOLUTION - GENERAL PURPOSE CLUSTERS GPU servers Multi-GPU servers CPU servers… Kubernetes: a distributed operating system PaddleSpark speech model trainer speech API server fluentd nginx log Kafka online data process offline data process Hadoop HDFS labeled data model Internet clients: - Web browser - mobile apps - IoT devices
  6. 6. CHALLENGES - GENERAL PURPOSE CLUSTERS ▸ group replica of processes into jobs ▸ Web services, data processing pipelines, machine learning jobs. ▸ service isolation and multi-user ▸ online experiments requires real log data stream, so ▸ we run production jobs and experimental jobs on the same cluster. ▸ priority-based scheduling ▸ a high-priority (production) job can preempt low-priority (experiment) jobs. ▸ make full use of hardware ▸ e.g., schedule processes of a Hadoop job that requires network and disk bandwidth and processes of a deep learning job that requires GPU on the same node.
  7. 7. CHALLENGES - FAULT-TOLERABLE JOBS ▸ auto-scaling ▸ there are often many active users at day time, so the cluster kills processes of deep learning jobs and creates more Web service processes. ▸ in nights, it kills some Web service processes to run more deep learning processes. ▸ fault-recovery ▸ a job must be tolerable with a varying number of processes. ▸ speedup v.s. fault-recovery ▸ speedup optimizes a job. ▸ speedup with fault-tolerance optimizes the business.
  8. 8. A PADDLE PADDLE JOB parameter server 1 parameter server 2 trainer 1 global model shard 1/2 global model shard 2/2 local model shard 1/2 local model shard 2/2 trainer 2 local model shard 1/2 local model shard 2/2 trainer 3 local model shard 1/2 local model shard 2/2 master gradients/model gradients/model gradients/model tasks tasks tasks
  9. 9. AUTO FAULT-RECOVERY etcd job B master of job A job A task 4 task 2 task 1 todo pending done task 3 task 2 task 1 todo pending done task 3 master of job B todo created pending done dispatched completed timeout
  10. 10. KEEP OPEN ▸ Thanks to the Kubernetes community for their expertise on distributed computing and their effort of code review. ▸ We hope to see more traditional industries have their on- premise clusters support running their whole business. ▸ PaddlePaddle will keep open. ▸ We are working on open source more AI technologies basing on PaddlePaddle.