Yi Wang is the tech lead of AI Platform at Baidu. The team is a primary contributor of PaddlePaddle, the open source deep learning platform originally developed in Baidu. Before Baidu, he was a founding member of ScaledInference, a Palo Alto-based AI startup company. Before that, he was a senior staff at LinkedIn, engineering director of advertising system at Tencent, and researcher at Google.
Fault-tolerable Deep Learning on General-purpose Clusters:
Researchers have been used to running deep learning jobs on clusters. In industrial applications, AI is built on top of big data and deep learning is only one stage of the data pipeline. That is where MPI-based clusters are not enough, and general-purpose cluster management systems are necessary to run Web servers like Nginx, log collectors like fluentd and Kafka, data processors on top of Hadoop, Spark, and Storm, and deep learning, which improves the Web service quality. This talk explains how we integrate PaddlePaddle and Kubernetes to provide an open source fault-tolerable large-scale deep learning platform.