Distributed Machine Learning: 1. A New Era

Distributed
Machine Learning
Yi Wang

Story Outline
• Use existing frameworks (2007~2010)

• Methods: Frequent itemset mining, Collaborative
ﬁltering, Spectral clustering, Graph partitioning,
Restricted Boltzmann machine, Latent topic modeling

• Frameworks: MPI, MapReduce, Pregel, GBR

• Developing frameworks (2010~2014)

• MapReduce Lite (C++) for language models

• Peacock (Go) for latent topic modeling

Lessons
• Internet services relies on machine intelligence

• Intelligence comes from learning users’ behavior

• Value lies in long tails ▹

• It is more about big than fast

• Good system = good algorithm + good architecture

• More about engineering than math

• It is Industrial Revolution!

Pitfalls
• De-noise data ▹

• Parallelize models in papers and textbooks

• Use existing frameworks

• MPI

• Mix frameworks with cluster operating systems ▹

• Less talking about production

• Use standard measures

• Java or Python ▹

Environment
• Balance business with killing tech development

• Separated

• Combined

• Switching

• Standalone business

• ML software

• ML frameworks

• MLaaP or MLaaS

www.couchsurﬁng
www.tripadvisor.c
www.kayak.com
www.easyjet.com/
www.ryanair.com
www.edreams.it
www.expedia.it
www.volagratis.co
www.skyscanner.n
www.google.com/
www.koders.com
www.bigbold.com
www.gotapi.com
0xcc.net/blog/arc
043.html
www.google.co.jp
www.livedoor.com
www.baidu.jp
www.namaan.net
www.operator11.c
www.joost.com
www.keepvid.com
www.getdemocrac
www.masternewm
www.technorati.c
www.listible.com
www.popurls.com
www.aldaily.com
www.quintura.com
wwwl.meebo.com
www.ebuddy.com
www.plugoo.com
www.easyhotel.co
www.hostelz.com
www.couchsurﬁng
www.tripadvisor.c
www.kayak.com
www.easyjet.com/
www.ryanair.com
www.edreams.it
www.expedia.it
www.volagratis.co
www.skyscanner.n
www.google.com/
www.koders.com
www.bigbold.com
www.gotapi.com
0xcc.net/blog/arc
043.html
www.google.co.jp
www.livedoor.com
www.baidu.jp
www.namaan.net
www.operator11.c

Long-tail is scale-free.

Mean and median make no sense with long-tail distributions.

Pie-charts make no sense either.

◀

application PageRank Indexing pCTR DNN
framework Pregel MapReduce SETI DistBelief
middleware
Chubby (Zookeeper, etcd),

Bigtable (HBase),

memcachg (memcached)
cluster OS Borg (Mesos,YARN, Kubernetes)
ﬁlesystem GFS (HDFS)
◀

var resp *Response
select {
case b := <- rpc.Call("B"):
resp = extract(b)
case c := <- rpc.Call("C"):
resp = extract(c)
case e := <- rpc.Call("E"):
resp = extract(e)
case <- time.Timeout(1*second):
resp = nil
}
// use resp here.

Distributed Machine Learning: 1. A New Era

Recommended

Recommended

More Related Content

Similar to Distributed Machine Learning: 1. A New Era

Similar to Distributed Machine Learning: 1. A New Era (20)

Recently uploaded

Recently uploaded (20)

Distributed Machine Learning: 1. A New Era