1. An Open Source Machine Learning Server
for Developers
@PredictionIO #PredictionIO
Simon Chan
simon@prediction.io
2. Thank you for having me here today!
• Simon Chan - CEO of PredictionIO
• A small team of Data Scientists and Engineers
• Mainly based in Silicon Valley, also London and Hong Kong
3. Top Github Open Source
• Over 5000 developers engaged
• Powering over 200 applications
4. Talk Focus:
• Machine Learning - A (Very) Brief Review
• Challenges We Face When Building PredictionIO
15. Needs:
•Support multiple distributed engines
•Support multiple algorithms to execute in parallel
How to coordinate the workflow when you have
more pending tasks than processing units?
16. Attempt #1
Use a database system to store tasks, and
have a pool of workers pull tasks from it.
•Inefficient. Database becomes bottleneck
and potentially single point of failure.
17. Attempt #2
Use an Akka cluster.
Akka is a toolkit and runtime for building highly
concurrent, distributed, and fault tolerant event-driven
applications on the JVM.
•Fundamentally the same problem with the above.
•Need to build management suite on top.
18. Solution
Apache Spark: directed acyclic graph
(DAG) scheduling
Adapts to many different infrastructure:
Apache Spark standalone cluster, Apache
Hadoop 2 YARN, Apache Mesos.
Source: http://upload.wikimedia.org/wikipedia/commons/3/39/Directed_acyclic_graph_3.svg
21. Needs:
•Engines produce models that are
distributed across a cluster.
Requires a way to serve these distributed
in-memory models to queries in real-time.
22. Solution
All PredictionIO engine instances are launched
inside a “SparkContext”.
A SparkContext represents the connection to a
Spark cluster, and can be used to create RDDs,
accumulators and broadcast variables on that
cluster.
Source: http://bighadoop.files.wordpress.com/2014/04/spark-architecture.png
23. •When an engine is local to a single
machine, it loads the model to its memory.
•When an engine is distributed,
SparkContext will automatically load the
model on a cluster.
24. Conceptual Code for the Solution
val sc = SparkContext(conf)
...
val model =
if (model_is_distributed) {
if (model_is_persisted) {
sc.objectFile(model_on_HDFS)
} else {
engine.algo.train()
}
} else {
...
}
}