An Open Source Machine Learning Server 
for Developers 
@PredictionIO #PredictionIO 
Simon Chan 
simon@prediction.io
Thank you for having me here today! 
• Simon Chan - CEO of PredictionIO 
• A small team of Data Scientists and Engineers 
• Mainly based in Silicon Valley, also London and Hong Kong
Top Github Open Source 
• Over 5000 developers engaged 
• Powering over 200 applications
Talk Focus: 
• Machine Learning - A (Very) Brief Review 
• Challenges We Face When Building PredictionIO
Machine Learning is Simple?
I am going to give an 
example that will make 
you… HUNGRY!
F FOOD Club – Menu 
FOOD 
CLUB
Coding time…. 
# Using PredictionIO 
# Collect Data 
cli = predictionio.EventClient("<my_app_id>") 
cli.record_user_action_on_item("buy", "John", “BulgogiA") 
# Predict top preferences 
eng = predictionio.EngineClient("<my_engine_url>") 
rec = eng.send_query({"uid" : "John", "n" : 5})
The Magic Behind: Engine 
1. Data Sourcing and Preparation 
2. Algorithm 
3. Serving 
4. Evaluation
Challenges and Solutions
Architectural Challenge 1 
Workflow Co-ordination on a Distributed Cluster
Needs: 
•Support multiple distributed engines 
•Support multiple algorithms to execute in parallel 
How to coordinate the workflow when you have 
more pending tasks than processing units?
Attempt #1 
Use a database system to store tasks, and 
have a pool of workers pull tasks from it. 
•Inefficient. Database becomes bottleneck 
and potentially single point of failure.
Attempt #2 
Use an Akka cluster. 
Akka is a toolkit and runtime for building highly 
concurrent, distributed, and fault tolerant event-driven 
applications on the JVM. 
•Fundamentally the same problem with the above. 
•Need to build management suite on top.
Solution 
Apache Spark: directed acyclic graph 
(DAG) scheduling 
Adapts to many different infrastructure: 
Apache Spark standalone cluster, Apache 
Hadoop 2 YARN, Apache Mesos. 
Source: http://upload.wikimedia.org/wikipedia/commons/3/39/Directed_acyclic_graph_3.svg
Solution Source Code: 
http://github.com/predictionio
Architectural Challenge 2 
Distributed In-memory Model Retrieval
Needs: 
•Engines produce models that are 
distributed across a cluster. 
Requires a way to serve these distributed 
in-memory models to queries in real-time.
Solution 
All PredictionIO engine instances are launched 
inside a “SparkContext”. 
A SparkContext represents the connection to a 
Spark cluster, and can be used to create RDDs, 
accumulators and broadcast variables on that 
cluster. 
Source: http://bighadoop.files.wordpress.com/2014/04/spark-architecture.png
•When an engine is local to a single 
machine, it loads the model to its memory. 
•When an engine is distributed, 
SparkContext will automatically load the 
model on a cluster.
Conceptual Code for the Solution 
val sc = SparkContext(conf) 
... 
val model = 
if (model_is_distributed) { 
if (model_is_persisted) { 
sc.objectFile(model_on_HDFS) 
} else { 
engine.algo.train() 
} 
} else { 
... 
} 
}
PredictionIO 0.8
Built-in Engines: 
•Item Recommendation 
•Item Rank 
•Item Similarity
Create an Engine Instance Project…. 
$ pio instance io.prediction.engines.itemrec 
$ cd io.prediction.engines.itemrec 
$ pio register
Collect Event Data…. 
cli = predictionio.EventClient("<app_id>") 
cli.record_user_action_on_item("like", "John", “bulgogi_12”) 
cli.record_user_action_on_item("view", "John", “bimbimbap_13”)
Configurate the Engine Instance settings 
in params/datasource.json 
{ 
"appId": <app_id>, 
"actions": [ 
"view", "like", ... 
], ... 
}
Train the Data Model 
$ pio train 
Deploy the Engine Instance 
$ pio deploy
Retrieve Prediction Results 
from predictionio import EngineClient 
client = EngineClient(url="http://localhost:8000") 
prediction = client.send_query({"uid": "John", "n": 3}) 
print prediction 
Output 
{u'items': [{u'272': 9.929327011108398}, {u'313': 
9.92607593536377}, {u’347': 9.92170524597168}]}
You can also…. 
• Change algorithm 
• Tune algorithm parameter 
• Compare and evaluate algorithm 
• Add custom business logics
SDKs for: 
• Python 
• Ruby 
• PHP 
• Java / Andriod 
• Scala 
• Node.js 
• iOS 
• Meteor 
• more….
Also, 
build your own Engine!
Applications 
of 
Machine Learning 
Speech Recognition 
Personal Newsfeed 
SPAM Filtering 
Recommendation 
Driverless Car 
Churn Prediction 
Ad Targeting 
Fraud Detection 
{
감사합니다 
Korean Documentation (Beta)! 
http://docs.prediction.io/kr 
- @PredictionIO 
- prediction.io - Newsletters 
- github.com/predictionio

[2C2]PredictionIO

  • 1.
    An Open SourceMachine Learning Server for Developers @PredictionIO #PredictionIO Simon Chan simon@prediction.io
  • 2.
    Thank you forhaving me here today! • Simon Chan - CEO of PredictionIO • A small team of Data Scientists and Engineers • Mainly based in Silicon Valley, also London and Hong Kong
  • 3.
    Top Github OpenSource • Over 5000 developers engaged • Powering over 200 applications
  • 4.
    Talk Focus: •Machine Learning - A (Very) Brief Review • Challenges We Face When Building PredictionIO
  • 5.
  • 6.
    I am goingto give an example that will make you… HUNGRY!
  • 7.
    F FOOD Club– Menu FOOD CLUB
  • 9.
    Coding time…. #Using PredictionIO # Collect Data cli = predictionio.EventClient("<my_app_id>") cli.record_user_action_on_item("buy", "John", “BulgogiA") # Predict top preferences eng = predictionio.EngineClient("<my_engine_url>") rec = eng.send_query({"uid" : "John", "n" : 5})
  • 10.
    The Magic Behind:Engine 1. Data Sourcing and Preparation 2. Algorithm 3. Serving 4. Evaluation
  • 13.
  • 14.
    Architectural Challenge 1 Workflow Co-ordination on a Distributed Cluster
  • 15.
    Needs: •Support multipledistributed engines •Support multiple algorithms to execute in parallel How to coordinate the workflow when you have more pending tasks than processing units?
  • 16.
    Attempt #1 Usea database system to store tasks, and have a pool of workers pull tasks from it. •Inefficient. Database becomes bottleneck and potentially single point of failure.
  • 17.
    Attempt #2 Usean Akka cluster. Akka is a toolkit and runtime for building highly concurrent, distributed, and fault tolerant event-driven applications on the JVM. •Fundamentally the same problem with the above. •Need to build management suite on top.
  • 18.
    Solution Apache Spark:directed acyclic graph (DAG) scheduling Adapts to many different infrastructure: Apache Spark standalone cluster, Apache Hadoop 2 YARN, Apache Mesos. Source: http://upload.wikimedia.org/wikipedia/commons/3/39/Directed_acyclic_graph_3.svg
  • 19.
    Solution Source Code: http://github.com/predictionio
  • 20.
    Architectural Challenge 2 Distributed In-memory Model Retrieval
  • 21.
    Needs: •Engines producemodels that are distributed across a cluster. Requires a way to serve these distributed in-memory models to queries in real-time.
  • 22.
    Solution All PredictionIOengine instances are launched inside a “SparkContext”. A SparkContext represents the connection to a Spark cluster, and can be used to create RDDs, accumulators and broadcast variables on that cluster. Source: http://bighadoop.files.wordpress.com/2014/04/spark-architecture.png
  • 23.
    •When an engineis local to a single machine, it loads the model to its memory. •When an engine is distributed, SparkContext will automatically load the model on a cluster.
  • 24.
    Conceptual Code forthe Solution val sc = SparkContext(conf) ... val model = if (model_is_distributed) { if (model_is_persisted) { sc.objectFile(model_on_HDFS) } else { engine.algo.train() } } else { ... } }
  • 25.
  • 26.
    Built-in Engines: •ItemRecommendation •Item Rank •Item Similarity
  • 27.
    Create an EngineInstance Project…. $ pio instance io.prediction.engines.itemrec $ cd io.prediction.engines.itemrec $ pio register
  • 28.
    Collect Event Data…. cli = predictionio.EventClient("<app_id>") cli.record_user_action_on_item("like", "John", “bulgogi_12”) cli.record_user_action_on_item("view", "John", “bimbimbap_13”)
  • 29.
    Configurate the EngineInstance settings in params/datasource.json { "appId": <app_id>, "actions": [ "view", "like", ... ], ... }
  • 30.
    Train the DataModel $ pio train Deploy the Engine Instance $ pio deploy
  • 31.
    Retrieve Prediction Results from predictionio import EngineClient client = EngineClient(url="http://localhost:8000") prediction = client.send_query({"uid": "John", "n": 3}) print prediction Output {u'items': [{u'272': 9.929327011108398}, {u'313': 9.92607593536377}, {u’347': 9.92170524597168}]}
  • 32.
    You can also…. • Change algorithm • Tune algorithm parameter • Compare and evaluate algorithm • Add custom business logics
  • 33.
    SDKs for: •Python • Ruby • PHP • Java / Andriod • Scala • Node.js • iOS • Meteor • more….
  • 34.
    Also, build yourown Engine!
  • 35.
    Applications of MachineLearning Speech Recognition Personal Newsfeed SPAM Filtering Recommendation Driverless Car Churn Prediction Ad Targeting Fraud Detection {
  • 36.
    감사합니다 Korean Documentation(Beta)! http://docs.prediction.io/kr - @PredictionIO - prediction.io - Newsletters - github.com/predictionio