[2C2]PredictionIO

An Open Source Machine Learning Server
for Developers
@PredictionIO #PredictionIO
Simon Chan
simon@prediction.io

Thank you for having me here today!
• Simon Chan - CEO of PredictionIO
• A small team of Data Scientists and Engineers
• Mainly based in Silicon Valley, also London and Hong Kong

Top Github Open Source
• Over 5000 developers engaged
• Powering over 200 applications

Talk Focus:
• Machine Learning - A (Very) Brief Review
• Challenges We Face When Building PredictionIO

I am going to give an
example that will make
you… HUNGRY!

F FOOD Club – Menu
FOOD
CLUB

Coding time….
# Using PredictionIO
# Collect Data
cli = predictionio.EventClient("<my_app_id>")
cli.record_user_action_on_item("buy", "John", “BulgogiA")
# Predict top preferences
eng = predictionio.EngineClient("<my_engine_url>")
rec = eng.send_query({"uid" : "John", "n" : 5})

The Magic Behind: Engine
1. Data Sourcing and Preparation
2. Algorithm
3. Serving
4. Evaluation

Architectural Challenge 1
Workflow Co-ordination on a Distributed Cluster

Needs:
•Support multiple distributed engines
•Support multiple algorithms to execute in parallel
How to coordinate the workflow when you have
more pending tasks than processing units?

Attempt #1
Use a database system to store tasks, and
have a pool of workers pull tasks from it.
•Inefficient. Database becomes bottleneck
and potentially single point of failure.

Attempt #2
Use an Akka cluster.
Akka is a toolkit and runtime for building highly
concurrent, distributed, and fault tolerant event-driven
applications on the JVM.
•Fundamentally the same problem with the above.
•Need to build management suite on top.

Solution
Apache Spark: directed acyclic graph
(DAG) scheduling
Adapts to many different infrastructure:
Apache Spark standalone cluster, Apache
Hadoop 2 YARN, Apache Mesos.
Source: http://upload.wikimedia.org/wikipedia/commons/3/39/Directed_acyclic_graph_3.svg

Solution Source Code:
http://github.com/predictionio

Architectural Challenge 2
Distributed In-memory Model Retrieval

Needs:
•Engines produce models that are
distributed across a cluster.
Requires a way to serve these distributed
in-memory models to queries in real-time.

Solution
All PredictionIO engine instances are launched
inside a “SparkContext”.
A SparkContext represents the connection to a
Spark cluster, and can be used to create RDDs,
accumulators and broadcast variables on that
cluster.
Source: http://bighadoop.files.wordpress.com/2014/04/spark-architecture.png

•When an engine is local to a single
machine, it loads the model to its memory.
•When an engine is distributed,
SparkContext will automatically load the
model on a cluster.

Conceptual Code for the Solution
val sc = SparkContext(conf)
...
val model =
if (model_is_distributed) {
if (model_is_persisted) {
sc.objectFile(model_on_HDFS)
} else {
engine.algo.train()
}
} else {
...
}
}

Built-in Engines:
•Item Recommendation
•Item Rank
•Item Similarity

Create an Engine Instance Project….
$ pio instance io.prediction.engines.itemrec
$ cd io.prediction.engines.itemrec
$ pio register

Collect Event Data….
cli = predictionio.EventClient("<app_id>")
cli.record_user_action_on_item("like", "John", “bulgogi_12”)
cli.record_user_action_on_item("view", "John", “bimbimbap_13”)

Configurate the Engine Instance settings
in params/datasource.json
{
"appId": <app_id>,
"actions": [
"view", "like", ...
], ...
}

Train the Data Model
$ pio train
Deploy the Engine Instance
$ pio deploy

Retrieve Prediction Results
from predictionio import EngineClient
client = EngineClient(url="http://localhost:8000")
prediction = client.send_query({"uid": "John", "n": 3})
print prediction
Output
{u'items': [{u'272': 9.929327011108398}, {u'313':
9.92607593536377}, {u’347': 9.92170524597168}]}

You can also….
• Change algorithm
• Tune algorithm parameter
• Compare and evaluate algorithm
• Add custom business logics

SDKs for:
• Python
• Ruby
• PHP
• Java / Andriod
• Scala
• Node.js
• iOS
• Meteor
• more….

Applications
of
Machine Learning
Speech Recognition
Personal Newsfeed
SPAM Filtering
Recommendation
Driverless Car
Churn Prediction
Ad Targeting
Fraud Detection
{

감사합니다
Korean Documentation (Beta)!
http://docs.prediction.io/kr
- @PredictionIO
- prediction.io - Newsletters
- github.com/predictionio

[2C2]PredictionIO

More Related Content

What's hot

Viewers also liked

Similar to [2C2]PredictionIO

More from NAVER D2

Recently uploaded

[2C2]PredictionIO