This document discusses using Apache PredictionIO to establish an effective machine learning platform. It provides an overview of PredictionIO, describes how to quickly start your first engine, customize an engine by modifying different components, and considerations for implementing PredictionIO in an enterprise production environment.
9. Hidden Technical Debt in Machine Learning Systems
“Only a small fraction of real-world ML systems is composed of the ML code.
The required surrounding infrastructure is vast and complex.”
Hidden Technical Debt in Machine Learning Systems , by Sculley, et al., NIPS, 2016
10. Big Data System with Machine Learning Stacks
API Service Server
Spark ML
Caffe, DeepLearning4J, Tensorflow, …...
Hadoop, Spark, …...
RDB, Hadoop HDFS, HBase, ES, …...
Apps
Algorithms
Processing
DataStore
PredictionIO
http://sssslide.com/speakerdeck.com/takahiro/building-a-recommendation-engine-with-spark-and-apache-predictionio
18. ● REST APIs
● SDKs
● 54 of available templates
● DASE for custom needs
● Source Code
● Docker
Quick Start your first Engine
Install & Start
EventServer
Train & Deploy
Prediction Engine
Query Result
via REST
1. Install and Run PredictionIO 2. Create a new Engine from an Engine
Template
3. Generate App ID and Access Key
6. Use the Engine
Alternatives
Operation
Steps
5. Deploy the Engine as a Service
4. Collecting Data
20. Installing with Docker
● Install docker firstly
● Start docker-predictionio
$ docker run -it -p 8000:8000 steveny/predictionio /bin/bash
http://predictionio.incubator.apache.org/community/projects/#docker-installation-for-predictionio
21. Installing From Source
● Up-to-date Version : 0.12.0
● Downloading Source Code : https://github.com/apache/incubator-predictionio/
● Building Dependencies:
Ecosystem Versions of Dependencies Default
Scala 2.10.x, 2.11.x 2.11.8
Spark 1.6.x, 2.0.x, 2.1.x 2.1.1
Elasticsearch 1.7.x, 5.x 5.5.2
Hadoop 2.4.x to 2.7.x 2.7.3(*)
HBase 0.98.x, 1.2.x 1.2.6(*)
https://predictionio.incubator.apache.org/install/install-sourcecode/
$ ./make-distribution.sh -Dscala.version=2.11.8 -Dspark.version=2.1.0 -Delasticsearch.version=5.3.0
● Setup and Start PredictionIO
22. Command Line
● General Commands
○ pio status : Displays install path and running status of PredictionIO system and its
dependencies.
● Event Server Commands
○ pio eventserver : Launch the Event Server.
○ pio app : Manage apps that are used by the Event Server
● Engine Commands
○ pio build : Build the engine at the current directory.
○ pio train : Kick off a training using an engine.
○ pio deploy : Deploy an engine as an engine server. If no instance ID is specified, it will
deploy the latest instance.
https://predictionio.incubator.apache.org/cli/#engine-commands
23. $ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY
-H "Content-Type: application/json" -d '{$JSON-CONTEXT}'
REST API and SDKs
● REST API :
○ port number for server access (default value, please make sure your setup) :
■ Event Server : 7070
■ Engine : 8000
○ example
$ curl -H "Content-Type: application/json"
-d '{ $JSON-CONTEXT }' http://localhost:8000/queries.json
● SDKs :
○ Java & Android
○ Python
○ PHP
○ Ruby
https://predictionio.incubator.apache.org/cli/#engine-commands
27. Customizing your Engine with D-A-S-E
參數設定
D
A
S
E
https://predictionio.incubator.apache.org/customize/
Datasource.scala Preparator.scala ALSAlgorithm
.scala
Serving.scala
Evaluation.scala
Engine.scalaengine.json
28. Engine
Query
case class case class
Predicted
Result
Engine
Factory
object
RecommendationEngine
Query via REST Predicted Result
Engine
參數設定
Engine.scala
D A S E
30. Data Source and Data Preparator
readTrain()
D A S E
events
RDD
ratings
RDD
Training
Data
prepare()
Action
Required
(*) Prepared
Data
DataSource Preparator
Algorithm
DataSource.scala
Preparator.scala
Note :
* : Performs any necessary feature selection or data processing, etc.
Event
Server
31. Algorithm
train()
D A S E
algo
Model
predict()
Model
Predicted
Result
Algorithm
Serving
Prepared
Data
train
Query
ALSAlgorithm
.scala
參數設定
Note :
*: train() is called when you run “pio train”
32. example of engine.json for Algorithm
{
...
"algorithms": [
{
"name": "als",
"params": {
"rank": 10,
"numIterations": 20,
"lambda": 0.01,
"seed": 3
}
}
]
...
}
D A S E
33. Serving
serve()
D A S E
Predicted
Result Predicted
Result
(JSON)
Serving
Query
Predicted
Results
Combine
Predicted
Result
(*)
Note:
*: serve() method will combine multiple predicted results into one if you have more than one predictive model
34. Quiz (1/4)
- Read Custom Events
Q: 如何將 rate 及 buy 二種 event 改成 like 及 dislike ?
events
RDD
val eventsRDD: RDD[Event] = PEventStore.find(
appName = dsp.appName,
entityType = Some("user"),
eventNames = Some(List("rate", "buy")), // read "rate" and "buy" event
// targetEntityType is optional field of an event.
targetEntityType = Some(Some("item")))(sc)
D A S E
val eventsRDD: RDD[Event] = PEventStore.find(
appName = dsp.appName,
entityType = Some("customer"), // change user to customer
eventNames = Some(List("like", "dislike")), // read "like" and "dislike” event
// targetEntityType is optional field of an event.
targetEntityType = Some(Some("product")))(sc) // Modified
Before
After
35. Quiz (2/4)
- Map Custom Events
Q: 如何將 rate 及 buy 二種 event 改成 like 及 dislike ? (續)
ratings
RDD
D A S E
val ratingValue: Double = event.event match {
case "rate" => event.properties.get[Double]("rating")
case "buy" => 4.0 // map buy event to rating value of 4
case "like" => 4.0 // map a like event to a rating of 4.0
case "dislike" => 1.0 // map a like event to a rating of 1.0
case _ => throw new Exception(s"Unexpected event ${event} is read.")
}
val ratingValue: Double = event.event match {
case "rate" => event.properties.get[Double]("rating")
case "buy" => 4.0 // map buy event to rating value of 4
case _ => throw new Exception(s"Unexpected event ${event} is read.")
}
Before
After
36. Quiz (3/4)
- Customizing Data Preparator
Q: 如何將新增黑名單功能,讓系統濾除部份產品 ?
class Preparator
extends PPreparator[TrainingData, PreparedData] {
def prepare(sc: SparkContext, trainingData: TrainingData): PreparedData = {
new PreparedData(ratings = trainingData.ratings)
}
}
D A S E
import scala.io.Source // ADDED
class Preparator
extends PPreparator[TrainingData, PreparedData] {
def prepare(sc: SparkContext, trainingData: TrainingData): PreparedData = {
val noTrainItems = Source.fromFile("./data/sample_not_train_data.txt").getLines.toSet
// exclude noTrainItems from original trainingData
val ratings = trainingData.ratings.filter( r => !noTrainItems.contains(r.item) )
new PreparedData(ratings)
}
}
Before
After
37. Quiz (4/4)
- Release for your Change
D A S E
$ pio build
$ pio train
$ pio deploy
● How to release the modified engine(s) ?
39. Evaluation (2/4)
Query
case class case class
Predicted
Result
RecommendationEngine
Query via REST Predicted Result
class DataSource
參數設定
D A S E
case class
Actual
Result
43. Implementation on Enterprise Production
Test Log
Cluster
Cluster
Cluster
Cluster
Batch Data ( pio import)
Real Time Data
( Streaming + PIO SDK )
Yield-En.
Event Server
Cluster
Prediction Engine Cluster
P1 Engine P2 Engine P3 Engine
Meta
Event
Data
Model
Query via REST
Prediction Result RDD
Off-line
Training
PredictionIO Platform
44. Deploy the Event Server onto Prediction Cluster
Setup
PredictionIO
Run
eventserver
listen pio_engine_7070 :7070
mode http
balance roundrobin
option httpclose
option forwardfor
option redispatch
retries 3
log global
log 127.0.0.1 local4 info
server piovm1 192.168.56.101:7070 check weight 1 maxconn 30
server piovm2 192.168.56.102:7070 check weight 1 maxconn 30
server piovm3 192.168.56.103:7070 check weight 1 maxconn 30
● HAProxy configuration for Event Server Cluster
Setup
HAProxy
分別在需佈署之 Event Server 上,執行下列指令:
$ pio eventserver &
45. Deploy the Engine onto Prediction Cluster
$ pio
deploy
分別在需佈署之 Prediction Server 上,執行下列指令:
$ pio deploy --port 8001 --engine-instance-id AV6dTEoKBlbECIGzXhaS
Off-Line
Engine Training
46. Summary
• Apache PredictionIO project is an active and popular project.
• It will let you to integrate machine learning functions in your apps effectively
and efficiently.
• It is also convenient for you to consolidate multiple PredictionIO nodes with
HAProxy and other Hadoop ecosystem to provide scalable and stable solution.