Prediction io 架構與整合 -DataCon.TW-2017

Apache PredictionIO 架構與整合
Establish an effective machine learning platform efficiently
李文良張峰睿
亦思科技
2017/09/30

About us
李文良 (William Lee)
亦思科技研究發展處專案經理
williamlee@is-land.com.tw
張峰睿 (Frank Chang)
亦思科技研究發展處系統架構師
frank@is-land.com.tw

Outline
•Background
•PredictionIO Overview
•Quick Start your first Engine
•Customizing an Engine
•Implementation on Enterprise Production
•Summary

對於機器學習的期望
《MIT Technology Review 》and 《Google Cloud》報告裡所提到的：
• 有 50% 的組織規劃將在將來透過機器學習來加深對手上資料群的了解，以
便得出更多的資訊。
• 用於取得更多的競爭優勢，或是加速現有資料的分析。
• 甚至有 31％認為可以透過機器學習達到降低成本的功效。
https://s3.amazonaws.com/files.technologyreview.com/whitepapers/MITTR_GoogleforWork_Survey.pdf

機器學習在半導體領域的應用
● 半導體生產通常需經過數百道的製程，過程中產出數百萬筆的資料。
● 在產品的開發過程 RD 必須為這些資料訂定 SPEC 用於檢查品質以及機台調
整。
● 通常需要 IT 人員協助取得資料集轉入統計軟體方能進行分析。
● 若系統結合收集資料並且機器學習自動產生 SPEC 輔助人員確認，能減少產
品開發過程中所耗用的時間。
開發產品資料收集分析資料調整參數
SPEC

機器學習在金融領域的應用
● 現行金融業已經進入電子交易的時代，歷史交易累積成為大量的資料群。
並且隨時會透過交易系統加入新的資料。
● 根據不同需求的從歷史資料庫中提取資訊進行分析。
● 若是能建構出系統統合收集歷史資料的資料庫並且提供幾種機器學習的演
算法，便能夠加快分析資料到產出目標的時間。
歷史資料
用戶端
用戶端
用戶端
分析資料
設計產品
風險管理
異常分析
市場預測
分析資料
分析資料

從 Spark MLLib 開始 run examples 時似乎不錯

實際上建構系統時，卻有許多需要注意的部份......
Training Model 可以
存起來下次使用。
存放在哪？
如何管理？
App 或現有系統結合？
如何即時並且方便使用？
Algorithm 如何執行？
參數和環境如何設定？
資料怎麼進來？存哪？

Hidden Technical Debt in Machine Learning Systems
“Only a small fraction of real-world ML systems is composed of the ML code.
The required surrounding infrastructure is vast and complex.”
Hidden Technical Debt in Machine Learning Systems , by Sculley, et al., NIPS, 2016

Big Data System with Machine Learning Stacks
API Service Server
Spark ML
Caffe, DeepLearning4J, Tensorflow, …...
Hadoop, Spark, …...
RDB, Hadoop HDFS, HBase, ES, …...
Apps
Algorithms
Processing
DataStore
PredictionIO
http://sssslide.com/speakerdeck.com/takahiro/building-a-recommendation-engine-with-spark-and-apache-predictionio

What is Apache PredictionIO
Apache PredictionIO (incubating) 是開源機器學習伺服器平台。
提供開發者及資料科學家有效快速建立預測引擎。
並且整合所有應用系統達到 Machine Learning as a Service 的目標。
PredictionIO 可帶來下列預期效益：
• 提供簡便資料收集以及儲存的方案，統合現有生態系中的平台。
• 讓開發者可以快速的使用模組建立 machine learning engine 並提供 Service
便於整合外部系統。
• 可以透過模組修改建立自訂的 machine learning engine。

Latest release on 9/26
From PredictionIO JIRA web site, we can find：
• Version 0.12.0 was released on 26/Sep’17

SDK / Service
Client
Architecture
Processing
Event Server Prediction Engine
PredictionIO
Platform
Engine
Template
Analytics
Tools
Storage
Build Engine
and
Deploy

● REST APIs
● SDKs
● 54 of available templates
● DASE for custom needs
● Source Code
● Docker
Quick Start your first Engine
Install & Start
EventServer
Train & Deploy
Prediction Engine
Query Result
via REST
1. Install and Run PredictionIO 2. Create a new Engine from an Engine
Template
3. Generate App ID and Access Key
6. Use the Engine
Alternatives
Operation
Steps
5. Deploy the Engine as a Service
4. Collecting Data

Installation & Quick Start
● 請參考 https://github.com/apache/incubator-predictionio/
https://github.com/apache/incubator-predictionio/

Installing with Docker
● Install docker firstly
● Start docker-predictionio
$ docker run -it -p 8000:8000 steveny/predictionio /bin/bash
http://predictionio.incubator.apache.org/community/projects/#docker-installation-for-predictionio

Installing From Source
● Up-to-date Version : 0.12.0
● Downloading Source Code : https://github.com/apache/incubator-predictionio/
● Building Dependencies:
Ecosystem Versions of Dependencies Default
Scala 2.10.x, 2.11.x 2.11.8
Spark 1.6.x, 2.0.x, 2.1.x 2.1.1
Elasticsearch 1.7.x, 5.x 5.5.2
Hadoop 2.4.x to 2.7.x 2.7.3(*)
HBase 0.98.x, 1.2.x 1.2.6(*)
https://predictionio.incubator.apache.org/install/install-sourcecode/
$ ./make-distribution.sh -Dscala.version=2.11.8 -Dspark.version=2.1.0 -Delasticsearch.version=5.3.0
● Setup and Start PredictionIO

Command Line
● General Commands
○ pio status : Displays install path and running status of PredictionIO system and its
dependencies.
● Event Server Commands
○ pio eventserver : Launch the Event Server.
○ pio app : Manage apps that are used by the Event Server
● Engine Commands
○ pio build : Build the engine at the current directory.
○ pio train : Kick off a training using an engine.
○ pio deploy : Deploy an engine as an engine server. If no instance ID is specified, it will
deploy the latest instance.
https://predictionio.incubator.apache.org/cli/#engine-commands

$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY
-H "Content-Type: application/json" -d '{$JSON-CONTEXT}'
REST API and SDKs
● REST API :
○ port number for server access (default value, please make sure your setup) :
■ Event Server : 7070
■ Engine : 8000
○ example
$ curl -H "Content-Type: application/json"
-d '{ $JSON-CONTEXT }' http://localhost:8000/queries.json
● SDKs :
○ Java & Android
○ Python
○ PHP
○ Ruby
https://predictionio.incubator.apache.org/cli/#engine-commands

Customizing your Engine with D-A-S-E
參數設定
D
A
S
E
https://predictionio.incubator.apache.org/customize/
Datasource.scala Preparator.scala ALSAlgorithm
.scala
Serving.scala
Evaluation.scala
Engine.scalaengine.json

Engine
Query
case class case class
Predicted
Result
Engine
Factory
object
RecommendationEngine
Query via REST Predicted Result
Engine
參數設定
Engine.scala
D A S E

Data Source and Data Preparator
readTrain()
D A S E
events
RDD
ratings
RDD
Training
Data
prepare()
Action
Required
(*) Prepared
Data
DataSource Preparator
Algorithm
DataSource.scala
Preparator.scala
Note :
* : Performs any necessary feature selection or data processing, etc.
Event
Server

Algorithm
train()
D A S E
algo
Model
predict()
Model
Predicted
Result
Algorithm
Serving
Prepared
Data
train
Query
ALSAlgorithm
.scala
參數設定
Note :
*: train() is called when you run “pio train”

example of engine.json for Algorithm
{
...
"algorithms": [
{
"name": "als",
"params": {
"rank": 10,
"numIterations": 20,
"lambda": 0.01,
"seed": 3
}
}
]
...
}
D A S E

Serving
serve()
D A S E
Predicted
Result Predicted
Result
(JSON)
Serving
Query
Predicted
Results
Combine
Predicted
Result
(*)
Note:
*: serve() method will combine multiple predicted results into one if you have more than one predictive model

Quiz (1/4)
- Read Custom Events
Q: 如何將 rate 及 buy 二種 event 改成 like 及 dislike ?
events
RDD
val eventsRDD: RDD[Event] = PEventStore.find(
appName = dsp.appName,
entityType = Some("user"),
eventNames = Some(List("rate", "buy")), // read "rate" and "buy" event
// targetEntityType is optional field of an event.
targetEntityType = Some(Some("item")))(sc)
D A S E
val eventsRDD: RDD[Event] = PEventStore.find(
appName = dsp.appName,
entityType = Some("customer"), // change user to customer
eventNames = Some(List("like", "dislike")), // read "like" and "dislike” event
// targetEntityType is optional field of an event.
targetEntityType = Some(Some("product")))(sc) // Modified
Before
After

Quiz (2/4)
- Map Custom Events
Q: 如何將 rate 及 buy 二種 event 改成 like 及 dislike ? (續)
ratings
RDD
D A S E
val ratingValue: Double = event.event match {
case "rate" => event.properties.get[Double]("rating")
case "buy" => 4.0 // map buy event to rating value of 4
case "like" => 4.0 // map a like event to a rating of 4.0
case "dislike" => 1.0 // map a like event to a rating of 1.0
case _ => throw new Exception(s"Unexpected event ${event} is read.")
}
val ratingValue: Double = event.event match {
case "rate" => event.properties.get[Double]("rating")
case "buy" => 4.0 // map buy event to rating value of 4
case _ => throw new Exception(s"Unexpected event ${event} is read.")
}
Before
After

Quiz (3/4)
- Customizing Data Preparator
Q: 如何將新增黑名單功能，讓系統濾除部份產品 ?
class Preparator
extends PPreparator[TrainingData, PreparedData] {
def prepare(sc: SparkContext, trainingData: TrainingData): PreparedData = {
new PreparedData(ratings = trainingData.ratings)
}
}
D A S E
import scala.io.Source // ADDED
class Preparator
extends PPreparator[TrainingData, PreparedData] {
def prepare(sc: SparkContext, trainingData: TrainingData): PreparedData = {
val noTrainItems = Source.fromFile("./data/sample_not_train_data.txt").getLines.toSet
// exclude noTrainItems from original trainingData
val ratings = trainingData.ratings.filter( r => !noTrainItems.contains(r.item) )
new PreparedData(ratings)
}
}
Before
After

Quiz (4/4)
- Release for your Change
D A S E
$ pio build
$ pio train
$ pio deploy
● How to release the modified engine(s) ?

Evaluation (1/4)
Evaluation
AccuracyEvaluation
Evaluation
Metrics
Engine
Params
List
AccuracyAlgo
D A S E
case class Accuracy
extends AverageMetric[EmptyEvaluationInfo, Query, PredictedResult, ActualResult] {
def calculate(query: Query, predicted: PredictedResult, actual: ActualResult)
: Double = (if (predicted.label == actual.label) 1.0 else 0.0)
}

Evaluation (2/4)
Query
case class case class
Predicted
Result
RecommendationEngine
Query via REST Predicted Result
class DataSource
參數設定
D A S E
case class
Actual
Result

Evaluation (3/4)
readTrain()
events
RDD
ratings
RDD
Training
Data
prepare()
(TBD)
Prepared
Data
DataSource Preparator
Algorithm
readEval()
TrainingData
RDD(Query, ActualResult)
events
DB
K-fold
splitting
D A S E

Evaluation (4/4)
● Build and run the evaluation
● Deploy the best engine parameter
D A S E

Implementation on Enterprise Production
Test Log
Cluster
Cluster
Cluster
Cluster
Batch Data ( pio import)
Real Time Data
( Streaming + PIO SDK )
Yield-En.
Event Server
Cluster
Prediction Engine Cluster
P1 Engine P2 Engine P3 Engine
Meta
Event
Data
Model
Query via REST
Prediction Result RDD
Off-line
Training
PredictionIO Platform

Deploy the Event Server onto Prediction Cluster
Setup
PredictionIO
Run
eventserver
listen pio_engine_7070 :7070
mode http
balance roundrobin
option httpclose
option forwardfor
option redispatch
retries 3
log global
log 127.0.0.1 local4 info
server piovm1 192.168.56.101:7070 check weight 1 maxconn 30
● HAProxy configuration for Event Server Cluster
Setup
HAProxy
分別在需佈署之 Event Server 上，執行下列指令：
$ pio eventserver &

Deploy the Engine onto Prediction Cluster
$ pio
deploy
分別在需佈署之 Prediction Server 上，執行下列指令：
$ pio deploy --port 8001 --engine-instance-id AV6dTEoKBlbECIGzXhaS
Off-Line
Engine Training

Summary
• Apache PredictionIO project is an active and popular project.
• It will let you to integrate machine learning functions in your apps effectively
and efficiently.
• It is also convenient for you to consolidate multiple PredictionIO nodes with
HAProxy and other Hadoop ecosystem to provide scalable and stable solution.

Prediction io 架構與整合 -DataCon.TW-2017

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Prediction io 架構與整合 -DataCon.TW-2017

Similar to Prediction io 架構與整合 -DataCon.TW-2017 (20)

Recently uploaded

Recently uploaded (20)

Prediction io 架構與整合 -DataCon.TW-2017