Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Apache PredictionIO 架構與整合
Establish an effective machine learning platform efficiently
李文良 張峰睿
亦思科技
2017/09/30
About us
李文良 (William Lee)
亦思科技 研究發展處 專案經理
williamlee@is-land.com.tw
張峰睿 (Frank Chang)
亦思科技 研究發展處 系統架構師
frank@is-land.com....
Outline
•Background
•PredictionIO Overview
•Quick Start your first Engine
•Customizing an Engine
•Implementation on Enterp...
對於機器學習的期望
《MIT Technology Review 》and 《Google Cloud》 報告裡所提到的:
• 有 50% 的組織規劃將在將來透過機器學習來加深對手上資料群的了解,以
便得出更多的資訊。
• 用於取得更多的競爭優...
機器學習在半導體領域的應用
● 半導體生產通常需經過數百道的製程,過程中產出數百萬筆的資料。
● 在產品的開發過程 RD 必須為這些資料訂定 SPEC 用於檢查品質以及機台調
整。
● 通常需要 IT 人員協助取得資料集轉入統計軟體方能進行分析...
機器學習在金融領域的應用
● 現行金融業已經進入電子交易的時代,歷史交易累積成為大量的資料群。
並且隨時會透過交易系統加入新的資料。
● 根據不同需求的從歷史資料庫中提取資訊進行分析。
● 若是能建構出系統統合收集歷史資料的資料庫並且提供幾種機...
從 Spark MLLib 開始 run examples 時似乎不錯
實際上建構系統時,卻有許多需要注意的部份......
Training Model 可以
存起來下次使用 。
存放在哪?
如何管理?
App 或現有系統結合?
如何即時並且方便使用?
Algorithm 如何執行?
參數和環境如何設定?
資料怎...
Hidden Technical Debt in Machine Learning Systems
“Only a small fraction of real-world ML systems is composed of the ML co...
Big Data System with Machine Learning Stacks
API Service Server
Spark ML
Caffe, DeepLearning4J, Tensorflow, …...
Hadoop, S...
Outline
•Background
•PredictionIO Overview
•Quick Start your first Engine
•Customizing an Engine
•Implementation on Enterp...
What is Apache PredictionIO
Apache PredictionIO (incubating) 是開源機器學習伺服器平台。
提供開發者及資料科學家有效快速建立預測引擎。
並且整合所有應用系統達到 Machine Lea...
Latest release on 9/26
From PredictionIO JIRA web site, we can find:
• Version 0.12.0 was released on 26/Sep’17
SDK / Service
Client
Architecture
Processing
Event Server Prediction Engine
PredictionIO
Platform
Engine
Template
Analytic...
PIO Storage Alternatives
Outline
•Background
•PredictionIO Overview
•Quick Start your first Engine
•Customizing an Engine
•Implementation on Enterp...
● REST APIs
● SDKs
● 54 of available templates
● DASE for custom needs
● Source Code
● Docker
Quick Start your first Engin...
Installation & Quick Start
● 請參考 https://github.com/apache/incubator-predictionio/
https://github.com/apache/incubator-pre...
Installing with Docker
● Install docker firstly
● Start docker-predictionio
$ docker run -it -p 8000:8000 steveny/predicti...
Installing From Source
● Up-to-date Version : 0.12.0
● Downloading Source Code : https://github.com/apache/incubator-predi...
Command Line
● General Commands
○ pio status : Displays install path and running status of PredictionIO system and its
dep...
$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY 
-H "Content-Type: application/json" -d '{$JSON-...
Outline
•Background
•PredictionIO Overview
•Quick Start your first Engine
•Customizing an Engine
•Implementation on Enterp...
Customizing your Engine with D-A-S-E
參數設定
D
A
S
E
https://predictionio.incubator.apache.org/customize/
Datasource.scala Pr...
Engine
Query
case class case class
Predicted
Result
Engine
Factory
object
RecommendationEngine
Query via REST Predicted Re...
Example
Data Source and Data Preparator
readTrain()
D A S E
events
RDD
ratings
RDD
Training
Data
prepare()
Action
Required
(*) Pre...
Algorithm
train()
D A S E
algo
Model
predict()
Model
Predicted
Result
Algorithm
Serving
Prepared
Data
train
Query
ALSAlgor...
example of engine.json for Algorithm
{
...
"algorithms": [
{
"name": "als",
"params": {
"rank": 10,
"numIterations": 20,
"...
Serving
serve()
D A S E
Predicted
Result Predicted
Result
(JSON)
Serving
Query
Predicted
Results
Combine
Predicted
Result
...
Quiz (1/4)
- Read Custom Events
Q: 如何將 rate 及 buy 二種 event 改成 like 及 dislike ?
events
RDD
val eventsRDD: RDD[Event] = PEve...
Quiz (2/4)
- Map Custom Events
Q: 如何將 rate 及 buy 二種 event 改成 like 及 dislike ? (續)
ratings
RDD
D A S E
val ratingValue: Dou...
Quiz (3/4)
- Customizing Data Preparator
Q: 如何將新增黑名單功能,讓系統濾除部份產品 ?
class Preparator
extends PPreparator[TrainingData, Prep...
Quiz (4/4)
- Release for your Change
D A S E
$ pio build
$ pio train
$ pio deploy
● How to release the modified engine(s) ?
Evaluation (1/4)
Evaluation
AccuracyEvaluation
Evaluation
Metrics
Engine
Params
List
AccuracyAlgo
D A S E
case class Accur...
Evaluation (2/4)
Query
case class case class
Predicted
Result
RecommendationEngine
Query via REST Predicted Result
class D...
Evaluation (3/4)
readTrain()
events
RDD
ratings
RDD
Training
Data
prepare()
(TBD)
Prepared
Data
DataSource Preparator
Algo...
Evaluation (4/4)
● Build and run the evaluation
● Deploy the best engine parameter
D A S E
Outline
•Background
•PredictionIO Overview
•Quick Start your first Engine
•Customizing an Engine
•Implementation on Enterp...
Implementation on Enterprise Production
Test Log
Cluster
Cluster
Cluster
Cluster
Batch Data ( pio import)
Real Time Data
(...
Deploy the Event Server onto Prediction Cluster
Setup
PredictionIO
Run
eventserver
listen pio_engine_7070 :7070
mode http
...
Deploy the Engine onto Prediction Cluster
$ pio
deploy
分別在需佈署之 Prediction Server 上,執行下列指令:
$ pio deploy --port 8001 --engi...
Summary
• Apache PredictionIO project is an active and popular project.
• It will let you to integrate machine learning fu...
Thank You !
Prediction io 架構與整合 -DataCon.TW-2017
Prediction io 架構與整合 -DataCon.TW-2017
Prediction io 架構與整合 -DataCon.TW-2017
Upcoming SlideShare
Loading in …5
×

Prediction io 架構與整合 -DataCon.TW-2017

285 views

Published on

Apache PredictionIO 是一個開源 Machine Learning Server 架構,提供開發者及資料科學家能有效地快速建立所需的預測引擎,並且透過 REST 整合現有系統,達到 Machine Learning as a Service 的目標。我們將介紹如何整合 Hadoop Ecosystem 及 PredictionIO,有效協助使用者蒐集、儲存資料、訓練學習引擎及提供預測結果,幫助企業發掘問題、改善客戶需求預測等。

Published in: Data & Analytics
  • Be the first to comment

Prediction io 架構與整合 -DataCon.TW-2017

  1. 1. Apache PredictionIO 架構與整合 Establish an effective machine learning platform efficiently 李文良 張峰睿 亦思科技 2017/09/30
  2. 2. About us 李文良 (William Lee) 亦思科技 研究發展處 專案經理 williamlee@is-land.com.tw 張峰睿 (Frank Chang) 亦思科技 研究發展處 系統架構師 frank@is-land.com.tw
  3. 3. Outline •Background •PredictionIO Overview •Quick Start your first Engine •Customizing an Engine •Implementation on Enterprise Production •Summary
  4. 4. 對於機器學習的期望 《MIT Technology Review 》and 《Google Cloud》 報告裡所提到的: • 有 50% 的組織規劃將在將來透過機器學習來加深對手上資料群的了解,以 便得出更多的資訊。 • 用於取得更多的競爭優勢,或是加速現有資料的分析。 • 甚至有 31% 認為可以透過機器學習達到降低成本的功效。 https://s3.amazonaws.com/files.technologyreview.com/whitepapers/MITTR_GoogleforWork_Survey.pdf
  5. 5. 機器學習在半導體領域的應用 ● 半導體生產通常需經過數百道的製程,過程中產出數百萬筆的資料。 ● 在產品的開發過程 RD 必須為這些資料訂定 SPEC 用於檢查品質以及機台調 整。 ● 通常需要 IT 人員協助取得資料集轉入統計軟體方能進行分析。 ● 若系統結合收集資料並且機器學習自動產生 SPEC 輔助人員確認,能減少產 品開發過程中所耗用的時間。 開發產品 資料收集 分析資料 調整參數 SPEC
  6. 6. 機器學習在金融領域的應用 ● 現行金融業已經進入電子交易的時代,歷史交易累積成為大量的資料群。 並且隨時會透過交易系統加入新的資料。 ● 根據不同需求的從歷史資料庫中提取資訊進行分析。 ● 若是能建構出系統統合收集歷史資料的資料庫並且提供幾種機器學習的演 算法,便能夠加快分析資料到產出目標的時間。 歷史資料 用戶端 用戶端 用戶端 分析資料 設計產品 風險管理 異常分析 市場預測 分析資料 分析資料
  7. 7. 從 Spark MLLib 開始 run examples 時似乎不錯
  8. 8. 實際上建構系統時,卻有許多需要注意的部份...... Training Model 可以 存起來下次使用 。 存放在哪? 如何管理? App 或現有系統結合? 如何即時並且方便使用? Algorithm 如何執行? 參數和環境如何設定? 資料怎麼進來?存哪?
  9. 9. Hidden Technical Debt in Machine Learning Systems “Only a small fraction of real-world ML systems is composed of the ML code. The required surrounding infrastructure is vast and complex.” Hidden Technical Debt in Machine Learning Systems , by Sculley, et al., NIPS, 2016
  10. 10. Big Data System with Machine Learning Stacks API Service Server Spark ML Caffe, DeepLearning4J, Tensorflow, …... Hadoop, Spark, …... RDB, Hadoop HDFS, HBase, ES, …... Apps Algorithms Processing DataStore PredictionIO http://sssslide.com/speakerdeck.com/takahiro/building-a-recommendation-engine-with-spark-and-apache-predictionio
  11. 11. Outline •Background •PredictionIO Overview •Quick Start your first Engine •Customizing an Engine •Implementation on Enterprise Production •Summary
  12. 12. What is Apache PredictionIO Apache PredictionIO (incubating) 是開源機器學習伺服器平台。 提供開發者及資料科學家有效快速建立預測引擎。 並且整合所有應用系統達到 Machine Learning as a Service 的目標。 PredictionIO 可帶來下列預期效益: • 提供簡便資料收集以及儲存的方案,統合現有生態系中的平台。 • 讓開發者可以快速的使用模組建立 machine learning engine 並提供 Service 便於整合外部系統。 • 可以透過模組修改建立自訂的 machine learning engine。
  13. 13. Latest release on 9/26 From PredictionIO JIRA web site, we can find: • Version 0.12.0 was released on 26/Sep’17
  14. 14. SDK / Service Client Architecture Processing Event Server Prediction Engine PredictionIO Platform Engine Template Analytics Tools Storage Build Engine and Deploy
  15. 15. PIO Storage Alternatives
  16. 16. Outline •Background •PredictionIO Overview •Quick Start your first Engine •Customizing an Engine •Implementation on Enterprise Production •Summary
  17. 17. ● REST APIs ● SDKs ● 54 of available templates ● DASE for custom needs ● Source Code ● Docker Quick Start your first Engine Install & Start EventServer Train & Deploy Prediction Engine Query Result via REST 1. Install and Run PredictionIO 2. Create a new Engine from an Engine Template 3. Generate App ID and Access Key 6. Use the Engine Alternatives Operation Steps 5. Deploy the Engine as a Service 4. Collecting Data
  18. 18. Installation & Quick Start ● 請參考 https://github.com/apache/incubator-predictionio/ https://github.com/apache/incubator-predictionio/
  19. 19. Installing with Docker ● Install docker firstly ● Start docker-predictionio $ docker run -it -p 8000:8000 steveny/predictionio /bin/bash http://predictionio.incubator.apache.org/community/projects/#docker-installation-for-predictionio
  20. 20. Installing From Source ● Up-to-date Version : 0.12.0 ● Downloading Source Code : https://github.com/apache/incubator-predictionio/ ● Building Dependencies: Ecosystem Versions of Dependencies Default Scala 2.10.x, 2.11.x 2.11.8 Spark 1.6.x, 2.0.x, 2.1.x 2.1.1 Elasticsearch 1.7.x, 5.x 5.5.2 Hadoop 2.4.x to 2.7.x 2.7.3(*) HBase 0.98.x, 1.2.x 1.2.6(*) https://predictionio.incubator.apache.org/install/install-sourcecode/ $ ./make-distribution.sh -Dscala.version=2.11.8 -Dspark.version=2.1.0 -Delasticsearch.version=5.3.0 ● Setup and Start PredictionIO
  21. 21. Command Line ● General Commands ○ pio status : Displays install path and running status of PredictionIO system and its dependencies. ● Event Server Commands ○ pio eventserver : Launch the Event Server. ○ pio app : Manage apps that are used by the Event Server ● Engine Commands ○ pio build : Build the engine at the current directory. ○ pio train : Kick off a training using an engine. ○ pio deploy : Deploy an engine as an engine server. If no instance ID is specified, it will deploy the latest instance. https://predictionio.incubator.apache.org/cli/#engine-commands
  22. 22. $ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY -H "Content-Type: application/json" -d '{$JSON-CONTEXT}' REST API and SDKs ● REST API : ○ port number for server access (default value, please make sure your setup) : ■ Event Server : 7070 ■ Engine : 8000 ○ example $ curl -H "Content-Type: application/json" -d '{ $JSON-CONTEXT }' http://localhost:8000/queries.json ● SDKs : ○ Java & Android ○ Python ○ PHP ○ Ruby https://predictionio.incubator.apache.org/cli/#engine-commands
  23. 23. Outline •Background •PredictionIO Overview •Quick Start your first Engine •Customizing an Engine •Implementation on Enterprise Production •Summary
  24. 24. Customizing your Engine with D-A-S-E 參數設定 D A S E https://predictionio.incubator.apache.org/customize/ Datasource.scala Preparator.scala ALSAlgorithm .scala Serving.scala Evaluation.scala Engine.scalaengine.json
  25. 25. Engine Query case class case class Predicted Result Engine Factory object RecommendationEngine Query via REST Predicted Result Engine 參數設定 Engine.scala D A S E
  26. 26. Example
  27. 27. Data Source and Data Preparator readTrain() D A S E events RDD ratings RDD Training Data prepare() Action Required (*) Prepared Data DataSource Preparator Algorithm DataSource.scala Preparator.scala Note : * : Performs any necessary feature selection or data processing, etc. Event Server
  28. 28. Algorithm train() D A S E algo Model predict() Model Predicted Result Algorithm Serving Prepared Data train Query ALSAlgorithm .scala 參數設定 Note : *: train() is called when you run “pio train”
  29. 29. example of engine.json for Algorithm { ... "algorithms": [ { "name": "als", "params": { "rank": 10, "numIterations": 20, "lambda": 0.01, "seed": 3 } } ] ... } D A S E
  30. 30. Serving serve() D A S E Predicted Result Predicted Result (JSON) Serving Query Predicted Results Combine Predicted Result (*) Note: *: serve() method will combine multiple predicted results into one if you have more than one predictive model
  31. 31. Quiz (1/4) - Read Custom Events Q: 如何將 rate 及 buy 二種 event 改成 like 及 dislike ? events RDD val eventsRDD: RDD[Event] = PEventStore.find( appName = dsp.appName, entityType = Some("user"), eventNames = Some(List("rate", "buy")), // read "rate" and "buy" event // targetEntityType is optional field of an event. targetEntityType = Some(Some("item")))(sc) D A S E val eventsRDD: RDD[Event] = PEventStore.find( appName = dsp.appName, entityType = Some("customer"), // change user to customer eventNames = Some(List("like", "dislike")), // read "like" and "dislike” event // targetEntityType is optional field of an event. targetEntityType = Some(Some("product")))(sc) // Modified Before After
  32. 32. Quiz (2/4) - Map Custom Events Q: 如何將 rate 及 buy 二種 event 改成 like 及 dislike ? (續) ratings RDD D A S E val ratingValue: Double = event.event match { case "rate" => event.properties.get[Double]("rating") case "buy" => 4.0 // map buy event to rating value of 4 case "like" => 4.0 // map a like event to a rating of 4.0 case "dislike" => 1.0 // map a like event to a rating of 1.0 case _ => throw new Exception(s"Unexpected event ${event} is read.") } val ratingValue: Double = event.event match { case "rate" => event.properties.get[Double]("rating") case "buy" => 4.0 // map buy event to rating value of 4 case _ => throw new Exception(s"Unexpected event ${event} is read.") } Before After
  33. 33. Quiz (3/4) - Customizing Data Preparator Q: 如何將新增黑名單功能,讓系統濾除部份產品 ? class Preparator extends PPreparator[TrainingData, PreparedData] { def prepare(sc: SparkContext, trainingData: TrainingData): PreparedData = { new PreparedData(ratings = trainingData.ratings) } } D A S E import scala.io.Source // ADDED class Preparator extends PPreparator[TrainingData, PreparedData] { def prepare(sc: SparkContext, trainingData: TrainingData): PreparedData = { val noTrainItems = Source.fromFile("./data/sample_not_train_data.txt").getLines.toSet // exclude noTrainItems from original trainingData val ratings = trainingData.ratings.filter( r => !noTrainItems.contains(r.item) ) new PreparedData(ratings) } } Before After
  34. 34. Quiz (4/4) - Release for your Change D A S E $ pio build $ pio train $ pio deploy ● How to release the modified engine(s) ?
  35. 35. Evaluation (1/4) Evaluation AccuracyEvaluation Evaluation Metrics Engine Params List AccuracyAlgo D A S E case class Accuracy extends AverageMetric[EmptyEvaluationInfo, Query, PredictedResult, ActualResult] { def calculate(query: Query, predicted: PredictedResult, actual: ActualResult) : Double = (if (predicted.label == actual.label) 1.0 else 0.0) }
  36. 36. Evaluation (2/4) Query case class case class Predicted Result RecommendationEngine Query via REST Predicted Result class DataSource 參數設定 D A S E case class Actual Result
  37. 37. Evaluation (3/4) readTrain() events RDD ratings RDD Training Data prepare() (TBD) Prepared Data DataSource Preparator Algorithm readEval() TrainingData RDD(Query, ActualResult) events DB K-fold splitting D A S E
  38. 38. Evaluation (4/4) ● Build and run the evaluation ● Deploy the best engine parameter D A S E
  39. 39. Outline •Background •PredictionIO Overview •Quick Start your first Engine •Customizing an Engine •Implementation on Enterprise Production •Summary
  40. 40. Implementation on Enterprise Production Test Log Cluster Cluster Cluster Cluster Batch Data ( pio import) Real Time Data ( Streaming + PIO SDK ) Yield-En. Event Server Cluster Prediction Engine Cluster P1 Engine P2 Engine P3 Engine Meta Event Data Model Query via REST Prediction Result RDD Off-line Training PredictionIO Platform
  41. 41. Deploy the Event Server onto Prediction Cluster Setup PredictionIO Run eventserver listen pio_engine_7070 :7070 mode http balance roundrobin option httpclose option forwardfor option redispatch retries 3 log global log 127.0.0.1 local4 info server piovm1 192.168.56.101:7070 check weight 1 maxconn 30 server piovm2 192.168.56.102:7070 check weight 1 maxconn 30 server piovm3 192.168.56.103:7070 check weight 1 maxconn 30 ● HAProxy configuration for Event Server Cluster Setup HAProxy 分別在需佈署之 Event Server 上,執行下列指令: $ pio eventserver &
  42. 42. Deploy the Engine onto Prediction Cluster $ pio deploy 分別在需佈署之 Prediction Server 上,執行下列指令: $ pio deploy --port 8001 --engine-instance-id AV6dTEoKBlbECIGzXhaS Off-Line Engine Training
  43. 43. Summary • Apache PredictionIO project is an active and popular project. • It will let you to integrate machine learning functions in your apps effectively and efficiently. • It is also convenient for you to consolidate multiple PredictionIO nodes with HAProxy and other Hadoop ecosystem to provide scalable and stable solution.
  44. 44. Thank You !

×