This document provides an overview of PredictionIO, an open source machine learning server. It discusses what PredictionIO is, how it works, and how to set it up and build a movie recommendation engine on it. PredictionIO is built on Apache Spark and MLlib and uses HBase for data storage. It allows importing data, building models, training, evaluating, and deploying machine learning engines. The document demonstrates how to create a movie recommendation engine using PredictionIO's template, import movie rating data, train and deploy the engine, and test it by sending queries.
Running Vue Storefront in production (PWA Magento webshop)
pio_present
1. Machine Learning on
PredictionIO
A Presentation by Gladson V Manuel
Contents in slides available online at:
https://gladsonvm.wordpress.com/predictionio/
2. "I believe that at the end of the century the use of words and general
educated opinion will have altered so much that one will be able to speak
of machines thinking without expecting to be contradicted,"
- Alan Turing
Above quote says that Alan Turing believed Humans will accept AI by 2000.
3. PredictionIO
What is PredictionIO?
PredictionIO is an open source Machine Learning (ML) Server built on top of state-of-the-art open source
stack for developers and data scientists create predictive engines for any machine learning task.
Note that any Machine Learning tool is only good as the data given to it.
Data is one thing that should be given prime importance while dealing with machine learning tools. Data must be
well formatted for machine learning tools to read and must be sufficient to handle a big range of possible
outcomes.
Built on top of:
● Apache Spark
● MLlib
● HBase
● Spray
● Elasticsearch
4. Powered by:
● Apache Spark
Spark is a large-scale data processing engine that powers the algorithm, training, and serving
processing.
● Mllib
Mlib is the scalable Machine learning library of Apache Spark.
● Hbase
Event Server uses Apache HBase as the data store. It stores imported events. Hbase required if
PredictionIO eventserver is used.
● Spray
spray is an open-source toolkit for building REST/HTTP-based integration layers on top of Scala and
Akka.
● Elasticsearch
stores metadata such as model versions, engine versions, access key and app id mappings,
evaluation results, etc.
5. Basic ML Processes
● Build
Build process build the ML engine on the imported data. A model is created as a result of build
process according to the algorithm. Default Algorithm is Naive Bayes algorithms like Random
Forest are also supported by default.
● Train
Training improves the accuracy of the results. The more its trained the more the accuracy of the
results.
● Evaluation
Data is split into parts and the larger part is trained and the smaller is used to estimate
performance. PredictionIO uses k-fold cross-validation by default.
● Deploy
In this stage engine is deployed and serving component will reply to the queries sent by user.
7. Data Flow
● Hbase is the data store. All the imported events are stored in Hbase.
● Datasource then reads the imported data and converts to the desired format.
● Data preparator preprocesses the data and feed it to algorithm for model
training.
● Algorithm includes ML algorithm that determines how a predictive model is
constructed.
● Serving component accepts the input from user and returns the corresponding
results. Serving combine the results into one if engine have multiple
algorithms. Logic can be added to serving component to customize results.
● PredictionIO splits these archtecture into two, An engine composed of [D]ata,
[A]lgorithm, [S]erving and preparator, second part have only [E]valuation
metrics. PredictionIO mention this as DASE architecture.
8. PredictionIO Setup
● One Liner
$ bash -c "$(curl -s https://install.prediction.io/install.sh)"
One liner will download and install predictionIO with all the required dependencies.
● Manual Install
➢ Apache Hadoop 2.4.0 (optional, required only if YARN and HDFS are needed)
➢ Apache Spark 1.3.0 for Hadoop 2.4
➢ Java SE Development Kit 7
● Supported Databases
➢ PostgreSQL 9.1
➢ MySQL 5.1
➢ Apache HBase 0.98.6 + Elasticsearch 1.4.0
9. ● Start Prediction Server
Change directory to pio_installation_directory/bin or export pio path to and run
pio-start-all
This will show you status of every task and finally a message like below will be shown
[INFO] [HBLEvents] Removing table pio_event:events_0...
[INFO] [Console$] (sleeping 5 seconds for all messages to show up...)
[INFO] [Console$] Your system is all ready to go.
● Status of server can be checked at any time using
pio status
● Engine templates can be downloaded from predictionIO
website
http://templates.prediction.io/
10. Building a Movie recommendation Engine on PredictionIO
● A movie recommendation engine can suggest movies to watch to a user
based on the ratings by other users.
● Dataset can be pulled from movielens website
http://files.grouplens.org/datasets/movielens/ml-100k.zip
● Engine template is available for movie recommendation and can be
downloaded by:
$ pio template get PredictionIO/template-scala-parallel-recommendation <YourNewEngineDir>
This will create a directory with name you given for the engine and will pull all the required data to it.
11. Create a New app and import data
● pio app new <app_name>
- This will create a new app with the given name. After the app is created access key
together with app ID and other details will be displayed.
● Import data
-Execute following commands in engine directory to pull movielens data from github
and import the same to predictionIO
curl https://raw.githubusercontent.com/apache/spark/master/data/mllib/sample_movielens_data.txt
--create-dirs -o data/sample_movielens_data.txt
python data/import_eventserver.py --access_key <access_key>
replace <access_key> with access key of app obtained from pio app list
If data import is successfull the number of imported events will be shown.
12. Creating Custom Events
● Custom events can be created using API. Events can be created using REST API using curl as below
$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY
-H "Content-Type: application/json"
-d '{
"event" : "rate",
"entityType" : "user"
"entityId" : "u0",
"targetEntityType" : "item",
"targetEntityId" : "i0",
"properties" : {
"rating" : 5
}
"eventTime" : "2014-11-02T09:39:45.618-08:00"
}'
● Fields must be same as defined in the dataset.
13. Creating events in Python SDK
Python package for predictionIO must be installed to use python SDK. Installation can be done by pip command
pip install predictionio
● import predictionio
client = predictionio.EventClient(
access_key=<ACCESS KEY>,
url=<URL OF EVENTSERVER>,
threads=5,
qsize=500
)
# A user rates an item
client.create_event(
event="rate",
entity_type="user",
entity_id=<USER ID>,
target_entity_type="item",
target_entity_id=<ITEM ID>,
properties= { "rating" : float(<RATING>) }
)
14. Check data and implement engine
● Event server can be queried to confirm data import is successfull
-http://localhost:7070/events.json?accessKey=<YOUR_ACCESS_KEY>
● The same can be performed through REST API also
$ curl -i -X GET "http://localhost:7070/events.json?accessKey=$ACCESS_KEY"
● Training
-Before training engine name must be mentioned in engine.json file. Open engine.json in template
directory using an editor.
"datasource": {
"params" : {
"appName": "<app_name>"
}
},
15. Build, Train and Deploy Engine
● To build engine issue command pio build –verbose.
- verbose flag will display output of each ongoing operation.
- On successfull build predictionIO will display the following line
[INFO] [Console$] Your engine is ready for training.
● To train engine issue command pio train
-On successfull training predictionIO will display the following
[INFO] [CoreWorkflow$] Training completed successfully.
- If low heap size issue is reported then -- --driver-memory 8G can be used to carry on
training with a higher memory. Replace 8G with the desired memory 'G' stands for GB.
16. ● Engine can be deployed by the command pio deploy.
-Heap size issues can be solved by allocating more memory for deployment using
driver-memory parameter
-On successfull run message will be displayed notifying engine is running on port
8000 and can be accessed on 0.0.0.0:8000
● Querying Engine
-Queries can be send to engine with REST API or SDK
● REST API
$ curl -H "Content-Type: application/json"
-d '{ "user": "1", "num": 4 }' http://localhost:8000/queries.json
● Python SDK
import predictionio
engine_client = predictionio.EngineClient(url="http://localhost:8000")
print engine_client.send_query({"user": "1", "num": 4})
17. Sample Response
● Following is a sample response. Note that the number of results returned
can be changed by altering num key of the input query eg:({"user": "1", "num": 2})
{
"itemScores":[
{"item":"22","score":4.072304374729956},
{"item":"62","score":4.058482414005789},
{"item":"75","score":4.046063009943821},
{"item":"68","score":3.8153661512945325}
]
}