SlideShare a Scribd company logo
1 of 18
Download to read offline
Machine Learning on
PredictionIO
A Presentation by Gladson V Manuel
Contents in slides available online at:
https://gladsonvm.wordpress.com/predictionio/
"I believe that at the end of the century the use of words and general
educated opinion will have altered so much that one will be able to speak
of machines thinking without expecting to be contradicted,"
- Alan Turing
Above quote says that Alan Turing believed Humans will accept AI by 2000.
PredictionIO
What is PredictionIO?
PredictionIO is an open source Machine Learning (ML) Server built on top of state-of-the-art open source
stack for developers and data scientists create predictive engines for any machine learning task.
Note that any Machine Learning tool is only good as the data given to it.
Data is one thing that should be given prime importance while dealing with machine learning tools. Data must be
well formatted for machine learning tools to read and must be sufficient to handle a big range of possible
outcomes.
Built on top of:
● Apache Spark
● MLlib
● HBase
● Spray
● Elasticsearch
Powered by:
● Apache Spark
Spark is a large-scale data processing engine that powers the algorithm, training, and serving
processing.
● Mllib
Mlib is the scalable Machine learning library of Apache Spark.
● Hbase
Event Server uses Apache HBase as the data store. It stores imported events. Hbase required if
PredictionIO eventserver is used.
● Spray
spray is an open-source toolkit for building REST/HTTP-based integration layers on top of Scala and
Akka.
● Elasticsearch
stores metadata such as model versions, engine versions, access key and app id mappings,
evaluation results, etc.
Basic ML Processes
● Build
Build process build the ML engine on the imported data. A model is created as a result of build
process according to the algorithm. Default Algorithm is Naive Bayes algorithms like Random
Forest are also supported by default.
● Train
Training improves the accuracy of the results. The more its trained the more the accuracy of the
results.
● Evaluation
Data is split into parts and the larger part is trained and the smaller is used to estimate
performance. PredictionIO uses k-fold cross-validation by default.
● Deploy
In this stage engine is deployed and serving component will reply to the queries sent by user.
Architecture
Data Flow
● Hbase is the data store. All the imported events are stored in Hbase.
● Datasource then reads the imported data and converts to the desired format.
● Data preparator preprocesses the data and feed it to algorithm for model
training.
● Algorithm includes ML algorithm that determines how a predictive model is
constructed.
● Serving component accepts the input from user and returns the corresponding
results. Serving combine the results into one if engine have multiple
algorithms. Logic can be added to serving component to customize results.
● PredictionIO splits these archtecture into two, An engine composed of [D]ata,
[A]lgorithm, [S]erving and preparator, second part have only [E]valuation
metrics. PredictionIO mention this as DASE architecture.
PredictionIO Setup
● One Liner
$ bash -c "$(curl -s https://install.prediction.io/install.sh)"
One liner will download and install predictionIO with all the required dependencies.
● Manual Install
➢ Apache Hadoop 2.4.0 (optional, required only if YARN and HDFS are needed)
➢ Apache Spark 1.3.0 for Hadoop 2.4
➢ Java SE Development Kit 7
● Supported Databases
➢ PostgreSQL 9.1
➢ MySQL 5.1
➢ Apache HBase 0.98.6 + Elasticsearch 1.4.0
● Start Prediction Server
Change directory to pio_installation_directory/bin or export pio path to and run
pio-start-all
This will show you status of every task and finally a message like below will be shown
[INFO] [HBLEvents] Removing table pio_event:events_0...
[INFO] [Console$] (sleeping 5 seconds for all messages to show up...)
[INFO] [Console$] Your system is all ready to go.
● Status of server can be checked at any time using
pio status
● Engine templates can be downloaded from predictionIO
website
http://templates.prediction.io/
Building a Movie recommendation Engine on PredictionIO
● A movie recommendation engine can suggest movies to watch to a user
based on the ratings by other users.
● Dataset can be pulled from movielens website
http://files.grouplens.org/datasets/movielens/ml-100k.zip
● Engine template is available for movie recommendation and can be
downloaded by:
$ pio template get PredictionIO/template-scala-parallel-recommendation <YourNewEngineDir>
This will create a directory with name you given for the engine and will pull all the required data to it.
Create a New app and import data
● pio app new <app_name>
- This will create a new app with the given name. After the app is created access key
together with app ID and other details will be displayed.
● Import data
-Execute following commands in engine directory to pull movielens data from github
and import the same to predictionIO
curl https://raw.githubusercontent.com/apache/spark/master/data/mllib/sample_movielens_data.txt
--create-dirs -o data/sample_movielens_data.txt
python data/import_eventserver.py --access_key <access_key>
replace <access_key> with access key of app obtained from pio app list
If data import is successfull the number of imported events will be shown.
Creating Custom Events
● Custom events can be created using API. Events can be created using REST API using curl as below
$ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY 
-H "Content-Type: application/json" 
-d '{
"event" : "rate",
"entityType" : "user"
"entityId" : "u0",
"targetEntityType" : "item",
"targetEntityId" : "i0",
"properties" : {
"rating" : 5
}
"eventTime" : "2014-11-02T09:39:45.618-08:00"
}'
● Fields must be same as defined in the dataset.
Creating events in Python SDK
Python package for predictionIO must be installed to use python SDK. Installation can be done by pip command
pip install predictionio
● import predictionio
client = predictionio.EventClient(
access_key=<ACCESS KEY>,
url=<URL OF EVENTSERVER>,
threads=5,
qsize=500
)
# A user rates an item
client.create_event(
event="rate",
entity_type="user",
entity_id=<USER ID>,
target_entity_type="item",
target_entity_id=<ITEM ID>,
properties= { "rating" : float(<RATING>) }
)
Check data and implement engine
● Event server can be queried to confirm data import is successfull
-http://localhost:7070/events.json?accessKey=<YOUR_ACCESS_KEY>
● The same can be performed through REST API also
$ curl -i -X GET "http://localhost:7070/events.json?accessKey=$ACCESS_KEY"
● Training
-Before training engine name must be mentioned in engine.json file. Open engine.json in template
directory using an editor.
"datasource": {
"params" : {
"appName": "<app_name>"
}
},
Build, Train and Deploy Engine
● To build engine issue command pio build –verbose.
- verbose flag will display output of each ongoing operation.
- On successfull build predictionIO will display the following line
[INFO] [Console$] Your engine is ready for training.
● To train engine issue command pio train
-On successfull training predictionIO will display the following
[INFO] [CoreWorkflow$] Training completed successfully.
- If low heap size issue is reported then -- --driver-memory 8G can be used to carry on
training with a higher memory. Replace 8G with the desired memory 'G' stands for GB.
● Engine can be deployed by the command pio deploy.
-Heap size issues can be solved by allocating more memory for deployment using
driver-memory parameter
-On successfull run message will be displayed notifying engine is running on port
8000 and can be accessed on 0.0.0.0:8000
● Querying Engine
-Queries can be send to engine with REST API or SDK
● REST API
$ curl -H "Content-Type: application/json" 
-d '{ "user": "1", "num": 4 }' http://localhost:8000/queries.json
● Python SDK
import predictionio
engine_client = predictionio.EngineClient(url="http://localhost:8000")
print engine_client.send_query({"user": "1", "num": 4})
Sample Response
● Following is a sample response. Note that the number of results returned
can be changed by altering num key of the input query eg:({"user": "1", "num": 2})
{
"itemScores":[
{"item":"22","score":4.072304374729956},
{"item":"62","score":4.058482414005789},
{"item":"75","score":4.046063009943821},
{"item":"68","score":3.8153661512945325}
]
}
Thank YOU

More Related Content

What's hot

What's hot (20)

KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...
KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...
KubeCon EU 2016: Bringing an open source Containerized Container Platform to ...
 
Kubernetes best practices
Kubernetes best practicesKubernetes best practices
Kubernetes best practices
 
The evolution of Dell EMC PowerEdge server systems management - Infographic
The evolution of Dell EMC PowerEdge server systems management - InfographicThe evolution of Dell EMC PowerEdge server systems management - Infographic
The evolution of Dell EMC PowerEdge server systems management - Infographic
 
Kubernetes in Docker
Kubernetes in DockerKubernetes in Docker
Kubernetes in Docker
 
Intro to docker
Intro to dockerIntro to docker
Intro to docker
 
Kubernetes Boulder - Kit Merker - Cloud Native Deployment
Kubernetes Boulder - Kit Merker - Cloud Native DeploymentKubernetes Boulder - Kit Merker - Cloud Native Deployment
Kubernetes Boulder - Kit Merker - Cloud Native Deployment
 
Learn kubernetes in 90 minutes
Learn kubernetes in 90 minutesLearn kubernetes in 90 minutes
Learn kubernetes in 90 minutes
 
ContainerDayVietnam2016: Django Development with Docker
ContainerDayVietnam2016: Django Development with DockerContainerDayVietnam2016: Django Development with Docker
ContainerDayVietnam2016: Django Development with Docker
 
OpenStack Rally presentation by RamaK
OpenStack Rally presentation by RamaKOpenStack Rally presentation by RamaK
OpenStack Rally presentation by RamaK
 
Docker d2 박승환
Docker d2 박승환Docker d2 박승환
Docker d2 박승환
 
Managing Docker Containers In A Cluster - Introducing Kubernetes
Managing Docker Containers In A Cluster - Introducing KubernetesManaging Docker Containers In A Cluster - Introducing Kubernetes
Managing Docker Containers In A Cluster - Introducing Kubernetes
 
Orchestrating Docker with OpenStack
Orchestrating Docker with OpenStackOrchestrating Docker with OpenStack
Orchestrating Docker with OpenStack
 
Kubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of ContainersKubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of Containers
 
Taking Docker to Production: What You Need to Know and Decide
Taking Docker to Production: What You Need to Know and DecideTaking Docker to Production: What You Need to Know and Decide
Taking Docker to Production: What You Need to Know and Decide
 
Docker introduction (1)
Docker introduction (1)Docker introduction (1)
Docker introduction (1)
 
2016 - Continuously Delivering Microservices in Kubernetes using Jenkins
2016 - Continuously Delivering Microservices in Kubernetes using Jenkins2016 - Continuously Delivering Microservices in Kubernetes using Jenkins
2016 - Continuously Delivering Microservices in Kubernetes using Jenkins
 
ContainerDayVietnam2016: Docker for JS Developer
ContainerDayVietnam2016: Docker for JS DeveloperContainerDayVietnam2016: Docker for JS Developer
ContainerDayVietnam2016: Docker for JS Developer
 
Deploying apps with Docker and Kubernetes
Deploying apps with Docker and KubernetesDeploying apps with Docker and Kubernetes
Deploying apps with Docker and Kubernetes
 
Introduction to Kubernetes
Introduction to KubernetesIntroduction to Kubernetes
Introduction to Kubernetes
 
Greach 2014 - Road to Grails 3.0
Greach 2014  - Road to Grails 3.0Greach 2014  - Road to Grails 3.0
Greach 2014 - Road to Grails 3.0
 

Viewers also liked (6)

PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...PredictionIO - Building Applications That Predict User Behavior Through Big D...
PredictionIO - Building Applications That Predict User Behavior Through Big D...
 
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
Automated Machine Learning Using Spark Mllib to Improve Customer Experience-(...
 
Snowplow at Sigfig
Snowplow at SigfigSnowplow at Sigfig
Snowplow at Sigfig
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
PredictionIO – A Machine Learning Server in Scala – SF Scala
PredictionIO – A Machine Learning Server in Scala – SF ScalaPredictionIO – A Machine Learning Server in Scala – SF Scala
PredictionIO – A Machine Learning Server in Scala – SF Scala
 
The Universal Recommender
The Universal RecommenderThe Universal Recommender
The Universal Recommender
 

Similar to pio_present

Google cloud platform
Google cloud platformGoogle cloud platform
Google cloud platform
rajdeep
 
OpenERP Technical Memento V0.7.3
OpenERP Technical Memento V0.7.3OpenERP Technical Memento V0.7.3
OpenERP Technical Memento V0.7.3
Borni DHIFI
 

Similar to pio_present (20)

Google Cloud Platform
Google Cloud Platform Google Cloud Platform
Google Cloud Platform
 
My Saminar On Php
My Saminar On PhpMy Saminar On Php
My Saminar On Php
 
Introduction to PredictionIO
Introduction to PredictionIOIntroduction to PredictionIO
Introduction to PredictionIO
 
Google App Engine for Java
Google App Engine for JavaGoogle App Engine for Java
Google App Engine for Java
 
Google App Engine for Java
Google App Engine for JavaGoogle App Engine for Java
Google App Engine for Java
 
Azure machine learning service
Azure machine learning serviceAzure machine learning service
Azure machine learning service
 
Android Development : (Android Studio, PHP, XML, MySQL)
Android Development : (Android Studio, PHP, XML, MySQL)Android Development : (Android Studio, PHP, XML, MySQL)
Android Development : (Android Studio, PHP, XML, MySQL)
 
Prediction io 架構與整合 -DataCon.TW-2017
Prediction io 架構與整合 -DataCon.TW-2017Prediction io 架構與整合 -DataCon.TW-2017
Prediction io 架構與整合 -DataCon.TW-2017
 
What is going on - Application diagnostics on Azure - TechDays Finland
What is going on - Application diagnostics on Azure - TechDays FinlandWhat is going on - Application diagnostics on Azure - TechDays Finland
What is going on - Application diagnostics on Azure - TechDays Finland
 
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
"Scaling ML from 0 to millions of users", Julien Simon, AWS Dev Day Kyiv 2019
 
I want my model to be deployed ! (another story of MLOps)
I want my model to be deployed ! (another story of MLOps)I want my model to be deployed ! (another story of MLOps)
I want my model to be deployed ! (another story of MLOps)
 
Google cloud platform
Google cloud platformGoogle cloud platform
Google cloud platform
 
Automating Workflows for Analytics Pipelines
Automating Workflows for Analytics PipelinesAutomating Workflows for Analytics Pipelines
Automating Workflows for Analytics Pipelines
 
Sst hackathon express
Sst hackathon expressSst hackathon express
Sst hackathon express
 
[2C2]PredictionIO
[2C2]PredictionIO[2C2]PredictionIO
[2C2]PredictionIO
 
Serverless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData SeattleServerless ML Workshop with Hopsworks at PyData Seattle
Serverless ML Workshop with Hopsworks at PyData Seattle
 
Nyc mule soft_meetup_13_march_2021
Nyc mule soft_meetup_13_march_2021Nyc mule soft_meetup_13_march_2021
Nyc mule soft_meetup_13_march_2021
 
OpenERP Technical Memento V0.7.3
OpenERP Technical Memento V0.7.3OpenERP Technical Memento V0.7.3
OpenERP Technical Memento V0.7.3
 
Google cloud certified professional cloud developer practice dumps 2020
Google cloud certified professional cloud developer practice dumps 2020Google cloud certified professional cloud developer practice dumps 2020
Google cloud certified professional cloud developer practice dumps 2020
 
Running Vue Storefront in production (PWA Magento webshop)
Running Vue Storefront in production (PWA Magento webshop)Running Vue Storefront in production (PWA Magento webshop)
Running Vue Storefront in production (PWA Magento webshop)
 

pio_present

  • 1. Machine Learning on PredictionIO A Presentation by Gladson V Manuel Contents in slides available online at: https://gladsonvm.wordpress.com/predictionio/
  • 2. "I believe that at the end of the century the use of words and general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted," - Alan Turing Above quote says that Alan Turing believed Humans will accept AI by 2000.
  • 3. PredictionIO What is PredictionIO? PredictionIO is an open source Machine Learning (ML) Server built on top of state-of-the-art open source stack for developers and data scientists create predictive engines for any machine learning task. Note that any Machine Learning tool is only good as the data given to it. Data is one thing that should be given prime importance while dealing with machine learning tools. Data must be well formatted for machine learning tools to read and must be sufficient to handle a big range of possible outcomes. Built on top of: ● Apache Spark ● MLlib ● HBase ● Spray ● Elasticsearch
  • 4. Powered by: ● Apache Spark Spark is a large-scale data processing engine that powers the algorithm, training, and serving processing. ● Mllib Mlib is the scalable Machine learning library of Apache Spark. ● Hbase Event Server uses Apache HBase as the data store. It stores imported events. Hbase required if PredictionIO eventserver is used. ● Spray spray is an open-source toolkit for building REST/HTTP-based integration layers on top of Scala and Akka. ● Elasticsearch stores metadata such as model versions, engine versions, access key and app id mappings, evaluation results, etc.
  • 5. Basic ML Processes ● Build Build process build the ML engine on the imported data. A model is created as a result of build process according to the algorithm. Default Algorithm is Naive Bayes algorithms like Random Forest are also supported by default. ● Train Training improves the accuracy of the results. The more its trained the more the accuracy of the results. ● Evaluation Data is split into parts and the larger part is trained and the smaller is used to estimate performance. PredictionIO uses k-fold cross-validation by default. ● Deploy In this stage engine is deployed and serving component will reply to the queries sent by user.
  • 7. Data Flow ● Hbase is the data store. All the imported events are stored in Hbase. ● Datasource then reads the imported data and converts to the desired format. ● Data preparator preprocesses the data and feed it to algorithm for model training. ● Algorithm includes ML algorithm that determines how a predictive model is constructed. ● Serving component accepts the input from user and returns the corresponding results. Serving combine the results into one if engine have multiple algorithms. Logic can be added to serving component to customize results. ● PredictionIO splits these archtecture into two, An engine composed of [D]ata, [A]lgorithm, [S]erving and preparator, second part have only [E]valuation metrics. PredictionIO mention this as DASE architecture.
  • 8. PredictionIO Setup ● One Liner $ bash -c "$(curl -s https://install.prediction.io/install.sh)" One liner will download and install predictionIO with all the required dependencies. ● Manual Install ➢ Apache Hadoop 2.4.0 (optional, required only if YARN and HDFS are needed) ➢ Apache Spark 1.3.0 for Hadoop 2.4 ➢ Java SE Development Kit 7 ● Supported Databases ➢ PostgreSQL 9.1 ➢ MySQL 5.1 ➢ Apache HBase 0.98.6 + Elasticsearch 1.4.0
  • 9. ● Start Prediction Server Change directory to pio_installation_directory/bin or export pio path to and run pio-start-all This will show you status of every task and finally a message like below will be shown [INFO] [HBLEvents] Removing table pio_event:events_0... [INFO] [Console$] (sleeping 5 seconds for all messages to show up...) [INFO] [Console$] Your system is all ready to go. ● Status of server can be checked at any time using pio status ● Engine templates can be downloaded from predictionIO website http://templates.prediction.io/
  • 10. Building a Movie recommendation Engine on PredictionIO ● A movie recommendation engine can suggest movies to watch to a user based on the ratings by other users. ● Dataset can be pulled from movielens website http://files.grouplens.org/datasets/movielens/ml-100k.zip ● Engine template is available for movie recommendation and can be downloaded by: $ pio template get PredictionIO/template-scala-parallel-recommendation <YourNewEngineDir> This will create a directory with name you given for the engine and will pull all the required data to it.
  • 11. Create a New app and import data ● pio app new <app_name> - This will create a new app with the given name. After the app is created access key together with app ID and other details will be displayed. ● Import data -Execute following commands in engine directory to pull movielens data from github and import the same to predictionIO curl https://raw.githubusercontent.com/apache/spark/master/data/mllib/sample_movielens_data.txt --create-dirs -o data/sample_movielens_data.txt python data/import_eventserver.py --access_key <access_key> replace <access_key> with access key of app obtained from pio app list If data import is successfull the number of imported events will be shown.
  • 12. Creating Custom Events ● Custom events can be created using API. Events can be created using REST API using curl as below $ curl -i -X POST http://localhost:7070/events.json?accessKey=$ACCESS_KEY -H "Content-Type: application/json" -d '{ "event" : "rate", "entityType" : "user" "entityId" : "u0", "targetEntityType" : "item", "targetEntityId" : "i0", "properties" : { "rating" : 5 } "eventTime" : "2014-11-02T09:39:45.618-08:00" }' ● Fields must be same as defined in the dataset.
  • 13. Creating events in Python SDK Python package for predictionIO must be installed to use python SDK. Installation can be done by pip command pip install predictionio ● import predictionio client = predictionio.EventClient( access_key=<ACCESS KEY>, url=<URL OF EVENTSERVER>, threads=5, qsize=500 ) # A user rates an item client.create_event( event="rate", entity_type="user", entity_id=<USER ID>, target_entity_type="item", target_entity_id=<ITEM ID>, properties= { "rating" : float(<RATING>) } )
  • 14. Check data and implement engine ● Event server can be queried to confirm data import is successfull -http://localhost:7070/events.json?accessKey=<YOUR_ACCESS_KEY> ● The same can be performed through REST API also $ curl -i -X GET "http://localhost:7070/events.json?accessKey=$ACCESS_KEY" ● Training -Before training engine name must be mentioned in engine.json file. Open engine.json in template directory using an editor. "datasource": { "params" : { "appName": "<app_name>" } },
  • 15. Build, Train and Deploy Engine ● To build engine issue command pio build –verbose. - verbose flag will display output of each ongoing operation. - On successfull build predictionIO will display the following line [INFO] [Console$] Your engine is ready for training. ● To train engine issue command pio train -On successfull training predictionIO will display the following [INFO] [CoreWorkflow$] Training completed successfully. - If low heap size issue is reported then -- --driver-memory 8G can be used to carry on training with a higher memory. Replace 8G with the desired memory 'G' stands for GB.
  • 16. ● Engine can be deployed by the command pio deploy. -Heap size issues can be solved by allocating more memory for deployment using driver-memory parameter -On successfull run message will be displayed notifying engine is running on port 8000 and can be accessed on 0.0.0.0:8000 ● Querying Engine -Queries can be send to engine with REST API or SDK ● REST API $ curl -H "Content-Type: application/json" -d '{ "user": "1", "num": 4 }' http://localhost:8000/queries.json ● Python SDK import predictionio engine_client = predictionio.EngineClient(url="http://localhost:8000") print engine_client.send_query({"user": "1", "num": 4})
  • 17. Sample Response ● Following is a sample response. Note that the number of results returned can be changed by altering num key of the input query eg:({"user": "1", "num": 2}) { "itemScores":[ {"item":"22","score":4.072304374729956}, {"item":"62","score":4.058482414005789}, {"item":"75","score":4.046063009943821}, {"item":"68","score":3.8153661512945325} ] }