Presentazione tutorial

AWS Machine Learning
Engineering in Computer Science - Data Mining Class

Who are we?
2AWS Machine Learning
Lukas
Hermann
Milad Kiwan
Dario Molinari Lorenzo Vitali
Daniele De Cillis
Matteo Pallotta

Where to find the material
Slideshare repository
http://www.slideshare.net/dariospin93/presentazione-
tutorial-70026708
Github repository:
https://github.com/dariospin93/TutorialDataMining
Here you’ll find the files needed for this tutorial

What is Machine Learning?
“Machine learning is the subfield of computer
science that gives computers the ability to learn
without being explicitly programmed" (Wikipedia)
“A computer program is said to learn from experience
E with respect to some class of tasks T and
performance measure P if its performance at tasks
in T, as measured by P, improves with experience
E." (Tom M. Mitchell, Chair of the Machine Learning Department at Carnegie Mellon
University)

Some ML tasks:
• Classification: inputs are divided into 2 or more
classes. The goal is to produce a model that
assigns unseen inputs to one (e.g: Spam Filtering,
input= emails, output = ”spam” or “not spam”).
• Regression: related to the previous category. The
outputs are continuous rather than discrete (e.g:
input = ”size of a house”, output = ”price”)

Some ML tasks:
• Clustering: divide inputs into groups. The main
difference with respect to Classification problems is
that the groups are not known beforehand
• Dimensionality reduction: map inputs into a
lower-dimensional space. (e.g: input = “set of
documents in human language”, output = “which
documents cover similar topics”)
• ...

Why Machine Learning?
• Growing flood of data
• Growing availability of computational power
• Progress in algorithms

Amazon Machine Learning
“Service that makes it easy for developers of all skill
levels to use machine learning technology. Amazon
Machine Learning provides visualization tools and
wizards that guide you through the process of
creating machine learning (ML) models without
having to learn complex ML algorithms and
technology.”

What is Amazon ML?
• Robust cloud-based service that makes it easy for
developers of all skill levels to use ML technology.
• Create ML models by finding patterns in your
existing data.
• provides visualization tools and wizards that guide
you through the process.

When to use Amazon ML?
• No need to learn complex ML algorithms and
technology.
• Makes it easy to obtain predictions for your
application using simple APIs.
• ML is not a solution for every type of problem.
– if you can determine a target value by using simple rules, computations,
or predetermined steps that can be programmed without needing ML.

• Many human tasks cannot be adequately solved
using a simple rule-based solution: recognizing
whether an email is spam or not spam.
• When rules depend on too many factors and many
of these rules overlap or need to be tuned very
finely.

You can use ML approaches for these specific ML
tasks:
• binary classification (predicting one of two possible
outcomes).
• multiclass classification (predicting one of more
than two outcomes).
• regression (predicting a numeric value).

Formulating The Problem
• The first step in machine learning is to decide what
you want to predict, which is known as the label or
target answer.
– Predict the number of purchases your customers will make for each
product. (regression problem)
– Predict which products will get more than 10 purchases. (binary
classification problem)
– Which category of products is most interesting to this customer.
(multiclass classification problem)

Collecting Labeled Data
• Labeled Data: are data for which you already know
the target answer.
• The Target: is the answer that you want to predict.

Collecting Labeled Data
• Data is not readily available in a labeled form.
Collecting and preparing the variables and the
target are often the most important steps in solving
an ML problem.
• You provide data that is labeled with the target to
the ML algorithm to learn from. Then, you will use
the trained ML model to predict this answer on
data for which you do not know the target answer.

What is Amazon S3 (Simple Storage Service)?
• Amazon S3 has a simple web services interface
that you can use to store and retrieve any amount
of data.
• It is designed to make web-scale computing easier
for developers.

Training and Evaluation Data
• The fundamental goal of ML is to generalize
beyond the data instances to train models.
• Amazon ML splits the first 70 percent of the input
data sent for training a model through the Amazon
ML console and the remaining 30 percent for the
evaluation datasource.

Training and Evaluation Data
• The ML system uses the training data to train
models to see patterns, and uses the evaluation
data to evaluate the predictive quality of the trained
model
• The ML system evaluates predictive performance
by comparing predictions on the evaluation data
set with true values.

Evaluation
• Threshold for prediction
can be adjusted
• Control precision and
recall

Precision and Recall

AWS ML Techniques
• for regression, AWS uses linear regression
• for classification, AWS uses logistic regression
– Despite the name classification method
– uses a ML model similar to regression with a logistic sigmoid
function
– binomial or multinomial

Logistic Regression: Example
• Labeled data with labels y ∈ {0,1}
• e.g.:
– x: hours of study
– y: pass (1) or fail (0)
• What’s the probability of success given a certain
time spent studying?

Logistic Regression: Example

Limitations of Amazon ML
•Only supervised learning (no clustering etc.)
•No selection of the ML method possible
•Preprocessing of the data is a black box

AWS Machine Learning – hands-on

A cloud application
With Amazon ML, we can build and train a predictive model in a scalable cloud
solution. In fact, there is no need of any kind of application to run this tool, because
it runs on the cloud (actually, we’ll need a web browser in order to access to the
tool). In our tutorial we’ll show you the basic functionalities of Amazon ML, like
creating a datasource, building a model and using this model to generate
predictions.
In order to do this, we need a dataset, as big as possible. Our dataset, is taken
from the University of California at Irvine (UCI) machine learning repository, where
it is possible to find a lot of them.
Pagina 26

What we will see in this tutorial
In this tutorial we’ll see how machine learning can be used for marketing purposes.
To do this, we’ll show you how to build and train a model to help you making
decisions based on the data you have.
We’ll focus on selecting people based on their earnings, that may be useful to find
who’s going to be more suitable for certain marketing offers
Pagina 27

Tutorial Plan
1. Preparing the data
2. Creating a training datasource
3. Creating a model
4. Reviewing the model’s predictive performance and setting a score threshold
5. Using the model to generate predictions
6. Cleaning up (to avoid incurring in unwanted charges)
Pagina 28

Step 1: Preparing the data
Initially, we must be sure that our tool understands the data we pass to it. In order
to do this, we should ensure that our dataset follows Amazon’s guidelines:
• Data must be saved in .csv format
• Each row must be a single observation
• Each column must contain a single attribute of the observation
• The first should contain the attribute’s names (or you can provide them in a
separated file, but it’s not recommended)
• Every attribute must be separated by comma
• If you use Excel and MacOS, do not save in “comma separated value(.csv)”
format, use the “windows comma separated (.csv)” instead.
Pagina 29

Consider our dataset: open the “census.csv” file
Our target is the attribute “class”: how much a person earns per year (binary, 1 if >
50.000, 0 if ≤ 50.000)
Pagina 30

In practice, the machine will learn which are the characteristics of the people who
earn more than the threshold and who earn less, and with this knowledge, we will
ask to predict at which class other people belong.
Pagina 31

Open the census-batch.csv file: there is no “class” attribute there. In fact, the tool’s
job now is showing us what it has learnt, letting it work on this dataset where we
know the right “class” attribute, but it’s not specified in there.
Pagina 32

Step 2: Creating the training datasource
In order to use all our files, we have to upload them to Amazon S3
• Open https://console.aws.amazon.com/s3/
• Create a new bucket
• Choose upload in the navigation bar
• Add the files mentioned before
Pagina 33

Now to create the datasource (it will contain only the location of the data):
• Open https://console.aws.amazon.com/machinelearning/
• Choose Get Started (or Create New) and launch
• Select S3 from “Where your data is located?”
• Type <name of your bucket>/census.csv
• Put the name “Census data”
• Choose verify and grant permission
• Review and choose continue
Pagina 34

A schema contains information needed to interpret the input data for the model.
The simplest and fastest thing to do is to let Amazon infer it. We have to check if it
is correct. Review the schema and be sure that:
• Attributes with only 2 possible states are marked as binary
• Attributes that are numbers or strings that are used to denote a category should
be marked as categorical
• Attributes that are numbers where order matters should be marked as numeric
• Attributes that are plain strings as text
Then choose continue
Pagina 35

Finally we can choose the target attribute to predict, in this case it is “class”. We
don’t have an identifier, so we skip to continue and the datasource will be created.
Pagina 36

Step 3: Creating an ML model
Amazon should redirect us to the page of model creation. If not:
• From the console, click on “create a new model”
• Choose “I already created a datasource pointing to my S3 data”
• Pick our datasource previously created and click Continue
• Be sure the model name is “ML model: Census data” and select Default
• The evaluation name must be “Evaluation ML model: Census data”, review and
finish
Pagina 37

Now Amazon is processing our data, and this may take some minutes
Pagina 38

The operations that Amazon is performing are the following:
• Splitting the training datasource in 2 parts: one containing the 70% of the data
and one containing the remaining 30%
• Training the model with 70% of the data
• Testing the resulting model with the 30%
The status now is in pending. It will be in progress and then completed.
Pagina 39

Pagina 40

Step 4: Reviewing the model’s predictive performance
and setting a score threshold
It’s important to check if the model is good enough for future predictions. This can
be done by looking at the model evaluation.
Take a look to the AUC (Area Under Curve) metric: it is an industry-standard quality
metric that expresses the performance quality of the model.
• Choose evaluation in the model summary
• Click on our model
• Click on summary
Pagina 41

Shortly, the ML model generates numeric prediction score for each record and
then, based on a threshold, it converts this scores in binary labels.
Pagina 42

We can interact with this evaluation: if we change this threshold, we can modify
how the model assigns the labels.
• On evaluation summary page, choose “Adjust score threshold”
• Try to move the vertical line on the graphic and the number of correct choices
and errors will change:
– Movements to the right will reduce the number of false positives
– Movements to the left will reduce the number of false negatives
• Move it until the score threshold becomes 0.37 (it decreases the false
negatives)
Pagina 43

Now every time the model will predict a label, it will do it with this new threshold.
Pagina 44

Step 5: Using the ML model to generate predictions
There are two types of prediction that can be done:
• Real time predictions: it is prediction for a single observation that amazon
generates on demand
• Batch predictions: it is a set of predictions for a group of observation (N.B.:
Amazon will charge you 0.10€ for 1000 predictions, rounding up to the next
thousand)
Pagina 45

We’ll try now batch predictions, and we need the census-batch.csv file that we
uploaded at the beginning.
• Click on Amazon Machine Learning
• Click on Batch prediction
• Choose the model we created and click Continue
• In “Locate the input data”, choose “My data is in s3, and I need to create a
datasource”
• For the name of the datasource, type “Census data 2” and for the location of the
file type “your-bucket/census-batch.csv”
• “Does the first line in your cvs contain the column names?”, choose Yes, then
Verify and Continue
Pagina 46

• For the destination, type the location where you uploaded the file at the
beginning
• Accept the default name
• Choose Review
• Grant permission to Amazon S3
• On the review page choose Finish
As we saw with the training, now Amazon will process our file and give us the
results.
Pagina 47

Pagina 48

To view the results:
• Go to https://console.aws.amazon.com/s3/
• Navigate to the output location given before
• You will find a compressed file containing the result: download it and open it
Pagina 49

This file has 2 columns: best answer and score for each row of the datasource.
The score is greater than the threshold → the best answer will be “> 50.000”
The score is smaller than the threshold → the best answer will be “≤ 50.000”
Pagina 50

Step 6: Cleaning up
It’s safe to delete all the model and predictions we created so far, in order to not
incur in additional charges and to keep clean our console.
Pagina 51

AWS Machine Learning - Homework Assignment
Pagina 52

Homework Assignment
In the tutorial it has been introduced the usage of Amazon ML service through a
graphical interface, however in practice it can be useful to integrate such service
into a particular application.
Amazon ML addresses this problem offering a large, complete and easy to use set
of APIs.
http://docs.aws.amazon.com/machine-learning/latest/APIReference
Pagina 53

Homework Assignment
Assignment:
You are asked to repeat the steps presented in the tutorial with the exception of the
5
th
step (Using the model to generate predictions). You are asked indeed to
complete such point by writing a Python script that makes use of the APIs.
Write the code needed to:
1) Generate real-time predictions
2) Generate batch predictions:
Pagina 54

Homework Assignment – Before starting
 DASHBOARD LINK:
https://eu-west-1.console.aws.amazon.com/machinelearning
 DATASOURCE_ID: once in the dashboard, click on the datasource (created at
pass 2), then copy the ID
 MODEL_ID: once in the dashboard, click on the model (created at pass 3), then
copy the ID
 ID and KEY: once in the dashboard, click on your username on the top right side
of the screen →"My Security Credentials" → expand the voice "Access Keys"
→"Create new access key" → copy the ID and KEY generated
Pagina 55

Homework Assignment – Before starting
 GIVE PERMISSIONS TO FILES IN S3: It is mandatory to grant usage
permissions to the files uploaded to S3. To do so: right click on the files ->
Properties -> Permissions -> Add more permissions -> Select 'Any authenticated
AWS user' -> Put a tick on all different permissions
 ENABLE MODEL FOR REAL TIME PREDICTIONS: click on the model → create
endpoint
Pagina 56

Homework Assignment – Exercise 1
 Generate real-time predictions: in a new file, store 10 records of the “census-
batch.csv” file. Generate one real-time prediction per record and print the results.
You can make use of the following function:
from boto3.session import Session #install library boto3 first
MODEL_ID = 'the id of the model you have created'
ID = 'your id'
KEY = 'your key'
session = Session(aws_access_key_id=ID, aws_secret_access_key=KEY)
client = session.client('machinelearning', region_name='eu-west-1')
prediction_endpoint = "https://realtime.machinelearning.eu-west-1.amazonaws.com"
fields=["age", "work class", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex",
"capital-gain", "capital-loss", "hours-per-week", "native-country"]
def real_time_prediction(line) : # line = one line of the csv file
record = dict()
for index, val in enumerate(line.split(',')):
record[fields[index]] = val
response = client.predict(MLModelId=MODEL_ID, Record=record, PredictEndpoint=prediction_endpoint)
return response.get('Prediction')
Pagina 57

Homework Assignment – Exercise 2
 Generate batch predictions: use the “census-batch.csv” file that you’ve uploaded
and then check the results on S3.
You can make use of the following function:
from boto3.session import Session #install library boto3 first
ID = 'your id'
KEY = 'your key'
MODEL_ID = 'the id of the model you have created'
DATASOURCE_ID = 'the id of the data source you have created (the one related to census-batch.csv)'
PREDICTION_ID = "batch_prediction_0001" # must be unique
PREDICTION_NAME = "bp_0001"
OUTPUT_URI = "s3://your_bucket/dir_batch_0001"
session = Session(aws_access_key_id=ID, aws_secret_access_key=KEY)
client = session.client('machinelearning', region_name='eu-west-1')
client.create_batch_prediction(BatchPredictionId=PREDICTION_ID, BatchPredictionName=PREDICTION_NAME,
MLModelId=MODEL_ID,
BatchPredictionDataSourceId=DATASOURCE_ID, OutputUri=OUTPUT_URI)
status = "PENDING"
while status != "COMPLETED" and status != "FAILED" :
print(status)
response = client.get_batch_prediction(BatchPredictionId=PREDICTION_ID)
status = response['Status']
time.sleep(3)
print(status)
print(response)
print("Your results are in s3!")
Pagina 58

Homework Assignment
For this homework, you’ll have one week of time to deliver the results.
In particular, the due is 20/12/2016, 23:59
You are asked to deliver back the code and the instructions to run it into a
.zip file to one of our email addresses.
Pagina 59

Homework Assignment
For any kind of problem or information, please contact us!
Contacts:
• Dario Molinari: molinari.1547862@studenti.uniroma1.it
• Daniele De cillis: decillis.1528489@studenti.uniroma1.it
• Lorenzo Vitali: vitali.1526110@studenti.uniroma1.it
• Lukas Hermann: lukas.hermann@gmx.de
• Milad Kiwan: kiwan.1164659@studenti.uniroma1.it
• Matteo Pallotta: matpallotta@gmail.com
Pagina 60

Presentazione tutorial

More Related Content

Similar to Presentazione tutorial

Recently uploaded

Presentazione tutorial