SlideShare a Scribd company logo
1 of 23
Download to read offline
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 1/23
LOG HOM DATAQUT.IO LARN DATA CINC IN YOUR ROWR

Machine learning with Pthon: A
Tutorial
Vik Paruchuri  21 OCT 2015 in tutorial
Machine learning is a eld that uses algorithms to learn from data and
make predictions. Practically, this means that we can feed data into an al-
gorithm, and use it to make predictions about what might happen in the
future. This has a vast range of applications, from self-driving cars to
stock price prediction. Not only is machine learning interesting, it’s also
starting to be widely used, making it an extremely practical skill to learn.
In this tutorial, we’ll guide you through the basic principles of machine
learning, and how to get started with machine learning with Python.
Luckily for us, Python has an amazing ecosystem of libraries that make
machine learning easy to get started with. We’ll be using the excellent
Scikit-learn, Pandas, and Matplotlib libraries in this tutorial.
If you want to dive more deeply into machine learning, and apply algo-
rithms in your browser, check out our courses here.
The dataet
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 2/23
Before we dive into machine learning, we’re going to explore a dataset,
and gure out what might be interesting to predict. The dataset is from
BoardGameGeek, and contains data on 80000 board games. Here’s a single
boardgame on the site. This information was kindly scraped into csv for-
mat by Sean Beck, and can be downloaded here.
The dataset contains several data points about each board game. Here’s a
list of the interesting ones:
name – name of the board game.
plaingtime – the playing time (given by the manufacturer).
minplatime – the minimum playing time (given by the manufacturer).
maxplatime – the maximum playing time (given by the manufacturer).
minage – the minimum recommended age to play.
uer_rated – the number of users who rated the game.
average_rating – the average rating given to the game by users. (0-10)
total_weight – Number of weights given by users. Weight is a subjective
measure that is made up by BoardGameGeek. It’s how “deep” or in-
volved a game is. Here’s a full explanation.
average_weight – the average of all the subjective weights (0-5).
Introduction to Panda
The rst step in our exploration is to read in the data and print some
quick summary statistics. In order to do this, we’ll us the Pandas library.
Pandas provides data structures and data analysis tools that make manip-
ulating data in Python much quicker and more e ective. The most com-
mon data structure is called a dataframe. A dataframe is an extension of a
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 3/23
matrix, so we’ll talk about what a matrix is before coming back to
dataframes.
Our data le looks like this (we removed some columns to make it easier
to look at):
id,tpe,name,earpulihed,minplaer,maxplaer,plaingtime 
12333,oardgame,Twilight truggle,2005,2,2,180 
120677,oardgame,Terra Mtica,2012,2,5,150 
This is in a format called csv, or comma-separated values, which you can
read more about here. Each row of the data is a di erent board game, and
di erent data points about each board game are separated by commas
within the row. The rst row is the header row, and describes what each
data point is. The entire set of one data point, going down, is a column.
We can easily conceptualize a csv le as a matrix:
    1       2           3                   4 
1   id      tpe        name                earpulihed 
2   12333   oardgame   Twilight truggle   2005 
3   120677  oardgame   Terra Mtica       2012 
We removed some of the columns here for display purposes, but you can
still get a sense of how the data looks visually. A matrix is a two-dimen-
sional data structure, with rows and columns. We can access elements in a
matrix by position. The rst row starts with id , the second row starts
with 12333 , and the third row starts with 120677 . The rst column is id ,
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 4/23
the second is tpe , and so on. Matrices in Python can be used via the
NumPy library.
A matrix has some downsides, though. You can’t easily access columns
and rows by name, and each column has to have the same datatype. This
means that we can’t e ectively store our board game data in a matrix –
the name column contains strings, and the earpulihed column contains
integers, which means that we can’t store them both in the same matrix.
A dataframe, on the other hand, can have di erent datatypes in each col-
umn. It has has a lot of built-in niceities for analyzing data as well, such
as looking up columns by name. Pandas gives us access to these features,
and generally makes working with data much simpler.
Reading in our data
We’ll now read in our data from a csv le into a Pandas dataframe, using
the read_cv method.
# Import the panda lirar. 
import panda 
 
# Read in the data. 
game = panda.read_cv("oard_game.cv") 
# Print the name of the column in game. 
print(game.column)
Index(['id', 'tpe', 'name', 'earpulihed', 'minplaer', 'maxplaer',
       'plaingtime', 'minplatime', 'maxplatime', 'minage', 'uer_rated',
       'average_rating', 'ae_average_rating', 'total_owner', 
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 5/23
The code above read the data in, and shows us all of the column names.
The columns that are in the data but aren’t listed above should be fairly
self-explanatory.
print(game.hape)
(81312, 20) 
We can also see the shape of the data, which shows that it has 81312 rows,
or games, and 20 columns, or data points describing each game.
Plotting our target variale
It could be interesting to predict the average score that a human would
give to a new, unreleased, board game. This is stored in the average_rating
column, which is the average of all the user ratings for a board game. Pre-
dicting this column could be useful to board game manufacturers who are
thinking of what kind of game to make next, for instance.
We can access a column is a dataframe with Pandas using game["aver-
age_rating"] . This will extract a single column from the dataframe.
Let’s plot a histogram of this column so we can visualize the distribution
of ratings. We’ll use Matplotlib to generate the visualization. Matplotlib is
       'total_trader', 'total_wanter', 'total_wiher', 'total_comment',
       'total_weight', 'average_weight'], 
      dtpe='oject') 
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 6/23
the main plotting infrastructure in Python, and most other plotting li-
braries, like seaborn and ggplot2 are built on top of Matplotlib.
We import Matplotlib’s plotting functions with import matplotli.pplot a 
plt . We can then draw and show plots.
# Import matplotli 
import matplotli.pplot a plt 
 
# Make a hitogram of all the rating in the average_rating column. 
plt.hit(game["average_rating"]) 
 
# how the plot. 
plt.how()
What we see here is that there are quite a few games with a 0 rating.
There’s a fairly normal distribution of ratings, with some right skew, and
a mean rating around 6 (if you remove the zeros).
xploring the 0 rating
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 7/23
Are there truly so many terrible games that were given a 0 rating? Or is
something else happening? We’ll need to dive into the data bit more to
check on this.
With Pandas, we can select subsets of data using Boolean series (vectors,
or one column/row of data, are known as series in Pandas). Here’s an
example:
game[game["average_rating"] == 0] 
The code above will create a new dataframe, with only the rows in game
where the value of the average_rating column equals 0 .
We can then index the resulting dataframe to get the values we want.
There are two ways to index in Pandas – we can index by the name of the
row or column, or we can index by position. Indexing by names looks like
game["average_rating"] – this will return the whole average_rating column of
game . Indexing by position looks like game.iloc[0] – this will return the
whole rst row of the dataframe. We can also pass in multiple index val-
ues at once – game.iloc[0,0] will return the rst column in the rst row of
game . Read more about Pandas indexing here.
# Print the firt row of all the game with zero core. 
# The .iloc method on dataframe allow u to index  poition. 
print(game[game["average_rating"] == 0].iloc[0]) 
# Print the firt row of all the game with core greater than 0. 
print(game[game["average_rating"] > 0].iloc[0])
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 8/23
id                             318 
tpe                     oardgame 
name                    Loone Leo 
uer_rated                      0 
average_rating                   0 
ae_average_rating             0 
Name: 13048, dtpe: oject 
id                                  12333 
tpe                            oardgame 
name                    Twilight truggle 
uer_rated                         20113 
average_rating                    8.33774 
ae_average_rating              8.22186 
Name: 0, dtpe: oject 
This shows us that the main di erence between a game with a 0 rating
and a game with a rating above 0 is that the 0 rated game has no reviews.
The uer_rated column is 0 . By ltering out any board games with 0 re-
views, we can remove much of the noise.
Removing game without review
# Remove an row without uer review. 
game = game[game["uer_rated"] > 0] 
# Remove an row with miing value. 
game = game.dropna(axi=0)
We just ltered out all of the rows without user reviews. While we were at
it, we also took out any rows with missing values. Many machine learning
algorithms can’t work with missing values, so we need some way to deal
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 9/23
with them. Filtering them out is one common technique, but it means that
we may potentially lose valuable data. Other techniques for dealing with
missing data are listed here.
Clutering game
We’ve seen that there may be distinct sets of games. One set (which we
just removed) was the set of games without reviews. Another set could be
a set of highly rated games. One way to gure out more about these sets
of games is a technique called clustering. Clustering enables you to nd
patterns within your data easily by grouping similar rows (in this case,
games), together.
We’ll use a particular type of clustering called k-means clustering. Scikit-
learn has an excellent implementation of k-means clustering that we can
use. Scikit-learn is the primary machine learning library in Python, and
contains implementations of most common algorithms, including random
forests, support vector machines, and logistic regression. Scikit-learn has
a consistent API for accessing these algorithms.
# Import the kmean clutering model. 
from klearn.cluter import KMean 
 
# Initialize the model with 2 parameter -- numer of cluter and random tate.
kmean_model = KMean(n_cluter=5, random_tate=1) 
# Get onl the numeric column from game. 
good_column = game._get_numeric_data() 
# Fit the model uing the good column. 
kmean_model.fit(good_column) 
# Get the cluter aignment. 
lael = kmean_model.lael_
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 10/23
In order to use the clustering algorithm in Scikit-learn, we’ll rst intialize
it using two parameters – n_cluter de nes how many clusters of games
that we want, and random_tate is a random seed we set in order to repro-
duce our results later. Here’s more information on the implementation.
We then only get the numeric columns from our dataframe. Most machine
learning algorithms can’t directly operate on text data, and can only take
numbers as input. Getting only the numeric columns removes tpe and
name , which aren’t usable by the clustering algorithm.
Finally, we t our kmeans model to our data, and get the cluster assign-
ment labels for each row.
Plotting cluter
Now that we have cluster labels, let’s plot the clusters. One sticking point
is that our data has many columns – it’s outside of the realm of human
understanding and physics to be able to visualize things in more than 3
dimensions. So we’ll have to reduce the dimensionality of our data, with-
out losing too much information. One way to do this is a technique called
principal component analysis, or PCA. PCA takes multiple columns, and
turns them into fewer columns while trying to preserve the unique infor-
mation in each column. To simplify, say we have two columns, total_own-
er , and total_trader . There is some correlation between these two col-
umns, and some overlapping information. PCA will compress this infor-
mation into one column with new numbers while trying not to lose any
information.
We’ll try to turn our board game data into two dimensions, or columns, so
we can easily plot it out.
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 11/23
We rst initialize a PCA model from Scikit-learn. PCA isn’t a machine
learning technique, but Scikit-learn also contains other models that are
useful for performing machine learning. Dimensionality reduction tech-
niques like PCA are widely used when preprocessing data for machine
learning algorithms.
We then turn our data into 2 columns, and plot the columns. When we
plot the columns, we shade them according to their cluster assignment.
# Import the PCA model. 
from klearn.decompoition import PCA 
 
# Create a PCA model. 
pca_2 = PCA(2) 
# Fit the PCA model on the numeric column from earlier. 
plot_column = pca_2.fit_tranform(good_column) 
# Make a catter plot of each game, haded according to cluter aignment.
plt.catter(x=plot_column[:,0], =plot_column[:,1], c=lael) 
# how the plot. 
plt.how()
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 12/23
The plot shows us that there are 5 distinct clusters. We could dive more
into which games are in each cluster to learn more about what factors
cause games to be clustered.
Figuring out what to predict
There are two things we need to determine before we jump into machine
learning – how we’re going to measure error, and what we’re going to
predict. We thought earlier that average_rating might be good to predict on,
and our exploration reinforces this notion.
There are a variety of ways to measure error (many are listed here). Gen-
erally, when we’re doing regression, and predicting continuous variables,
we’ll need a di erent error metric than when we’re performing classi ca-
tion, and predicting discrete values.
For this, we’ll use mean squared error – it’s easy to calculate, and simple
to understand. It shows us how far, on average, our predictions are from
the actual values.
Finding correlation
njoing thi pot? Learn data cience with Dataquet!
> Learn from the comfort of our rower.
> Work with real­life data et.
> uild a portfolio of project.
tart for Free
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 13/23
Now that we want to predict average_rating , let’s see what columns might
be interesting for our prediction. One way is to nd the correlation be-
tween average_rating and each of the other columns. This will show us
which other columns might predict average_rating the best. We can use the
corr method on Pandas dataframes to easily nd correlations. This will
give us the correlation between each column and each other column. Since
the result of this is a dataframe, we can index it and only get the correla-
tions for the average_rating column.
game.corr()["average_rating"]
id                      0.304201 
earpulihed           0.108461 
minplaer             -0.032701 
maxplaer             -0.008335 
plaingtime             0.048994 
minplatime             0.043985 
maxplatime             0.048994 
minage                  0.210049 
uer_rated             0.112564 
average_rating          1.000000 
ae_average_rating    0.231563 
total_owner            0.137478 
total_trader           0.119452 
total_wanter           0.196566 
total_wiher           0.171375 
total_comment          0.123714 
total_weight           0.109691 
average_weight          0.351081 
Name: average_rating, dtpe: float64 
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 14/23
We see that the average_weight and id columns correlate best to rating. id
are presumably assigned when the game is added to the database, so this
likely indicates that games created later score higher in the ratings.
Maybe reviewers were not as nice in the early days of BoardGameGeek, or
older games were of lower quality. average_weight indicates the “depth” or
complexity of a game, so it may be that more complex games are re-
viewed better.
Picking predictor column
Before we get started predicting, let’s only select the columns that are
relevant when training our algorithm. We’ll want to remove certain col-
umns that aren’t numeric.
We’ll also want to remove columns that can only be computed if you al-
ready know the average rating. Including these columns will destroy the
purpose of the classi er, which is to predict the rating without any previ-
ous knowledge. Using columns that can only be computed with knowledge
of the target can lead to over tting, where your model is good in a train-
ing set, but doesn’t generalize well to future data.
The ae_average_rating column appears to be derived from average_rating in
some way, so let’s remove it.
# Get all the column from the dataframe. 
column = game.column.tolit() 
# Filter the column to remove one we don't want. 
column = [c for c in column if c not in ["ae_average_rating", "average_ratin
 
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 15/23
plitting into train and tet et
We want to be able to gure out how accurate an algorithm is using our
error metrics. However, evaluating the algorithm on the same data it has
been trained on will lead to over tting. We want the algorithm to learn
generalized rules to make predictions, not memorize how to make speci c
predictions. An example is learning math. If you memorize that 1+1=2 , and
2+2=4 , you’ll be able to perfectly answer any questions about 1+1 and 2+2 .
You’ll have 0 error. However, the second anyone asks you something out-
side of your training set where you know the answer, like 3+3 , you won’t
be able to solve it. On the other hand, if you’re able to generalize and
learn addition, you’ll make occasional mistakes because you haven’t
memorized the solutions – maybe you’ll get 3453 + 353535 o by one, but
you’ll be able to solve any addition problem thrown at you.
If your error looks surprisingly low when you’re training a machine
learning algorithm, you should always check to see if you’re over tting.
In order to prevent over tting, we’ll train our algorithm on a set consist-
ing of 80% of the data, and test it on another set consisting of 20% of the
data. To do this, we rst randomly samply 80% of the rows to be in the
training set, then put everything else in the testing set.
# tore the variale we'll e predicting on. 
target = "average_rating"
# Import a convenience function to plit the et. 
from klearn.cro_validation import train_tet_plit 
 
# Generate the training et.  et random_tate to e ale to replicate reult.
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 16/23
(45515, 20) 
(11379, 20) 
Above, we exploit the fact that every Pandas row has a unique index to
select any row not in the training set to be in the testing set.
Fitting a linear regreion
Linear regression is a powerful and commonly used machine learning al-
gorithm. It predicts the target variable using linear combinations of the
predictor variables. Let’s say we have a 2 values, 3 , and 4 . A linear com-
bination would be 3 * .5 + 4 * .5 . A linear combination involves multiply-
ing each number by a constant, and adding the results. You can read more
here.
Linear regression only works well when the predictor variables and the
target variable are linearly correlated. As we saw earlier, a few of the pre-
dictors are correlated with the target, so linear regression should work
well for us.
We can use the linear regression implementation in Scikit-learn, just as
we used the k-means implementation earlier.
train = game.ample(frac=0.8, random_tate=1) 
# elect anthing not in the training et and put it in the teting et. 
tet = game.loc[~game.index.iin(train.index)] 
# Print the hape of oth et. 
print(train.hape) 
print(tet.hape)
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 17/23
# Import the linearregreion model. 
from klearn.linear_model import LinearRegreion 
 
# Initialize the model cla. 
model = LinearRegreion() 
# Fit the model to the training data. 
model.fit(train[column], train[target])
When we t the model, we pass in the predictor matrix, which consists of
all the columns from the dataframe that we picked earlier. If you pass a
list to a Pandas dataframe when you index it, it will generate a new
dataframe with all of the columns in the list. We also pass in the target
variable, which we want to make predictions for.
The model learns the equation that maps the predictors to the target with
minimal error.
Predicting error
After we train the model, we can make predictions on new data with it.
This new data has to be in the exact same format as the training data, or
the model won’t make accurate predictions. Our testing set is identical to
the training set (except the rows contain di erent board games). We se-
lect the same subset of columns from the test set, and then make predic-
tions on it.
# Import the cikit-learn function to compute error. 
from klearn.metric import mean_quared_error 
 
# Generate our prediction for the tet et. 
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 18/23
prediction = model.predict(tet[column]) 
 
# Compute error etween our tet prediction and the actual value. 
mean_quared_error(prediction, tet[target])
1.8239281903519875 
Once we have the predictions, we’re able to compute error between the
test set predictions and the actual values. Mean squared error has the for-
mula . Basically, we subtract each predicted value from
the actual value, square the di erences, and add them together. Then we
divide the result by the total number of predicted values. This will give us
the average error for each prediction.
Tring a different model
One of the nice things about Scikit-learn is that it enables us to try more
powerful algorithms very easily. One such algorithm is called random for-
est. The random forest algorithm can nd nonlinearities in data that a
linear regression wouldn’t be able to pick up on. Say, for example, that if
the minage of a game, is less than 5, the rating is low, if it’s 5-10 , it’s high,
and if it is between 10-15 , it is low. A linear regression algorithm wouldn’t
be able to pick up on this because there isn’t a linear relationship between
the predictor and the target. Predictions made with a random forest usu-
ally have less error than predictions made by a linear regression.
# Import the random foret model. 
from klearn.enemle import RandomForetRegreor 
 
# Initialize the model with ome parameter. 
( −
1
n
∑
n
i=1
yi ŷ 
i
)
2
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 19/23
1.4144905030983794 
Further exploration
We’ve managed to go from data in csv format to making predictions. Here
are some ideas for further exploration:
Try a support vector machine.
Try ensembling multiple models to create better predictions.
Try predicting a di erent column, such as average_weight .
Generate features from the text, such as length of the name of the
game, number of words, etc.
Want to learn more aout machine
learning?
At Dataquest, we o er interactive lessons on machine learning and data
science. We believe in learning by doing, and you’ll learn interactively in
your browser by analyzing real data and building projects. Check out our
machine learning lessons here.
model = RandomForetRegreor(n_etimator=100, min_ample_leaf=10, random_tate
# Fit the model to the data. 
model.fit(train[column], train[target]) 
# Make prediction. 
prediction = model.predict(tet[column]) 
# Compute the error. 
mean_quared_error(prediction, tet[target])
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 20/23
Join 40,000+ Data Scientists: Get Data Science Tips, Tricks and Tuto-
rials, Delivered Weekly.
Enter your email address Sign me up!
Vik Paruchuri
Developer and Data Scientist in San Francisco; Founder of Dataque-
st.io (Learn Data Science in your Browser).
Get in touch @vikparuchuri.
hare thi pot
  
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 21/23
16 Comments Dataquest Login

1
Share
⤤ Sort by Oldest
Join the discussion…
• Reply •
Egor Ignatenkov • a year ago
jupyter notebook says: "Unrecognized alias: '--profile=pyspark', it will probably have no
effect."
1 △ ▽
• Reply •
Vik Paruchuri • a year ago
Mod > Egor Ignatenkov
This may be due to a newer version of Jupyter no longer supporting the flag. I think
Srini used IPython 3.
△ ▽
• Reply •
ron • 10 months ago
> Vik Paruchuri
I'm having this same error as well - how can I get it working?
△ ▽
• Reply •
Robert Dupont • 6 months ago
> ron
Not sure if that will help you, after I did all of that i got the same
error, I tried to run spark/bin/pyspark.cmd and it lunched the
notebook with the sc environment.
△ ▽
• Reply •
essay writers online • a year ago
Machine learning is very complicated to learn. It involves different codes that contain
complex terms. All the data's that you have posted was relevant to a common data
structures that are applicable in the common data frame.
△ ▽
• Reply •
Samantha Zeitlin • a year ago
It would help if you could show an example of what the expected output should look like if
the spark context was initialized properly?
1 △ ▽
• Reply •
Pushpam Choudhury • 10 months ago
How can I add some user-defined property such as username, in the SparkConf object? I
can hardcode the details in pyspark.context by using setExecutorEnv, but how for security I
would like to use the details(username etc) captured from the notebook's login page. Can
you please suggest a viable approach?
△ ▽
Recommend

Share ›
Share ›
Share ›
Share ›
Share ›
Share ›
Share ›
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 22/23
• Reply •
raj chandrasekaran • 10 months ago
I found http://tonysyu.github.io/py... useful to add sparkInstallationFolder/python to add
so that there is no need of profile.
pip install pypath_magic
%load_ext pypath_magic
%cd <your spark="" python="" path="">
%pypath –a
△ ▽
• Reply •
Ebisa • 10 months ago
Very helpful explanation. I am using Windows operating system. And Jupyter Notebook is
installed and running fine. When I run the command "Jupyter profile create pyspark" on
windows cmd, I receive an error message that says "jupyter-profile is not recognized as an
internal or external command, operable program or batch file". Any ideas to go around
this?
1 △ ▽
• Reply •
Asher King Abramson • 9 months ago
In getting Spark up and running on my machine ( ), I hit an error that said "Exception:
Java gateway process exited before sending the driver its port number" and I had to change
one of the lines you guys add to .bash_profile.
I had to change
export PYSPARK_SUBMIT_ARGS="--master local[2]"
to
export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell"
to get it running.
Hope this helps someone else who hits the error down the road!
1 △ ▽
• Reply •
Josh • 9 months ago
Mod > Asher King Abramson
Hey Asher - thanks for the note. Because there have been a few changes to the
packages since this post was written, we're planning a mini-update real soon :)
△ ▽
Yuanwen Wang • 8 months ago
Hi Just wondering since Jupyter does not have the concept of "profile" anymore
http://jupyter.readthedocs....
How could we do this step:
Share ›
Share ›
Share ›
Share ›
3/27/2017 Machine learning with Python: A Tutorial
https://www.dataquest.io/blog/machine-learning-python/ 23/23
• Reply •
"Jupyter profile create pyspark"
We get errors like
"jupyter: 'profile' is not a Jupyter command"
△ ▽
• Reply •
Josh • 8 months ago
Mod > Yuanwen Wang
I'd recommend using findspark - we'll update this post soon!
https://github.com/minrk/fi...
△ ▽
• Reply •
Agustin Luques • 7 months ago
> Josh
Please help me, I was following this post.
My problem: http://stackoverflow.com/qu...
△ ▽
Josh • 7 months ago
Mod > Agustin Luques
Share ›
Share ›
Share ›
njoing thi pot? Learn data cience with Dataquet!
> Learn from the comfort of our rower.
> Work with real­life data et.
> uild a portfolio of project.
tart for Free
Dataquet log ©
2017 • All right
reerved.

More Related Content

Similar to Tutorial machine learning with python - a tutorial

PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
PYTHON-Chapter 4-Plotting and Data Science  PyLab - MAULIK BORSANIYAPYTHON-Chapter 4-Plotting and Data Science  PyLab - MAULIK BORSANIYA
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYAMaulik Borsaniya
 
A Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with PythonA Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with PythonTariq Rashid
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docxrohithprabhas1
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
Lesson 21. Pattern 13. Data alignment
Lesson 21. Pattern 13. Data alignmentLesson 21. Pattern 13. Data alignment
Lesson 21. Pattern 13. Data alignmentPVS-Studio
 
Lecture 3 intro2data
Lecture 3 intro2dataLecture 3 intro2data
Lecture 3 intro2dataJohnson Ubah
 
Using pandas library for data analysis in python
Using pandas library for data analysis in pythonUsing pandas library for data analysis in python
Using pandas library for data analysis in pythonBruce Jenks
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple stepsRenjith M P
 
IRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and ChallengesIRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and ChallengesIRJET Journal
 
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...Kavika Roy
 
Internship - Python - AI ML.pptx
Internship - Python - AI ML.pptxInternship - Python - AI ML.pptx
Internship - Python - AI ML.pptxHchethankumar
 
Internship - Python - AI ML.pptx
Internship - Python - AI ML.pptxInternship - Python - AI ML.pptx
Internship - Python - AI ML.pptxHchethankumar
 
A gentle introduction to algorithm complexity analysis
A gentle introduction to algorithm complexity analysisA gentle introduction to algorithm complexity analysis
A gentle introduction to algorithm complexity analysisLewis Lin 🦊
 
First ML Experience
First ML ExperienceFirst ML Experience
First ML ExperienceAmrith Kumar
 
MachinaFiesta: A Vision into Machine Learning 🚀
MachinaFiesta: A Vision into Machine Learning 🚀MachinaFiesta: A Vision into Machine Learning 🚀
MachinaFiesta: A Vision into Machine Learning 🚀GDSCNiT
 

Similar to Tutorial machine learning with python - a tutorial (20)

PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
PYTHON-Chapter 4-Plotting and Data Science  PyLab - MAULIK BORSANIYAPYTHON-Chapter 4-Plotting and Data Science  PyLab - MAULIK BORSANIYA
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
 
A Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with PythonA Gentle Introduction to Coding ... with Python
A Gentle Introduction to Coding ... with Python
 
employee turnover prediction document.docx
employee turnover prediction document.docxemployee turnover prediction document.docx
employee turnover prediction document.docx
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Lesson 21. Pattern 13. Data alignment
Lesson 21. Pattern 13. Data alignmentLesson 21. Pattern 13. Data alignment
Lesson 21. Pattern 13. Data alignment
 
Lecture 3 intro2data
Lecture 3 intro2dataLecture 3 intro2data
Lecture 3 intro2data
 
Using pandas library for data analysis in python
Using pandas library for data analysis in pythonUsing pandas library for data analysis in python
Using pandas library for data analysis in python
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps
 
IRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and ChallengesIRJET- Machine Learning: Survey, Types and Challenges
IRJET- Machine Learning: Survey, Types and Challenges
 
Sam python pro_points_slide
Sam python pro_points_slideSam python pro_points_slide
Sam python pro_points_slide
 
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
Unraveling The Meaning From COVID-19 Dataset Using Python – A Tutorial for be...
 
More on Pandas.pptx
More on Pandas.pptxMore on Pandas.pptx
More on Pandas.pptx
 
Internship - Python - AI ML.pptx
Internship - Python - AI ML.pptxInternship - Python - AI ML.pptx
Internship - Python - AI ML.pptx
 
Internship - Python - AI ML.pptx
Internship - Python - AI ML.pptxInternship - Python - AI ML.pptx
Internship - Python - AI ML.pptx
 
Python cheat-sheet
Python cheat-sheetPython cheat-sheet
Python cheat-sheet
 
AQA Computer science easter revision
AQA Computer science easter revisionAQA Computer science easter revision
AQA Computer science easter revision
 
A gentle introduction to algorithm complexity analysis
A gentle introduction to algorithm complexity analysisA gentle introduction to algorithm complexity analysis
A gentle introduction to algorithm complexity analysis
 
First ML Experience
First ML ExperienceFirst ML Experience
First ML Experience
 
MachinaFiesta: A Vision into Machine Learning 🚀
MachinaFiesta: A Vision into Machine Learning 🚀MachinaFiesta: A Vision into Machine Learning 🚀
MachinaFiesta: A Vision into Machine Learning 🚀
 
Lecture 9.pptx
Lecture 9.pptxLecture 9.pptx
Lecture 9.pptx
 

Recently uploaded

All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445ruhi
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...SUHANI PANDEY
 
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...nilamkumrai
 
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...SUHANI PANDEY
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge GraphsEleniIlkou
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Servicegwenoracqe6
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLimonikaupta
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.soniya singh
 
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.soniya singh
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersDamian Radcliffe
 
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...SUHANI PANDEY
 
Enjoy Night⚡Call Girls Samalka Delhi >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Samalka Delhi >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Samalka Delhi >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Samalka Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...roncy bisnoi
 
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.soniya singh
 
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Bookingdharasingh5698
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirtrahman018755
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...Neha Pandey
 

Recently uploaded (20)

All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
 
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
Shikrapur - Call Girls in Pune Neha 8005736733 | 100% Gennuine High Class Ind...
 
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
( Pune ) VIP Baner Call Girls 🎗️ 9352988975 Sizzling | Escorts | Girls Are Re...
 
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
Ganeshkhind ! Call Girls Pune - 450+ Call Girl Cash Payment 8005736733 Neha T...
 
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
2nd Solid Symposium: Solid Pods vs Personal Knowledge Graphs
 
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl ServiceRussian Call girl in Ajman +971563133746 Ajman Call girl Service
Russian Call girl in Ajman +971563133746 Ajman Call girl Service
 
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRLLucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
Lucknow ❤CALL GIRL 88759*99948 ❤CALL GIRLS IN Lucknow ESCORT SERVICE❤CALL GIRL
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
 
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Sarai Rohilla Escort Service Delhi N.C.R.
 
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
VVVIP Call Girls In Connaught Place ➡️ Delhi ➡️ 9999965857 🚀 No Advance 24HRS...
 
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providersMoving Beyond Twitter/X and Facebook - Social Media for local news providers
Moving Beyond Twitter/X and Facebook - Social Media for local news providers
 
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
Sarola * Female Escorts Service in Pune | 8005736733 Independent Escorts & Da...
 
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
valsad Escorts Service ☎️ 6378878445 ( Sakshi Sinha ) High Profile Call Girls...
 
Enjoy Night⚡Call Girls Samalka Delhi >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Samalka Delhi >༒8448380779 Escort ServiceEnjoy Night⚡Call Girls Samalka Delhi >༒8448380779 Escort Service
Enjoy Night⚡Call Girls Samalka Delhi >༒8448380779 Escort Service
 
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
Call Girls Sangvi Call Me 7737669865 Budget Friendly No Advance BookingCall G...
 
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
Low Sexy Call Girls In Mohali 9053900678 🥵Have Save And Good Place 🥵
 
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Rani Bagh Escort Service Delhi N.C.R.
 
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Himatnagar 7001035870 Whatsapp Number, 24/07 Booking
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 

Tutorial machine learning with python - a tutorial

  • 1. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 1/23 LOG HOM DATAQUT.IO LARN DATA CINC IN YOUR ROWR  Machine learning with Pthon: A Tutorial Vik Paruchuri  21 OCT 2015 in tutorial Machine learning is a eld that uses algorithms to learn from data and make predictions. Practically, this means that we can feed data into an al- gorithm, and use it to make predictions about what might happen in the future. This has a vast range of applications, from self-driving cars to stock price prediction. Not only is machine learning interesting, it’s also starting to be widely used, making it an extremely practical skill to learn. In this tutorial, we’ll guide you through the basic principles of machine learning, and how to get started with machine learning with Python. Luckily for us, Python has an amazing ecosystem of libraries that make machine learning easy to get started with. We’ll be using the excellent Scikit-learn, Pandas, and Matplotlib libraries in this tutorial. If you want to dive more deeply into machine learning, and apply algo- rithms in your browser, check out our courses here. The dataet
  • 2. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 2/23 Before we dive into machine learning, we’re going to explore a dataset, and gure out what might be interesting to predict. The dataset is from BoardGameGeek, and contains data on 80000 board games. Here’s a single boardgame on the site. This information was kindly scraped into csv for- mat by Sean Beck, and can be downloaded here. The dataset contains several data points about each board game. Here’s a list of the interesting ones: name – name of the board game. plaingtime – the playing time (given by the manufacturer). minplatime – the minimum playing time (given by the manufacturer). maxplatime – the maximum playing time (given by the manufacturer). minage – the minimum recommended age to play. uer_rated – the number of users who rated the game. average_rating – the average rating given to the game by users. (0-10) total_weight – Number of weights given by users. Weight is a subjective measure that is made up by BoardGameGeek. It’s how “deep” or in- volved a game is. Here’s a full explanation. average_weight – the average of all the subjective weights (0-5). Introduction to Panda The rst step in our exploration is to read in the data and print some quick summary statistics. In order to do this, we’ll us the Pandas library. Pandas provides data structures and data analysis tools that make manip- ulating data in Python much quicker and more e ective. The most com- mon data structure is called a dataframe. A dataframe is an extension of a
  • 3. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 3/23 matrix, so we’ll talk about what a matrix is before coming back to dataframes. Our data le looks like this (we removed some columns to make it easier to look at): id,tpe,name,earpulihed,minplaer,maxplaer,plaingtime  12333,oardgame,Twilight truggle,2005,2,2,180  120677,oardgame,Terra Mtica,2012,2,5,150  This is in a format called csv, or comma-separated values, which you can read more about here. Each row of the data is a di erent board game, and di erent data points about each board game are separated by commas within the row. The rst row is the header row, and describes what each data point is. The entire set of one data point, going down, is a column. We can easily conceptualize a csv le as a matrix:     1       2           3                   4  1   id      tpe        name                earpulihed  2   12333   oardgame   Twilight truggle   2005  3   120677  oardgame   Terra Mtica       2012  We removed some of the columns here for display purposes, but you can still get a sense of how the data looks visually. A matrix is a two-dimen- sional data structure, with rows and columns. We can access elements in a matrix by position. The rst row starts with id , the second row starts with 12333 , and the third row starts with 120677 . The rst column is id ,
  • 4. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 4/23 the second is tpe , and so on. Matrices in Python can be used via the NumPy library. A matrix has some downsides, though. You can’t easily access columns and rows by name, and each column has to have the same datatype. This means that we can’t e ectively store our board game data in a matrix – the name column contains strings, and the earpulihed column contains integers, which means that we can’t store them both in the same matrix. A dataframe, on the other hand, can have di erent datatypes in each col- umn. It has has a lot of built-in niceities for analyzing data as well, such as looking up columns by name. Pandas gives us access to these features, and generally makes working with data much simpler. Reading in our data We’ll now read in our data from a csv le into a Pandas dataframe, using the read_cv method. # Import the panda lirar.  import panda    # Read in the data.  game = panda.read_cv("oard_game.cv")  # Print the name of the column in game.  print(game.column) Index(['id', 'tpe', 'name', 'earpulihed', 'minplaer', 'maxplaer',        'plaingtime', 'minplatime', 'maxplatime', 'minage', 'uer_rated',        'average_rating', 'ae_average_rating', 'total_owner', 
  • 5. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 5/23 The code above read the data in, and shows us all of the column names. The columns that are in the data but aren’t listed above should be fairly self-explanatory. print(game.hape) (81312, 20)  We can also see the shape of the data, which shows that it has 81312 rows, or games, and 20 columns, or data points describing each game. Plotting our target variale It could be interesting to predict the average score that a human would give to a new, unreleased, board game. This is stored in the average_rating column, which is the average of all the user ratings for a board game. Pre- dicting this column could be useful to board game manufacturers who are thinking of what kind of game to make next, for instance. We can access a column is a dataframe with Pandas using game["aver- age_rating"] . This will extract a single column from the dataframe. Let’s plot a histogram of this column so we can visualize the distribution of ratings. We’ll use Matplotlib to generate the visualization. Matplotlib is        'total_trader', 'total_wanter', 'total_wiher', 'total_comment',        'total_weight', 'average_weight'],        dtpe='oject') 
  • 6. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 6/23 the main plotting infrastructure in Python, and most other plotting li- braries, like seaborn and ggplot2 are built on top of Matplotlib. We import Matplotlib’s plotting functions with import matplotli.pplot a  plt . We can then draw and show plots. # Import matplotli  import matplotli.pplot a plt    # Make a hitogram of all the rating in the average_rating column.  plt.hit(game["average_rating"])    # how the plot.  plt.how() What we see here is that there are quite a few games with a 0 rating. There’s a fairly normal distribution of ratings, with some right skew, and a mean rating around 6 (if you remove the zeros). xploring the 0 rating
  • 7. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 7/23 Are there truly so many terrible games that were given a 0 rating? Or is something else happening? We’ll need to dive into the data bit more to check on this. With Pandas, we can select subsets of data using Boolean series (vectors, or one column/row of data, are known as series in Pandas). Here’s an example: game[game["average_rating"] == 0]  The code above will create a new dataframe, with only the rows in game where the value of the average_rating column equals 0 . We can then index the resulting dataframe to get the values we want. There are two ways to index in Pandas – we can index by the name of the row or column, or we can index by position. Indexing by names looks like game["average_rating"] – this will return the whole average_rating column of game . Indexing by position looks like game.iloc[0] – this will return the whole rst row of the dataframe. We can also pass in multiple index val- ues at once – game.iloc[0,0] will return the rst column in the rst row of game . Read more about Pandas indexing here. # Print the firt row of all the game with zero core.  # The .iloc method on dataframe allow u to index  poition.  print(game[game["average_rating"] == 0].iloc[0])  # Print the firt row of all the game with core greater than 0.  print(game[game["average_rating"] > 0].iloc[0])
  • 8. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 8/23 id                             318  tpe                     oardgame  name                    Loone Leo  uer_rated                      0  average_rating                   0  ae_average_rating             0  Name: 13048, dtpe: oject  id                                  12333  tpe                            oardgame  name                    Twilight truggle  uer_rated                         20113  average_rating                    8.33774  ae_average_rating              8.22186  Name: 0, dtpe: oject  This shows us that the main di erence between a game with a 0 rating and a game with a rating above 0 is that the 0 rated game has no reviews. The uer_rated column is 0 . By ltering out any board games with 0 re- views, we can remove much of the noise. Removing game without review # Remove an row without uer review.  game = game[game["uer_rated"] > 0]  # Remove an row with miing value.  game = game.dropna(axi=0) We just ltered out all of the rows without user reviews. While we were at it, we also took out any rows with missing values. Many machine learning algorithms can’t work with missing values, so we need some way to deal
  • 9. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 9/23 with them. Filtering them out is one common technique, but it means that we may potentially lose valuable data. Other techniques for dealing with missing data are listed here. Clutering game We’ve seen that there may be distinct sets of games. One set (which we just removed) was the set of games without reviews. Another set could be a set of highly rated games. One way to gure out more about these sets of games is a technique called clustering. Clustering enables you to nd patterns within your data easily by grouping similar rows (in this case, games), together. We’ll use a particular type of clustering called k-means clustering. Scikit- learn has an excellent implementation of k-means clustering that we can use. Scikit-learn is the primary machine learning library in Python, and contains implementations of most common algorithms, including random forests, support vector machines, and logistic regression. Scikit-learn has a consistent API for accessing these algorithms. # Import the kmean clutering model.  from klearn.cluter import KMean    # Initialize the model with 2 parameter -- numer of cluter and random tate. kmean_model = KMean(n_cluter=5, random_tate=1)  # Get onl the numeric column from game.  good_column = game._get_numeric_data()  # Fit the model uing the good column.  kmean_model.fit(good_column)  # Get the cluter aignment.  lael = kmean_model.lael_
  • 10. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 10/23 In order to use the clustering algorithm in Scikit-learn, we’ll rst intialize it using two parameters – n_cluter de nes how many clusters of games that we want, and random_tate is a random seed we set in order to repro- duce our results later. Here’s more information on the implementation. We then only get the numeric columns from our dataframe. Most machine learning algorithms can’t directly operate on text data, and can only take numbers as input. Getting only the numeric columns removes tpe and name , which aren’t usable by the clustering algorithm. Finally, we t our kmeans model to our data, and get the cluster assign- ment labels for each row. Plotting cluter Now that we have cluster labels, let’s plot the clusters. One sticking point is that our data has many columns – it’s outside of the realm of human understanding and physics to be able to visualize things in more than 3 dimensions. So we’ll have to reduce the dimensionality of our data, with- out losing too much information. One way to do this is a technique called principal component analysis, or PCA. PCA takes multiple columns, and turns them into fewer columns while trying to preserve the unique infor- mation in each column. To simplify, say we have two columns, total_own- er , and total_trader . There is some correlation between these two col- umns, and some overlapping information. PCA will compress this infor- mation into one column with new numbers while trying not to lose any information. We’ll try to turn our board game data into two dimensions, or columns, so we can easily plot it out.
  • 11. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 11/23 We rst initialize a PCA model from Scikit-learn. PCA isn’t a machine learning technique, but Scikit-learn also contains other models that are useful for performing machine learning. Dimensionality reduction tech- niques like PCA are widely used when preprocessing data for machine learning algorithms. We then turn our data into 2 columns, and plot the columns. When we plot the columns, we shade them according to their cluster assignment. # Import the PCA model.  from klearn.decompoition import PCA    # Create a PCA model.  pca_2 = PCA(2)  # Fit the PCA model on the numeric column from earlier.  plot_column = pca_2.fit_tranform(good_column)  # Make a catter plot of each game, haded according to cluter aignment. plt.catter(x=plot_column[:,0], =plot_column[:,1], c=lael)  # how the plot.  plt.how()
  • 12. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 12/23 The plot shows us that there are 5 distinct clusters. We could dive more into which games are in each cluster to learn more about what factors cause games to be clustered. Figuring out what to predict There are two things we need to determine before we jump into machine learning – how we’re going to measure error, and what we’re going to predict. We thought earlier that average_rating might be good to predict on, and our exploration reinforces this notion. There are a variety of ways to measure error (many are listed here). Gen- erally, when we’re doing regression, and predicting continuous variables, we’ll need a di erent error metric than when we’re performing classi ca- tion, and predicting discrete values. For this, we’ll use mean squared error – it’s easy to calculate, and simple to understand. It shows us how far, on average, our predictions are from the actual values. Finding correlation njoing thi pot? Learn data cience with Dataquet! > Learn from the comfort of our rower. > Work with real­life data et. > uild a portfolio of project. tart for Free
  • 13. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 13/23 Now that we want to predict average_rating , let’s see what columns might be interesting for our prediction. One way is to nd the correlation be- tween average_rating and each of the other columns. This will show us which other columns might predict average_rating the best. We can use the corr method on Pandas dataframes to easily nd correlations. This will give us the correlation between each column and each other column. Since the result of this is a dataframe, we can index it and only get the correla- tions for the average_rating column. game.corr()["average_rating"] id                      0.304201  earpulihed           0.108461  minplaer             -0.032701  maxplaer             -0.008335  plaingtime             0.048994  minplatime             0.043985  maxplatime             0.048994  minage                  0.210049  uer_rated             0.112564  average_rating          1.000000  ae_average_rating    0.231563  total_owner            0.137478  total_trader           0.119452  total_wanter           0.196566  total_wiher           0.171375  total_comment          0.123714  total_weight           0.109691  average_weight          0.351081  Name: average_rating, dtpe: float64 
  • 14. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 14/23 We see that the average_weight and id columns correlate best to rating. id are presumably assigned when the game is added to the database, so this likely indicates that games created later score higher in the ratings. Maybe reviewers were not as nice in the early days of BoardGameGeek, or older games were of lower quality. average_weight indicates the “depth” or complexity of a game, so it may be that more complex games are re- viewed better. Picking predictor column Before we get started predicting, let’s only select the columns that are relevant when training our algorithm. We’ll want to remove certain col- umns that aren’t numeric. We’ll also want to remove columns that can only be computed if you al- ready know the average rating. Including these columns will destroy the purpose of the classi er, which is to predict the rating without any previ- ous knowledge. Using columns that can only be computed with knowledge of the target can lead to over tting, where your model is good in a train- ing set, but doesn’t generalize well to future data. The ae_average_rating column appears to be derived from average_rating in some way, so let’s remove it. # Get all the column from the dataframe.  column = game.column.tolit()  # Filter the column to remove one we don't want.  column = [c for c in column if c not in ["ae_average_rating", "average_ratin  
  • 15. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 15/23 plitting into train and tet et We want to be able to gure out how accurate an algorithm is using our error metrics. However, evaluating the algorithm on the same data it has been trained on will lead to over tting. We want the algorithm to learn generalized rules to make predictions, not memorize how to make speci c predictions. An example is learning math. If you memorize that 1+1=2 , and 2+2=4 , you’ll be able to perfectly answer any questions about 1+1 and 2+2 . You’ll have 0 error. However, the second anyone asks you something out- side of your training set where you know the answer, like 3+3 , you won’t be able to solve it. On the other hand, if you’re able to generalize and learn addition, you’ll make occasional mistakes because you haven’t memorized the solutions – maybe you’ll get 3453 + 353535 o by one, but you’ll be able to solve any addition problem thrown at you. If your error looks surprisingly low when you’re training a machine learning algorithm, you should always check to see if you’re over tting. In order to prevent over tting, we’ll train our algorithm on a set consist- ing of 80% of the data, and test it on another set consisting of 20% of the data. To do this, we rst randomly samply 80% of the rows to be in the training set, then put everything else in the testing set. # tore the variale we'll e predicting on.  target = "average_rating" # Import a convenience function to plit the et.  from klearn.cro_validation import train_tet_plit    # Generate the training et.  et random_tate to e ale to replicate reult.
  • 16. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 16/23 (45515, 20)  (11379, 20)  Above, we exploit the fact that every Pandas row has a unique index to select any row not in the training set to be in the testing set. Fitting a linear regreion Linear regression is a powerful and commonly used machine learning al- gorithm. It predicts the target variable using linear combinations of the predictor variables. Let’s say we have a 2 values, 3 , and 4 . A linear com- bination would be 3 * .5 + 4 * .5 . A linear combination involves multiply- ing each number by a constant, and adding the results. You can read more here. Linear regression only works well when the predictor variables and the target variable are linearly correlated. As we saw earlier, a few of the pre- dictors are correlated with the target, so linear regression should work well for us. We can use the linear regression implementation in Scikit-learn, just as we used the k-means implementation earlier. train = game.ample(frac=0.8, random_tate=1)  # elect anthing not in the training et and put it in the teting et.  tet = game.loc[~game.index.iin(train.index)]  # Print the hape of oth et.  print(train.hape)  print(tet.hape)
  • 17. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 17/23 # Import the linearregreion model.  from klearn.linear_model import LinearRegreion    # Initialize the model cla.  model = LinearRegreion()  # Fit the model to the training data.  model.fit(train[column], train[target]) When we t the model, we pass in the predictor matrix, which consists of all the columns from the dataframe that we picked earlier. If you pass a list to a Pandas dataframe when you index it, it will generate a new dataframe with all of the columns in the list. We also pass in the target variable, which we want to make predictions for. The model learns the equation that maps the predictors to the target with minimal error. Predicting error After we train the model, we can make predictions on new data with it. This new data has to be in the exact same format as the training data, or the model won’t make accurate predictions. Our testing set is identical to the training set (except the rows contain di erent board games). We se- lect the same subset of columns from the test set, and then make predic- tions on it. # Import the cikit-learn function to compute error.  from klearn.metric import mean_quared_error    # Generate our prediction for the tet et. 
  • 18. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 18/23 prediction = model.predict(tet[column])    # Compute error etween our tet prediction and the actual value.  mean_quared_error(prediction, tet[target]) 1.8239281903519875  Once we have the predictions, we’re able to compute error between the test set predictions and the actual values. Mean squared error has the for- mula . Basically, we subtract each predicted value from the actual value, square the di erences, and add them together. Then we divide the result by the total number of predicted values. This will give us the average error for each prediction. Tring a different model One of the nice things about Scikit-learn is that it enables us to try more powerful algorithms very easily. One such algorithm is called random for- est. The random forest algorithm can nd nonlinearities in data that a linear regression wouldn’t be able to pick up on. Say, for example, that if the minage of a game, is less than 5, the rating is low, if it’s 5-10 , it’s high, and if it is between 10-15 , it is low. A linear regression algorithm wouldn’t be able to pick up on this because there isn’t a linear relationship between the predictor and the target. Predictions made with a random forest usu- ally have less error than predictions made by a linear regression. # Import the random foret model.  from klearn.enemle import RandomForetRegreor    # Initialize the model with ome parameter.  ( − 1 n ∑ n i=1 yi ŷ  i ) 2
  • 19. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 19/23 1.4144905030983794  Further exploration We’ve managed to go from data in csv format to making predictions. Here are some ideas for further exploration: Try a support vector machine. Try ensembling multiple models to create better predictions. Try predicting a di erent column, such as average_weight . Generate features from the text, such as length of the name of the game, number of words, etc. Want to learn more aout machine learning? At Dataquest, we o er interactive lessons on machine learning and data science. We believe in learning by doing, and you’ll learn interactively in your browser by analyzing real data and building projects. Check out our machine learning lessons here. model = RandomForetRegreor(n_etimator=100, min_ample_leaf=10, random_tate # Fit the model to the data.  model.fit(train[column], train[target])  # Make prediction.  prediction = model.predict(tet[column])  # Compute the error.  mean_quared_error(prediction, tet[target])
  • 20. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 20/23 Join 40,000+ Data Scientists: Get Data Science Tips, Tricks and Tuto- rials, Delivered Weekly. Enter your email address Sign me up! Vik Paruchuri Developer and Data Scientist in San Francisco; Founder of Dataque- st.io (Learn Data Science in your Browser). Get in touch @vikparuchuri. hare thi pot   
  • 21. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 21/23 16 Comments Dataquest Login  1 Share ⤤ Sort by Oldest Join the discussion… • Reply • Egor Ignatenkov • a year ago jupyter notebook says: "Unrecognized alias: '--profile=pyspark', it will probably have no effect." 1 △ ▽ • Reply • Vik Paruchuri • a year ago Mod > Egor Ignatenkov This may be due to a newer version of Jupyter no longer supporting the flag. I think Srini used IPython 3. △ ▽ • Reply • ron • 10 months ago > Vik Paruchuri I'm having this same error as well - how can I get it working? △ ▽ • Reply • Robert Dupont • 6 months ago > ron Not sure if that will help you, after I did all of that i got the same error, I tried to run spark/bin/pyspark.cmd and it lunched the notebook with the sc environment. △ ▽ • Reply • essay writers online • a year ago Machine learning is very complicated to learn. It involves different codes that contain complex terms. All the data's that you have posted was relevant to a common data structures that are applicable in the common data frame. △ ▽ • Reply • Samantha Zeitlin • a year ago It would help if you could show an example of what the expected output should look like if the spark context was initialized properly? 1 △ ▽ • Reply • Pushpam Choudhury • 10 months ago How can I add some user-defined property such as username, in the SparkConf object? I can hardcode the details in pyspark.context by using setExecutorEnv, but how for security I would like to use the details(username etc) captured from the notebook's login page. Can you please suggest a viable approach? △ ▽ Recommend  Share › Share › Share › Share › Share › Share › Share ›
  • 22. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 22/23 • Reply • raj chandrasekaran • 10 months ago I found http://tonysyu.github.io/py... useful to add sparkInstallationFolder/python to add so that there is no need of profile. pip install pypath_magic %load_ext pypath_magic %cd <your spark="" python="" path=""> %pypath –a △ ▽ • Reply • Ebisa • 10 months ago Very helpful explanation. I am using Windows operating system. And Jupyter Notebook is installed and running fine. When I run the command "Jupyter profile create pyspark" on windows cmd, I receive an error message that says "jupyter-profile is not recognized as an internal or external command, operable program or batch file". Any ideas to go around this? 1 △ ▽ • Reply • Asher King Abramson • 9 months ago In getting Spark up and running on my machine ( ), I hit an error that said "Exception: Java gateway process exited before sending the driver its port number" and I had to change one of the lines you guys add to .bash_profile. I had to change export PYSPARK_SUBMIT_ARGS="--master local[2]" to export PYSPARK_SUBMIT_ARGS="--master local[2] pyspark-shell" to get it running. Hope this helps someone else who hits the error down the road! 1 △ ▽ • Reply • Josh • 9 months ago Mod > Asher King Abramson Hey Asher - thanks for the note. Because there have been a few changes to the packages since this post was written, we're planning a mini-update real soon :) △ ▽ Yuanwen Wang • 8 months ago Hi Just wondering since Jupyter does not have the concept of "profile" anymore http://jupyter.readthedocs.... How could we do this step: Share › Share › Share › Share ›
  • 23. 3/27/2017 Machine learning with Python: A Tutorial https://www.dataquest.io/blog/machine-learning-python/ 23/23 • Reply • "Jupyter profile create pyspark" We get errors like "jupyter: 'profile' is not a Jupyter command" △ ▽ • Reply • Josh • 8 months ago Mod > Yuanwen Wang I'd recommend using findspark - we'll update this post soon! https://github.com/minrk/fi... △ ▽ • Reply • Agustin Luques • 7 months ago > Josh Please help me, I was following this post. My problem: http://stackoverflow.com/qu... △ ▽ Josh • 7 months ago Mod > Agustin Luques Share › Share › Share › njoing thi pot? Learn data cience with Dataquet! > Learn from the comfort of our rower. > Work with real­life data et. > uild a portfolio of project. tart for Free Dataquet log © 2017 • All right reerved.