Introduction into
Python for Scientific Computing
Jo˜ao Machado • Ricardo Cruz
Introduction
a d g
b e h
c f i
×
0 0 1
0 1 0
1 0 0
=
? ? ?
? ? ?
? ? ?
What is the result of this operation?
Introduction
a d g
b e h
c f i
×
0 0 1
0 1 0
1 0 0
=
? ? ?
? ? ?
? ? ?
What is the result of this operation?
a d g
b e h
c f i
×
0 0 1
0 1 0
1 0 0
=
g d a
h e b
i f c
Introduction
a d g
b e h
c f i
×
0 0 1
0 1 0
1 0 0
=
? ? ?
? ? ?
? ? ?
What is the result of this operation?
a d g
b e h
c f i
×
0 0 1
0 1 0
1 0 0
=
g d a
h e b
i f c
1 from numpy import *
2 cout = p r i n t
3
4 A = random.random ((3, 3));
5 B = fliplr(eye (3));
6 C = dot(A, B);
7 cout(C);
What programming language is this?
Introduction
a d g
b e h
c f i
×
0 0 1
0 1 0
1 0 0
=
? ? ?
? ? ?
? ? ?
What is the result of this operation?
a d g
b e h
c f i
×
0 0 1
0 1 0
1 0 0
=
g d a
h e b
i f c
1 from numpy import *
2 cout = p r i n t
3
4 A = random.random ((3, 3));
5 B = fliplr(eye (3));
6 C = dot(A, B);
7 cout(C);
It’s Python!
1 from numpy import *
2 cout = p r i n t
3
4 A = random.random ((3, 3));
5 B = fliplr(eye (3));
6 C = dot(A, B);
7 cout(C);
What programming language is this?
Introduction
1 #i n c l u d e <armadillo >
2 using namespace arma;
3 using namespace std;
4
5 mat A(3 ,3), B(3 ,3);
6 A.randu ();
7 B = fliplr(B.eye ());
8 M3 = M1 * M2;
9 cout << M3 << endl;
What about this programming language?
a d g
b e h
c f i
×
0 0 1
0 1 0
1 0 0
=
g d a
h e b
i f c
1 from numpy import *
2 cout = p r i n t
3
4 A = random.random ((3, 3));
5 B = fliplr(eye (3));
6 C = dot(A, B);
7 cout(C);
It’s Python!
1 from numpy import *
2 cout = p r i n t
3
4 A = random.random ((3, 3));
5 B = fliplr(eye (3));
6 C = dot(A, B);
7 cout(C);
What programming language is this?
Introduction
1 #i n c l u d e <armadillo >
2 using namespace arma;
3 using namespace std;
4
5 mat A(3 ,3), B(3 ,3);
6 A.randu ();
7 B = fliplr(B.eye ());
8 M3 = M1 * M2;
9 cout << M3 << endl;
What about this programming language?
Why use Python?
More important than the programming
language is the ecosystem – and Python
has a great scientific community
Python has good interoperability with
other systems
The entire stack can be developed in
Python: machine learning, flask, etc
Computations do not run in Python; the
slow stuff is implemented in Fortran and
C
1 from numpy import *
2 cout = p r i n t
3
4 A = random.random ((3, 3));
5 B = fliplr(eye (3));
6 C = dot(A, B);
7 cout(C);
It’s Python!
1 from numpy import *
2 cout = p r i n t
3
4 A = random.random ((3, 3));
5 B = fliplr(eye (3));
6 C = dot(A, B);
7 cout(C);
What programming language is this?
Python
matplotlib
numpy
sklearn
pandas
R
ggplot2
rpart
foreign
dplyr
survival
ggmaps
zooMATLAB
Statistics
Toolbox
Biostatistics
Toolbox
Neural Network
Toolbox
Why Python?
Good data mining ecosystem.
Not as centralized/monopolistic as
Matlab’s
Not as decentralized and messy as R :P
Why Python?
Source: http://www.kdnuggets.com/2017/01/most-popular-language-machine-learning-data-science.html
Some notes on Numpy
Numpy Notes
Let A and B be matrices,
Python/Numpy MATLAB R
A.dot(B) A * B A %*% B
A * B A .* B A * B
Operations are elementwise by default (like
R)
Numpy Notes
Let A and B be matrices,
Python/Numpy MATLAB R
A.dot(B) A * B A %*% B
A * B A .* B A * B
Operations are elementwise by default (like
R)
Python/Numpy MATLAB R
A.shape size(A) length, nrow, ncol
A[0:4,:] or
A[0:4] or A[:4] A(1:4,:) A[1:4,]
A[0:10:2] A[seq(0, 9, 2)]
A[-4:] A(end-4:end,:) A[nrow(A)-4:nrow(A),]
A.T A.’ t(A)
Numpy in general allows for more succinct
writing.
Furthermore:
Indexing starts at zero.
Intervals are of the form [i, j[
Numpy Notes
Let A and B be matrices,
Python/Numpy MATLAB R
A.dot(B) A * B A %*% B
A * B A .* B A * B
Operations are elementwise by default (like
R)
Python/Numpy MATLAB R
A.shape size(A) length, nrow, ncol
A[0:4,:] or
A[0:4] or A[:4] A(1:4,:) A[1:4,]
A[0:10:2] A[seq(0, 9, 2)]
A[-4:] A(end-4:end,:) A[nrow(A)-4:nrow(A),]
A.T A.’ t(A)
Numpy in general allows for more succinct
writing.
Furthermore:
Indexing starts at zero.
Intervals are of the form [i, j[
This is further aided by the fact that Numpy
supports arithmetic broadcasting. (unlike
MATLAB or R.)
That is, you can do the following element-
wise multiplication: (6,3) * (6,1). It auto-
matically assumes you want to multiply by
column. In MATLAB, you would have to use
bsxfun(@times,r,A) or first use repmat().
Numpy Notes
Let A and B be matrices,
Python/Numpy MATLAB R
A.dot(B) A * B A %*% B
A * B A .* B A * B
Operations are elementwise by default (like
R)
Python/Numpy MATLAB R
A.shape size(A) length, nrow, ncol
A[0:4,:] or
A[0:4] or A[:4] A(1:4,:) A[1:4,]
A[0:10:2] A[seq(0, 9, 2)]
A[-4:] A(end-4:end,:) A[nrow(A)-4:nrow(A),]
A.T A.’ t(A)
Numpy in general allows for more succinct
writing.
Furthermore:
Indexing starts at zero.
Intervals are of the form [i, j[
Something like the following is valid in
Numpy...
1 import skimage.data
2 img1 = skimage.data.astronaut ()
3 img2 = skimage.data.moon ()
4 p r i n t (img1.shape) # (512 , 512 , 3)
5 p r i n t (img2.shape) # (512 , 512)
6
7 import matplotlib.pyplot as plt
8 plt.subplot (1, 2, 1)
9 plt.imshow(img1)
10 plt.subplot (1, 2, 2)
11 plt.imshow(img2 , cmap=’gray ’)
12 plt.show ()
This is further aided by the fact that Numpy
supports arithmetic broadcasting. (unlike
MATLAB or R.)
That is, you can do the following element-
wise multiplication: (6,3) * (6,1). It auto-
matically assumes you want to multiply by
column. In MATLAB, you would have to use
bsxfun(@times,r,A) or first use repmat().
Numpy Notes
Python/Numpy MATLAB R
A.shape size(A) length, nrow, ncol
A[0:4,:] or
A[0:4] or A[:4] A(1:4,:) A[1:4,]
A[0:10:2] A[seq(0, 9, 2)]
A[-4:] A(end-4:end,:) A[nrow(A)-4:nrow(A),]
A.T A.’ t(A)
Numpy in general allows for more succinct
writing.
Furthermore:
Indexing starts at zero.
Intervals are of the form [i, j[
Something like the following is valid in
Numpy...
1 import skimage.data
2 img1 = skimage.data.astronaut ()
3 img2 = skimage.data.moon ()
4 p r i n t (img1.shape) # (512 , 512 , 3)
5 p r i n t (img2.shape) # (512 , 512)
6
7 import matplotlib.pyplot as plt
8 plt.subplot (1, 2, 1)
9 plt.imshow(img1)
10 plt.subplot (1, 2, 2)
11 plt.imshow(img2 , cmap=’gray ’)
12 plt.show ()
This is further aided by the fact that Numpy
supports arithmetic broadcasting. (unlike
MATLAB or R.)
That is, you can do the following element-
wise multiplication: (6,3) * (6,1). It auto-
matically assumes you want to multiply by
column. In MATLAB, you would have to use
bsxfun(@times,r,A) or first use repmat().
Numpy Notes
Arithmetic mean
1 img2 = img2[:, :, np.newaxis] #(512 ,512 ,1)
2 img1 = img1.astype(np.uint32)
3 img2 = img2.astype(np.uint32)
4 img3 = (img1 + img2)//2
5 img3 = img3.astype(np.uint8)
6 plt.imshow(img3)
7 plt.show ()
Something like the following is valid in
Numpy...
1 import skimage.data
2 img1 = skimage.data.astronaut ()
3 img2 = skimage.data.moon ()
4 p r i n t (img1.shape) # (512 , 512 , 3)
5 p r i n t (img2.shape) # (512 , 512)
6
7 import matplotlib.pyplot as plt
8 plt.subplot (1, 2, 1)
9 plt.imshow(img1)
10 plt.subplot (1, 2, 2)
11 plt.imshow(img2 , cmap=’gray ’)
12 plt.show ()
This is further aided by the fact that Numpy
supports arithmetic broadcasting. (unlike
MATLAB or R.)
That is, you can do the following element-
wise multiplication: (6,3) * (6,1). It auto-
matically assumes you want to multiply by
column. In MATLAB, you would have to use
bsxfun(@times,r,A) or first use repmat().
Numpy Notes
Arithmetic mean
1 img2 = img2[:, :, np.newaxis] #(512 ,512 ,1)
2 img1 = img1.astype(np.uint32)
3 img2 = img2.astype(np.uint32)
4 img3 = (img1 + img2)//2
5 img3 = img3.astype(np.uint8)
6 plt.imshow(img3)
7 plt.show ()
Something like the following is valid in
Numpy...
1 import skimage.data
2 img1 = skimage.data.astronaut ()
3 img2 = skimage.data.moon ()
4 p r i n t (img1.shape) # (512 , 512 , 3)
5 p r i n t (img2.shape) # (512 , 512)
6
7 import matplotlib.pyplot as plt
8 plt.subplot (1, 2, 1)
9 plt.imshow(img1)
10 plt.subplot (1, 2, 2)
11 plt.imshow(img2 , cmap=’gray ’)
12 plt.show ()
Numpy Notes
Arithmetic mean
1 img2 = img2[:, :, np.newaxis] #(512 ,512 ,1)
2 img1 = img1.astype(np.uint32)
3 img2 = img2.astype(np.uint32)
4 img3 = (img1 + img2)//2
5 img3 = img3.astype(np.uint8)
6 plt.imshow(img3)
7 plt.show ()
Geometric mean
1 img2 = img2[:, :, np.newaxis]
2 img1 = img1.astype(np.uint32)
3 img2 = img2.astype(np.uint32)
4 img3 = np.sqrt(img1 * img2)
5 img3 = img3.astype(np.uint8)
6 plt.imshow(img3)
7 plt.show ()
Numpy Notes
Arithmetic mean
1 img2 = img2[:, :, np.newaxis] #(512 ,512 ,1)
2 img1 = img1.astype(np.uint32)
3 img2 = img2.astype(np.uint32)
4 img3 = (img1 + img2)//2
5 img3 = img3.astype(np.uint8)
6 plt.imshow(img3)
7 plt.show ()
Geometric mean
1 img2 = img2[:, :, np.newaxis]
2 img1 = img1.astype(np.uint32)
3 img2 = img2.astype(np.uint32)
4 img3 = np.sqrt(img1 * img2)
5 img3 = img3.astype(np.uint8)
6 plt.imshow(img3)
7 plt.show ()
Pandas and Data Visualization –
Python for Scientific Computing
Jo˜ao Machado • Ricardo Cruz
Pandas
What is Pandas?
A package for data manipulation and
analysis, based on the concept of data
frame in the R language
Optimized for performance, with critical
code paths written in C
Originally developed by Wes McKinney,
while working for AQR Capital (a
quantitative finance firm)
Pandas
What is Pandas?
A package for data manipulation and
analysis, based on the concept of data
frame in the R language
Optimized for performance, with critical
code paths written in C
Originally developed by Wes McKinney,
while working for AQR Capital (a
quantitative finance firm)
Given the previous point, it makes sense
to demonstrate some of the
functionalities of Pandas with a dataset
comprised of financial stocks :)
Data Mining –
Python for Scientific Computing
Jo˜ao Machado • Ricardo Cruz
Models
Models
Let us produce fake data...
y(x) = 2x + 10 + ε1 + ε2
ε1 ∼ N(0, 2)
ε2 ∼
|N(0, 25)| with p = 0.1,
0 otherwise.
Models
Let us produce fake data...
y(x) = 2x + 10 + ε1 + ε2
ε1 ∼ N(0, 2)
ε2 ∼
|N(0, 25)| with p = 0.1,
0 otherwise.
Let us produce fake data...
y(x) = 2x + 10 + ε1 + bε2
ε1 ∼ N(0, 2)
b ∼ B(2, 0.1)
ε2 ∼ |N(0, 25)|
Models
Let us produce fake data...
y(x) = 2x + 10 + ε1 + ε2
ε1 ∼ N(0, 2)
ε2 ∼
|N(0, 25)| with p = 0.1,
0 otherwise.
Translation to numpy:
1 import numpy as np
2 N = 50
3 x = np.linspace (0, 25, N)
4 y = 2*x + 10
5 y += np.random.randn(N)*2
6 y += np.random.binomial (2, 0.10 , N)*np. abs
(np.random.randn(N)*25)
Let us produce fake data...
y(x) = 2x + 10 + ε1 + bε2
ε1 ∼ N(0, 2)
b ∼ B(2, 0.1)
ε2 ∼ |N(0, 25)|
Models
1 import matplotlib.pyplot as plt
2 plt.plot(x, y)
3 plt.title(’Data ’)
4 plt.show ()
Let us produce fake data...
y(x) = 2x + 10 + ε1 + ε2
ε1 ∼ N(0, 2)
ε2 ∼
|N(0, 25)| with p = 0.1,
0 otherwise.
Translation to numpy:
1 import numpy as np
2 N = 50
3 x = np.linspace (0, 25, N)
4 y = 2*x + 10
5 y += np.random.randn(N)*2
6 y += np.random.binomial (2, 0.10 , N)*np. abs
(np.random.randn(N)*25)
Let us produce fake data...
y(x) = 2x + 10 + ε1 + bε2
ε1 ∼ N(0, 2)
b ∼ B(2, 0.1)
ε2 ∼ |N(0, 25)|
Models
1 import matplotlib.pyplot as plt
2 plt.plot(x, y)
3 plt.title(’Data ’)
4 plt.show ()
What model could we create to explain this
data?
Translation to numpy:
1 import numpy as np
2 N = 50
3 x = np.linspace (0, 25, N)
4 y = 2*x + 10
5 y += np.random.randn(N)*2
6 y += np.random.binomial (2, 0.10 , N)*np. abs
(np.random.randn(N)*25)
Let us produce fake data...
y(x) = 2x + 10 + ε1 + bε2
ε1 ∼ N(0, 2)
b ∼ B(2, 0.1)
ε2 ∼ |N(0, 25)|
Models
1 import matplotlib.pyplot as plt
2 plt.plot(x, y)
3 plt.title(’Data ’)
4 plt.show ()
What model could we create to explain this
data?
Translation to numpy:
1 import numpy as np
2 N = 50
3 x = np.linspace (0, 25, N)
4 y = 2*x + 10
5 y += np.random.randn(N)*2
6 y += np.random.binomial (2, 0.10 , N)*np. abs
(np.random.randn(N)*25)
Linear Regression
Model: ˆy = β0 + β1x
Minimize: i (yi − ˆyi )2
Models
1 import matplotlib.pyplot as plt
2 plt.plot(x, y)
3 plt.title(’Data ’)
4 plt.show ()
What model could we create to explain this
data?
1 from sklearn. linear_model import
LinearRegression
2 m = LinearRegression ()
3 m.fit(x[:, np.newaxis], y)
4 yp = m.predict(x[:, np.newaxis ])
5
6 plt.plot(x, y)
7 plt.plot(x, yp)
8 plt.title(’Linear regression ’)
9 plt.text(0, 70, ’m=%.1f b=%.1f’ % (m.coef_
[0], m.intercept_))
10 plt.show ()
Linear Regression
Model: ˆy = β0 + β1x
Minimize: i (yi − ˆyi )2
Models
What model could we create to explain this
data?
1 from sklearn. linear_model import
LinearRegression
2 m = LinearRegression ()
3 m.fit(x[:, np.newaxis], y)
4 yp = m.predict(x[:, np.newaxis ])
5
6 plt.plot(x, y)
7 plt.plot(x, yp)
8 plt.title(’Linear regression ’)
9 plt.text(0, 70, ’m=%.1f b=%.1f’ % (m.coef_
[0], m.intercept_))
10 plt.show ()
Linear Regression
Model: ˆy = β0 + β1x
Minimize: i (yi − ˆyi )2
Models
y(x) = 2x + 10 + ε1 + bε2
ˆy(x) = 2x + 18
What if I want to explain only the trend?
How can I avoid the impact of these spikes?
1 from sklearn. linear_model import
LinearRegression
2 m = LinearRegression ()
3 m.fit(x[:, np.newaxis], y)
4 yp = m.predict(x[:, np.newaxis ])
5
6 plt.plot(x, y)
7 plt.plot(x, yp)
8 plt.title(’Linear regression ’)
9 plt.text(0, 70, ’m=%.1f b=%.1f’ % (m.coef_
[0], m.intercept_))
10 plt.show ()
Linear Regression
Model: ˆy = β0 + β1x
Minimize: i (yi − ˆyi )2
Models
y(x) = 2x + 10 + ε1 + bε2
ˆy(x) = 2x + 18
What if I want to explain only the trend?
How can I avoid the impact of these spikes?
1 from sklearn. linear_model import
LinearRegression
2 m = LinearRegression ()
3 m.fit(x[:, np.newaxis], y)
4 yp = m.predict(x[:, np.newaxis ])
5
6 plt.plot(x, y)
7 plt.plot(x, yp)
8 plt.title(’Linear regression ’)
9 plt.text(0, 70, ’m=%.1f b=%.1f’ % (m.coef_
[0], m.intercept_))
10 plt.show ()
What would a statistician do?
1 res = yp -y
2 plt.boxplot(res)
3 plt.show ()
Models
y(x) = 2x + 10 + ε1 + bε2
ˆy(x) = 2x + 18
What if I want to explain only the trend?
How can I avoid the impact of these spikes?
1 q1 = np.percentile(res , 25)
2 q3 = np.percentile(res , 75)
3 t = np.logical_and(res > q1 , res < q3)
4 x2 = x[t]
5 y2 = y[t]
6
7 m = LinearRegression ()
8 m.fit(x2[:, np.newaxis], y2)
9 yp = m.predict(x[:, np.newaxis ])
What would a statistician do?
1 res = yp -y
2 plt.boxplot(res)
3 plt.show ()
Models
y(x) = 2x + 10 + ε1 + bε2
ˆy(x) = 2x + 18
What if I want to explain only the trend?
How can I avoid the impact of these spikes?
1 q1 = np.percentile(res , 25)
2 q3 = np.percentile(res , 75)
3 t = np.logical_and(res > q1 , res < q3)
4 x2 = x[t]
5 y2 = y[t]
6
7 m = LinearRegression ()
8 m.fit(x2[:, np.newaxis], y2)
9 yp = m.predict(x[:, np.newaxis ])
What would a statistician do?
1 res = yp -y
2 plt.boxplot(res)
3 plt.show ()
Models
Approach #2: What would a statistician
with some computer science knowledge do?
1 q1 = np.percentile(res , 25)
2 q3 = np.percentile(res , 75)
3 t = np.logical_and(res > q1 , res < q3)
4 x2 = x[t]
5 y2 = y[t]
6
7 m = LinearRegression ()
8 m.fit(x2[:, np.newaxis], y2)
9 yp = m.predict(x[:, np.newaxis ])
What would a statistician do?
1 res = yp -y
2 plt.boxplot(res)
3 plt.show ()
Models
Approach #2: What would a statistician
with some computer science knowledge do?
1 q1 = np.percentile(res , 25)
2 q3 = np.percentile(res , 75)
3 t = np.logical_and(res > q1 , res < q3)
4 x2 = x[t]
5 y2 = y[t]
6
7 m = LinearRegression ()
8 m.fit(x2[:, np.newaxis], y2)
9 yp = m.predict(x[:, np.newaxis ])
Model: ˆy = β0 + β1x
Minimize: i |yi − ˆyi |
Models
Approach #2: What would a statistician
with some computer science knowledge do?
1 from statsmodels.regression.
quantile_regression import QuantReg
2
3 m = QuantReg(y, np.c_[np.ones(N), x])
4 m = m.fit (0.5)
5 yp = m.predict ()
Model: ˆy = β0 + β1x
Minimize: i |yi − ˆyi |
Models
Approach #2: What would a statistician
with some computer science knowledge do?
1 from statsmodels.regression.
quantile_regression import QuantReg
2
3 m = QuantReg(y, np.c_[np.ones(N), x])
4 m = m.fit (0.5)
5 yp = m.predict ()
Model: ˆy = β0 + β1x
Minimize: i |yi − ˆyi |
Models
Approach #3: What would a crazy com-
puter scientist do?
1 from statsmodels.regression.
quantile_regression import QuantReg
2
3 m = QuantReg(y, np.c_[np.ones(N), x])
4 m = m.fit (0.5)
5 yp = m.predict ()
Model: ˆy = β0 + β1x
Minimize: i |yi − ˆyi |
Models
Approach #3: What would a crazy com-
puter scientist do?
1 from statsmodels.regression.
quantile_regression import QuantReg
2
3 m = QuantReg(y, np.c_[np.ones(N), x])
4 m = m.fit (0.5)
5 yp = m.predict ()
1 plt.plot(x, y)
2 f o r it i n range (10):
3 t = np.random.choice(N, N//10 , replace
=False)
4 x2 = x[t]
5 y2 = y[t]
6 m.fit(x2[:, np.newaxis], y2)
7 yp = m.predict(x[:, np.newaxis ])
8 plt.plot(x, yp , color=’black ’, alpha
=0.4)
9 plt.show ()
Models
Approach #3: What would a crazy com-
puter scientist do?
1 plt.plot(x, y)
2 f o r it i n range (10):
3 t = np.random.choice(N, N//10 , replace
=False)
4 x2 = x[t]
5 y2 = y[t]
6 m.fit(x2[:, np.newaxis], y2)
7 yp = m.predict(x[:, np.newaxis ])
8 plt.plot(x, yp , color=’black ’, alpha
=0.4)
9 plt.show ()
Models
Sklearn already comes with this crazy model
too:
1 from sklearn. linear_model import
RANSACRegressor
2 m = RANSACRegressor ()
3 m.fit(x[:, np.newaxis], y)
4
5 plt.plot(x, y)
6 plt.plot(x, m.predict(x[:, np.newaxis ]))
7 plt.title(’RANSAC ’)
8 plt.show ()
Approach #3: What would a crazy com-
puter scientist do?
1 plt.plot(x, y)
2 f o r it i n range (10):
3 t = np.random.choice(N, N//10 , replace
=False)
4 x2 = x[t]
5 y2 = y[t]
6 m.fit(x2[:, np.newaxis], y2)
7 yp = m.predict(x[:, np.newaxis ])
8 plt.plot(x, yp , color=’black ’, alpha
=0.4)
9 plt.show ()
Models
Sklearn already comes with this crazy model
too:
1 from sklearn. linear_model import
RANSACRegressor
2 m = RANSACRegressor ()
3 m.fit(x[:, np.newaxis], y)
4
5 plt.plot(x, y)
6 plt.plot(x, m.predict(x[:, np.newaxis ]))
7 plt.title(’RANSAC ’)
8 plt.show ()
1 plt.plot(x, y)
2 f o r it i n range (10):
3 t = np.random.choice(N, N//10 , replace
=False)
4 x2 = x[t]
5 y2 = y[t]
6 m.fit(x2[:, np.newaxis], y2)
7 yp = m.predict(x[:, np.newaxis ])
8 plt.plot(x, yp , color=’black ’, alpha
=0.4)
9 plt.show ()
What kind of things can we use data mining /
machine learning for?
Data Mining Problems
Regression: predict a continuous
variable
e.g.
House Price = 100 + 20 × Land Size
In scikit-learn, LinearRegression, Gradient-
BoostingRegressor, etc (:: RegressorMixin)
.fit(X, y)
.predict(X) -> yp
Data Mining Problems
Regression: predict a continuous
variable
e.g.
House Price = 100 + 20 × Land Size
In scikit-learn, LinearRegression, Gradient-
BoostingRegressor, etc (:: RegressorMixin)
.fit(X, y)
.predict(X) -> yp
Classification: predict a discrete variable
e.g. House Price =
Expensive if in the city center
Cheap if outside the city
In scikit-learn, LogisticRegression, Gradient-
BoostingClassifier, etc (:: ClassifierMixin)
.fit(X, y)
.predict(X) -> yp
Data Mining Problems
Regression: predict a continuous
variable
e.g.
House Price = 100 + 20 × Land Size
In scikit-learn, LinearRegression, Gradient-
BoostingRegressor, etc (:: RegressorMixin)
.fit(X, y)
.predict(X) -> yp
Classification: predict a discrete variable
e.g. House Price =
Expensive if in the city center
Cheap if outside the city
In scikit-learn, LogisticRegression, Gradient-
BoostingClassifier, etc (:: ClassifierMixin)
.fit(X, y)
.predict(X) -> yp
Clustering: not predict, aggregate
In scikit-learn, KMeans, LatentDirichletAllo-
cation, etc (:: ClusterMixin)
.fit(X)
.transform(X) -> X’
.fit transform(X) -> X’
Data Mining Problems
Regression: predict a continuous
variable
e.g.
House Price = 100 + 20 × Land Size
In scikit-learn, LinearRegression, Gradient-
BoostingRegressor, etc (:: RegressorMixin)
.fit(X, y)
.predict(X) -> yp
Classification: predict a discrete variable
e.g. House Price =
Expensive if in the city center
Cheap if outside the city
In scikit-learn, LogisticRegression, Gradient-
BoostingClassifier, etc (:: ClassifierMixin)
.fit(X, y)
.predict(X) -> yp
Re-inforcement learning: (predict best
move)
Clustering: not predict, aggregate
In scikit-learn, KMeans, LatentDirichletAllo-
cation, etc (:: ClusterMixin)
.fit(X)
.transform(X) -> X’
.fit transform(X) -> X’
Use Cases
Jo˜ao Machado • Ricardo Cruz
Signal processing:
Packages:
numpy
pandas
scipy
matplotlib
Text Mining w/ Twitter
Packages:
tweepy
numpy
matplotlib
scikit-learn
Text Mining
1 import tweepy
2 auth = tweepy. OAuthHandler (api_key ,
api_secret)
3 auth. set_access_token (access_token ,
access_secret )
4 api = tweepy.API(auth)
5
6 timeline = api. user_timeline (’
realDonaldTrump ’, count =100)
7 texts = [tweet.text f o r tweet i n timeline]
Text Mining
1 import tweepy
2 auth = tweepy. OAuthHandler (api_key ,
api_secret)
3 auth. set_access_token (access_token ,
access_secret )
4 api = tweepy.API(auth)
5
6 timeline = api. user_timeline (’
realDonaldTrump ’, count =100)
7 texts = [tweet.text f o r tweet i n timeline]
1 from sklearn. feature_extraction .text
import CountVectorizer
2 m = CountVectorizer (stop_words=’english ’,
min_df =5, max_df =16)
3 X = m. fit_transform (texts)
4 words = sorted (m.vocabulary_ , key=m.
vocabulary_.get)
5
6 import pandas as pd
7 p r i n t (pd.DataFrame(X.todense (), columns=
words).ix[:5, :5]. to_latex ())
america big comey day dems
0 0 0 0 0 0
1 0 1 0 0 0
2 1 0 0 0 0
Text Mining
1 import tweepy
2 auth = tweepy. OAuthHandler (api_key ,
api_secret)
3 auth. set_access_token (access_token ,
access_secret )
4 api = tweepy.API(auth)
5
6 timeline = api. user_timeline (’
realDonaldTrump ’, count =100)
7 texts = [tweet.text f o r tweet i n timeline]
1 from sklearn. feature_extraction .text
import CountVectorizer
2 m = CountVectorizer (stop_words=’english ’,
min_df =5, max_df =16)
3 X = m. fit_transform (texts)
4 words = sorted (m.vocabulary_ , key=m.
vocabulary_.get)
5
6 import pandas as pd
7 p r i n t (pd.DataFrame(X.todense (), columns=
words).ix[:5, :5]. to_latex ())
america big comey day dems
0 0 0 0 0 0
1 0 1 0 0 0
2 1 0 0 0 0
1 import matplotlib.pyplot as plt
2 counts = np.asarray(X.sum(0))[0]
3 plt.barh( range ( len (counts)), counts)
4 plt.xticks( range (0, 14, 2))
5 plt.yticks( range ( len (counts)), words)
6 plt.show ()
Text Mining
1 import tweepy
2 auth = tweepy. OAuthHandler (api_key ,
api_secret)
3 auth. set_access_token (access_token ,
access_secret )
4 api = tweepy.API(auth)
5
6 timeline = api. user_timeline (’
realDonaldTrump ’, count =100)
7 texts = [tweet.text f o r tweet i n timeline]
1 from sklearn. feature_extraction .text
import CountVectorizer
2 m = CountVectorizer (stop_words=’english ’,
min_df =5, max_df =16)
3 X = m. fit_transform (texts)
4 words = sorted (m.vocabulary_ , key=m.
vocabulary_.get)
5
6 import pandas as pd
7 p r i n t (pd.DataFrame(X.todense (), columns=
words).ix[:5, :5]. to_latex ())
america big comey day dems
0 0 0 0 0 0
1 0 1 0 0 0
2 1 0 0 0 0
1 import matplotlib.pyplot as plt
2 counts = np.asarray(X.sum(0))[0]
3 plt.barh( range ( len (counts)), counts)
4 plt.xticks( range (0, 14, 2))
5 plt.yticks( range ( len (counts)), words)
6 plt.show ()
Text Mining
1 from sklearn. decomposition import
LatentDirichletAllocation
2 lda = LatentDirichletAllocation (2,
learning_method =’online ’)
3 lda.fit(X)
4 topics = lda. components_
newword1 = β11word1 + β12word2 + . . .
newword2 = β21word1 + β22word2 + . . .
1 from sklearn. feature_extraction .text
import CountVectorizer
2 m = CountVectorizer (stop_words=’english ’,
min_df =5, max_df =16)
3 X = m. fit_transform (texts)
4 words = sorted (m.vocabulary_ , key=m.
vocabulary_.get)
5
6 import pandas as pd
7 p r i n t (pd.DataFrame(X.todense (), columns=
words).ix[:5, :5]. to_latex ())
america big comey day dems
0 0 0 0 0 0
1 0 1 0 0 0
2 1 0 0 0 0
1 import matplotlib.pyplot as plt
2 counts = np.asarray(X.sum(0))[0]
3 plt.barh( range ( len (counts)), counts)
4 plt.xticks( range (0, 14, 2))
5 plt.yticks( range ( len (counts)), words)
6 plt.show ()
Text Mining
1 from sklearn. decomposition import
LatentDirichletAllocation
2 lda = LatentDirichletAllocation (2,
learning_method =’online ’)
3 lda.fit(X)
4 topics = lda. components_
newword1 = β11word1 + β12word2 + . . .
newword2 = β21word1 + β22word2 + . . .
1 topics = topics / topics.max(1)[:, np.
newaxis]
2 topics += np.random.randn (* topics.shape)
*0.02
3 f o r i, word i n enumerate(words):
4 plt.text(topics [0, i], topics [1, i],
word , ha=’center ’)
5 plt.show ()
1 import matplotlib.pyplot as plt
2 counts = np.asarray(X.sum(0))[0]
3 plt.barh( range ( len (counts)), counts)
4 plt.xticks( range (0, 14, 2))
5 plt.yticks( range ( len (counts)), words)
6 plt.show ()
Text Mining
1 from sklearn. decomposition import
LatentDirichletAllocation
2 lda = LatentDirichletAllocation (2,
learning_method =’online ’)
3 lda.fit(X)
4 topics = lda. components_
newword1 = β11word1 + β12word2 + . . .
newword2 = β21word1 + β22word2 + . . .
1 topics = topics / topics.max(1)[:, np.
newaxis]
2 topics += np.random.randn (* topics.shape)
*0.02
3 f o r i, word i n enumerate(words):
4 plt.text(topics [0, i], topics [1, i],
word , ha=’center ’)
5 plt.show ()
Text Mining
1 from sklearn. decomposition import
LatentDirichletAllocation
2 lda = LatentDirichletAllocation (2,
learning_method =’online ’)
3 lda.fit(X)
4 topics = lda. components_
newword1 = β11word1 + β12word2 + . . .
newword2 = β21word1 + β22word2 + . . .
1 topics = topics / topics.max(1)[:, np.
newaxis]
2 topics += np.random.randn (* topics.shape)
*0.02
3 f o r i, word i n enumerate(words):
4 plt.text(topics [0, i], topics [1, i],
word , ha=’center ’)
5 plt.show ()
1 timeline = api. user_timeline (’
marcelorebelo_ ’, count =100)
Traditional Learning vs Deep Learning
Traditionally, hand-crafted features would be extracted from the dataset and learning
would happen on top of those features. Deep learning learns from the raw data.
Packages:
scikit-image
numpy
keras
Traditional Learning
Cats vs Dogs – Kaggle Competition – https:
//www.kaggle.com/c/dogs-vs-cats
25,000 images of cats and dogs
Traditional Learning
Cats vs Dogs – Kaggle Competition – https:
//www.kaggle.com/c/dogs-vs-cats
25,000 images of cats and dogs
Feature #1: Extract histogram of colors
1 from skimage.io import imread
2 from skimage.transform import rgb2gray
3
4 f o r filename i n os.listdir(’train ’):
5 im = imread(os.path.join(’train ’,
filename))
6 im = rgb2gray(im)
7 f1 = np.histogram(im.flatten (), 10) [0]
8 f1 = (f1/f1.sum()).cumsum ()
Traditional Learning
Cats vs Dogs – Kaggle Competition – https:
//www.kaggle.com/c/dogs-vs-cats
25,000 images of cats and dogs
Feature #1: Extract histogram of colors
1 from skimage.io import imread
2 from skimage.transform import rgb2gray
3
4 f o r filename i n os.listdir(’train ’):
5 im = imread(os.path.join(’train ’,
filename))
6 im = rgb2gray(im)
7 f1 = np.histogram(im.flatten (), 10) [0]
8 f1 = (f1/f1.sum()).cumsum ()
Feature #2: Histogram of Oriented Gradi-
ents
1 im2 = resize(im , (32, 32) , mode=’reflect
’)
2 im2 = np.sqrt(im2)
3 f2 = hog(im2 , block_norm=’L2 -Hys ’)
Traditional Learning
Cats vs Dogs – Kaggle Competition – https:
//www.kaggle.com/c/dogs-vs-cats
25,000 images of cats and dogs
Feature #1: Extract histogram of colors
1 from skimage.io import imread
2 from skimage.transform import rgb2gray
3
4 f o r filename i n os.listdir(’train ’):
5 im = imread(os.path.join(’train ’,
filename))
6 im = rgb2gray(im)
7 f1 = np.histogram(im.flatten (), 10) [0]
8 f1 = (f1/f1.sum()).cumsum ()
1 from sklearn.tree import
DecisionTreeClassifier ,
export_graphviz
2 m = DecisionTreeClassifier (max_depth =3)
3 m.fit(X, y)
Feature #2: Histogram of Oriented Gradi-
ents
1 im2 = resize(im , (32, 32) , mode=’reflect
’)
2 im2 = np.sqrt(im2)
3 f2 = hog(im2 , block_norm=’L2 -Hys ’)
Traditional Learning
1 from sklearn. model_selection import
cross_val_score
2 from sklearn.ensemble import
RandomForestClassifier
3 p r i n t ( cross_val_score (
RandomForestClassifier (100) , X, y))
1 [ 0.69642429 0.70086393 0.69851176]
Feature #1: Extract histogram of colors
1 from skimage.io import imread
2 from skimage.transform import rgb2gray
3
4 f o r filename i n os.listdir(’train ’):
5 im = imread(os.path.join(’train ’,
filename))
6 im = rgb2gray(im)
7 f1 = np.histogram(im.flatten (), 10) [0]
8 f1 = (f1/f1.sum()).cumsum ()
1 from sklearn.tree import
DecisionTreeClassifier ,
export_graphviz
2 m = DecisionTreeClassifier (max_depth =3)
3 m.fit(X, y)
Feature #2: Histogram of Oriented Gradi-
ents
1 im2 = resize(im , (32, 32) , mode=’reflect
’)
2 im2 = np.sqrt(im2)
3 f2 = hog(im2 , block_norm=’L2 -Hys ’)
Deep Learning
1 from sklearn. model_selection import
cross_val_score
2 from sklearn.ensemble import
RandomForestClassifier
3 p r i n t ( cross_val_score (
RandomForestClassifier (100) , X, y))
1 [ 0.69642429 0.70086393 0.69851176]
Linear Regression
ˆy = β0 + β1x1 + β2x2 + . . .
Multilayer perceptron / neural network
ˆy = β00σ(β10 + β11x1 + β12x2 + . . . )
+ β01σ(β20 + β21x1 + β22x2 + . . . ) + . . .
1 from sklearn.tree import
DecisionTreeClassifier ,
export_graphviz
2 m = DecisionTreeClassifier (max_depth =3)
3 m.fit(X, y)
Feature #2: Histogram of Oriented Gradi-
ents
1 im2 = resize(im , (32, 32) , mode=’reflect
’)
2 im2 = np.sqrt(im2)
3 f2 = hog(im2 , block_norm=’L2 -Hys ’)
Deep Learning
1 from sklearn. model_selection import
cross_val_score
2 from sklearn.ensemble import
RandomForestClassifier
3 p r i n t ( cross_val_score (
RandomForestClassifier (100) , X, y))
1 [ 0.69642429 0.70086393 0.69851176]
Linear Regression
ˆy = β0 + β1x1 + β2x2 + . . .
Multilayer perceptron / neural network
ˆy = β00σ(β10 + β11x1 + β12x2 + . . . )
+ β01σ(β20 + β21x1 + β22x2 + . . . ) + . . .
1 from sklearn.tree import
DecisionTreeClassifier ,
export_graphviz
2 m = DecisionTreeClassifier (max_depth =3)
3 m.fit(X, y)
Deep Learning
1 from sklearn. model_selection import
cross_val_score
2 from sklearn.ensemble import
RandomForestClassifier
3 p r i n t ( cross_val_score (
RandomForestClassifier (100) , X, y))
1 [ 0.69642429 0.70086393 0.69851176]
Linear Regression
ˆy = β0 + β1x1 + β2x2 + . . .
Multilayer perceptron / neural network
ˆy = β00σ(β10 + β11x1 + β12x2 + . . . )
+ β01σ(β20 + β21x1 + β22x2 + . . . ) + . . .
1 model = Sequential ()
2 model.add(Conv2D (8, 3, 1, activation=’relu
’, input_shape =(32 , 32, 1)))
3 model.add( MaxPooling2D ())
4 model.add(Conv2D (16, 3, 1, activation=’
relu ’))
5 model.add( MaxPooling2D ())
6 model.add(Flatten ())
7 model.add(Dense (16, activation=’relu ’))
8 model.add(Dense (8, activation=’relu ’))
9 model.add(Dense (1, activation=’sigmoid ’))
10
11 sgd = SGD ()
12 model. compile (sgd , ’binary_crossentropy ’)
13
14 model.fit(X[tr], y[tr], validation_data =(X
[ts], y[ts]),
15 epochs =10, batch_size =100)
Deep Learning
1 f o r tr , ts i n StratifiedKFold ().split(X, y
):
2 model = ...
3 ...
4 yp = (model.predict(X[ts])[:, -1] > 0.5)
.astype( i n t )
5 p r i n t ( accuracy_score (y[ts], yp))
1 [0.57 , 0.57 , 0.63]
Linear Regression
ˆy = β0 + β1x1 + β2x2 + . . .
Multilayer perceptron / neural network
ˆy = β00σ(β10 + β11x1 + β12x2 + . . . )
+ β01σ(β20 + β21x1 + β22x2 + . . . ) + . . .
1 model = Sequential ()
2 model.add(Conv2D (8, 3, 1, activation=’relu
’, input_shape =(32 , 32, 1)))
3 model.add( MaxPooling2D ())
4 model.add(Conv2D (16, 3, 1, activation=’
relu ’))
5 model.add( MaxPooling2D ())
6 model.add(Flatten ())
7 model.add(Dense (16, activation=’relu ’))
8 model.add(Dense (8, activation=’relu ’))
9 model.add(Dense (1, activation=’sigmoid ’))
10
11 sgd = SGD ()
12 model. compile (sgd , ’binary_crossentropy ’)
13
14 model.fit(X[tr], y[tr], validation_data =(X
[ts], y[ts]),
15 epochs =10, batch_size =100)
Deep Learning
1 f o r tr , ts i n StratifiedKFold ().split(X, y
):
2 model = ...
3 ...
4 yp = (model.predict(X[ts])[:, -1] > 0.5)
.astype( i n t )
5 p r i n t ( accuracy_score (y[ts], yp))
1 [0.57 , 0.57 , 0.63]
Overview of Python deep learning landscape:
Theano TensorFlow PyTorch
KerasLasagne
1 model = Sequential ()
2 model.add(Conv2D (8, 3, 1, activation=’relu
’, input_shape =(32 , 32, 1)))
3 model.add( MaxPooling2D ())
4 model.add(Conv2D (16, 3, 1, activation=’
relu ’))
5 model.add( MaxPooling2D ())
6 model.add(Flatten ())
7 model.add(Dense (16, activation=’relu ’))
8 model.add(Dense (8, activation=’relu ’))
9 model.add(Dense (1, activation=’sigmoid ’))
10
11 sgd = SGD ()
12 model. compile (sgd , ’binary_crossentropy ’)
13
14 model.fit(X[tr], y[tr], validation_data =(X
[ts], y[ts]),
15 epochs =10, batch_size =100)
Deep Learning
1 f o r tr , ts i n StratifiedKFold ().split(X, y
):
2 model = ...
3 ...
4 yp = (model.predict(X[ts])[:, -1] > 0.5)
.astype( i n t )
5 p r i n t ( accuracy_score (y[ts], yp))
1 [0.57 , 0.57 , 0.63]
Overview of Python deep learning landscape:
Theano TensorFlow PyTorch
KerasLasagne
1 model = Sequential ()
2 model.add(Conv2D (8, 3, 1, activation=’relu
’, input_shape =(32 , 32, 1)))
3 model.add( MaxPooling2D ())
4 model.add(Conv2D (16, 3, 1, activation=’
relu ’))
5 model.add( MaxPooling2D ())
6 model.add(Flatten ())
7 model.add(Dense (16, activation=’relu ’))
8 model.add(Dense (8, activation=’relu ’))
9 model.add(Dense (1, activation=’sigmoid ’))
10
11 sgd = SGD ()
12 model. compile (sgd , ’binary_crossentropy ’)
13
14 model.fit(X[tr], y[tr], validation_data =(X
[ts], y[ts]),
15 epochs =10, batch_size =100)
Deep learning architectures:
Fully connected perceptrons
Convolutional neural networks
Recurrent neural networks
Neural Turing Machines
Autoencoders
Conclusions –
Python for Scientific Computing
Jo˜ao Machado • Ricardo Cruz
Conclusions
Packages to know:
Numpy: basic linear algebra
Scipy: extensions to numpy
sparse matrices, pdfs, hypothesis tests
Statsmodels: several statistics models,
incl. timeseries
Pandas: extension to numpy for
dataframes support
Matplotlib, seaborn: drawing graphics
Conclusions
Packages to know:
Numpy: basic linear algebra
Scipy: extensions to numpy
sparse matrices, pdfs, hypothesis tests
Statsmodels: several statistics models,
incl. timeseries
Pandas: extension to numpy for
dataframes support
Matplotlib, seaborn: drawing graphics
scikit-learn: complete machine learning
toolkit
xgboost: famous gradient boosting
model
Keras: deep learning (and TensorFlow,
Theano, Lasagne)
OpenCV, scikit-image: image
processing
NLTK: natural language toolkit
Gensim: natural language models
Final remarks
Python’s a “jack of all trades” type of language;
Its speed and ease of development is really apt for scientific computing;
Ever increasingly adopted by scientists and engineers, due to the available third-party
scientific libraries contributed by a large community;
Has become a ’de-facto’ language present in advances in some fields, such as Deep
Learning.
About us
Jo˜ao Machado
machadojpf@gmail.com
Fraunhofer Portugal research engineer
Masters in Electrical and Computer Engineering
http://www.linkedin.com/in/machadojpf
Ricardo Cruz
rpcruz@inesctec.pt
INESC TEC researcher
Computer Science & Applied Mathematics graduate
https://rpmcruz.github.io/
Subscribe workshops:
http://tinyurl.com/cruz-workshops

Python for Scientific Computing -- Ricardo Cruz

  • 1.
    Introduction into Python forScientific Computing Jo˜ao Machado • Ricardo Cruz
  • 2.
    Introduction a d g be h c f i × 0 0 1 0 1 0 1 0 0 = ? ? ? ? ? ? ? ? ? What is the result of this operation?
  • 3.
    Introduction a d g be h c f i × 0 0 1 0 1 0 1 0 0 = ? ? ? ? ? ? ? ? ? What is the result of this operation? a d g b e h c f i × 0 0 1 0 1 0 1 0 0 = g d a h e b i f c
  • 4.
    Introduction a d g be h c f i × 0 0 1 0 1 0 1 0 0 = ? ? ? ? ? ? ? ? ? What is the result of this operation? a d g b e h c f i × 0 0 1 0 1 0 1 0 0 = g d a h e b i f c 1 from numpy import * 2 cout = p r i n t 3 4 A = random.random ((3, 3)); 5 B = fliplr(eye (3)); 6 C = dot(A, B); 7 cout(C); What programming language is this?
  • 5.
    Introduction a d g be h c f i × 0 0 1 0 1 0 1 0 0 = ? ? ? ? ? ? ? ? ? What is the result of this operation? a d g b e h c f i × 0 0 1 0 1 0 1 0 0 = g d a h e b i f c 1 from numpy import * 2 cout = p r i n t 3 4 A = random.random ((3, 3)); 5 B = fliplr(eye (3)); 6 C = dot(A, B); 7 cout(C); It’s Python! 1 from numpy import * 2 cout = p r i n t 3 4 A = random.random ((3, 3)); 5 B = fliplr(eye (3)); 6 C = dot(A, B); 7 cout(C); What programming language is this?
  • 6.
    Introduction 1 #i nc l u d e <armadillo > 2 using namespace arma; 3 using namespace std; 4 5 mat A(3 ,3), B(3 ,3); 6 A.randu (); 7 B = fliplr(B.eye ()); 8 M3 = M1 * M2; 9 cout << M3 << endl; What about this programming language? a d g b e h c f i × 0 0 1 0 1 0 1 0 0 = g d a h e b i f c 1 from numpy import * 2 cout = p r i n t 3 4 A = random.random ((3, 3)); 5 B = fliplr(eye (3)); 6 C = dot(A, B); 7 cout(C); It’s Python! 1 from numpy import * 2 cout = p r i n t 3 4 A = random.random ((3, 3)); 5 B = fliplr(eye (3)); 6 C = dot(A, B); 7 cout(C); What programming language is this?
  • 7.
    Introduction 1 #i nc l u d e <armadillo > 2 using namespace arma; 3 using namespace std; 4 5 mat A(3 ,3), B(3 ,3); 6 A.randu (); 7 B = fliplr(B.eye ()); 8 M3 = M1 * M2; 9 cout << M3 << endl; What about this programming language? Why use Python? More important than the programming language is the ecosystem – and Python has a great scientific community Python has good interoperability with other systems The entire stack can be developed in Python: machine learning, flask, etc Computations do not run in Python; the slow stuff is implemented in Fortran and C 1 from numpy import * 2 cout = p r i n t 3 4 A = random.random ((3, 3)); 5 B = fliplr(eye (3)); 6 C = dot(A, B); 7 cout(C); It’s Python! 1 from numpy import * 2 cout = p r i n t 3 4 A = random.random ((3, 3)); 5 B = fliplr(eye (3)); 6 C = dot(A, B); 7 cout(C); What programming language is this?
  • 8.
  • 9.
    Why Python? Good datamining ecosystem. Not as centralized/monopolistic as Matlab’s Not as decentralized and messy as R :P
  • 10.
  • 11.
  • 12.
    Numpy Notes Let Aand B be matrices, Python/Numpy MATLAB R A.dot(B) A * B A %*% B A * B A .* B A * B Operations are elementwise by default (like R)
  • 13.
    Numpy Notes Let Aand B be matrices, Python/Numpy MATLAB R A.dot(B) A * B A %*% B A * B A .* B A * B Operations are elementwise by default (like R) Python/Numpy MATLAB R A.shape size(A) length, nrow, ncol A[0:4,:] or A[0:4] or A[:4] A(1:4,:) A[1:4,] A[0:10:2] A[seq(0, 9, 2)] A[-4:] A(end-4:end,:) A[nrow(A)-4:nrow(A),] A.T A.’ t(A) Numpy in general allows for more succinct writing. Furthermore: Indexing starts at zero. Intervals are of the form [i, j[
  • 14.
    Numpy Notes Let Aand B be matrices, Python/Numpy MATLAB R A.dot(B) A * B A %*% B A * B A .* B A * B Operations are elementwise by default (like R) Python/Numpy MATLAB R A.shape size(A) length, nrow, ncol A[0:4,:] or A[0:4] or A[:4] A(1:4,:) A[1:4,] A[0:10:2] A[seq(0, 9, 2)] A[-4:] A(end-4:end,:) A[nrow(A)-4:nrow(A),] A.T A.’ t(A) Numpy in general allows for more succinct writing. Furthermore: Indexing starts at zero. Intervals are of the form [i, j[ This is further aided by the fact that Numpy supports arithmetic broadcasting. (unlike MATLAB or R.) That is, you can do the following element- wise multiplication: (6,3) * (6,1). It auto- matically assumes you want to multiply by column. In MATLAB, you would have to use bsxfun(@times,r,A) or first use repmat().
  • 15.
    Numpy Notes Let Aand B be matrices, Python/Numpy MATLAB R A.dot(B) A * B A %*% B A * B A .* B A * B Operations are elementwise by default (like R) Python/Numpy MATLAB R A.shape size(A) length, nrow, ncol A[0:4,:] or A[0:4] or A[:4] A(1:4,:) A[1:4,] A[0:10:2] A[seq(0, 9, 2)] A[-4:] A(end-4:end,:) A[nrow(A)-4:nrow(A),] A.T A.’ t(A) Numpy in general allows for more succinct writing. Furthermore: Indexing starts at zero. Intervals are of the form [i, j[ Something like the following is valid in Numpy... 1 import skimage.data 2 img1 = skimage.data.astronaut () 3 img2 = skimage.data.moon () 4 p r i n t (img1.shape) # (512 , 512 , 3) 5 p r i n t (img2.shape) # (512 , 512) 6 7 import matplotlib.pyplot as plt 8 plt.subplot (1, 2, 1) 9 plt.imshow(img1) 10 plt.subplot (1, 2, 2) 11 plt.imshow(img2 , cmap=’gray ’) 12 plt.show () This is further aided by the fact that Numpy supports arithmetic broadcasting. (unlike MATLAB or R.) That is, you can do the following element- wise multiplication: (6,3) * (6,1). It auto- matically assumes you want to multiply by column. In MATLAB, you would have to use bsxfun(@times,r,A) or first use repmat().
  • 16.
    Numpy Notes Python/Numpy MATLABR A.shape size(A) length, nrow, ncol A[0:4,:] or A[0:4] or A[:4] A(1:4,:) A[1:4,] A[0:10:2] A[seq(0, 9, 2)] A[-4:] A(end-4:end,:) A[nrow(A)-4:nrow(A),] A.T A.’ t(A) Numpy in general allows for more succinct writing. Furthermore: Indexing starts at zero. Intervals are of the form [i, j[ Something like the following is valid in Numpy... 1 import skimage.data 2 img1 = skimage.data.astronaut () 3 img2 = skimage.data.moon () 4 p r i n t (img1.shape) # (512 , 512 , 3) 5 p r i n t (img2.shape) # (512 , 512) 6 7 import matplotlib.pyplot as plt 8 plt.subplot (1, 2, 1) 9 plt.imshow(img1) 10 plt.subplot (1, 2, 2) 11 plt.imshow(img2 , cmap=’gray ’) 12 plt.show () This is further aided by the fact that Numpy supports arithmetic broadcasting. (unlike MATLAB or R.) That is, you can do the following element- wise multiplication: (6,3) * (6,1). It auto- matically assumes you want to multiply by column. In MATLAB, you would have to use bsxfun(@times,r,A) or first use repmat().
  • 17.
    Numpy Notes Arithmetic mean 1img2 = img2[:, :, np.newaxis] #(512 ,512 ,1) 2 img1 = img1.astype(np.uint32) 3 img2 = img2.astype(np.uint32) 4 img3 = (img1 + img2)//2 5 img3 = img3.astype(np.uint8) 6 plt.imshow(img3) 7 plt.show () Something like the following is valid in Numpy... 1 import skimage.data 2 img1 = skimage.data.astronaut () 3 img2 = skimage.data.moon () 4 p r i n t (img1.shape) # (512 , 512 , 3) 5 p r i n t (img2.shape) # (512 , 512) 6 7 import matplotlib.pyplot as plt 8 plt.subplot (1, 2, 1) 9 plt.imshow(img1) 10 plt.subplot (1, 2, 2) 11 plt.imshow(img2 , cmap=’gray ’) 12 plt.show () This is further aided by the fact that Numpy supports arithmetic broadcasting. (unlike MATLAB or R.) That is, you can do the following element- wise multiplication: (6,3) * (6,1). It auto- matically assumes you want to multiply by column. In MATLAB, you would have to use bsxfun(@times,r,A) or first use repmat().
  • 18.
    Numpy Notes Arithmetic mean 1img2 = img2[:, :, np.newaxis] #(512 ,512 ,1) 2 img1 = img1.astype(np.uint32) 3 img2 = img2.astype(np.uint32) 4 img3 = (img1 + img2)//2 5 img3 = img3.astype(np.uint8) 6 plt.imshow(img3) 7 plt.show () Something like the following is valid in Numpy... 1 import skimage.data 2 img1 = skimage.data.astronaut () 3 img2 = skimage.data.moon () 4 p r i n t (img1.shape) # (512 , 512 , 3) 5 p r i n t (img2.shape) # (512 , 512) 6 7 import matplotlib.pyplot as plt 8 plt.subplot (1, 2, 1) 9 plt.imshow(img1) 10 plt.subplot (1, 2, 2) 11 plt.imshow(img2 , cmap=’gray ’) 12 plt.show ()
  • 19.
    Numpy Notes Arithmetic mean 1img2 = img2[:, :, np.newaxis] #(512 ,512 ,1) 2 img1 = img1.astype(np.uint32) 3 img2 = img2.astype(np.uint32) 4 img3 = (img1 + img2)//2 5 img3 = img3.astype(np.uint8) 6 plt.imshow(img3) 7 plt.show () Geometric mean 1 img2 = img2[:, :, np.newaxis] 2 img1 = img1.astype(np.uint32) 3 img2 = img2.astype(np.uint32) 4 img3 = np.sqrt(img1 * img2) 5 img3 = img3.astype(np.uint8) 6 plt.imshow(img3) 7 plt.show ()
  • 20.
    Numpy Notes Arithmetic mean 1img2 = img2[:, :, np.newaxis] #(512 ,512 ,1) 2 img1 = img1.astype(np.uint32) 3 img2 = img2.astype(np.uint32) 4 img3 = (img1 + img2)//2 5 img3 = img3.astype(np.uint8) 6 plt.imshow(img3) 7 plt.show () Geometric mean 1 img2 = img2[:, :, np.newaxis] 2 img1 = img1.astype(np.uint32) 3 img2 = img2.astype(np.uint32) 4 img3 = np.sqrt(img1 * img2) 5 img3 = img3.astype(np.uint8) 6 plt.imshow(img3) 7 plt.show ()
  • 21.
    Pandas and DataVisualization – Python for Scientific Computing Jo˜ao Machado • Ricardo Cruz
  • 22.
    Pandas What is Pandas? Apackage for data manipulation and analysis, based on the concept of data frame in the R language Optimized for performance, with critical code paths written in C Originally developed by Wes McKinney, while working for AQR Capital (a quantitative finance firm)
  • 23.
    Pandas What is Pandas? Apackage for data manipulation and analysis, based on the concept of data frame in the R language Optimized for performance, with critical code paths written in C Originally developed by Wes McKinney, while working for AQR Capital (a quantitative finance firm) Given the previous point, it makes sense to demonstrate some of the functionalities of Pandas with a dataset comprised of financial stocks :)
  • 24.
    Data Mining – Pythonfor Scientific Computing Jo˜ao Machado • Ricardo Cruz
  • 25.
  • 26.
    Models Let us producefake data... y(x) = 2x + 10 + ε1 + ε2 ε1 ∼ N(0, 2) ε2 ∼ |N(0, 25)| with p = 0.1, 0 otherwise.
  • 27.
    Models Let us producefake data... y(x) = 2x + 10 + ε1 + ε2 ε1 ∼ N(0, 2) ε2 ∼ |N(0, 25)| with p = 0.1, 0 otherwise. Let us produce fake data... y(x) = 2x + 10 + ε1 + bε2 ε1 ∼ N(0, 2) b ∼ B(2, 0.1) ε2 ∼ |N(0, 25)|
  • 28.
    Models Let us producefake data... y(x) = 2x + 10 + ε1 + ε2 ε1 ∼ N(0, 2) ε2 ∼ |N(0, 25)| with p = 0.1, 0 otherwise. Translation to numpy: 1 import numpy as np 2 N = 50 3 x = np.linspace (0, 25, N) 4 y = 2*x + 10 5 y += np.random.randn(N)*2 6 y += np.random.binomial (2, 0.10 , N)*np. abs (np.random.randn(N)*25) Let us produce fake data... y(x) = 2x + 10 + ε1 + bε2 ε1 ∼ N(0, 2) b ∼ B(2, 0.1) ε2 ∼ |N(0, 25)|
  • 29.
    Models 1 import matplotlib.pyplotas plt 2 plt.plot(x, y) 3 plt.title(’Data ’) 4 plt.show () Let us produce fake data... y(x) = 2x + 10 + ε1 + ε2 ε1 ∼ N(0, 2) ε2 ∼ |N(0, 25)| with p = 0.1, 0 otherwise. Translation to numpy: 1 import numpy as np 2 N = 50 3 x = np.linspace (0, 25, N) 4 y = 2*x + 10 5 y += np.random.randn(N)*2 6 y += np.random.binomial (2, 0.10 , N)*np. abs (np.random.randn(N)*25) Let us produce fake data... y(x) = 2x + 10 + ε1 + bε2 ε1 ∼ N(0, 2) b ∼ B(2, 0.1) ε2 ∼ |N(0, 25)|
  • 30.
    Models 1 import matplotlib.pyplotas plt 2 plt.plot(x, y) 3 plt.title(’Data ’) 4 plt.show () What model could we create to explain this data? Translation to numpy: 1 import numpy as np 2 N = 50 3 x = np.linspace (0, 25, N) 4 y = 2*x + 10 5 y += np.random.randn(N)*2 6 y += np.random.binomial (2, 0.10 , N)*np. abs (np.random.randn(N)*25) Let us produce fake data... y(x) = 2x + 10 + ε1 + bε2 ε1 ∼ N(0, 2) b ∼ B(2, 0.1) ε2 ∼ |N(0, 25)|
  • 31.
    Models 1 import matplotlib.pyplotas plt 2 plt.plot(x, y) 3 plt.title(’Data ’) 4 plt.show () What model could we create to explain this data? Translation to numpy: 1 import numpy as np 2 N = 50 3 x = np.linspace (0, 25, N) 4 y = 2*x + 10 5 y += np.random.randn(N)*2 6 y += np.random.binomial (2, 0.10 , N)*np. abs (np.random.randn(N)*25) Linear Regression Model: ˆy = β0 + β1x Minimize: i (yi − ˆyi )2
  • 32.
    Models 1 import matplotlib.pyplotas plt 2 plt.plot(x, y) 3 plt.title(’Data ’) 4 plt.show () What model could we create to explain this data? 1 from sklearn. linear_model import LinearRegression 2 m = LinearRegression () 3 m.fit(x[:, np.newaxis], y) 4 yp = m.predict(x[:, np.newaxis ]) 5 6 plt.plot(x, y) 7 plt.plot(x, yp) 8 plt.title(’Linear regression ’) 9 plt.text(0, 70, ’m=%.1f b=%.1f’ % (m.coef_ [0], m.intercept_)) 10 plt.show () Linear Regression Model: ˆy = β0 + β1x Minimize: i (yi − ˆyi )2
  • 33.
    Models What model couldwe create to explain this data? 1 from sklearn. linear_model import LinearRegression 2 m = LinearRegression () 3 m.fit(x[:, np.newaxis], y) 4 yp = m.predict(x[:, np.newaxis ]) 5 6 plt.plot(x, y) 7 plt.plot(x, yp) 8 plt.title(’Linear regression ’) 9 plt.text(0, 70, ’m=%.1f b=%.1f’ % (m.coef_ [0], m.intercept_)) 10 plt.show () Linear Regression Model: ˆy = β0 + β1x Minimize: i (yi − ˆyi )2
  • 34.
    Models y(x) = 2x+ 10 + ε1 + bε2 ˆy(x) = 2x + 18 What if I want to explain only the trend? How can I avoid the impact of these spikes? 1 from sklearn. linear_model import LinearRegression 2 m = LinearRegression () 3 m.fit(x[:, np.newaxis], y) 4 yp = m.predict(x[:, np.newaxis ]) 5 6 plt.plot(x, y) 7 plt.plot(x, yp) 8 plt.title(’Linear regression ’) 9 plt.text(0, 70, ’m=%.1f b=%.1f’ % (m.coef_ [0], m.intercept_)) 10 plt.show () Linear Regression Model: ˆy = β0 + β1x Minimize: i (yi − ˆyi )2
  • 35.
    Models y(x) = 2x+ 10 + ε1 + bε2 ˆy(x) = 2x + 18 What if I want to explain only the trend? How can I avoid the impact of these spikes? 1 from sklearn. linear_model import LinearRegression 2 m = LinearRegression () 3 m.fit(x[:, np.newaxis], y) 4 yp = m.predict(x[:, np.newaxis ]) 5 6 plt.plot(x, y) 7 plt.plot(x, yp) 8 plt.title(’Linear regression ’) 9 plt.text(0, 70, ’m=%.1f b=%.1f’ % (m.coef_ [0], m.intercept_)) 10 plt.show () What would a statistician do? 1 res = yp -y 2 plt.boxplot(res) 3 plt.show ()
  • 36.
    Models y(x) = 2x+ 10 + ε1 + bε2 ˆy(x) = 2x + 18 What if I want to explain only the trend? How can I avoid the impact of these spikes? 1 q1 = np.percentile(res , 25) 2 q3 = np.percentile(res , 75) 3 t = np.logical_and(res > q1 , res < q3) 4 x2 = x[t] 5 y2 = y[t] 6 7 m = LinearRegression () 8 m.fit(x2[:, np.newaxis], y2) 9 yp = m.predict(x[:, np.newaxis ]) What would a statistician do? 1 res = yp -y 2 plt.boxplot(res) 3 plt.show ()
  • 37.
    Models y(x) = 2x+ 10 + ε1 + bε2 ˆy(x) = 2x + 18 What if I want to explain only the trend? How can I avoid the impact of these spikes? 1 q1 = np.percentile(res , 25) 2 q3 = np.percentile(res , 75) 3 t = np.logical_and(res > q1 , res < q3) 4 x2 = x[t] 5 y2 = y[t] 6 7 m = LinearRegression () 8 m.fit(x2[:, np.newaxis], y2) 9 yp = m.predict(x[:, np.newaxis ]) What would a statistician do? 1 res = yp -y 2 plt.boxplot(res) 3 plt.show ()
  • 38.
    Models Approach #2: Whatwould a statistician with some computer science knowledge do? 1 q1 = np.percentile(res , 25) 2 q3 = np.percentile(res , 75) 3 t = np.logical_and(res > q1 , res < q3) 4 x2 = x[t] 5 y2 = y[t] 6 7 m = LinearRegression () 8 m.fit(x2[:, np.newaxis], y2) 9 yp = m.predict(x[:, np.newaxis ]) What would a statistician do? 1 res = yp -y 2 plt.boxplot(res) 3 plt.show ()
  • 39.
    Models Approach #2: Whatwould a statistician with some computer science knowledge do? 1 q1 = np.percentile(res , 25) 2 q3 = np.percentile(res , 75) 3 t = np.logical_and(res > q1 , res < q3) 4 x2 = x[t] 5 y2 = y[t] 6 7 m = LinearRegression () 8 m.fit(x2[:, np.newaxis], y2) 9 yp = m.predict(x[:, np.newaxis ]) Model: ˆy = β0 + β1x Minimize: i |yi − ˆyi |
  • 40.
    Models Approach #2: Whatwould a statistician with some computer science knowledge do? 1 from statsmodels.regression. quantile_regression import QuantReg 2 3 m = QuantReg(y, np.c_[np.ones(N), x]) 4 m = m.fit (0.5) 5 yp = m.predict () Model: ˆy = β0 + β1x Minimize: i |yi − ˆyi |
  • 41.
    Models Approach #2: Whatwould a statistician with some computer science knowledge do? 1 from statsmodels.regression. quantile_regression import QuantReg 2 3 m = QuantReg(y, np.c_[np.ones(N), x]) 4 m = m.fit (0.5) 5 yp = m.predict () Model: ˆy = β0 + β1x Minimize: i |yi − ˆyi |
  • 42.
    Models Approach #3: Whatwould a crazy com- puter scientist do? 1 from statsmodels.regression. quantile_regression import QuantReg 2 3 m = QuantReg(y, np.c_[np.ones(N), x]) 4 m = m.fit (0.5) 5 yp = m.predict () Model: ˆy = β0 + β1x Minimize: i |yi − ˆyi |
  • 43.
    Models Approach #3: Whatwould a crazy com- puter scientist do? 1 from statsmodels.regression. quantile_regression import QuantReg 2 3 m = QuantReg(y, np.c_[np.ones(N), x]) 4 m = m.fit (0.5) 5 yp = m.predict () 1 plt.plot(x, y) 2 f o r it i n range (10): 3 t = np.random.choice(N, N//10 , replace =False) 4 x2 = x[t] 5 y2 = y[t] 6 m.fit(x2[:, np.newaxis], y2) 7 yp = m.predict(x[:, np.newaxis ]) 8 plt.plot(x, yp , color=’black ’, alpha =0.4) 9 plt.show ()
  • 44.
    Models Approach #3: Whatwould a crazy com- puter scientist do? 1 plt.plot(x, y) 2 f o r it i n range (10): 3 t = np.random.choice(N, N//10 , replace =False) 4 x2 = x[t] 5 y2 = y[t] 6 m.fit(x2[:, np.newaxis], y2) 7 yp = m.predict(x[:, np.newaxis ]) 8 plt.plot(x, yp , color=’black ’, alpha =0.4) 9 plt.show ()
  • 45.
    Models Sklearn already comeswith this crazy model too: 1 from sklearn. linear_model import RANSACRegressor 2 m = RANSACRegressor () 3 m.fit(x[:, np.newaxis], y) 4 5 plt.plot(x, y) 6 plt.plot(x, m.predict(x[:, np.newaxis ])) 7 plt.title(’RANSAC ’) 8 plt.show () Approach #3: What would a crazy com- puter scientist do? 1 plt.plot(x, y) 2 f o r it i n range (10): 3 t = np.random.choice(N, N//10 , replace =False) 4 x2 = x[t] 5 y2 = y[t] 6 m.fit(x2[:, np.newaxis], y2) 7 yp = m.predict(x[:, np.newaxis ]) 8 plt.plot(x, yp , color=’black ’, alpha =0.4) 9 plt.show ()
  • 46.
    Models Sklearn already comeswith this crazy model too: 1 from sklearn. linear_model import RANSACRegressor 2 m = RANSACRegressor () 3 m.fit(x[:, np.newaxis], y) 4 5 plt.plot(x, y) 6 plt.plot(x, m.predict(x[:, np.newaxis ])) 7 plt.title(’RANSAC ’) 8 plt.show () 1 plt.plot(x, y) 2 f o r it i n range (10): 3 t = np.random.choice(N, N//10 , replace =False) 4 x2 = x[t] 5 y2 = y[t] 6 m.fit(x2[:, np.newaxis], y2) 7 yp = m.predict(x[:, np.newaxis ]) 8 plt.plot(x, yp , color=’black ’, alpha =0.4) 9 plt.show ()
  • 47.
    What kind ofthings can we use data mining / machine learning for?
  • 48.
    Data Mining Problems Regression:predict a continuous variable e.g. House Price = 100 + 20 × Land Size In scikit-learn, LinearRegression, Gradient- BoostingRegressor, etc (:: RegressorMixin) .fit(X, y) .predict(X) -> yp
  • 49.
    Data Mining Problems Regression:predict a continuous variable e.g. House Price = 100 + 20 × Land Size In scikit-learn, LinearRegression, Gradient- BoostingRegressor, etc (:: RegressorMixin) .fit(X, y) .predict(X) -> yp Classification: predict a discrete variable e.g. House Price = Expensive if in the city center Cheap if outside the city In scikit-learn, LogisticRegression, Gradient- BoostingClassifier, etc (:: ClassifierMixin) .fit(X, y) .predict(X) -> yp
  • 50.
    Data Mining Problems Regression:predict a continuous variable e.g. House Price = 100 + 20 × Land Size In scikit-learn, LinearRegression, Gradient- BoostingRegressor, etc (:: RegressorMixin) .fit(X, y) .predict(X) -> yp Classification: predict a discrete variable e.g. House Price = Expensive if in the city center Cheap if outside the city In scikit-learn, LogisticRegression, Gradient- BoostingClassifier, etc (:: ClassifierMixin) .fit(X, y) .predict(X) -> yp Clustering: not predict, aggregate In scikit-learn, KMeans, LatentDirichletAllo- cation, etc (:: ClusterMixin) .fit(X) .transform(X) -> X’ .fit transform(X) -> X’
  • 51.
    Data Mining Problems Regression:predict a continuous variable e.g. House Price = 100 + 20 × Land Size In scikit-learn, LinearRegression, Gradient- BoostingRegressor, etc (:: RegressorMixin) .fit(X, y) .predict(X) -> yp Classification: predict a discrete variable e.g. House Price = Expensive if in the city center Cheap if outside the city In scikit-learn, LogisticRegression, Gradient- BoostingClassifier, etc (:: ClassifierMixin) .fit(X, y) .predict(X) -> yp Re-inforcement learning: (predict best move) Clustering: not predict, aggregate In scikit-learn, KMeans, LatentDirichletAllo- cation, etc (:: ClusterMixin) .fit(X) .transform(X) -> X’ .fit transform(X) -> X’
  • 52.
    Use Cases Jo˜ao Machado• Ricardo Cruz
  • 53.
  • 54.
    Text Mining w/Twitter Packages: tweepy numpy matplotlib scikit-learn
  • 55.
    Text Mining 1 importtweepy 2 auth = tweepy. OAuthHandler (api_key , api_secret) 3 auth. set_access_token (access_token , access_secret ) 4 api = tweepy.API(auth) 5 6 timeline = api. user_timeline (’ realDonaldTrump ’, count =100) 7 texts = [tweet.text f o r tweet i n timeline]
  • 56.
    Text Mining 1 importtweepy 2 auth = tweepy. OAuthHandler (api_key , api_secret) 3 auth. set_access_token (access_token , access_secret ) 4 api = tweepy.API(auth) 5 6 timeline = api. user_timeline (’ realDonaldTrump ’, count =100) 7 texts = [tweet.text f o r tweet i n timeline] 1 from sklearn. feature_extraction .text import CountVectorizer 2 m = CountVectorizer (stop_words=’english ’, min_df =5, max_df =16) 3 X = m. fit_transform (texts) 4 words = sorted (m.vocabulary_ , key=m. vocabulary_.get) 5 6 import pandas as pd 7 p r i n t (pd.DataFrame(X.todense (), columns= words).ix[:5, :5]. to_latex ()) america big comey day dems 0 0 0 0 0 0 1 0 1 0 0 0 2 1 0 0 0 0
  • 57.
    Text Mining 1 importtweepy 2 auth = tweepy. OAuthHandler (api_key , api_secret) 3 auth. set_access_token (access_token , access_secret ) 4 api = tweepy.API(auth) 5 6 timeline = api. user_timeline (’ realDonaldTrump ’, count =100) 7 texts = [tweet.text f o r tweet i n timeline] 1 from sklearn. feature_extraction .text import CountVectorizer 2 m = CountVectorizer (stop_words=’english ’, min_df =5, max_df =16) 3 X = m. fit_transform (texts) 4 words = sorted (m.vocabulary_ , key=m. vocabulary_.get) 5 6 import pandas as pd 7 p r i n t (pd.DataFrame(X.todense (), columns= words).ix[:5, :5]. to_latex ()) america big comey day dems 0 0 0 0 0 0 1 0 1 0 0 0 2 1 0 0 0 0 1 import matplotlib.pyplot as plt 2 counts = np.asarray(X.sum(0))[0] 3 plt.barh( range ( len (counts)), counts) 4 plt.xticks( range (0, 14, 2)) 5 plt.yticks( range ( len (counts)), words) 6 plt.show ()
  • 58.
    Text Mining 1 importtweepy 2 auth = tweepy. OAuthHandler (api_key , api_secret) 3 auth. set_access_token (access_token , access_secret ) 4 api = tweepy.API(auth) 5 6 timeline = api. user_timeline (’ realDonaldTrump ’, count =100) 7 texts = [tweet.text f o r tweet i n timeline] 1 from sklearn. feature_extraction .text import CountVectorizer 2 m = CountVectorizer (stop_words=’english ’, min_df =5, max_df =16) 3 X = m. fit_transform (texts) 4 words = sorted (m.vocabulary_ , key=m. vocabulary_.get) 5 6 import pandas as pd 7 p r i n t (pd.DataFrame(X.todense (), columns= words).ix[:5, :5]. to_latex ()) america big comey day dems 0 0 0 0 0 0 1 0 1 0 0 0 2 1 0 0 0 0 1 import matplotlib.pyplot as plt 2 counts = np.asarray(X.sum(0))[0] 3 plt.barh( range ( len (counts)), counts) 4 plt.xticks( range (0, 14, 2)) 5 plt.yticks( range ( len (counts)), words) 6 plt.show ()
  • 59.
    Text Mining 1 fromsklearn. decomposition import LatentDirichletAllocation 2 lda = LatentDirichletAllocation (2, learning_method =’online ’) 3 lda.fit(X) 4 topics = lda. components_ newword1 = β11word1 + β12word2 + . . . newword2 = β21word1 + β22word2 + . . . 1 from sklearn. feature_extraction .text import CountVectorizer 2 m = CountVectorizer (stop_words=’english ’, min_df =5, max_df =16) 3 X = m. fit_transform (texts) 4 words = sorted (m.vocabulary_ , key=m. vocabulary_.get) 5 6 import pandas as pd 7 p r i n t (pd.DataFrame(X.todense (), columns= words).ix[:5, :5]. to_latex ()) america big comey day dems 0 0 0 0 0 0 1 0 1 0 0 0 2 1 0 0 0 0 1 import matplotlib.pyplot as plt 2 counts = np.asarray(X.sum(0))[0] 3 plt.barh( range ( len (counts)), counts) 4 plt.xticks( range (0, 14, 2)) 5 plt.yticks( range ( len (counts)), words) 6 plt.show ()
  • 60.
    Text Mining 1 fromsklearn. decomposition import LatentDirichletAllocation 2 lda = LatentDirichletAllocation (2, learning_method =’online ’) 3 lda.fit(X) 4 topics = lda. components_ newword1 = β11word1 + β12word2 + . . . newword2 = β21word1 + β22word2 + . . . 1 topics = topics / topics.max(1)[:, np. newaxis] 2 topics += np.random.randn (* topics.shape) *0.02 3 f o r i, word i n enumerate(words): 4 plt.text(topics [0, i], topics [1, i], word , ha=’center ’) 5 plt.show () 1 import matplotlib.pyplot as plt 2 counts = np.asarray(X.sum(0))[0] 3 plt.barh( range ( len (counts)), counts) 4 plt.xticks( range (0, 14, 2)) 5 plt.yticks( range ( len (counts)), words) 6 plt.show ()
  • 61.
    Text Mining 1 fromsklearn. decomposition import LatentDirichletAllocation 2 lda = LatentDirichletAllocation (2, learning_method =’online ’) 3 lda.fit(X) 4 topics = lda. components_ newword1 = β11word1 + β12word2 + . . . newword2 = β21word1 + β22word2 + . . . 1 topics = topics / topics.max(1)[:, np. newaxis] 2 topics += np.random.randn (* topics.shape) *0.02 3 f o r i, word i n enumerate(words): 4 plt.text(topics [0, i], topics [1, i], word , ha=’center ’) 5 plt.show ()
  • 62.
    Text Mining 1 fromsklearn. decomposition import LatentDirichletAllocation 2 lda = LatentDirichletAllocation (2, learning_method =’online ’) 3 lda.fit(X) 4 topics = lda. components_ newword1 = β11word1 + β12word2 + . . . newword2 = β21word1 + β22word2 + . . . 1 topics = topics / topics.max(1)[:, np. newaxis] 2 topics += np.random.randn (* topics.shape) *0.02 3 f o r i, word i n enumerate(words): 4 plt.text(topics [0, i], topics [1, i], word , ha=’center ’) 5 plt.show () 1 timeline = api. user_timeline (’ marcelorebelo_ ’, count =100)
  • 63.
    Traditional Learning vsDeep Learning Traditionally, hand-crafted features would be extracted from the dataset and learning would happen on top of those features. Deep learning learns from the raw data. Packages: scikit-image numpy keras
  • 64.
    Traditional Learning Cats vsDogs – Kaggle Competition – https: //www.kaggle.com/c/dogs-vs-cats 25,000 images of cats and dogs
  • 65.
    Traditional Learning Cats vsDogs – Kaggle Competition – https: //www.kaggle.com/c/dogs-vs-cats 25,000 images of cats and dogs Feature #1: Extract histogram of colors 1 from skimage.io import imread 2 from skimage.transform import rgb2gray 3 4 f o r filename i n os.listdir(’train ’): 5 im = imread(os.path.join(’train ’, filename)) 6 im = rgb2gray(im) 7 f1 = np.histogram(im.flatten (), 10) [0] 8 f1 = (f1/f1.sum()).cumsum ()
  • 66.
    Traditional Learning Cats vsDogs – Kaggle Competition – https: //www.kaggle.com/c/dogs-vs-cats 25,000 images of cats and dogs Feature #1: Extract histogram of colors 1 from skimage.io import imread 2 from skimage.transform import rgb2gray 3 4 f o r filename i n os.listdir(’train ’): 5 im = imread(os.path.join(’train ’, filename)) 6 im = rgb2gray(im) 7 f1 = np.histogram(im.flatten (), 10) [0] 8 f1 = (f1/f1.sum()).cumsum () Feature #2: Histogram of Oriented Gradi- ents 1 im2 = resize(im , (32, 32) , mode=’reflect ’) 2 im2 = np.sqrt(im2) 3 f2 = hog(im2 , block_norm=’L2 -Hys ’)
  • 67.
    Traditional Learning Cats vsDogs – Kaggle Competition – https: //www.kaggle.com/c/dogs-vs-cats 25,000 images of cats and dogs Feature #1: Extract histogram of colors 1 from skimage.io import imread 2 from skimage.transform import rgb2gray 3 4 f o r filename i n os.listdir(’train ’): 5 im = imread(os.path.join(’train ’, filename)) 6 im = rgb2gray(im) 7 f1 = np.histogram(im.flatten (), 10) [0] 8 f1 = (f1/f1.sum()).cumsum () 1 from sklearn.tree import DecisionTreeClassifier , export_graphviz 2 m = DecisionTreeClassifier (max_depth =3) 3 m.fit(X, y) Feature #2: Histogram of Oriented Gradi- ents 1 im2 = resize(im , (32, 32) , mode=’reflect ’) 2 im2 = np.sqrt(im2) 3 f2 = hog(im2 , block_norm=’L2 -Hys ’)
  • 68.
    Traditional Learning 1 fromsklearn. model_selection import cross_val_score 2 from sklearn.ensemble import RandomForestClassifier 3 p r i n t ( cross_val_score ( RandomForestClassifier (100) , X, y)) 1 [ 0.69642429 0.70086393 0.69851176] Feature #1: Extract histogram of colors 1 from skimage.io import imread 2 from skimage.transform import rgb2gray 3 4 f o r filename i n os.listdir(’train ’): 5 im = imread(os.path.join(’train ’, filename)) 6 im = rgb2gray(im) 7 f1 = np.histogram(im.flatten (), 10) [0] 8 f1 = (f1/f1.sum()).cumsum () 1 from sklearn.tree import DecisionTreeClassifier , export_graphviz 2 m = DecisionTreeClassifier (max_depth =3) 3 m.fit(X, y) Feature #2: Histogram of Oriented Gradi- ents 1 im2 = resize(im , (32, 32) , mode=’reflect ’) 2 im2 = np.sqrt(im2) 3 f2 = hog(im2 , block_norm=’L2 -Hys ’)
  • 69.
    Deep Learning 1 fromsklearn. model_selection import cross_val_score 2 from sklearn.ensemble import RandomForestClassifier 3 p r i n t ( cross_val_score ( RandomForestClassifier (100) , X, y)) 1 [ 0.69642429 0.70086393 0.69851176] Linear Regression ˆy = β0 + β1x1 + β2x2 + . . . Multilayer perceptron / neural network ˆy = β00σ(β10 + β11x1 + β12x2 + . . . ) + β01σ(β20 + β21x1 + β22x2 + . . . ) + . . . 1 from sklearn.tree import DecisionTreeClassifier , export_graphviz 2 m = DecisionTreeClassifier (max_depth =3) 3 m.fit(X, y) Feature #2: Histogram of Oriented Gradi- ents 1 im2 = resize(im , (32, 32) , mode=’reflect ’) 2 im2 = np.sqrt(im2) 3 f2 = hog(im2 , block_norm=’L2 -Hys ’)
  • 70.
    Deep Learning 1 fromsklearn. model_selection import cross_val_score 2 from sklearn.ensemble import RandomForestClassifier 3 p r i n t ( cross_val_score ( RandomForestClassifier (100) , X, y)) 1 [ 0.69642429 0.70086393 0.69851176] Linear Regression ˆy = β0 + β1x1 + β2x2 + . . . Multilayer perceptron / neural network ˆy = β00σ(β10 + β11x1 + β12x2 + . . . ) + β01σ(β20 + β21x1 + β22x2 + . . . ) + . . . 1 from sklearn.tree import DecisionTreeClassifier , export_graphviz 2 m = DecisionTreeClassifier (max_depth =3) 3 m.fit(X, y)
  • 71.
    Deep Learning 1 fromsklearn. model_selection import cross_val_score 2 from sklearn.ensemble import RandomForestClassifier 3 p r i n t ( cross_val_score ( RandomForestClassifier (100) , X, y)) 1 [ 0.69642429 0.70086393 0.69851176] Linear Regression ˆy = β0 + β1x1 + β2x2 + . . . Multilayer perceptron / neural network ˆy = β00σ(β10 + β11x1 + β12x2 + . . . ) + β01σ(β20 + β21x1 + β22x2 + . . . ) + . . . 1 model = Sequential () 2 model.add(Conv2D (8, 3, 1, activation=’relu ’, input_shape =(32 , 32, 1))) 3 model.add( MaxPooling2D ()) 4 model.add(Conv2D (16, 3, 1, activation=’ relu ’)) 5 model.add( MaxPooling2D ()) 6 model.add(Flatten ()) 7 model.add(Dense (16, activation=’relu ’)) 8 model.add(Dense (8, activation=’relu ’)) 9 model.add(Dense (1, activation=’sigmoid ’)) 10 11 sgd = SGD () 12 model. compile (sgd , ’binary_crossentropy ’) 13 14 model.fit(X[tr], y[tr], validation_data =(X [ts], y[ts]), 15 epochs =10, batch_size =100)
  • 72.
    Deep Learning 1 fo r tr , ts i n StratifiedKFold ().split(X, y ): 2 model = ... 3 ... 4 yp = (model.predict(X[ts])[:, -1] > 0.5) .astype( i n t ) 5 p r i n t ( accuracy_score (y[ts], yp)) 1 [0.57 , 0.57 , 0.63] Linear Regression ˆy = β0 + β1x1 + β2x2 + . . . Multilayer perceptron / neural network ˆy = β00σ(β10 + β11x1 + β12x2 + . . . ) + β01σ(β20 + β21x1 + β22x2 + . . . ) + . . . 1 model = Sequential () 2 model.add(Conv2D (8, 3, 1, activation=’relu ’, input_shape =(32 , 32, 1))) 3 model.add( MaxPooling2D ()) 4 model.add(Conv2D (16, 3, 1, activation=’ relu ’)) 5 model.add( MaxPooling2D ()) 6 model.add(Flatten ()) 7 model.add(Dense (16, activation=’relu ’)) 8 model.add(Dense (8, activation=’relu ’)) 9 model.add(Dense (1, activation=’sigmoid ’)) 10 11 sgd = SGD () 12 model. compile (sgd , ’binary_crossentropy ’) 13 14 model.fit(X[tr], y[tr], validation_data =(X [ts], y[ts]), 15 epochs =10, batch_size =100)
  • 73.
    Deep Learning 1 fo r tr , ts i n StratifiedKFold ().split(X, y ): 2 model = ... 3 ... 4 yp = (model.predict(X[ts])[:, -1] > 0.5) .astype( i n t ) 5 p r i n t ( accuracy_score (y[ts], yp)) 1 [0.57 , 0.57 , 0.63] Overview of Python deep learning landscape: Theano TensorFlow PyTorch KerasLasagne 1 model = Sequential () 2 model.add(Conv2D (8, 3, 1, activation=’relu ’, input_shape =(32 , 32, 1))) 3 model.add( MaxPooling2D ()) 4 model.add(Conv2D (16, 3, 1, activation=’ relu ’)) 5 model.add( MaxPooling2D ()) 6 model.add(Flatten ()) 7 model.add(Dense (16, activation=’relu ’)) 8 model.add(Dense (8, activation=’relu ’)) 9 model.add(Dense (1, activation=’sigmoid ’)) 10 11 sgd = SGD () 12 model. compile (sgd , ’binary_crossentropy ’) 13 14 model.fit(X[tr], y[tr], validation_data =(X [ts], y[ts]), 15 epochs =10, batch_size =100)
  • 74.
    Deep Learning 1 fo r tr , ts i n StratifiedKFold ().split(X, y ): 2 model = ... 3 ... 4 yp = (model.predict(X[ts])[:, -1] > 0.5) .astype( i n t ) 5 p r i n t ( accuracy_score (y[ts], yp)) 1 [0.57 , 0.57 , 0.63] Overview of Python deep learning landscape: Theano TensorFlow PyTorch KerasLasagne 1 model = Sequential () 2 model.add(Conv2D (8, 3, 1, activation=’relu ’, input_shape =(32 , 32, 1))) 3 model.add( MaxPooling2D ()) 4 model.add(Conv2D (16, 3, 1, activation=’ relu ’)) 5 model.add( MaxPooling2D ()) 6 model.add(Flatten ()) 7 model.add(Dense (16, activation=’relu ’)) 8 model.add(Dense (8, activation=’relu ’)) 9 model.add(Dense (1, activation=’sigmoid ’)) 10 11 sgd = SGD () 12 model. compile (sgd , ’binary_crossentropy ’) 13 14 model.fit(X[tr], y[tr], validation_data =(X [ts], y[ts]), 15 epochs =10, batch_size =100) Deep learning architectures: Fully connected perceptrons Convolutional neural networks Recurrent neural networks Neural Turing Machines Autoencoders
  • 75.
    Conclusions – Python forScientific Computing Jo˜ao Machado • Ricardo Cruz
  • 76.
    Conclusions Packages to know: Numpy:basic linear algebra Scipy: extensions to numpy sparse matrices, pdfs, hypothesis tests Statsmodels: several statistics models, incl. timeseries Pandas: extension to numpy for dataframes support Matplotlib, seaborn: drawing graphics
  • 77.
    Conclusions Packages to know: Numpy:basic linear algebra Scipy: extensions to numpy sparse matrices, pdfs, hypothesis tests Statsmodels: several statistics models, incl. timeseries Pandas: extension to numpy for dataframes support Matplotlib, seaborn: drawing graphics scikit-learn: complete machine learning toolkit xgboost: famous gradient boosting model Keras: deep learning (and TensorFlow, Theano, Lasagne) OpenCV, scikit-image: image processing NLTK: natural language toolkit Gensim: natural language models
  • 78.
    Final remarks Python’s a“jack of all trades” type of language; Its speed and ease of development is really apt for scientific computing; Ever increasingly adopted by scientists and engineers, due to the available third-party scientific libraries contributed by a large community; Has become a ’de-facto’ language present in advances in some fields, such as Deep Learning.
  • 79.
    About us Jo˜ao Machado machadojpf@gmail.com FraunhoferPortugal research engineer Masters in Electrical and Computer Engineering http://www.linkedin.com/in/machadojpf Ricardo Cruz rpcruz@inesctec.pt INESC TEC researcher Computer Science & Applied Mathematics graduate https://rpmcruz.github.io/ Subscribe workshops: http://tinyurl.com/cruz-workshops