Vectorise all the things

Vectorise all the things!
Speeding up your code with basic
linear algebra

What we’re going to cover
Linear algebra
basics
Distance
metrics and
kNN
Unoptimised
solution
Optimising
distance
metrics
Optimising
kNN

What is a vector?
Major axis length Minor axis length
653.7 392.3
427.8 247.0
229.2 162.4

What is a matrix?
Major axis length Minor axis length
653.7 392.3
427.8 247.0
229.2 162.4

Representations in NumPy
● Vectors and matrices are stored in
arrays
● Arrays can be n-dimensional
○ Vectors in 1D arrays
○ Matrices in 2D arrays
○ Tensors in 3D (or nD) arrays
● Shape gives size along each
dimension
1D array
(2,)
2D array
(2,2)
3D array
(1,2,2)

Distances in vector spaces
● Objects that are similar are closer (“less
distant”) in vector space
● Have similar values along every dimension in
the vector space
● Cali beans:
○ Similar values on features to each other
○ Distinct values compared to other beans

Manhattan distance
Manhattan distance:
D = |(4 - 2)| + |(3 - 1)|
= 4
a
b

k-nearest neighbours
● Data are divided into train and test
sets
● Distance between test point and
training points measured

k-nearest neighbours
● Data are divided into train and test
sets
● Distance between test point and
training points measured
● k nearest points are retained
● Labels of k nearest points counted
● Test point assigned majority label
Cali
Cali
Cali
Seker
Dermason
Cali

Our first code improvement
def calculate_manhattan_distance(a: list, b: list, p: int) -> float:
"""Calculates the Manhattan distance between two vectors, X and Y."""
i = len(a)
diffs = []
for element in range(0, i):
diffs.append(abs(a[element] - b[element]))
return sum(diffs)

Our first code improvement
def calculate_manhattan_distance(a: list, a: list, p: int) -> float:
"""Calculates the Manhattan distance between two vectors, X and Y."""
i = len(a)
diffs = []
for element in range(0, i):
diffs.append(abs(a[element] - b[element]))
return sum(diffs)

Vectorising the
Manhattan distance

Vector and matrix subtraction
● Vectors and matrices of the same size can
be subtracted
○ E.g., 4 x 1 vectors
○ E.g., 3 x 2 matrices
● Subtractions are performed elementwise
● Result is vector or matrix of the same size

Operations on elements of vectors
● Elements of vectors can have an
operation performed on them:
○ Scalar multiplication
○ Other functions such as absolute
value
● Result is vector or matrix of the same
size

Times after vector subtraction
1.3x
1.2x
3.2x
1.8x 1.8x
4.3x

Nested for loops can get expensive
● Nested loops compound issues with single
loops
● Sequential processing means time scales as
product of lengths of each list:
○ Small dataset = 3000 x 1000 = 3 million
○ Medium and large = 20000 x 7000 =
140 million

Our second code improvement
def apply_manhattan_distance(vectors_1: list, vectors_2: list, p: int
) -> list:
"""Calculates the pairwise difference between two lists of vectors."""
distances = []
for train_obs in vectors_1:
tmp_distances = []
for test_obs in vectors_2:
tmp_distances.append(calculate_manhattan_distance(train_obs, test_obs, p))
distances.append(tmp_distances)
return [list(x) for x in zip(*distances)]

Getting rid of the nested for loop

Doing the matrix subtraction in one pass
(1,3,4) (3,3,4)
(1,3,4) (3,3,4)

Doing the matrix subtraction in one pass

● A memory efficient way for NumPy to transform arrays to a compatible size for operations
● For an operation, NumPy compares each dimension and checks:
○ Are the dimensions the same size?
○ If not, is one of the dimensions size = 1
● Replicates or “stretches” incompatible dimensions to be the same size
○ E.g., subtraction between a 3 x 4 matrix and 1 x 4 vector
Broadcasting

Times after broadcasting
1.3x
1.2x
1.8x 1.8x
4.3x
10x 11x 13x

Our final code improvements
def calculate_nearest_neighbour(distances: list, labels: list, k: int
) -> str:
"""
Calculates the k-nearest neighbours for a test point,
using k selected neighbours.
"""
sorted_distances = sorted(zip(distances, labels), key=itemgetter(0))[1:]
top_n_labels = [label for dist, label in sorted_distances][:k]
return max(set(top_n_labels), key=top_n_labels.count)

● Sorting:
○ sort and sorted methods locked to Timsort
○ NumPy sort methods default to quicksort
○ Stable methods adjust to dtype
● List comprehension:
○ For loop in disguise
The problems with this function

Final timings
1.3x
1.2x
1.8x 1.8x
4.3x
10x 11x 13x
1150x
50x 15x

Vectorise all the things

More Related Content

Similar to Vectorise all the things

Recently uploaded

Vectorise all the things