Vectorise all the things!
Speeding up your code with basic
linear algebra
What we’re going to cover
Linear algebra
basics
Distance
metrics and
kNN
Unoptimised
solution
Optimising
distance
metrics
Optimising
kNN
Linear algebra basics
What is a vector?
Major axis length Minor axis length
653.7 392.3
427.8 247.0
229.2 162.4
Vector spaces
What is a matrix?
Major axis length Minor axis length
653.7 392.3
427.8 247.0
229.2 162.4
Representations in NumPy
● Vectors and matrices are stored in
arrays
● Arrays can be n-dimensional
○ Vectors in 1D arrays
○ Matrices in 2D arrays
○ Tensors in 3D (or nD) arrays
● Shape gives size along each
dimension
1D array
(2,)
2D array
(2,2)
3D array
(1,2,2)
K-Nearest Neighbours
Distances in vector spaces
● Objects that are similar are closer (“less
distant”) in vector space
● Have similar values along every dimension in
the vector space
● Cali beans:
○ Similar values on features to each other
○ Distinct values compared to other beans
Manhattan distance
Manhattan distance:
D = |(4 - 2)| + |(3 - 1)|
= 4
a
b
k-nearest neighbours
● Data are divided into train and test
sets
● Distance between test point and
training points measured
k-nearest neighbours
● Data are divided into train and test
sets
● Distance between test point and
training points measured
● k nearest points are retained
● Labels of k nearest points counted
● Test point assigned majority label
Cali
Cali
Cali
Seker
Dermason
Cali
First pass with loops
Our first code improvement
def calculate_manhattan_distance(a: list, b: list, p: int) -> float:
"""Calculates the Manhattan distance between two vectors, X and Y."""
i = len(a)
diffs = []
for element in range(0, i):
diffs.append(abs(a[element] - b[element]))
return sum(diffs)
Our first code improvement
def calculate_manhattan_distance(a: list, a: list, p: int) -> float:
"""Calculates the Manhattan distance between two vectors, X and Y."""
i = len(a)
diffs = []
for element in range(0, i):
diffs.append(abs(a[element] - b[element]))
return sum(diffs)
Vectorising the
Manhattan distance
Vector and matrix subtraction
● Vectors and matrices of the same size can
be subtracted
○ E.g., 4 x 1 vectors
○ E.g., 3 x 2 matrices
● Subtractions are performed elementwise
● Result is vector or matrix of the same size
Operations on elements of vectors
● Elements of vectors can have an
operation performed on them:
○ Scalar multiplication
○ Other functions such as absolute
value
● Result is vector or matrix of the same
size
Times after vector subtraction
1.3x
1.2x
3.2x
1.8x 1.8x
4.3x
Nested for loops can get expensive
● Nested loops compound issues with single
loops
● Sequential processing means time scales as
product of lengths of each list:
○ Small dataset = 3000 x 1000 = 3 million
○ Medium and large = 20000 x 7000 =
140 million
Our second code improvement
def apply_manhattan_distance(vectors_1: list, vectors_2: list, p: int
) -> list:
"""Calculates the pairwise difference between two lists of vectors."""
distances = []
for train_obs in vectors_1:
tmp_distances = []
for test_obs in vectors_2:
tmp_distances.append(calculate_manhattan_distance(train_obs, test_obs, p))
distances.append(tmp_distances)
return [list(x) for x in zip(*distances)]
Getting rid of the nested for loop
Getting rid of the nested for loop
Doing the matrix subtraction in one pass
(1,3,4) (3,3,4)
(1,3,4) (3,3,4)
Doing the matrix subtraction in one pass
● A memory efficient way for NumPy to transform arrays to a compatible size for operations
● For an operation, NumPy compares each dimension and checks:
○ Are the dimensions the same size?
○ If not, is one of the dimensions size = 1
● Replicates or “stretches” incompatible dimensions to be the same size
○ E.g., subtraction between a 3 x 4 matrix and 1 x 4 vector
Broadcasting
Broadcasting
(1,3,4)
(3,1,4)
Times after broadcasting
1.3x
1.2x
1.8x 1.8x
4.3x
10x 11x 13x
Vectorising kNN
Our final code improvements
def calculate_nearest_neighbour(distances: list, labels: list, k: int
) -> str:
"""
Calculates the k-nearest neighbours for a test point,
using k selected neighbours.
"""
sorted_distances = sorted(zip(distances, labels), key=itemgetter(0))[1:]
top_n_labels = [label for dist, label in sorted_distances][:k]
return max(set(top_n_labels), key=top_n_labels.count)
Our final code improvements
def calculate_nearest_neighbour(distances: list, labels: list, k: int
) -> str:
"""
Calculates the k-nearest neighbours for a test point,
using k selected neighbours.
"""
sorted_distances = sorted(zip(distances, labels), key=itemgetter(0))[1:]
top_n_labels = [label for dist, label in sorted_distances][:k]
return max(set(top_n_labels), key=top_n_labels.count)
● Sorting:
○ sort and sorted methods locked to Timsort
○ NumPy sort methods default to quicksort
○ Stable methods adjust to dtype
● List comprehension:
○ For loop in disguise
The problems with this function
Final timings
1.3x
1.2x
1.8x 1.8x
4.3x
10x 11x 13x
1150x
50x 15x

Vectorise all the things

  • 1.
    Vectorise all thethings! Speeding up your code with basic linear algebra
  • 3.
    What we’re goingto cover Linear algebra basics Distance metrics and kNN Unoptimised solution Optimising distance metrics Optimising kNN
  • 4.
  • 5.
    What is avector? Major axis length Minor axis length 653.7 392.3 427.8 247.0 229.2 162.4
  • 6.
  • 7.
    What is amatrix? Major axis length Minor axis length 653.7 392.3 427.8 247.0 229.2 162.4
  • 8.
    Representations in NumPy ●Vectors and matrices are stored in arrays ● Arrays can be n-dimensional ○ Vectors in 1D arrays ○ Matrices in 2D arrays ○ Tensors in 3D (or nD) arrays ● Shape gives size along each dimension 1D array (2,) 2D array (2,2) 3D array (1,2,2)
  • 9.
  • 10.
    Distances in vectorspaces ● Objects that are similar are closer (“less distant”) in vector space ● Have similar values along every dimension in the vector space ● Cali beans: ○ Similar values on features to each other ○ Distinct values compared to other beans
  • 11.
    Manhattan distance Manhattan distance: D= |(4 - 2)| + |(3 - 1)| = 4 a b
  • 12.
    k-nearest neighbours ● Dataare divided into train and test sets ● Distance between test point and training points measured
  • 13.
    k-nearest neighbours ● Dataare divided into train and test sets ● Distance between test point and training points measured ● k nearest points are retained ● Labels of k nearest points counted ● Test point assigned majority label Cali Cali Cali Seker Dermason Cali
  • 14.
  • 15.
    Our first codeimprovement def calculate_manhattan_distance(a: list, b: list, p: int) -> float: """Calculates the Manhattan distance between two vectors, X and Y.""" i = len(a) diffs = [] for element in range(0, i): diffs.append(abs(a[element] - b[element])) return sum(diffs)
  • 16.
    Our first codeimprovement def calculate_manhattan_distance(a: list, a: list, p: int) -> float: """Calculates the Manhattan distance between two vectors, X and Y.""" i = len(a) diffs = [] for element in range(0, i): diffs.append(abs(a[element] - b[element])) return sum(diffs)
  • 17.
  • 18.
    Vector and matrixsubtraction ● Vectors and matrices of the same size can be subtracted ○ E.g., 4 x 1 vectors ○ E.g., 3 x 2 matrices ● Subtractions are performed elementwise ● Result is vector or matrix of the same size
  • 19.
    Operations on elementsof vectors ● Elements of vectors can have an operation performed on them: ○ Scalar multiplication ○ Other functions such as absolute value ● Result is vector or matrix of the same size
  • 20.
    Times after vectorsubtraction 1.3x 1.2x 3.2x 1.8x 1.8x 4.3x
  • 21.
    Nested for loopscan get expensive ● Nested loops compound issues with single loops ● Sequential processing means time scales as product of lengths of each list: ○ Small dataset = 3000 x 1000 = 3 million ○ Medium and large = 20000 x 7000 = 140 million
  • 22.
    Our second codeimprovement def apply_manhattan_distance(vectors_1: list, vectors_2: list, p: int ) -> list: """Calculates the pairwise difference between two lists of vectors.""" distances = [] for train_obs in vectors_1: tmp_distances = [] for test_obs in vectors_2: tmp_distances.append(calculate_manhattan_distance(train_obs, test_obs, p)) distances.append(tmp_distances) return [list(x) for x in zip(*distances)]
  • 23.
    Getting rid ofthe nested for loop
  • 24.
    Getting rid ofthe nested for loop
  • 25.
    Doing the matrixsubtraction in one pass (1,3,4) (3,3,4) (1,3,4) (3,3,4)
  • 26.
    Doing the matrixsubtraction in one pass
  • 27.
    ● A memoryefficient way for NumPy to transform arrays to a compatible size for operations ● For an operation, NumPy compares each dimension and checks: ○ Are the dimensions the same size? ○ If not, is one of the dimensions size = 1 ● Replicates or “stretches” incompatible dimensions to be the same size ○ E.g., subtraction between a 3 x 4 matrix and 1 x 4 vector Broadcasting
  • 28.
  • 29.
  • 30.
  • 31.
    Our final codeimprovements def calculate_nearest_neighbour(distances: list, labels: list, k: int ) -> str: """ Calculates the k-nearest neighbours for a test point, using k selected neighbours. """ sorted_distances = sorted(zip(distances, labels), key=itemgetter(0))[1:] top_n_labels = [label for dist, label in sorted_distances][:k] return max(set(top_n_labels), key=top_n_labels.count)
  • 32.
    Our final codeimprovements def calculate_nearest_neighbour(distances: list, labels: list, k: int ) -> str: """ Calculates the k-nearest neighbours for a test point, using k selected neighbours. """ sorted_distances = sorted(zip(distances, labels), key=itemgetter(0))[1:] top_n_labels = [label for dist, label in sorted_distances][:k] return max(set(top_n_labels), key=top_n_labels.count)
  • 33.
    ● Sorting: ○ sortand sorted methods locked to Timsort ○ NumPy sort methods default to quicksort ○ Stable methods adjust to dtype ● List comprehension: ○ For loop in disguise The problems with this function
  • 34.