SlideShare a Scribd company logo
Accelerating Key Bioinformatics Tasks
100-fold by Improving Memory Access
Igor Sfiligoi, Daniel McDonald and Rob Knight
University of California San Diego
High level overview
• Microbiome studies recently moved from 100s to 10k-100k samples
• But the tools were built having 1k samples in mind
• Many of the tools are written in Python
• With NumPy used for compute-intensive portions
• This talk focuses on pcoa and mantel, which are part of scikit-bio
• The above codes work well at small scales
• NumPy takes good care of using the compute resources
• But very slow at large(r) scales
• Lack of memory locality due to multi-step logic
• Re-implemented in Cython, with emphasis on memory locality
• Now scales well into the 100k samples range
A word about microbiome studies
• Microbiome studies composed of samples from the real world
• Often collected from many different environments
• DNA sequencing technology used to observe
what microorganisms are present in a given sample
• Typical questions
• How similar the samples are to each other?
• Can these similarities be explained by study variables?
(e.g., salinity, pH, host or not host associated, etc)
• How correlated are distance matrices created by different methods?
• Often comparing with samples from other studies, too
• Increases the total number of samples to consider
A word about CPU memory subsystem
• CPUs compute several orders of magnitude faster than DRAM
• Intel Xeon Gold 6242 takes ~100 cycles to load from DRAM
• CPUs thus have internal caches to speed up access
• But limited in size, due to high cost
• L1 cache: 4 cycles, 32 KiB per core
• L2 cache: 14 cycles, 1 MiB per core
• L3 cache: ~50 cycles, 22.5 MiB per chip
(the above is for the Xeon, other CPUs similar)
• Minimizing off-cache access thus essential
• Not a new concept, but can be overlooked when datasets grow
Principal Coordinates Analysis
• The pcoa function is used to perform
a Principal Coordinates Analysis of a distance matrix
• Two step process
• Centering of the matrix
• Compute of approximate eigenvalues
Surprisingly,
the first step was the
most expensive!
Centering the matrix
• Original implementation
simple and elegant
• But not memory efficient
• Multi-step process
• With each arithmetic operation
traversing the whole NumPy matrix
• Reads 8 and writes 5
matrix-sized buffers
import numpy as np
def e_matrix(distance_matrix):
return distance_matrix * distance_matrix / -2
def f_matrix(E_matrix):
row_means = E_matrix.mean(axis=1, keepdims=True);
col_means = E_matrix.mean(axis=0, keepdims=True)
matrix_mean = E_matrix.mean()
return E_matrix - row_means - col_means + matrix_mean
def center_distance_matrix(distance_matrix):
# distance_matrix must be of type np.ndarray and
# be a 2D floating point matrix
return f_matrix(e_matrix(distance_matrix))
OK for small matrices, but
slow once cannot fit in cache
Centering the matrix
• To keep active buffer in cache, must expose parallelism
• Move compute to inner loop
• And use a tilling pattern
• Cannot do in NumPy
• Re-implemented in Cython
0/2
def center_distance_matrix(mat, centered):
row_means = np.empty(mat.shape[0])
global_mean = e_matrix_means_cy(mat, centered, row_means)
f_matrix_inplace_cy(row_means, global_mean, centered)
Centering the matrix
• To keep active buffer in cache, must expose parallelism
• Move compute to inner loop
• And use a tilling pattern
• Re-implemented in Cython
• Harder to read,
but much faster
def e_matrix(distance_matrix):
return distance_matrix * distance_matrix / -2
def f_matrix(E_matrix):
row_means = E_matrix.mean(axis=1, keepdims=True);
col_means = E_matrix.mean(axis=0, keepdims=True)
…
def e_matrix_means_cy(float[:, :] mat,
float[:, :] centered, float[:] row_means):
n_samples = mat.shape[0]
global_sum = 0.0
for row in prange(n_samples, nogil=True):
row_sum = 0.0
for col in range(n_samples):
val = mat[row,col];
val2 = -0.5*val*val
centered[row,col] = val2
row_sum = row_sum + val2
global_sum += row_sum
row_means[row] = row_sum/n_samples
return (global_sum/n_samples)/n_samples
1/2
Compute means
while creating E_matrix
Centering the matrix
• To keep active buffer in cache, must expose parallelism
• Move compute to inner loop
• And use a tilling pattern
• Re-implemented in Cython
• Harder to read,
but much faster
def f_matrix_inplace_cy(float[:] row_means, float global_mean,
float[:, :] centered):
n_samples = centered.shape[0]
# use a tiled pattern to maximize locality of row_means,
# since they are also column means
for trow in prange(0, n_samples, 16, nogil=True):
trow_max = min(trow+16 n_samples)
for tcol in range(0, n_samples, 16):
tcol_max = min(tcol+16, n_samples)
for row in range(trow, trow_max, 1):
gr_mean = global_mean - row_means[row]
for col in range(tcol, tcol_max, 1):
centered[row,col] = centered[row,col] +
(gr_mean - row_means[col])
def f_matrix(E_matrix):
row_means = E_matrix.mean(axis=1, keepdims=True);
col_means = E_matrix.mean(axis=0, keepdims=True)
matrix_mean = E_matrix.mean()
return E_matrix - row_means - col_means +
matrix_mean
Out
2/2
Extensively use tilling
Observed speedup
• Between 2x and 30x, depending on available HW
• Bigger improvement on larger problems in full-CPU mode
CPU, number of used cores, matrix size Original Latest
AMD EPYC 7252 1 core - 25k x 25k 4.1 2.1
AMD EPYC 7252 8 cores - 25k x 25k 4.0 0.3
Intel Xeon Gold 6242 1 core - 25k x 25k 7.7 2.3
Intel Xeon Gold 6242 16 cores - 25k x 25k 7.2 0.3
AMD EPYC 7252 1 core - 70k x 70k 100 30
AMD EPYC 7252 8 cores - 70k x 70k 125 4.2
Intel Xeon Gold 6242 1 core - 100k x 100k 255 78
Intel Xeon Gold 6242 16 cores - 100k x 100k 254 9.1
Note: Original implementation cannot make effective use of multiple cores
The Mantel test
• The mantel function computes the correlation
between two distance matrices using the Mantel test.
• The statistical significance derived from a Monte-Carlo procedure,
where the input matrices are permuted K times
• Default K=999
• Implication:
• The correlation function is typically
computed 1k times on a whole matrix
Extremely slow
if matrix cannot fit in cache
Original mantel implementation
• Again, simple and elegant,
but not memory efficient
• Note that it relies
on external functions
for its core compute
function
from scipy.stats import pearsonr
def _mantel_stats_pearson(x, y, permutations=999):
# pearsonr operates on the flat representation
# of the upper triangle of the symmetric matrix
x_flat = x.condensed_form()
y_flat = y.condensed_form()
orig_stat = pearsonr(x_flat, y_flat)
perm_gen = ( pearsonr(x.permute(condensed=True), y_flat)
for _ in range(permutations) )
permuted_stats = np.fromiter(perm_gen, np.float,
count=permutations)
return [orig_stat, permuted_stats]
def mantel(x, y, permutations=999):
orig_stat, permuted_stats = _mantel_stats_pearson(x, y,
permutations)
count_better = ( np.absolute(permuted_stats)
>= np.absolute(orig_stat) ).sum()
p_value = (count_better + 1) / (permutations + 1)
return orig_stat, p_value
Original mantel implementation
• Again, simple and elegant,
but not memory efficient
• Note that it relies
on external functions
for its core compute
function
from scipy.stats import pearsonr
def _mantel_stats_pearson(x, y, permutations=999):
# pearsonr operates on the flat representation
# of the upper triangle of the symmetric matrix
x_flat = x.condensed_form()
y_flat = y.condensed_form()
orig_stat = pearsonr(x_flat, y_flat)
perm_gen = ( pearsonr(x.permute(condensed=True), y_flat)
for _ in range(permutations) )
permuted_stats = np.fromiter(perm_gen, np.float,
count=permutations)
return [orig_stat, permuted_stats]
def mantel(x, y, permutations=999):
orig_stat, permuted_stats = _mantel_stats_pearson(x, y,
permutations)
count_better = ( np.absolute(permuted_stats)
>= np.absolute(orig_stat) ).sum()
p_value = (count_better + 1) / (permutations + 1)
return orig_stat, p_value
Memory locality requires partial compute,
so must inline and re-implement
external functions, too
from linalg import norm
def pearsonr(x_flat, y_flat):
xm = x_flat – x_flat.mean()
ym = y_flat – y_flat.mean()
normxm = norm(xm)
normym = norm(ym)
xnorm = xm/normxm
ynorm = ym/normym
return np.dot(xnorm, ynorm)
Additional optimization opportunities
• With fewer black-boxes,
we can optimize more, too
• 2nd parameter always the same
• The mean and norm of a
matrix do not change when the
matrix is permuted
from scipy.stats import pearsonr
def _mantel_stats_pearson(x, y, permutations=999):
# pearsonr operates on the flat representation
# of the upper triangle of the symmetric matrix
x_flat = x.condensed_form()
y_flat = y.condensed_form()
orig_stat = pearsonr(x_flat, y_flat)
perm_gen = ( pearsonr(x.permute(condensed=True), y_flat)
for _ in range(permutations) )
permuted_stats = np.fromiter(perm_gen, np.float,
count=permutations)
return [orig_stat, permuted_stats]
def mantel(x, y, permutations=999):
orig_stat, permuted_stats = _mantel_stats_pearson(x, y,
permutations)
count_better = ( np.absolute(permuted_stats)
>= np.absolute(orig_stat) ).sum()
p_value = (count_better + 1) / (permutations + 1)
return orig_stat, p_value
from linalg import norm
def pearsonr(x_flat, y_flat):
xm = x_flat – x_flat.mean()
ym = y_flat – y_flat.mean()
normxm = norm(xm)
normym = norm(ym)
xnorm = xm/normxm
ynorm = ym/normym
return np.dot(xnorm, ynorm)
Reuse
results
Optimized mantel implementation
• Inlining makes it harder to read
• But not much longer
def mantel_perm_cy(float[:, :] x_data, float[:, :] perm_order,
float xmean, float normxm,
float[:] ym_normalized,
float[:] permuted_stats):
perms_n = perm_order.shape[0]
out_n = perm_order.shape[1]
for p in prange(perms_n, nogil=True):
my_ps = 0.0
for row in range(out_n-1):
vrow = perm_order[p, row]
idx = row*(out_n-1) - ((row-1)*row)/2
for icol in range(out_n-row-1):
col = icol+row+1
yval = ym_normalized[idx+icol]
xval = x_data[vrow, perm_order[p, col]]*mul + add
my_ps = yval*xval + my_ps
permuted_stats[p] = my_ps
from linalg import norm
def _mantel(x, y, permutations):
x_flat = x.condensed_form()
y_flat = y.condensed_form()
xm = x_flat – x_flat.mean()
ym = y_flat – y_flat.mean()
normxm = norm(xm)
normym = norm(ym)
xnorm = xm/normxm
ynorm = ym/normym
orig_stat = np.dot(xnorm, ynorm)
mat_n = x._data.shape[0]
perm_order = np.empty([permutations, mat_n], dtype=np.int)
for row in range(permutations):
perm_order[row, :] = np.random.permutation(mat_n)
permuted_stats = np.empty([permutations])
mantel_perm_pearsonr_cy(x, perm_order, xmean, normxm,
ynorm, permuted_stats)
return [orig_stat, permuted_stats]
Permuted multi-matrix
dot operation
Pre-compute anything
that can be reused
Optimized mantel implementation
• Inlining makes it harder to read
• But not much longer
def mantel_perm_cy(float[:, :] x_data, float[:, :] perm_order,
float xmean, float normxm,
float[:] ym_normalized,
float[:] permuted_stats):
perms_n = perm_order.shape[0]
out_n = perm_order.shape[1]
for p in prange(perms_n, nogil=True):
my_ps = 0.0
for row in range(out_n-1):
vrow = perm_order[p, row]
idx = row*(out_n-1) - ((row-1)*row)/2
for icol in range(out_n-row-1):
col = icol+row+1
yval = ym_normalized[idx+icol]
xval = x_data[vrow, perm_order[p, col]]*mul + add
my_ps = yval*xval + my_ps
permuted_stats[p] = my_ps
from linalg import norm
def _mantel(x, y, permutations):
x_flat = x.condensed_form()
y_flat = y.condensed_form()
xm = x_flat – x_flat.mean()
ym = y_flat – y_flat.mean()
normxm = norm(xm)
normym = norm(ym)
xnorm = xm/normxm
ynorm = ym/normym
orig_stat = np.dot(xnorm, ynorm)
mat_n = x._data.shape[0]
perm_order = np.empty([permutations, mat_n], dtype=np.int)
for row in range(permutations):
perm_order[row, :] = np.random.permutation(mat_n)
permuted_stats = np.empty([permutations])
mantel_perm_pearsonr_cy(x, perm_order, xmean, normxm,
ynorm, permuted_stats)
return [orig_stat, permuted_stats]
Observed speedup
• Up to 200x
• Bigger improvement
on larger problems
in full-CPU mode
• Game changing
for large problems
• We would not
even consider
computing those
Again, original implementation cannot make effective use of multiple cores
CPU, num. of used cores, matrix size Original Latest
AMD EPYC 7252 1 core - 10k ^2 2.19k 149
AMD EPYC 7252 8 cores - 10k ^2 2.17k 24
Intel Xeon Gold 6242 1 core - 10k ^2 4.2k 170
Intel Xeon Gold 6242 16 cores - 10k ^2 4.2k 20
AMD EPYC 7252 1 core - 20k ^2 8.9k 615
AMD EPYC 7252 8 cores - 20k ^2 8.8k 97
Intel Xeon Gold 6242 1 core - 20k ^2 17.9k 933
Intel Xeon Gold 6242 16 cores - 20k ^2 17.5k 108
AMD EPYC 7252 8 cores - 70k ^2 - 1.44k
Intel Xeon Gold 6242 16 cores-100k^2 - 4.2k
Validation costs
• Even something as trivial as checking for symmetry can be expensive
if one does not consider memory locality
• A 100k samples distance matrix would take several minutes!
CPU, num. of used cores, matrix size Original Latest
AMD EPYC 7252 1 core - 25k ^2 1.8 2.6
AMD EPYC 7252 8 cores - 25k ^2 1.8 0.4
Intel Xeon Gold 6242 1 core - 25k ^2 4.4 3.2
Intel Xeon Gold 6242 16 cores - 25k ^2 4.4 0.24
AMD EPYC 7252 1 core - 70k ^2 18 21
AMD EPYC 7252 8 cores - 70k ^2 18 3.2
Intel Xeon Gold 6242 1 core - 100k ^2 219 78
Intel Xeon Gold 6242 16 cores-100k^2 212 5.4
Check performed on each
object instantiation!
Validation costs
• The solution, again,
was Cython with tiling
import numpy as np
def is_symmetric_and_hollow(mat):
# mat must be of type np.ndarray
# and be a 2D floating point matrix
not_sym = (mat.T != mat).any()
not_hollow = (np.trace(mat) != 0)
return [not no_sym, not not_hollow]
import numpy as np
from cython.parallel import prange
def is_symmetric_and_hollow_cy(float[:, :] mat):
in_n = mat.shape[0]
is_sym = True
is_hollow = True
# use a tiled approach to maximize memory locality
for trow in prange(0, in_n, 16, nogil=True):
trow_max = min(trow+16, in_n)
for tcol in range(0, in_n, 16):
tcol_max = min(tcol+16, in_n)
for row in range(trow, trow_max, 1):
for col in range(tcol, tcol_max, 1):
is_sym &= (mat[row,col]==mat[col,row])
if (trow==tcol): # diagonal block
for col in range(tcol, tcol_max, 1):
is_hollow &= (mat[col,col]==0)
return [(is_sym==True), (is_hollow==True)]
Conclusions
• We re-implemented two often used functions, pcoa and mantel,
plus related object validation function
• Now up to 100x faster
• Root cause was over-reliance on standard libraries
• In particular Python’s NumPy
• Code was simple and elegant, but not memory efficient
• New implementation uses lower-level constructs
• Used Cython to push most of compute into inner loop
• This will greatly help the microbiome community
• Improvements accepted into scikit-bio library code-base
Acknowledgemnts
• This work was partially funded by the
US National Research Foundation (NSF) under grants
DBI-2038509, OAC-1541349 and OAC-1826967.

More Related Content

What's hot

Transfer Learning
Transfer LearningTransfer Learning
Transfer Learning
Hichem Felouat
 
How to Build your First Neural Network
How to Build your First Neural NetworkHow to Build your First Neural Network
How to Build your First Neural Network
Hichem Felouat
 
4 dynamic memory allocation
4 dynamic memory allocation4 dynamic memory allocation
4 dynamic memory allocationFrijo Francis
 
Memory allocation in c
Memory allocation in cMemory allocation in c
Memory allocation in c
Muhammed Thanveer M
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
Hichem Felouat
 
Tensor flow (1)
Tensor flow (1)Tensor flow (1)
Tensor flow (1)
景逸 王
 
TLPI - 7 Memory Allocation
TLPI - 7 Memory AllocationTLPI - 7 Memory Allocation
TLPI - 7 Memory AllocationShu-Yu Fu
 
Additive model and boosting tree
Additive model and boosting treeAdditive model and boosting tree
Additive model and boosting tree
Dong Guo
 
Google TensorFlow Tutorial
Google TensorFlow TutorialGoogle TensorFlow Tutorial
Google TensorFlow Tutorial
台灣資料科學年會
 
Introduction To Generative Adversarial Networks GANs
Introduction To Generative Adversarial Networks GANsIntroduction To Generative Adversarial Networks GANs
Introduction To Generative Adversarial Networks GANs
Hichem Felouat
 
Introduction to Tensorflow
Introduction to TensorflowIntroduction to Tensorflow
Introduction to Tensorflow
Tzar Umang
 
Object detection and Instance Segmentation
Object detection and Instance SegmentationObject detection and Instance Segmentation
Object detection and Instance Segmentation
Hichem Felouat
 
Machine Learning - Introduction to Tensorflow
Machine Learning - Introduction to TensorflowMachine Learning - Introduction to Tensorflow
Machine Learning - Introduction to Tensorflow
Andrew Ferlitsch
 
Tensor board
Tensor boardTensor board
Tensor board
Sung Kim
 
Working with tf.data (TF 2)
Working with tf.data (TF 2)Working with tf.data (TF 2)
Working with tf.data (TF 2)
Oswald Campesato
 
H2 o berkeleydltf
H2 o berkeleydltfH2 o berkeleydltf
H2 o berkeleydltf
Oswald Campesato
 
TensorFlow for IITians
TensorFlow for IITiansTensorFlow for IITians
TensorFlow for IITians
Ashish Bansal
 
Dynamic memory allocation(memory,allocation,memory allocatin,calloc,malloc,re...
Dynamic memory allocation(memory,allocation,memory allocatin,calloc,malloc,re...Dynamic memory allocation(memory,allocation,memory allocatin,calloc,malloc,re...
Dynamic memory allocation(memory,allocation,memory allocatin,calloc,malloc,re...
Mangalayatan university
 
Introduction to TensorFlow 2 and Keras
Introduction to TensorFlow 2 and KerasIntroduction to TensorFlow 2 and Keras
Introduction to TensorFlow 2 and Keras
Oswald Campesato
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Vivian S. Zhang
 

What's hot (20)

Transfer Learning
Transfer LearningTransfer Learning
Transfer Learning
 
How to Build your First Neural Network
How to Build your First Neural NetworkHow to Build your First Neural Network
How to Build your First Neural Network
 
4 dynamic memory allocation
4 dynamic memory allocation4 dynamic memory allocation
4 dynamic memory allocation
 
Memory allocation in c
Memory allocation in cMemory allocation in c
Memory allocation in c
 
Machine Learning Algorithms
Machine Learning AlgorithmsMachine Learning Algorithms
Machine Learning Algorithms
 
Tensor flow (1)
Tensor flow (1)Tensor flow (1)
Tensor flow (1)
 
TLPI - 7 Memory Allocation
TLPI - 7 Memory AllocationTLPI - 7 Memory Allocation
TLPI - 7 Memory Allocation
 
Additive model and boosting tree
Additive model and boosting treeAdditive model and boosting tree
Additive model and boosting tree
 
Google TensorFlow Tutorial
Google TensorFlow TutorialGoogle TensorFlow Tutorial
Google TensorFlow Tutorial
 
Introduction To Generative Adversarial Networks GANs
Introduction To Generative Adversarial Networks GANsIntroduction To Generative Adversarial Networks GANs
Introduction To Generative Adversarial Networks GANs
 
Introduction to Tensorflow
Introduction to TensorflowIntroduction to Tensorflow
Introduction to Tensorflow
 
Object detection and Instance Segmentation
Object detection and Instance SegmentationObject detection and Instance Segmentation
Object detection and Instance Segmentation
 
Machine Learning - Introduction to Tensorflow
Machine Learning - Introduction to TensorflowMachine Learning - Introduction to Tensorflow
Machine Learning - Introduction to Tensorflow
 
Tensor board
Tensor boardTensor board
Tensor board
 
Working with tf.data (TF 2)
Working with tf.data (TF 2)Working with tf.data (TF 2)
Working with tf.data (TF 2)
 
H2 o berkeleydltf
H2 o berkeleydltfH2 o berkeleydltf
H2 o berkeleydltf
 
TensorFlow for IITians
TensorFlow for IITiansTensorFlow for IITians
TensorFlow for IITians
 
Dynamic memory allocation(memory,allocation,memory allocatin,calloc,malloc,re...
Dynamic memory allocation(memory,allocation,memory allocatin,calloc,malloc,re...Dynamic memory allocation(memory,allocation,memory allocatin,calloc,malloc,re...
Dynamic memory allocation(memory,allocation,memory allocatin,calloc,malloc,re...
 
Introduction to TensorFlow 2 and Keras
Introduction to TensorFlow 2 and KerasIntroduction to TensorFlow 2 and Keras
Introduction to TensorFlow 2 and Keras
 
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its authorKaggle Winning Solution Xgboost algorithm -- Let us learn from its author
Kaggle Winning Solution Xgboost algorithm -- Let us learn from its author
 

Similar to Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access

Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Universitat Politècnica de Catalunya
 
Machine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester ElectiveMachine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester Elective
MayuraD1
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Universitat Politècnica de Catalunya
 
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Universitat Politècnica de Catalunya
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
Joonhyung Lee
 
Deep learning
Deep learningDeep learning
Deep learning
Rouyun Pan
 
lec08-numpy.pptx
lec08-numpy.pptxlec08-numpy.pptx
lec08-numpy.pptx
lekha572836
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
Daiki Tanaka
 
Recursion
RecursionRecursion
Recursion
Jesmin Akhter
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
CastLabKAIST
 
Caffe framework tutorial2
Caffe framework tutorial2Caffe framework tutorial2
Caffe framework tutorial2
Park Chunduck
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
ssuserf07225
 
L 5 Numpy final ppt kirti.pptx
L 5 Numpy final ppt kirti.pptxL 5 Numpy final ppt kirti.pptx
L 5 Numpy final ppt kirti.pptx
Kirti Verma
 
Mat lab workshop
Mat lab workshopMat lab workshop
Mat lab workshop
Vinay Kumar
 
PS
PSPS
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA Taiwan
 
Jvm memory model
Jvm memory modelJvm memory model
Jvm memory model
Yoav Avrahami
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
Qiangning Hong
 
DSJ_Unit I & II.pdf
DSJ_Unit I & II.pdfDSJ_Unit I & II.pdf
DSJ_Unit I & II.pdf
Arumugam90
 
Neural Network Part-2
Neural Network Part-2Neural Network Part-2
Neural Network Part-2
Venkata Reddy Konasani
 

Similar to Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access (20)

Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
Optimization for Neural Network Training - Veronica Vilaplana - UPC Barcelona...
 
Machine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester ElectiveMachine learning Module-2, 6th Semester Elective
Machine learning Module-2, 6th Semester Elective
 
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
Optimization (DLAI D4L1 2017 UPC Deep Learning for Artificial Intelligence)
 
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
Multilayer Perceptron - Elisa Sayrol - UPC Barcelona 2018
 
Rethinking Attention with Performers
Rethinking Attention with PerformersRethinking Attention with Performers
Rethinking Attention with Performers
 
Deep learning
Deep learningDeep learning
Deep learning
 
lec08-numpy.pptx
lec08-numpy.pptxlec08-numpy.pptx
lec08-numpy.pptx
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Recursion
RecursionRecursion
Recursion
 
Hardware Acceleration for Machine Learning
Hardware Acceleration for Machine LearningHardware Acceleration for Machine Learning
Hardware Acceleration for Machine Learning
 
Caffe framework tutorial2
Caffe framework tutorial2Caffe framework tutorial2
Caffe framework tutorial2
 
DeepLearningLecture.pptx
DeepLearningLecture.pptxDeepLearningLecture.pptx
DeepLearningLecture.pptx
 
L 5 Numpy final ppt kirti.pptx
L 5 Numpy final ppt kirti.pptxL 5 Numpy final ppt kirti.pptx
L 5 Numpy final ppt kirti.pptx
 
Mat lab workshop
Mat lab workshopMat lab workshop
Mat lab workshop
 
PS
PSPS
PS
 
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflowNVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
NVIDIA 深度學習教育機構 (DLI): Image segmentation with tensorflow
 
Jvm memory model
Jvm memory modelJvm memory model
Jvm memory model
 
Python高级编程(二)
Python高级编程(二)Python高级编程(二)
Python高级编程(二)
 
DSJ_Unit I & II.pdf
DSJ_Unit I & II.pdfDSJ_Unit I & II.pdf
DSJ_Unit I & II.pdf
 
Neural Network Part-2
Neural Network Part-2Neural Network Part-2
Neural Network Part-2
 

More from Igor Sfiligoi

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
Igor Sfiligoi
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
Igor Sfiligoi
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...
Igor Sfiligoi
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
Igor Sfiligoi
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resources
Igor Sfiligoi
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rate
Igor Sfiligoi
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Igor Sfiligoi
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance compute
Igor Sfiligoi
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
Igor Sfiligoi
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific Output
Igor Sfiligoi
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
Igor Sfiligoi
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
Igor Sfiligoi
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
Igor Sfiligoi
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
Igor Sfiligoi
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
Igor Sfiligoi
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Igor Sfiligoi
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
Igor Sfiligoi
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
Igor Sfiligoi
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
Igor Sfiligoi
 
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Igor Sfiligoi
 

More from Igor Sfiligoi (20)

Preparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYROPreparing Fusion codes for Perlmutter - CGYRO
Preparing Fusion codes for Perlmutter - CGYRO
 
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
O&C Meeting - Evaluation of ARM CPUs for IceCube available through Google Kub...
 
Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...Comparing single-node and multi-node performance of an important fusion HPC c...
Comparing single-node and multi-node performance of an important fusion HPC c...
 
The anachronism of whole-GPU accounting
The anachronism of whole-GPU accountingThe anachronism of whole-GPU accounting
The anachronism of whole-GPU accounting
 
Auto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resourcesAuto-scaling HTCondor pools using Kubernetes compute resources
Auto-scaling HTCondor pools using Kubernetes compute resources
 
Speeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rateSpeeding up bowtie2 by improving cache-hit rate
Speeding up bowtie2 by improving cache-hit rate
 
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence SimulationsPerformance Optimization of CGYRO for Multiscale Turbulence Simulations
Performance Optimization of CGYRO for Multiscale Turbulence Simulations
 
Comparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance computeComparing GPU effectiveness for Unifrac distance compute
Comparing GPU effectiveness for Unifrac distance compute
 
Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...Managing Cloud networking costs for data-intensive applications by provisioni...
Managing Cloud networking costs for data-intensive applications by provisioni...
 
Using A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific OutputUsing A100 MIG to Scale Astronomy Scientific Output
Using A100 MIG to Scale Astronomy Scientific Output
 
Using commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobsUsing commercial Clouds to process IceCube jobs
Using commercial Clouds to process IceCube jobs
 
Modest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYROModest scale HPC on Azure using CGYRO
Modest scale HPC on Azure using CGYRO
 
Data-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud BurstData-intensive IceCube Cloud Burst
Data-intensive IceCube Cloud Burst
 
Scheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with AdmiraltyScheduling a Kubernetes Federation with Admiralty
Scheduling a Kubernetes Federation with Admiralty
 
Accelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACCAccelerating microbiome research with OpenACC
Accelerating microbiome research with OpenACC
 
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
Demonstrating a Pre-Exascale, Cost-Effective Multi-Cloud Environment for Scie...
 
Porting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUsPorting and optimizing UniFrac for GPUs
Porting and optimizing UniFrac for GPUs
 
Demonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public CloudsDemonstrating 100 Gbps in and out of the public Clouds
Demonstrating 100 Gbps in and out of the public Clouds
 
TransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud linksTransAtlantic Networking using Cloud links
TransAtlantic Networking using Cloud links
 
Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...Bursting into the public Cloud - Sharing my experience doing it at large scal...
Bursting into the public Cloud - Sharing my experience doing it at large scal...
 

Recently uploaded

Enterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptxEnterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptx
QuickwayInfoSystems3
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Globus
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
Cyanic lab
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
NYGGS Automation Suite
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
Ortus Solutions, Corp
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
Globus
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
XfilesPro
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
Globus
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
AMB-Review
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Globus
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
Boni García
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
abdulrafaychaudhry
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
Neo4j
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
abdulrafaychaudhry
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
informapgpstrackings
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Globus
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
Donna Lenk
 

Recently uploaded (20)

Enterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptxEnterprise Software Development with No Code Solutions.pptx
Enterprise Software Development with No Code Solutions.pptx
 
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
Exploring Innovations in Data Repository Solutions - Insights from the U.S. G...
 
Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Enterprise Resource Planning System in Telangana
Enterprise Resource Planning System in TelanganaEnterprise Resource Planning System in Telangana
Enterprise Resource Planning System in Telangana
 
BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024BoxLang: Review our Visionary Licenses of 2024
BoxLang: Review our Visionary Licenses of 2024
 
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket ManagementUtilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
Utilocate provides Smarter, Better, Faster, Safer Locate Ticket Management
 
Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024Globus Compute wth IRI Workflows - GlobusWorld 2024
Globus Compute wth IRI Workflows - GlobusWorld 2024
 
GlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote sessionGlobusWorld 2024 Opening Keynote session
GlobusWorld 2024 Opening Keynote session
 
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, BetterWebinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
Webinar: Salesforce Document Management 2.0 - Smarter, Faster, Better
 
Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024Globus Compute Introduction - GlobusWorld 2024
Globus Compute Introduction - GlobusWorld 2024
 
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdfDominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
Dominate Social Media with TubeTrivia AI’s Addictive Quiz Videos.pdf
 
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
Innovating Inference - Remote Triggering of Large Language Models on HPC Clus...
 
APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)APIs for Browser Automation (MoT Meetup 2024)
APIs for Browser Automation (MoT Meetup 2024)
 
Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)Introduction to Pygame (Lecture 7 Python Game Development)
Introduction to Pygame (Lecture 7 Python Game Development)
 
GraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph TechnologyGraphSummit Paris - The art of the possible with Graph Technology
GraphSummit Paris - The art of the possible with Graph Technology
 
Pro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp BookPro Unity Game Development with C-sharp Book
Pro Unity Game Development with C-sharp Book
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data AnalysisProviding Globus Services to Users of JASMIN for Environmental Data Analysis
Providing Globus Services to Users of JASMIN for Environmental Data Analysis
 
Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"Navigating the Metaverse: A Journey into Virtual Evolution"
Navigating the Metaverse: A Journey into Virtual Evolution"
 

Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access

  • 1. Accelerating Key Bioinformatics Tasks 100-fold by Improving Memory Access Igor Sfiligoi, Daniel McDonald and Rob Knight University of California San Diego
  • 2. High level overview • Microbiome studies recently moved from 100s to 10k-100k samples • But the tools were built having 1k samples in mind • Many of the tools are written in Python • With NumPy used for compute-intensive portions • This talk focuses on pcoa and mantel, which are part of scikit-bio • The above codes work well at small scales • NumPy takes good care of using the compute resources • But very slow at large(r) scales • Lack of memory locality due to multi-step logic • Re-implemented in Cython, with emphasis on memory locality • Now scales well into the 100k samples range
  • 3. A word about microbiome studies • Microbiome studies composed of samples from the real world • Often collected from many different environments • DNA sequencing technology used to observe what microorganisms are present in a given sample • Typical questions • How similar the samples are to each other? • Can these similarities be explained by study variables? (e.g., salinity, pH, host or not host associated, etc) • How correlated are distance matrices created by different methods? • Often comparing with samples from other studies, too • Increases the total number of samples to consider
  • 4. A word about CPU memory subsystem • CPUs compute several orders of magnitude faster than DRAM • Intel Xeon Gold 6242 takes ~100 cycles to load from DRAM • CPUs thus have internal caches to speed up access • But limited in size, due to high cost • L1 cache: 4 cycles, 32 KiB per core • L2 cache: 14 cycles, 1 MiB per core • L3 cache: ~50 cycles, 22.5 MiB per chip (the above is for the Xeon, other CPUs similar) • Minimizing off-cache access thus essential • Not a new concept, but can be overlooked when datasets grow
  • 5. Principal Coordinates Analysis • The pcoa function is used to perform a Principal Coordinates Analysis of a distance matrix • Two step process • Centering of the matrix • Compute of approximate eigenvalues Surprisingly, the first step was the most expensive!
  • 6. Centering the matrix • Original implementation simple and elegant • But not memory efficient • Multi-step process • With each arithmetic operation traversing the whole NumPy matrix • Reads 8 and writes 5 matrix-sized buffers import numpy as np def e_matrix(distance_matrix): return distance_matrix * distance_matrix / -2 def f_matrix(E_matrix): row_means = E_matrix.mean(axis=1, keepdims=True); col_means = E_matrix.mean(axis=0, keepdims=True) matrix_mean = E_matrix.mean() return E_matrix - row_means - col_means + matrix_mean def center_distance_matrix(distance_matrix): # distance_matrix must be of type np.ndarray and # be a 2D floating point matrix return f_matrix(e_matrix(distance_matrix)) OK for small matrices, but slow once cannot fit in cache
  • 7. Centering the matrix • To keep active buffer in cache, must expose parallelism • Move compute to inner loop • And use a tilling pattern • Cannot do in NumPy • Re-implemented in Cython 0/2 def center_distance_matrix(mat, centered): row_means = np.empty(mat.shape[0]) global_mean = e_matrix_means_cy(mat, centered, row_means) f_matrix_inplace_cy(row_means, global_mean, centered)
  • 8. Centering the matrix • To keep active buffer in cache, must expose parallelism • Move compute to inner loop • And use a tilling pattern • Re-implemented in Cython • Harder to read, but much faster def e_matrix(distance_matrix): return distance_matrix * distance_matrix / -2 def f_matrix(E_matrix): row_means = E_matrix.mean(axis=1, keepdims=True); col_means = E_matrix.mean(axis=0, keepdims=True) … def e_matrix_means_cy(float[:, :] mat, float[:, :] centered, float[:] row_means): n_samples = mat.shape[0] global_sum = 0.0 for row in prange(n_samples, nogil=True): row_sum = 0.0 for col in range(n_samples): val = mat[row,col]; val2 = -0.5*val*val centered[row,col] = val2 row_sum = row_sum + val2 global_sum += row_sum row_means[row] = row_sum/n_samples return (global_sum/n_samples)/n_samples 1/2 Compute means while creating E_matrix
  • 9. Centering the matrix • To keep active buffer in cache, must expose parallelism • Move compute to inner loop • And use a tilling pattern • Re-implemented in Cython • Harder to read, but much faster def f_matrix_inplace_cy(float[:] row_means, float global_mean, float[:, :] centered): n_samples = centered.shape[0] # use a tiled pattern to maximize locality of row_means, # since they are also column means for trow in prange(0, n_samples, 16, nogil=True): trow_max = min(trow+16 n_samples) for tcol in range(0, n_samples, 16): tcol_max = min(tcol+16, n_samples) for row in range(trow, trow_max, 1): gr_mean = global_mean - row_means[row] for col in range(tcol, tcol_max, 1): centered[row,col] = centered[row,col] + (gr_mean - row_means[col]) def f_matrix(E_matrix): row_means = E_matrix.mean(axis=1, keepdims=True); col_means = E_matrix.mean(axis=0, keepdims=True) matrix_mean = E_matrix.mean() return E_matrix - row_means - col_means + matrix_mean Out 2/2 Extensively use tilling
  • 10. Observed speedup • Between 2x and 30x, depending on available HW • Bigger improvement on larger problems in full-CPU mode CPU, number of used cores, matrix size Original Latest AMD EPYC 7252 1 core - 25k x 25k 4.1 2.1 AMD EPYC 7252 8 cores - 25k x 25k 4.0 0.3 Intel Xeon Gold 6242 1 core - 25k x 25k 7.7 2.3 Intel Xeon Gold 6242 16 cores - 25k x 25k 7.2 0.3 AMD EPYC 7252 1 core - 70k x 70k 100 30 AMD EPYC 7252 8 cores - 70k x 70k 125 4.2 Intel Xeon Gold 6242 1 core - 100k x 100k 255 78 Intel Xeon Gold 6242 16 cores - 100k x 100k 254 9.1 Note: Original implementation cannot make effective use of multiple cores
  • 11. The Mantel test • The mantel function computes the correlation between two distance matrices using the Mantel test. • The statistical significance derived from a Monte-Carlo procedure, where the input matrices are permuted K times • Default K=999 • Implication: • The correlation function is typically computed 1k times on a whole matrix Extremely slow if matrix cannot fit in cache
  • 12. Original mantel implementation • Again, simple and elegant, but not memory efficient • Note that it relies on external functions for its core compute function from scipy.stats import pearsonr def _mantel_stats_pearson(x, y, permutations=999): # pearsonr operates on the flat representation # of the upper triangle of the symmetric matrix x_flat = x.condensed_form() y_flat = y.condensed_form() orig_stat = pearsonr(x_flat, y_flat) perm_gen = ( pearsonr(x.permute(condensed=True), y_flat) for _ in range(permutations) ) permuted_stats = np.fromiter(perm_gen, np.float, count=permutations) return [orig_stat, permuted_stats] def mantel(x, y, permutations=999): orig_stat, permuted_stats = _mantel_stats_pearson(x, y, permutations) count_better = ( np.absolute(permuted_stats) >= np.absolute(orig_stat) ).sum() p_value = (count_better + 1) / (permutations + 1) return orig_stat, p_value
  • 13. Original mantel implementation • Again, simple and elegant, but not memory efficient • Note that it relies on external functions for its core compute function from scipy.stats import pearsonr def _mantel_stats_pearson(x, y, permutations=999): # pearsonr operates on the flat representation # of the upper triangle of the symmetric matrix x_flat = x.condensed_form() y_flat = y.condensed_form() orig_stat = pearsonr(x_flat, y_flat) perm_gen = ( pearsonr(x.permute(condensed=True), y_flat) for _ in range(permutations) ) permuted_stats = np.fromiter(perm_gen, np.float, count=permutations) return [orig_stat, permuted_stats] def mantel(x, y, permutations=999): orig_stat, permuted_stats = _mantel_stats_pearson(x, y, permutations) count_better = ( np.absolute(permuted_stats) >= np.absolute(orig_stat) ).sum() p_value = (count_better + 1) / (permutations + 1) return orig_stat, p_value Memory locality requires partial compute, so must inline and re-implement external functions, too from linalg import norm def pearsonr(x_flat, y_flat): xm = x_flat – x_flat.mean() ym = y_flat – y_flat.mean() normxm = norm(xm) normym = norm(ym) xnorm = xm/normxm ynorm = ym/normym return np.dot(xnorm, ynorm)
  • 14. Additional optimization opportunities • With fewer black-boxes, we can optimize more, too • 2nd parameter always the same • The mean and norm of a matrix do not change when the matrix is permuted from scipy.stats import pearsonr def _mantel_stats_pearson(x, y, permutations=999): # pearsonr operates on the flat representation # of the upper triangle of the symmetric matrix x_flat = x.condensed_form() y_flat = y.condensed_form() orig_stat = pearsonr(x_flat, y_flat) perm_gen = ( pearsonr(x.permute(condensed=True), y_flat) for _ in range(permutations) ) permuted_stats = np.fromiter(perm_gen, np.float, count=permutations) return [orig_stat, permuted_stats] def mantel(x, y, permutations=999): orig_stat, permuted_stats = _mantel_stats_pearson(x, y, permutations) count_better = ( np.absolute(permuted_stats) >= np.absolute(orig_stat) ).sum() p_value = (count_better + 1) / (permutations + 1) return orig_stat, p_value from linalg import norm def pearsonr(x_flat, y_flat): xm = x_flat – x_flat.mean() ym = y_flat – y_flat.mean() normxm = norm(xm) normym = norm(ym) xnorm = xm/normxm ynorm = ym/normym return np.dot(xnorm, ynorm) Reuse results
  • 15. Optimized mantel implementation • Inlining makes it harder to read • But not much longer def mantel_perm_cy(float[:, :] x_data, float[:, :] perm_order, float xmean, float normxm, float[:] ym_normalized, float[:] permuted_stats): perms_n = perm_order.shape[0] out_n = perm_order.shape[1] for p in prange(perms_n, nogil=True): my_ps = 0.0 for row in range(out_n-1): vrow = perm_order[p, row] idx = row*(out_n-1) - ((row-1)*row)/2 for icol in range(out_n-row-1): col = icol+row+1 yval = ym_normalized[idx+icol] xval = x_data[vrow, perm_order[p, col]]*mul + add my_ps = yval*xval + my_ps permuted_stats[p] = my_ps from linalg import norm def _mantel(x, y, permutations): x_flat = x.condensed_form() y_flat = y.condensed_form() xm = x_flat – x_flat.mean() ym = y_flat – y_flat.mean() normxm = norm(xm) normym = norm(ym) xnorm = xm/normxm ynorm = ym/normym orig_stat = np.dot(xnorm, ynorm) mat_n = x._data.shape[0] perm_order = np.empty([permutations, mat_n], dtype=np.int) for row in range(permutations): perm_order[row, :] = np.random.permutation(mat_n) permuted_stats = np.empty([permutations]) mantel_perm_pearsonr_cy(x, perm_order, xmean, normxm, ynorm, permuted_stats) return [orig_stat, permuted_stats] Permuted multi-matrix dot operation Pre-compute anything that can be reused
  • 16. Optimized mantel implementation • Inlining makes it harder to read • But not much longer def mantel_perm_cy(float[:, :] x_data, float[:, :] perm_order, float xmean, float normxm, float[:] ym_normalized, float[:] permuted_stats): perms_n = perm_order.shape[0] out_n = perm_order.shape[1] for p in prange(perms_n, nogil=True): my_ps = 0.0 for row in range(out_n-1): vrow = perm_order[p, row] idx = row*(out_n-1) - ((row-1)*row)/2 for icol in range(out_n-row-1): col = icol+row+1 yval = ym_normalized[idx+icol] xval = x_data[vrow, perm_order[p, col]]*mul + add my_ps = yval*xval + my_ps permuted_stats[p] = my_ps from linalg import norm def _mantel(x, y, permutations): x_flat = x.condensed_form() y_flat = y.condensed_form() xm = x_flat – x_flat.mean() ym = y_flat – y_flat.mean() normxm = norm(xm) normym = norm(ym) xnorm = xm/normxm ynorm = ym/normym orig_stat = np.dot(xnorm, ynorm) mat_n = x._data.shape[0] perm_order = np.empty([permutations, mat_n], dtype=np.int) for row in range(permutations): perm_order[row, :] = np.random.permutation(mat_n) permuted_stats = np.empty([permutations]) mantel_perm_pearsonr_cy(x, perm_order, xmean, normxm, ynorm, permuted_stats) return [orig_stat, permuted_stats]
  • 17. Observed speedup • Up to 200x • Bigger improvement on larger problems in full-CPU mode • Game changing for large problems • We would not even consider computing those Again, original implementation cannot make effective use of multiple cores CPU, num. of used cores, matrix size Original Latest AMD EPYC 7252 1 core - 10k ^2 2.19k 149 AMD EPYC 7252 8 cores - 10k ^2 2.17k 24 Intel Xeon Gold 6242 1 core - 10k ^2 4.2k 170 Intel Xeon Gold 6242 16 cores - 10k ^2 4.2k 20 AMD EPYC 7252 1 core - 20k ^2 8.9k 615 AMD EPYC 7252 8 cores - 20k ^2 8.8k 97 Intel Xeon Gold 6242 1 core - 20k ^2 17.9k 933 Intel Xeon Gold 6242 16 cores - 20k ^2 17.5k 108 AMD EPYC 7252 8 cores - 70k ^2 - 1.44k Intel Xeon Gold 6242 16 cores-100k^2 - 4.2k
  • 18. Validation costs • Even something as trivial as checking for symmetry can be expensive if one does not consider memory locality • A 100k samples distance matrix would take several minutes! CPU, num. of used cores, matrix size Original Latest AMD EPYC 7252 1 core - 25k ^2 1.8 2.6 AMD EPYC 7252 8 cores - 25k ^2 1.8 0.4 Intel Xeon Gold 6242 1 core - 25k ^2 4.4 3.2 Intel Xeon Gold 6242 16 cores - 25k ^2 4.4 0.24 AMD EPYC 7252 1 core - 70k ^2 18 21 AMD EPYC 7252 8 cores - 70k ^2 18 3.2 Intel Xeon Gold 6242 1 core - 100k ^2 219 78 Intel Xeon Gold 6242 16 cores-100k^2 212 5.4 Check performed on each object instantiation!
  • 19. Validation costs • The solution, again, was Cython with tiling import numpy as np def is_symmetric_and_hollow(mat): # mat must be of type np.ndarray # and be a 2D floating point matrix not_sym = (mat.T != mat).any() not_hollow = (np.trace(mat) != 0) return [not no_sym, not not_hollow] import numpy as np from cython.parallel import prange def is_symmetric_and_hollow_cy(float[:, :] mat): in_n = mat.shape[0] is_sym = True is_hollow = True # use a tiled approach to maximize memory locality for trow in prange(0, in_n, 16, nogil=True): trow_max = min(trow+16, in_n) for tcol in range(0, in_n, 16): tcol_max = min(tcol+16, in_n) for row in range(trow, trow_max, 1): for col in range(tcol, tcol_max, 1): is_sym &= (mat[row,col]==mat[col,row]) if (trow==tcol): # diagonal block for col in range(tcol, tcol_max, 1): is_hollow &= (mat[col,col]==0) return [(is_sym==True), (is_hollow==True)]
  • 20. Conclusions • We re-implemented two often used functions, pcoa and mantel, plus related object validation function • Now up to 100x faster • Root cause was over-reliance on standard libraries • In particular Python’s NumPy • Code was simple and elegant, but not memory efficient • New implementation uses lower-level constructs • Used Cython to push most of compute into inner loop • This will greatly help the microbiome community • Improvements accepted into scikit-bio library code-base
  • 21. Acknowledgemnts • This work was partially funded by the US National Research Foundation (NSF) under grants DBI-2038509, OAC-1541349 and OAC-1826967.