Prof Alba shared parallel biological sequence alignment with the Smith-Waterman algorithm and present CUDAlign, our fine-grained multi-GPU strategy.. This project is part of Research project at University of Brasilia
Many Machine Learning inference workloads compute predictions based on a limited number of models that are deployed together in the system. These models often share common structure and state. This scenario provides large rooms for optimizations of runtime and memory, which current systems fall short in exploring because they employ a black-box model of ML models and tasks, thus being unaware of optimization and sharing opportunities.
On the opposite side, Pretzel adopts a white-box description of ML models, which allows the framework to perform optimizations over deployed models and running tasks, saving memory and increasing the overall system performance. In this talk we will show the motivations behind Pretzel, its current design and possible future developments.
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORMsipij
In this paper the Delay Computation method for Common Sub expression Elimination algorithm is being implemented on Cyclotomic Fast Fourier Transform. The Common Sub Expression Elimination algorithm is combined with the delay computing method and is known as Gate Level Delay Computation with Common Sub expression Elimination Algorithm. Common sub expression elimination is effective
optimization method used to reduce adders in cyclotomic Fourier transform. The delay computing method is based on delay matrix and suitable for implementation with computers. The Gate level delay computation method is used to find critical path delay and it is analyzed on various finite field elements. The presented algorithm is established through a case study in Cyclotomic Fast Fourier Transform over finite field. If Cyclotomic Fast Fourier Transform is implemented directly then the system will have high additive complexities. So by using GLDC-CSE algorithm on cyclotomic fast Fourier transform, the additive
complexities will be reduced and also the area and area delay product will be reduced.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
FAST AND EFFICIENT IMAGE COMPRESSION BASED ON PARALLEL COMPUTING USING MATLABJournal For Research
Image compression technique is used in many applications for example, satellite imaging, medical imaging, video where the size of the iamge requires more space to store, in such application image compression effectively can be used. There are two types in image compression techniques Lossy and Lossless comression. Both these techniques are used for compression of images, but these techniques are not fast. The image compression techniques both lossy and lossless image compression techniques are not fast, they take more time for compression and decompression. For fast and efficient image compression a parallel computing technique is used in matlab. Matlab is used in this project for parallel computing of images. In this paper we will discuss Regular image compression technique, three alternatives of parallel computing using matlab, comparison of image compression with and without parallel computing.
Machine Learning (ML) models are often composed as pipelines of operators, from “classical” ML operators to pre-processing and featurization operators. Current systems deploy pipelines as "black boxes”, where the same implementation of training is run for inference. This solution is convenient but leaves large room to improve performance and resource usage. This talk presents Pretzel, a framework for deployment of ML pipelines that is inspired to Database Systems: Pretzel inspects and optimizes pipelines end-to-end much like queries, and manages resources common to multiple pipelines such as operators' state. Pretzel is joint work with University of Seoul and Microsoft Research and has recently been presented at OSDI ’18. After the overview, this talk also shows experimental results of Pretzel against state-of-art ML solutions and discusses limitations and extensions.
Many Machine Learning inference workloads compute predictions based on a limited number of models that are deployed together in the system. These models often share common structure and state. This scenario provides large rooms for optimizations of runtime and memory, which current systems fall short in exploring because they employ a black-box model of ML models and tasks, thus being unaware of optimization and sharing opportunities.
On the opposite side, Pretzel adopts a white-box description of ML models, which allows the framework to perform optimizations over deployed models and running tasks, saving memory and increasing the overall system performance. In this talk we will show the motivations behind Pretzel, its current design and possible future developments.
DESIGN OF DELAY COMPUTATION METHOD FOR CYCLOTOMIC FAST FOURIER TRANSFORMsipij
In this paper the Delay Computation method for Common Sub expression Elimination algorithm is being implemented on Cyclotomic Fast Fourier Transform. The Common Sub Expression Elimination algorithm is combined with the delay computing method and is known as Gate Level Delay Computation with Common Sub expression Elimination Algorithm. Common sub expression elimination is effective
optimization method used to reduce adders in cyclotomic Fourier transform. The delay computing method is based on delay matrix and suitable for implementation with computers. The Gate level delay computation method is used to find critical path delay and it is analyzed on various finite field elements. The presented algorithm is established through a case study in Cyclotomic Fast Fourier Transform over finite field. If Cyclotomic Fast Fourier Transform is implemented directly then the system will have high additive complexities. So by using GLDC-CSE algorithm on cyclotomic fast Fourier transform, the additive
complexities will be reduced and also the area and area delay product will be reduced.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
FAST AND EFFICIENT IMAGE COMPRESSION BASED ON PARALLEL COMPUTING USING MATLABJournal For Research
Image compression technique is used in many applications for example, satellite imaging, medical imaging, video where the size of the iamge requires more space to store, in such application image compression effectively can be used. There are two types in image compression techniques Lossy and Lossless comression. Both these techniques are used for compression of images, but these techniques are not fast. The image compression techniques both lossy and lossless image compression techniques are not fast, they take more time for compression and decompression. For fast and efficient image compression a parallel computing technique is used in matlab. Matlab is used in this project for parallel computing of images. In this paper we will discuss Regular image compression technique, three alternatives of parallel computing using matlab, comparison of image compression with and without parallel computing.
Machine Learning (ML) models are often composed as pipelines of operators, from “classical” ML operators to pre-processing and featurization operators. Current systems deploy pipelines as "black boxes”, where the same implementation of training is run for inference. This solution is convenient but leaves large room to improve performance and resource usage. This talk presents Pretzel, a framework for deployment of ML pipelines that is inspired to Database Systems: Pretzel inspects and optimizes pipelines end-to-end much like queries, and manages resources common to multiple pipelines such as operators' state. Pretzel is joint work with University of Seoul and Microsoft Research and has recently been presented at OSDI ’18. After the overview, this talk also shows experimental results of Pretzel against state-of-art ML solutions and discusses limitations and extensions.
논문 제목부터 재미있어 보이는 주제 입니다. 오늘 딥러닝 논문읽기 모임에서 소개드릴 논문은 DEAR: Deep Reinforcement Learning for Online Advertising Impression in Recommender Systems, 강화학습을 이용한 온라인 추천 시스템 입니다. 비공개 된 정보들이 몇가지가 있지만, 아이디어면에서 여러분들이 충분히 재밌게 들으실수 있습니다. 강화학습의 기본적인 개념부터,
논문에 대한 디테일하고 깊이 있는 리뷰를
펀디멘탈팀 김창연 님이 도와주셨습니다!
오늘도 많은 관심 미리 감사드립니다!
추가로 .. 딥러닝 논문읽기 모임은 청강방 오픈채팅 방을 운영하고 있습니다. 최근 악성 홍보 봇 계정이 늘어나 방을 비밀번호를 걸어두게 되었습니다
딥러닝 청강방도 많은 관심 부탁드립니다!
청강방 링크 : https://open.kakao.com/o/gp6GHMMc
청강방 비밀번호 : 0501
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
We address the problem of processing multiple graph queries over a massive set of data graphs in this letter. As the number of data graphs is growing rapidly, it is often hard to process graph queries with serial algorithms in a timely manner. We propose a distributed graph querying algorithm, which employs feature-based comparison and a filterand-verify scheme working on the MapReduce framework. Moreover, we devise an ecient scheme that adaptively tunes a proper feature size at runtime by sampling data graphs. With various experiments, we show that the proposed method outperforms conventional algorithms in terms of both scalability and efficiency.
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...Kyong-Ha Lee
Subgraph matching is a fundamental operation for querying
graph-structured data. Due to potential errors and noises in real world graph data, exact subgraph matching is sometimes not appropriate in practice.
In this paper we consider an approximate subgraph matching model that allows missing edges. Based on this model, approximate subgraph matching finds all occurrences of a given query graph in a database graph,
allowing missing edges. A straightforward approach to this problem is to first generate query subgraphs of the query graph by deleting edges and then perform exact subgraph matching for each query subgraph. In this paper we propose a sharing based approach to approximate subgraph matching, called SASUM. Our method is based on the fact that query subgraphs are highly overlapped. Due to this overlapping nature of query subgraphs, the matches of a query subgraph can be computed from the matches of a smaller query subgraph, which results in reducing the number of query subgraphs that need costly exact subgraph matching. Our method uses a lattice framework to identify sharing opportunities between query subgraphs. To further reduce the number of graphs that need exact subgraph matching, SASUM generates small base graphs that are shared by query subgraphs and chooses the minimum number of base graphs whose matches are used to derive the matching results of all query subgraphs. A comprehensive set of experiments shows that our approach outperforms the state-of-the-art
approach by orders of magnitude in terms of query execution time.
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From ScratchSunghoon Joo
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
Paper link: https://arxiv.org/abs/2003.03384
Video presentation link: https://youtu.be/J__uJ79m01Q
Auto DeepLab을 간단하게 소개를 먼저 드리면 Semantic Segmentation
테스크를 위한 모델입니다 저자들은 머신러닝을 통해서 세그멘테이션 네트워크 자체를 생성하고자 했습니다 아키텍처 Search 같은 경우에는 AutoML의 대표적인 방법인데요
그래서 이 논문의 제목이 Auto DeepLab인 이유도 이제 AutoML의 방법을 사용했기 때문입니다 저자들은 AutoML 측면에서 DARTS라는 논문을 참고로 해 갖고 다음에 Segmentation측면에서는 DeepLab V3을 많이 참고하였습니다 논문 리뷰를 이미지 처리팀 김선옥님이 디테일한 논문 리뷰 도와주셨습니다!
https://youtu.be/2886fuyKo9g
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
Uncertainty-quantification tasks are often ``many query'' in nature, as they require repeated evaluations of a model that often corresponds to a parameterized system of nonlinear equations (e.g., arising from the spatial discretization of a PDE). To make this task tractable for large-scale models, low-fidelity models (e.g., reduced-order models, coarse-mesh solutions) must be employed. However, such approximations introduce additional error, which may be treated as a source of epistemic uncertainty that must be quantified to ensure rigor in the ultimate UQ result. We present a new approach to quantify the error (i.e., epistemic uncertainty) introduced by these low-fidelity models approximations. The approach (1) engineers features that are informative of the error using concepts related to dual-weighted residuals and rigorous error bounds, and (2) applies machine learning regression techniques (e.g., artificial neural networks, random forests, support vector machines) to construct a statistical model of the error from these features. We consider both (signed) errors in quantities of interest, as well as global state-space error norms. We present several examples to demonstrate the effectiveness of the proposed approach compared to more conventional feature and regression choices. In each of the examples, the predicted errors have a coefficient of determination value of at least 0.998.
Revised presentation slide for NLP-DL, 2016/6/22.
Recent Progress (from 2014) in Recurrent Neural Networks and Natural Language Processing.
Profile http://www.cl.ecei.tohoku.ac.jp/~sosuke.k/
Japanese ver. https://www.slideshare.net/hytae/rnn-63761483
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIijtsrd
In Matrix multiplication we refer to a concept that is used in technology applications such as digital image processing, digital signal processing and graph problem solving. Multiplication of huge matrices requires a lot of computing time as its complexity is O n3 . Because most engineering science applications require higher computational throughput with minimum time, many sequential and analogue algorithms are developed. In this paper, methods of matrix multiplication are elect, implemented, and analyzed. A performance analysis is evaluated, and some recommendations are given when using open MP and MPI methods of parallel of latitude computing. Adamu Abubakar I | Oyku A | Mehmet K | Amina M. Tako ""Comprehensive Performance Evaluation on Multiplication of Matrices using MPI""
Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30015.pdf
Paper Url : https://www.ijtsrd.com/engineering/electrical-engineering/30015/comprehensive-performance-evaluation-on-multiplication-of-matrices-using-mpi/adamu-abubakar-i
Description: WeightWatcher (WW): is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data. It can be used to:analyze pre/trained PyTorch, Keras, DNN models (Conv2D and Dense layers) monitor models, and the model layers, to see if they are over-trained or over-parameterized, predict test accuracies across different models, with or without training data, and detect potential problems when compressing or fine-tuning pre-trained models. see https://weightwatcher.ai
A New Cross Diamond Search Motion Estimation Algorithm for HEVCIJERA Editor
In this project, a novel approach for motion estimation is proposed. There are few block matching algorithm existing for motion estimation. In motion estimation a new cross diamond search algorithm is implemented compared to diamond search it uses less search point. Because of this we can reduce the computational complexity. The performance of the algorithm is compared with other algorithm by means of search points. This algorithm achieves close performance than that of three step search and diamond search. Compared to all the algorithm cross diamond uses less logic elements, delay and power dissipation.
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPINGijdkp
An intrinsic problem of classifiers based on machine learning (ML) methods is that their learning time
grows as the size and complexity of the training dataset increases. For this reason, it is important to have
efficient computational methods and algorithms that can be applied on large datasets, such that it is still
possible to complete the machine learning tasks in reasonable time. In this context, we present in this paper
a more accurate simple process to speed up ML methods. An unsupervised clustering algorithm is
combined with Expectation, Maximization (EM) algorithm to develop an efficient Hidden Markov Model
(HMM) training. The idea of the proposed process consists of two steps. In the first step, training instances
with similar inputs are clustered and a weight factor which represents the frequency of these instances is
assigned to each representative cluster. Dynamic Time Warping technique is used as a dissimilarity
function to cluster similar examples. In the second step, all formulas in the classical HMM training
algorithm (EM) associated with the number of training instances are modified to include the weight factor
in appropriate terms. This process significantly accelerates HMM training while maintaining the same
initial, transition and emission probabilities matrixes as those obtained with the classical HMM training
algorithm. Accordingly, the classification accuracy is preserved. Depending on the size of the training set,
speedups of up to 2200 times is possible when the size is about 100.000 instances. The proposed approach
is not limited to training HMMs, but it can be employed for a large variety of MLs methods.
Molecular dynamics (MD) is a very useful tool to understand various phenomena in atomistic detail. In MD, we can overcome the size- and time-scale problems by efficient parallelization. In this lecture, I’ll explain various parallelization methods of MD with some examples of GENESIS MD software optimization on Fugaku.
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...ijcsa
Computer industry has widely accepted that future performance increases must largely come from increasing the number of processing cores on a die. This has led to NoC processors. Task scheduling is one of the most challenging problems facing parallel programmers today which is known to be NP-complete. A good principle is space-sharing of cores and to schedule multiple DAGs simultaneously on NoC processor. Hence the need to find optimal number of cores for a DAG for a particular scheduling method and further which region of cores on NoC, to be allotted for a DAG . In this work, a method is proposed to find near-optimal minimal block of cores for a DAG on a NoC processor. Further, a time efficient framework and three on-line block allotment policies to the submitted DAGs are experimented. The objectives of the policies, is to improve the NoC throughput. The policies are experimented on a simulator and found to deliver better performance than the policies found in literature..
논문 제목부터 재미있어 보이는 주제 입니다. 오늘 딥러닝 논문읽기 모임에서 소개드릴 논문은 DEAR: Deep Reinforcement Learning for Online Advertising Impression in Recommender Systems, 강화학습을 이용한 온라인 추천 시스템 입니다. 비공개 된 정보들이 몇가지가 있지만, 아이디어면에서 여러분들이 충분히 재밌게 들으실수 있습니다. 강화학습의 기본적인 개념부터,
논문에 대한 디테일하고 깊이 있는 리뷰를
펀디멘탈팀 김창연 님이 도와주셨습니다!
오늘도 많은 관심 미리 감사드립니다!
추가로 .. 딥러닝 논문읽기 모임은 청강방 오픈채팅 방을 운영하고 있습니다. 최근 악성 홍보 봇 계정이 늘어나 방을 비밀번호를 걸어두게 되었습니다
딥러닝 청강방도 많은 관심 부탁드립니다!
청강방 링크 : https://open.kakao.com/o/gp6GHMMc
청강방 비밀번호 : 0501
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
We address the problem of processing multiple graph queries over a massive set of data graphs in this letter. As the number of data graphs is growing rapidly, it is often hard to process graph queries with serial algorithms in a timely manner. We propose a distributed graph querying algorithm, which employs feature-based comparison and a filterand-verify scheme working on the MapReduce framework. Moreover, we devise an ecient scheme that adaptively tunes a proper feature size at runtime by sampling data graphs. With various experiments, we show that the proposed method outperforms conventional algorithms in terms of both scalability and efficiency.
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...Kyong-Ha Lee
Subgraph matching is a fundamental operation for querying
graph-structured data. Due to potential errors and noises in real world graph data, exact subgraph matching is sometimes not appropriate in practice.
In this paper we consider an approximate subgraph matching model that allows missing edges. Based on this model, approximate subgraph matching finds all occurrences of a given query graph in a database graph,
allowing missing edges. A straightforward approach to this problem is to first generate query subgraphs of the query graph by deleting edges and then perform exact subgraph matching for each query subgraph. In this paper we propose a sharing based approach to approximate subgraph matching, called SASUM. Our method is based on the fact that query subgraphs are highly overlapped. Due to this overlapping nature of query subgraphs, the matches of a query subgraph can be computed from the matches of a smaller query subgraph, which results in reducing the number of query subgraphs that need costly exact subgraph matching. Our method uses a lattice framework to identify sharing opportunities between query subgraphs. To further reduce the number of graphs that need exact subgraph matching, SASUM generates small base graphs that are shared by query subgraphs and chooses the minimum number of base graphs whose matches are used to derive the matching results of all query subgraphs. A comprehensive set of experiments shows that our approach outperforms the state-of-the-art
approach by orders of magnitude in terms of query execution time.
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From ScratchSunghoon Joo
PR-232: AutoML-Zero:Evolving Machine Learning Algorithms From Scratch
Paper link: https://arxiv.org/abs/2003.03384
Video presentation link: https://youtu.be/J__uJ79m01Q
Auto DeepLab을 간단하게 소개를 먼저 드리면 Semantic Segmentation
테스크를 위한 모델입니다 저자들은 머신러닝을 통해서 세그멘테이션 네트워크 자체를 생성하고자 했습니다 아키텍처 Search 같은 경우에는 AutoML의 대표적인 방법인데요
그래서 이 논문의 제목이 Auto DeepLab인 이유도 이제 AutoML의 방법을 사용했기 때문입니다 저자들은 AutoML 측면에서 DARTS라는 논문을 참고로 해 갖고 다음에 Segmentation측면에서는 DeepLab V3을 많이 참고하였습니다 논문 리뷰를 이미지 처리팀 김선옥님이 디테일한 논문 리뷰 도와주셨습니다!
https://youtu.be/2886fuyKo9g
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
Uncertainty-quantification tasks are often ``many query'' in nature, as they require repeated evaluations of a model that often corresponds to a parameterized system of nonlinear equations (e.g., arising from the spatial discretization of a PDE). To make this task tractable for large-scale models, low-fidelity models (e.g., reduced-order models, coarse-mesh solutions) must be employed. However, such approximations introduce additional error, which may be treated as a source of epistemic uncertainty that must be quantified to ensure rigor in the ultimate UQ result. We present a new approach to quantify the error (i.e., epistemic uncertainty) introduced by these low-fidelity models approximations. The approach (1) engineers features that are informative of the error using concepts related to dual-weighted residuals and rigorous error bounds, and (2) applies machine learning regression techniques (e.g., artificial neural networks, random forests, support vector machines) to construct a statistical model of the error from these features. We consider both (signed) errors in quantities of interest, as well as global state-space error norms. We present several examples to demonstrate the effectiveness of the proposed approach compared to more conventional feature and regression choices. In each of the examples, the predicted errors have a coefficient of determination value of at least 0.998.
Revised presentation slide for NLP-DL, 2016/6/22.
Recent Progress (from 2014) in Recurrent Neural Networks and Natural Language Processing.
Profile http://www.cl.ecei.tohoku.ac.jp/~sosuke.k/
Japanese ver. https://www.slideshare.net/hytae/rnn-63761483
Comprehensive Performance Evaluation on Multiplication of Matrices using MPIijtsrd
In Matrix multiplication we refer to a concept that is used in technology applications such as digital image processing, digital signal processing and graph problem solving. Multiplication of huge matrices requires a lot of computing time as its complexity is O n3 . Because most engineering science applications require higher computational throughput with minimum time, many sequential and analogue algorithms are developed. In this paper, methods of matrix multiplication are elect, implemented, and analyzed. A performance analysis is evaluated, and some recommendations are given when using open MP and MPI methods of parallel of latitude computing. Adamu Abubakar I | Oyku A | Mehmet K | Amina M. Tako ""Comprehensive Performance Evaluation on Multiplication of Matrices using MPI""
Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30015.pdf
Paper Url : https://www.ijtsrd.com/engineering/electrical-engineering/30015/comprehensive-performance-evaluation-on-multiplication-of-matrices-using-mpi/adamu-abubakar-i
Description: WeightWatcher (WW): is an open-source, diagnostic tool for analyzing Deep Neural Networks (DNN), without needing access to training or even test data. It can be used to:analyze pre/trained PyTorch, Keras, DNN models (Conv2D and Dense layers) monitor models, and the model layers, to see if they are over-trained or over-parameterized, predict test accuracies across different models, with or without training data, and detect potential problems when compressing or fine-tuning pre-trained models. see https://weightwatcher.ai
A New Cross Diamond Search Motion Estimation Algorithm for HEVCIJERA Editor
In this project, a novel approach for motion estimation is proposed. There are few block matching algorithm existing for motion estimation. In motion estimation a new cross diamond search algorithm is implemented compared to diamond search it uses less search point. Because of this we can reduce the computational complexity. The performance of the algorithm is compared with other algorithm by means of search points. This algorithm achieves close performance than that of three step search and diamond search. Compared to all the algorithm cross diamond uses less logic elements, delay and power dissipation.
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPINGijdkp
An intrinsic problem of classifiers based on machine learning (ML) methods is that their learning time
grows as the size and complexity of the training dataset increases. For this reason, it is important to have
efficient computational methods and algorithms that can be applied on large datasets, such that it is still
possible to complete the machine learning tasks in reasonable time. In this context, we present in this paper
a more accurate simple process to speed up ML methods. An unsupervised clustering algorithm is
combined with Expectation, Maximization (EM) algorithm to develop an efficient Hidden Markov Model
(HMM) training. The idea of the proposed process consists of two steps. In the first step, training instances
with similar inputs are clustered and a weight factor which represents the frequency of these instances is
assigned to each representative cluster. Dynamic Time Warping technique is used as a dissimilarity
function to cluster similar examples. In the second step, all formulas in the classical HMM training
algorithm (EM) associated with the number of training instances are modified to include the weight factor
in appropriate terms. This process significantly accelerates HMM training while maintaining the same
initial, transition and emission probabilities matrixes as those obtained with the classical HMM training
algorithm. Accordingly, the classification accuracy is preserved. Depending on the size of the training set,
speedups of up to 2200 times is possible when the size is about 100.000 instances. The proposed approach
is not limited to training HMMs, but it can be employed for a large variety of MLs methods.
Molecular dynamics (MD) is a very useful tool to understand various phenomena in atomistic detail. In MD, we can overcome the size- and time-scale problems by efficient parallelization. In this lecture, I’ll explain various parallelization methods of MD with some examples of GENESIS MD software optimization on Fugaku.
A Novel Framework and Policies for On-line Block of Cores Allotment for Multi...ijcsa
Computer industry has widely accepted that future performance increases must largely come from increasing the number of processing cores on a die. This has led to NoC processors. Task scheduling is one of the most challenging problems facing parallel programmers today which is known to be NP-complete. A good principle is space-sharing of cores and to schedule multiple DAGs simultaneously on NoC processor. Hence the need to find optimal number of cores for a DAG for a particular scheduling method and further which region of cores on NoC, to be allotted for a DAG . In this work, a method is proposed to find near-optimal minimal block of cores for a DAG on a NoC processor. Further, a time efficient framework and three on-line block allotment policies to the submitted DAGs are experimented. The objectives of the policies, is to improve the NoC throughput. The policies are experimented on a simulator and found to deliver better performance than the policies found in literature..
DUAL POLYNOMIAL THRESHOLDING FOR TRANSFORM DENOISING IN APPLICATION TO LOCAL ...ijma
Thresholding operators have been used successfully for denoising signals, mostly in the wavelet domain.
These operators transform a noisy coefficient into a denoised coefficient with a mapping that depends on
signal statistics and the value of the noisy coefficient itself. This paper demonstrates that a polynomial
threshold mapping can be used for enhanced denoising of Principal Component Analysis (PCA) transform
coefficients. In particular, two polynomial threshold operators are used here to map the coefficients
obtained with the popular local pixel grouping method (LPG-PCA), which eventually improves the
denoising power of LPG-PCA. The method reduces the computational burden of LPG-PCA, by eliminating
the need for a second iteration in most cases. Quality metrics and visual assessment show the improvement.
Semantic Segmentation on Satellite ImageryRAHUL BHOJWANI
This is an Image Semantic Segmentation project targeted on Satellite Imagery. The goal was to detect the pixel-wise segmentation map for various objects in Satellite Imagery including buildings, water bodies, roads etc. The data for this was taken from the Kaggle competition <https://www.kaggle.com/c/dstl-satellite-imagery-feature-detection>.
We implemented FCN, U-Net and Segnet Deep learning architectures for this task.
Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU a...Masahito Ohue
Masahito Ohue, Marina Yamasawa, Kazuki Izawa, Yutaka Akiyama: Parallelized pipeline for whole genome shotgun metagenomics with GHOSTZ-GPU and MEGAN,
In Proceedings of the 19th annual IEEE International Conference on Bioinformatics and Bioengineering (IEEE BIBE 2019), 152-156, 2019. doi: 10.1109/BIBE.2019.00035
Parallel Batch-Dynamic Graphs: Algorithms and Lower BoundsSubhajit Sahu
Highlighted notes on Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds.
While doing research work under Prof. Kishore Kothapalli.
Laxman Dhulipala, David Durfee, Janardhan Kulkarni, Richard Peng, Saurabh Sawlani, Xiaorui Sun:
Parallel Batch-Dynamic Graphs: Algorithms and Lower Bounds. SODA 2020: 1300-1319
In this paper we study the problem of dynamically maintaining graph properties under batches of edge insertions and deletions in the massively parallel model of computation. In this setting, the graph is stored on a number of machines, each having space strongly sublinear with respect to the number of vertices, that is, n for some constant 0 < < 1. Our goal is to handle batches of updates and queries where the data for each batch fits onto one machine in constant rounds of parallel computation, as well as to reduce the total communication between the machines. This objective corresponds to the gradual buildup of databases over time, while the goal of obtaining constant rounds of communication for problems in the static setting has been elusive for problems as simple as undirected graph connectivity. We give an algorithm for dynamic graph connectivity in this setting with constant communication rounds and communication cost almost linear in terms of the batch size. Our techniques combine a new graph contraction technique, an independent random sample extractor from correlated samples, as well as distributed data structures supporting parallel updates and queries in batches. We also illustrate the power of dynamic algorithms in the MPC model by showing that the batched version of the adaptive connectivity problem is P-complete in the centralized setting, but sub-linear sized batches can be handled in a constant number of rounds. Due to the wide applicability of our approaches, we believe it represents a practically-motivated workaround to the current difficulties in designing more efficient massively parallel static graph algorithms.
Parallel Batch-Dynamic Graphs: Algorithms and Lower BoundsSubhajit Sahu
In this paper we study the problem of dynamically
maintaining graph properties under batches of edge
insertions and deletions in the massively parallel model
of computation. In this setting, the graph is stored
on a number of machines, each having space strongly
sublinear with respect to the number of vertices, that
is, n
for some constant 0 < < 1. Our goal is to
handle batches of updates and queries where the data
for each batch fits onto one machine in constant rounds
of parallel computation, as well as to reduce the total
communication between the machines. This objective
corresponds to the gradual buildup of databases over
time, while the goal of obtaining constant rounds of
communication for problems in the static setting has
been elusive for problems as simple as undirected graph
connectivity.
We give an algorithm for dynamic graph connectivity
in this setting with constant communication rounds and
communication cost almost linear in terms of the batch
size. Our techniques combine a new graph contraction
technique, an independent random sample extractor from
correlated samples, as well as distributed data structures
supporting parallel updates and queries in batches.
We also illustrate the power of dynamic algorithms in
the MPC model by showing that the batched version
of the adaptive connectivity problem is P-complete in
the centralized setting, but sub-linear sized batches can
be handled in a constant number of rounds. Due to
the wide applicability of our approaches, we believe
it represents a practically-motivated workaround to the
current difficulties in designing more efficient massively
parallel static graph algorithms.
Brief Explanation about the Tau-Leaping Process, Parallel Processing and NVIDIA's CUDA architecture
And the use of cuTau - Leaping for simulation of Biological systems
Fault-Tolerance Aware Multi Objective Scheduling Algorithm for Task Schedulin...csandit
Computational Grid (CG) creates a large heterogeneous and distributed paradigm to manage and execute the applications which are computationally intensive. In grid scheduling tasks are assigned to the proper processors in the grid system to for its execution by considering the execution policy and the optimization objectives. In this paper, makespan and the faulttolerance of the computational nodes of the grid which are the two important parameters for the task execution, are considered and tried to optimize it. As the grid scheduling is considered to be NP-Hard, so a meta-heuristics evolutionary based techniques are often used to find a solution for this. We have proposed a NSGA II for this purpose. The performance estimation ofthe proposed Fault tolerance Aware NSGA II (FTNSGA II) has been done by writing program in Matlab. The simulation results evaluates the performance of the all proposed algorithm and the results of proposed model is compared with existing model Min-Min and Max-Min algorithm which proves effectiveness of the model.
ICIAM 2019: A New Algorithm Model for Massive-Scale Streaming Graph AnalysisJason Riedy
Applications in many areas analyze an ever-changing environment. On billion vertices graphs, providing snapshots imposes a large performance cost. We propose the first formal model for graph analysis running concurrently with streaming data updates. We consider an algorithm valid if its output is correct for the initial graph plus some implicit subset of concurrent changes. We show theoretical properties of the model, demonstrate the model on various algorithms, and extend it to updating results incrementally.
Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics on ...Databricks
Explore the trade-offs of performing linear algebra for data analysis and machine learning using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Apache Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks.
This session will examine three widely-used and important matrix factorizations: NMF (for physical plausibility), PCA (for its ubiquity) and CX (for data interpretability). Learn how these methods are applied to terabyte-sized problems in particle physics, climate modeling and bioimaging, as use cases where interpretable analytics is of interest. The data matrices are tall-and-skinny, which enable the algorithms to map conveniently into Spark’s data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns and provide tuning guidance to obtain high performance. Based on joint work with Alex Gittens and many others.
The Libre-SOC Project aims to create an entirely Libre-Licensed, transparently-developed fully auditable Hybrid 3D CPU-GPU-VPU, using the Supercomputer-class OpenPOWER ISA as the foundation.
Our first test ASIC is a 180nm "Fixed-Point" Power ISA v3.0B processor, 5.1mm x 5.9mm, as a proof-of-concept for the team, whose primary expertise is in Software Engineering. Software Engineering training brings a radically different approach to Hardware development: extensive unit tests, source code revision control, automated development tools are normal. Libre Project Management brings even more: bug trackers, mailing lists, auditable IRC logs and a wiki are standard fare for Libre Projects that are simply not normal Industry-Standard practice.
This talk therefore goes through the workflow, from the original HDL through to the GDS-II layout, showing how we were able to keep track of the development that led to the IMEC 180nm tape-out in July 2021. In particular, by following a parallel development process involving "Real" and "Symbolic" Cell Libraries, developed by Chips4Makers, will be shown how our developers did not need to sign a Foundry NDA, but were still able to work side-by-side with a University that did. With this parallel development process, the University upheld their NDA obligations, and Libre-SOC were simultaneously able to honour its Transparency Objectives.
Workload Transformation and Innovations in POWER Architecture Ganesan Narayanasamy
IT Industry is going through two major transformations. One is adaption of AI and tight integration of the same in the commercial applications and enterprise workflow. Two the transformation in software architecture through the concepts like microservices and the cloud native architecture. These transformation alongside the aggressive adaption of IoT/mobile and 5G in all our day today activities is making the world operate in more real time manner which opens-up a new challenge to improve the hardware architecture to adapt to these requirements. These above two major transformation pushes the boundary of the entire systems stack making the designer rethink hardware. This talk presents you a picture of how the enterprise Industry leading POWER architecture is transforming to fulfill the performance demands of these newer generation workloads with primary focus on the AI acceleration on the chip.
July 16th 2021 , Friday for our newest workshop with DoMS, IIT Roorkee, Concept to Solutions using OpenPOWER Stack. It's time to discover advances in #DeepLearning tools and techniques from the world's leading innovators across industries, research, and public speakers.
Register here:
https://lnkd.in/ggxMq2N
This presentation covers two uses cases using OpenPOWER Systems
1. Diabetic Retinopathy using AI on NVIDIA Jetson Nano: The objective is to classify the diabetic level solely on retina image in a remote area with minimum doctor's inference. The model uses VGG16 network architecture and gets trained from scratch on POWER9. The model was deployed on the Jetson Nano board.
1. Classifying Covid positivity using lung X-ray images: The idea is to build ML models to detect positive cases using X-ray images. The model was trained on POWER9, and the application was developed using Python.
IBM Bayesian Optimization Accelerator (BOA) is a do-it-yourself toolkit to apply state-of-the-art Bayesian inferencing techniques and obtain optimal solutions for complex, real-world design simulations without requiring deep machine learning skills. This talk will describe IBM BOA, its differentiation and ease of use, and how researchers can take advantage of it for optimizing any arbitrary HPC simulation.
This presentation covers various partners and collaborators who are currently working with OpenPOWER foundation ,Use cases of OpenPOWER systems in multiple Industries , OpenPOWER Workgroups and OpenCAPI features .
The IBM POWER10 processor represents the 10th generation of the POWER family of enterprise computing engines. Its performance is a result of both powerful processing cores and high-bandwidth intra- and inter-chip interconnect. POWER10 systems can be configured with up to 16 processor chips and 1920 simultaneous threads of execution. Cross-system memory sharing, through the new Memory Inception technology, and 2 Petabytes of addressing space support an expansive memory system. The POWER10 processing core has been significantly enhanced over its POWER9 predecessor, including a doubling of vector units and the addition of an all-new matrix math engine. Throughput gains from POWER9 to POWER10 average 30% at the core level and three-fold at the socket level. Those gains can reach ten- or twenty-fold at the socket level for matrix-intensive computations.
Everything is changing from Health Care to the Automotive markets without forgetting Financial markets or any type of engineering everything has stopped being created as an individual or best-case scenario a team effort to something that is being developed and perfectioned by using AI and hundreds of computers.And even AI is something that we no longer can run in a single computer, no matter how powerful it is. What drives everything today is HPC or High-Performance Computing heavily linked to AI In this session we will discuss about AI, HPC computing, IBM Power architecture and how it can help develop better Healthcare, better Automobiles, better financials and better everything that we run on them
Macromolecular crystallography is an experimental technique allowing to explore 3D atomic structure of proteins, used by academics for research in biology and by pharmaceutical companies in rational drug design. While up to now development of the technique was limited by scientific instruments performance, recently computing performance becomes a key limitation. In my presentation I will present a computing challenge to handle 18 GB/s data stream coming from the new X-ray detector. I will show PSI experiences in applying conventional hardware for the task and why this attempt failed. I will then present how IC 922 server with OpenCAPI enabled FPGA boards allowed to build a sustainable and scalable solution for high speed data acquisition. Finally, I will give a perspective, how the advancement in hardware development will enable better science by users of the Swiss Light Source.
AI in healthcare and Automobile Industry using OpenPOWER/IBM POWER9 systemsGanesan Narayanasamy
As the adoption of AI technologies increases and matures, the focus will shift from exploration to time to market, productivity and integration with existing workflows. Governing Enterprise data, scaling AI model development, selecting a complete, collaborative hybrid platform and tools for rapid solution deployments are key focus areas for growing data scientist teams tasked to respond to business challenges. This talk will cover the challenges and innovations for AI at scale for the Industires such as Healthcare and Automotive , the AI ladder and AI life cycle and infrastructure architecture considerations.
This talk gives an introduction about Healthcare Use cases - The AI ladder and Lifestyle AI at Scale Themes The iterative nature of the workflow and some of the important components to be aware in developing AI health care solutions were being discussed. The different types of algorithms and when machine learning might be more appropriate in deep learning or the other way will also be discussed. Use cases in terms of examples are also shared as part of this presentation .
Healthcare has became one of the most important aspects of everyones life. Its importance has surged due to the latests outbreaks and due to this latest pandemic it has become mandatory to collaborate to improve everyones Healthcare as soon as possible.
IBM has reacted quickly sharing not only its knowledge but also its Artificial Intelligence Supercomputers all around the world.
Those Supercomputers are helping to prevail this outbreak and also future ones.
They have completely different features compared to proposals from other players of this Supercomputers market.
We will try to make a quick look at the differences of those AI focused Supercomputers and how they can help in the R&D of Healthcare solutions for everyone, from those ones with access to a big IBM AI Supercomputer to those ones with access to only one small IBM AI focused server.
Healthcare has became one of the most important aspects of everyones life. Its importance has surged due to the latests outbreaks and due to this latest pandemic it has become mandatory to collaborate to improve everyones Healthcare as soon as possible.
IBM has reacted quickly sharing not only its knowledge but also its Artificial Intelligence Supercomputers all around the world.
Those Supercomputers are helping to prevail this outbreak and also future ones.
They have completely different features compared to proposals from other players of this Supercomputers market.
We will try to make a quick look at the differences of those AI focused Supercomputers and how they can help in the R&D of Healthcare solutions for everyone, from those ones with access to a big IBM AI Supercomputer to those ones with access to only one small IBM AI focused server.
Moving object recognition (MOR) corresponds to the localization and classification of moving objects in videos. Discriminating moving objects from static objects and background in videos is an essential task for many computer vision applications. MOR has widespread applications in intelligent visual surveillance, intrusion detection, anomaly detection and monitoring, industrial sites monitoring, detection-based tracking, autonomous vehicles, etc. In this session, Murari provided a poster about the deep learning algorithms to identify both locations and corresponding categories of moving objects with a convolutional network. The challenges in developing such algorithms have been discussed.
Clarisse Hedglin from IBM presented this as part of 3 days International Summit .. She shared the scenarios AI can solve for today using the IBM AI infrastructure.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...DanBrown980551
Do you want to learn how to model and simulate an electrical network from scratch in under an hour?
Then welcome to this PowSyBl workshop, hosted by Rte, the French Transmission System Operator (TSO)!
During the webinar, you will discover the PowSyBl ecosystem as well as handle and study an electrical network through an interactive Python notebook.
PowSyBl is an open source project hosted by LF Energy, which offers a comprehensive set of features for electrical grid modelling and simulation. Among other advanced features, PowSyBl provides:
- A fully editable and extendable library for grid component modelling;
- Visualization tools to display your network;
- Grid simulation tools, such as power flows, security analyses (with or without remedial actions) and sensitivity analyses;
The framework is mostly written in Java, with a Python binding so that Python developers can access PowSyBl functionalities as well.
What you will learn during the webinar:
- For beginners: discover PowSyBl's functionalities through a quick general presentation and the notebook, without needing any expert coding skills;
- For advanced developers: master the skills to efficiently apply PowSyBl functionalities to your real-world scenarios.
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...UiPathCommunity
💥 Speed, accuracy, and scaling – discover the superpowers of GenAI in action with UiPath Document Understanding and Communications Mining™:
See how to accelerate model training and optimize model performance with active learning
Learn about the latest enhancements to out-of-the-box document processing – with little to no training required
Get an exclusive demo of the new family of UiPath LLMs – GenAI models specialized for processing different types of documents and messages
This is a hands-on session specifically designed for automation developers and AI enthusiasts seeking to enhance their knowledge in leveraging the latest intelligent document processing capabilities offered by UiPath.
Speakers:
👨🏫 Andras Palfi, Senior Product Manager, UiPath
👩🏫 Lenka Dulovicova, Product Program Manager, UiPath
A tale of scale & speed: How the US Navy is enabling software delivery from l...sonjaschweigert1
Rapid and secure feature delivery is a goal across every application team and every branch of the DoD. The Navy’s DevSecOps platform, Party Barge, has achieved:
- Reduction in onboarding time from 5 weeks to 1 day
- Improved developer experience and productivity through actionable findings and reduction of false positives
- Maintenance of superior security standards and inherent policy enforcement with Authorization to Operate (ATO)
Development teams can ship efficiently and ensure applications are cyber ready for Navy Authorizing Officials (AOs). In this webinar, Sigma Defense and Anchore will give attendees a look behind the scenes and demo secure pipeline automation and security artifacts that speed up application ATO and time to production.
We will cover:
- How to remove silos in DevSecOps
- How to build efficient development pipeline roles and component templates
- How to deliver security artifacts that matter for ATO’s (SBOMs, vulnerability reports, and policy evidence)
- How to streamline operations with automated policy checks on container images
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf91mobiles
91mobiles recently conducted a Smart TV Buyer Insights Survey in which we asked over 3,000 respondents about the TV they own, aspects they look at on a new TV, and their TV buying preferences.
UiPath Test Automation using UiPath Test Suite series, part 4DianaGray10
Welcome to UiPath Test Automation using UiPath Test Suite series part 4. In this session, we will cover Test Manager overview along with SAP heatmap.
The UiPath Test Manager overview with SAP heatmap webinar offers a concise yet comprehensive exploration of the role of a Test Manager within SAP environments, coupled with the utilization of heatmaps for effective testing strategies.
Participants will gain insights into the responsibilities, challenges, and best practices associated with test management in SAP projects. Additionally, the webinar delves into the significance of heatmaps as a visual aid for identifying testing priorities, areas of risk, and resource allocation within SAP landscapes. Through this session, attendees can expect to enhance their understanding of test management principles while learning practical approaches to optimize testing processes in SAP environments using heatmap visualization techniques
What will you get from this session?
1. Insights into SAP testing best practices
2. Heatmap utilization for testing
3. Optimization of testing processes
4. Demo
Topics covered:
Execution from the test manager
Orchestrator execution result
Defect reporting
SAP heatmap example with demo
Speaker:
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Accelerate your Kubernetes clusters with Varnish Caching
Parallel Biological Sequence Comparison in GPU Platforms
1. Alba Cristina Magalhaes Alves de Melo
Full Professor at the University of Brasilia (UnB)
CNPq/Brazil Research Fellow level 1C
IEEE Senior Member
Parallel Biological Sequence
Comparison in GPU Platforms:
Research at the University of Brasilia
OpenPower Webinar – June, 19th, 2020
https://www.compoundchem.com/2015/03/24/dna/>
3 Comments <
poundchem.com/2015/03/24/dna/#disqus_thread>
2. • Biological Sequences are obtained with sequencing
machines, using chromatography analysis.
This image cannot currently be displayed.
DNA
sequence
Introduction
3. • Once a biological sequence is obtained, its
properties/characteristics need to be
established.
• This is mainly done by comparing the new
sequence with sequences that have already
been catalogued.
• The comparison of biological sequences is
one of the most important operations in
Bioinformatics.
– Its goal is to define how similar the sequences
are.
Introduction
4. • There are exact algorithms that
compare two sequences and produce
the optimal result. They have quadratic
time and memory complexity - O(mn),
where m and n are the size of the
sequences.
• Heuristic methods are faster and are
used in many genome projects.
– However, they do not guarantee that the
optimal result will be produced.
Introduction
5. • Smith-Waterman (SW) proposed an exact
algorithm based on dynamic programming
to locally align two sequences in quadratic
time and memory.
– It produces the optimal result.
– High execution times and huge memory
requirements.
• To compare the human chromosome 1
with the chimpanzee chromosome 1 (249
Millions of Base Pairs - MBP x 228 MBP),
at least 240 Petabytes of memory are
needed.
ØThis SW comparison was considered
unfeasible in 2008.
Introduction
6. • Present and discuss our MASA-CUDAlign
strategy to compare huge chromosomes in
GPUs
– We use a highly optimized algorithm to exploit
parallelism
– We use speculative techniques to accelerate the
sequential part of the algorithm
– We were able to use up to 384 NVidia M2090 GPUs to
compare the human and chimpanzee homologous
chromosomes 1 in 2016
– We will show preliminary results of the next version of
our tool in the IBM platform with 8 NVidia Volta
• Present and discuss our MASA-OpenMP results in
Power9 for smaller sequences
– We will show comparative results Power vs Intel
– We will present preliminary covid-19 results
Goal of this talk
8. • To compare two sequences, one sequence is
placed above the other and a score is
computed.
G A C G G A T T A G G A T C G G A A T A GS0 S1
G
G
S0
S1
+1
A
A
+1
match
-
T
-2
gap
C
C
+1
G
G
+1
G
G
+1
A
A
+1
match
T
A
-1
mismatch
T
T
+1
A
A
+1
G
G
+1
match
+6
score
alignment
11 characters (Base Pairs - BP)
Biological Sequence Comparison
9. • Based on dynamic programming with quadratic time and
memory complexity (O(mn)).
• Executes in two steps:
• (1) calculate the DP matrix (similarity score) and
• (2) traceback (alignment)
• Having sequences S0 and S1 as input, with sizes m and n,
Hm+1,n+1 is computed:
H[i, j] = max
H[i, j −1]− g
H[i −1, j]− g
H[i −1, j −1]+ p(i, j)
0.
#
$
%
%
&
%
%
p(i,j) = ma, if s[i] = t[j]
mi, otherwise
Gap penalty
match
mismatch
Smith-Waterman (SW) Algorithm
10. - A T A G C T A
0 0 0 0 0 0 0 0
A 0 ! 1 0 ! 1 0 0 0 ! 1
T 0 0 ! 2 0 0 0 ! 1 0
A 0 ! 1 0 ! 3 0 0 0 ! 2
C 0 0 0 " 1 ! 2 ! 1 0 0
G 0 0 0 ! 1 ! 2 ! 1 0 0
C 0 0 0 0 0 ! 3 0 0
T 0 0 ! 1 0 0 " 1 ! 4 #2
C 0 0 0 0 0 ! 1 " 2 ! 3
T 0 0 ! 1 0 0 0 ! 2 " 1
T 0 0 ! 1 0 0 0 0 ! 1
highest
score
(traceback
path)
Local Alignment:
A T A - G C T
A T A C G C T
values: g=-2, mi=-1, ma=1
S1
S0
first row and
column are
initialized
with zeros
max(2,-2,-2,0)
Smith-Waterman (SW) Example
11. • [Gotoh 1982]: Computes the affine gap model, where the
value assigned to start a sequence of gaps (GapOpen) is
higher than the value assigned to extend it (GapExtend)
• Computes 3 DP matrices and provides a better
biological result
• Time and memory complexity (O(mn))
• [Hirschberg 1977]: Computes the linear gap model in
linear memory, with a divide and conquer recursive
approach
• Time complexity (O(mn)), memory complexity (O(m+n))
• [Myers-Miller 1988]: computes the affine gap model in
linear memory, with a modified version of Hirschberg’s
• Time complexity (O(mn)), memory complexity (O(m+n))
Smith-Waterman (SW) Variants
12. Smith-Waterman (SW) and its Variants
Wavefront method
(i,j) depends on (i-1,j), (i-1,j-1) and (i,j-1)
up
left
diag
Smith-Waterman (SW) and its Variants
Wavefront method
minimum parallelism maximum parallelism
(i,j) depends on (i-1,j), (i-1,j-1) and (i,j-1)
up
left
diag
Clicktoenlarge
minimum parallelism maximum parallelism
(i,j) depends on (i-1,j), (i-1,j-1) and (i,j
up
left
diag
Clicktoenlarge
minimum parallelism maximum parallelism
(i,j) depends on (i-1,j), (i-1,j-1)
left
d0 d1 d2 d3 d4d5d6 d0 d1 d2 d3
d4d5d6
d0 d1 d2 d3
d4d5d6
d0 d1 d2 d3
d4d5d6
• Each anti-diagonal can be computed in parallel
• m+n-1 antidiagonals
• Non-uniform parallelism
minimum at the beginning (d0)
maximum at the main antidiagonal (d4)
minimum at the end (d7)
14. • Goal: compare huge DNA sequences in GPU with a
combination of Gotoh and Myers-Miller algorithms
– CUDAlign 1.0: similarity score in 1 GPU
– CUDAlign 2.0: score and alignment in 1 GPU
– CUDAlign 2.1: score and alignment in 1 GPU with pruning
– CUDAlign 3.0: similarity score in several GPUs
– MASA-CUDAlign 4.0: score and alignment in several GPUs
• PhD Thesis - Edans F. O. Sandes (Awarded the Best PhD
Thesis in Computer Science in Brazil - 2016)
• Wilkes Award 2019 – Best paper -
The Computer Journal in 2018
MASA-CUDAlign: Goal and VersionsMASA-CUDAlign: Goal and versions
15. 1 - Find the best score
(GPU)
2 - Partial traceback
(GPU)
(3) Split partitions
(GPU)
(4) Align partitions (CPU) (5) Full alignment (CPU)
crosspoint
MASA-CUDAlign: Goal and Versions
MASA-CUDAlign in one GPU
Stage 1 (Compute the DP matrix)
Stages 2 to 5 (Traceback)
16. •The DP Matrix is divided
into grid blocks and a set
of grid blocks compose
an external diagonal.
•Each external diagonal
is composed by B blocks
where each block is
calculated by T threads.
Each thread will compute
a rows.
•Each CUDA kernel is
invoked to calculate one
external diagonal.
B=3; T=3; α=2
Size(S0)=36, Size(S1)=36
B1 B2 B3
MASA-CUDAlign: Goal and Versions
MASA-CUDAlign in one GPU
Stage 1 (Compute the DP matrix)
17. • Computation of each block stops at full parallelism
and remaining cells are delegated to the next
invocation.
G0,0 G0,1 G0,2
G1,0 G1,1 G1,2
G2,0 G2,1 G2,2
G3,0 G3,1 G3,2
G4,0 G4,1 G4,2
G5,0 G5,1 G5,2
G6,0 G6,1 G6,2
G7,0 G7,1 G7,2
first external
diagonals processed
external diagonals
in the middle of the
matrices
external diagonals
with non-contiguous
cells delegation
small loss of
parallelism
MASA-CUDAlign: Goal and Versions
MASA-CUDAlign in one GPU
Stage 1 (Parallelogram execution)
18. • The goal of the block pruning optimization is to
eliminate the calculation of blocks of cells that
surely do not belong to the optimal alignment.
• These blocks have such a small score that it is
not mathematically possible to lead to a score
higher than a score that has already been
produced.calculating the original MM
similar to FastLSA (Section 3.1),
stead of memory because there
pace. Also, saving rows to disk
omputation in case of interrup-
intervention, among others).
flushed to disk are taken from
n 4.1) at a certain interval of
bus contains the cells of the last
that are multiple of the block
be considered a special row.
ETRIEVING SMITH-WATERMAN ALIGNMENTS WITH OPTIMIZATIONS FOR MEGABASE BIOLOGICAL... 7
MASA-CUDAlign: Goal and Versions
MASA-CUDAlign in one GPU
Stage 1 (Block Pruning)
Hi,j = 100, Di = 50, Hmax(i,j) = 150
best_score = 230
pruning = true
19. 14 E. Sandes, G. Teodoro, M. Walter, E. Ayguade, X. Mar
FIGURE 14. Geometrical representation of the pruning
case f2: j > i and Hi
min(i, j)ma.p |i
i.ma.p (j i)G + (m
j(G
case f3: j i and Hi
Hi,j + min(
For similar sequences,
the pruning area is
characterized by four lines
(f1, f2, f3, f4), forming
two polygons that are
connected in the end of the
alignment
Gray area: not processed
MASA-CUDAlign: Goal and Versions
MASA-CUDAlign in one GPU
Stage 1 (Block Pruning)
20. MASA-CUDAlign: Goal and Versions
MASA-CUDAlign in one GPU
Publications (CUDAlign 1.0)
CUDAlign: Using GPU to Accelerate the
Comparison of Megabase Genomic Sequences
Edans Flavius de O. Sandes Alba Cristina M. A. de Melo
University of Brasilia (UnB), Brazil
{edans,albamm}@cic.unb.br
Abstract
Biological sequence comparison is a very important oper-
ation in Bioinformatics. Even though there do exist exact
methods to compare biological sequences, these methods
are often neglected due to their quadratic time and space
complexity. In order to accelerate these methods, many
GPU algorithms were proposed in the literature. Never-
theless, all of them restrict the size of the smallest se-
quence in such a way that Megabase genome comparison
is prevented. In this paper, we propose and evaluate CUD-
Align, a GPU algorithm that is able to compare Megabase
biological sequences with an exact Smith-Waterman affine
gap variant. CUDAlign was implemented in CUDA and
tested in two GPU boards, separately. For real sequences
whose size range from 1MBP (Megabase Pairs) to 47MBP,
a close to uniform GCUPS (Giga Cells Updates per Sec-
ond) was obtained, showing the potential scalability of our
approach. Also, CUDAlign was able to compare the human
chromosome 21 and the chimpanzee chromosome 22. This
operation took 21 hours on GeForce GTX 280, resulting in
a peak performance of 20.375 GCUPS. As far as we know,
this is the first time such huge chromosomes are compared
with an exact method.
Categories and Subject Descriptors D.1.3 [Program-
ming Techniques]: Concurrent Programming; J.3 [Life
and Medical Sciences]: Biology and Genetics
General Terms Algorithms
Keywords Biological Sequence Comparison, Smith-
Waterman, GPU
1. Introduction
In the last four years, new DNA sequencing technologies
have been developed that allow a hundred-fold increase in
the throughput over the traditional method. This means
that the genomic databases, that have already an expo-
nential growth rate, will experience an unprecedented in-
crease in their sizes. Therefore, a huge amount of new
DNA sequences will need to be compared, in order to in-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise,
to republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
PPoPP’10, January 9–14, 2010, Bangalore, India.
Copyright c⃝ 2010 ACM 978-1-60558-708-0/10/01. . . $10.00
fer functional/structural characteristics. In this scenario,
the time spent in each comparison, as well as the accuracy
of the result obtained, will be a fundamental factor to de-
termine the success/failure of the next generation genome
projects.
Sequence comparison is, thus, a very basic and impor-
tant operation in Bioinformatics. As a result of this step,
one or more sequence alignments can be produced [1]. A
sequence alignment has a similarity score associated to it
that is obtained by placing one sequence above the other,
making clear the correspondence between the characters
and possibly introducing gaps into them [2]. The most
common types of sequence alignment are global and lo-
cal. To solve a global alignment problem is to find the best
match between the entire sequences. On the other hand,
local alignment algorithms must find the best match be-
tween parts of the sequences.
One important issue to be considered is how gaps are
treated. A simple solution assigns a constant penalty for
gaps. However, it has been observed that keeping gaps
together represents better the biological relationships.
Hence, the most widely used model among biologists is the
affine gap model [3], where the penalty for opening a gap
is higher than the penalty for extending it.
Smith-Waterman (SW) [4] is an exact algorithm based
on the longest common subsequence (LCS) concept that
uses dynamic programming to find local alignments be-
tween two sequences of size m and n in O(mn) space
and time. In this algorithm, a similarity matrix of size
(m + 1) × (n + 1) is calculated. SW is very accurate but
it needs a lot of computational resources.
In order to reduce execution time, heuristic methods
such as BLAST [5] were proposed. These methods com-
bine exact pattern matching with dynamic programming
in order to produce good solutions faster. BLAST can align
sequences in a very short time, still producing good re-
sults. Nevertheless, there is no guarantee that the best
result will be produced.
Therefore, many efforts were made to develop methods
and techniques that execute the SW algorithm in high per-
formance architectures, allowing the production of exact
results in a shorter time. One recent trend in high per-
formance architectures is the Graphics Processing Units
(GPUs). In addition to the usual graphics functions, re-
cent GPU architectures are able to execute general pur-
pose algorithms (GPGPUs). These GPUs contain elements
that execute massive vector operations in a highly parallel
way. Because of its TFlops peak performance and its avail-
ability in PC desktops, the utilization of GPUs is rapidly
increasing in many scientific areas.
137
Conference: ACM PPoPP 2010
543K×536K 2.91E+11 NC_003064.2 NC_000914.1 48 (308558 , 455134)
1044K×1073K 1.12E+12 CP000051.1 AE002160.2 88353 (1072950 , 722725)
3147K×3283K 1.03E+13 BA000035.2 BX927147.1 4226 (2991493 , 2689488)
5227K×5229K 2.73E+13 AE016879.1 AE017225.1 5220960 (5227292 , 5228663)
7146K×5227K 3.74E+13 NC_005027.1 NC_003997.3 172 (4655867 , 5077642)
23012K×24544K 5.65E+14 NT_033779.4 NT_037436.3 9063 (14651731 , 11501313)
32799K×46944K 1.54E+15 BA000046.3 NC_000021.7 27206434 (32718231 , 46919080)
Table 5. Comparison for the real sequences used in tests. The best local score and the end position are presented.
Comparison
BLAST
Time Score
162K×172K 0.4s 18
543K×536K 0.6s 48
1044K×1073K 2.4s 6973
3147K×3283K 6.7s 3888
5227K×5229K 17.4s 36159
7146K×5227K 7.7s 157
23012K×24544K 110s 7085
32799K×46944K - -
Table 6. BLAST Results.
panzee chromosome comparison, BLAST finished its
ution with a segmentation fault, due to an out-of-
ory error.
Conclusion and Future Work
his paper, we proposed and evaluated CUDAlign, a
-accelerated version of Smith-Waterman (SW) that
pares two Megabase genomic sequences. Differently
the previous GPU Smith-Waterman (SW) proposals
e literature, our proposal does not impose severe re-
tions on the size of the smallest sequence and that
0.1
1
10
100
1000
10000
100000
1e+06
1e+10 1e+11 1e+12 1e+13 1e+14 1e+15 1e+16
time(s)
cells
8600GT
GTX280
1,968 MCUPS
20,375 MCUPS
Figure 10. Runtimes (seconds) × DP matrix size (cells) in
logarithm scale. Results show scalability and almost constant
MCUPS ratio for Megabase sequences (cells ≥ 1e + 12).
order to exploit the characteristics of the GPU memory hi-
erarchy.
We obtained the optimal score of the chromosome 21
human x chimpanzee comparison (32MBP x
47MBP) using the GPU Nvidia GTX 280 in 21 hours
GCUPS (Giga of Cells Updated Per Second): 20.3
22. MASA-CUDAlign: Goal and Versions
MASA-CUDAlign in one GPU
Publications (CUDAlign 2.1)
Retrieving Smith-Waterman Alignments with
Optimizations for Megabase Biological
Sequences Using GPU
Edans Flavius de O. Sandes and Alba Cristina M.A. de Melo, Senior Member, IEEE
Abstract—In Genome Projects, biological sequences are aligned thousands of times, in a daily basis. The Smith-Waterman algorithm
is able to retrieve the optimal local alignment with quadratic time and space complexity. So far, aligning huge sequences, such as
whole chromosomes, with the Smith-Waterman algorithm has been regarded as unfeasible, due to huge computing and memory
requirements. However, high-performance computing platforms such as GPUs are making it possible to obtain the optimal result for
huge sequences in reasonable time. In this paper, we propose and evaluate CUDAlign 2.1, a parallel algorithm that uses GPU to align
huge sequences, executing the Smith-Waterman algorithm combined with Myers-Miller, with linear space complexity. In order to
achieve that, we propose optimizations which are able to reduce significantly the amount of data processed, while enforcing full
parallelism most of the time. Using the NVIDIA GTX 560 Ti board and comparing real DNA sequences that range from 162 KBP
(Thousand Base Pairs) to 59 MBP (Million Base Pairs), we show that CUDAlign 2.1 is scalable. Also, we show that CUDAlign 2.1 is
able to produce the optimal alignment between the chimpanzee chromosome 22 (33 MBP) and the human chromosome 21 (47 MBP)
in 8.4 hours and the optimal alignment between the chimpanzee chromosome Y (24 MBP) and the human chromosome Y (59 MBP) in
13.1 hours.
Index Terms—Bioinformatics, sequence alignment, parallel algorithms, GPU
Ç
1 INTRODUCTION
BIOINFORMATICS is an interdisciplinary field that involves
computer science, biology, mathematics, and statistics
[1]. One of its main goals is to analyze biological sequence
data and genome content in order to obtain the function/
structure of the sequences as well as evolutionary
information.
Once a new biological sequence is discovered, its
functional/structural characteristics must be established.
The first step to achieve this goal is to compare the new
sequence with the sequences that compose genomic
databases, in search of similarities. This comparison is
made thousands of times in a daily basis, all over the world.
Sequence comparison is, therefore, one of the most basic
operations in Bioinformatics. As output, a sequence
comparison operation produces similarity scores and
alignments. The score is a measure of similarity between
the sequences and the alignment highlights the similarities/
differences between the sequences. Both are very useful and
often are used as building blocks for more complex
problems such as multiple sequence alignment and sec-
ondary structure prediction.
Smith and Waterman (SW) [2] proposed an exact
algorithm that retrieves the optimal score and local
alignment between two sequences. It is based on Dynamic
Programming (DP) and has time and space complexity
OðmnÞ, where m and n are the sizes of the sequences. In SW,
a linear gap function was used. Nevertheless, in the nature,
gaps tend to occur together. For this reason, the affine gap
model is often used, where the penalty for opening a gap is
higher than the penalty for extending it. Gotoh [3] modified
the SW algorithm to include affine gap penalties.
One of the most restrictive characteristics of SW and its
variants is the quadratic space needed to store the DP
matrices. For instance, in order to compare two 33 MBP
(Million Base Pairs) sequences, we would need at least
4.3 PB of memory. This fact was observed by Hirschberg [4],
who proposed a linear space algorithm to compute the
Longest Common Subsequence (LCS). Hirschberg’s algo-
rithm was later modified by Myers and Miller (MM) [5] to
compute global alignments in linear space.
Another restrictive characteristic of the SW algorithm is
that it is usually slow due to its quadratic time complexity.
In order to accelerate the comparison between long
sequences, heuristic tools such as LASTZ [6] and MUMMER
[7] were created. They use seeds (LASTZ) and suffix trees
(MUMMER) to scan the sequences, providing a big picture
of the main differences/similarities between them. On the
other hand, Smith-Waterman provides the optimal local
alignment, where the regions of differences/similarities are
much more accurate, as well as the gapped regions that
represent inclusion/deletion of bases. Therefore, we claim
that both kinds of tools should be used in a complementary
way: first, MUMMER or LASTZ would be executed and
IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 24, NO. 5, MAY 2013 1009
. E.F. de O. Sandes is with the Department of Computer Science, University
of Brasilia, Campus Darcy Ribeiro, PO Box 4466, Asa Norte, Brasilia-DF
CEP 70910-900, Brazil. E-mail: edans@cic.unb.br.
. A.C.M.A. de Melo is with the Department of Computer Science, University
of Brasilia, Campus Darcy Ribeiro, PO Box 4466, Asa Norte, Brasilia-DF
CEP 70910-900, Brazil. E-mail: albamm@cic.unb.br.
Manuscript received 16 Nov. 2011; revised 29 Apr. 2012; accepted 4 June
2012; published online 22 June 2012.
Recommended for acceptance by S. Aluru.
For information on obtaining reprints of this article, please send e-mail to:
tpds@computer.org, and reference IEEECS Log Number TPDS-2011-11-0838.
Digital Object Identifier no. 10.1109/TPDS.2012.194.
1045-9219/13/$31.00 ß 2013 IEEE Published by the IEEE Computer Society
Journal: IEEE Transactions on Parallel
and Distributed Systems, 2013
stages. The first stage processes the full DP matrix as in [27],
but some special rows are saved in an area called Special
Rows Area and some blocks are pruned. The second stage
processes the DP matrix in the reverse direction starting
from the endpoint of the optimal alignment and also saves
special columns in disk. Using an optimization called
orthogonal execution, the area calculated in Stage 2 is
reduced. Stage 3 increases the number of crosspoints with
an execution similar to Stage 2 but in the forward direction.
Stage 4 uses the MM algorithm with orthogonal
execution to decrease the size of the partitions. As soon as
all the partitions are smaller than the maximum partition
size, Stage 5 finds the alignment of each partition and
concatenates the results in the full alignment. Stage 6 is
optional and it presents the full alignment in textual or
graphical representation.
memory space. Using an SRA of 50 GB, the full align
these genomic sequences was obtained in 8 ho
26 minutes, where 99.1 percent of this time was spen
GPU stages. CUDAlign 2.1 obtained maximum spee
41.64Â when compared to the Z-align cluster solut
64 cores.
As future work, we intend to further optimiz
stages of the algorithm. In Stage 3, the parall
currently exploited intensively inside each partition
future works many partitions may also be proce
parallel, reducing the execution time of Stage 4. A
intend to implement the block pruning optimiza
Stages 2 and 3. We will also extend the tests to ev
powerful GPUs, including systems with dual ca
from other vendors. Finally, we will investig
possibility of solving the multiple sequence al
problem with the optimizations proposed in this p
REFERENCES
[1] D.W. Mount, Bioinformatics: Sequence and Genome Anal
Spring Harbor Laboratory Press, 2004.
[2] T.F. Smith and M.S. Waterman, “Identification of
Molecular Subsequences,” J. Molecular Biology, vol. 1
pp. 195-197, Mar. 1981.
[3] O. Gotoh, “An Improved Algorithm for Matching
Sequences,” J. Molecular Biology, vol. 162, no. 3, pp. 705
1982.
[4] D.S. Hirschberg, “A Linear Space Algorithm for C
Maximal Common Subsequences,” Comm. ACM, vol.
pp. 341-343, 1975.
[5] E.W. Myers and W. Miller, “Optimal Alignments
Space,” Computer Applications in the Biosciences, vol.
pp. 11-17, 1988.
[6] R.S. Harris, “Improved Pairwise Alignment of Genom
PhD thesis, The Pennsylvania State Univ., 2007.
[7] S. Kurtz, A. Phillippy, A.L. Delcher, M. Smoot, M. Shu
Antonescu, and S.L. Salzberg, “Versatile and Open Sof
Comparing Large Genomes,” Genome Biology, vol. 5, no.
2004.
[8] S. Aluru, N. Futamura, and K. Mehrotra, “Parallel
Sequence Comparison Using Prefix Computations,”
Distributed Computing, vol. 63, no. 3, pp. 264-272, 2003.
[9] S. Rajko and S. Aluru, “Space and Time Optimal Parallel
Alignments,” IEEE Trans. Parallel Distributed Systems
no. 12, pp. 1070-1081, Dec. 2004.
[10] R.B. Batista, A. Boukerche, and A.C.M.A. de Melo, “A
Strategy for Biological Sequence Alignment in Restricted
Space,” J. Parallel Distributed Computing, vol. 68, no. 4, pp
2008.Fig. 13. Plot of some alignments with pruned blocks in gray.We obtained the optimal align. of the
chromosome 21 human x chimpanzee
comparison (32MBP x 47MBP) using the GPU
Nvidia GTX 560 Ti in 8 hours, GCUPS: 52.85
24. Fo
GPU1 GPU2 GPU3 GPU4
Figure 6. Columns distributions for 4 GPUs. Figure 8. Multi-GPU buf
Figure 8 illustrates
tween 4 GPUs. Buffers I
and buffers O1, O2 and
output-input pair of buf
GPUs is continually tra
Transactions on Parallel and Distributed System
25. Figure 5. Multi-GPU threads chaining.
G
Communication
using sockets
and I/O threads
Overlap between
computation and
communication:
8M buffer
MASA-CUDAlign: Goal and Versions
MASA-CUDAlign in multiple GPUs
Stage 1 – Compute the DP matrix
Multi-GPU wavefront
26. • Main challenge: how to parallelize a stage
that is inherently sequential?
– Speculation
• Incremental Speculative Traceback (IST):
each GPU will consider that the local
maximum is also the global maximum.
GPU1 GPU2 GPU3 GPU4
Figure 6. Columns distributions for 4 GPUs. Figure 8. Mu
Transactions on Parallel and Distribute
1
2
3
4
5
6
7
8
9
10
optimal
speculated
MASA-CUDAlign: Goal and Versions
MASA-CUDAlign in multiple GPUs
Stages 2 to 5 - Traceback
27. Without speculation With speculation
t
(a) Pipelined Traceback (PT)
t
(b) Incremental Speculative Traceback (IST)
Figure 9. Traceback Timelines
MASA-CUDAlign: Goal and Versions
MASA-CUDAlign in multiple GPUs
Incremental Speculative Traceback (IST)time
time
28. MASA-CUDAlign: Goal and Versions
MASA-CUDAlign in multiple GPUs
Publications (CUDAlign 3.0)
Fine-Grain Parallel Megabase Sequence
Comparison with Multiple Heterogeneous GPUs
Edans F. de O. Sandes
University of Brasilia
edans@cic.unb.br
Guillermo Miranda
Barcelona Supercomputing Center
guillermo.miranda@bsc.es
Alba C. M. A. Melo
University of Brasilia
alba@cic.unb.br
Xavier Martorell
Universitat Politècnica de Catalunya
Barcelona Supercomputing Center
xavier.martorell@bsc.es
Eduard Ayguadé
Universitat Politècnica de Catalunya
Barcelona Supercomputing Center
eduard.ayguade@bsc.es
Abstract
This paper proposes and evaluates a parallel strategy to
execute the exact Smith-Waterman (SW) algorithm for
megabase DNA sequences in heterogeneous multi-GPU
platforms. In our strategy, the computation of a single huge
SW matrix is spread over multiple GPUs, which commu-
nicate border elements to the neighbour, using a circular
buffer mechanism that hides the communication overhead.
We compared 4 pairs of human-chimpanzee homologous
chromosomes using 2 different GPU environments, obtain-
ing a performance of up to 140.36 GCUPS (Billion of cells
processed per second) with 3 heterogeneous GPUS.
Categories and Subject Descriptors D.1.3 [Programming
Techniques]: Concurrent Programming; J.3 [Life and Med-
ical Sciences]: Biology and Genetics
Keywords GPU; Biological Sequence Comparison; Smith-
Waterman;
1. Introduction
Smith-Waterman (SW) [4] is an exact algorithm based on
the longest common subsequence (LCS) concept, that uses
dynamic programming to find local alignments between two
sequences. SW is very accurate but it needs a lot of compu-
tational resources. GPUs (Graphics Processing Units) have
been considered to accelerate SW, but very few GPU strate-
Permission to make digital or hard copies of part or all of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
For all other uses, contact the owner/author(s).
PPoPP ’14, February 15–19, 2014, Orlando, Florida, USA.
Copyright is held by the owner/author(s).
ACM 978-1-4503-2656-8/14/02.
http://dx.doi.org/10.1145/2555243.2555280
gies [1, 3] allow the comparison of Megabase sequences
longer than 10 Million Base Pairs (MBP). SW#[1] uses 2
GPUs to execute a Myers-Miller [2] linear space variant of
SW. CUDAlign [3] uses a single GPU to execute a com-
bined strategy with SW and Myers-Miller. When compared
to SW#(1 GPU), CUDAlign (1 GPU) presents better execu-
tion times for huge sequences [1].
In this work, we modified the most computational inten-
sive stage of CUDAlign, parallelizing the computation of
a single huge DP matrix among heterogeneous GPUs in a
fine-grained way. In the proposed strategy, GPUs are logi-
cally arranged in a linear way so that each GPU calculates
a subset of columns of the SW matrix, sending the border
column elements to the next GPU. Experimental results col-
lected in 2 different environments show performance of up
to 140 GCUPS (Billion of cells processed per second) using
3 heterogeneous GPUS. With this performance, we are able
to compare real megabase sequences in reasonable time.
2. Proposed Multi-GPU Strategy
We modified the first stage of CUDAlign [3] to parallelize
computation of a single huge DP matrix among many het-
erogeneous GPUs. The parallelization is done using a multi-
GPU wavefront method, where the GPUs are logically ar-
ranged in a linear way, i.e, the first GPU is connected to the
second, the second to the third and so on. Each GPU com-
putes a range of columns of the DP matrix and the GPUs
transfer the cells of their last column to the next GPU. In
a scenario composed of heterogeneous GPUs, assigning the
same number of columns to all GPUs is not a good choice.
In this case, the slowest GPU would determine the process-
ing rate of the whole wavefront. To avoid this, we statically
distribute the columns proportionally to the computational
power of each GPU. This distribution can be obtained from
sequence comparison benchmarks that determine each GPU
383
Conference:
ACM PPoPP 2014
GTX580+GTX680+GTX680
30.71+34.64%+34.63%
GPU1 GPU2 GPU3 GPU4
Figure 1. Columns distributions for 4 GPUs.
Chr.
Human Chimpanzee
Score
Accession Size Accession Size
chr19 NC_000019.9 59M NC_006486.3 64M 17297608
chr20 NC_000020.10 63M NC_006487.3 62M 40050427
chr21 NC_000021.8 48M NC_006488.2 46M 36006054
chr22 NC_000022.10 51M NC_006489.3 50M 31510791
Table 1. Sequences used in the tests.
ea
an
co
th
G
an
th
ie
B
ti
p
b
w
G
We obtained the optimal score of the chromosome
21 human x chimpanzee comparison (46MBP x
47MBP) using 3 GPUs (GTX580+2GTX680) in
6 hours and 28 minutes GCUPS: 139.63
30. Journal:
IEEE Transactions on Parallel and
Distributed Systems, 2016
Using 384 GPUs, we obtained the optimal align. of the
chromosome 21 in a few minutes and the opt. align. of
chromosome 5 human x chimpanzee comparison
(180MBP x 183MBP) in 53 minutes –
10372.56 GCUPS
CUDAlign 4.0: Incremental Speculative
Traceback for Exact Chromosome-Wide
Alignment in GPU Clusters
Edans Flavius de Oliveira Sandes, Guillermo Miranda, Xavier Martorell, Eduard Ayguade,
George Teodoro, and Alba Cristina Magalhaes Melo, Senior Member, IEEE
Abstract—This paper proposes and evaluates CUDAlign 4.0, a parallel strategy to obtain the optimal alignment of huge DNA
sequences in multi-GPU platforms, using the exact Smith–Waterman (SW) algorithm. In the first phase of CUDAlign 4.0, a huge
Dynamic Programming (DP) matrix is computed by multiple GPUs, which asynchronously communicate border elements to the right
neighbor in order to find the optimal score. After that, the traceback phase of SW is executed. The efficient parallelization of the
traceback phase is very challenging because of the high amount of data dependency, which particularly impacts the performance and
limits the application scalability. In order to obtain a multi-GPU highly parallel traceback phase, we propose and evaluate a new parallel
traceback algorithm called Incremental Speculative Traceback (IST), which pipelines the traceback phase, speculating incrementally
over the values calculated so far, producing results in advance. With CUDAlign 4.0, we were able to calculate SW matrices with up to 60
Peta cells, obtaining the optimal local alignments of all Human and Chimpanzee homologous chromosomes, whose sizes range from
26 Millions of Base Pairs (MBP) up to 249 MBP. As far as we know, this is the first time such comparison was made with the SW exact
method. We also show that the ISTalgorithm is able to reduce the traceback time from 2.15Â up to 21.03Â, when compared with the
baseline traceback algorithm. The humanÂchimpanzee chromosome 5 comparison (180 MBPÂ183 MBP) attained 10,370.00 GCUPS
(Billions of Cells Updated per Second) using 384 GPUs, with a speculation hit ratio of 98.2 percent.
Index Terms—Bioinformatics, sequence alignment, parallel algorithms, GPU
Ç
1 INTRODUCTION
IN comparative genomics, biologists compare the sequen-
ces that represent organisms in order to infer functional/
structural properties. Sequence comparison is, therefore,
one of the most basic operations in Bioinformatics [1], usu-
ally solved using heuristic methods due to the excessive
computation times of the exact methods.
Smith–Waterman (SW) [2] is an exact algorithm to com-
pute pairwise local comparisons. It is based on Dynamic Pro-
gramming (DP) and has quadratic time and space
complexities. The SW algorithm is divided in two phases,
where the first phase is responsible to calculate a DP matrix
in order to obtain the optimal score and the second phase
(traceback) obtains the optimal alignment. SW is usually
executed to compare (a) two DNA sequences or (b) a protein
sequence (query sequence) to a genomic database. In the first
case, a single SW matrix is calculated and all the Processing
Elements (PEs) cooperate in this calculation, communicating
to exchange border elements (fine-grained computation).
For Megabase DNA sequences, a huge DP matrix with
several Petabytes is computed. In the second case, multiple
small SW matrices are calculated usually without communi-
cation between the PEs (coarse-grained computation). With
the current genomic databases, often hundreds of thousands
SW matrices are calculated in a single query  database
comparison.
In the last decades, SW approaches for both cases have
been parallelized in the literature, using multiprocessor/
multicores [3], [4], Cell Broadband Engines (CellBEs) [5],
Field Programmable Gate Arrays (FPGAs) [6], Application
Specific Integrated Circuits (ASICs) [7], Intel Xeon Phis [8]
and Graphics Processing Units (GPUs) [9], [10], [11], [12].
The SW algorithm is widely used by biologists to compare
sequences in many practical applications, such as identifica-
tion of orthologs [13], and virus integration detection [14].
In this last application, an FPGA-based platform [6] was
used to compute millions of SW alignments with small
query sequences in short time.
Nowadays, executing SW comparisons with Megabase
sequences is still considered unfeasible by most research-
ers, which currently limits its practical use. We claim
that important bioinformatics applications such as whole
genome alignment (WGA) [15] could benefit from exact
pairwise comparisons of long DNA sequences. WGA
applications often construct global genome alignments by
using local alignments as building blocks [16], [17]. In
[18], the authors state that SW local alignments would be
the best choice in this case. However, in order to compare
1 MBP Â 1 MBP sequences, the SW tool took more than
five days, preventing its use.
E. Sandes, G. Teodoro, and A. Melo are with the Department of Computer
Science, University of Brasılia, Brasılia, DF, Brazil.
E-mail: {edans, teodoro, albamm}@cic.unb.br.
G. Miranda, X. Martorell, and E. Ayguade are with the Barcelona Super-
computing Center, Barcelona, Spain.
E-mail: {guillermo.miranda, xavier.martorell, eduard.ayguade}@bsc.es.
Manuscript received 29 Dec. 2014; revised 10 Dec. 2015; accepted 1 Jan. 2016.
Date of publication 7 Jan. 2016; date of current version 14 Sept. 2016.
Recommended for acceptance by D. Trystram.
For information on obtaining reprints of this article, please send e-mail to:
reprints@ieee.org, and reference the Digital Object Identifier below.
Digital Object Identifier no. 10.1109/TPDS.2016.2515597
2838 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 10, OCTOBER 2016
1045-9219 ß 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
spent in Stage 1 and the remaining stages (Traceback). As
shown, the speedups attained with 128 nodes for chr22 and
chr16 were, respectively, 26.9Â and 29.7Â (21.0 and 23.2
percent of parallel efficiency).
The breakdown of the total execution shows that the
Stage 1 of CUDAlign has a much better scalability. Stage 1
attained speedups of 84.0Â and 97.3Â with 128 nodes (65.6
and 76.0 percent of parallel efficiency), resulting in a peak
performance of 8.3 and 9.7 TCUPS for chr22 and chr16,
respectively. Stage 1 results of chr22 and chr16 are consis-
tent with the ones obtained in CUDAlign 3.0 [12]. The PT
traceback phase, on the other hand, was not able to effi-
phase increased from about 4 to 71 percent as the number of
nodes used was scaled from 1 to 128. This negative impact
of the traceback to the whole application performance is
highly reduced when IST is used, as shown in Section 6.3.
6.3 Impact of Incremental Speculative Traceback
The experimental evaluation of the impact of IST to the per-
formance was carried out using five pairs of homologous
chromosomes: chr22, chr16, chr13, chr8, and chr5. These
sequences were selected intending to provide a wide range
of variation in the DP matrix size calculated (2.55, 8.13,
13.26, 21.07, 33.04 Peta cells, respectively).
Fig. 10. Alignment plots between human and chimpanzee homologous chromosomes.
SANDES ET AL.: CUDALIGN 4.0: INCREMENTAL SPECULATIVE TRACEBACK FOR EXACT CHROMOSOME-WIDE ALIGNMENT IN GPU... 2847
MASA-CUDAlign: Goal and Versions
MASA-CUDAlign in multiple GPUs
Publications (CUDAlign 4.0)
34. The programmer chooses the type of block pruning and
the parallelization strategy
The programmer needs to code the recurrence relations
ust be implemented in the specic language and linked together to create a new aligner extension. Some aligners we
in the work that presented MASA, executing in dierent hardware such as GPUs, multicore CPUs and Intel Phi co-pro
e of MASA is divided in modules, according to the features: platform-independent functions (like data management, sta
ical procedures) and platform-dependent functions (like the parallel processing of the DP matrix and the BP module
ed considering the target platform). The integration of these modules can be observed in Figure 2 .
FIGURE 2 MASA architecture17
.
lelization strategy, two approaches are suggested: the diagonal method, allowing the parallel processing of cells in the
ta ow method, where the propagation is generic among nodes that represent blocks of cells. Similarly, the block pr
posed using diagonal or generic execution approaches, avoiding unnecessary calculations. In order to create a speci
MASA Architecture
35. GPU1 GPU2 GPU3 GPU4
Figure 5: Columns distributions for 4 GPUs.
Pruning area:
The area in blue is not computed
Heavy load imbalance
• Challenge: execute MASA-CUDAlign in a multi-
GPU platform with block pruning.
MASA-CUDAlign –
Multi-GPU with Pruning
The pruning area is obtained during the execution
36. • The GPUs exchange their local best
results periodically.
In order to execute the sequence alignment with BP in
multiple GPUs, each one will compute a subset of columns
of the DP matrix, i.e., the sequence placed horizontally (S1
in Fig. 4) is split according to a defined static partition. Thus,
each GPU compares a part of this sequence with the entire
sequence placed vertically (S0 in Fig. 4).
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
Multi-GPU with Pruning
Score sharing
37. Multi-GPU with Pruning
Score sharing - Publication
0
50
100
150
200
250
300
350
400
450
500
550
600
650
700
1M 3M 5M 7M 10M 23M 47M Ch19 Ch20 Ch21 Ch22
GCUPS
Comparison Id
2*P100: No−BP
2*P100: BP
4*P100: No−BP
4*P100: BP
Fig. 6: Multi-BP results in Comet environment (2 and 4 P100
GPUs).
As can be observed, the speedup varied from 1.60x to 1.92x
with two GPUs, and 2.70x to 3.72x with four GPUs.
TABLE VII: Execution time (in hours) and speedup of Multi-
BP on P100 GPUs.
Id 1*P100 2*P100 4*P100
Linear 1.00x 2.00x 4.00x
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
Conference: Euromicro PDP 2020
Parallel Comparison of Huge DNA Sequences in
Multiple GPUs with Block Pruning
Marco Figueiredo Jr.
Univ. of Brasilia
160063027@aluno.unb.br marcoacf@sarah.br
Edans Sandes
Univ. of Brasilia
edans.sandes@gmail.com
George Teodoro
Univ. Fed. de Minas Gerais
george@dcc.ufmg.br
Alba C. M. A. Melo
Univ. of Brasilia
alves@unb.br
Abstract—Sequence comparison is a task performed in several
Bioinformatics applications daily all over the world. Algorithms
that retrieve the optimal result have quadratic time complexity,
requiring a huge amount of computing power when the sequences
compared are long. In order to reduce the execution time, many
parallel solutions have been proposed in the literature. Neverthe-
less, depending on the sizes of the sequences, even those parallel
solutions take hours or days to complete. Pruning techniques can
significantly improve the performance of the parallel solutions
and a few approaches have been proposed to provide pruning
capabilities for sequence comparison applications. This paper
proposes and evaluates a variant of the block pruning approach
that runs in multiple GPUs, in homogeneous or heterogeneous en-
vironments. Experimental results obtained with DNA sequences
in two testbeds show that significant performance gains are
obtained with pruning, compared to its non-pruning counterpart,
achieving the impressive performance of 694.8 GCUPS (Billions
of Cells Updated per Second) for four GPUs.
Index Terms—bioinformatics, DNA alignment, GPU, pruning
I. INTRODUCTION
Bioinformatics produces solutions that are used by various
fields of study, such as medicine and biology [1]. Biological
sequence comparison operations are executed several times
daily all over the world, either in stand-alone mode or in-
corporated into Bioinformatics applications to solve complex
problems such as evolutionary relationship determination and
drug design. Due to their quadratic time complexity, sequence
comparison algorithms that retrieve the optimal result can
take a lot of time. In order to reduce the execution time of
such algorithms, parallel solutions have been proposed in the
literature over the last decades.
The type of parallelism provided by Graphical Processor
Units (GPUs) makes these devices a very good alternative
to run sequence comparisons [2] [3]. CUDAlign 4.0 [3] is
a state-of-the art tool that compares huge DNA sequences
in multiple GPUs and obtains the optimal result, combining
the Gotoh [4] and the Myers-Miller [5] algorithms. Using
384 GPUs, it was able to compare the homologous human x
chimpanzee chromosomes 5 (180 Million of Base Pairs – MBP
– each) in 53 minutes, computing a matrix of 33.04 Petacells
at 10.37 TCUPS (Trillions of Cells Updated per Second). In
an earlier version for one GPU (CUDAlign 2.1 [6]), the block
pruning (BP) strategy was proposed to avoid the computation
of parts of the matrix that surely will not lead to the optimal
solution, with good results for one GPU. Further versions of
CUDAlign present pruning capabilities only for single GPU
executions. SW# [7] implemented the original MM algorithm
and extended the block pruning strategy [6] to be used in two
GPUs, but the performance was just a little better than the
execution of CUDAlign in one GPU [7]. As far as we know,
there is no work in the literature that obtains the optimal result
with pruning using more than two GPUs. Other works use
CPUs [8], FPGAs [9] or hybrid environment [10], but they
are outside the scope of the paper.
This paper proposes and evaluates Multi-BP, an adaptation
of block pruning for multiple GPUs. It is based on static
distribution and dynamic sharing of pruning information, lead-
ing to considerable performance gains in medium-sized GPU
environments. Multi-BP combines the multi-GPU CUDAlign
version [3] and the pruning technique proposed in [6]. The
challenges to design Multi-BP were the following: (a) ensure
that Multi-BP will not affect the performance in single GPU
executions; (b) adapt the calculation of the index of each GPU
block of cells and the evaluation of the pruning window to a
multiple GPU environment; (c) disseminate the pruning infor-
mation obtained by each GPU to all others with low overhead;
and (d) adjust the pruning technique to the heterogeneous GPU
environments, considering that the DP matrix might not be
partitioned evenly among the GPUs.
Experimental results obtained with real DNA sequences
with sizes varying from 1 to 63 MBP in two computing envi-
ronments show that very good gains were attained with Multi-
BP. The execution time of the comparison of chromosome 20
(human x chimpanzee) in the same heterogeneous environment
(GTX 980 Ti + GTX 680) was reduced from 8h17min (without
Multi-BP) to 4h55min (with Multi-BP).
The remainder of this paper is organized as follows. In
Section II we present the pairwise sequence alignment problem
and in Section III we discuss pruning approaches and the block
pruning technique. Section IV discusses solutions that execute
biological sequence comparisons in multiple GPUs. Section V
describes the design of Multi-BP and Section VI details the
experiments. Finally, Section VII concludes the paper.
II. PAIRWISE BIOLOGICAL SEQUENCE COMPARISON
The field of Bioinformatics [1] demands continuous pro-
cessing improvements. Due to the huge volume of data and
performance requirements, new parallel algorithms and tools
are proposed regularly, aiming to provide faster executions.
In particular, the alignment of biological sequences (proteins,
Authorized licensed use limited to: UNIVERSIDADE DE BRASILIA. Downloaded on June 11,2020 at 12:19:16 UTC from IEEE Xplore. Restrictions apply.
We obtained the optimal score of the chromosome
21 human x chimpanzee comparison (46MBP x
47MBP) using 4 GPUs NVidia Pascal in
56 minutes GCUPS: 680.81
38. • Challenge: adapt the workload to a dynamic
pruning scenario.
– Execution is paused at some points: overhead
– Use in cases where the load balancing benefits are
higher than the overhead
– Sequences which are not very similar do not have a
big pruning area.
MASA-CUDAlign: Goal and Versions
Multi-GPU with Pruning
Load Balancing – ongoing work
39. GPU1 GPU4 GPU1 GPU1GPU4 GPU4
w/o breakpoint with breakpoint
Multi-GPU with Pruning
Score sharing + cyclic + load balancing
41. Multi-GPU with Pruning
Score sharing + cyclic + load balancing
• Best result in the literature for GPUs:
– 10.3 TCUPS with 384 NVidia M2090 GPUs + Intel CPU
• Result obtained in the platform:
– 2.7 TCUPS with 8 NVidia Volta GPUs + Power9 CPU
• We estimate that we are able to beat the best result
for GPUs (10.3 TCUPS) with 40 NVidia V100 and
the best theoretical result (53 TCUPS) with 256
NVidia V100
43. • We compared MASA-OpenMP (CPU) running
in the IBM Power9, Intel i7 and Intel Xeon
platforms for the 1M x 1M comparison.
• Intel i7 (4 cores - Skylake)
– GCUPS: 1.1, time: 16 minutes (962.9 seconds)
• Intel Xeon (24 cores – Haswell)
– GCUPS: 4.5, time: 4 minutes (247.16 seconds)
• IBM Power (22 cores – Power9)
– GCUPS: 6.1, time: 3 minutes (181.4 seconds)
MASA in CPUs (MASA-OpenMP)
MASA-OpenMP – CPU Comparison
44. • We used MASA-OpenMP (CPU) for
these comparisons since the sars-cov-2
sequences are short (about 30 thousand
characters)
• We first compared sars-cov-2 sequences
from China, Brazil, USA, India and
Japan
– Conclusion: very similar sequences
• We then compared sars-cov-2
sequences from Brazil to mers and sars:
– Even though the sequences are quite
similar, there are regions of interest
MASA in CPUs (MASA-OpenMP)
Ongoing covid-19 study – IBM Power9
46. Thanks to…
• My former PhD student, Edans F. O. Sandes, my PhD
student Marco Figueiredo Jr. and my undergrad student
Bernardo Nascimento
• George Teodoro, UFMG, Brazil
• Maria Emilia Walter, University of Brasilia, Brazil
• Eduard Ayguade, Xavier Martorell and Guillermo
Miranda, Universitat Politecnica de Catalunya and
Barcelona Supercomputing Center
• And Azzedine Boukerche, University of Ottawa, Manuel
Ujaldon, University of Malaga, Samuel Thibault, University
of Bordeaux, Genaina Rodrigues, University of Brasilia,
Celia Ralha, University of Brasilia, a couple of MsC students
and many undergrad students
48. The MASA code, including MASA-CUDAlign and
MASA-OpenMP, is available at
https://github.com/edanssandes/MASA-Core
MASA code
The MASA code (GPU, CPU, Intel Phi) was used in the following institutions:
Brazil – University of Brasilia, Fed Univ Rio Grande do Sul, NVidia Brazil, NEC Brazil
Croatia – University of Zagreb
France – University of Bordeaux
India - Manonmaniam Sundaranar University
Japan – NEC Japan
Singapore – Agency for Science Technology and Research
Spain – Politechnical University of Catalunya and University of Malaga
USA – University of Delaware and IBM USA
We are open to collaborations!