SlideShare a Scribd company logo
1 of 21
What is BLAS?
Basic Linear Algebra
Subprograms (BLAS) is
a specification that
prescribes a set of low-level
routines for performing
common linear
algebra operations such
as vector addition, scalar
multiplication, dot
products, linear
combinations, and matrix
multiplication.
Why use BLAS?
BLAS specification is
general, BLAS
implementations are often
optimized for speed on a
particular machine, so
using them can bring
substantial performance
benefits. BLAS
implementations will take
advantage of special
floating point hardware
such as vector registers or
SIMD instructions.
It originated as a Fortran library in
1979and its interface was
standardized by the BLAS Technical
(BLAST) Forum, whose latest BLAS
report can be found on the netlib
website . This Fortran library is
known as the reference
implementation (sometimes
confusingly referred to as the BLAS
library) and is not optimized for speed
but is in the public domain.
BLAS History
Most libraries that offer linear algebra
routines conform to the BLAS interface,
allowing library users to develop programs
that are indifferent to the BLAS library
being used. BLAS implementations have
known a spectacular explosion in uses with
the development of GPGPU, with cuBLAS
and rocBLAS being prime examples.
CPU-based examples of BLAS libraries
include: OpenBLAS, BLIS (BLAS-like
Library Instantiation Software), Arm
Performance Libraries,[5] ATLAS, and
Intel Math Kernel Library (MKL).
BLAS
applications
Many numerical software applications use BLAS-compatible libraries to do linear algebra
computations, including LAPACK, LINPACK, Armadillo, GNU Octave, Mathematica
, MATLAB, NumPy , R, and Julia.
Back Ground
Initially, these subroutines used hard-coded loops for their low-level
operations. For example, if a subroutine needed to perform a matrix
multiplication, then the subroutine would have three nested loops. Linear
algebra programs have many common low-level operations (the so-called
"kernel" operations, not related to operating systems )
Back Ground
Between 1973 and 1977, several of these kernel operations were identified.
These kernel operations became defined subroutines that math libraries
could call. The kernel calls had advantages over hard-coded loops: the
library routine would be more readable, there were fewer chances for bugs,
and the kernel implementation could be optimized for speed. A specification
for these kernel operations using scalars and vectors, the level-1 Basic
Linear Algebra Subroutines (BLAS), was published in 1979. BLAS was
used to implement the linear algebra subroutine library LINPACK.
Back Ground
The BLAS abstraction allows customization for high performance. For
example, LINPACK is a general purpose library that can be used on many
different machines without modification. LINPACK could use a generic
version of BLAS. To gain performance, different machines might use
tailored versions of BLAS. As computer architectures became more
sophisticated, vector machines appeared. BLAS for a vector machine could
use the machine's fast vector operations. (While vector processors eventually
fell out of favor, vector instructions in modern CPUs are essential for
optimal performance in BLAS routines.
BLAS functionality
BLAS functionality is categorized into three sets of routines called "levels",
which correspond to both the chronological order of definition and
publication, as well as the degree of the polynomial in the complexities of
algorithms; Level 1 BLAS operations typically take linear time, O(n), Level
2 operations quadratic time and Level 3 operations cubic time. Modern
BLAS implementations typically provide all three levels.
Level 1
This level consists of all the routines described in the original presentation of
BLAS (1979),[1] which defined only vector operations on strided arrays: dot
products, vector norms, a generalized vector addition of the form
Level 2
with T being triangular. Design of the Level 2 BLAS started in 1984, with
results published in 1988.The Level 2 subroutines are especially intended to
improve performance of programs using BLAS on vector processors, where
Level 1 BLAS are suboptimal "because they hide the matrix-vector nature of
the operations from the compiler."
Level 3
This level, formally published in 1990,[19] contains matrix-matrix
operations, including a "general matrix multiplication" (gemm), of the form
𝛼𝐴𝐵 + 𝛽𝐶 → 𝐶
where A and B can optionally be transposed or hermitian-conjugated inside
the routine, and all three matrices may be strided. The ordinary matrix
multiplication A B can be performed by setting α to one and C to an all-zeros
matrix of the appropriate size.
Also included in Level 3 are routines for computing
Batched BLAS
The traditional BLAS functions have been
also ported to architectures that support large
amounts of parallelism such as GPUs. Here,
the traditional BLAS functions provide
typically good performance for large matrices.
However, when computing e.g., matrix-
matrix-products of many small matrices by
using the GEMM routine, those architectures
show significant performance losses. To
address this issue, in 2017 a batched version
of the BLAS function has been specified]
Batched BLAS
Taking the GEMM routine from above as an example, the batched version
performs the following computation simultaneously for many matrices:
𝛼𝐴 𝑘 𝐵 𝑘 + 𝛽𝐶 𝑘 → 𝐶 𝑘 ∀𝑘
The index k in square brackets indicates that the operation is performed for
all matrices k in a stack. Often, this operation is implemented for a strided
batched memory layout where all matrices follow concatenated in the arrays
A,B andC.
Batched BLAS
Batched BLAS functions can be a
versatile tool and allow e.g. a fast
implementation of exponential
integrators and Magnus integrators that
handle long integration periods with
many time steps.[53] Here, the matrix
exponentiation, the computationally
expensive part of the integration, can
be implemented in parallel for all time-
steps by using Batched BLAS
functions.
BLSA.pptx

More Related Content

Similar to BLSA.pptx

Sybase IQ ile Muhteşem Performans
Sybase IQ ile Muhteşem PerformansSybase IQ ile Muhteşem Performans
Sybase IQ ile Muhteşem Performans
Sybase Türkiye
 
Assembler design option
Assembler design optionAssembler design option
Assembler design option
Mohd Arif
 
186 devlin p-poster(2)
186 devlin p-poster(2)186 devlin p-poster(2)
186 devlin p-poster(2)
vaidehi87
 

Similar to BLSA.pptx (20)

AMGLAB A COMMUNITY PROBLEM SOLVING ENVIRONMENT FOR ALGEBRAIC MULTIGRID METHODS
AMGLAB  A COMMUNITY PROBLEM SOLVING ENVIRONMENT FOR ALGEBRAIC MULTIGRID METHODSAMGLAB  A COMMUNITY PROBLEM SOLVING ENVIRONMENT FOR ALGEBRAIC MULTIGRID METHODS
AMGLAB A COMMUNITY PROBLEM SOLVING ENVIRONMENT FOR ALGEBRAIC MULTIGRID METHODS
 
And Then There Were None
And Then There Were NoneAnd Then There Were None
And Then There Were None
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
 
HC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu DasHC-4017, HSA Compilers Technology, by Debyendu Das
HC-4017, HSA Compilers Technology, by Debyendu Das
 
Sybase IQ ile Muhteşem Performans
Sybase IQ ile Muhteşem PerformansSybase IQ ile Muhteşem Performans
Sybase IQ ile Muhteşem Performans
 
Matlab Projects USA
Matlab Projects USAMatlab Projects USA
Matlab Projects USA
 
Matlab - Introduction and Basics
Matlab - Introduction and BasicsMatlab - Introduction and Basics
Matlab - Introduction and Basics
 
AMBA 2.0 REPORT
AMBA 2.0 REPORTAMBA 2.0 REPORT
AMBA 2.0 REPORT
 
Summer training introduction to matlab
Summer training  introduction to matlabSummer training  introduction to matlab
Summer training introduction to matlab
 
Assembler design option
Assembler design optionAssembler design option
Assembler design option
 
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin WilkinsSpark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
Spark Summit EU talk by Mikhail Semeniuk Hollin Wilkins
 
Introduction to LAVA Workload Scheduler
Introduction to LAVA Workload SchedulerIntroduction to LAVA Workload Scheduler
Introduction to LAVA Workload Scheduler
 
A dynamically reconfigurable multi asip architecture for multistandard and mu...
A dynamically reconfigurable multi asip architecture for multistandard and mu...A dynamically reconfigurable multi asip architecture for multistandard and mu...
A dynamically reconfigurable multi asip architecture for multistandard and mu...
 
An ntutorial[1]
An ntutorial[1]An ntutorial[1]
An ntutorial[1]
 
Sap architecture
Sap architectureSap architecture
Sap architecture
 
Efficient Methodology of Sampling UVM RAL During Simulation for SoC Functiona...
Efficient Methodology of Sampling UVM RAL During Simulation for SoC Functiona...Efficient Methodology of Sampling UVM RAL During Simulation for SoC Functiona...
Efficient Methodology of Sampling UVM RAL During Simulation for SoC Functiona...
 
Rails DB migrations
Rails DB migrationsRails DB migrations
Rails DB migrations
 
186 devlin p-poster(2)
186 devlin p-poster(2)186 devlin p-poster(2)
186 devlin p-poster(2)
 
The Open Power ISA: A Summary of Architecture Compliancy Options and the Late...
The Open Power ISA: A Summary of Architecture Compliancy Options and the Late...The Open Power ISA: A Summary of Architecture Compliancy Options and the Late...
The Open Power ISA: A Summary of Architecture Compliancy Options and the Late...
 
Iaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg as
 

Recently uploaded

Heat Units in plant physiology and the importance of Growing Degree days
Heat Units in plant physiology and the importance of Growing Degree daysHeat Units in plant physiology and the importance of Growing Degree days
Heat Units in plant physiology and the importance of Growing Degree days
Brahmesh Reddy B R
 

Recently uploaded (20)

GBSN - Biochemistry (Unit 8) Enzymology
GBSN - Biochemistry (Unit 8) EnzymologyGBSN - Biochemistry (Unit 8) Enzymology
GBSN - Biochemistry (Unit 8) Enzymology
 
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
Emergent ribozyme behaviors in oxychlorine brines indicate a unique niche for...
 
Introduction and significance of Symbiotic algae
Introduction and significance of  Symbiotic algaeIntroduction and significance of  Symbiotic algae
Introduction and significance of Symbiotic algae
 
Heat Units in plant physiology and the importance of Growing Degree days
Heat Units in plant physiology and the importance of Growing Degree daysHeat Units in plant physiology and the importance of Growing Degree days
Heat Units in plant physiology and the importance of Growing Degree days
 
MSCII_ FCT UNIT 5 TOXICOLOGY.pdf
MSCII_              FCT UNIT 5 TOXICOLOGY.pdfMSCII_              FCT UNIT 5 TOXICOLOGY.pdf
MSCII_ FCT UNIT 5 TOXICOLOGY.pdf
 
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
dkNET Webinar: The 4DN Data Portal - Data, Resources and Tools to Help Elucid...
 
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptxPlasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
Plasmapheresis - Dr. E. Muralinath - Kalyan . C.pptx
 
short range interaction for protein and factors influencing or affecting the ...
short range interaction for protein and factors influencing or affecting the ...short range interaction for protein and factors influencing or affecting the ...
short range interaction for protein and factors influencing or affecting the ...
 
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
Alternative method of dissolution in-vitro in-vivo correlation and dissolutio...
 
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptxSaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
SaffronCrocusGenomicsThessalonikiOnlineMay2024TalkOnline.pptx
 
ANALEPTICS Mrs Namrata Sanjay Mane  Department of Pharmaceutical Chemistry...
ANALEPTICS  Mrs Namrata Sanjay  Mane   Department of Pharmaceutical Chemistry...ANALEPTICS  Mrs Namrata Sanjay  Mane   Department of Pharmaceutical Chemistry...
ANALEPTICS Mrs Namrata Sanjay Mane  Department of Pharmaceutical Chemistry...
 
Plasma proteins_ Dr.Muralinath_Dr.c. kalyan
Plasma proteins_ Dr.Muralinath_Dr.c. kalyanPlasma proteins_ Dr.Muralinath_Dr.c. kalyan
Plasma proteins_ Dr.Muralinath_Dr.c. kalyan
 
NUMERICAL Proof Of TIme Electron Theory.
NUMERICAL Proof Of TIme Electron Theory.NUMERICAL Proof Of TIme Electron Theory.
NUMERICAL Proof Of TIme Electron Theory.
 
MODERN PHYSICS_REPORTING_QUANTA_.....pdf
MODERN PHYSICS_REPORTING_QUANTA_.....pdfMODERN PHYSICS_REPORTING_QUANTA_.....pdf
MODERN PHYSICS_REPORTING_QUANTA_.....pdf
 
Abortion uae unmarried price +27791653574 Contact Us Dubai Abu Dhabi Sharjah ...
Abortion uae unmarried price +27791653574 Contact Us Dubai Abu Dhabi Sharjah ...Abortion uae unmarried price +27791653574 Contact Us Dubai Abu Dhabi Sharjah ...
Abortion uae unmarried price +27791653574 Contact Us Dubai Abu Dhabi Sharjah ...
 
PHOTOSYNTHETIC BACTERIA (OXYGENIC AND ANOXYGENIC)
PHOTOSYNTHETIC BACTERIA  (OXYGENIC AND ANOXYGENIC)PHOTOSYNTHETIC BACTERIA  (OXYGENIC AND ANOXYGENIC)
PHOTOSYNTHETIC BACTERIA (OXYGENIC AND ANOXYGENIC)
 
In-pond Race way systems for Aquaculture (IPRS).pptx
In-pond Race way systems for Aquaculture (IPRS).pptxIn-pond Race way systems for Aquaculture (IPRS).pptx
In-pond Race way systems for Aquaculture (IPRS).pptx
 
THE GENERAL PROPERTIES OF PROTEOBACTERIA AND ITS TYPES
THE GENERAL PROPERTIES OF PROTEOBACTERIA AND ITS TYPESTHE GENERAL PROPERTIES OF PROTEOBACTERIA AND ITS TYPES
THE GENERAL PROPERTIES OF PROTEOBACTERIA AND ITS TYPES
 
POST TRANSCRIPTIONAL GENE SILENCING-AN INTRODUCTION.pptx
POST TRANSCRIPTIONAL GENE SILENCING-AN INTRODUCTION.pptxPOST TRANSCRIPTIONAL GENE SILENCING-AN INTRODUCTION.pptx
POST TRANSCRIPTIONAL GENE SILENCING-AN INTRODUCTION.pptx
 
Fun for mover student's book- English book for teaching.pdf
Fun for mover student's book- English book for teaching.pdfFun for mover student's book- English book for teaching.pdf
Fun for mover student's book- English book for teaching.pdf
 

BLSA.pptx

  • 1.
  • 2. What is BLAS? Basic Linear Algebra Subprograms (BLAS) is a specification that prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication.
  • 3. Why use BLAS? BLAS specification is general, BLAS implementations are often optimized for speed on a particular machine, so using them can bring substantial performance benefits. BLAS implementations will take advantage of special floating point hardware such as vector registers or SIMD instructions.
  • 4. It originated as a Fortran library in 1979and its interface was standardized by the BLAS Technical (BLAST) Forum, whose latest BLAS report can be found on the netlib website . This Fortran library is known as the reference implementation (sometimes confusingly referred to as the BLAS library) and is not optimized for speed but is in the public domain. BLAS History
  • 5. Most libraries that offer linear algebra routines conform to the BLAS interface, allowing library users to develop programs that are indifferent to the BLAS library being used. BLAS implementations have known a spectacular explosion in uses with the development of GPGPU, with cuBLAS and rocBLAS being prime examples. CPU-based examples of BLAS libraries include: OpenBLAS, BLIS (BLAS-like Library Instantiation Software), Arm Performance Libraries,[5] ATLAS, and Intel Math Kernel Library (MKL). BLAS applications
  • 6. Many numerical software applications use BLAS-compatible libraries to do linear algebra computations, including LAPACK, LINPACK, Armadillo, GNU Octave, Mathematica , MATLAB, NumPy , R, and Julia.
  • 7.
  • 8. Back Ground Initially, these subroutines used hard-coded loops for their low-level operations. For example, if a subroutine needed to perform a matrix multiplication, then the subroutine would have three nested loops. Linear algebra programs have many common low-level operations (the so-called "kernel" operations, not related to operating systems )
  • 9.
  • 10. Back Ground Between 1973 and 1977, several of these kernel operations were identified. These kernel operations became defined subroutines that math libraries could call. The kernel calls had advantages over hard-coded loops: the library routine would be more readable, there were fewer chances for bugs, and the kernel implementation could be optimized for speed. A specification for these kernel operations using scalars and vectors, the level-1 Basic Linear Algebra Subroutines (BLAS), was published in 1979. BLAS was used to implement the linear algebra subroutine library LINPACK.
  • 11.
  • 12. Back Ground The BLAS abstraction allows customization for high performance. For example, LINPACK is a general purpose library that can be used on many different machines without modification. LINPACK could use a generic version of BLAS. To gain performance, different machines might use tailored versions of BLAS. As computer architectures became more sophisticated, vector machines appeared. BLAS for a vector machine could use the machine's fast vector operations. (While vector processors eventually fell out of favor, vector instructions in modern CPUs are essential for optimal performance in BLAS routines.
  • 13. BLAS functionality BLAS functionality is categorized into three sets of routines called "levels", which correspond to both the chronological order of definition and publication, as well as the degree of the polynomial in the complexities of algorithms; Level 1 BLAS operations typically take linear time, O(n), Level 2 operations quadratic time and Level 3 operations cubic time. Modern BLAS implementations typically provide all three levels.
  • 14.
  • 15. Level 1 This level consists of all the routines described in the original presentation of BLAS (1979),[1] which defined only vector operations on strided arrays: dot products, vector norms, a generalized vector addition of the form
  • 16. Level 2 with T being triangular. Design of the Level 2 BLAS started in 1984, with results published in 1988.The Level 2 subroutines are especially intended to improve performance of programs using BLAS on vector processors, where Level 1 BLAS are suboptimal "because they hide the matrix-vector nature of the operations from the compiler."
  • 17. Level 3 This level, formally published in 1990,[19] contains matrix-matrix operations, including a "general matrix multiplication" (gemm), of the form 𝛼𝐴𝐵 + 𝛽𝐶 → 𝐶 where A and B can optionally be transposed or hermitian-conjugated inside the routine, and all three matrices may be strided. The ordinary matrix multiplication A B can be performed by setting α to one and C to an all-zeros matrix of the appropriate size. Also included in Level 3 are routines for computing
  • 18. Batched BLAS The traditional BLAS functions have been also ported to architectures that support large amounts of parallelism such as GPUs. Here, the traditional BLAS functions provide typically good performance for large matrices. However, when computing e.g., matrix- matrix-products of many small matrices by using the GEMM routine, those architectures show significant performance losses. To address this issue, in 2017 a batched version of the BLAS function has been specified]
  • 19. Batched BLAS Taking the GEMM routine from above as an example, the batched version performs the following computation simultaneously for many matrices: 𝛼𝐴 𝑘 𝐵 𝑘 + 𝛽𝐶 𝑘 → 𝐶 𝑘 ∀𝑘 The index k in square brackets indicates that the operation is performed for all matrices k in a stack. Often, this operation is implemented for a strided batched memory layout where all matrices follow concatenated in the arrays A,B andC.
  • 20. Batched BLAS Batched BLAS functions can be a versatile tool and allow e.g. a fast implementation of exponential integrators and Magnus integrators that handle long integration periods with many time steps.[53] Here, the matrix exponentiation, the computationally expensive part of the integration, can be implemented in parallel for all time- steps by using Batched BLAS functions.