Dr REEJA S R
Associate Professor
CSE Department
Dayananda Sagar University – School of Engineering
Kudlu Gate, Bangalore
Given Talk in DSCE ,ISE Dept., Bangalore
 Introduction to HPC
 What is Python?
 Why Python
 Why Python for HPC
 Python in HPC
 Challenges
 Conclusions
 What is HPC?
 When do we need HPC?
 What does HPC Include?
 Rise &Falls of HPC Computer Architecture
• There is no clear definition
 Computing on high performance computers
 Solving problems / doing research using computer modeling, simulationand analysis
 Engineering design using computer modeling, simulation and analysis
• My understanding
 A huge number of computational and memory requirements
 Cannot be afforded by a PC efficiently
 Speeds and feeds are the keywords
• Who uses High-Performance Computing
 Research institutes, universities and government labs
 Weather and climate research, bioscience, energy, militaryetc.
 Engineering design: more or less every product we use
 Automotive, aerospace, oil and gas explorations, digital media, financialsimulation
 Mechanical simulation, package designs, silicon manufacturingetc.
• Similar concepts
 Parallel computing: computing on parallel computers
 Super computing: computing on world 500 fastest supercomputers
• Case1: Complete a time-consuming operation in less time
 I am an automotive engineer
 I need to design a new car that consumes less gasoline
 I’d rather have the design completed in 6 months than in 2 years
 I want to test my design using computer simulations rather than building very
expensive prototypes and crashing them
• Case 2: Complete an operation under a tight deadline
 I work for a weather prediction agency
 I am getting input from weather stations/sensors
 I’d like to predict tomorrow’s forecast today
• Case 3: Perform a high number of operations per seconds
 I am an engineer at Amazon.com
 My Web server gets 1,000 hits per seconds
 I’d like my web server and databases to handle 1,000 transactions per
seconds so that customers do not experience bad delays
• High-performance computing is fast computing
• Computations in parallel over lots of compute elements (CPU,
GPU)
• Very fast network to connect between the compute elements
• Hardware
• Computer Architecture
• Vector Computers, MPP, SMP, Distributed Systems, Clusters
• Network Connections
• InfiniBand, Ethernet, Proprietary (Myrinet, Quadrics, Cray-
SeaStar etc.)
• Software
• Programming models
• MPI (Message Passing Interface), SHMEM (Shared Memory),
PGAS( partitioned global address space), etc.
• Applications
• Open source, commercial
 Vector Computers (VC) - proprietary system
 Provided the breakthrough needed for the emergence of computational
science, but
 they were only a partial answer
 Massively Parallel Processors (MPP) - proprietary systems
 High cost and a low performance/price ratio.
 Symmetric Multiprocessors (SMP)
 Suffers from scalability
 Distributed Systems
 Difficult to use and hard to extract parallel performance
 Clusters – commodity and highly popular
 High Performance Computing - Commodity Supercomputing
 High Availability Computing - Mission Critical Applications
 Modern, interpreted, object-oriented, full
featured high level programming language
 Portable (Unix/Linux, Mac OS X, Windows)
 Open source, intellectual property rights
held by the Python Software Foundation
 Python versions: 2.x and 3.x
 Goal - Develop a small python program
that runs multiple serial execution with
different load balancing techniques
applied
 Fast program development
 Simple syntax
 Easy to write well readable code
 Large standard library
 Lots of third party libraries
 Numpy, Scipy
 Mpi4py
 When you want to maximize productivity (not
necessarily performance)
 Mature language with large user base
 Huge collection of freely available software libraries
 High Performance Computing
 Engineering, Optimization, Differential Equations
 Scientific Datasets, Analysis, Visualization
 General-purpose computing
 web apps, GUIs, databases, and tons more
 Python combines the best of both JIT and AOTcode.
 Write performance critical loops and kernels in
C/FORTRAN
 Write high level logic and “boiler plate” in Python
 Unicode&Bytes
 Array
- Memory efficient array for primitive types
 Math
- Basic maths operations, include statistics
 Sqlite3
- Sql file based storage engine
 Collections
- Variety of objects (deque, counter &
dictionary Variants)
 Huge varieties of libraries , including(
numpy,scipy etc…)
 Libraries
 Numpy- a numerical python library
 Scipy –Scientific libraries
 Pandas-library for data analysis
 Scikit-learn – default machine learning library
 Biopython – bioinformatics library
 Tornado – easy bindings for concurrency
 Database bindings- for communicating with virtually all db
including Redis, MongoDB,HDF5 & SQL
 Web development framework – Creating website
 Opencv- binding for computer vision
 API bindings – for easy access to popular web API(google,
twitter & linkdln)
 Matplotlib: python –m pip install matplotlib
 High level
-lower barriers, reduce time to solution
 Interfaces with os, libraries and other
software
- Make a great glue for automating the modern scientific work
flow
- Sage(ties together biggest open source numeric software into
a unified python interface
- Reduce re-inventing of wheels
 Open Source
- Portable, free, transparent, verifiable
- Scales to arbitrary number of nodes with no license costs
 Interpreted
-Interactive data analysis and plotting
-Interactive parallel computing
 Numpy : Array data structure
>>> from numpy.random import *
>>> from pylab import *
>>> hist(randn(10000), 100)
>>> show()
>>>import math
>>>x=math.factorial(3)
>>>print”fact =%d”%(x)
 Ans: fact=6
 matrix
>>>import numpy as np
>>>np.array(np.mat(‘1 2;3 4’))
Ans:Array([ [1, 2],
[3, 4]])
>>>array=[4,2,6]
>>>array.append(1)
>>>print”before sorting”,array
>>>array.sort()
>>>Print”after sorting”,array
 History of NumPy
 Features
– a powerful N-dimensional array object
– sophisticated (broadcasting) functions
– tools for integrating C/C++ and Fortran code
– useful linear algebra, Fourier transform, and
random number capabilities
 Development
– Based originally on Numeric by Jim Hugunin
– Also based on NumArray by Perry Greenfield
– Written by Travis Oliphant to bring both feature
sets together
 What makes an array so much faster?
 Data layout
– homogenous: every item takes up the
same size block of memory
– single data-type objects
– powerful array scalar types
 universal function (ufuncs)
– function that operates on ndarrays in an
element-by-element fashion
– vectorized wrapper for a function
– built-in functions are implemented in
compiled C code
 Data layout
 homogenous: every item takes up the same
size block of memory
 single data-type objects
 powerful array scalar types
 Numpy has a sophisticated view of data.
bool int int8 int16
int32 int64 uint8
uint16 uint32 uint64
float float16 float32
float64 complex complex64
complex128
 Help
>>>import pylab
>>>help(pylab)
 Speedups
-use faster hardware – more cores, more cache, more GHz
-use cpu vector instruction
- Byte code and everything is in object
- fast fetcher
-load directly to numpy array
-Improves RDBMS query speed
-Speed up data message
-Cache previous day’s data
-Switch from batch to online architecture
-6 process slots cut runtime to 2 hours
-Fully parallel crashes the db
 To speedup
 PyMPI
 Pynamic
 Pytrilinos
 ODIN
 Seamless
 PyMPI
-was developed to extend python’s
scripting abilities to parallel and distributed
codes
- Parallel extension modules are written
- modules and processing can be
combined in one convenient place to simplify
processing
- single python script can provide setup,
simulation, instruction and postprocessing
 Pynamic
- Tests a system’s linking and loading
capabilities
-pynamic drivers will perform a test of
the MPI functionality
Can also gather performance matric
including the job startup time, module import
time, function visit time and MPI test time
 PyTrilinos:
- For parallel scientific computing, we provide a
high-level interface to the Trilinos and Tpetra parallel
linear algebra library.
- This makes parallel linear algebra
- Easier to use via a simplified user interface
- More intuitive through features such as advanced indexing
- More useful by enabling access to it from the already extensive
Python scientific software stack.
 Optimized Distributed NumPy (ODIN)
- builds on top of the NumPy
- providing a distributed array data
structure that makes parallel array-based
computations.
- It provides built-in functions that work
with distributed arrays
- Framework for creating new functions
that work with distributed arrays.
ODIN’s approach has several advantages:
- Users have access to arrays in the same way that
they think about them: either globally or locally.
- As ODIN arrays are easier to use and reason about
than the MPI-equivalent, this leads to faster iterative
cycles, more flexibility when exploring parallel algorithms,
and an overall reduction in total time-to-solution.
- ODIN is designed to work with existing MPI
programs
– By using Python, ODIN can leverage the ecosystem
of speed-related third party packages, either to wrap
external code or to accelerate existing Python code.
- With the power and expressiveness of NumPy array
slicing, ODIN can optimize distributed array expressions.
These optimizations include: loop fusion, array expression
analysis to select the appropriate communication strategy
between worker nodes
• ODIN’s basic features
—distributed array creation, unary and binary ufunc
application, global and local modes of interaction
—are prototyped
- are currently being tested on systems and clusters
with small to mid-range number of nodes.
 Seamless
- for automatic, Just-in-time compilation
of Python source code.
-Seamless aims to make node-level
Python code as fast as compiled languages via
dynamic compilation.
-It also allows effortless access to
compiled libraries in Python, allowing easy
integration of existing code bases written in
statically typed languages.
• Schematic relation between PyTrilinos, ODIN, and Seamless.
• Each of the three packages is standalone.
• ODIN can use Seamless and PyTrilinos and the functionality that
these two packages provide.
• Seamless provides four principal features, while PyTrilinos
wraps several Trilinos solver packages.
 Python is too slow.
-Seamless allows compilation to fast machine code,
either dynamically or statically.
 Python is yet another language to integrate with existing
software.
-Seamless allows easy interaction between Python and
other languages, and removes nearly all barriers to inter-
language programming.
 The Python HPC ecosystem is too small.
- PyTrilinos provides access to a comprehensive suite of
HPC solvers. Further, ODIN will provide a library of functions and
methods designed to work with distributed arrays, and its design
allows access to any existing MPI routines.
 Integrating all components is too difficult.
-ODIN provides a common framework to integrate
disparate components for distributed computing.
 Performance.
-Processor capacity and memory bandwidth are scaling faster than
system I/O.
-A solution is required that provides higher overall available I/O
bandwidth per socket to accelerate message passing interface (MPI) rates for
tomorrow’s HPC deployments.
 Cost and density.
-More components in a server limit density and increase fabric cost.
-An integrated fabric controller helps eliminate the additional costs
and required space of discrete cards, enabling higher server density while
freeing up a valuable PCIe slot for other storage and networking controllers.
 Reliability and power.
-Discrete interface cards consume many watts of power.
-An integrated interface card on the processor can draw less power
with fewer discrete components.
 Python is a dynamic object-oriented programming
language.
 Because of its powerful and flexible syntax, Python
excels as a platform for High Performance
Computing and scientific computing.
 Versatility, simplicity of use, high portability and
the large number of open source modules and
packages make it very popular for scientific use.
 Pure Python is generally slower compared to
traditional language (C or Fortran), there are
various techniques and libraries that allow you to
obtain performance absolutely comparable to
those of the most common compiled languages,
assuring a good balance between computational
performance and time investment.
 reejasr@gmail.com
 reeja-cse@dsu.edu.in
Role of python in hpc

Role of python in hpc

  • 1.
    Dr REEJA SR Associate Professor CSE Department Dayananda Sagar University – School of Engineering Kudlu Gate, Bangalore Given Talk in DSCE ,ISE Dept., Bangalore
  • 2.
     Introduction toHPC  What is Python?  Why Python  Why Python for HPC  Python in HPC  Challenges  Conclusions
  • 3.
     What isHPC?  When do we need HPC?  What does HPC Include?  Rise &Falls of HPC Computer Architecture
  • 4.
    • There isno clear definition  Computing on high performance computers  Solving problems / doing research using computer modeling, simulationand analysis  Engineering design using computer modeling, simulation and analysis • My understanding  A huge number of computational and memory requirements  Cannot be afforded by a PC efficiently  Speeds and feeds are the keywords • Who uses High-Performance Computing  Research institutes, universities and government labs  Weather and climate research, bioscience, energy, militaryetc.  Engineering design: more or less every product we use  Automotive, aerospace, oil and gas explorations, digital media, financialsimulation  Mechanical simulation, package designs, silicon manufacturingetc. • Similar concepts  Parallel computing: computing on parallel computers  Super computing: computing on world 500 fastest supercomputers
  • 5.
    • Case1: Completea time-consuming operation in less time  I am an automotive engineer  I need to design a new car that consumes less gasoline  I’d rather have the design completed in 6 months than in 2 years  I want to test my design using computer simulations rather than building very expensive prototypes and crashing them • Case 2: Complete an operation under a tight deadline  I work for a weather prediction agency  I am getting input from weather stations/sensors  I’d like to predict tomorrow’s forecast today • Case 3: Perform a high number of operations per seconds  I am an engineer at Amazon.com  My Web server gets 1,000 hits per seconds  I’d like my web server and databases to handle 1,000 transactions per seconds so that customers do not experience bad delays
  • 6.
    • High-performance computingis fast computing • Computations in parallel over lots of compute elements (CPU, GPU) • Very fast network to connect between the compute elements • Hardware • Computer Architecture • Vector Computers, MPP, SMP, Distributed Systems, Clusters • Network Connections • InfiniBand, Ethernet, Proprietary (Myrinet, Quadrics, Cray- SeaStar etc.) • Software • Programming models • MPI (Message Passing Interface), SHMEM (Shared Memory), PGAS( partitioned global address space), etc. • Applications • Open source, commercial
  • 7.
     Vector Computers(VC) - proprietary system  Provided the breakthrough needed for the emergence of computational science, but  they were only a partial answer  Massively Parallel Processors (MPP) - proprietary systems  High cost and a low performance/price ratio.  Symmetric Multiprocessors (SMP)  Suffers from scalability  Distributed Systems  Difficult to use and hard to extract parallel performance  Clusters – commodity and highly popular  High Performance Computing - Commodity Supercomputing  High Availability Computing - Mission Critical Applications
  • 8.
     Modern, interpreted,object-oriented, full featured high level programming language  Portable (Unix/Linux, Mac OS X, Windows)  Open source, intellectual property rights held by the Python Software Foundation  Python versions: 2.x and 3.x  Goal - Develop a small python program that runs multiple serial execution with different load balancing techniques applied
  • 9.
     Fast programdevelopment  Simple syntax  Easy to write well readable code  Large standard library  Lots of third party libraries  Numpy, Scipy  Mpi4py
  • 10.
     When youwant to maximize productivity (not necessarily performance)  Mature language with large user base  Huge collection of freely available software libraries  High Performance Computing  Engineering, Optimization, Differential Equations  Scientific Datasets, Analysis, Visualization  General-purpose computing  web apps, GUIs, databases, and tons more  Python combines the best of both JIT and AOTcode.  Write performance critical loops and kernels in C/FORTRAN  Write high level logic and “boiler plate” in Python
  • 11.
     Unicode&Bytes  Array -Memory efficient array for primitive types  Math - Basic maths operations, include statistics  Sqlite3 - Sql file based storage engine  Collections - Variety of objects (deque, counter & dictionary Variants)  Huge varieties of libraries , including( numpy,scipy etc…)
  • 12.
     Libraries  Numpy-a numerical python library  Scipy –Scientific libraries  Pandas-library for data analysis  Scikit-learn – default machine learning library  Biopython – bioinformatics library  Tornado – easy bindings for concurrency  Database bindings- for communicating with virtually all db including Redis, MongoDB,HDF5 & SQL  Web development framework – Creating website  Opencv- binding for computer vision  API bindings – for easy access to popular web API(google, twitter & linkdln)  Matplotlib: python –m pip install matplotlib
  • 13.
     High level -lowerbarriers, reduce time to solution  Interfaces with os, libraries and other software - Make a great glue for automating the modern scientific work flow - Sage(ties together biggest open source numeric software into a unified python interface - Reduce re-inventing of wheels  Open Source - Portable, free, transparent, verifiable - Scales to arbitrary number of nodes with no license costs  Interpreted -Interactive data analysis and plotting -Interactive parallel computing
  • 14.
     Numpy :Array data structure >>> from numpy.random import * >>> from pylab import * >>> hist(randn(10000), 100) >>> show()
  • 17.
    >>>import math >>>x=math.factorial(3) >>>print”fact =%d”%(x) Ans: fact=6  matrix >>>import numpy as np >>>np.array(np.mat(‘1 2;3 4’)) Ans:Array([ [1, 2], [3, 4]])
  • 18.
  • 19.
     History ofNumPy  Features – a powerful N-dimensional array object – sophisticated (broadcasting) functions – tools for integrating C/C++ and Fortran code – useful linear algebra, Fourier transform, and random number capabilities  Development – Based originally on Numeric by Jim Hugunin – Also based on NumArray by Perry Greenfield – Written by Travis Oliphant to bring both feature sets together
  • 20.
     What makesan array so much faster?  Data layout – homogenous: every item takes up the same size block of memory – single data-type objects – powerful array scalar types  universal function (ufuncs) – function that operates on ndarrays in an element-by-element fashion – vectorized wrapper for a function – built-in functions are implemented in compiled C code
  • 21.
     Data layout homogenous: every item takes up the same size block of memory  single data-type objects  powerful array scalar types
  • 22.
     Numpy hasa sophisticated view of data. bool int int8 int16 int32 int64 uint8 uint16 uint32 uint64 float float16 float32 float64 complex complex64 complex128
  • 23.
  • 24.
     Speedups -use fasterhardware – more cores, more cache, more GHz -use cpu vector instruction - Byte code and everything is in object - fast fetcher -load directly to numpy array -Improves RDBMS query speed -Speed up data message -Cache previous day’s data -Switch from batch to online architecture -6 process slots cut runtime to 2 hours -Fully parallel crashes the db
  • 25.
     To speedup PyMPI  Pynamic  Pytrilinos  ODIN  Seamless
  • 26.
     PyMPI -was developedto extend python’s scripting abilities to parallel and distributed codes - Parallel extension modules are written - modules and processing can be combined in one convenient place to simplify processing - single python script can provide setup, simulation, instruction and postprocessing
  • 27.
     Pynamic - Testsa system’s linking and loading capabilities -pynamic drivers will perform a test of the MPI functionality Can also gather performance matric including the job startup time, module import time, function visit time and MPI test time
  • 28.
     PyTrilinos: - Forparallel scientific computing, we provide a high-level interface to the Trilinos and Tpetra parallel linear algebra library. - This makes parallel linear algebra - Easier to use via a simplified user interface - More intuitive through features such as advanced indexing - More useful by enabling access to it from the already extensive Python scientific software stack.
  • 30.
     Optimized DistributedNumPy (ODIN) - builds on top of the NumPy - providing a distributed array data structure that makes parallel array-based computations. - It provides built-in functions that work with distributed arrays - Framework for creating new functions that work with distributed arrays.
  • 31.
    ODIN’s approach hasseveral advantages: - Users have access to arrays in the same way that they think about them: either globally or locally. - As ODIN arrays are easier to use and reason about than the MPI-equivalent, this leads to faster iterative cycles, more flexibility when exploring parallel algorithms, and an overall reduction in total time-to-solution. - ODIN is designed to work with existing MPI programs – By using Python, ODIN can leverage the ecosystem of speed-related third party packages, either to wrap external code or to accelerate existing Python code. - With the power and expressiveness of NumPy array slicing, ODIN can optimize distributed array expressions. These optimizations include: loop fusion, array expression analysis to select the appropriate communication strategy between worker nodes
  • 32.
    • ODIN’s basicfeatures —distributed array creation, unary and binary ufunc application, global and local modes of interaction —are prototyped - are currently being tested on systems and clusters with small to mid-range number of nodes.
  • 33.
     Seamless - forautomatic, Just-in-time compilation of Python source code. -Seamless aims to make node-level Python code as fast as compiled languages via dynamic compilation. -It also allows effortless access to compiled libraries in Python, allowing easy integration of existing code bases written in statically typed languages.
  • 34.
    • Schematic relationbetween PyTrilinos, ODIN, and Seamless. • Each of the three packages is standalone. • ODIN can use Seamless and PyTrilinos and the functionality that these two packages provide. • Seamless provides four principal features, while PyTrilinos wraps several Trilinos solver packages.
  • 35.
     Python istoo slow. -Seamless allows compilation to fast machine code, either dynamically or statically.  Python is yet another language to integrate with existing software. -Seamless allows easy interaction between Python and other languages, and removes nearly all barriers to inter- language programming.  The Python HPC ecosystem is too small. - PyTrilinos provides access to a comprehensive suite of HPC solvers. Further, ODIN will provide a library of functions and methods designed to work with distributed arrays, and its design allows access to any existing MPI routines.  Integrating all components is too difficult. -ODIN provides a common framework to integrate disparate components for distributed computing.
  • 36.
     Performance. -Processor capacityand memory bandwidth are scaling faster than system I/O. -A solution is required that provides higher overall available I/O bandwidth per socket to accelerate message passing interface (MPI) rates for tomorrow’s HPC deployments.  Cost and density. -More components in a server limit density and increase fabric cost. -An integrated fabric controller helps eliminate the additional costs and required space of discrete cards, enabling higher server density while freeing up a valuable PCIe slot for other storage and networking controllers.  Reliability and power. -Discrete interface cards consume many watts of power. -An integrated interface card on the processor can draw less power with fewer discrete components.
  • 37.
     Python isa dynamic object-oriented programming language.  Because of its powerful and flexible syntax, Python excels as a platform for High Performance Computing and scientific computing.  Versatility, simplicity of use, high portability and the large number of open source modules and packages make it very popular for scientific use.  Pure Python is generally slower compared to traditional language (C or Fortran), there are various techniques and libraries that allow you to obtain performance absolutely comparable to those of the most common compiled languages, assuring a good balance between computational performance and time investment.
  • 38.