Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

High–Performance Computing


Published on

This Presentation was prepared by Abdussamad Muntahi for the Seminar on High Performance Computing on 11/7/13 (Thursday) Organized by BRAC University Computer Club (BUCC) in collaboration with BRAC University Electronics and Electrical Club (BUEEC).

Published in: Education, Technology
  • Dating direct: ❶❶❶ ❶❶❶
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating for everyone is here: ❶❶❶ ❶❶❶
    Are you sure you want to  Yes  No
    Your message goes here

High–Performance Computing

  1. 1. High–Performance Computing (HPC) Prepared By: Abdussamad Muntahi 1 © Copyright: Abdussamad Muntahi & BUCC, 2013
  2. 2. Introduction • High‐speed computing. Originally implemented only in supercomputers for scientific research • Tools and systems available to implement and create high performance computing systems • Used for scientific research and computational science • Main area of discipline is developing parallel processing algorithms and software so that programs can be divided into small independent parts and can be executed simultaneously by separate processors • HPC systems have shifted from supercomputer to computing clusters 2
  3. 3. What is Cluster? • Cluster is a group of machines interconnected in a way that they work together as a single system 3 • Terminology o Node – individual machine in a cluster o Head/Master node – connected to both the private network of the cluster and a public network and are used to access a given cluster. Responsible for providing user an environment to work and distributing task among other nodes o Compute nodes – connected to only the private network of the cluster and are generally used for running jobs assigned to them by the head node(s)
  4. 4. What is Cluster? • Types of Cluster o Storage Storage clusters provide a consistent file system image Allowing simultaneous read and write to a single shared file system o High‐availability (HA) Provide continuous availability of services by eliminating single points of failure o Load‐balancing Sends network service requests to multiple cluster nodes to balance the requested load among the cluster nodes o High‐performance Use cluster nodes to perform concurrent calculations Allows applications to work in parallel to enhance the performance of the applications Also referred to as computational clusters or grid computing 4
  5. 5. Benefits of Cluster • Reduced Cost o The price of off‐the‐shelf consumer desktops has plummeted in recent years, and this drop in price has corresponded with a vast increase in their processing power and performance. The average desktop PC today is many times more powerful than the first mainframe computers. • Processing Power o The parallel processing power of a high‐performance cluster can, in many cases, prove more cost effective than a mainframe with similar power. This reduced price‐per‐unit of power enables enterprises to get a greater ROI (Return On Investment). • Scalability o Perhaps the greatest advantage of computer clusters is the scalability they offer. While mainframe computers have a fixed processing capacity, computer clusters can be easily expanded as requirements change by adding additional nodes to the network. 5
  6. 6. Benefits of Cluster • Improved Network Technology o In clusters, computers are typically connected via a single virtual local area network (VLAN), and the network treats each computer as a separate node. Information can be passed throughout these networks with very little lag, ensuring that data doesn’t bottleneck between nodes. 6 • Availability o When a mainframe computer fails, the entire system fails. However, if a node in a computer cluster fails, its operations can be simply transferred to another node within the cluster, ensuring that there is no interruption in service.
  7. 7. Invention of HPC • Need for ever‐increasing performance • And visionary concept of Parallel Computing 7
  8. 8. Why we need ever‐increasing  performance • Computational power is increasing, but so are our  computation problems and needs. • Some Examples: – Case1: Complete a time‐consuming operation in less time  • I am an automotive engineer  • I need to design a new car that consumes less gasoline  • I’d rather have the design completed in 6 months than in 2 years  • I want to test my design using computer simulations rather than building  very expensive prototypes and crashing them  – Case 2: Complete an operation under a tight deadline  • I work for a weather prediction agency  • I am getting input from weather stations/sensors  • I’d like to predict tomorrow’s forecast today  8
  9. 9. Why we need ever‐increasing  performance – Case 3: Perform a high number of operations per seconds  • I am an engineer at  • My Web server gets 1,000 hits per seconds  • I’d like my web server and databases to handle 1,000 transactions per  seconds so that customers do not experience bad delays  9
  10. 10. Why we need ever‐increasing  performance 10 Climate modeling Protein folding Drug discovery Energy research Data analysis
  11. 11. Where are we using HPC? • Used to solve complex modeling problems in a spectrum of  disciplines • Topics include:  • HPC is currently applied to business uses as well o data warehouses o transaction processing 11 o Artificial intelligence o Climate modeling o Automotive engineering o Cryptographic analysis o Geophysics o Molecular biology o Molecular dynamics o Nuclear physics o Physical oceanography o Plasma physics o Quantum physics o Quantum chemistry o Solid state physics o Structural dynamics.
  12. 12. Top 10 Supercomputers for HPC 12 Copyright (c) 2000-2009 TOP500.Org | All trademarks and copyrights on this page are owned by their respective owners June 2013
  13. 13. Fastest Supercomputer  Tianhe‐2 (MilkyWay‐2) @ China’s National University of Defense  Technology 13
  14. 14. Changing times • From 1986 – 2002, microprocessors were  speeding like a rocket, increasing in  performance an average of 50% per year • Since then, it’s dropped to about 20% increase  per year 14
  15. 15. The Problem • Up to now, performance increases have been  attributed to increasing density of transistors • But there are inherent problems • A little Physics lesson – – Smaller transistors = faster processors – Faster processors = increased power consumption – Increased power consumption = increased heat – Increased heat = unreliable processors 15
  16. 16. An intelligent solution • Move away from single‐core systems to multicore processors • “core” = processing unit • Introduction of parallelism!!! • But … – Adding more processors doesn’t help much if  programmers aren’t aware of them… – … or don’t know how to use them. – Serial programs don’t benefit from this approach (in most  cases) 16
  17. 17. Parallel Computing • Form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently i.e. "in parallel“ • So, we need to rewrite serial programs so that they’re  parallel. • Write translation programs that automatically convert  serial programs into parallel programs. – This is very difficult to do. – Success has been limited. 17
  18. 18. Parallel Computing • Example – Compute n values and add them together. – Serial solution: 18
  19. 19. Parallel Computing • Example – We have p cores, p much smaller than n. – Each core performs a partial sum of approximately n/p values. 19 Each core uses it’s own private variables and executes this block of code independently of the other cores.
  20. 20. Parallel Computing • Example – After each core completes execution of the code, is a  private variable my_sum contains the sum of the values  computed by its calls to Compute_next_value. – Ex., n = 200, then • Serial – will take 200 addition • Parallel (for 8 cores) – each core will perform n/p = 25 addition – And master will perform 8 more addition + 8 receive operation – Total 41 operation 20
  21. 21. Parallel Computing • Some coding constructs can be recognized by an automatic  program generator, and converted to a parallel construct. • However, it’s likely that the result will be a very inefficient  program. • Sometimes the best parallel solution is to step back and  devise an entirely new algorithm. • Parallel computer programs are more difficult to write than sequential programs • Potential problems – Race condition (output depending on sequence or timing of other events) – Communication and synchronization between the different subtasks 21
  22. 22. Parallel Computing • Parallel Computer classification – Semiconductor industry has settled on two main  trajectories • Multicore trajectory – CPU – Coarse, heavyweight threads, better performance per thread maximize the speed of sequential program  • Many‐core trajectory – GPU – large  number of much smaller cores to improve the execution  throughput of parallel applications – Fine, lightweight threads  single‐thread performance is poor 22
  23. 23. CPU vs. GPU • CPU  – Uses sophisticated control logic to allow single thread execution – Uses large cache to reduce the latency of instruction and data access – None of them contributed to the peak calculation speed 23 Cache ALU Control ALU ALU ALU DRAM
  24. 24. CPU vs. GPU • GPU – Need to conduct massive number of floating‐point calculation – Optimize the execution throughput of massive numbers of threads – Cache memories to help control the bandwidth requirements and  reduce DRAM access  24 DRAM
  25. 25. CPU vs. GPU • Speed – Calculation speed: 367 GFLOPS vs. 32 GFLOPS  – Ratio is about 10 to 1 for GPU vs. CPU – But speed‐up depends on • Problem set • Level of parallelism • Code optimization • Memory management 25
  26. 26. CPU vs. GPU 26 Architecture: CPU took a right hand turn
  27. 27. CPU vs. GPU 27 Architecture: GPU still keeping up with Moore’s Law
  28. 28. CPU vs. GPU 28 • Architecture and Technology – Control hardware dominates μprocessors • Complex, difficult to build and verify • Scales poorly – Pay for max throughput, sustain average throughput – Control hardware doesn’t do any math! – Industry moving from “instructions per second” to  “instructions per watt” • Traditional μprocessors are not power‐efficient – We can continue to put more transistors on a chip • … but we can’t scale their voltage like we used to … • … and we can’t clock them as fast …
  29. 29. Why GPU? 29 • GPU is a massively parallel architecture – Many problems map well to GPU‐style computing – GPUs have large amount of arithmetic capability – Increasing amount of programmability in the pipeline – CPU has duel and quad core chips; but GPU currently has 240 cores  (GeForce GTX 280) • Memory Bandwidth – CPU – 3.2 GB/s; GPU – 141.7 GB/s • Speed – CPU – 20 GFLOPS (per core) – GPU – 933 (single‐precision or int) / 78 (double‐precision) GFLOPS  • Direct access to compute units in new APIs
  30. 30. CPU + GPU 30 • CPU and GPU is a powerful combination  – CPUs consist of a few cores optimized for serial processing – GPUs consist of thousands of smaller, more efficient cores  designed for parallel performance – Serial portions of the code run on the CPU – Parallel portions run on the GPU – Performance significantly faster • This idea ignited the movement of GPGPU (General‐ Purpose computation on GPU)
  31. 31. GPGPU 31 • Using GPU (graphics processing unit) together with a CPU to  accelerate general‐purpose scientific and engineering  applications • GPGPU computing offers speed by  – Offloading compute‐intensive portions of the application to the GPU – While the remainder of the code still runs on the CPU • Data Parallel algorithms take advantage of GPU attributes – Large data arrays, streaming throughput – Fine‐grain SIMD (single‐instruction multiple‐data) parallelism – Low‐latency floating point computation
  32. 32. Parallel Programming • HPC Parallel Programming Models associated with different computing technology – Parallel programming in CPU Clusters – General purpose GPU programming 32
  33. 33. Operational Model: CPU • Originally was designed for distributed memory architectures • Tasks are divided  among p processes • Data‐parallel, compute intensive functions should be selected  to be assigned to theses processes • Functions that are executed many times, but independently  on different data, are prime candidates – i.e. body of for‐loops 33
  34. 34. Operational Model: CPU • Execution model allows each task to operate independently • Memory model assumes that memory is private to each task – Move data point‐to‐point between processes • Perform some collective computations and at the end gather  results from different processes – Needs synchronization after the end of tasks of each process 34
  35. 35. Programming Language: MPI • Message Passing Interface – An application programming interface (API) specification that allows processes to communicate with one another by sending and receiving messages – Now a de facto standard for parallel programs running on distributed memory systems in computer clusters and supercomputers – A massage passing API with language‐independent protocol and semantic specifications – Support both point‐to‐point and collective communication – Communications are defined by the APIs 35
  36. 36. Programming Language: MPI • Message Passing Interface – Goals are standardization, high performance, scalability, and portability – Consists of a specific set of routines (i.e. APIs) directly callable from C, C++, Fortran and any language able to interface with such libraries – Program consists of autonomous processes  • The processes may run either the same code (SPMD style) or  different codes (heterogeneous) – Processes communicate with each other via calls to MPI  functions 36
  37. 37. Operation Model: GPU 37 • Both CPU and GPU operates with separate memory pool • CPUs are masters and GPUs are workers – CPUs launch computations onto the GPUs – CPUs can be used for other computations as well – GPUs will have limited communication back to CPUs • CPU must initiate data transfer to the GPU memory – Synchronous data transfer – CPU waits for transfer to complete – Asynchronous data transfer – CPU continues with other work; checks if  transfer is complete
  38. 38. Operation Model : GPU 38 • GPU can not directly access main memory • CPU can not directly access GPU memory • Need to explicitly copy data
  39. 39. Operation Model : GPU 39 • GPU is viewed as a compute device operating as a  coprocessor to the main CPU (host) – Data‐parallel, compute intensive functions should be off‐loaded to the  device – Functions that are executed many times, but independently on  different data, are prime candidates • i.e. body of for‐loops – A function compiled for the device is called a kernel – The kernel is executed on the device as many different threads – Both host (CPU) and device (GPU) manage their own memory – host  memory and device memory
  40. 40. Programming Language: CUDA 40 • Compute Unified Device Architecture” – Introduced by Nvidia in late 2006 – It is a compiler and toolkit for programming NVIDIA GPUs – API extends the C programming language – Adds library functions to access GPU – Adds directives to translate C into instructions that run on  the host CPU or the GPU when needed – Allows easy multi‐threading ‐ parallel execution on all  thread processors on the GPU – Runs on thousands of threads – It is a scalable model
  41. 41. Programming Language: CUDA 41 • Compute Unified Device Architecture” – General purpose programming model • User kicks off batches of threads on the GPU • Specific language and tools – Driver for loading computation programs into GPU • Standalone Driver ‐ Optimized for computation  • Interface designed for compute – graphics‐free API • Explicit GPU memory management – Objectives • Express parallelism • Give a high level abstraction from hardware
  42. 42. 42 The End 42