Parallel Programming


Published on

Traditionally, computer software has been written for serial computation. To solve a problem, an algorithm is constructed and implemented as a serial stream of instructions. These instructions are executed on a central processing unit on one computer. Only one instruction may execute at a time—after that instruction is finished, the next is executed.

Published in: Technology
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Parallel Programming

  1. 1. Parallel Programming By Roman Okolovich
  2. 2. Overview  Traditionally, computer software has been written for serial computation. To solve a problem, an algorithm is constructed and implemented as a serial stream of instructions. These instructions are executed on a central processing unit on one computer. Only one instruction may execute at a time—after that instruction is finished, the next is executed.  Nowadays one single machine (PC) can have multi-core and/or multi-processor computer architecture.  A multiprocessor computer architecture where two or more identical processors can connect to a single shared main memory. Most common multiprocessor systems today use an SMP (symmetric multiprocessing) architecture. In the case of multi-core processors, the SMP architecture applies to the cores, treating them as separate processors.
  3. 3. Speedup  The amount of performance gained by the use of a multi-core processor is strongly dependent on the software algorithms and implementation. In particular, the possible gains are limited by the fraction of the software that can be "parallelized" to run on multiple cores simultaneously; this effect is described by Amdahl's law. In the best case, so- called embarrassingly parallel problems may realize speedup factors near the number of cores. Many typical applications, however, do not realize such large speedup factors and thus, the parallelization of software is a significant on-going topic of research.
  4. 4. Intel Atom  Nokia Booklet 3G - Intel® Atom™ Z530, 1.6 GHz  Intel Atom is the brand name for a line of ultra-low-voltage x86 and x86-64 CPUs (or microprocessors) from Intel, designed in 45 nm CMOS and used mainly in Netbooks, Nettops and MIDs.  Intel Atom can execute up to two instructions per cycle. The performance of a single core Atom is equal to around half that offered by an equivalent Celeron.  Hyper-threading (officially termed Hyper-Threading Technology or HTT) is an Intel-proprietary technology used to improve parallelization of computations (doing multiple tasks at once) performed on PC microprocessors.  A processor with hyper-threading enabled is treated by the operating system as two processors instead of one. This means that only one processor is physically present but the operating system sees two virtual processors, and shares the workload between them.  The advantages of hyper-threading are listed as: improved support for multi-threaded code, allowing multiple threads to run simultaneously, improved reaction and response time.
  5. 5. Instruction level parallelism  Instruction-level parallelism (ILP) is a measure of how many of the operations in a computer program can be performed simultaneously. Consider the following program:  1. e = a + b 2. f = c + d 3. g = e * f  Operation 3 depends on the results of operations 1 and 2, so it cannot be calculated until both of them are completed. However, operations 1 and 2 do not depend on any other operation, so they can be calculated simultaneously. (See also: Data dependency) If we assume that each operation can be completed in one unit of time then these three instructions can be completed in a total of two units of time, giving an ILP of 3/2.
  6. 6. Qt 4's Multithreading  Qt provides thread support in the form of platform-independent threading classes, a thread-safe way of posting events, and signal-slot connections across threads. This makes it easy to develop portable multithreaded Qt applications and take advantage of multiprocessor machines.  QThread provides the means to start a new thread.  QThreadStorage provides per-thread data storage.  QThreadPool manages a pool of threads that run QRunnable objects.  QRunnable is an abstract class representing a runnable object.  QMutex provides a mutual exclusion lock, or mutex.  QMutexLocker is a convenience class that automatically locks and unlocks a QMutex.  QReadWriteLock provides a lock that allows simultaneous read access.  QReadLocker and QWriteLocker are convenience classes that automatically lock and unlock a QReadWriteLock.  QSemaphore provides an integer semaphore (a generalization of a mutex).  QWaitCondition provides a way for threads to go to sleep until woken up by another thread.  QAtomicInt provides atomic operations on integers.  QAtomicPointer provides atomic operations on pointers.
  7. 7. OpenMP  The OpenMP Application Program Interface (API) supports multi-platform shared-memory parallel programming in C/C++ and Fortran on all architectures, including Unix platforms and Windows NT platforms.  OpenMP is a portable, scalable model that gives shared-memory parallel programmers a simple and flexible interface for developing parallel applications for platforms ranging from the desktop to the supercomputer.  The designers of OpenMP wanted to provide an easy method to thread applications without requiring that the programmer know how to create, synchronize, and destroy threads or even requiring him or her to determine how many threads to create. To achieve these ends, the OpenMP designers developed a platform-independent set of compiler pragmas, directives, function calls, and environment variables that explicitly instruct the compiler how and where to insert threads into the application.  Most loops can be threaded by inserting only one pragma right before the loop. Further, by leaving the nitty-gritty details to the compiler and OpenMP, you can spend more time determining which loops should be threaded and how to best restructure the algorithms for maximum performance.
  8. 8. OpenMP Example • OpenMP places the following five restrictions on #include <omp.h> which loops can be threaded: #include <stdio.h> • The loop variable must be of type signed int main() { integer. Unsigned integers, such as #pragma omp parallel DWORD's, will not work. printf("Hello from thread %d, nthreads %dn", • The comparison operation must be in the omp_get_thread_num(), omp_get_num_threads()); form loop_variable <, <=, >, or >= } loop_invariant_integer • The third expression or increment portion of //------------------------------------------- the for loop must be either integer addition #pragma omp parallel shared(n,a,b) or integer subtraction and by a loop { invariant value. #pragma omp for • If the comparison operation is < or <=, the for (int i=0; i<n; i++) loop variable must increment on every { iteration, and conversely, if the comparison a[i] = i + 1; operation is > or >=, the loop variable must #pragma omp parallel for decrement on every iteration. /*-- Okay - This is a parallel region --*/ • The loop must be a basic block, meaning for (int j=0; j<n; j++) no jumps from the inside of the loop to the b[i][j] = a[i]; outside are permitted with the exception of } the exit statement, which terminates the } /*-- End of parallel region --*/ whole application. If the statements goto or //------------------------------------------- break are used, they must jump within the #pragma omp parallel for loop, not outside it. The same goes for for (i=0; i < numPixels; i++) exception handling; exceptions must be { caught within the loop. pGrayScaleBitmap[i] = (unsigned BYTE) (pRGBBitmap[i].red * 0.299 + pRGBBitmap[i].green * 0.587 + pRGBBitmap[i].blue * 0.114); }
  9. 9. OpenMP and Visual Studio
  10. 10. Intel Threading Building Blocks (TBB)  Intel® Threading Building Blocks (Intel® TBB) is an award-winning C++ template library that abstracts threads to tasks to create reliable, portable, and scalable parallel applications. Just as the C++ Standard Template Library (STL) extends the core language, Intel TBB offers C++ users a higher level abstraction for parallelism. To implement Intel TBB, developers use familiar C++ templates and coding style, leaving low-level threading details to the library. It is also portable between architectures and operating systems.  Intel® TBB for Windows (Linux, Mac OS) costs $299 per sit. #include <iostream> #include <string> #include “tbb/parallel_for.h” #include “tbb/blocked_range.h” using namespace tbb; using namespace std; int main() { //... parallel_for(blocked_range<size_t>(0, to_scan.size() ), SubStringFinder( to_scan, max, pos )); //... return 0; }
  11. 11. Parallel Pattern Library (PPL)  The Concurrency Runtime is a concurrent programming framework for C++. The Concurrency Runtime simplifies parallel programming and helps you write robust, scalable, and responsive parallel applications.  The features that the Concurrency Runtime provides are unified by a common work scheduler. This work scheduler implements a work-stealing algorithm that enables your application to scale as the number of available processors increases.  The Concurrency Runtime enables the following programming patterns and concepts:  Imperative data parallelism: Parallel algorithms distribute computations on collections or on sets of data across multiple processors.  Task parallelism: Task objects distribute multiple independent operations across processors.  Declarative data parallelism: Asynchronous agents and message passing enable you to declare what computation has to be performed, but not how it is performed.  Asynchrony: Asynchronous agents make productive use of latency by doing work while waiting for data.  The Concurrency Runtime is provided as part of the C Runtime Library (CRT).  Only Visual Studio 2010 supports PPL
  12. 12. Concurrency Runtime Architecture  The Concurrency Runtime is divided into four components: the Parallel Patterns Library (PPL), the Asynchronous Agents Library, the work scheduler, and the resource manager. These components reside between the operating system and applications. The following illustration shows how the Concurrency Runtime components interact among the operating system and applications: struct LongRunningOperationMsg{ LongRunningOperationMsg (int x, int y) : m_x(x),m_y(y){} int m_x; int m_y; } call<LongRunningOperationMsg>* LongRunningOperationCall = new call<LongRunningOperationMsg>([]( LongRunningOperationMsg msg) { LongRunningOperation(msg.x, msg.y); }) void SomeFunction(int x, int y){ asend(LongRunningOperationCall, LongRunningOperationMsg(x,y)); }
  13. 13. References  Parallel computing  Superscalar  Simultaneous multithreading  Hyper-threading  Thread Support in Qt  OpenMP  Intel: Getting Started with OpenMP  Intel® Threading Building Blocks (Intel® TBB)  Intel® Threading Building Blocks 2.2 for Open Source  Concurrency Runtime Library  Four Ways to Use the Concurrency Runtime in Your C++ Projects  Parallel Programming in Native Code blog