• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
CMPE 49B Sp. Top. in CMPE: Multi-Core Programming
 

CMPE 49B Sp. Top. in CMPE: Multi-Core Programming

on

  • 739 views

 

Statistics

Views

Total Views
739
Views on SlideShare
739
Embed Views
0

Actions

Likes
0
Downloads
8
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Highlight the things theyt haven’t thought of – router config, finding the servers at frys

CMPE 49B Sp. Top. in CMPE: Multi-Core Programming CMPE 49B Sp. Top. in CMPE: Multi-Core Programming Presentation Transcript

  • CMPE 49B Spec. Topics in CMPE: Multi-core Programming picture of ASCI WHITE, the most powerful computer in the world in 2001
  • Von Neumann Architecture CPU RAM Device Device
    • sequential computer
    BUS
  • Memory Hierarchy Registers Cache Real Memory Disk CD Fast Slow
  • History of Computer Architecture
    • 4 Generations (identified by logic technology)
      • Tubes
      • Transistors
      • Integrated Circuits
      • VLSI (very large scale integration)
  • PERFORMANCE TRENDS
  • PERFORMANCE TRENDS
    • Traditional mainframe/supercomputer performance 25% increase per year
    • But … microprocessor performance 50% increase per year since mid 80’s.
  • Moore’s Law
    • “ Transistor density doubles every 18 months”
    • Moore is co-founder of Intel.
    • 60 % increase per year
    • Exponential growth
    • PC costs decline.
    • PCs are building bricks of all future systems.
  • VLSI Generation
  • Bit Level Parallelism (upto mid 80’s)
    • 4 bit microprocessors replaced by 8 bit, 16 bit, 32 bit etc.
    • doubling the width of the datapath reduces the number of cycles required to perform a full 32-bit operation
    • mid 80’s reap benefits of this kind of parallelism (full 32-bit word operations combined with the use of caches)
  • Instruction Level Parallelism (mid 80’s to mid 90’s)
    • Basic steps in instruction processing (instruction decode, integer arithmetic, address calculations, could be performed in a single cycle)
    • Pipelined instruction processing
    • Reduced instruction set (RISC)
    • Superscalar execution
    • Branch prediction
  • Thread/Process Level Parallelism (mid 90’s to present)
    • On average control transfers occur roughly once in five instructions, so exploiting instruction level parallelism at a larger scale is not possible
    • Use multiple independent “threads” or processes
    • Concurrently running threads, processes
  • Evolution of the Infrastructure
    • Electronic Accounting Machine Era: 1930-1950
    • General Purpose Mainframe and Minicomputer Era: 1959-Present
    • Personal Computer Era: 1981 – Present
    • Client/Server Era: 1983 – Present
    • Enterprise Internet Computing Era: 1992- Present
  • Sequential vs Parallel Processing
    • physical limits reached
    • easy to program
    • expensive supercomputers
    • “ raw” power unlimited
    • more memory, multiple cache
    • made up of COTS, so cheap
    • difficult to program
  • What is Multi-Core Programming ?
    • Answer: It is basically parallel programming on a single computer box (e.g. a desktop, a notebook, a blade)
  • Amdahl’s Law
    • The serial percentage of a program is fixed. So speed-up obtained by employing parallel processing is bounded.
    • Lead to pessimism in in the parallel processing community and prevented development of parallel machines for a long time.
    Speedup = 1 s + 1-s P
    • In the limit:
    • Spedup = 1/s
    s
  • Gustafson’s Law
    • Serial percentage is dependent on the number of processors/input.
    • Broke/disproved Amdahl’s law.
    • Demonstrated achieving more than 1000 fold speedup using 1024 processors.
    • Justified parallel processing
  • Grand Challenge Applications
    • Important scientific & engineering problems identified by U.S. High Performance Computing & Communications Program (’92)
  • Flynn’s Taxonomy
    • classifies computer architectures according to:
      • Number of instruction streams it can process at a time
      • Number of data elements on which it can operate simultaneously
    Data Streams Single Multiple Single Multiple Instruction Streams SISD SIMD MIMD MISD
  • SPMD Model (Single Program Multiple Data)
    • Each processor executes the same program asynchronously
    • Synchronization takes place only when processors need to exchange data
    • SPMD is extension of SIMD (relax synchronized instruction execution)
    • SPMD is restriction of MIMD (use only one source/object)
  • Parallel Processing Terminology
    • Embarassingly Parallel :
      • applications which are trivial to parallelize
      • large amounts of independent computation
      • Little communication
    • Data Parallelism :
      • model of parallel computing in which a single operation can be applied to all data elements simultaneously
      • amenable to SIMD or SPMD style of computation
    • Control Parallelism :
      • many different operations may be executed concurrently
      • require MIMD/SPMD style of computation
  • Parallel Processing Terminology
    • Scalability:
      • If the size of problem is increased, number of processors that can be effectively used can be increased (i.e. there is no limit on parallelism).
      • Cost of scalable algorithm grows slowly as input size and the number of processors are increased.
      • Data parallel algorithms are more scalable than control parallel alorithms
    • Granularity:
      • fine grain machines: employ massive number of weak processors each with small memory
      • coarse grain machines: smaller number of powerful processors each with large amounts of memory
  • Shared Memory Machines
    • Memory is globally shared, therefore processes (threads) see single address
    • space
    • Coordination of accesses to locations done by use of locks provided by
    • thread libraries
    • Example Machines: Sequent, Alliant, SUN Ultra, Dual/Quad Board Pentium PC
    • Example Thread Libraries: POSIX threads, Linux threads.
    Shared Address Space process (thread) process (thread) process (thread) process (thread) process (thread)
  • Shared Memory Machines
    • can be classified as:
      • UMA: uniform memory access
      • NUMA: nonuniform memory access
      • based on the amount of time a processor takes to access local and global memory.
    Inter- connection network/ or BUS Inter- connection network Inter- connection network P P .. P M M .. M P M P M .. P M P M P M .. P M M M M .. M (a) (b) (c)
  • Distributed Memory Machines
    • Each processor has its own local memory (not directly accessible by others)
    • Processors communicate by passing messages to each other
    • Example Machines: IBM SP2, Intel Paragon, COWs (cluster of workstations)
    • Example Message Passing Libraries: PVM, MPI
    Network process process process process process M M M M M
  • Beowulf Clusters
    • Use COTS, ordinary PCs and networking equipment
    • Has the best price/performance ratio
    PC cluster
  • Multi-Core Computing
    • A multi-core microprocessor is one which combines two or more independent processors into a single package, often a single integrated circuit.
    • A dual-core device contains only two independent microprocessors.
  • Comparison of Different Architectures Single Core Architecture CPU State Cache Execution unit
  • Comparison of Different Architectures Multiprocessor CPU State Cache Execution unit CPU State Cache Execution unit
  • Comparison of Different Architectures CPU State Cache Execution unit Hyper-Threading Technology CPU State
  • Comparison of Different Architectures Multi-Core Architecture CPU State Cache Execution unit CPU State Cache Execution unit
  • Comparison of Different Architectures CPU State Execution unit Multi-Core Architecture with Shared Cache CPU State Cache Execution unit
  • Comparison of Different Architectures Multi-Core with Hyper-Threading Technology CPU State Cache Execution unit CPU State CPU State Cache Execution unit CPU State
  • Top 10 Most Powerful Computers in the World (as of 6/2006) Rank Site System Processors R max R peak 1 DOE/NNSA/LLNL eServer Blue Gene Solution 131072 280600 367000 United States IBM 2 IBM Thomas J. Watson Research Center eServer Blue Gene Solution 40960 91290 114688 United States IBM 3 DOE/NNSA/LLNL eServer pSeries p5 575 1.9 GHz 12208 75760 92781 United States IBM 4 NASA/Ames Research Center/NAS SGI Altix 1.5 GHz, Voltaire Infiniband 10160 51870 60960 United States SGI 5 Commissariat a l'Energie Atomique (CEA) NovaScale 5160, Itanium2 1.6 GHz, Quadrics 8704 42900 55705.6 France Bull SA 6 Sandia National Laboratories PowerEdge 1850, 3.6 GHz, Infiniband 9024 38270 64972.8 United States Dell 7 GSIC Center, Tokyo Institute of Technology Sun Fire X64 Cluster, Opteron 2.4/2.6 GHz, Infiniband 10368 38180 49868.8 Japan NEC/Sun 8 Forschungszentrum Juelich (FZJ) eServer Blue Gene Solution 16384 37330 45875 Germany IBM 9 Sandia National Laboratories Red Storm Cray XT3, 2.0 GHz 10880 36190 43520 United States Cray Inc. 10 The Earth Simulator Center Earth-Simulator 5120 35860 40960 Japan NEC
  • Most Powerful Computers in the World (as of 11/2007)
  • Top 500 Lists
    • http://www.top500.org/list/2007/11
    • http://www.top500.org/list/2007/06
    • …… ..
  • Application Areas in Top 500 List
  • Top 500 Statistics
    • http://www.top500.org/stats
  • Grid Computing
    • provide access to computing power and various resources just like accessing electrical power from electrical grid
    • Allows coupling of geographically distributed resources
    • Provide inexpensive access to resources irrespective of their physical location or access point
    • Internet & dedicated networks can be used to interconnect distributed computational resources and present them as a single unified resource
    • Resources: supercomputers, clusters, storage systems, data resources, special devices
  • Grid Computing
    • the GRID is, in effect, a set of software tools , which when combined with hardware, would let users tap processing power off the Internet as easily as the electrical power can be drawn from the electricty grid.
    • Examples of Grid s :
      • - TeraGrid (USA) : http:// www. teragrid .org
      • - EGEE Grid (Europe) : http://www.eu-egee.org/
      • TR-Grid (Turkey) : http://www.grid.org.tr/
      • Sun Grid Compute Utility (Commercial, pay-per-use) http://www.network.com/
  • GRID COMPUTING Power Grid Compute Grid
      • Archeology
      • Astronomy
      • Astrophysics
      • Civil Protection
      • Comp. Chemistry
      • Earth Sciences
      • Finance
      • Fusion
      • Geophysics
      • High Energy Physics
      • Life Sciences
      • Multimedia
      • Material Sciences
    >250 sites 48 countries >50,000 CPUs >20 PetaBytes >10,000 users >150 VOs >150,000 jobs/day
  • Cloud Computing
    • Style of computing in which IT-related capabilities are provided “ as a
    • service ”,allowing users to access technology-enabled services from the Internet
    • ("in the cloud") without knowledge of, expertise with, or control over the
    • technology infrastructure that supports them.
    • General concept that incorporates software as a service (SaaS), Web 2.0 and
    • other recent, well-known technology trends, in which the common theme is
    • reliance on the Internet for satisfying the computing needs of the users.
  • Cloud Computing
    • Virtualisation provides separation between infrastructure and user runtime environment
    • Users specify virtual images as their deployment building blocks
    • Pay-as-you-go allows users to use the service when they want and only pay for what they use
    • Elasticity of the cloud allows users to start simple and explore more complex deployment over time
    • Simple interface allows easy integration with existing systems
  • Cloud computing is about much more than technological capabilities. Technology is the mechanism, but, as in any shift in business, the driver is economics. Nicholas Carr,The author of “The Big Switch”
  • We want to pay only for what we use And we want to control it accurately. Better Economics
  • Facing New Challenges
    • Complexity of modern IT infrastructures: physical servers, virtual machines, clusters, Grids, geographical distribution
    • Cost of electricity
    • Credit crunch
    • Further pressures to reduce costs
    • Openness to the acceptable security concept
  •  
  • The Grid/Cloud
    • Advantages
    • Lower cost
    • Access to larger infrastructure
      • Faster calculations
      • More storage
    • Speed
      • Faster calculations
      • Easier provisioning
    • Disadvantages
    • Very complicated
    • Security
    • Lack of confidence
      • Trust
      • Compatibility
  • Grid and Clouds Issue Classic Grid Computing Cloud computing Why we need it? (The Problem) To enable the R&D community to achieve its research goals in reasonable time. Computation over large data sets, or of paralleizable compute-intensive applications . Reduce IT costs. On-demand scalability for all applications , including research, development and business applications. Main Target Market First - Academia Second – certain industries Mainly Industry Business Model – Where the money comes from? Academia Sponsor-based (Mainly government money). Industry pays Internal Implementations. Hosted by commercial companies , paid-for by users. Based on the economies of scale and expertise. Only pay for what you need, when you need it: (On- Demand + Pay per Use).
  • Example Cloud: Amazon Web Services
    • EC2 (Elastic Computing Cloud) is the computing service of Amazon
      • Based on hardware virtualisation
      • Users request virtual machine instances, pointing to an image (public or private) stored in S3
      • Users have full control over each instance (e.g. access as root, if required)
      • Requests can be issued via SOAP and REST
  • Example Cloud: Amazon Web Services
    • S3 (Simple Storage Service) is a service for storing and accessing data on the Amazon cloud
      • From a user’s point-of-view, S3 is independent from the other Amazon services
      • Data is built in a hierarchical fashion, grouped in buckets (i.e. containers) and objects
      • Data is accessible via various protocols
    • Elastic Block Store
      • Locally mounted storage
      • Highly available
  • Example Cloud: Amazon Web Services
    • Other AWS services:
      • SQS (Simple Queue Service)
      • SimpleDB
      • Billing services: DevPay
      • Elastic IP (Static IPs for Dynamic Cloud Computing)
      • Multiple Locations
  • Example Cloud: Amazon Web Services
    • Pricing information
    • http://aws.amazon.com/ec2/
  • “ By 2012, 80 percent of Fortune 1000 companies will pay for some cloud computing service, And 30 percent of them will pay for cloud computing infrastructure”. Gartner, 2008 Cloud Market
  • EC2 – “Google of the Clouds” According to Vogels (Amazon CTO), 370,000 developers have registered for Amazon Web Services since their start in 2002, and the company now spends more bandwidth on the developers than it does on e-commerce. http://www.theregister.co.uk/2008/06/26/amazon_trumpets_web_services/ In the last two months of 2007 usage of Amazon Web Services grew by 40% $131 million revenues in Q1 from AWS 60,000 customers The majority of usage comes from banks, pharmaceuticals and other large corporations
    • - CIOs -> Do more with Less (Energy costs / Recession will boost it)
    • Lower cost for Scalability
    • Enterprise IT budget - Spending 80% on MAINTENANCE
    • In average, we utilize only 15% of our computing resources capacity
    • Peak Times economy
    • The Enterprise IT is not its core business
    • Psychology of Internet/Cloud trust (SalesForce, Gmail, Internet banking, etc.)
    • Ideal for Developers
    Why Now? (Economy)
  • Models of Parallel Computers
    • Message Passing Model
      • Distributed memory
      • Multicomputer
    • 2. Shared Memory Model
      • Multiprocessor
      • Multi-core
    • 3. Theoretical Model
      • PRAM
    • New architectures: combination of 1 and 2.
  • Theoretical PRAM Model
    • Used by parallel algorithm designers
    • Algorithm designers do not want to worry about low level details: They want to concentrate on algorithmic details
    • Extends classic RAM model
    • Consist of :
      • Control unit (common clock), synchronous
      • Global shared memory
      • Unbounded set of processors, each with its private own memory
  • Theoretical PRAM Model
    • Some characteristics
      • Each processor has a unique identifier, mypid=0,1,2,…
      • All processors operate synhronously under the control of a common clock
      • In each unit of time, each procesor is allowed to execute an instruction or stay idle
  • Various PRAM Models weakest strongest (how write conflicts to the same memory location are handled) EREW (exlusive read / exclusive write) CREW (concurrent read / exclusive write) CRCW (concurrent read / concurrent write) Common (must write the same value) Arbitrary (one processor is chosen arbitrarily) Priority (processor with the lowest index writes)
  • Algorithmic Performance Parameters
    • Notation
    Input size Time Complexity of the best sequential algorithm Number of processors Time complexity of the parallel algorithm when run on P processors Time complexity of the parallel algorithm when run on 1 processors
  • Algorithmic Performance Parameters
    • Speed-Up
    • Efficiency
  • Algorithmic Performance Parameters
    • Work = Processors X Time
      • Informally: How much time a parallel algorithm will take to simulate on a serial machine
      • Formally:
  • Algorithmic Performance Parameters
    • Work Efficient:
      • Informally: a work efficient parallel algorithm does no more work than the best serial algorithm
      • Formally: a work efficient algorithm satisfies:
  • Algorithmic Performance Parameters
    • Scalability:
      • Informally, scalability implies that if the size of the problem is increased, the number of processors effectively used can be increased (i.e. there is no limit on parallelism)
      • Formally, scalability means:
  • Algorithmic Performance Parameters
    • Some remarks:
      • Cost of scalable algorithm grows slowly as input size and the number of procesors are increased
      • Level of ‘control parallelism’ is usually a constant independent of problem size
      • Level of ‘data parallelism’ is an increasing function of problem size
      • Data parallel algorithms are more scalable than control parallel algorithms
  • Goals in Designing Parallel Algorithms
    • Scalability:
      • Algorithm cost grows slowly, preferably in a polylogarithmic manner
    • Work Efficient:
      • We do not want to waste CPU cycles
      • May be an important point when we are worried about power consumption or ‘money’ paid for CPU usage
  • Summing N numbers in Parallel
    • Array of N numbers can be summed in log(N) steps using
    x1+..+x4 step 1 result N/2 processors x1 x2 x3 x4 x5 x6 x7 x8 x1+x2 x2 x3+x4 x4 x5+x6 x6 x7+x8 x8 x2 x3+x4 x4 x5+..+x8 x6 x7+x8 x8 x1+..+x8 x2 x3+x4 x4 x5+..+x8 x6 x7+x8 x8 step 2 step 3
  • Prefix Summing N numbers in Parallel x1+..+x4 x2+..+x4 x3+..+x6 x4+..+x7 x5+..+x8 x6+..+x8 x7+x8 x8 x1+..+x8 x2+..+x8 x3+..+x8 x4+..+x8 x5+..+x8 x6+..+x8 x7+x8 x8 step 1 step 2 step 3
    • Computing partial sums of an array of N numbers can be done in
    • log(N) steps using N processors
    x1 x2 x3 x4 x5 x6 x7 x8 x1+x2 x2+x3 x3+x4 x4+x5 x5+x6 x6+x7 x7+x8 x8
  • Prefix Paradigm for Parallel Algorithm Design
    • Prefix computation forms a paradigm for parallel algorithm
    • development, just like other well known paradigms such as:
      • divide and conquer, dynamic programming, etc.
    • Prefix Paradigm:
      • If possible, transform your problem to prefix type
      • computation
      • Apply the efficient logarithmic prefix computation
    • Examples of Problems solved by Prefix Paradigm:
    • Solving linear recurrence equations
    • Tridiagonal Solver
    • Problems on trees
    • Adaptive triangular mesh refinement
  • Solving Linear Recurrence Equations
    • Given the linear recurrence equation:
    • we can rewrite it as:
    • if we expand it, we get the solution in terms of partial products of coefficients and the initial values z 1 and z 0 :
    • use prefix to compute partial products
  • Pointer Jumping Technique
    • A linked list of N numbers can be prefix-summed in log(N)
    steps using N processors step 1 step 3 step 2 x1 x2 x3 x4 x5 x6 x7 x8 x1+..+x4 x2+..+x5 x3+..+x6 x4+..+x7 x5+..+x8 x6+x7 x7+x8 x8 x1+.x2 x2+x3 x3+x4 x4+x5 x5+x6 x6+x7 x7+x8 x8 x1+..+x8 x2+..+x8 x3+..+x8 x4+..+x8 x5+..+x8 x6+..+x8 x7+x8 x8
  • Euler Tour Technique Tree Problems:
    • Preorder numbering
    • Postorder numbering
    • Number of Descendants
    • Level of each node
    • To solve such problems, first transform the tree by linearizing it
    • into a linked-list and then apply the prefix computation
    b d a c f g e h i
  • Computing Level of Each Node by Euler Tour Technique 1 -1 1 -1 1 -1 1 1 -1 1 -1 -1 1 -1 1 -1 weight assignment: 1 -1 level(v) = pw(<v,parent(v)>) level(root) = 0 w(<u,v>) pw(<u,v>) i g b a d a c a g h g b f b e b a 1 -1 -1 1 -1 -1 -1 1 -1 1 1 -1 1 -1 1 1 1 2 1 2 1 2 3 2 3 2 1 0 1 0 1 0 initial weights: prefix: b d a c f g e h i
  • Computing Number of Descendants by Euler Tour Technique 0 1 0 1 0 1 0 0 1 0 1 1 0 1 0 1 weight assignment: 0 1 # of descendants(v) = pw(<parent(v),v>) - pw(<v,parent(v)>) # of descendants(root) = n w(<u,v>) pw(<u,v>) i g b a d a c a g h g b f b e b a 0 1 1 0 1 1 1 0 1 0 0 1 0 1 0 0 0 0 1 1 2 2 2 3 3 4 5 6 6 7 7 8 initial weights: prefix: b d a c f g e h i
  • Preorder Numbering by Euler Tour Technique 1 0 1 0 1 0 1 1 0 1 0 0 1 0 1 0 weight assignment: 1 0 preorder(v) = 1 + pw(<v,parent(v)>) preorder(root) = 1 w(<u,v>) pw(<u,v>) i g b a d a c a g h g b f b e b a 1 0 0 1 0 0 0 1 0 1 1 0 1 0 1 1 1 2 2 3 3 4 5 5 6 6 6 6 7 7 8 8 initial weights: prefix: 1 2 3 4 5 6 7 8 9 b d a c f g e h i
  • Postorder Numbering by Euler Tour Technique 0 1 0 1 0 1 0 0 1 0 1 1 0 1 0 1 weight assignment: 0 1 postorder(v) = pw(<parent(v),v>) postorder(root) = n w(<u,v>) pw(<u,v>) i g b a d a c a g h g b f b e b a 0 1 1 0 1 1 1 0 1 0 0 1 0 1 0 0 0 0 1 1 2 2 2 3 3 4 5 6 6 7 7 8 initial weights: prefix: 9 6 1 2 5 3 4 7 8 b d a c f g e h i