• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Amd   accelerated computing -ufrj
 

Amd accelerated computing -ufrj

on

  • 1,058 views

 

Statistics

Views

Total Views
1,058
Views on SlideShare
1,058
Embed Views
0

Actions

Likes
1
Downloads
37
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Our new technology pillars that will help the channel differentiate
  • Explain how 3 monitors can be less expensive than single 30” monitor. E.g 3x22” ~ $500 solution, vs single 30” > $1000On the productivity, also explain ISVs continue to leverage multi-monitor. E.g. MS office 2010, on powerpoint you can open multiple files on multiple windows.
  • Original legal approval – Maranello Platform Launch, March 2010The first generation DCA introduced features now expected in the market[cover features at bottom quickly and go to next slide]
  • Original legal approval – Maranello Platform Launch, March 2010Today’s introduction brings DCA 2.0Four memory channels12 DIMMs per CPUSupports up to 12 cores today, will support next-gen core with up to 16 per CPULet’s take a closer look at the effect of memory on workloads [next slide]
  • done
  • Add more deep blue computers
  • Add “All models ATI Radeon™”Add “as of this date the HD5870 GPU has the highest GFLOPS/mm2 of all known products”
  • Explain how 3 monitors can be less expensive than single 30” monitor. E.g 3x22” ~ $500 solution, vs single 30” > $1000On the productivity, also explain ISVs continue to leverage multi-monitor. E.g. MS office 2010, on powerpoint you can open multiple files on multiple windows.
  • Work on the slide (larget text)
  • Using ATI Stream technology, enjoy better visual quality when you watch streaming video online (YouTube/Hulu) with new video enhancement features.*
  • Explain how 3 monitors can be less expensive than single 30” monitor. E.g 3x22” ~ $500 solution, vs single 30” > $1000On the productivity, also explain ISVs continue to leverage multi-monitor. E.g. MS office 2010, on powerpoint you can open multiple files on multiple windows.
  • Let’s look at today’s compute platforms:You have a Phenom II with 758 million transistors on 45nm process technology on the left On the right you see a 5870 DX11 GPU with 2.15 billion transistors on 40nm process technology. Today, with the emergence of visual computing, you see more work than ever before for the GPU. Especially with, arguably for consumers, the most important workload: video.The explosion of HD video and now HD gaming, means the GPU matters more than ever in the PC platform. More user-generated content puts more of the work onto the GPU such as video processing and rendering and 3D user interface.The era of visual computing is already becoming more about mobility and being able to do more of what I’ve just described on the go. However, users do not want more compute capabilities at the expense of battery life or smaller form factors.Favoring one component over the other or taking a niche approach to balanced visual computing platforms does not meet the needs of the mass market. Usage scenarios favor a combination of GPU/CPU balance and low power..
  • Now – Many of you are technologists, so you are probably glad to see me finally start talking about some technology – the workload changes are also dramatically impacting chip architectures.This chart does a good job of demonstrating the evolution of chip architectures:Starting on X axis on the left you go back in time to highly programmable, single core CPUs which aimed to increase throughput (Y axis) over time by first adding threads, then cores.GPUs on the other hand, started out way to the right in terms of throughput and have been becoming more and more programmable.We call this evolution the move from Homogenous Computing to Heterogeneous Computing , finally resulting up on the top right where the two arrows meet in what we call an APU. A combination of different types of cores, working closely together on different type workloads for optimum performance per watt per mm2This AMD’s architectural vision of the future and where we are heading with our first APU in 2011, the Llano processor – our first integrated CPU + GPU on a single piece of silicon.
  • WHERE WE ARE TODAYAttempt to provide an environment in which optimized hardware can provide higher absolute performance, better power efficiency, and lower cost. At the same time, the goal is to dramatically improve programmer productivity as the cost of software development is substantially the same as hardware developmentThis means support for heterogeneous multi-core hardware and a much more effective application programming environment are critical.This chart does a good job of summarizing the evolution of chip architectures:Starting on X axis on the left you go back in time to highly programmable, single core CPUs which aimed to increase throughput (Y axis) over time by first adding threads, then cores.GPUs on the other hand, started out way to the right in terms of throughput and have been becoming more and more programmable.We call this evolution the move from Homogenous Computing to Heterogeneous Computing , finally resulting up on the top right where the two arrows meet in what we call an APU. A combination of different types of cores, working closely together on different type workloads for optimum performance per watt per mm2
  • The need for this optimal energy-efficient balance of CPU and GPU represents the beginning of a new era of computing in 2011.The Fusion of CPU and GPU compute power is what the next chapter in visual computing requires – a powerful visual computing experience at home or on the go without compromise. Our AMD Fusion™ design is driven by mobility and is based on a low-power visual compute architecture that will enhance active and resting battery life while increasing both CPU and GPU performance. This is the culmination of the vision of ‘One AMD’ and only AMD can deliver the GPU and CPU combination that will be the future of computing
  • Review slide to determine message
  • The Industry has always tried to move away from proprietary technology and towards open standards when available.The proprietary Apple Display Connector never became popular since DVI was license-free and widely available.3dfx’s Glide API for 3D graphics failed to stick around in the market long after DirectX was available on a wide variety of hardware.nVIDIA’s Cg language was never widely used since OpenGL and DirectX provided a compelling open alternativeThe Unified Display Interface was a failed interface backed by Intel and nVIDIA, which was deprecated in favor of the license-free DisplayPort standard.RAMBUS has tried to bring many proprietary memory technologies to market, but have always been displaced by JEDEC open memory standards.CUDA is a proprietary GPGPU model into the market whose specification is controlled by only one company, we believe it will soon be replaced by OpenCL and the DirectX Compute Shader.

Amd   accelerated computing -ufrj Amd accelerated computing -ufrj Presentation Transcript

  • CPU
    GPU
    OpenCL
    DirectCompute
    Accelerated Computing
    Roberto Brandão
    AMD Latin America
  • Agenda
    X86 PROCESSOR EVOLUTION
    THE GPU AS AN ACCELERATOR
    ACCELERATED PROCESSING UNITS
    INTRODUCTION TO OpenCL
  • Evolving x86 Processors
  • AMD architecture“Istambul” six-core diagram
    Chipset
    Balanced
    caches
    2
    3
    4
    5
    6
    1
    Native
    six-core
    processor
    L2
    L2
    L2
    L2
    L2
    L2
    L3 Cache
    Lower memory
    latency
    CROSSBAR
    Memory
    Controller
    Hyper
    Transport
    HyperTransport
    Fast full-duplex
    bus
    PCI-e
  • 4P/24-core system examplevery good scalability
    One memory controller for every processor
    Full-duplex Hyper Transport links (up to 5.2GHz)
    Bus Optimization: HT Assist (Cache Probe Filtering)
    Still the only available 4P system with Direct Connect Architecture
    MEMORY
    MEMORY
    MEMORY
    MEMORY
  • Direct Connect Architecture 1.0Balanced and Scalable Design to Support up to 6 Cores
    2 MEMORY
    CHANNELS
    2 MEMORY
    CHANNELS
    8 DIMMs per CPU
    8 DIMMs per CPU
    2 MEMORY
    CHANNELS
    2 MEMORY
    CHANNELS
    8 DIMMs per CPU
    8 DIMMs per CPU
    No front side bus
    HyperTransport™ technology
    Integrated memory controller
    NUMA memory architecture
  • Direct Connect Architecture 2.0Balanced and Scalable Design to Support up to 16 Cores* per CPU
    4 MEMORY
    CHANNELS
    4 MEMORY
    CHANNELS
    12 DIMMs per CPU
    12 DIMMs per CPU
    4 MEMORY
    CHANNELS
    4 MEMORY
    CHANNELS
    12 DIMMs per CPU
    12 DIMMs per CPU
    • 1-hop between processors
    • Four memory channels
    • Up to 50% more DIMMs
    • Up to 33% increase in CPU to CPU communication speed±
  • What is next for x86 CPUs
    • More processor cores to come
    (12, 16, 16 double cores)
    • More memory channels (improves memory bandwidth per core)
    • Improved IPC
    (8 per cycle is a target)
  • Top500 list - beyond the petaflop
    Datacenters in the USA will spend more than $3 billion on energy in 2009
  • 1997:
    X
    Garry Kasparov IBM Deep Blue
  • The World’s Most Powerful GPU
    =
    177x
    IBM Deep Blue
  • 2011 GPU Architecture AMD Radeon™ HD 6900 Series
    Dual graphics engines
    New VLIW4 core architecture
    Up to 24 SIMD engines
    Up to 96 Texture Units
    Upgraded render back-ends
    Improved anti-aliasing performance
    Fast 256-bit GDDR5 memory interface
    Up to 5.5 Gbps
    New GPU compute features
  • Designing very efficient GPUsFull load: 180W; Idle:27W
    14.47
    GFLOPS/W
    GFLOPS/W
    GFLOPS/mm2
    7.50
    7.90
    GFLOPS/mm2
    4.50
    2.21
    2.01
    4.56
    2.24
    1.07
    1.06
    0.92
    0.42
  • Old and New in High Performance Computing
    Old: Power is free, Transistors are expensive
    New: Power expensive, Transistors free
    (Can put more transistors on chip than can afford to turn on)
    Old: Multiplies are slow, Memory access is fast
    New: Multiplies fast, Memory slow
    (up 200 clocks to DRAM memory, 4 clocks for FP multiply)
    Old: Increasing Instruction Level Parallelism via compilers innovation
    New: Explicit thread and data parallelism must be exploited
  • GPUs: more than just gaming
    15
    2700
    Both use GPUs
    Oil exploration platform - 2010
    Wii Sports - Golf
  • DirectX® 11 Multi-Threading
    • Application, DirectX runtime, and DirectX driver can each run in separate threads
    • Tasks like loading a texture or compiling a shader can execute in parallel with main rendering thread
    DirectX® 10
    DirectX® 11
    16
  • Today’s GPUs focused on
    GAMING
    ENTERTAINMENT
    PRODUCTIVITY
  • DirectX® 11 Tessellation
    DirectX® 10
    DirectX® 11
    No Tessellation
    Tessellation
    Images courtesy of Unigine Corp.
    18
  • 5/25/2011
  • 5/25/2011
  • Research companies already using
    21
    Oil exploration
    Nature simulation
    Wheather forecast
    Fluid Dynamics
  • AMD Balanced Platform
    GPU is ideal for data parallel algorithms like image processing, CAE, etc
    • Great use for ATI Stream technology
    • Great use for additional GPUs
    CPU is excellent for running some algorithms
    • Ideal place to process if GPU is fully loaded
    • Great use for additional CPU cores
    Graphics Workloads
    Other Highly Parallel Workloads
    Serial/Task-Parallel Workloads
    Delivers optimal performance for a wide range of platform configurations
  • ATI Stream Technology is…
    Heterogeneous: Developers leverage AMD GPUs and x86 CPUs for optimal application performance and user experience
    High performance:Massively parallel, programmable GPU architecture delivers unprecedented performance and power efficiency
    Industry Standards:OpenCL™ and DirectCompute 11 enable cross-platform development
    Engineering
    Sciences
    Government
    Gaming
    Digital Content Creation
    Productivity
  • Improvements already reached consumers
    ATI
    Stream
    Processor utilization
    Adobe Flash plugin used by Youtube.com
    • Better image quality and video smoothness
    • Lower processor usage
  • GPU-accelerated video transcoding
    Ipod Video
    HD Video
    Up to 6x faster when using an AMD graphics card
  • Video Transcoding SampleNo GPU Acceleration
    CPU Usage: 100%
    Frames
    Frames
    Using four
    CPU Cores
    GPU Usage: 1%
    26
  • Video Transcoding SampleATI GPU Acceleration
    CPU Usage: 45%
    Control
    Control
    Frames
    Frames
    GPU Usage: 35%
    Using hundreds of
    Stream Processors
    27
  • FUSION TECHNOLOGY
  • Today
    TeraFLOPS-class GPU
    Multi-core CPU
    ~800 million transistors
    Multi-tasking
    Up to 2 billion transistors
    Jogosemmultiplosmonitores
    Video e audio Full HD
  • A new Era on performance evolution
    Multi-Core
    Heterogeneous
    computing
    Single-Core
    Challenge:
    Power consumption
    Software
    Challenge:
    Power consumption
    Complexity
    Pros:
    • Performance
    • Power efficient
    Cons:
    Software availability
    ?
    Single-thread
    We are here
    Performance
    Performance
    We are here
    We are here
    Time x Cores
    Time
    Time
  • A new Era on performance evolution
    Multi-Core
    Single-Core
    CPU
    Core efficiency
    Software
    Acceleration
    Low power consumption
    Multimedia
    Gaming
    GPU
  • Putting all together – The Future is Fusion
    RingStop
    Client Interface
    Client Interface
    Client Interface
    Client Interface
    Write Crossbar Switch
    Memory
    Controller
    RingStop
    RingStop
    Chipset
    Client Interface
    Client Interface
    Client Interface
    Client Interface
    RingStop
    RV500 GPU Core (2006)
    AMD “Istambul” six-core processor
    2
    3
    4
    5
    6
    1
    L2
    L2
    L2
    L2
    L2
    L2
    Cache L3
    CROSSBAR
    Memory
    Controller
    Hyper
    Transport
    HyperTransport
    PCI-e
  • Putting all together – The Future is Fusion
    Chipset
    RV700 GPU Core (2008-2009)
    AMD “Istambul” six-core processor
    2
    3
    4
    5
    6
    1
    L2
    L2
    L2
    L2
    L2
    L2
    Cache L3
    CROSSBAR
    Memory
    Controller
    Hyper
    Transport
    HyperTransport
    PCI-e
  • Putting all together – The Future is Fusion
    RV700 GPU Core
    AMD “Istambul” six-core processor
    CROSSBAR
    CROSSBAR
  • 2011: welcome to the APU time!
    APU
    GPU
    CPU
    “Supercomputing power in a notebook platform whose battery lasts for a full day”
  • One Design, Fewer Watts, Massive Capability
    “Zacate” AMD Fusion APU
    Discrete-level DirectX® 11 GPU
    Dual-Core CPU
    +
    +
    =
    Northbridge
    • 75 sq. mm
    • 18 watts
    • 59 sq. mm
    • 8 watts
    • 66 sq. mm
    • 13 watts
    • 117 sq. mm
    • 25 watts
  • Graphics and Media Processing Efficiency Improvements
    2011 APU-based Platform
    2010 IGP-based Platform
    ~17 GB/sec
    ~17 GB/sec
    CPU Cores
    DDR3 DIMM
    Memory
    CPU Cores
    DDR3 DIMM
    Memory
    CPU Chip
    APU Chip
    UVD
    UNB / MC
    MC
    UNB
    GPU
    ~27 GB/sec
    ~7 GB/sec
    Graphics requires memory bandwidth to bring full capabilities to life
    GPU
    UVD
    ~27 GB/sec
    PCIe
    SB Functions
    3X bandwidth between GPU and memory
    Even the same sized GPU is substantially more effective in this configuration
    Eliminate latency and power associated with the extra chip crossing
    Substantially smaller physical foot print
    PCIe
    Bandwidth pinch points and latency hold back the GPU capabilities
  • “Ontario” & “Zacate” Architecture
    APU
    • 2 x86 CPU Cores (40nm “Bobcat” core – 1 MB L2, 64-bit FPU)
    • C6 and power gating
    • Array of SIMD Engines
    • DX11 graphics performance
    • Industry leading 3D and graphics processing
    • 3rd Generation Unified Video Decoder
    • H.264, VC1, DixX/Xvid format
    • DDR3 800-1066, 2 DIMMs, 64 bit channel
    • BGA package
    Display and I/O
    • Two dedicated digital display interfaces
    • Configurable externally as HDMI, DVI, and/or Display Port
    • Also supports a single link LVDS for internal panels
    • Integrated VGA
    • 5x8 PCIe®
    • “Hudson” Fusion Controller Hub
  • xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
    OpenCL
    Working together
  • ATI Stream SDK: OpenCL™ For Multicore x86 CPUs and GPUs
    http://developer.amd.com/
    The Power of Fusion: Developers leverage heterogeneous architecture to deliver superior user experience
    • First complete OpenCL™ development platform
    • Certified OpenCL 1.0 compliant by the Khronos Group
    • Write code that can scale well on multi-core CPUs and GPUs
    • AMD delivers on the promise of OpenCL™, with both high-performance CPU and GPU technologies
    • Available for download now as part of ATI Stream SDK beta program – includes documentation, samples, and developer support
  • OpenCL™: Game-Changing DevelopmentEnabling Broad Adoption of GP-GPU Capabilities
    • Industry standard API: Open, multiplatform development platform for heterogeneous architectures
    • The power of Fusion: Leverages CPUs and GPUs for balanced system approach
    • Broad industry support: Created by architects from AMD, Apple, IBM, Intel, Nvidia, Sony, etc.
    • Fast track development: Ratified in December; AMD is the first company to provide a complete OpenCL solution
    • Momentum: Enormous interest from mainstream developers and application ISVs
    More stream-enabled applications across all markets
  • Open Standards:
    Maximize Developer Freedom and Addressable Market
    Vendor specific
    Cross-platform limiters
    Vendor neutral
    Cross-platform enablers
    • Apple Display Connector
    • 3dfx Glide
    Digital Visual Interface
    OpenCL™
    DirectX®
    • Nvidia CUDA
    • Nvidia Cg
    • Rambus
    Certified DP
    JEDEC
    OpenGL®
    • Unified Display Interface
    OpenCL™ and DirectX® are emerging as the two most important standards for heterogeneous (CPU+GPU) compute
  • Comparing OpenCL™ and DirectX® 11 DirectCompute
    How will developers choose between OpenCL™ and DirectX® 11 DirectCompute?
    Feature set is similar in both APIs
    DirectX® 11 DirectCompute
    Easiest path to add compute capabilities to existing DirectX applications
    Windows Vista® and Windows® 7 only
    OpenCL™
    Ideal path for new applications porting to the GPU for the first time
    True multiplatform: Windows®, Linux®, MacOS
    Natural programming without dealing with a graphics API
  • Anatomy of OpenCL™
    Language Specification
    • C-based cross-platform programming interface
    • Subset of ISO C99 with language extensions - familiar to developers
    • Well-defined numerical accuracy - IEEE 754 rounding behavior with defined maximum error
    • Online or offline compilation and build of compute kernel executables
    • Includes a rich set of built-in functions
    Platform Layer API
    Runtime API
    • A hardware abstraction layer over diverse computational resources
    • Query, select and initialize compute devices
    • Create compute contexts and work-queues
    • Execute compute kernels
    • Manage scheduling, compute, and memory resources
  • OpenCL Example
  • Summary
    46
    X86 PROCESSOR EVOLUTION
    THE GPU AS AN ACCELERATOR
    ACCELERATED PROCESSING UNITS
    INTRODUCTION TO OpenCL
    http://developer.amd.com
  • Obrigado!
    roberto.brandao@amd.com
  • roberto.brandao@amd.com
    Obrigado!