The past and the next 20 years? Scalable computing as a key evolution


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The World is changing ever more rapidly – demands from a wider range of consumers is changing the pace at which the component parts need to change and evolve. People are using their communication channels and devices in more dynamic ways . Always on always connected is now becoming the de facto expectation. They want to do more with less. Wider range of devices are going mobile – not limited to smartphones – tablets, and other mobile computing devices and machine to machine communications The result of this is the increase in the data rates - 4G modem ~500x more complex than 2G. In our increasingly connected world, it will not just be mobile handsets that need to keep up – a wider range of devices are requiring to be connected The always on always connected also brings the level of mobility required – so it is not only about connectivity but is about how long you can go between charges. Battery capacity - Historical 11% capacity growth , Not well matched to Moore's Law. Continued innovation required just to maintain 11%.
  • So what impact does this have on the underlying systems? Reuse and time to market continue to be a key challenge for OEMs and this does not change as the underlying systems become more complex. So more has to be done to increase the reuse for both hardware and software elements to address both the complexity and TTM. By designing a system with key components that can scale to address a range of segments provides points of efficiency in the design and software cycles. Power efficiency has been the dominant driver for the mobile space from the start and continues to be a constraining element. Energy efficiency demands are not limited to the mobile space but are becoming more of a concern for a wide range of devices from motor controls to servers. So we are seeing an increasing demand for energy-conservative devices to address a wider range of segments. Energy efficiency is one key aspect of optimization that is fundamental to ARM however great benefits can be achieved by taking a more system perspective to the key comments that are required - CPU, GPU, Memory, interconnects..
  • The balance between power and performance is expanding outside the mobile area and more historically power hungry devices are requiring to do more with less. In today’s system on chip, it’s not just about the CPU - There are multi-processing CPUs - Multicore GPUs - Video/audio engines - Security engines and lots more. This is driven by the philosophy of the right tool for the right task . Graphics or video could be done on the CPU but it wouldn’t be most optimal in terms of energy. And, depending on the application more and more is being integrated into the system-on-chip…
  • Main message: The world is moving more and more to heterogeneous computing with many more specialized units to do the work and to get the best performance and efficiency, but there are also some strong reasons to focus on achieving the heterogeneous properties in system that behave as homogeneous as possible. Yes you want your single threaded work on the CPU, the video decode on your video decoder and the massive multi-threaded work on your GPU, but you also want pain free and low cost dispatch and resource sharing and communication between these units. The homogeneous forces drive the development towards system coherent memory, sharing of address spaces, sharing of page tables, new communication channels etc. ARM is in a very special position having full control of all the pieces of the puzzle.
  • We launched the Cortex-A15 last year - and it has stirred up the entire industry and analyst community on the new potential for ARM based devices. The performance has exceeded original specification and combined with the new features such as virtualization support and larger memory support is redefining what higher-end mobile devices,such as superphone and tablets,will be capable of. Key is performance for web– shown here vs Atom?(or is this Cortex-A8?) Cortex-A15 great for web rendering and gaming - the ability for the Cortex-A15 to be used in single and multicore configurations enables a wide range of performance points to be addressed by one product In Q2 2011, we taped out our our test chip and have accelerated the software optimization around Cortex-A15. Multiple partners have announced Cortex-A15 platforms for the mobile space Lead-in to next slide: The software and ecosystem driven by the Cortex-A15 can find it’s way into the mid-range and entry-level devices based on its ability to provide a energy-efficient high-performance solution for a range of end devices and products
  • Even while delivering 5x the energy efficiency, the Cortex-A7 can deliver up to 50% higher performance on the same workload compared to a Cortex-A8 processor as implemented in today’s smartphones. Dual-core configurations – would deliver over 2 times the performance – closely in line with today’s high-end dual-core processor products. This is more than enough horsepower to run most - if not all – of today’s workloads with the user experience level expected. This all comes with the Cortex-A7 being fully software compatible with all of today’s Cortex-A profile software and also being feature aligned with Cortex-A15. Cue to next slide: For all application software and middleware purposes, the Cortex-A7 looks identical to the Cortex-A15 and this brings several advantages when it comes to system design. Notes: BLUE BARS: Cortex-A8 in 45nm represents mainstream smartphones GREEN BARS: Cortex-A7 in 28nm BLUE DOTTED LINE: Today’s high-end smartphone performance level with 1GHz Dual-core Cortex-A7 CPU includes L1 caches, NEON, FP and coherency support
  • ARM announced in the middle of October the big.LITTLE processing concept: A cluster of “big” processors – or Cortex-A15s to handle the very demanding tasks that require more than twice today’s high-end performance. the “big” is only relative – considering how little our LITTLE core is. A cluster of “LITTLE” processors or Cortex-A7s to handle the “always on” tasks – and they are more than capable of handling the OS, user interface activity and provide at least the performance being experience in today's Cortex-A8 based phones. Most importantly, this set up ensures is completely transparent to the applications being run, and transitions between the Cortex-A15 and Cortex-A7 are extremely rapid – in the order of 20 micro seconds. We will look at this in a bit more detail In order to understand this a bit better.
  • There are a wide range of design choices for the micro-architecture of a processor that impact its performance, efficiency and size. Basing design decisions on workload parameters enables optimization. The Cortex-A7 (LITTLE core) is the most energy-efficient core ARM’s built to date. It is capable of handling the majority of mobile workloads. And the Cortex-A15 is the highest performance ARM core in a mobile envelope, and will push performance past today’s high-end smartphones You can see by the diagrams even though the processors are brothers there are some differences which enable the Cortex-A7 to reduce its power consumption even lower through a simpler pipeline structure, whereas the Cortex-A15 gets its performance hike from a more complex out of order pipeline more suited to more complicated tasks. When you put these together you get BOTH High performance AND energy efficiency. You no longer need to trade them off against one another.
  • So what are the energy benefits of this system? The Cortex-A7 fits into the footprint of the expected budget for an application processor CPU cluster. We see most common workloads that lean more towards the LITTLE show a great deal of energy benefits. The more always on activities that enable voice calls, emails, text messages benefit the most from the energy saving aspects of the Cortex-A7 with out impacting the user experience. Their battery life just lasts a whole lot longer. This enables you to turn on the Cortex-A15 when you need that additional performance to increase the performance of games, and bringing next generation web experience to your mobile device. Even with very efficient voltage-scaling on the big core, there is still significant energy savings that can be realized by running on the little core instead. At the upper end, there is still quite some energy saved, but it’s less than at the lower end. So it delivers the best of both. Cue into next: The software tuning will provide even better benefits and to see this in action I would like to ask Nandan to come up and explain a brief video
  • Speaker Notes: There is a huge breadth of Mali and success everywhere, from phones, to tablets, to DTV and STB. Designed for high image quality, and large screen resolutions, the scalable Mali architecture is suited to all applications Use DTV/STB examples, - Samsung DTV, Skyworth, ST in Sagemcom STB, Quote the Skyworth model – from Amlogic signing the IP deal to Skyworth shipping TV Key wins are Trident, Mediatek, ST for TV, HTC, Nokia, Samsung Now at the start of the growth curve – shipping 10’s of millions in 2011, hundreds of millions by 2014 (A faster growth rate than CPU at the same stage... ???) Actions Images - Products identified and in marcom image queue
  • Speaker Notes: “ Desktop experience in my hand” This is the vision for what MPD is driving towards All about complex content = It always gets more complex These images show graphics showing the evolution of content over time Games moved – from fixed function GLES 1.1, to programmable GLES 2.0 and soon, GLES 3.0 and DX11 capabilites If possible, launch the Heaven video and walk through what is happening on the screen How image is built, wirefame, polygon count, processing / shader shown to drive home show the Heaven benchmark video Talking about tessellation and more complex processing in terms of FLOPS/pixel output onto a screen Actions: Simon and Anand to provide approved images for GLES1.1 and GLes2.0 (Unity) games Lorna and Ed to contact Unigine to gain permission for Heaven benchmark video Ed to work with Scott to get a Unigine capture
  • The CoreLink 400 series system IP provides the high performance and power efficient components required for your Cortex SoC. CCI-400 is the cache coherent interconnect which allows the big.LITTLE clusters to run concurrently with the same view of memory, and the same operating system; in addition it provides I/O coherency for Mali-T600 GPUs. MMU-400 and GIC-400 offer virtualisation for the system masters and interrupts and work seamlessly to support the virtualisation support in A15 and A7. DMC-400 offers a high utilisation, high efficiency dynamic memory controller for LPDDR2 and DDR2&3 NIC-400 connects up the rest of the system and has the flexibility to use the minimum routing and power to meet the system bandwidth needs.
  • ARM provides by far the broadest coverage in the industry, spanning three major platform areas Advanced technology platforms, 32nm and below, with 8 physical IP platforms. Notably, ARM is the only provider with physical IP platforms at every 28nm High-K Metal Gate foundry process In 40-65nm, Artisan Physical IP is available on 15 platforms. We call these “ramping” because a huge number of companies are now reaching these nodes in their designs. From 90nm-250nm, 69 platforms are available for a broad range of general purpose and specialty process technologies. No matter what you are designing, Artisan Physical IP has a platform solution suitable for your design needs. Many of these platforms are sponsored by the foundry and are available for your use at NO LICENSE FEE
  • The past and the next 20 years? Scalable computing as a key evolution

    1. 1. The Past & The Next 20 Years. Scalable Computing As A Key Evolution Haydn Povey, Director Product Marketing Processor Division, ARM
    2. 2. 1991
    3. 3. ARM Founded 27 th Nov 1990 <ul><li>A barn, some energy, experience and belief: “We’re going to be the Global Standard” </li></ul><ul><li>“ I gave ARM two things for success – no staff </li></ul><ul><li>and no money” – Sir Robin Saxby </li></ul><ul><li>Originally 12 employees </li></ul><ul><li>Two decades of Partnership success </li></ul><ul><li>8 Partners at first Partner meeting </li></ul><ul><li>>500 Partners at 2011 Partner meeting </li></ul>
    4. 4. A 1991 View of the Industry
    5. 5. The Early Market for 32-Bit 1989 1995 Embedded Control Revenue in $M. 32-bit Growth >45% per annum Early ARM Design Win: ACORN Archimedes Polygon Pushing at ARM
    6. 6. The 20 Year Journey <ul><li>ARM1 3  6k gates </li></ul><ul><li>7mm x 7mm = 49mm 2 </li></ul>December 2010 . Cortex M0 20nm 8k gates 0.07mm x 0.07mm M0 1/10,000 th size Cortex-M0 Subsystem Phenomenal Power, Performance & Area Improvements
    7. 7. 2011
    8. 8. Our Increasingly Connected World <ul><li>Faster data rates can increase complexity, power and cost </li></ul><ul><li>Devices are becoming more multi-purpose, open, general computing platform </li></ul><ul><li>All devices are becoming energy constrained </li></ul>
    9. 9. Increasing Demands On Chip Design <ul><li>Hardware and software reuse </li></ul><ul><li>Power efficient processing </li></ul><ul><li>Optimized implementation </li></ul><ul><li>Heterogeneous design </li></ul><ul><li>Simplified software integration </li></ul>Today The Chip Is The System
    10. 10. Power efficiency and Performance <ul><li>Mobile SoC’s have experience of balancing power with performance </li></ul><ul><li>Optimized processing units designed for specific tasks </li></ul><ul><li>Today’s SoC contains many diverse components </li></ul>
    11. 11. The Chip is the System <ul><li>Heterogeneous hardware: </li></ul><ul><ul><li>Optimum power efficiency requires HW perfect for each task </li></ul></ul><ul><ul><li>Implies a demand for multiple HW accelerators </li></ul></ul><ul><ul><li>Leads to a proliferation of engines, more levels of parallelism </li></ul></ul><ul><ul><li>Benefits from HW coherency </li></ul></ul><ul><li>Homogenous software: </li></ul><ul><ul><li>Application software and OS efficiency will increasingly rely on a unified memory model </li></ul></ul><ul><ul><li>Aligning memory systems (page tables, address spaces, coherency) between the different units becomes critical for high performance </li></ul></ul>
    12. 12. Cortex-A15: The New Market Standard <ul><li>Performance enables new product types </li></ul><ul><ul><li>Large-screen, connected, slim-profile, light </li></ul></ul><ul><li>All your compute needs in a superphone </li></ul><ul><ul><li>Expect innovative mobile MP platforms </li></ul></ul><ul><li>Advanced capabilities </li></ul><ul><ul><li>Support for OS virtualization, larger memory </li></ul></ul>Cortex-A15 measurements on equivalent system. Frequency varies dependent on process Relative performance <ul><li>First silicon undergoing test </li></ul><ul><li>Linux and browsing optimizations reviewed and upstreamed </li></ul><ul><li>Optimized tool-chains available </li></ul>Cortex-A15 Available now
    13. 13. Cortex-A7: Redefining Energy-Efficiency <ul><li>Most energy-efficient applications processor </li></ul><ul><ul><li>5x the energy efficiency of mainstream phones </li></ul></ul><ul><li>Performance to handle common workloads </li></ul><ul><ul><li>>2x the performance of mainstream phone </li></ul></ul><ul><li>Feature set and software compliant with Cortex-A15 </li></ul><ul><ul><li>Full backward compatibility </li></ul></ul><ul><ul><li>Scalable and extensible </li></ul></ul>Browsing workload comparison Today’s dual-core high-end smartphones Relative Performance 1 GHz 1.2 GHz 1.2 GHz Energy Efficiency 45nm 28nm
    14. 14. Introducing big.LITTLE Processing <ul><li>Uses the right processor for the right job </li></ul><ul><li>Up to 70% energy savings on common workloads </li></ul><ul><li>Flexible and transparent to apps – importance of seamless software handover </li></ul>big LITTLE Cortex-A15 MPCore L2 Cache CPU Cortex-A7 MPCore L2 Cache CCI-400 Coherent Interconnect CPU CPU CPU Interrupt Control
    15. 15. Performance AND Energy efficiency <ul><li>Simple, in-order, 8 stage pipeline </li></ul><ul><li>Performance better than today’s mainstream, high-volume smartphones </li></ul><ul><li>Complex, out-of-order, multi-issue pipeline </li></ul><ul><li>Up to 5x the performance of today’s mainstream, high-volume smartphones </li></ul>Cortex-A7 Cortex-A15 LITTLE big Most energy-efficient applications processor from ARM Highest performance in mobile power envelope Queue Issue Integer
    16. 16. The Right Processor for the Right Job Processing energy saved versus today’s high-end multicore phones Cortex-A15 provides the high-end performance Cortex-A7 is ideal for Low to mid-range tasks * Dual Cortex-A15 + Dual Cortex-A7 big.LITTLE system estimate in 32/28nm compared with a dual-Cortex-A9 system estimate in 40nm LITTLE cluster activity dominates Cortex-A7 Big cluster activity dominates Cortex-A15
    17. 17. Cortex ™ -A Series: Optimum Performance Scalable performance with low power for broad application scope Mobile Internet Smart TV Automotive Infotainment Network Infrastructure Servers Cortex-A5 MPCore <ul><li>Most efficient ARM processor </li></ul><ul><li>Big.LITTLE with Cortex-A15 </li></ul><ul><li>First superscalar design </li></ul><ul><li>Market proven, wide adoption </li></ul>Cortex-A7 MPCore Cortex-A8 Cortex-A9 MPCore <ul><li>High-efficiency multicore </li></ul><ul><li>High-performance hard macro </li></ul>Cortex-A15 MPCore <ul><li>Unprecedented performance </li></ul><ul><li>Broad application capability </li></ul><ul><li>64-bit architecture </li></ul><ul><li>ARMv7-A compatibility </li></ul>ARMv8-A Architecture <ul><li>Low-cost internet </li></ul><ul><li>Migration from classic ARM </li></ul>Wide Application Range Cortex-A High Performance Scalable Efficient BROAD PORTFOLIO WIDELY ADOPTED MARKET PROVEN
    18. 18. Bringing Visual Computing to Life <ul><li>Visual & graphical expectations continue to grow </li></ul><ul><ul><li>Scaling to all resolutions from VGA to 1080p </li></ul></ul>Samsung Galaxy SII Hardkernel ODROID-A WinAccord PTT 1026 Ramos W10 Samsung Smart TV Skyworth Smart TV TomTom GO LIVE 1000
    19. 19. Evolving Processing Demands <ul><li>OpenGL ® ES ‘Halti’ and Microsoft ® DirectX ® 11 enabling advanced content </li></ul><ul><ul><li>Content keeps advancing, look at history to predict the future </li></ul></ul><ul><li>GPU computing – OpenCL ™ , Renderscript, DirectCompute </li></ul><ul><ul><li>Expectations of a common user experience across any consumer product leading to ever-higher performance demands in low-power portable devices </li></ul></ul>25x increase in complexity Polarbit – Raging Thunder Unigine Corp – Heaven Unity – Sixits ExoVerse
    20. 20. System Design Scalability 400 Series Coherency Virtualization External Memory Subsystem Rest of SoC Interconnect <ul><ul><li>CCI-400 </li></ul></ul><ul><ul><ul><ul><li>Big.LITTLE coherency </li></ul></ul></ul></ul><ul><ul><ul><ul><li>I/O coherency </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Prioritization and utilization </li></ul></ul></ul></ul><ul><ul><li>MMU-400 </li></ul></ul><ul><ul><ul><ul><li>OS level virtualization </li></ul></ul></ul></ul><ul><ul><li>GIC-400 </li></ul></ul><ul><ul><ul><ul><li>Virtual interrupts </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Multicore support </li></ul></ul></ul></ul><ul><ul><li>External Memory Subsystem </li></ul></ul><ul><ul><li>DMC-400 </li></ul></ul><ul><ul><ul><ul><li>DDR utilization </li></ul></ul></ul></ul><ul><ul><ul><ul><li>PHY integration </li></ul></ul></ul></ul><ul><ul><li>NIC-400 </li></ul></ul><ul><ul><ul><ul><li>Routing efficiency </li></ul></ul></ul></ul>
    21. 21. Advanced Physical IP 14nm - 32nm 40nm - 65nm 90nm - 250nm 8 Physical IP Platforms 15 Physical IP Platforms 69 Physical IP Platforms
    22. 22. The Next 20 Years 2010s Mobiles Soon Pervasive Devices Ubiquitous Environments Heterogeneous Compute Engines Functionality Energy × $ Functionality Available Energy × $ Functionality $ Breakthroughs? Silicon technology Non-volatile memory tech Battery technology Charging speed ?
    23. 23. Enabling Scalability - From 1mm 3 to 1km 3 8.75mm 3 platform solar cell 0.18 µm Cortex™-M3 12 µ Ah Li-ion battery University of Michigan 1mm 3 platform 1km 1km 3 platform 4200 ARM Neutrino Detectors 70 bore holes 2.5km deep 60 detectors per bore hole supported by the National Science Foundation and University of Wisconsin-Madison
    24. 24. Thank You