• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Computing Without Computers - Oct08

Computing Without Computers - Oct08



A design methodology and a language framework which contributes to providing a solid, scalable framework for developing next-generation silicon-based systems.

A design methodology and a language framework which contributes to providing a solid, scalable framework for developing next-generation silicon-based systems.



Total Views
Views on SlideShare
Embed Views



2 Embeds 13

http://www.techgig.com 11
http://www.slideshare.net 2



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Computing Without Computers - Oct08 Computing Without Computers - Oct08 Presentation Transcript

  • Computing Without Computers Ian Page Business Development Director, Seven Spires Investments Founder, Celoxica Ltd. Visiting Professor, Cass Business School
  • A Personal Story - Background
    • Trained as electronic engineer, but seduced by software
    • Working first in industry, then academia
    • Building hardware and software to support fast user interfaces
      • Software: silicon compiler, parallel graphics algorithms
      • Hardware: microcoded, SIMD, MIMD and ASIC processors
    • 1990, Oxford academic – ‘road to Damscus’ experience
      • Saw my first FPGA – and the future!
      • All previous threads came together - simultaneously
      • HLLs, regular architectures, algorithms in hardware, parallelism, real-time, design automation, communications, hardware o/s, program algebra, …
  • A Personal Story – A pattern emerges
    • I had been trying for many years to build complex algorithms (graphics and highly interactive user interfaces) into hardware
    • I tried:
      • User micro-coding
      • Massively parallel, SIMD array processing
      • Custom designed silicon
      • MIMD networks of transputers
    • All were short-term successes, but long-term failures - I hadn’t realised that what I was mostly doing was fighting Moore’s Law
    • None of these hardware platforms that I built or used stayed around long enough to be a stable platform
    • The largest investment - in the software – was written off each time Moore’s Law made yet another architecture redundant
  • Moore’s Law – just a reminder
    • A reminder of what an amazing industry we are embedded in
    • A doubling of transistor count every two years
    • First published 1965 and it's still driving the industry
    • It still has many more years to run
    • It is completely pervasive. Nothing escapes its influence
    • The Opportunity:
      • 4,000 transistors per circuit in 1970
      • 1 billion transistors by 2005
      • $1/transistor in 1968 to $1/50 million transistors today
    • The Problems:
      • Rock's Law - foundries double in cost each generation
        • A 300mm foundry costs $3 Billion (Intel pushing for 450mm)‏
        • A 65nm mask set is around $3m
      • Somebody has to design these chips
  • A Personal Story – What does it all mean?
    • Moore’s Law continues to force entry ticket prices up and ever greater integration and to reduce the number of different chip solutions available
    • What are tomorrow’s commodity chips?
      • FPGAs will be around for decades
      • 10 6 LUTs available soon
    • I see FPGA fabric as the world’s first, truly stable, parallel processing substrate
    • (though the ‘grid’ may be some sort of competition)‏
    • 1990 – believing that FPGAs change the nature of the game, an act of faith “One day, most hardware designs will be done through programming languages and FPGAs”
    • And the research question was:
    • “ what do we have to do to make it come true?”
  • The Design Problem – statistics of failure
    • 18% of all projects are cancelled within 5 months*
    • 58% are late to market*
    • 20% of products are not within 50% of specification*
    • 15% of deep sub-micron designs require up to four re-spins
    • Of the products that do get to market:
      • On time and 50% over budget earn only 4% less profit over 5 years †
      • 6 months late and on budget earn 33% less profit over 5 years †
    • Every 4 weeks delay in product launch equals 14% loss in market share‡
    * Source : Current and Emerging Embedded Markets and Opportunities † Source: McKinsey & Co. ‡ Source: John Chambers, CEO Cisco
    • Moore’s Law : Chip complexity grows at over 40% CAGR (Compound Annual Growth Rate).
    • Designer productivity has historically grown at 21% CAGR*
    • The difference is the Design Gap
    • It is the gap between what you can design (with fixed resources)‏
    • and what you must design (to stay in business)‏
    • The Design Gap increases by around 20% CAGR
    The Design Problem – The Design Gap * Source: Gartner Group
    • Rapidly increasing complexity is the root of the problem
    • The only practical way to handle complexity is to raise the level of design abstraction
    • We are guided by previous shifts in hardware design methodology which raised the level of abstraction:
    • - from schematics to HDLs
    • - from assembler code to HLLs
    The Design Problem - Complexity
  • Handel-C solution: treat hardware like software
    • Exploit the massive leverage created by the software industry
    • A rapid and simple flow from program to implementation
      • Compile/P&R, run, edit – in minutes, just like with software
    • Hardware and software development use same methodology
    • Hardware development in less time with a smaller team
    • Enables hardware development by system architects and software engineers as well as hardware engineers; these skills all converge
    • This might be the only design option for really complex designs
  • Choosing a Programming Language
    • Hardware implementations need efficiently to use both time and space (= parallelism)‏
    • Q: Why not compile ordinary C++/C programs into hardware?
    • A: Nobody knows how to write a compiler that efficiently and reliably invents the parallelism that the designer didn’t specify
    • Conclusion: We require a language that allows (forces) the designer explicitly to denote the parallelism required in the computation
    • Q: Why not use a language such as occam, Java, …?
    • A: Nobody knows how to write a compiler that efficiently and reliably invents the timing specifications that the designer didn’t specify
    • Conclusion: We require a language that allows (forces) the designer explicitly to denote the time that computations take
    • These might appear to be two backwards steps – but NO!
  • The Handel Solution
    • No existing language met the basic requirements, so the Handel model of programming was created
    • Handel-C is the embedding of the Handel model in C language
    • Handel-C is a language for programming applications
      • Handel-C is not an HDL. Nor is it C used as an HDL
      • Handel-C is meaningful to both s/w and h/w engineers
      • Handel-C is exceptionally easy to learn and use
    • The par command gives control over space
    • The single clock assignment rule gives control over time
  • Handel-C in brief
    • Handel-C is based on ANSI-C
    • It has well-defined semantics
    • Similar to occam in spirit, but adding timing and replacing pseudo-parallelism with true parallelism
    • Other additions:
      • channels for communications between parallel processes
      • flexible bit-widths and better logical operators
      • constructs for RAM, ROM, interfacing, etc.
  • Handel-C Example A Windowed Display System
    • par {
    • sync_generator (sx, sy); // process 1
    • while (1) // process 2
    • if inside (window1, sx, sy)‏
    • video = contents (window1, sx, sy)‏
    • else if inside (window2, sx, sy)‏
    • video = contents (window2, sx, sy)‏
    • else video = background_colour;
    • while (1) … mouse; update window1, 2 … // process 3
    • }
  • Our first FPGA Platform – HARP, 1991
    • FPGA + SRAM
    • Transputer + DRAM
    • Four fast serial links for expansion
    • Physically stackable (TRAM) module for arbitrary expansion
    • I confidently predicted that Xilinx and Altera would be building things like this as single chips by 1995!
  • SW HW
  • Company ‘E’ : Redesign of a Failing Project
    • A team of 2 software engineers developed core component of IPv6 router in 2 man-months using Handel-C
    • Team of 3 hardware engineers failed to produce the design using VHDL in over 36 man-months
    Handel-C Design 33 MHz 15% V1000 FPGA 20 Pages Code V HDL Design Design Not Completed >100% V1000 FPGA >400 Pages Actual Months 0 5 10 15 IPv6 Router Code
  • Company ‘L’ : Algorithm Acceleration Trial
    • A team of 2 software engineers (with no previous HW experience) transferred an algorithm from a CPU to an FPGA
    • Run-time was 21 seconds on a 600MHz Pentium III
    • 23 times performance improvement after 42 man-days
    Signal Processing Algorithm > 700 s 0.9 s 28 s 16 s Company Training Session 600 MHz CPU Algorithm Run-time (seconds)‏ Man-days 0 10 40 700 30 20 10 0
  • Customer ‘C’ : Internal Design Competition
    • Competition to design MP3 encoder between:
      • Traditional hardware design team using HDL-based approach and
      • Small group of software designers using Celoxica technology
    • Handel-C group
      • Converted existing software implementation of MP3 encoder to Handel-C
      • Optimized, working hardware that beat design specifications in 7 weeks (including training time)‏
    In the same time, the hardware group had not completed writing the specification!
  • Xilinx Design Challenge
    • A Xilinx-specified “Design Challenge”
    • To implement JPEG2000 using conventional HDL and Handel-C approaches
    • Comparison made between Handel-C and HDL approach
    • See Article in Xcell Volume 46
    • Online at www.xilinx.com/publications/xcellonline/xcell_46/xc_celoxica46.htm
  • JPEG2000 Architecture and Communication Model Pre processing RGB to YUV conversion Quantisation Tier-2 Encoder Rate Control Original Image Coded Image DWT- Wavelet Transform Tier-1 Encoder Hardware models Software models
    • Xilinx project benchmark to validate FPGA system tools
      • Start with C description of JPEG2000 algorithm
      • Use Software-Compiled System Design methodology
      • Partition and Implement JPEG2000 Design
      • Compare results against original VHDL design performance
    JPEG2000 Project overview Top level block diagram for JPEG2000 operation Pre processing RGB to YUV conversion Wavelet Transform Quantisation Tier-1 Encoder Tier-2 Encoder Rate Control Original Image Coded Image
  • JPEG2000 Case Study results
    • DK Design Suite 1 st pass
      • Slices 646
      • Device utilization 6%
      • Speed (MHz)* 110
      • Lines of code 386
      • Design time (days) 6
    • Rapid Handel-C (HC) implementation by an engineer with no prior knowledge of JPEG2000. Primary design focus was area efficiency .
    • Common language base made easy porting to hardware of the DWT source & DSM allowed partition, co verification & data to be easily moved between HW & SW
    • Optimizations included using signals instead of registers, maximum use of dual ported memory & reduction in routing logic by syntax duplication in Handel-C. Place & Route tools configured to optimize the implementation for area efficiency
    • Final implementation integrated existing HDL IP block into the design flow for maximum design re-use value (black boxing)‏
    • Observations
      • Comparable
      • HC faster
      • HC quicker
      • Expert vs Novice
      • HDL
      • 800
      • 7%
      • 128
      • 435
      • 20*
    * Doesn’t include partitioning spec. development
    • 2 nd pass
      • 546
      • 5%
      • 130
      • 395
      • 7 (6+1)‏
    • Final
      • 758*
      • 7%
      • 151
      • 395
      • 7 (6+1)‏
    *Lena image used as test-bench throughout input bit width=12, max 1K image width * Includes IP Block Insertion
  • Does it work? - Demonstrations
    • RC100 Board:
      • Single Xilinx XC2S200 FPGA
      • 28 x 42 = 1176 CLBs (2352 LUTs)‏
      • Flash memory with stored configurations
      • PLD to reload the FPGA from the flash memory
      • Digital/Analogue converter to create video signal
    • All demos fit in 1200 CLBs – some in under 500
    • A few of them use external memory
    • No computer. No software. No operating system
    • Cheapest FPGAs: over 340 LUTs/$ (Oct08, one-off price)‏
    • Solutions for Algorithm Design
      • Algorithm acceleration
      • Rapid Prototyping
      • SW & FPGA Implementation
    • Technologies for Algorithm to Implementation
      • MATLAB to C
      • C to FPGA
      • System Prototyping Boards
      • IP Libraries
      • Implementation Services
    • Over 100 customers worldwide
    Shortening the time to develop and deploy complex image processing systems Agility Design Solutions
  • Proven Customer Success Lockheed Hubble Telescope Canon PowerShot Digital Camera Toyota Prius Hybrid Aeroastro Vision Recognition Harris Satellite Communications Raytheon Airborne Systems & NLOS
  • Thank You Computing Without Computers Ian Page Business Development Director, Seven Spires Investments Founder, Celoxica Ltd. Visiting Professor, Cass Business School