Interface for Performance Environment Autoconfiguration Framework
Dotnet microarchitecture of a coarse-grain out-of-order superscalar processor
1. ECWAY TECHNOLOGIES
IEEE PROJECTS & SOFTWARE DEVELOPMENTS
OUR OFFICES @ CHENNAI / TRICHY / KARUR / ERODE / MADURAI / SALEM / COIMBATORE
CELL: +91 98949 17187, +91 875487 2111 / 3111 / 4111 / 5111 / 6111
VISIT: www.ecwayprojects.com MAIL TO: ecwaytechnologies@gmail.com
MICROARCHITECTURE OF A COARSE-GRAIN OUT-OF-ORDER
SUPERSCALAR PROCESSOR
ABSTRACT:
We explore the design, implementation, and evaluation of a coarse-grain superscalar processor in
the context of the microarchitecture of the Control Processor (CP) of the Multilevel Computing
Architecture (MLCA), a novel architecture targeted for multimedia multicore systems. The
MLCA augments a traditional multicore architecture (called the lower level) with a CP (called
the top-level), which automatically extracts parallelism among coarse-grain units of computation
(tasks), synchronizes these tasks and schedules them for execution on processors. It does so in a
fashion similar to how instruction-level parallelism is extracted by superscalar processors, i.e.,
using registers renaming, Out-of-Order Execution (OoOE) and scheduling. The coarse-grain
nature of tasks imposes challenging constraints on the direct use of these techniques, but also
offers opportunities for simpler designs.
We analyze the impact of these constraints and opportunities and present novel
microarchitectural mechanisms for coarse-grain superscalar execution, including register
renaming, task queue, dynamic out-of-order scheduling and task-issue. We design an MLCA
system around our CP microarchitecture and implement it on an FPGA. We evaluate the system
using multimedia applications and show good scalability for eight processors, limited by the
memory bandwidth of the FPGA platform. Furthermore, we show that the CP introduces little
overhead in terms of resource usage. Finally, we show scalability beyond eight processors using
cycle-accurate RTL-level simulation with an idealized memory subsystem. We demonstrate that
the CP poses no performance bottlenecks and is scalable up to 32 processors.