Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.



Published on

  • Be the first to comment


  1. 1. CEC 470, PROJECT II, DECEMBER 2014 1 ARM Cortex-A8: An Overview Andrew Daws, David Franklin, Cole Laidlaw Abstract The purpose of this document is to provide a comprehensive overview of the ARM Cortex-A8. This will include background information, a detailed description of the Cortex-A8s RISC based superscalar design, provide brief comparison with other ARM processors, and describe the NEON SIMD and other features. The Cortex-A8 is among the highest performance ARM processors currently in service. It is widely used in mobile devices and other consumer electronics. This analysis will include basic descriptions of components and processes of the Cortex-A8 architecture. These features include a 13 stage instruction pipeline, branch prediction, and other components which result in high performance with low power needs. The Cortex-A8 includes the NEON SIMD which uses integer and floating point operations to deliver greater graphics and multimedia capabilities. Additionally, we will include a brief description of the ARMv7-A architecture to establish comparison to the Cortex-A8. Index Terms ARM, SIMD, VFP, RISC. I. INTRODUCTION THE increasing ubiquity of mobile computing continues to increase the demand for processors that are versatile and deliver high performance. This demand is driven by the need for a variety of services to include connectivity and entertainment. The ARM Cortex-A8 - developed by Texas Instruments - combines connectivity, performance, and multimedia. Its achieves versatility while attaining energy efficiency. Its lower power profile can be measured in Instructions per Cycle (IPC) and is measured through a balance of increased operating frequency and machine efficiency [1]. The increase in performance results from superscalar execution, improvements in branch prediction, and efficient memory system. The Cortex-A8 has a pipeline with less instruction depth per stage than previous ARMs. It is important to analyze the Cortex- A8s features to highlight their effects on power saving and improved performance. There is also an important need for graphics and multimedia. The Cortex-A8 meets this demand with the NEON. The NEON achieves greater graphical capabilities by utilizing 64 bit integer and floating point operations. The NEON is a Single Instruction Multiple Data (SIMD) accelerator processor. It is capable of executing one instruction across 16 sets simultaneously. This parallelism confers a host of new capabilities. The Cortex-A8 also employs the Vector Floating Point (VFP) accelerator for the purpose of speeding up floating point operations. The Cortex-A8s capabilities can illustrated by making a brief comparison with other architec- tures within ARMs Cortex family. The Cortex-A8 belongs to the ARMv7-A family. This group consists of seven of other processors. The Cortex-A8 is proven to be the more flexible processor when compared to related architectures. These other designs can be faster and more powerful but lack the Cortex-A8s versatility. Any comparison illustrates the Cortex-A8s success as a commercial grade processor. Analyzing these concepts reveals the importance of the Cortex-A8s RISC based superscalar design and its versatility.
  2. 2. CEC 470, PROJECT II, DECEMBER 2014 2 II. ARM PROCESSORS The ARM processor family has had a substantial impact on the world of consumer electronics. ARMs developers founded their company in 1990 as Advanced RISC Machines Ltd [6]. This companys purpose was to develop commercial grade general purpose processors. ARM processors can be found on many platforms including laptops, mobile devices, and other embedded systems. The ARM is a Reduced Instruction Set Computer (RISC). Typical RISC architectures includes several features including: a large register file, load/store architecture, simple address modes, and uniform instruction lengths. The ARM architecture includes several more aspects in addition to these basic RISC features. These features include: • Control of ALU and shifter in most data processing operations to maximize their respective uses. • Optimization of program loops through auto-increment and auto-decrement of address modes. • Load and Store multiple instructions to maximize data throughput. • Maximized execution throughput by conditional execution of almost all instructions These ARM features enhance already existing RISC architecture to reach high performance, reduced code size, and reduced power needs. A typical ARM processor will have 16 visible registers out of 31 total registers. There are three special purpose registers: the stack pointer, link register, and program counter. ARM supports exception handling which will cause a standard register to be replaced with a register specific to its respective exception type. All processor states are are contained in status registers. The ARM instruction set includes branch instructions, data processing, status register transfer, load/store, coprocessor, and exception generating instructions [7].
  3. 3. CEC 470, PROJECT II, DECEMBER 2014 3 III. ARCHITECTURE The ARM Cortex-A8 is a microprocessor that works with general-purpose consumer electron- ics. The ARM architecture is load/store with an instruction set similar to other RISC processors but contains numerous special features. Shift and ALU operations may be carried out in the same instruction. The program counter may be used as a general purpose register. There is support for 16-bit and 32-bit instruction opcodes. Lastly, there is a fully conditional instruction set [1]. There are 16 32-bit registers. 13 of these registers are general purpose. The stack pointer, link register, and program counter comprise the remaining registers. These registers can be used for load/store instructions and data processing in addition to their special purposes. Pipeline. The Cortex-A8 utilizes a sophisticated instruction pipeline architecture. This 13 stage instruction pipeline implements in-order, dual-issue, superscalar processor with advanced branch prediction [1]. The main pipeline is divided into fetch, execute, and decode instructions. For example: F1,F2,D0,D1 Where the first two instructions are responsible for prediction and placement into a buffer for decoding. Decoding is implemented in five stages that decode, schedule, and issue instructions [1]. Complex instruction sequences are processed or even replayed if the memory stalls. Six execute stages are comprised of a load-store pipeline, a multiply pipeline, and two symmetric ALU pipelines. There are additional pipelines in addition to the main 13-stage pipeline. An 8-stage pipeline is used for the level-2 memory system and a 13 stage pipeline for debug trace execution. The NEON SIMDs execution engine implements a 10-stage pipeline. The NEON pipeline includes four stages for instruction decode and six stages for execution.
  4. 4. CEC 470, PROJECT II, DECEMBER 2014 4 Fig. 1. Full Pipeline [1]
  5. 5. CEC 470, PROJECT II, DECEMBER 2014 5 IV. INSTRUCTION FETCH Instruction Fetch Pipeline. Dynamic branch prediction, instruction queueing hardware, and the entire Level-1 instruction side memory are located within the Instruction Fetch unit. Instruction fetching includes dynamic branch prediction and instruction queuing [1]. The instruction fetch pipeline runs decoupled from the actual processor and may acquire up to four instructions per cycle in parallel to predicted execution stream. Instructions are subsequently placed in the queue to be decoded. A new virtual address is created at the F0 stage once the fetch pipeline begins. This may be a predicted target address or the next calculated sequentially from the previous instruction if no branch is made. The F0 stage is also not counted as the first stage. The instruction cache is considered the first official stage by ARM processor pipelines. The F1 stage can serve two purposes in parallel. It contains an array for instruction cache access and branch prediction. The F2 stage is the final stage in the instruction fetch pipeline. Instruction data is returned from the instruction cache and placed in its respective queue for future utilization by the decode unit. A new target address will be used as the fetch address if there is a resulting branch prediction. This will change the address calculated in the F0 stage and discard the instruction fetch made in the F1 stage. Code sequences which contain substantial branch commands may cause an inacurate branch prediction resulting from this situation [1].
  6. 6. CEC 470, PROJECT II, DECEMBER 2014 6 Fig. 2. Fetch Pipeline [1]
  7. 7. CEC 470, PROJECT II, DECEMBER 2014 7 Instruction Cache. An instruction cache is implemented in the instruction fetch unit. The instruction cache is the largest component of the instruction fetch unit [1]. It can be configured for 16KB or 32KB in size and can return 64 bits of data per access. The instruction cache is physically addressed, four way associative cache. A fully associative 32 bit translation lookaside buffer (TLB) is also included. Instruction and data caches are identical to ensure design efficiency. Differences are minimized by allowing access to the same array structures while making only minor changes to control logic [1]. These elements are consistent with conventional cache designs. The hashed virtual address buffer (HVAB) is not part of this conventional design strategy. Traditionally, RAM is cross referenced with physical addresses in parallel. The physical address is then compared to tag arrays which verify data contained in RAM. The HVAB prevents the arrays from being fired in parallel. A scheme is implemented using 6-bit hash of a virtual address which is used to index the HVAB to determine which cache is likely to contain the required data. Translation and tag compare from the TLB verify if there is an accurate hit [1]. If a hit is invalid the access is removed and the HVAB and cache data are updated. The TLB translation and tag are subsequently removed from the critical path to cache access. This process results in power savings but hinders performance when predictions are inaccurate. This can be mitigated by implementing an efficient hash function which possesses low probability for false matches. Instruction Queue. The purpose of the instruction queue is to reduce discontinuity from instruction delivery and consumption. Instructions are placed in the instruction queue after they are fetched from the instruction cache. The instructions are forwarded to the D0 stage if the instruction queue is empty [1]. Decoupling allows the instruction fetch unit to prefetch ahead of the remaining integer unit and establish a reserve of instructions awaiting decoding. This reserve conceals latency from prediction changes. Decode unit stalls are also prevented from spreading back into the prefetch unit during the cycle in which a stall is recognized. There are four parallel FIFOs which comprise the instruction queue [1]. Each FIFO consists of six entries that are 20 bits wide - 16 bits of instruction data and four bits of control state. Instructions may be contained in up to two entries. Branch Prediction. A 512-entry branch target buffer (BTB) and 4096-entry global history buffer (GHB) is included in the branch predictor. The BTB indicates whether or not the current fetch instruction contains a branch using counters to indicate which branch predictions should or should not be taken. The BTB is indexed by the fetch address and contains target addresses and branch types. Both arrays are accessed in parallel with the instruction cache during the F1 stage. A 10-bit global branch history and four lower bits of the PC can select a GHB entry. Branch history is generated by analyzing the taken/not taken status of of the 10 most recent branches. This information is saved in the global history register (GHR). This approach increases efficiency by creating traces which are used to make better predictions. Low order bits are used to index the GHB to prevent referencing too similar histories. The BTB consists of branch target addresses and branch type information. It is indexed by the fetch address [1]. Both direct and indirect target address branch predictions are contained in the BTB. Return Stack. Subroutine predictions are made using a return stack, which returns an eight- entry stack depth [1]. There is an instruction decode unit which issues new instructions after decoding and sequencing. Return addresses are pushed onto the stack once the BTB determines when a branch is a subroutine. A subroutine return results in the address being popped from the stack instead of being read from the BTB entry. It is important to support multiple push/pop
  8. 8. CEC 470, PROJECT II, DECEMBER 2014 8 commands at a time due to the relative shortness of each subroutine. Speculative updates may be harmful because updates from an incorrect path may result in a loss of synchronization with the return stack. This may cause mispredictions. The instruction fetch unit must have both speculative and non speculative return stacks. The speculative return stack will be updated immediately while the non-speculative return stack will not be updated until it is known whether the branch is speculative or non-speculative [1]. Inaccurate predictions will result in the speculative stack being overwritten by the non-speculative state. V. INSTRUCTION DECODE Instruction Decode Pipeline. The instruction decode unit decodes, sequences, issues new instructions, and provides exception handling [1]. The decode unit is contained within the D0- D4 pipeline stages. The instruction type, destination and source operands are determined in the D0 and D1 stages. Multi Cycle instructions are divided into multiple single cycle instructions during the D1 stage. Instructions are written and into and read from the pending/replay queue structure during the D2 stage. The D3 stage implements the instruction scheduling logic. The scoreboard is referenced for the next two possible instructions during the stage. These two instructions are analyzed to determine any dependency hazards that may not be detected by the scoreboard. These instructions cannot be stalled once they reach the D3/D4 boundary. Final decode for all control signals critical to instruction execute and load/store units occurs in the D4 stage. Fig. 3. Decode Pipeline [1] Static Scheduling Scoreboard. The static scheduling scoreboard predicts available operands [1]. This static scheduling scoreboard value indicates the number of cycles until a valid result is available. This differs from traditional scoreboards which will normally use a single bit to determine the availability of a source operand. This information is used with the source operand to determine possible dependency hazzard. Each scoreboard entry is self-updating on a cycle- to-cycle basis to ensure proper operation when a new register is written to. Each entry will decrement by one until a new register write or until the counter reaches zero - which indicates
  9. 9. CEC 470, PROJECT II, DECEMBER 2014 9 availability. The static scheduling scoreboard also tracks the execution pipeline by its respective stage and result. This information is used to generate forwarding multiplexer control signals that accompany instructions upon issue [1]. There are several advantages to the static scheduling scoreboard. It allows implementation of fire-and-forget pipeline with no stalls when used in with the replay queue [1]. This removes speed paths that would hinder high frequency operation. This design conserves power by knowing early which execution units are required. Instruction Scheduling. Cortex-A8 is a dual-issue processor. There are two integer pipelines: pipe0 and pipe1. Pipe0 contains the older instruction while pipe1 contains the newest. If an older instruction cannot issue the instruction in pipe1 will not issue. This will be true even if there is no hazard or resource conflict [1]. Pipe0 is the default for single instructions. All instructions will progress through the execution pipeline and their results recorded into the register file during the E5 stage. This process will prevent write-after-read hazards and track write-after-write hazards. The pipe0 instruction is free to issue if no hazards are detected by the scoreboard. There are constraints that must be considered in addition to scoreboard indicators for dual pairing of instruction issue. The combination of instruction types must be considered. The following combinations are supported: • Any two data processing instructions • One load/store instruction followed by one data processing instruction • Older multiply instruction with a newer load/store or data processing instruction The program counter can only be changed by one of the two issued instructions. Only branch instructions or data processing and load with the program counter as the destination register may change its value [1]. The two instructions must be cross referenced to verify data dependency. Read-after-write or write-after-read hazards may prevent dual issue. Dual issue may be prevented if the newer instruction requires a destination register before it is produced by the older instruction or if both instructions are writing to the same register. Comparisons are performed when the data is produced and when it is needed. The dual issue may not be prevented if the data is not needed for one or more cycles. These are the examples when this occurs: • Compare or subtract instruction that sets the flags followed by a flag-dependent conditional branch instruction • Any ALU instruction followed by a dependent store of the ALU result to memory • A move or shift instruction followed by a dependent ALU instruction These instructions are commonplace in conventional code sequences. Addressing dual issue instruction pairs is critical in overall performance increase [1].
  10. 10. CEC 470, PROJECT II, DECEMBER 2014 10 VI. NEON PIPELINE The Cortex-A8 has other features that complement its high performance, such as the NEON hybrid SIMD, which grants the Cortex-A8 increased performance in the field of graphics and other media [2]. There are numerous advantages to NEON. Efficiency of SIMD operations is ensured through aligned and unaligned data access. Integer and floating- point operations provide a broad range of applications including 3D graphics. There is a simpler tool flow created by single instruction streams and unified memory views. There is efficient data implementation and memory access through its large register file [2]. So what is NEON? The NEON engine is a SIMD (Single Instruction Multiple Data) accelerator processor, also known as a vector processor, which means that during the execution of one instruction the same operation will occur on up to 16 data sets in parallel [2]. The purpose of this parallelism is to obtain a greater amount of get more MIPS or FLOPS out of the SIMD portion of the processor then you could obtain with a basic SISD (Single Instruction Single Data) processor running at the same clock rate. This increased parallelism also decreases the instruction count necessary to accomplish the same task if run on an SISD, thus also reducing the number of clocks used to perform the same task. To determine how much of a speed increase the NEON engine will grant to a portion of code, a specific loop is necessary to look at the data size of the operation. The largest NEON register is 128 bits, thus if you wish to perform an operation on 8-bit values you can perform up to 16 operations simultaneously. Another example being if you are using 32 bit values, you can perform up to 4 operation simultaneously [2]. However, there are other factors to take into consideration that affect execution speed such as loop overhead, memory speeds, and data throughput. NEON instructions are mainly for numerical, load/store, and some logical operations, thus NEON operations will execute while other instruction occur in the main ARM pipeline. NEON has 4 decode stages, known as M0-M3, which are similar in design to the four decode stages, D0-D4, seen in the main ARM pipeline. This structure uses the first two stages to decode the instruction resource and operand requirements, then the last two stages for instruction scheduling. NEON also has 6 execute stages, N1-N6 [1]. The NEON pipeline uses a fire-and- forget issue mechanism and a static scoreboard, similar to what is used by the ARM integer pipeline with the primary difference being that there is no replay queue [2]. The NEON decode logic is highly capabable in that it can dual issue any LS permute instruction with any non-LS permute instruction which requires fewer register ports than what would be needed for dual issuing two data processing instructions since LS data is provided directly from the load data queue. It is also the most useful pairing of instructions to dual issue since significant load/store bandwidth is required to keep up with the Advanced SIMD data processing operations [1]. Access to the 32-entry register file is handled M3 stage when instruction(s) are issued [1]. Once an instruction is issued, it is sent to one of seven execution pipelines: integer algorithmic logic unit, integer multiply, integer shift, NFP Add, NFP multiply, IEEE floating point, or load/store permute with all execution datapath pipelines being balanced at six stages [1].
  11. 11. CEC 470, PROJECT II, DECEMBER 2014 11 Fig. 4. NEON Pipeline Stages [1] VII. NEON INTEGER EXECUTION PIPELINE There are three execution pipelines responsible for executing NEON integer instructions: multiply-accumulate (MAC), shift, and ALU. The integer MAC pipeline contains two 32x16 multiply arrays with two 64-bit accumulate units. The 32x16 multiplier array can perform four 8x8, two 16x16, or one 32x16 multiply operation in each cycle and have dedicated register read ports for the accumulate operand. The MAC unit is also optimized to support one multiply accumulate operations per cycle for high performance on a sequence of MAC operations with a common accumulator. The integer shift pipeline consists of simply three stages. Shift is made available early for subsequent instructions at the end of the N3 stage when only the shift result is required [1]. When both a shift and accumulate operation are require the result from the shift pipeline are forwarded directly to the MAC pipeline. The integer ALU pipeline consists of two parallel 64-bit SIMD ALUs, each permitting four 64- bit inputs. The first stage of the ALU pipeline, N1, formats the operands to in preperation for the the next cycle, includes inverting operands as needed for subtract operations, multiplexing vector element pairs for folding operations, and sign/zero-extend of operands [1]. The second stage, N2, performs the main ALU opations such as: add, subtract, logical, count leading-sign/zero, count set, and sum of element pairs operations [1] along with also calculating the flags are also to be used in the following stage. The third stage, N3, performs operations such as: compare, test, and max/min operations for saturation detection. The N3 stage also has contains an SIMD incrementer for generating twos complement and rounding operations It also has a data formatter for performing high-half and halving operations. Just like the shift pipeline, the ALU pipeline will use the final stages, N4 and N5, for completing any accumulate operations by forwarding it to the MAC [1].
  12. 12. CEC 470, PROJECT II, DECEMBER 2014 12 VIII. NEON LOAD-STORE/PERMUTE EXECUTION PIPELINE The permute pipeline is fed by the load data queue (LDQ). The LDQ holds all data associated with NEON load accesses prior to entering the NEON permute pipeline. It is 12 entries deep and each entry is 128-bit wide [1]. Data can be placed into the LDQ from either L1 cache or L2 memory system. Accesses that hit in the L1 cache will return and commit the data to the LDQ. Accesses that miss in the L1 cache will initiate an L2 access. A pointer is attached with the load request as it proceeds down the L2 memory system pipeline. When the data is returned from the L2 cache, the pointer is used to update the LDQ entry reserved for this load request. Each entry in the LDQ has a valid bit to indicate valid data returned from L1 cache or L2. Entries in the LDQ can be filled by L1 or L2 out-of-order, but valid data within the LDQ must be read in program order. Entries at the front of the LDQ are read off in-order. If a load instruction reaches the M2 issue stage before the corresponding data has arrived in the LDQ, then it will stall and wait for the data [1]. L1 and L2 data that is read out of the LDQ is aligned and formatted to be useful for the NEON execution units. Aligned/formatted data from the LDQ is multiplexed with NEON register read operands in the M3 stage, before it is issued to the NEON execute pipeline. The NEON LS/Permute pipeline is responsible for all NEON load/stores, data transfers to/from the integer unit, and data permute operations. One of the more interesting features of the NEON instruction set is the data permute operations that can be done from register to register or as part of a load or store operation. These operations allow for the interleaving of bytes of memory into packed values in SIMD registers. For example, when adding two eight byte vectors, you may wish to interleave all of the odd bytes of memory into register A and the even bytes into register B [1]. The permute instructions in NEON allow you to do operations like this natively in the instruction set and often with only using a single instruction [1]. This data permute functionality is implemented by the load-store permute pipeline. Any data permutation required is done across 2 stages, N1-N2. In the N3 stage, store data can forwarded from the permute pipeline and sent to the NEON Store Buffer in the memory system [1]. IX. NEON FLOATING-POINT EXECUTION PIPELINES The NEON Floating-Point (NFP) has two main pipelines: a 6-stage multiply pipeline and a 6-stage add pipeline [1]. The add pipeline adds two single-precision floating-point numbers, producing a single-precision sum. The multiply pipeline multiplies two single-precision floating- point numbers, producing a single-precision product. In both cases, the pipelines are 2-way SIMD which means that two 32-bit results are produced in parallel when executing NFP instructions [1]. X. NEONS IEEE COMPLIANT FLOATING POINT ENGINE The IEEE compliant floating point engine is a non-pipelined implementation of the ARM Floating-Point instruction set targeted for medium performance IEEE 754-compliant and double precision floating-point [1]. It is designed to provide general-purpose floating-point capabilities for a Cortex A8 processor. This engine is not pipelined for most operations and modes, but instead iterates over a single instruction until it has completed. A subsequent operation will be stalled until the prior operation has fully completed execution and written the result to the register file. The IEEE compliant engine will be used for any floating point operation that cannot be executed in the NEON floating point pipeline. This includes all double precision operations and any floating point operations run with IEEE precision enabled.
  13. 13. CEC 470, PROJECT II, DECEMBER 2014 13 XI. VFP VFP (Vector Floating Point) is a floating point hardware accelerator whose primary purpose is to perform one operation on one set of inputs and returns one output, thus allowing it to speed up floating point calculations. Considerably slower software math libraries are used by ARM processors if dedicated have floating point hardware is not available. The VFP supports both single and double precision floating point calculations compliant with IEEE 754 [2]. It is also worth noting that the VFP will not have the same performance increase that NEON grants because it does not contain a similar highly parallel and fully pipelined architecture XII. ARM CORTEX-A8 COMPARED TO ARM CORTEX-A17 The ARM Cortex-A8 is a part of the ARMv7-A architecture. There have been seven cores designed with this architecture including the Cortex-A8 and the Cortex-A17. The ARM Cortex- A17 is the most powerful core within the same family as the Cortex-A8 yet the differences between the two are drastic. From internal specifications to the actual use in devices vary. The Cortex-A17 provides a 60% increase in performance over the Cortex-A9 and the Cortex- A9 has a 50% [10] increase in performance over the Cortex-A8 leading the comparison the Cortex-A17 is a 110% performance increase over the Cortex-A8. Fig. 5. Cortex-A17 performance comparison to the Cortex-A9 [8]
  14. 14. CEC 470, PROJECT II, DECEMBER 2014 14 This leads to the initial comparison that the Cortex-A17 is far more powerful than the Cortex- A8 even though their design is the same 32-bit ARMv7-A architecture using the NEON SIMD and VFP hardware accelerator. Just as the Cortex-A8, this core is also very popular in mobile devices with its combination of high performance combined with the high efficiency brought about by the Cortex-A8 introduction. The Cortex-A17 consists of four scalable cores. These cores contain a fully out-of-order pipeline delivering optimal performance of todays premium mobile devices [8]. This is a key difference since the Cortex-A8 only supports one core, hence the massive speed increase with the Cortex-A17. The decode width of the Cortex-A17 is only one more than the Cortex-A8, yet that ability to decode one more instruction in parallel creates an improvement without sacrificing efficiency. The pipeline depth of the Cortex-A8 is 13 in order while the Cortex-A17 is 11+ out-of-order. The NEON (SIMD) for the Cortex-A8 is 64-bit wide where the Cortex-A17s is 128-bit wide allowing for greater parallel processing of data to occur. The Cortex-A17 has a big role in the big.LITTLE architecture role whereas the Cortex-A8 does not use big.LITTLE at all. The Cortex-A8 does not have a pipelined VFP accelerator whereas the Cortex-A17 does, which improves performance. The Cortex-A8 is used in many commercial applications that affect our daily lives. An application that the Cortex-A8 is utilized in is smartphones as an application processor running fully featured mobile OS, the Cortex-A17 is commonly seen in smartphones as well and tablets unlike the Cortex-A8. It is also used in Netbooks because of its Power-Efficient main processor running desktop OS. The Cortex-A8 is also used in set-top Boxes as the main processor for managing Rich OS, Multi-format A/V and UI, same as the Cortex-A17. They are also used in Digital TV applications as the processor for managing rich OS, UI and browser functions, same as the Cortex-A17. The Cortex-A8 is used in home networking as a control processor for system management. It is also used for storage networking as a control processor to manage traffic flow. They are even used in printers as a high-performance integrated processor [8][9]. The Cortex-A17 also works with Industrial and Automotive Infotainment which the Cortex-A8 did not work with [8]. These are devices that we interact with in our lives and some that we interact with daily. The small size of the core is advantageous because it can fit into small devices such as smartphones, netbooks, TV receivers and printers. The Cortex-A8 is also very advantageous because its power efficiency which for these small devices with small batteries makes a huge difference in lifespan of use per charge. The power of the Cortex-A8 is very useful in many of these applications. With its pipelining abilities and enhancement from the NEON SIMD and the VFP hardware accelerator, it allows for small devices such as smartphones to have amazing processing speed. The Cortex-A8 and the Cortex-A17 are very similar, yet with large performance differences. The Cortex-A8 is a High-Performance processor used to run in complex systems, it is: • Symmetric, superscalar pipeline for full dual-issue capability • High-frequency through efficient, deep pipeline • Advanced branch prediction unit with ¿95% accuracy • Integrated Level 2 Cache for optimal performance in high-performance systems [9] The Cortex-A8 is designed to handle media processing in software with NEON Technology which is: • 128-bit SIMD data engine • 2x the performance of v6SIMD • Power-saving through efficient media processing • Flexibility to handle the media formats of the future
  15. 15. CEC 470, PROJECT II, DECEMBER 2014 15 • Easily integrate multiple codecs in software with NEON Technology on the Cortex-A8 • Enhance user interfaces [9] The Cortex-A8 boasts many features, but how do they compare to the Cortex-A17? The Cortex- A8 features the NEON, 128-bit SIMD engine that enables high performance media processing. It also features the Optimized Level 1 cache which is integrated tightly into the processor with a single-cycle access time as well as an Integrated Level 2 cache which is integrated into the core and provides ease of integration, power efficiency, and optimal performance. The Cortex- A8 also features Thumb-2 Technology which delivers the peak performance of traditional ARM code while also providing up to a 30% reduction in memory required to store instructions. It also has Dynamic Branch Prediction, used to minimize branch wrong prediction penalties, the dynamic branch predictor achieves 95% accuracy across a wide range of industry benchmarks. The Cortex-A8 also features a Memory Management Unit, having a full MMU enables the Cortex-A8 to run rich operating systems in a variety of Applications. It also features Jazelle- RCT Technology, a RCT Java-acceleration technology to optimize Just in Time (JIT) and Dynamic Adaptive Compilation (DAC), and reduce memory footprint by up to three times. The Cortex-A8 also features a Memory System that is optimized for power-efficiency and high- performance. It also features TrustZone Technology which allows for secure transactions and Digital Rights Management (DRM) [9]. This list of features comes from the ARM website and the specific product specification pages. The Cortex-A17 also has list of specifications on the ARM website, however they are different from the Cortex-A8. The Cortex-A17 and the Cortex- A8 share some similar features such as Thumb-2 Technology, TrustZone Technology, NEON and Optimized Level 1 Caches. The Cortex-A17 also has an Integrated Level 2 Cache Controller but the difference is that its size is configurable. The Cortex-A17 also has the DSP & SIMD Extensions which increases the DSP processing capability of ARM solutions in high-performance applications, while offering the low power consumption required by portable, battery-powered devices. It also uses a Floating Point, the Cortex-A17 processor provides a high-performance FPU including hardware support for floating point operations in half-, single- and double-precision floating point arithmetic. The Cortex-A17 also features Hardware Virtualization, a highly efficient hardware support for data management and arbitration, whereby multiple software environments and their applications are able to access simultaneously the system capabilities. It also has a Large Physical Address Extension (LPAE) which enables the processor to access up to 1TB of memory. The Cortex-A17 also features an AMBA4 CoreLink CCI-400 Cache Coherent which provides AMBA4 ACE ports for full coherency between multiple processors, enabling use cases like big.LITTLE [8]. This lengthy list of features about the Cortex-A17 for comparison to the Cortex-A8 was also retrieved from the ARM website in the Cortex-A17 product specifications section. This comparison is key to see where the ARMv7-A architecture has evolved to, the Cortex-A8 is one of the middle models in development purposes whereas the Cortex-A17 is the newest and most powerful that ARM produces in this architecture set. The comparison for the debugger between the Cortex-A8 and the Cortex-A17 are the same. The ARM DS-5 Development Studio fully supports all ARM processors and IP as well as a wide range of third party tools, operating systems and EDA flows. DS-5 represents a comprehensive range of software tools to create, debug and optimize systems based on the Cortex-A8 and Cortex- A17 processors [8]. This line comes from the Cortex-A17 related products page but is near the exact same as the Cortex-A8. They both incorporates DS-5 Debugger, whose powerful and intuitive graphical environment enables fast debugging of bare-metal, Linux and Android native applications. DS-5 Debugger provides pre-defined configurations for Fixed Virtual Platforms
  16. 16. CEC 470, PROJECT II, DECEMBER 2014 16 (built on ARM Fast Models technology) and ARM Versatile Express boards, enabling early software development before silicon availability [8][9]. This segment is the same in both the Cortex-A17 and Cortex-A8. Both of the Cortex-A17 and the Cortex-A8 use the same family of products for Graphic Processing. The MaliTM family of products combine to provide the complete graphics stack for all embedded graphics needs, enabling device manufacturers and content developers to deliver the highest quality, cutting edge graphics solutions across the broadest range of consumer devices [8][9]. An example would be the Mali-400 in the Cortex-A8 which is the worlds first OpenGL ES 2.0 conformant multi-core GPU that provides 2D and 3D acceleration with performance scalable up to 1080p resolution [9]. For the Cortex-A8 the ARM Physical IP Platforms deliver process optimized IP, for best-in- class implementations of the Cortex-A8 processor at 40nm and below [9]. The Cortex-A8 uses the Standard Cell Logic Libraries which are available in a variety of different architectures ARM Standard Cell Libraries support a wide performance range for all types of SoC designs. It also supports Memory Compilers and Registers, a broad array of silicon proven SRAM, Register File and ROM memory compilers for all types of SoC designs ranging from performance critical to cost sensitive and low power applications. The Cortex-A8 also supports Interface Libraries, a broad portfolio of silicon-proven Interface IP designed to meet varying system architectures and standards [9]. The ARM Physical IP Platforms deliver process optimized IP, for best-in-class implementations of the Cortex-A17 processor at 28nm and below [8]. This is similar to the Cortex-A8 except for the difference from 40nm to 28nm. A set of high-performance POPTM IP containing advanced ARM Physical IP for 28nm technologies supports the Cortex-A17, to enable rapid development of leadership physical implementations [8]. ARM is uniquely able to design the optimization packs in parallel with the Cortex-A17 processor, enabling the processor and physical IP combination to deliver best-in-class performance in the mobile power envelope while facilitating rapid time-to-market [8]. The Physical IP for the Cortex-A17 is a different design from the Cortex-A8 with using the POP IP. System IP components are essential for building complex system on chips and by utilizing System IP components developers can significantly reduce development and validation cycles, saving cost and reducing time to market [9]. The Cortex-A8 uses a different set of tools for System IP than Cortex-A17, here are the differences: Cortex-A8 • Advanced AMBA 3 Interconnect IP using the AXI AMBA Bus. • Dynamic Memory Controller using the AXI AMBA Bus. • Adaptive Verification IP using the AXI AMBA Bus. • DMA Controller using the AXI AMBA Bus. • CoreSight Embedded Debug and Trace using the ATB AMBA Bus. [9] The set of tools that the Cortex-A17 uses for System IP are as follows: Cortex-A17 • AMBA 4 Cache Coherent Interconnect – The CCI-400 provides AMBA 4 AXI Coherency Extensions compliant ports for full coherency for the Cortex-A17 processor and other Cortex processors, better utilizing caches and simplifying software development. This feature is essential for high bandwidth applications including future mobile SoCs that require clusters of coherent processors or GPUs. Combined with other available ARM CoreLink System IP, the CCI-400 increases system performance and power efficiency.
  17. 17. CEC 470, PROJECT II, DECEMBER 2014 17 – CoreLink CCI-400 Cache Coherent Interconnect provides system coherency with Cortex processors and an IO Coherent channel with Mali IP and opens up a number of possibilities for offload and acceleration of tasks. When combined with a Cortex-A7 processor, CCI-400 allows big.LITTLE operation with full L2 cache coherency between the Cortex-A17 and Cortex-A7 processors. – Efficient voltage scaling and power management is enabled with the CoreLink ADB- 400 unlocking DVFS control of the Cortex-A17 processor. • AMBA Generic Interrupt Controller – AMBA Interrupt Controllers like the GIC-400 provide an efficient implementation of the ARM Generic Interrupt Specification to work in multi-processor systems. They are highly configurable to provide the ultimate flexibility in handling a wide range of interrupt sources that can control a single CPU or multiple CPUs. • AMBA 4 CoreLink MMU-500 – CoreLink MMU-500 provides a, hardware accelerated, common memory view for all SoC components and minimizes software overhead for virtual machines to get on with other system management functions. • CoreLink TZC-400 – The Cortex-A17 processor implements a secure, optimized path to memory to further enhance its market leading performance with the aid of CoreLink TZC-400 TrustZone address space controller. • CoreLink DMC-400 – All interconnect components and the ARM DMC guarantee bandwidth and latency requirements by utilizing in-built dynamic QoS mechanisms. • CoreSight SoC-400 – ARM CoreSight SoC debug and trace hardware is used to profile and optimize the system software running through-out from driver to OS level. • Artisan POP IP – Cortex-A17 processor is supported through advanced Physical POP IP for accelerated time to market [8]. These differences show how much more technology is in the Cortex-A17 versus the Cortex-A8 even though they are in the same ARM architecture set family (ARMv7-A). With differences like these it really makes it clear as to how flexible these systems are and what all can be done with them from media processing to data crunching. These differences are important to understand because it lays out where this technology is headed and what changes could and are being made to create more powerful yet more efficient devices. XIII. CONCLUSION The Cortex-A8 is an important example of RISC-based superscalar design. It has many features that make it a powerful and flexible processor. The summation of its components result in increased performance and flexibility. Its instruction pipelining and branch prediction are critical into ensuring performance efficiency. The NEON SIMD possesses a very robust architecture including its own instruction pipelines. The NEON introduces a host of new capabilities including multimedia and graphics processing. Examination of other ARM processors further illustrates the Cortex-A8s evolution. The Cortex-A8 belongs to family consisting of seven other processors.
  18. 18. CEC 470, PROJECT II, DECEMBER 2014 18 A comparison with the faster Cortex-A17 demonstrates a higher degree of flexibility from the Cortex-A8. This flexibility is critical to the Cortex-A8s success in consumer electronics. This processor is commercially available in a variety of applications including mobile devices and other media. Studying the ARM Cortex-A8 is critical in understanding the role superscalar architecture plays in embedded systems.
  19. 19. CEC 470, PROJECT II, DECEMBER 2014 19 REFERENCES [1] Williamson, David, ARM Cortex-A8: A High-Performance Processor for Low-Power Applications, Unique Chips and Systems (2007): 79. [2] (n.d.). Texas Instruments Wiki, Cortex-A8 - Texas Instruments Wiki Retrieved from [3] (n.d.). ARM - The Architecture For The Digital World, NEON - ARM, Retrieved from [4] (n.d.). ARM - The Architecture For The Digital World, Cortex-A8 Processor - ARM,Retrieved from [5] (n.d.). ARM - The ARM Architecture. With a focus on v7A and Cortex-A8. Retrieved from Arch A8.pdf [6] ARM, A. (2000). Architecture Reference Manual. ARM DDI E, 100, 6. waldroj/3d1/arm arm.pdf [7] Cortex-A17 Processor. (n.d.). Retrieved November 17, 2014, from a17-processor.php [8] Cortex-A8 Processor. (n.d.). Retrieved November 17, 2014, from a8.php [9] Cortex-A9 Processor. (n.d.). Retrieved November 17, 2014, from a9.php