CEC 470, PROJECT II, DECEMBER 2014 1
Andrew Daws, David Franklin, Cole Laidlaw
The purpose of this document is to provide a comprehensive overview of the ARM Cortex-A8. This
will include background information, a detailed description of the Cortex-A8s RISC based superscalar
design, provide brief comparison with other ARM processors, and describe the NEON SIMD and
other features. The Cortex-A8 is among the highest performance ARM processors currently in service.
It is widely used in mobile devices and other consumer electronics. This analysis will include basic
descriptions of components and processes of the Cortex-A8 architecture. These features include a 13
stage instruction pipeline, branch prediction, and other components which result in high performance
with low power needs. The Cortex-A8 includes the NEON SIMD which uses integer and ﬂoating point
operations to deliver greater graphics and multimedia capabilities. Additionally, we will include a brief
description of the ARMv7-A architecture to establish comparison to the Cortex-A8.
ARM, SIMD, VFP, RISC.
THE increasing ubiquity of mobile computing continues to increase the demand for
processors that are versatile and deliver high performance. This demand is driven by the
need for a variety of services to include connectivity and entertainment. The ARM Cortex-A8
- developed by Texas Instruments - combines connectivity, performance, and multimedia. Its
achieves versatility while attaining energy efﬁciency. Its lower power proﬁle can be measured in
Instructions per Cycle (IPC) and is measured through a balance of increased operating frequency
and machine efﬁciency . The increase in performance results from superscalar execution,
improvements in branch prediction, and efﬁcient memory system. The Cortex-A8 has a pipeline
with less instruction depth per stage than previous ARMs. It is important to analyze the Cortex-
A8s features to highlight their effects on power saving and improved performance. There is
also an important need for graphics and multimedia. The Cortex-A8 meets this demand with the
NEON. The NEON achieves greater graphical capabilities by utilizing 64 bit integer and ﬂoating
point operations. The NEON is a Single Instruction Multiple Data (SIMD) accelerator processor.
It is capable of executing one instruction across 16 sets simultaneously. This parallelism confers a
host of new capabilities. The Cortex-A8 also employs the Vector Floating Point (VFP) accelerator
for the purpose of speeding up ﬂoating point operations.
The Cortex-A8s capabilities can illustrated by making a brief comparison with other architec-
tures within ARMs Cortex family. The Cortex-A8 belongs to the ARMv7-A family. This group
consists of seven of other processors. The Cortex-A8 is proven to be the more ﬂexible processor
when compared to related architectures. These other designs can be faster and more powerful
but lack the Cortex-A8s versatility. Any comparison illustrates the Cortex-A8s success as a
commercial grade processor. Analyzing these concepts reveals the importance of the Cortex-A8s
RISC based superscalar design and its versatility.
CEC 470, PROJECT II, DECEMBER 2014 2
II. ARM PROCESSORS
The ARM processor family has had a substantial impact on the world of consumer electronics.
ARMs developers founded their company in 1990 as Advanced RISC Machines Ltd .
This companys purpose was to develop commercial grade general purpose processors. ARM
processors can be found on many platforms including laptops, mobile devices, and other
embedded systems. The ARM is a Reduced Instruction Set Computer (RISC). Typical RISC
architectures includes several features including: a large register ﬁle, load/store architecture,
simple address modes, and uniform instruction lengths. The ARM architecture includes several
more aspects in addition to these basic RISC features. These features include:
• Control of ALU and shifter in most data processing operations to maximize their respective
• Optimization of program loops through auto-increment and auto-decrement of address
• Load and Store multiple instructions to maximize data throughput.
• Maximized execution throughput by conditional execution of almost all instructions
These ARM features enhance already existing RISC architecture to reach high performance,
reduced code size, and reduced power needs. A typical ARM processor will have 16 visible
registers out of 31 total registers. There are three special purpose registers: the stack pointer,
link register, and program counter. ARM supports exception handling which will cause a standard
register to be replaced with a register speciﬁc to its respective exception type. All processor states
are are contained in status registers. The ARM instruction set includes branch instructions, data
processing, status register transfer, load/store, coprocessor, and exception generating instructions
CEC 470, PROJECT II, DECEMBER 2014 3
The ARM Cortex-A8 is a microprocessor that works with general-purpose consumer electron-
ics. The ARM architecture is load/store with an instruction set similar to other RISC processors
but contains numerous special features. Shift and ALU operations may be carried out in the same
instruction. The program counter may be used as a general purpose register. There is support
for 16-bit and 32-bit instruction opcodes. Lastly, there is a fully conditional instruction set .
There are 16 32-bit registers. 13 of these registers are general purpose. The stack pointer, link
register, and program counter comprise the remaining registers. These registers can be used for
load/store instructions and data processing in addition to their special purposes.
Pipeline. The Cortex-A8 utilizes a sophisticated instruction pipeline architecture. This 13 stage
instruction pipeline implements in-order, dual-issue, superscalar processor with advanced branch
prediction . The main pipeline is divided into fetch, execute, and decode instructions. For
Where the ﬁrst two instructions are responsible for prediction and placement into a buffer for
decoding. Decoding is implemented in ﬁve stages that decode, schedule, and issue instructions
. Complex instruction sequences are processed or even replayed if the memory stalls. Six
execute stages are comprised of a load-store pipeline, a multiply pipeline, and two symmetric
ALU pipelines. There are additional pipelines in addition to the main 13-stage pipeline. An
8-stage pipeline is used for the level-2 memory system and a 13 stage pipeline for debug trace
execution. The NEON SIMDs execution engine implements a 10-stage pipeline. The NEON
pipeline includes four stages for instruction decode and six stages for execution.
CEC 470, PROJECT II, DECEMBER 2014 4
Fig. 1. Full Pipeline 
CEC 470, PROJECT II, DECEMBER 2014 5
IV. INSTRUCTION FETCH
Instruction Fetch Pipeline. Dynamic branch prediction, instruction queueing hardware,
and the entire Level-1 instruction side memory are located within the Instruction Fetch unit.
Instruction fetching includes dynamic branch prediction and instruction queuing . The
instruction fetch pipeline runs decoupled from the actual processor and may acquire up to four
instructions per cycle in parallel to predicted execution stream. Instructions are subsequently
placed in the queue to be decoded.
A new virtual address is created at the F0 stage once the fetch pipeline begins. This may
be a predicted target address or the next calculated sequentially from the previous instruction
if no branch is made. The F0 stage is also not counted as the ﬁrst stage. The instruction cache
is considered the ﬁrst ofﬁcial stage by ARM processor pipelines. The F1 stage can serve two
purposes in parallel. It contains an array for instruction cache access and branch prediction. The
F2 stage is the ﬁnal stage in the instruction fetch pipeline. Instruction data is returned from the
instruction cache and placed in its respective queue for future utilization by the decode unit. A
new target address will be used as the fetch address if there is a resulting branch prediction. This
will change the address calculated in the F0 stage and discard the instruction fetch made in the
F1 stage. Code sequences which contain substantial branch commands may cause an inacurate
branch prediction resulting from this situation .
CEC 470, PROJECT II, DECEMBER 2014 6
Fig. 2. Fetch Pipeline 
CEC 470, PROJECT II, DECEMBER 2014 7
Instruction Cache. An instruction cache is implemented in the instruction fetch unit. The
instruction cache is the largest component of the instruction fetch unit . It can be conﬁgured
for 16KB or 32KB in size and can return 64 bits of data per access. The instruction cache is
physically addressed, four way associative cache. A fully associative 32 bit translation lookaside
buffer (TLB) is also included.
Instruction and data caches are identical to ensure design efﬁciency. Differences are minimized
by allowing access to the same array structures while making only minor changes to control
logic . These elements are consistent with conventional cache designs. The hashed virtual
address buffer (HVAB) is not part of this conventional design strategy. Traditionally, RAM is
cross referenced with physical addresses in parallel. The physical address is then compared to
tag arrays which verify data contained in RAM. The HVAB prevents the arrays from being ﬁred
in parallel. A scheme is implemented using 6-bit hash of a virtual address which is used to
index the HVAB to determine which cache is likely to contain the required data. Translation
and tag compare from the TLB verify if there is an accurate hit . If a hit is invalid the
access is removed and the HVAB and cache data are updated. The TLB translation and tag
are subsequently removed from the critical path to cache access. This process results in power
savings but hinders performance when predictions are inaccurate. This can be mitigated by
implementing an efﬁcient hash function which possesses low probability for false matches.
Instruction Queue. The purpose of the instruction queue is to reduce discontinuity from
instruction delivery and consumption. Instructions are placed in the instruction queue after they
are fetched from the instruction cache. The instructions are forwarded to the D0 stage if the
instruction queue is empty . Decoupling allows the instruction fetch unit to prefetch ahead of
the remaining integer unit and establish a reserve of instructions awaiting decoding. This reserve
conceals latency from prediction changes. Decode unit stalls are also prevented from spreading
back into the prefetch unit during the cycle in which a stall is recognized.
There are four parallel FIFOs which comprise the instruction queue . Each FIFO consists
of six entries that are 20 bits wide - 16 bits of instruction data and four bits of control state.
Instructions may be contained in up to two entries.
Branch Prediction. A 512-entry branch target buffer (BTB) and 4096-entry global history
buffer (GHB) is included in the branch predictor. The BTB indicates whether or not the current
fetch instruction contains a branch using counters to indicate which branch predictions should or
should not be taken. The BTB is indexed by the fetch address and contains target addresses and
branch types. Both arrays are accessed in parallel with the instruction cache during the F1 stage.
A 10-bit global branch history and four lower bits of the PC can select a GHB entry. Branch
history is generated by analyzing the taken/not taken status of of the 10 most recent branches.
This information is saved in the global history register (GHR). This approach increases efﬁciency
by creating traces which are used to make better predictions. Low order bits are used to index
the GHB to prevent referencing too similar histories.
The BTB consists of branch target addresses and branch type information. It is indexed by
the fetch address . Both direct and indirect target address branch predictions are contained in
Return Stack. Subroutine predictions are made using a return stack, which returns an eight-
entry stack depth . There is an instruction decode unit which issues new instructions after
decoding and sequencing. Return addresses are pushed onto the stack once the BTB determines
when a branch is a subroutine. A subroutine return results in the address being popped from the
stack instead of being read from the BTB entry. It is important to support multiple push/pop
CEC 470, PROJECT II, DECEMBER 2014 8
commands at a time due to the relative shortness of each subroutine. Speculative updates may be
harmful because updates from an incorrect path may result in a loss of synchronization with the
return stack. This may cause mispredictions. The instruction fetch unit must have both speculative
and non speculative return stacks. The speculative return stack will be updated immediately
while the non-speculative return stack will not be updated until it is known whether the branch
is speculative or non-speculative . Inaccurate predictions will result in the speculative stack
being overwritten by the non-speculative state.
V. INSTRUCTION DECODE
Instruction Decode Pipeline. The instruction decode unit decodes, sequences, issues new
instructions, and provides exception handling . The decode unit is contained within the D0-
D4 pipeline stages. The instruction type, destination and source operands are determined in the
D0 and D1 stages. Multi Cycle instructions are divided into multiple single cycle instructions
during the D1 stage. Instructions are written and into and read from the pending/replay queue
structure during the D2 stage. The D3 stage implements the instruction scheduling logic. The
scoreboard is referenced for the next two possible instructions during the stage. These two
instructions are analyzed to determine any dependency hazards that may not be detected by
the scoreboard. These instructions cannot be stalled once they reach the D3/D4 boundary. Final
decode for all control signals critical to instruction execute and load/store units occurs in the
Fig. 3. Decode Pipeline 
Static Scheduling Scoreboard. The static scheduling scoreboard predicts available operands
. This static scheduling scoreboard value indicates the number of cycles until a valid result
is available. This differs from traditional scoreboards which will normally use a single bit to
determine the availability of a source operand. This information is used with the source operand
to determine possible dependency hazzard. Each scoreboard entry is self-updating on a cycle-
to-cycle basis to ensure proper operation when a new register is written to. Each entry will
decrement by one until a new register write or until the counter reaches zero - which indicates
CEC 470, PROJECT II, DECEMBER 2014 9
availability. The static scheduling scoreboard also tracks the execution pipeline by its respective
stage and result. This information is used to generate forwarding multiplexer control signals that
accompany instructions upon issue .
There are several advantages to the static scheduling scoreboard. It allows implementation of
ﬁre-and-forget pipeline with no stalls when used in with the replay queue . This removes speed
paths that would hinder high frequency operation. This design conserves power by knowing early
which execution units are required.
Instruction Scheduling. Cortex-A8 is a dual-issue processor. There are two integer pipelines:
pipe0 and pipe1. Pipe0 contains the older instruction while pipe1 contains the newest. If an older
instruction cannot issue the instruction in pipe1 will not issue. This will be true even if there is
no hazard or resource conﬂict . Pipe0 is the default for single instructions. All instructions will
progress through the execution pipeline and their results recorded into the register ﬁle during the
E5 stage. This process will prevent write-after-read hazards and track write-after-write hazards.
The pipe0 instruction is free to issue if no hazards are detected by the scoreboard. There
are constraints that must be considered in addition to scoreboard indicators for dual pairing
of instruction issue. The combination of instruction types must be considered. The following
combinations are supported:
• Any two data processing instructions
• One load/store instruction followed by one data processing instruction
• Older multiply instruction with a newer load/store or data processing instruction
The program counter can only be changed by one of the two issued instructions. Only branch
instructions or data processing and load with the program counter as the destination register may
change its value .
The two instructions must be cross referenced to verify data dependency. Read-after-write
or write-after-read hazards may prevent dual issue. Dual issue may be prevented if the newer
instruction requires a destination register before it is produced by the older instruction or if
both instructions are writing to the same register. Comparisons are performed when the data is
produced and when it is needed. The dual issue may not be prevented if the data is not needed
for one or more cycles. These are the examples when this occurs:
• Compare or subtract instruction that sets the ﬂags followed by a ﬂag-dependent conditional
• Any ALU instruction followed by a dependent store of the ALU result to memory
• A move or shift instruction followed by a dependent ALU instruction
These instructions are commonplace in conventional code sequences. Addressing dual issue
instruction pairs is critical in overall performance increase .
CEC 470, PROJECT II, DECEMBER 2014 10
VI. NEON PIPELINE
The Cortex-A8 has other features that complement its high performance, such as the NEON
hybrid SIMD, which grants the Cortex-A8 increased performance in the ﬁeld of graphics and
other media .
There are numerous advantages to NEON. Efﬁciency of SIMD operations is ensured through
aligned and unaligned data access. Integer and ﬂoating- point operations provide a broad range
of applications including 3D graphics. There is a simpler tool ﬂow created by single instruction
streams and uniﬁed memory views. There is efﬁcient data implementation and memory access
through its large register ﬁle .
So what is NEON? The NEON engine is a SIMD (Single Instruction Multiple Data) accelerator
processor, also known as a vector processor, which means that during the execution of one
instruction the same operation will occur on up to 16 data sets in parallel . The purpose of
this parallelism is to obtain a greater amount of get more MIPS or FLOPS out of the SIMD
portion of the processor then you could obtain with a basic SISD (Single Instruction Single
Data) processor running at the same clock rate. This increased parallelism also decreases the
instruction count necessary to accomplish the same task if run on an SISD, thus also reducing
the number of clocks used to perform the same task.
To determine how much of a speed increase the NEON engine will grant to a portion of
code, a speciﬁc loop is necessary to look at the data size of the operation. The largest NEON
register is 128 bits, thus if you wish to perform an operation on 8-bit values you can perform
up to 16 operations simultaneously. Another example being if you are using 32 bit values,
you can perform up to 4 operation simultaneously . However, there are other factors to take
into consideration that affect execution speed such as loop overhead, memory speeds, and data
throughput. NEON instructions are mainly for numerical, load/store, and some logical operations,
thus NEON operations will execute while other instruction occur in the main ARM pipeline.
NEON has 4 decode stages, known as M0-M3, which are similar in design to the four decode
stages, D0-D4, seen in the main ARM pipeline. This structure uses the ﬁrst two stages to
decode the instruction resource and operand requirements, then the last two stages for instruction
scheduling. NEON also has 6 execute stages, N1-N6 . The NEON pipeline uses a ﬁre-and-
forget issue mechanism and a static scoreboard, similar to what is used by the ARM integer
pipeline with the primary difference being that there is no replay queue .
The NEON decode logic is highly capabable in that it can dual issue any LS permute
instruction with any non-LS permute instruction which requires fewer register ports than what
would be needed for dual issuing two data processing instructions since LS data is provided
directly from the load data queue. It is also the most useful pairing of instructions to dual issue
since signiﬁcant load/store bandwidth is required to keep up with the Advanced SIMD data
processing operations .
Access to the 32-entry register ﬁle is handled M3 stage when instruction(s) are issued . Once
an instruction is issued, it is sent to one of seven execution pipelines: integer algorithmic logic
unit, integer multiply, integer shift, NFP Add, NFP multiply, IEEE ﬂoating point, or load/store
permute with all execution datapath pipelines being balanced at six stages .
CEC 470, PROJECT II, DECEMBER 2014 11
Fig. 4. NEON Pipeline Stages 
VII. NEON INTEGER EXECUTION PIPELINE
There are three execution pipelines responsible for executing NEON integer instructions:
multiply-accumulate (MAC), shift, and ALU. The integer MAC pipeline contains two 32x16
multiply arrays with two 64-bit accumulate units. The 32x16 multiplier array can perform four
8x8, two 16x16, or one 32x16 multiply operation in each cycle and have dedicated register
read ports for the accumulate operand. The MAC unit is also optimized to support one multiply
accumulate operations per cycle for high performance on a sequence of MAC operations with a
The integer shift pipeline consists of simply three stages. Shift is made available early for
subsequent instructions at the end of the N3 stage when only the shift result is required .
When both a shift and accumulate operation are require the result from the shift pipeline are
forwarded directly to the MAC pipeline.
The integer ALU pipeline consists of two parallel 64-bit SIMD ALUs, each permitting four 64-
bit inputs. The ﬁrst stage of the ALU pipeline, N1, formats the operands to in preperation for the
the next cycle, includes inverting operands as needed for subtract operations, multiplexing vector
element pairs for folding operations, and sign/zero-extend of operands . The second stage,
N2, performs the main ALU opations such as: add, subtract, logical, count leading-sign/zero,
count set, and sum of element pairs operations  along with also calculating the ﬂags are also
to be used in the following stage. The third stage, N3, performs operations such as: compare,
test, and max/min operations for saturation detection. The N3 stage also has contains an SIMD
incrementer for generating twos complement and rounding operations It also has a data formatter
for performing high-half and halving operations. Just like the shift pipeline, the ALU pipeline
will use the ﬁnal stages, N4 and N5, for completing any accumulate operations by forwarding
it to the MAC .
CEC 470, PROJECT II, DECEMBER 2014 12
VIII. NEON LOAD-STORE/PERMUTE EXECUTION PIPELINE
The permute pipeline is fed by the load data queue (LDQ). The LDQ holds all data associated
with NEON load accesses prior to entering the NEON permute pipeline. It is 12 entries deep
and each entry is 128-bit wide . Data can be placed into the LDQ from either L1 cache or L2
memory system. Accesses that hit in the L1 cache will return and commit the data to the LDQ.
Accesses that miss in the L1 cache will initiate an L2 access. A pointer is attached with the load
request as it proceeds down the L2 memory system pipeline. When the data is returned from
the L2 cache, the pointer is used to update the LDQ entry reserved for this load request. Each
entry in the LDQ has a valid bit to indicate valid data returned from L1 cache or L2. Entries in
the LDQ can be ﬁlled by L1 or L2 out-of-order, but valid data within the LDQ must be read in
program order. Entries at the front of the LDQ are read off in-order. If a load instruction reaches
the M2 issue stage before the corresponding data has arrived in the LDQ, then it will stall and
wait for the data .
L1 and L2 data that is read out of the LDQ is aligned and formatted to be useful for the NEON
execution units. Aligned/formatted data from the LDQ is multiplexed with NEON register read
operands in the M3 stage, before it is issued to the NEON execute pipeline.
The NEON LS/Permute pipeline is responsible for all NEON load/stores, data transfers to/from
the integer unit, and data permute operations. One of the more interesting features of the NEON
instruction set is the data permute operations that can be done from register to register or as
part of a load or store operation. These operations allow for the interleaving of bytes of memory
into packed values in SIMD registers. For example, when adding two eight byte vectors, you
may wish to interleave all of the odd bytes of memory into register A and the even bytes into
register B . The permute instructions in NEON allow you to do operations like this natively
in the instruction set and often with only using a single instruction .
This data permute functionality is implemented by the load-store permute pipeline. Any data
permutation required is done across 2 stages, N1-N2. In the N3 stage, store data can forwarded
from the permute pipeline and sent to the NEON Store Buffer in the memory system .
IX. NEON FLOATING-POINT EXECUTION PIPELINES
The NEON Floating-Point (NFP) has two main pipelines: a 6-stage multiply pipeline and
a 6-stage add pipeline . The add pipeline adds two single-precision ﬂoating-point numbers,
producing a single-precision sum. The multiply pipeline multiplies two single-precision ﬂoating-
point numbers, producing a single-precision product. In both cases, the pipelines are 2-way SIMD
which means that two 32-bit results are produced in parallel when executing NFP instructions
X. NEONS IEEE COMPLIANT FLOATING POINT ENGINE
The IEEE compliant ﬂoating point engine is a non-pipelined implementation of the ARM
Floating-Point instruction set targeted for medium performance IEEE 754-compliant and double
precision ﬂoating-point . It is designed to provide general-purpose ﬂoating-point capabilities
for a Cortex A8 processor. This engine is not pipelined for most operations and modes, but
instead iterates over a single instruction until it has completed. A subsequent operation will
be stalled until the prior operation has fully completed execution and written the result to the
register ﬁle. The IEEE compliant engine will be used for any ﬂoating point operation that cannot
be executed in the NEON ﬂoating point pipeline. This includes all double precision operations
and any ﬂoating point operations run with IEEE precision enabled.
CEC 470, PROJECT II, DECEMBER 2014 13
VFP (Vector Floating Point) is a ﬂoating point hardware accelerator whose primary purpose
is to perform one operation on one set of inputs and returns one output, thus allowing it to
speed up ﬂoating point calculations. Considerably slower software math libraries are used by
ARM processors if dedicated have ﬂoating point hardware is not available. The VFP supports
both single and double precision ﬂoating point calculations compliant with IEEE 754 . It is
also worth noting that the VFP will not have the same performance increase that NEON grants
because it does not contain a similar highly parallel and fully pipelined architecture
XII. ARM CORTEX-A8 COMPARED TO ARM CORTEX-A17
The ARM Cortex-A8 is a part of the ARMv7-A architecture. There have been seven cores
designed with this architecture including the Cortex-A8 and the Cortex-A17. The ARM Cortex-
A17 is the most powerful core within the same family as the Cortex-A8 yet the differences
between the two are drastic. From internal speciﬁcations to the actual use in devices vary.
The Cortex-A17 provides a 60% increase in performance over the Cortex-A9 and the Cortex-
A9 has a 50%  increase in performance over the Cortex-A8 leading the comparison the
Cortex-A17 is a 110% performance increase over the Cortex-A8.
Fig. 5. Cortex-A17 performance comparison to the Cortex-A9 
CEC 470, PROJECT II, DECEMBER 2014 14
This leads to the initial comparison that the Cortex-A17 is far more powerful than the Cortex-
A8 even though their design is the same 32-bit ARMv7-A architecture using the NEON SIMD
and VFP hardware accelerator. Just as the Cortex-A8, this core is also very popular in mobile
devices with its combination of high performance combined with the high efﬁciency brought
about by the Cortex-A8 introduction. The Cortex-A17 consists of four scalable cores. These
cores contain a fully out-of-order pipeline delivering optimal performance of todays premium
mobile devices . This is a key difference since the Cortex-A8 only supports one core, hence
the massive speed increase with the Cortex-A17. The decode width of the Cortex-A17 is only
one more than the Cortex-A8, yet that ability to decode one more instruction in parallel creates
an improvement without sacriﬁcing efﬁciency. The pipeline depth of the Cortex-A8 is 13 in order
while the Cortex-A17 is 11+ out-of-order. The NEON (SIMD) for the Cortex-A8 is 64-bit wide
where the Cortex-A17s is 128-bit wide allowing for greater parallel processing of data to occur.
The Cortex-A17 has a big role in the big.LITTLE architecture role whereas the Cortex-A8 does
not use big.LITTLE at all. The Cortex-A8 does not have a pipelined VFP accelerator whereas
the Cortex-A17 does, which improves performance.
The Cortex-A8 is used in many commercial applications that affect our daily lives. An
application that the Cortex-A8 is utilized in is smartphones as an application processor running
fully featured mobile OS, the Cortex-A17 is commonly seen in smartphones as well and tablets
unlike the Cortex-A8. It is also used in Netbooks because of its Power-Efﬁcient main processor
running desktop OS. The Cortex-A8 is also used in set-top Boxes as the main processor for
managing Rich OS, Multi-format A/V and UI, same as the Cortex-A17. They are also used
in Digital TV applications as the processor for managing rich OS, UI and browser functions,
same as the Cortex-A17. The Cortex-A8 is used in home networking as a control processor for
system management. It is also used for storage networking as a control processor to manage
trafﬁc ﬂow. They are even used in printers as a high-performance integrated processor .
The Cortex-A17 also works with Industrial and Automotive Infotainment which the Cortex-A8
did not work with . These are devices that we interact with in our lives and some that we
interact with daily. The small size of the core is advantageous because it can ﬁt into small
devices such as smartphones, netbooks, TV receivers and printers. The Cortex-A8 is also very
advantageous because its power efﬁciency which for these small devices with small batteries
makes a huge difference in lifespan of use per charge. The power of the Cortex-A8 is very useful
in many of these applications. With its pipelining abilities and enhancement from the NEON
SIMD and the VFP hardware accelerator, it allows for small devices such as smartphones to
have amazing processing speed. The Cortex-A8 and the Cortex-A17 are very similar, yet with
large performance differences.
The Cortex-A8 is a High-Performance processor used to run in complex systems, it is:
• Symmetric, superscalar pipeline for full dual-issue capability
• High-frequency through efﬁcient, deep pipeline
• Advanced branch prediction unit with ¿95% accuracy
• Integrated Level 2 Cache for optimal performance in high-performance systems 
The Cortex-A8 is designed to handle media processing in software with NEON Technology
• 128-bit SIMD data engine
• 2x the performance of v6SIMD
• Power-saving through efﬁcient media processing
• Flexibility to handle the media formats of the future
CEC 470, PROJECT II, DECEMBER 2014 15
• Easily integrate multiple codecs in software with NEON Technology on the Cortex-A8
• Enhance user interfaces 
The Cortex-A8 boasts many features, but how do they compare to the Cortex-A17? The Cortex-
A8 features the NEON, 128-bit SIMD engine that enables high performance media processing.
It also features the Optimized Level 1 cache which is integrated tightly into the processor with
a single-cycle access time as well as an Integrated Level 2 cache which is integrated into the
core and provides ease of integration, power efﬁciency, and optimal performance. The Cortex-
A8 also features Thumb-2 Technology which delivers the peak performance of traditional ARM
code while also providing up to a 30% reduction in memory required to store instructions. It
also has Dynamic Branch Prediction, used to minimize branch wrong prediction penalties, the
dynamic branch predictor achieves 95% accuracy across a wide range of industry benchmarks.
The Cortex-A8 also features a Memory Management Unit, having a full MMU enables the
Cortex-A8 to run rich operating systems in a variety of Applications. It also features Jazelle-
RCT Technology, a RCT Java-acceleration technology to optimize Just in Time (JIT) and
Dynamic Adaptive Compilation (DAC), and reduce memory footprint by up to three times.
The Cortex-A8 also features a Memory System that is optimized for power-efﬁciency and high-
performance. It also features TrustZone Technology which allows for secure transactions and
Digital Rights Management (DRM) . This list of features comes from the ARM website and
the speciﬁc product speciﬁcation pages. The Cortex-A17 also has list of speciﬁcations on the
ARM website, however they are different from the Cortex-A8. The Cortex-A17 and the Cortex-
A8 share some similar features such as Thumb-2 Technology, TrustZone Technology, NEON and
Optimized Level 1 Caches. The Cortex-A17 also has an Integrated Level 2 Cache Controller
but the difference is that its size is conﬁgurable. The Cortex-A17 also has the DSP & SIMD
Extensions which increases the DSP processing capability of ARM solutions in high-performance
applications, while offering the low power consumption required by portable, battery-powered
devices. It also uses a Floating Point, the Cortex-A17 processor provides a high-performance FPU
including hardware support for ﬂoating point operations in half-, single- and double-precision
ﬂoating point arithmetic. The Cortex-A17 also features Hardware Virtualization, a highly efﬁcient
hardware support for data management and arbitration, whereby multiple software environments
and their applications are able to access simultaneously the system capabilities. It also has a
Large Physical Address Extension (LPAE) which enables the processor to access up to 1TB of
memory. The Cortex-A17 also features an AMBA4 CoreLink CCI-400 Cache Coherent which
provides AMBA4 ACE ports for full coherency between multiple processors, enabling use cases
like big.LITTLE . This lengthy list of features about the Cortex-A17 for comparison to the
Cortex-A8 was also retrieved from the ARM website in the Cortex-A17 product speciﬁcations
section. This comparison is key to see where the ARMv7-A architecture has evolved to, the
Cortex-A8 is one of the middle models in development purposes whereas the Cortex-A17 is the
newest and most powerful that ARM produces in this architecture set.
The comparison for the debugger between the Cortex-A8 and the Cortex-A17 are the same.
The ARM DS-5 Development Studio fully supports all ARM processors and IP as well as a wide
range of third party tools, operating systems and EDA ﬂows. DS-5 represents a comprehensive
range of software tools to create, debug and optimize systems based on the Cortex-A8 and Cortex-
A17 processors . This line comes from the Cortex-A17 related products page but is near the
exact same as the Cortex-A8. They both incorporates DS-5 Debugger, whose powerful and
intuitive graphical environment enables fast debugging of bare-metal, Linux and Android native
applications. DS-5 Debugger provides pre-deﬁned conﬁgurations for Fixed Virtual Platforms
CEC 470, PROJECT II, DECEMBER 2014 16
(built on ARM Fast Models technology) and ARM Versatile Express boards, enabling early
software development before silicon availability . This segment is the same in both the
Cortex-A17 and Cortex-A8.
Both of the Cortex-A17 and the Cortex-A8 use the same family of products for Graphic
Processing. The MaliTM family of products combine to provide the complete graphics stack for
all embedded graphics needs, enabling device manufacturers and content developers to deliver
the highest quality, cutting edge graphics solutions across the broadest range of consumer devices
. An example would be the Mali-400 in the Cortex-A8 which is the worlds ﬁrst OpenGL
ES 2.0 conformant multi-core GPU that provides 2D and 3D acceleration with performance
scalable up to 1080p resolution .
For the Cortex-A8 the ARM Physical IP Platforms deliver process optimized IP, for best-in-
class implementations of the Cortex-A8 processor at 40nm and below . The Cortex-A8 uses
the Standard Cell Logic Libraries which are available in a variety of different architectures ARM
Standard Cell Libraries support a wide performance range for all types of SoC designs. It also
supports Memory Compilers and Registers, a broad array of silicon proven SRAM, Register File
and ROM memory compilers for all types of SoC designs ranging from performance critical to
cost sensitive and low power applications. The Cortex-A8 also supports Interface Libraries, a
broad portfolio of silicon-proven Interface IP designed to meet varying system architectures and
standards . The ARM Physical IP Platforms deliver process optimized IP, for best-in-class
implementations of the Cortex-A17 processor at 28nm and below . This is similar to the
Cortex-A8 except for the difference from 40nm to 28nm. A set of high-performance POPTM
IP containing advanced ARM Physical IP for 28nm technologies supports the Cortex-A17, to
enable rapid development of leadership physical implementations . ARM is uniquely able to
design the optimization packs in parallel with the Cortex-A17 processor, enabling the processor
and physical IP combination to deliver best-in-class performance in the mobile power envelope
while facilitating rapid time-to-market . The Physical IP for the Cortex-A17 is a different
design from the Cortex-A8 with using the POP IP.
System IP components are essential for building complex system on chips and by utilizing
System IP components developers can signiﬁcantly reduce development and validation cycles,
saving cost and reducing time to market . The Cortex-A8 uses a different set of tools for
System IP than Cortex-A17, here are the differences:
• Advanced AMBA 3 Interconnect IP using the AXI AMBA Bus.
• Dynamic Memory Controller using the AXI AMBA Bus.
• Adaptive Veriﬁcation IP using the AXI AMBA Bus.
• DMA Controller using the AXI AMBA Bus.
• CoreSight Embedded Debug and Trace using the ATB AMBA Bus. 
The set of tools that the Cortex-A17 uses for System IP are as follows:
• AMBA 4 Cache Coherent Interconnect
– The CCI-400 provides AMBA 4 AXI Coherency Extensions compliant ports for
full coherency for the Cortex-A17 processor and other Cortex processors, better
utilizing caches and simplifying software development. This feature is essential for high
bandwidth applications including future mobile SoCs that require clusters of coherent
processors or GPUs. Combined with other available ARM CoreLink System IP, the
CCI-400 increases system performance and power efﬁciency.
CEC 470, PROJECT II, DECEMBER 2014 17
– CoreLink CCI-400 Cache Coherent Interconnect provides system coherency with
Cortex processors and an IO Coherent channel with Mali IP and opens up a number
of possibilities for ofﬂoad and acceleration of tasks. When combined with a Cortex-A7
processor, CCI-400 allows big.LITTLE operation with full L2 cache coherency between
the Cortex-A17 and Cortex-A7 processors.
– Efﬁcient voltage scaling and power management is enabled with the CoreLink ADB-
400 unlocking DVFS control of the Cortex-A17 processor.
• AMBA Generic Interrupt Controller
– AMBA Interrupt Controllers like the GIC-400 provide an efﬁcient implementation of
the ARM Generic Interrupt Speciﬁcation to work in multi-processor systems. They
are highly conﬁgurable to provide the ultimate ﬂexibility in handling a wide range of
interrupt sources that can control a single CPU or multiple CPUs.
• AMBA 4 CoreLink MMU-500
– CoreLink MMU-500 provides a, hardware accelerated, common memory view for all
SoC components and minimizes software overhead for virtual machines to get on with
other system management functions.
• CoreLink TZC-400
– The Cortex-A17 processor implements a secure, optimized path to memory to further
enhance its market leading performance with the aid of CoreLink TZC-400 TrustZone
address space controller.
• CoreLink DMC-400
– All interconnect components and the ARM DMC guarantee bandwidth and latency
requirements by utilizing in-built dynamic QoS mechanisms.
• CoreSight SoC-400
– ARM CoreSight SoC debug and trace hardware is used to proﬁle and optimize the
system software running through-out from driver to OS level.
• Artisan POP IP
– Cortex-A17 processor is supported through advanced Physical POP IP for accelerated
time to market .
These differences show how much more technology is in the Cortex-A17 versus the Cortex-A8
even though they are in the same ARM architecture set family (ARMv7-A). With differences like
these it really makes it clear as to how ﬂexible these systems are and what all can be done with
them from media processing to data crunching. These differences are important to understand
because it lays out where this technology is headed and what changes could and are being made
to create more powerful yet more efﬁcient devices.
The Cortex-A8 is an important example of RISC-based superscalar design. It has many features
that make it a powerful and ﬂexible processor. The summation of its components result in
increased performance and ﬂexibility. Its instruction pipelining and branch prediction are critical
into ensuring performance efﬁciency. The NEON SIMD possesses a very robust architecture
including its own instruction pipelines. The NEON introduces a host of new capabilities including
multimedia and graphics processing. Examination of other ARM processors further illustrates
the Cortex-A8s evolution. The Cortex-A8 belongs to family consisting of seven other processors.
CEC 470, PROJECT II, DECEMBER 2014 18
A comparison with the faster Cortex-A17 demonstrates a higher degree of ﬂexibility from the
Cortex-A8. This ﬂexibility is critical to the Cortex-A8s success in consumer electronics. This
processor is commercially available in a variety of applications including mobile devices and
other media. Studying the ARM Cortex-A8 is critical in understanding the role superscalar
architecture plays in embedded systems.
CEC 470, PROJECT II, DECEMBER 2014 19
 Williamson, David, ARM Cortex-A8: A High-Performance Processor for Low-Power Applications, Unique Chips and Systems
 (n.d.). Texas Instruments Wiki, Cortex-A8 - Texas Instruments Wiki Retrieved from
 (n.d.). ARM - The Architecture For The Digital World, NEON - ARM, Retrieved from
 (n.d.). ARM - The Architecture For The Digital World, Cortex-A8 Processor - ARM,Retrieved from
 (n.d.). ARM - The ARM Architecture. With a focus on v7A and Cortex-A8. Retrieved from
http://www.arm.com/ﬁles/pdf/ARM Arch A8.pdf
 ARM, A. (2000). Architecture Reference Manual. ARM DDI E, 100, 6.https://www.scss.tcd.ie/ waldroj/3d1/arm arm.pdf
 Cortex-A17 Processor. (n.d.). Retrieved November 17, 2014, from http://www.arm.com/products/processors/cortex-a/cortex-
 Cortex-A8 Processor. (n.d.). Retrieved November 17, 2014, from http://www.arm.com/products/processors/cortex-a/cortex-
 Cortex-A9 Processor. (n.d.). Retrieved November 17, 2014, from http://www.arm.com/products/processors/cortex-a/cortex-