Smartphones architecture is generally different from
common desktop architectures. It is limited by power, size and
cost of manufacturing with the goal to provide the best
experience for users in a minimum cost. Stemming from this
fact, modern micro-processors are designed with an
architecture that has three main components: an application
processor that executes the end user’s applications, a modem
responding to baseband radio activities, and peripheral devices
for interacting with the end user.
Parallelism
Multicores:
The Cortex A7 MPCore processor implements the ARMv7-A
architecture. The Cortex A7 MPCore processor has one to
four processors in a single multi-processor device. The
following figure shows an example configuration with four
processors [3].
In this paper, we are discussing the architecture of the
application processor of Apple iPhone. Specifically, Apple
iPhone uses ARM Cortex generation of processors as their
core. The following sections discusses this architecture in terms
of Instruction Set Architecture, Memory Hierarchy and
Parallelism.
1. Overview on iPhone Architecture
Abdelrahman H. Ibrahim Huda M. Aldosari Raed A. Alotaibi
CSE 5095 – Computer Architecture – Spring 2015
University of Connecticut
Storrs, CT, USA
[Emails: abdelrahman@engr.uconn.edu , huda.aldosari@uconn.edu , raed.alotaibi@uconn.edu ]
Abstract
Smartphones architecture is generally different from
common desktop architectures. It is limited by power, size and
cost of manufacturing with the goal to provide the best
experience for users in a minimum cost. Stemming from this
fact, modern micro-processors are designed with an
architecture that has three main components: an application
processor that executes the end user’s applications, a modem
responding to baseband radio activities, and peripheral devices
for interacting with the end user.
In this paper, we are discussing the architecture of the
application processor of Apple iPhone. Specifically, Apple
iPhone uses ARM Cortex generation of processors as their
core. The following sections discusses this architecture in terms
of Instruction Set Architecture, Memory Hierarchy and
Parallelism.
Index Terms – iPhone, Architecture, ARM Cortex
I. Introduction
Apple has started a new generation of iPhone with the launch
of iPhone 5s. This release announces the start of 64-bit architecture
in smartphones. It is shipped with a custom ARMv7-based
processor. The chip has 64-bit desktop-class architecture, modern
instruction set, 2x general purpose registers, 2x floating-point
registers and over 1 billion transistors [1].
Let’s have a closer look at ARM Cortex A7 processor. The
following diagram shows a view of the component of the processor
[2].
The chip is designed to meet the demands of smartphones
energy restrictions. At the same time, it incorporates all the
features of the high-performance other Cortex processors. It
includes virtualization support in hardware, Large Physical
Address Extension (LPAE), NEON Engine (used for SIMD
instructions), and 128-bit AMBA Coherent Bus Interface which
maximizes the efficiency of data movement and storage.
II. Parallelism
Multicores:
The Cortex A7 MPCore processor implements the ARMv7-A
architecture. The Cortex A7 MPCore processor has one to
four processors in a single multi-processor device. The
following figure shows an example configuration with four
processors [3].
Figure 2 - photo courtesy inforcenter.arm.com
Let’s have a closer look at the each processor and how they
coordinate.
Figure 3 - photo courtesy - infocenter.arm.com
The Data Processing Unit (DPU) holds most of the
program-visible state of the processor, such as general-
purpose registers, status registers and control registers. It
decodes and executes instructions, operating on data held in
Figure 1 - photo courtesy - www.arm.com
2. the registers in accordance with the ARM architecture
(discussed in the next section).
The Prefetch Unit (PFU) obtains instructions from the
instruction cache or from external memory and predicts the
outcome of branches in the instruction stream, then passes
the instruction to the DPU for processing. In any given cycle,
up to a maximum of four instructions can be fetched and two
can be passed to the DPU.
The Snoop Control Unit (SCU), connecting up to four
processors within a cluster. It maintains coherency between
the individual data caches in the processor. It also contains
buffers that can handle direct cache-to-cache transfers
between processors without having to read or write any data
to the external memory system.
Pipelining:
Now, let’s have a closer look inside the execution unit.
Cortex-A7 is an in-order, non-symmetric dual issue processor
with a pipeline length of between 8-stages and 10-stages [4].
Figure 4 - photo courtesy - www.arm.com
SIMD Support:
Cortex-A7 comes shipped with the ARM NEON™ general-
purpose SIMD engine efficiently processes current and future
multimedia formats, enhancing the user experience. It can
accelerate multimedia and signal processing algorithm such a
video encode/decode, 2D/3D graphics, gaming, audio and
speech processing, image processing, telephony, and sound
synthesis. It works with its own pipeline and register file.
NEON technology is a 128-bit SIMD (Single Instruction,
Multiple Data) architecture designed to provide flexible and
powerful acceleration of the mentioned applications [5].
- Registers are considered as vectors of elements of the
same data type.
- Data types can be signed/unsigned 8-bit, 16-bit, 32-bit,
64-bit or single precision floating-point.
- Instructions perform the same operation in all lanes.
Figure 5 – NEON
So, in summary, ARM Cortex-A7 processing module
enhances parallelism through multi-cores, pipelining as well
as SIMD support. It provides great performance with less
power consumption suitable for the iPhone users.
III. Instruction Set Architecture (ISA)
The Instruction Set Architecture (ISA) is a segment of the
architecture responsible for programming and includes native data
types, registers, memory, and addressing modes. Apple, the
manufacturer of iPhone, is a heavy user of Instruction Set
Architectures that allow complete functionality of the iPhone
operating systems. The most frequently used Instruction Set
Architecture in iPhone devices is the ARM architecture developed
by ARM Holdings [6].
ARM is based on reduced instruction set computing (RISC)
architecture which means ARM processors require fewer
transistors than CISC X86 processors. By using RISC will reduce
costs, heat and power use. In smartphones such as iphone,
ARMv7-A architecture is the most widely used. ARMv7 contains
two main instruction sets, the ARM and Thumb-2 instruction sets.
Thumb-2 introduces 32-bit instructions that are intermixed
with the 16-bit instructions. The Thumb-2 instruction set does
almost all the functionality of the ARM instruction set. Thumb-2
instruction set and the ARM instruction set have some differences
is that most Thumb-2 instructions are unconditional, whereas
almost all ARM instructions can be conditional. However, Thumb-
2 introduces a new conditional execution instruction, IT, that is a
logical if-then-else function. Thumb-2 has the performance close
to or better than that of the ARM instruction set and has the code
density of the original Thumb ISA.
ARMv7-A architecture consists of the divide instructions.
The instructions might not be implemented, or implemented only
in the Thumb instruction set, or implemented in both the Thumb
and ARM instruction sets, or implemented if the Virtualization
Extensions are included. The 32-bit ARM Thumb-2 instruction
format is shown in:
31 16 15 0
Half-word 1 Half-word 2
The instruction length and functionality is determined by (hw1). If
the instruction is decoded as being 32-bits long, (hw2) of the
instruction is fetched from the instruction address plus two.
VFP (Vector Floating Point) technology is
an FPU coprocessor extension to the ARM architecture that
provides low-cost -point computation. The VFP architecture was
intended to support execution of short "vector mode" instructions
but these operated on each vector element sequentially and thus did
not offer the performance of true single instruction multiple data
(SIMD) vector parallelism. This vector mode was therefore
removed shortly after its introduction to be replaced with the much
more powerful NEON Advanced SIMD unit.
The Advanced SIMD extension is SIMD instruction set that
provides standardized acceleration for media and signal processing
applications. NEON is able to execute MP3 audio decoding on
CPUs running at 10 MHz. NEON supports all 8-, 16-, 32- and 64-
bit integer and single-precision (32-bit) floating-point data and
SIMD operations for handling audio and video processing as well
as graphics and gaming processing. In NEON, the SIMD execute
3. operations up to 16 operations simultaneously. The floating-point
registers that used by the NEON are the same floating-point
registers as used in VFP [7].
The Security Extensions, TrustZone Technology, provides a
low-cost alternative to adding another dedicated security core to an
SoC, by providing two virtual processors backed by hardware
based access control. This feature gives the application core
abilities to switch between two states, referred to as worlds in order
to avoid information from leaking from the more trusted world to
the less trusted world. Each world either trusted or untrusted world
can operate independently of the other while using the same core
[7].
Figure 6 - Security TrustZone System
IV. Memory Hierarchy
The Cortex-A7 has an integrated L1 and L2 cache, which
allows lower transaction latencies and ultimately improved
memory system performance.
Memory close to a processor has very low latency, but is
limited in size and expensive to implement. Further from the
processor it is easier to implement larger blocks of memory but
these have increased latency. To optimize overall performance, an
ARMv7 memory system can include multiple levels of cache in a
hierarchical memory system. The following figure shows such a
system, in an ARMv7-A implementation of a VMSA, supporting
virtual addressing [8].
Figure 7 - Memory Hierarchy
Optimized Level-1 Cache:
Performance and power optimized L1 caches combine minimal
access latency techniques to maximize performance and minimize
power consumption. Cache size is configurable from 8KB-64KB
for instruction and data. There is also the option of cache
coherence for enhanced inter-processor communication, or support
of a rich SMP capable OS for simplified multicore software
development [2]. The L1 memory system has a store buffer that
has four 64-bit slots with data merging capability. It handles writes
to Device, Strongly-ordered, Cacheable and Non-cacheable
memory [9]. The L1 instruction memory system has the following
features:
- Instruction side cache line length of 32-bytes
- Virtually indexed and physically tagged instruction
cache.
- Pseudo random cache replacement policy.
- 2-way set-associative instruction cache.
- Support for four sizes of memory page.
- Export of memory attributes for external memory
systems.
- Support for Security Extensions.
- Xan be disabled independently, using the system
control coprocessor.
- On a cache miss, critical word first filling of the cache
is performed
The L1 data memory system has the following features:
- Data side cache line length of 64-bytes.
- Physically indexed and physically tagged data cache.
- Pseudo random cache replacement policy
- 4-way set-associative data cache.
- Two 32-byte line-fill buffers and one 64-byte eviction
buffer.
- A 4-entry, 64-bit merging store buffer.
- Can be disabled independently, using the system
control coprocessor.
- On a cache miss, critical word first filling of the cache
is performed.
Integrated Level 2 Cache:
Provides low-latency and high-bandwidth access to up to 1MB of
cached memory in high-frequency designs, or designs needing to
reduce the power consumption associated with off-chip memory
access. The L2 cache is optional on Cortex-A7 [2].
The L2 memory system consists of an:
Integrated Snoop Control Unit (SCU), connecting up to
four processors within a cluster. The SCU also has
duplicate copies of the L1 data cache directories for
coherency support.
Optional tightly-coupled L2 cache that includes:
o Configurable L2 cache size of 128KB,
256KB, 512KB, and 1MB.
o Fixed line length of 64 bytes.
o Physically indexed and tagged cache.
o 8-way set-associative cache structure.
o Pseudo-random cache replacement policy.
The L2 memory system has a synchronous abort mechanism and
an asynchronous abort mechanism.
Memory Management:
Memory Management in Cortex-A7 Processor is ARMv7 Memory
Management Unit (MMU). The MMU works with the L1 and L2
memory system to translate virtual addresses to physical addresses.
It also controls accesses to and from external memory. The
ARMv7-A memory system incorporates a Memory Management
Unit (MMU), controlled by CP15 registers. The memory system
supports virtual addressing, with the MMU performing virtual to
physical address translation, in hardware, as part of program
execution.
4. The MMU features in each processor of the multiprocessor device
include the following [10]:
- 10-entry fully-associative micro instruction TLB.
- 10-entry fully-associative micro data TLB.
- 2-way set-associative 256-entry unified main TLB.
- 2-way set-associative 64-entry walk cache.
- 2-way set-associative 64-entry IPA cache.
- The TLB entries include global and application specific
identifiers to prevent context switch TLB flushes.
- Virtual Machine Identifier (VMID) to prevent TLB
flushes on virtual machine switches by the hypervisor.
V. References
[1] iPhone 5s and Apple A7’s 64-bit architecture: What does it
mean?, by Anupam Saxena, Sept. 11, 2013.
http://gadgets.ndtv.com/mobiles/news/iphone-5s-and-apple-a7s-
64-bit-architecture-what-does-it-mean-417340
[2] Arm official website
http://www.arm.com/products/processors/cortex-a/cortex-a7.php
[3] Cortex-A7 MPCore
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0464f/DDI04
64F_cortex_a7_mpcore_r0p5_trm.pdf
[4] big.LITTLE Processing with ARM Cortex-A7
http://www.arm.com/files/downloads/big_LITTLE_Final_Final.pd
f
[5] NEON
http://www.arm.com/products/processors/technologies/neon.php
[6] (Lettner, Tschernuth, & Mayrhofer, 2012)
http://www.c-sharpcorner.com/UploadFile/d49768/iphone-
operating-system-architecture/
[7] ARM Info Center
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0
338g/ch01s02s01.html
[8] ARM® Architecture Reference Manual ARM® v7-A and
ARM® v7-R edition
http://www.club.cc.cmu.edu/~mjrosenb/ARM%20v7%20Architect
ure%20Reference%20Manual.pdf
[9] Cortex-A7 MPCore Technical Reference Manual
http://infocenter.arm.com/help/topic/com.arm.doc.ddi0464d/DDI0
464D_cortex_a7_mpcore_r0p3_trm.pdf
[10] ARM Info Center
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0
388f/Cihcjfjc.html