ARM AAE - Memory Systems

SOFTWARE & SYSTEMS
DESIGN
4 – Memory Systems

AGENDA
• Memory System Hierarchy
Tightly Coupled Memory
Alignment, Endianness and Ordering
VMSA and PMSA
AAETC4v00
Memory Systems 2
VMSA and PMSA
Caches and Coherency
Barriers and Synchronization

MEMORY SUBSYSTEM
ARM Core
L1 I-Cache
MMU/MPU
BusInterfaceUnit
CP15
L2Cache
AMBA
Interconnect
L1 L2 L3
AAETC4v00
Memory Systems 3
L1 D-Cache
BusInterfaceUnit
WB
• MMU
• Supports virtual memory, included by all Cortex-A processors
• MPU
• Allows memory protection only, included by all Cortex-R processors
• Some Cortex-M processors support an optional MPU

TYPICAL MEMORY MAP
Peripherals
PL310
GIC/SCU/Timer
AAETC4v00
Memory Systems 4
RAM
ROM

AGENDA
Memory System Hierarchy
• Tightly Coupled Memory
VMSA and PMSA
AAETC4v00
Memory Systems 5
VMSA and PMSA

WHAT IS TIGHTLY COUPLED MEMORY?
• An alternative approach to caches
– Allows for high performance operation with slow external memory
– Supported on Cortex-R processors
• Fast memory, local to the processor
– Provides high speed performance without accessing the system bus
– A smaller die size penalty compared to equivalent amount of cache
AAETC4v00
Memory Systems 6
• Appears at fixed locations within the physical memory map
– Code and data can be copied to TCMs by application or library code
– DMA access or an external AXI interface to TCMs are included on some processors
• Can be used for TCM preloading
• Cortex-R4/R5 provides an external AXI slave port for access to TCMs
• Precise real-time performance can be predicted for MPU based cores
– MMU enabled cores have to perform address translation for TCM accesses
• TLB checks will be made and table walks can occur

TCM CONFIGURATION
• TCM enabled cores support two interfaces
– Traditionally referred to as I-TCM and D-TCM
– Also referred to as TCM-A and TCM-B, e.g. Cortex-R4
TCM-B
0x200000
External Memory• Each TCM interface can individually be configured
using CP15 operations
• Physical base address (multiple of size)
• Can overlay external memory
AAETC4v00
Memory Systems 7
Memory map
0x0
TCM-A
TCM blocks• Memory Size (depends on core and
implementation)
• Enable/Disable
• External pin(s) determines post-reset configuration
• Possible to make system boot from TCM memory
• INITRAM pin(s) enables TCMs during core reset
• LOCZRAM pin allows TCM address selection before reset
• Supported on Cortex-R4
• When enabled TCMs must not overlap

AGENDA
• Alignment, Endianness and Ordering
VMSA and PMSA
AAETC4v00
Memory Systems 8
VMSA and PMSA

ALIGNMENT AND ENDIANNESS
• ARMv4/v5 data alignment
– Prior to ARMv6, all hardware data accesses had to be size aligned (for
example, words on word boundaries)
– Unaligned accesses could be caught by hardware
– Unaligned data in software was accessed by a series of aligned memory
accesses
AAETC4v00
Memory Systems 9
– Data accesses can be unaligned
• Only a sub-set of load/store instructions support unaligned accesses
• Unaligned accesses only allowed to addresses marked as Normal
– The load/store unit will access memory with aligned memory accesses
and make the data available to the CPU
• ARM processors are little-endian
– But can be configured to access big-endian memory systems

MEMORY ORDERING MODEL
• The ARM architecture defines a weak ordering model…
… between accesses to Normal memory regions
… between Normal memory and Device memory accesses
• This means that accesses might not occur in program order
• The architecture also allows for speculative accesses
– Data or instructions fetched from memory before being explicitly referenced
AAETC4v00
Memory Systems 10
– Data or instructions fetched from memory before being explicitly referenced
– Examples of speculative access includes:
• Branch prediction
• Out of order data loads
• Speculative cache line fills
• Speculative data accesses are only allowed to Normal memory
• Speculative instruction fetches are allowed to any region not marked as
XN

WHY DO I CARE ABOUT ACCESS
ORDER?
• In most cases precise access order does not matter
– But sometimes it is necessary to force access ordering
• Examples of when ordering matters:
– Sharing data between different threads/CPUs
• e.g. mail boxes
– Sharing data with peripherals
AAETC4v00
Memory Systems 11
– Sharing data with peripherals
• e.g. DMA operations
– Modifying instruction memory
• e.g. loading a program into RAM or scatter loading
– Modifying memory management scheme
• e.g. context switching or demand paging
• Where access order is important you may need to use barrier
instructions
• Compilers/assemblers will not automatically insert barriers for you!

V6/V7 MEMORY TYPE
• In ARMv6/ARMv7 address locations must be described in terms of
a type
• The “type” tells the processor how accesses to that location must
behave
– Memory access ordering rules
– Caching and buffering behavior
AAETC4v00
Memory Systems 12
– Caching and buffering behavior
– Speculation
• There are three mutually exclusive memory types
– Normal - Data and instructions
– Device - Devices/peripherals
– Strongly-ordered - Device/peripherals, or data used by legacy code

ACCESS ORDERING
• In Normal memory, ARM implements a weakly-ordered memory model
– This means that, in the absence of address or data dependencies, accesses
may be re-ordered, combined and/or repeated without affect on the system
– Speculative access are permitted
• Access ordering
– The table shows the ordering enforced between two memory accesses (A1
AAETC4v00
Memory Systems 13
– The table shows the ordering enforced between two memory accesses (A1
and A2) in each type of memory
– “<“ indicates that access A1 must complete before access A2
– Barrier instructions are required to enforce ordering beyond the default
behavior in the table

AGENDA
• VMSA and PMSA
AAETC4v00
Memory Systems 14
• VMSA and PMSA

VMSA AND PMSA
• Protected Memory System Architecture
– Allows protection of configurable memory regions
– Regions defined as base address and length
– Number of regions available varies between processors
– Protection is on basis of access type and privilege
– Does not support virtual address translation
AAETC4v00
Memory Systems 15
– Does not support virtual address translation
• Virtual Memory System Architecture
– Implements virtual memory translation
– Supported by all Cortex-A processors
– Uses page tables for translation configuration
– Also implements a full access protection scheme
– Extended to 40-bit physical addressing on latest cores (e.g.
Cortex-A15)

MEMORY PROTECTION UNIT
Peripherals
FLASH
Memory map
MPU region 2
Size: 256MB
Read/Write
MPU region 1
Size: 32MB
Read Only
Normal (Cached)
Executable
• A Memory Protection Unit (MPU)
provides basic memory management
• Allows attributes to be applied to
different address regions
• All accesses checked against MPU
regions
• Each region has:
AAETC4v00
Memory Systems 16
SRAM
MPU region 3
Size: 256KB
Read/Write
Normal (Cached, bufferable)
Executable
Peripherals Read/Write
Device (Bufferable)
Execute Never (XN)
MPU region 0
Size: 4GB
No Access
• Each region has:
• Base address
• Size
• Attributes (e.g. Type)
• Available on:
• ARM1156T2(F)-S
• Cortex-R family

VIRTUAL MEMORY
• Core issues “Virtual Addresses” (VA)
• Memory is accessed using “Physical Addresses” (PA)
• Translation is carried out automatically by Memory Management
Unit (MMU)
• Translation configuration is stored in page tables in external
memory
Virtual Memory Map Physical Memory Map
AAETC4v00
Memory Systems 17
Virtual Memory Map
OS
Application Space
Vectors
Peripherals
Privileged Access
User Access
Uncached
Read-only
Physical Memory Map
FLASH
RAM
Peripherals

THE MEMORY MANAGEMENT UNIT
• The Memory Management Unit (MMU) handles translation of virtual addresses to
ARM Core
MMU
Caches
Memory
Virtual Address Space Physical Address Space
TLBs
Table
Walk
Unit
Translation
Tables
AAETC4v00
Memory Systems 18
• The Memory Management Unit (MMU) handles translation of virtual addresses to
physical addresses
• Provides hardware to read translation tables in memory - called table walking
• CP15 Table Base Registers (TTBR) store physical base addresses of tables
• Translation Look-aside Buffers (TLBs) cache recent translations
• Core can have separate instruction and data TLBs, or a shared unified TLB
• When the MMU is enabled all accesses by the core are passed through it
• MMU will use cached translations from the TLB(s) or perform a table walk
• Translation must occur before cache look-up can complete

LEVEL ONE PAGE TABLES
First-level Table
0x0
0x4
0x8
0xC
0x3FFC
0x3FF8
0x3FF4
0x3FF0
0x3FEC
0x3FE8
Tableoffset(bytes)
ARM Core
Virtual Address
VA
PA
Memory
Physical Address
AAETC4v00
Memory Systems 19
• Diagram shows a single-level page table
• VA to PA mapping at 1MB resolution
• Translation carried out in a single step
• Page table lookup is done automatically by MMU
• Recent translations are cached in internal TLB
Tableoffset(bytes)
Translation Table Base (TTB)

LEVEL TWO PAGE TABLES
First-level table
0x4
0x8
0xC
0x3FFC
0x3FF8
0x3FF4
0x3FF0
0x3FEC
0x3FE8
Second-level tables
0x0
0x4
0x8
0x3FC
0x3F8
0x3F4
0x3F0
4KB Page
Page Table
4KB Page
Page Table
Tableoffset(bytes)
ARM Core
Virtual Address
VA
PA
AAETC4v00
Memory Systems 20
• Second level page table allows mapping at 4KB resolution
• Translation requires two page table look-ups
0x0
0x4
Translation Table Base (TTB)
Tableoffset(bytes)

ACCESS PERMISSIONS AND XN
• Access permission determined by AP[2:0] bits in page table
descriptor
AP Privileged User Notes
000 No access No access Permission fault
001 Read/Write No access Privileged mode access
010 Read/Write Read Permission fault on user write
011 Read/Write Read/Write Full access
AAETC4v00
Memory Systems 21
011 Read/Write Read/Write Full access
100 - - Reserved
101 Read No access Privileged mode read only
110 Read Read Permission fault on writes†
111 Read Read Permission fault on writes
• “eXecute Never” (XN) prevents instruction execution from a region
• Speculative instruction fetches are also suppressed
• The core never makes speculative accesses to Device or Strongly Ordered memory

MMU CONFIGURATION AND
MAINTENANCE
• Enabling the MMU
– The MMU is disabled at reset and is enabled via the SCTLR.M bit
– MMU page tables contain memory type configuration
(Includes shareability, cacheability, bufferability, access permissions etc.)
– All this must be configured before the MMU is enabled
• TLB maintenance
– TLBs cache memory translation information
AAETC4v00
Memory Systems 22
– TLBs cache memory translation information
– Must be invalided when translation table contents are changed
– May also need invalidation on a context switch
– ASID is provided to minimize this
– TLBs should be invalidated by the startup code on reset
• When the MMU is disabled
– PA = VA i.e. no address translation is performed
– Instruction accesses may be cached (controlled by SCRTL.I bit)
– Data accesses will not be cached and are all treated as Strongly ordered
– No access permissions are carried out

AGENDA
VMSA and PMSA
AAETC4v00
Memory Systems 23
VMSA and PMSA
• Caches and Coherency

CACHES IN CORTEX-A SERIES
PROCESSORS
• Applications processors are usually implemented with two levels of cache
– Separate (Harvard) L1 Instruction and Data caches per core
• Relatively small (typically 32KB), providing fast access inside the L1 subsystem
– A single (unified) L2 cache (integrated or external, depending on the CPU)
• Relatively large (up to 8 MB), with access times slower than L1 memory
accesses
• MMU uses information contained in the translation tables to control which
memory locations are cached
AAETC4v00
Memory Systems 24
memory locations are cached
MMU
CPU0
I-Cache
D-Cache
BusInterfaceUnit
CP15
L2Cache
AMBAInterconnect
SRAM
External
DRAM
AMBAInterconnect
APB
MMU
CPU1
I-Cache
D-Cache
CP15

CACHE TERMINOLOGY
• You should know the meaning of the following
terms…
– Line
– Way
Tag Index Offset
Address:
AAETC4v00
Memory Systems 25
Way
– Set
– Tag
– Index
– Offset
– Data RAM
– Tag RAM
– Valid and Dirty Bits
Tag RAMData RAM
Way
Set
Index
Tag

HOW IS DATA STORED IN MY CACHE?
• Caches handle data in lines (32 or 64 bytes per cache line)
– Physical address used to determine the location of data in cache
• Bottom bits (offset) identify word/byte in line
• Middle bits (index) identify which line
• Top bits (tag) identify remainder of address
• Each line in the cache includes:
Tag RAMData RAM
Index
Tag Index Offset
Address:
AAETC4v00
Memory Systems 26
• Each line in the cache includes:
– Tag bits from the associated physical address
– Valid bit: indicates whether line exists in the cache
– Dirty data bit(s): indicates whether line (or cache line) is not coherent with external
memory
• To reduce cache contention, ARM caches are “set associative”
– There are multiple possible cache locations (ways) for any given address
– A victim counter decides which cache way will be used for an allocation
– Replacement policy used by victim counter varies by core
Way
Set
Tag

EXAMPLE MEMORY ACCESS
Main Memory
Offset
Index
Index
Offset
0x00000000
0x00000010
0x00000020
0x00000030
0x00000040
0x00000050
0x00000060
0x00000070
0x00000080
0x00000090
Way 0 Way 1
…110 ...101
Tag Index Offset
32bit Address: 0x0000007C
...001 11 11 00
Byte
…001
Main Memory
Offset
Index
Index
Offset
0x00000000
0x00000010
0x00000020
0x00000030
0x00000040
0x00000050
0x00000060
0x00000070
0x00000080
0x00000090
Way 0 Way 1
…110 ...101
Tag Index Offset
32bit Address: 0x0000007C
...001 11 11 00
Byte
…001
AAETC4v00
Memory Systems 27
?×
Victim Counter
?
Victim Counter
Way 0 Way 1
Data
==
4. Victim counter specifies which cache Way to use (will Evict previous data)
5. Cache returns requested word to the core
Way 0 Way 1
Data
==
• Memory Read:
LDR r1,[0x0000007C]
1. Cache Lookup is performed
2. Cache Miss - Tag matches fail for
given Index in all Cache Ways
3. Cache Linefill is performed

CACHE BEHAVIOR
• Cache lookup
– The core checks to see if a memory address is currently in the cache
– A “cache miss” occurs if the data is not found
• The cache may then automatically load the relevant data
• This is called a “cache linefill”
– A “cache hit” occurs if the data is found
AAETC4v00
Memory Systems 28
– A “cache hit” occurs if the data is found
• The data is immediately returned to the core
• No external memory access takes place
• Cache Eviction
– In order to make space for new data, existing cache data may have to be
evicted
– In “writeback” mode, dirty data will have to be written back to memory first
• Victim counter
– This is an internal value used to select the data for eviction

CACHE MODES AND POLICIES
• Allocation policy
– Controls when new data is loaded into the cache
– A read-allocate policy only allocates new data on a read miss
– A write-allocate policy also allocates on a write miss
• Eviction policy
– Governs the selection of lines for eviction
AAETC4v00
Memory Systems 29
– Governs the selection of lines for eviction
– A round-robin policy cycles through the lines in a fixed order
– A random policy selects a line at random
• Write-through and Write-back
– Controls what happens when a write operation hits in the cache
– A write-through cache updates external memory in parallel
– A write-back cache does not update external memory

WHEN SHOULD I ENABLE CACHES?
• Caches are disabled on reset
– Architecturally, caches are not guaranteed to be in a known state at reset
– Need to be invalidated by software on Cortex-A9
– Not required on Cortex-A5/A7/A15
• The L1 instruction cache can be enabled without enabling the MMU
– Many boot loaders will enable the I cache, but not the D cache
• Data caching is only possible once the MMU is enabled
AAETC4v00
Memory Systems 30
• Data caching is only possible once the MMU is enabled
– Appropriate cache policies must be configured in the translation tables
• The L2 cache should generally be enabled with the L1 data cache
– On the Cortex A15 and A7 the L2 (unified) cache is always enabled
• But no lookup occurs unless the L1 D-cache on one of the CPUs in the cluster is
also enabled
– On Cortex-A9 or A5 an external L2 cache (like PL310) is enabled separately
• Via a write to a memory mapped control register
Performance is very poor if instructions are not fetched from cache!

CACHE MAINTENANCE OPERATIONS
• Caches require maintenance to ensure that the program always has
access to the correct data
– Cache clean
• Writes out “dirty data” so that external memory and cache are coherent
• Only applicable to write-back caching
– Cache invalidate
• Marks lines as invalid and therefore available for new data
AAETC4v00
Memory Systems 31
• Marks lines as invalid and therefore available for new data
• When is maintenance required?
– Context switches which modify the mapping between address tags and
physical addresses
– To write self-modifying code, data written via the data cache must be read
back via the instruction cache
• The data cache must be flushed and the instruction cache invalidated
– If an external engine has modified external memory (e.g. DMA)
• Data cache must be invalidated

COHERENCY OPERATIONS
• Type of operation
– Invalidate - clear the Valid bits on the particular cache / branch predictor entry
– Clean - updates the external memory system with Dirty cache line(s)
• Which entries
– All - the entire cache (not available for the data/unified cache)
– MVA - a specific virtual address
AAETC4v00
Memory Systems 32
– MVA - a specific virtual address
– Set/Way - a specific cache line (not available for the branch predictor)
• Scope
– PoC - Point of Coherency (discussed later)
– PoU - Point of Unification (discussed later)
• Inner shareable
– Operations that can be “broadcast”

POINT OF UNIFICATION (POU)
I D
I
Cache
D
Cache
System Control
Coprocessor
CP15
AAETC4v00
Memory Systems 33
I D
Cache Cache
TLB
Point of Unification
• The point at which Instruction, data and TLB accesses see the same copy of memory
• Generally the L2 cache or memory – depends on the system design

POINT OF COHERENCY (POC)
I D
D
Cache
System Control
Coprocessor
CP15
I D
D
Cache
System Control
Coprocessor
CP15
Master A Master B
AAETC4v00
Memory Systems 34
Cache Cache
Point of Coherency
• The point at which all agents see the same copy of memory
• Generally the external memory system – again, very system dependent

POU V POC
I D
I
Cache
D
Cache
System Control
Coprocessor
CP15
I D
I
Cache
D
Cache
System Control
Coprocessor
CP15
AAETC4v00
Memory Systems 35
TLB
Point of Coherency
TLB
L2 Cache
Point of Coherency

ASYMMETRIC MULTI-PROCESSING
(AMP)
CPU1
Shared
CPU2 CPU1
CPU2
RAM
Tasks
Task
AAETC4v00
Memory Systems 36
• Each CPU may run a different program
– May also see a different memory map
– May also have own set of interrupts
• Data exchange through shared memory
– CPU L1 caches will need to be managed for coherency
Peripheral
(CPU2)
Peripheral
(CPU1)

SYMMETRIC MULTI-PROCESSING (SMP)
RAM
CPU
1
CPU
2Tasks
AAETC4v00
Memory Systems 37
Peripheral Peripheral

SMP VS. AMP
• SMP – Symmetric Multi-Processing
– All tasks share a common view of memory and peripherals
– Tasks can be dynamically shared across Multiple CPUs
– Simplifies software development
• Provides increased productivity for programmer
AAETC4v00
Memory Systems 38
• AMP – Asymmetric Multi-Processing
– Code portability and design flexibility
– Programmer statically assigns tasks to a CPU
– Enables tasks to be isolated from each other
• Each task may have a different view of memory

A MULTICORE ARM PROCESSOR
Two
Cortex-A9
processor
cores
Snoop ControlInterrupt
CoreSight
debug
infrastructure
AAETC4v00
Memory Systems 39
Shared
external bus
interface
Snoop Control
Unit maintains
L1 cache
coherency
Interrupt
Distributor
Shared
architectural
peripherals

SNOOP CONTROL UNIT
Snoop Control
Unit maintains
AAETC4v00
Memory Systems 40
Unit maintains
L1 cache
coherency

SNOOP CONTROL UNIT (SCU)
• The Snoop Control Unit (SCU) maintains coherency between L1 data caches
– Arbitrates accesses to L2 AXI master interface(s), for both instructions and data
– Duplicated Tag RAMs keep track of what data is allocated in each CPU’s cache
• Separate interfaces into L1 data caches for coherency maintenance
• Optionally, can use address filtering
– Directing accesses to configured memory range to AXI Master port 1
AAETC4v00
Memory Systems 41
CPU0
D$ I$
CPU1
D$ I$
CPU2
D$ I$
CPU3
D$ I$
Snoop Control Unit
TAG TAG TAG TAG
AXI Master 0 AXI Master 1

AGENDA
VMSA and PMSA
AAETC4v00
Memory Systems 42
VMSA and PMSA
• Barriers and Synchronization

SYNCHRONIZATION
• Shared resources in a
multi-threaded or multi-
processor system need
protection in critical code
sections
• Operating Systems provide
AAETC4v00
Memory Systems 43
• Operating Systems provide
resources such as spinlocks
or mutexes etc
• Here is an example of a
simple spinlock using
ARM’s exclusive load
and store instructions

BARRIERS
• The ARM architecture includes barrier instructions to force access order
and access completion at a specific point
DMB – Data Memory Barrier
DSB – Data Synchronization Barrier
ISB – Instruction Synchronization Barrier
• This course provides a simple introduction to barriers and their use,
but…
AAETC4v00
Memory Systems 44
but…
– If you are writing code where ordering is important we recommend also
reading:
– ARM Architecture Reference Manual ARMv7-A/R Edition (Rev C)
• A3.8 Memory access order
• B2.2.9 Ordering of cache and branch predictor maintenance operations
• B3.10.1 TLB maintenance operations and the memory order model
• Appendix G Barrier Litmus Tests
– Includes worked examples

DMB VS DSB
• A Data Memory Barrier (DMB) is less restrictive than a Data
Synchronization Barrier (DSB)
• For a DMB:
– No memory accesses after the DMB in program order are started until all
memory accesses before the DMB in program order have been seen by the
rest of the system
AAETC4v00
Memory Systems 45
rest of the system
• A DSB doesn’t complete until:
– All memory accesses before the DSB in program order have completed, and
– All cache, branch predictor and TLB maintenance operations issued by the
local processor have completed
– Furthermore, no instruction that appears after the DSB in program order can
execute until the DSB completes
• Use a DSB when necessary, but don’t overuse them

MAIL BOX EXAMPLE
• P0 – DMB needed to ensure mail box is seen BEFORE the flag is
updated
• P1 – DMB needed to ensure mail box read AFTER flag is seen
P0 – Flag Data As Available
LDR r1, =ADDR_MAILBOX_DATA
LDR r2, =ADDR_MAILBOX_FLAG
P1 – Flag Data As Available
LDR r1, =ADDR_MAILBOX_DATA
LDR r2, =ADDR_MAILBOX_FLAG
AAETC4v00
Memory Systems 46
; Write a new message into
; mail box
STR r5, [r1]
DMB
; set available flag to
; signal mail box full
MOV r0, #0
STR r0, [r2]
; wait for flag
loop
LDR r12, [r2]
CMP r12, #0
BNE loop
DMB
; read message
LDR r0, [r1]

ISB
• The ARM architecture defines context as the system settings in CP15
• Context-changing operations include:
– Cache, TLB, and branch predictor maintenance operations
– Changes to system control registers (e.g. SCTLR, TTBCR, TTBRn, CONTEXTIDR)
• The effect of a context-changing operation is only guaranteed to be seen
after a context synchronization event
AAETC4v00
Memory Systems 47
after a context synchronization event
– Taking an exception
– Returning from an exception
– Instruction Synchronization Barrier (ISB)
• An ISB flushes the pipeline, and re-fetches the instructions from the cache
(or memory)
– Guarantees that effects of any completed context-changing operation before the
ISB are visible to any instruction after the barrier
– Also guarantees that context-changing operations after the ISB instruction only
take effect after the ISB has been executed

CP15 EXAMPLE
• To enable FPU/NEON you need to first enable access to cp10 and cp11
– This is done by writing to the Coprocessor Access Control Register (CACR)
MRC p15, 0, r1, c1, c0, 2; Read CACR into r1
ORR r1, r1, #(0xf << 20) ; Enable full access for p10 & p11
MCR p15, 0, r1, c1, c0, 2; Write back into CACR
ISB
AAETC4v00
Memory Systems 48
ISB
MOV r0, #0x40000000
VMSR FPEXC, r0 ; Enable FPU and NEON
• Without the ISB the processor could already have decoded the VMSR
as an Undefined Instruction exception, before the time the MCR is
executed
– The ISB ensures the update to the CACR is seen by the processor when
decoding the VMSR

SELF-MODIFYING CODE (1)
P0 loads a new program into memory, which then gets executed by P0 and P1
P0
STR r11, [r1] ; Save instruction to program memory
DCCMVAU r1 ; clean D-$ so instruction visible to I-$
DSB ; ensure clean completes on all CPUs
ICIMVAU r1 ; discard stale data from I-$ …
BPIMVA r1 ; … and from Branch Predictor
AAETC4v00
Memory Systems 49
BPIMVA r1 ; … and from Branch Predictor
DSB ; ensure I-$/BP invalidates complete for all
STR r0, [r2] ; set flag == 1 to signal completion
ISB ; synchronize context on this processor
MOV pc, r1 ; branch to new code
P1-Pn
WAIT ([r2] == 1) ; wait for flag signaling completion
; no DSB required here
ISB ; synchronize context on this processor
MOV pc, r1 ; execute newly saved instruction

ARM AAE - Memory Systems

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to ARM AAE - Memory Systems

Similar to ARM AAE - Memory Systems (20)

More from Anh Dung NGUYEN

More from Anh Dung NGUYEN (10)

Recently uploaded

Recently uploaded (20)

ARM AAE - Memory Systems