This document provides an overview of memory systems and caching in ARM processors. It discusses memory hierarchies including tightly coupled memory. It covers concepts like alignment, endianness, memory ordering models, and the virtual memory system architecture (VMSA) used in Cortex-A processors. It describes the memory protection unit (MPU) and how it provides memory protection. It also discusses caching in Cortex-A processors including cache terminology, how data is stored in caches, and an example of a memory access involving the cache.
This presentation contains an overview of novelties in ARMv8-A and details on application binary interface (ABI), memory management unit (MMU), caches and interrupts.
This talk was held within GlobalLogic Lviv Embedded TechTalk on November 23d, 2017.
ARM (Advance RISC Machine) is one of the most licensed and thus widespread processor cores in the world.Used especially in portable devices due to low power consumption and reasonable performance.Several interesting extension available like THUMB instruction set and Jazelle Java Machine.
This presentation contains an overview of novelties in ARMv8-A and details on application binary interface (ABI), memory management unit (MMU), caches and interrupts.
This talk was held within GlobalLogic Lviv Embedded TechTalk on November 23d, 2017.
ARM (Advance RISC Machine) is one of the most licensed and thus widespread processor cores in the world.Used especially in portable devices due to low power consumption and reasonable performance.Several interesting extension available like THUMB instruction set and Jazelle Java Machine.
This webinar by Dov Nimratz (Senior Solution Architect, Consultant, GlobalLogic) was delivered at Embedded Community Webinar #1 on July 7, 2020.
Webinar agenda:
- CPU / GPU / TPU architectures
- Historical context
- CPU and their variations
- GPU or gin in a bottle for artificial intelligence tasks
- TPU architecture specialized artificial intelligence accelerator
- What's next in technology
More details and presentation: https://www.globallogic.com/ua/about/events/embedded-community-webinar-1/
A multi-core processor is a single computing component with two or more independent actual processing units (called "cores"), which are units that read and execute program instructions. The instructions are ordinary CPU instructions (such as add, move data, and branch), but the multiple cores can run multiple instructions at the same time, increasing overall speed for programs amenable to parallel computing. Manufacturers typically integrate the cores onto a single integrated circuit die (known as a chip multiprocessor or CMP), or onto multiple dies in a single chip package.
This webinar by Dov Nimratz (Senior Solution Architect, Consultant, GlobalLogic) was delivered at Embedded Community Webinar #1 on July 7, 2020.
Webinar agenda:
- CPU / GPU / TPU architectures
- Historical context
- CPU and their variations
- GPU or gin in a bottle for artificial intelligence tasks
- TPU architecture specialized artificial intelligence accelerator
- What's next in technology
More details and presentation: https://www.globallogic.com/ua/about/events/embedded-community-webinar-1/
A multi-core processor is a single computing component with two or more independent actual processing units (called "cores"), which are units that read and execute program instructions. The instructions are ordinary CPU instructions (such as add, move data, and branch), but the multiple cores can run multiple instructions at the same time, increasing overall speed for programs amenable to parallel computing. Manufacturers typically integrate the cores onto a single integrated circuit die (known as a chip multiprocessor or CMP), or onto multiple dies in a single chip package.
POWER ISA introduction and what’s new in ISA V3.1 (Overview)Ganesan Narayanasamy
This presentation will cover the introduction to the POWER ISA including the register file, floating point architecture, basic VSX architecture, Interrupts, memory management, branch handling, instruction set etc. There are several architectural innovations and extensions that happened in the latest version of POWER ISA v3.1 This presentation will also provide an overview on the new architecture features introduced in POWER ISA v3.1.
Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time SystemsHeechul Yun
In this paper, we show that cache partitioning does
not necessarily ensure predictable cache performance in modern
COTS multicore platforms that use non-blocking caches to exploit
memory-level-parallelism (MLP).
Through carefully designed experiments using three real COTS
multicore platforms (four distinct CPU architectures) and a cycleaccurate
full system simulator, we show that special hardware
registers in non-blocking caches, known as Miss Status Holding
Registers (MSHRs), which track the status of outstanding cachemisses,
can be a significant source of contention; we observe up
to 21X WCET increase in a real COTS multicore platform due
to MSHR contention.
We propose a hardware and system software (OS) collaborative
approach to efficiently eliminate MSHR contention for
multicore real-time systems. Our approach includes a low-cost
hardware extension that enables dynamic control of per-core
MLP by the OS. Using the hardware extension, the OS scheduler
then globally controls each core’s MLP in such a way that
eliminates MSHR contention and maximizes overall throughput
of the system.
We implement the hardware extension in a cycle-accurate fullsystem
simulator and the scheduler modification in Linux 3.14
kernel. We evaluate the effectiveness of our approach using a set
of synthetic and macro benchmarks. In a case study, we achieve
up to 19% WCET reduction (average: 13%) for a set of EEMBC
benchmarks compared to a baseline cache partitioning setup.
CPU Structure and Design
Computer Arch and Organization
learn how it works and the uses of the components and parts aswell.
this presentation is intended for those who are new to the ICT or already have some familiarity with working computers .
this tutorial was arranged and organized by enineer Gabiye
VMworld 2013
Peter Boone, VMware
Seongbeom Kim, VMware
Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare
4.1 Introduction 145• In this section, we first take a gander at a.pdfarpowersarps
4.1 Introduction 145
• In this section, we first take a gander at an exceptionally straightforward PC called MARIE: A
Machine
Design that is Really Intuitive and Easy.
• We then give brief reviews of Intel and MIPS machines, two prevalent
models mirroring the CISC (Complex Instruction Set Computer) and RISC
(Diminished Instruction Set Computer) outline theories.
• The goal of this part is to give you a comprehension of how a PC
capacities.
4.1.1 CPU Basics and Organization 145
• The Central handling unit (CPU) is in charge of bringing system guidelines,
translating every direction that is brought, and executing the demonstrated succession of
operations on the right information.
• The two key parts of the CPU are the datapath and the control unit.
• The datapath comprises of a number juggling rationale unit (ALU) and capacity units
(registers)
that are interconnected by an information transport that is likewise associated with principle
memory. Check
page 29 Figure 1.4.
• Various CPU segments perform sequenced operations as indicated by signs
given by its control unit.
• Registers hold information that can be promptly gotten to by the CPU.
• They can be executed utilizing D flip-flops. A 32-bit register requires 32 D flip-flops.
• The number juggling rationale unit (ALU) completes intelligent and math operations as
coordinated by the control unit.
• The control unit figures out which activities to do as per the qualities in a
program counter enroll and a status register.
CMPS375 Class Notes Page 3/22 by Kuo-pao Yang
4.1.2 The Bus 147
• The CPU offers information with other framework segments by method for an information
transport.
• A transport is an arrangement of wires that all the while pass on a solitary piece along every
line.
• Two sorts of transports are normally found in PC frameworks: point-to-point, and
multipoint transports.
FIGURE 4.1 (a) Point-to-Point Busses; (b) A Multipoint Bus
• At any one time, stand out gadget (be it a register, the ALU, memory, or some other
segment) may utilize the transport.
• However, the sharing regularly brings about a correspondences bottleneck.
CMPS375 Class Notes Page 4/22 by Kuo-pao Yang
• Master gadget is one that starts activities and a slave reacts to demands by a
expert.
• Busses comprise of information lines, control lines, and address lines.
• While the information lines pass on bits starting with one gadget then onto the next, control
lines decide
the bearing of information stream, and when every gadget can get to the transport.
• Address lines decide the area of the source or goal of the information.
FIGURE 4.2 The Components of a Typical Bus
• In an expert slave design, where more than one gadget can be the transport expert,
simultaneous transport expert solicitations must be refereed.
• Four classifications of transport mediation are:
o Daisy chain: Permissions are passed from the most noteworthy need gadget to the
most reduced.
o Centralized parallel: Each gadget is straightforwardly ass.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
What is the purpose of the Sabbath Law in the Torah. It is interesting to compare how the context of the law shifts from Exodus to Deuteronomy. Who gets to rest, and why?
Read| The latest issue of The Challenger is here! We are thrilled to announce that our school paper has qualified for the NATIONAL SCHOOLS PRESS CONFERENCE (NSPC) 2024. Thank you for your unwavering support and trust. Dive into the stories that made us stand out!
Operation “Blue Star” is the only event in the history of Independent India where the state went into war with its own people. Even after about 40 years it is not clear if it was culmination of states anger over people of the region, a political game of power or start of dictatorial chapter in the democratic setup.
The people of Punjab felt alienated from main stream due to denial of their just demands during a long democratic struggle since independence. As it happen all over the word, it led to militant struggle with great loss of lives of military, police and civilian personnel. Killing of Indira Gandhi and massacre of innocent Sikhs in Delhi and other India cities was also associated with this movement.
How to Create Map Views in the Odoo 17 ERPCeline George
The map views are useful for providing a geographical representation of data. They allow users to visualize and analyze the data in a more intuitive manner.
Welcome to TechSoup New Member Orientation and Q&A (May 2024).pdfTechSoup
In this webinar you will learn how your organization can access TechSoup's wide variety of product discount and donation programs. From hardware to software, we'll give you a tour of the tools available to help your nonprofit with productivity, collaboration, financial management, donor tracking, security, and more.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
Model Attribute Check Company Auto PropertyCeline George
In Odoo, the multi-company feature allows you to manage multiple companies within a single Odoo database instance. Each company can have its own configurations while still sharing common resources such as products, customers, and suppliers.
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
The Roman Empire A Historical Colossus.pdfkaushalkr1407
The Roman Empire, a vast and enduring power, stands as one of history's most remarkable civilizations, leaving an indelible imprint on the world. It emerged from the Roman Republic, transitioning into an imperial powerhouse under the leadership of Augustus Caesar in 27 BCE. This transformation marked the beginning of an era defined by unprecedented territorial expansion, architectural marvels, and profound cultural influence.
The empire's roots lie in the city of Rome, founded, according to legend, by Romulus in 753 BCE. Over centuries, Rome evolved from a small settlement to a formidable republic, characterized by a complex political system with elected officials and checks on power. However, internal strife, class conflicts, and military ambitions paved the way for the end of the Republic. Julius Caesar’s dictatorship and subsequent assassination in 44 BCE created a power vacuum, leading to a civil war. Octavian, later Augustus, emerged victorious, heralding the Roman Empire’s birth.
Under Augustus, the empire experienced the Pax Romana, a 200-year period of relative peace and stability. Augustus reformed the military, established efficient administrative systems, and initiated grand construction projects. The empire's borders expanded, encompassing territories from Britain to Egypt and from Spain to the Euphrates. Roman legions, renowned for their discipline and engineering prowess, secured and maintained these vast territories, building roads, fortifications, and cities that facilitated control and integration.
The Roman Empire’s society was hierarchical, with a rigid class system. At the top were the patricians, wealthy elites who held significant political power. Below them were the plebeians, free citizens with limited political influence, and the vast numbers of slaves who formed the backbone of the economy. The family unit was central, governed by the paterfamilias, the male head who held absolute authority.
Culturally, the Romans were eclectic, absorbing and adapting elements from the civilizations they encountered, particularly the Greeks. Roman art, literature, and philosophy reflected this synthesis, creating a rich cultural tapestry. Latin, the Roman language, became the lingua franca of the Western world, influencing numerous modern languages.
Roman architecture and engineering achievements were monumental. They perfected the arch, vault, and dome, constructing enduring structures like the Colosseum, Pantheon, and aqueducts. These engineering marvels not only showcased Roman ingenuity but also served practical purposes, from public entertainment to water supply.
Ethnobotany and Ethnopharmacology:
Ethnobotany in herbal drug evaluation,
Impact of Ethnobotany in traditional medicine,
New development in herbals,
Bio-prospecting tools for drug discovery,
Role of Ethnopharmacology in drug evaluation,
Reverse Pharmacology.
2. AGENDA
• Memory System Hierarchy
Tightly Coupled Memory
Alignment, Endianness and Ordering
VMSA and PMSA
AAETC4v00
Memory Systems 2
VMSA and PMSA
Caches and Coherency
Barriers and Synchronization
3. MEMORY SUBSYSTEM
ARM Core
L1 I-Cache
MMU/MPU
BusInterfaceUnit
CP15
L2Cache
AMBA
Interconnect
L1 L2 L3
AAETC4v00
Memory Systems 3
L1 D-Cache
BusInterfaceUnit
WB
• MMU
• Supports virtual memory, included by all Cortex-A processors
• MPU
• Allows memory protection only, included by all Cortex-R processors
• Some Cortex-M processors support an optional MPU
5. AGENDA
Memory System Hierarchy
• Tightly Coupled Memory
Alignment, Endianness and Ordering
VMSA and PMSA
AAETC4v00
Memory Systems 5
VMSA and PMSA
Caches and Coherency
Barriers and Synchronization
6. WHAT IS TIGHTLY COUPLED MEMORY?
• An alternative approach to caches
– Allows for high performance operation with slow external memory
– Supported on Cortex-R processors
• Fast memory, local to the processor
– Provides high speed performance without accessing the system bus
– A smaller die size penalty compared to equivalent amount of cache
AAETC4v00
Memory Systems 6
• Appears at fixed locations within the physical memory map
– Code and data can be copied to TCMs by application or library code
– DMA access or an external AXI interface to TCMs are included on some processors
• Can be used for TCM preloading
• Cortex-R4/R5 provides an external AXI slave port for access to TCMs
• Precise real-time performance can be predicted for MPU based cores
– MMU enabled cores have to perform address translation for TCM accesses
• TLB checks will be made and table walks can occur
7. TCM CONFIGURATION
• TCM enabled cores support two interfaces
– Traditionally referred to as I-TCM and D-TCM
– Also referred to as TCM-A and TCM-B, e.g. Cortex-R4
TCM-B
0x200000
External Memory• Each TCM interface can individually be configured
using CP15 operations
• Physical base address (multiple of size)
• Can overlay external memory
AAETC4v00
Memory Systems 7
Memory map
0x0
TCM-A
TCM blocks• Memory Size (depends on core and
implementation)
• Enable/Disable
• External pin(s) determines post-reset configuration
• Possible to make system boot from TCM memory
• INITRAM pin(s) enables TCMs during core reset
• LOCZRAM pin allows TCM address selection before reset
• Supported on Cortex-R4
• When enabled TCMs must not overlap
8. AGENDA
Memory System Hierarchy
Tightly Coupled Memory
• Alignment, Endianness and Ordering
VMSA and PMSA
AAETC4v00
Memory Systems 8
VMSA and PMSA
Caches and Coherency
Barriers and Synchronization
9. ALIGNMENT AND ENDIANNESS
• ARMv4/v5 data alignment
– Prior to ARMv6, all hardware data accesses had to be size aligned (for
example, words on word boundaries)
– Unaligned accesses could be caught by hardware
– Unaligned data in software was accessed by a series of aligned memory
accesses
• ARMv6/v7 data alignment
AAETC4v00
Memory Systems 9
• ARMv6/v7 data alignment
– Data accesses can be unaligned
• Only a sub-set of load/store instructions support unaligned accesses
• Unaligned accesses only allowed to addresses marked as Normal
– The load/store unit will access memory with aligned memory accesses
and make the data available to the CPU
• ARM processors are little-endian
– But can be configured to access big-endian memory systems
10. MEMORY ORDERING MODEL
• The ARM architecture defines a weak ordering model…
… between accesses to Normal memory regions
… between Normal memory and Device memory accesses
• This means that accesses might not occur in program order
• The architecture also allows for speculative accesses
– Data or instructions fetched from memory before being explicitly referenced
AAETC4v00
Memory Systems 10
– Data or instructions fetched from memory before being explicitly referenced
– Examples of speculative access includes:
• Branch prediction
• Out of order data loads
• Speculative cache line fills
• Speculative data accesses are only allowed to Normal memory
• Speculative instruction fetches are allowed to any region not marked as
XN
11. WHY DO I CARE ABOUT ACCESS
ORDER?
• In most cases precise access order does not matter
– But sometimes it is necessary to force access ordering
• Examples of when ordering matters:
– Sharing data between different threads/CPUs
• e.g. mail boxes
– Sharing data with peripherals
AAETC4v00
Memory Systems 11
– Sharing data with peripherals
• e.g. DMA operations
– Modifying instruction memory
• e.g. loading a program into RAM or scatter loading
– Modifying memory management scheme
• e.g. context switching or demand paging
• Where access order is important you may need to use barrier
instructions
• Compilers/assemblers will not automatically insert barriers for you!
12. V6/V7 MEMORY TYPE
• In ARMv6/ARMv7 address locations must be described in terms of
a type
• The “type” tells the processor how accesses to that location must
behave
– Memory access ordering rules
– Caching and buffering behavior
AAETC4v00
Memory Systems 12
– Caching and buffering behavior
– Speculation
• There are three mutually exclusive memory types
– Normal - Data and instructions
– Device - Devices/peripherals
– Strongly-ordered - Device/peripherals, or data used by legacy code
13. ACCESS ORDERING
• In Normal memory, ARM implements a weakly-ordered memory model
– This means that, in the absence of address or data dependencies, accesses
may be re-ordered, combined and/or repeated without affect on the system
– Speculative access are permitted
• Access ordering
– The table shows the ordering enforced between two memory accesses (A1
AAETC4v00
Memory Systems 13
– The table shows the ordering enforced between two memory accesses (A1
and A2) in each type of memory
– “<“ indicates that access A1 must complete before access A2
– Barrier instructions are required to enforce ordering beyond the default
behavior in the table
14. AGENDA
Memory System Hierarchy
Tightly Coupled Memory
Alignment, Endianness and Ordering
• VMSA and PMSA
AAETC4v00
Memory Systems 14
• VMSA and PMSA
Caches and Coherency
Barriers and Synchronization
15. VMSA AND PMSA
• Protected Memory System Architecture
– Allows protection of configurable memory regions
– Regions defined as base address and length
– Number of regions available varies between processors
– Protection is on basis of access type and privilege
– Does not support virtual address translation
AAETC4v00
Memory Systems 15
– Does not support virtual address translation
• Virtual Memory System Architecture
– Implements virtual memory translation
– Supported by all Cortex-A processors
– Uses page tables for translation configuration
– Also implements a full access protection scheme
– Extended to 40-bit physical addressing on latest cores (e.g.
Cortex-A15)
16. MEMORY PROTECTION UNIT
Peripherals
FLASH
Memory map
MPU region 2
Size: 256MB
Read/Write
MPU region 1
Size: 32MB
Read Only
Normal (Cached)
Executable
• A Memory Protection Unit (MPU)
provides basic memory management
• Allows attributes to be applied to
different address regions
• All accesses checked against MPU
regions
• Each region has:
AAETC4v00
Memory Systems 16
SRAM
MPU region 3
Size: 256KB
Read/Write
Normal (Cached, bufferable)
Executable
Peripherals Read/Write
Device (Bufferable)
Execute Never (XN)
MPU region 0
Size: 4GB
No Access
• Each region has:
• Base address
• Size
• Attributes (e.g. Type)
• Available on:
• ARM1156T2(F)-S
• Cortex-R family
17. VIRTUAL MEMORY
• Core issues “Virtual Addresses” (VA)
• Memory is accessed using “Physical Addresses” (PA)
• Translation is carried out automatically by Memory Management
Unit (MMU)
• Translation configuration is stored in page tables in external
memory
Virtual Memory Map Physical Memory Map
AAETC4v00
Memory Systems 17
Virtual Memory Map
OS
Application Space
Vectors
Peripherals
Privileged Access
User Access
Uncached
Read-only
Physical Memory Map
FLASH
RAM
Peripherals
18. THE MEMORY MANAGEMENT UNIT
• The Memory Management Unit (MMU) handles translation of virtual addresses to
ARM Core
MMU
Caches
Memory
Virtual Address Space Physical Address Space
TLBs
Table
Walk
Unit
Translation
Tables
AAETC4v00
Memory Systems 18
• The Memory Management Unit (MMU) handles translation of virtual addresses to
physical addresses
• Provides hardware to read translation tables in memory - called table walking
• CP15 Table Base Registers (TTBR) store physical base addresses of tables
• Translation Look-aside Buffers (TLBs) cache recent translations
• Core can have separate instruction and data TLBs, or a shared unified TLB
• When the MMU is enabled all accesses by the core are passed through it
• MMU will use cached translations from the TLB(s) or perform a table walk
• Translation must occur before cache look-up can complete
19. LEVEL ONE PAGE TABLES
First-level Table
0x0
0x4
0x8
0xC
0x3FFC
0x3FF8
0x3FF4
0x3FF0
0x3FEC
0x3FE8
Tableoffset(bytes)
ARM Core
Virtual Address
VA
PA
Memory
Physical Address
AAETC4v00
Memory Systems 19
• Diagram shows a single-level page table
• VA to PA mapping at 1MB resolution
• Translation carried out in a single step
• Page table lookup is done automatically by MMU
• Recent translations are cached in internal TLB
Tableoffset(bytes)
Translation Table Base (TTB)
20. LEVEL TWO PAGE TABLES
First-level table
0x4
0x8
0xC
0x3FFC
0x3FF8
0x3FF4
0x3FF0
0x3FEC
0x3FE8
Second-level tables
0x0
0x4
0x8
0x3FC
0x3F8
0x3F4
0x3F0
4KB Page
Page Table
4KB Page
Page Table
Tableoffset(bytes)
ARM Core
Virtual Address
VA
PA
AAETC4v00
Memory Systems 20
• Second level page table allows mapping at 4KB resolution
• Translation requires two page table look-ups
0x0
0x4
Translation Table Base (TTB)
Tableoffset(bytes)
21. ACCESS PERMISSIONS AND XN
• Access permission determined by AP[2:0] bits in page table
descriptor
AP Privileged User Notes
000 No access No access Permission fault
001 Read/Write No access Privileged mode access
010 Read/Write Read Permission fault on user write
011 Read/Write Read/Write Full access
AAETC4v00
Memory Systems 21
011 Read/Write Read/Write Full access
100 - - Reserved
101 Read No access Privileged mode read only
110 Read Read Permission fault on writes†
111 Read Read Permission fault on writes
• “eXecute Never” (XN) prevents instruction execution from a region
• Speculative instruction fetches are also suppressed
• The core never makes speculative accesses to Device or Strongly Ordered memory
22. MMU CONFIGURATION AND
MAINTENANCE
• Enabling the MMU
– The MMU is disabled at reset and is enabled via the SCTLR.M bit
– MMU page tables contain memory type configuration
(Includes shareability, cacheability, bufferability, access permissions etc.)
– All this must be configured before the MMU is enabled
• TLB maintenance
– TLBs cache memory translation information
AAETC4v00
Memory Systems 22
– TLBs cache memory translation information
– Must be invalided when translation table contents are changed
– May also need invalidation on a context switch
– ASID is provided to minimize this
– TLBs should be invalidated by the startup code on reset
• When the MMU is disabled
– PA = VA i.e. no address translation is performed
– Instruction accesses may be cached (controlled by SCRTL.I bit)
– Data accesses will not be cached and are all treated as Strongly ordered
– No access permissions are carried out
23. AGENDA
Memory System Hierarchy
Tightly Coupled Memory
Alignment, Endianness and Ordering
VMSA and PMSA
AAETC4v00
Memory Systems 23
VMSA and PMSA
• Caches and Coherency
Barriers and Synchronization
24. CACHES IN CORTEX-A SERIES
PROCESSORS
• Applications processors are usually implemented with two levels of cache
– Separate (Harvard) L1 Instruction and Data caches per core
• Relatively small (typically 32KB), providing fast access inside the L1 subsystem
– A single (unified) L2 cache (integrated or external, depending on the CPU)
• Relatively large (up to 8 MB), with access times slower than L1 memory
accesses
• MMU uses information contained in the translation tables to control which
memory locations are cached
AAETC4v00
Memory Systems 24
memory locations are cached
MMU
CPU0
I-Cache
D-Cache
BusInterfaceUnit
CP15
L2Cache
AMBAInterconnect
SRAM
External
DRAM
AMBAInterconnect
APB
MMU
CPU1
I-Cache
D-Cache
CP15
25. CACHE TERMINOLOGY
• You should know the meaning of the following
terms…
– Line
– Way
Tag Index Offset
Address:
AAETC4v00
Memory Systems 25
Way
– Set
– Tag
– Index
– Offset
– Data RAM
– Tag RAM
– Valid and Dirty Bits
Tag RAMData RAM
Way
Set
Index
Tag
26. HOW IS DATA STORED IN MY CACHE?
• Caches handle data in lines (32 or 64 bytes per cache line)
– Physical address used to determine the location of data in cache
• Bottom bits (offset) identify word/byte in line
• Middle bits (index) identify which line
• Top bits (tag) identify remainder of address
• Each line in the cache includes:
Tag RAMData RAM
Index
Tag Index Offset
Address:
AAETC4v00
Memory Systems 26
• Each line in the cache includes:
– Tag bits from the associated physical address
– Valid bit: indicates whether line exists in the cache
– Dirty data bit(s): indicates whether line (or cache line) is not coherent with external
memory
• To reduce cache contention, ARM caches are “set associative”
– There are multiple possible cache locations (ways) for any given address
– A victim counter decides which cache way will be used for an allocation
– Replacement policy used by victim counter varies by core
Way
Set
Tag
27. EXAMPLE MEMORY ACCESS
Main Memory
Offset
Index
Index
Offset
0x00000000
0x00000010
0x00000020
0x00000030
0x00000040
0x00000050
0x00000060
0x00000070
0x00000080
0x00000090
Way 0 Way 1
…110 ...101
Tag Index Offset
32bit Address: 0x0000007C
...001 11 11 00
Byte
…001
Main Memory
Offset
Index
Index
Offset
0x00000000
0x00000010
0x00000020
0x00000030
0x00000040
0x00000050
0x00000060
0x00000070
0x00000080
0x00000090
Way 0 Way 1
…110 ...101
Tag Index Offset
32bit Address: 0x0000007C
...001 11 11 00
Byte
…001
AAETC4v00
Memory Systems 27
?×
Victim Counter
?
Victim Counter
Way 0 Way 1
Data
==
4. Victim counter specifies which cache Way to use (will Evict previous data)
5. Cache returns requested word to the core
Way 0 Way 1
Data
==
• Memory Read:
LDR r1,[0x0000007C]
1. Cache Lookup is performed
2. Cache Miss - Tag matches fail for
given Index in all Cache Ways
3. Cache Linefill is performed
28. CACHE BEHAVIOR
• Cache lookup
– The core checks to see if a memory address is currently in the cache
– A “cache miss” occurs if the data is not found
• The cache may then automatically load the relevant data
• This is called a “cache linefill”
– A “cache hit” occurs if the data is found
AAETC4v00
Memory Systems 28
– A “cache hit” occurs if the data is found
• The data is immediately returned to the core
• No external memory access takes place
• Cache Eviction
– In order to make space for new data, existing cache data may have to be
evicted
– In “writeback” mode, dirty data will have to be written back to memory first
• Victim counter
– This is an internal value used to select the data for eviction
29. CACHE MODES AND POLICIES
• Allocation policy
– Controls when new data is loaded into the cache
– A read-allocate policy only allocates new data on a read miss
– A write-allocate policy also allocates on a write miss
• Eviction policy
– Governs the selection of lines for eviction
AAETC4v00
Memory Systems 29
– Governs the selection of lines for eviction
– A round-robin policy cycles through the lines in a fixed order
– A random policy selects a line at random
• Write-through and Write-back
– Controls what happens when a write operation hits in the cache
– A write-through cache updates external memory in parallel
– A write-back cache does not update external memory
30. WHEN SHOULD I ENABLE CACHES?
• Caches are disabled on reset
– Architecturally, caches are not guaranteed to be in a known state at reset
– Need to be invalidated by software on Cortex-A9
– Not required on Cortex-A5/A7/A15
• The L1 instruction cache can be enabled without enabling the MMU
– Many boot loaders will enable the I cache, but not the D cache
• Data caching is only possible once the MMU is enabled
AAETC4v00
Memory Systems 30
• Data caching is only possible once the MMU is enabled
– Appropriate cache policies must be configured in the translation tables
• The L2 cache should generally be enabled with the L1 data cache
– On the Cortex A15 and A7 the L2 (unified) cache is always enabled
• But no lookup occurs unless the L1 D-cache on one of the CPUs in the cluster is
also enabled
– On Cortex-A9 or A5 an external L2 cache (like PL310) is enabled separately
• Via a write to a memory mapped control register
Performance is very poor if instructions are not fetched from cache!
31. CACHE MAINTENANCE OPERATIONS
• Caches require maintenance to ensure that the program always has
access to the correct data
– Cache clean
• Writes out “dirty data” so that external memory and cache are coherent
• Only applicable to write-back caching
– Cache invalidate
• Marks lines as invalid and therefore available for new data
AAETC4v00
Memory Systems 31
• Marks lines as invalid and therefore available for new data
• When is maintenance required?
– Context switches which modify the mapping between address tags and
physical addresses
– To write self-modifying code, data written via the data cache must be read
back via the instruction cache
• The data cache must be flushed and the instruction cache invalidated
– If an external engine has modified external memory (e.g. DMA)
• Data cache must be invalidated
32. COHERENCY OPERATIONS
• Type of operation
– Invalidate - clear the Valid bits on the particular cache / branch predictor entry
– Clean - updates the external memory system with Dirty cache line(s)
• Which entries
– All - the entire cache (not available for the data/unified cache)
– MVA - a specific virtual address
AAETC4v00
Memory Systems 32
– MVA - a specific virtual address
– Set/Way - a specific cache line (not available for the branch predictor)
• Scope
– PoC - Point of Coherency (discussed later)
– PoU - Point of Unification (discussed later)
• Inner shareable
– Operations that can be “broadcast”
33. POINT OF UNIFICATION (POU)
I D
I
Cache
D
Cache
System Control
Coprocessor
CP15
AAETC4v00
Memory Systems 33
I D
Cache Cache
TLB
Point of Unification
• The point at which Instruction, data and TLB accesses see the same copy of memory
• Generally the L2 cache or memory – depends on the system design
34. POINT OF COHERENCY (POC)
I D
D
Cache
System Control
Coprocessor
CP15
I D
D
Cache
System Control
Coprocessor
CP15
Master A Master B
AAETC4v00
Memory Systems 34
Cache Cache
Point of Coherency
• The point at which all agents see the same copy of memory
• Generally the external memory system – again, very system dependent
35. POU V POC
I D
I
Cache
D
Cache
System Control
Coprocessor
CP15
I D
I
Cache
D
Cache
System Control
Coprocessor
CP15
AAETC4v00
Memory Systems 35
TLB
Point of Unification
Point of Coherency
TLB
L2 Cache
Point of Unification
Point of Coherency
38. SMP VS. AMP
• SMP – Symmetric Multi-Processing
– All tasks share a common view of memory and peripherals
– Tasks can be dynamically shared across Multiple CPUs
– Simplifies software development
• Provides increased productivity for programmer
AAETC4v00
Memory Systems 38
• AMP – Asymmetric Multi-Processing
– Code portability and design flexibility
– Programmer statically assigns tasks to a CPU
– Enables tasks to be isolated from each other
• Each task may have a different view of memory
39. A MULTICORE ARM PROCESSOR
Two
Cortex-A9
processor
cores
Snoop ControlInterrupt
CoreSight
debug
infrastructure
AAETC4v00
Memory Systems 39
Shared
external bus
interface
Snoop Control
Unit maintains
L1 cache
coherency
Interrupt
Distributor
Shared
architectural
peripherals
40. SNOOP CONTROL UNIT
Snoop Control
Unit maintains
AAETC4v00
Memory Systems 40
Unit maintains
L1 cache
coherency
41. SNOOP CONTROL UNIT (SCU)
• The Snoop Control Unit (SCU) maintains coherency between L1 data caches
– Arbitrates accesses to L2 AXI master interface(s), for both instructions and data
– Duplicated Tag RAMs keep track of what data is allocated in each CPU’s cache
• Separate interfaces into L1 data caches for coherency maintenance
• Optionally, can use address filtering
– Directing accesses to configured memory range to AXI Master port 1
AAETC4v00
Memory Systems 41
CPU0
D$ I$
CPU1
D$ I$
CPU2
D$ I$
CPU3
D$ I$
Snoop Control Unit
TAG TAG TAG TAG
AXI Master 0 AXI Master 1
42. AGENDA
Memory System Hierarchy
Tightly Coupled Memory
Alignment, Endianness and Ordering
VMSA and PMSA
AAETC4v00
Memory Systems 42
VMSA and PMSA
Caches and Coherency
• Barriers and Synchronization
43. SYNCHRONIZATION
• Shared resources in a
multi-threaded or multi-
processor system need
protection in critical code
sections
• Operating Systems provide
AAETC4v00
Memory Systems 43
• Operating Systems provide
resources such as spinlocks
or mutexes etc
• Here is an example of a
simple spinlock using
ARM’s exclusive load
and store instructions
44. BARRIERS
• The ARM architecture includes barrier instructions to force access order
and access completion at a specific point
DMB – Data Memory Barrier
DSB – Data Synchronization Barrier
ISB – Instruction Synchronization Barrier
• This course provides a simple introduction to barriers and their use,
but…
AAETC4v00
Memory Systems 44
but…
– If you are writing code where ordering is important we recommend also
reading:
– ARM Architecture Reference Manual ARMv7-A/R Edition (Rev C)
• A3.8 Memory access order
• B2.2.9 Ordering of cache and branch predictor maintenance operations
• B3.10.1 TLB maintenance operations and the memory order model
• Appendix G Barrier Litmus Tests
– Includes worked examples
45. DMB VS DSB
• A Data Memory Barrier (DMB) is less restrictive than a Data
Synchronization Barrier (DSB)
• For a DMB:
– No memory accesses after the DMB in program order are started until all
memory accesses before the DMB in program order have been seen by the
rest of the system
AAETC4v00
Memory Systems 45
rest of the system
• A DSB doesn’t complete until:
– All memory accesses before the DSB in program order have completed, and
– All cache, branch predictor and TLB maintenance operations issued by the
local processor have completed
– Furthermore, no instruction that appears after the DSB in program order can
execute until the DSB completes
• Use a DSB when necessary, but don’t overuse them
46. MAIL BOX EXAMPLE
• P0 – DMB needed to ensure mail box is seen BEFORE the flag is
updated
• P1 – DMB needed to ensure mail box read AFTER flag is seen
P0 – Flag Data As Available
LDR r1, =ADDR_MAILBOX_DATA
LDR r2, =ADDR_MAILBOX_FLAG
P1 – Flag Data As Available
LDR r1, =ADDR_MAILBOX_DATA
LDR r2, =ADDR_MAILBOX_FLAG
AAETC4v00
Memory Systems 46
; Write a new message into
; mail box
STR r5, [r1]
DMB
; set available flag to
; signal mail box full
MOV r0, #0
STR r0, [r2]
; wait for flag
loop
LDR r12, [r2]
CMP r12, #0
BNE loop
DMB
; read message
LDR r0, [r1]
47. ISB
• The ARM architecture defines context as the system settings in CP15
• Context-changing operations include:
– Cache, TLB, and branch predictor maintenance operations
– Changes to system control registers (e.g. SCTLR, TTBCR, TTBRn, CONTEXTIDR)
• The effect of a context-changing operation is only guaranteed to be seen
after a context synchronization event
AAETC4v00
Memory Systems 47
after a context synchronization event
– Taking an exception
– Returning from an exception
– Instruction Synchronization Barrier (ISB)
• An ISB flushes the pipeline, and re-fetches the instructions from the cache
(or memory)
– Guarantees that effects of any completed context-changing operation before the
ISB are visible to any instruction after the barrier
– Also guarantees that context-changing operations after the ISB instruction only
take effect after the ISB has been executed
48. CP15 EXAMPLE
• To enable FPU/NEON you need to first enable access to cp10 and cp11
– This is done by writing to the Coprocessor Access Control Register (CACR)
MRC p15, 0, r1, c1, c0, 2; Read CACR into r1
ORR r1, r1, #(0xf << 20) ; Enable full access for p10 & p11
MCR p15, 0, r1, c1, c0, 2; Write back into CACR
ISB
AAETC4v00
Memory Systems 48
ISB
MOV r0, #0x40000000
VMSR FPEXC, r0 ; Enable FPU and NEON
• Without the ISB the processor could already have decoded the VMSR
as an Undefined Instruction exception, before the time the MCR is
executed
– The ISB ensures the update to the CACR is seen by the processor when
decoding the VMSR
49. SELF-MODIFYING CODE (1)
P0 loads a new program into memory, which then gets executed by P0 and P1
P0
STR r11, [r1] ; Save instruction to program memory
DCCMVAU r1 ; clean D-$ so instruction visible to I-$
DSB ; ensure clean completes on all CPUs
ICIMVAU r1 ; discard stale data from I-$ …
BPIMVA r1 ; … and from Branch Predictor
AAETC4v00
Memory Systems 49
BPIMVA r1 ; … and from Branch Predictor
DSB ; ensure I-$/BP invalidates complete for all
STR r0, [r2] ; set flag == 1 to signal completion
ISB ; synchronize context on this processor
MOV pc, r1 ; branch to new code
P1-Pn
WAIT ([r2] == 1) ; wait for flag signaling completion
; no DSB required here
ISB ; synchronize context on this processor
MOV pc, r1 ; execute newly saved instruction