Here are a few things you could try to address the increased executable size and performance impact on the CPU cache:
1. Recompile the executables to only use 64-bit pointers where needed, and use 32-bit pointers elsewhere to reduce the overall size.
2. Optimize the compiler to better pack instructions and data to improve cache utilization.
3. Consider using position independent code (PIC) to allow sharing of common code segments between processes to reduce duplicated code.
4. Profile the applications to identify hot spots and optimize those sections first, such as improving data locality.
5. Consider using link-time optimizations (LTO) to better optimize across compilation units.
6. Upgrade CPU/
2. A Class in Eight Sections
Introduction, history, computers and CPUs
Memory
Operating systems and process basics
Responder training (Kent – 3 sessions)
Approach to forensic analysis
Case study – stepping through real malware
3. History
Hacking has always followed invention
1876 - Bell demonstrates the telephone
1878 - teenagers try to take it apart
~1971 - phone phreaking starts, hacking follows
1974 - unknown 15-yo teenager acquires
privileged access to CSUS computers
- To chance a view of the future, you must
understand the path which it used.
4. Some Numbers
2015 - $3.3B was invested in 229 startups
2017 – 780K jobs with 350K openings
2021 – 3.5million job openings (estimated)
Roughly ~250,000 unique pieces of Windows
malware appear every day
Cyber security will be a growth industry
because there is too much money in it for all
involved
5. Two Possible Futures
1. All the “bad guys” decide “it’s
just too much trouble and
give up”
2. They just keep coming and
getting more sophisticated
7. Numerology
8-bits think 256 (or 0x100)
16-bits think 64K (or 0x10000)
32-bits think 4G (or 0x100000000)
1M think 1M (or 0x100000)
All numbering systems start at 0
Only difference between signed and
unsigned values is semantics
1M is 1048576 not 1,000,000
Know hex like you have 16 fingers
9. The First Days (sorta)
CPU dealt with 8-bits at a time
Address was 16-bits, so <= 64Kbytes
Bus supported was 16-bit address, 8-bit data
I/O was completely separate operation
I/O address was 8-bits
4MHz bus clock
Some manufacturers attached those signals
to a connector called a bus
S-100,Apple,STD,SS-50, etc
11. The Second Days
CPU dealt with 8-bits at a time
8088, 7-byte prefetch, really an 8-bit processor
Address = 20-bits so 1M maximum
Bus supported 20-bit address, 8-bit data
I/O was 16-bit address and 16-bit data
First bus masters appeared
6MHz bus clock
12. x86 Not Orthogonal
Orthogonal means that any register can
be used for any operation
Not orthogonal means that registers
have specific tasks that the other
registers cannot perform
13. 16-bit Registers
AX – fastest, used in most opcodes
BX – pointer, used in some opcodes
CX – counter, used in some opcodes
DX – sometimes extension of AX
32-bit number was placed in DX:AX with DX
being the most significant 16-bits and AX
being the least significant 16-bits
14. More 16-bit Registers
DI – general purpose & destination pointer
SI – general purpose & source pointer
SP – stack pointer
BP – general purpose, pointer & used for
stack frame
F – flags, directly used with stack or AH
The difference
AX, BX, CX, DX have one byte subregisters
AH/AL, BH/BL, CH/CL, DH/DL
15. The Opcode
The opcode is a set of numbers that the
tell the CPU what to do
0x41 means add 1 to register CX
0x6B 0xC9 0x05 means CX = CX * 5
Think of the opcode as a verb (action)
Think of memory and registers as nouns
The opcode operates on nouns
16. Opcode Structure
All assembly language follows:
<opcode> <v1> [,<v2> [,<v3> […]]]
or
verb noun1, noun2, …
Opcodes have a target, explicit/implied
Opcodes can have 0 to many sources
17. Opcode Targets
Implicit: something in the CPU
SAHF, CLI, HLT
Explicit: register, memory
mov ax, 3
mov [memory_variable], dx
18. Opcode Sources
Implicit: something in the CPU
LAHF – load AH with the flags
PUSHF – store the flags on the stack
Explicit: register, memory, value
mov ax, bx
mov cx, [some_memory_variable]
mov dx, 45
19. x86 Op Codes
x86 currently has 981 unique opcodes
Compilers use ~25 opcodes 99.9% of
the time
Assembly language is like any other, just
think in smaller steps
Ones you should know:
mov, push, pop, jmp(s), call, cmp, add, sub
or, and, xor, inc, dec, test, shl, shr, ror, rol
and the ones that look like them
20. A Quick Opcode Eye Chart
mov : copies data
push/pop : stack in and out
jmp/call : goto or a function call
cmp : compares two values
add/sub/mul/div : math operators
and/or/xor/not : logical operators
inc/dec : ++ and - -
shl/shr/ror/rol : bit shifting/rotating
21. Addressing Modes
CPU has to access memory
Addressing modes you should know
Immediate: mov ax, A_VALUE
Direct: mov ax, memory_location
Indirect: mov ax, [bx]
Indirect+offset: mov ax, [bx + A_VALUE]
Indirect scaled: mov ax, [bx*4]
Combined: mov ax, [bx*4] + A_VALUE
22. Segment Registers
Used to reference a 16-byte location in
memory (e.g. segment 2 is address 32)
CS – code segment (ip)
DS – data segment (bx, si, di)
SS – stack segment (sp, bp)
ES – extra segment (di for string
opcodes)
23. How are Segments Used?
0
1
2
3
4
5
FFFB
FFFC
FFFD
FFFE
FFFF
0x00000
0x00060
0xFFFB0
0xFFFF0
DS == 0x0002
…
…
DS:0x0037 is address 0x00057
0x20 from DS being 2
+ 0x0037
= 0x00057
So, (segment number * 16) + offset
is the physical address.
Memory
Segments
1 megabyte of memory is divided
into 64K segments of 16-bytes each
Addresses
24. Segment Overrides
Normally pointer registers use certain
segments
DS – data segment (bx, si, di)
An override can be used to have a
pointer use another segment instead
es:[bx] means use ES not DS
26. How to CPUs Store Data
0x12345678
Little Endian (Intel, Arm)
0x78
0x56
0x34
0x12
+0
+1
+2
+3
0x12345678
Big Endian (Motorola, PowerPC, Arm)
0x12
0x34
0x56
0x78
+0
+1
+2
+3
Most modern embedded CPUs allow you to choose the endianness
28. 32-bit Land
CPU dealt with 16 or 32-bits at a time
Address was 32-bits
I/O was 16-bit address and 16-bit data
Registers became more orthogonal
Real, protected and V86 modes
real mode:16-bit, protected mode:32-bit
i386 had cache controller but no cache
I never saw a single system with one installed
29. Register Name Changes
AX -> EAX
BX -> EBX
CX -> ECX
DX -> EDX Well that’s
DI -> EDI exciting!
SI -> ESI
SP -> ESP
BP -> EBP
30. New Segment Registers
CS – code segment (eip)
DS – data segment
(eax,ebx,ecx,edx,esi,edi)
SS – stack segment (esp, ebp)
ES – extra segment (edi for strings)
FS - ??? eff segment?
GS - ??? gee segment?
32. Answer
They are no longer used for 16-byte
segments
They have new properties that define
where in physical memory they start
They provide the first taste of virtual
memory
33. 32-bit Segment Register Usage
M
e
m
o
r
y
DS describes address and size of data area
CS describes address and size of code area
34. 32-bit Segment Register Usage
M
e
m
o
r
y
CS
DS VMEM data location 0 is here
VMEM code location 0 is here
PHYSMEM VMEM
36. New Term: Superscaling
Superscaling allows a CPU to process
two opcodes in a single cycle
If a CPU could process two opcodes in a
cycle, then it needed to have opcodes
twice as fast
The opcodes can’t be dependent upon
each other
Leads to interesting opcode placement
by compilers
37. Why Wasn’t DRAM Good Enough?
CPU Byte Address DRAM
CPU DRAM
Get a byte
Some time later
38. Superscaling Led to Caching
In order to make simultaneous opcode
execution viable, a larger prefetch was
required (e.g. caching)
First showed up in the i486 for certain
pairs of opcodes
39. Caches
Very fast, expensive static RAM built into
the CPU
Must operate at twice the speed of the CPU
Different layers, L1, L2, maybe even L3
Each layer is faster than the one above
L1 faster than L2 faster than L3, etc
41. Caching Led to Page Mode DRAM
Full cache lines pulled in from RAM
rather than single words
Addressing by cache lines reduced the
number of pins required for DDR
43. Faster Systems -> Faster Bus
PCI – 32-bit open specification
Microchannel – 32-bit IBM proprietary
Both attempted to become the true
standard. PCI was free and
Microchannel cost $1000’s to license
44. PCI Bus
32-bit physical addressing
32-bit data
Designed to support multiple masters
I/O mapped addressing -> memory mapped
33MHz bus clock (133Mbyte throughput)
45. Bus Masters
Virtually all PCI devices are bus masters
Effectively a separate computer
No access to the CPU’s cache
47. PCI Led to Memory Structure
Bus masters operate on RAM directly
CPU and PCI accessing same thing is bad
news
Bus master buffers are cache line aligned
Bus master structures are aligned as well
PCI has 32-bit addressing limit so < 4GByte
PCI only deals with physical addressing so
there is no security
48. Memory Contention
CPU Core
L1 Cache
L2 Cache
DDR
Internal Bus
PCI Device
Drivers understand this problem
and structure themselves
accordingly.
49. PCI Issues
Parallel interface has several pins
Speed of light becomes a factor when
multiple high speed signals need to
reach their goal at the same time
At high speed, a trace becomes a
memory device
50. PCIe
High speed serial interface
Far fewer pins
Full 64-bit address range
Version 1, 2.5GHz per lane
Version 2, 5GHz per lane
Version 3, 8GHz per lane
etc
51. Legacy
64-bit addressing, but structures still stay
below 4G
Still deals with physical memory addresses
Has no security
52. 64-bit
rax, rbx, rcx, rdx, rdi, rsi, rbp, rsp
Plus r8 – r15
Virtual address range from 256TB to 16PB
Physical address range from 1TB (40 bits) to
256 TB (48 bits)
For the remainder of this series, I’ll refer to the
32-bit registers, but all can be 64-bit extended
54. Protection Rings
Intel has four security rings: 0 – 3
Ring 0 has full access to all opcodes
Ring 3 has limited access to opcodes
and certain memory
Drivers and OS run in ring 0/1
User software runs in ring 3
55. Problem for You to Think About
In a 16-bit, x86 computer, a segment
register is used as a base of a 16-byte
offset. So, ES = 0x1000, would be
based at the memory location 0x10000.
In a system with 1Mbyte of RAM (max
address location 0x100000), what would
happen if you load ES with 0xFFFF and
BX with 0x400 and then execute the
instruction: mov ax, es:[bx]?
56. Real World Problem
You created a 64-bit operating system.
You found that the size of your
executables almost doubled in size. You
found that this also caused the
programs to run slower because the
increased size was a burden on the
CPU cache.
What would you do to fix that?