www.oeclib.in
Submitted By:
Odisha Electronics Control Library
Seminar
On
Smart Memories
CONTENT
 Packet Processing Workload Challenges
 Solution –Smart Memory
 Introduction to Smart Memory
 Smart Memory Architecture
 Packet Processing Bottlenecks
 Smart Memory
 Advantages
 Reference
OVERVIEW
1. High Performance Packet Processing
Challenges
2. Solution –Smart Memory
3. Smart Memory Architecture
PACKET PROCESSING WORKLOAD
CHALLENGES
• Sequential memory references
 For lookups (L2, L3, L4, and L7)
 Finite automata traversal
• Read-modify-write Tons of memory reference sand minimal compute
 Statistics, counters, token-bucket, mutex, etc
• Pointer and link - list management
 Buffer management, packet queues, etc.
• Traditional implementations use
 Commodity memory to store data
 NPs and ASICs to process data in memory
Performance Barriers:
1. Memory and chip I/O bandwidth
2. Memory latency
3. Lock for atomic access
Memory Memory Memory Memory
P P P P P P
P P P P P P
ILLUSTRATION OF PERFORMANCE BARRIER I
Memory Memory Memory
P P P P P P
P P P P P P
Interconnection network
Memory
Requires several transactions between memory and processors
Requires several transactions between memory and processors
More latency In
inter connect
Need more
processors
Low IPC
IP lookup tree
4
2
9
5
1
7
8
3
6
0
0
1
1
P10
1
0 1 P3
P4P5
P2
ILLUSTRATION OF PERFORMANCE BARRIER II
Memory Memory Memory
P P P P P P
P P P P P P
Interconnection network
Lock free-list
Get free node
Unlock free-list
Lock list tail
Read list tail
Link free node
Update list tail
Unlock list tail
Lock counter
Read counter
Write counter
Unlock counter
Locks often kept in memory
Requires another transaction
Adds significant latency
Single queue or single counter
operations are extremely slow
CountersEnqueue
Dequeue
•Lookups are read-only so
relatively easy
•Link-list, counters, policers, etc
are read-modify-write
•Requires per memory address
lock in multi-core systems
SOLUTION –SMART MEMORY
 Attach simple compute with data
 Attach lock with data
 Enable local memory communication
INTRODUCTION TO SMART MEMORY
• What is the real problem?
 Compute occurs far away from data
 Lock acquire/release occurs far from data
• Solution: Make memory smarter by:
Fortunately, compute for packet
processing jobs are very modest!
Memory Memory Memory
Memory
P P P P P P
P P P P P P
Interconnection network
P P P P P P
P P P P P P
Interconnection network
Compute
Memory
Compute
Memory
Compute
Memory
Compute
Enabling local communication
Managing lock close to data
Keeping compute close to data
INTRODUCTION TO SMART MEMORY
Memory Memory Memory
Memory
P P P P P P
P P P P P P
Interconnection network
P P P P P P
P P P P P P
Interconnection network
Compute
Memory
Compute
Memory
Compute
Memory
Compute
• What is the real problem?
 Compute occurs far away from data
 Lock acquire/release occurs far from data
Fortunately, compute for packet
processing jobs are very modest!
Smart Memory Advantages
(Get more off fewer transactions!)
1. Lower I/O bandwidth
2. Lower processing latency
3. Higher IPC
4. Significantly higher single
counter/queue performance
SMART MEMORY ARCHITECTURE
 Hybrid memory –eDRAM + DDR3-DRAM
 Serial chip I/O
SMART MEMORY CAPACITY AND
BANDWIDTH @100G
40
20
10
5
2.5
1.2
5
.62
.31
.15
2 4 8 16 32 64 128 256 512+
Memorybandwidth(Billionaccesses/packet)
Memory Capacity (MB)
Basic
Laye2
Layer2
fwding
Statistics
/Counter
Queuing/
Scheduling
Packet
Buffer
Vide
Buffer
SMART MEMORY CAPACITY AND
BANDWIDTH @100G
40
20
10
5
2.5
1.2
5
.62
.31
.15
2 4 8 16 32 64 128 256 512+
Memorybandwidth(Billionaccesses/packet)
Memory Capacity (MB)
Basic
Laye2
Layer2
fwding
Statistics
/Counter
Queuing/
Scheduling
Packet
Buffer
Vide
Buffer
8 Channels of DDR3-RAM
64 banks eDRAM
Smart Memory uses
intelligent algorithms to
split the data-structures
SMART MEMORY HIGH LEVEL ARCHITECTURE
P P P P
P P P P
P P P P
P P P P
Packet processor complex
eDRAM
SM engine
Smart Memory complex
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
DRAM
SMEngine
Local interconnect:
provides local communication
between smart memory blocks
Global interconnect:
provides fair communication between processors
andsmart memory
DDR3
DRAM
SMART MEMORY HIGH LEVEL ARCHITECTURE
P P P P
P P P P
P P P P
P P P P
Packet processor complex
eDRAM
SM engine
Smart Memory complex
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
eDRAM
SM engine
DRAM
SMEngine
Result
Split tables into
eDRAM and DRAM
Computation occurs closeto
memory reducing latency
Requires fewer memory
transactions
Read
Read
DDR3
DRAM
I/O TECHNOLOGY CHOICE IN SMART MEMORY
 Smart Memory reduces the chip I/O
bandwidth significantly
 How to further optimize it?
Based on MoSys data
Bandwidth, latency and I/O bandwidth gap is growing
On-chip bandwidth is much higher than memory I/O
Smart Memory use serial I/O
-4X throughput than RLDRAM and QDR
-3X fewer pins than DDR3 and DDR4
-2.5X reduces I/O power
HIGH SPEED LINE CARD WITH SMART MEMORY
NP NP
NP NP
TM
TM
C
A
M
C
I
F
TCM TCM
S
R
A
M
S
R
A
M
S
R
A
M
S
R
A
M
S
R
A
M
S
R
A
M
TCM TCM
S
R
A
M
S
R
A
M
R
D
D
3
R
D
D
3
R
D
D
3
R
D
D
3
R
D
D
3
R
D
D
3
R
D
D
3
R
D
D
3
U
P
C
R
D
D
3
R
D
D
3
R
D
D
3
R
D
D
3
R
D
D
3
R
D
D
3
R
D
D
3
R
D
D
3
Y
H
P
Y
H
P
Y
H
P
Y
H
P
RDD 3
RDD 3
ToSwitchFabricDDR3
DDR3 memory
10+DIMM,900+Pins
Cantrol Plane
Memory
2-3
times
NP
SM SM
SM SM
A M
YHP YHP YHP YHP
CIF D
R
D
D
3
R
D
D
3
U P C
540+ W Power 212 – W
472+ cm^2 Area 148 cm^2
5600+ $ Cost 2520 $
Traditional Line Card
Line Card with SM
CONCLUDING REMARKS
•Packet Processing Bottlenecks
Data away from compute
I/O and memory bandwidth
•Smart Memory
Keep compute close to data
Keep locking close to data
Provide inter-memory connect
•Advantages
Reduced chip I/O bandwidth
High performance and low latency
Feature rich, flexible and programmable
Lower cost
One chip for several functions
REFERENCE
 www.google.com
 www.wikipedia.com
 www.oeclib.in
 www.projectsreports.org
Thank You
ALL

Smart Memory ppt

  • 1.
    www.oeclib.in Submitted By: Odisha ElectronicsControl Library Seminar On Smart Memories
  • 2.
    CONTENT  Packet ProcessingWorkload Challenges  Solution –Smart Memory  Introduction to Smart Memory  Smart Memory Architecture  Packet Processing Bottlenecks  Smart Memory  Advantages  Reference
  • 3.
    OVERVIEW 1. High PerformancePacket Processing Challenges 2. Solution –Smart Memory 3. Smart Memory Architecture
  • 4.
    PACKET PROCESSING WORKLOAD CHALLENGES •Sequential memory references  For lookups (L2, L3, L4, and L7)  Finite automata traversal • Read-modify-write Tons of memory reference sand minimal compute  Statistics, counters, token-bucket, mutex, etc • Pointer and link - list management  Buffer management, packet queues, etc. • Traditional implementations use  Commodity memory to store data  NPs and ASICs to process data in memory Performance Barriers: 1. Memory and chip I/O bandwidth 2. Memory latency 3. Lock for atomic access Memory Memory Memory Memory P P P P P P P P P P P P
  • 5.
    ILLUSTRATION OF PERFORMANCEBARRIER I Memory Memory Memory P P P P P P P P P P P P Interconnection network Memory Requires several transactions between memory and processors Requires several transactions between memory and processors More latency In inter connect Need more processors Low IPC IP lookup tree 4 2 9 5 1 7 8 3 6 0 0 1 1 P10 1 0 1 P3 P4P5 P2
  • 6.
    ILLUSTRATION OF PERFORMANCEBARRIER II Memory Memory Memory P P P P P P P P P P P P Interconnection network Lock free-list Get free node Unlock free-list Lock list tail Read list tail Link free node Update list tail Unlock list tail Lock counter Read counter Write counter Unlock counter Locks often kept in memory Requires another transaction Adds significant latency Single queue or single counter operations are extremely slow CountersEnqueue Dequeue •Lookups are read-only so relatively easy •Link-list, counters, policers, etc are read-modify-write •Requires per memory address lock in multi-core systems
  • 7.
    SOLUTION –SMART MEMORY Attach simple compute with data  Attach lock with data  Enable local memory communication
  • 8.
    INTRODUCTION TO SMARTMEMORY • What is the real problem?  Compute occurs far away from data  Lock acquire/release occurs far from data • Solution: Make memory smarter by: Fortunately, compute for packet processing jobs are very modest! Memory Memory Memory Memory P P P P P P P P P P P P Interconnection network P P P P P P P P P P P P Interconnection network Compute Memory Compute Memory Compute Memory Compute Enabling local communication Managing lock close to data Keeping compute close to data
  • 9.
    INTRODUCTION TO SMARTMEMORY Memory Memory Memory Memory P P P P P P P P P P P P Interconnection network P P P P P P P P P P P P Interconnection network Compute Memory Compute Memory Compute Memory Compute • What is the real problem?  Compute occurs far away from data  Lock acquire/release occurs far from data Fortunately, compute for packet processing jobs are very modest! Smart Memory Advantages (Get more off fewer transactions!) 1. Lower I/O bandwidth 2. Lower processing latency 3. Higher IPC 4. Significantly higher single counter/queue performance
  • 10.
    SMART MEMORY ARCHITECTURE Hybrid memory –eDRAM + DDR3-DRAM  Serial chip I/O
  • 11.
    SMART MEMORY CAPACITYAND BANDWIDTH @100G 40 20 10 5 2.5 1.2 5 .62 .31 .15 2 4 8 16 32 64 128 256 512+ Memorybandwidth(Billionaccesses/packet) Memory Capacity (MB) Basic Laye2 Layer2 fwding Statistics /Counter Queuing/ Scheduling Packet Buffer Vide Buffer
  • 12.
    SMART MEMORY CAPACITYAND BANDWIDTH @100G 40 20 10 5 2.5 1.2 5 .62 .31 .15 2 4 8 16 32 64 128 256 512+ Memorybandwidth(Billionaccesses/packet) Memory Capacity (MB) Basic Laye2 Layer2 fwding Statistics /Counter Queuing/ Scheduling Packet Buffer Vide Buffer 8 Channels of DDR3-RAM 64 banks eDRAM Smart Memory uses intelligent algorithms to split the data-structures
  • 13.
    SMART MEMORY HIGHLEVEL ARCHITECTURE P P P P P P P P P P P P P P P P Packet processor complex eDRAM SM engine Smart Memory complex eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine DRAM SMEngine Local interconnect: provides local communication between smart memory blocks Global interconnect: provides fair communication between processors andsmart memory DDR3 DRAM
  • 14.
    SMART MEMORY HIGHLEVEL ARCHITECTURE P P P P P P P P P P P P P P P P Packet processor complex eDRAM SM engine Smart Memory complex eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine eDRAM SM engine DRAM SMEngine Result Split tables into eDRAM and DRAM Computation occurs closeto memory reducing latency Requires fewer memory transactions Read Read DDR3 DRAM
  • 15.
    I/O TECHNOLOGY CHOICEIN SMART MEMORY  Smart Memory reduces the chip I/O bandwidth significantly  How to further optimize it? Based on MoSys data Bandwidth, latency and I/O bandwidth gap is growing On-chip bandwidth is much higher than memory I/O Smart Memory use serial I/O -4X throughput than RLDRAM and QDR -3X fewer pins than DDR3 and DDR4 -2.5X reduces I/O power
  • 16.
    HIGH SPEED LINECARD WITH SMART MEMORY NP NP NP NP TM TM C A M C I F TCM TCM S R A M S R A M S R A M S R A M S R A M S R A M TCM TCM S R A M S R A M R D D 3 R D D 3 R D D 3 R D D 3 R D D 3 R D D 3 R D D 3 R D D 3 U P C R D D 3 R D D 3 R D D 3 R D D 3 R D D 3 R D D 3 R D D 3 R D D 3 Y H P Y H P Y H P Y H P RDD 3 RDD 3 ToSwitchFabricDDR3 DDR3 memory 10+DIMM,900+Pins Cantrol Plane Memory 2-3 times NP SM SM SM SM A M YHP YHP YHP YHP CIF D R D D 3 R D D 3 U P C 540+ W Power 212 – W 472+ cm^2 Area 148 cm^2 5600+ $ Cost 2520 $ Traditional Line Card Line Card with SM
  • 17.
    CONCLUDING REMARKS •Packet ProcessingBottlenecks Data away from compute I/O and memory bandwidth •Smart Memory Keep compute close to data Keep locking close to data Provide inter-memory connect •Advantages Reduced chip I/O bandwidth High performance and low latency Feature rich, flexible and programmable Lower cost One chip for several functions
  • 18.
    REFERENCE  www.google.com  www.wikipedia.com www.oeclib.in  www.projectsreports.org
  • 19.