2. 『 Packet Coalescing and Server Substitution
for Energy-Proportional Operation of
Network Links and Data Servers 』
Mostowfi, Mehrgan, "Packet Coalescing and Server Substitution for Energy-Proportional Operation of
Network Links and Data Servers" (2013). Graduate School Theses and Dissertations.
http://scholarcommons.usf.edu/etd/4732
3. • PKT / MSG Coalescing 의 구분
• Energy Efficient Ethernet의 Packet Coalescing
• EEE Coalescer 버퍼 크기에 따른 에너지 소비량 비교
5. 메시지 패킷
네트워크를 통해
전송하기 쉽도록 자른
데이터의 전송단위.
통신 수단에 의한 전달에
적합한 언어나 부호로
작성된 단위 정보 또는
전송된 단위 정보.
[네이버 지식백과] 패킷 [packet] (두산백과)
[네이버 지식백과] 메시지 [message] (IT용어사전, 한국정보통신기술협회)
즉, ‘메시지’가 네트워크를 통해 전송 될 때,
‘패킷’이라는 단위로 변환(이를 단편화라 함)된다.
6. packet fragmentation
via ‘Transport
& Network Layer’
Packet
Capacity == 1500
case 1. strlen(msg) > Capacity
case 2. strlen(msg) < Capacity
- TCP
- UDP
- IP
단편화(패킷으로 변환)의 두 가지 경우
7. Case 1.
strlen(msg) > Capacity
Packet
Capacity == 1500
msgmsgmsg
1500 - socketHeader
필요한 크기만큼
메시지를 잘라
패킷으로 변환.
수신측에선
잘게 나누어진
패킷의 순서를
해석한 후
메시지 재 조립
packet fragmentation
via ‘Transport
& Network Layer’
8. Case 2.
strlen(msg) < Capacity
Capacity < 1500
msg
전달된 메시지가
바로 패킷으로 변환.
pkt
packet fragmentation
via ‘Transport
& Network Layer’
9. Case 3.
strlen(msg) < Capacity
&& Too Much Message
msg
pkt
msgmsgmsgmsgmsg
pktpktpktpktpkt
n
n
불필요하게
많은 수의
패킷이 생성.
n 만큼의 변환 과정이 필요.
불필요하게
많은 수의
전송이 실행.
n 만큼의 전송 과정이 필요.
10. n
(ex: n=10,
msg size=1500byte)
Case 3. Solution 1.
PKT Coalescing
msg
pkt
msgmsgmsgmsgmsg
pktpktpktpktpkt
n
(ex: n=10,
msg size=1500byte)
pkt pkt ··· pkt pkt
Packet Coalescing
n
‘패킷’을 네트워크 카드가
보낼 수 있을 만큼
모아서 한꺼번에 전송.
현재 EEE 에서 사용하는 방식.
[NIC 카드]
13. 에너지 효율 이더넷
• 에너지 효율 이더넷(영어: Energy-Efficient Ethernet)은 데이
터를 적게 쓰는 시기에 소비 전력을 낮춤으로써 연선과 백플레인
이더넷 계열의 컴퓨터 네트워킹 표준을 강화하는 기술이다. 50%
이상 소비 전력을 낮추지만 기존의 장비와 완전한 호환성을 유지
하는 것이 목적이다.[1] IEEE는 최종 표준을 2010년 9월에 승인
하였다.[2] 이 표준이 승인되기 전까지는 그린 이더넷(Green
Ethernet)이라는 이름을 사용했다.
[1] Sean Michael Kerner (2009. 7. 17 ). Energy Efficient Ethernet hits standards milestone
— InternetNews:The Blog — Sean Michael Kerner . 《Internetnews blog》
[2] "IEEE ratifies new 8023az standard to reduce network energy footprint ", (2010. 10. 5 )
14. 참고부분 - Chapter 3: Packet Coalescing
for Energy Efficient Ethernet
3.1 An Analytical Energy-Delay Model for a Count-based Packet Coalescer
3.1.1 Energy-Delay Model for Coalescer
3.1.2 Delay Model for Downstream Queue
3.1.3 Numerical Results
3.2 Reducing the Energy Consumption of EEE by Packet Coalescing
3.2.1 Simulation Model of EEE with Packet Coalescing
3.2.2 Experiments
3.2.3 Results
3.2.4 Comparison Between the Analytical Model of Coalescing and the
Simulation Model of EEE with Packet Coalescing
3.3 Extending Savings of Packet Coalescing Beyond Links in Ethernet Switches
3.3.1 Switch Energy Use and Transition Times
3.3.2 The Synchronized Coalescing Method
3.3.2.1 Simple Synchronized Coalescing
3.3.2.2 Adaptive Coalescing
3.3.3 Evaluation by Simulation
3.3.4 Results and Discussion
3.4 Chapter Summary
15. EEE uses a Low Power Idle (LPI) mode to reduce power
consumption between packet transmissions. EEE has
transition times Ts(wake-to-sleep) and Tw(sleep-to-
wake), which are significantly greater than a single
packet transmission time for both 1 and 10 Gb/s EEE.
By coalescing arriving packets into bursts, the overhead
of Ts and Tw can be reduced and nearly energy-
proportional operation can be achieved. The trade-off in
coalescing is increased packet delay at the sender and,
potentially, also in downstream switches or routers.
* EEE : Energy Efficient Ethernet
16. In packet coalescing, a FIFO queue in the Ethernet interface (in the host
NIC and switch or router line card) is used to collect, or coalesce, multiple
packets before sending them on a link as a burst of back-to-back packets.
This FIFO queue is called a coalescer.
Packet coalescing is already used in many high-speed Ethernet interfaces
– mostly on the receive side – to reduce CPU overhead for packet
processing [73]. Packet coalescing can be based on packet count and/or
time from first packet arrival.
In packet coalescing based on packet count (count-based coalescing),
the coalescer collects a certain number of packets before sending them on
the link in a single burst.
In packet coalescing based on time from first packet arrival, the coalescer
sets a timer, called the coalescing timer, to a certain predefined time upon
the arrival of the first packet to an empty coalescer. The timer counts down
to zero. When the timer reaches zero (or expires), the coalescer sends the
packets which are collected in the coalescer on the link.
1
2
Counter에 의한 Coalescing
Timer에 의한 Coalescing
17. FSM of PKT Coalescing
count-based time-based
(simple synchronized coalescing)
18. EEE with Packet Coalescing
CTimer : maintain PKT
Coalescing time.
WTimer : maintain time
spent in ‘Wakeup’.
STimer : maintain time
spent in ‘Sleep’.
21. Ps : Power Consumption in LPI mode
Pa : Power Consumption during Active mode
tLPI : time spent in the LPI mode
tws : Sleep Time (needed to enter the low-power mode)
tsw : Wake-up Time (required to exit the low-power mode)
전력소모공식
* 인용 : 『 Performance Evaluation of Energy Efficient Ethernet 』
P. Reviriego, J. A. Hern´andez, D. Larrabeiti, and J. A. Maestro
IEEE COMMUNICATIONS LETTERS, VOL. 13, NO. 9, SEPTEMBER 2009
22. The factors in these
experiments are :
• The power consumption in the LPI mode, Ps, is
assumed to be 10% according to the estimations made
by different NIC manufacturers during the
standardization process of EEE *
• The power consumption during transitions is also
assumed 100% (Pa) also based on estimations made by
different NIC manufacturers. *
• The power consumption in Active mode is obviously
100% of the link’s consumption. *
* 인용 : 『 Performance Evaluation of Energy Efficient Ethernet 』
P. Reviriego, J. A. Hern´andez, D. Larrabeiti, and J. A. Maestro
IEEE COMMUNICATIONS LETTERS, VOL. 13, NO. 9, SEPTEMBER 2009
23. The factors in these
experiments are :
[5] J. Chou, “Low-power idle based EEE 100Base-TX,”
Mar. 2008, in IEEE 802.3az Task Force presentation.
[6] B. Kohl, “10GBase-T power budget summary,”
Mar. 2007, in IEEE 802.3az Task Force presentation.
24. The factors in these
experiments are :
• Tws and Tsw; set to their minimums,
4.48 and 2.88 μs respectively
• Distribution of packet arrivals and packet size; set to
Poisson distribution with fixed packet size of 1500 B.
• For the small coalescer, 12μs and 10 packets are used
for these factors, respectively.
• For the large coalescer, 120μs and 100 packets are
used.
-> 15 KB
-> 150 KB
28. 『 Reducing Connection Memory
Requirements of MPI for InfiniBand
Clusters: A Message Coalescing Approach 』
Matthew J. Koop(1)(2), Terry Jones(2), Dhabaleswar K. Panda(1)
(1) Network-Based Computing Laboratory Department of Computer Science and Engineering
The Ohio State University
(2) Lawrence Livermore National Laboratory Livermore, CA 94550
*Published in :
Cluster Computing and the Grid, 2007. CCGRID 2007.
Seventh IEEE International Symposium on
http://ieeexplore.ieee.org/xpl/login.jsp?
tp=&arnumber=4215416&url=http%3A%2F%2Fieeexplore.ieee.org
%2Fxpls%2Fabs_all.jsp%3Farnumber%3D4215416
32. Feature InfiniBand PCI-X Fibre Channel
1 Gb & 10Gb
Ethernet
Hypertransport Rapid I/O
Bus/link bandwidth 2.5/10/30 Gbps 8.51Gbps 1/2.1Gbps 1 Gb, 10 Gb
12.8, 25.6, 51.2
Gbps
16/3 Gbps
Bus/link bandwidth (fully
duplexed)
5/20/60 Gbps n/a Gbps 2.1/4.2 Gb 2 Gb, 20 Gb
25.6, 51.2, 102
Gbps
32/64Gbps
Pin count 4/16/484 90 4 4, Fiber 55,103,197 40/76
Maximum signal length km Inches km km Inches Inches
Transport media
PCB, Fiber, copper
cable
PCB only
Copper and fiber
cable
Copper and fiber
cable
PCB only PCB only
Simultaneous peer-to-peer
communication
15 VLs + management
lane
X
Three transaction
flows
Native hwd transport support X
In-band management X
Not native; can
use IP
RDMA support X
Native support for virtual interface X
End-to-end management X X X X
Memory partitioning X X
QoS X X Limited X
Reliable X X X X
Scaleable X X X X X
Maximum packet payload 4 KB
Not packet
based
2 KB
1.5 KB (Jumbo: 9
KB)
64 B 256 B
Notes:
1. The raw bandwidth of an InfiniBand 1X link is 2.5 Gbps (per link). Data bandwidth (due to 8B/10B encoding) is 2.0 Gbps for 1X, 8 Gbps for 4X, and 24
Gbps for 12X; twice that for full duplex or 4/16/48 Gbps.
2. The bandwidth of 2-Gb fibre channel is 2.1 Gbps, but the actual raw bandwidth (due to 8B/10B encoding) is 20% lower or around 1.7 Gbps (twice that
for full duplex).
3. Values are for 8B/16B data paths peak at 1-GHz operation. Speeds of 125, 250, and 500 MHz are supported.
4. The pin count for a 1X link is four pins up to 48 pins for a 12X link.
5. Memory partitioning enables multiple hosts to access storage endpoints in a controlled manner based on a key. Access to a particular endpoint is
controlled by this key, so different hosts can have access to different elements in the network.
* InfiniBand: Thinking Outside the Box Design
http://www.eetimes.com/document.asp?doc_id=1204052
33. Feature InfiniBand PCI-X Fibre Channel
1 Gb & 10Gb
Ethernet
Hypertransport Rapid I/O
Bus/link bandwidth 2.5/10/30 Gbps 8.51Gbps 1/2.1Gbps 1 Gb, 10 Gb
12.8, 25.6, 51.2
Gbps
16/3 Gbps
Bus/link bandwidth (fully
duplexed)
5/20/60 Gbps n/a Gbps 2.1/4.2 Gb 2 Gb, 20 Gb
25.6, 51.2, 102
Gbps
32/64Gbps
Pin count 4/16/484 90 4 4, Fiber 55,103,197 40/76
Maximum signal length km Inches km km Inches Inches
Transport media
PCB, Fiber, copper
cable
PCB only
Copper and fiber
cable
Copper and fiber
cable
PCB only PCB only
Simultaneous peer-to-peer
communication
15 VLs + management
lane
X
Three transaction
flows
Native hwd transport support X
In-band management X
Not native; can
use IP
RDMA support X
Native support for virtual interface X
End-to-end management X X X X
Memory partitioning X X
QoS X X Limited X
Reliable X X X X
Scaleable X X X X X
Maximum packet payload 4 KB
Not packet
based
2 KB
1.5 KB (Jumbo: 9
KB)
64 B 256 B
Notes:
1. The raw bandwidth of an InfiniBand 1X link is 2.5 Gbps (per link). Data bandwidth (due to 8B/10B encoding) is 2.0 Gbps for 1X, 8 Gbps for 4X, and 24
Gbps for 12X; twice that for full duplex or 4/16/48 Gbps.
2. The bandwidth of 2-Gb fibre channel is 2.1 Gbps, but the actual raw bandwidth (due to 8B/10B encoding) is 20% lower or around 1.7 Gbps (twice that
for full duplex).
3. Values are for 8B/16B data paths peak at 1-GHz operation. Speeds of 125, 250, and 500 MHz are supported.
4. The pin count for a 1X link is four pins up to 48 pins for a 12X link.
5. Memory partitioning enables multiple hosts to access storage endpoints in a controlled manner based on a key. Access to a particular endpoint is
controlled by this key, so different hosts can have access to different elements in the network.
* InfiniBand: Thinking Outside the Box Design
http://www.eetimes.com/document.asp?doc_id=1204052
Feature InfiniBand
1 Gb & 10Gb
Ethernet
Hypertransport
Bus/link bandwidth 2.5/10/30 Gbps 1 Gb, 10 Gb 51.2 Gbps
Bus/link bandwidth
(fully duplexed)
5/20/60 Gbps 2 Gb, 20 Gb 102 Gbps
Maximum signal
length
km km inches
Transport media
PCB, Fiber,
copper cable
Copper and
fiber cable
PCB only
34. 인피니밴드는 전통적인 이더넷 아키텍처와 같은 계층적 스위치 방식의 네트워크와는 반대로
스위치 패브릭 방식의 토폴리지를 사용한다. 모든 전송은 채널 어댑터에서 시작하거나 끝이 난다.
각 프로세서는 호스트 채널 어댑터(HCA)를 가지고 있으며 각 주변장치에는 타켓 채널 어댑터
(TCA)가 있다. 이러한 어댑터들은 보안 및 QoS를 위하여 정보를 교환할 수 있다.
* INFINIBAND by Carlo kopp
http://www.csse.monash.edu.au/~carlo/SYSTEMS/Infiniband-Intro-0901.html
* http://ko.wikipedia.org/wiki/인피니밴드
* http://etherealmind.com/what-is-the-definition-of-switch-fabric/
스위치 페브릭은 각 노드들이 직물처럼 옹기종
기 엮여있는 모양새.
점대점 연결이라서 라우팅 알고리즘이 필요 없
다.
35. a host channel adapter
(HCA)
a target channel adapter
(TCA)
Channel Adapters
The HCA provides an interface to a
host CPU and memory subsystem,
such as a web server, and supports all
software verbs defined by the
InfiniBand architecture.
A TCA, on the other hand, provides the
connection to an I/O device from
InfiniBand. This I/O card, which could be
a network interface card (NIC), houses a
subset of features necessary for each
device's specific operations.
* InfiniBand: Thinking Outside the Box Design
http://www.eetimes.com/document.asp?doc_id=1204052
36. * High-Performance Buses and Interconnects
http://www.pcmag.com/article2/0,2817,1154809,00.asp
NIC
msg
send/recv via
InfiniBand
HCA 위치 성질
37. - Ethernet 대신 InfiniBand를 사용함으로써,
• Transport / Network Layer에서 진행되던 패킷화 과정이 간소화.
• 따라서 CPU 사용량과 지연시간이 감소.
Pkt Pkt
msg
(kern)
* Enterprise Distributed Systems and Infiniband
http://www.cisco.com/c/en/us/products/collateral/switches/sfs-7000-
series-infiniband-server-switches/prod_white_paper0900aecd804f90f3.html
38. How it Works?
When using a connection-based model, a pair of hosts that wishes to
communicate must each set up a dedicated Queue Pair (QP) for
communication with that peer. Each QP is linked to a Completion Queue (CQ) for
notification of completion. In this connection-based model, there is additional
memory usage with each additional connection.
39. To send a message a descriptor is posted to the QP. This descriptor contains
information about the message to be sent, including the data address, memory
keys, and message length. To receive a message using channel semantics a
receive descriptor must be posted containing the address and length of the buffer.
Upon posting a descriptor, a send Work Queue Entry (WQE), pronounced “wookie,”
is used to track the progress of the request.
Upon completion of a WQE a Completion Queue Entry (CQE), “cookie,” is placed in
the CQ. This method is used in both channel and memory semantics. CQEs can be
obtained by polling the CQ or through an event-based methods.
When a QP is created, the number of send and receive WQEs must be defined.
The number of WQEs allocated determines the number of outstanding send and
receive operations allowed on a single QP. Using a Shared Receive Queue (SRQ)
allows receive WQEs and buffers to be shared rather than per QP, which allows far
better scalability. Benefits are demonstrated in [17] and we will assume SRQ is
being used. Even using a SRQ, however, send WQEs must be posted per QP. Thus,
the number of send WQEs allocated for a QP determines how many outstanding
send operations are allowed for that connection.
* 2.1 InfiniBand Architecture Overview
How it Works?
40. * INFINIBAND by Carlo kopp
http://www.csse.monash.edu.au/~carlo/SYSTEMS/Infiniband-Intro-0901.html
FIFO Queue
HCA는 Work Queue의
내용을 검색, 해당 메시지
를 주기억 장치에서 읽어내
어 패킷으로 변환한다.
전송이 종료되면 해당
Completion Queue의 내용
에 전송 완료 정보를 기록한다.
전송된 패킷은 목적지 노드에서
다시 메시지로 조립되어 Work
Queue에 저장된다.
하드웨어 상세
45. MSG Coalescing Design
1. alter the send flow operation.
2. use the InfiniBand scatter/gather capabilities
instead of packing into the same buffer.
3. cache the MPI tag matching information for each
message.
46. MSG Coalescing Evaluation
Our experimental testbed is a 575-node InfiniBand Linux cluster at Lawrence Livermore National Laboratory.
Each compute node has four 2.5 GHz Opteron 8216 dual-core processors for a total of 8 cores. Total memory
per node is 16GB. Each node has a Mellanox MT25208 DDR HCA. InfiniBand software support is provided
through the OpenFabrics/Gen2 stack [15]. The Intel v9.0 compiler is used for compilation of the MVA- PICH
library and applications.