PCI Express switch over Ethernet or Distributed IO Systems for Ubiquitous Computing and IoT Solutions

PCI Express switch over Ethernet or Distributed IO
Systems for Ubiquitous Computing and IoT Solutions
Deepak Pathania, NEC

Actionable
Information
Real-time
Feedback
Challenge faced in Real-Time Data Analytics
Big Data of varying characteristics, such
as Live feeds, graphics, video, text, etc.
comes into cloud computers
This data is to be processed and analyzed
in real-time
However, instead of building servers with
such accelerators, Cloud vendors still
prefer building homogeneous servers due
to TCO and efficiency considerations
Real-Time Analytics,
Deep Learning, etc… To accelerate such processing, a large
number of accelerators such as GPUs and
FPGAs, along with high speed storage are
required
Xeon Phi GPU FPGA

So, What can be a Dynamic Accelerator Deployment Solution?
A technology that extends PCI Express beyond the confines of a computer
chassis via Ethernet, WITHOUT any modification of existing hardware and
software or PCIe switch over Ethernet (ExpEther or EE)
Server
CPU
Memory
PCI Express
ExpEther
NIC
L2 Switch
Standard
Ethernet
PCI Express
IO Device
ExpEther
Engine
ExpEther
Engine
IO Expansion Unit
with
PCIe Cards

Just another implementation of PCIe Switch
IO
Device
IO
Device
ExpEther Engine is seen as PCIe Switch from CPU
 Ethernet region is invisible from the CPU
Upstream Port
(PCI Bridge)
Downstream Port
(PCI Bridge)
Downstream Port
(PCI Bridge)
Internal PCI bus
CPU
IO
Device
IO
Device
PCIe Switch
CPU
Ethernet
Switch
ExpEther Engine
(PCI Bridge)
ExpEther Engine
(PCI Bridge)
ExpEther Engine
(PCI Bridge)
Ethernet Fabric
(Invisible)
PCI Express
PCI Express
PCI Express
PCI Express

Broad-Scale Single Computer
PCIe
Switch
IO
Device
IO
Device
CPU CPU
IO
DeviceIO
DeviceIO
Device
IO
DeviceIO
DeviceIO
Device
In the same rack In the next rack
IO
DeviceIO
Device
In another floor
IO
DeviceIO
Device
In another building
A PCI express switch
is equivalent to
Ethernet fabric.
ExpEther
Engines
ExpEther
Engines
ExpEther
Engine
ExpEther
Engines
ExpEther
Engines
Ethernet
Switch
Ethernet
Switch
Ethernet
Switch
Ethernet
Switch
ExpEther can build new type of computing environment without physical constraints

ExpEther Architecture
• Achieve the “System on Network”
• Merge the PCI Express technology into Ethernet technology
• Connect logically in MAC layer
• No impact for upper or lower layer of the PCIe and Ethernet standard for future
expansion
Application
OS
PCI Driver
EFI/PCI BIOS
ExpEther Logic
MAC
PHY
40G 10G 1G
Application
OS
NDIS Driver
Ethernet Logic
MAC
PHY
10M 100M 1G 10G 40G
Ethernet
ExpEther
Software
Hardware
Upper Compatible
No modification for
future expansion of
ExpEther or Ethernet

Resource Disaggregated Platform or ExpEther features
Ether
Frame
CPU
PCI Express
ExpEther
Engine
PCI Express
Ethernet
Switch
ExpEther
Engine
ExpEther
Engine
ExpEther
Engine
I/O
Device
I/O
Device
I/O
Device
PCIExpress
Equivalent to direct connection
(Ethernet is invisible from CPU/IO)
1
Ethernet
Fabric
Low Latency
(L2 Ether w/o SW stack)
2
I/O Dynamic Reconfiguration
(Hot-Plug Scheme)
4
EE PCI Express TLP
No packet loss
(Adding reliability to Ethernet)
3

Dual Path for Throughput and Reliability
• Two Ethernet connections are established between the Host Chip and I/O Chip
• Load balancing for performance
• Path redundancy for failure recovery
Dual Port
CPU
ExpEther
Host Chip
I/O
Device
ExpEther
IO Chip
I/O
Device
ExpEther
IO Chip
Failure Recovery
Quickly detects path failures
and switches paths
Load-balancing
Round-robin data packet
transmission between the
two redundant connections
Ethernet Fabric-I
Ethernet Fabric-II
40G ExpEther NIC

Frame Rate Control
TCP/IP : Rate control is triggered by packet loss (TCP Reno)
Network
Bandwidth
Slow Start Avoid
Congestion
Time
Avoid
Congestion
Avoid
Congestion
Packet loss causes significant performance degradation because of retransmission.
ExpEther : Rate control is always done by measuring network latency
Probing Avoid Congestion
Network
Bandwidth
Time
Packet loss does not occur basically in ExpEther.
ExpEther engine always measures the frame arrival time of receive
side and minutely controls the frame rate to avoid packet loss.

SAS JBOD
Multi-path IO with Resource Disaggregation or ExpEther
• Multi-Path IO (MPIO)
• MPIO is one of the technic for achieving high-reliability. If the target IO device supports MPIO,
it can support MPIO even under ExpEther.
• Multi-Path Ethernet
• It supports the high-speed network path failover.
Host
SAS
HBA#0
SAS
HBA#1
Host
EE
NIC#0
SAS JBOD
SAS
HBA#0
SAS
HBA#1
Equivalent
Act Act
MPIO
Ether
Switch
Ether
Switch
EE EE
MPIO
High-Speed
Network Failover

Dynamic Reconfiguration and Hot-Plug Capability
Host
B D G I
Host
A J
Host
C E H
Host
F
Group#1 Group#2 Group#3 Group#4
Logical View
Host Host Host
1 2 4
A B C D E F G H I J
1 1 1 12 23 3 34
EE
Manager
PCIe
Switch
PCIe
Switch
PCIe
Switch
PCIe
Switch
Host
Ethernet Fabric
3

Dynamic Reconfiguration and Hot-Plug Capability
• Group ID (GID : 1~4,095)
• GID range from 1 to 15 is set by physical DIP switch residing on card.
• Setting GID to 0 allows Management Software to program a soft GID.
Host Host HostHost
Management
Server
EE
1
EE
2
EE
3
EE
4
EE EE EE EE EE EE EE EE EE EE EE EE EE EE EE EE
1 1 1 1 12 2 2 23 3 34 4 4 4
IO IO IO IO IO IO IO IO IO IO IO IO IO IO IO IO
Group ID
Configuration
Group ID
Configuration
Collecting
Various
Information
- ExpEther Manager -
 Configuration
• Group ID Configuration
 Monitoring
• ExpEther network status
• PCIe device status
• New ExpEther detection
• Failure detection
Management Frame
- Mng. Frame -
 Special Ether Frame
• ExpEther hard wired logic directly
receives and sends the frames for
configuration and management

ExpEther Technology Architectural Possibilities
▐ Std-EE : Standard PCIe-over-Ethernet
• Foundation of ExpEther
▐ MR-EE : I/O sharing
• Multi-hosts are able to share an IO device by using SR-IOV compliant device
▐ P2P-EE: I/O direct connection
• Support for the Peer-to-Peer data transfer between I/O devices.
▐ NTB-EE : Remote direct memory access by NTB
• Hi-speed data transfer between hosts
Host
Std-EE
I/O I/O
P2P-EE P2P-EE
Ethernet
Switch
Peer-to-Peer
Current Path
Host
NTB-EE
Ethernet
Switch
Host
NTB-EE
Host
NTB-EE
NTB
Ethernet
I/O
Std-EE
I/O
Std-EE
Host
Std-EE
I/O I/O
Std-EE Std-EE
Ethernet
Switch
PCIe-over-Ether
Host
Std-EE
PartitioningPartitioning
Host
Std-EE
SR-IOV
Ethernet
Switch
Host
Std-EE
Host
Std-EE
SR-IOV
MR-EE MR-EE
Resource Sharing
Ethernet

• 40G ExpEther
ExpEther Lineup
• 1G/10G ExpEther
 2x 1000BASE-T
 DVIx1,HDMI x1
 USB3.0 x1
 USB2.0 x3
 Headphone x1
 Microphone x1
 x1 PCI Express
 Dual 1000BASE-T
 x8 PCIe Gen2
 Dual 10G SFP+
 x16 PCIe x 1 slot
 Dual 1000BASE-T
 x16 PCIe2 x 2 slots
(full height/full length)
 Dual 10G SFP+ per slot
ExpEther HBA ExpEther Client ExpEther IO Expansion Unit
IO Interface : x8 PCI Express 3.0
Network I/F : QSFP+ x 2
Form Factor : PCI Low Profile
IO Interface : x8 PCI Express 3.0
Slots : x16 Slot x 4
Network I/F : QSFP+ x 4
Support IO : GPGPU (K80, P100, etc)
ExpEther HBA IO Expansion Unit
19” Rack Size
1,000W PSU for 2-Slot IO Expansion Unit
800W PSU for 4-Slot IO Expansion Unit
3U
400mm
1G
1G
1G 10G
10G
40G
40G

Performance of EE vs Local with PCIe based SSD’s
name/ssd 1 2 4
local 2728448.0 5133619.2 10321510.4
ExpEther（HBA1） 2728584.5 5004185.6 6648012.8
ExpETher(HBA2) - - 9974886.4
ExpEther(HBA1)/local (%) 100.01 97.48 64.41
ExpEther(HBA2)/local (%) - - 96.64
Theoretical Value 2700000 5400000 10800000
name/ssd 1 2 4
local 1032396.8 2044231.7 3913407.6
ExpEther（HBA1） 1035468.8 2049361.9 3870552.8
ExpETher(HBA2) - - 3901378.2
Theoretical Value 1080000 2160000 4320000
There is no impact on bandwidth in ExpEther that can fully support PCIe x8 gen3 (64Gbps)

Performance of EE vs Local with PCIe based SSD’s
name/ssd 1 2 4
local 455913 911963 1823617
ExpEther（HBA1） 455984 912167 1224985
ExpETher(HBA2) - - 1823856
Theoretical Value 450000.00 900000.00 1800000.00
name/ssd 1 2 4
local 65470 129356 259631
ExpEther（HBA1） 65365 128806 259631
ExpETher(HBA2) - - 259838
Theoretical Value 75000.00 150000.00 300000.00
ExpEther can achieve the similar IOPS as local by increasing the IO depth parameter to hide the latency of
Ethernet.

Service Acceleration Platform with RD or ExpEther
EE Client
USB/
VGA
KVM
CPU/
Chipset
CPU/
Chipset
Remote IO
GPGPU
GPGPU
GPGPU
GPGPU
GPGPU
GPGPU
GPGPUAccelerator
FPGA
NVMe
SSDNVMe
SSD
NVMe
SSD
NVMe
SSD
ExpEther
Engines
NVMe
SSDNVMe
SSD
NVMe
SSD
NVMe
SSD
ExpEther
Engines
NVMe
SSDNVMe
SSD
NVMe
SSD
NVMe
SSD
ExpEther
Engines
NVMe
SSDNVMe
SSD
NVMe
SSD
NVMe
SSD
ExpEther
Engines
ExpEther
HBA
ExpEther
HBA
ExpEther
Engine
Ethernet
Ether
Switch
ExpEther
Engine
USB
Ctrl
ExpEther
Engines
ExpEther
Engines
Sensors
Ether
Switch
Accelerator Resource Pool
IO devices can be dynamically allocated to
appropriate host according to workload
Ether
Switch

Case : Resource Pool System for HPC (Osaka University)
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
SAS JBOD
SAS JBOD
SAS JBOD
SAS Ctrl
GPUs
GPUs
TOR SW
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
SAS JBOD
SAS JBOD
SAS Ctrl
GPUs
GPUs
TOR SW
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
SAS JBOD
SAS JBOD
SAS Ctrl
GPUs
GPUs
TOR SW
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
PCoIP
K2 GRID
GPUs
GPUs
TOR SW
Server
Server
Server
Server
Server
Server
Server
Server
Server
Server
SAS JBOD
SAS JBOD
SAS Ctrl
GPUs
GPUs
TOR SW
Server
Server
Server
Server
Server
Server
Server
Server
Server
NIC
PCIe Flash
GPUs
GPUs
TOR SW
Server ServerServer Server
CPU
GPU
GPU
GPU
GPU
HDD
HDD
Flash
Flash
SoftwareProvisioning
Server System is configured according
to user requirement
▌64 servers and 70 IO devices for research in Osaka University
There are GPUs, Flash storages and VDI accelerators as IO device
The IO devices are dynamically connected to the servers through 10G ExpEther in accordance
with server’s workload

Case : Easy Extension of Measurement Equipment (PXI)
PCIe Cable
E.g. Different Room
Optical Cable (more than 1 Mile...)
Ethernet
Switch
ExpEther Manager Software
assigns ID to each ExpEther
module
Current PXI products are typically extended
by PCIe cable. So the measurement system
is fixed and the installation location is very
limited.
If ExpEther engine is implemented into PXI chassis,
the system can have a large number of PXI
modules and dynamically configure the system.
PXI Module
PXI (PCI eXtensions for Instrumentation)
is one of several modular electronic
instrumentation platforms based on PCIe.

Case: Ultra-Fast Failover Recovery for Database system with EE and ExpressClusterX
Main DB
(FC SAN)
DB Journal
(NVMe + EE)
Ethernet
FC
 NVMe SSD is faster than
Fiber Channel.
 Use NVMe SSD as Journal
for DB.
Fail
Active Server
Standby Server
 When Active Server fails,
NVMe SSDs’ connection
is switched, allowing for
DB journal restore on
Standby Server.
Configuration with Legacy Failover New Configuration
OS
EE40G I/O
Expansion Unit
EE40G Board
OS
40G Switch
Primary Server Secondary Server
EEM EEM

Wide-Area
Network
Local
Network
Edge Computing
Device Computing
Cloud ComputingL5
L3
L1
IoT Layers
Living at the Edge for going Real-Time with ExpEther
L5 Cloud ~ Analytics
L3 Edge ~ Abstraction/Real-Time Proc.
L1 Device/Sensor ~ Smart Device
Real-Time
Feedback
Rack-Scale or Resource pooling with dynamic
reconfiguration allows low-cost, low-power and high
performance computing data centers at the cloud level.
Actionable
Information
ExpEther can connect devices directly to the edge and
servers using simple everything in hardware approach or
no complex software protocol stack for communication
which is high-speed and low power. Making devices
smarter.
ExpEther helps in bringing analytics to the edge.
In combination with low-power and high-performance
hardware like FPGA’s one can achieve an idealistic
abstraction required for Real-time processing.
Data
Collection
Analytics
Abstraction

Future Roadmap of ExpEther or Universal Interconnect

Summary
• The EE or resource disaggregated system allows to have next generation
computer hardware architectures due to following features:
• Giving distance or length with dynamic switching capability.
• Same or similar performance of local vs remotely located IO’s.
• Moving within chassis devices outside with plug and play ability (independent of OS or
drivers and applications).
• Making legacy devices useful and cost-effective system realization.
• A resource disaggregated system using well time, applications, environment
tested protocols like PCIe and Ethernet or EE is simple, yet a revolutionary
step forward towards next generation computer hardware architectures or
systems with the trust from the best of both legacies.

Business Menu
• Product Sales Business
• Sales of the product which was developed as an option for Express
server
• FPGA IP Core License Business
• Development of an FPGA IP Core with ExpEther technology according to
customer’s requirement, and release binary image file

PCI Express switch over Ethernet or Distributed IO Systems for Ubiquitous Computing and IoT Solutions

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to PCI Express switch over Ethernet or Distributed IO Systems for Ubiquitous Computing and IoT Solutions

Similar to PCI Express switch over Ethernet or Distributed IO Systems for Ubiquitous Computing and IoT Solutions (20)

More from Ganesan Narayanasamy

More from Ganesan Narayanasamy (20)

Recently uploaded

Recently uploaded (20)

PCI Express switch over Ethernet or Distributed IO Systems for Ubiquitous Computing and IoT Solutions