5 1-33-1-10-20161221 kennedy

Journal of Modern Computer Networks
Cloud Based Datacenter Network Acceleration
Using FPGA for Data-Offloading
Kennedy Chinedu Okafora,1,2
, V. C. Chijindub,1
, G. C. Ononiwua,1
, and O. C. Nosiria,1
a
Dept.of Electrical and Electronics Engineering, Federal University of Technology Owerri, Nigeria; b
Dept. of Electronic Engineering, University of
Nigeria, Nsukka, Nigeria
Currently, the high-performance processors in Spine-Leaf,
Mesh, and Router layer-3 (SLMR-3) backend server do-
main have multiple cores, but data offloading from proces-
sor to the peripheral is not keeping pace with the required
Quality of Service (QoS) needed to balance the workload
on a Warehouse Scaled Computer (WSC) running a de-
veloped Enterprise Energy Tracking Analytic Cloud Por-
tal (EETACP) data center network. High speed with low
latency interconnects between the processors and Field
Programmable Gate Array (FPGA) is critical for achiev-
ing performance benefits in EETACP deployment. Most
of the servers in WSC architectures are running at aver-
age utilization rates and perform well under peak process-
ing power. These servers are good candidates for FPGA
processors in cloud-based data centers owing to its acceler-
ation coherency. This paper made a strong case for cloud-
based support for EETACP. An FPGA-based Spine-Leaf
model is proposed to be an alternative to traditional net-
work models for EETACP provisioning. The paper an-
alyzed reconfigurable FPGAs, characterized a simplified
process model for hyperscale FPGA cloud design descrip-
tion. To validate the performance, comparisons was made
with two similar networks, namely DCell and BCube for
enterprise application deployments. It was concluded that
FPGA-based DCN acceleration for EETACP offers accept-
able QoS expectations
FPGA System on Chip, Cloud Computing, Virtualization, VHDL, Net-
work Optimization, Quality of Service
1. Introduction
Cloud datacenters are designed and built for various high-
performance computing services such as office collaborative
tools (e.g, Microsoft Office 360, Google Drive), search engines
(e.g. Bing, Google), global stock market analysis, entertain-
ment (sports broadcasting, news mining, games, etc.), mecha-
tronics integrations and other scientific workloads [1, 2]. Today,
the servers in these datacenters are interconnected using ei-
ther Spine-Leaf, Mesh or Routed Layer-3 model (SLMR-3) [3].
Cloud application datacenter networks are large and usually
connect hundreds of thousands of servers via their lay-er-3
switch fabrics. A good data offloading strategy in the cloud
datacenter network is critical to ensure that servers, switch-
es, routers, load-balancers as well as its application do not
encounter deadly bandwidth bottlenecks due to utilization
and over-subscription. This will help to isolate services from
each other; and derive more relaxation in workload placement,
rather than having to streamline workload placement to where
bandwidth is available.
Besides, due to the rise of cloud computing integrations, low
latency and high throughput datacenter networking (DCN)
is now an important area of research. The current SLMR-3
for a typical EETACP deployment context is novel. With
cloud provisioning, the beneficial role of FPGA component in
datacenter acceleration vis-à-vis SLMR-3, becomes an inter-
esting and timely subject. Several contribution and studies on
DCN have not considered datacenter acceleration for Quality
of Service (QoS improvement. For instance, topology design
and routing are the focus in [4–8]. Architectural tiers are the
emphasis in [3], [3], [9]. Flow scheduling and congestion con-
trol are the consideration in [10], [11], [12]. Virtualization is
the focus in [13], [14] while application support is the focus in
[15], [16]. In all these studies, little attention has been given to
QoS performance using FPGA service processing cores. Since
cloud-based DCN is a relatively new exploration area in high-
performance networks, many of the designs discussed in [5],
[6], [8], [11], [14], and [16] have failed to carry out investigation
on DCN acceleration. Using Spine-Leaf FPGA network model
has so many benefits for high-performance market segments.
According to [17], and [18], a 4-way Layer-3 Leaf/Spine with
Equal Cost Multi-Path (ECMP) architectural processing for
routing and other computing services is the new template for
high-performance datacenter network designs. In cloud-based
scenario, the type of switch or even server processor cores can
contribute to congestion delays regardless of data offloading
strategies. For example, Portland [8], BCube and Quantized
Congestion Notification (QCN), [11] uses rate based conges-
tion control which is not efficient. Hence, current Ethernet
switches, IP routers, server etc, found in the existing data-
center architectures, therefore can not be used to implement
high-performance datacenter designs.
A high-end FPGA System on Chip (FSoC) could be em-
ployed for data offloading, leading to improved QoS for en-
terprise applications. There two types namely: the Static
Random Memory (SRAM) and the Antifuse versions. These
are semiconductor devices built on a matrix of Configurable
Logic Blocks (CLBs) connected via programmable intercon-
nects [19].
By construction, FPGA’s are efficient at executing a pre-
dictable workload. Given that datacenter workloads requires
high computational capabilities, energy efficiency and low cost,
a legacy commodity server cannot satisfy these demands. As
such, FSoC can be reprogrammed to offer flexible acceleration
The authors declare no conflict of interest
Academic Editor: Dr Mohamad Yusof Darus, Senior Lecturer, Faculty of
Computer and Mathematical Sciences, Universiti Teknologi MARA, Malaysia
1
All authors contributed equally to this work
2
Corresponding author email: kennedy.okafor@futo.edu.ng
JMCN | January 2017 | vol. 1 | no. 1 | 1–12

of workloads.
Till date, many cloud datacenters have not deployed FSoC
as compute accelerators. Hence, to implement efficient cloud
DCN designs, rich programmability is absolutely required in
the cloud DCN service processors beside the role of Type-1
bare-metal virtualization [20].
There are two approaches in this regard namely: Pure
software-based [21], [22] and FPGA-based programmability
such as NetFPGA [23]. Software-based systems can provide
full programmability while providing a reasonable packet for-
warding rate. Their performance is still not comparable to
commodity switch and server FPGA Application Specific Inte-
grated Circuits (ASICs). The batch processing used in existing
server switches and software-based switches yield optimization
that introduces high latency. This is critical for various control
plane functions such as signaling and congestion control [6],
[8], [11] in high performance networks.
Considering bandwidth intensive applications, FPGAs can
be designed for low-latency applications. This offers higher
value for cloud computing processes. Since FPGA-based sys-
tems are fully programmable [24], a datacenter backend can
be optimized though in-circuit re-configuration at power-up to
support more functions and achieve seamless data-offloading.
Hence, the latest trend in the server performance is the data
offloading paradigm. It involves pairing an x86 processor with
an FPGA device architecture which is highly customizable.
With this method, workload performance can be enhanced
while accommodating changing needs in the future. Clearly,
a data-offloading FSoC will improve the throughput of cloud
based Software as a Service (SaaS) by co-processing with a
commodity CPU. This same concept can accelerate cloud
database searches for improved performance. The major trade-
off for acceleration (cloud workload offloading in this case) is
that frequent or repetitive tasks or task sequences will affect
power demand.
As far as this work is concerned, little research has been
carried out in literature as per investigating the QoS effects
of cloud network server, router, etc., driven by FPGA cores.
Hence, there is the need to explore FPGA target device archi-
tecture in developing DCCNs for cloud-based services (such the
Enterprise Energy Tracking Analytic Cloud Portal, EETACP,
e.g., databases, big data analytics and high-performance com-
puting).
2. Related Works
2.1 Cloud Datacenter Networking
Traditional datacenter network architectures such as DCCN
[3], Portland [8], DCell [25], BCube [26], R-DCN [27], He-
lios [28], c-Through [29], etc., have been extensively studied.
Most of them uses a recursive scheme for scalability and per-
formance while others construct a separate optical network
with an expensive high port-count 3D MEMS side by side
with the existing datacenter to add core bandwidth on the
fly. Most DCNs like OmniSwitch which is modular data-
center network architecture, integrates small optical circuit
switches with Ethernet switches to provide both topological
flexibility and large-scale connectivity. These architectures
can be re-modified using the enhanced Spine-Leaf, mesh, and
router layer-3 (tier-2) models running on a low latency FPGA
core. This has not been used in server-centric application
deployments strategy. The author in [30] highlighted issues
affecting existing commercial off-the-shelf Ethernet switches
for these architectures at a high link speed, such as 10gigabits
per second (Gbps). The challenges include:
(a) Extreme complexities, particularly the switch software,
wiring and scaled troubleshooting.
(b) Availability of various failure modes in the absence of
fail-over schemes.
(c) Existing large commercial switches and routers are expen-
sive.
(d) Some datacenters require high port density at the aggre-
gate or datacenter level switches at extremely high link
bandwidth.
(e) Other issues are over-subscription, microburst detec-
tion problems using SNMP polling for TCP sprawl (i.e.
many–to-one traffic pattern), high queuing latency, an ab-
sence of mobility support for virtual server infrastructure,
poor scalability, and inflexibility resulting from legacy
designs that has compatibility issues with automated vir-
tualized datacenter.
Therefore, many researchers have kept on evolving data-
center network architectures, with most of them focusing on
the novel design philosophy of Spine-Leaf, mesh, and router
layer 3 models [31],[32].
The new trend in datacenter network model is to address
the issues of optimal performance such as low latency, availabil-
ity/fault tolerance, utilization, energy efficiency and scheduling
of resources regardless of the network device.
Regarding architectural design framework, the most related
work in this research is Datacenter-in-a-Box at Low cost (DIA-
BLO) FPGA cluster prototype in [30]. The authors discussed
a novel cost-efficient evaluation methodology. FPGAs were
used, but treated datacenters as whole computers with tightly
integrated hardware and software. The work enumerated three
viz: i. Server Models: Built on top of RAMP Gold: SPARC V8
ISA, running on full Linux 3.5 with a fixed CPI timing model.
ii. Switch Models: Based on circuit and packet switching with
abstracted models focusing on switch buffer configurations. iii.
NIC Models: Having a scatter/gather DMA with zero copy
drivers as well as a NAPI polling support.
In integrating the cloud DCN nodes to FPGA cores, Figure
1 illustrates a high-level structure. The system used 6BEES
boards having 24Xilinx-Virtex 5FPGAs [30]. The simulation
was realized with 3072 servers in ninty-six racks. The network
switches were used at 8.4B instruction/second. The validation
was on a single rack physical system with sixteen node cluster,
3GHz Xeon + 16 port Asante IntraCore 35516-T switch. The
physical hardware setup had two servers + 1 to 14 clients. The
software configurations included server protocols: TCP/UDP,
server worker threads: 4(default), eight simulated server: sin-
gle-core with 4GHz fixed CPI.
Figure 2 shows type 1 DIABLO without inter-board connec-
tions and type 2 DIABLO with fully-connected with high-speed
cables. Type 2 shares similar feature with this work. With
2 | Okafor et al.

Fig. 1. DIABLO cluster physical mapping [33]
FPGA and use of programmable hardware platforms, the sim-
plification of the load on cloud nodes and network devices will
enhance performance. As such, a cloud of general-purpose
re-sources (FPGA) was used to offload the processed tasks.
Andrew P. [34] in their work, described a reconfigurable
fabric (FPGA Catapult) designed to balance some performance
concerns. The system was embedded into each half-rack of
48 servers in the form of a small board with a medium sized
FPGA and local DRAM attached to each server. As depicted
in Figure 2, FPGAs are directly wired to each other in a 6x8
two-dimensional torus, allowing services to allocate groups
of FPGAs to provide the necessary area to implement the
desired functionality. The work was evaluated by offloading
a significant fraction of Microsoft Bing’s ranking stack onto
groups of eight FPGAs to support each instance of this service
[34]. Based on performance expectations of an earlier proposed
EETACP (cloud application deployed on DCCN), the key goals
for any datacenter architecture includes [9]:
(a) Deterministic latency
(b) Redundancy/high availability
(c) Manageability/flexibility
(d) Excellent resource allocation and scheduling
(e) Scalability and fault tolerance
An improved network architecture based on FPGA fabric
is proposed to achieve those above. This model has been
shown to be better than the Spine-Leaf model, mesh and layer
3-routed models owing to the performance characteristics of
this de-vice. It supports lower latency, offloading, seamless
integration and computing scalability. It is very imperative
to outline the advantages and disadvantages of the current
Spine-Leaf, mesh and Layer 3-routed network design. This is
shown in Table 1.
In EETACP DCCN [34], a low latency and fault tolerance
network was achieved. In this case, the number of network
tiers was to be reduced to minimize system latency. But, an
FPGA based fabric structure simplifies management, reduces
cost, and allows resilient and low-latency networks to be de-
signed just like the Spine-leaf model. The robust architectural
concepts supported in the DCCN architectures provide high
availability, deterministic low latency and can scale up or down
with demand. EETACP was tightly integrated with the Om-
niVista™ 2500 Virtual Machine Manager (VMM), providing a
unified platform for virtual machine visibility and provisioning
with virtual network profile across the network. These allow
seamless server interfacing.
By introducing FPGA cluster in the above architectures, its
advantages in cloud datacenter networks (e.g., DCCN) include:
(a) Allows multi-chassis terminated link aggregation groups
to be created.
(b) Creates a loop-free edge without Spanning Tree Protocol
(STP).
(c) Provides node-and link-level redundancy particularly with
the Integrated Service OpenFlow load balancer.
(d) Enables overall architecture to be geo-independent i.e. no
co-location support.
(e) Active support for Inter-connect switches using standard
10G and 40G Ethernet optics.
(f) Supports redundancy and resiliency across the switches
connecting EETACP servers.
In Web-scale data centers, by boosting performance with a
few FPGA device architecture across thousands of servers, this
will save cost. Besides, by leveraging FPGAs for acceleration in
Spine-Leaf models, this will improve dynamic over-allocation
(change management for large-scale data centers, because en-
terprise tools must track the FPGA algorithm as it is updated.
This is needful for enterprise adoption. With the availability
of server virtualization, a hyper-scale datacenter could use
FPGA capabilities. This paper opines that new processor
architectures based on a programmable FPGA-device have
several advantages to cloud service provisioning. It allows for
scalability on demand and loosely coupled system designs. R.
Joost, & Salomon [35] showed that FPGAs are best suited for
Okafor et al. JMCN | January 2017 | vol. 1 | no. 1 | 3

Fig. 2. DIABLO cluster prototype with 6 BEE3 boards[30]
Table 1. Advantages and Disadvantages of the current Spine/Leaf, Mesh, and Layer 3-Routed Network Design
Model Advantages Disadvantages
Spine-Leaf
Model • Offers layer 2/3 common fabric implementation
• Facilitates simpler design
• Fewer interconnects
• Easy to scale within boundary and better la-
tency transition
• Additional layer of transit hop may
impact latency and over subscrip-
tion
• Scalability limited to number of
ports in the spine layer
Mesh- Model
• Offers layer 2/3 differentiated fabric implemen-
tation
• Implementation highly scalable
• No transit hop
• Lower latency and lower over-subscription ra-
tions
More links used for interconnects
Lay-3 Routed
Model • Offers end to end routed fabric implementation
• Easy to secure in IP layer
• Fewer interconnects
• Easy to scale
• Highly oversubscribed architecture
• Number of transit hops is not deter-
ministic, impacting latency
• Complex design and maintenance
4 | Okafor et al.

tackling most industrial and network based applications, such
as supervisory control systems, cloud computing, Internet of
things, and other grid computing services. It is shown that FP-
GAs are very powerful, relatively inexpensive, and adaptable.
This is because their configuration is specified in an abstract
hardware description language. FPGA-based implementations
combines many advantages such as rapid development cycles,
high flexibility, re-usability, moderate costs, easy upgrading
(due to the usage of abstract Hardware Description Languages
(HDLs), and feature extension (as long as the FPGA is not
exhausted). For the network pieces in the cloud DCN, the
FPGA core on the servers, switches, and load balancers, are
managed by a management console in the form of Software
Defined Network (SDN) that separates the data, control and
application layer planes. In context, for updating a switching
policy, the network is initially mapped in the design thereby
maintaining a default state and eliminating routine reprogram-
ming of the FPGA logic cells. However, the use of FPGA
can complement other chipset-accelerators (i.e., GPUs) but at
the expense of writing new procedure in VHDL. The issues of
power consumption and area-on-chip are vital for performance
considering the requirement of FPGA cores needed in the
network. This is a trade-off for future research.
3. Methodology
In this section, FPGA modular description is presented. A
characterization scenario was used as a basis for generaliza-
tion. To achieve this, an Electronic Design Simulation tool
(Riverbed Modeled) with extended C++ library was employed
in this study. Due consideration was made on FPGA Virtex
UltraScale driven server machine. This was used for the Spine-
Leaf DCCN design as it offers efficient performance, good
system integration, and bandwidth with the added benefits
of re-programmability. In the enterprise setup, the periph-
eral controllers include general purpose I/O, UART, Timer,
Debug, SPI, DMA Controller, Ethernet (interface to exter-
nal MAC/PHY chip). Also, the memory controllers include
SRAM, Flash, SDRAM, and DDR SDRAM.
In context, the scalability of the Virtex UltraScale VU440
device is made possible by its ASIC-class architecture – for up
to 90% utilization featuring next generation routing, ASIC-like
clocking, resource utilization, power management, and elimi-
nation of interconnect bottlenecks, and critical path optimiza-
tions. Its key architectural blocks include wider multipliers,
high-speed memory cascading, 33G capable transceivers, and
the addition of industry-leading integrated 100Gb/s Ethernet
MAC and 150Gb/s IP cores. These devices enable multi-
hundred gigabit per second levels of system performance with
smart processing at full line rates. Figure 3 shows a proof of
concept demonstrating the initial testbed setup for EETACP
deployment. The configuration facilitates the dual-housing of
servers/storage and access devices with links distributed across
the DCCN switches. There is no logical loop between the edge
devices and multi-chassis peer switches, even though a physical
loop exists. Single interface servers, storage and edge devices
can be connected to any DCCN switch via a virtualization
management console. The setup is based on general purpose
processor. Using the Type-1 bare metal virtualization offers
the feasibility of VM instances which supports failovers, repli-
cation, and redundancy for a production environment. The
assumption in this research is that the FPGA concept as well
as the Type-1 bare metal virtualization must be integrated in
a case of myriads of servers, i.e., massively scaled datacenter
to derive the expected QoS.
An FPGA scalable architecture [36] offers a template for
adoption in DCCN. Specifically, a Xilinx FPGA comparison
show-ing an optimal configuration for Virtex UltraScale device
has been enumerated in [19]. In the work, the logic Cells (K),
Ul-traRAM (Mb), Block RAM (Mb), DSP Slices, Transceiver
Count, Maximum Transceiver Speed (Gb/s), Total Transceiver
Bandwidth (full duplex) (Gb/s), Memory Interface (DDR3),
Memory Interface (DDR4), PCI Express, Configuration AES,
I/O Pins and I/O Voltages were all compared against other
device architecture variants showing VirtexUltraScale device as
the most preferred choice. This further facilitated its adoption
in the proposed DCN design in section 3. The FPGA-based
system implementations have the following characteristics:
(a) Allows for the integration of soft-core processors.
(b) Have plenty of logic resources for routing
(c) Have plenty of RAM supports. This observation combined
with the lack of bypass path, led to a multi-threaded
design of large modules.
In the validation analysis, this work will focus on FPGA-based
datacenters for performance benchmarking. It must be stat-ed
that congestion offloading is derived through the use of over-
allocation considered in Figure 3. In this paper, the prototype
design of the cloud based datacenter has only been tested
with a very small testbed running realistic micro-benchmarks
for cloud computing services. The emphasis is on the QoS
comparison with related datacenter cores. The role of Type-1
virtualization as a DCN accelerator is presented in Section 3.1
3.1 Architectural Model (Type 1-Virtualization
The goal of the FPGA-based network server model (DCCN) is
to have a credible workload generation that is scaleable, and
efficient with respect to QoS for a congested traffic pool. A
highly-accurate framework for the cloud computing workload
was developed in Figure 4. At the core, the server clusters
must be capable of running complex server application software
with minimum modifications. In context, an FPGA service
model is responsible for executing the target procedure (router,
switch or server CPU) correctly as well maintaining the device
architectural state in congested networks. By using the Type
-1 virtualization strategy, this is made feasible. The benefits
of this management scheme include:
(a) Simplified mapping of the functional FPGA model. The
separation allows complex operations to take multiple
host cycles. For example, a highly-ported register file can
be mapped to a block RAM and accessed in multiple host
cycles, avoiding a large, slow mapping to FPGA registers,
multiplexers, etc.
(b) Improved flexibility and reuse of resources even in over-
allocation mode. With it, precise server timing model
can be changed without modifying the overall network
model. This will improve efficiency. For instance, it is

Fig. 3. DCCN EETACP server testbed (Kswitche Labs, 2015)
possible to use the same VM switch model to simulate
both 10Gbps switches and 100Gbps switch-es just by
changing the timing model only.
(c) It enables a highly-configurable abstracted timing model.
In the virtualized datacenter, splitting timing function
allows the timing model to effect abstraction in the cloud
layer. When looking closely at the FPGA characteristics
for network architectures, this work identified a wide vari-
ety of design choices such as switch architecture, network
topology, protocols, and applications.
To support data-intensive processing in an FPGA-based do-
main, the traffic workloads must be optimized in the cloud
environment. As such, optimization via data management
must be satisfied for enhanced QoS.
3.2 FPGA Cloud Datacenter Specifications
This paper used the specification of the cloud datacenter
network described in [37]. However, using a Type-1 server
virtualization is considered for resource management in an
FPGA driven DCCN. The network fabrics have OpenFlow
load-balancer, virtual gateway, server instances on the hy-
pervisor. In the network, a three-stage Clos topology using
Nexus 7000 (spine) and Nexus 3000 platform (leaf) with FPGA
based N-Servers connected to them, forms a Warehouse-scale
cloud dat-acenter. These runs on 10-40Gbps links. These
specifications are encapsulated in Figure 4.
3.3 Hyper-Scale Cloud Cluster Server (HCCS)
The FPGA card used in the Spine-Leaf cloud network server
(as shown in Figure 4 is depicted in Figure 5. This is based
on Xilinx Virtex UltraScale FPGA technology, (i.e., target
device). The characterization in the HCCS is mainly for the
Spine-Leaf DCCN.
For data-offloading at the server core, this prototype FPGA
accelerator card has six Virtex-6 FPGAs linked together by
a PCI-Express switch from PLX Technology. Three of them
are fixed into a Supermicro SuperServer designed to accommo-
date three Tesla GPU coprocessors in DCCN. This has a pair
of six-core Xeon 5600 class processors as shown in Figure 5.
This processor core is depicted Figure 3 while Figure 6 shows
the logical placement. In this case, the server machine has
24 half-wide grid sockets. This pattern allows the X86 server
processor (grouped into two) to fit into the testebed rack
enclosure. On the server, the FPGA co-processor in Figure 6
has eight lanes mapping to Mini-SAS xSFF-8088 connectors,
with two ports on each FPGA card. This is to speed up data
cycling and improve utilization cycles of the CPU.
The server has enough space for PCI-Express 3.0 periph-
eral card situated at the back of the server sled. It has two
eight-core Xeon processors running at 2.1GHz CPUs with
64GB of main memory (DRAM). For its storage capacity,
four 2TB disk drives (4HDDs-2TB) and two 512 GB solid
state disks (2SSDs) was introduced. The server node has
a 10 Gb/sec Ethernet port with redundant power supplies.
Wireless connectivity via the bay ports is at default in the
DCCN. The DCCN server FSoC accelerator card configured
in a production setup is distributed across server cluster infras-
tructure. In the deployment context for HCCS-DCCN, two
sets of cables are used to implement a ring connection with
six xSFF-8088 connectors. Also, eight connectors are used
on a ring for duplication/redundancy. With the six-adapter
cables, the six FPGA cards (in six adjacent server nodes in
the server chassis) are mapped to each other with one set
of Mini-SAS ports. The complex arrangement allows eight
different groups of FSoC-nodes in a 48-node pod to be self-
linked using eight adapter cables. During its operation, the
FSoC run at 10Gb/sec even for all the Ethernet connected
interfaces. Figure 7 shows Virtex UltraScale VU440 device
used for service processing cores. This provides the highest
system perfor-mance and bandwidth for large-scale computing.
This is very good for a typical server scenario in Figure 3.
3.4 HCCS FSoC Data-Offloading Algorithm
Algorithm I describes the server interconnection read and write
operations with FPGA data offloading. Firstly, after defining
the server configuration with its virtualization mappings, a
10GB link is used for current link interconnection in the cluster
subnet. An array of user input jobs through a load balancer
Lm for none zero term is defined for the server. To facilitate
read operation from the server, the variable controls (a, N, i, j)
are used to execute successive read operations in matrix form.
6 | Okafor et al.

Fig. 4. Cloud Computing Spine-Leaf Cluster (DCCN)
Fig. 5. A Typical Cloud FPGA Accelerator Network Card [38]
Fig. 6. A modiﬁed logical Interfacing in DCCN Subnet Cluster[38]

Fig. 7. An FPGA based Virtex UltraScale VU440 device architecture for
cloud Server board
To complete successfully, j control checks for equal availability
of server CPUs and their VMs. The first step to job processing
is to select the shortest path (i.e., the one with the highest
throughput) between the user job request and the server VM
in the HCCS. As the workload increases, more bandwidth
is over-allocated by the hypervisor virtual machine monitor
(VMM). This translates into an increased throughput along
the path. All the processed workloads are returned through
the shortest-path to end-users and the cycles re-initializes and
repeats the read and write operations.
Using Algorithm 1, the EDA study was used to explore
the capabilities of Virtex UltraScale FSoC for DCCN data-
offloading. In study, the RAMs on FSoC are used to store
the simulation thread state. This dynamically switches the
threads in order to keep the data pipelines saturated. This is
memory strategy is called HCCS host-multithreading for low
latency data-offloading. The benefits are summarized below.
• Availability of hard-wired DSP blocks with execution
units, especially Floating Point Units (FPUs). This domi-
nates Look Up Tables (LUT) resource consumption. The
implication is that by mapping functional units to DSP
blocks rather than just LUTs, more resources are reserved
to execution timing.
• DRAM accesses are relatively fast on FSoC. The logic
in FSoC often runs slower than DRAM because of on-
chip routing delays. This insight will greatly simplify
host memory system, as large, associative caches are not
needed for high performance. The tradeoff between the
QoS performance and FPGA compute resources is the
overall server cost budget parameter.
4. Simulation Validation
4.1 Experimental Design Description
First, FPGA server process model was built for DCCN VM
clusters. This was realized using Riverbed modeller academic
edition 17.5 with its C++ libraries1
as an EDA tool. The
implementation was on a heavily modified host-cache design.
The server model supports full 32-bit OS. At the core, the
Virtex UltraScale was emulated into the service processors
shown in Figure 4. In the real setup (depicted in Figure 4),
the components introduced includes the server farm virtual
1
https://splash.riverbed.com/community/product-
lines/steelcentral/university-support-center/blog/2014/06/11/riverbed-
modeler-academic-edition-release
firewall router SFV, emu-lated OpenFlow controller OC, ap-
plication and profile configuration windows. This test center
configuration sets-up the Web, Database, FTP, and Exchange
servers, such as DCCN server1, server 2, server 3, server 4,
server 5,. . . .. . . .N and six location with active users. The
system servers run on Virtex Ultra Scale FPGA target de-
vice. With Type-1 virtualization, servers are placed on the
DCCN as VM clusters. The VM connect user tasks to the
HCCS which processes services concurrently. The application
(HTTP service) runs on the OpenFlow controller whose job is
to dispatch the requests to the server clusters.
This facilitates resource allocation, scheduling and load
balancing in the DCCN. The simulation experiments were
per-formed on an emulated cloud, at the IaaS level, using the
datacenters cardinality theory. For the DCCN VM clusters,
two physical servers (2X-8-core Xeon 2.1 GHz CPUs, 64GB
DRAM, 4HDDs-2TB, 2SSDs-512GB, 10GB Ethernet, with
Linux & Mac OS) were configured to run on the CPU model.
The VM instances were created according to the workloads
per site. For acceleration, Type-1 full/active virtualization,
failover and over-allocation were simultaneously enabled to
address the is-sues highlighted in Section 2.1. The process
model experimental methodology considered four key metrics:
Service process latency, throughput, resource availability and
resource utilization. The execution time is measured using
the timer functions provided by the C++ trace file diagnostic
library. The throughput is determined at destination as a ratio
between the amount of data sent from users and the service
processing time. Finally, of the metric is computed based on
riverbed frame-work/simulation for DCCN, DCell, and BCube.
Each QoS metric was reported in the plots discussed below.
4.2 Performance Evaluation
After setting up three distinct network scenarios (DCell and
BCube) alongside the FPGA-based DCCN, a focused discus-
sion on its services as well as its performance were analyzed in
a previous study [38]. The first scenario measures the improve-
ments brought to I/O-intensive FPGA applications from ser-
vice process throughput perspective. By adaptively switching
end users from the leaf to the spine models, the servers read
and process request concurrently. This occurs within the data
center management which replicates processed multicast jobs
and transfers them in a pipeline fashion within the deploy-
ment. This paper presents a set of results obtained from a QoS
comparison among the three networks using the remote cloud
storage. This host services such as FTP, database/storage,
etc.
4.3 Analysis of Non-FPGA Cloud DCNs
The experiment in context focused on the comparative analy-
sis of three distinct DCNs viz: Spine-Leaf DCCN (proposed),
DCell, and BCube for network throughput, resource availabil-
ity and resource utilization. These networks were configured
using a scenario based approach. In this case, the cloud
computing application workload is homogenous. A suitable
frame-work used to evaluate the impact of FPGA acceleration
on the cloud datacenter is the MapReduce [39]. This makes
cloud-based computation flexible though with its performance
thread-offs in the cloud. This work used emulated cached
MapReduce engine [40], a general-purpose workflow engine
8 | Okafor et al.

Algorithm I: DCCN Server Read/Write Operations
Procedure FPGA Dataoffloading, Read/Write // the idea is to use FPGA to carry out data offload via read and write operation
Define DCCN-Server I/O // A distributed cloud computing server must have a well defined inputs and outputs
Program Server matrix (Input, Output)
ConstMaxS = 𝑆 𝑛+1; // recursive server chain ensures that the servers redundancies are maintained
While j ≤ 𝐾 do
𝑆. 𝑉𝑚 = 𝑉 𝑚+1; // recursive server virtual instances for internal server
//resources (I/O, RAM, CPU, etc)
FPGA acceleration = shortestpathjoboffload // initialization
Set link = 10GB // interconnection links // initialization
If Var = Var +1, then Sort with FSoC // the Var is used to allocate memory spaces on the CPU for read/write operation provided
they are not used up by the CPU
Var 𝑃0, 𝑄1, 𝑅1, 𝑁𝑘+1: Array [0……………MaxN, 0…………MaxN] of real nonzero term;
a,N,i,j: integer; // a = security term , (N, i ,j = control loop variables)
end if
end while
Begin // read operation
Readln (N); // Read user jobs from CPU
While i ≤ 𝑗 do
For i:= 0 to N-1 do for j;= 0 to N-1 do read (P[i]);// this now implements read job/task request
For i := 0 to N-1 do for j;= 0 to N-1 do read (Q[i]);
For i:= 0 to N-1 do for j;= 0 to N-1 do read (R[i]);
For i:= 0 to N-1 do for j;= 0 to N-1 do read (𝑁𝑘+1[i]);
For i:= 0 to N-1 do for j;= 0 to N-1 do read r[i]: = P[i] + Q[i] + R[i]………𝑁𝑘+1[i]);
For i:= 0 to N-1 do for j;= 0 to N do
If 𝑗 = 𝑆 𝑛+1 then
// Get the job request threads with maximum throughput
While j: i+1 to N do
If a [j] <≠ a [MinSec] then 𝑆 𝑛+1 ≠ 1
Dataoffload > = 0 // server CPU
else
Return;
end if
end while
end if
Transferjob.shortestpath = NextPath // recursive CPU server chain ensures that workloads are efficiently transferred using the
shortest path.
end while
end procedure
Fig. 8. HCCS-DCCN Read/write Algorithm
Fig. 9. Throughput Stability Response

Fig. 10. Cloud Server Utilization Response
[41] to run trace file statistics. For the three scenarios, the
number of mappers (32 MB per job), data sizes (1024MB) and
reducers (3) was maintained in all cases. From Figure 9, it
was observed that the proposed FSoC-DCCN had relatively
a better throughput with optimal virtual instance allocation
coordinator. In this regard, the average throughput stability
responses for DCCN, DCell and BCube are 40.00%, 33.33%,
and 26.67% respectively.
With Type-1 virtualization of Spine-Leaf DCCN server
cores alongside with the FSoC acceleration, relative perfor-
mance is feasible. Scientific workflows running in large, geo-
graphically distributed and highly dynamic computing environ-
ments can efficiently use FSOC-DCCN. This is because FSoC
based platforms can effectively satisfy throughput stability
requirement in a production deployment. From Figure 10,
resource availability refers to the ability to access the FSoC-
DCCN server clusters on demand while completing the job
requests. The complexity of cloud datacenters architecture,
its overall infrastructure makes resource utilization an-other
important parameter. It was observed that the proposed FSoC-
DCCN offered better resource utilization (for the work-loads)
compared with BCube and DCell scenarios.
When all existing resources in FSoC-DCCN server clusters
are used up by means of over allocation, additional resources
can be reserved for high priority jobs that arrive. In context,
when a job arrives, the availability of the VM is guaranteed.
The issue will be on the availability of resources to execute the
jobs. If the VM is available, then job is allowed to run on the
VM via dynamic allocation considering the network density.
This occurs only for Type-I virtualization on the cloud DCN
Spine-Leaf model. It was shown that the proposed scheme had
about 58.06% resource utilization (i.e. when logically isolated
with FPGA device cores) while the others offered 38.71%
(BCube) and 3.23% (DCell) respectively (ie. when not logically
isolated with FPGA cores). The implication is that FPGA-
based DCCNs will offload tasks from server processors more
frequently than other accelerator options since cloud service
processing rates are high. It also implies that the proposed
model will offer fairly good resource availability leading to
enhanced performance. This makes it more attractive in Hyper-
scale datacenters for Warehouse Scaled Computers (WSC).
Hence, Vm based cloud networks particularly in the cell based
and Spine-Leaf WSC can benefit from this advantage.
From the plots in Figure 9 and Figure 10, network infrastruc-
ture that processes bandwidth intensive applications will scale
optimally with FSoC. This is because, a key potential benefit
of the integrated processor and FPGA system is the ability to
boost system performance by accelerating compute-intensive
functions in FPGA logic (i.e. hardware acceleration and cache
coherency) while making more resources available. The proces-
sor performance is improved by the FSOC co-processing roles,
particularly from computing cyclic-redundancy check (CRC)
to offloading the entire TCP/IP stack. When the FPGA-based
accelerator produces a new result, the data needs to be passed
back to the processor as quickly as possible, so that the proces-
sor can update its view of the data. As a validation, a network
case of 1632 servers with FPGAs running an enterprise search
service on the web was analyzed with 11. This shows an im-
proved throughput with FPGA acceleration compared with
the case without FPGA acceleration in terms query latency
responses.
Another key benefit of integrating a General Purpose Pro-
cessor (GPP) and FPGA into a single real estate is the abil-
ity to ac-celerate system performance by offloading critical
functions to the FPGA. Transferring the data quickly and
coherently is key to realizing performance boost in cloud-
based networks. Datacenter network optimization with FPGA
acceleration im-proves bandwidth efficiency while satisfying
QoS metrics. Using any network equipment embedded with
FPGA processors, this would eradi-cate various performance
bottlenecks that software driven processors cannot overcome.
Smart computing and intelligence ap-plications having massive
workload will benefit from this alternative.
5. Conclusion
This paper has presented a super-scalar cloud datacenter net-
work built with FPGA core support. This offers excellent
throughput, low latency and good resource utilization when
compared with DCell and BCube datacenter networks. Hence,
offloading key functions from the processor to the FPGA can
result in substantial improvement in the system performance
while reducing system power drain. As observed in existing
Warehouse Scaled Computing (WSCs), high speeds with low
latency interconnect between the processors and FSOC are
10 | Okafor et al.

Fig. 11. FPGA Query latency Behavior[38]
necessary for optimal performance. The proposed datacen-
ter net-work offers memory coherency through the use of the
FPGA acceleration coherency. With this, issues of bandwidth,
performance, integration, and power requirements are fixed.
In highly dynamic environments, various types of computing
work-loads (such as databases, big data analytics, and high
performance computing, performance) can be improved us-
ing the proposed FPGA acceleration in Spine-Leaf datacenter
model. As more and more workloads are being deployed in the
cloud, it is appropriate to consider how to make FPGAs and
their capabilities available in the cloud. Hence, the proposed
system offers a low latency path from the network interface
to the consuming process, irrespective of network workloads.
As a proof of concept and validation, a micro testbed setup
on real life datacenter was explored. The work used DCCN
to model a datacenter Spine-Leaf architecture running traffic
patterns sampled from the Riverbed application engine on top
of Linux-KVM and Virtex Ultra Scale FPGA target device.
This enables isolation between multiple processes in multiple
VMs such as accurate acceleration, resource allocation, and
priority-based workload scheduling for QoS. The results from
the FPGA DCCN offloading strategy in Spine-Leaf designs
show that Type-1 virtualization influences re-source alloca-
tion and scheduling. With FPGA acceleration, performance
of cloud computing systems particularly in QoS contexts is
enhanced. Consequently, newer processors can use FPGAs to
accelerate applications (workload optimization). Furthermore,
with WSC (FPGA based servers), the Central Processing
Unit (CPU) of Spine-Leaf topologies can easily offload tasks to
FPGA device architectures for hardware acceleration. The con-
clusion is that with global deployment of FPGA-based cloud
datacenters, this will enable large-scale scientific workflows
to improve performance and deliver fast responses re-grading
QOS. Future work will focus on mathematical modeling and
state analysis of Markovian queue on heterogeneous FPGA
cloud based servers and their working vacation. The work will
investigate on power drain on high-density network, the chip
area, comparison with other GPUs /accelerators.
ACKNOWLEDGMENTS. We wish to specially thank
Cloud Computing and Distributed Systems (CLOUDS) Laboratory
at the University of Melbourne Australia; Department of Electronic
Engineering, UNN; Center for Basic Space Science, UNN; Energy
Commission of Nigeria,-NCERD-UNN, and National Agency for
Science and Engineering Infrastructure (NASENI) for their immense
support in course of this research work.
References
1. Microsoft (2016) Azure successful stories, online :
http://www.windowsazure.com/en-us/case-studies/archive/.
2. Lu G et al. (2011) Serverswitch: A programmable and high
performance platform for data center networks in In NSDI’11
Proc. of 8th USENIX conference on Networked systems design
and implementation.
3. Okafor K, Ugwoke F, Obayi AA, Chijindu V, Oparaku O
(2016) Analysis of cloud network management using resource
allocation and task scheduling services. International Journal
of Advanced Computer Science & Applications 1(7):375–386.
4. Guo C et al. (2008) Dcell: a scalable and fault-tolerant net-
work structure for data centers. ACM SIGCOMM Computer
Communication Review 38(4):75–86.
5. Al-Fares M, Loukissas A, Vahdat A (2008) A scalable, com-
modity data center network architecture. SIGCOMM Comput.
Commun. Rev. 38(4):63–74.
6. Guo C et al. (2009) Bcube: a high performance, server-centric
network architecture for modular data centers. ACM SIG-
COMM Computer Communication Review 39(4):63–74.
7. Greenberg A et al. (2009) Vl2: a scalable and flexible data
center network. ACM SIGCOMM computer communication
review 39(4):51–62.
8. Niranjan Mysore R et al. (2009) Portland: A scalable fault-
tolerant layer 2 data center network fabric. SIGCOMM Com-
put. Commun. Rev. 39(4):39–50.
9. D KC (2016) Ph.D. thesis (University of Nigeria Nsukka).
10. Okafor K, Nwaodo T (2012) A synthesis vlan approach to
congestion management in datacenter internet networks. Inter-
national Journal of Electronics and Telecommunication System
Research 5(6):86–92.
11. Alizadeh M et al. (2008) Data center transport mechanisms:
Congestion control theory and ieee standardization in Commu-
nication, Control, and Computing, 2008 46th Annual Allerton
Conference on. (IEEE), pp. 1270–1277.
12. Al-Fares M, Radhakrishnan S, Raghavan B, Huang N, Vahdat
A (2010) Hedera: Dynamic flow scheduling for data center
networks. in NSDI. Vol. 10, pp. 19–19.
13. Wood T (2011) Ph.D. thesis (University of Massachusetts
Amherst).
14. Guo C et al. (2010) Secondnet: a data center network virtual-
ization architecture with bandwidth guarantees in Proceedings
of the 6th International COnference. (ACM), p. 15.
15. Shieh A, Kandula S, Sirer EG (2010) Sidecar: building

programmable datacenter networks without programmable
switches in Proceedings of the 9th ACM SIGCOMM Workshop
on Hot Topics in Networks. (ACM), p. 21.
16. Abu-Libdeh H, Costa P, Rowstron A, O’Shea G, Donnelly
A (2010) Symbiotic routing in future data centers. ACM
SIGCOMM Computer Communication Review 40(4):51–62.
17. Arsita (2015) Arista universal cloud network white paper
(https://www.arista.com).
18. Cisco (2013) Cisco fabric path technology and de-
sign brkdct-2081 (http://www.valleytalk.org/wp-
content/uploads/2013/08/BRKDCT-2081-Cisco-FabricPath-
Technology-and-Design.pdf).
19. Xilinx (2016) Field programmable gate array (fpga)
(https://www.xilinx.com/training/fpga/fpga-field-
programmable-gate-array.htm).
20. Goldberg RP (1973) Architecture of virtual machines in Pro-
ceedings of the workshop on virtual computer systems. (ACM),
pp. 74–112.
21. Kohler E, Morris R, Chen B, Jannotti J, Kaashoek MF (2000)
The click modular router. ACM Transactions on Computer
Systems (TOCS) 18(3):263–297.
22. Dobrescu M et al. (2009) Routebricks: exploiting parallelism
to scale software routers in Proceedings of the ACM SIGOPS
22nd symposium on Operating systems principles. (ACM), pp.
15–28.
23. Naous J, Gibb G, Bolouki S, McKeown N (2008) Netfpga:
reusable router architecture for experimental research in Pro-
ceedings of the ACM workshop on Programmable routers for
extensible services of tomorrow. (ACM), pp. 1–7.
24. Yang R, Wang J, Clement B, Mansour A (2013) Fpga imple-
mentation of a parameterized fourier synthesizer in Electronics,
Circuits, and Systems (ICECS), 2013 IEEE 20th International
Conference on. (IEEE), pp. 473–476.
25. Kliegl M et al. (2010) Generalized dcell structure for load-
balanced data center networks in INFOCOM IEEE Conference
on Computer Communications Workshops, 2010. (IEEE), pp.
1–5.
26. Overholt M, Wang S (2016) Modularized data center
cube (http://pbg.cs.illinois.edu/courses/cs538fa11/lectures/17-
Mark-Shiguang.pdf).
27. Udeze C, Okafor K, Okezie C, Okeke I, Ezekwe C (2014) Per-
formance analysis of r-dcn architecture for next generation web
application integration in 2014 IEEE 6th International Con-
ference on Adaptive Science & Technology (ICAST). (IEEE),
pp. 1–12.
28. Farrington N et al. (2010) Helios: a hybrid electrical/optical
switch architecture for modular data centers. ACM SIGCOMM
Computer Communication Review 40(4):339–350.
29. Wang G et al. (2010) c-through: part-time optics in data
centers. SIGCOMM Comput. Commun. Rev. 41(4):–.
30. Tan Z (2013) Ph.D. thesis (Department of Electrical Engineer-
ing and Computer Sciences, University Of California, Berke-
ley).
31. Cisco (2012) Cisco’s massively scalable data center network
fabric for warehouse scale computer., (Cisco), Technical report.
32. Alcatel-Lucent (2013) Data center converged solutions design
guide, (Alcatel-Lucent), Technical report.
33. Tan Z, Qian Z, Asanovic XCK, Patterson D (2013) Diablo:
Simulating datacenter network at scale using fpgas, (ASPIRE
UC Berkeley), Technical report.
34. Putnam A et al. (2015) A reconfigurable fabric for accelerating
large-scale datacenter services. IEEE Micro 35(3):10–22.
35. Joost R, Salomon R (2005) Advantages of fpga-based multipro-
cessor systems in industrial applications in 31st Annual Con-
ference of IEEE Industrial Electronics Society, 2005. IECON
2005. (IEEE), pp. 6–pp.
36. Savaš E, Tenca AF, Koç CK (2000) A scalable and unified
multiplier architecture for finite fields gf (p) and gf (2m) in
International Workshop on Cryptographic Hardware and Em-
bedded Systems. (Springer), pp. 277–292.
37. K.C.Okafor, Ezeha G, I.E. Achumba FU, Okezie C, U.H.Diala
(2015) Harnessing fpga processor cores in evolving cloud based
datacenter network designs (dccn) in In Proc.12th Interna-
tional of Conference of Nigeria Computer Society- Information
Technology for Inclusive Development. (Nigerian Computer
Society), pp. 1–14.
38. Morgan TP (2014) How microsoft is us-
ing fpgas to speed up bing search
(http://www.enterprisetech.com/2014/09/03/microsoft-
using-fpgas-speed-bing-search/).
39. Dean J, Ghemawat S (2008) Mapreduce: simplified data
processing on large clusters. Communications of the ACM
51(1):107–113.
40. Chauhan A, Fontama V, Hart M, Tok WH, Buck W (2014)
Introducing Microsoft Azure HDInsight-Technical Overview.
(Microsoft Press).
41. Simmhan Y, Van Ingen C, Subramanian G, Li J (2010) Bridging
the gap between desktop and the cloud for escience applica-
tions in 2010 IEEE 3rd International Conference on Cloud
Computing. (IEEE), pp. 474–481.
12 | Okafor et al.

5 1-33-1-10-20161221 kennedy

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Similar to 5 1-33-1-10-20161221 kennedy

Similar to 5 1-33-1-10-20161221 kennedy (20)

More from Onyebuchi nosiri

More from Onyebuchi nosiri (20)

Recently uploaded

Recently uploaded (20)

5 1-33-1-10-20161221 kennedy