This document discusses application behavior-aware flow control in network-on-chip (NoC) architectures. The author proposes a proactive approach to congestion detection that predicts changes in global, end-to-end network traffic patterns of running applications. Traffic prediction is based on a table-driven predictor that uses application communication patterns. The flow control algorithm schedules packet injection to avoid predicted network congestion and improve throughput. Simulation results on SPLASH-2 benchmarks and synthetic traffic show significant performance improvements with negligible overhead.
PERFORMANCE EVALUATION OF SELECTED E2E TCP CONGESTION CONTROL MECHANISM OVER ...ijwmn
TCP is one of the main protocols that govern the Internet traffic nowadays. However, it suffers significant
performance degradation over wireless links. Since wireless networks are leading the communication
technologies recently, it is imperative to introduce effective solutions for the TCP congestion control
mechanisms over such networks. In this research four End-to-End TCP implementations are discussed,
they are TCP Westwood, Hybla, Highspeed, and NewReno. The performance of these variants is compared
using LTE emulated environment in terms of throughput, delay, and fairness. Ns-3 simulator is used to
simulate the LTE networks environment. The simulation results showed that TCP Highspeed achieves the
best throughput results. Although TCP Westwood recorded the lowest latency values comparing to others,
it behaved unfairly among different traffic flows. Moreover, TCP Hybla demonstrated the best fairness
behaviour among other TCP variants
Differentiated Classes of Service and Flow Management using An Hybrid Broker1IDES Editor
Recently, mobile networks have been overloaded
with a considerable amount of data traffic. The current paper
proposes a management service for mobile environments,
using policies and quality metrics, which ensure a better usage
of network resources with a more fine-grained management
based on flows with different classes of service and
transmission rates. This management of flows is supported
through a closed innovative control loop among a flexible
brokerage service in the network, and agents at the mobile
terminals. It also allows the terminals to make well-informed
decisions about their connections to enhance the number of
connected flows per technology and the individual service level
offered to each flow. Our results indicate that the proposed
solution optimizes the usage of available 4G network resources
among a high number of differentiated flows in several
scenarios where access technologies are extremely overloaded
whilst protecting, through a low complexity scheme, the flows
associated to users that have celebrated more expensive
contracts with their network operators.
PERFORMANCE EVALUATION OF SELECTED E2E TCP CONGESTION CONTROL MECHANISM OVER ...ijwmn
TCP is one of the main protocols that govern the Internet traffic nowadays. However, it suffers significant
performance degradation over wireless links. Since wireless networks are leading the communication
technologies recently, it is imperative to introduce effective solutions for the TCP congestion control
mechanisms over such networks. In this research four End-to-End TCP implementations are discussed,
they are TCP Westwood, Hybla, Highspeed, and NewReno. The performance of these variants is compared
using LTE emulated environment in terms of throughput, delay, and fairness. Ns-3 simulator is used to
simulate the LTE networks environment. The simulation results showed that TCP Highspeed achieves the
best throughput results. Although TCP Westwood recorded the lowest latency values comparing to others,
it behaved unfairly among different traffic flows. Moreover, TCP Hybla demonstrated the best fairness
behaviour among other TCP variants
Differentiated Classes of Service and Flow Management using An Hybrid Broker1IDES Editor
Recently, mobile networks have been overloaded
with a considerable amount of data traffic. The current paper
proposes a management service for mobile environments,
using policies and quality metrics, which ensure a better usage
of network resources with a more fine-grained management
based on flows with different classes of service and
transmission rates. This management of flows is supported
through a closed innovative control loop among a flexible
brokerage service in the network, and agents at the mobile
terminals. It also allows the terminals to make well-informed
decisions about their connections to enhance the number of
connected flows per technology and the individual service level
offered to each flow. Our results indicate that the proposed
solution optimizes the usage of available 4G network resources
among a high number of differentiated flows in several
scenarios where access technologies are extremely overloaded
whilst protecting, through a low complexity scheme, the flows
associated to users that have celebrated more expensive
contracts with their network operators.
Improving Performance of TCP in Wireless Environment using TCP-PIDES Editor
Improving the performance of the transmission
control protocol (TCP) in wireless environment has been an
active research area. Main reason behind performance
degradation of TCP is not having ability to detect actual reason
of packet losses in wireless environment. In this paper, we are
providing a simulation results for TCP-P (TCP-Performance).
TCP-P is intelligent protocol in wireless environment which
is able to distinguish actual reasons for packet losses and
applies an appropriate solution to packet loss.
TCP-P deals with main three issues, Congestion in
network, Disconnection in network and random packet losses.
TCP-P consists of Congestion avoidance algorithm and
Disconnection detection algorithm with some changes in TCP
header part. If congestion is occurring in network then
congestion avoidance algorithm is applied. In congestion
avoidance algorithm, TCP-P calculates number of sending
packets and receiving acknowledgements and accordingly set
a sending buffer value, so that it can prevent system from
happening congestion. In disconnection detection algorithm,
TCP-P senses medium continuously to detect a happening
disconnection in network. TCP-P modifies header of TCP
packet so that loss packet can itself notify sender that it is
lost.This paper describes the design of TCP-P, and presents
results from experiments using the NS-2 network simulator.
Results from simulations show that TCP-P is 4% more
efficient than TCP-Tahoe, 5% more efficient than TCP-Vegas,
7% more efficient than TCP-Sack and equally efficient in
performance as of TCP-Reno and TCP-New Reno. But we can
say TCP-P is more efficient than TCP-Reno and TCP-New
Reno since it is able to solve more issues of TCP in wireless
environment.
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...IDES Editor
In this paper, we have proposed a novel architectural
technique which can be used to boost performance of modern
day processors. It is especially useful in certain code constructs
like small loops and try-catch blocks. The technique is aimed
at improving performance by reducing the number of
instructions that need to enter the pipeline itself. We also
demonstrate its working in a scalar pipelined soft-core
processor developed by us. Lastly, we present how a superscalar
microprocessor can take advantage of this technique and
increase its performance.
Abstract - The Transmission Control Protocol (TCP) is
connection oriented, reliable and end-to-end protocol that support
flow and congestion control, with the evolution and rapid growth
of the internet and emergence of internet of things IoT, flow and
congestion have clear impact in the network performance. In this
paper we study congestion control mechanisms Tahoe, Reno,
Newreno, SACK and Vegas, which are introduced to control
network utilization and increase throughput, in the performance
evaluation we evaluate the performance metrics such as
throughput, packets loss, delivery and reveals impact of the cwnd.
Showing that SACK had done better performance in terms of
numbers of packets sent, throughput and delivery ratio than
Newreno, Vegas shows the best performance of all of them.
Performance evaluation of tcp sack1 in wimax network asymmetryeSAT Journals
Abstract The WiMAX technology support to different channel bandwidth, cyclic prefix, modulation coding scheme, frame duration, simultaneous two way data transfer and propagation model. The WiMAX network asymmetry is largely depends on DL: UL ratio. This paper evaluate the performance of TCP Sack1 by considering Channel Bandwidth, Cyclic Prefix, Modulation Coding Scheme, Frame Duration, Two way transfer and Propagation model in WiMAX network with network asymmetry. The performance of TCP Sack1 in WiMAX network is evaluating by varying MAC layer parameter such as channel bandwidth, cyclic prefix, modulation coding scheme, frame duration, DL: UL ratio and physical layer parameter such as propagation model and full duplex mode of data transfer and other operating parameter such as downloading traffic and these parameters really affect the performance of TCP Sack1 in WiMAX network. The performance of WiMAX network is measured in terms of throughput, goodput and number of packets dropped. Keywords: World Wide interoperability for microwave access (WiMAX), Subscriber Stations (SSs), Downlink (DL), Uplink (UL), Medium access control (MAC), Transmission Control Protocol (TCP), OFDM, IEEE 802.16, Throughput, Goodput and Packets drop
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Abstract
The rapid growth in the Internet of Things (IoTs)
has change our life to be more intelligent and smart,
the development in the Wireless Sensors Networks
(WSNs), besides the wide use of the embedded devices
in different area like industry, home automation,
transport, agriculture and health care, which was led
the Routing Over Low-power and Lossy-network
(ROLL) working group to introduce the IPv6 Routing
Protocol for Low-Power and Lossy Networks (RPL),
therein the RPL nodes have organized topology as a
Directed Acyclic Graph (DAG) and terminated at one
root to form the Destination Oriented DAGs
(DODAGs). In this paper by using InstantContiki3.0
and CoojaGUI we analyze the DODAG formations,
the RPL control messages that are send downward
and upward routes to construct and maintain
DODAG and the Rank computation by Objective
Function (OF) for inconsistency and loop detection,
also we evaluate the performance of the RPL based
on the Expected Transmission Count (ETX) OF that
enable RPL to select and optimize routes within RPL
instance, as well as we evaluate the following metrics:
The ETX Reliability Object (ETX), Radio Duty Cycle
(RDC), energy consumption, the received packets by
the motes and neighbor count. The simulation results
show that the RPL control messages flow in consistent
manner, the DODAG root able to connect to all of the
neighbor motes, also Rank illustration shows no loops
and DODAG topology consistent, as well as the ETX
can essentially take control over DODAG formations
and it has an effects in the RDC ratio, furthermore
most of the motes show reasonable low power
consumption, also the motes show acceptable number
of the received packets.
Abstract— Internet of things (IoT) is a new networks paradigm,
that billions of internet things can be connected at anytime and
anyplace, and it’s expected to include billions of smart devices,
these devices characterized by small memory, low transfer rate
and low energy, internet protocol version 6 (IPv6) it was
introduced to offer huge address space, however it doesn’t
compatible with capabilities of the constrained device, therefore
IPv6 over low power Wireless Personal Area network
(6LoWPAN) adaptation layer was introduced to carry IPv6
datagram over constrained links, in this paper, we first provide
intensive analysis of 6LoWPAN specifications that includes IPv6
encapsulation, frame format, 6LoWPAN header compression,
fragmentation of the payload datagram and encoding of user
datagram protocol UDP, in addition to the implementation of the
6LoWPAN in the NS-3 using different payload size, then we
evaluate the following metrics throughput, packets loss, delay
and jitter, the results showed that the fragmentation effects the
network throughput and increase the delay and the number of
lost packets, moreover, when payload fit within a single frame the
network show better performance , there are no packets lost as
well as minimum values of the delay and the jitter, and in the
two cases 6LoWPAN shows reasonable packets delivery ratio.
An Implementation and Analysis of RTS/CTS Mechanism for Data Transfer in Wire...iosrjce
In this paper, the implementation and analysis of RTS/CTS mechanism for data transfer in wireless
network is being studied. The Request-To-Send (RTS) and Clear-To-Send (CTS) mechanism is widely used in
wireless networks to avoid collisions due to hidden nodes by reserving the channel for transmitting data from
source to destination. The collisions caused by the hidden nodes reduce the network throughput and efficiency.
In RTS/CTS mechanism, RTS/CTS packets set the timer for the neighboring nodes so that these nodes defer their
transmission for the entire data packet transmission period. But there may be the case when the intended
transmission completes before the expiration of this timer, so a kind of delay has been developed. To reduce this
delay, the proposed methodology in this paper provides RTR (Ready-To-Receive) packets along with RTS/CTS
packets. The receiving node sends RTR packets to notify all the neighboring nodes that the intended
communication has finished. The results show that this method improves the data transfer rate resulting in
higher throughput and network efficiency and the system will be more efficient. This will reflect in the overall
information transfer time.
Area-Efficient Design of Scheduler for Routing Node of Network-On-ChipVLSICS Design
Traditional System-on-Chip (SoC) design employed shared buses for data transfer among various subsystems. As SoCs become more complex involving a larger number of subsystems, traditional busbased architecture is giving way to a new paradigm for on-chip communication. This paradigm is called Network-on-Chip (NoC). A communication network of point-to-point links and routing switches is used to facilitate communication between subsystems. The routing switch proposed in this paper consists of four components, namely the input ports, output ports, switching fabric, and scheduler. The scheduler design is described in this paper. The function of the scheduler is to arbitrate between requests by data packets for use of the switching fabric. The scheduler uses an improved round robin based arbitration algorithm. Due to the symmetric structure of the scheduler, an area-efficient design is proposed by folding the scheduler onto itself, thereby reducing its area roughly by 50%.
AREA-EFFICIENT DESIGN OF SCHEDULER FOR ROUTING NODE OF NETWORK-ON-CHIPVLSICS Design
Traditional System-on-Chip (SoC) design employed shared buses for data transfer among various subsystems. As SoCs become more complex involving a larger number of subsystems, traditional busbased architecture is giving way to a new paradigm for on-chip communication. This paradigm is called Network-on-Chip (NoC). A communication network of point-to-point links and routing switches is used to facilitate communication between subsystems. The routing switch proposed in this paper consists of four components, namely the input ports, output ports, switching fabric, and scheduler. The scheduler design is described in this paper. The function of the scheduler is to arbitrate between requests by data packets for use of the switching fabric. The scheduler uses an improved round robin based arbitration algorithm. Due to the symmetric structure of the scheduler, an area-efficient design is proposed by folding the scheduler onto itself, thereby reducing its area roughly by 50%.
Network on Chip Architecture and Routing Techniques: A surveyIJRES Journal
The processor designing and development was designed to perform various complex logical information exchange and processing operations in a variety of resolutions. They mainly rely on concurrent and sync, both that of the software and hardware to enhance the productivity and performance. With the high speed growth approaching multi-billion transistor integration era, some of the main problems which are symbolized by all gate lengths in the range of 60-90 nm, will be from non-scalable delays generated by wire. All similar problems may be solved by using Network on Chip (NOC) systems. In the presented paper, we have summarized research papers and contributions in NOC area. With advancement in the technology in the on chip communication, faster interaction between devices is becoming vital. Network on Chip (NOC) can be one of the solutions for faster on chip communication. For efficient link between devices of NOC, routers are needed. This paper also reviews implementation of routing techniques. The use of routing gives higher throughput as required for dealing with complexity of modern systems. It is mainly focused on the routing design parameters on both system level including traffic pattern, network topology and routing algorithm, and architecture level including arbitration algorithm.
Low power network on chip architectures: A surveyCSITiaesprime
Mostly communication now days is done through system on chip (SoC) models so, network on chip (NoC) architecture is most appropriate solution for better performance. However, one of major flaws in this architecture is power consumption. To gain high performance through this type of architecture it is necessary to confirm power consumption while designing this. Use of power should be diminished in every region of network chip architecture. Lasting power consumption can be lessened by reaching alterations in network routers and other devices used to form that network. This research mainly focusses on state-of-the-art methods for designing NoC architecture and techniques to reduce power consumption in those architectures like, network architecture, network links between nodes, network design, and routers.
Improving Performance of TCP in Wireless Environment using TCP-PIDES Editor
Improving the performance of the transmission
control protocol (TCP) in wireless environment has been an
active research area. Main reason behind performance
degradation of TCP is not having ability to detect actual reason
of packet losses in wireless environment. In this paper, we are
providing a simulation results for TCP-P (TCP-Performance).
TCP-P is intelligent protocol in wireless environment which
is able to distinguish actual reasons for packet losses and
applies an appropriate solution to packet loss.
TCP-P deals with main three issues, Congestion in
network, Disconnection in network and random packet losses.
TCP-P consists of Congestion avoidance algorithm and
Disconnection detection algorithm with some changes in TCP
header part. If congestion is occurring in network then
congestion avoidance algorithm is applied. In congestion
avoidance algorithm, TCP-P calculates number of sending
packets and receiving acknowledgements and accordingly set
a sending buffer value, so that it can prevent system from
happening congestion. In disconnection detection algorithm,
TCP-P senses medium continuously to detect a happening
disconnection in network. TCP-P modifies header of TCP
packet so that loss packet can itself notify sender that it is
lost.This paper describes the design of TCP-P, and presents
results from experiments using the NS-2 network simulator.
Results from simulations show that TCP-P is 4% more
efficient than TCP-Tahoe, 5% more efficient than TCP-Vegas,
7% more efficient than TCP-Sack and equally efficient in
performance as of TCP-Reno and TCP-New Reno. But we can
say TCP-P is more efficient than TCP-Reno and TCP-New
Reno since it is able to solve more issues of TCP in wireless
environment.
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...IDES Editor
In this paper, we have proposed a novel architectural
technique which can be used to boost performance of modern
day processors. It is especially useful in certain code constructs
like small loops and try-catch blocks. The technique is aimed
at improving performance by reducing the number of
instructions that need to enter the pipeline itself. We also
demonstrate its working in a scalar pipelined soft-core
processor developed by us. Lastly, we present how a superscalar
microprocessor can take advantage of this technique and
increase its performance.
Abstract - The Transmission Control Protocol (TCP) is
connection oriented, reliable and end-to-end protocol that support
flow and congestion control, with the evolution and rapid growth
of the internet and emergence of internet of things IoT, flow and
congestion have clear impact in the network performance. In this
paper we study congestion control mechanisms Tahoe, Reno,
Newreno, SACK and Vegas, which are introduced to control
network utilization and increase throughput, in the performance
evaluation we evaluate the performance metrics such as
throughput, packets loss, delivery and reveals impact of the cwnd.
Showing that SACK had done better performance in terms of
numbers of packets sent, throughput and delivery ratio than
Newreno, Vegas shows the best performance of all of them.
Performance evaluation of tcp sack1 in wimax network asymmetryeSAT Journals
Abstract The WiMAX technology support to different channel bandwidth, cyclic prefix, modulation coding scheme, frame duration, simultaneous two way data transfer and propagation model. The WiMAX network asymmetry is largely depends on DL: UL ratio. This paper evaluate the performance of TCP Sack1 by considering Channel Bandwidth, Cyclic Prefix, Modulation Coding Scheme, Frame Duration, Two way transfer and Propagation model in WiMAX network with network asymmetry. The performance of TCP Sack1 in WiMAX network is evaluating by varying MAC layer parameter such as channel bandwidth, cyclic prefix, modulation coding scheme, frame duration, DL: UL ratio and physical layer parameter such as propagation model and full duplex mode of data transfer and other operating parameter such as downloading traffic and these parameters really affect the performance of TCP Sack1 in WiMAX network. The performance of WiMAX network is measured in terms of throughput, goodput and number of packets dropped. Keywords: World Wide interoperability for microwave access (WiMAX), Subscriber Stations (SSs), Downlink (DL), Uplink (UL), Medium access control (MAC), Transmission Control Protocol (TCP), OFDM, IEEE 802.16, Throughput, Goodput and Packets drop
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Abstract
The rapid growth in the Internet of Things (IoTs)
has change our life to be more intelligent and smart,
the development in the Wireless Sensors Networks
(WSNs), besides the wide use of the embedded devices
in different area like industry, home automation,
transport, agriculture and health care, which was led
the Routing Over Low-power and Lossy-network
(ROLL) working group to introduce the IPv6 Routing
Protocol for Low-Power and Lossy Networks (RPL),
therein the RPL nodes have organized topology as a
Directed Acyclic Graph (DAG) and terminated at one
root to form the Destination Oriented DAGs
(DODAGs). In this paper by using InstantContiki3.0
and CoojaGUI we analyze the DODAG formations,
the RPL control messages that are send downward
and upward routes to construct and maintain
DODAG and the Rank computation by Objective
Function (OF) for inconsistency and loop detection,
also we evaluate the performance of the RPL based
on the Expected Transmission Count (ETX) OF that
enable RPL to select and optimize routes within RPL
instance, as well as we evaluate the following metrics:
The ETX Reliability Object (ETX), Radio Duty Cycle
(RDC), energy consumption, the received packets by
the motes and neighbor count. The simulation results
show that the RPL control messages flow in consistent
manner, the DODAG root able to connect to all of the
neighbor motes, also Rank illustration shows no loops
and DODAG topology consistent, as well as the ETX
can essentially take control over DODAG formations
and it has an effects in the RDC ratio, furthermore
most of the motes show reasonable low power
consumption, also the motes show acceptable number
of the received packets.
Abstract— Internet of things (IoT) is a new networks paradigm,
that billions of internet things can be connected at anytime and
anyplace, and it’s expected to include billions of smart devices,
these devices characterized by small memory, low transfer rate
and low energy, internet protocol version 6 (IPv6) it was
introduced to offer huge address space, however it doesn’t
compatible with capabilities of the constrained device, therefore
IPv6 over low power Wireless Personal Area network
(6LoWPAN) adaptation layer was introduced to carry IPv6
datagram over constrained links, in this paper, we first provide
intensive analysis of 6LoWPAN specifications that includes IPv6
encapsulation, frame format, 6LoWPAN header compression,
fragmentation of the payload datagram and encoding of user
datagram protocol UDP, in addition to the implementation of the
6LoWPAN in the NS-3 using different payload size, then we
evaluate the following metrics throughput, packets loss, delay
and jitter, the results showed that the fragmentation effects the
network throughput and increase the delay and the number of
lost packets, moreover, when payload fit within a single frame the
network show better performance , there are no packets lost as
well as minimum values of the delay and the jitter, and in the
two cases 6LoWPAN shows reasonable packets delivery ratio.
An Implementation and Analysis of RTS/CTS Mechanism for Data Transfer in Wire...iosrjce
In this paper, the implementation and analysis of RTS/CTS mechanism for data transfer in wireless
network is being studied. The Request-To-Send (RTS) and Clear-To-Send (CTS) mechanism is widely used in
wireless networks to avoid collisions due to hidden nodes by reserving the channel for transmitting data from
source to destination. The collisions caused by the hidden nodes reduce the network throughput and efficiency.
In RTS/CTS mechanism, RTS/CTS packets set the timer for the neighboring nodes so that these nodes defer their
transmission for the entire data packet transmission period. But there may be the case when the intended
transmission completes before the expiration of this timer, so a kind of delay has been developed. To reduce this
delay, the proposed methodology in this paper provides RTR (Ready-To-Receive) packets along with RTS/CTS
packets. The receiving node sends RTR packets to notify all the neighboring nodes that the intended
communication has finished. The results show that this method improves the data transfer rate resulting in
higher throughput and network efficiency and the system will be more efficient. This will reflect in the overall
information transfer time.
Area-Efficient Design of Scheduler for Routing Node of Network-On-ChipVLSICS Design
Traditional System-on-Chip (SoC) design employed shared buses for data transfer among various subsystems. As SoCs become more complex involving a larger number of subsystems, traditional busbased architecture is giving way to a new paradigm for on-chip communication. This paradigm is called Network-on-Chip (NoC). A communication network of point-to-point links and routing switches is used to facilitate communication between subsystems. The routing switch proposed in this paper consists of four components, namely the input ports, output ports, switching fabric, and scheduler. The scheduler design is described in this paper. The function of the scheduler is to arbitrate between requests by data packets for use of the switching fabric. The scheduler uses an improved round robin based arbitration algorithm. Due to the symmetric structure of the scheduler, an area-efficient design is proposed by folding the scheduler onto itself, thereby reducing its area roughly by 50%.
AREA-EFFICIENT DESIGN OF SCHEDULER FOR ROUTING NODE OF NETWORK-ON-CHIPVLSICS Design
Traditional System-on-Chip (SoC) design employed shared buses for data transfer among various subsystems. As SoCs become more complex involving a larger number of subsystems, traditional busbased architecture is giving way to a new paradigm for on-chip communication. This paradigm is called Network-on-Chip (NoC). A communication network of point-to-point links and routing switches is used to facilitate communication between subsystems. The routing switch proposed in this paper consists of four components, namely the input ports, output ports, switching fabric, and scheduler. The scheduler design is described in this paper. The function of the scheduler is to arbitrate between requests by data packets for use of the switching fabric. The scheduler uses an improved round robin based arbitration algorithm. Due to the symmetric structure of the scheduler, an area-efficient design is proposed by folding the scheduler onto itself, thereby reducing its area roughly by 50%.
Network on Chip Architecture and Routing Techniques: A surveyIJRES Journal
The processor designing and development was designed to perform various complex logical information exchange and processing operations in a variety of resolutions. They mainly rely on concurrent and sync, both that of the software and hardware to enhance the productivity and performance. With the high speed growth approaching multi-billion transistor integration era, some of the main problems which are symbolized by all gate lengths in the range of 60-90 nm, will be from non-scalable delays generated by wire. All similar problems may be solved by using Network on Chip (NOC) systems. In the presented paper, we have summarized research papers and contributions in NOC area. With advancement in the technology in the on chip communication, faster interaction between devices is becoming vital. Network on Chip (NOC) can be one of the solutions for faster on chip communication. For efficient link between devices of NOC, routers are needed. This paper also reviews implementation of routing techniques. The use of routing gives higher throughput as required for dealing with complexity of modern systems. It is mainly focused on the routing design parameters on both system level including traffic pattern, network topology and routing algorithm, and architecture level including arbitration algorithm.
Low power network on chip architectures: A surveyCSITiaesprime
Mostly communication now days is done through system on chip (SoC) models so, network on chip (NoC) architecture is most appropriate solution for better performance. However, one of major flaws in this architecture is power consumption. To gain high performance through this type of architecture it is necessary to confirm power consumption while designing this. Use of power should be diminished in every region of network chip architecture. Lasting power consumption can be lessened by reaching alterations in network routers and other devices used to form that network. This research mainly focusses on state-of-the-art methods for designing NoC architecture and techniques to reduce power consumption in those architectures like, network architecture, network links between nodes, network design, and routers.
ANALYSIS OF LINK STATE RESOURCE RESERVATION PROTOCOL FOR CONGESTION MANAGEMEN...ijgca
With the wide spread of WiFi hotspots, concentrated traffic workload on Smart Web (SW) can slow down the network performance. This paper presents a congestion management strategy considering real time activities in today’s smart web. With the SW context, cooperative packet recovery using resource reservation procedure for TCP flows was adapted for mitigating packet losses. This is to maintain data consistency between various access points of smart web hotspot. Using a real world scenario, it was confirmed that generic TCP cannot handle traffic congestion in a SW hotspot network. With TCP in scalable workload environments, continuous packet drops at the event of congestion remains obvious. This is unacceptable for mission critical domains. An enhanced Link State Resource Reservation Protocol (LSRSVP) which serves as dynamic feedback mechanism in smart web hotspots is presented. The contextual behaviour was contrasted with the generic TCP model. For the LS-RSVP, a simulation experiment for TCP connection between servers at the remote core layer and the access layer was carried out while using selected benchmark metrics. From the results, under realistic workloads, a steady-state throughput response was achieved by TCP LS-RSVP to about 3650Bits/secs compared with generic TCP plots in a previous study. Considering network service availability, this was found to be dependent on fault-tolerance of the hotspot network. From study, a high peak threshold of 0.009 (i.e. 90%) was observed. This shows fairly acceptable service availability behaviour compared with the existing TCP schemes. For packet drop effects, an analysis on the network behaviour with respect to the LS-RSVP yielded a drop response of about 0.000106 bits/sec which is much lower compared with the case with generic TCP with over 0.38 bits/sec. The latency profile of average FTP download response was found to be 0.030secs, but with that of FTP upload response, this yielded about 0.028 sec. The results from the study demonstrate efficiency and optimality for realistic loads in Smart web contexts.
Analysis of Link State Resource Reservation Protocol for Congestion Managemen...ijgca
With the wide spread of WiFi hotspots, concentrated traffic workload on Smart Web (SW) can slow down
the network performance. This paper presents a congestion management strategy considering real time
activities in today’s smart web. With the SW context, cooperative packet recovery using resource
reservation procedure for TCP flows was adapted for mitigating packet losses. This is to maintain data
consistency between various access points of smart web hotspot. Using a real world scenario, it was
confirmed that generic TCP cannot handle traffic congestion in a SW hotspot network. With TCP in
scalable workload environments, continuous packet drops at the event of congestion remains obvious. This
is unacceptable for mission critical domains. An enhanced Link State Resource Reservation Protocol (LSRSVP)
which serves as dynamic feedback mechanism in smart web hotspots is presented. The contextual
behaviour was contrasted with the generic TCP model. For the LS-RSVP, a simulation experiment for TCP
connection between servers at the remote core layer and the access layer was carried out while using
selected benchmark metrics. From the results, under realistic workloads, a steady-state throughput
response was achieved by TCP LS-RSVP to about 3650Bits/secs compared with generic TCP plots in a
previous study. Considering network service availability, this was found to be dependent on fault-tolerance
of the hotspot network. From study, a high peak threshold of 0.009 (i.e. 90%) was observed. This shows
fairly acceptable service availability behaviour compared with the existing TCP schemes. For packet drop
effects, an analysis on the network behaviour with respect to the LS-RSVP yielded a drop response of about
0.000106 bits/sec which is much lower compared with the case with generic TCP with over 0.38 bits/sec.
The latency profile of average FTP download response was found to be 0.030secs, but with that of FTP
upload response, this yielded about 0.028 sec. The results from the study demonstrate efficiency and
optimality for realistic loads in Smart web contexts.
IJMTST 2016 | ISSN: 2455-3778 Traffic and Power reduction Routing Algorithm f...IJMTST Journal
With the progress of VLSI technology, the number of cores on a chip keeps increasing, Now a days we are
increasing the processing level of the chip, NOC is a best method to interconnect the core with each other core
on the chip, it reducing the overall chip power and Traffic level by sharing the work load with other cores on
the chip. And Dynamic Voltage Frequency Scaling (DVFS) is the technique for monitoring the
Frequency/Voltage level of each core of the chip and providing sufficient power to the cores, ATPT is a Table
that having (low and high) Frequency level table of the Each core. ATPT has very high prediction accuracy
system. Depends upon the data speed of the core the voltage/frequency will be given by DVFS. If the core is
in ideal state for a while, that core is moved to low power mode. So the power of the each core will be reduced.
Available network bandwidth schema to improve performance in tcp protocolsIJCNCJournal
The TCP congestion control mechanism in standard implementations presents several problems, for
example, large queue lengths in network routers and packet losses, a misleading reduce of the transmission
rate when there are link failures, among others. This paper proposes a schema to congestion control in
TCP protocols, called NGWA, witch is based on the network bandwidth. The NGWA provides information
considering the available bandwidth of the network infrastructure to the endpoints of the TCP connection.
Hence, it helps in choosing a better transmission rate for TCP improving its performance. Simulation
results show superior performance of the proposed scheme when compared to those obtained by TCP New
Reno and standard TCP. A physical implementation in the Linux kernel was performed to prove the correct
operation of the proposal.
A ULTRA-LOW POWER ROUTER DESIGN FOR NETWORK ON CHIPijaceeejournal
The design of more complex systems becomes an increasingly difficult task because of different issues
related to latency, design reuse, throughput and cost that has to be considered while designing. In
Real-time applications there are different communication needs among the cores. When NoCs (Networks
on chip) are the means to interconnect the cores, use of some techniques to optimize the
communication are indispensable. From the performance point of view, large buffer sizes ensure
performance during different applications execution. But unfortunately, these same buffers are the main
responsible for the router total power dissipation. Another aspect is that by sizing buffers for the worst case
latency incurs in extra dissipation for the mean case, which is much more frequent. Reconfigurable router
architecture for NOC is designed for processing elements communicate over a second
communication level using direct-links between another node elements. Several possibilities to use the
router as additional resources to enhance complexity of modules are presented. The reconfigurable router
is evaluated in terms of area, speed and latencies. The proposed router was described in VHDL and used
the ModelSim tool to simulate the code. Analyses the average power consumption, area, and frequency
results to a standard cell library using the Design Compiler tool. With the reconfigurable router it was
possible to reduce the congestion in the network, while at the same time reducing power dissipation and
improving energy.
Fault Injection Approach for Network on Chipijsrd.com
Packet-based on-chip interconnection networks, or Network-on-Chips (NoCs) are progressively replacing global on-chip interconnections in Multi-processor System-on-Chips (MP-SoCs) thanks to better performances and lower power consumption. However, modern generations of MP-SoCs have an increasing sensitivity to faults due to the progressive shrinking technology. Consequently, in order to evaluate the fault sensitivity in NoC architectures, there is the need of accurate test solution which allows evaluating the fault tolerance capability of NoCs. Presents an innovative test architecture based on a dual-processor system which is able to extensively test mesh based NoCs. The proposed solution improves previously developed methods since it is based on a NoC physical implementation which allows investigating the effects induced by several kind of faults thanks to the execution of on-line fault injection within all the network interface and router resources during NoC run-time operations. The solution has been physically implemented on an FPGA platform using a NoC emulation model adopting standard communication protocols. The obtained results demonstrated the effectiveness of the developed solution in term of testability and diagnostic capabilities and make our solutions suitable for testing large scale.
Improvement In LEACH Protocol By Electing Master Cluster Heads To Enhance The...Editor IJCATR
In wireless sensor networks, sensor nodes play the most prominent role. These sensor nodes are mainly un-chargeable, so it
raises an issue regarding lifetime of the network. Mainly sensor nodes collect data and transmit it to the Base Station. So, most of the
energy is consumed in the communication process between sensor nodes and the Base Station. In this paper, we present an
improvement on LEACH protocol to enhance the network lifetime. Our goal is to reduce the transmissions between cluster heads and
the sink node. We will choose optimum number of Master Cluster Heads from variation cluster heads present in the network. The
simulation results show that our proposed algorithm enhances the network lifetime as compare to the LEACH protocol.
Many intellectual property (IP) modules are present in contemporary system on chips (SoCs). This could provide an issue with interconnection among different IP modules, which would limit the system's ability to scale. Traditional bus-based SoC architectures have a connectivity bottleneck, and network on chip (NoC) has evolved as an embedded switching network to address this issue. The interconnections between various cores or IP modules on a chip have a significant impact on communication and chip performance in terms of power, area latency and throughput. Also, designing a reliable fault tolerant NoC became a significant concern. In fault tolerant NoC it becomes critical to identify faulty node and dynamically reroute the packets keeping minimum latency. This study provides an insight into a domain of NoC, with intention of understanding fault tolerant approach based on the XY routing algorithm for 4×4 mesh architecture. The fault tolerant NoC design is synthesized on field programmable gate array (FPGA).
Investigating the Performance of NoC Using Hierarchical Routing ApproachIJERA Editor
The Network-on-Chip (NoC) model has appeared as a revolutionary methodology for incorporatingmany number of intellectual property (IP) blocks in a die. As said by the International Roadmap for Semiconductors (ITRS), it is must to scale down the device size. In order to reduce the device long interconnection should be avoided. For that, new interconnect patterns are need. Three-dimensional ICs are proficient of achieving superior performance, resistance against noise and lower interconnect power consumption compared to traditional planar ICs. In this paper, network data routed by Hierarchical methodology. We are analyzing total number of logic gates and registers, power consumption and delay when different bits of data transmitted using Quartus II software.
Investigating the Performance of NoC Using Hierarchical Routing ApproachIJERA Editor
The Network-on-Chip (NoC) model has appeared as a revolutionary methodology for incorporatingmany number of intellectual property (IP) blocks in a die. As said by the International Roadmap for Semiconductors (ITRS), it is must to scale down the device size. In order to reduce the device long interconnection should be avoided. For that, new interconnect patterns are need. Three-dimensional ICs are proficient of achieving superior performance, resistance against noise and lower interconnect power consumption compared to traditional planar ICs. In this paper, network data routed by Hierarchical methodology. We are analyzing total number of logic gates and registers, power consumption and delay when different bits of data transmitted using Quartus II software.
Similar to Application Behavior-Aware Flow Control in Network-on-Chip (20)
Investigating the Performance of NoC Using Hierarchical Routing Approach
Application Behavior-Aware Flow Control in Network-on-Chip
1. Application Behavior-aware Flow Control
in Network-on-Chip
Advisor: Chung-Ta King
Student: Huan-Yu Liu
Department of Computer Science
National Tsing Hua University
Hsinchu, Taiwan 30013
R.O.C.
July, 2010
2. Abstract
Multicore might be the only solution when concerning about performance and
power issues in future chip processor architecture. As the number of cores on
a chip keeps on increasing, traditional bus-based architectures are incapable of
offering the required communication bandwidth on the chip, so Network-on-chip
(NoC) becomes the main paradigm for on-chip interconnection. NoCs not only
offer significant bandwidth advantages but also provide outstanding flexibility.
However, the performance of NoCs can be degraded significantly if the network
flow is not controlled properly. Most previous solutions try to detect network
congestion by monitoring the hardware status of the network switches or links.
Change of hardware statuses at local end may indicate possible congestions in the
network, and thus packet injection into the network should be controlled to react
to the congestions. The problem with these solutions is that congestion detection
is based only on local status without global information. Actual congestions may
occur somewhere else and can only be detected through backpressure, which may
be too passive and too slow for taking reactive measures in time.
This work takes a proactive approach for congestion detection. The idea is to
predict the changes in global, end-to-end network traffic patterns of the running
1
3. application and take proactive flow control actions to avoid possible congestions.
Traffic prediction is based on our recent paper [1], which uses a table-driven
predictor for predicting application communication patterns. In this thesis, we
discuss how to use the prediction results for effective scheduling of packet injec-
tion to avoid network congestions and improve the throughput. The proposed
scheme is evaluated using simulation based on a SPLASH-2 benchmark as well
as synthetic traffic. The results show its superior performance improvement and
negligible execution overhead.
i
10. List of Tables
6.1 Simulation Configuration . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.2 Our proposed flow control algorithm leads to the huge reduction
in the latency and slight execution time overhead. . . . . . . . . . . 27
6.3 Our proposed flow control algorithm for synthetic traffic leads
to the huge reduction in the average latency and the maximum
latency and slight reduction in the execution time. . . . . . . . . . 31
iv
11. List of Figures
2.1 The tile arrangement and interconnection topology used for ex-
periment on TILE64 platform . . . . . . . . . . . . . . . . . . . . . . 7
2.2 The traffic of router 4 is tracked. The first diagram is all the traffic
input/output from router 4. The second to the fourth diagrams
show the decomposed traffic. Note that the traffic relayed by
router 4 is omitted. The last one is the output traffic from router
4 to 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.1 The structure of a router . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.2 A example of a L1-table. The columns G4 : G0 record the quantized
transmitted data size of the last 5 time intervals. . . . . . . . . . . . 14
4.3 A example of a L2-table which is indexed by the transmission
history pattern G4 : G0. The corresponding data size level Gp is
the value predicted to transmit in the next time interval . . . . . . 15
4.4 A table which records the delayed transmissions . . . . . . . . . . . 15
5.1 The diagram of the flow control algorithm . . . . . . . . . . . . . . . 21
v
12. 5.2 The diagram of flow control . . . . . . . . . . . . . . . . . . . . . . . 24
6.1 Histograms of the packet latencies without (a) and with (b) the
proposed flow control and in (b) the latencies slow down drastically. 28
6.2 The maximum workload of links in the network without (a) and
with (b) the proposed flow control. . . . . . . . . . . . . . . . . . . . 29
vi
13. Chapter 1
Introduction
The number of transistors on a chip has increased exponentially over the decades
according to the Moore’s Law. At the same time, applications, such as process-
ing, have also increased in complexities and therefore require huge computations.
These factors are further coupled with the increasing need for power saving is
also as the clock frequency of a core increases. The best practice at this is
to go for multicore architecture and application parallelization. However, the
communication overhead would be a critical bottleneck if we cannot offer sub-
stantial bandwidth among the core. Traditional bus-based architectures suffer
from increased packet latencies as the number of cores on-chip increases and are
incapable of providing performance guarantees especially for real-time applica-
tions. As a result, Network-on-chip (NoC) becomes a de facto solution to handle
this critical problem.
NoCs not only offer significant bandwidth but also provide outstanding flex-
1
14. ibility and scalability. There have already had multi- and many-core processors
at the market that adopt NoC as their communication fabric. For example,
Tilera’s TILE64 [2] introduced in 2007 uses a 2-D mesh-based network to inter-
connect 64 tiles and 4 memory controllers. Indeed, NoCs are becoming the main
communication and design fabric for chip-level multiprocessors.
Since the cores are connected by a network, the flow control and congestion
control in NoC are certainly important issues. If a core transmits too many
packets to another core, the intermediate routers need to buffer many packet
flits, causing the network congested. Without an effective flow control mech-
anism, the performance of NoCs may degrade sharply due to the congestion.
According to [3], the accepted traffic increases linearly with the applied load
until a saturation point is reached. After the saturation point, the accepted
traffic decreases considerably.
There have already had many solutions solving the congestion situation in
the off-chip network [4–6]. However, most of them are not suitable for on-
chip network. In off-chip environments, dropping packets is usually used as a
means of flow control when congestion happens. Using this kind of control, the
environments must provide an acknowledgment mechanism. On the other hand,
on-chip network possesses reliable on-chip wires and more effective link-level
flow control, which make on-chip NoCs almost lossless. As a result, there is no
2
15. need to implement complicated protocols, such as acknowledgment, only for flow
control. This difference provides us the chance to come up with new solution.
To our best knowledge, there are very few research works discussing the
congestion control problem in NoCs. In [7], the switches exchange their load
information with neighboring switches to avoid hot spots where most packets
will pass through. In [8, 9], a predictive closed-loop flow control mechanism is
proposed based on a router model, which is used to predict how many flits the
router can accept in the next k time steps. However, it ignores the flits injected
by neighbor routers in the prediction period. In [10, 11], a centralized, end-to-
end flow control mechanism is proposed. However, they need a special network
called control NoC to transfer OS-control messages and they only rely on local
blocked messages to decide the time where a processing element is able to send
messages to the network.
Most of the works mentioned above detect network congestions by monitoring
the hardware status, such as buffer fillings, link utilization, and the amount of
blocked messages. However, the statuses are probably bounded due to the hard-
ware limitation. For example, the size and the number of buffers are limited,
so without adding any new hardware, the detection may be very inaccurate.
Particularly, if a bursty workload exceeds the limitation of the hardware, the
congestion information might not be detected immediately. In addition, con-
3
16. gestion detection based on hardware status is a reactive technique. It relies on
the backpressure to detect the network congestion, and thus the traffic sources
cannot throttle the injection rate immediately before the network is severely
congested. Furthermore, previous work on flow control of NoCs do not take
global information into consideration when making flow control decision. Even
if a certain core determines that the network is out of congestion and decides to
inject packets onto the network, some links or buffers of the other cores might
be still in congestion statuses causing more severe congestion.
In this thesis, we propose a proactive congestion and flow control mechanism.
The core idea is to predict the future global traffic in the NoC according to
the data transmission behaviors of the running applications. According to the
prediction, we can control network injection before congestion occurs. Notice
that most applications show repetitive communication patterns because they
likely execute similar codes in a time interval, such as a loop in the program.
These patterns may reflect the network states more accurately since applications
are the sources of the traffic in the network. Once the application patterns can
be predicted accurately, the future traffic of every link can be estimated based on
this information. The injection rate of each node can thus be controlled before
the network goes into congestion. However, predicting the traffic in a network
with high accuracy is a challenge. In this thesis, the data transmission behavior
4
17. of the running application is tracked and then used as the clues for predicting the
future traffic by a specialized table-driven predictor. This technique is inspired
by the branch predictor and works well for the end-to-end traffic of the network
[1].
The main contributions of this paper are as follows. First, we predict the
congestion according to the data transmission behaviors of applications rather
than the hardware statuses since data transmissions os application are the direct
source of NoC congestion of the network. Second, we modify the table-driven
predictor proposed in [1] to not only capture and predict the data transmission
behaviors in the application at run time, but also make the decision for the
injection rate control. Third, the implementation details for this traffic control
algorithm are presented. By taking the advantage of many-core architecture, we
can dedicate a core for making decisions on packet injection and achieving global
performance.
This thesis is organized as follows. In Chapter 2, a motivating example is given
to show the repetitive data transmission behavior in applications. In Chapter 3,
related works are discussed. Next, we give a formal definition of the flow control
problem in Chapter 4. In Chapter 5, we present the details of the traffic control
algorithm. Evaluations are shown in Chapter 6. Finally, conclusions are given
in Chapter 7.
5
18. Chapter 2
Motivating Example
In this chapter, we show that the data transmission behavior appears to have
repetitive patterns in the parallel programs by taking the LU decomposition
of the SPLASH-2 benchmark as an example. The LU decomposition kernel
is ported to TILE64 platform and run on 4 × 4 tile array as Figure 2.1 shows.
Detailed experiment setup is described in Chapter 6. We used 16 tiles for porting
the applications, and the routing algorithm is X-Y dimensional routing. In the
following discussion, we use the form of (source, destination) to describe the
transmission pairs.
Figure 2.2 shows the transmission trace of router 4. In the first diagram, the
traffic is mixed from the viewpoints of East. The mixed traffic is somewhat messy
and hard to predict. In previous works, the traffic prediction is made mainly
by checking the hardware status, such as the fullness of buffers, the utilization
of links, and so on. The hardware status is affected by the mixed traffic as the
6
19. 0
4
8
12
1
5
9
13
2
6
10
14
3
7
11
15
Figure 2.1: The tile arrangement and interconnection topology used for experiment on
TILE64 platform
first diagram shows. Irregular traffic makes hardware status not suitable for
predicting the network workload.
However, when we extract the traffic between the pairs of (5,4), (6,4) and
(7,4), as the second to the fourth diagram show, and the last diagram is for
the output traffic(4,5), they are more regular and predictable. The separated
transmission trace is recorded in the view point of end-to-end data transmission,
which issued by the running application. The end-to-end data transmission
behaves in some repetitive patterns since the application is executing similar
operations in the time intervals.
By utilizing the repetitive characteristic of application execution, we can pre-
dict the end-to-end data transmission accurately by recording the history. The
7
20. workload prediction for a given link in the network can be derived by summing
all the predicted end-to-end data transmission that passing through this link.
As we can predict the NoC traffic in the next time interval, we can control
the sources of the traffic and regulate them ahead of packet injection and the
congestion avoidance can also be realized.
8
22. Chapter 3
Related Work
In [7], information of a switch is sent to other ones for deciding the routing path
to avoid the congestion. The control information is sent locally and cannot reflect
the statuses of the whole network. The authors predict network congestion based
on their proposed traffic source and router model in [8,9]. By using this model,
each router predicts the availability of its buffer ahead of time, i.e., how many
flits a router can accept currently. The traffic source cannot inject packets
until the availability is greater than zero. They predict traffic from the switch
perspective but our predictions are made from the perspectives of applications.
In [12–15], they consider a congestion control scenario which models flow
control as an utility maximization problem. These works propose an iterative
algorithm as the solution to the maximization problem.
The authors in [10] make use of the operating system (OS) and let the system
software to control the resource usage. In [11] the authors detail a NoC com-
10
23. munication management scheme based on a centralized, end-to-end flow control
mechanism by monitoring the hardware statuses. All the works above need a
dedicated control NoC to transfer OS-control message and a data NoC which is
responsible for delivering data packets. The OS refers to the blocked messages of
the local processing element to limit the time wherein the element is able to send
messages. In [16], almost the same network architecture is assumed except that
they add some extra hardware to support its distributed HW/SW congestion
control technique.
Model Predictive Control (MPC) is used for on-chip congestion control in
[17]. In this work, link utilization of a router is used as the indication for
the congestion measurement. In contrast, our work makes predictions from
the application-layer rather in the link-layer in order to obtain the transmission
behaviors of the running applications. We claim that these behaviors are actually
the main reason which brings about the network congestion.
11
24. Chapter 4
Problem Formulation
We have already known that congestion might degrade network performance
considerably. So congestion in the network should be avoided as possible as we
can. In [18], the queueing delay is used as one metric of congestion detection.
In [17], the authors use link utilization as congestion measure. Since there is no
official universally accepted definition of network congestion [19], we take link
utilization as congestion measure in this thesis. The utilization of a link ei at the
t-th time interval is defined as:
Utili(t) =
Di(t)
T × W
0 ≤ Utili ≤ 1
where Di(t) denotes the total data size transmitted by ei at the t-th time interval.
The period of a time interval is defined as T seconds and W is the maximum
bandwidth of a communication link. Thus T ×W denotes the maximum possible
data size transmitted in one time interval.
We make an assumption that if the link utilization of a given link in the
12
25. network exceeds a properly selected threshold Th, this link is congested. Ex-
perimental results in [17] asserts that 80 % link utilization results in reasonable
latencies before the congestion limit. However, this selected threshold value
should take some hardware configurations in to consideration such as the buffer
size and the link bandwidth.
We hope to prevent network from being congested before it happens. This
prospect is achieved by predicting possible traffic at the t-th time interval. We
hope to prevent several traffic sources from injecting packets concurrently. By
scheduling the packet injection effectively we can avoid network congestion and
then improve the average packet latency. Latency is a commonly used perfor-
mance metric and can be interpreted in different ways [3]. We define latency
here as the time elapsed from the message header is injected into the network
at the source node to the tail of packet is received at the destination node.
Assume that λ is the average packet latency and texec is the total execution
time without doing any flow control. λ is the average packet latency and texec
is the total execution time with our proposed flow control. Our goal is to max-
imize λ − λ and texec − texec. However, the execution time is affected by the
communication dependencies between traffic sources [20]. This will require fur-
ther discussion about dependencies in the program which is beyond the scope
of this thesis.
13
26. Cross-bar
Output 0
Output 1
Output 2
Output 3
Output 4
Input 0
Input 1
Input 2
Input 3
Input 4
To local processorFrom local processor
To the N. router
To the W. router
To the S. router
To the E. router
From the N. router
From the W. router
From the S. router
From the E. router
Figure 4.1: The structure of a router
Dest. LRU Data Size G4 G3 G2 G1 G0
5 0 256 5 3 1 2 4
8 2 128 3 3 0 3 3
10 1 512 2 2 2 2 2
13 3 64 5 4 3 5 4
Transmission history
Figure 4.2: A example of a L1-table. The columns G4 : G0 record the quantized transmitted
data size of the last 5 time intervals.
4.1 Application-Driven Predictor
In this subsection, we show that how to predict traffic by using a table-driven
network traffic predictor and make traffic control decisions with an extra ta-
ble which records the delayed transmissions. This original prediction method
is proposed in [1]; however, they only discuss how to monitor and predict the
traffic without interfering in the traffic. In this thesis, the future transmissions
14
27. G4 G3 G2 G1 G0 LRU Gp
5 3 1 2 4 31 2
4 4 0 4 4 13 4
5 4 2 5 3 5 0
3 1 2 6 3 12 2
…
Indexed by L1-table
Figure 4.3: A example of a L2-table which is indexed by the transmission history pattern
G4 : G0. The corresponding data size level Gp is the value predicted to transmit in the next
time interval
Src. Dest. Data size Priority
9 10 256 3
4 3 64 2
3 12 32 0
5 6 16 0
…
Figure 4.4: A table which records the delayed transmissions
15
28. are deeply controlled by our extended design. In order to simplify the following
discussion, we assume the 2D mesh network as the underlying topology, and the
size of the mesh network is N × N. Note that our approach is independent of
the topology and the size of the network, so it can be easily extended to other
network topology and arbitrary size of network. Each tile consists of a processor
core, a memory module and a router. We assume that the router has 5 input
and 5 output ports and a 5 × 5 crossbar. The structure of a router is shown in
Figure 4.1. Each crossbar contains five connections: east, north, west, south and
the local processor. Each connection consists of two uni-directional communi-
cation links for sending and receiving data, respectively. Deterministic routing
algorithm is assumed so that the path between a source and a destination is
determined in advance. This is the most common type of the routing algorithms
in the current NoC implementations.
A table-driven predictor is employed to record the traffic of the past history
and then we make use of the history to predict the data size and the destina-
tion of the outgoing traffic from each router in the next time interval. Each
router maintains two hierarchical tables for tracking and predicting the data
transmission. The first level table (L1-table) as shown in Figure 4.2 tracks all
output data transmissions. Each router here uses only four entries to record
transmission destination since a core may only communicate with a subset of
16
29. all the cores [1]. The destination entry can be replaced by the LRU replace-
ment policy for reducing the size of the table. In order to map the patterns to
guess the following transmission, a second-level table (L2-table) is required. At
the beginning of the t-th time interval, the transmission history recorded in the
L1-table is used to index the L2-table to get the predicted level of the trans-
mission data size at the t-th time interval. At the end of the t-th time interval,
when an output transmission is issued by the processor core, the destination
and data size are recorded in L1-table. The data size is quantized and recorded
in G0. The columns from G0 to Gn records the quantized transmitted data size
of the last n + 1 time intervals. The two tables are updated at the end of each
predefined time interval. After checking the prediction, the value of the data
size counter in the L1-table is quantized and shift into G0. Finally, the updated
transmission history in the L1-table is used to index the L2-table and retrieve
the predicted data size level that will be transmitted in the next time interval. If
the transmission history can not be found in the L2-table, the system will either
create a new entry or replace the existing entry by LRU in the L2-table, and use
the last value (G0) as the predicted transmission data size level. The recorded
transmitted data size levels in the L1-table are used to check the accuracy of the
prediction made at the last time interval. If the prediction was wrong, the value
of Gp at the L2-table for the corresponding transmission history pattern will be
17
30. modified to the data size level recorded in L1-table.
Besides the traffic predictor, we need to maintain another table to record
the delayed transmission, as shown in Table 4.4. As the traffic control algorithm
decides to delay a transmission, we need to record the source and the destination
and the traffic size. In order to avoid starvation, we need to add the priority
column. As the transmission is determined to delay for another interval, the
value in the priority column is also increased.
18
31. Chapter 5
Traffic Control Algorithm
In this chapter, we present a heuristic algorithm for NoC traffic management.
Then, we give some possible solutions to aggregate the prediction data.
5.1 Traffic Control Algorithm and Implementation
Overhead
The algorithm, detailed in Algorithm 1 is for a central control system and the
algorithm detailed in Algorithm 2 is for each node. This control system needs to
maintain two tables: one is to record those transmissions which are delayed and
another table is to record those transmissions which are predicted. Upon receiv-
ing these transmissions, this control system has to decide which transmission
should be delayed and which transmission should be injected. The control mes-
sage inject sent from control system to each node to decide whether the source
node i can inject or not into the destination node j in the next time interval.
Noticeably, Algorithm 1 is executed at the beginning of the time interval and
19
32. then Algorithm 2 is executed between the time interval.
Because this flow control algorithm is from the end-to-end layer, we use inject
to indicate if source i can send packets to destination j. Figure 5.1 is a simple
flow chart to explain our flow control algorithm.
At the beginning, we assume that each source can send traffic to each desti-
nation (line 3). Then, the algorithm will have to decide which transmission in
the delay table can inject (line 5 - 22). Each transmission has its own priority
to avoid starvation. (line 6). The transmission with the highest priority is the
one which has the longest delay. The workload (line 10) includes the workload
which has not finished processing before and the workload which may inject in
the next time interval. If the workload of any link exceeds the threshold value,
the control signal should be set false (line 11). The threshold value depends on
the architecture. After deciding which transmission in the delay table should
inject, the remaining transmissions should update their priority (line 23). The
control system collect transmissions which are predicted to inject in the next
time interval from the predictor and decide that the control signal value should
be true or false.
Algorithm 2 is executed in each source node between a time interval. Every
source node receives the control message from the control system and makes
decisions (line 1). When there is transmission from source i to destination j,
20
33. Figure 5.1: The diagram of the flow control algorithm
if the control message value is true, it means that the source node is allowed
to inject traffic onto the network; on the opposite, the source node should not
inject any traffic and add this transmission to the centralize delay table.
To deserve to be mentioned, the algorithms presented here are just a example
for flow control as we know how to predict NoC traffic. There may be other
available algorithms to solve flow control problems.
5.2 Data Aggregation
Figure 5.2 is the basic idea of our proposed method. The control system is re-
sponsible for Algorithm 1 and each node is responsible for Algorithm 2. The
21
34. Algorithm 1 Algorithm for central control system
1: // Initialization. inject[src][dest] is a control message to decide injecting or not.
2: for all source-to-destination transmission pairs do
3: inject[src][dest] = true;
4: end for
5: for all transmissions in the delay table do
6: Selecting the transmission Tdelay i,j with the highest priority;
7: Let path be the routing path of Tdelay i,j
8: if injected[i][j] == true then
9: for all link path do
10: if link.workload ¿ threshold then
11: inject[i][j] = false;
12: break;
13: end if
14: // send the control message to the nodes
15: if inject[i][j] == true then
16: Sending injecting notification to node i to inject Tdelay i,j;
17: Updating link.workload;
18: Deleting Tdelay i,j from delay table
19: end if
20: end for
21: end if
22: end for
23: Updating delay table for priority;
24: Collecting predicted transmissions from application-driven predictor;
25: for all predicted transmissions do
26: Selecting the transmission Tpredict i,j with the highest priority;
27: if inject[i][j] == true then
28: for all link path do
29: if link.workload ¿ threshold then
30: inject[i][j] = false;
31: break;
32: if inject[i][j] == true then
33: Updating link.workload;
34: Deleting Tpredict i,j from the predicted transmissions
35: end if
36: end if
37: end for
38: end if
39: end for
22
35. Algorithm 2 Algorithm for each node i
1: receive the control message;
2: if there is a transmission to destination j then
3: if inject[i][j] == true then
4: inject;
5: else
6: Adding the transmission to the centralized delay table;
7: end if
8: end if
9: Updating application-driven predictor;
10: Updating link.workload;
control system is bound to send control signal to each node via control net-
work and each node needs to send some information via control network to the
control system to help the control’s system make decision. Each node communi-
cates with each other by data network. In [10], the authors think the operating
system is capable of network traffic management. For this reason, our method
can be adopted to the architecture platform mentioned in [10] and the control
system can be seen as the operating system. However, this method may be too
troublesome so we propose an alternative. Since there are many existing cores,
we can use a dedicated core to handle the flow control decision. This dedicated
core stands for the control system in Figure 5.2.
5.3 Area Occupancy
Then, we analyze the area overhead of the NTPT. In this subsection, we use the
number of transistors in real manycore design. In UC Davis AsAP, it has 55M
transistors, and in Tileras TILE64, the number of transistors is 615M. Assuming
23
36. control signal
Cores Data Network
update information in
control system
cconon
Application-Driven
Traffic Predictor
(via control network)
(via control network)
Control System
Figure 5.2: The diagram of flow control
that each bit needs 6 transistors, in our design the application-driven predictor
needs 0.69M transistors when the number of cores is 64. And because we need to
maintain another table named control table to record the delayed transmissions
and here we assume that the number of entry is 128 and the needed transistors
are about 0.02M. The application-driven predictor occupies 1.29% and 0.12% in
AsAP and TILE64, respectively. It is quite small and tolerable area overhead.
However, [21] addresses that an increase of the data path width by 138% results
in an area penalty of 64% in Xpipe, which is a NoC architecture. The area
overhead is extremely considerable. The average packet latency changes from 49
cycles to 39 cycles as the link bandwidth enlarges from 2.2 GB/s to 3.2 GB/s. In
short, the average packet latency improves slightly as the link bandwidth enlarges
and results in huge area overhead. This conclusion gives us the motivation to do
the inject rate flow control since increasing the link bandwidth is not economic.
24
37. Chapter 6
Experimental Results
In this chapter, we will demonstrate the experimental result to evaluate our
proposed flow control algorithm. We adopt both real application traffic and
synthetic traffic in our experiment.
6.1 Simulation Setup
The PoPNet network simulator [22] is used for our simulations and the data
transmission traces are used as the input of the simulator. The data transmission
traces record the packet injection time, the address of the source router, the
address of the destination router and the packet size. The detailed configuration
of simulation is provided in Table 6.1. The original data transmission traces are
altered by our flow control algorithm, and this results in that some transmissions
are delayed for some period so as to avoid congestion. The experimental results
presented in the following show that our algorithm exhibits huge performance
improvement.
25
38. Table 6.1: Simulation Configuration
Network Topology mesh 4x4
Virtual Channel 3
Buffer Size 12
Routing Algorithm x-y routing
Bandwidth 32 byte
6.2 Real Application Traffic
The Tilera’s TILE64 platform is used to run the benchmark programs and collect
the data transmission traces. We use SPLASH-2 blocked LU decomposition
as our benchmark program. The total workload is 3991 packets. As shown
in Table 6.2, the average packet latency drops from 2410.79 cycles to 771.858
cycles and the maximum packet latency drops from 5332 cycles to 3242 cycles.
The significant performance improvement origins from that we predict traffic
workload in the next interval and delay some packet injection to avoid congestion.
As depicted in Figure 6.1 (a), the packet latencies without flow control range
between 0 cycles and 5500 cycles. However, with our proposed flow control
algorithm, the packet latencies range between 0 cycles and 3300 cycles. These
packet latencies have decreased violently so that the histogram shifts to the left
side. To bear up our conviction, Figure 6.2 demonstrates more details about
26
39. Original Pattern-
oriented
Reduction
Ave. latency 2410.79 cycles 771.858 cycles 3.12
Max. latency 5332 cycles 3242 cycles 1.64
Simulation
Cycle
5600 cycles 6100 cycles 0.92
Table 6.2: Our proposed flow control algorithm leads to the huge reduction in the latency
and slight execution time overhead.
the network congestion. We set the congestion threshold as 40 flits. The line in
the Figure 6.2 (b) goes above the threshold is because of the wrong predictions
of network traffic. However, the impact of miss prediction is slight so that the
result is under our acceptable scope. In Figure 6.2 (a) without flow control,
the maximum workload is far apart from the threshold, and consequently causes
severe network congestion.
6.3 Synthetic Traffic
Besides the real application traffic, we also extend our algorithm for synthetic
traffic. In [20], the authors state that injected network traffic possesses self-
similar temporal properties. They use a single parameter, the Hurst exponent H
to capture temporal burstiness characteristic of NoC traffic. Based on this traffic
model, we synthesize our traffic traces. In Table 6.3, we give some instances
27
40. 0
20
40
60
80
100
120
140
160
Numberofpackets
Packet Latency (cycles)
Histogram of the packet latencies
(a)
0
50
100
150
200
250
300
350
400
100
200
300
400
500
600
700
800
900
1000
1100
1200
1300
1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
2400
2500
2600
2700
2800
2900
3000
3100
Numberofpackets
Packet Latency (cycles)
Histogram of the packet latencies
(b)
Figure 6.1: Histograms of the packet latencies without (a) and with (b) the proposed flow
control and in (b) the latencies slow down drastically.
28
42. of different parameters H and make some comparisons. These parameters are
chosen based on [20]. Table 1 in [20] shows some values of Hurst exponent H and
in our thesis we choose some values among them as a matter of convenience. The
average packet latency and the maximum latency both drop down significantly.
Besides, the execution time with our proposed flow control is a little better than
that which is without flow control. Relatively large H values indicate highly
self-similar traffic and higher traffic prediction accuracy rate. But because the
average packet size also increases with H, the reduction does not arise linearly
with H
30
43. H 0.576 0.661 0.768 0.855 0.978
Original
Ave. latency
3553.14
cycles
3596.45
cycles
3649.21
cycles
3665.53
cycles
3614.56
cycles
Improved
Ave. latency
482.512
cycles
467.787
cycles
387.716
cycles
412.983
cycles
417.577
cycles
Reduction
of Ave.
latency
7.364 7.688 9.412 8.876 8.656
Original
Max.
latency
7623
cycles
7623
cycles
7710
cycles
7658
cycles
7714
cycles
Improved
Max.
latency
1591
cycles
1532
cycles
1016
cycles
1054
cycles
1037
cycles
Reduction
of Max.
latency
4.791 4.976 7.589 7.266 7.438
Original
Simulation
Cycle
8580
cycles
8510
cycles
8550
cycles
8480
cycles
8450
cycles
Improved
Simulation
Cycle
8280
cycles
8260
cycles
7690
cycles
7781
cycles
7731
cycles
Table 6.3: Our proposed flow control algorithm for synthetic traffic leads to the huge reduc-
tion in the average latency and the maximum latency and slight reduction in the execution
time.
31
44. Chapter 7
Conclusion and Future Works
Our thesis proposes an application-oriented flow control for packet-switched
networks-on-chip. By tracking and predicting the end-to-end transmission be-
havior of the running applications, we can limit the traffic injection when the
network is heavily loaded. By delaying some transmissions efficiently, the aver-
age packet latency can be decreased significantly so that the performance can
be improved obviously. In our experiments, we adopt real application traffic
traces as well as synthetic traffic traces. The experimental result shows that
our proposed flow control not only decreases the average packet latency and the
maximum latency, but under some condition the execution time can even be
shortened.
Future work will focus on improving the accuracy of the application-oriented
traffic prediction. Also, the simulation configuration should be further discussed.
Determining the optimal parameter and adjusting flow control algorithm are
32
45. also important. Besides, we ignore the communication dependencies between
the traffic traces because there is difficulty in considering about this issue.
33
46. Bibliography
[1] Y. S.-C. Huang, C.-K. Chou, C.-T. King, and S.-Y. Tseng, “Ntpt: On the
end-to-end traffic prediction in the on-chip networks”, in Proc. 47th ACM
IEEE Design Automation Conference, 2010.
[2] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay,
M. Reif, Liewei Bao, J. Brown, M. Mattina, Chyi-Chang Miao, C. Ramey,
D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montene-
gro, J. Stickney, and J. Zook, “Tile64 - processor: A 64-core soc with mesh
interconnect”, in Proc. Digest of Technical Papers. IEEE International
Solid-State Circuits Conference ISSCC 2008, Feb. 3–7, 2008, pp. 88–598.
[3] Jose Duato, Sudhakar Yalmanchili, and Lionel Ni, “Interconnection net-
works”, 2002, pp. 428–431.
[4] S. Mascolo, “Classical control theory for congestion avoidance in high-speed
internet”, in Proc. Decision and Control Conference, 1999.
34
47. [5] Cui-Qing Yang, “A taxonomy for congestion control algorithms in packet
switching networks”, in IEEE Network, 1995.
[6] Hua Yongru Gu, Wang Hua O., and Hong Yiguang, “A predictive conges-
tion control algorithm for high speed communication networks”, in Proc.
American Control Conference, 2001.
[7] Erland Nillson, Mikael Millberg, Johnny ¨Oberg, and Axel Jantsch, “Load
distribution with the proximity congestion awareness in a network on chip”,
in Proc. Design, Automation, and Test in Europe, 2003, p. 11126.
[8] U. Y. Ogras and R. Marculescu, “Prediction-based flow control for network-
on-chip traffic”, in Proc. 43rd ACM IEEE Design Automation Conference,
2006, pp. 839–844.
[9] U. Y. Ogras and R. Marculescu, “Analysis and optimization of prediction-
based flow control in networks-on-chip”, in ACM Transactions on Design
Automation of Electronic Systems, 2008.
[10] Vincent Nollet, Th´eodore. Marescaux, and Diederik Verkest, “Operating-
system controlled network on chip”, in Proc. 41st ACM IEEE Deaign
Automation Conference, 2004.
35
48. [11] P. Avasare, J-Y. Nollet, D. Verkest, and H. Corporaal, “Centralized end-
to-end flow control in a best-effort network-on-chip”, in Proc. 5th ACM
internatinoal conference on Embedded software, 2005.
[12] Mohammad S. Talebi, Fahimeh Jafari, and Ahmad Khonsari, “A novel
congestion control scheme for elastic flows in network-on-chip based on sum-
rate optimization”, in ICCSA, 2007.
[13] M. S. Talebi, F. Jafari, and A. Khonsari, “A novel flow control scheme
for best effort traffic in noc based on source rate utility maximization”, in
MASCOTs, 2007.
[14] Mohammad S. Talebi, Fahimeh Jafari, Ahmad Khonsari, and Mohammad H.
Yaghmaeem, “Best effort flow control in network-on-chip”, in CSICC, 2008.
[15] Fahimeh Jafari, Mohammad S. Talebi, Mohammad H. Yaghmaee, Ahmad
Khonsari, and Mohamed Ould-Khaoua, “Throughput-fairness tradeoff in
best effort flow control for on-chip architectures”, in Proc. 2009 IEEE
International Symposium on Parallel and Distributed Processing, 2009.
[16] T. Marescaux, A. R˚angevall, V. Nollet, A. Bartic, and H. Corporaal, “Dis-
tributed congestion control for packet switched networks on chip”, in
ParCo, 2005.
36
49. [17] J.W. van den Brand, C. Ciordas, K. Goossens, and T. Basten, “Congestion-
controlled best-effort communication for networks-on-chip”, in Proc. De-
sign, Automation, and Test in Europe, 2007.
[18] Jin Yuho, Yum Ki Hwan, and Kim Eun Jung, “Adaptive data compression
for high-performance low-power on-chip networks”, in Proc. 41st annual
IEEE/ACM International Symposium on Microarchitecture, 2008.
[19] Keshav Srinivasan, “Congestion control in computer networks”, 1991.
[20] Vassos Soteriou, Hangsheng Wang, and Li-Shiuan Peh, “A statistical traffic
model for on-chip interconnection networks”, in Proc. 14th IEEE Interna-
tional Symposium on Modeling, Analysis, and Simulation, 2006.
[21] Anthony Leroy, “Optimizing the on-chip communication architecture of low
power systems-on-chip in deep sub-micron technology”, 2006.
[22] N. Agarwal, T. Krishna, L. Peh, and N. Jha, “Garnet: A detailed on-chip
network model inside a full-system simulator”, in Proceedings of Inter-
national Symposium on Performance Analysis of Systems and Software,
2009.
37