thesis

University POLITEHNICA of Bucharest
Faculty of Automatic Control and Computers,
Computer Science and Engineering Department
BACHELOR THESIS
TCP Traﬃc Monitoring and Debugging
Tool
Scientiﬁc Adviser: Author:
Şl.dr.ing. Costin Raiciu Dan-Ştefan Drăgan
Bucharest, 2015

I would like to thank to my supervisor, Costin Raiciu, for all the support and good advice he
provided, and also to Alexandru Agache who was always available for all my requests.

Abstract
As cloud computing technology usage continues to escalate, it becomes crucial for a cloud
provider to understand what its tenant applications need. The problem lies in the lack of any
means of doing accurate traffic engineering and bottleneck detection. This project aims to
provide an efficient tool that will determine and make use of metrics and statistics of a TCP
connection to help users evaluate its state. The application revolves around a simple idea which
is combining inferred methods of determining TCP connection’s most important variables with
socket-level traffic logs. The testing was realised using sendbuffer advertisement, which basically
means advertising the number of bytes waiting to be sent in a kernel buffer of a TCP connection
in every TCP segment.
Keywords: TCP kernel buffer, debugging, traffic engineering, monitoring, congestion window,
TCP segment
ii

Contents
Acknowledgements i
Abstract ii
1 Introduction 1
1.1 Project Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Project Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Project Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Background 3
2.1 Transmission Control Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Maximum Segment Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 TCP Receive Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.3 TCP Congestion Window and Congestion Control . . . . . . . . . . . . . 4
2.1.4 TCP Finite State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 TCP Sendbuffer Advertisement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Related Work 8
3.1 RINC - Real-Time Inference-based Network Diagnosis in the Cloud . . . . . . . . 8
3.2 TCP Sendbuffer Advertising Patch for Multipath TCP Kernel Implementation . 10
3.3 Scalable Network-Application Profiler (SNAP) . . . . . . . . . . . . . . . . . . . 11
3.4 Inferring TCP Connection Characteristics Through Passive Measurements . . . . 12
4 Diagnosis and Monitoring Tool’s Architecture and Implementation 15
4.1 Overview of the diagnosis tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Brief Description of Project’s Idea . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Phases in Tool’s Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4 Inference Methods of Determining Congestion Window . . . . . . . . . . . . . . . 20
4.4.1 Inferring congestion window for TCP Cubic . . . . . . . . . . . . . . . . . 20
4.5 Linux Loadable Kernel Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.6 Sendbuffer Advertising Kernel Patch . . . . . . . . . . . . . . . . . . . . . . . . . 24
5 Testing 26
5.1 The Testing Environment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.3 Error Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6 Conclusion and Further Development 34
iii

List of Figures
2.1 Congestion window - evolution in time . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Window growth function of TCP CUBIC . . . . . . . . . . . . . . . . . . . . . . 6
2.3 TCP State Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.1 RINC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 SNAP socket-level monitoring and analysis . . . . . . . . . . . . . . . . . . . . . 11
3.3 TCP running sample based RTT estimation . . . . . . . . . . . . . . . . . . . . . 13
4.1 Overview of the diagnosis tool when running on sender’s host . . . . . . . . . . . 16
4.2 User space / kernel space communication on sender’s host . . . . . . . . . . . . 17
4.3 Overview of the diagnosis tool when running close to the sender’s host . . . . . 18
5.1 Testing environment’s architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Smoothed round trip time CDF representation for a delay of 1 ms; absolute value
of the difference (left), precision error (right) . . . . . . . . . . . . . . . . . . . . 29
5.3 Smoothed round trip time CDF representation for a delay of 10 ms; absolute
value of the difference (left), precision error (right) . . . . . . . . . . . . . . . . . 29
5.4 Smoothed round trip time CDF representation for a delay of 100 ms; absolute
value of the difference (left), precision error (right) . . . . . . . . . . . . . . . . . 30
5.5 Flight size CDF representation for a delay of 1 ms and a bandwidth of 10 Mb/s;
absolute value of the difference (left), precision error (right) . . . . . . . . . . . . 30
5.6 Flight size CDF representation for a delay of 10 ms and a bandwidth of 200
Mb/s; absolute value of the difference (left), precision error (right) . . . . . . . . 31
5.7 Flight size CDF representation for a delay of 10 ms and a bandwidth of 1 Gb/s;
absolute value of the difference (left), precision error (right) . . . . . . . . . . . . 31
5.8 Congestion window CDF representation for a delay of 10 ms and a bandwidth
of 200 Mb/s; absolute value of the difference (left), precision error (right) . . . . 32
5.9 Congestion window evolution over time for a 10 ms delay and a 200 Mb/s band-
width . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.10 Greedy sender detection error for a 5 ms delay between the sender and the
detection point; absolute value of the difference (left), precision error (right),
sendbuffer evolution over time corresponding to the tested scenario (center) . . . 33
iv

List of Tables
2.1 Overview of TCP connection states . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.1 Output of running list_connections.sh script . . . . . . . . . . . . . . . . . . . . 18
v

Chapter 1
Introduction
1.1 Project Description
1.1.1 Project Scope
As cloud computing began to gain ground, it also appeared the need for a better monitoring of
the infrastructure and its tenants. This project is a traffic engineering tool intended to be of
great use to cloud providers that are eager to know more in-depth information about the state
of every TCP connection. Moreover, using this application, it would be easier to find malicious
tenants that are trying to flood the network using TCP-formatted packets or selfish tenants that
are running TCP stacks with congestion control turned off to get better throughput. It also
demonstrates the usefulness of introducing sendbuffer advertisement in every TCP segment.
Another useful usage of this tool would be finding out whether endpoint applications are back-
logged or not, thus being able to take more intelligent decisions in shifting traffic around and
obtain a better network optimization. A cloud operator may decide to split a flow into different
subflows, using multipath TCP, determining why congestion occurred.
This network diagnosis tool brings a fresh approach based on combining measured and de-
termined statistics of TCP connections with socket-level traffic logs and inference methods of
determining connection’s state variables depending on the location of running it. It also uses the
research work of Costin Raiciu and Alexandru Agache1
in testing purpose. The existing similar
tools do not completely resolve the problem of finding out at any point inside the network
path if an application is backlogged or not, needs more bandwidth, or experienced a timeout.
This tool is aiming to have a very good precision when running close to the sender of the TCP
connection. For accuracy checking I will use sendbuffer advertising, which is currently the best
method of determining in the middle of the network if a TCP connection is application-limited
or not.
1.1.2 Project Objectives
The main goal of this project is creating a tool capable of identifying problems inside cloud-
based networks, by monitoring TCP connections and finding out their limiting factor. It will
also display statistics for a specified connection such as round trip times, retransmission timeouts
and other important counters and constants.
1Alexandru Agache, Costin Raiciu, Oh Flow, Are Thou Happy? TCP sendbuffer advertising for make benefit
of clouds and tenants
1

CHAPTER 1. INTRODUCTION 2
The testing of this application is based on a small Linux kernel patch that may be applied
to TCP stacks enabling information regarding transmitter’s backlogged data in every TCP
segment, introducing no bandwidth overhead in average case. The tool also uses a character
device driver for determining the congestion window value when running on the same host as
the sender of the TCP connection. This approach provides accuracy in obtaining the desired
value, by reading the exact information from the sender’s kernel variable. When running the
tool on another host located close to the sender, the congestion window’s value is inferred using
mechanisms that I will describe later.
After knowing all these information regarding the size of data the sender has in its kernel buffer
and the congestion window’s value, the root cause of any connection limitation may be easily
analysed. To accomplish that it would be also required to know the number of in-flight packets
and the retransmission timeouts.
To conclude, the purpose of this tool is providing traffic statistics for any TCP connection,
such as round trip time, retransmission timeout, or flight size to cloud operators. Moreover,
the application will display some warning flags indicating that the connection is application-
limited or network bound. It may be run on the same host as the sender of a TCP connection,
thus taking advantage of kernel information and measurements, or it may be run anywhere on
the connection’s path using inference methods of determining state variables instead of using
socket-level logs.

Chapter 2
Background
2.1 Transmission Control Protocol
The Transmission Control Protocol is amongst the main Internet protocols and the most widely
used transport protocol according to measurements that accredit TCP for 85 to 95 percent of
the Internet traffic in wide-area networks1,2,3
. It provides ordered and error free delivery of a
stream of bytes between applications that transfer data over an IP network. Richard Stevens
states in the first volume of TCP/IP Illustrated that TCP provides a byte stream service which
is reliable and connection-oriented and this last concept denotes that two applications using
TCP transport protocol must contact one another in order to establish a connection before data
exchange.
TCP handles retransmission of dropped or corrupt packets as well as acknowledgement of all
packets that arrive. A TCP mechanism known as flow control resolves the problem that appears
when there is a very slow receiver relative to the sender, introducing a method of forcing the
sender to decrease its sending rate. Another important mechanism that TCP implements is
congestion control that dictates the sender to lower its sending rate to avoid overwhelming the
network between itself and the receiver. If in the case of flow control, a window advertisement
is used to determine the sender to keep the pace with the receiver, for congestion control the
sender has to guess that it needs to lower its transmission rate based on some other evidence,
such as the congestion window.
2.1.1 Maximum Segment Size
The maximum segment size (MSS) is the size that will never be outrunned irrespective of the
high value of the current receive window. It is a parameter of the Options field of the TCP
header that specifies the maximum amount of data, in bytes, that an endpoint can receive in
a single TCP segment, not considering the IP or TCP headers. When deciding the amount
of data that should be put into a segment, each party in the TCP connection will determine
how much to put by taking into account the current window size, along with with the many
congestion-avoidance algorithms, but the amount of data will never exceed this segment size.
1K. Claffy, Greg Miller, and Kevin Thompson. The Nature of the Beast: Recent Traffic Measurements from
an Internet Backbone. In Proceedings of INET, 1998.
2Chuck Fraleigh and et. al. Packet-Level Traffic Measurements from a Tier-1 IP Backbone. ATL Technical
Report TR01-ATL-110101, Sprint, Nov. 2001.
3Sean McCreary and Kc Claffy. Trends in Wide Area IP Traffic Patterns: A View from Ames Internet
Exchange. In Proceedings of the 13th ITC Specialist Seminar on Measurement and Modeling of IP Traffic,
Monterey, CA, Jan. 2000.
3

CHAPTER 2. BACKGROUND 4
2.1.2 TCP Receive Window
The TCP receive window has the important role of dictating the amount of unacknowledged
data that might be in flight between sender and receiver. A 16-bit window field that lays
within the TCP header indicates which are the acceptable sequence numbers on the receiver’s
side relative to the acknowledgement number from the same segment. After receiving data,
acknowledgements are sent back to the sender along with a window field indicating the data
amount remaining in the receive window. So if the acknowledgement number is 1000 and after
scaling the receive window the obtained value is 10000, it means that the sender will only be
able to send segments with the sequence number in the range 1000-11000.
2.1.3 TCP Congestion Window and Congestion Control
In the Transmission Control Protocol, the congestion window (cwnd) is a data window imposed
by the sender, implemented in order to avoid routers’ overwhelming in the middle of the network
path. It estimates how much congestion exists between the endpoints of a TCP connection. The
congestion window’s initial value will be set to a small multiple of the connection’s maximum
segment size (MSS). The window will start growing exponentially until the expiration of the
timeout or until the slow-start threshold is reached. After this its value will increase linearly
after receiving each new acknowledgement at the rate of 1
cwnd packets.
It might be said that variance in the congestion window is controlled by an Additive Increase
Multiplicative Decrease approach. The essence of this approach is that, after leaving slow-start
state, the sender will add a constant to the window size, if receiving all segments and the
acknowledgements reach the sender before a timeout occurs. When it occurs, the value of the
congestion window will be set to one MSS, the slow-start threshold value will be set to half the
size of the window before packet loss started and slow-start phase is initiated.
Congestion is a very important issue that can arise in computer networks when too many
packets are present in a part of the subnet, thus causing a steep performance degradation. It
may occur when the load on the network is greater than the capacity of the network. In this
subsection TCP’s congestion control algorithm will be briefly presented, describing each of its
four phases: slow start, congestion avoidance, fast retransmit, and fast recovery.
Figure 2.1: Congestion window - evolution in time1
1RINC Architecture, Mojgan Ghashemi, Theophilus Benson, Jennifer Rexford, RINC: Real-Time Inference-
based Network Diagnosis in the Cloud

Slow Start and Congestion Avoidance
The slow-start state (SS) is the first state the TCP connection will be in after setting it up,
initially sending packets at a slow rate and trying to learn the available bandwidth by exponen-
tially increasing the rate. When receiving a packet, the receiver sends back an acknowledge-
ment packet. Upon receiving the acknowledgement, the sender starts sending more packets.
The connection remains in slow start until either a loss happens, meaning that three duplicate-
acknowledgements have been received or no acknowledgements have been received before a
retransmission timer expires (RTO), or the sender sends a predefined number of bytes, known
as slow start threshold (ssthresh). According to RFC 56811
, the initial ssthresh value will be
set to an arbitrarily high value, like for example to the maximum possible advertised window’s
size, but it will be reduced as a response to congestion.
After the TCP sender assumes that it has discovered its equitable share of the bandwidth,
it transitions into the congestion avoidance (CA) state. While in this state, the connection
increases its sending rate linearly. From this point, if a packet loss is inferred from RTO, the
connection returns to the slow-start state and resets the window size to initial window, usually
one MSS.
Fast Retransmit/Fast Recovery
If congestion is detected through duplicate acknowledgements and the connection is in con-
gestion avoidance state the fast retransmit and fast recovery algorithms will be invoked. The
connection sets its window size to half the current value and resends presumably lost segments.
If a duplicate acknowledgement is received the sender is not capable of distinguishing between
TCP segment loss or delayed and out of order segment received at the other endpoint. If more
than two duplicate acknowledgements are received by the sender, it is highly possible that min-
imum one segment was lost. When receiving three or more duplicate acknowledgements, the
sender will not wait for the retransmission timeout to expire and retransmits the segment, thus
entering the congestion avoidance state. In concludion, there will be no lost time waiting for a
timeout in order for retransmission to begin.
CUBIC TCP
CUBIC is a TCP implementation that uses an improved congestion control algorithm for high
bandwidth networks with high latency. It is used by default in Linux kernels 2.6.19 and above.
The greatest improvement CUBIC brings is that it simplifies the window control mechanism
of previous variants and enhances TCP-friendliness and round trip time fairness. The name of
this protocol is very representative, because its window growth function is a cubic function.
The congestion window of this TCP variant is described by the following function:
Wcubic = C(t − K)3
+ Wmax
In the formula above, C is a scaling factor and t is the time that had passed from the last
window shrinking. Wmax is the size of the window just before the last window decrease, K =
3
Wmax ∗ β
C , where β is a constant decrease factor applied for window shrinking every time a
loss occurs.
1M. Allman, V. Paxson, Request for Comments: 5681, Sep. 2009

Figure 2.2: Window growth function of TCP CUBIC1
The ﬁgure above presents TCP CUBIC’s growth function with the origin at Wmax. After
a window shrinking it starts growing very fast, but as it gets closer to Wmax, it reduces its
growth. The increment of the window becomes almost zero around the value of Wmax and the
algorithm will start probing for more bandwidth with an initial window slow growth, above that
value. The window growth will accelerate as its value starts exceeding the value of Wmax. The
role of this slow growth is to enhance the stability of the protocol, and increase the network’s
utilization, while the fast growth ensures the protocol’s scalability.
2.1.4 TCP Finite State Machine
Figure 2.3: TCP State Machine2
1Injong Rhee, and Lisong Xu, CUBIC: A New TCP-Friendly High-Speed TCP Variant
2W. Richard Stevens, Kevin R. Fall TCP/IP Illustrated vol. 1

State Description
LISTEN waits a connection request from an endpoint
SYN-SENT sends a connection request, then waits a matching connection
SYN-RECEIVED both received and sent a connection requests are being sent, then waits a
request acknowledgement
ESTABLISHED an open connection
FIN-WAIT-1 waits an acknowledgement of the previously sent termination request or a
termination request
FIN-WAIT-2 waits for the remote endpoint to send a termination request
CLOSE-WAIT waits a termination request from the local endpoint
CLOSING waits a termination request acknowledgement from the remote endpoint
LAST-ACK waits an acknowledgement of the previously sent termination request
TIME-WAIT waits the remote endpoint to received the termination request acknowl-
edgement
CLOSE no connection state
Table 2.1: Overview of TCP connection states
2.2 TCP Sendbuffer Advertisement
When a send system call is issued by the application, the kernel copies data into an in-kernel
buffer as much as it is capable of, then sends TCP segments on the transmission medium
according to the window advertised by receiver and the congestion window.
The sendbuffer is composed of two parts:
• In-flight segments that have been already sent but unacknowledged by the receiver. They
must be kept until acknowledged.
• Segments that are waiting to be sent, also called the backlog.
Sendbuffer advertising refers to announcing the number of bytes that composes the backlog
in every TCP segment. This information plays a very important role in detecting bandwidth-
limited flows, thus a client that observes an increase in the number of bytes in the buffer may
decide to take full advantage of multipath TCP and spread data across multiple subflows.
Another essential use case of sendbuffer advertisement would be identifying application-limited
TCP connections. It would be beneficial for application developers who find out there is an
issue in their code and try to optimize it.

Chapter 3
Related Work
3.1 RINC - Real-Time Inference-based Network Diagnosis
in the Cloud
RINC is a framework that runs within the hypervisor and uses techniques for inferring the
internal state of midstream connections, allowing the user to collect statistics and measurements
at any moment during a connection lifetime. It oﬀers the cloud providers the opportunity to
write diagnosis applications in a simpliﬁed way by providing a simple query interface to the
cloud platform.
Figure 3.1: RINC Architecture1
As presented in the picture above, RINC’s architecture contains two main components:
• Local agent - It consists of a measurement module and a communication module. The
main purposes of the measurement module is to calculate statistics, inspect packets and
reconstruct the internal state of every TCP connection. The communication module
translates queries into instructions, forwards them to the Global Coordinator and returns
the results.
• Global coordinator - It represents the integrant part that aggregates information from
the local agents. A set of mechanisms for reducing the local agent’s overhead are provided
by this component, thus improving scalability. The user has the ability to limit the set
8

CHAPTER 3. RELATED WORK 9
of statistics being collected and to determine the minimum number of connections to
investigate in order to identify problems within the network.
The solution proposed by this project is to selectively collect only a relevant subset of statistics in
an on-demand manner when the cloud operator needs them. A downside of collecting statistics
for a midstream connection is loosing track of a significant number of initial packets. Thus,
the approach of this framework is not to calculate long running averages and sampled statistics
that cannot be accurately inferred for midstream flows, but to determine relative values of these
important statistics.
Another great issue is the overhead introduced by storing statistics within the hypervisor. By
comparing the overhead of collecting all statistics against the overhead of selectively collecting
data it was concluded that the former causes a 75 percent increase in memory.
Inferring the state of the TCP state machine for midstream monitored connections is difficult
for various reasons. Variables that are generally considered to be constant, as slow-start thresh-
old actually change over time with each state transition. Another obstacle in determining the
state of a TCP connection is the behavioural resemblance of different states. For example,
TCP behaves identically in congestion avoidance and fast recovery, linearly increasing its send-
ing rate. In addition to this, a connection in slow-start not having enough application data
to send may have a linear sending rate, as the congestion avoidance or fast recovery state.
Furthermore, transitions are triggered by a sequence of events. If the connection is monitored
after three duplicate-acknowledgements has been received, it may be incorrectly presumed that
the connection is in congestion avoidance state even though the connection transitioned to fast
recovery state.
The authors of this framework have developed a series of canonical applications to demonstrate
its flexibility as a diagnosis tool:
• Detecting Long-Lived Flows - This application tries to find the flows that have lasted
longer than a threshold value using the local agent to monitor all TCP flows and to
calculate their duration, then returning a list of flow keys that have a duration more than
the threshold. RINC’s Global Coordinator will aggregate responses and present it to cloud
provider.
• Detecting Heavy Hitters - The goal of this proof of concept tool is to find the flows
that have sent more than a threshold number of bytes using the local agent to monitor
all TCP connections and for each. A list of flows matching the criterion will be created
and sent back to Global Coordinator.
• Traffic Counter - This application tries to keep track of how many distinct source
IPs send traffic to a specific destination IP by monitoring all the connections to this
destination.
• Detecting Super Spreaders - The purpose of this tool is to find the IPs who contact
more than a constant number of distinct destination IPs by monitoring all connections in
the cloud to get a list of distinct source IPs, then for each of them a query is submitted
to count its distinct number of destination IPs. The source IP is considered to be a super
spreader if the value determined earlier is greater than the specified threshold.
• Root Cause Analysis of Slow Connections - This application is finding troubled
connections and then starts a root cause analysis on them using the local agent to monitor
all the connections and to collect their sending rates, thus finding the subset of troubled
connections with a limited sending rate. For each of these connections, the tool collects
more heavy weight statistics such as round trip time, congestion window and TCP receive
window to find out if the limiting factor is network, sender or receiver.
1RINC Architecture, Mojgan Ghashemi, Theophilus Benson, Jennifer Rexford, RINC: Real-Time Inference-
based Network Diagnosis in the Cloud

3.2 TCP Sendbuffer Advertising Patch for Multipath TCP
Kernel Implementation
This project is a research paper backed by the implementation of a small kernel patch that
may be applied to the multipath TCP kernel implementation. It was motivated by the lack of
relevant information in the network regarding the intentions of TCP endpoints. The authors of
this paper thought that it would be of great use to know whether an application is backlogged
or not. As a consequence, they came up with the idea of including the number of bytes waiting
to be sent from the kernel buffer in every TCP segment.
As mentioned in the previous chapter, sendbuffer is composed of two parts: already sent seg-
ments but unacknowledged by the receiver and segments waiting to be sent, or the backlog.
Several approaches of advertising the number of bytes that form the backlog have been pro-
posed. First it was taken into consideration placing the advertisement into a new TCP option.
This would have introduced an overhead of 8 bytes in every TCP segment. The overhead is
divided as follows: 2 bytes are required for option size and type, 4 bytes for the sendbuffer
advertisement, and another additional 2 bytes for padding. Moreover, hardware offloading
support on modern network interface controllers may be disabled, thus reducing performance.
A better proposal was to introduce the sendbuffer advertisement in the receive window field of
outgoing TCP segments, using a reserved flag to indicate that the field has a different purpose.
this approach has no overhead at all. The advertisement will be encoded by the stack only
when the receive window and acknowledgement combination is the same as the one previously
sent. The idea is feasible because in most cases traffic is unidirectional, thus the receive window
advertisements from the data source are redundant. Replacing receive window’s value with
sendbuffer advertisement will create no performance issues. In the opposite direction, the
advertisement is not needed and the acknowledgement and the window value are changing,
thus it will not be sent.
The kernel patch implemented to support the ideas described above has less than 100 lines of
code. It may be applied to the kernel implementation of multipath TCP. After successfully
applying the patch, two new sysctl options will appear: one that activates and deactivates the
advertisement and one that specifies the number of the TCP option. This option will be added
to the TCP segment if there is room for it, requiring 8 bytes.
The concept of sendbuffer advertising finds usage in many circumstances. A good example is
detecting network hotspots. The authors sustain that a very precise way of finding bottlenecks in
datacenter networks would be relying on sendbuffer information carried by packets, by averaging
the sendbuffer value of every packet on a specific link. Several tests had been carried out and it
was found out that application-limited flows has the average reported sendbuffer close to zero.
When running network limited traffic through a link sendbuffer averages are on the order of
hundreds of KB. Having all these information datacenter operators may easily discover when a
link is a bottleneck.
Another important use case is helping mobile clients determine WiFi’s performance. Using
multipath TCP a mobile may use both WiFi and cellular links simultaneously, spreading traffic
over both. In consequence, the user may increase its network capacity if WiFi is not good
enough.
A downside of including sendbuffer advertisement in every packet is that the information about
the sender’s congestion window may be deduced. If total size of the sendbuffer is constant,
when the congestion window grows, the number of bytes buffered by the receiver reduces,
meaning that there are more in-flight packets. When congestion window is decreased upon a
loss, the amount of bytes buffered increases. The good thing is that the sendbuffer varies due
to application writes into the buffer and the kernel sends packets out on the link when possible,
thus these trends are not easily visible.

3.3 Scalable Network-Application Profiler (SNAP)
SNAP is a network profiler that helps developers in identifying and fixing performance issues. It
collects in a passive manner TCP statistics and socket-call logs, requiring low storage overhead
and few computational resources. These information are correlated across shared resources,
such as switches, links, or hosts, and connections to identify the root of the problem.
This diagnosis tool holds information about the the network-stack configuration, network topol-
ogy, and the mapping of applications to servers. It identifies applications with common issues,
as well as congested resources, such links and hosts that affect multiple applications by correlat-
ing all these data. It instruments the network stack in order to observe the TCP connections’s
evolution in a direct and more precise way, instead of trying to infer TCP behaviour from
packet traces. Moreover, this tool may collect finer-grain information, without resorting to
packet monitoring.
The main improvement that SNAP differs from other diagnosis tools is the efficient and accurate
way of detecting and analysing performance problems using collected measurements of the
network stack. It provides an efficient way of identifying the component responsible for the
issue, such as the sender application, receiver, send buffer or network. Furthermore, it may
gather information about connections belonging to the same application, thus providing more
insight into the root cause of problems.
Figure 3.2: SNAP socket-level monitoring and analysis1
In the above picture it is presented the overview of SNAP’s architecture. First, TCP-connection
statistics are collected in real time with low overhead, through socket-level logs of application
write and read operations. After that, a TCP classifier identifies and categorizes the amount of
time with bad performance for each socket, logging the diagnosis. The final phase is represented
by the correlation of information across connections belonging to the same application, or
sharing a common resource. This step is performed by a centralized correlator to highlight the
performance issues.
According to the set of tests carried out, this project concludes that data-center operators
should allow the TCP stack to automatically tune the receiver window and send buffer sizes,
given the facts that machines with more connections tend to have more send buffer problems
and there is a mismatch between receiver window send buffer size.
1M. Yu, A. Greenberg, D. Maltz, J. Rexford, L. Yuan, S. Kandula, and C. Kim. Profiling network performance
for multi-tier data center applications. In NSDI, 2011.

3.4 Inferring TCP Connection Characteristics Through Pas-
sive Measurements
This scientific research paper presents a passive measurement methodology to infer two essen-
tial variables associated with a TCP connection: the round trip time (RTT) and the sender’s
congestion window (cwnd), thus being able to provide a valuable diagnostic of network per-
formance from the end-user’s point of view. Comparing the amount of data sent with cwnd,
one can determine when a higher transfer rate could be supported by a TCP connection, if
there is more data available by observing the manner in which cwnd value is affected by loss.
Knowing these information it is easy to identify nonadhering TCP senders, or particular flavour
of TCP, such as Reno, New Reno, Tahoe. Identifying the TCP flavour is useful in determining
the manner in which cwnd is modified after packet loss.
A great improvement this project comes with lies in the development of a passive methodology
to infer a sender’s congestion window using a measurement point that observes TCP segments
passing through. The measurement point may be placed anywhere between the receiver and
the sender. The downside of this approach is that in case of connection losses, the estimate
of cwnd is sensitive to the TCP congestion control flavor that matches the sender’s detected
behaviour. It also implements a RTT estimation technique based on the estimated value of
cwnd.
The core idea of estimating the congestion window value is to construct a clone of the TCP
sender’s state for every monitored TCP connection, in the form of a finite state machine (FSM).
The estimated FSM updates its current sender’s cwnd based on observed receiver-to-sender
acknowledgements, that cause the sender to change its state if received. Detecting a timeout
event at the sender will result into a transition. These events are revealed in the form of
out-of-sequence retransmissions from sender to receiver.
The biggest obstacle this project meets is represented by estimating the state of a distant sender,
that introduces a great amount of uncertainty into cwnd appraisal. First, when processing
large amounts of data, the estimated FSM will maintain minimal state and perform limited
processing, thus it cannot backtrack or go back to previous state transitions. It is possible that
the FSM to not be able observing the same packet sequence as the sender. Acknowledgements
that pass through the measurement point may not arrive at the sender’s side. Furthermore,
sender’s sent packets may be duplicated or reordered on their way to the measurement point.
Another challenge is that implementation details of the TCP sender are not visible to the FSM.
A couple of variables has to be initialized in the estimated FSM. One of the variables is the
sender’s initial congestion window size (icwnd), which represents the maximum number of
bytes that a sender can transmit after completing the three-way-handshake and before any
acknowledgement arrives. The icwnd value will be set in the range starting from twice the
maximum segment size to twice this value. The slow-start threshold (ssthresh) is another
variable that will be initialized to an extremely large value, as it is frequently done in any TCP
stack.
The evolution of these variables will happen as follows. A TCP sender can be in slow-start or
congestion avoidance during its normal operations flow. The cwnd will be increased by 1 or by
1
cwnd at the arrival of a new acknowledgement. If the sender notices a loss caused by timeout,
cwnd will be set to 1 and ssthresh to max(min(rwnd,cwnd)
2 ,2), rwnd being the receiver advertised
window, described in the second chapter of this thesis.
Detecting packet loss by receiving three duplicate acknowledgements is taking into consideration
the differences between the three flavours mentioned above. For Tahoe, the sender will use the
fast retransmit algorithm, behaving as if the retransmission timeout had expired. In the case
of Reno, the fast recovery algorithm is added to Tahoe’s fast retransmit algorithm. It presumes

that each duplicate acknowledgement is a hint that another packet has reached the receiver.
The ssthresh value will be set to 1
cwnd and cwnd will be set to ssthres + 3. At this point,
the sender increments the cwnd by 1 when receiving a new duplicate acknowledgement and
resets the value of cwnd to ssthresh when receiving a new acknowledgement, thus it exits fast
recovery and returns to congestion avoidance. For NewReno flavour, there is a change in Reno’s
fast recovery mechanism occurring when the sender receives a new acknowledgement while in
the recovery phase. It consists in removing the need of loss detection through timeout when
multiple losses are detected within a single congestion window. This improvement guarantees
that the sender is capable of retransmitting a lost packet after every round trip time without
using the timeout mechanism.
The solution proposed for determining the congestion window is to keep track of its value for
all of the three TCP flavours. When monitoring traffic, if a packet is not allowed for a given
flavour, then the observed data packet represents an event that is not allowed. A count of the
number of such events is maintained for each of the candidate flavours. The sender’s flavour is
considered to be the one with the minimum number of nonadhering events.
Figure 3.3: TCP running sample based RTT estimation1
The picture above describes the basic idea behind RTT estimation mechanism. First of all,
the round trip delay should be determined from the measurement point to the receiver, then
from the receiver to the measurement point. After that, the round trip delay between the
measurement point and the sender should be determined, then it should be determined the
delay between the sender back to the measurement point. Summing these two delays will result
the estimate of the RTT.
1Sharad Jaiswal, Gianluca Iannaccone, Christophe Diot, Jim Kurose, Don Towsley, Inferring TCP Connection
Characteristics Through Passive Measurements

Chapter 4
Diagnosis and Monitoring Tool’s
Architecture and Implementation
4.1 Overview of the diagnosis tool
This diagnosis and monitoring tool’s main purpose is to help cloud providers and developers
debug TCP related problems by gaining more insight of every TCP connection. It is composed
of a series of different modules. First of all, there is an executable that takes a series of command
line arguments, such as source and destination IP addresses and destination port to identify
the TCP connection the user would like to debug. This application intercepts all the ingoing
and outgoing network packets that pass through the network device it runs on using libpcap in
accomplishing this goal.
After a network packet is intercepted, the application will get a better insight of its IP and TCP
headers and register all the important information and relative time values in some special struc-
tures, thus being able to determine round trip time, retransmission timeout and flight size. It
also keeps track of statistics regarding ingoing and outgoing packets, timeouts, retransmissions,
acknowledgements and duplicate acknowledgements to identify the congestion window.
The tool will display these statistics and metrics mentioned above, but will also inform the user
about the cause of the TCP connection’s issue. Thus, a network operator will know any time
if the connection is application-limited, receive buffer limited or network bound.
On the TCP connection’s sender’s host should always boot the multipath TCP Linux kernel
implementation. The sendbuffer advertising kernel patch should also be applied to it, thus all
the outgoing network packets will contain the backlog information in the TCP header. The
other host on the TCP connection’s path is not required to have the same operating system
distribution.
There are still some differences between running the tool on the sender’s host and running it
on another host placed close to the sender’s host. A command line parameter will be passed
to the application’s executable to be able to make the distinction between using ioctl functions
to read the kernel statistics and determining these values through inference methods. I will
further use two diagrams to help me explain the dissimilarities existent in these two situations.
14

CHAPTER 4. DIAGNOSIS AND MONITORING TOOL’S ARCHITECTURE
AND IMPLEMENTATION 15
Figure 4.1: Overview of the diagnosis tool when running on sender’s host
As one can see, in the diagram above I represented the flow of the diagnosis and monitoring
tool when running on the same host as the TCP connection’s sender. Regardless of the chosen
approach, on the sender’s host should always boot a multipath TCP kernel implementation. In
this situation a command line parameter will announce the monitoring application about the
presence of a loadable kernel module.
The kernel module provides a couple of mechanisms through a character device driver and
ioctl functions to obtain a series of metrics and statistics that the Linux kernel holds, such as
congestion window, receiver window, slow start threshold, or smoothed round trip time. It will be
compiled along with multipath TCP kernel sources and will be loaded using insmod command.
On the same host will also run a diagnosis and monitoring application, written in C++, that
knows about the character device driver’s presence. It communicates with the device driver
via ioctl functions, as presented in the diagram below, thus being able to set parameters for
identifying a specific TCP connection and get information about important kernel variables
regarding that connection. This tool uses libpcap for packet sniffing, registers all the important
fields from the IP and TCP headers and initializes some timers every time a network packet
is sent. It also waits for acknowledgements and computes round trip times, retransmission
timeouts and flight sizes.
Determining the round trip time of an outgoing network packet will be accomplished using a
hashtable, that maps the time of its interception with its sequence number read from its TCP
header. After that, when receiving an acknowledgement, its sequence number will be looked up
in the hashtable and the round trip time will be determined as a time difference. Getting the
system’s time is performed using the gettimeofday function, which is a very efficient function,
because it takes advantage of Virtual Dynamic Shared Object (VDSO) shared library. This
library allows userspace applications to perform some kernel actions without a big overhead
such as a system call.

Figure 4.2: User space / kernel space communication on sender’s host
The smoothed round trip time and the retransmission timeout are determined using the following
formulas:
RTTVAR = (1 - β) * RTTVAR + β * |SRTT - RTT|
SRTT = (1 - α) * SRTT + α * RTT
RTO = SRTT + max(K * RTTVAR, G)
The abbreviations above have the following meaning:
RTTVAR - Round Trip Time Variance
SRTT - Smoothed Round Trip Time
RTT - Measured Round Trip Time
α - smoothing factor (its recommended value is 1
8 )
β - delay variance (its recommended value is 1
4 )
K - constant set to 4
G - granularity of the clock in seconds
The flight size will be determined using the same data structure previously presented. Its value
will be the number of entries in the hashtable at a given time, which represents the number of
sent network packets that did not receive any acknowledgement.
In the second scenario, the diagnosis tool will run on a host located close to the sender’s host.
The loadable kernel module will not be present anymore and the application will be informed
about its absence and about the need of using another module, which infers the congestion win-
dow value by monitoring each outgoing packet and studying the receipt of acknowledgements,
duplicate acknowledgements and timeouts.
The mechanisms for determining TCP connection’s metrics, such as round trip times, timeouts
and flight sizes are the same as in the first use case, giving the assumption that these variables
should have almost the same values, due to the short distance between the sender and the host
the application in running on.

Figure 4.3: Overview of the diagnosis tool when running close to the sender’s host
The diagnosis application works as follows. First of all, there is a shell script that uses the
netstat command and is very useful in identifying all the existent TCP connections, source and
destination ports and IP addresses, as well as the state of each connection. An example of
output after running the script may be seen below. If we run the application on the sender’s
host, then we should ﬁrst load the tcp_stats loadable kernel module using insmod command.
After identifying the desired TCP connection the user would like to monitor, the tool will
intercept all the incoming and outgoing network packets for that connection. Every network
packet will be decapsulated and a close analysis of the IP and TCP header’s ﬁelds will be
taken. Knowing these information the tool will be able to determine metrics and statistic and
also display information about TCP connection’s main issues.
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 32 0 192.168.0.103:41998 91.189.92.10:443 CLOSE_WAIT
tcp 0 0 192.168.0.103:50385 64.233.184.189:443 ESTABLISHED
tcp6 0 0 ::1:631 :::* LISTEN
tcp 0 0 192.168.0.103:41902 68.232.35.121:443 ESTABLISHED
tcp 0 0 192.168.0.103:50736 50.19.232.199:80 TIME_WAIT
tcp 0 0 192.168.0.103:46222 54.76.102.229:80 ESTABLISHED
tcp 0 0 192.168.0.103:33890 190.93.244.58:80 TIME_WAIT
tcp 0 0 192.168.0.103:52139 199.71.183.28:80 ESTABLISHED
tcp 0 0 192.168.0.103:39231 91.198.174.192:443 ESTABLISHED
Table 4.1: Output of running list_connections.sh script

4.2 Brief Description of Project’s Idea
This project aims to identify application limited situations for every TCP connection. A sender
is considered to be “greedy” if at any moment in time the available window size equals the
number of unacknowledged packets in the network, also known as the flight size. The tool
bases on the idea that if the number of unacknowledged packets at the end of a flight, is less
than the available inferred window value at the beginning of a flight, then the sender is not
greedy. The beginning of a new flight is defined as the first data packet observed after an
acknowledgement that informs about the receipt of a data packet in the current flight.
The tool assumes that a set of packets sent by the sender in the same flight is never interleaved
with acknowledgements for that flight’s packets, meaning that all acknowledgements that in-
terrupts a flight cover network packets of previous flights. This assumption is valid as long as
the application is run somewhere near the sender’s host machine. If the observing point would
be close to the receiver, each data packet will be acknowledged immediately. Thus, the flight
size estimation will not be correct and the sender will also be consider not-greedy.
Another capability of this tool is determining if the TCP connection is in the network bound
situation. The idea behind this estimation bases on the fact that in this case the flight size
should be approximately equal to the congestion window and there are a significant number
of segment retransmissions. This scenario may also be tested by analysing the sendbuffer
advertisement and observing its size, which will be of the order of hundreds of Kbytes.
Testing the validity of the results will be accomplished using the backlog information advertised
in any TCP packet sent. For application-limited TCP connection, the average reported send-
buffer should be zero. When in network limited situation traffic through a connection obtains
sendbuffer averages that are on the order of hundreds of KB.
4.3 Phases in Tool’s Implementation
Implementing this monitoring and diagnosis application was not such an easy task as expected.
There were many obstacles and shortcomings that were overcame in its development process. In
addition to this, it was not only the coding period a time consuming activity, but also the paper
research, collecting information about how TCP works and finding the appropriate mechanisms
of determining TCP connection’s important variables created many issues.
First of all, a C++ application that uses pcap library to obtain every packet that is coming and
going through network interfaces existent on the host machine was implemented. After that,
a it was created mechanism of determining the smoothed round trip time, the retransmission
timeout, and the flight size of a TCP connection by using a hashmap data structure.
The next step was determining the congestion window value, having only information from
the network packet’s headers and some counters and timers that keep track of duplicate ac-
knowledgements and timeouts. These inference methods were implemented for a series of TCP
congestion control flavours, such as Reno, NewReno, Tahoe and Cubic. For more accuracy it
was also implemented a loadable kernel module that uses a character device driver and ioctl
functions and helps the monitoring tool retrieve in user space information, metrics and statistics
that Linux kernel knows about.
As a testing method it was used the sendbuffer advertising patch that may be applied to
multipath TCP kernel implementation. The multipath TCP development repository was cloned
and a switched to the branch the patch was intended for was made. After that, the sendbuffer
kernel patch was applied to the multipath TCP sources and the kernel was recompiled.

The actual testing was accomplished after kernel sources compilation. First of all, it was
used Qemu virtual machine monitor to test if the patch was correctly applied. The next step
was testing the diagnosis and monitoring tool along with the loadable kernel module and the
sendbuffer advertising patch on a mini-cluster. On this cluster there were available only three
physical machines and they had been used as a sender, a router and a receiver.
4.4 Inference Methods of Determining Congestion Window
The monitoring and diagnosis tool concentrates on finding the congestion window’s value. For
this purpose two solutions were designed. One is available only when running the application
on the sender’s host and uses a loadable kernel module. It will be described in a further section.
The second proposal is applicable when running the tool anywhere close to the sender’s host
machine and uses inference methods to accomplish its goal.
This application is able to find the TCP flavour implemented in the operating system as it
was described in section 3.4. It monitors all the important events for a TCP connection, such
as duplicate acknowledgements, retransmissions, and it keeps track of the congestion window
for every TCP congestion control flavour. Then, for every packet it monitors a counter that
indicates the number of violations per each flavour. The flavour with the lowest number of
violations is considered to be the one implemented in the operating system’s TCP/IP kernel
stack.
This mechanism was presented only for the three main TCP congestion control flavours: Tahoe,
Reno and NewReno. This project’s main contribution was implementing in a similar manner
a mechanism for inferring the congestion window for another commonly used flavour, which is
TCP Cubic. This module was integrated with the monitoring tool application and it simulates
the congestion control algorithm with respect of the changes observed in the network packets’
behaviour and it runs only in user space, thus reducing the overhead of reading data from kernel
space, like the other solution does.
4.4.1 Inferring congestion window for TCP Cubic
The simulation algorithm of TCP Cubic’s congestion window implemented is the same as the
one used in Linux1
. It bases on the idea that cwnd additive increase rate is a function depending
on time of the last notification of congestion, but also on the window size at the same moment
in time. It will be described and explained as follows.
First of all, the algorithm uses a modified slow start. The slow start threshold (ssthresh) will be
initialised to a value of 100 packets. When cwnd value exceeds ssthresh’s value, the algorithm
transcends from the normal slow start state to a new state where a smoother exponential in-
crease will occur. The cwnd will be increased by one packet for every 50 acknowledgements, or
doubles its value every 35 round-trip times. Another improvement represents the change in the
backoff factor from 0.5 to 0.8, in opposition with standard TCP algorithm. Cwnd is decreased
with this factor on every packet loss.
1 if (minimum_delay > 0) {
2 count = max(count, 8 * cwnd / (20 * minimum_delay));
3 }
As one may see in the listing above, minimum_delay is an estimate of the round-trip propaga-
tion delay of the flow. The additive increase rate is limited to be at most 20 ∗ minimum_delay
1D.J. Leith, R.N.Shorten, G.McCullagh, Experimental evaluation of Cubic-TCP

packets per round trip time. These may be interpreted as a cap on the increase rate of 20
packets per second independent of round trip time.
1 if (start_of_epoch == 0) {
2 point_of_origin = max(cwnd, last_maximum);
3 start_of_epoch = crt_time;
4 K = max(0.0, std::pow(b * (last_maximum - cwnd), 1/3.));
5 }
6
7 time = crt_time + minimum_delay - start_of_epoch;
8 target = point_of_origin + c * pow(time - K, 3);
9
10 if (target <= cwnd) {
11 count = 100 * cwnd;
12 } else {
13 count = cwnd / (target - cwnd);
14 }
In this code section, target - cwnd packets represents the additive increase rate per round trip
time. The purpose of this increase is to adjust cwnd to be equal to target during a single RTT.
The meaning of the variable is the following: t is the elapsed time since last backoff, added
with the value of minimum_delay and point_of_origin is a variable related to the cwnd at last
backoff. After backoff has occured, the cwnd value is 80% of its previous value.
1 if (cwnd >= last_maximum) {
2 last_maximum = cwnd;
3 } else {
4 last_maximum = 0.9 * cwnd;
5 }
6
7 cwnd_loss = cwnd;
8 cwnd = 0.8 * cwnd;
The value of point_of_origin will be adjusted based on the occurrence of last backoff in relation
with the time cwnd reached the previous point_of_origin value. It will be set equal to the
cwnd value right before backoff, when it is greater than the previous value of point_of_origin.
Otherwise point_of_origin is set equal to 90% of cwnd value before backoff.

4.5 Linux Loadable Kernel Module
Adding code to a Linux kernel may be done by including source files to the kernel source tree
and recompile the kernel, or it may be done while it is running. The pieces of code added in
the latter manner are called loadable kernel modules (LKM). They may have various purposes,
being commonly one on the list below:
1. device drivers
2. filesystem drivers
3. system calls
These functions are isolated by the kernel, so they are not wired into the rest of its code.
The loadable kernel module presented in this paper is a character device driver. It will be used
only when running the tool on the same host as the sender of the TCP connection. It also has
testing functionality, thus being able to compare the inferred state variables values with the
ones the Linux kernel knows about. Following the compilation phase a .ko file will be generated.
It is a file linked with some kernel automatically generated data structures needed by the kernel.
Insertion and deletion of the loadable kernel module will be accomplished through insmod and
rmmod commands.
The tcp_stats loadable kernel module that was implemented implemented was designed to
be used on a multipath TCP kernel implementation. If the user wants to change the Linux
distribution running on the host the LKM will be loaded on, it should first change the path to
the kernel sources in the Makefile, then recompile the module.
This LKM will implement the communication between user space and kernel space through a
set of ioctl functions. There were defined a couple of commands that may be used from user
space. Their main purpose is to help the kernel identify a specific TCP connection by sending
from user space the source and destination IP addresses and the destination port of the con-
nection.
1 /* ioctl command to pass src_address to tcp_stats driver */
2 #define MY_IOCTL_FILTER_SADDR _IOW(’k’, 1, unsigned int)
3 /* ioctl command to pass dest_address to tcp_stats driver */
4 #define MY_IOCTL_FILTER_DADDR _IOW(’k’, 2, unsigned int)
5 /* ioctl command to pass dest_port to tcp_stats driver */
6 #define MY_IOCTL_FILTER_DPORT _IOW(’k’, 3, unsigned)
7 /* ioctl command to read snd_cwnd from tcp_stats driver */
8 #define MY_IOCTL_READ_CWND _IOR(’k’, 4, unsigned long)
9 /* ioctl command to read rcv_rwnd from tcp_stats driver */
10 #define MY_IOCTL_READ_RWND _IOR(’k’, 5, unsigned long)
11 /* ioctl command to read snd_una from tcp_stats driver */
12 #define MY_IOCTL_READ_SRTT _IOR(’k’, 7, unsigned long)
13 /* ioctl command to read advmss from tcp_stats driver */
14 #define MY_IOCTL_READ_MSS _IOR(’k’, 8, unsigned long)
15 /* ioctl command to read window_clamp from tcp_stats driver */
16 #define MY_IOCTL_READ_SSTHRESH _IOR(’k’, 10, unsigned long)
Listing 4.1: ioctl user space commands
The user space buffer by which the data transfer is accomplished will be handled using copy_-
to_user or copy_from_user functions, that provide more protection. These functions check if
the provided address lies within the user portion of the address space, thus preventing userspace

applications from asking the kernel to write or read kernel addresses. If an addresses is inac-
cessible an error will be returned. They allow the error to be returned to userspace, instead of
crashing the kernel.
The character device driver’s main goal is to capture all the network packets using netfiler kernel
API. In Linux kernel, capturing packets using netfilter API is accomplished by attaching a series
of hooks. The netfilter hook used for capturing IP packets is applicable to all packets coming
from a local process. After setting the kernel internal variables related to the TCP connection
using ioctl functions, the Linux kernel module will be able to filter outgoing network packets
corresponding to that specified connection.
Every network packet that will be captured will trigger a netfilter hook hander. The hook will
be define using the nf_hook_ops structure. The second parameter of the handler’s prototype
is a structure struct sk_buff, that abstracts a network packet.
1 typedef unsigned int nf_hookfn(const struct nf_hook_ops *ops, struct
sk_buff *skb, const struct net_device *in, const struct
net_device *out, int (*okfn)(struct sk_buff *));
Listing 4.2: netfilter hook handler prototype
Having a reference to a network packet, it will be easy to obtain many information about the
headers of this packet, and more important than that it will be of great use in determining
internal metrics of the TCP connection that the kernel holds. A struct sk_buff 1
structure has
a series of fields that are of interest to our project and will be described below.
1 struct sk_buff {
2 struct sk_buff *next;
3 struct sk_buff *prev;
4 struct sock *sk;
5 union h;
6 union nh;
7 union mac;
8 unsigned char *data;
9 ...
10 };
Listing 4.3: fields of interest in sk_buff structure
next
Next network packet in list
prev
Previous network packet in list
sk
Socket that owns the network packet
h
Transport layer header
nh
Network layer header
1http://lxr.free-electrons.com/source/include/linux/skbuff.h

mac
Link layer header
data
Data head pointer
Now that the socket structure that owns the network packet was identified, it is simple to obtain
a reference to the TCP socket, which is the essential structure that keep metrics and statistics
of the TCP connection. Retrieving the structure that abstracts the TCP socket will be done
using the inline function listed below.
1 static inline struct tcp_sock *tcp_sk(const struct sock *sk)
2 {
3 return (struct tcp_sock *)sk;
4 }
Listing 4.4: inline function that returns a reference to tcp_sock structure
The Linux kernel store all the information regarding the state and important internal variable
of a TCP connection as fields inside the tcp_sock structure1
. The fields that are of great use
to the purpose of this project will be further presented.
1 struct tcp_sock {
2 ...
3 u32 rcv_nxt; /* What we want to receive next */
4 u32 snd_una; /* First byte we want an ack for */
5 u32 srtt; /* The value in usecs of smoothed
round trip time << 3 */
6 u32 packets_out; /* Packets which are "in flight" */
7 u16 advmss; /* Advertised MSS */
8 u32 window_clamp; /* Maximal window to advertise */
9 u32 snd_ssthresh; /* Slow start size threshold */
10 u32 snd_cwnd; /* Sender’s congestion window */
11 u32 rcv_wnd; /* Value of receiver’s window */
12 ...
13 }
Listing 4.5: struct tcp_sock important fields
Using ioctl functions it will be possible to transfer these information from kernel space to user
space, thus being able to find the kernel estimation of the TCP connection’s parameters. The
most important data we would like to obtain using this loadable kernel module is the cwnd and
the smoothed round trip time, to ensure the low error of the inferred methods estimation.
4.6 Sendbuffer Advertising Kernel Patch
The last part in the development process was finding an alternative mechanism of testing the
implemented application. For this purpose it was used a Linux kernel patch implemented
by Alexandru Agache from University Poiltehnica of Bucharest. This patch was designed for
multipath TCP kernel implementation and what it basically does is advertising the number of
bytes from the in-kernel buffer waiting to be sent, known also as the backlog.
1http://lxr.free-electrons.com/source/include/linux/tcp.h

After successfully applying the patch to the desired development branch of the kernel sources,
the multipath TCP kernel was recompiled and two new sysctl options were created. The first
option enables or disables the sendbuffer advertisement and the second specifies the number of
the TCP option that carries the information. The latter TCP option requires 8 bytes and it
will be added to every network packet if there is still room for it.
The setup process was not an easy task and many problems were encountered while trying to
make it work. First of all, the patch could not be applied for the last development branch of
the kernel, so a revert to the commit it was intended for was required. Afterwards, it was a
difficult process in creating a suitable configuration file for the kernel, because there was few
information regarding what dependencies sendbuffer advertising has. After all these steps were
completed, Qemu virtual machine was used along with the resulted kernel image to verify the
patch’s capabilities.
Testing the patch was the easy part. A virtual network interface was created on the host
machine to enable the communication with Qemu machine. After that, netcat tool was used to
start listening on a port on the host operating system and a large file was transferred via the
same tool from the virtual to the physical machine. Ensuring that everything works as expected
was done by using the tcpdump tool. Its outputs contained the new option number inside every
outgoing TCP packet along with the number of bytes of the backlog in hexadecimal format.
The final phase regarding sendbuffer advertising patch utilization consists of requiring access
to a small university cluster and getting familiar with it. After getting access to three physical
machines, they were used as a sender, a router and a receiver. Finally, testing the same scenario
was started as described above on this new infrastructure and a shell script was written to print
the tcpdump output into a file and parses it, then converts the number of bytes from hexadecimal
to decimal format.

Chapter 5
Testing
The testing phase was realised gradually. First, testing the measurements and statistics re-
garding a TCP connection was started on my personal computer, which is running a Linux
operating system. After implementing the loadable kernel module, it began the testing on a
Qemu virtual machine. The last part of this stage was obtaining relevant data, that could be
plotted, thus highlighting the validity of the results. This stage was accomplished using three
physical machines from a small cluster belonging to the university.
5.1 The Testing Environment Setup
Gathering useful data for realising a series of plots that describe the project’s results was
realised on three physical machines from the university’s testbed named GAINA. It consists of
ten high-end servers, each equipped with:
• a dual-port 10 Gigabit network adapter
• a quad-port Gigabit network adapter
The servers are connected to an 48-port HP Openﬂow switch and a 10 Gigabit IBM Switch1
.
Resources from this mini-cluster were used to create a custom testing environment. Its archi-
tecture will be described in the ﬁgure below.
Figure 5.1: Testing environment’s architecture
1https://systems.cs.pub.ro/testbed
25

CHAPTER 5. TESTING 26
As one can see, the physical machines we had access to were used as a mini-network. First, the
multipath TCP kernel sources was copied on the cluster’s filesystem. After that, the sendbuffer
advertising kernel patch was applied to these sources. A configuration file for the multipath
TCP kernel was created, then it was recompiled. The compilation was a success, and after en-
abling the sendbuffer advertising option, a new TCP option emerged when capturing outgoing
network packets using tcpdump tool.
1 18:50:49.583088 IP 10.0.1.1.59832 > 10.0.2.1.12345: Flags [.], seq
36024:37464, ack 1, win 229, options [nop,nop,TS val 4294832922
ecr 717377,nop,nop,unknown-211 0x00005fa0], length 1440
Listing 5.1: Example of tcpdump capture
The next step was configuring the computer acting as a router as a default gateway for the
computer acting as a sender. The same actions were considered for the computer acting as a
receiver, which was the default gateway for the router. On the computer placed in the middle of
the improvised network ip forwarding had to be enabled. For achieving this goal a configuration
script that assigns an IP address and default gateway to each network interface of this machine
was written.
1 #!/bin/bash
2
3 ifconfig eth13 up
4 ip a a 10.0.1.2/24 dev eth13
5 route add default gw 10.0.1.1 eth13
6
7 ifconfig eth10 up
8 ip a a 10.0.2.2/24 dev eth10
9 route add default gw 10.0.2.1 eth10
10
11 echo "1" > /proc/sys/net/ipv4/ip_forward
Listing 5.2: Configuration script for Computer9
Both links in this mini-network had 1 Gbps bandwidth. To simulate different situations, includ-
ing variance in detection point’s placement, dummynet tool was used for varying the bandwidth
and delay of TCP traffic on a specified network interface. Dummynet is an application that
simulates bandwidth limitations, packet losses, or delays, but also multipath effects. It may
be used both on the machine running the user’s application, or on external devices acting as
switches or routers.
1 insmod ipfw_mod.ko
2 ipfw add pipe 1 ip from any to any in via eth13
3 ipfw pipe 1 config bw 200Mbps delay 10ms
Listing 5.3: Example of dummynet tool usage on computer9

5.2 Tools
During the process of testing this project on the mini-cluster, a series of applications that helped
us accomplish this task were used. The most frequently used tool was iperf. It is an application
written in C widely used for network testing. It has the ability of creating User Datagram
Protocol (UDP) and Transmission Control Protocol (TCP) data streams and displays statistics
of the network that is carrying them. The example below presents the initialization of a data
stream transfer between a sender that tries to send data for 100 seconds to a receiver that has
a network interface with the IP address 10.0.1.2.
1 iperf -s
Listing 5.4: Iperf tool usage on computer acting as a receiver
1 iperf -c 10.0.1.2 -t 100
Listing 5.5: Iperf tool usage on computer acting as a sender
To simulate the application-limited situation cat and netcat tools were also used for reading
data from disk and sending them through a TCP network connection. Netcat is a networking
service for writing to and reading from network connections. This tool was useful both on the
sender and receiver side. On receiver side a server that listens on a specified port was started.
Sending data from sender to receiver was first realised by generating a large file using dd Linux
tool, then it was sent via netcat to the server’s port opened by the receiver. Examples of netcat
usage are presented below.
1 nc -l 12345
Listing 5.6: Starting a server listening on a specified port
1 dd if=/dev/zero of=1g.img bs=1 count=0 seek=1G
2 cat 1g.img | nc -s 10.0.1.1 10.0.2.1 12345
Listing 5.7: Sending a large file via netcat
All the line charts were realised using an online free plotting tool named plotly1
, that was very
useful and easy to use in representing the data obtained by running the tool over different
set-ups.
5.3 Error Evaluation
The first step in testing the tool on the improvised mini-network was ensuring the computed
metrics and statistics are correct. In achieving this goal the implemented Linux kernel module
was used, then a TCP connection was started as described in the section above. Another appli-
cation I implemented that sends to Linux kernel the parameters regarding the TCP connection
via ioctl calls was used. This kernel module intercept all network packets and gets on demand
information regarding metrics, statistics and state variables for that packets that match the
previously sent parameters.
1https://plot.ly/

The most important variables that dictates the evolution of a TCP connection are the smoothed
round trip time, the flight size and the congestion window. In this section it will be presented
the precision of the inferred values of these TCP connection’s variables when varying bandwidth
and delay in the testing environment. For this purpose two cumulative distribution function
(CDF) representation were used, one presenting the distribution of the absolute value of the
difference between the inferred values and the values Linux kernel knows about, and the other
presenting the distribution of the precision error. These representation were made starting from
the same data sets, but it is important to visualize them both, because it will provide a better
understanding about the evolution of the monitored variables given the network parameters’
variance.
Figure 5.2: Smoothed round trip time CDF representation for a delay of 1 ms; absolute value
of the difference (left), precision error (right)
First the CDF representations for smoothed round trip time (srtt) was plotted, when varying
the delay between computer6 - the sender and computer9 - the receiver. The experiment and
the estimation were conducted on the sender’s side. The delay variance is the most important
factor that causes precision error in estimating the srtt, varying the bandwidth will not influence
too much its value.
As one can see in these line charts the smoothed round trip time estimation error decreases
when introducing a bigger delay between the sender and the receiver, even if the the range of
absolute values of the difference between estimated and ground truth values is growing. It is
visible that the mean error decreases from 2% for 1 ms delay to approximately 1% for 10 ms
delay and then for a delay of 100 ms the error decreases under 0.5%, while the maximum error

decreases from 12% in the first case to 2.5% in the last one. The precision errors occur mainly
for small values of the srtt (under 1 ms), because there is an insignificant, but in this situation,
observable delay between the moment of determining the inferred value of this variable and the
moment of reading the srtt value estimated by the sender’s Linux kernel.
After the initial test phase of the smoothed round trip time, the delay and bandwidth between
computer6 and computer9 were varied, both having the same role as in the previous scenario,
and plotted some CDF representation of the flight size and congestion window. Representing
both, the distribution of the absolute value of the difference between inferred values and ground
truth, and the distribution of the error of this estimation, plays a very important part in
highlighting the evolution of the size and growth speed of these values.
This testing was done using dummynet and iperf tools for varying network’s parameters and to
create a client-server connection and send a TCP data stream. It is also important to mention
that the connection is not application limited.
Figure 5.5: Flight size CDF representation for a delay of 1 ms and a bandwidth of 10 Mb/s;
absolute value of the difference (left), precision error (right)
As one may see in the representation above, the mean error is 2.5-3%, the maximum error is
14% and the absolute value of the difference is less than 2 packets. These values let us conclude
that for small bandwidth we have small flight size values, and also the error is in a tolerable
range, due to the fast growth of the variable’s value. This last fact is possible because of the
small value of the round trip times, that causes the client to send more segments at a fast rate.

Figure 5.6: Flight size CDF representation for a delay of 10 ms and a bandwidth of 200 Mb/s;
In these two representations we have a comparison between the flight size error evolution when
using the same delay and varying the bandwidth from 200 Mb/s to 1Gb/s. It is visible that the
error will start growing once we increase the bandwidth. In the first scenario we have a mean
error of 1.5% and a maximum error of 5.5%, while in the second the error will grow a little,
reaching even 2% for mean error and 8% maximum error.
It may also be observe the huge difference between these two situations regarding the absolute
value of the difference. Considering the fact that the error rate in the second case is only twice
larger than the one in the first, and the absolute value of the difference between the inferred
values and the ground truth is ten times larger, we may conclude that the flight size will grow
to very big values while the bandwidth grows.
Figure 5.7: Flight size CDF representation for a delay of 10 ms and a bandwidth of 1 Gb/s;
Given the fact that the tested TCP connections were not application limited, the congestion
window values should have the same evolution as the flight size values. Below it is presented
the same test scenarios for congestion window. As one may see from these representations, the
error rate evolution retains trend, enlarging its value range once the bandwidth increases. For
a delay of 10 ms and a 200 Mb/s bandwidth the mean error will be 2.5% and the maximum
error will be 7.5%.

Figure 5.8: Congestion window CDF representation for a delay of 10 ms and a bandwidth of
200 Mb/s; absolute value of the difference (left), precision error (right)
Figure 5.9: Congestion window evolution over time for a 10 ms delay and a 200 Mb/s bandwidth
Another testing scenario was varying the detection point of the TCP connection. To accomplish
this task, all three machines were used, with computer6 as a sender, computer10 as a receiver
and computer9 as a router. The application was run only on the router and its variance in
relative distance to the sender was simulated using dummynet tool for delay insertion. First of
all, the delay between the sender and the receiver was set to 100 ms and then the delay between
router and sender was set to values of 5 ms, 10 ms and 20 ms.
The only way of testing the tool in the middle of the network was by using the sendbuffer
information from each incoming TCP segment. For this purpose the bandwidth of each link
was set to 200 Mb/s and then a greedy sender situation was simulated to test the error evolution
when varying the detection point. In this case the sendbuffer backlog information has a value in
the range 45-50 kbytes, while the flight size should be equal to the congestion window’s value.
The error is represented as the difference between the congestion window value and the flight
size relative to congestion window.

As one can see, in the testing scenario with a 5% distance between sender and detection point
relative to the total distance between sender and receiver, the results are correct in more than
90% of the cases. For 10% relative distance, the precision drops at a value of approximately
70%, while for 20% relative distance we have less than 65% precision.
Figure 5.10: Greedy sender detection error for a 5 ms delay between the sender and the detection
point; absolute value of the diﬀerence (left), precision error (right), sendbuﬀer evolution over
time corresponding to the tested scenario (center)

Chapter 6
Conclusion and Further
Development
The monitoring tool presented in this paper is a very robust and simple traffic engineering
application that may be of great use for cloud providers that are interested in debugging prob-
lematic TCP connections. It will also help network application developers who want to have
more in-depth information regarding the root cause of a client-server troublesome data trans-
fer. This tool is good enough in detecting application-limited or network bound situations when
running it on the same machine as the sender is run on, or on another machine located close
to the sender’s host.
The essential TCP connection variables that strongly influence its behaviour are the round trip
time, the flight size and the congestion window. Running the application at a considerable
distance relative to the sender’s position will result in a loss of estimation precision of the above
mentioned variables. Combining socket-level logs with inferred methods of determining these
values turned to be a good idea. If the user will run the tool on the sender’s host it may choose
to use the loadable Linux kernel module and obtain the ground truth, but it will incur overhead
due to the ioctl calls and data transfer between kernel and userspace application. In order to
reduce overhead, the estimation of these values may be also used, achieving a very small error
rate when running the application in this point. This loadable Linux kernel module was also
used as an error detection testing tool.
Sendbuffer advertisement was another testing support. It is very reasonable to conclude that
it is the most feasible method of determining application-limited and greedy sender situation
in any point of a network path.
The application may suffer many improvements. The first step would be to create a very
intuitive and eye-candy graphical user interface that will ease the process of inserting the Linux
kernel module and running the tool from command line and passing a considerable number of
arguments to it. The testing process may also be integrated in this user interface, letting the
user to start a client-server TCP connection.
Another great improvement that may be accomplished would be implementing the detection of
connection variables for different less commonly used TCP flavours using pattern recognition
and machine learning algorithms. It would be a very useful feature for computer programmers
that try to debug their applications running on hosts that have network stacks with custom
implementations of congestion control algorithms.
33

Bibliography
[1] W. Richard Stevens, Kevin R. Fall, TCP/IP Illustrated vol. 1
[2] Joao Taveira Araujo, Raul Landa, Richard G. Clegg, George Pavlou, Kensuke Fukuda, A
longitudinal analysis of Internet rate limitations.
[3] Injong Rhee, and Lisong Xu, CUBIC: A New TCP-Friendly High-Speed TCP Variant.
[4] Alexandru Agache, Costin Raiciu, Oh Flow, Are Thou Happy? TCP sendbuffer advertising
for make benefit of clouds and tenants.
[5] RINC Architecture, Mojgan Ghashemi, Theophilus Benson, Jennifer Rexford, RINC: Real-
Time Inference-based Network Diagnosis in the Cloud.
[6] Sharad Jaiswal, Gianluca Iannaccone, Christophe Diot, Jim Kurose, Don Towsley, Inferring
TCP Connection Characteristics Through Passive Measurements.
[7] D.J.Leith, R.N.Shorten, G.McCullagh, Experimental evaluation of Cubic-TCP.
[8] M. Yu, A. Greenberg, D. Maltz, J. Rexford, L. Yuan, S. Kandula, and C. Kim. Profiling
network performance for multi-tier data center applications, 2011.
[9] M. Allman, V. Paxson, Request for Comments: 5681, Sep. 2009.
[10] K. Claffy, Greg Miller, and Kevin Thompson. The Nature of the Beast: Recent Traffic
Measurements from an Internet Backbone. In Proceedings of INET, 1998.
[11] Chuck Fraleigh and et. al. Packet-Level Traffic Measurements from a Tier-1 IP Backbone.
ATL Technical Report TR01-ATL-110101, Sprint, Nov. 2001.
[12] Sean McCreary and Kc Claffy. Trends in Wide Area IP Traffic Patterns: A View from Ames
Internet Exchange. In Proceedings of the 13th ITC Specialist Seminar on Measurement and
Modeling of IP Traffic, Monterey, CA, Jan. 2000.
34

thesis

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Viewers also liked

Viewers also liked (6)

Similar to thesis

Similar to thesis (20)

thesis