Strategies and Metric for Resilience in Computer Networks

  • 4,608 views
Uploaded on

 

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
4,608
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
53
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. The Computer Journal Advance Access published October 19, 2011 © The Author 2011. Published by Oxford University Press on behalf of The British Computer Society. All rights reserved. For Permissions, please email: journals.permissions@oup.com doi:10.1093/comjnl/bxr110 Strategies and Metric for Resilience in Computer Networks Ronaldo M. Salles∗ and Donato A. Marino Jr. Instituto Militar de Engenharia, Seção de Engenharia de Computação Praça Gen. Tibúrcio 80, Rio de Janeiro, RJ 22290-270, Brazil ∗Corresponding author: salles@ieee.org The use of the Internet for business-critical and real-time services is growing day after day. Random node (link) failures and targeted attacks against the network affect all types of traffic, but mainly critical services. For these services, most of the time it is not possible to wait for the complete network Downloaded from http://comjnl.oxfordjournals.org/ by guest on October 20, 2011 recovery; the best approach is to act in a proactive way by improving redundancy and network robustness. In this paper, we study network resilience and propose a resilience factor to measure the network level of robustness and protection against targeted attacks. We also propose strategies to improve resilience by simple alterations in the network topology. Our proposal is compared with previous approaches, and experimental results on selected network topologies confirmed the effectiveness of the approach. Keywords: network resilience; network topology; graph theory; k-connectivity; targeted attacks and failures Received 29 April 2011; revised 29 July 2011 Handling editor: Ing-Ray Chen1. INTRODUCTION topology where each node is connected to all others, if a givenToday, computer networks are highly complex heterogeneous set of nodes are destroyed by a targeted attack, the remainingsystems. This wis mostly due to the exponential growth of the nodes are still connected and may continue to communicateInternet and its use as a convergence medium for all sorts of with each other. It is therefore important to provide a way totraffic and applications. There is a common perception in the quantify this notion in order to evaluate the resilience capacityfield that engineers and researchers have the right knowledge of computer networks, mainly the ones that operate in criticalto design and operate such complex networks but they still scenarios.lack a thorough understanding of system behaviour under stress Regarding the Internet, the authors in [1] showed thatconditions and anomalies. Providing fault tolerance and prompt although it is susceptible to both random failures and maliciousattack recovery capabilities to networks are still a matter for attacks, the latter problem can cause greater damage to thefurther investigation. network due to the particular Internet topological characteristics Random node/link failures and targeted attacks against the [2]. A target attack on a highly connected node (hub) cannetwork affect all types of traffic, but mainly critical services severely degrade performance, disconnect a whole region or(e.g. e-commerce and e-government in the civilian sphere; isolate some network section from crucial services.comand and control data transmission in military tactical An important aspect to consider is that nodes may haveoperations). For these services, most of the time it is not possible different roles in the network. Some nodes are central, actingto wait for complete network recovery; the best approach is to as network hubs; if they fail (or are destroyed), there will be aact in a proactive way by improving redundancy and network considerable impact to the network since the network dependsrobustness beforehand. heavily on them. Other peripheral nodes may not have such an Network resilience against failures and attacks relies mostly impact if they are put out of service. Social network metricson the topology redundancy and the nodes’ connectivity. For are useful to determine the degree of centrality of each networkinstance if a single link failure disconnects network nodes, it node.is implied that the network is not robust enough given that Hence, strategies to improve network resilience should notredundancy is weak. On the other hand, assuming a full mesh only consider redundancy by adding extra links between nodes The Computer Journal, 2011
  • 2. 2 R.M. Salles and D.A. Marino Jr.but also try to reduce network dependency on some central 2.1. Related worknodes. Resilience strategies may by applied on the design of One of the first works that studied network resilience andnew network topologies or on the modification of existing ones. presented a measure to evaluate network fault tolerance was The contribution of this paper is 2-fold: due to [8]. The measure was defined as the number of faults that (i) The resilience factor is proposed to quantify the degree a network may suffer before being disconnected. The authors of resilience of a given network topology. It is based on computed an analytical approximation of the probability of the the k-connectivity property of a graph; network to become disconnected and validated their proposal (ii) Two strategies are proposed to improve network using Monte Carlo simulation results. The simulation scenario resilience: Proposed preferential Addition (PropAdd) employed three particular classes of graphs to represent network and Proposed preferential Rewiring (PropRew). The topologies: cube connected-cycles, torus and n-binary cubes— strategies use social network centrality metrics and all of them are symmetric and with fixed node degrees. employ link addition and rewiring. Percolation theory has also been employed to characterize network robustness and fragility. The idea is to determine a The remaining of this paper is organized as follows. In certain threshold, pc , which represents a fraction of networkSection 2 related works on network resilience are discussed. nodes and their connections, that when removed the integritySection 3 reviews some important concepts applied in our of the network is compromised—the network disintegrates into Downloaded from http://comjnl.oxfordjournals.org/ by guest on October 20, 2011proposal. Section 4 presents the resilience factor along with smaller, disconnected parts [9]. Below such a critical thresholdsome numerical results. The proposed strategies are presented disconnections may occur but there still exists a large connectedin Section 5 as well as numerical results comparing them with component spanning most of the network. A network with asimilar approaches. Finally, the work is concluded in Section 6 higher percolation threshold is preferred in terms of resilienceand a list of references cited in this paper is presented. since it will be more difficult to break it. Other works in the literature also studied resilience under a percolation perspective, for instance, in [10] percolation was used to characterize Internet resilience.2. NETWORK RESILIENCE The main goal of the work in [11] is to quantify networkResilience is a broad term used to study several different types resilience so that it is possible to compare different networksof systems and networks, ranging from socioeconomic, finance according to this property. The authors used as resilience metricsand even terrorist networks [3] to computer networks. In [4] the percentage of traffic loss due to network failures; theythe authors comment that resilience is an important service also considered a scalability parameter in respect to networkprimitive for various computer systems and networks but its size, fault probability and network traffic volume. Resiliencequantification has not been done well. was evaluated taking into account uniform traffic patterns According to [5], network resilience is defined as: the and dependent or independent link failures, with or withoutability for an entity to tolerate (resist and autonomically protection. Input traffic is also modelled by Poisson processes.recover from) three types of severe impacts on the network and The authors concluded that complete topologies (mesh) are theits applications: challenging network conditions, coordinate most resilient; and considering regular topologies with the sameattacks and traffic anomalies. number of nodes and links, Moore graph topologies presented Challenging network conditions occur, for instance, in some a better performance.dynamic and hostile environments where nodes are weakly The work in [8], and later [11], presented resilienceand episodically connected by wireless links given their high metrics based on probability computations, and also usingmobility and topography conditions. particular environments with uniform traffic patterns and Coordinate attacks can be logical or physical. In the first case, regular network topologies. It is important to evaluate resiliencethe main target are network protocols and services; such attacks in a more realistic scenario, considering for instance realare typically classified as denial of service attacks (DoS or the network topologies.distributed DoS) [6, 7]. Physical attacks consist of network Another important work that studies resilience in packet-infrastructure destruction by the enemy in a war operation, switched networks is [12]. The authors pointed out that someterrorism or even natural disasters. network failure combinations may lead to loss of ingress–egress Traffic anomalies are any kind of unpredictable behaviour connectivity or to severe congestion due to rerouted traffic. Theyor failure that severely impacts network services, especially provided a thorough framework to analyse resilience accordingmission critical applications. to failures and changed interdomain routing of traffic. The Thus, network resilience is a broad topic of research and framework is based on the calculation of the border-to-bordermay include or be related to robustness, survivability, network availability of a network and the complementary cumulativerecovery, fault and disruption tolerance. The next subsections distribution function of the link load depending on networkpresent the context on which this work focuses. failures and traffic fluctuations. The Computer Journal, 2011
  • 3. Strategies and Metric for Resilience in Computer Networks 3 Dekker and Colbert [13] associated the concept of resilience Both [13, 15] worked with scale-free topologies, whichto the capacity of a given network to tolerate attacks and present heavy-tailed degree distributions following a power law.faults on nodes. The work focused on network topologies The Internet and other large-scale networks (biological, social,and studied connectivity and symmetry properties of the etc.) are proved to exhibit such property [2, 16, 17]. However,corresponding graphs. They also considered other metrics in network operators and ISPs are most interested in studyingtheir evaluation, such as link connectivity, distance between resilience of their own backbones and domains, which are notnodes, graph regularity (degree distribution of nodes), etc. necessarily modelled by scale-free graphs. This paper dealsNetwork topologies were divided and studied in groups: Cayley with this issue and investigates network resilience consideringgraphs, random graphs, two-level networks and scale-free realistic ISPs topologies.networks. Their main conclusion is that to reach a good level The work in [15] adopted three different metrics (LCC, ASPLof resilience, the network should have a high degree and low and diameter) to represent resilience, however, we argue thatdiameter, be symmetric and contain no large subrings. However, those metrics are not consistent in several cases. For instance,they did not evaluate realistic networks in terms of resilience ASPL and diameter are inconsistent when the network getsnor proposed ways to improve their robustness. disconnected, while the LCC does not give any information Resilience of networks against attacks is also studied in [14], about the remaining subgraphs. This work investigates thiswhere the authors modelled the cost to the attacker to disable a matter and proposes a new metric, the resilience factor RF , Downloaded from http://comjnl.oxfordjournals.org/ by guest on October 20, 2011node as a function of the node degree. Depending on this cost that does not suffer from those problems.function a certain type of network (Poissonian, Power Law, etc.) Finally, [15] also proposed strategies to improve the resiliencemay become easier (harder) to crack, i.e. less (more) resilient. of a given network, however, they based on the strategies onSuch a cost function may depend on the particular scenario random link additions and rewirings. We believe that moreunder study and also may be difficult to determine. sophisticated approaches may achieve a better performance. In The work in [15] considered three important metrics to fact, this paper proposes novel strategies based on graph analysisevaluate network resilience with respect to random failures and nodes qualification (centrality metrics [18]). Such strategiesand targeted attacks. The first one is the largest connected provide a better reasoning to link additions and rewiringscomponent (LCC), which gives the size of the largest subgraph reflecting on improved performance.that still remains connected after the network is attacked ordisconnected by failures. The second metric studied was theaverage shortest path length (ASPL), which varies according to 3. BASIC CONCEPTStopology alterations. In fact, the authors used the average inverse The network topology is modelled by a graph G = (V , E),shortest path (AISPL) to avoid numerical problems when nodes where V is the set of vertices or network nodes (routers, switchesbecome disconnected. Finally, the network diameter was also or any other infrastructure equipment) and E is the set of edgesconsidered to assess network robustness. or network links (fibre, cable, wireless, etc.) connecting two nodes. A graph G can be represented by the connectivity matrix2.2. Research problems M of size n × n, where n = V . Each element of M is 1 if there is a corresponding link in E, otherwise it is 0.Related works provided a glimpse of the vast literature under The degree of a node v, d(v), is the number of linksthe topic of network resilience, they also showed different connecting the node: i∈V ;i=v M(v, i). For undirected graphs,approaches to the problem. In this subsection, we direct our M is symmetric and d(v) could also be defined in terms ofattention to some points that deserve further investigation and M(i, v). Note that this is the case considered in this paper,constitute the focus of this work. however, directed graphs can also be applied without loss of First of all, it has been shown that current communication generality.networks have topological weakness that seriously reduce their The average degree of a network topology is given byattack survivability [1, 2]. Such weakness could be exploited bymalicious agents through the execution of directed attacks. This 1 d(G) = d(v) (1)work studies topological aspects of the network and focuses on n v∈Vresilience against target attacks on nodes. Another observation is that most of previous works and the degree distributionconcentrate on synthetic topologies for analysis and some used n(k)these topologies to test their proposals. We realize that a more P (k) = , (2)practical and direct approach was missing and could add value to nthe problem. Hence, our approach focused on working directly where n(k) is the number of elements v ∈ V such that d(v) = k.with real topologies of network domains although we illustrated For instance, in scale-free networks Equation (2) follows ain some parts of the text how our approach works for some power-law distribution, i.e. P (k) ∼ k −γ , where 2 < γ < 3regular and uniform topologies (full mesh, line). for most real networks [19]. The Computer Journal, 2011
  • 4. 4 R.M. Salles and D.A. Marino Jr. The distance between nodes is also an important property compute the k-connectivity of a graph are based on Min-Cutto be studied. The geodesic path, g(u, v) where u, v ∈ V , Max-Flow theorems [27, 28].is defined as the number of edges (links) in the shortest path Regarding network nodes, centrality metrics play an essentialconnecting u to v. From this concept the network diameter (D) role to characterize the relative importance of each node in theis defined as the largest geodesic path in the network, topology. These metrics are commonly employed in the theory of social networks [29]. In general terms a central node can be D = max g(u, v) ∀u, v ∈ V . (3) seen as a popular actor in the social topology. uv In the case studied in this paper, a node with much higherA low D may indicate a redundant and robust topology. For centrality than all others may configure a problem for theinstance D = 1 in a full-mesh network and D = n in a line network in terms of resilience. An attack or failure on thisnetwork. node may disconnect the network, thus it is important to study Another important concept related to paths is the Average centrality and investigate how it relates to network resilience.Shortest Path Length (ASPL), which is given by The first and simpler measure of centrality is known as Degree Centrality (DC). It is simply defined as the degree of a node, in 2 u,v∈V g(u, v) ASPL = . (4) fact it has already been studied before as d(v). If a given node n(n − 1) v has d(v) = 1, there is no further implication for the network Downloaded from http://comjnl.oxfordjournals.org/ by guest on October 20, 2011It is also common to consider the inverse parameter, AISPL, in terms of resilience since this node is a network leaf. On thesince in the study of resilience the network may get disconnected other hand, if d(v) is high, v can be considered an importantyielding in some g(u, v) = ∞. node for the connectivity of the network. The second measure of centrality considered in this work is 2 u,v∈V (1/g(u, v)) Closeness Centrality (CC), which defines how close the node AISPL = . (5) n(n − 1) is from the centre of the network. It is computed from the sum of distances between the node and all others. The lower the Connectivity is also a fundamental property of a graph sum the closer is the node from the centre, or in other words, itespecially in the study of network resilience. A graph is said requires fewer intermediate nodes to reach all other nodes. Theto be connected if there is a path between any two nodes in V . following Equation (6) defines CC for node u [18]: If the graph is not connected, the size of the largest connected ⎡ ⎤−1subgraph is usually considered instead. The LCC is computedusing breadth-first search or depth-first search techniques [20]. CC(u) = ⎣ g(u, v)⎦ . (6)Note that in case the graph is connected, the LCC is the graph v=u,v∈Vitself. The LCC is usually given as the diameter of the largestconnected subgraph. The third and last centrality measure studied in this paper One of the properties of a graph that is mostly related to is Betweenness Centrality (BC), which defines how often noderobustness, fault tolerance and redundancy is k-connectivity takes part in the geodesic paths between all other nodes. Nodes[21]. Several works in the literature employ k-connectivity that occur on many shortest paths between other nodes haveto represent resilience in different applications and network higher betweenness than those that do not. The followingscenarios [22–24]. This concept is based on the Menger’ Equation (7) defines BC for node u [18]:theorem [25]: let G = (V , E) be a graph and u and v be σu (j, k)two non-adjacent vertices; then the minimum cut for u and v BC(u) = j, k ∈ V , (7)is equal to the maximum number of disjoint paths connecting j <k σ (j, k)u and v. According to [26], k-connectivity can be defined as follows. where σ (j, k) is the number of shortest paths from j to k, and σu (j, k) is the number of shortest paths from j to k that pass Definition 3.1. Let G be a k-connected graph; then, for any through a node u.two vertices u and v, there are at least k-vertice disjoint pathsbetween u and v. A direct implication from the above definition1 is that a 4. THE RESILIENCE FACTORgraph is k-connected when removing any of its k − 1 vertices, From the study of previous works in the literature it can be saidthe resulting subgraph remains connected. This property is that connectivity plays a major role regarding network resilienceusually related to network resilience since it indicates topology and should be considered in the construction of the resiliencetolerance to faults and/or attacks on nodes. Algorithms to factor. The analysis of attacks over networks, either physical or 1 There is also a similar definition considering edges (k-edge-connectivity) cyber attacks, shows that topology may become disconnectedinstead of vertices (k-vertice-connectivity). However, in this work we focus on and so high node connectivity is desired for the topology as ak-vertice-connectivity given its closer relation to attacks on nodes. proactive protection [13, 30–32]. The Computer Journal, 2011
  • 5. Strategies and Metric for Resilience in Computer Networks 5 The principal metrics that describes connectivity is of percentages gives a certain independence for the factor ink-connectivity, however, other metrics such as AISPL and LCC terms of the number of nodes in the topology.(see Section 3) have also been proposed in previous works and Figure 1 illustrates the computation of RF for the networkwill be discussed and analysed in this section. topology in Fig. 1a. It can be seen from Fig. 1b that the topology One key observation towards our proposal is that by analysing is 2-connected since if a single node fails or is put out of servicemost of the network commercial backbones, it can be verified the remaining topology is still connected, or in other words,that such topologies are 2-connected (or 3-connected); i.e. there are at least two disjoint paths connecting any pair of nodes.if a node (or any pair of nodes) is put out of service, the According to our notation, k(2) = 1 in this case.network continues to operate maintaining the remaining nodes Figure 1c checks the topology for 3-connectivity. The onlyconnected. case that fails is when nodes 1 and 3 are deleted at the The question that follows is whether all those topologieshave the same level of resilience, or in other words, how cantwo different 2-connected topologies be compared in terms of (a)resilience.4.1. Proposal Downloaded from http://comjnl.oxfordjournals.org/ by guest on October 20, 2011The k-connectivity property of a graph is used as a basis for theproposed resilience factor of network topologies. However, asposed by the above question, our proposal is to specialize thisproperty to better express the notion of resilience. We propose (b)the use of the partial k-connectivity property of a graph torepresent the resilience factor. The idea behind partial-k-connectivity is explained asfollows. Assume the there are two different 2-connectednetwork topologies, T1 and T2 , with the same number of nodes. (c)Since they are 2-connected, any node can be deleted withoutcausing disconnections but there is at least one pair of nodes thatif deleted disconnects the topologies—they are not 3-connected.However, assume that T1 has just a single pair of nodes that whenremoved disconnects the topology, and T2 has two different pairsof nodes that cannot be deleted (one at a time); otherwise T2 getdisconnected. Although both T1 and T2 are not 3-connected, T1 is betterthan T2 in terms of 3-connectivity since there is just one case(one pair of nodes) that is capable of disconnecting it, while in (d)T2 chances are twice as high (two pairs of nodes). It is thereforenatural to consider that T2 is somehow less resilient than T1 . Weconsolidate this idea in the resilient factor (RF ) below: n−1 k(i) RF = i=2 , (8) (n − 2)where n is the number of nodes in the topology and k(i)the percentage of node combinations that guarantees partial (e)i-connectivity. It is assumed that all networks considered are1-connected and n-connected, thus these cases are excludedfrom computations. Note that, for a line network topology RF = 0 and for afull-mesh topology RF = 1 (100%), all other arrangements FIGURE 1. Example on the computation of the resilience factor—RF .fall in between these two cases. This conforms to the idea that (a) Network topology. (b) Subgraphs with one node deleted, k(2) =a line topology presents very poor levels of resilience while 5 9 5 = 1. (c) Subgraphs with two nodes deleted, k(3) = 10 = 0.9. (d)a full-mesh topology enjoys full resilience, which makes RF 7 Subgraphs with three nodes deleted, k(4) = 10 = 0.7. (e) Subgraphsconsistent in this sense. Another important aspect is that the use with four nodes deleted. The Computer Journal, 2011
  • 6. 6 R.M. Salles and D.A. Marino Jr.same time, and node 2 gets disconnected from nodes 3 and4 (second subgraph). Hence, the topology is not 3-connected.However, this is the only case out of 10 possibilities and so suchinformation should be taken into account for resilience matters.This is considered by partial 3-connectivity, k(3) = 0.9. Then,in Fig. 1d, k(4) is computed. Figure 1e and a are extremecases (assumed to be connected) and not considered in theRF as described in (8): RF = (1 + 0.90 + 0.7)/3 ⇒ RF =0.8666 (86.66%). The next subsection compares the proposed resilience factorwith other resilience metrics and verifies whether the proposalis consistent or not. FIGURE 3. Cost239 topology (see Ref. [33]): 19 nodes, 40 links (bidirectional), degree distribution [3(21%), 4(47%), 5(21%), 6(11%)].4.2. Numerical resultsIn this subsection, we evaluate the proposed resilience factor Downloaded from http://comjnl.oxfordjournals.org/ by guest on October 20, 2011(RF ) and compare its results with other well-known metrics(AISPL and Diameter) already discussed in previous sections. Results are obtained for three real network topologies:Telcordia (Fig. 2), Cost239 (Fig. 3) and UKnet (Fig. 4). Thetopologies were used to represent typical network domains aswe wish to consider real networks in our study to be closerto practical scenarios. The topologies differ in the number ofnodes, links, layout and degree distribution; we believe theyreasonably cover the scenario we wanted to study. The letters ‘B’, ‘C’ and ‘D’ in the topologies represent thehighest BC, CC and DC nodes, respectively, while number ‘2’is used when the same node is both CC and BC (Fig. 3) or BCand DC (Fig. 4). Figures 5–7 show the results obtained for RF againstAISPL (as in Equation (5)) and Diameter (as in Equation (3)) FIGURE 4. UKnet topology (based on the UK national backbone): 30for TelCordia, Cost239 and UKnet topologies, respectively. nodes, 52 links (bidirectional), degree distribution [2(20%), 3(40%),However, in order to display the three metrics in the same scale, 4(23%), 5(13%), 6(3%)].the Diameter was normalized (Diameter*)—it is important toobserve the variations of this parameter for each situation.FIGURE 2. Telcordia/Bellcore topology (based on the New Jerseycommercial backbone): 15 nodes, 28 links (bidirectional), degree FIGURE 5. RF , AISPL and diameter comparison for the Telcordiadistribution [2(13%), 3(47%), 4(13%), 5(13%),6(13%)]. topology. The Computer Journal, 2011
  • 7. Strategies and Metric for Resilience in Computer Networks 7 RF variations followed a better Diameter* profile. Similar behaviour is also observed for the other topologies in Figs. 6 and 7. This suggests that RF enjoys important properties of both AISPL and Diameter metrics. Another important result that comes out from the graphs is that the Cost239 topology is more resilient than the other two topologies. In Fig. 6, the RF results are higher for all seven situations when compared with Fig. 5 (Telcordia) and Fig. 7 (UKnet). This was somehow expected given the more homogeneous layout of Cost239 and can be reassured by the degree distribution of nodes. Looking at Fig. 3 caption, Cost239 has no node of degree 2 and a higher percentage of degree 4, 5 and 6 nodes. Also, it does not suffer severely from the removal of nodes; in Fig. 6, results within the same metrics are close for all situations. Regarding Telcordia and UKnet, a different behaviour is Downloaded from http://comjnl.oxfordjournals.org/ by guest on October 20, 2011FIGURE 6. RF , AISPL and diameter comparison for the Cost239 observed. While the Telcordia original topology is more resilienttopology. than UKnet, it suffers a greater impact when nodes are removed. RF results indicate that when 10% of highest BC or DC nodes are removed, the resulting subgraphs are even less resilient than UKnet subgraphs. Finally, it is important to realize that in all situations presented the topologies and their corresponding subgraphs (after node removals) remained connected, i.e. there was always a path connecting any pair of remaining nodes. This was necessary to maintainAISPL and Diameter metrics consistent and to ensure a fair comparison with RF . BothAISPL and Diameter are based on shortest paths computation; as long as nodes are removed from the topology, paths grow in size providing a steady behaviour to the metrics (either they increase or decrease). However, when a node removal produces a disconnected topology (for instance, dividing nodes in two disconnected subgraphs), there is no more sense in taking into account the computation of shortest paths. This constitutes a serious drawback for most metrics but it does not affect RF since it is in fact based on connectivity properties.FIGURE 7. RF , AISPL and diameter comparison for the UKnettopology. 4.3. Discussion concerning RF The first issue important to discuss about RF is the complexity The topologies were subjected to seven different situations: involving its computation. It is necessary to evaluate alloriginal topology, highest DC node removed, highest 10% possible combinations of node removals (excluding the trivialDC nodes removed, highest CC node removed, highest 10% CC cases: no removals and (n − 1) node removals) to obtainnodes removed, highest BC node removed and highest 10% BC RF , which requires 2n − (n + 1) tests. However, there arenodes removed indicating different anomalies in the topologies. some considerations that make this calculation possible for theRF , AISPL and Diameter* were computed for each situation and problems of interest.results displayed in bar graphs. The resilience factor was primarily proposed to assess the The first black bar in the graphs represents the resilience of resilience of a single network domain against attacks. Thus thethe original topology according to the three different metrics. number of nodes, n, is constrained by the size of real networkIt can be observed that RF and AISPL results are close for all domains, which we consider to be <40 in general (Figs 2–4Figs. 5–7. In fact, the RF and AISPL results are of same illustrate some network domains used in our study). For a verymagnitude in all situations, however, RF provides greater dis- large network that comprises several domains, our approachcrimination between different situations. In Fig. 5 it can be could be applied if each domain is taken separately one at aseen that although the RF absolute results are closer to AISPL, time and then the backbone that connects all domains. The Computer Journal, 2011
  • 8. 8 R.M. Salles and D.A. Marino Jr. Another important aspect concerning the computation of Finally, the proposed factor does not take into accountRF is that it is obtained off-line before a problem occurs. individual link problems, but only the links connected toAs already mentioned, our study focuses on proactive ways the removed nodes. Also it does not evaluate the impact ofto improve resilience and not on how to react to a network traffic losses due to network alterations since this requires theanomaly condition. Hence, there is no strict restrictions about knowledge of traffic matrices, routing protocols and servicethe computational time to return RF , provided it is viable. level agreements of the networks under investigation, which In the particular case a single network domain is very large is not part of our topological study. Such investigation ismaking the computation of RF not viable in practice, we can complementary to our approach and can also be consideredstill apply our approach with the following small modification. to provide a wider view of the problem regarding the specificInstead of considering all node combinations, which in this case network operation.may be prohibitive, RF can be computed using only an initialset of combinations, say combinations of up to 10% of n. It isimportant to realize that what really impact the calculation of 5. STRATEGIES TO IMPROVE NETWORKRF are the middle cases where the number of combinations is RESILIENCEmaximum, n for k around n/2 (Fig. 1b and c in the example). k Regarding the definition of RF as presented in Equation (8), After studying resilience and proposing the resilience factor RF in the previous section, we are ready to move forward and Downloaded from http://comjnl.oxfordjournals.org/ by guest on October 20, 2011it can be seen that all k(i) were equally weighted in themean. From a failure perspective, for instance, this may not investigate what can be done to improve network resilience.be the case since a single failure is more likely to happen It is important to evaluate if topology alterations wouldthan a double failure and so on. Thus, according to this view improve the resilience of a given network and what alterationseach individual component k(i) could be associated to a would be more advantageous for the case under study. Of course,different weight wi in the computation of RF , for instance alterations are not always possible given physical and financialRF = n−1 wi k(i)/ n−1 wi (if all wi = 1, we go back to restrictions, but the objective here is to provide insightful i=2 i=2Equation (8)). However, our main focus in this paper is on information to support network manager decisions on how toresilience against targeted attacks, and in this case strategies achieve a more resilient network.may be employed to impact a single node or multiple nodes Basically, there are two types of alterations that are commonlyat the same time. It is not straightforward to tell which strategy employed to increase network robustness and tolerance tois more likely to happen and how they could be differently failures and attacks: link additions and link rewirings (noteweighted. Another example is natural disasters where it is also that a common procedure to upgrade network infrastructuredifficult to predict beforehand the size of the impact over the net- is to increase link capacity, however, this operation does notwork. Therefore, we decided to keep RF in the simplest form not provide alterations in connectivity or in the general layout of thedepending on the definition of extra n− 2 parameters (weights). topology and thus it is not related to resilience in our context). It is important to briefly discuss how RF relates to other In practical terms, link additions tend to be more effectivepossible metrics, besides diameter and AISPL, also used to than rewirings since they directly increase redundancy andcharacterize resilience. One of these metrics presented in network resources; however, there may be situations whereSection 2.1 to represent resilience is percolation. The main focus rewirings are preferred. For instance suppose there is a radioof percolation theory applied to network resilience is to search link connecting nodes A and B, but the network manager comesfor phase transitions, or in other words, from what point or above to the conclusion that a much more important connection for thewhich threshold the network percolates and loses its previous network would be between nodes A and C. Then, he may decidestructure. If this point is more difficult to be achieved, given to remove radio equipment from node B and install it in node C,a higher percolation threshold, the network can be considered establishing the radio link now between nodes A and C. Suchmore robust than another. In our proposal we did not focus on a rewiring operation may be several times less expensive thansingle specific property of the network; actually we considered contracting a new link between those nodes. Therefore, unlessall disconnection scenarios through the computation of RF . some practical restriction is imposed to the problem, we treatIn fact, one may say that RF takes into account all scenarios additions and rewirings as possible strategies to be consideredbelow and above percolation. We also believe that percolation in the effort to improve network resilience.theory is more relevant when the network under investigation isreally large (some studies consider infinite topologies) such that 5.1. Previous worksimple disconnections do not affect the system; in this case itis more relevant to search for phase transitions that may indeed The work in [15] proposed two addition and four rewiringcompromise the structure of the system as a whole. However, in strategies to improve network robustness: random additionthe case of a single network domain it is important to consider (S1a), preferential addition (S1b), random edge rewiring (S2a),every impact to the network and the proposed resilience factor random neighbour rewiring (S2b), preferential rewiring (S3a)accounts for all of that. and preferential random edge rewiring (S3b). According to the The Computer Journal, 2011
  • 9. Strategies and Metric for Resilience in Computer Networks 9results presented in the paper, the best rewiring strategy was It is important to observe that all strategies discussed in thisachieved by S3a and the best addition strategy by S1b: section do not take into account practical issues, such as: costs to launch or rent a communication link, technology, type of (i) preferential addition (S1b): add a new edge by link (wired/wireless), length of links and geographic distances, connecting two unconnected nodes having the lowest political matters, etc. Therefore, they may return alterations in degrees in the network; the topology that are not viable in a real network scenario. (ii) preferential rewiring (S3a): disconnect a random edge It is up to the network manager to decide whether he can from a highest-degree node, and reconnect that edge to adopt the returned solutions or not. Such a decision is very a random node. much related to the particular network under administration and The strategies proposed in this paper were motivated and the corresponding practical restrictions. The strategies are alsocompared with those presented in [15], particularly S1b and S3a. important for the network manager to get insights about good moves towards improving resilience. The next subsection evaluates the results presented by our5.2. The proposed strategies proposed strategies (PropAdd and PropRew) when comparedBefore analysing rewiring and addition strategies, it is important with the previous S1b and S3a.to study the role each node may play towards achieving networkresilience. Downloaded from http://comjnl.oxfordjournals.org/ by guest on October 20, 2011 Regarding the degree of nodes, low DC nodes are very 5.3. Numerical resultscritical since they limit network connectivity. Let κ be the k- The next figures illustrate the results obtained when employingconnectivity of the network and DCmin the minimum DC; it can the strategies presented in Section 5 to the three topologies underbe directly proved that κ ≤ DCmin [34]. Thus, a reasonable and study: Telcordia (Fig. 8), Cost239 (Fig. 9) and UKnet (Fig. 10).simple strategy towards improving network resilience is to add In the graphs: PropAdd results are presented in blue solid lineslinks to the lowest degree nodes in order to increase DCmin ; that with dots, S1b in red solid lines with triangles, PropRew inis what is done with S1b. green dashed lines with stars and S3a in brown dashed lines On the other hand, to decrease the impact of targeted attacks, with squares.it is important to reduce network dependency on certain nodes. Each strategy starts from the original topology (step 0) andFor instance a node of very high degree may be selected as is applied again in five consecutive iterations (steps 1–5). Wea good target for attack in the network; this is the reasoning considered that in a practical scenario it may not be feasible tobehind S3a. produce a large number of alterations in the network topology, However, we believe that strategies S1b and S3a could still and hence we allowed no more than five alterationsbe improved to provide better resilience to the network. The Since the objective of the strategies is to improve resilience,strategies are based on a single centrality metric (DC) and do the resilience factor (RF ) should increase at each step. For allnot take into account other important information provided by the 60 points presented in the graphs, the resilience factor hasCC or BC. Morevoer, S3a provides reconnection to a random decreased in just six cases: 1 for PropRew (Fig. 8 from step 4 tonode, which may end up sometimes being ineffective or even step 5) and 5 for S3a (3 in Fig. 8 and 2 in Fig. 9). Note that forworsen network resilience since there is no control over this the only case where PropRew failed to produce a positive result,choice. In this work we propose the following addition and rewiringstrategies: (i) PropAdd: add a new edge by connecting the lowest DC node to the lowest CC node. (ii) PropRew: disconnect from the highest-degree node the edge with its highest CC neighbour, and reconnect that edge to the lowest DC node in the network. While S1b improves only DC, PropAdd tries to improve twocentrality metrics CC and DC at the same time. Regarding the proposed rewiring strategy, it tries to regularizethe network topology by decreasing the degree of the highestDC node and increasing the degree of the lowest DC node.In addition to that, PropRew also brings the lowest DC nodecloser to the network centre since it is connected to the highestCC neighbour. The effect of that is a possible reduction in thenetwork diameter. FIGURE 8. Strategies comparison for the Telcordia topology. The Computer Journal, 2011
  • 10. 10 R.M. Salles and D.A. Marino Jr. TABLE 1. PropAdd gains on resilience (RF ) over original topologies and S1b strategy. Telcordia (%) Cost239 (%) UKnet (%) Original topology 37.84 22.97 41.71 S1b (max) 9.74 7.15 6.26 S1b (step 5) 3.05 7.15 6.25 TABLE 2. PropRew gains on resilience (RF ) over original topologies and S3a strategy. Telcordia (%) Cost239 (%) UKnet (%) original topology 18.15 11.22 23.30 S3a (max) 12.94 9.66 13.48 FIGURE 9. Strategies comparison for the Cost239 topology. S3a (step 5) 8.77 9.38 13.48 Downloaded from http://comjnl.oxfordjournals.org/ by guest on October 20, 2011 study. In fact, PropAdd provided a steady increase in resilience with all results above the other strategies. If link additions are possible in the network under study, PropAdd should be adopted as the main strategy. However, if rewirings are preferred by the network operator, PropRew may also be a good choice given the gain observed over the previous rewiring strategy (S3a). S3a performed poorly when compared with the other strategies and could be discarded. The poor results may be due to the two random operations employed by S3a. Tables 1 and 2 summarize the gains on RF obtained by the proposed strategies when compared with the resilience of the original topologies (Telcordia, Cost 239 and UKnet) and also the gains over previous strategies S1b and S3a. FIGURE 10. Strategies comparison for the UKnet topology. It can be seen from Table 1 that PropAdd improved the resilience of the UKnet topology by more than 40%. PropAdd also provided an improvement over S1b of about 6%, morethe decrease in RF was minimal (<1%) and the two points may precisely, a maximum gain of 6.26% and a 6.25% gain at stepbe considered equivalent in a practical situation. S3a produced 5. The best result of PropAdd when compared with S1b occurredthe worst results among all the strategies, providing a decrease for the Telcordia topology: 9.74%.in RF even in the first iteration for the Telcordia and Cost239 Table 2 shows the results obtained with PropRew. PropRewtopologies. improved the resilience of the UKnet topology by 23.30% and Some general conclusions about the behaviour of the provided a gain of more than 13% when compared with R3a.strategies can be observed from the results presented in thegraphs and will be discussed next. First of all, strategies based on link addition (PropAdd and 6. CONCLUSIONSS1b) provided a better performance than those based on linkrewiring (PropRew and S3a). This was expected since a link This paper studied resilience in computer networks. We showedaddition in fact adds resources to the network and should directly the importance of quantifying the notion of resilience in ordercontribute to increase redundancy. to measure the capacity of the network to tolerate failures and The proposed strategies (PropAdd and PropRew) presented targeted attacks. A resilience factor (RF ) was proposed basedbetter results than the previous ones (S1b and S3a). This on connectivity properties of a graph representing the networkconfirmed our expectations about the simplicity of S1b and S3a, topology under study.which were based on random procedures and used only DC as The resilience factor can be applied in practice as an importantmetrics. tool for the network manager (designer) to evaluate how a given Among all four strategies PropAdd provided the highest gains alteration in the topology impacts resilience against targetedin terms of resilience in all cases for the three topologies under attacks. It can also be used to construct strategies for future The Computer Journal, 2011
  • 11. Strategies and Metric for Resilience in Computer Networks 11network expansions or protection. The factor was compared [4] Trivedi, K.S., Kim, D.S. and Ghosh, R. (2009) Resiliencewith other metrics previously employed to quantify resilience in Computer Systems and Networks. Proc. IEEE/ACM Int.in computer networks and the advantages on using our approach Conf. Computer-Aided Design, San Jose, USA, November 2–5,were shown. pp. 74–77. IEEE, NJ, USA. After that, the resilience factor was employed to evaluate [5] Aggelou, G. (2008) Wireless Mesh Networking. McGraw-Hilltwo proposed strategies designed to improve the resilience of Professional, ISBN 0071482563.a given network. The strategies were based on link additions [6] Douligeris, C. and Mitrokosta, A. (2004) DDos attacks and defense mechanisms: classification and state-of-the-art. Comput.and rewirings, and also on centrality properties of the topology Netw., 44, 643–666.graph. [7] Liu, S. (2009). Surviving distributed denial-of-service attacks. IT The strategies do not take into account some practical Prof., 11, 51–53.restrictions (geographic, economic, political, etc.) that may [8] Najjar, W. and Gaudiot, J.L. (1990) Network resilience: a measureaffect the network; such issues depend on the particular scenario of network fault tolerance. IEEE Trans. Comput., 39, 174–181.under investigation, being out of the scope of this work. [9] Callaway, D., Newman, M.E.J., Strogatz, S. and Watts, D.J.However, even in the cases where the strategy suggests a not (2000) Network robustness and fragility: percolation on randomviable network alteration, it can still be used as a reference for graphs. Phys. Rev. Lett., 85, 5568–5471.the network manager to compare with other planned actions and [10] Cohen, R., Erez, K.,Avraham, D. and Halvin, S. (2000) Resilience Downloaded from http://comjnl.oxfordjournals.org/ by guest on October 20, 2011to give insights about good moves towards improving network of the internet to random breakdowns. Phys. Rev. Lett., 85,resilience. The strategies were compared with previous work 4626–46285.and shown to have a better performance. [11] Liu, G. and Ji, C. (2009) Scalability of network-failure Future work intends to study resilience in mobile wireless resilience: analysis using multi-layer probabilistic graphicalnetworks where topology changes frequently according to the models. IEEE/ACM Trans. Netw., 17, 319–331.movement of nodes. Another interesting study is to apply our [12] Menth, M., Duelli, M., Martin, R. and Milbrandt, J. (2009)approach with the focus on the link, and so work with partial Resilience analysis of packet-switched communication networks.k-link connectivity instead of partial k-node connectivity. This IEEE/ACM Trans. Netw., 17, 1950–1963.provides better adherence to the problems affecting links, such [13] Dekker, A.H. and Colbert, B.D. (2004) Network Robustness and Graph Topology. Proc. 27th Australasian Conf. Computeras failures, but increases computational complexity since the Science, Dunedin, New Zealand, January 26, pp. 359–368. ACM,number of links is usually higher than the number of nodes in a Australian Computer Society.network topology. We also want to apply our study in other types [14] Annibale, A., Coolen, A.C.C. and Bianconi, G. (2010) Networkof networks such as electric power distribution, water supply, resilience against intelligent attacks constrained by degree-social networks, etc. dependent node removal cost. J. Phys. A: Math. Theor., 43, 1–25. [15] Beygelzimer, A., Grinstein, G., Linsker, R. and Rish I. (2005) Improving network robustness by edge modification. Phys. A:ACKNOWLEDGEMENTS Stat. Mech. Appl., 357, 593–612. Comput. Commun. Rev., 29, 251–262.The authors thank all the support received from the Military [16] Faloutsos, M., Faloutsos, P. and Faloutsos, C. (1999). OnInstitute of Engineering (Brazilian Army) during this research. Power-Law Relationships of the Internet Topology. Proc. SIGCOMM’99, Cambridge, USA, August 31–September 3, pp. 251–262. ACM, New York, USA.FUNDING [17] Barabasi, A.-L., Ravasz, E. and Vicsek, T. (2001) DeterministicThis work was also sponsored by the CNPq (Brazilian Ministry scale-free networks. Phys. A: Stat. Mech. Appl., 299, 559–564.of Science and Technology) under grant number 305626/ [18] Wasserman, S., Faust, K. and Iacobucci, D. (1994) Social Network2007-8. Analysis : Methods and Applications. Cambridge University Press. [19] Barabasi, A.-L. and Albert, R. (2002) Statistical mechanics of complex networks. Rev. Mod. Phys., 74, 47–97.REFERENCES [20] Hopcroft, J. and Tarjan, R. (1973) Efficient algorithms for graph [1] Park, S., Khrabrov, A., Pennock, D., Lawrence, S., Giles, C. and manipulation. Commun. ACM, 16, 372–378. Ungar, L. (2003) Static and Dynamic Analysis of the Internets [21] Bertsekas, D. and Gallager, R. (1987) Data Networks. Prentice- Susceptibility to Faults and Attacks. Proc. IEEE INFOCOM Hall, Inc., Upper Saddle River, NJ, USA. 2003, San Francisco, USA, March 30–April 3, pp. 2144–2154. [22] Yang, L. (2006) Building k-connected neighborhood graphs for IEEE, NJ, USA. isometric data embedding. IEEE Trans. Pattern Anal. Mach. [2] Albert, R., Jeong, H. and Barabási, A.-L. (2000) Error and attack Intell., 28, 827–831. tolerance of complex networks. Nature, 406, 378—482. [23] Jia, X., Kim, D., Makki, S., Wan, P. and Yi, C. (2005) Power [3] Gutfraind, A. (2010) Optimizing topological cascade resilience assignment for k-connectivity in wireless ad hoc networks. based on the structure of terrorist networks. PLoS ONE, 5, 1–20. J. Comb. Optim. (Kluwer Academic Publishers), 9, 213–222. The Computer Journal, 2011
  • 12. 12 R.M. Salles and D.A. Marino Jr.[24] Bredin, J., Demaine, E.D., Hajiaghayi, M. and Rus, D. (2005) [30] Sam, S.B., Sujatha, S. Kannan A. and Vivekanandan, P. (2006) Deploying Sensor Networks with Guaranteed Capacity and Network topology against distributed denial of service attacks. Fault Tolerance. Proc. 6th ACM Int. Symp. MOBIle ad HOC Inf. Technol. J., 5, 489–493. Networking and Computing, IL, USA, May 25–28, pp. 309–319. [31] Dekker, A.H. and Colbert, B. (2004) Scale-Free Networks and ACM, New York, USA. Robustness of Critical Infrastructure Networks. Proc. 7th Asia-[25] Menger, K. (1927) Zur allgemeinen Kurventheorie. Fundam. Pacific Conf. Complex Systems, Cairns, Australia, December 6– Math., 10, 96–115. 10, pp. 1–15.[26] Skiena, S. (2008) The Algorithm Design Manual. Springer. [32] Frantz, T. and Carley, K.M. (2005) Relating Network Topology[27] Kleitman, D.J. (1969) Methods for investigating connectiv- to the Robustness of Centrality measures. Technical Report. ity of large graphs. IEEE Trans. Circuit Theory, CT-16, CMU-ISRI-05-117, 1–24. School of Computer Science, Carnegie 232–243. Mellon University, USA.[28] Kammer, F. and Täubig, H. (2004) Graph Connec- [33] O’Mahony, M.J. (1996) Results form the COST 239 Project. tivity. Institut für Informatik, Technischen Universität Proc. 22nd European Conf. Optical Communication, Oslo, München. Norway, September 19, pp. 3–11. IEEE, NJ, USA.[29] Freeman, L.C. (1979) Centrality in social networks: conceptual [34] Gibbons, A. (1985) Algorithmic Graph Theory. Cambridge clarification. Soc. Netw., 1, 215–239. University Press, Cambridge, NY, USA. Downloaded from http://comjnl.oxfordjournals.org/ by guest on October 20, 2011 The Computer Journal, 2011