SybilInfer: Detecting Sybil Nodes using Social Networks George Danezis Prateek Mittal Microsoft Research, University of Illinois at Urbana-Champaign, Cambridge, UK. Illinois, USA. email@example.com firstname.lastname@example.org Abstract contributors. On-line forums, starting with USENET , to contemporary blogs or virtual worlds like Second SybilInfer is an algorithm for labelling nodes in a so- Life  always have to deal with the issue of disruptioncial network as honest users or Sybils controlled by an in the discussion threads, with persistent abusers comingadversary. At the heart of SybilInfer lies a probabilis- back under different names. All these are forms of Sybiltic model of honest social networks, and an inference attacks at the high-level application layers.engine that returns potential regions of dishonest nodes. There are two schools of Sybil defence mechanisms,The Bayesian inference approach to Sybil detection comes the centralised and decentralised ones. Centralised de-with the advantage label has an assigned probability, indi- fences assume the existence of an authority that is capablecating its degree of certainty. We prove through analytical of doing admission control for the network . Its role isresults as well as experiments on simulated and real-world to rate limit the introduction of ‘fake’ identities, to ensurenetwork topologies that, given standard constraints on the that the fraction of corrupt nodes remains under a certainadversary, SybilInfer is secure, in that it successfully dis- threshold. The practicalities of running such an authoritytinguishes between honest and dishonest nodes and is not are very system-speciﬁc and in general it would have tosusceptible to manipulation by the adversary. Further- act as a Public Key Certiﬁcation Authority as well as amore, our results show that SybilInfer outperforms state guardian of the moral standing of the nodes introduced –of the art algorithms, both in being more widely applica- a very difﬁcult problem in practice. Such centralised so-ble, as well as providing vastly more accurate results. lutions are also at odds with the decentralisation guiding principle of peer-to-peer systems. Decentralised approaches recognise the difﬁculty in1 Introduction having a single authority vouching for nodes, and dis- tribute this task across all nodes of the system. The ﬁrst The Peer-to-peer paradigm allows cooperating users to such proposal is Advogato , which aimed to reduceenjoy a service with little or no need for any centralised abuse in on-line services, followed by a proposal to useinfrastructure. While communication  and storage  introduction graphs of Distributed Hash Tables  to limitsystems employing this design philosophy have been pro- the potential for routing disruption in those systems. Theposed, the lack of any centralised control over identities state of the art SybilGuard  and SybiLimit  pro-opens such systems to Sybil attacks : a few malicious pose the use of social networks to mitigate Sybil attacks.nodes can simulate the presence of a very large number As we will see, SybilGuard suffers from high false neg-of nodes, to take over or disrupt key functions of the atives, while SybilLimit makes unrealistic assumptionsdistributed or peer-to-peer system. Any attempt to build about the knowledge of number of honest nodes in the net-fault-tolerance mechanisms is doomed since adversaries work. In both cases the systems Sybil detection strategiescan control arbitrary fractions of the system nodes. This are based on heuristics that are not optimal.Sybil attack is further made practical through the use of Our key contribution is to propose SybilInfer, a methodthe existing large number of compromised networked ma- for detecting Sybil nodes in a social network, that makeschines (often called zombies) being part of bot-nets. use of all information available to the defenders. The for- Similar problems plague Web 2.0 applications that mal model underlying our approach casts the problem ofrely on collaborative tagging, ﬁltering and editing, like detecting Sybil nodes in the context of Bayesian Infer-Wikipedia , or del.icio.us . A single user ence: given a set of stated relationships between nodes,can register under different pseudonyms, and bypass any the task is to label nodes as honest or dishonest. Based onelections of velocity check mechanism that attempts to some simple and generic assumptions, like the fact thatguarantee the quality of the data through the plurality of social networks are fast mixing , we sample cuts in
the social graph according to the probability they divide Limit, the current state of the art Sybil defence mecha-it into honest and dishonest regions. These samples not nisms. We also propose extensions that enable our solu-only allow us to label nodes as honest or Sybil attackers, tion to be implemented in decentralised settings.but also to associate with each label output by our algo- This paper is organised in the following fashion: in sec-rithm a degree of certainty. tion 2 we present an overview of our approach that can The proposed techniques can be applied in a wide vari- be used as a road-map to the technical sections. In sec-ety of settings where high reliability peer-to-peer systems, tion 3 we present our security assumptions, threat model,or Sybil-resistant collaborative mechanisms, are required the probabilistic model and sampler underpinning Sybil-even under attack: Infer; a security evaluation follows in section 4, providing • Secure routing in Distributed Hash Tables motivated analytical as well as experimental arguments supporting early research into this ﬁeld, and our proposal can the security of the method proposed along with a compar- be used instead of a centralised authority to limit the ison with SybilInfer. Section 5 discusses the settings in fraction of dishonest nodes, that could disrupt rout- which SybilInfer can be fruitfully used, followed by some ing . conclusions in section 6. • In public anonymous communication networks, such 2 Overview as Tor , our techniques can be used to eliminate the potential for a single entity introducing a large num- ber of nodes, and de-anonymize users’ circuits. This The SybilInfer algorithm takes as an input a social was so far a key open problem for securely scaling graph G and a single known good node that is part of this such systems. graph. The following conceptual steps are then applied to return the probability each node is honest or controlled by • Leader Election  and Byzantine agreement  a Sybil attacker: mechanisms that were rendered useless by the Sybil • A set of traces T are generated and stored by per- attack can again be of use, after Sybil nodes have forming special random walks over the social graph been detected and eliminated from the social graph. G. These are the only information retained about the • Finally, detecting Sybil accounts is a key step in pre- graph for the rest of the SybilInfer algorithm, and venting false email accounts used for spam, or pre- their generation is detailed in section 3.1. venting trolling and abuse of on-line communities • A probabilistic model is then deﬁned that describes and web-forums. Our techniques can be applied in the likelihood a trace T was generated by a speciﬁc all those settings, to ﬁght spam and abuse . honest set of nodes within G, called X. This modelSybilInfer applies to settings where a peer-to-peer or dis- is based on our assumptions that social networks aretributed system is somehow based on or aware of social fast mixing, while the transitions to dishonest regionsconnections between users. Properties of natural social are slow. Given the probabilistic model, the traces Tgraphs are used to classify nodes are honest or Sybils. and the set of honest nodes we are able to calculateWhile this approach might not be applicable to very tradi- Pr[T |X is honest]. The calculation of this quantitytional peer-to-peer systems , it is more an more com- is the subject of section 3.1 and section 3.2.mon for designers to make distributed systems aware of • Once the probabilistic model is deﬁned, we usethe social environment of their users. Third party social Bayes theorem to calculate for any set of nodes Xnetwork services [29, 30], can also be used to extract so- and the generated trace T , the probability that X con-cial information to protect systems against sybil attacks sists of honest nodes. Mathematically this quality isusing SybilInfer. Section 5 details deployment strategies deﬁned as Pr[X is honest|T ]. The use of Bayes the-for SybilInfer and how it is applicable to current systems. orem is described in section 3.1. We show analytically that SybilInfer is, from a theo-retical perspective, very powerful: under ideal circum- • Since it is not possible to simply enumerate all sub-stances an adversary gains no advantage by introducing sets of nodes X of the graph G, we instead sampleinto the social network any additional Sybil nodes that are from the distribution of honest node sets X, to onlynot ‘naturally’ connected to the rest of the social structure. get a few X0 , . . . , XN ∼ Pr[X is honest|T ]. UsingEven linking all dishonest nodes with each other (with- those representative sample sets of honest nodes, weout adding any Sybils) changes the characteristics of their can calculate the probability any node in the systemsocial sub-graph, and can under some circumstances be is honest or dishonest. Sampling and the approxi-detected. We demonstrate the practical efﬁcacy of our ap- mation of the sought marginal probabilities are theproach using both synthetic scale-free topologies as well subject of section 3.3.as real-world LiveJournal data. We show very signiﬁcant The key conceptual difﬁculty of our approach is thesecurity improvements over both SybilGuard and Sybil- deﬁnition of the probabilistic model over the traces T , and
graph. Several authors have shown that real-life so- cial networks are indeed fast mixing [16, 18]. 3. A node knows the complete social network topology (G) : social network topologies are relatively static, and it is feasible to obtain a global snapshot of the network. Friendship relationships are already public Attack Edges data for popular social networks like Facebook  and Orkut . This assumption can be relaxed to using sub-graphs, making SybilInfer applicable to Honest Nodes Sybil Nodes decentralised settings. Assumptions (1) and (2) are identical to those made by Figure 1. Illustration of Honest nodes, Sybil the SybilGuard and SybilInfer systems. Previously, the nodes and attack edges between them. authors of SybilGuard  observed that when the adver- sary creates too many Sybil nodes, then the graph G has a small cut: a set of edges that together have small station- ary probability and whose removal disconnects the graphits inversion using Bayes theorem to deﬁne a probability into two large sub-graphs.distribution over all possible honest sets of nodes X. Thisdistribution describes the likelihood that a speciﬁc set of This intuition can be pushed much further to build su-nodes is honest. The key technical challenge is making perior Sybil defences. It has been shown  that theuse of this distribution to extract the sought probability presence of a small cut in a graph results in slow mix-each node is honest or dishonest, that we achieve via sam- ing which means that fast mixing implies the absence ofpling. Section 3, describes in some detail how these issues small cuts. Applied to social graphs this observation un-are tackled by SybilInfer. derpins the key intuition behind our Sybil defence mech- anism: the mixing between honest nodes in the social net- works is fast, while the mixing between honest nodes and3 Model & Algorithm dishonest nodes is slow. Thus, computing the set of honest nodes in the graph is related to computing the bottleneck Let us denote the social network topology as a graph G cut of the graph.comprising vertices V , representing people and edges E, One way of formalising the notion of a bottleneck cut,representing trust relationships between people. We con- is in terms of graph conductance (Φ) , deﬁned as:sider the friendship relationship to be an undirected edgein the graph G. Such an edge indicates that two nodes trust Φ= min ΦX , X⊂V :π(X)<1/2each other to not be part of a Sybil attack. Furthermore,we denote the friendship relationship between an attackernode and an honest node as an attack edge and the hon- where ΦX is deﬁned as:est node connected to an attacker node as a naive node or Σx∈X Σy∈X π(x)Pxy /misguided node. Different types of nodes are illustrated in ΦX = ,ﬁgure 1. These relationships must be understood by users π(X)as having security implications, to restrict the promiscu-ous behaviour often observed in current social networks, and π(·) is the stationary distribution of the graph. Intu-where users often ﬂag strangers as their friends . itively for any subset of vertices X ⊂ V its conductance We build our Sybil defence around the following as- ΦX represents the probability of going from X to the restsumptions: of the graph, normalised by the probability weight of be- ing on X. When the value is minimal the bottleneck cut 1. At least one honest node in the network is known. In in the graph is detected. practise, each node trying to detect Sybil nodes can Note that performing a brute force search for this bot- use itself as the apriori honest node. This assump- tleneck cut is computationally infeasible (it is actually NP- tion is necessary to break symmetry: otherwise an Hard, given its relationship to the sparse-cut problem). attacker could simply mirror the honest social struc- Furthermore, ﬁnding the exact smallest cut is not as im- ture, and any detector would not be able to distin- portant as being able to judge how likely any cut is, to be guish which of the two regions is the honest one. dividing nodes into an honest and dishonest region. This 2. Social networks are fast mixing: this means that a probability is related to the deviation of the size of any random walk on the social graph converges quickly cut from what we would expect in a natural, fast mixing, to a node following the stationary distribution of the social network.
3.1 Inferring honest sets ProbXX ProbXX ProbXX In this paper, we propose a framework based onBayesian inference to detect approximate cuts between X Xhonest and Sybil node regions in a social graph and usethose to infer the labels of each node. A key strength ofour approach is that it, not only associates labels to each ProbXXnode, but also ﬁnds the correct probability of error that (a) A schematic representation of tran-could be used by peer-to-peer or distributed applications sition probabilities between honest X ¯ and dishonest X regions of the socialto select nodes. network. The ﬁrst step of SybilInfer is the generation of a set ofrandom walks on the social graph G. These walks are gen-erated by performing a number s of random walks, start- ProbXX = 1 / |V| + Exx Probabilitying from each node in the graph (i.e. a total of s · |V | 1 / |V|walks.) A special probability transition matrix is used, ProbXX = 1 / |V| - Exxdeﬁned as follows: Honest Nodes Sybil nodes 1 1 min( di , dj ) if i → j is an edge in G Pij = , (b) The model of the probability a short random walk of length 0 otherwise O(log |V |) starting at an honest node ends on a particular honest or dishonest (Sybil) node. If no Sybils are present the network is fastwhere di denotes the degree of vertex i in G. mixing and the probability converges to 1/|V |, otherwise it is biased This choice of transition probabilities ensures that the towards landing on an honest node.stationary distribution of the random walk is uniform overall vertices |V |. The length of the random walks is l = Figure 2. Illustrations of the SybilInfer mod-O(log |V |), which is rather short, while the number of ran- elsdom walks per node (denoted by s) is a tunable parameterof the model. Only the starting vertex and the ending ver-tex of each random walk are used by the algorithm, andwe denote this set of vertex-pairs, also called the traces,by T . Note that since the set X is honest, we assume (by as- Now consider any cut X ⊂ V of nodes in the graph, sumption (2)) fast mixing amongst its elements, meaningsuch that the a-prior honest node is an element of X. We that a short random walk reaches any element of the sub-are interested in the probability that the vertices in set X set X uniformly at random. On the other hand a randomare all honest nodes, given our set of traces T , i.e. P (X = walk starting in X is less likely to end up in the dishonestHonest|T ). Through the application of Bayes theorem we ¯ region X, since there should be an abnormally small cuthave an expression of this probability: between them. (This intuition is illustrated in ﬁgure 2(a).) Therefore we approximate the probability that a short ran- P (T |X = Honest) · P (X = Honest)P (X = Honest|T ) = , dom walk of length l = O(log |V |) starts in X and ends at Z a particular node in X is given by ProbXX = Π + EXX ,where Z is the normalization constant given by: Z = where Π is the stationary distribution given by 1/|V |, forΣX⊂V P (T |X = Honest) · P (X = Honest). Note that some EXX > 0. Similarly, we approximate the proba-Z is difﬁcult to compute because it involves the summa- bility that a random walk starts in X and does not endtion of an exponential number of terms in the size of |V |. in X is given by ProbX X = Π − EX X . Notice that ¯ ¯Only being able to compute this probability up to a mul- ProbXX > ProbX X , which captures the property that ¯tiplicative constant Z is not an impediment. The a-prior there is fast mixing amongst honest nodes and slow mix-distribution P (X = Honest) can be used to encode any ing between honest and dishonest nodes. The approxi-further knowledge about the honest nodes, or can simply mate probabilities ProbXX and ProbX X and their likely ¯be set to be uniform over all possible cuts. gap from the ideal 1/|V | are illustrated in ﬁgure 2(b). Bayes theorem has allowed us to reduce the initialproblem of inferring the set of good nodes X from the set Let NXX be the number of traces in T starting in theof traces T , to simply being able to assign a probability honest set X and ending in same honest set X. Let NX X ¯to each set of traces T given a set of honest nodes X, i.e. be the number of random walks that start at the honest setcalculating P (T |X = Honest). Our only remaining theo- ¯ X and end in the dishonest set X. NX X and NXX are ¯ ¯ ¯retical task is deriving this probability, given our model’s deﬁned similarly. Given the approximate probabilities ofassumptions. transitions from one set to the other and the counts of such
transitions we can ascribe a probability to the trace: Approximating ProbXX through the traces T provides us with a simple expression for the sought probability,P (T |X = Honest) = (ProbXX )NXX · (ProbX X )NX X · ¯ ¯ based simply on the number of walks starting in one re- (ProbX X )NX X · (ProbXX )NXX , ¯ ¯ ¯ ¯ ¯ ¯ gion and ending in another:where ProbX X and ProbXX are the probabilities a walk NXX 1 NXX ¯ ¯ ¯ P (T |X = Honest) = ( · ) ·starting in the dishonest region ends in the dishonest or NXX + NX X ¯ |X|honest regions respectively. NX X¯ 1 ( · ¯ )NX X · ¯ The model described by P (T |X = Honest) is an ap- NX X + NXX ¯ |X|proximation to reality that is suitable enough to perform NX X ¯ ¯ 1Sybil detection. It is of course unlikely that a random walk ( · ¯ )NX X · ¯ ¯ NX X + NXX ¯ ¯ ¯ |X|starting at an honest node will have a uniform probabil- NXX ¯ 1 NXXity to land on all honest or dishonest nodes respectively. ( · ) ¯ , NXX + NX X ¯ ¯ ¯ |X|Yet this simple probabilistic model relating the startingand ending nodes of traces is rich enough to capture the This expression concludes the deﬁnition of our probabilis-“probability gap” between landing on an honest or dis- tic model, and contains only quantities that can be ex-honest node, as illustrated in ﬁgure 2(b), and suitable for tracted from either the known set of nodes X, or the setSybil detection. of traces T that is assigned a probability. Note that we do not assume any prior knowledge of the size of the honest3.2 Approximating EXX ¯ set, and it is simply a variable |X| or |X| of the model. Next, we shall describe how to sample from the distri- We have reduced the problem of calculating P (T |X = bution P (X = Honest|T ) using the Metropolis-HastingsHonest) to ﬁnding a suitable EXX , representing the ‘gap’ algorithm.between the case when the full graph is fast mixing (forEXX = 0) and when there is a distinctive Sybil attack (in 3.3 Sampling honest conﬁgurationswhich case EXX >> 0.) One approach could be to try inferring EXX through a At the heart of our Sybil detection techniques lies atrivial modiﬁcation of our analysis to co-estimate P (X = model that assigns a probability to each sub-set of nodesHonest, EXX |T ). Another possibility is to approximate of being honest. This probability P (X = Honest|T ) canEXX or ProbXX directly, by choosing the most likely be calculated up to a constant multiplicative factor Z, thatcandidate value for each conﬁguration of honest nodes X is not easily computable. Hence, instead of directly calcu-considered. This can be done through the conductance or lating this probability for any conﬁguration of nodes X,through sampling random walks on the social graph. we will attempt instead to sample conﬁgurations Xi fol- Given the full graph G, ProbXX can be approximated lowing this distribution. Those samples are used to esti- Σx∈X Σy∈X Π(x)P lxy l mate the marginal probability that any speciﬁc node, oras ProbXX = Π(X) , where Pxy is the prob- collections of nodes, are honest or Sybil attackers.ability that a random walk of length l starting at x ends in Our sampler for P (X = Honest|T ) is based ony. This approximation is very closely related to the con- ¯ the established Metropolis-Hastings algorithm  (MH),ductance of the set X and X. Yet computing this measure which is an instance of a Markov Chain Monte Carlo sam-would require some effort. pler. In a nutshell, the MH algorithm holds at any point Notice that ProbXX , as calculated above, can also a sample X0 . Based on the X0 sample a new candidatebe approximated by performing many random walks of sample X is proposed according to a probability distribu-length l starting at X and computing the fraction of those tion Q, with probability Q(X |X0 ). The new sample Xwalks that end in X. Interestingly our traces already con- is ‘accepted’ to replace X0 with probability α:tain random walks over the graph of exactly the appropri-ate length, and therefore we can reuse them to estimate a P (X |T ) · Q(X0 |X )good ProbXX and related probabilities. Given the counts α = min( , 1) P (X0 |T ) · Q(X |X0 )NXX , NX X , NXX and NX X : ¯ ¯ ¯ ¯ otherwise the original sample X0 is retained. It can be NXX 1 shown that after multiple iterations this yields samples X ProbXX = · NXX + NX X |X| ¯ according to the distribution P (X|T ) irrespective of the way new candidate sets X are proposed or the initial stateand of the algorithm, i.e. a more likely state X will pop-out NX X ¯ ¯ 1 ProbX X = ¯ ¯ · ¯ , more frequently from the sampler, than less likely states. NX X + NXX |X| ¯ ¯ ¯ A relatively naive strategy can be used to propose can-and, ProbX X = 1−ProbXX and ProbXX = 1−ProbX X . ¯ ¯ ¯ ¯ didate states X given X0 for our problem. It relies on
simply considering sets of nodes X that are only different 4.1 Theoretical resultsby a single member from X0 . Thus, with some probability ¯padd a random node x ∈ X0 is added to the set to form the The security of our Sybil detection scheme hinges oncandidate X = X0 ∪ x. Alternatively, with probability two important results. First, we show that we can detectpremove , a member of X0 is removed from the set of nodes, whether a network is under Sybil attack, based on the so-deﬁning X = X0 ∩ x for x ∈ X0 . It is trivial to calculate cial graph. Second, we show that we are able to detectthe probabilities Q(X |X0 ) and Q(X |X0 ) based on padd , Sybil attackers connected to the honest social graph, andpremove and using a uniformly at random choice over nodes this for any attacker topology. ¯ ¯in X0 , X0 , X and X when necessary. Our ﬁrst result states that: A key issue when utilizing the MH algorithm is decid- T HEOREM A. In the absence of any Sybil attack,ing how many iterations are necessary to get independent the distribution of P (X = Honest|T ), for a givensamples. Our rule of thumb is that |V | · log |V | steps are size |X|, is close to uniform, and all cuts are equallylikely to guarantee convergence to the target distribution likely (EXX 0).P . After that number of steps the coupon collector’s the- This result is based on our assumption that a random walkorem states that each node in the graph would have been over a social network is fast mixing meaning that, afterconsidered at least once by the sampler, and assigned to log(|V |) steps, it visits nodes drawn from the stationarythe honest or dishonest set. In practice, given very large distribution of the graph. In our case the random walk istraces T , the number of nodes that are difﬁcult to cate- performed over a slightly modiﬁed version of the socialgorise is very small, and a non-naive sampler requires few graph, where the transition probability attached to eachsteps to produce good samples (after a certain burn in- link ij is:period that allows it to detect the most likely honest re- 1 1gion.) min( di , dj ) if i → j is an edge in G Pij = , Finally, given a set of N samples Xi ∼ P (X|T ) out- 0 otherwiseput by the MH algorithm it is possible to calculate themarginal probabilities any node is honest. This is key out- which guarantees that the stationary distribution is uni- 1put of the SybilInfer algorithm: given a node i it is pos- form over all nodes (i.e. Π = |V | ). Therefore we ex-sible to associate a probability it is honest by calculating: pect that in the absence of an adversary the short walks I(i∈Xj ) in T to end at a network node drawn at random amongstPr[i is honest] = j∈[0,N −1)N , where I(i ∈ Xj ) isan indicator variable taking value 1 if node i is in the hon- all nodes |V |. In turn this means that the number of endest sample Xj , and value zero otherwise. Enough samples nodes in the set of traces T , that end in the honest set X iscan be extracted from the sampler to estimate this proba- NXX = lim|TX |→∞ |X| · |TX |, where TX is the number |V |bility with an arbitrary degree of precision. of traces in T starting within the set |X|. Substituting this More sophisticated samplers would make use of a bet- in the equations presented in section 2.1 and 2.2 we get:ter strategy to propose candidate states X for each iter- NXX 1ation. The choice of X can, for example, be biased to- ProbXX = · ⇒ (1) NXX + NX X |X| ¯wards adding or removing nodes according to how often NXX 1random walks starting at the single honest node land on Π + EXX = · ⇒ (2) NXX + NX X |X| ¯them. We expect nodes that are reached often by randomwalks starting in the honest region to be honest, and the 1 (|X|/|V |) · |TX | 1 + EXX = · ⇒ (3)opposite to be true for dishonest nodes. In all cases this |V | |TX | |X|bias is simply an optimization for the sampling to take EXX = 0 (4)fewer iterations, and does not affect the correctness of theresults. As a result, by sufﬁciently increasing the number of ran- dom walks T performed on the social graph, we can get EXX arbitrarily close to zero. In turn this means that our4 Security evaluation distribution P (T |X = Honest) is uniform for given sizes of |X|, given our uniform a-prior P (X = Honest|T ). In a nutshell by estimating EXX for any sample X re- In this section we discuss the security of SybilInfer turned by the MH algorithm, and testing how close it iswhen under Sybil attack. We show analytically that we to zero we detect whether it corresponds to an attack (ascan detect when a social network suffers from a Sybil at- we will see from theorem B) or a natural cut in the graph.tack, and correctly label the Sybil nodes. Our assumptions We can increase the precision of the detector arbitrarily byand full proposal are then tested experimentally on syn- increasing the number of walks T .thetic as well as real-world data sets, indicating that the Our second results relates to the behaviour of the sys-theoretical guarantees hold. tem under Sybil attack:
T HEOREM B. Connecting any additional Sybil nodes 4.2 Practical considerations to the social network, through a set of corrupt nodes, lowers the dishonest sub-graph conductance to the Models and assumptions are always an approximation honest region, leading to slow mixing, and hence we of the real world. As a result, careful evaluation is nec- expect EXX > 0. essary to ensure that the theorems are robust to deviations ¯ from the ideal behaviour assumed so far.First we deﬁne the dishonest set X0 comprising all dis-honest nodes connected to honest nodes in the graph. The The ﬁrst practical issue concerns the fast mixing prop- ¯set XS contains all dishonest nodes in the system, includ- erties of social networks. There is a lot of evidence that ¯ing nodes in X0 and the Sybil nodes attached to them. It social networks exhibit this behaviour , and previous ¯ ¯must hold that |X0 | < |XS |, in case there is a Sybil attack. proposals relating to Sybil defence use and validate theSecond we note that the probability of a transition be- same assumption [27, 26]. SybilInfer makes an furthertween an honest node i ∈ X and a dishonest node j ∈ X ¯ assumption, namely that the modiﬁed random walk overcannot increase through Sybil attacks, since it is equal to the social network, that yields a uniform distribution over 1 1Pij = min( di , dj ). At worst the corrupt node will in- all nodes, is also fast mixing for real social networks. The 1 1crease its degree by connecting Sybils which has only the probability Pij = min( di , dj ), depends on the mutualpotential to decrease this probability. Therefore we have degrees of the nodes i and j, and makes the transition tothat x∈XS y∈XS Pxy ≤ x∈X0 y∈X0 Pxy . Com- ¯ ¯ ¯ ¯ nodes of higher degree less likely. This effect has the po-bining the two inequalities we get that: tential to slow down mixing times in the honest case, par- ticularly when there is a high variation in node degrees. This effect can be alleviated by removing random edges ¯ Pxy ¯ ¯ Pxy from high degree nodes to guarantee that the ratio of max- y∈XS x∈X0 y∈X0 ¯S | < ¯ ⇔ (5) imum and minimum node degree in the graph is bounded |X |X0 | (an approach also used by SybilLimit.) 1 1 y∈XS |V | Pxy ¯ ¯ x∈X0 y∈X0 |V | Pxy ¯ The second consideration also relates to the fast mix- ¯ < ¯ 1 ⇔ (6) |XS | 1 |V | |X0 | |V | ing properties of networks. While in theory fast mixing networks should not exhibit any small cuts, or regions of ¯ y∈XS π(x)Pxy ¯ x∈X0 y∈X0 π(x)Pxy ¯ < ⇔ (7) abnormally low conductance, in practice they do. This ¯S ) Π(X Π(X¯0) is especially true for regions with new users that have not ¯ ¯ Φ(XS ) < Φ(X0 ). (8) had the chance to connect to many others, as well as social networks that only contain users with particular charac- teristics (like interest, locality, or administrative groups.)This result signiﬁes that independently of the topology of Those regions yield, even in the honest case, sample cutsthe adversary region the conductance of a sub-graph con- that have the potential to be mistaken as attacks. This ef-taining Sybil nodes will be lower compared with the con- fect forces us to consider a threshold EXX under whichductance of the sub-graph of nodes that are simply com- we consider cuts to be simply false positives. In turn thispromised and connected to the social network. Lower makes the guarantees of schemes weaker in practice thanconductance in turn leads to slower mixing times be- in theory, since the adversary can introduce Sybils into atween honest and dishonest regions  which means that region undetected, as long as the set threshold EXX is notEXX > 0, even for very few Sybils. This deviation is exceeded.subject to the sampling variation introduced by the trace The threshold EXX is chosen to be α·EXXmax , whereT , but the error can be made arbitrarily small by sampling 1 1 EXXmax = |X| − |V | , and α is a constant between 0more random walks in T . and 1. Here α can be used to control the tradeoff between false positives and false negatives. A higher value of alpha These two results are very strong: they indicate that, in enables the adversary to insert a larger number of sybilstheory, a set of compromised nodes connecting to honest undetected but reduces the false positives. On the othernodes in a social network, would get no advantage by con- hand, a smaller value of α reduces the number of Sybilsnecting any additional Sybil nodes, since that would lead that can be introduced undetected but at the cost of higherto their detection. Sampling regions of the graph with ab- number of false positives.normally small conductance, through the use of the ran-dom walks T , should lead to their discovery, which is the Given these practical considerations, we can formulatetheoretical foundation of our technique. Furthermore we a weaker security guarantee for SybilInfer:established that techniques based on detecting abnormali- T HEOREM C. Given a certain “natural” thresholdties in the value of EXX are strategy proof, meaning that value for EXX in an honest social network, a dis-there is no attacker strategy (in terms of special adversary honest region performing a Sybil attack will exceedtopology) to foil detection. it after introducing a certain number of Sybil nodes.
Scale Free Topology: 1000 nodes, 100 malicious Scale Free Topology: 1000 nodes, 100 malicious 250 250 False Negatives, alpha=0.65 False Negatives,alpha=0.65 False Positives, alpha=0.65 False Positives,alpha=0.65 False Negatives, alpha=0.7 False Negatives,alpha=0.7 False Positives, alpha=0.7 False Positives,alpha=0.7 False Negatives, alpha=0.75 False Negatives,alpha=0.75 200 False Positives, alpha=0.75 200 False Positives,alpha=0.75 Number of identities Number of identities 150 150 100 100 50 50 0 0 0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000 Number of additional sybil identities (x) Number of additional sybil identities (x) (a) Average degree compromised nodes (b) Low degree compromised nodes Figure 3. Synthetic Scale Free Topology: SybilInfer Evaluation as a function of additional Sybil identities (ψ) introduced by colluding entities. False negatives denote the total number of dishon- est identities accepted by SybilInfer while false positives denote the number of honest nodes that are misclassiﬁed.This theorem is the result of Theorem B that demonstrates given node is proportional to the degree of that node; i.e.:that the conductance keeps decreasing as the number of Pr[(v, i)] = didj , where di is the degree of node i. In jSybils attached to a dishonest region increases. This in our simulations, we use m = 5, giving an average nodeturn will slow down the mixing time between the hon- degree of 10.est and dishonest region, leading to an increasingly large In such a scale free topology of 1000 nodes, we con-EXX . sider a fraction f = 10% of the nodes to be compro- Intuitively, as the attack becomes larger, the cut be- mised by a single adversary. The compromised nodes aretween honest and dishonest nodes becomes increasingly distributed uniformly at random in the topology. Com-distinct, which makes Sybil detection easier. It is impor- promised nodes introduce ψ additional Sybil nodes andtant to note that as more Sybils are introduced into the establish a scale free topology amongst themselves. Wedishonest region, the probability of the whole region be- conﬁgure SybilInfer to use 20 samples for computing theing detected as an attack increases, not only the new Sybil marginal probabilities, and label as honest the set of nodesnodes. This provides strong disincentives to the adver- whose marginal probability of being honest is greater thansary from performing larger Sybil attacks, since even pre- 0.5. The experiment is repeated 100 times with differentviously undetected malicious nodes might be ﬂagged as scale free topologies.Sybils. Figure 3(a) illustrates the false positives and false neg- atives classiﬁcations returned by SybilInfer, for varying4.3 Experimental evaluation using synthetic value of ψ, the number of additional Sybil nodes intro- data duced. We observe that when ψ < 100, α = 0.7 , then all the malicious identities are classiﬁed as honest by Sybil- We ﬁrst experimentally demonstrate the validity of Infer. However, there is a threshold at ψ = 100, be-Theorem C using synthetic topologies. Our experiments yond which all of the Sybil identities, including the ini-consist of building synthetic social network topologies, in- tially compromised entities are ﬂagged as attackers. Thisjecting a variable number of Sybil nodes, and applying is because beyond this point, the EXX for the Sybil regionSybilInfer to establish how many of them are detected. A exceeds the natural threshold leading to full detection, val-key issue we explore is the number of introduced Sybil idating Theorem C. The value ψ = 100 is clearly the op-nodes under which Sybil attacks are not detected. timal attack strategy, in which the attacker can introduce Social networks exhibit a scale-free (or power law) the maximal number of Sybils without being detected. Wenode degree topology . Our network synthesis al- also note that even in the worst case, the false positives aregorithm replicates this structure through preferential at- less than 5%. The false positive nodes have been misclas-tachment, following the methodology of Nagaraja . siﬁed because these nodes are closer to the Sybil region;We create m0 initial nodes connected in a clique, and SybilInfer is thus incentive compatible in the sense thatthen for each new node v, we create m new edges to nodes which have mostly honest friends are likely not toexisting nodes, such that the probability of choosing any be misclassiﬁed.
Figure 4 presents a plot of the maximum Sybil iden- Scale Free Topology: 1000 nodes 0.4 tities as a function of the compromised fraction of nodes False Negatives 0.35 Theoretical Prediction y=x f . Note that our theoretical prediction (which is strategy- independent) matches closely with the attacker strategy 0.3 Fraction of total malicious entities of connecting Sybil nodes in a scale free topology. The 0.25 adversary is able to introduce roughly about 1 additional 0.2 Sybil identity per real entity. For instance, at f = 0.2, 0.15 the total number of Sybil identities is 0.37. As we observe 0.1 from the ﬁgure the ability of the adversary to just include 0.05 about one additional Sybil identity per compromised node 0 embedded in the social network remains constant, no mat- 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 Fraction of malicious entities (f) ter the fraction f of compromised nodes in the network. Figure 4. Scale Free Topology: fraction of 4.4 Experimental evaluation using real-world total malicious and Sybil identities as a func- data tion of real malicious entities. Next we validate the security guarantees provided by We can also see the effect of varying the threshold SybilInfer using a sampled LiveJournal topology. A vari-EXX . As α is increased from 0.65 to 0.7, the ψ for the ant of snowball  sampling was used to collect the fulloptimal attacker strategy increases from 70 to 100. This is data set data, comprising over 100,000 nodes.because an increase in the threshold EXX allows the ad- To perform our experiments we chose a random nodeversary to insert more Sybils undetected. The advantage and collect all nodes in its three hop neighbourhood. Theof increasing α lies in reducing the worst case false pos- resulting social network has about 50,000 nodes. We thenitives. We can see that by increasing α from 0.7 to 0.75, perform some pre-processing step on the sub-graph:the worst case false positives can be reduced from 5% to • Nodes with degree less than 3 are removed, to ﬁlter2%. out nodes that are too new to the social network, or Note that for the remainder of the paper, we shall use inactive.α = 0.7. We also wish to show that the security of our scheme • If there is an edge between A → B, but no edgedepends primarily on the number of colluding malicious between B → A, then A → B is removed (to onlynodes and not on the number of attack edges. To this end, keep the symmetric friendship relationships.)we chose the compromised nodes to have the lowest num-ber of attack edges (instead of choosing them uniformly We note that despite this pre-processing nodes all degreesat random), and repeat the experiment. Figure 3(b) illus- can be found in the ﬁnal dataset, since nodes with ini-trates that the false positives and false negatives classiﬁca- tial degree over 3 will have some edges removed reducingtions returned by SybilInfer, where the average number of their degree to less than 3.attack edges are 500. Note that these results are very simi- After pre-processing, the social sub-graph consists oflar to the previous case illustrated in Figure 3(a), where the about 33,000 nodes. First, we ran SybilInfer on this topol-number of attack edges is around 800. This analysis in- ogy without introducing any artiﬁcial attack. We found adicates that the security provided by SybilInfer primarily bottleneck cut diving off about 2, 000 Sybil nodes. It isdepends on the number of colluding entities. The implica- impossible to establish whether these nodes are false pos-tion is that the compromise of high degree nodes does not itives (a rate of 6%) or a real-world Sybil attack presentyield any signiﬁcant advantage to the adversary. As we in the LiveJournal network. Since there is no way to es-shall see, this is in contrast to SybilGuard and SybilLimit, tablish ground truth, we do not label these nodes as eitherwhich are extremely vulnerable when high degree nodes honest/dishonest.are compromised. Next, we consider a fraction f of the nodes to be com- Our next experiment establishes the number of Sybil promised and compute the optimal attacker strategy, as innodes that can be inserted into a network given different our experiments with synthetic data. Figure 5 shows thefractions of compromised nodes. We vary the fraction of fraction of malicious identities accepted by SybilInfer ascompromised colluding nodes f , and for each value of f , a function of fraction of malicious entitites in the system.we compute the optimal number of additional Sybil iden- The trend is similar to our observations on synthetic scaletities that the attackers can insert, as in the previous exper- free topologies. At f = 0.2, the fraction of Sybil identitiesiment. accepted by SybilInfer is approximately 0.32.
LiveJournal Topology: 31603 nodes Scale Free Topology: 1000 nodes 0.45 0.6 Fraction of malicious identities SybilLimit, uniform y=x SybilLimit, low degree 0.4 SybilInfer y=x 0.5 0.35 Fraction of total malicious entities Fraction of total malicious identities 0.3 0.4 0.25 0.3 0.2 0.15 0.2 0.1 0.1 0.05 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02 Fraction of colluding entities Fraction of malicious entities (f) Figure 5. LiveJournal Topology: fraction of Figure 6. Comparison with related work total malicious identities as a function of real malicious entities f < 0.01. In this case SybilInfer allows for 5% total ma-4.5 Comparison with SybilLimit and Sybil- licious entities, while SybilLimit allows for over 50%. Guard This is an illustration that SybilInfer performs an or- SybilGuard  and SybilLimit  are state of the art der of magnitude better than the state of the art both indecentralized protocols that defend against Sybil attacks. terms of range of applicability and performance withinSimilar to SybilInfer, both protocols exploit the fact that a that range (SybilGuard’s performance is strictly worseSybil attack disrupts the fast mixing property of the social than SybilLimit’s performance, and is not illustrated.) Annetwork topology, albeit in a heuristic fashion. A brief obvious question is: “why does SybilInfer perform sooverview of the two systems can be found in the appendix, much better than SybilGuard and SybilLimit?” It is partic-and their full descriptions is given in [27, 26]. ularly pertinent since all three systems are making use of Figure 6 compares the performance of SybilInfer with the same assumptions, and a similar intuition, that therethe performance of SybilLimit. First it is worth noting should be a “gap” between the honest and Sybil regionsthat the fraction of compromised nodes that SybilLimit of a social network. The reason SybilLimit and Sybil-tolerates is only a small fraction of the range within which Guard provide weaker guarantees is that they interpretSybilInfer provide its guarantees. SybilLimit tolerates up these assumptions in a very sensitive way: they assumeto f = 0.02 compromised nodes when the degree of at- that an overwhelming majority of random walks staring intackers is low (about degree 5 – green line), while we have the honest region will stay in the honest region, and thenalready shown the performance of SybilLimit for compro- bound the number of walks originating from the Sybil re-mised fractions up to f = 0.35 in ﬁgure 4. Within the gion via the number of corrupt edges. As a result theyinterval SybilLimit is applicable, our system systemati- are very sensitive to the length of those walks, and cancally outperforms: when very few compromised nodes are only provide strong guarantees for a very small number ofpresent in the system (f = 0.01) our system only allows corrupt edges. Furthermore the validation procedure re-them to control less than 5% of the entities in the system, lies on collisions between honest nodes, via the birthdayversus SybilLimit that allows them to control over 30% paradox, which adds a further layer of inefﬁciency to esti-of entities (rendering insecure byzantine fault tolerance mating good from bad regions.mechanisms that requires at least 2/3 honest nodes) At thelimit of SybilLimit’s applicability range when f = 0.02, SybilInfer, on the other hand, interprets the disruptionour approach caps the number of dishonest entities in the in fast-mixing between the honest and dishonest regionsystem to fewer than 8%, while SybilLimit allows about simply as a faint bias in the last node of a short random50% dishonest entities. (This large fraction renders leader walk (as illustrated in ﬁgures 2(a) and 2(b).) In our ex-election or other voting systems ineffective.) periments, as well as in theory, we observe a very large An important difference between SybilInfer and Sybil- fraction of the T walks crossing between the honest andLimit is that the former is not sensitive to the degree of dishonest regions. Yet the faint difference in the probabil-the attacker nodes. SybilLimit provides very weak guar- ity of landing on nodes in the honest and dishonest regionsantees when high degree (e.g. degree 10 – red line) nodes is present, and the sampler makes use of it to get good cutsare compromised, and can protect the system only for between the honest and dishonest nodes.
4.6 Computational and time complexity O(|V |) per iteration, making the overall complexity of the scheme O(|V |2 log |V |). Two implementations of the SybilInfer sampler werebuild in Python and C++, of about 1KLOC each. The 5 Deployment StrategiesPython implementation can handle 10K node networks,while the C++ implementation has handled up to 30K So far we presented an overview of the SybilInfer algo-node networks, returning results in seconds. rithm, as well as a theoretical and empirical evaluation of The implementation strategy for both samplers has fa- its performance when it comes to detecting Sybil nodes.voured a low time complexity over storage costs. The The core of the algorithm outperforms SybilGuard andcritical loop performs O(|V | · log |V |) Metropolis Hast- SybilLimit, and is applicable in settings beyond whichings iterations per sample returned, each only requiring the two systems provide no security guarantees whatso-about O(log |V |) operations. Two copies of the full state ever. Yet a key difference between the previous systemsare stored, as well as associated data that allows for fast and SybilInfer is the latter’s reliance on the full friendshipupdating of the state, which requires O(|V |) storage. The graph to perform the random walks that drive the infer-transcript of evidence traces T is also stored, as well as an ence engine. In this section we discuss how this constraintindex over it, which dominates the storage required and still allows SybilInfer to be used for important classes ofmakes it order O(|V | · log |V |). applications, as well as how it can be relaxed to accom- There is a serious time complexity penalty associated modate peer-to-peer systems with limited resources perwith implementing non-naive sampling strategies. Our client.Python implementation biases the candidate moves to-wards nodes that are more or less likely to be part of the 5.1 Full social graph knowledgehonest set. Yet exactly sampling nodes from this knownprobability distribution, naively may raise the cost of each The most straightforward way of applying SybilInferiteration to be O(|V |). Depending on the differential be- is using the full graph of a social network to infer whichtween the highest and lowest probabilities, faster sampling nodes are honest and which nodes are Sybils, given atechniques like rejection sampling  can be used to known honest seed node. This is applicable to centralisedbring the cost down. The Python implementation uses a on-line services, like free email hosting services, bloggingvariant of Metropolis-Hastings to implement selection of sites, and discussion forums that want to deter spammers.candidate nodes for the next move, at a computation cost Today those systems use a mixture of CAPTCHA of O(log |V |). The C++ implementation uses naive sam- and network based intrusion detection to eliminate masspling from the honest or dishonest sets, and has a very low attacks. SybilInfer could be used to either complementcost per iteration of order O(1). those mechanisms and provide additional information as The Markov chain sampling techniques used consider to which identities are suspicious, or replace those sy-sequences of states that are very close to each other, dif- stems when they are expensive and error prone. One offering at most by a single node. This enables a key the ﬁrst social network based Sybil defence systems, Ad-optimization, where the counts NXX , NX X , NXX and ¯ ¯ vogato , worked in such a centralized fashion.NX X are stored for each state and updated when the state ¯ ¯ The need to know the social graph does not precludechanges. This simple variant of self-adjusting computa- the use of SybilInfer in distributed or even peer-to-peer sy-tion , allows for very fast computations of the proba- stems. Social networks, once mature, are generally stablebilities associated with each state. Updating the counts, and do not change much over time. Their rate of changeand associated information is an order O(log |V |) opera- is by no means comparable to the churn of nodes in thetion. The alternative, of recounting these quantities from network, and as a result the structure of the social networkT would cost O(|V | log |V |) for every iteration, leading could be stored and used to perform inference on multipleto a total computational complexity for our algorithm of nodes in a network, along with a mechanisms to share oc-O((|V | log |V |)2 ). Hence implementing it is vital to get- casional updates. The storage overhead for storing largeting results fast. social networks is surprisingly low: a large social network Finally our implementations use a zero-copy strategy with 10 billion nodes (roughly the population of planetfor the state. Two states and all associated information are earth) with each node having about 1000 friends, can bemaintained at any time, the current state and the candi- stored in about 187Gb of disk space uncompressed. Indate state. Operations on the candidate state can be done such settings it is likely that SybilInfer computation willand undone in O(log |V |) per operation. Accepted moves be the bottleneck, rather than storage of the graph, for thecan be committed to the current state at the same cost. foreseeable future.These operations can be used to maintain the two states A key application of Sybil defences is to ensure thatsynchronised for use by the Metropolis-Hastings sampler. volunteer relays in anonymous communication networksThe naive strategy of re-writing the full state would cost belong to independent entities, and are not controlled
by a single adversary. Practical systems like Mixmas- tain community of interest (a social club, a committee, ater , Mixminion  and Tor  operate such a volun- town, etc.) can be extracted to form a sub-graph. SybilIn-teer based anonymity infrastructure, that are very suscep- fer can then be applied on this partial view of the graph,tible to Sybil attacks. Extending such an infrastructure to to detect nodes that are less well integrated than others inuse SybilInfer is an easy task: each relay in the system the group. There is an important distinction between us-would have to indicate to the central directory services ing SybilInfer in this mode or using it with the full graph:which other nodes it considers honest and non-colluding. while the results using the full graph output an “absolute”The graph of nodes and mutual trust relations can be used probability for each node being a Sybil, applying SybilIn-to run SybilInfer centrally by the directory service, or by fer to a partial view of the network only provides a “rel-each individual node that wishes to use the anonymizing ative” probability the node is honest in that context. It isservice. Currently, the largest of those services, the Tor likely that nodes are tagged as Sybils, because they do notnetwork has about 2000 nodes, which is well within the have many contacts within the select group, which givencomputation capabilities of our implementations. the full graph would be classiﬁed as honest. Before ap- plying SybilInfer in this mode it is important to assess, at5.2 Partial social graph knowledge least, whether the subgroup is fast-mixing or not. SybilInfer can be used to detect and prevent Sybil at- 5.3 Using SybilInfer output optimallytacks, using only a partial view of the social graph. In thecontext of a distributed or peer-to-peer system each user Unlike previous systems the output of the SybilInferdiscovers only a ﬁxed diameter sub-graph around them. algorithm is a probabilistic statement, or even more gen-For example a user may choose to retrieve and store all erally, a set of samples that allows probabilistic statementsother users two or three hops away in the social network to be estimated. So far in the work we discussed how tograph, or discover a certain threshold of nodes in a breadth make inferences about the marginal probability speciﬁcﬁrst manner. SybilInfer is then applied on the extracted nodes are honest of dishonest by using the returned sam-sub-graph to detect potential Sybil regions. This allows ples to compute Pr[i is honest] for all nodes i. In our ex-the user to prune its social neighbourhood from any Sybil periments we applied a 0.5 threshold to the probability toattacks, and is sufﬁcient for selecting a set of honest nodes classify nodes as honest or dishonest. This is a rather lim-when sampling from the full network is not required. Dis- ited use of the richer output that SybilInfer provides.tributed backup and storage, and all friend and friend- Distributed system applications can, instead of usingof-friend based sharing protocols can beneﬁt from such marginal probabilities of individual nodes, estimate theprotection. The storage and communication cost of this probability that the particular security guarantees they re-scheme is constant and relative to the number of nodes in quire hold. High latency anonymous communication sy-the chosen neighbourhood. stems, for example, require a set of different nodes such In cases where nodes can afford to know a larger frac- that with high probability at least one of them is honest.tion of the social graph, they could choose to discover Path selection is also subject to other constraints (like la-O(c· |V |) nodes in their neighbourhood, for some small tency.) In this case the samples returned by SybilInferinteger c. This increases the chances two arbitrary nodes can be used to calculate exactly the sought probability, i.e.have to know a common node, that can perform the Sybil- the probability a single node in the chosen path is hon-Infer protocol and act as an introduction point for the est. Onion routing based system, on the other hand arenodes. In this protocol Alice and Bob want to ensure the secure as long as the ﬁrst and last hop of the relayed com-other party is not a Sybil. They ﬁnd a node C that is in the munication is honest. As before, the samples returned byc · |V neighbourhood of both of them, and each make SybilInfer can be used to choose a path that has a highsure that with high probability it is honest. They then con- probability to exhibit this characteristic.tact node C that attests to both of them, given its local Other distributed applications, like peer-to-peer storagerun of the SybilInfer engine, that they are not Sybil nodes and retrieval have similar needs, but also tunable param-(with C as the honest seed.) This protocol introduces a eters that depend on the probability of a node being dis-single layer of transitive trust, and therefore it is neces- honest. Storage systems like OceanStore, use Rabin’s in-sary for Alice and Bob to be quite certain that C is indeed formation dispersion algorithm to divide ﬁles into chunkshonest. Its storage and communication cost is O( |V |), stored and retrieved to reconstruct a ﬁle. The degree ofwhich is the same order of magnitude as SybilLimit and redundancy required crucially depends on the probabilitySybilGuard. Modifying this simple minded protocol into nodes are compromised. Such algorithms can use SybilIn-a fully ﬂedged one-hop distributed hash table  is an fer to foil Sybil attacks, and calculate the probability theinteresting challenge for future work. set of nodes to be used to store particular ﬁles contains SybilInfer can also be applied to speciﬁc on-line com- certain fractions of honest nodes. This probability can inmunities. In such cases a set of nodes belonging to a cer- turn inform the choice of parameters to maximise the sur-
vivability of the ﬁles. approaches that treat user statements beyond just black Finally a note of warning should accompany any Sybil and white and make explicit use of probabilistic reasoningprevention scheme: it is not the goal of SybilInfer (or any and statements as their outputs will become increasinglyother such scheme) to ensure that all adversary nodes are important in building safe systems.ﬁltered out of the network. The job of SybilInfer is toensure that a certain fraction of existing adversary nodes Acknowledgements. This work was performed whilecannot signiﬁcantly increase its control of the system by Prateek Mittal was an intern at Microsoft Research, Cam-introducing ‘fake’ Sybil identities. As it is illustrated by bridge, UK. The authors would like to thank Carmelathe examples on anonymous communications and stor- Troncoso, Emilia K¨ sper, Chris Lesniewski-Laas, Nikita aage, system speciﬁc mechanisms are still crucial to ensure Borisov and Steven Murdoch for their helpful commentsthat a minority of adversary entities cannot compromise on the research and draft manuscript. Our shepherd, Ta-any security properties. SybilInfer can only ensure that dayoshi Kohno, and ISOC NDSS 2009 reviewers pro-this minority remains a minority and cannot artiﬁcially in- vided valuable feedback to improve this work. Barry Law-crease its share of the network. son was very helpful with the technical preparation of the Sybil defence schemes are also bound to contain false- ﬁnal manuscript.positives, namely honest nodes labeled as Sybils. For thisreason other mechanisms need to be in place to ensure that Referencesthose users can seek a remedy to the automatic classiﬁca-tion they suffered from the system, potentially by making  U. A. Acar. Self-adjusting computation. PhD thesis, Pitts-some additional effort. Proofs-of-work, social introduc- burgh, PA, USA, 2005. Co-Chair-Guy Blelloch and Co-tion services, or even payment targeting those users could Chair-Robert Harper.be a way of ensuring SybilInfer is not turned into an auto-  B. Awerbuch. Optimal distributed algorithms for mini-mated social exclusion mechanism. mum weight spanning tree, counting, leader election and related problems (detailed summary). In STOC, pages6 Conclusion 230–240. ACM, 1987.  M. Castro, P. Druschel, A. Ganesh, A. Rowstron, and D. S. Wallach. Secure routing for structured peer-to-peer over- We presented SybilInfer, an algorithm aimed at detect- lay networks. In OSDI ’02: Proceedings of the 5th sym-ing Sybil attacks against peer-to-peer networks or open posium on Operating systems design and implementation,services, and label which nodes are honest and which are pages 299–314, New York, NY, USA, 2002. ACM.dishonest. Its applicability and performance in this task is  F. Dabek. A cooperative ﬁle system. Master’s thesis, MIT,an order of magnitude better than previous systems mak- September 2001.  G. Danezis, R. Dingledine, and N. Mathewson. Mixmin-ing similar assumptions, like SybilGuard and SybilLimit, ion: Design of a Type III Anonymous Remailer Protocol.even though it requires nodes to know a substantial part of In Proceedings of the 2003 IEEE Symposium on Securitythe social structure within which honest nodes are embed- and Privacy, pages 2–15, May 2003.ded. SybilInfer illustrates how robust Sybil defences can  G. Danezis, C. Lesniewski-Laas, M. F. Kaashoek, andbe bootstrapped from distributed trust judgements, instead R. Anderson. Sybil-resistant dht routing. In ESORICSof a centralised identity scheme. This is a key enabler for 2005: Proceedings of the European Symp. Research insecure peer-to-peer architectures as well as collaborative Computer Security, pages 305–318, 2005.web 2.0 applications.  R. Dingledine, N. Mathewson, and P. Syverson. Tor: the SybilInfer is also signiﬁcant due to the use of machine second-generation onion router. In SSYM’04: Proceed-learning techniques and their careful application to a secu- ings of the 13th conference on USENIX Security Sympo- sium, pages 21–21, Berkeley, CA, USA, 2004. USENIXrity problem. Cross disciplinary designs are a challenge, Association.and applying probabilistic techniques to system defence  J. R. Douceur. The sybil attack. In IPTPS ’01: Re-should not be at the expense of strength of protection, and vised Papers from the First International Workshop onstrategy-proof designs. Our ability to demonstrate that the Peer-to-Peer Systems, pages 251–260, London, UK, 2002.underlying mechanisms behind SybilInfer is not suscepti- Springer-Verlag.ble to gaming by an adversary arranging its Sybil nodes in  L. Goodman. Snowball sampling. Annals of Mathematicala particular topology is, in this aspect, a very import part Statistics, 32:148–170.of the SybilInfer security design.  W. K. Hastings. Monte carlo sampling methods us- Yet machine learning techniques that take explicitly ing markov chains and their applications. Biometrika, 57(1):97–109, April 1970.into account noise and incomplete information, as the one  R. Kannan, S. Vempala, and A. Vetta. On clusterings:contained in the social graphs, are key to building secu- Good, bad and spectral. J. ACM, 51(3):497–515, 2004.rity systems that degrade well when theoretical guarantees  L. Lamport, R. Shostak, and M. Pease. The byzantineare not exactly matching a messy reality. As security in- generals problem. ACM Trans. Program. Lang. Syst.,creasingly becomes a “people” problem, it is likely that 4(3):382–401, 1982.
 C. Lesniewski-Laas. A sybil-proof one-hop dht. In Pro- a given honest network size n, the mixing time of the so- ceedings of the Workshop on Social Network Systems, cial network is O(log√ it sufﬁces to use a long random n), Glasgow, Scotland, April 2008. walk of length w = n · log n to gather those samples. R. Levien and A. Aiken. Attack-resistant trust metrics To prevent active attacks biasing the samples, SybilGuard for public key certiﬁcation. In SSYM’98: Proceedings of performs the random walk over constrained random route. the 7th conference on USENIX Security Symposium, pages 18–18, Berkeley, CA, USA, 1998. USENIX Association. Random routes require each node to use a pre-computed D. J. C. MacKay. Information Theory, Inference & Learn- random permutation as a one to one mapping from incom- ing Algorithms. Cambridge University Press, New York, ing edges to outgoing edges giving them the following im- NY, USA, 2002. portant properties: S. Milgram. The small world problem. 2:60–67, 1967. U. M¨ ller, L. Cottrell, P. Palfrader, and L. Sassaman. Mix- o • Convergence: Two random routes entering an honest master Protocol — Version 2. IETF Internet Draft, July node along the same edge will always exit along the 2003. same edge. S. Nagaraja. Anonymity in the wild: Mixes on unstruc- • Back-traceability: The outgoing edge of a random tured networks. In N. Borisov and P. Golle, editors, Pro- ceedings of the Seventh Workshop on Privacy Enhancing route uniquely determines the incoming edge at an Technologies (PET 2007), Ottawa, Canada, June 2007. honest node. Springer. Since the size of the set of honest nodes, n is unknown, D. J. Phillips. Defending the boundaries: Identifying and SybilGuard requires an estimation technique to ﬁgure the countering threats in a usenet newsgroup. Inf. Soc., 12(1), needed length of the random route. 1996. D. Randall. Rapidly mixing markov chains with appli- The validation criterion for accepting a node as hon- cations in computer science and physics. Computing in est in SybilGuard is that there should be an intersection Science and Engineering, 8(2):30–41, 2006. between the random route of the veriﬁer node and the sus- M. Ripeanu, I. Foster, and A. Iamnitchi. Mapping the pect node. It can be shown using the birthday paradox √ Gnutella Network: Properties of Large-Scale Peer-to-Peer that if two honest nodes are able to obtain n samples Systems and Implications for System Design. IEEE Inter- from the honest region, then their samples will have an net Computing Journal, 6(1), Aug. 2002. non empty intersection with high probability, and will thus Secondlife: secondlife.com. Sophos. Sophos facebook id probe shows 41% of users be able to validate each other. happy to reveal all to potential identity thieves, August 14 There are cases when the random route of an honest 2007. node ends up within the Sybil region, leading to a “bad” I. Stoica, R. Morris, D. Liben-Nowell, D. R. Karger, M. F. sample, and the possibility of accepting Sybil nodes as Kaashoek, F. Dabek, and H. Balakrishnan. Chord: a scal- honest. Thus, SybilGuard is only able to provide bounds able peer-to-peer lookup protocol for internet applications. on the number of Sybil identities if such an event is rare, IEEE/ACM Trans. Netw., 11(1):17–32, 2003. which translates into an assumption that the maximum L. von Ahn, M. Blum, N. Hopper, and J. Langford. √ Captcha: Using hard ai problems for security. In In Pro- number of attack edges is g = o( logn ). To further reduce n ceedings of EUROCRYPT 03, Lecture Notes in Computer the effects of random routes entering the Sybil region (bad Science, 2003. samples), nodes in SybilGuard can perform random routes H. Yu, P. B. Gibbons, M. Kaminsky, and F. Xiao. Sybil- along all their edges and validate a node only if a major- limit: A near-optimal social network defense against sybil ity of these random routes have an intersection with the attacks. In IEEE Symposium on Security and Privacy, random routes of the suspect. pages 3–17, 2008. SybilGuard’s security really depends on the number of H. Yu, M. Kaminsky, P. B. Gibbons, and A. Flaxman. Sybilguard: defending against sybil attacks via social net- attack edges in the system, connecting honest and dishon- works. SIGCOMM Comput. Commun. Rev., 36(4):267– est users. To intersect a veriﬁer’s random route, a sybil 278, 2006. node’s random route must traverse an attack edge (say A). del.icio.us: delicious.com. Due to the convergence property, the random routes of all Facebook: www.facebook.com. sybil nodes traversing A will intersect the veriﬁer’s ran- Orkut: www.orkut.com. dom route at the same node and along the same edge. All Wikipedia: www.wikipedia.org. such nodes form an equivalence group from the veriﬁer’s perspective. Thus, the number of sybil groups is boundedA An overview of SybilGuard and Sybil- by the number of attack edges, i.e. g. Moreover, due to Limit the back-traceability property, there can be at most w dis- tinct routes that intersect the veriﬁers route at a particularA.1 SybilGuard node and a particular edge. Thus, there is a bound on the √ size of the equivalence groups. To sum up, SybilGuard In SybilGuard, each node ﬁrst obtains n independent divides the accepted nodes into equivalence groups, withsamples from the set of honest nodes of size n. Since, for the guarantee that there are at most g sybil groups whose
maximum size is w. sybil identities. Unlike SybilInfer, nodes in SybilGuard do not require Assuming there was a way to estimate all parame-knowledge of the complete network topology. On the ter required by SybilLimit, our proposal, SybilInfer, pro-other hand, the bounds provided by SybilGuard are quite vides an order of magnitude better guarantees. Further-weak: in a million node topology, SybilGuard accepts more these guarantees relate to the number of (real) dis-about 2000 Sybil identities per attack edge! (Attack honest entities in the system, unlike SybilLimit that de-edges are trust relations between an honest and a dishon- pends on number of attack edges. As noted, in com-est node.) Since the bounds provided by SybilGuard de- parison with SybilGuard, SybilInfer does not assume anypend on the number of attack edges, high degree nodes threshold on the number of colluding entities, while Sybil-would be attractive targets for the attackers. To use the Limit can bound the number of sybil identities only when nsame example, in a million node topology, the compro- g = o( log n ).mise of about 3 high degree nodes with about 100 attackedges each enables the adversary to control more than 1/3of all identities in the system, and thus prevent honestnodes from reaching byzantine consensus. In contrast, thebounds provided by SybilInfer depend on the number ofcolluding entities in the social network, and not on thenumber of attack edges. Lastly, SybilGuard is only√ableto provide bounds on Sybil identities when g = o( logn ), nwhile SybilInfer is not bound by any such threshold on thenumber of colluding entities.A.2 SybilLimit In contrast to SybilGuard’s methodology of using asingle instance of a very long random route, SybilLimit √employs multiple instances ( m) of short random walks(O(log n)) to sample nodes from the honest set, where mdenotes the number of edges amongst the honest nodes. It ncan be shown that as long as g = o( log n ), then with highprobability, a short random walk is likely to stay withinthe set of honest nodes, i.e., the probability of a samplebeing “bad” is small. The validation criterion for accept-ing a node as honest in SybilLimit is that there should bean intersection amongst the last edge (i.e. the tail) of therandom routes of the veriﬁer and the suspect. Similar toSybilGuard, it can be shown using the birthday paradoxthat two honest nodes will validate each other with highprobability. Note that if a random route enters the sybilregion, then the malicious tail of that route could adver-tise intersections with many sybil identities. To counterthis attack, a further “balance” condition is imposed thatensures that a particular intersection cannot validate arbi-trary number of identities. Sybil identities can induce atotal of g · w · m intersections at honest tails. Thus, for ev-ery attack edge, SybilLimit accepts at most w = O(log n)sybil identities. A key shortcoming of SybilLimit is its reliance on thevalue of w = O(log n), the length of the short randomwalk, without any mechanisms to estimate it. Yet the cor-rect estimation of this parameter is essential to calculatethe security bounds of the scheme. Underestimating it isleads to an increase of false positives, while overestimat-ing it will result in the random walks ending in the Sybilregion, allowing the adversary to fake intersections with