On the Internet Delay Space Dimensionality

2,489 views

Published on

We investigate the dimensionality properties of the Internet delay space, i.e., the matrix of measured round-trip latencies between Internet hosts. Previous work on network coordinates has indicated that this matrix can be embedded,
with reasonably low distortion, into a 4- to 9-dimensional Euclidean space. The application of Principal Component Analysis (PCA) reveals the same dimensionality values. Our work addresses the question: to what extent is the dimensionality an intrinsic property of the delay space, defined without reference to a host metric such as Euclidean space? Is the intrinsic dimensionality of the Internet delay space
approximately equal to the dimension determined using embedding techniques or PCA? If not, what explains the discrepancy? What properties of the network contribute to
its overall dimensionality? Using datasets obtained via the King [14] method, we study different measures of dimensionality to establish the following conclusions. First, based on its power-law behavior, the structure of the delay space can be better characterized by fractal measures. Second, the intrinsic dimension is significantly smaller than the value
predicted by the previous studies; in fact by our measures it is less than 2. Third, we demonstrate a particular way in which the AS topology is reflected in the delay space; subnetworks composed of hosts which share an upstream Tier-1 autonomous system in common possess lower dimensionality than the combined delay space. Finally, we observe that
fractal measures, due to their sensitivity to non-linear structures, display higher precision for measuring the influence of subtle features of the delay space geometry.

Published in: Education, Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
2,489
On SlideShare
0
From Embeds
0
Number of Embeds
767
Actions
Shares
0
Downloads
7
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • In this work, we view the Internet as a metric space, where the metric is defined as the round-trip time to send a packet between two hosts.  One of the most important properties that characterizes a metric space is its dimensionality.  Properly estimating the dimensionality of a metric space is crucial for characterizing its structure and designing algorithms based on that space: if one's estimate of the dimension is too low, it is impossible to find an embedding that preserves distances; if the estimate is too high then we make the algorithms inefficient as we run into the so-called curse of dimensionality: there are too many degrees of freedom to explore and the algorithms get lost.  A major application of this abstraction is a coordinate-based positioning system. These systems aim at mapping the network into a metric space in such a way that the geometric distances estimates the real latency with low degree of error. For such systems, the dimensionality of the target space is a tunable parameter. In addition the dimensionality is known to affect the accuracy of the predictions, the stability of coordinates over time and the time to converge to stable coordinates. Our uses a latency matrix obtained via the King method, which is a convenient way Of measuring the latency between nameservers without having login access to them, to answer two fundamental questions: What is the dimensionality of the Internet delay space? And… What forces contribute to its geometric properties? Apart from its implications to the performance of coordinate systems, this characterization is by itself a topic of practical interest as it uncovers properties and opens new questions on the nature and complexity of the network.
  • Measurement-based positioning systems implement the same functionality as the coordinate-based counterparts by performing measurements in a careful way.
  • In this work, we view the Internet as a metric space, where the metric is defined as the round-trip time to send a packet between two hosts.  One of the most important properties that characterizes a metric space is its dimensionality.  Properly estimating the dimensionality of a metric space is crucial for characterizing its structure and designing algorithms based on that space: if one's estimate of the dimension is too low, it is impossible to find an embedding that preserves distances; if the estimate is too high then we make the algorithms inefficient as we run into the so-called curse of dimensionality: there are too many degrees of freedom to explore and the algorithms get lost.  A major application of this abstraction is a coordinate-based positioning system. These systems aim at mapping the network into a metric space in such a way that the geometric distances estimates the real latency with low degree of error. For such systems, the dimensionality of the target space is a tunable parameter. In addition the dimensionality is known to affect the accuracy of the predictions, the stability of coordinates over time and the time to converge to stable coordinates. Our uses a latency matrix obtained via the King method, which is a convenient way Of measuring the latency between nameservers without having login access to them, to answer two fundamental questions: What is the dimensionality of the Internet delay space? And… What forces contribute to its geometric properties? Apart from its implications to the performance of coordinate systems, this characterization is by itself a topic of practical interest as it uncovers properties and opens new questions on the nature and complexity of the network. Prior work: 5- to 9- dimensional Euclidean space with “reasonably” low distortion
  • data can be approximately embedded into a low-dimensional Euclidean space the distance matrix can be accurately approximated by a low-rank matrix
  • Dimensionality notions used in prior work often reflect the assumption that low-dimensional data can be approximately embedded in a low-dimensional Euclidean space or that the distance matrix can be accurately approximated by a low-rank matrix.  Thus, one could define the embedding dimension of a space by finding the lowest-dimensional Euclidean space that admits an embedding with adequate percentiles of relative error, or one could define the dimension by using Principal Component Analysis to identify the smallest value of k for which the distance matrix has a rank-k approximation with adequate relative error.  In these two plots we see the outcome of applying this process to the Internet measurements mentioned earlier.  The vertical bar in the graph on the left shows the relative error obtained when embedding the data into a Euclidean space of dimension 1,2,3, etc. The graph on the right shows the percent of variance explained by the first k principal components of the distance matrix for varying values of k. What are the problems with these methods? 1) The problem with the first approach is that the embedding algorithm might fail to produce the real value of dimensionality, since the curse of dimensionality comes into play: after 7 dimensions we observe higher percentiles of relative errors. 2) The problem with PCA is that, 2.1) as a method grounded in linear algebra, it is oblivious to non-linear relationships between the dimensions. For instance, if we try to estimate the surface of a sphere, it will indicate a 3 dimensional object whereas it is actually 2 dimensional. 2.2) As in the case of this plot it is not always clear where to establish the cutoff point beyond which the subsequent components explain only a negligible variance. To the extent that these methods estimate the dimensionality of the delay space, they indicate a value between 4 and 7.
  • Dimensionality notions used in prior work often reflect the assumption that low-dimensional data can be approximately embedded in a low-dimensional Euclidean space or that the distance matrix can be accurately approximated by a low-rank matrix.  Thus, one could define the embedding dimension of a space by finding the lowest-dimensional Euclidean space that admits an embedding with adequate percentiles of relative error, or one could define the dimension by using Principal Component Analysis to identify the smallest value of k for which the distance matrix has a rank-k approximation with adequate relative error.  In these two plots we see the outcome of applying this process to the Internet measurements mentioned earlier.  The vertical bar in the graph on the left shows the relative error obtained when embedding the data into a Euclidean space of dimension 1,2,3, etc. The graph on the right shows the percent of variance explained by the first k principal components of the distance matrix for varying values of k. What are the problems with these methods? 1) The problem with the first approach is that the embedding algorithm might fail to produce the real value of dimensionality, since the curse of dimensionality comes into play: after 7 dimensions we observe higher percentiles of relative errors. 2) The problem with PCA is that, 2.1) as a method grounded in linear algebra, it is oblivious to non-linear relationships between the dimensions. For instance, if we try to estimate the surface of a sphere, it will indicate a 3 dimensional object whereas it is actually 2 dimensional. 2.2) As in the case of this plot it is not always clear where to establish the cutoff point beyond which the subsequent components explain only a negligible variance. To the extent that these methods estimate the dimensionality of the delay space, they indicate a value between 4 and 7.
  • If the measured distances reflect a metric other than Euclidean distance Algorithm fails to produce low-distortion embedding in any dimension! e.g., hilly terrain
  • In our work, we explore the structural and statistical properties of the Internet delay space in order to better characterize its dimensionality. For instance, metric spaces that exhibit power-law behavior can be measured using fractal dimensions Which is a intrinsic measure that work without a reference to a external metric space. One of the way from which the power law behavior arises in the delay space is When we plot in logscale the number of nodes (y-axis) that are within a given distance (x-axis). 1) The first striking feature of this plot is a power law that persists over two orders of magnitude. Datasets that display a property like that are said to behavior like a fractal. 2) Another interesting observation is that this range of distances include all RTT between 2 and 100ms, Which in turn include all non-oceanic distances. 3) When we observe this behavior, we can measure the intrisic dimensionality of the dataset as the power-law exponent. The other surprising finding is that the intrinsic dimensionality measured by this method is so much lower than what can be estimated by previous methods!
  • Is the dimensionality value less than 2D due to the fact that the Internet lives on the surface of the Earth? Is the surface of the Earth 0.9-dimensional?
  • Another way in which our work illuminates the structure of the delay space is by revealing that the delay space is not homogeneously 1.8-dimensional but is made up of a small number of low-dimensional pieces. One might expect the subnetworks to be simper because the decomposition eliminated inefficient routes that go from a network, up to its Tier-1 provider, over to another Tier-1 provider down to its customer. Or perhaps this decomposition just decomposed the network into subsnetworks of equal complexity. I ’ll now show that that’s not the case. Upon decomposing the delay space into overlapping pieces, each corresponding to A Tier-1 AS and its downstream customers, we observe that each individual piece Has dimensionality 10% lower than the combined delay space. This dimensionality shift cannot be achieved by other kinds of decompositions, namely By decomposing into pieces of low diameter, or clustered geographically, or randomly selecting a set with lower cardinality. Another interesting finding is that this dimensionality shift can only be detected by fractal measures. The embedding dimension and PCA are oblivious to it. This also demonstrates the power and applicability of the fractal measures. Disproportionate to the rest of the links What is the geometric effect of analyzing each Tier-1 network, together with its downstream customers in isolation Expect better behaved pieces
  • So far, this presentation presented no evidence that the this study may lead to better network embeddings. In fact, the evidence so far suggests the opposite. However, now we ’ll see that it does lead to better non-linear embeddings.
  • On the Internet Delay Space Dimensionality

    1. 1. InetDim: Characterizing the Internet Delay Space Dimensionality Bruno Abrahao Robert Kleinberg Cornell University
    2. 2. <ul><li>Develop a geometric framework for exploring properties of networks </li></ul><ul><li>Understanding allows us to make predictions on the consequences of the growth and evolution of the network and guide the design of distributed systems. </li></ul>InetDim Project
    3. 3. <ul><li>Concerned with properties of the Internet that contribute to the overall dimensionality of its latency space. </li></ul><ul><li>B. Abrahao and R. Kleinberg, On the Internet Delay Space Dimensionality , In Proc. of ACM/SIGCOMM Internet Measurement Conference (IMC 2008) </li></ul>Current Investigation
    4. 4. <ul><li>Internet Delay Space </li></ul><ul><ul><li>matrix of all-pairs round trip times between Internet hosts </li></ul></ul><ul><li>Dimensionality </li></ul><ul><ul><li>We ’ll consider several definitions in this talk </li></ul></ul><ul><ul><li>Value which abstracts notion of network complexity </li></ul></ul>Definitions
    5. 5. <ul><li>Why to study the Internet delay space dimensionality? </li></ul>Question
    6. 6. <ul><li>Network embedding (GNP, Vivaldi) </li></ul><ul><ul><li>Models the network as contained in a vector space </li></ul></ul><ul><ul><li>Estimate real distances with geometric distances </li></ul></ul><ul><li>Elegant, compact, relative success </li></ul><ul><li>BUT they suffer from </li></ul><ul><ul><li>inherent embedding distortion </li></ul></ul><ul><ul><li>disappointing accuracy [Lua et al ., IMC ’05][Ledlie-Gardner-Seltzer, NSDI’07] </li></ul></ul>Coordinate-based positioning systems
    7. 7. <ul><li>Meridian [Wong et al. SIGCOMM ’04] </li></ul><ul><li>Iplane [Madhyastha et al. OSDI ’06] </li></ul><ul><li>Relatively more accurate </li></ul><ul><li>BUT </li></ul><ul><ul><li>Measurement intensive </li></ul></ul><ul><ul><li>Strong scaling assumptions </li></ul></ul>Measurement based positioning systems
    8. 8. <ul><li>Dimensionality of target space is a tunable parameter </li></ul><ul><ul><li>Critical parameter influencing accuracy, convergence, stability [Dabek et al ., SIGCOMM ’04] [Ledlie-Gardner-Seltzer, NSDI’07] </li></ul></ul><ul><li>Question : What is the optimal value? </li></ul><ul><ul><li>too low: high distortion </li></ul></ul><ul><ul><li>too high: inefficiency </li></ul></ul>Motivation 1: Coordinate-based positioning systems
    9. 9. <ul><ul><li>Prior work : delay space can be embedded into an 5- to 9-dimensional Euclidean space with “reasonably” low distortion [Ng-Zhang’01; Tang-Crovella’03] </li></ul></ul><ul><ul><li>Is this value optimal? </li></ul></ul><ul><li>Invariant with scaling? </li></ul><ul><ul><li>For worst-case metric space, dimensionality increases logarithmic with the cardinality of the metric [Bourgain 75] </li></ul></ul>Motivation 1: Coordinate-based positioning systems
    10. 10. <ul><li>Scaling assumptions underlying Meridian </li></ul><ul><ul><li>Latency space is a metric of bounded doubling dimension D </li></ul></ul>Motivation 2: Measurement-based positining systems
    11. 11. <ul><li>Question 1 : What properties of the Internet contribute to its dimensionality? And to what extent? </li></ul><ul><li>Question 2 : Is the Internet delay space dimensionality homogeneous or it is made up of many lower-dimensional pieces? </li></ul><ul><ul><li>Hierarchical embedding [Zhang ‘06] </li></ul></ul>Motivation 3: Internet properties
    12. 12. <ul><li>How to generate synthetic realistic delay spaces? </li></ul><ul><li>Previous work [Zhang, IMC ’06] </li></ul><ul><ul><li>Statistical properties preserved </li></ul></ul><ul><li>Generative models? </li></ul>Motivation 4: Synthetic delay space generation
    13. 13. <ul><li>Part I : Studies different methods for characterizing the Internet delay space geometry </li></ul><ul><li>Part II : Demonstrates how to use these tools for predicting dimensionality shifts cause by structural properties </li></ul>Outline
    14. 14. <ul><li>Uses raw measurements collected via the King method by Meridian (5200 hosts) and P2PSim (1953 hosts) </li></ul><ul><ul><li>Filter out pairs with < 10 measurements total in both directions </li></ul></ul><ul><ul><li>Latency = median of measurements in both directions </li></ul></ul><ul><ul><li>Eliminate ambiguity which arises from missing values </li></ul></ul><ul><ul><li>Approximate the largest clique with the remaining pairs </li></ul></ul><ul><ul><li>After filtering </li></ul></ul><ul><ul><ul><li>Meridian : 2385 hosts </li></ul></ul></ul><ul><ul><ul><li>P2PSim : 298 hosts </li></ul></ul></ul>Datasets 1/2
    15. 15. <ul><li>Major limitation is availability of datasets </li></ul><ul><li>Other publicly available datasets: Dimes, Planetlab, DS2, King (Harvard) , … </li></ul><ul><ul><li>Small cliques, collected over long periods of time </li></ul></ul><ul><li>Combined with geolocation obtained by querying hostip.info </li></ul>Datasets 2/2
    16. 16. Meridian geolocation visualization
    17. 17. <ul><li>Question </li></ul><ul><li>How to estimate the Internet Delay Space dimensionality? </li></ul><ul><li>What methods were applied in prior work? What are their assumptions? </li></ul><ul><ul><li>data can be approximately embedded into a low-dimensional Euclidean space </li></ul></ul><ul><ul><li>the distance matrix can be accurately approximated by a low-rank matrix </li></ul></ul>Part I: Dimensionality measures
    18. 18. Embedding Dimension <ul><li>Benefit : recovers coordinates of points </li></ul><ul><li>Suggests a dimensionality value between 4 and 7 </li></ul><ul><li>The dimensionality can be estimated using an embedding algorithm, such as Vivaldi </li></ul>Meridian
    19. 19. Embedding Dimension Shortcomings (1/2) <ul><li>Embeddings using 8D and beyond are worse than 7D! </li></ul><ul><li>Curse of dimensionality : embedding algorithm is overwhelmed with so many degrees of freedom </li></ul>Meridian
    20. 20. <ul><li>Existing network embedding algorithms are suboptimal and slow to converge! </li></ul><ul><li>Finding an embedding that minimizes distortion is intractable [Matoušek-Sidiropoulos ‘08] </li></ul><ul><li>Fails if the measured distances reflect a metric other than Euclidean distance </li></ul><ul><li>Algorithm often produces lots of empty space , unnecessarily inflating the estimate of dimensionality </li></ul>Embedding Dimension Shortcomings (2/2)
    21. 21. <ul><li>Approximates the distance matrix by a low-rank matrix </li></ul><ul><li>Rotates the axes so that data variance is better captured by the components </li></ul>Principal Component Analysis Meridian
    22. 22. <ul><li>Assumes that the dimensions are independent </li></ul><ul><ul><li>e.g. surface of a sphere is reported as a 3D object </li></ul></ul><ul><li>The dimensionality cut-off is non-obvious to determine </li></ul>Principal Component Analysis ? ? ? Meridian
    23. 23. <ul><li>Question : How to avoid assuming a specific host metric space or assuming linearity? </li></ul><ul><li>If a dataset exhibits power-law behavior in its statistical or structural properties, one can model and measure its dimensionality using Fractal Measures </li></ul>Intrinsic notion of dimensionality
    24. 24. <ul><li>Power-law behavior over two decimal orders of magnitude </li></ul>Intrinsic dimensionality metrics <ul><ul><li>Correlation Fractal Dimension (pair-count plot) </li></ul></ul><ul><ul><li>Fractal measures indicate dimensionality values less than 2 </li></ul></ul>Meridian P2PSim Includes almost all intra-continental distances (usec)
    25. 25. Correlation Dimension <ul><li># samples within distance r is proportional to r </li></ul>x r
    26. 26. Correlation Dimension <ul><li># samples within distance r is proportional to </li></ul>x r
    27. 27. Sampling from a unit square
    28. 28. <ul><li>There is an infinite family of fractal dimensions, parameterized by q </li></ul><ul><li>Some practical dimensions are </li></ul><ul><ul><li>Hausdorff Dimension </li></ul></ul><ul><ul><li>Shannon ’s Entropy </li></ul></ul><ul><ul><li>Correlation Dimension </li></ul></ul><ul><li>For manifolds, i.e., spaces locally modeled on , the fractal dimension coincides with d </li></ul>Fractal Dimension Family Easy to measure Sensitive to non-linear behavior
    29. 29. <ul><li>How much does the embedded structure deviate from the original space? </li></ul>Correlation Dimension of Embedded Matrix
    30. 30. <ul><li>Question : What doe the fractal behavior implies about the Internet? Is it self-similar? Recursive? </li></ul><ul><ul><li>[Li, Alderson, Willinger, Doyle, 2004] </li></ul></ul><ul><ul><li>[Mitzenmacher, 2004] </li></ul></ul><ul><li>Fractal behavior may also arise from non-recursive structures </li></ul><ul><ul><li>e.g., snowflakes, coastlines, surface of the human brain, etc. </li></ul></ul>Fractals and Internet Models
    31. 31. <ul><li>Questions </li></ul><ul><li>What features contribute to the delay space behavior? To what extent? </li></ul><ul><li>Why is it useful to study the Internet through the lens of the fractal measures? </li></ul>Part II: Dimensionality Components
    32. 32. <ul><li>Power-law extending over two orders of magnitude. </li></ul><ul><li>Geolocation does not account for the whole delay space dimensionality, although it is a strong component </li></ul>Geographic Location
    33. 33. <ul><li>Transit links between Tier-1 AS ’es contained on a significant fraction of Internet routes </li></ul><ul><li>Decomposition into overlapping subsets, each rooted at a Tier-1 network </li></ul>Dimensionality Reducing Decomposition <ul><li>Notice: not a partition, due to multihomed networks </li></ul><ul><li>Superposition of these pieces may inflate the dimensionality </li></ul>
    34. 34. <ul><li>Approximation to ground truth decomposition using a snapshot of the inferred topology [Oliveira et al., SIGMETRICS ’08] </li></ul>Dimensionality Reducing Decomposition
    35. 35. <ul><li>We measured each piece separately using </li></ul><ul><ul><li>Embedding dimension </li></ul></ul><ul><ul><li>PCA </li></ul></ul><ul><ul><li>Correlation dimension </li></ul></ul><ul><ul><li>Hausdorff dimension (see paper) </li></ul></ul>Dimensionality Reducing Decomposition
    36. 36. Dimensionality Reducing Decomposition Results <ul><li>Embedding dimension and PCA were insensitive to decomposition </li></ul>
    37. 37. Dimensionality Reducing Decomposition Results <ul><li>Fractal measures capture the structural change and report a dimensionality reduction </li></ul><ul><li>Power-law is preserved over the same range for subsets, however, with different exponents </li></ul>
    38. 38. <ul><li>Is the reduction in dimensionality due to other side effects of the decomposition? </li></ul><ul><ul><li>Pieces of smaller diameter? </li></ul></ul><ul><ul><ul><li>Decompose the network according to geographical and latency-based clustering </li></ul></ul></ul>Sanity Check 1
    39. 39. <ul><li>Diverts the set from the power-law behavior </li></ul>Distance-based Clustering <ul><li>Not comparable to topology-based decomposition </li></ul>
    40. 40. <ul><li>Is the reduction in dimensionality due to other side effects of the decomposition? </li></ul><ul><ul><li>Pieces of smaller cardinality [Bourgain ’ 75] ? </li></ul></ul><ul><ul><ul><li>Generated 139 random subnetworks with smaller cardinality (i.e., the median of Tier-1 subnetwork the sizes) </li></ul></ul></ul>Sanity Check 2
    41. 41. Random subsets with lower cardinality <ul><li>Random subsets of lower cardinality do not explain the dimensionality reduction </li></ul>
    42. 42. <ul><li>Fractal tool reports a dimensionality shift which was not captured by Embedding Dimension or PCA </li></ul><ul><li>What explain the discrepancy? </li></ul>Dimensionality Paradox
    43. 43. <ul><li>Dimensionality reducing decomposition is real and not an artifact of the fractal methodology </li></ul>Isomap Space is rich in non-linearity
    44. 44. <ul><li>Dimensionality is critical aspect influencing performance of algorithms </li></ul><ul><li>Fractal nature and non-linear behavior: Internet delay space dimensionality better characterized by fractal measures </li></ul><ul><ul><li>Lightweight </li></ul></ul><ul><ul><li>Sensitive to Internet features and structural changes </li></ul></ul><ul><li>Properties influencing dimensionality </li></ul><ul><li>Internet delay space dimensionality is not homogeneous </li></ul>Conclusion
    45. 45. <ul><li>Inetdim Project Website </li></ul><ul><ul><li>King datasets annotated with IP addresses, code, more info </li></ul></ul><ul><ul><li>http://www.cs.cornell.edu/~abrahao/inetdim </li></ul></ul>Further Info

    ×