Original Plan
1.  Algorithms for BIG graphs
•  The centrality of centrality
•  How to store BIG Graphs (WebGraph
Framework...
Slightly Revised Plan
1.  Algorithms for BIG graphs
•  The centrality of centrality
•  Four Degrees of Separation
•  Diame...
Four	
  Degrees	
  of	
  Separa.on	
  
Literature	
  
•  Frigyes	
  Karinthy,	
  in	
  his	
  1929	
  short	
  
story	
  “Láncszemek”	
  (“Chains'”)	
  
suggeste...
The	
  Sociologists	
  
•  M.	
  Kochen,	
  I.	
  de	
  Sola	
  Pool:	
  Contacts	
  and	
  
influences.	
  (Manuscript,	
 ...
Milgram’s	
  ques.on	
  
•  “Given	
  two	
  individuals	
  selected	
  randomly	
  
from	
  the	
  popula.on,	
  what	
  ...
Milgram’s	
  ques.on	
  
•  What	
  is	
  the	
  distance	
  distribu.on	
  of	
  the	
  
acquaintance	
  graph?	
  
–  ho...
Milgram’s	
  experiment	
  	
  
296	
  people	
  (star.ng	
  popula.on)	
  asked	
  to	
  dispatch	
  a	
  
parcel	
  to	
...
Milgram’s	
  experiment	
  	
  
9	
  
Milgram’s	
  experiment	
  	
  
•  Actually	
  completed	
  64	
  chains	
  
(22%)	
  
•  453	
  intermediaries	
  happene...
New	
  Milgram’s	
  experiment?	
  	
  
•  Can	
  we	
  reproduce	
  Milgram’s	
  experiment	
  on	
  
a	
  large	
  scale...
Graph	
  distances	
  and	
  distribu.on	
  
•  The	
  distance	
  d(x,y)	
  is	
  the	
  length	
  of	
  the	
  
shortest...
Previous	
  experiments	
  
On-­‐line	
  social	
  networks:	
  
•  6.6	
  degrees	
  of	
  separa.on	
  on	
  a	
  one-­‐...
•  4.74	
  degrees	
  of	
  separa6on	
  in	
  Facebook	
  
(Backstrom	
  et	
  al.	
  [Backstrom+12])	
  
– 712	
  M	
  p...
Hyper	
  ANF	
  
•  New	
  tool	
  for	
  studying/compu.ng	
  distance	
  
distribu.on	
  of	
  very	
  large	
  graphs	
...
Compu.ng	
  Neighborhood	
  Func.on	
  
•  Neighborhood	
  func.on	
  N(t):	
  for	
  each	
  t,	
  number	
  of	
  pairs	...
Alterna.ve:	
  Diffusion	
  
•  Basic	
  idea:	
  Palmer	
  et	
  al	
  [PGF02]	
  
•  Let	
  Bt(x)	
  be	
  the	
  ball	
 ...
Easy	
  but	
  Costly:	
  Approximate	
  
•  Every	
  set	
  needs	
  O(n)	
  bits:	
  overall	
  O(n2).	
  Too	
  
many.	...
ANF	
  and	
  HyperANF	
  
ANF	
  [PGF02]:	
  	
  
•  Diffusion	
  with	
  Mar.n-­‐Flajolet	
  (MF)	
  counters	
  
(log	
 ...
HyperLogLog	
  counters	
  
20	
  
21	
  
Rough	
  Intui.on:	
  let	
  x	
  be	
  unknown	
  cardinality	
  of	
  M.	
  	
  
Each	
  substream	
  will	
  con...
HyperLogLog	
  counters	
  
•  Instead	
  of	
  actually	
  coun.ng,	
  observe	
  a	
  sta.s.cal	
  
feature	
  of	
  a	
...
Many	
  many	
  counters…	
  
•  To	
  increase	
  confidence,	
  need	
  several	
  counters	
  
(usually	
  2b,	
  b	
  ≥...
Broadword	
  Programming	
  
8	
  bits	
  
9	
  	
   0	
  	
   3	
  	
   2	
  	
  
7	
  	
   3	
  	
   3	
  	
   6	
  	
  
Broadword	
  Programming	
  
8	
  bits	
  
9	
  	
   0	
  	
   3	
  	
   2	
  	
  
7	
  	
   3	
  	
   3	
  	
   6	
  	
  ...
Broadword	
  Programming	
  
8	
  bits	
  
9	
  	
   0	
  	
   3	
  	
   2	
  	
  
7	
  	
   3	
  	
   3	
  	
   6	
  	
  ...
Broadword	
  Programming	
  
8	
  bits	
  
9	
  	
   0	
  	
   3	
  	
   2	
  	
  
7	
  	
   3	
  	
   3	
  	
   6	
  	
  ...
Broadword	
  Programming	
  
8	
  bits	
  
9	
  	
   0	
  	
   3	
  	
   2	
  	
  
7	
  	
   3	
  	
   3	
  	
   6	
  	
  ...
Broadword	
  Programming	
  
8	
  bits	
  
9	
  	
   0	
  	
   3	
  	
   2	
  	
  
7	
  	
   3	
  	
   3	
  	
   6	
  	
  ...
Broadword	
  Programming	
  
8	
  bits	
  
9	
  	
   0	
  	
   3	
  	
   2	
  	
  
7	
  	
   3	
  	
   3	
  	
   6	
  	
  ...
Broadword	
  Programming	
  
8	
  bits	
  
9	
  	
   0	
  	
   3	
  	
   2	
  	
  
7	
  	
   3	
  	
   3	
  	
   6	
  	
  ...
Broadword	
  Programming	
  
8	
  bits	
  
9	
  	
   0	
  	
   3	
  	
   2	
  	
  
7	
  	
   3	
  	
   3	
  	
   6	
  	
  ...
Broadword	
  Programming	
  
8	
  bits	
  
9	
  	
   0	
  	
   3	
  	
   2	
  	
  
7	
  	
   3	
  	
   3	
  	
   6	
  	
  ...
Broadword	
  Programming	
  
8	
  bits	
  
9	
  	
   0	
  	
   3	
  	
   2	
  	
  
7	
  	
   3	
  	
   3	
  	
   6	
  	
  ...
Real	
  speed?	
  
•  Large	
  size:	
  HADI	
  [Kang	
  et	
  al.,	
  2010]	
  is	
  a	
  
Hadoop-­‐conscious	
  implemen...
Experiments	
  	
  
•  24-­‐core	
  machine	
  with:	
  
– 72	
  GB	
  of	
  RAM	
  
– 1	
  TB	
  of	
  disk	
  space	
  
...
Experiments	
  (.me)	
  
Ran	
  experiments	
  on	
  snapshots	
  of	
  facebook	
  
•  Jan	
  1,	
  2007	
  
•  Jan	
  1,...
Experiments	
  (dataset)	
  
Considered:	
  
•  ‚:	
  the	
  whole	
  facebook	
  graph	
  
•  it	
  /	
  se:	
  only	
  I...
Facebook	
  distance	
  distribu.on	
  
39	
  
4,74	
  
Distance	
  distribu.ons	
  
40	
  
Average	
  Distance	
  
41	
  
Average	
  Degree	
  and	
  Density	
  (‚)	
  
42	
  Density	
  defined	
  as	
  2m	
  /	
  n	
  (n-­‐1)	
  
Diameter	
  (‚)	
  
43	
  
How	
  to	
  compute	
  the	
  diameter?	
  
44	
  
Lower	
  bounds	
  byproduct	
  of	
  HyperANF	
  runs	
  	
  
	
  
Wh...
The	
  WebGraph	
  Framework	
  
	
  
(Boldi	
  &	
  Vigna	
  2003)	
  
“The”	
  Web	
  graph	
  
•  Set	
  U	
  of	
  URLs.	
  Directed	
  graph	
  with:	
  
–  U	
  as	
  set	
  of	
  nodes	
 ...
Storing	
  the	
  Web	
  graph	
  
•  What	
  does	
  it	
  mean	
  “to	
  store	
  (part	
  of)	
  the	
  Web	
  graph”?	...
Adjacency	
  lists	
  
•  The	
  set	
  of	
  neighbors	
  of	
  a	
  node	
  
•  E.g.,	
  for	
  a	
  4	
  billion	
  pag...
(Some)	
  History	
  
•  Connec.vity	
  Server	
  [Bharat+98]	
  ≈	
  32	
  bits/link	
  
•  Algorithms	
  for	
  separabl...
Exploit	
  features	
  of	
  Web	
  Graphs	
  
1.	
   Locality:	
   usually	
   most	
   links	
   in	
   a	
   page	
   h...
Locality	
  
Property:	
  with	
  URLs	
  lexicographically	
  ordered,	
  for	
  
many	
  arcs	
  x→y	
  we	
  have	
  o•...
Naïve	
  Representa.on	
  and	
  with	
  Gaps	
  
52	
  
2.	
  Similarity:	
  Pages	
  that	
  are	
  close	
  
to	
  each	
  other	
  (in	
  lexicographic	
  
order)	
  tend	
  t...
Similarity	
  
•  Property:	
  URLs	
  that	
  are	
  close	
  in	
  lexicographic	
  order	
  are	
  
likely	
  to	
  bel...
Reference	
  compression	
  
55	
  
Differen.al	
  compression	
  
Look	
  at	
  copy	
  list	
  as	
  sequence	
  of	
  1-­‐blocks	
  and	
  0-­‐blocks	
  
Le...
Differen.al	
  compression	
  
57	
  
Copy	
  blocks	
  specify	
  by	
  inclusion/exclusion	
  sublists	
  that	
  must	
 ...
Exploit	
  features	
  of	
  Web	
  Graphs	
  
3.	
  Consecu6vity:	
  Many	
  links	
  within	
  same	
  page	
  are	
  li...
Consecu.vity	
  
•  More	
  in	
  general,	
  consecu.vity	
  is	
  the	
  dual	
  of	
  
distance-­‐one	
  similarity.	
 ...
Intervaliza.on	
  
•  	
  Exploit	
  consecu.vity	
  among	
  nodes.	
  Instead	
  of	
  
compressing	
  them	
  directly	...
Lmin	
  =	
  2	
  
Interval	
  [15,19]:	
  le•	
  extreme	
  15-­‐15=0,	
  length	
  5-­‐2=3	
  
Interval	
  [23,24]:	
  l...
62	
  
Intervaliza.on	
  
Choices	
  in	
  the	
  reference	
  scheme	
  
•  How	
  do	
  you	
  choose	
  the	
  reference	
  node	
  for	
  x?	
  ...
Implementa.on	
  
•  Random	
  access	
  to	
  successor	
  lists	
  is	
  implemented	
  lazily	
  
through	
  a	
  casca...
Access	
  speed	
  
•  Access	
  speed	
  to	
  a	
  compressed	
  graph	
  is	
  commonly	
  
measured	
  in	
  the	
  .m...
Summary	
  
WebGraph	
  provides	
  methods	
  to	
  manage	
  very	
  large	
  (web)	
  
graphs.	
  It	
  consists	
  of:...
Conclusions	
  
WebGraph	
  combines	
  new	
  codes,	
  new	
  insights	
  on	
  the	
  
structure	
  of	
  the	
  Web	
 ...
Conclusions	
  
The	
  compression	
  techniques	
  are	
  specialized	
  for	
  
Web	
  Graphs.	
  
The	
  average	
  lin...
References	
  1/2	
  
1.  [AJB00]	
  Albert,	
  Jeong,	
  and	
  	
  Barabási.	
  	
  Error	
  and	
  aoack	
  tolerance	
...
References	
  2/2	
  
11.  [Crescenzi+12]	
  Crescenzi,	
  Grossi,	
  	
  Habib,	
  Lanzi,	
  and	
  Marino.	
  On	
  comp...
Upcoming SlideShare
Loading in...5
×

Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

1,160

Published on

The first part of my lectures will be devoted to the design of practical algorithms for very large graphs. The second part will be devoted to algorithms resilient to memory errors. Modern memory devices may suffer from faults, where some bits may arbitrarily flip and corrupt the values of the affected memory cells. The appearance of such faults may seriously compromise the correctness and performance of computations, and the larger is the memory usage the higher is the probability to incur into memory errors. In recent years, many algorithms for computing in the presence of memory faults have been introduced in the literature: in particular, an algorithm or a data structure is called resilient if it is able to work correctly on the set of uncorrupted values. This part will cover recent work on resilient algorithms and data structures.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,160
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
8
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

  1. 1. Original Plan 1.  Algorithms for BIG graphs •  The centrality of centrality •  How to store BIG Graphs (WebGraph Framework) •  Four Degrees of Separation •  Diameter and Radius 2.  Big Data and Memory Errors
  2. 2. Slightly Revised Plan 1.  Algorithms for BIG graphs •  The centrality of centrality •  Four Degrees of Separation •  Diameter (no Radius) •  How to store BIG Graphs (WebGraph Framework) 2.  Big Data and Memory Errors
  3. 3. Four  Degrees  of  Separa.on  
  4. 4. Literature   •  Frigyes  Karinthy,  in  his  1929  short   story  “Láncszemek”  (“Chains'”)   suggested  that  any  two  persons   are  distanced  by  at  most  six   friendship  links   •  Just  an  (op.mis.c)  posi.vis.c   statement  about  combinatorial   explosion   •  Used  by  John  Guare's  in  his  1990   eponymous  play  (and  1993  movie   by  Fred  Schepisi)   4  
  5. 5. The  Sociologists   •  M.  Kochen,  I.  de  Sola  Pool:  Contacts  and   influences.  (Manuscript,  early  50s)   •  A.  Rapoport,  W.J.  Horvath:  A  study  of  a  large   sociogram.  (Behav.Sci.  1961)   •  S.  Milgram,  An  experimental  study  of  the   small  world  problem.  (Sociometry,  1969)…   5  
  6. 6. Milgram’s  ques.on   •  “Given  two  individuals  selected  randomly   from  the  popula.on,  what  is  the  probability   that  the  minimum  number  of  intermediaries   required  to  link  them  is  0,  1,  2,  .  .  .  ,  k?”   6  
  7. 7. Milgram’s  ques.on   •  What  is  the  distance  distribu.on  of  the   acquaintance  graph?   –  how  many  pairs  are  friends?   –  how  many  are  not  friends  but  have  a  friend  in   common?   –  …   •  Note  on  “distance”:   –  sociologists  measure  the  degrees  of  separa.on   –  as  computer  scien.sts  we  measure  the  graph-­‐ theore.c  distance  (just  add  one)   7  
  8. 8. Milgram’s  experiment     296  people  (star.ng  popula.on)  asked  to  dispatch  a   parcel  to  a  single  individual  (target)     •  Target:  a  Boston  stockholder   •  Star.ng  popula.on  selected  as  follows:   –  100  were  random  Boston  inhabitants  (group  A)   –  100  were  random  Nebraska  stockholders  (group  B)   –  96  were  random  Nebraska  inhabitants  (group  C)     •  Rule  of  the  game:  parcels  could  be  directly  sent  only  to   someone  the  sender  knows  personally  (“first-­‐name   acquaintance”)   8  
  9. 9. Milgram’s  experiment     9  
  10. 10. Milgram’s  experiment     •  Actually  completed  64  chains   (22%)   •  453  intermediaries  happened  to   be  involved  in  the  experiments   •  Average  distance  of  the   completed  chains  was  6.2,   ranging  from  5.4  (Boston  group)   to  6.7  (random  Nebraska   inhabitants)   •  Distance  6.7  was  5.7  degrees  of   separa.on  (thus  six  degrees  of   separa.on)   10  
  11. 11. New  Milgram’s  experiment?     •  Can  we  reproduce  Milgram’s  experiment  on   a  large  scale?   •  How  can  one  compute  or  approximate  the   distance  distribu6on  of  a  given  huge   friendship  graph?  (such  as  Facebook)   11  
  12. 12. Graph  distances  and  distribu.on   •  The  distance  d(x,y)  is  the  length  of  the   shortest  path  from  x  to  y   – d(x,y)  =  ∞  if  one  cannot  go  from  x  to  y     •  For  undirected  graphs  d(x,y)  =  d(y,x)   •  For  every  t,  count  the  number  of  pairs  (x,y)   such  that  d(x,y)  =  t  (distance  distribu.on)   •  The  frac.on  of  pairs  at  distance  t  is  (the   density  func.on  of)  a  distribu.on   12  
  13. 13. Previous  experiments   On-­‐line  social  networks:   •  6.6  degrees  of  separa.on  on  a  one-­‐month   MSN  Messenger  communica.on  graph   (Leskovec  and  Horvitz  [LH08])   – 180  M  nodes  and  1.3  G  arcs   •  3.67  degrees  of  separa.on  in  Twioer   [KLPM10]     – 5  G  follows     – Is  this  meaningful?  In  Twioer,  links  created   without  permission  at  both  ends…   13  
  14. 14. •  4.74  degrees  of  separa6on  in  Facebook   (Backstrom  et  al.  [Backstrom+12])   – 712  M  people  and  69  G  friendship  links   14  
  15. 15. Hyper  ANF   •  New  tool  for  studying/compu.ng  distance   distribu.on  of  very  large  graphs   •  Diffusion-­‐based  approximated  algorithm   •  Based  on  WebGraph  (Boldi  et  al  [BV04])  and   ANF  [Palmer  et  al.,  2002]   •  It  uses  HyperLogLog  counters  [Flajolet  et  al., 2007]  and  broadword  programming  for  low-­‐ level  paralleliza.on   15  
  16. 16. Compu.ng  Neighborhood  Func.on   •  Neighborhood  func.on  N(t):  for  each  t,  number  of  pairs   at  distance  ≤  t   –  provides  data  about  how  fast  the  “average  ball”  around   each  node  expands   •  Many  breadth-­‐first  visits:  O(mn),  need  direct  access  L   •  Sampling:  a  frac.on  of  breadth-­‐first  visits,  very   unreliable  results  on  graphs  that  are  not  strongly   connected,  needs  direct  access  L   •  Edith  Cohen’s  [JCSS  1997]  size  es.ma.on  framework:   very  powerful  (strong  theore.cal  guarantees)  but  does   not  scale  really  well,  needs  direct  access  L   16  
  17. 17. Alterna.ve:  Diffusion   •  Basic  idea:  Palmer  et  al  [PGF02]   •  Let  Bt(x)  be  the  ball  of  radius  t  around  x  (nodes  at   distance  at  most  t  from  x)   •  Clearly  B0(x)={x}   •  But  also  Bt+1(x)  =  ∪x→y  Bt(y)∪{x}   •  So  we  can  compute  balls  by  enumera.ng  the  arcs   x→y  and  performing  unions  (on  sets)   •  The  neighborhood  func.on  at  t  is  given  by  the  sum   of  the  sizes  of  the  balls  of  radius  t,                                                     i.e.,  Bt(x1),    Bt(x2),  …,  Bt(xn)  !   17  
  18. 18. Easy  but  Costly:  Approximate   •  Every  set  needs  O(n)  bits:  overall  O(n2).  Too   many.   •  All  we  need  is  cardinality  es.mators:  es.mate  #   of  dis.nct  elements  (nodes)  in  very  large   mul.sets  (balls  around  nodes),  presented  as   massive  streams   •  Do  not  es.mate  cardinality  exactly:  use   probabilis.c  coun.ng  to  get  approximate   es.mates   •  Wish  to  choose  approximate  sets  such  that   unions  can  be  computed  quickly   18  
  19. 19. ANF  and  HyperANF   ANF  [PGF02]:     •  Diffusion  with  Mar.n-­‐Flajolet  (MF)  counters   (log  n  +  c  space)   •  MF  counters  can  be  combined  with  OR   HyperANF  [BRV11a]:     •  Diffusion  with  HyperLogLog  counters  [Flajolet +07]  (loglog  n  space)   •  Use  broadword  programming  to  combine     HyperLogLog  counters  quickly!   19  
  20. 20. HyperLogLog  counters   20  
  21. 21. 21   Rough  Intui.on:  let  x  be  unknown  cardinality  of  M.     Each  substream  will  contain  approximately  (x/m)  different   elements.     Then,  its  Max-­‐parameter  should  be  close  to  log2(x/m).     Harmonic  mean  of  quan..es  2Max  (mZ  in  our  nota.on)  likely   to  be  of  the  order  of  (x/m).   Thus,  m2Z  should  be  of  the  order  of  x.     Constant  αm  to  correct  systema.c  mul.plica.ve  bias  in  m2Z  .  
  22. 22. HyperLogLog  counters   •  Instead  of  actually  coun.ng,  observe  a  sta.s.cal   feature  of  a  set  (stream)  of  elements   •  Feature:  #  of  trailing  zeroes  of  the  value  of  a  very   good  hash  func.on   •  Keep  track  of  maximum  Max  (log  log  n  bits!)   •  #  of  dis.nct  elements  ∝  2Max   •  Important:  counter  of  stream  AB  is  simply   maximum  of  counters  of  A  and  B!   •  With  40  bits  can  count  up  to  4  billion  with   standard  devia.on  of  6%   22  
  23. 23. Many  many  counters…   •  To  increase  confidence,  need  several  counters   (usually  2b,  b  ≥  4)  and  take  their  harmonic  mean   •  Thus  each  set  is  represented  by  a  list  of  small   counters   •  How  small?  Typically  5-­‐bit  is  enough:  232  >  1G.  For   huge  graphs,  7  bits  (unlikely  >  7  bits!)   •  To  compute  the  union  of  two  sets  these  must  be   maximized  one-­‐by-­‐one   •  Extrac.ng  by  shi•s,  maximizing  and  pu€ng  back   by  shi•s  is  unbearably  slow   •  Mar.n–Flajolet  (ANF)  just  OR  the  features!   •  Exponen.al  reduc.on  in  space  (from  log  n  to   loglog  n)  but  unions  get  more  complicated….   23  
  24. 24. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6    
  25. 25. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6     1   1   1   1   0   0   0   0  
  26. 26. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6     1   1   1   1   0   0   0   0   -­‐   =  
  27. 27. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6     1   1   1   1   0   0   0   0   -­‐   =   2   125   0   124  1   0   1   0  
  28. 28. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6     1   1   1   1   0   0   0   0   -­‐   =   2   125   0   124   1   1   1   1   1   0   1   0  
  29. 29. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6     1   1   1   1   0   0   0   0   -­‐   =   2   125   0   124   0   1   0  1  1   1   1   1  
  30. 30. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6     1   1   1   1   0   0   0   0   -­‐   =   2   125   0   124   0   1   0  1  1   1   1   1   1   1   1  1  0   0   0   0  
  31. 31. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6     1   1   1   1   0   0   0   0   -­‐   =   2   125   0   124   0   1   0  1  1   1   1   1   1   1   1  1  0   0   0   0   -­‐   =  
  32. 32. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6     1   1   1   1   0   0   0   0   -­‐   =   2   125   0   124   0   1   0  1  1   1   1   1   1   1   1  1  0   0   0   0   -­‐   =   127   0   127  0  1   0   1   0  
  33. 33. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6     1   1   1   1   0   0   0   0   -­‐   =   2   125   0   124   0   1   0  1  1   1   1   1   1   1   1  1  0   0   0   0   -­‐   =   127   0   127  0  1   0   1   0  
  34. 34. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6     1   1   1   1   0   0   0   0   -­‐   =   2   125   0   124   0   1   0  1  1   1   1   1   1   1   1  1  0   0   0   0   -­‐   =   127   0   127  0  1   0   1   0  
  35. 35. Real  speed?   •  Large  size:  HADI  [Kang  et  al.,  2010]  is  a   Hadoop-­‐conscious  implementa.on  of  ANF.   Takes  30  minutes  on  a  200K-­‐node  graph  (on   one  of  the  50  world  largest  supercomputers).     •  HyperANF  does  the  same  in  couple  of  minutes   on  a  worksta.on  (tens  min  on  a  laptop).   35  
  36. 36. Experiments     •  24-­‐core  machine  with:   – 72  GB  of  RAM   – 1  TB  of  disk  space    
  37. 37. Experiments  (.me)   Ran  experiments  on  snapshots  of  facebook   •  Jan  1,  2007   •  Jan  1,  2008     •  ...   •  Jan  1,  2011   •  May,  2011  (721.1M  nodes,  68.7G  edges)   37  
  38. 38. Experiments  (dataset)   Considered:   •  ‚:  the  whole  facebook  graph   •  it  /  se:  only  Italian  /  Swedish  users   •  it+se:  only  Italian  &  Swedish  users   •  us:  only  US  users   Based  on  users’  current  geo-­‐IP  loca.on   38  
  39. 39. Facebook  distance  distribu.on   39   4,74  
  40. 40. Distance  distribu.ons   40  
  41. 41. Average  Distance   41  
  42. 42. Average  Degree  and  Density  (‚)   42  Density  defined  as  2m  /  n  (n-­‐1)  
  43. 43. Diameter  (‚)   43  
  44. 44. How  to  compute  the  diameter?   44   Lower  bounds  byproduct  of  HyperANF  runs       What  about  the  exact  diameter?     Computa.on  was  rela.vely  fast:  diameter  of   Facebook  required  10  hours  of  computa.on  on   machine  with  1TB  of  RAM  (although  256GB   would  have  been  sufficient)      
  45. 45. The  WebGraph  Framework     (Boldi  &  Vigna  2003)  
  46. 46. “The”  Web  graph   •  Set  U  of  URLs.  Directed  graph  with:   –  U  as  set  of  nodes   –  an  arc  from  x  to  y  iff  the  page  with  URL  x  has  a  hyperlink   poin.ng  to  URL  y.   •  Transpose  graph:  reverse  all  arcs  (useful  for  several  advanced   ranking  algs,  i.e.,  HITS)   •  Web  graph  is  HUGE!  (≈  100  M  nodes,  1  G  links)   46   Page A hyperlink Page BAnchor
  47. 47. Storing  the  Web  graph   •  What  does  it  mean  “to  store  (part  of)  the  Web  graph”?   –  being  able  to  know  the  successors  of  each  node  in  a   reasonable  .me  (e.g.,  much  less  than  1  ms/link)     –  having  a  simple  way  to  know  the  node  corresponding  to  a   URL  (e.g.,  minimal  perfect  hash)   –  having  a  simple  way  to  know  the  URL  corresponding  to  a   node  (e.g.,  front-­‐coded  lists).   •  W.l.o.g.,assume  each  URL  represented  by  an  integer          (0,  1,  .  .  .  ,  n−1,  where  n  =  |U|)   47  
  48. 48. Adjacency  lists   •  The  set  of  neighbors  of  a  node   •  E.g.,  for  a  4  billion  page  web,  need  32  bits  per   node  (each  URL  represented  by  an  integer)   •  Naively,  this  demands  64  bits  to  represent  each   hyperlink   •  That’s  too  much:  Web  graph  with  1  G  links  would   require  more  than  64Gb   Sec. 20.4
  49. 49. (Some)  History   •  Connec.vity  Server  [Bharat+98]  ≈  32  bits/link   •  Algorithms  for  separable  graphs  [BBK03]  ≈  5   bits/link   •  WebBase  [Cho+06]  ≈  5.6  bits/link   •  LINK  database  [Randall+02]  ≈  4.5  bits/link   •  WebGraph  [Boldi  Vigna  04]     ≈  3  bits/link          hop://webgraph.dsi.unimi.it/   49  
  50. 50. Exploit  features  of  Web  Graphs   1.   Locality:   usually   most   links   in   a   page   have   naviga.onal   nature  and  so  are  local  (i.e.,  they  point  to  other  pages  on  the   same   host).   Source   and   target   of   those   links   share   a   long   common   prefix.   If   URLs   are   sorted   lexicographically,   the   index  of  source  and  target  are  close  to  each  other.  E.g.:.   –  www.stanford.edu/alchemy   –  www.stanford.edu/biology   –  www.stanford.edu/biology/plant   –  www.stanford.edu/biology/plant/copyright   –  www.stanford.edu/biology/plant/people   –  www.stanford.edu/chemistry   Literature  reports  that  on  average  80%  of  the  links  are  local    
  51. 51. Locality   Property:  with  URLs  lexicographically  ordered,  for   many  arcs  x→y  we  have  o•en  small  |x  −  y|     Improvement:  represent  the  successors  y1  <  y2  <   ·∙·∙·∙  <  yk  of  x  using  their  gaps  instead:   y1  −  x,  y2  −  y1  −  1,  ...  ,  yk  −  yk−1  −  1   All  integers  non-­‐nega.ve,  except  for  first  one.  To   avoid  this,  use  following  coding  for  first  integer:   51  
  52. 52. Naïve  Representa.on  and  with  Gaps   52  
  53. 53. 2.  Similarity:  Pages  that  are  close   to  each  other  (in  lexicographic   order)  tend  to  have  many   common  successors.  This  is   because  many  naviga.onal  links   are  the  same  on  the  same  local   cluster  of  pages  (even  non-­‐ naviga.onal  links  are  o•en   copied  from  one  page  to  another   on  the  same  host)   Exploit  features  of  Web  Graphs  
  54. 54. Similarity   •  Property:  URLs  that  are  close  in  lexicographic  order  are   likely  to  belong  to  the  same  site  (probably  to  the  same   level  of  the  site  hierarchy)   •  Consequence:  they  are  likely  to  have  similar  successor   lists   •  Improvement:  code  successors  list  by  referen6a6on   –  a  reference  r  which  tells  us  to  start  from  the  list  of  x  −  r  (if   r  >  0)   –  a  bit  string  which  tells  us  which  successors  must  be  copied   –  a  list  of  extra-­‐nodes  for  the  remaining  nodes   Choice  of  r  cri.cal:  r  chosen  as  value  between  0  and  W  (fixed   parameter,  windows  size)  that  gives  the  best  compression.   Larger  W  yields  beoer  compression  rate  but  slower  and   more  memory-­‐consuming  compression  and  decompression  54  
  55. 55. Reference  compression   55  
  56. 56. Differen.al  compression   Look  at  copy  list  as  sequence  of  1-­‐blocks  and  0-­‐blocks   Length  of  each  block  decremented  by  1,  except  for  1st  block   1st  copy  block  always  assumed  to  be  a  1-­‐block  (so  1st  copy  block  is  0   if  copy  list  starts  with  0)     Last  block  always  omioed:  its  value  can  be  deduced  from  #  blocks   and  outdegree     56  
  57. 57. Differen.al  compression   57   Copy  blocks  specify  by  inclusion/exclusion  sublists  that  must  be   alterna.vely  copied  or  discarded     Using  typical  codes,  such  as  γ coding,  copying  en.rely  a  list  costs  1   bit:  can  code  a  link  in  less  than  1  bit!  
  58. 58. Exploit  features  of  Web  Graphs   3.  Consecu6vity:  Many  links  within  same  page  are  likely   to   be   consecu.ve   (w.r.t.   lexicographic   order).   Mainly   due  to  two  dis.nct  phenomena:     1.  Most  pages  contain  sets  of  naviga.onal  links  which   point   to   a   fixed   level   of   the   hierarchy.   Because   of   hierarchical  nature  of  URLs,  links  in  pages  at  booom   level   of   hierarchy   tend   to   be   adjacent   lexicographically.   2.  In  the  transposed  Web  graph:  pages  that  are  high  in   the  site  hierarchy  (e.g.,  the  home  page)  are  pointed   to  by  most  pages  of  the  site.      
  59. 59. Consecu.vity   •  More  in  general,  consecu.vity  is  the  dual  of   distance-­‐one  similarity.  If  a  graph  is  easily   compressible  using  similarity  at  distance  one,   its  transpose  must  sport  large  intervals  of   consecu.ve  links,  and  viceversa.   59  
  60. 60. Intervaliza.on   •   Exploit  consecu.vity  among  nodes.  Instead  of   compressing  them  directly  using  gaps,  first  isolate   subsequences  corresponding  to  integer  intervals  (only   intervals  with  length  above  threshold  Lmin  considered).   Compress  list  of  extra  nodes  as  follows:   – A  list  of  integer  intervals:  each  interval  represented   by  its  le•  extreme  and  its  length;  le•  extremes   compressed  thru  difference  with  previous  right   extreme  -­‐2  ,  interval  lengths  decremented  by  Lmin   – A  list  of  residuals  (remaining  integers),  compressed   through  differences.     60  
  61. 61. Lmin  =  2   Interval  [15,19]:  le•  extreme  15-­‐15=0,  length  5-­‐2=3   Interval  [23,24]:  le•  extreme  23-­‐19-­‐2=2,  length  2-­‐2=0     Residuals  are  13,  203,  315,  1034,  which  gives  13-­‐15=-­‐2   (encoded  as  2|-­‐2|+1=5),  203-­‐13-­‐1=189,  315-­‐203-­‐1=111,   1034-­‐315-­‐1=718.   61  
  62. 62. 62   Intervaliza.on  
  63. 63. Choices  in  the  reference  scheme   •  How  do  you  choose  the  reference  node  for  x?   •  You  consider  the  successor  lists  of  the  last  W   nodes,  but.  .  .  you  do  not  consider  lists  which   would  cause  a  recursive  reference  of  more  than  R   (maximum  reference  count)  chains.  No  limit  on  R:   accessing  adjacency  list  of  x  may  require  a   decompression  of  all  lists  up  to  x!   •  Tuning  parameter  R  essen.al  for  deciding  the  ra.o   compression/speed:  small  R  gives  worst   compression  but  shorter  (random)  access  .mes.   •  W  essen.ally  decreases  compression  .me  only.   63  
  64. 64. Implementa.on   •  Random  access  to  successor  lists  is  implemented  lazily   through  a  cascade  of  iterators.   •  Each  series  of  interval  and  each  reference  cause  the   crea.on  of  an  iterator;  the  same  happens  for   references.   •  The  results  of  all  iterators  are  then  merged.   •  The  advantage  of  laziness  is  that  we  never  have  to   build  an  actual  list  of  successors  in  memory,  so  the   overhead  is  limited  to  the  number  of  actual  reads,  not   to  the  number  of  successors  lists  that  would  be   necessary  to  re-­‐create  a  given  one.   64  
  65. 65. Access  speed   •  Access  speed  to  a  compressed  graph  is  commonly   measured  in  the  .me  required  to  access  a  link  (≈  300  ns  for   WebGraph).   •  This  quan.ty,  however,  is  strongly  dependent  on  the   architecture  (e.g.,  cache  size),  and,  even  more,  on  low-­‐level   op.miza.ons  (e.g.,  hard-­‐coding  of  the  first  codewords  of   an  instantaneaous  code).   •  To  compare  speeds  reliably,  we  need  public  data,  that   anyone  can  access,  and  a  common  framework  for  the  low-­‐ level  opera.ons.   •  A  first  step  is  hop://webgraph-­‐data.dsi.unimi.it/.  Freely   available  data  to  compare  compression  techniques.   65  
  66. 66. Summary   WebGraph  provides  methods  to  manage  very  large  (web)   graphs.  It  consists  of:   •  ζ codes:  par.cularly  suitable  for  storing  web  graphs  (or   integers  with  power  law  distribu.on  in  a  certain   exponen.al  range)   •  Algorithms  for  compressing  web  graphs  that  exploit   referen.a.on,  intervaliza.on  and  ζ codes  to  provide   high  compression  ra.o   •  Algorithms  for  accessing  a  compressed  graph  without   actually  decompressing  it,  using  lazy  techniques  that   delay  decompression  un.l  it  is  actually  necessary   66  
  67. 67. Conclusions   WebGraph  combines  new  codes,  new  insights  on  the   structure  of  the  Web  graph  and  new  algorithmic   techniques  to  achieve  a  very  high  compression  ra.o,   while  s.ll  retaining  a  good  access  speed  (can  it  be   beoer?).   So•ware  is  highly  tunable:  can  experiment  with   dozens  of  codes,  algorithmic  techniques  and   compression  parameters,  and  there  is  a  large   unexplored  space  of  combina.ons.   A  theore.cally  interes.ng  ques.on  is  how  to   combine  op.mally  differen.al  compression  and   intervaliza.on:  can  you  do  beoer  than  current   greedy  approach  (first  copy  as  much  as  you  can,  then   intervalize)?     67  
  68. 68. Conclusions   The  compression  techniques  are  specialized  for   Web  Graphs.   The  average  link  size  decreases  with  the  increase   of  the  graph.   The  average  link  access  .me  increases  with  the   increase  of  the  graph.   ζ-­‐codes  seems  to  achieve  great  trade-­‐off  between   avg.  bit  size  and  access  .me.  
  69. 69. References  1/2   1.  [AJB00]  Albert,  Jeong,  and    Barabási.    Error  and  aoack  tolerance  of  complex  networks.  Nature,   406:378–382,  2000.   2.  [Backstrom+12]  Backstrom,  Boldi,  Rosa,  Ugander,  and  Vigna.  2012.  Four  degrees  of  separa.on.   In  Proceedings  of  the  3rd  Annual  ACM  Web  Science  Conference  (WebSci  '12)   3.  [BBK03]  Blandford,  Blelloch  and  Kash.  2003.  Compact  representa.ons  of  separable  graphs.   In  Proceedings  of  the  14th  ACM-­‐SIAM  symposium  on  Discrete  algorithms  (SODA  '03)   4.  [Bharat+98]  Bharat,  Broder,  Henzinger,  Kumar,  and  Venkatasubramanian.  1998.  The  connec.vity   server:  fast  access  to  linkage  informa.on  on  the  Web.  In  Proceedings  of  the  seventh  interna.onal   conference  on  World  Wide  Web  7  (WWW7)   5.  [Bra01]  Brandes.  A  faster  algorithm  for  betweenness  centrality.  Journal  of  Mathema.cal  Sociology,   25(2):163–177,  2001.   6.  [BRV11a]  Boldi,  Rosa,  and  Vigna.  2011.  HyperANF:  approxima.ng  the  neighbourhood  func.on  of   very  large  graphs  on  a  budget.  In  Proceedings  of  the  20th  interna.onal  conference  on  World  wide   web  (WWW  '11)   7.  [BRV11b]  Boldi,  Rosa,  and  Vigna.  2011.  Robustness  of  social  networks:  compara.ve  results  based   on  distance  distribu.ons.  In  Proceedings  of  the  Third  interna.onal  conference  on  Social   informa.cs  (SocInfo'11)     8.  [BV04]  Boldi  and  Vigna.  2004.  The  webgraph  framework  I:  compression  techniques.  In  Proceedings   of  the  13th  interna.onal  conference  on  World  Wide  Web  (WWW  '04)   9.  [Bor05]  Borga€.  Centrality  and  network  flow.  Social  Networks,  27(1):55–71,  2005.   10.  [Cho+06]  Cho,  Garcia-­‐Molina,  Haveliwala,  Lam,  Paepcke,  Raghavan,  and  Wesley.  2006.  Stanford   WebBase  components  and  applica.ons.  ACM  Trans.  Internet  Technol.  6   69  
  70. 70. References  2/2   11.  [Crescenzi+12]  Crescenzi,  Grossi,    Habib,  Lanzi,  and  Marino.  On  compu.ng  the  diameter  of  real-­‐ world  undirected  graphs.  Theore.cal  Computer  Science,  2012.   12.  [Flajolet+07]  Flajolet  ,  Fusy  ,  Gandouet  and  Meunier.  2008.  HyperLogLog:  the  analysis  of  a  near-­‐ op.mal  cardinality  es.ma.on  algorithm.  In  Proceedings  of  the  2007  interna.onal  conference  on   Analysis  of  Algorithms  (AOFA  ’07)   13.  [Firmani+12]  Firmani,  Italiano,  Laura,  Orlandi,  and  Santaroni.  2012.  Compu.ng  strong  ar.cula.on   points  and  strong  bridges  in  large  scale  graphs.  InProceedings  of  the  11th  interna.onal  conference   on  Experimental  Algorithms  (SEA'12)   14.  [Italiano+12]  Italiano,  Laura,  Santaroni,  “Finding  strong  ar.cula.on  points  and  strong  bridges  in   linear  .me”,  Theore.cal  Computer  Science,  vol.  447,  74–84,  2012.   15.   [KLPM10]  Kwak,  Lee,  Park,  and  Moon.  2010.  What  is  Twioer,  a  social  network  or  a  news  media?.   In  Proceedings  of  the  19th  interna.onal  conference  on  World  wide  web  (WWW  '10)   16.  [LH08]  Leskovec  and  Horvitz.  2008.  Planetary-­‐scale  views  on  a  large  instant-­‐messaging  network.   In  Proceedings  of  the  17th  interna.onal  conference  on  World  Wide  Web  (WWW  '08)   17.  [Mislove+07]  Mislove,  Marcon,  Gummadi,  Druschel,  and  Bhaoacharjee.  2007.  Measurement  and   analysis  of  online  social  networks.  In  Proceedings  of  the  7th  ACM  SIGCOMM  conference  on  Internet   measurement  (IMC  '07)   18.  [PGF02]  Palmer,  Gibbons,  and  Faloutsos.  2002.  ANF:  a  fast  and  scalable  tool  for  data  mining  in   massive  graphs.  In  Proceedings  of  the  eighth  ACM  SIGKDD  interna.onal  conference  on  Knowledge   discovery  and  data  mining  (KDD  '02)   19.  [Randall+02]    Randall,  Stata,    Wiener,  and  Wickremesinghe.  2002.  The  Link  Database:  Fast  Access  to   Graphs  of  the  Web.  In  Proceedings  of  the  Data  Compression  Conference  (DCC  '02)   20.  [RAK07]  Raghavan,  Albert,  and  Kumara.  Near  linear  .me  algorithm  to  detect  community  structures   in  large-­‐scale  networks.  Physical  Review  E  (Sta.s.cal,  Nonlinear,  and  So•  Maoer  Physics),  76(3),   2007.   70  
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×