Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano)

on

  • 1,285 views

The first part of my lectures will be devoted to the design of practical algorithms for very large graphs. The second part will be devoted to algorithms resilient to memory errors. Modern memory ...

The first part of my lectures will be devoted to the design of practical algorithms for very large graphs. The second part will be devoted to algorithms resilient to memory errors. Modern memory devices may suffer from faults, where some bits may arbitrarily flip and corrupt the values of the affected memory cells. The appearance of such faults may seriously compromise the correctness and performance of computations, and the larger is the memory usage the higher is the probability to incur into memory errors. In recent years, many algorithms for computing in the presence of memory faults have been introduced in the literature: in particular, an algorithm or a data structure is called resilient if it is able to work correctly on the set of uncorrupted values. This part will cover recent work on resilient algorithms and data structures.

Statistics

Views

Total Views
1,285
Views on SlideShare
522
Embed Views
763

Actions

Likes
0
Downloads
6
Comments
0

1 Embed 763

http://almada2013.ru 763

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Algorithms for Big Data: Graphs and Memory Errors 3 (Lecture by Giuseppe Italiano) Presentation Transcript

  • 1. Original Plan 1.  Algorithms for BIG graphs •  The centrality of centrality •  How to store BIG Graphs (WebGraph Framework) •  Four Degrees of Separation •  Diameter and Radius 2.  Big Data and Memory Errors
  • 2. Slightly Revised Plan 1.  Algorithms for BIG graphs •  The centrality of centrality •  Four Degrees of Separation •  Diameter (no Radius) •  How to store BIG Graphs (WebGraph Framework) 2.  Big Data and Memory Errors
  • 3. Four  Degrees  of  Separa.on  
  • 4. Literature   •  Frigyes  Karinthy,  in  his  1929  short   story  “Láncszemek”  (“Chains'”)   suggested  that  any  two  persons   are  distanced  by  at  most  six   friendship  links   •  Just  an  (op.mis.c)  posi.vis.c   statement  about  combinatorial   explosion   •  Used  by  John  Guare's  in  his  1990   eponymous  play  (and  1993  movie   by  Fred  Schepisi)   4  
  • 5. The  Sociologists   •  M.  Kochen,  I.  de  Sola  Pool:  Contacts  and   influences.  (Manuscript,  early  50s)   •  A.  Rapoport,  W.J.  Horvath:  A  study  of  a  large   sociogram.  (Behav.Sci.  1961)   •  S.  Milgram,  An  experimental  study  of  the   small  world  problem.  (Sociometry,  1969)…   5  
  • 6. Milgram’s  ques.on   •  “Given  two  individuals  selected  randomly   from  the  popula.on,  what  is  the  probability   that  the  minimum  number  of  intermediaries   required  to  link  them  is  0,  1,  2,  .  .  .  ,  k?”   6  
  • 7. Milgram’s  ques.on   •  What  is  the  distance  distribu.on  of  the   acquaintance  graph?   –  how  many  pairs  are  friends?   –  how  many  are  not  friends  but  have  a  friend  in   common?   –  …   •  Note  on  “distance”:   –  sociologists  measure  the  degrees  of  separa.on   –  as  computer  scien.sts  we  measure  the  graph-­‐ theore.c  distance  (just  add  one)   7  
  • 8. Milgram’s  experiment     296  people  (star.ng  popula.on)  asked  to  dispatch  a   parcel  to  a  single  individual  (target)     •  Target:  a  Boston  stockholder   •  Star.ng  popula.on  selected  as  follows:   –  100  were  random  Boston  inhabitants  (group  A)   –  100  were  random  Nebraska  stockholders  (group  B)   –  96  were  random  Nebraska  inhabitants  (group  C)     •  Rule  of  the  game:  parcels  could  be  directly  sent  only  to   someone  the  sender  knows  personally  (“first-­‐name   acquaintance”)   8  
  • 9. Milgram’s  experiment     9  
  • 10. Milgram’s  experiment     •  Actually  completed  64  chains   (22%)   •  453  intermediaries  happened  to   be  involved  in  the  experiments   •  Average  distance  of  the   completed  chains  was  6.2,   ranging  from  5.4  (Boston  group)   to  6.7  (random  Nebraska   inhabitants)   •  Distance  6.7  was  5.7  degrees  of   separa.on  (thus  six  degrees  of   separa.on)   10  
  • 11. New  Milgram’s  experiment?     •  Can  we  reproduce  Milgram’s  experiment  on   a  large  scale?   •  How  can  one  compute  or  approximate  the   distance  distribu6on  of  a  given  huge   friendship  graph?  (such  as  Facebook)   11  
  • 12. Graph  distances  and  distribu.on   •  The  distance  d(x,y)  is  the  length  of  the   shortest  path  from  x  to  y   – d(x,y)  =  ∞  if  one  cannot  go  from  x  to  y     •  For  undirected  graphs  d(x,y)  =  d(y,x)   •  For  every  t,  count  the  number  of  pairs  (x,y)   such  that  d(x,y)  =  t  (distance  distribu.on)   •  The  frac.on  of  pairs  at  distance  t  is  (the   density  func.on  of)  a  distribu.on   12  
  • 13. Previous  experiments   On-­‐line  social  networks:   •  6.6  degrees  of  separa.on  on  a  one-­‐month   MSN  Messenger  communica.on  graph   (Leskovec  and  Horvitz  [LH08])   – 180  M  nodes  and  1.3  G  arcs   •  3.67  degrees  of  separa.on  in  Twioer   [KLPM10]     – 5  G  follows     – Is  this  meaningful?  In  Twioer,  links  created   without  permission  at  both  ends…   13  
  • 14. •  4.74  degrees  of  separa6on  in  Facebook   (Backstrom  et  al.  [Backstrom+12])   – 712  M  people  and  69  G  friendship  links   14  
  • 15. Hyper  ANF   •  New  tool  for  studying/compu.ng  distance   distribu.on  of  very  large  graphs   •  Diffusion-­‐based  approximated  algorithm   •  Based  on  WebGraph  (Boldi  et  al  [BV04])  and   ANF  [Palmer  et  al.,  2002]   •  It  uses  HyperLogLog  counters  [Flajolet  et  al., 2007]  and  broadword  programming  for  low-­‐ level  paralleliza.on   15  
  • 16. Compu.ng  Neighborhood  Func.on   •  Neighborhood  func.on  N(t):  for  each  t,  number  of  pairs   at  distance  ≤  t   –  provides  data  about  how  fast  the  “average  ball”  around   each  node  expands   •  Many  breadth-­‐first  visits:  O(mn),  need  direct  access  L   •  Sampling:  a  frac.on  of  breadth-­‐first  visits,  very   unreliable  results  on  graphs  that  are  not  strongly   connected,  needs  direct  access  L   •  Edith  Cohen’s  [JCSS  1997]  size  es.ma.on  framework:   very  powerful  (strong  theore.cal  guarantees)  but  does   not  scale  really  well,  needs  direct  access  L   16  
  • 17. Alterna.ve:  Diffusion   •  Basic  idea:  Palmer  et  al  [PGF02]   •  Let  Bt(x)  be  the  ball  of  radius  t  around  x  (nodes  at   distance  at  most  t  from  x)   •  Clearly  B0(x)={x}   •  But  also  Bt+1(x)  =  ∪x→y  Bt(y)∪{x}   •  So  we  can  compute  balls  by  enumera.ng  the  arcs   x→y  and  performing  unions  (on  sets)   •  The  neighborhood  func.on  at  t  is  given  by  the  sum   of  the  sizes  of  the  balls  of  radius  t,                                                     i.e.,  Bt(x1),    Bt(x2),  …,  Bt(xn)  !   17  
  • 18. Easy  but  Costly:  Approximate   •  Every  set  needs  O(n)  bits:  overall  O(n2).  Too   many.   •  All  we  need  is  cardinality  es.mators:  es.mate  #   of  dis.nct  elements  (nodes)  in  very  large   mul.sets  (balls  around  nodes),  presented  as   massive  streams   •  Do  not  es.mate  cardinality  exactly:  use   probabilis.c  coun.ng  to  get  approximate   es.mates   •  Wish  to  choose  approximate  sets  such  that   unions  can  be  computed  quickly   18  
  • 19. ANF  and  HyperANF   ANF  [PGF02]:     •  Diffusion  with  Mar.n-­‐Flajolet  (MF)  counters   (log  n  +  c  space)   •  MF  counters  can  be  combined  with  OR   HyperANF  [BRV11a]:     •  Diffusion  with  HyperLogLog  counters  [Flajolet +07]  (loglog  n  space)   •  Use  broadword  programming  to  combine     HyperLogLog  counters  quickly!   19  
  • 20. HyperLogLog  counters   20  
  • 21. 21   Rough  Intui.on:  let  x  be  unknown  cardinality  of  M.     Each  substream  will  contain  approximately  (x/m)  different   elements.     Then,  its  Max-­‐parameter  should  be  close  to  log2(x/m).     Harmonic  mean  of  quan..es  2Max  (mZ  in  our  nota.on)  likely   to  be  of  the  order  of  (x/m).   Thus,  m2Z  should  be  of  the  order  of  x.     Constant  αm  to  correct  systema.c  mul.plica.ve  bias  in  m2Z  .  
  • 22. HyperLogLog  counters   •  Instead  of  actually  coun.ng,  observe  a  sta.s.cal   feature  of  a  set  (stream)  of  elements   •  Feature:  #  of  trailing  zeroes  of  the  value  of  a  very   good  hash  func.on   •  Keep  track  of  maximum  Max  (log  log  n  bits!)   •  #  of  dis.nct  elements  ∝  2Max   •  Important:  counter  of  stream  AB  is  simply   maximum  of  counters  of  A  and  B!   •  With  40  bits  can  count  up  to  4  billion  with   standard  devia.on  of  6%   22  
  • 23. Many  many  counters…   •  To  increase  confidence,  need  several  counters   (usually  2b,  b  ≥  4)  and  take  their  harmonic  mean   •  Thus  each  set  is  represented  by  a  list  of  small   counters   •  How  small?  Typically  5-­‐bit  is  enough:  232  >  1G.  For   huge  graphs,  7  bits  (unlikely  >  7  bits!)   •  To  compute  the  union  of  two  sets  these  must  be   maximized  one-­‐by-­‐one   •  Extrac.ng  by  shi•s,  maximizing  and  pu€ng  back   by  shi•s  is  unbearably  slow   •  Mar.n–Flajolet  (ANF)  just  OR  the  features!   •  Exponen.al  reduc.on  in  space  (from  log  n  to   loglog  n)  but  unions  get  more  complicated….   23  
  • 24. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6    
  • 25. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6     1   1   1   1   0   0   0   0  
  • 26. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6     1   1   1   1   0   0   0   0   -­‐   =  
  • 27. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6     1   1   1   1   0   0   0   0   -­‐   =   2   125   0   124  1   0   1   0  
  • 28. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6     1   1   1   1   0   0   0   0   -­‐   =   2   125   0   124   1   1   1   1   1   0   1   0  
  • 29. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6     1   1   1   1   0   0   0   0   -­‐   =   2   125   0   124   0   1   0  1  1   1   1   1  
  • 30. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6     1   1   1   1   0   0   0   0   -­‐   =   2   125   0   124   0   1   0  1  1   1   1   1   1   1   1  1  0   0   0   0  
  • 31. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6     1   1   1   1   0   0   0   0   -­‐   =   2   125   0   124   0   1   0  1  1   1   1   1   1   1   1  1  0   0   0   0   -­‐   =  
  • 32. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6     1   1   1   1   0   0   0   0   -­‐   =   2   125   0   124   0   1   0  1  1   1   1   1   1   1   1  1  0   0   0   0   -­‐   =   127   0   127  0  1   0   1   0  
  • 33. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6     1   1   1   1   0   0   0   0   -­‐   =   2   125   0   124   0   1   0  1  1   1   1   1   1   1   1  1  0   0   0   0   -­‐   =   127   0   127  0  1   0   1   0  
  • 34. Broadword  Programming   8  bits   9     0     3     2     7     3     3     6     1   1   1   1   0   0   0   0   -­‐   =   2   125   0   124   0   1   0  1  1   1   1   1   1   1   1  1  0   0   0   0   -­‐   =   127   0   127  0  1   0   1   0  
  • 35. Real  speed?   •  Large  size:  HADI  [Kang  et  al.,  2010]  is  a   Hadoop-­‐conscious  implementa.on  of  ANF.   Takes  30  minutes  on  a  200K-­‐node  graph  (on   one  of  the  50  world  largest  supercomputers).     •  HyperANF  does  the  same  in  couple  of  minutes   on  a  worksta.on  (tens  min  on  a  laptop).   35  
  • 36. Experiments     •  24-­‐core  machine  with:   – 72  GB  of  RAM   – 1  TB  of  disk  space    
  • 37. Experiments  (.me)   Ran  experiments  on  snapshots  of  facebook   •  Jan  1,  2007   •  Jan  1,  2008     •  ...   •  Jan  1,  2011   •  May,  2011  (721.1M  nodes,  68.7G  edges)   37  
  • 38. Experiments  (dataset)   Considered:   •  ‚:  the  whole  facebook  graph   •  it  /  se:  only  Italian  /  Swedish  users   •  it+se:  only  Italian  &  Swedish  users   •  us:  only  US  users   Based  on  users’  current  geo-­‐IP  loca.on   38  
  • 39. Facebook  distance  distribu.on   39   4,74  
  • 40. Distance  distribu.ons   40  
  • 41. Average  Distance   41  
  • 42. Average  Degree  and  Density  (‚)   42  Density  defined  as  2m  /  n  (n-­‐1)  
  • 43. Diameter  (‚)   43  
  • 44. How  to  compute  the  diameter?   44   Lower  bounds  byproduct  of  HyperANF  runs       What  about  the  exact  diameter?     Computa.on  was  rela.vely  fast:  diameter  of   Facebook  required  10  hours  of  computa.on  on   machine  with  1TB  of  RAM  (although  256GB   would  have  been  sufficient)      
  • 45. The  WebGraph  Framework     (Boldi  &  Vigna  2003)  
  • 46. “The”  Web  graph   •  Set  U  of  URLs.  Directed  graph  with:   –  U  as  set  of  nodes   –  an  arc  from  x  to  y  iff  the  page  with  URL  x  has  a  hyperlink   poin.ng  to  URL  y.   •  Transpose  graph:  reverse  all  arcs  (useful  for  several  advanced   ranking  algs,  i.e.,  HITS)   •  Web  graph  is  HUGE!  (≈  100  M  nodes,  1  G  links)   46   Page A hyperlink Page BAnchor
  • 47. Storing  the  Web  graph   •  What  does  it  mean  “to  store  (part  of)  the  Web  graph”?   –  being  able  to  know  the  successors  of  each  node  in  a   reasonable  .me  (e.g.,  much  less  than  1  ms/link)     –  having  a  simple  way  to  know  the  node  corresponding  to  a   URL  (e.g.,  minimal  perfect  hash)   –  having  a  simple  way  to  know  the  URL  corresponding  to  a   node  (e.g.,  front-­‐coded  lists).   •  W.l.o.g.,assume  each  URL  represented  by  an  integer          (0,  1,  .  .  .  ,  n−1,  where  n  =  |U|)   47  
  • 48. Adjacency  lists   •  The  set  of  neighbors  of  a  node   •  E.g.,  for  a  4  billion  page  web,  need  32  bits  per   node  (each  URL  represented  by  an  integer)   •  Naively,  this  demands  64  bits  to  represent  each   hyperlink   •  That’s  too  much:  Web  graph  with  1  G  links  would   require  more  than  64Gb   Sec. 20.4
  • 49. (Some)  History   •  Connec.vity  Server  [Bharat+98]  ≈  32  bits/link   •  Algorithms  for  separable  graphs  [BBK03]  ≈  5   bits/link   •  WebBase  [Cho+06]  ≈  5.6  bits/link   •  LINK  database  [Randall+02]  ≈  4.5  bits/link   •  WebGraph  [Boldi  Vigna  04]     ≈  3  bits/link          hop://webgraph.dsi.unimi.it/   49  
  • 50. Exploit  features  of  Web  Graphs   1.   Locality:   usually   most   links   in   a   page   have   naviga.onal   nature  and  so  are  local  (i.e.,  they  point  to  other  pages  on  the   same   host).   Source   and   target   of   those   links   share   a   long   common   prefix.   If   URLs   are   sorted   lexicographically,   the   index  of  source  and  target  are  close  to  each  other.  E.g.:.   –  www.stanford.edu/alchemy   –  www.stanford.edu/biology   –  www.stanford.edu/biology/plant   –  www.stanford.edu/biology/plant/copyright   –  www.stanford.edu/biology/plant/people   –  www.stanford.edu/chemistry   Literature  reports  that  on  average  80%  of  the  links  are  local    
  • 51. Locality   Property:  with  URLs  lexicographically  ordered,  for   many  arcs  x→y  we  have  o•en  small  |x  −  y|     Improvement:  represent  the  successors  y1  <  y2  <   ·∙·∙·∙  <  yk  of  x  using  their  gaps  instead:   y1  −  x,  y2  −  y1  −  1,  ...  ,  yk  −  yk−1  −  1   All  integers  non-­‐nega.ve,  except  for  first  one.  To   avoid  this,  use  following  coding  for  first  integer:   51  
  • 52. Naïve  Representa.on  and  with  Gaps   52  
  • 53. 2.  Similarity:  Pages  that  are  close   to  each  other  (in  lexicographic   order)  tend  to  have  many   common  successors.  This  is   because  many  naviga.onal  links   are  the  same  on  the  same  local   cluster  of  pages  (even  non-­‐ naviga.onal  links  are  o•en   copied  from  one  page  to  another   on  the  same  host)   Exploit  features  of  Web  Graphs  
  • 54. Similarity   •  Property:  URLs  that  are  close  in  lexicographic  order  are   likely  to  belong  to  the  same  site  (probably  to  the  same   level  of  the  site  hierarchy)   •  Consequence:  they  are  likely  to  have  similar  successor   lists   •  Improvement:  code  successors  list  by  referen6a6on   –  a  reference  r  which  tells  us  to  start  from  the  list  of  x  −  r  (if   r  >  0)   –  a  bit  string  which  tells  us  which  successors  must  be  copied   –  a  list  of  extra-­‐nodes  for  the  remaining  nodes   Choice  of  r  cri.cal:  r  chosen  as  value  between  0  and  W  (fixed   parameter,  windows  size)  that  gives  the  best  compression.   Larger  W  yields  beoer  compression  rate  but  slower  and   more  memory-­‐consuming  compression  and  decompression  54  
  • 55. Reference  compression   55  
  • 56. Differen.al  compression   Look  at  copy  list  as  sequence  of  1-­‐blocks  and  0-­‐blocks   Length  of  each  block  decremented  by  1,  except  for  1st  block   1st  copy  block  always  assumed  to  be  a  1-­‐block  (so  1st  copy  block  is  0   if  copy  list  starts  with  0)     Last  block  always  omioed:  its  value  can  be  deduced  from  #  blocks   and  outdegree     56  
  • 57. Differen.al  compression   57   Copy  blocks  specify  by  inclusion/exclusion  sublists  that  must  be   alterna.vely  copied  or  discarded     Using  typical  codes,  such  as  γ coding,  copying  en.rely  a  list  costs  1   bit:  can  code  a  link  in  less  than  1  bit!  
  • 58. Exploit  features  of  Web  Graphs   3.  Consecu6vity:  Many  links  within  same  page  are  likely   to   be   consecu.ve   (w.r.t.   lexicographic   order).   Mainly   due  to  two  dis.nct  phenomena:     1.  Most  pages  contain  sets  of  naviga.onal  links  which   point   to   a   fixed   level   of   the   hierarchy.   Because   of   hierarchical  nature  of  URLs,  links  in  pages  at  booom   level   of   hierarchy   tend   to   be   adjacent   lexicographically.   2.  In  the  transposed  Web  graph:  pages  that  are  high  in   the  site  hierarchy  (e.g.,  the  home  page)  are  pointed   to  by  most  pages  of  the  site.      
  • 59. Consecu.vity   •  More  in  general,  consecu.vity  is  the  dual  of   distance-­‐one  similarity.  If  a  graph  is  easily   compressible  using  similarity  at  distance  one,   its  transpose  must  sport  large  intervals  of   consecu.ve  links,  and  viceversa.   59  
  • 60. Intervaliza.on   •   Exploit  consecu.vity  among  nodes.  Instead  of   compressing  them  directly  using  gaps,  first  isolate   subsequences  corresponding  to  integer  intervals  (only   intervals  with  length  above  threshold  Lmin  considered).   Compress  list  of  extra  nodes  as  follows:   – A  list  of  integer  intervals:  each  interval  represented   by  its  le•  extreme  and  its  length;  le•  extremes   compressed  thru  difference  with  previous  right   extreme  -­‐2  ,  interval  lengths  decremented  by  Lmin   – A  list  of  residuals  (remaining  integers),  compressed   through  differences.     60  
  • 61. Lmin  =  2   Interval  [15,19]:  le•  extreme  15-­‐15=0,  length  5-­‐2=3   Interval  [23,24]:  le•  extreme  23-­‐19-­‐2=2,  length  2-­‐2=0     Residuals  are  13,  203,  315,  1034,  which  gives  13-­‐15=-­‐2   (encoded  as  2|-­‐2|+1=5),  203-­‐13-­‐1=189,  315-­‐203-­‐1=111,   1034-­‐315-­‐1=718.   61  
  • 62. 62   Intervaliza.on  
  • 63. Choices  in  the  reference  scheme   •  How  do  you  choose  the  reference  node  for  x?   •  You  consider  the  successor  lists  of  the  last  W   nodes,  but.  .  .  you  do  not  consider  lists  which   would  cause  a  recursive  reference  of  more  than  R   (maximum  reference  count)  chains.  No  limit  on  R:   accessing  adjacency  list  of  x  may  require  a   decompression  of  all  lists  up  to  x!   •  Tuning  parameter  R  essen.al  for  deciding  the  ra.o   compression/speed:  small  R  gives  worst   compression  but  shorter  (random)  access  .mes.   •  W  essen.ally  decreases  compression  .me  only.   63  
  • 64. Implementa.on   •  Random  access  to  successor  lists  is  implemented  lazily   through  a  cascade  of  iterators.   •  Each  series  of  interval  and  each  reference  cause  the   crea.on  of  an  iterator;  the  same  happens  for   references.   •  The  results  of  all  iterators  are  then  merged.   •  The  advantage  of  laziness  is  that  we  never  have  to   build  an  actual  list  of  successors  in  memory,  so  the   overhead  is  limited  to  the  number  of  actual  reads,  not   to  the  number  of  successors  lists  that  would  be   necessary  to  re-­‐create  a  given  one.   64  
  • 65. Access  speed   •  Access  speed  to  a  compressed  graph  is  commonly   measured  in  the  .me  required  to  access  a  link  (≈  300  ns  for   WebGraph).   •  This  quan.ty,  however,  is  strongly  dependent  on  the   architecture  (e.g.,  cache  size),  and,  even  more,  on  low-­‐level   op.miza.ons  (e.g.,  hard-­‐coding  of  the  first  codewords  of   an  instantaneaous  code).   •  To  compare  speeds  reliably,  we  need  public  data,  that   anyone  can  access,  and  a  common  framework  for  the  low-­‐ level  opera.ons.   •  A  first  step  is  hop://webgraph-­‐data.dsi.unimi.it/.  Freely   available  data  to  compare  compression  techniques.   65  
  • 66. Summary   WebGraph  provides  methods  to  manage  very  large  (web)   graphs.  It  consists  of:   •  ζ codes:  par.cularly  suitable  for  storing  web  graphs  (or   integers  with  power  law  distribu.on  in  a  certain   exponen.al  range)   •  Algorithms  for  compressing  web  graphs  that  exploit   referen.a.on,  intervaliza.on  and  ζ codes  to  provide   high  compression  ra.o   •  Algorithms  for  accessing  a  compressed  graph  without   actually  decompressing  it,  using  lazy  techniques  that   delay  decompression  un.l  it  is  actually  necessary   66  
  • 67. Conclusions   WebGraph  combines  new  codes,  new  insights  on  the   structure  of  the  Web  graph  and  new  algorithmic   techniques  to  achieve  a  very  high  compression  ra.o,   while  s.ll  retaining  a  good  access  speed  (can  it  be   beoer?).   So•ware  is  highly  tunable:  can  experiment  with   dozens  of  codes,  algorithmic  techniques  and   compression  parameters,  and  there  is  a  large   unexplored  space  of  combina.ons.   A  theore.cally  interes.ng  ques.on  is  how  to   combine  op.mally  differen.al  compression  and   intervaliza.on:  can  you  do  beoer  than  current   greedy  approach  (first  copy  as  much  as  you  can,  then   intervalize)?     67  
  • 68. Conclusions   The  compression  techniques  are  specialized  for   Web  Graphs.   The  average  link  size  decreases  with  the  increase   of  the  graph.   The  average  link  access  .me  increases  with  the   increase  of  the  graph.   ζ-­‐codes  seems  to  achieve  great  trade-­‐off  between   avg.  bit  size  and  access  .me.  
  • 69. References  1/2   1.  [AJB00]  Albert,  Jeong,  and    Barabási.    Error  and  aoack  tolerance  of  complex  networks.  Nature,   406:378–382,  2000.   2.  [Backstrom+12]  Backstrom,  Boldi,  Rosa,  Ugander,  and  Vigna.  2012.  Four  degrees  of  separa.on.   In  Proceedings  of  the  3rd  Annual  ACM  Web  Science  Conference  (WebSci  '12)   3.  [BBK03]  Blandford,  Blelloch  and  Kash.  2003.  Compact  representa.ons  of  separable  graphs.   In  Proceedings  of  the  14th  ACM-­‐SIAM  symposium  on  Discrete  algorithms  (SODA  '03)   4.  [Bharat+98]  Bharat,  Broder,  Henzinger,  Kumar,  and  Venkatasubramanian.  1998.  The  connec.vity   server:  fast  access  to  linkage  informa.on  on  the  Web.  In  Proceedings  of  the  seventh  interna.onal   conference  on  World  Wide  Web  7  (WWW7)   5.  [Bra01]  Brandes.  A  faster  algorithm  for  betweenness  centrality.  Journal  of  Mathema.cal  Sociology,   25(2):163–177,  2001.   6.  [BRV11a]  Boldi,  Rosa,  and  Vigna.  2011.  HyperANF:  approxima.ng  the  neighbourhood  func.on  of   very  large  graphs  on  a  budget.  In  Proceedings  of  the  20th  interna.onal  conference  on  World  wide   web  (WWW  '11)   7.  [BRV11b]  Boldi,  Rosa,  and  Vigna.  2011.  Robustness  of  social  networks:  compara.ve  results  based   on  distance  distribu.ons.  In  Proceedings  of  the  Third  interna.onal  conference  on  Social   informa.cs  (SocInfo'11)     8.  [BV04]  Boldi  and  Vigna.  2004.  The  webgraph  framework  I:  compression  techniques.  In  Proceedings   of  the  13th  interna.onal  conference  on  World  Wide  Web  (WWW  '04)   9.  [Bor05]  Borga€.  Centrality  and  network  flow.  Social  Networks,  27(1):55–71,  2005.   10.  [Cho+06]  Cho,  Garcia-­‐Molina,  Haveliwala,  Lam,  Paepcke,  Raghavan,  and  Wesley.  2006.  Stanford   WebBase  components  and  applica.ons.  ACM  Trans.  Internet  Technol.  6   69  
  • 70. References  2/2   11.  [Crescenzi+12]  Crescenzi,  Grossi,    Habib,  Lanzi,  and  Marino.  On  compu.ng  the  diameter  of  real-­‐ world  undirected  graphs.  Theore.cal  Computer  Science,  2012.   12.  [Flajolet+07]  Flajolet  ,  Fusy  ,  Gandouet  and  Meunier.  2008.  HyperLogLog:  the  analysis  of  a  near-­‐ op.mal  cardinality  es.ma.on  algorithm.  In  Proceedings  of  the  2007  interna.onal  conference  on   Analysis  of  Algorithms  (AOFA  ’07)   13.  [Firmani+12]  Firmani,  Italiano,  Laura,  Orlandi,  and  Santaroni.  2012.  Compu.ng  strong  ar.cula.on   points  and  strong  bridges  in  large  scale  graphs.  InProceedings  of  the  11th  interna.onal  conference   on  Experimental  Algorithms  (SEA'12)   14.  [Italiano+12]  Italiano,  Laura,  Santaroni,  “Finding  strong  ar.cula.on  points  and  strong  bridges  in   linear  .me”,  Theore.cal  Computer  Science,  vol.  447,  74–84,  2012.   15.   [KLPM10]  Kwak,  Lee,  Park,  and  Moon.  2010.  What  is  Twioer,  a  social  network  or  a  news  media?.   In  Proceedings  of  the  19th  interna.onal  conference  on  World  wide  web  (WWW  '10)   16.  [LH08]  Leskovec  and  Horvitz.  2008.  Planetary-­‐scale  views  on  a  large  instant-­‐messaging  network.   In  Proceedings  of  the  17th  interna.onal  conference  on  World  Wide  Web  (WWW  '08)   17.  [Mislove+07]  Mislove,  Marcon,  Gummadi,  Druschel,  and  Bhaoacharjee.  2007.  Measurement  and   analysis  of  online  social  networks.  In  Proceedings  of  the  7th  ACM  SIGCOMM  conference  on  Internet   measurement  (IMC  '07)   18.  [PGF02]  Palmer,  Gibbons,  and  Faloutsos.  2002.  ANF:  a  fast  and  scalable  tool  for  data  mining  in   massive  graphs.  In  Proceedings  of  the  eighth  ACM  SIGKDD  interna.onal  conference  on  Knowledge   discovery  and  data  mining  (KDD  '02)   19.  [Randall+02]    Randall,  Stata,    Wiener,  and  Wickremesinghe.  2002.  The  Link  Database:  Fast  Access  to   Graphs  of  the  Web.  In  Proceedings  of  the  Data  Compression  Conference  (DCC  '02)   20.  [RAK07]  Raghavan,  Albert,  and  Kumara.  Near  linear  .me  algorithm  to  detect  community  structures   in  large-­‐scale  networks.  Physical  Review  E  (Sta.s.cal,  Nonlinear,  and  So•  Maoer  Physics),  76(3),   2007.   70