Algorithms for BIG DATA:
Graphs and Memory Errors
Giuseppe F. Italiano
Università di Roma “Tor Vergata”
italiano@disp.unir...
Some advertising first
School  on    Graph  Theory,  Algorithms  &  Applications	
Erice,  Italy,  September  8-­‐‑16,  201...
BIG data
NYT, Feb 11, 2012: The Age of Big Data
•  What is Big Data? A meme and a marketing term, for sure, but also
short...
Why BIG data?
Why BIG data?
In God we trust.
Why BIG data?
In God we trust.
All others must bring data.
(attributed to W Edwards Deming)
And all others (we) are indeed bringing data!
What
happens in an
Internet
minute!
Every two
days now we
create as
much
inf...
Latest News…
How do we view BIG data?
BIG data: Is it the size or the network?
Big Data is notable not because of its size, but
because of its relationality to ...
Networked BIG data?
Not only social networks or Web graphs
Recommendation systems: generate better song or
movie suggestio...
•  Networks represent interaction among units.
•  In the case of social and economic networks, these units
(nodes) are ind...
Da Demetrescu et al. McGraw
Hill 2004
The network of loans among financial institutions can be used to analyze the
roles th...
Da Demetrescu et al. McGraw
Hill 2004The bow-tie graph structure of the Web (Broder et al 2000)
Theory of network structure and behavior addressees
simultaneous challenges deriving from
•  Economics: theories for strat...
Ability to work with massive network datasets
enriched the picture: study networks with billions of
interacting items at a...
Need for data analytics
Just one example (although important in many
applications): “node centrality”:
Degree of influence...
Need for data analytics
Just one example (although important in many
applications): “node centrality”:
Degree of influence...
Da Demetrescu et al. McGraw
Hill 2004
Da Demetrescu et al. McGraw
Hill 2004 15th Century Florentine Marriages Data
Da Demetrescu et al. McGraw
Hill 2004
Da Demetrescu et al. McGraw
Hill 2004
The social network of friendships within a 34-person karate club
provides clues to t...
Da Demetrescu et al. McGraw
Hill 2004
23
Road	
  networks,	
  Point-­‐to-­‐point	
  shortest	
  paths:	
  seconds	
  (Dijkstra)	
  à	
  	
  microseconds	
  
Rou:n...
n  The	
  world-­‐wide	
  web	
  can	
  be	
  represented	
  as	
  a	
  directed	
  graph	
  
n  Web	
  search	
  and	
 ...
n  Reorderings	
  for	
  sparse	
  solvers	
  
n  Fill	
  reducing	
  orderings	
  
n  Par::oning,	
  eigenvectors	
  
...
n  Graph	
  abstrac:ons	
  are	
  very	
  useful	
  to	
  analyze	
  complex	
  data	
  
sets.	
  
n  Sources	
  of	
  d...
n  Study	
  of	
  the	
  interac:ons	
  between	
  	
  
various	
  components	
  in	
  a	
  
biological	
  system	
  
n ...
Image Source: Nexus (Facebook application)
Graph–theore:c	
  problems	
  in	
  social	
  networks	
  
–  Community identif...
n  [Krebs	
   04]	
  Post	
  9/11	
  
Terrorist	
  Network	
  Analysis	
  
from	
  public	
  domain	
  
informa:on	
  
n...
n  Old (1990) British TV series was
still popular
n  Films featuring Kevin Spacey had
always done well
n  Movies direct...
Power of Recommendation Systems
We Need: 1. “Bigger machines”
Two main ideas behind Google’s computing platform:
•  Google File System (GFS), way of distr...
Need for “Bigger machines”
In God we trust.
All others must bring data.
(attributed to W Edwards Deming)
We Need: 2. Smarter algorithms
Need more algorithms capable of
turning “meaningless” numbers into
actionable insights.
Col...
E.g., Anomaly detection
We Need: 3. Faster algorithms
“Progress in Algorithms Beats Moore’s Law”
(from The White House advisory report 2010)
Or, y...
Google or Bing Maps
Routing in Road Networks
Typical road networks are huge: 10s of millions nodes
and arcs
Getting directions with classical ...
Routing in Road Networks
Typical road networks are huge: 10s of millions nodes
and arcs
Getting directions with classical ...
We’ll focus on 3. Faster algorithms
For graphs with m edges and n nodes, this means
that the algorithms should run in line...
A Methodological Break
In theory,
theory and
practice are
the same.
Theory
In practice,
theory and
practice are
different...
The real world out there…
Wish to
combine theory
and practice…
Theory is when
you know
something, but
it doesn't work.
Practice is when
something
wo...
Disclaimer
Disclaimer
BIG data is like teenage sex
Disclaimer
BIG data is like teenage sex
Everyone talks about it
Disclaimer
BIG data is like teenage sex
Everyone talks about it
Nobody really knows how to do it
Disclaimer
BIG data is like teenage sex
Everyone talks about it
Nobody really knows how to do it
Everyone thinks everyone ...
Disclaimer
BIG data is like teenage sex
Everyone talks about it
Nobody really knows how to do it
Everyone thinks everyone ...
Disclaimer
BIG data is like teenage sex
Everyone talks about it
Nobody really knows how to do it
Everyone thinks everyone ...
Outline of Lectures
1.  Algorithms for BIG graphs
•  The centrality of centrality
•  How to store BIG Graphs (WebGraph
Fra...
Upcoming SlideShare
Loading in...5
×

Algorithms for Big Data: Graphs and Memory Errors 1 (Lecture by Giuseppe Italiano)

1,370

Published on

The first part of my lectures will be devoted to the design of practical algorithms for very large graphs. The second part will be devoted to algorithms resilient to memory errors. Modern memory devices may suffer from faults, where some bits may arbitrarily flip and corrupt the values of the affected memory cells. The appearance of such faults may seriously compromise the correctness and performance of computations, and the larger is the memory usage the higher is the probability to incur into memory errors. In recent years, many algorithms for computing in the presence of memory faults have been introduced in the literature: in particular, an algorithm or a data structure is called resilient if it is able to work correctly on the set of uncorrupted values. This part will cover recent work on resilient algorithms and data structures.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,370
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Algorithms for Big Data: Graphs and Memory Errors 1 (Lecture by Giuseppe Italiano)

  1. 1. Algorithms for BIG DATA: Graphs and Memory Errors Giuseppe F. Italiano Università di Roma “Tor Vergata” italiano@disp.uniroma2.it ALMADA, July-August 2013
  2. 2. Some advertising first School  on    Graph  Theory,  Algorithms  &  Applications Erice,  Italy,  September  8-­‐‑16,  2014 Consider  applying  to  the  School!
  3. 3. BIG data NYT, Feb 11, 2012: The Age of Big Data •  What is Big Data? A meme and a marketing term, for sure, but also shorthand for advancing trends in technology that open the door to a new approach to understanding the world and making decisions. … A lot of people are talking about big data, but most people are just creating it The real value is in the analysis
  4. 4. Why BIG data?
  5. 5. Why BIG data? In God we trust.
  6. 6. Why BIG data? In God we trust. All others must bring data. (attributed to W Edwards Deming)
  7. 7. And all others (we) are indeed bringing data! What happens in an Internet minute! Every two days now we create as much information as we did from the dawn of civilization up until 2003 (Eric Schmidt, Google CEO)
  8. 8. Latest News…
  9. 9. How do we view BIG data?
  10. 10. BIG data: Is it the size or the network? Big Data is notable not because of its size, but because of its relationality to other data. Due to efforts to mine and aggregate data, Big Data is fundamentally networked (threaded with connections) Its value comes from the patterns that can be derived by making connections between pieces of data, about an individual, about individuals in relation to others, about groups of people, or simply about the structure of information itself.
  11. 11. Networked BIG data? Not only social networks or Web graphs Recommendation systems: generate better song or movie suggestions (Pandora or Netflix), Data analytics, e.g., monitoring trending topics on Twitter Etc…
  12. 12. •  Networks represent interaction among units. •  In the case of social and economic networks, these units (nodes) are individuals or organizations. •  At some broad level, the study of networks can encompass the study of all kinds of interactions. •  Transportation •  Communication •  Social •  Friendship / Trust / Trade / Credit and financial flows. •  Information transmission •  Web links / Information exchange / Diffusion of ideas and innovation. •  Spread of epidemics. Networks
  13. 13. Da Demetrescu et al. McGraw Hill 2004 The network of loans among financial institutions can be used to analyze the roles that different participants play in the financial system, and how the interactions among these roles affect the health of individual participants and the system as a whole. (Bech and Atalay 2008)
  14. 14. Da Demetrescu et al. McGraw Hill 2004The bow-tie graph structure of the Web (Broder et al 2000)
  15. 15. Theory of network structure and behavior addressees simultaneous challenges deriving from •  Economics: theories for strategic interaction among small numbers of parties, as well as for cumulative behavior of large, homogeneous populations. •  Sociology: some fundamental insights into structure of social networks, but network methodology refined only in domains and scales where data-collection traditionally possible (well- defined groups with tens to hundreds of people). •  Computer Science: with rise of Web and social media, dealt with design constraints on large computing systems which are not only technological but also human (complex feedback that human audiences create when humans collectively use Web for communication, self-expression, and creation of knowledge). Networks
  16. 16. Ability to work with massive network datasets enriched the picture: study networks with billions of interacting items at a level of resolution where each connection is recorded (this is exactly what an Internet search engine is doing!) Ongoing and challenging scientific problem to bridge these vastly different levels of scale, so that predictions and principles from one level can be reconciled with those of others. Networks
  17. 17. Need for data analytics Just one example (although important in many applications): “node centrality”: Degree of influence or importance of a node within the social domain under consideration One expects such importance to be reflected in the structure of the social network
  18. 18. Need for data analytics Just one example (although important in many applications): “node centrality”: Degree of influence or importance of a node within the social domain under consideration One expects such importance to be reflected in the structure of the social network How do we measure node centrality?
  19. 19. Da Demetrescu et al. McGraw Hill 2004
  20. 20. Da Demetrescu et al. McGraw Hill 2004 15th Century Florentine Marriages Data
  21. 21. Da Demetrescu et al. McGraw Hill 2004
  22. 22. Da Demetrescu et al. McGraw Hill 2004 The social network of friendships within a 34-person karate club provides clues to the fault lines that eventually split the club apart (Zachary, 1977)
  23. 23. Da Demetrescu et al. McGraw Hill 2004 23
  24. 24. Road  networks,  Point-­‐to-­‐point  shortest  paths:  seconds  (Dijkstra)  à    microseconds   Rou:ng  in  transporta:on  networks   A. V. Goldberg. The hub labeling algorithm. SEA 2013.  
  25. 25. n  The  world-­‐wide  web  can  be  represented  as  a  directed  graph   n  Web  search  and  crawl:  traversal   n  Link  analysis,  ranking:  Page  rank  and  HITS   n  Document  classifica:on  and  clustering   n  Internet  topologies  (router  networks)  are  naturally  modeled   as  graphs   Internet  and  the  WWW  
  26. 26. n  Reorderings  for  sparse  solvers   n  Fill  reducing  orderings   n  Par::oning,  eigenvectors   n  Heavy  diagonal  to  reduce  pivo:ng  (matching)     n  Data  structures  for  efficient  exploita:on              of  sparsity   n  Deriva:ve  computa:ons  for  op:miza:on   n  Matroids,  graph  colorings,  spanning  trees   n  Precondi:oning   n  Incomplete  Factoriza:ons   n  Par::oning  for  domain  decomposi:on   n  Graph  techniques  in  algebraic  mul:grid   n  Independent  sets,  matchings,  etc.   n  Support  Theory   n  Spanning  trees  &  graph  embedding  techniques   Scien:fic  Compu:ng   B.  Hendrickson,   Graphs  and  HPC:  Lessons  for  Future  Architectures ,  hUp:// www.er.doe.gov/ascr/ascac/Mee:ngs/Oct08/Hendrickson%20ASCAC.pdf   Image  source:  Yifan  Hu,   A  gallery  of  large   graphs   Image  source:  Tim  Davis,  UF  Sparse  Matrix   Collec:on.  
  27. 27. n  Graph  abstrac:ons  are  very  useful  to  analyze  complex  data   sets.   n  Sources  of  data:  petascale  simula:ons,  experimental  devices,   the  Internet,  sensor  networks   n  Challenges:  data  size,  heterogeneity,  uncertainty,  data  quality   Large-­‐scale  data  analysis   Astrophysics: massive datasets, temporal variations Bioinformatics: data quality, heterogeneity Social Informatics: new analytics challenges, data uncertainty Image sources: (1) http://physics.nmt.edu/images/astro/hst_starfield.jpg (2,3) www.visualComplexity.com
  28. 28. n  Study  of  the  interac:ons  between     various  components  in  a   biological  system   n  Graph-­‐theore:c  formula:ons  are   pervasive:   n  Predic:ng  new  interac:ons:   modeling   n  Func:onal  annota:on  of  novel   proteins:  matching,  clustering   n  Iden:fying  metabolic  pathways:   paths,  clustering   n  Iden:fying  new  protein  complexes:   clustering,  centrality   Data  Analysis  and  Graph  Algorithms  in  Systems  Biology   Image Source: Giot et al., A Protein Interaction Map of Drosophila melanogaster , Science 302, 1722-1736, 2003.
  29. 29. Image Source: Nexus (Facebook application) Graph–theore:c  problems  in  social  networks   –  Community identification: clustering –  Targeted advertising: centrality –  Information spreading: modeling
  30. 30. n  [Krebs   04]  Post  9/11   Terrorist  Network  Analysis   from  public  domain   informa:on   n  Plot  masterminds  correctly   iden:fied  from  interac:on   paUerns:  centrality   n  A  global  view  of  en::es  is   ofen  more  insighgul   n  Detect  anomalous  ac:vi:es  by   exact/approximate  subgraph   isomorphism.   Image Source: http://www.orgnet.com/hijackers.html Network  Analysis  for  Intelligence  and  Survelliance   Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47
  31. 31. n  Old (1990) British TV series was still popular n  Films featuring Kevin Spacey had always done well n  Movies directed by David Fincher, (“The Social Network”) had a healthy share Big Hits are now being informed by Big Data? Power of Recommendation Systems
  32. 32. Power of Recommendation Systems
  33. 33. We Need: 1. “Bigger machines” Two main ideas behind Google’s computing platform: •  Google File System (GFS), way of distributing data across hundred/thousand inexpensive computers •  MapReduce, breaks given job into smaller pieces, sends those tasks out to the different computers, then gathers the answers in one central node. Hadoop is an open source implementation Is this enough? MapReduce not designed to analyze data sets threaded with connections… Google’s Pregel system developed to work with graph structures, since MapReduce had fallen short.
  34. 34. Need for “Bigger machines” In God we trust. All others must bring data. (attributed to W Edwards Deming)
  35. 35. We Need: 2. Smarter algorithms Need more algorithms capable of turning “meaningless” numbers into actionable insights. Collecting large amounts of statistics and numbers bring little benefit if there is no layer of added algorithmic intelligence. Detect signals from large amounts of real, live data is much like rapidly fishing for needles in a haystack. It is like finding needles the moment they are dropped into the haystack… NSA knows about it!
  36. 36. E.g., Anomaly detection
  37. 37. We Need: 3. Faster algorithms “Progress in Algorithms Beats Moore’s Law” (from The White House advisory report 2010) Or, you cannot just throw HW at problems!: •  Linear Programming: in 20 years, speed-ups quite evenly divided between algorithms and hardware improvements. •  Sparse linear systems: in 25 years, 10^4 hardware, 10^6 algorithms. •  The N-Body Problem: in 30 years, 10^7 hardware, 10^10 algorithms. Need staggering algorithmic advances for "big data"
  38. 38. Google or Bing Maps
  39. 39. Routing in Road Networks Typical road networks are huge: 10s of millions nodes and arcs Getting directions with classical shortest paths algorithms (Dijkstra) will require seconds That’s too slow! The algorithms have to run in milliseconds!
  40. 40. Routing in Road Networks Typical road networks are huge: 10s of millions nodes and arcs Getting directions with classical shortest paths algorithms (Dijkstra) will require seconds That’s too slow! The algorithms have to run in milliseconds! A. V. Goldberg. The hub labeling algorithm. SEA 2013.  
  41. 41. We’ll focus on 3. Faster algorithms For graphs with m edges and n nodes, this means that the algorithms should run in linear time and space [O(m+n)] with low asymptotic constant Quadratic time and space is too much Constants do matter
  42. 42. A Methodological Break
  43. 43. In theory, theory and practice are the same. Theory
  44. 44. In practice, theory and practice are different... The real world out there…
  45. 45. Wish to combine theory and practice… Theory is when you know something, but it doesn't work. Practice is when something works, but you don't know why. Bridging the Gap between Theory and Practice? …i.e., nothing works and you don't know why.
  46. 46. Disclaimer
  47. 47. Disclaimer BIG data is like teenage sex
  48. 48. Disclaimer BIG data is like teenage sex Everyone talks about it
  49. 49. Disclaimer BIG data is like teenage sex Everyone talks about it Nobody really knows how to do it
  50. 50. Disclaimer BIG data is like teenage sex Everyone talks about it Nobody really knows how to do it Everyone thinks everyone else is doing it
  51. 51. Disclaimer BIG data is like teenage sex Everyone talks about it Nobody really knows how to do it Everyone thinks everyone else is doing it So everyone claims they are doing it…
  52. 52. Disclaimer BIG data is like teenage sex Everyone talks about it Nobody really knows how to do it Everyone thinks everyone else is doing it So everyone claims they are doing it… And like sex, the ones getting the most are smart enough not to talk about it!
  53. 53. Outline of Lectures 1.  Algorithms for BIG graphs •  The centrality of centrality •  How to store BIG Graphs (WebGraph Framework) •  Four Degrees of Separation •  Diameter and Radius 2.  Big Data and Memory Errors
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×