Compute “Closeness” in Graphs using
Apache Giraph
… using probabilistic data structures.
Today: Validation

IMPRO-3, TU Be...
Quick Recap on our Task
● Measure reachable nodes
within s steps from a node n
in a Graph.
→ N(a,s).
N(“Robert”,1)=80
N(“R...
What happened so far ...

● Giraph Implementation:
○ a) Bitfield
○ b) Flajolet Martin Sketch
■ 32 bit with Thomas Wang’s i...
Validating the correctness of the
implementation ...

● Approach: Assume the “bitfield” implementation
as the reference an...
Statistical Methods to determine correlation
● Kendall's τ (tau)
○
○

-1 < τ < 1
expects an order (ranking)
e.g. Comparabl...
Coauthorship Results (I)

Kendall’s τ

Spearman’s ρ

Pearson’s r

FM32

0.906881050538273 0.98765689317449

FM64

0.905736...
Coauthorship Results (II)

Top10

Top100

Top1000

Last1

Last100

FM32

6/10

76/100

891/1000

1/1

94/100

FM64

5/10

...
Enron Results (I)
Kendall’s τ

Spearman’s ρ

Pearson’s r

FM32

0.9138299158409239

0.9880939188638478

0.9935462917118506...
Enron Results (II)

Top10

Top100

Top1000

Last1

Last100

FM32

5/10

80/100

877/1000

1/1

96/100

FM64

7/10

66/100
...
Validation Summary

● HyperLogLog exhibits the highest correlation
in all experiments. It also has the lowest
memory footp...
Next step

● Benchmark implementations with larger datasets
(that require Giraph out-of-core execution)
● Datasets:
Descri...
References
U. Kang, Charalampos E. Tsourakakis, Ana Paula Appel, Christos Faloutsos, and Jure Leskovec. 2011. HADI: Mining...
Upcoming SlideShare
Loading in …5
×

Compute "Closeness" in Graphs using Apache Giraph.

1,932 views

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,932
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Compute "Closeness" in Graphs using Apache Giraph.

  1. 1. Compute “Closeness” in Graphs using Apache Giraph … using probabilistic data structures. Today: Validation IMPRO-3, TU Berlin, Winter 13/14 Robert Metzger, Robert Waury 13.1.2014 DIMA - TU Berlin
  2. 2. Quick Recap on our Task ● Measure reachable nodes within s steps from a node n in a Graph. → N(a,s). N(“Robert”,1)=80 N(“Robert”,2)=10413 … ● Largest N() is graph diameter. Robert’s Xing Network 13.1.2014 DIMA - TU Berlin
  3. 3. What happened so far ... ● Giraph Implementation: ○ a) Bitfield ○ b) Flajolet Martin Sketch ■ 32 bit with Thomas Wang’s integer hash ■ 64 bit MurmurHash 2.0 ○ c) HyperLogLogSketch with MurmurHash 2.0 ● Drafted Stratosphere “Spargel” implementation ● Benchmarked a) and b) for AIM-3 13.1.2014 DIMA - TU Berlin
  4. 4. Validating the correctness of the implementation ... ● Approach: Assume the “bitfield” implementation as the reference and measure the correlation with the results from the other implementations. ● On two (small) datasets: ○ ○ 13.1.2014 General Relativity and Quantum Cosmology collaboration network (Coauthor relationships). Largest CC 4.158 Nodes. Enron email network. Largest CC 33.696 Nodes. DIMA - TU Berlin
  5. 5. Statistical Methods to determine correlation ● Kendall's τ (tau) ○ ○ -1 < τ < 1 expects an order (ranking) e.g. Comparable interface ;-) ● Spearman's ρ (rho) ○ same properties as Kendall but checks whether relation is monotonic (not just linear) ● Pearson’s r ○ ○ 13.1.2014 checks for linear correlation uses the actual values (not just ranks) DIMA - TU Berlin
  6. 6. Coauthorship Results (I) Kendall’s τ Spearman’s ρ Pearson’s r FM32 0.906881050538273 0.98765689317449 FM64 0.905736944670186 0.987400738579957 0.991700042774567 HLL 0.931782793461063 0.993272573234886 0.9956213651786 0.991695076216846 → High (linear) correlation with all metrics ✔ → HyperLogLog has highest correlation and has best memory properties 13.1.2014 DIMA - TU Berlin
  7. 7. Coauthorship Results (II) Top10 Top100 Top1000 Last1 Last100 FM32 6/10 76/100 891/1000 1/1 94/100 FM64 5/10 69/100 881/1000 1/1 94/100 HLL 8/10 80/100 932/1000 1/1 95/100 → HLL the best approximation → outliers can be identified with higher confidence than central nodes → nodes with highest closeness tend to have similar values 13.1.2014 DIMA - TU Berlin
  8. 8. Enron Results (I) Kendall’s τ Spearman’s ρ Pearson’s r FM32 0.9138299158409239 0.9880939188638478 0.9935462917118506 FM64 0.8894530452951206 0.9803803899254973 0.9902062846287614 HLL 0.9335364446051608 0.9927569721570411 0.9966840593148085 → High (linear) correlation with all metrics ✔ → HyperLogLog has highest correlation and has best memory properties 13.1.2014 DIMA - TU Berlin
  9. 9. Enron Results (II) Top10 Top100 Top1000 Last1 Last100 FM32 5/10 80/100 877/1000 1/1 96/100 FM64 7/10 66/100 839/1000 1/1 97/100 HLL 8/10 86/100 889/1000 1/1 97/100 → HLL again best approximation → outliers can be identified with higher confidence than central nodes 13.1.2014 DIMA - TU Berlin
  10. 10. Validation Summary ● HyperLogLog exhibits the highest correlation in all experiments. It also has the lowest memory footprint. ● We assume that these results hold for larger data sets. 13.1.2014 DIMA - TU Berlin
  11. 11. Next step ● Benchmark implementations with larger datasets (that require Giraph out-of-core execution) ● Datasets: Description Name Vertices Edges Text File Size in GB The data of Stanford's WebBase 2001 crawl as a graph webbase-2001 118,142,155 1,019,903,190 9.46 Follower relationships twitter-2010 41,652,230 1,468,365,182 12.49 13.1.2014 DIMA - TU Berlin
  12. 12. References U. Kang, Charalampos E. Tsourakakis, Ana Paula Appel, Christos Faloutsos, and Jure Leskovec. 2011. HADI: Mining Radii of Large Graphs. ACM Trans. Knowl. Discov. Data 5, 2, Article 8 (February 2011), 24 pages Centralities in Large Networks: Algorithms and Observations. U Kang, Spiros Papadimitriou, Jimeng Sun, and Hanghang Tong. SIAM International Conference on Data Mining (SDM) 2011, Mesa, Arizona, USA Stefan Heule, Marc Nunkesser, and Alexander Hall. 2013. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. InProceedings of the 16th International Conference on Extending Database Technology(EDBT '13). ACM, New York, NY, USA, 683-692 Paolo Boldi, Marco Rosa, and Sebastiano Vigna. 2011. HyperANF: approximating the neighbourhood function of very large graphs on a budget. In Proceedings of the 20th international conference on World wide web (WWW '11). ACM, New York, NY, USA, 625-634. Formulas taken from Wikipedia. 13.1.2014 DIMA - TU Berlin

×