Your SlideShare is downloading. ×
Compute "Closeness" in Graphs using Apache Giraph.
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Compute "Closeness" in Graphs using Apache Giraph.

1,512

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,512
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Compute “Closeness” in Graphs using Apache Giraph … using probabilistic data structures. Today: Validation IMPRO-3, TU Berlin, Winter 13/14 Robert Metzger, Robert Waury 13.1.2014 DIMA - TU Berlin
  • 2. Quick Recap on our Task ● Measure reachable nodes within s steps from a node n in a Graph. → N(a,s). N(“Robert”,1)=80 N(“Robert”,2)=10413 … ● Largest N() is graph diameter. Robert’s Xing Network 13.1.2014 DIMA - TU Berlin
  • 3. What happened so far ... ● Giraph Implementation: ○ a) Bitfield ○ b) Flajolet Martin Sketch ■ 32 bit with Thomas Wang’s integer hash ■ 64 bit MurmurHash 2.0 ○ c) HyperLogLogSketch with MurmurHash 2.0 ● Drafted Stratosphere “Spargel” implementation ● Benchmarked a) and b) for AIM-3 13.1.2014 DIMA - TU Berlin
  • 4. Validating the correctness of the implementation ... ● Approach: Assume the “bitfield” implementation as the reference and measure the correlation with the results from the other implementations. ● On two (small) datasets: ○ ○ 13.1.2014 General Relativity and Quantum Cosmology collaboration network (Coauthor relationships). Largest CC 4.158 Nodes. Enron email network. Largest CC 33.696 Nodes. DIMA - TU Berlin
  • 5. Statistical Methods to determine correlation ● Kendall's τ (tau) ○ ○ -1 < τ < 1 expects an order (ranking) e.g. Comparable interface ;-) ● Spearman's ρ (rho) ○ same properties as Kendall but checks whether relation is monotonic (not just linear) ● Pearson’s r ○ ○ 13.1.2014 checks for linear correlation uses the actual values (not just ranks) DIMA - TU Berlin
  • 6. Coauthorship Results (I) Kendall’s τ Spearman’s ρ Pearson’s r FM32 0.906881050538273 0.98765689317449 FM64 0.905736944670186 0.987400738579957 0.991700042774567 HLL 0.931782793461063 0.993272573234886 0.9956213651786 0.991695076216846 → High (linear) correlation with all metrics ✔ → HyperLogLog has highest correlation and has best memory properties 13.1.2014 DIMA - TU Berlin
  • 7. Coauthorship Results (II) Top10 Top100 Top1000 Last1 Last100 FM32 6/10 76/100 891/1000 1/1 94/100 FM64 5/10 69/100 881/1000 1/1 94/100 HLL 8/10 80/100 932/1000 1/1 95/100 → HLL the best approximation → outliers can be identified with higher confidence than central nodes → nodes with highest closeness tend to have similar values 13.1.2014 DIMA - TU Berlin
  • 8. Enron Results (I) Kendall’s τ Spearman’s ρ Pearson’s r FM32 0.9138299158409239 0.9880939188638478 0.9935462917118506 FM64 0.8894530452951206 0.9803803899254973 0.9902062846287614 HLL 0.9335364446051608 0.9927569721570411 0.9966840593148085 → High (linear) correlation with all metrics ✔ → HyperLogLog has highest correlation and has best memory properties 13.1.2014 DIMA - TU Berlin
  • 9. Enron Results (II) Top10 Top100 Top1000 Last1 Last100 FM32 5/10 80/100 877/1000 1/1 96/100 FM64 7/10 66/100 839/1000 1/1 97/100 HLL 8/10 86/100 889/1000 1/1 97/100 → HLL again best approximation → outliers can be identified with higher confidence than central nodes 13.1.2014 DIMA - TU Berlin
  • 10. Validation Summary ● HyperLogLog exhibits the highest correlation in all experiments. It also has the lowest memory footprint. ● We assume that these results hold for larger data sets. 13.1.2014 DIMA - TU Berlin
  • 11. Next step ● Benchmark implementations with larger datasets (that require Giraph out-of-core execution) ● Datasets: Description Name Vertices Edges Text File Size in GB The data of Stanford's WebBase 2001 crawl as a graph webbase-2001 118,142,155 1,019,903,190 9.46 Follower relationships twitter-2010 41,652,230 1,468,365,182 12.49 13.1.2014 DIMA - TU Berlin
  • 12. References U. Kang, Charalampos E. Tsourakakis, Ana Paula Appel, Christos Faloutsos, and Jure Leskovec. 2011. HADI: Mining Radii of Large Graphs. ACM Trans. Knowl. Discov. Data 5, 2, Article 8 (February 2011), 24 pages Centralities in Large Networks: Algorithms and Observations. U Kang, Spiros Papadimitriou, Jimeng Sun, and Hanghang Tong. SIAM International Conference on Data Mining (SDM) 2011, Mesa, Arizona, USA Stefan Heule, Marc Nunkesser, and Alexander Hall. 2013. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. InProceedings of the 16th International Conference on Extending Database Technology(EDBT '13). ACM, New York, NY, USA, 683-692 Paolo Boldi, Marco Rosa, and Sebastiano Vigna. 2011. HyperANF: approximating the neighbourhood function of very large graphs on a budget. In Proceedings of the 20th international conference on World wide web (WWW '11). ACM, New York, NY, USA, 625-634. Formulas taken from Wikipedia. 13.1.2014 DIMA - TU Berlin

×