• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Compute "Closeness" in Graphs using Apache Giraph.
 

Compute "Closeness" in Graphs using Apache Giraph.

on

  • 349 views

 

Statistics

Views

Total Views
349
Views on SlideShare
349
Embed Views
0

Actions

Likes
0
Downloads
1
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Compute "Closeness" in Graphs using Apache Giraph. Compute "Closeness" in Graphs using Apache Giraph. Presentation Transcript

    • Compute “Closeness” in Graphs using Apache Giraph … using probabilistic data structures. Today: Validation IMPRO-3, TU Berlin, Winter 13/14 Robert Metzger, Robert Waury 13.1.2014 DIMA - TU Berlin
    • Quick Recap on our Task ● Measure reachable nodes within s steps from a node n in a Graph. → N(a,s). N(“Robert”,1)=80 N(“Robert”,2)=10413 … ● Largest N() is graph diameter. Robert’s Xing Network 13.1.2014 DIMA - TU Berlin
    • What happened so far ... ● Giraph Implementation: ○ a) Bitfield ○ b) Flajolet Martin Sketch ■ 32 bit with Thomas Wang’s integer hash ■ 64 bit MurmurHash 2.0 ○ c) HyperLogLogSketch with MurmurHash 2.0 ● Drafted Stratosphere “Spargel” implementation ● Benchmarked a) and b) for AIM-3 13.1.2014 DIMA - TU Berlin
    • Validating the correctness of the implementation ... ● Approach: Assume the “bitfield” implementation as the reference and measure the correlation with the results from the other implementations. ● On two (small) datasets: ○ ○ 13.1.2014 General Relativity and Quantum Cosmology collaboration network (Coauthor relationships). Largest CC 4.158 Nodes. Enron email network. Largest CC 33.696 Nodes. DIMA - TU Berlin
    • Statistical Methods to determine correlation ● Kendall's τ (tau) ○ ○ -1 < τ < 1 expects an order (ranking) e.g. Comparable interface ;-) ● Spearman's ρ (rho) ○ same properties as Kendall but checks whether relation is monotonic (not just linear) ● Pearson’s r ○ ○ 13.1.2014 checks for linear correlation uses the actual values (not just ranks) DIMA - TU Berlin
    • Coauthorship Results (I) Kendall’s τ Spearman’s ρ Pearson’s r FM32 0.906881050538273 0.98765689317449 FM64 0.905736944670186 0.987400738579957 0.991700042774567 HLL 0.931782793461063 0.993272573234886 0.9956213651786 0.991695076216846 → High (linear) correlation with all metrics ✔ → HyperLogLog has highest correlation and has best memory properties 13.1.2014 DIMA - TU Berlin
    • Coauthorship Results (II) Top10 Top100 Top1000 Last1 Last100 FM32 6/10 76/100 891/1000 1/1 94/100 FM64 5/10 69/100 881/1000 1/1 94/100 HLL 8/10 80/100 932/1000 1/1 95/100 → HLL the best approximation → outliers can be identified with higher confidence than central nodes → nodes with highest closeness tend to have similar values 13.1.2014 DIMA - TU Berlin
    • Enron Results (I) Kendall’s τ Spearman’s ρ Pearson’s r FM32 0.9138299158409239 0.9880939188638478 0.9935462917118506 FM64 0.8894530452951206 0.9803803899254973 0.9902062846287614 HLL 0.9335364446051608 0.9927569721570411 0.9966840593148085 → High (linear) correlation with all metrics ✔ → HyperLogLog has highest correlation and has best memory properties 13.1.2014 DIMA - TU Berlin
    • Enron Results (II) Top10 Top100 Top1000 Last1 Last100 FM32 5/10 80/100 877/1000 1/1 96/100 FM64 7/10 66/100 839/1000 1/1 97/100 HLL 8/10 86/100 889/1000 1/1 97/100 → HLL again best approximation → outliers can be identified with higher confidence than central nodes 13.1.2014 DIMA - TU Berlin
    • Validation Summary ● HyperLogLog exhibits the highest correlation in all experiments. It also has the lowest memory footprint. ● We assume that these results hold for larger data sets. 13.1.2014 DIMA - TU Berlin
    • Next step ● Benchmark implementations with larger datasets (that require Giraph out-of-core execution) ● Datasets: Description Name Vertices Edges Text File Size in GB The data of Stanford's WebBase 2001 crawl as a graph webbase-2001 118,142,155 1,019,903,190 9.46 Follower relationships twitter-2010 41,652,230 1,468,365,182 12.49 13.1.2014 DIMA - TU Berlin
    • References U. Kang, Charalampos E. Tsourakakis, Ana Paula Appel, Christos Faloutsos, and Jure Leskovec. 2011. HADI: Mining Radii of Large Graphs. ACM Trans. Knowl. Discov. Data 5, 2, Article 8 (February 2011), 24 pages Centralities in Large Networks: Algorithms and Observations. U Kang, Spiros Papadimitriou, Jimeng Sun, and Hanghang Tong. SIAM International Conference on Data Mining (SDM) 2011, Mesa, Arizona, USA Stefan Heule, Marc Nunkesser, and Alexander Hall. 2013. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm. InProceedings of the 16th International Conference on Extending Database Technology(EDBT '13). ACM, New York, NY, USA, 683-692 Paolo Boldi, Marco Rosa, and Sebastiano Vigna. 2011. HyperANF: approximating the neighbourhood function of very large graphs on a budget. In Proceedings of the 20th international conference on World wide web (WWW '11). ACM, New York, NY, USA, 625-634. Formulas taken from Wikipedia. 13.1.2014 DIMA - TU Berlin