Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Benchmarking graph databases on the problem of community detection


Published on

Presentation given in ADBIS 2014.

Published in: Technology

Benchmarking graph databases on the problem of community detection

  1. 1. Benchmarking graph databases on the problem of community detection Sotirios Beis, Symeon Papadopoulos, Yiannis Kompatsiaris Centre for Research and Technology Hellas, Information Technologies Institute (CERTH-ITI) ADBIS 2014 Conference Ohrid, Republic of Macedonia, September 7-10, 2014
  2. 2. Important Note • This is an update of the actual presentation given in September 2014. After the presentation, we received comments about the implementation of our benchmark, which led us to update the implementation and rerun the experiments. We would like to thank @lgarulli and @OrientDB for their contribution. • The updated paper is now available on: while the original is still available on the Springer site: #2
  3. 3. #3 Overview • Problem formulation • Related work • Workload description • Evaluation • Conclusions
  4. 4. Motivation • The rapid growth of Online Social Networks contributes to the creation of high-volume and velocity graph data – Twitter had 145M users in 2010 and 300M in 2011 • The need for massive graph data management and processing systems is constantly increasing – Many methods and technologies available (RDBMS, OODBMS and graph databases) • Benchmarks to evaluate candidate solutions are absolutely essential! – We had to choose one in the context of the SocialSensor research project #4
  5. 5. Problem Formulation • Given a set of database management systems assess the performance of each system with respect to a common mining task: community detection • Many DBMS available to use – Graph databases selected, as they are designed to store and manage efficiently big graph data – Benchmarked systems: Titan, OrientDB and Neo4j • Many operations used in databases to benchmark – Workloads that simulate common operations in OSNs employed #5
  6. 6. Related Work (1) • Evaluation between different DBMS – Giatsoglou et al. focused on the Social Tagging System use case scenario – Angles et al. and Armstrong et al. proposed a synthetic graph generator with OSN characteristics and use this data for evaluation – Grossniklaus et al. used a workloads that cover a wide variety of graph data use case – Vicknair et al. utilized query workload that simulates typical operations performed in provenance system #6
  7. 7. Related Work (2) • Evaluation between graph DBMS – Bader et al. proposed a benchmark with four operations: insert, h-hops node and edge selection, while Dominguez et al. reports the implementation. – A similar workload is proposed by Jouili et al. emphasizing the effects of increasing multiple concurrent users. – Ciglan et al. proposed a benchmark focusing on graph traversal operations. – Dayarathna et al. used similar workload, focusing on graph databases server and cloud environments. #7
  8. 8. Workloads Description (1) Our Contribution •Clustering Workload (CW) – Very important due to its numerous applications, such as topic detection, photo clustering and event detection. – Until now most community detection algorithms used main memory to store the graph and perform the required operations  fast execution time, unable to manage big graphs. – We propose an implementation of Louvain Method on top of graph databases employing cache techniques  take advantage of both graph database capabilities and in-memory execution speed. #8
  9. 9. Workload Description (2) Supplementary Workloads •Massive Insertion Workload (MIW) – Populates a graph with bulk load operations. – Simulates batch creation of a graph. •Single Insertion Workload (SIW) – Every object insertion is committed directly. – Simulates the growth of a OSN. •Query Workload (QW) – FindNeighbors query (FN)  Finds the neighbors of all nodes • simulates the friends retrieval a Facebook user – FindAdjacentNodes query (FA)  Finds the adjacent nodes of all edges • used to find whether two user joined the same Facebook group – FindShortestPath (FS)  Finds the shortest paths between 100 nodes • used to find the level two LinkedIn users are connected #9
  10. 10. Experimental Study • Datasets – Synthetic for CW, generated with LFR generator. – Real for MIW, SIW & QW, from SNAP dataset collection. • Evaluation Measure: execution time (second) #10
  11. 11. Results: Comparison wrt CW #11
  12. 12. Results: Cache size impact #12 Titan Neo4j OrientDB
  13. 13. Results: comparison wrt MIW and QW #13
  14. 14. Results: Comparison wrt SIW YouTube LiveJournal #14
  15. 15. Conclusions SUMMARY • OrientDB is the most efficient solution for CW, while Titan is the winner in SIW • Neo4j is the fastest graph database for MIW and QW • Titan and OrientDB have competitive performance in MIW and QW respectively  don't scale as good as Neo4j FUTURE WORK • Test with bigger graphs. • Run the benchmark employing the distributed version of Titan and OrientDB and other graph databases (Sparksee already added). • Improve performance of the implemented community detection method. #15
  16. 16. References (1) Evaluation between different DBMS Giatsoglou, M., Papadopoulos, S., Vakali, A.: Massive graph management for the web and web 2.0. In Vakali, A., Jain, L., eds.: New Directions in Web Data Management 1. Volume 331 of Studies in Computational Intelligence. Springer Berlin Heidelberg (2011) 19-58 Angles, R., Prat-Perez, A., Dominguez-Sal, D., Larriba-Pey, J.L.: Benchmarking database systems for social network applications. In: First International Workshop on Graph Data Management Experiences and Systems. GRADES '13, New York, NY, USA, ACM (2013) 15:1-15:7 Armstrong, T.G., Ponnekanti, V., Borthakur, D., Callaghan, M.: Linkbench: a database benchmark based on the facebook social graph. (2013) Grossniklaus, M., Leone, S., Zaschke, T.: Towards a benchmark for graph data management and processing. (2013) Vicknair, C., Macias, M., Zhao, Z., Nan, X., Chen, Y., Wilkins, D.: A comparison of a graph database and a relational database: A data provenance perspective. In: Proceedings of the 48th Annual Southeast Regional Conference. ACM SE '10, New York, NY, USA, ACM (2010) 42:1-42:6 #16
  17. 17. References (2) Evaluation between graph DBMS Bader, D.A., Feo, J., Gilbert, J., Kepner, J., Koester, D., Loh, E., Madduri, K., Mann, B., Meuse, T., Robinson, E.: Hpc scalable graph analysis benchmark (2009) Dominguez-Sal, D., Urbn-Bayes, P., Gimnez-Va, A., Gmez-Villamor, S., Martnez- Bazn, N., Larriba-Pey, J.: Survey of graph database performance on the hpc scalable graph analysis benchmark. In Shen, H., Pei, J., zsu, M., Zou, L., Lu, J., Ling, T.W., Yu, G., Zhuang, Y., Shao, J., eds.:Web-Age Information Management. Volume 6185 of Lecture Notes in Computer Science. Springer Berlin Heidelberg (2010) 37-48 Ciglan, M., Averbuch, A., Hluchy, L.: Benchmarking traversal operations over graph databases. In: Data Engineering Workshops (ICDEW), 2012 IEEE 28th International Conference on. (April 2012) 186-189 Jouili, S., Vansteenberghe, V.: An empirical comparison of graph databases. In: Social Computing (SocialCom), 2013 International Conference on. (Sept 2013) 708- 715 Dayarathna, M., Suzumura, T.: Xgdbench: A benchmarking platform for graph stores in exascale clouds. In: Cloud Computing Technology and Science (Cloud-Com), 2012 IEEE 4th International Conference on. (Dec 2012) 363-370 #17
  18. 18. Source: #18 Questions Further contact: / Acknowledgement: