This is my comprehensive viva report version 4.
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
Graph is a generic data structure and is a superset of lists, and trees. Binary search on sorted lists can be interpreted as a balanced binary tree search. Database tables can be thought of as indexed lists, and table joins represent relations between columns. This can be modeled as graphs instead. Assignment of registers to variables (by compiler), and assignment of available channels to a radio transmitter and also graph problems. Finding shortest path between two points, and sorting web pages in order of importance are also graphs problems. Neural networks are graphs too. Interaction between messenger molecules in the body, and interaction between people on social media, also modeled as graphs.
The document proposes using an A* algorithm along with a relational framework to more efficiently calculate shortest paths in graph data stored in a relational database. The system initializes a source node, then iteratively selects the next frontier node and expands paths until the target node is found. Experimental results on road network data show the proposed approach has faster execution time than bidirectional search, especially on larger datasets containing over 500,000 records. The approach requires more memory than bidirectional search but is more efficient than other shortest path algorithms.
Skyline Query Processing using Filtering in Distributed EnvironmentIJMER
This document summarizes a research paper about skyline query processing in distributed databases. Skyline queries return multidimensional data points that are not dominated by other points. In distributed databases, skyline queries must be processed across multiple data sites. The paper proposes using multiple filtering points selected from each local skyline result to reduce the number of false positive results and communication costs between sites. Two heuristics called MaxSum and MaxDist are described for selecting filtering points that maximize their combined dominating potential across sites to improve distributed skyline query processing performance.
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...cscpconf
For performing distributed data mining two approaches are possible: First, data from several sources are copied to a data warehouse and mining algorithms are applied in it. Secondly,
mining can performed at the local sites and the results can be aggregated. When the number of
features is high, a lot of bandwidth is consumed in transferring datasets to a centralized location. For this dimensionality reduction can be done at the local sites. In dimensionality reduction a certain encoding is applied on data so as to obtain its compressed form. The
reduced features thus obtained at the local sites are aggregated and data mining algorithms are applied on them. There are several methods of performing dimensionality reduction. Two most important ones are Discrete Wavelet Transforms (DWT) and Principal Component Analysis (PCA). Here a detailed study is done on how PCA could be useful in reducing data flow across a distributed network.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The document proposes an extension to the M-tree family of index structures called M*-tree. M*-tree improves upon M-tree by maintaining a nearest-neighbor graph within each node. The nearest-neighbor graph stores, for each entry in a node, a reference and distance to its nearest neighbor among the other entries in that node. This additional structure allows for more efficient filtering of non-relevant subtrees during search queries through the use of "sacrifice pivots". The experiments showed that M*-tree can perform searches significantly faster than M-tree while keeping construction costs low.
The document discusses advanced database topics including temporal data, spatial and geographic databases, and multimedia databases. It provides details on:
- Modeling data that changes over time using temporal databases, representing facts with valid and transaction time intervals.
- Storing spatial information like maps and geometric objects using vector and raster formats, and indexing techniques for spatial data like R-trees.
- Applications of geographic information systems for tasks like vehicle navigation and utility network management.
- Emerging areas of multimedia databases for non-traditional data types like images, video and audio.
Clustering is also known as data segmentation aims to partitions data set into groups, clusters, according to their similarity. Cluster analysis has been extensively studied in many researches. There are many algorithms for different types of clustering. These classical algorithms can't be applied on big data due to its distinct features. It is a challenge to apply the traditional techniques on large unstructured data. This study proposes a hybrid model to cluster big data using the famous traditional K-means clustering algorithm. The proposed model consists of three phases namely; Mapper phase, Clustering Phase and Reduce phase. The first phase uses map-reduce algorithm to split big data into small datasets. Whereas, the second phase implements the traditional clustering K-means algorithm on each of the spitted small data sets. The last phase is responsible of producing the general clusters output of the complete data set. Two functions, Mode and Fuzzy Gaussian, have been implemented and compared at the last phase to determine the most suitable one. The experimental study used four benchmark big data sets; Covtype, Covtype-2, Poker, and Poker-2. The results proved the efficiency of the proposed model in clustering big data using the traditional K-means algorithm. Also, the experiments show that the Fuzzy Gaussian function produces more accurate results than the traditional Mode function.
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
This document summarizes a research paper that presents a task-decomposition based anomaly detection system for analyzing massive and highly volatile session data from the Science Information Network (SINET), Japan's academic backbone network. The system uses a master-worker design with dynamic task scheduling to process over 1 billion sessions per day. It discriminates incoming and outgoing traffic using GPU parallelization and generates histograms of traffic volumes over time. Long short-term memory (LSTM) neural networks detect anomalies like spikes in incoming traffic volumes. The experiment analyzed SINET data from February 27 to March 8, 2021, detecting some anomalies while processing 500-650 gigabytes of daily session data.
The document proposes using an A* algorithm along with a relational framework to more efficiently calculate shortest paths in graph data stored in a relational database. The system initializes a source node, then iteratively selects the next frontier node and expands paths until the target node is found. Experimental results on road network data show the proposed approach has faster execution time than bidirectional search, especially on larger datasets containing over 500,000 records. The approach requires more memory than bidirectional search but is more efficient than other shortest path algorithms.
Skyline Query Processing using Filtering in Distributed EnvironmentIJMER
This document summarizes a research paper about skyline query processing in distributed databases. Skyline queries return multidimensional data points that are not dominated by other points. In distributed databases, skyline queries must be processed across multiple data sites. The paper proposes using multiple filtering points selected from each local skyline result to reduce the number of false positive results and communication costs between sites. Two heuristics called MaxSum and MaxDist are described for selecting filtering points that maximize their combined dominating potential across sites to improve distributed skyline query processing performance.
TOWARDS REDUCTION OF DATA FLOW IN A DISTRIBUTED NETWORK USING PRINCIPAL COMPO...cscpconf
For performing distributed data mining two approaches are possible: First, data from several sources are copied to a data warehouse and mining algorithms are applied in it. Secondly,
mining can performed at the local sites and the results can be aggregated. When the number of
features is high, a lot of bandwidth is consumed in transferring datasets to a centralized location. For this dimensionality reduction can be done at the local sites. In dimensionality reduction a certain encoding is applied on data so as to obtain its compressed form. The
reduced features thus obtained at the local sites are aggregated and data mining algorithms are applied on them. There are several methods of performing dimensionality reduction. Two most important ones are Discrete Wavelet Transforms (DWT) and Principal Component Analysis (PCA). Here a detailed study is done on how PCA could be useful in reducing data flow across a distributed network.
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
The document proposes an extension to the M-tree family of index structures called M*-tree. M*-tree improves upon M-tree by maintaining a nearest-neighbor graph within each node. The nearest-neighbor graph stores, for each entry in a node, a reference and distance to its nearest neighbor among the other entries in that node. This additional structure allows for more efficient filtering of non-relevant subtrees during search queries through the use of "sacrifice pivots". The experiments showed that M*-tree can perform searches significantly faster than M-tree while keeping construction costs low.
The document discusses advanced database topics including temporal data, spatial and geographic databases, and multimedia databases. It provides details on:
- Modeling data that changes over time using temporal databases, representing facts with valid and transaction time intervals.
- Storing spatial information like maps and geometric objects using vector and raster formats, and indexing techniques for spatial data like R-trees.
- Applications of geographic information systems for tasks like vehicle navigation and utility network management.
- Emerging areas of multimedia databases for non-traditional data types like images, video and audio.
Clustering is also known as data segmentation aims to partitions data set into groups, clusters, according to their similarity. Cluster analysis has been extensively studied in many researches. There are many algorithms for different types of clustering. These classical algorithms can't be applied on big data due to its distinct features. It is a challenge to apply the traditional techniques on large unstructured data. This study proposes a hybrid model to cluster big data using the famous traditional K-means clustering algorithm. The proposed model consists of three phases namely; Mapper phase, Clustering Phase and Reduce phase. The first phase uses map-reduce algorithm to split big data into small datasets. Whereas, the second phase implements the traditional clustering K-means algorithm on each of the spitted small data sets. The last phase is responsible of producing the general clusters output of the complete data set. Two functions, Mode and Fuzzy Gaussian, have been implemented and compared at the last phase to determine the most suitable one. The experimental study used four benchmark big data sets; Covtype, Covtype-2, Poker, and Poker-2. The results proved the efficiency of the proposed model in clustering big data using the traditional K-means algorithm. Also, the experiments show that the Fuzzy Gaussian function produces more accurate results than the traditional Mode function.
TASK-DECOMPOSITION BASED ANOMALY DETECTION OF MASSIVE AND HIGH-VOLATILITY SES...ijdpsjournal
This document summarizes a research paper that presents a task-decomposition based anomaly detection system for analyzing massive and highly volatile session data from the Science Information Network (SINET), Japan's academic backbone network. The system uses a master-worker design with dynamic task scheduling to process over 1 billion sessions per day. It discriminates incoming and outgoing traffic using GPU parallelization and generates histograms of traffic volumes over time. Long short-term memory (LSTM) neural networks detect anomalies like spikes in incoming traffic volumes. The experiment analyzed SINET data from February 27 to March 8, 2021, detecting some anomalies while processing 500-650 gigabytes of daily session data.
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Mumbai Academisc
This document summarizes a paper that presents a framework called BRA that provides a bidirectional abstraction of asymmetric mobile ad hoc networks to enable off-the-shelf routing protocols to work. BRA maintains multi-hop reverse routes for unidirectional links, improves connectivity by using unidirectional links, enables reverse route forwarding of control packets, and detects packet loss on unidirectional links. Simulations show packet delivery increases substantially when AODV is layered on BRA in asymmetric networks compared to regular AODV.
Implementation of query optimization for reducing run timeAlexander Decker
This document discusses query optimization techniques to improve performance. It proposes performing query optimization at compile-time using histograms of data statistics rather than at run-time. Histograms are used to estimate selectivity of query joins and predicates at compile-time, allowing a query plan to be constructed in advance and executed without run-time optimization. The technique uses a split and merge algorithm to incrementally maintain histograms as data changes. Selectivity estimation with histograms allows join and predicate ordering to be determined at compile-time for query plan generation. Experimental results showed this compile-time optimization approach improved runtime performance over traditional run-time optimization.
Instruction level parallelism using ppm branch predictionIAEME Publication
This document summarizes an approach to instruction level parallelism using prediction by partial matching (PPM) branch prediction. It proposes a hybrid PPM-based branch predictor that uses both local and global branch histories. The two predictors are combined using a neural network. Key aspects of the implementation include:
1. Using local and global history PPM predictors and combining their predictions with a neural network.
2. Enhancements to the basic PPM approach like program counter tagging, efficient history encoding using run-length encoding, tracking pattern bias, and dynamic pattern length selection.
3. Details of the global history PPM predictor including the use of tables and linked lists to store patterns of different lengths and handle collisions
Practice of Streaming Processing of Dynamic Graphs: Concepts, Models, and Sys...Subhajit Sahu
Highlighted notes on Practice of Streaming Processing of Dynamic Graphs: Concepts, Models, and Systems.
While doing research work under Prof. Dip Banerjee, Prof, Kishore Kothapalli.
This is a huge review paper discussing a lot about several graph streaming frameworks, and graph databases. How can i summarize this! GPU frameworks given are cuSTINGER, EvoGraph, Hornet, faimGraph, GPMA. Gap between databases and frameworks seems to be closing.
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
This document summarizes a research paper that proposes a distributed graph querying algorithm called MR-Graph that employs MapReduce. MR-Graph uses a filter-and-verify scheme to first filter graphs based on contained features before verifying subgraph isomorphism. It also adaptively tunes the feature size at runtime by sampling data graphs to determine the most appropriate size. The experiments showed MR-Graph outperforms conventional algorithms in scalability and efficiency for processing multiple graph queries over massive datasets.
This document summarizes a research paper that analyzes and evaluates the performance of processing large data sets using Hadoop. It discusses how Hadoop Distributed File System (HDFS) and MapReduce provide parallel and distributed processing of large structured and unstructured data at scale. The paper also presents the results of experiments conducted on Hadoop to classify and cluster large data sets using machine learning algorithms. The experiments showed that Hadoop can process large data sets more efficiently and reliably compared to processing on a single computer.
Study on Theoretical Aspects of Virtual Data Integration and its ApplicationsIJERA Editor
Data integration is the technique of merging data residing at different sources at different locations, and
providing users with an integrated, reconciled view of these data. Such unified view is called global or mediated
schema. It represents the intentional level of the integrated and reconciled data. In the data integration system,
our area of interest in this paper is characterized by an architecture based on a global schema and a set of
sources or source schemas. The objective of this paper is to provide a study on the theoretical aspects of data
integration systems and to present a comprehensive review of the applications of data integration in various
fields including biomedicine, environment, and social networks. It also discusses a privacy framework for
protecting user’s privacy with privacy views and privacy policies.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...Kyong-Ha Lee
This document proposes a new approach called SASUM for approximate subgraph matching in large graphs. Approximate subgraph matching allows missing edges in query matches, which is important for real-world graphs that may be incomplete. SASUM improves upon the basic approach of generating all possible query subgraphs and doing exact matching for each. It exploits the overlapping nature of query subgraphs to reduce the number that require costly exact matching. SASUM uses a lattice framework to identify sharing opportunities between query subgraphs. It generates small "base graphs" that are shared between queries and chooses a minimum set of these to match, from which it can derive matches for all queries. The approach outperforms the state-of-the-art by orders of
Optimized Access Strategies for a Distributed Database DesignWaqas Tariq
Abstract Distributed Database Query Optimization has been an active area of research for Database research Community in this decade. Research work mostly involves mathematical programming and evolving new algorithm design techniques in order to minimize the combined cost of storing the database, processing transactions and communication amongst various sites of storage. The complete problem and most of its subsets as well are NP-Hard. Most of proposed solutions till date are based on use of Enumerative Techniques or using Heuristics. In this paper we have shown benefits of using innovative Genetic Algorithms (GA) for optimizing the sequence of sub-query operations over the enumerative methods and heuristics. A stochastic simulator has been designed and experimental results show encouraging improvements in decreasing the total cost of a query. An exhaustive enumerative method is also applied and solutions are compared with that of GA on various parameters of a Distributed Query, like up to 12 joins and 10 sites. Keywords: Distributed Query Optimization, Database Statistics, Query Execution Plan, Genetic Algorithms, Operation Allocation.
This document describes a web application for analyzing building energy management data using predictive modeling and machine learning techniques. The application contains years of sensor data from a CUNY building and allows users to visualize the data, perform statistical analysis, and generate forecasts using Python modules. Key features include interactive data visualization, filtering and selecting subsets of data, defining expressions of sensor variables, and applying machine learning models for prediction. The application provides a customizable platform for exploring time series data while allowing different users to share their work.
Decision tree clustering a columnstores tuple reconstructioncsandit
Column-Stores has gained market share due to promi
sing physical storage alternative for
analytical queries. However, for multi-attribute qu
eries column-stores pays performance
penalties due to on-the-fly tuple reconstruction. T
his paper presents an adaptive approach for
reducing tuple reconstruction time. Proposed approa
ch exploits decision tree algorithm to
cluster attributes for each projection and also eli
minates frequent database scanning.
Experimentations with TPC-H data shows the effectiv
eness of proposed approach.
This document discusses machine learning algorithms and their suitability for parallelization using MapReduce. It begins by introducing machine learning and different types of algorithms, including supervised, unsupervised, and reinforcement learning. It then discusses how several common machine learning algorithms can be expressed using the MapReduce framework. Single-pass algorithms like language modeling and naive Bayes classification are well-suited since they involve extracting statistics from each data point. Iterative algorithms can also be parallelized by chaining multiple MapReduce jobs, though the map tasks may need access to global parameters. Overall, the document analyzes how different machine learning algorithms map to the data processing patterns of MapReduce.
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIEScsandit
This document summarizes a research paper that proposes a system to enhance keyword search over relational databases using ontologies. The system builds structures during pre-processing like a reachability index to store connectivity information and an ontology concept graph. During querying, it maps keywords to concepts, uses the ontology to find related concepts and tuples, and generates top-k answer trees combining syntactic and semantic matches while limiting redundant results. The system is expected to perform better than existing approaches by reducing storage requirements through its approach to materializing neighborhood information in the reachability index.
Survey on scalable continual top k keyword search in relational databaseseSAT Journals
Abstract Keyword search in relational database is a technique that has higher relevance in the present world. Extracting data from a large number of sets of database is very important .Because it reduces the usage of man power and time consumption. Data extraction from a large database using the relevant keyword based on the information needed is a very interactive and user friendly. Without knowing any database schemas or query languages like sql the user can get information. By using keyword in relational database data extraction will be simpler. The user doesn’t want to know the query language for search. But the database content is always changing for real time application for example database which store the data of publication data. When new publications arrive it should be added to database so the database content changes according to time. Because the database is updated frequently the result should change. In order to handle the database updation takes the top-k result from the currently updated data for each search. Top-k keyword search means take greatest k results based on the relevance of document. Keyword search in relational database means to find structural information from tuples from the database. Two types of keyword search are schema-based method and graph based approach. Using top-k keyword search instead of executing all query results taking highest k queries. By handling database updation try to find the new results and remove expired one
The document proposes an improved clustering algorithm for social network analysis. It combines BSP (Business System Planning) clustering with Principal Component Analysis (PCA) to group social network objects into classes based on their links and attributes. Specifically, it applies PCA before BSP clustering to reduce the dimensionality of the social network data and retain only the most important variables for clustering. This improves the BSP clustering results by focusing on the key information in the social network.
This document summarizes various techniques for scalable continual top-k keyword search in relational databases. There are two main approaches: schema-based and graph-based. Schema-based methods generate candidate networks from the database schema and evaluate them. Graph-based methods represent the database as a graph and use techniques like bidirectional expansion. Top-k keyword search finds the highest scoring k results instead of all results. Methods like the Global Pipeline algorithm and Skyline-Sweeping algorithm efficiently process top-k queries over multiple candidate networks. Techniques for updating results with database changes include maintaining an initial top-k and recalculating scores. Lattice-based methods share computational costs for keyword search in data streams.
Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3Subhajit Sahu
This is my comprehensive viva report version 3.
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
Graph is a generic data structure and is a superset of lists, and trees. Binary search on sorted lists can be interpreted as a balanced binary tree search. Database tables can be thought of as indexed lists, and table joins represent relations between columns. This can be modeled as graphs instead. Assignment of registers to variables (by compiler), and assignment of available channels to a radio transmitter and also graph problems. Finding shortest path between two points, and sorting web pages in order of importance are also graphs problems. Neural networks are graphs too. Interaction between messenger molecules in the body, and interaction between people on social media, also modeled as graphs.
Scalability has been an essential factor for any kind of computational algorithm while considering its performance. In this Big Data era, gathering of large amounts of data is becoming easy. Data analysis on Big Data is not feasible using the existing Machine Learning (ML) algorithms and it perceives them to perform poorly. This is due to the fact that the computational logic for these algorithms is previously designed in sequential way. MapReduce becomes the solution for handling billions of data efficiently. In this report we discuss the basic building block for the computations behind ML algorithms, two different attempts to parallelize machine learning algorithms using MapReduce and a brief description on the overhead in parallelization of ML algorithms.
Parallel algorithms for multi-source graph traversal and its applicationsSubhajit Sahu
Highlighted notes on Parallel algorithms for multi-source graph traversal and its applications.
While doing research work under Prof. Kishore Kothapalli.
Seema is working on Multi-source BFS with hybrid-CSR, with applications in APSP, diameter, centrality, reachability.
BFS can be either top-down (from visited nodes, mark next frontier), or bottom-up (from unvisited nodes, mark next frontier). She mentioned that hybrid approach is more efficient. EtaGraph uses unified degree cut (UDC) graph partitioning. Also overlaps data transfer with kernel execution. iCENTRAL uses biconnected components for betwenness centrality on dynamic graphs.
Hybrid CSR uses an additional value array for storing packed "has edge/neighbour" bits. This can give better memory access pattern if many bits are set, and cause many threads to wait if many bits are zero. She mentioned Volta architecture has independent PC, stack per thread (similar to CPU?). Does is not matter then if the threads in a block diverge?
(BFS = G*v, Multi-source BFS = G*vs)
STIC-D: algorithmic techniques for efficient parallel pagerank computation on...Subhajit Sahu
Authors:
Paritosh Garg
Kishore Kothapalli
Publication:
ICDCN '16: Proceedings of the 17th International Conference on Distributed Computing and Networking. January 2016.
Article No.: 15 Pages 1–10
https://doi.org/10.1145/2833312.2833322
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions csandit
Analyzing interconnection structures among the data through the use of graph algorithms and
graph analytics has been shown to provide tremendous value in many application domains (like
social networks, protein networks, transportation networks, bibliographical networks,
knowledge bases and many more). Nowadays, graphs with billions of nodes and trillions of
edges have become very common. In principle, graph analytics is an important big data
discovery technique. Therefore, with the increasing abundance of large scale graphs, designing
scalable systems for processing and analyzing large scale graphs has become one of the
timeliest problems facing the big data research community. In general, distributed processing of
big graphs is a challenging task due to their size and the inherent irregular structure of graph
computations. In this paper, we present a comprehensive overview of the state-of-the-art to
better understand the challenges of developing very high-scalable graph processing systems. In
addition, we identify a set of the current open research challenges and discuss some promising
directions for future research.
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Mumbai Academisc
This document summarizes a paper that presents a framework called BRA that provides a bidirectional abstraction of asymmetric mobile ad hoc networks to enable off-the-shelf routing protocols to work. BRA maintains multi-hop reverse routes for unidirectional links, improves connectivity by using unidirectional links, enables reverse route forwarding of control packets, and detects packet loss on unidirectional links. Simulations show packet delivery increases substantially when AODV is layered on BRA in asymmetric networks compared to regular AODV.
Implementation of query optimization for reducing run timeAlexander Decker
This document discusses query optimization techniques to improve performance. It proposes performing query optimization at compile-time using histograms of data statistics rather than at run-time. Histograms are used to estimate selectivity of query joins and predicates at compile-time, allowing a query plan to be constructed in advance and executed without run-time optimization. The technique uses a split and merge algorithm to incrementally maintain histograms as data changes. Selectivity estimation with histograms allows join and predicate ordering to be determined at compile-time for query plan generation. Experimental results showed this compile-time optimization approach improved runtime performance over traditional run-time optimization.
Instruction level parallelism using ppm branch predictionIAEME Publication
This document summarizes an approach to instruction level parallelism using prediction by partial matching (PPM) branch prediction. It proposes a hybrid PPM-based branch predictor that uses both local and global branch histories. The two predictors are combined using a neural network. Key aspects of the implementation include:
1. Using local and global history PPM predictors and combining their predictions with a neural network.
2. Enhancements to the basic PPM approach like program counter tagging, efficient history encoding using run-length encoding, tracking pattern bias, and dynamic pattern length selection.
3. Details of the global history PPM predictor including the use of tables and linked lists to store patterns of different lengths and handle collisions
Practice of Streaming Processing of Dynamic Graphs: Concepts, Models, and Sys...Subhajit Sahu
Highlighted notes on Practice of Streaming Processing of Dynamic Graphs: Concepts, Models, and Systems.
While doing research work under Prof. Dip Banerjee, Prof, Kishore Kothapalli.
This is a huge review paper discussing a lot about several graph streaming frameworks, and graph databases. How can i summarize this! GPU frameworks given are cuSTINGER, EvoGraph, Hornet, faimGraph, GPMA. Gap between databases and frameworks seems to be closing.
Scalable and Adaptive Graph Querying with MapReduceKyong-Ha Lee
This document summarizes a research paper that proposes a distributed graph querying algorithm called MR-Graph that employs MapReduce. MR-Graph uses a filter-and-verify scheme to first filter graphs based on contained features before verifying subgraph isomorphism. It also adaptively tunes the feature size at runtime by sampling data graphs to determine the most appropriate size. The experiments showed MR-Graph outperforms conventional algorithms in scalability and efficiency for processing multiple graph queries over massive datasets.
This document summarizes a research paper that analyzes and evaluates the performance of processing large data sets using Hadoop. It discusses how Hadoop Distributed File System (HDFS) and MapReduce provide parallel and distributed processing of large structured and unstructured data at scale. The paper also presents the results of experiments conducted on Hadoop to classify and cluster large data sets using machine learning algorithms. The experiments showed that Hadoop can process large data sets more efficiently and reliably compared to processing on a single computer.
Study on Theoretical Aspects of Virtual Data Integration and its ApplicationsIJERA Editor
Data integration is the technique of merging data residing at different sources at different locations, and
providing users with an integrated, reconciled view of these data. Such unified view is called global or mediated
schema. It represents the intentional level of the integrated and reconciled data. In the data integration system,
our area of interest in this paper is characterized by an architecture based on a global schema and a set of
sources or source schemas. The objective of this paper is to provide a study on the theoretical aspects of data
integration systems and to present a comprehensive review of the applications of data integration in various
fields including biomedicine, environment, and social networks. It also discusses a privacy framework for
protecting user’s privacy with privacy views and privacy policies.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
SASUM: A Sharing-based Approach to Fast Approximate Subgraph Matching for Lar...Kyong-Ha Lee
This document proposes a new approach called SASUM for approximate subgraph matching in large graphs. Approximate subgraph matching allows missing edges in query matches, which is important for real-world graphs that may be incomplete. SASUM improves upon the basic approach of generating all possible query subgraphs and doing exact matching for each. It exploits the overlapping nature of query subgraphs to reduce the number that require costly exact matching. SASUM uses a lattice framework to identify sharing opportunities between query subgraphs. It generates small "base graphs" that are shared between queries and chooses a minimum set of these to match, from which it can derive matches for all queries. The approach outperforms the state-of-the-art by orders of
Optimized Access Strategies for a Distributed Database DesignWaqas Tariq
Abstract Distributed Database Query Optimization has been an active area of research for Database research Community in this decade. Research work mostly involves mathematical programming and evolving new algorithm design techniques in order to minimize the combined cost of storing the database, processing transactions and communication amongst various sites of storage. The complete problem and most of its subsets as well are NP-Hard. Most of proposed solutions till date are based on use of Enumerative Techniques or using Heuristics. In this paper we have shown benefits of using innovative Genetic Algorithms (GA) for optimizing the sequence of sub-query operations over the enumerative methods and heuristics. A stochastic simulator has been designed and experimental results show encouraging improvements in decreasing the total cost of a query. An exhaustive enumerative method is also applied and solutions are compared with that of GA on various parameters of a Distributed Query, like up to 12 joins and 10 sites. Keywords: Distributed Query Optimization, Database Statistics, Query Execution Plan, Genetic Algorithms, Operation Allocation.
This document describes a web application for analyzing building energy management data using predictive modeling and machine learning techniques. The application contains years of sensor data from a CUNY building and allows users to visualize the data, perform statistical analysis, and generate forecasts using Python modules. Key features include interactive data visualization, filtering and selecting subsets of data, defining expressions of sensor variables, and applying machine learning models for prediction. The application provides a customizable platform for exploring time series data while allowing different users to share their work.
Decision tree clustering a columnstores tuple reconstructioncsandit
Column-Stores has gained market share due to promi
sing physical storage alternative for
analytical queries. However, for multi-attribute qu
eries column-stores pays performance
penalties due to on-the-fly tuple reconstruction. T
his paper presents an adaptive approach for
reducing tuple reconstruction time. Proposed approa
ch exploits decision tree algorithm to
cluster attributes for each projection and also eli
minates frequent database scanning.
Experimentations with TPC-H data shows the effectiv
eness of proposed approach.
This document discusses machine learning algorithms and their suitability for parallelization using MapReduce. It begins by introducing machine learning and different types of algorithms, including supervised, unsupervised, and reinforcement learning. It then discusses how several common machine learning algorithms can be expressed using the MapReduce framework. Single-pass algorithms like language modeling and naive Bayes classification are well-suited since they involve extracting statistics from each data point. Iterative algorithms can also be parallelized by chaining multiple MapReduce jobs, though the map tasks may need access to global parameters. Overall, the document analyzes how different machine learning algorithms map to the data processing patterns of MapReduce.
ENHANCING KEYWORD SEARCH OVER RELATIONAL DATABASES USING ONTOLOGIEScsandit
This document summarizes a research paper that proposes a system to enhance keyword search over relational databases using ontologies. The system builds structures during pre-processing like a reachability index to store connectivity information and an ontology concept graph. During querying, it maps keywords to concepts, uses the ontology to find related concepts and tuples, and generates top-k answer trees combining syntactic and semantic matches while limiting redundant results. The system is expected to perform better than existing approaches by reducing storage requirements through its approach to materializing neighborhood information in the reachability index.
Survey on scalable continual top k keyword search in relational databaseseSAT Journals
Abstract Keyword search in relational database is a technique that has higher relevance in the present world. Extracting data from a large number of sets of database is very important .Because it reduces the usage of man power and time consumption. Data extraction from a large database using the relevant keyword based on the information needed is a very interactive and user friendly. Without knowing any database schemas or query languages like sql the user can get information. By using keyword in relational database data extraction will be simpler. The user doesn’t want to know the query language for search. But the database content is always changing for real time application for example database which store the data of publication data. When new publications arrive it should be added to database so the database content changes according to time. Because the database is updated frequently the result should change. In order to handle the database updation takes the top-k result from the currently updated data for each search. Top-k keyword search means take greatest k results based on the relevance of document. Keyword search in relational database means to find structural information from tuples from the database. Two types of keyword search are schema-based method and graph based approach. Using top-k keyword search instead of executing all query results taking highest k queries. By handling database updation try to find the new results and remove expired one
The document proposes an improved clustering algorithm for social network analysis. It combines BSP (Business System Planning) clustering with Principal Component Analysis (PCA) to group social network objects into classes based on their links and attributes. Specifically, it applies PCA before BSP clustering to reduce the dimensionality of the social network data and retain only the most important variables for clustering. This improves the BSP clustering results by focusing on the key information in the social network.
This document summarizes various techniques for scalable continual top-k keyword search in relational databases. There are two main approaches: schema-based and graph-based. Schema-based methods generate candidate networks from the database schema and evaluate them. Graph-based methods represent the database as a graph and use techniques like bidirectional expansion. Top-k keyword search finds the highest scoring k results instead of all results. Methods like the Global Pipeline algorithm and Skyline-Sweeping algorithm efficiently process top-k queries over multiple candidate networks. Techniques for updating results with database changes include maintaining an initial top-k and recalculating scores. Lattice-based methods share computational costs for keyword search in data streams.
Exploring optimizations for dynamic pagerank algorithm based on CUDA : V3Subhajit Sahu
This is my comprehensive viva report version 3.
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
Graph is a generic data structure and is a superset of lists, and trees. Binary search on sorted lists can be interpreted as a balanced binary tree search. Database tables can be thought of as indexed lists, and table joins represent relations between columns. This can be modeled as graphs instead. Assignment of registers to variables (by compiler), and assignment of available channels to a radio transmitter and also graph problems. Finding shortest path between two points, and sorting web pages in order of importance are also graphs problems. Neural networks are graphs too. Interaction between messenger molecules in the body, and interaction between people on social media, also modeled as graphs.
Scalability has been an essential factor for any kind of computational algorithm while considering its performance. In this Big Data era, gathering of large amounts of data is becoming easy. Data analysis on Big Data is not feasible using the existing Machine Learning (ML) algorithms and it perceives them to perform poorly. This is due to the fact that the computational logic for these algorithms is previously designed in sequential way. MapReduce becomes the solution for handling billions of data efficiently. In this report we discuss the basic building block for the computations behind ML algorithms, two different attempts to parallelize machine learning algorithms using MapReduce and a brief description on the overhead in parallelization of ML algorithms.
Parallel algorithms for multi-source graph traversal and its applicationsSubhajit Sahu
Highlighted notes on Parallel algorithms for multi-source graph traversal and its applications.
While doing research work under Prof. Kishore Kothapalli.
Seema is working on Multi-source BFS with hybrid-CSR, with applications in APSP, diameter, centrality, reachability.
BFS can be either top-down (from visited nodes, mark next frontier), or bottom-up (from unvisited nodes, mark next frontier). She mentioned that hybrid approach is more efficient. EtaGraph uses unified degree cut (UDC) graph partitioning. Also overlaps data transfer with kernel execution. iCENTRAL uses biconnected components for betwenness centrality on dynamic graphs.
Hybrid CSR uses an additional value array for storing packed "has edge/neighbour" bits. This can give better memory access pattern if many bits are set, and cause many threads to wait if many bits are zero. She mentioned Volta architecture has independent PC, stack per thread (similar to CPU?). Does is not matter then if the threads in a block diverge?
(BFS = G*v, Multi-source BFS = G*vs)
STIC-D: algorithmic techniques for efficient parallel pagerank computation on...Subhajit Sahu
Authors:
Paritosh Garg
Kishore Kothapalli
Publication:
ICDCN '16: Proceedings of the 17th International Conference on Distributed Computing and Networking. January 2016.
Article No.: 15 Pages 1–10
https://doi.org/10.1145/2833312.2833322
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions csandit
Analyzing interconnection structures among the data through the use of graph algorithms and
graph analytics has been shown to provide tremendous value in many application domains (like
social networks, protein networks, transportation networks, bibliographical networks,
knowledge bases and many more). Nowadays, graphs with billions of nodes and trillions of
edges have become very common. In principle, graph analytics is an important big data
discovery technique. Therefore, with the increasing abundance of large scale graphs, designing
scalable systems for processing and analyzing large scale graphs has become one of the
timeliest problems facing the big data research community. In general, distributed processing of
big graphs is a challenging task due to their size and the inherent irregular structure of graph
computations. In this paper, we present a comprehensive overview of the state-of-the-art to
better understand the challenges of developing very high-scalable graph processing systems. In
addition, we identify a set of the current open research challenges and discuss some promising
directions for future research.
BIG GRAPH: TOOLS, TECHNIQUES, ISSUES, CHALLENGES AND FUTURE DIRECTIONScscpconf
Analyzing interconnection structures among the data through the use of graph algorithms and
graph analytics has been shown to provide tremendous value in many application domains (like
social networks, protein networks, transportation networks, bibliographical networks, knowledge bases and many more). Nowadays, graphs with billions of nodes and trillions of
edges have become very common. In principle, graph analytics is an important big data
discovery technique. Therefore, with the increasing abundance of large scale graphs, designing scalable systems for processing and analyzing large-scale graphs has become one of the timeliest problems facing the big data research community. In general, distributed processing of big graphs is a challenging task due to their size and the inherent irregular structure of graph computations. In this paper, we present a comprehensive overview of the state-of-the-art to better understand the challenges of developing very high-scalable graph processing systems. In addition, we identify a set of the current open research challenges and discuss some promising
directions for future research.
The International Journal of Engineering and Science (The IJES)theijes
The International Journal of Engineering & Science is aimed at providing a platform for researchers, engineers, scientists, or educators to publish their original research results, to exchange new ideas, to disseminate information in innovative designs, engineering experiences and technological skills. It is also the Journal's objective to promote engineering and technology education. All papers submitted to the Journal will be blind peer-reviewed. Only original articles will be published.
This paper presents efficient parallel algorithms for hypergraph processing implemented in a new framework called Hygra. Hygra extends the Ligra graph processing framework to support hypergraphs. It represents hypergraphs using a bipartite graph and implements optimizations from Ligra. The paper introduces parallel hypergraph algorithms for betweenness centrality, maximal independent set, k-core decomposition, hypertrees, hyperpaths, connected components, PageRank, and single-source shortest paths. Experiments show the algorithms in Hygra achieve good parallel speedup and outperform existing hypergraph frameworks.
Web Graph Clustering Using Hyperlink Structureaciijournal
Now, information is useful for every environment in which time similarity is more important case. The most
of people are strongly interested in Internet. Web pages in the Internet are linked thorough hyperlinks that
contain useful information. By using hyperlinks, web graphs are constructed for time similarity web links in
which webs have been seen by users at past. These activities are needed to use for tracing who used the
websites for something at the time. So this paper provides the history of users who connect the person
started the news. We found that the normalized-cut method with the new similarity metric is particularly
effective, as demonstrated on a web log file
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Bragged Regression Tree Algorithm for Dynamic Distribution and Scheduling of ...Editor IJCATR
In the past few years, Grid computing came up as next generation computing platform which is a combination of
heterogeneous computing resources combined by a network across dynamic and geographically separated organizations. So, it
provides the perfect computing environment to solve large-scale computational demands. As the Grid computing demands are still
increasing from day to day due to rise in large number of complex jobs worldwide. So, the jobs may take much longer time to
complete due to poor distribution of batches or groups of jobs to inappropriate CPU’s. Therefore there is need to develop an efficient
dynamic job scheduling algorithm that would assign jobs to appropriate CPU’s dynamically. The main problem which dealt in the
paper is, how to distribute the jobs when the payload, importance, urgency, flow time etc. dynamically keeps on changing as the grid
expands or is flooded with number of job requests from different machines within the grid.
In this paper, we present a scheduling strategy which takes the advantage of decision tree algorithm to take dynamic decision
based on the current scenarios and which automatically incorporates factor analysis for considering the distribution of jobs.
The document discusses experimenting with big data technologies using the Tengu platform. Tengu allows customers to easily set up environments to experiment with big data stores like Cassandra and Elasticsearch. It also supports different types of big data analysis like stream processing, batch analysis, and the Lambda architecture. Tengu handles all the deployment and configuration of these technologies so users can focus on experimenting with their applications in a big data context without having to deal with integration and setup.
Hortizontal Aggregation in SQL for Data Mining Analysis to Prepare Data SetsIJMER
International Journal of Modern Engineering Research (IJMER) is Peer reviewed, online Journal. It serves as an international archival forum of scholarly research related to engineering and science education.
International Journal of Modern Engineering Research (IJMER) covers all the fields of engineering and science: Electrical Engineering, Mechanical Engineering, Civil Engineering, Chemical Engineering, Computer Engineering, Agricultural Engineering, Aerospace Engineering, Thermodynamics, Structural Engineering, Control Engineering, Robotics, Mechatronics, Fluid Mechanics, Nanotechnology, Simulators, Web-based Learning, Remote Laboratories, Engineering Design Methods, Education Research, Students' Satisfaction and Motivation, Global Projects, and Assessment…. And many more.
WEB LOG PREPROCESSING BASED ON PARTIAL ANCESTRAL GRAPH TECHNIQUE FOR SESSION ...cscpconf
Web access log analysis is to analyze the patterns of web site usage and the features of users behavior. It is
the fact that the normal Log data is very noisy and unclear and it is vital to preprocess the log data for
efficient web usage mining process. Preprocessing comprises of three phases which includes data cleaning,
user identification and session construction. Session construction is very vital and numerous real world
problems can be modeled as traversals on graph and mining from these traversals would provide the
requirement for preprocessing phase. On the other hand, the traversals on unweighted graph have been
taken into consideration in existing works. This paper oversimplifies this to the case where vertices of
graph are given weights to reflect their significance. The proposed method constructs sessions as a Partial
Ancestral Graph which contains pages with calculated weights. This will help site administrators to find
the interesting pages for users and to redesign their web pages. After weighting each page according to
browsing time a PAG structure is constructed for each user session. Existing system in which there is a
problem of learning with the latent variables of the data and the problem can be overcome by the proposed
method.
The document provides an overview of discrete mathematics and its applications. It begins by defining discrete mathematics as the study of mathematical structures that are discrete rather than continuous. Some key points made include:
- Discrete mathematics deals with objects that can only assume distinct, separated values. Fields like combinatorics, graph theory, and computation theory are considered parts of discrete mathematics.
- Research in discrete mathematics increased in the latter half of the 20th century due to the development of digital computers which operate using discrete bits.
- The document then gives several examples of applications of discrete mathematics, such as in computer science, networking, cryptography, logistics, and scheduling problems.
- Discrete mathematics is widely used in fields like
The document describes an assignment given to Md. Mehedi Hasan on the topic of applying numerical methods in computer science engineering. The assignment was given by five students and includes an index listing numerical methods to cover: error analysis, N-R method, interpolation, differentiation and max/min, curve fitting, and integration.
HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)Subhajit Sahu
Highlighted notes on A Parallel Algorithm Template for Updating Single-Source Shortest Paths in Large-Scale Dynamic Networks.
While doing research work under Prof. Dip Banerjee, Prof, Kishore Kothapalli.
In Hybrid Pagerank the vertices are divided in 3 groups, V_old, V_border, V_new. Scaling for old, border vertices is N/N_new, and 1/N_new for V_new (i do this too ). Then PR is run only on V_border, V_new.
"V_border which is the set of nodes which have edges in Bi connecting V_old and V_new and is reachable using a breadth first traversal."
Does that mean V_border = V_batch(i) ∩ V_old? BFS from where?
"We can assume that the new batch of updates is topologically sorted since the PR scores of the new nodes in Bi is guaranteed to be lower than those in Co."
Is sum(PR) in V_old > sum(PR) in V_new always?
"For performing the comparisons with GPMA and GPMA+, we configure the experiment to run HyPR on the same platform as used in [1] which is a Intel Xeon CPU connected to a Titan X Pascal GPU, and also the same datasets."
Old GPUs are going to be slower ...
Like we were discussing last time, it is not possible to scale old ranks, and skip the unchanged components (or here V_old). Please check this simple counter example that shows skipping leads to incorrect ranks.
https://github.com/puzzlef/pagerank-levelwise-skip-unchanged-components
Another omission in the paper is that Hybrid PR (just like STICD) wont work for graphs which have dead ends. This is a pre-condition for the algorithm.
Query Optimization Techniques in Graph Databasesijdms
Graph databases (GDB) have recently been arisen to overcome the limits of traditional databases for
storing and managing data with graph-like structure. Today, they represent a requirementfor many
applications that manage graph-like data,like social networks.Most of the techniques, applied to optimize
queries in graph databases, have been used in traditional databases, distribution systems,… or they are
inspired from graph theory. However, their reuse in graph databases should take care of the main
characteristics of graph databases, such as dynamic structure, highly interconnected data, and ability to
efficiently access data relationships. In this paper, we survey the query optimization techniques in graph
databases. In particular,we focus on the features they have in
GRAPH MATCHING ALGORITHM FOR TASK ASSIGNMENT PROBLEMIJCSEA Journal
Task assignment is one of the most challenging problems in distributed computing environment. An optimal task assignment guarantees minimum turnaround time for a given architecture. Several approaches of optimal task assignment have been proposed by various researchers ranging from graph partitioning based tools to heuristic graph matching. Using heuristic graph matching, it is often impossible to get optimal task assignment for practical test cases within an acceptable time limit. In this paper, we have parallelized the basic heuristic graph-matching algorithm of task assignment which is suitable only for cases where processors and inter processor links are homogeneous. This proposal is a derivative of the basic task assignment methodology using heuristic graph matching. The results show that near optimal assignments are obtained much faster than the sequential program in all the cases with reasonable speed-up.
Application of discrete mathematics in ITShahidAbbas52
This document discusses discrete mathematics and its applications. It begins with defining discrete mathematics and providing examples of its different fields like graphs, networks, and logic. It then discusses various real-world applications of discrete mathematics in areas like computers, encryption, Google Maps, and scheduling. Discrete mathematical concepts like graphs, algorithms, and logic are widely used in fields like computer science, engineering, operations research, and social sciences.
Similar to Exploring optimizations for dynamic PageRank algorithm based on GPU : V4 (20)
About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...Subhajit Sahu
TrueTime is a service that enables the use of globally synchronized clocks, with bounded error. It returns a time interval that is guaranteed to contain the clock’s actual time for some time during the call’s execution. If two intervals do not overlap, then we know calls were definitely ordered in real time. In general, synchronized clocks can be used to avoid communication in a distributed system.
The underlying source of time is a combination of GPS receivers and atomic clocks. As there are “time masters” in every datacenter (redundantly), it is likely that both sides of a partition would continue to enjoy accurate time. Individual nodes however need network connectivity to the masters, and without it their clocks will drift. Thus, during a partition their intervals slowly grow wider over time, based on bounds on the rate of local clock drift. Operations depending on TrueTime, such as Paxos leader election or transaction commits, thus have to wait a little longer, but the operation still completes (assuming the 2PC and quorum communication are working).
Adjusting Bitset for graph : SHORT REPORT / NOTESSubhajit Sahu
Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is commonly used for efficient graph computations. Unfortunately, using CSR for dynamic graphs is impractical since addition/deletion of a single edge can require on average (N+M)/2 memory accesses, in order to update source-offsets and destination-indices. A common approach is therefore to store edge-lists/destination-indices as an array of arrays, where each edge-list is an array belonging to a vertex. While this is good enough for small graphs, it quickly becomes a bottleneck for large graphs. What causes this bottleneck depends on whether the edge-lists are sorted or unsorted. If they are sorted, checking for an edge requires about log(E) memory accesses, but adding an edge on average requires E/2 accesses, where E is the number of edges of a given vertex. Note that both addition and deletion of edges in a dynamic graph require checking for an existing edge, before adding or deleting it. If edge lists are unsorted, checking for an edge requires around E/2 memory accesses, but adding an edge requires only 1 memory access.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Experiments with Primitive operations : SHORT REPORT / NOTESSubhajit Sahu
This includes:
- Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
- Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
- Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
- Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...Subhajit Sahu
Below are the important points I note from the 2020 paper by Martin Grohe:
- 1-WL distinguishes almost all graphs, in a probabilistic sense
- Classical WL is two dimensional Weisfeiler-Leman
- DeepWL is an unlimited version of WL graph that runs in polynomial time.
- Knowledge graphs are essentially graphs with vertex/edge attributes
ABSTRACT:
Vector representations of graphs and relational structures, whether handcrafted feature vectors or learned representations, enable us to apply standard data analysis and machine learning techniques to the structures. A wide range of methods for generating such embeddings have been studied in the machine learning and knowledge representation literature. However, vector embeddings have received relatively little attention from a theoretical point of view.
Starting with a survey of embedding techniques that have been used in practice, in this paper we propose two theoretical approaches that we see as central for understanding the foundations of vector embeddings. We draw connections between the various approaches and suggest directions for future research.
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTESSubhajit Sahu
https://gist.github.com/wolfram77/54c4a14d9ea547183c6c7b3518bf9cd1
There exist a number of dynamic graph generators. Barbasi-Albert model iteratively attach new vertices to pre-exsiting vertices in the graph using preferential attachment (edges to high degree vertices are more likely - rich get richer - Pareto principle). However, graph size increases monotonically, and density of graph keeps increasing (sparsity decreasing).
Gorke's model uses a defined clustering to uniformly add vertices and edges. Purohit's model uses motifs (eg. triangles) to mimick properties of existing dynamic graphs, such as growth rate, structure, and degree distribution. Kronecker graph generators are used to increase size of a given graph, with power-law distribution.
To generate dynamic graphs, we must choose a metric to compare two graphs. Common metrics include diameter, clustering coefficient (modularity?), triangle counting (triangle density?), and degree distribution.
In this paper, the authors propose Dygraph, a dynamic graph generator that uses degree distribution as the only metric. The authors observe that many real-world graphs differ from the power-law distribution at the tail end. To address this issue, they propose binning, where the vertices beyond a certain degree (minDeg = min(deg) s.t. |V(deg)| < H, where H~10 is the number of vertices with a given degree below which are binned) are grouped into bins of degree-width binWidth, max-degree localMax, and number of degrees in bin with at least one vertex binSize (to keep track of sparsity). This helps the authors to generate graphs with a more realistic degree distribution.
The process of generating a dynamic graph is as follows. First the difference between the desired and the current degree distribution is calculated. The authors then create an edge-addition set where each vertex is present as many times as the number of additional incident edges it must recieve. Edges are then created by connecting two vertices randomly from this set, and removing both from the set once connected. Currently, authors reject self-loops and duplicate edges. Removal of edges is done in a similar fashion.
Authors observe that adding edges with power-law properties dominates the execution time, and consider parallelizing DyGraph as part of future work.
My notes on shared memory parallelism.
Shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs. Using memory for communication inside a single program, e.g. among its multiple threads, is also referred to as shared memory [REF].
A Dynamic Algorithm for Local Community Detection in Graphs : NOTESSubhajit Sahu
**Community detection methods** can be *global* or *local*. **Global community detection methods** divide the entire graph into groups. Existing global algorithms include:
- Random walk methods
- Spectral partitioning
- Label propagation
- Greedy agglomerative and divisive algorithms
- Clique percolation
https://gist.github.com/wolfram77/b4316609265b5b9f88027bbc491f80b6
There is a growing body of work in *detecting overlapping communities*. **Seed set expansion** is a **local community detection method** where a relevant *seed vertices* of interest are picked and *expanded to form communities* surrounding them. The quality of each community is measured using a *fitness function*.
**Modularity** is a *fitness function* which compares the number of intra-community edges to the expected number in a random-null model. **Conductance** is another popular fitness score that measures the community cut or inter-community edges. Many *overlapping community detection* methods **use a modified ratio** of intra-community edges to all edges with atleast one endpoint in the community.
Andersen et al. use a **Spectral PageRank-Nibble method** which minimizes conductance and is formed by adding vertices in order of decreasing PageRank values. Andersen and Lang develop a **random walk approach** in which some vertices in the seed set may not be placed in the final community. Clauset gives a **greedy method** that *starts from a single vertex* and then iteratively adds neighboring vertices *maximizing the local modularity score*. Riedy et al. **expand multiple vertices** via maximizing modularity.
Several algorithms for **detecting global, overlapping communities** use a *greedy*, *agglomerative approach* and run *multiple separate seed set expansions*. Lancichinetti et al. run **greedy seed set expansions**, each with a *single seed vertex*. Overlapping communities are produced by a sequentially running expansions from a node not yet in a community. Lee et al. use **maximal cliques as seed sets**. Havemann et al. **greedily expand cliques**.
The authors of this paper discuss a dynamic approach for **community detection using seed set expansion**. Simply marking the neighbours of changed vertices is a **naive approach**, and has *severe shortcomings*. This is because *communities can split apart*. The simple updating method *may fail even when it outputs a valid community* in the graph.
Scalable Static and Dynamic Community Detection Using Grappolo : NOTESSubhajit Sahu
A **community** (in a network) is a subset of nodes which are _strongly connected among themselves_, but _weakly connected to others_. Neither the number of output communities nor their size distribution is known a priori. Community detection methods can be divisive or agglomerative. **Divisive methods** use _betweeness centrality_ to **identify and remove bridges** between communities. **Agglomerative methods** greedily **merge two communities** that provide maximum gain in _modularity_. Newman and Girvan have introduced the **modularity metric**. The problem of community detection is then reduced to the problem of modularity maximization which is **NP-complete**. **Louvain method** is a variant of the _agglomerative strategy_, in that is a _multi-level heuristic_.
https://gist.github.com/wolfram77/917a1a4a429e89a0f2a1911cea56314d
In this paper, the authors discuss **four heuristics** for Community detection using the _Louvain algorithm_ implemented upon recently developed **Grappolo**, which is a parallel variant of the Louvain algorithm. They are:
- Vertex following and Minimum label
- Data caching
- Graph coloring
- Threshold scaling
With the **Vertex following** heuristic, the _input is preprocessed_ and all single-degree vertices are merged with their corresponding neighbours. This helps reduce the number of vertices considered in each iteration, and also help initial seeds of communities to be formed. With the **Minimum label heuristic**, when a vertex is making the decision to move to a community and multiple communities provided the same modularity gain, the community with the smallest id is chosen. This helps _minimize or prevent community swaps_. With the **Data caching** heuristic, community information is stored in a vector instead of a map, and is reused in each iteration, but with some additional cost. With the **Vertex ordering via Graph coloring** heuristic, _distance-k coloring_ of graphs is performed in order to group vertices into colors. Then, each set of vertices (by color) is processed _concurrently_, and synchronization is performed after that. This enables us to mimic the behaviour of the serial algorithm. Finally, with the **Threshold scaling** heuristic, _successively smaller values of modularity threshold_ are used as the algorithm progresses. This allows the algorithm to converge faster, and it has been observed a good modularity score as well.
From the results, it appears that _graph coloring_ and _threshold scaling_ heuristics do not always provide a speedup and this depends upon the nature of the graph. It would be interesting to compare the heuristics against baseline approaches. Future work can include _distributed memory implementations_, and _community detection on streaming graphs_.
Application Areas of Community Detection: A Review : NOTESSubhajit Sahu
This is a short review of Community detection methods (on graphs), and their applications. A **community** is a subset of a network whose members are *highly connected*, but *loosely connected* to others outside their community. Different community detection methods *can return differing communities* these algorithms are **heuristic-based**. **Dynamic community detection** involves tracking the *evolution of community structure* over time.
https://gist.github.com/wolfram77/09e64d6ba3ef080db5558feb2d32fdc0
Communities can be of the following **types**:
- Disjoint
- Overlapping
- Hierarchical
- Local.
The following **static** community detection **methods** exist:
- Spectral-based
- Statistical inference
- Optimization
- Dynamics-based
The following **dynamic** community detection **methods** exist:
- Independent community detection and matching
- Dependent community detection (evolutionary)
- Simultaneous community detection on all snapshots
- Dynamic community detection on temporal networks
**Applications** of community detection include:
- Criminal identification
- Fraud detection
- Criminal activities detection
- Bot detection
- Dynamics of epidemic spreading (dynamic)
- Cancer/tumor detection
- Tissue/organ detection
- Evolution of influence (dynamic)
- Astroturfing
- Customer segmentation
- Recommendation systems
- Social network analysis (both)
- Network summarization
- Privary, group segmentation
- Link prediction (both)
- Community evolution prediction (dynamic, hot field)
<br>
<br>
## References
- [Application Areas of Community Detection: A Review : PAPER](https://ieeexplore.ieee.org/document/8625349)
This paper discusses a GPU implementation of the Louvain community detection algorithm. Louvain algorithm obtains hierachical communities as a dendrogram through modularity optimization. Given an undirected weighted graph, all vertices are first considered to be their own communities. In the first phase, each vertex greedily decides to move to the community of one of its neighbours which gives greatest increase in modularity. If moving to no neighbour's community leads to an increase in modularity, the vertex chooses to stay with its own community. This is done sequentially for all the vertices. If the total change in modularity is more than a certain threshold, this phase is repeated. Once this local moving phase is complete, all vertices have formed their first hierarchy of communities. The next phase is called the aggregation phase, where all the vertices belonging to a community are collapsed into a single super-vertex, such that edges between communities are represented as edges between respective super-vertices (edge weights are combined), and edges within each community are represented as self-loops in respective super-vertices (again, edge weights are combined). Together, the local moving and the aggregation phases constitute a stage. This super-vertex graph is then used as input fof the next stage. This process continues until the increase in modularity is below a certain threshold. As a result from each stage, we have a hierarchy of community memberships for each vertex as a dendrogram.
Approaches to perform the Louvain algorithm can be divided into coarse-grained and fine-grained. Coarse-grained approaches process a set of vertices in parallel, while fine-grained approaches process all vertices in parallel. A coarse-grained hybrid-GPU algorithm using multi GPUs has be implemented by Cheong et al. which grabbed my attention. In addition, their algorithm does not use hashing for the local moving phase, but instead sorts each neighbour list based on the community id of each vertex.
https://gist.github.com/wolfram77/7e72c9b8c18c18ab908ae76262099329
Survey for extra-child-process package : NOTESSubhajit Sahu
Useful additions to inbuilt child_process module.
📦 Node.js, 📜 Files, 📰 Docs.
Please see attached PDF for literature survey.
https://gist.github.com/wolfram77/d936da570d7bf73f95d1513d4368573e
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTERSubhajit Sahu
This paper presents two algorithms for efficiently computing PageRank on dynamically updating graphs in a batched manner: DynamicLevelwisePR and DynamicMonolithicPR. DynamicLevelwisePR processes vertices level-by-level based on strongly connected components and avoids recomputing converged vertices on the CPU. DynamicMonolithicPR uses a full power iteration approach on the GPU that partitions vertices by in-degree and skips unaffected vertices. Evaluation on real-world graphs shows the batched algorithms provide speedups of up to 4000x over single-edge updates and outperform other state-of-the-art dynamic PageRank algorithms.
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Subhajit Sahu
For the PhD forum an abstract submission is required by 10th May, and poster by 15th May. The event is on 30th May.
https://gist.github.com/wolfram77/1c1f730d20b51e0d2c6d477fd3713024
Fast Incremental Community Detection on Dynamic Graphs : NOTESSubhajit Sahu
In this paper, the authors describe two approaches for dynamic community detection using the CNM algorithm. CNM is a hierarchical, agglomerative algorithm that greedily maximizes modularity. They define two approaches: BasicDyn and FastDyn. BasicDyn backtracks merges of communities until each marked (changed) vertex is its own singleton community. FastDyn undoes a merge only if the quality of merge, as measured by the induced change in modularity, has significantly decreased compared to when the merge initially took place. FastDyn also allows more than two vertices to contract together if in the previous time step these vertices eventually ended up contracted in the same community. In the static case, merging several vertices together in one contraction phase could lead to deteriorating results. FastDyn is able to do this, however, because it uses information from the merges of the previous time step. Intuitively, merges that previously occurred are more likely to be acceptable later.
https://gist.github.com/wolfram77/1856b108334cc822cdddfdfa7334792a
Can you fix farming by going back 8000 years : NOTESSubhajit Sahu
1. Human population didn't explode, but plateued.
2. Fertilizer prices are going to the sky.
3. Farmers are looking for alternatives such as animal waste (manure) or even human waste.
4. Manure prices are also going up.
5. Switching to organic farming not an option.
https://gist.github.com/wolfram77/49067fc3ddc1ba2e1db4f873056fd88a
Atelier - Innover avec l’IA Générative et les graphes de connaissancesNeo4j
Atelier - Innover avec l’IA Générative et les graphes de connaissances
Allez au-delà du battage médiatique autour de l’IA et découvrez des techniques pratiques pour utiliser l’IA de manière responsable à travers les données de votre organisation. Explorez comment utiliser les graphes de connaissances pour augmenter la précision, la transparence et la capacité d’explication dans les systèmes d’IA générative. Vous partirez avec une expérience pratique combinant les relations entre les données et les LLM pour apporter du contexte spécifique à votre domaine et améliorer votre raisonnement.
Amenez votre ordinateur portable et nous vous guiderons sur la mise en place de votre propre pile d’IA générative, en vous fournissant des exemples pratiques et codés pour démarrer en quelques minutes.
Software Engineering, Software Consulting, Tech Lead, Spring Boot, Spring Cloud, Spring Core, Spring JDBC, Spring Transaction, Spring MVC, OpenShift Cloud Platform, Kafka, REST, SOAP, LLD & HLD.
WhatsApp offers simple, reliable, and private messaging and calling services for free worldwide. With end-to-end encryption, your personal messages and calls are secure, ensuring only you and the recipient can access them. Enjoy voice and video calls to stay connected with loved ones or colleagues. Express yourself using stickers, GIFs, or by sharing moments on Status. WhatsApp Business enables global customer outreach, facilitating sales growth and relationship building through showcasing products and services. Stay connected effortlessly with group chats for planning outings with friends or staying updated on family conversations.
OpenMetadata Community Meeting - 5th June 2024OpenMetadata
The OpenMetadata Community Meeting was held on June 5th, 2024. In this meeting, we discussed about the data quality capabilities that are integrated with the Incident Manager, providing a complete solution to handle your data observability needs. Watch the end-to-end demo of the data quality features.
* How to run your own data quality framework
* What is the performance impact of running data quality frameworks
* How to run the test cases in your own ETL pipelines
* How the Incident Manager is integrated
* Get notified with alerts when test cases fail
Watch the meeting recording here - https://www.youtube.com/watch?v=UbNOje0kf6E
Why Mobile App Regression Testing is Critical for Sustained Success_ A Detail...kalichargn70th171
A dynamic process unfolds in the intricate realm of software development, dedicated to crafting and sustaining products that effortlessly address user needs. Amidst vital stages like market analysis and requirement assessments, the heart of software development lies in the meticulous creation and upkeep of source code. Code alterations are inherent, challenging code quality, particularly under stringent deadlines.
GraphSummit Paris - The art of the possible with Graph TechnologyNeo4j
Sudhir Hasbe, Chief Product Officer, Neo4j
Join us as we explore breakthrough innovations enabled by interconnected data and AI. Discover firsthand how organizations use relationships in data to uncover contextual insights and solve our most pressing challenges – from optimizing supply chains, detecting fraud, and improving customer experiences to accelerating drug discoveries.
Introducing Crescat - Event Management Software for Venues, Festivals and Eve...Crescat
Crescat is industry-trusted event management software, built by event professionals for event professionals. Founded in 2017, we have three key products tailored for the live event industry.
Crescat Event for concert promoters and event agencies. Crescat Venue for music venues, conference centers, wedding venues, concert halls and more. And Crescat Festival for festivals, conferences and complex events.
With a wide range of popular features such as event scheduling, shift management, volunteer and crew coordination, artist booking and much more, Crescat is designed for customisation and ease-of-use.
Over 125,000 events have been planned in Crescat and with hundreds of customers of all shapes and sizes, from boutique event agencies through to international concert promoters, Crescat is rigged for success. What's more, we highly value feedback from our users and we are constantly improving our software with updates, new features and improvements.
If you plan events, run a venue or produce festivals and you're looking for ways to make your life easier, then we have a solution for you. Try our software for free or schedule a no-obligation demo with one of our product specialists today at crescat.io
E-commerce Application Development Company.pdfHornet Dynamics
Your business can reach new heights with our assistance as we design solutions that are specifically appropriate for your goals and vision. Our eCommerce application solutions can digitally coordinate all retail operations processes to meet the demands of the marketplace while maintaining business continuity.
Graspan: A Big Data System for Big Code AnalysisAftab Hussain
We built a disk-based parallel graph system, Graspan, that uses a novel edge-pair centric computation model to compute dynamic transitive closures on very large program graphs.
We implement context-sensitive pointer/alias and dataflow analyses on Graspan. An evaluation of these analyses on large codebases such as Linux shows that their Graspan implementations scale to millions of lines of code and are much simpler than their original implementations.
These analyses were used to augment the existing checkers; these augmented checkers found 132 new NULL pointer bugs and 1308 unnecessary NULL tests in Linux 4.4.0-rc5, PostgreSQL 8.3.9, and Apache httpd 2.2.18.
- Accepted in ASPLOS ‘17, Xi’an, China.
- Featured in the tutorial, Systemized Program Analyses: A Big Data Perspective on Static Analysis Scalability, ASPLOS ‘17.
- Invited for presentation at SoCal PLS ‘16.
- Invited for poster presentation at PLDI SRC ‘16.
Hand Rolled Applicative User ValidationCode KataPhilip Schwarz
Could you use a simple piece of Scala validation code (granted, a very simplistic one too!) that you can rewrite, now and again, to refresh your basic understanding of Applicative operators <*>, <*, *>?
The goal is not to write perfect code showcasing validation, but rather, to provide a small, rough-and ready exercise to reinforce your muscle-memory.
Despite its grandiose-sounding title, this deck consists of just three slides showing the Scala 3 code to be rewritten whenever the details of the operators begin to fade away.
The code is my rough and ready translation of a Haskell user-validation program found in a book called Finding Success (and Failure) in Haskell - Fall in love with applicative functors.
Need for Speed: Removing speed bumps from your Symfony projects ⚡️Łukasz Chruściel
No one wants their application to drag like a car stuck in the slow lane! Yet it’s all too common to encounter bumpy, pothole-filled solutions that slow the speed of any application. Symfony apps are not an exception.
In this talk, I will take you for a spin around the performance racetrack. We’ll explore common pitfalls - those hidden potholes on your application that can cause unexpected slowdowns. Learn how to spot these performance bumps early, and more importantly, how to navigate around them to keep your application running at top speed.
We will focus in particular on tuning your engine at the application level, making the right adjustments to ensure that your system responds like a well-oiled, high-performance race car.
Microservice Teams - How the cloud changes the way we workSven Peters
A lot of technical challenges and complexity come with building a cloud-native and distributed architecture. The way we develop backend software has fundamentally changed in the last ten years. Managing a microservices architecture demands a lot of us to ensure observability and operational resiliency. But did you also change the way you run your development teams?
Sven will talk about Atlassian’s journey from a monolith to a multi-tenanted architecture and how it affected the way the engineering teams work. You will learn how we shifted to service ownership, moved to more autonomous teams (and its challenges), and established platform and enablement teams.
Neo4j - Product Vision and Knowledge Graphs - GraphSummit ParisNeo4j
Dr. Jesús Barrasa, Head of Solutions Architecture for EMEA, Neo4j
Découvrez les dernières innovations de Neo4j, et notamment les dernières intégrations cloud et les améliorations produits qui font de Neo4j un choix essentiel pour les développeurs qui créent des applications avec des données interconnectées et de l’IA générative.
Revolutionizing Visual Effects Mastering AI Face Swaps.pdfUndress Baby
The quest for the best AI face swap solution is marked by an amalgamation of technological prowess and artistic finesse, where cutting-edge algorithms seamlessly replace faces in images or videos with striking realism. Leveraging advanced deep learning techniques, the best AI face swap tools meticulously analyze facial features, lighting conditions, and expressions to execute flawless transformations, ensuring natural-looking results that blur the line between reality and illusion, captivating users with their ingenuity and sophistication.
Web:- https://undressbaby.com/
Takashi Kobayashi and Hironori Washizaki, "SWEBOK Guide and Future of SE Education," First International Symposium on the Future of Software Engineering (FUSE), June 3-6, 2024, Okinawa, Japan
Exploring optimizations for dynamic PageRank algorithm based on GPU : V4
1. Exploring optimizations for dynamic PageRank
algorithm based on GPU
Subhajit Sahu
Advisor: Kishore Kothapalli
Center for Security, Theory, and Algorithmic Research (CSTAR)
International Institute of Information Technology, Hyderabad (IIITH)
Gachibowli, Hyderabad, India - 500 032
subhajit.sahu@research.iiit.ac.in
1. Introduction
The Königsberg bridge problem, which was posed and answered in the negative by Euler in
1736 represents the beginning of graph theory [1]. Graph is a generic data structure and is a
superset of lists, and trees. Binary search on sorted lists can be interpreted as a balanced
binary tree search. Database tables can be thought of as indexed lists, and table joins
represent relations between columns. This can be modeled as graphs instead. Assignment
of registers to variables (by compiler), and assignment of available channels to a radio
transmitter and also graph problems. Finding shortest path between two points, and
sorting web pages in order of importance are also graphs problems. Neural networks are
graphs too. Interaction between messenger molecules in the body, and interaction between
people on social media, also modeled as graphs.
Figure 1.1: Number of websites online from 1992 to 2019 [2].
The web has a bowtie structure on many levels, as shown in figure 1.2. There is usually
one giant strongly connected component, with several pages pointing into this component,
several pages pointed to by the component, and a number of disconnected pages. This
structure is seen as a fractal on many different levels [3].
1
2. Figure 1.2: Web’s bow tie structure on different aggregation levels [4].
Static graphs are those which do not change with time. Static graph algorithms are
techniques used to solve such a graph problem (developed since the 1940s). To solve larger
and larger problems, a number of optimizations (both algorithmic and hardware/software
techniques) have been developed to take advantage of vector-processors (like Cray),
multicores, and GPUs. A lot of research had to be done in order to find ways to enhance
concurrency. The techniques include a number of concurrency models, locking
techniques, transactions, etc. This is especially due to a lack of single-core performance
improvements.
Graphs where relations vary with time, are called temporal graphs. As you might guess,
many problems use temporal graphs. These temporal graphs can be thought of as a series
of static graphs at different points in time. In order to solve graph problems with these
temporal graphs, people would normally take the graph at a certain point in time, and run the
necessary static graph algorithm on it. This worked out fine, and as the size of the temporal
graph grows, this repeated computation becomes increasingly slower. It is possible to take
advantage of previous results, in order to compute the result for the next time point. Such
algorithms are called dynamic graph algorithms. This is an ongoing area of research,
which includes new algorithms, hardware/software optimization techniques for distributed
systems, multicores (shared memory), GPUS, and even FPGAs. Optimization of algorithms
can focus on space complexity (memory usage), time complexity (query time),
preprocessing time, and even accuracy of result.
While dynamic algorithms only focus on optimizing the algorithm’s computation time,
dynamic graph data structures focus on improving graph update time, and memory usage.
2
3. Dense graphs are usually represented by an adjacency matrix (bit matrix). Sparse graphs
can be represented with variations of adjacency lists (like CSR), and edge lists. Sparse
graphs can also be thought of as sparse matrices, and edges of a vertex can be considered
a bitset. In fact, a number of graph algorithms can be modeled as linear algebra operations
(see nvGraph, cuGraph frameworks). A number of dynamic graph data structures have also
been developed to improve update speed (like PMA), or enable concurrent updates and
computation (like Aspen’s compressed functional trees) [5]. These data formats are
illustrated in figure 1.3.
Figure 1.3: Illustration of fundamental graph representations (Adjacency Matrix, Adjacency List, Edge
List, CSR) [6] .
Streaming / dynamic / time-evolving graph data structures maintain only the latest graph
information. Historical graphs on the other hand keep track of all previous states of the
graph. Changes to the graphs can be thought of as edge insertions and deletions, which
are usually done in batches. Except for functional techniques, updating a graph usually
involves modifying a shared structure using some kind of fine-grained synchronization. It
might also be possible to store additional information along with vertices/edges, though this
is usually not the focus of research (graph databases do). In the recent decade or so, a
number of graph streaming frameworks have been developed, each with a certain focus
area, and targeting a certain platform (distributed system / multiprocessor / GPU / FPGA /
ASIC). Such frameworks focus on designing an improved dynamic graph data structure, and
define a fundamental model of computation. For GPUs, the following frameworks exist:
cuSTINGER, aimGraph, faimGraph, Hornet, EvoGraph, and GPMA [5].
3
4. 2. PageRank algorithm
The PageRank algorithm is a technique used to sort web pages (or vertices of a graph) by
importance. It is quite popularly the algorithm published by the founders of Google. Other
link analysis algorithms include HITS [7], TrustRank, and HummingBird. Such algorithms
are also used for word sense disambiguation in lexical semantics, urban planning [8],
ranking streets by traffic [9], identifying communities [10], measuring their impact on the
web, maximizing influence [11], providing recommendations [12], analysing neural/protein
networks, determining species essential for health of the environment, or even quantifying
the scientific impact of researchers [13].
In order to understand the PageRank algorithm, consider this random (web) surfer model.
Each web page is modeled as a vertex, and each hyperlink as an edge. The surfer (such as
you) initially visits a web page at random. He then follows one of the links on the page,
leading to another web page. After following some links, the surfer would eventually decide
to visit another web page (at random). The probability of the random surfer being on a
certain page is what the PageRank algorithm returns. This probability (or importance) of a
web page depends upon the importance of web pages pointing to it (markov chain). This
definition of PageRank is recursive, and takes the form of an eigen-value problem. Solving
for PageRank thus requires multiple iterations of computation, which is known as the
power-iteration method. Each computation is essentially a (sparse) matrix multiplication.
A damping factor (of 0.85) is used to counter the effect of spider-traps (like self-loops),
which can otherwise suck up all importance. Dead-ends (web pages with no out-links) are
countered by effectively linking it to all vertices of the graph (making markov matrix column
stochastic), which otherwise would leak out importance [14]. See figure 2.1 for example.
Figure 2.1: Example of web pages with hyperlinks and respective PageRanks [15].
4
5. Note that as originally conceived, the PageRank model does not factor a web browser’s
back button into a surfer’s hyperlinking possibilities. Surfers in one class, if teleporting, may
be much more likely to jump to pages about sports, while surfers in another class may be
much more likely to jump to pages pertaining to news and current events. Such differing
teleportation tendencies can be captured in two different personalization vectors. However,
it makes the once query-independent, user independent PageRankings user-dependent and
more calculation-laden. Nevertheless, it seems this little personalization vector has had more
significant side effects. This personalization vector, along with a non-uniform/weighted
version of PageRank [16] can help control spamming done by the so-called link farms [3].
PageRank algorithms almost always take the following parameters: damping, tolerance,
and max. iterations. Here, tolerance defines the error between the previous and the current
iterations. Though this is usually L1-norm, L2 and L∞-norm are also used sometimes. Both
damping and tolerance control the rate of convergence of the algorithm. The choice of
tolerance function also affects the rate of convergence. However, adjusting damping can
give completely different PageRank values. Since the ordering of vertices is important, and
not the exact values, it can usually be a good idea to choose a larger tolerance value.
3. Optimizing PageRank
Techniques to optimize the PageRank algorithm usually fall in two categories. One is to try
reducing the work per iteration, and the other is to try reducing the number of
iterations. These goals are often at odds with one another. The adapting PageRank
technique “locks” vertices which have converged, and saves iteration time by skipping their
computation [3]. Identical nodes, which have the same in-links, can be removed to reduce
duplicate computations and thus reduce iteration time. Road networks often have chains
which can be short-circuited before PageRank computation to improve performance. Final
ranks of chain nodes can be easily calculated. This reduces both the iteration time, and the
number of iterations. If a graph has no dangling nodes, PageRank of each strongly
connected component can be computed in topological order. This helps reduce the iteration
time, no. of iterations, and also enable concurrency in PageRank computation. The
combination of all of the above methods is the STICD algorithm (see figure 3.1) [17]. A
somewhat similar aggregation algorithm is BlockRank which computes the PageRank of
hosts, local PageRank of pages within hosts independently, and aggregates them with
weights for the final rank vector. The global PageRank solution can be found in a
computationally efficient manner by computing the sub-PageRank of each connected
component, then pasting the sub-PageRanks together to form the global PageRank, using
Avrachenkov et. al. method. These methods exploit the inherent reducibility in the graph.
Bianchini et. al. suggest using the Jacobi method to compute the PageRank vector [3].
Monte Carlo based PageRank methods consider several random walks on the input graph
to get approximate PageRanks. Its optimizations for distributed PageRank computation
(specially for undirected graphs) [18], map-reduce algorithm for personalized PageRank [19],
and reordering strategy (to reduce space and compute complexity on GPU) for local
PageRank [20] are present.
5
6. Figure 3.1: STIC-D: Algorithmic optimizations for PageRank [17].
Iteration time can be reduced further by taking note of the fact that the traditional algorithm is
not computationally bound, and generates fine granularity random accesses (it exhibits
irregular parallelism). This causes poor memory bandwidth and compute utilization, and the
extent of this is quite dependent upon the graph structure [21], [22]. Four strategies for
neighbour iteration were attempted, to help reason about the expected impact of a graph’s
structure on the performance of each strategy [21]. CPUs/GPUs are generally designed
optimized to load memory in blocks (cache-lines in CPUs, coalesced memory reads in
GPUs), and not for fine-grained accesses. Being able to adjust this behaviour depending
upon application (PageRank) can lead to performance improvements. Techniques like
prefetching to SRAM, using a high-performance shuffle network [23], indirect memory
prefetcher (of the form A[B[i]]), partial cache line accessing mechanisms [24], adjusting data
layout [22] (for sequential DRAM access [25]), and branch avoidance mechanisms (with
partitioning) [22] are used. Large graphs can be partitioned or decomposed into subgraphs
to help reduce cross-partition data access that helps both in distributed, as well as shared
memory systems (by reducing random accesses). Techniques like chunk partitioning [26],
cache/propagation blocking [27], partition-centric processing with gather-apply-scatter model
[22], edge-centric scatter-gather with non-overlapping vertex-set [28], exploiting node-score
sparsity [29], and even personalized PageRank based partitioning [30] have been used.
Graph/subgraph compression can also help reduce memory bottlenecks [26] [31], and
enable processing of larger graphs in memory. A no. of techniques can be used to compress
adjacency lists, such as, delta encoding of edge/neighbour ids [32], and referring sets of
edges in edge lists [33] [34] (hard to find reference vertices though) [3]. Since the rank vector
(possibly even including certain additional page-importance estimates) must reside entirely
in main memory, a few compression techniques have been attempted. These include lossy
encoding schemes based on scalar quantization seeking to minimize the distortion of
search-result rankings [35] [3], and using custom half-precision floating-point formats [36].
As new software/hardware platforms appear on the horizon, researchers have been eager
to test the performance of PageRank on the hardware. This is because each platform offers
its own unique architecture and engineering choices, and also because PageRank often
serves as a good benchmark for the capability of the platform to handle various other graph
algorithms. Attempts have been made on distributed frameworks like Hadoop [37], and even
RDBMS [38]. A number of implementations have been done on standard multicores [38],
Cell BE [39] [28], AMD GPUs [40], NVIDIA/CUDA GPUs [41] [28] [42], GPU clusters [26],
FPGAs [43] [23] [31], CPU-FPGA hybrids [44] [45] [29], and even on SpMV ASICs [46].
6
7. PageRank algorithm is a live algorithm which means that an ongoing computation can be
paused during graph update, and simply be resumed afterwards (instead of restarting it).
The first updating paper by Chien et al. (2002) identifies a small portion of the web graph
“near” the link changes and model the rest of the web as a single node in a new, much
smaller graph; compute a PageRank for this small graph and transfer these results to the
much bigger, original graph [3].
4. Graph streaming frameworks / databases
STINGER [47] uses an extended form of CSR with edge lists represented and link-list of
contiguous blocks. Each edge has 2 timestamps, and fine-locking is used per edge.
cuSTINGER extends STINGER for CUDA GPUs and uses contiguous edge list instead
(CSR). faimGraph [48] is a GPU framework with fully dynamic vertex and edge updates. It
has an in-GPU memory manager, and uses a paged linked-list for edges similar to
STINGER. Hornet [49] also implements its own memory manager, and uses B+ trees to
maintain blocks efficiently, and keep track of empty space. LLAMA uses a variant of CSR
with large multi-versioned arrays. It stores all snapshots of a graph, and persists old
snapshots to disk. GraphIn uses CSR along with edge lists, and updates CSR after edge
lists are large enough. GraphOne [50] is also similar, and uses page-aligned memory for
high-degree vertices. GraphTau is based on Apache Spark and uses read-only partitioned
collections of data sets. It uses a window sliding model for graph snapshots. Aspen [51]
uses C-tree (tree of trees) based on purely functional compressed search trees to store
graph structures. Elements are stored in chunks and compressed using difference encoding.
It allows any no. of readers and a single writer, and the framework guarantees strict
serializability. Tegra stores the full history of the graph and relies on recomputing graph
algorithms on affected subgraphs. It also uses a cost model to guess when full
recomputation might be better. It uses an adaptive radix tree as the core data structure for
efficient updates and range scans [5].
Unlike graph streaming frameworks, graph databases focus on rich attached data, complex
queries, transactional support with ACID properties, data replication and sharding. A few
graph databases have started to support global analytics as well. However, most graph
databases do not offer dedicated support for incremental changes. Little research exists into
accelerating streaming graph processing using low-cost atomics, hardware transactions,
FPGAs, high-performance networking hardware. On average, the highest rate of ingestion
is achieved by shared memory single-node designs [5]. An overview of the graph
frameworks is shown in figure 4.1.
7
8. Figure 4.1: Overview of the domains and concepts in the practice and theory of streaming and
dynamic graph processing and algorithms [6].
5. NVIDIA Tesla V100 GPU Architecture
NVIDIA Tesla was a line of products targeted at stream processing / general-purpose
graphics processing units (GPGPUs). In May 2020, NVIDIA retired the Tesla brand because
of potential confusion with the brand of cars. Its new GPUs are branded NVIDIA Data
Center GPUs as in the Ampere A100 GPU [52].
The NVIDIA Tesla GV100 (Volta) is a 21.1 billion transistor TSMC 12nm FinFET with die
size 815 mm2
. Here is a short summary of its features:
8
9. ● 84 SMs, each with 64 independent FP, INT cores.
● Shared memory size config. up to 96KB / SM.
● 4 512-bit memory controllers (total 4096-bit).
● Upto 6 bidirectional NVLink, 25 GB/s per direction (for IBM Power 9 CPUs).
● 4 dies / HBM stack, with 4 stacks. 16 GB with 900 GB/s HBM2 (Samsung).
● Native/sideband SEDEC (1 correct, 2 detect) ECC (for HBM, REG, L1, L2).
Each SM has 4 processing blocks (each handles 1 warp of 32 threads). L1 data cache is
combined with shared memory of 128 KB / SM (explicit caching not as necessary anymore).
Volta also supports write-caching (not just load, as previous architectures). NVLink
supports coherency allowing data reads from GPU memory to be stored in CPU cache.
Address Translation Service (ATS) allows the GPU to access CPU page tables directly
(malloc ptr). The new copy engine doesn't need pinned memory. Volta per-thread
program-counter, call-stack, allows interleaved executions of warp threads (see figure
5.1), enabling fine-grained synchronization between threads within a warp (use
__syncwarp()). Cooperative groups enable synchronization between warps, grid-wide,
multi-GPUs, cross-warp, sub-warp [53].
Figure 5.1: Programs use Explicit Synchronization to Reconverge Threads in a Warp [53].
9
10. 6. Experiments
6.1 Adjusting CSR format for graph
Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
commonly used for efficient graph computations. However, given N vertices, M edges, and a
32-bit / 4-byte vertex-id, it occupies a space of 4(N + M) bytes. Note however that a 32-bit
unsigned integer is limited to just 4 billion ids, and thus massive graphs would need to use
a 64-bit / 8-byte vertex-id. This further raises the occupied space to 8(N +M) bytes. Since
large memories are difficult to make and tend to be slower than smaller ones [?] it makes
sense to try to reduce this space requirement. Hybrid CSR is a graph representation that
combines the idea behind adjacency-list and adjacency-matrix [bfs-seema], with its
edge-lists being similar to roaring bitmaps [lemire]. Unlike CSR, which stores a list of indices
of destination vertices for each vertex, hybrid CSR uses smaller indices, each combined with
a dense bitset. This allows it to represent dense regions of a graph in a compact form.
An experiment was conducted to assess the size needed for graph representation for
various possible hybrid CSR formats, by adjusting the size of dense bitset (block), and
hence the index-bits. Both 32-bit and 64-bit hybrid CSR are studied, and compared with
32-bit regular CSR. A 32-bit regular CSR is represented using a uint32_t data type, and
uses all 32 bits for vertex-index (index-bits). It can support graphs with a maximum of 2^32
vertices (or simply a 32-bit vertex-id). A 32-bit hybrid CSR is also represented using a
uint32_t data type, where lower b bits are used to store the dense bitset (block), and upper i
= 32-b bits to store the index-bits. It supports an effective vertex-id of i+log2(b) =
32-b+log2(b) bits. For this experiment, the size of block b is adjusted from 4 to 16 bits.
Similarly, a 64-bit hybrid CSR is represented using uint64_t data type, where lower b bits
are used to store the dense bitset (block) and upper i = 64-b bits to store the index-bits.
Hence, the effective vertex-id supported is of i+log2(b) = 64-b+log2(b) bits. For this
experiment, the size of block b is adjusted from 4 to 32 bits. For a given vertex-id v,
index-bits are defined as v >> b, block-bits are defined as 1 << (v & ones(b)), and thus the
hybrid CSR entry is (index-bits << b) | block-bits. Finding an edge-id in an edge-list involves
scanning all entries with matching index-bits, and once matched, checking if the appropriate
block-bit is set (for both hybrid CSRs). Since lowering the number of index-bits reduces the
maximum possible order of graph representable by the format, the effective bits usable for
vertex-id for each hybrid CSR variation is listed for reference. For this experiment, edge-ids
for each graph are first loaded into a 32-bit array of arrays structure, and then converted to
the desired CSR formats.
All graphs used are stored in the MatrixMarket (.mtx) file format, and obtained from the
SuiteSparse Matrix Collection. The experiment is implemented in C++, and compiled using
GCC 9 with optimization level 3 (-O3). The system used is a Dell PowerEdge R740 Rack
server with two Intel Xeon Silver 4116 CPUs @ 2.10GHz, 128GB DIMM DDR4 Synchronous
Registered (Buffered) 2666 MHz (8x16GB) DRAM, and running CentOS Linux release
7.9.2009 (Core). Statistics of each test case is printed to standard output (stdout), and
redirected to a log file, which is then processed with a script to generate a CSV file, with
each row representing the details of a single test case. This CSV file is imported into Google
10
11. Sheets, and necessary tables are set up with the help of the FILTER function to create the
charts.
It is observed that for a given n-bit hybrid CSR using the highest possible block size
(taking into account effective index-bits) results in the smallest space usage. The 32-bit
hybrid CSR with a 16-bit block is able to achieve a maximum space usage (bytes) reduction
of ~5x, but is unable to represent all the graphs under test (it has a 20-bit effective vertex-id).
With an 8-bit block the space usage is reduced by ~3x - 3.5x for coPapersCiteseer,
coPapersDBLP, and indochina-2004. The 64-bit hybrid CSR with a 32-bit block is able to
achieve a maximum space usage reduction of ~3.5x, but generally does not perform well.
However, for massive graphs which can not be represented with a 32-bit vertex-id, it is likely
to provide significant reduction in space usage. This can be gauged by comparing the
number of destination-indices needed for each CSR variant, where it achieves a maximum
destination-indices reduction of ~7x. This reduction is likely to be higher with graphs
partitioned by hosts / heuristics / clustering algorithms which is usually necessary for
massive graphs deployed in a distributed setting . This could be assessed in a future study.
Table 6.1.1: List of variations of CSR attempted, followed by list of programs including results &
figures.
regular 32-bit hybrid 32-bit hybrid 64-bit
single bit 32-bit index
4-bit block 28-bit index (30 eff.) 60-bit index (62 eff.)
8-bit block 24-bit index (27 eff.) 56-bit index (59 eff.)
16-bit block 16-bit index (20 eff.) 48-bit index (52 eff.)
32-bit block 32-bit index (32 eff.)
1. Comparing space usage of regular vs hybrid CSR (various sizes).
11
12. Figure 6.1.1: Space usage (bytes) reduction ratio of each format. For graphs that can not be
represented with the given format, it is set to 0.
Figure 6.1.2: Destination-indices (total number of edge values) reduction ratio of each format. For
graphs that can not be represented with the given format, it is set to 0.
12
13. 6.2 Adjusting Bitset for graph
Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
commonly used for efficient graph computations. Unfortunately, using CSR for dynamic
graphs is impractical since addition/deletion of a single edge can require on average
(N+M)/2 memory accesses, in order to update source-offsets and destination-indices. A
common approach is therefore to store edge-lists/destination-indices as an array of arrays,
where each edge-list is an array belonging to a vertex. While this is good enough for small
graphs, it quickly becomes a bottleneck for large graphs. What causes this bottleneck
depends on whether the edge-lists are sorted or unsorted. If they are sorted, checking for
an edge requires about log(E) memory accesses, but adding an edge on average requires
E/2 accesses, where E is the number of edges of a given vertex. Note that both addition and
deletion of edges in a dynamic graph require checking for an existing edge, before adding or
deleting it. If edge lists are unsorted, checking for an edge requires around E/2 memory
accesses, but adding an edge requires only 1 memory access.
An experiment was conducted in an attempt to find a suitable data structure for
representing bitset, which can be used to represent edge-lists of a graph. The data
structures under test include single-buffer ones like unsorted bitset, and sorted bitset;
single-buffer partitioned (by integers) like partially-sorted bitset; and multi-buffer ones like
small-vector optimization bitset (unsorted), and 16-bit subrange bitset (todo). An unsorted
bitset consists of a vector (in C++) that stores all the edge ids in the order they arrive. Edge
lookup consists of a simple linear search. Edge addition is a simple push-back (after lookup).
Edge deletion is a vector-delete, which requires all edge-ids after it to be moved back (after
lookup). A sorted bitset maintains edge ids sorted in ascending order of edge ids. Edge
lookup consists of a binary search. Edge addition is a vector-insert, which requires all
edge-ids after it to be shifted one step ahead. Edge deletion is a vector-delete, just like
unsorted bitset. A partially-sorted bitset tries to amortize the cost of sorting edge-ids by
keeping the recently added edges unsorted at the end (upto a limit) and maintains the old
edges as sorted. Edge lookup consists of binary search in the sorted partition, and then
linear search in the unsorted partition, or the other way around. Edge addition is usually a
simple push-back and updating of partition size. However, if the unsorted partition grows
beyond a certain limit, it is merged with the sorted partition in one of the following ways: sort
both partitions as a whole, merge partitions using in-place merge, merge partitions using
extra space for sorted partition, or merge partitions using extra space for unsorted partition
(this requires a merge from the back end). Edge deletion checks to see if the edge can be
brought into the unsorted partition (within limit). If so, it simply swaps it out with the last
unsorted edge id (and updates partition size). However, if it cannot be brought into the
unsorted partition a vector-delete is performed (again, updating partition size). A
small-vector optimization bitset (unsorted) makes use of an additional fixed-size buffer
(this size is adjusted to different values) to store edge-ids until this buffer overflows, when all
edge-ids are moved to a dynamic (heap-allocated) vector. Edge lookups, additions, and
deletions are similar to that of an unsorted bitset, except that count of edge-ids in the
fixed-size buffer and the selection of buffer or dynamic vector needs to be done with each
operation.
13
14. All variants of the data structures were tested with real-world temporal graphs. These are
stored in a plain text file in “u, v, t” format, where u is the source vertex, v is the destination
vertex, and t is the UNIX epoch time in seconds. All of them are obtained from the Stanford
Large Network Dataset Collection. The experiment is implemented in C++, and compiled
using GCC 9 with optimization level 3 (-O3). The system used is a Dell PowerEdge R740
Rack server with two Intel Xeon Silver 4116 CPUs @ 2.10GHz, 128GB DIMM DDR4
Synchronous Registered (Buffered) 2666 MHz (8x16GB) DRAM, and running CentOS Linux
release 7.9.2009 (Core). The execution time of each test case is measured using
std::chrono::high_performance_timer. This is done 5 times for each test case, and timings
are averaged. Statistics of each test case is printed to standard output (stdout), and
redirected to a log file, which is then processed with a script to generate a CSV file, with
each row representing the details of a single test case. This CSV file is imported into Google
Sheets, and necessary tables are set up with the help of the FILTER function to create the
charts. Similar charts are combined together into a single GIF (to help with interpretation of
results).
From the results, it appears that transpose of graphs based on sorted bitset is clearly faster
than the unsorted bitset. However, with reading graph edges there is no clear winner
(sometimes sorted is faster especially for large graphs, and sometimes unsorted). Maybe
when new edges have many duplicates, inserts are less, and hence sorted version is faster
(since sorted bitset has slow insert time). Transpose of a graph based on a fully-sorted bitset
is clearly faster than the partially-sorted bitset. This is possibly because partially sorted
bitset based graphs cause higher cache misses due to random accesses (while reversing
edges). However, with reading graph edges there is no clear winner (sometimes
partially-sorted is faster especially for large graphs, and sometimes fully-sorted). For
small-vector optimization bitset, on average, a buffer size of 4 seems to give small
improvement. Any further increase in buffer size slows down performance. This is possibly
because of unnecessarily large contiguous memory allocation needed by the buffer, and low
cache-hit percent due to widely separated edge data (due to the static buffer). In fact it even
crashes when 26 instances of graphs with varying buffer sizes can't all be held in memory.
Hence, small vector optimization is not so useful, at least when used for graphs.
Table 6.2.1: List of data structures for bitset attempted, followed by list of programs inc. results &
figures.
single-buffer single-buffer partitioned multi-buffer
unsorted partially-sorted (vs) small-vector (optimization)
sorted subrange-16bit
1. Testing the effectiveness of sorted vs unsorted list of integers for BitSet.
2. Comparing various unsorted sizes for partially sorted BitSet.
3. Performance of fully sorted vs partially sorted BitSet (inplace-s128).
4. Comparing various buffer sizes for BitSet with small vector optimization.
5. Comparing various switch points for 16-bit subrange based BitSet.
14
15. 6.3 Adjusting data types for rank vector
When PageRank is computed in a distributed setting, for massive graphs, it is necessary
to communicate ranks of a subgraph computed at a machine across the other machine
over a network. Depending upon the algorithm, this message passing either needs to be
done every iteration [?], or after the subgraph has converged [sticd]. Minimizing this data
transfer can help improve the performance of the PageRank algorithm. One approach is to
compress the data using existing compression algorithms. Depending upon the achievable
compression ratio, and the time required to compress and decompress the data, it might be
a viable approach. Another approach would be to use smaller lower-precision data types
that can be directly used in the computation (or converted to a floating-point number on the
fly), without requiring any separate compression or decompression step.
An experiment was conducted to assess the ability of BFloat16 being used in place of
Float32 as a storage type (converted to Float32 on the fly, during computation). BFloat16 is
a 2-byte lower-precision data type specially developed for use in machine learning, and is
available for use in recent GPUs. It is, quite simply, the upper 16 bits of IEEE 754
single-precision floating point format (Float32). Conversion to and from BFloat16 is done
using bit-shift operators and reinterpret_cast. To make BFloat16 trivially replaceable with
Float32 in the PageRank algorithm, it is implemented as a class with appropriate
constructors (default, copy), and operator overloads (typecast, assignment). The experiment
is performed on a Xeon CPU, with a single thread, and using the standard power-iteration
(pull) formulation of PageRank. The rank of a vertex in an iteration is calculated as c0 +
pΣrn/dn, where c0 is the common teleport contribution, p is the damping factor (0.85), rn is the
previous rank of vertex with an incoming edge, dn is the out-degree of the incoming-edge
vertex, and N is the total number of vertices in the graph. The common teleport contribution
c0, calculated as (1-p)/N + pΣrn/N, includes the contribution due to a teleport from any vertex
in the graph due to the damping factor (1-p)/N, and teleport from dangling vertices (with no
outgoing edges) in the graph pΣrn/N. This is because a random surfer jumps to a random
page upon visiting a page with no links, in order to avoid the rank-sink effect. The ranks
obtained from the BFloat16 approach are compared with standard Float32 data type
approach using L1 norm (sum of absolute error).
All graphs used in this experiment are stored in the MatrixMarket (.mtx) file format, and
obtained from the SuiteSparse Matrix Collection. The experiment is implemented in C++,
and compiled using GCC 9 with optimization level 3 (-O3). The system used is a Dell
PowerEdge R740 Rack server with two Intel Xeon Silver 4116 CPUs @ 2.10GHz, 128GB
DIMM DDR4 Synchronous Registered (Buffered) 2666 MHz (8x16GB) DRAM, and running
CentOS Linux release 7.9.2009 (Core). The execution time of each test case is measured
using std::chrono::high_performance_timer. This is done 5 times for each test case, and
timings are averaged. Statistics of each test case is printed to standard output (stdout), and
redirected to a log file, which is then processed with a script to generate a CSV file, with
each row representing the details of a single test case. This CSV file is imported into Google
Sheets, and necessary tables are set up with the help of the FILTER function to create the
charts.
15
16. It is observed that the error associated with using BFloat16 as storage type is too high, and
thus unsuitable for use with the PageRank algorithm. Future work may explore the usability
of BFloat16 only during message passing steps (after a full iteration, or after convergence of
a subgraph), or attempting other custom data types suitable for PageRank (possibly
non-byte aligned).
Table 6.3.1: List of rank adjustment strategies attempted, followed by programs inc. results & figures.
Custom fp16 bfloat16 float double
1. Performance of vector element sum using float vs bfloat16 as the storage type.
2. Comparison of PageRank using float vs bfloat16 as the storage type (pull, CSR).
3. Performance of PageRank using 32-bit floats vs 64-bit floats (pull, CSR).
16
17. 6.4 Adjusting PageRank parameters
Adjusting the damping factor for the PageRank algorithm can have a significant effect on
the convergence rate of the PageRank algorithm (as mentioned in literature), both in terms
of time and iterations. For this experiment, the damping factor d (which is usually 0.85) is
varied from 0.50 to 1.00 in steps of 0.05. This is in order to compare the performance
variation with each damping factor. The calculated error is the L1-norm wrt default
PageRank (d=0.85). The PageRank algorithm used here is the standard power-iteration
(pull) based PageRank. The rank of a vertex in an iteration is calculated as c0 + pΣrn/dn,
where c0 is the common teleport contribution, p is the damping factor, rn is the previous rank
of vertex with an incoming edge, dn is the out-degree of the incoming-edge vertex, and N is
the total number of vertices in the graph. The common teleport contribution c0, calculated as
(1-p)/N + pΣrn/N, includes the contribution due to a teleport from any vertex in the graph due
to the damping factor (1-p)/N, and teleport from dangling vertices (with no outgoing edges) in
the graph pΣrn/N. This is because a random surfer jumps to a random page upon visiting a
page with no links, in order to avoid the rank-sink effect.
All graphs used in this experiment are stored in the MatrixMarket (.mtx) file format, and
obtained from the SuiteSparse Matrix Collection. The experiment is implemented in C++,
and compiled using GCC 9 with optimization level 3 (-O3). The system used is a Dell
PowerEdge R740 Rack server with two Intel Xeon Silver 4116 CPUs @ 2.10GHz, 128GB
DIMM DDR4 Synchronous Registered (Buffered) 2666 MHz (8x16GB) DRAM, and running
CentOS Linux release 7.9.2009 (Core). The execution time of each test case is measured
using std::chrono::high_performance_timer. This is done 5 times for each test case, and
timings are averaged. Statistics of each test case is printed to standard output (stdout), and
redirected to a log file, which is then processed with a script to generate a CSV file, with
each row representing the details of a single test case. This CSV file is imported into Google
Sheets, and necessary tables are set up with the help of the FILTER function to create the
charts.
As expected, increasing the damping factor beyond 0.85 significantly increases
convergence time, and lowering it below 0.85 decreases convergence time. Note that a
higher damping factor implies that a random surfer follows links with higher probability (and
jumps to a random page with lower probability). Also note that 500 is the maximum iterations
allowed here.
CHARTS HERE
Observing that adjusting the damping factor has a significant effect, another experiment was
performed. The experiment was to adjust the damping factor (alpha) in steps. Start with a
small alpha, change it when PageRank is converged, until the final desired value of alpha.
For example, start initially with alpha = 0.5, let PageRank converge quickly, and then switch
to alpha = 0.85 and run PageRank until it converges. Using a single step like this seems like
it might help reduce iterations. Unfortunately it doesn't. Trying with multiple steps tends to
have even higher iteration count.
CHARTS HERE
17
18. Similar to the damping factor, adjusting the tolerance value has a significant effect as well.
Apart from that, it is observed that different people make use of different error functions for
measuring tolerance. Although L1 norm is commonly used for convergence check, it
appears nvGraph uses L2 norm instead. Another person in stackoverflow seems to suggest
the use of per-vertex tolerance comparison, which is essentially the L∞ norm. This
experiment was for comparing the performance between L1, L2 and L∞ norms for various
tolerance values. Each approach was attempted on a number of graphs, varying the
tolerance from 10^-0 to 10^-10 for each tolerance function. Results show that the L∞ norm
is a faster convergence check for all graphs. For road networks, which have a large no. of
vertices, using the L∞ norm is orders of magnitude faster. For smaller values of tolerance the
ranks converge in just 1 iteration too. This is possibly because the per-vertex update of
ranks is smaller than 10^-6. Also note that L2 norm is initially faster than L1 norm, but
quickly slows down wrt L1 norm for most graphs. However, it is always faster for road
networks.
CHARTS HERE
Table 6.4.1: List of parameter adjustments attempted, followed by programs inc. results & figures.
Damping Factor adjust dynamic-adjust
Tolerance L1 norm L2 norm L∞ norm
1. Comparing the effect of using different values of damping factor, with PageRank (pull, CSR).
2. Experimenting PageRank improvement by adjusting damping factor (α) between iterations.
3. Comparing the effect of using different functions for convergence check, with PageRank (...).
4. Comparing the effect of using different values of tolerance, with PageRank (pull, CSR).
18
19. 6.5 Adjusting ranks for dynamic graphs
When a graph is updated, there are a number of strategies to set up the initial rank
vector for obtaining PageRanks of the updated graph, using the ranks from the old
graph. One approach is to zero-fill ranks of the new vertices. Another approach is to use
1/N for the new vertices. Yet another approach is to scale the existing vertices and use 1/N
for the new vertices.
An experiment is conducted with each technique on different temporal graphs, updating
each graph with multiple batch sizes. For each batch size, static as well as the 3 dynamic
rank adjustment methods are tested. All rank adjustment strategies are performed using a
common adjustment function that adds a value, then multiplies a value to old ranks, and sets
a value for new ranks. The PageRank algorithm used is the standard power-iteration (pull)
based that optionally accepts initial ranks. The rank of a vertex in an iteration is calculated as
c0 + pΣrn/dn, where c0 is the common teleport contribution, p is the damping factor (0.85), rn is
the previous rank of vertex with an incoming edge, dn is the out-degree of the incoming-edge
vertex, and N is the total number of vertices in the graph. The common teleport contribution
c0, calculated as (1-p)/N + pΣrn/N, includes the contribution due to a teleport from any vertex
in the graph due to the damping factor (1-p)/N, and teleport from dangling vertices (with no
outgoing edges) in the graph pΣrn/N. This is because a random surfer jumps to a random
page upon visiting a page with no links, in order to avoid the rank-sink effect.
All graphs (temporal) used in this experiment are stored in a plain text file in “u, v, t” format,
where u is the source vertex, v is the destination vertex, and t is the UNIX epoch time in
seconds. All of them are obtained from the Stanford Large Network Dataset Collection. If
initial ranks are not provided, they are set to 1/N. Error check is done using L1 norm with
static PageRank (without initial ranks). The experiment is implemented in C++, and compiled
using GCC 9 with optimization level 3 (-O3). The system used is a Dell PowerEdge R740
Rack server with two Intel Xeon Silver 4116 CPUs @ 2.10GHz, 128GB DIMM DDR4
Synchronous Registered (Buffered) 2666 MHz (8x16GB) DRAM, and running CentOS Linux
release 7.9.2009 (Core). The execution time of each test case is measured using
std::chrono::high_performance_timer. This is done 5 times for each test case, and timings
are averaged. Statistics of each test case is printed to standard output (stdout), and
redirected to a log file, which is then processed with a script to generate a CSV file, with
each row representing the details of a single test case. This CSV file is imported into Google
Sheets, and necessary tables are set up with the help of the FILTER function to create the
charts.
Each rank adjustment method (for dynamic PageRank) can have a different number of
iterations to convergence. The 3rd approach, which does scaling and uses 1/N for new
vertices seems to perform best. It is also seen that as batch size increases, the
convergence iterations (time) of dynamic PageRank increases. In some cases it even
becomes slower than static PageRank.
Table 6.5.1: List of rank adjustment strategies attempted, followed by programs inc. results & figures.
19
20. update new zero fill 1/N fill
update old, new scale, 1/N fill
1. Comparing strategies to update ranks for dynamic PageRank (pull, CSR).
20
21. 6.6 Adjusting OpenMP PageRank
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate PageRank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement PageRank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for PageRank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
Before starting an OpenMP implementation, a good sequential PageRank implementation
needs to be set up. There are two ways (algorithmically) to think of the PageRank
calculation. One approach (push) is to find PageRank by pushing contributions to
out-vertices. The push method is somewhat easier to implement, and is described in this
lecture. With this approach, in an iteration for each vertex, the ranks of vertices connected to
its outgoing edge are cumulated with p×rn, where p is the damping factor (0.85), and rn is the
rank of the (source) vertex in the previous iteration. But, if a vertex has no out-going edges, it
is considered to have out-going edges to all vertices in the graph (including itself). This is
because a random surfer jumps to a random page upon visiting a page with no links, in order
to avoid the rank-sink effect. However, it requires multiple writes per source vertex, due to
the cumulation (+=) operation. The other approach (pull) is to pull contributions from
in-vertices. Here, the rank of a vertex in an iteration is calculated as c0 + pΣrn/dn, where c0 is
the common teleport contribution, p is the damping factor (0.85), rn is the previous rank of
vertex with an incoming edge, dn is the out-degree of the incoming-edge vertex, and N is the
total number of vertices in the graph. The common teleport contribution c0, calculated as
(1-p)/N + pΣrn/N, includes the contribution due to a teleport from any vertex in the graph due
to the damping factor (1-p)/N, and teleport from dangling vertices (with no outgoing edges) in
the graph pΣrn/N (to avoid the rank-sink effect). This approach requires 2 additional
calculations per-vertex, i.e., non-teleport contribution of each vertex, and total teleport
contribution (to all vertices). However, it requires only 1 write per destination vertex. For this
experiment both of these approaches are assessed on a number of different graphs.
All graphs used are stored in the MatrixMarket (.mtx) file format, and obtained from the
SuiteSparse Matrix Collection. The experiment is implemented in C++, and compiled using
GCC 9 with OpenMP flag (-fopenmp), optimization level 3 (-O3). The system used is a Dell
PowerEdge R740 Rack server with two Intel Xeon Silver 4116 CPUs @ 2.10GHz, 128GB
DIMM DDR4 Synchronous Registered (Buffered) 2666 MHz (8x16GB) DRAM, and running
CentOS Linux release 7.9.2009 (Core). The execution time of each test case is measured
using std::chrono::high_performance_timer. This is done 5 times for each test case, and
timings are averaged. Statistics of each test case is printed to standard output (stdout), and
redirected to a log file, which is then processed with a script to generate a CSV file, with
each row representing the details of a single test case. This CSV file is imported into Google
Sheets, and necessary tables are set up with the help of the FILTER function to create the
charts.
21
22. While it might seem that the pull method would be a clear winner, the results indicate that
although pull is always faster than push approach, the difference between the two depends
on the nature of the graph. The next step is to compare the performance between finding
PageRank using C++ DiGraph class directly (using arrays of edge-lists) vs its CSR
(Compressed Sparse Row) representation (contiguous). Using a CSR representation has
the potential for performance improvement due to information on vertices and edges being
stored contiguously.
Table 6.6.1: Adjusting Sequential approach
Push Pull Class CSR
1. Performance of contribution-push based vs contribution-pull based PageRank.
2. Performance of C++ DiGraph class based vs CSR based PageRank (pull).
Both uniform and hybrid OpenMP techniques were attempted on different types of graphs.
All OpenMP based functions are defined with a parallel for clause and static scheduling of
size 4096. When necessary, a reduction clause is used. Number of threads for this
experiment (using OMP_NUM_THREADS) was varied from 2 to 48. Results show that the
hybrid approach performs worse in most cases, and is only slightly better than the uniform
approach in a few cases. This could possibly be because of proper chip/core-scheduling
handled by OpenMP when it is used with all the primitives.
Table 6.6.2: Adjusting OpenMP approach
Map Reduce Uniform Hybrid
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Performance of sequential execution based vs OpenMP based vector element sum.
3. Performance of uniform-OpenMP based vs hybrid-OpenMP based PageRank (pull, CSR).
In the final experiment performance of OpenMP based PageRank is contrasted with
sequential based approach and nvGraph PageRank. OpenMP based PageRank does
seem to provide a clear benefit for most graphs wrt sequential PageRank. This speedup is
definitely not directly proportional to the number of threads, as one would normally expect
(Amdahl's law). However, nvGraph is clearly way faster than the OpenMP version. This is
as expected because nvGraph makes use of GPU for performance.
22
23. Table 6.6.3: Comparing sequential approach
OpenMP nvGraph
Sequential vs vs
OpenMP vs
1. Performance of sequential execution based vs OpenMP based PageRank (pull, CSR).
2. Performance of sequential execution based vs nvGraph based PageRank (pull, CSR).
3. Performance of OpenMP based vs nvGraph based PageRank (pull, CSR).
23
24. 6.7 Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD)
Techniques to optimize the PageRank algorithm usually fall in two categories. One is to try
reducing the work per iteration, and the other is to try reducing the number of iterations.
These goals are often at odds with one another. Skipping computation on vertices which
have already converged has the potential to save iteration time. Skipping in-identical
vertices, with the same in-links, helps reduce duplicate computations and thus could help
reduce iteration time. Road networks often have chains which can be short-circuited before
PageRank computation to improve performance. Final ranks of chain nodes can be easily
calculated. This could reduce both the iteration time, and the number of iterations. If a graph
has no dangling nodes, PageRank of each strongly connected component can be
computed in topological order. This could help reduce the iteration time, no. of iterations,
and also enable multi-iteration concurrency in PageRank computation. The combination of
all of the above methods is the STICD algorithm [sticd]. For dynamic graphs, unchanged
components whose ranks are unaffected can be skipped altogether.
However, the STICD algorithm requires the graph to be free of dangling nodes. Although this
can easily be dealt with by adding self-loops to those nodes, such a modification of the
graph may be undesirable in some cases. Another way to deal with the issue is to perform
PageRank computation of the entire graph at once, instead of doing it in topological
ordering. With this approach, dangling nodes are dealt with by a calculation of teleport
contribution that is shared among all nodes in the graph (like with standard pull-based
PageRank). However, we can still take advantage of the locality benefits of splitting the
graph by components, skipping in-identicals to reduce iteration time, skipping chains to
reduce iteration time and number of iterations, and skipping converged nodes (as
mentioned).
Before starting any algorithmic optimization, a good monolithic PageRank implementation
needs to be set up. There are two ways (algorithmically) to think of the PageRank
calculation. One approach (push) is to find PageRank by pushing contributions to
out-vertices. The push method is somewhat easier to implement, and is described in this
lecture. With this approach, in an iteration for each vertex, the ranks of vertices connected to
its outgoing edge are cumulated with p×rn, where p is the damping factor (0.85), and rn is the
rank of the (source) vertex in the previous iteration. But, if a vertex has no out-going edges, it
is considered to have out-going edges to all vertices in the graph (including itself). This is
because a random surfer jumps to a random page upon visiting a page with no links, in order
to avoid the rank-sink effect. However, it requires multiple writes per source vertex, due to
the cumulation (+=) operation. The other approach (pull) is to pull contributions from
in-vertices. Here, the rank of a vertex in an iteration is calculated as c0 + pΣrn/dn, where c0 is
the common teleport contribution, p is the damping factor (0.85), rn is the previous rank of
vertex with an incoming edge, dn is the out-degree of the incoming-edge vertex, and N is the
total number of vertices in the graph. The common teleport contribution c0, calculated as
(1-p)/N + pΣrn/N, includes the contribution due to a teleport from any vertex in the graph due
to the damping factor (1-p)/N, and teleport from dangling vertices (with no outgoing edges) in
the graph pΣrn/N (to avoid the rank-sink effect). This approach requires 2 additional
calculations per-vertex, i.e., non-teleport contribution of each vertex, and total teleport
24
25. contribution (to all vertices). However, it requires only 1 write per destination vertex. For this
experiment both of these approaches are assessed on a number of different graphs.
All graphs used are stored in the MatrixMarket (.mtx) file format, and obtained from the
SuiteSparse Matrix Collection. The experiment is implemented in C++, and compiled using
GCC 9 with optimization level 3 (-O3). The system used is a Dell PowerEdge R740 Rack
server with two Intel Xeon Silver 4116 CPUs @ 2.10GHz, 128GB DIMM DDR4 Synchronous
Registered (Buffered) 2666 MHz (8x16GB) DRAM, and running CentOS Linux release
7.9.2009 (Core). The execution time of each test case is measured using
std::chrono::high_performance_timer. This is done 5 times for each test case, and timings
are averaged. Statistics of each test case is printed to standard output (stdout), and
redirected to a log file, which is then processed with a script to generate a CSV file, with
each row representing the details of a single test case. This CSV file is imported into Google
Sheets, and necessary tables are set up with the help of the FILTER function to create the
charts.
While it might seem that the pull method would be a clear winner, the results indicate that
although pull is always faster than push approach, the difference between the two depends
on the nature of the graph. The next step is to compare the performance between finding
PageRank using C++ DiGraph class directly (using arrays of edge-lists) vs its CSR
(Compressed Sparse Row) representation (contiguous). Using a CSR representation has
the potential for performance improvement due to information on vertices and edges being
stored contiguously.
Table 6.7.1: Adjusting Monolithic (Sequential) approach
Push Pull Class CSR
1. Performance of contribution-push based vs contribution-pull based PageRank.
2. Performance of C++ DiGraph class based vs CSR based PageRank (pull).
Next an experiment is conducted to assess the performance benefit of each algorithmic
optimization separately. For splitting graph by components optimization, the following
approaches are compared: PageRank without optimization, PageRank with vertices split by
components, and finally PageRank with components sorted in topological order.
Components of the graph are obtained using Kosaraju’s algorithm. Topological ordering is
done by representing the graph as a block-graph, where each component is represented as
a vertex, and cross-edges between components are represented as edges. This block-graph
is then topologically sorted, and this vertex-order in block-graph is used to reorder the
components in topological order. Vertices, and their respective edges are accordingly simply
reordered before computing PageRank (no graph partitioning is done). Each approach was
attempted on a number of graphs. On a few graphs, splitting vertices by components
provides a speedup, but sorting components in topological order provides no additional
speedup. For road networks, like germany_osm which only have one component, the
speedup is possibly because of the vertex reordering caused by dfs() which is required for
25
26. splitting by components. For skipping in-identicals optimization, comparison is done with
unoptimized PageRank. In-identical vertices are obtained by scanning matching edges of a
vertex by in-vertices hash. Except the first in-identical vertex of an in-identicals-group,
remaining vertices are skipped during PageRank computation. After each iteration ends,
rank of the first in-identical vertex is copied to the remaining vertices of the in-identicals
group. The vertices to be skipped are marked with negative source-offset in CSR. On
indochina-2004 graph, skipping in-identicals provides a speedup of ~1.8, but on average
provides no speedup for other graphs. This is likely due to the fact that the graph
indochina-2004 has a large number of in-identicals and in-identical groups. Although, it
doesn't have the highest in-identicals % or the highest avg. in-identical group size. For
skipping chains optimization, comparison is done with unoptimized PageRank. It is
important to note that a chain here means a set of unidirectional links connecting one vertex
to the next, without any additional edges. Bi-directional links are not considered as chains.
Chain vertices are obtained by traversing 2-degree vertices in both directions and marking
visited ones. Except the first chain vertex of a chains-group, remaining vertices are skipped
during PageRank computation. After each iteration ends, ranks of the remaining vertices in
each chains-group is updated using the (GP) formula c0×(1-pn
)/(1-p) + pn
×r, where c0 is the
common teleport contribution, p is the damping factor, n is the distance from the first chain
vertex, and r is the rank of the first chain vertex in previous iteration. The vertices to be
skipped are marked with negative source-offset in CSR. On average, skipping chain vertices
provides no speedup. This is likely because most graphs don't have enough chains to
provide an advantage. Road networks do have chains, but they are bi-directional, and thus
not considered here. For skipping converged vertices optimization, the following
approaches are compared: PageRank without optimization, PageRank skipping converged
vertices with re-check (in 2-16 turns), and PageRank skipping converged vertices after
several turns (in 2-64 turns). Skip with re-check (skip-check) approach skips the current
iteration for a vertex if its rank for the last two iterations match, and the current turn (iteration)
is not a “check” turn. The check turn is adjusted between 2-16 turns. Skip after turns
(skip-after) skips all future iterations of a vertex after its rank does not change for “after”
turns. The after turns are adjusted between 2-64 turns. On average, neither skip-check, nor
skip-after gives better speed than the default (unoptimized) approach. This could be due to
the unnecessary iterations added by skip-check (mistakenly skipped), and increased
memory accesses performed by skip-after (tracking converged count).
Table 6.7.2: Adjusting Monolithic optimizations (from STICD)
Split components Skip in-identicals Skip chains Skip converged
1. Performance benefit of PageRank with vertices split by components (pull, CSR).
2. Performance benefit of skipping in-identical vertices for PageRank (pull, CSR).
3. Performance benefit of skipping chain vertices for PageRank (pull, CSR).
4. Performance benefit of skipping converged vertices for PageRank (pull, CSR).
For this experiment Monolithic PageRank (static and dynamic) is contrasted with
nvGraph PageRank (static and dynamic). For dynamic PageRank (monolithic / nvGraph),
26
27. initial ranks are set to the ranks obtained from static PageRank of the graph in the previous
instant (or batch). Temporal graphs are stored in a plain text file in “u, v, t” format, where u
is the source vertex, v is the destination vertex, and t is the UNIX epoch time in seconds. All
of them are obtained from the Stanford Large Network Dataset Collection. They are loaded
in multiple batch sizes (1, 5, 10, 50, ...). New edges are incrementally added to the graph
batch-by-batch until the entire graph is complete. Fixed graphs are stored in the
MatrixMarket (.mtx) file format, and are obtained from the SuiteSparse Matrix Collection.
They are loaded in multiple batch sizes (1, 5, 10, 50, ...), as with temporal graphs. For each
batch size B, the same number of random edges are added to the graph, with probability of
a random edge being added to a vertex as directly proportional to its out-degree. As
expected, results show dynamic PageRank to be clearly faster than static PageRank for
most cases (for both temporal and fixed graphs).
Table 6.7.3: Comparing dynamic approach with static
nvGraph dynamic Monolithic dynamic
nvGraph static vs: temporal
Monolithic static vs: fixed, temporal
1. Performance of nvGraph based static vs dynamic PageRank (temporal).
2. Performance of static vs dynamic PageRank (temporal).
3. Performance of static vs dynamic levelwise PageRank (fixed).
Note: fixed ⇒ static graphs with batches of random edge updates. temporal ⇒ batches of edge
updated from temporal graphs.
The purpose of this experiment is to settle on a good CUDA implementation of static
PageRank. PageRank uses map-reduce primitives in each iteration step (like multiply and
sum). Two floating-point vectors x and y, with no. of elements 1E+6 to 1E+9 were multiplied
using CUDA. Each no. of elements was attempted with various CUDA launch configs,
running each config 5 times to get a good time measure. Multiplication here represents any
memory-aligned independent operation. Using a large grid_limit and a block_size of 256
could be a decent choice (for both float and double).
A floating-point vector x, with no. of elements 1E+6 to 1E+9 was summed up using CUDA
(Σx). Each no. of elements was attempted with various CUDA launch configs, running each
config 5 times to get a good time measure. Sum here represents any reduction operation
that processes several values to a single value. This sum can be performed with two
possible approaches: memcpy, or inplace. With memcpy approach, partial results are
transferred to CPU, where the final sum is calculated. If the result can be used within the
GPU itself, it might be faster to calculate complete sum in-place instead of transferring to
CPU. This is done using either 2 (if grid_limit is 1024) or 3 kernel calls (otherwise). A
block_size of 128 (decent choice for sum) is used for the 2nd kernel, if there are 3 kernels.
27
28. With memcpy approach, using float values, a grid_limit of 1024 and a block_size of 128 is a
decent choice. For double values, a grid_limit of 1024 and a block_size of 256 is a decent
choice. With in-place approach, a number of possible optimizations including multiple reads
per loop iteration, loop unrolled reduce, and atomic adds provided no benefit. A simple one
read per loop iteration and standard reduce loop (minimizing warp divergence) is both
shorter and works best. For float, a grid_limit of 1024 and a block_size of 128 is a decent
choice. For double, a grid_limit of 1024 and a block_size of 256 is a decent choice.
Comparing both approaches shows similar performance.
This experiment was for finding a suitable launch config for CUDA thread-per-vertex
PageRank. For the launch config, the block-size (threads) was adjusted from 32-1024, and
the grid-limit (max grid-size) was adjusted from 1024-32768. Each config was run 5 times
per graph to get a good time measure.
On average, the launch config doesn't seem to have a good enough impact on performance.
However 8192x128 appears to be a good config. Here 8192 is the grid-limit, and 128 is the
block-size. Comparing with the graph properties, it seems it would be better to use 8192x512
for graphs with high avg. density, and 8192x32 for graphs with high avg. degree. Maybe,
sorting the vertices by degree can have a good effect (due to less warp divergence). Note
that this applies to Tesla V100 PCIe 16GB, and would be different for other GPUs. In order
to measure error, nvGraph PageRank is taken as a reference.
For this experiment, sorting of vertices and/or edges was either NO, ASC, or DESC. This
gives a total of 3 * 3 = 9 cases. Each case is run on multiple graphs, running each 5 times
per graph for good time measure.
Results show that sorting in most cases is slower. Maybe this is because sorted
arrangement tends to overflood certain memory chunks with too many requests. In order to
measure error, nvGraph PageRank is taken as a reference.
This experiment was for finding a suitable launch config for CUDA block-per-vertex
PageRank. For the launch config, the block-size (threads) was adjusted from 32-1024, and
the grid-limit (max grid-size) was adjusted from 1024-32768. Each config was run 5 times
per graph to get a good time measure.
MAXx64 appears to be a good config for most graphs. Here MAX is the grid-limit, and 64 is
the block-size. This launch config is for the entire graph, and could be slightly different for a
subset of graphs. Also note that this applies to Tesla V100 PCIe 16GB, and could be
different for other GPUs. In order to measure error, nvGraph PageRank is taken as a
reference.
For this experiment, sorting of vertices and/or edges was either NO, ASC, or DESC. This
gives a total of 3 * 3 = 9 cases. Each case is run on multiple graphs, running each 5 times
per graph for good time measure.
Results show that sorting in most cases is not faster. In fact, in a number of cases, sorting
actually slows down performance. Maybe (just maybe) this is because sorted arrangements
28
29. tend to overflood certain memory chunks with too many requests. In order to measure error,
nvGraph PageRank is taken as a reference.
This experiment was for finding a suitable launch config for CUDA switched-per-vertex
PageRank for thread approach. For the launch config, the block-size (threads) was
adjusted from 32-1024, and the grid-limit (max grid-size) was adjusted from 1024-32768.
Each config was run 5 times per graph to get a good time measure.
MAXx512 appears to be a good config for most graphs. Here MAX is the grid-limit, and 512
is the block-size. Note that this applies to Tesla V100 PCIe 16GB, and would be different for
other GPUs. In order to measure error, nvGraph PageRank is taken as a reference.
This experiment was for finding a suitable launch config for CUDA switched-per-vertex
PageRank for block approach. For the launch config, the block-size (threads) was
adjusted from 32-1024, and the grid-limit (max grid-size) was adjusted from 1024-32768.
Each config was run 5 times per graph to get a good time measure.
MAXx256 appears to be a good config for most graphs. Here MAX is the grid-limit, and 256
is the block-size. Note that this applies to Tesla V100 PCIe 16GB, and would be different for
other GPUs. In order to measure error, nvGraph PageRank is taken as a reference.
For this experiment, sorting of vertices and/or edges was either NO, ASC, or DESC. This
gives a total of 3 * 3 = 9 cases. NO here means that vertices are partitioned by in-degree
(edges remain unchanged). Each case is run on multiple graphs, running each 5 times per
graph for good time measure.
Results show that sorting in most cases is not faster. It's better to simply partition vertices by
degree. In order to measure error, nvGraph PageRank is taken as a reference.
For this experiment, the objective is to find a good switch point for CUDA
switched-per-vertex PageRank. To assess this, switch_degree was varied from 2 - 1024,
and switch_limit was varied from 1 - 1024. switch_degree defines the in-degree at which
PageRank kernel switches from thread-per-vertex approach to block-per-vertex. switch_limit
defines the minimum block size for thread-per-vertex / block-per-vertex approach (if a block
size is too small, it is merged with the other approach block). Each case is run on multiple
graphs, running each 5 times per graph for good time measure.
It seems switch_degree of 64 and switch_limit of 32 would be a good choice.
29
30. Table 6.7.4: Adjusting Monolithic CUDA approach
Map launch
Reduce memcpy launch in-place launch vs
Thread /V launch sort/p. vertices sort edges
Block /V launch sort/p. vertices sort edges
Switched /V thread launch block launch switch-point
1. Comparing various launch configs for CUDA based vector multiply.
2. Comparing various launch configs for CUDA based vector element sum (memcpy).
3. Comparing various launch configs for CUDA based vector element sum (in-place).
4. Performance of memcpy vs in-place based CUDA based vector element sum.
5. Comparing various launch configs for CUDA thread-per-vertex based PageRank (pull, CSR).
6. Sorting vertices and/or edges by in-degree for CUDA thread-per-vertex based PageRank.
7. Comparing various launch configs for CUDA block-per-vertex based PageRank (pull, CSR).
8. Sorting vertices and/or edges by in-degree for CUDA block-per-vertex based PageRank.
9. Launch configs for CUDA switched-per-vertex based PageRank focusing on thread approach.
10. Launch configs for CUDA switched-per-vertex based PageRank focusing on block approach.
11. Sorting vertices and/or edges by in-degree for CUDA switched-per-vertex based PageRank.
12. Comparing various switch points for CUDA switched-per-vertex based PageRank (pull, ...).
Note: sort/p. vertices ⇒ sorting vertices by ascending or descending order of in-degree, or simply
partitioning (by in-degree). sort edges ⇒ sorting edges by ascending or descending order of id.
This experiment was for checking the benefit of splitting vertices of the graph by
components. This was done by comparing performance between: PageRank without
optimization, PageRank with vertices split by components, PageRank with components
sorted in topological order. Each approach was attempted on a number of graphs, running
each approach 5 times to get a good time measure.
On a few graphs, splitting vertices by components provides a speedup, but sorting
components in topological order provides no additional speedup. For road networks, like
germany_osm which only have one component, the speedup is possibly because of the
vertex reordering caused by dfs() which is required for splitting by components. However, on
average there is no speedup.
This experiment was for checking the benefit of skipping rank calculation of in-identical
vertices. This optimization, and the control approach, was attempted on a number of
graphs, running each approach 5 times to get a good time measure.
On indochina-2004 graph, skipping in-identicals provides a speedup of ~1.3, but on average
provides no speedup for other graphs. This could be due to the fact that the graph
indochina-2004 has a large number of in-identicals and in-identical groups. Although, it
30
31. doesn't have the highest in-identials % or the highest avg. in-identical group size, so I am not
so sure.
This experiment was for checking the benefit of skipping rank calculation of chain
vertices. This optimization, and the control approach, was attempted on a number of
graphs, running each approach 5 times to get a good time measure.
On average, skipping chain vertices provides no speedup. A chain here means a set of
unidirectional links connecting one vertex to the next, without any additional edges.
Bi-directional links are not considered as chains. Note that most graphs don't have enough
chains to provide an advantage. Road networks do have chains, but they are bi-directional,
and thus not considered here.
This experiment was for checking the benefit of skipping converged vertices. This was
done by comparing performance between: PageRank without optimization, PageRank
skipping converged vertices with re-check (in 2-16 turns), PageRank skipping converged
vertices after several turns (in 2-64 turns). Each approach was attempted on a number of
graphs, running each approach 5 times to get a good time measure. Skip with re-check
(skip-check) is done every 2-16 turns. Skip after turns (skip-after) is done after 2-64 turns.
On average, neither skip-check, nor skip-after gives better speed than the default
(unoptimized) approach. This could be due to the unnecessary iterations added by
skip-check (mistakenly skipped), and increased memory accesses performed by skip-after
(tracking converged count).
Table 6.7.5: Adjusting Monolithic CUDA optimizations (from STICD)
Split components Skip in-identicals Skip chains Skip converged
1. Performance benefit of CUDA based PageRank with vertices split by components.
2. Performance benefit of skipping in-identical vertices for CUDA based PageRank (pull, CSR).
3. Performance benefit of skipping chain vertices for CUDA based PageRank (pull, CSR).
4. Performance benefit of skipping converged vertices for CUDA based PageRank (pull, CSR).
This experiment is for comparing the performance between: Monolithic dynamic
PageRank, Monolithic static PageRank, nvGraph dynamic PageRank, and nvGraph static
PageRank. This is done with both fixed, and temporal graphs. For temporal graphs, updating
of each graph is done in multiple batch sizes (1, 5, 10, 50, ...). New edges are incrementally
added to the graph batch-by-batch until the entire graph is complete. For fixed graphs, each
batch size was run with 5 different updates to the graph, and each specific update was run 5
times for each approach to get a good time measure.
On average, Monolithic dynamic PageRank is faster than static approach.
31
32. Table 6.7.6: Comparing dynamic CUDA approach with static
nvGraph dynamic Monolithic dynamic
nvGraph static vs: fixed, temporal vs: fixed, temporal
Monolithic static vs: fixed, temporal vs: fixed, temporal
1. Performance of static vs dynamic CUDA based PageRank (fixed).
2. Performance of static vs dynamic CUDA based PageRank (temporal).
3. Performance of CUDA based static vs dynamic levelwise PageRank (fixed).
4. Performance of static vs dynamic CUDA based levelwise PageRank (temporal).
Note: fixed ⇒ static graphs with batches of random edge updates. temporal ⇒ batches of edge
updated from temporal graphs.
This experiment is for comparing the performance between: Monolithic dynamic PageRank,
Monolithic static PageRank, nvGraph dynamic PageRank, and nvGraph static PageRank.
Here, unaffected vertices are skipped from PageRank computation. This is done with
fixed graphs. Each batch size was run with 5 different updates to the graph, and each
specific update was run 5 times for each approach to get a good time measure.
On average, Monolithic dynamic PageRank is faster than static approach.
Table 6.7.7: Comparing dynamic optimized CUDA approach with static
nvGraph dynamic Monolithic dynamic
nvGraph static vs: fixed vs: fixed
Monolithic static vs: fixed vs: fixed
1. Performance of CUDA based optimized dynamic monolithic vs levelwise PageRank (fixed).
Note: fixed ⇒ static graphs with batches of random edge updates. temporal ⇒ batches of edge
updated from temporal graphs.
32
33. 6.8 Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD)
Techniques to optimize the PageRank algorithm usually fall in two categories. One is to try
reducing the work per iteration, and the other is to try reducing the number of iterations.
These goals are often at odds with one another. Skipping computation on vertices which
have already converged has the potential to save iteration time. Skipping in-identical
vertices, with the same in-links, helps reduce duplicate computations and thus could help
reduce iteration time. Road networks often have chains which can be short-circuited before
PageRank computation to improve performance. Final ranks of chain nodes can be easily
calculated. This could reduce both the iteration time, and the number of iterations. If a graph
has no dangling nodes, PageRank of each strongly connected component can be
computed in topological order. This could help reduce the iteration time, no. of iterations,
and also enable multi-iteration concurrency in PageRank computation. The combination of
all of the above methods is the STICD algorithm [sticd]. For dynamic graphs, unchanged
components whose ranks are unaffected can be skipped altogether.
Before starting any algorithmic optimization, a good monolithic PageRank implementation
needs to be set up. There are two ways (algorithmically) to think of the PageRank
calculation. One approach (push) is to find PageRank by pushing contributions to
out-vertices. The push method is somewhat easier to implement, and is described in this
lecture. With this approach, in an iteration for each vertex, the ranks of vertices connected to
its outgoing edge are cumulated with p×rn, where p is the damping factor (0.85), and rn is the
rank of the (source) vertex in the previous iteration. But, if a vertex has no out-going edges, it
is considered to have out-going edges to all vertices in the graph (including itself). This is
because a random surfer jumps to a random page upon visiting a page with no links, in order
to avoid the rank-sink effect. However, it requires multiple writes per source vertex, due to
the cumulation (+=) operation. The other approach (pull) is to pull contributions from
in-vertices. Here, the rank of a vertex in an iteration is calculated as c0 + pΣrn/dn, where c0 is
the common teleport contribution, p is the damping factor (0.85), rn is the previous rank of
vertex with an incoming edge, dn is the out-degree of the incoming-edge vertex, and N is the
total number of vertices in the graph. The common teleport contribution c0, calculated as
(1-p)/N + pΣrn/N, includes the contribution due to a teleport from any vertex in the graph due
to the damping factor (1-p)/N, and teleport from dangling vertices (with no outgoing edges) in
the graph pΣrn/N (to avoid the rank-sink effect). This approach requires 2 additional
calculations per-vertex, i.e., non-teleport contribution of each vertex, and total teleport
contribution (to all vertices). However, it requires only 1 write per destination vertex. For this
experiment both of these approaches are assessed on a number of different graphs.
All graphs used are stored in the MatrixMarket (.mtx) file format, and obtained from the
SuiteSparse Matrix Collection. The experiment is implemented in C++, and compiled using
GCC 9 with optimization level 3 (-O3). The system used is a Dell PowerEdge R740 Rack
server with two Intel Xeon Silver 4116 CPUs @ 2.10GHz, 128GB DIMM DDR4 Synchronous
Registered (Buffered) 2666 MHz (8x16GB) DRAM, and running CentOS Linux release
7.9.2009 (Core). The execution time of each test case is measured using
std::chrono::high_performance_timer. This is done 5 times for each test case, and timings
are averaged. Statistics of each test case is printed to standard output (stdout), and
redirected to a log file, which is then processed with a script to generate a CSV file, with
33
34. each row representing the details of a single test case. This CSV file is imported into Google
Sheets, and necessary tables are set up with the help of the FILTER function to create the
charts.
While it might seem that the pull method would be a clear winner, the results indicate that
although pull is always faster than push approach, the difference between the two depends
on the nature of the graph. The next step is to compare the performance between finding
PageRank using C++ DiGraph class directly (using arrays of edge-lists) vs its CSR
(Compressed Sparse Row) representation (contiguous). Using a CSR representation has
the potential for performance improvement due to information on vertices and edges being
stored contiguously.
Table 6.8.1: Adjusting Monolithic (Sequential) approach
Push Pull Class CSR
1. Performance of contribution-push based vs contribution-pull based PageRank.
2. Performance of C++ DiGraph class based vs CSR based PageRank (pull).
Next an experiment is conducted to assess the performance benefit of each algorithmic
optimization separately. For splitting graph by components optimization, the following
approaches are compared: PageRank without optimization, PageRank with vertices split by
components, and finally PageRank with components sorted in topological order.
Components of the graph are obtained using Kosaraju’s algorithm. Topological ordering is
done by representing the graph as a block-graph, where each component is represented as
a vertex, and cross-edges between components are represented as edges. This block-graph
is then topologically sorted, and this vertex-order in block-graph is used to reorder the
components in topological order. Vertices, and their respective edges are accordingly simply
reordered before computing PageRank (no graph partitioning is done). Each approach was
attempted on a number of graphs. On a few graphs, splitting vertices by components
provides a speedup, but sorting components in topological order provides no additional
speedup. For road networks, like germany_osm which only have one component, the
speedup is possibly because of the vertex reordering caused by dfs() which is required for
splitting by components. For skipping in-identicals optimization, comparison is done with
unoptimized PageRank. In-identical vertices are obtained by scanning matching edges of a
vertex by in-vertices hash. Except the first in-identical vertex of an in-identicals-group,
remaining vertices are skipped during PageRank computation. After each iteration ends,
rank of the first in-identical vertex is copied to the remaining vertices of the in-identicals
group. The vertices to be skipped are marked with negative source-offset in CSR. On
indochina-2004 graph, skipping in-identicals provides a speedup of ~1.8, but on average
provides no speedup for other graphs. This is likely due to the fact that the graph
indochina-2004 has a large number of in-identicals and in-identical groups. Although, it
doesn't have the highest in-identicals % or the highest avg. in-identical group size. For
skipping chains optimization, comparison is done with unoptimized PageRank. It is
important to note that a chain here means a set of unidirectional links connecting one vertex
34
35. to the next, without any additional edges. Bi-directional links are not considered as chains.
Chain vertices are obtained by traversing 2-degree vertices in both directions and marking
visited ones. Except the first chain vertex of a chains-group, remaining vertices are skipped
during PageRank computation. After each iteration ends, ranks of the remaining vertices in
each chains-group is updated using the (GP) formula c0×(1-pn
)/(1-p) + pn
×r, where c0 is the
common teleport contribution, p is the damping factor, n is the distance from the first chain
vertex, and r is the rank of the first chain vertex in previous iteration. The vertices to be
skipped are marked with negative source-offset in CSR. On average, skipping chain vertices
provides no speedup. This is likely because most graphs don't have enough chains to
provide an advantage. Road networks do have chains, but they are bi-directional, and thus
not considered here. For skipping converged vertices optimization, the following
approaches are compared: PageRank without optimization, PageRank skipping converged
vertices with re-check (in 2-16 turns), and PageRank skipping converged vertices after
several turns (in 2-64 turns). Skip with re-check (skip-check) approach skips the current
iteration for a vertex if its rank for the last two iterations match, and the current turn (iteration)
is not a “check” turn. The check turn is adjusted between 2-16 turns. Skip after turns
(skip-after) skips all future iterations of a vertex after its rank does not change for “after”
turns. The after turns are adjusted between 2-64 turns. On average, neither skip-check, nor
skip-after gives better speed than the default (unoptimized) approach. This could be due to
the unnecessary iterations added by skip-check (mistakenly skipped), and increased
memory accesses performed by skip-after (tracking converged count).
Table 6.8.2: Adjusting Monolithic optimizations (from STICD)
Split components Skip in-identicals Skip chains Skip converged
1. Performance benefit of PageRank with vertices split by components (pull, CSR).
2. Performance benefit of skipping in-identical vertices for PageRank (pull, CSR).
3. Performance benefit of skipping chain vertices for PageRank (pull, CSR).
4. Performance benefit of skipping converged vertices for PageRank (pull, CSR).
This experiment was for comparing performance between levelwise PageRank with various
min. compute size, ranging from 1 - 1E+7. Here, min. compute size is the minimum no.
nodes of each PageRank compute using standard algorithm (monolithic). Each min.
compute size was attempted on different types of graphs, running each size 5 times per
graph to get a good time measure. Levelwise PageRank is the STIC-D algorithm, without
ICD optimizations (using single-thread). Although there is no clear winner, it appears a min.
compute size of 10 would be a good choice. Note that the levelwise approach does not
make use of SIMD instructions which are available on all modern hardware.
This experiment was for comparing performance between: monolithic PageRank, monolithic
PageRank skipping teleport, levelwise PageRank, levelwise PageRank skipping teleport.
Each approach was attempted on different types of graphs, running each approach 5 times
per graph to get a good time measure. Levelwise PageRank is the STIC-D algorithm, without
ICD optimizations (using single-thread).
35
36. Except for soc-LiveJournal1 and coPapersCiteseer, in all cases skipping teleport calculations
is slightly faster (could be a random issue for the two). The improvement is most prominent
in case of road networks and certain web graphs.
Table 6.8.3: Adjusting Levelwise (STICD) approach
Min. component size Min. compute size Skip teleport calculation
1. Comparing various min. component sizes for topologically-ordered components (levelwise...).
2. Comparing various min. compute sizes for topologically-ordered components (levelwise...).
3. Checking performance benefit of levelwise PageRank when teleport calculation is skipped.
Note: min. components size merges small components even before generating block-graph /
topological-ordering, but min. compute size does it before PageRank computation.
This experiment was for comparing performance between: PageRank with standard
algorithm (monolithic), PageRank in topologically-ordered components fashion (levelwise).
Both approaches were attempted on different types of graphs, running each approach 5
times per graph to get a good time measure. Levelwise PageRank is the STIC-D algorithm,
without ICD optimizations (using single-thread).
On average, levelwise PageRank is faster than the monolithic approach. Note that neither
approach makes use of SIMD instructions which are available on all modern hardware.
Table 6.8.4: Comparing Levelwise (STICD) approach
Monolithic nvGraph
Levelwise (STICD) vs
1. Performance of monolithic vs topologically-ordered components (levelwise) PageRank.
This experiment was for comparing performance between: static levelwise PageRank,
dynamic levelwise PageRank (process all components), dynamic levelwise PageRank
skipping unchanged components. Each approach was attempted on a number of graphs
(fixed and temporal), running each with multiple batch sizes (1, 5, 10, 50, ...). Levelwise
PageRank is the STIC-D algorithm, without ICD optimizations (using single-thread).
On average, skipping unchanged components is barely faster than not skipping.
36
37. Table 6.8.5: Adjusting Levelwise (STICD) dynamic approach
Skip unaffected components For fixed graphs For temporal graphs
1. Checking for correctness of levelwise PageRank when unchanged components are skipped.
2. Perf. benefit of levelwise PageRank when unchanged components are skipped (fixed).
3. Perf. benefit of levelwise PageRank when unchanged components are skipped (temporal).
Note: fixed ⇒ static graphs with batches of random edge updates. temporal ⇒ batches of edge
updated from temporal graphs.
This experiment was for comparing performance between: static PageRank using standard
algorithm (monolithic), static PageRank using levelwise algorithm, dynamic PageRank using
levelwise algorithm. Each approach was attempted on a number of graphs, running each
with multiple batch sizes (1, 5, 10, 50, ...). Each PageRank computation was run 5 times for
both approaches to get a good time measure. Levelwise PageRank is the STIC-D algorithm,
without ICD optimizations (using single-thread).
Clearly, dynamic levelwise PageRank is faster than the static approach for many batch
sizes.
Table 6.8.6: Comparing dynamic approach with static
nvGraph dynamic Monolithic dynamic Levelwise dynamic
nvGraph static vs: temporal
Monolithic static vs: fixed, temporal vs: fixed, temporal
Levelwise static vs: fixed vs: fixed, temporal
1. Performance of nvGraph based static vs dynamic PageRank (temporal).
2. Performance of static vs dynamic PageRank (temporal).
3. Performance of static vs dynamic levelwise PageRank (fixed).
4. Performance of levelwise based static vs dynamic PageRank (temporal).
Note: fixed ⇒ static graphs with batches of random edge updates. temporal ⇒ batches of edge
updated from temporal graphs.
This experiment was for comparing performance between levelwise CUDA PageRank with
various min. compute size, ranging from 1E+3 - 1E+7. Here, min. compute size is the
minimum no. nodes of each PageRank compute using standard algorithm (monolithic
CUDA). Each min. compute size was attempted on different types of graphs, running each
37
38. size 5 times per graph to get a good time measure. Levelwise PageRank is the STIC-D
algorithm, without ICD optimizations (using single-thread).
Although there is no clear winner, it appears a min. compute size of 5E+6 would be a good
choice.
Table 6.8.7: Adjusting Levelwise (STICD) CUDA approach
Min. component size Min. compute size Skip teleport calculation
1. Min. component sizes for topologically-ordered components (levelwise, CUDA) PageRank.
2. Min. compute sizes for topologically-ordered components (levelwise CUDA) PageRank.
Note: min. components size merges small components even before generating block-graph /
topological-ordering, but min. compute size does it before PageRank computation.
This experiment was for comparing performance between: CUDA based PageRank with
standard algorithm (monolithic), CUDA based PageRank in topologically-ordered
components fashion (levelwise). Both approaches were attempted on different types of
graphs, running each approach 5 times per graph to get a good time measure. Levelwise
PageRank is the STIC-D algorithm, without ICD optimizations (using single-thread).
On average, levelwise PageRank is the same performance as the monolithic approach.
Table 6.8.8: Comparing Levelwise (STICD) CUDA approach
nvGraph Monolithic CUDA
Monolithic vs vs
Monolithic CUDA vs
Levelwise CUDA vs vs
1. Performance of sequential execution based vs CUDA based PageRank (pull, CSR).
2. Performance of nvGraph vs CUDA based PageRank (pull, CSR).
3. Performance of Monolithic CUDA vs Levelwise CUDA PageRank (pull, CSR, ...).
This experiment was for comparing the performance between: static PageRank of updated
graph, dynamic PageRank of updated graph. Both techniques were attempted on different
temporal graphs, updating each graph with multiple batch sizes (1, 5, 10, 50, ...). New edges
are incrementally added to the graph batch-by-batch until the entire graph is complete.
Dynamic PageRank is clearly faster than static approach for many batch sizes.
38
39. Table 6.8.9: Comparing dynamic CUDA approach with static
nvGraph dynamic Monolithic dynamic Levelwise dynamic
nvGraph static vs: fixed, temporal vs: fixed, temporal vs: fixed, temporal
Monolithic static vs: fixed, temporal vs: fixed, temporal vs: fixed, temporal
Levelwise static vs: fixed, temporal vs: fixed, temporal vs: fixed, temporal
1. Performance of static vs dynamic CUDA based PageRank (fixed).
2. Performance of static vs dynamic CUDA based PageRank (temporal).
3. Performance of CUDA based static vs dynamic levelwise PageRank (fixed).
4. Performance of static vs dynamic CUDA based levelwise PageRank (temporal).
Note: fixed ⇒ static graphs with batches of random edge updates. temporal ⇒ batches of edge
updated from temporal graphs.
This experiment was for comparing performance between: static PageRank of updated
graph using nvGraph, dynamic PageRank of updated graph using nvGraph, static monolithic
CUDA based PageRank of updated graph, dynamic monolithic CUDA based PageRank of
updated graph, static levelwise CUDA based PageRank of updated graph, dynamic
levelwise CUDA based PageRank of updated graph. Each approach was attempted on a
number of graphs, running each with multiple batch sizes (1, 5, 10, 50, ...). Each batch size
was run with 5 different updates to the graph, and each specific update was run 5 times for
each approach to get a good time measure. Levelwise PageRank is the STIC-D algorithm,
without ICD optimizations.
Indeed, dynamic levelwise PageRank is faster than the static approach for many batch
sizes. In order to measure error, nvGraph PageRank is taken as a reference.
Table 6.8.10: Comparing dynamic optimized CUDA approach with static
nvGraph dynamic Monolithic dynamic Levelwise dynamic
nvGraph static vs: fixed vs: fixed vs: fixed
Monolithic static vs: fixed vs: fixed vs: fixed
Levelwise static vs: fixed vs: fixed vs: fixed
1. Performance of CUDA based optimized dynamic monolithic vs levelwise PageRank (fixed).
Note: fixed ⇒ static graphs with batches of random edge updates. temporal ⇒ batches of edge
updated from temporal graphs.
39
40. 7. Packages
1. CLI for SNAP dataset, which is a collection of more than 50 large networks.
This is for quickly fetching SNAP datasets that you need right from the CLI. Currently there is
only one command clone, where you can provide filters for specifying exactly which datasets
you need, and where to download them. If a dataset already exists, it is skipped. This
summary is shown at the end. You can install this with npm install -g snap-data.sh.
2. CLI for nvGraph, which is a GPU-based graph analytics library written by NVIDIA,
using CUDA.
This is for running nvGraph functions right from the CLI with graphs in MatrixMarket format
(.mtx) directly. It just needs a x86_64 linux machine with NVIDIA GPU drivers installed.
Execution time, along with the results can be saved in JSON/YAML file. The executable code
is written in C++. You can install this with npm install -g nvgraph.sh.
8. Further action
List dynamic graph algorithms
List dynamic graph data structures
List graph processing frameworks
List graph applications
Package graph processing frameworks
9. Bibliography
[1] E. W. Weisstein, “Königsberg Bridge Problem.,” MathWorld--A Wolfram Web
Resource. https://mathworld.wolfram.com/KoenigsbergBridgeProblem.html (accessed
Jul. 23, 2021).
[2] M. A. F. Richter, “Infographic: How Many Websites Are There?,” Statista Infographics,
Oct. 2019.
[3] A. Langville and C. Meyer, “Deeper Inside PageRank,” Internet Math., vol. 1, no. 3, pp.
335–380, Jan. 2004, doi: 10.1080/15427951.2004.10129091.
[4] R. Meusel, “The graph structure in the web – analyzed on different aggregation levels,”
JWS, vol. 1, no. 1, pp. 33–47, Aug. 2015, doi: 10.1561/106.00000003.
[5] M. Besta, M. Fischer, V. Kalavri, M. Kapralov, and T. Hoefler, “Practice of Streaming
and Dynamic Graphs: Concepts, Models, Systems, and Parallelism,” CoRR, vol.
abs/1912.12740, 2019.
[6] M. Besta, M. Fischer, V. Kalavri, M. Kapralov, and T. Hoefler, “Practice of Streaming
Processing of Dynamic Graphs: Concepts, Models, and Systems,” 2021.
[7] Ong Kok Chien, Poo Kuan Hoong, and Chiung Ching Ho, “A comparative study of
HITS vs PageRank algorithms for Twitter users analysis,” in 2014 International
Conference on Computational Science and Technology (ICCST), Aug. 2014, pp. 1–6,
40
41. doi: 10.1109/ICCST.2014.7045007.
[8] Q. Zhang and T. Yuan, “Analysis of China’s Urban Network Structure from the
Perspective of ‘Streaming,’” in 2018 26th International Conference on Geoinformatics,
Jun. 2018, pp. 1–7, doi: 10.1109/GEOINFORMATICS.2018.8557078.
[9] Y.-Y. Kim, H.-A. Kim, C.-H. Shin, K.-H. Lee, C.-H. Choi, and W.-S. Cho, “Analysis on
the transportation point in cheongju city using pagerank algorithm,” in Proceedings of
the 2015 International Conference on Big Data Applications and Services - BigDAS
’15, New York, New York, USA, Oct. 2015, pp. 165–169, doi:
10.1145/2837060.2837087.
[10] I. M. Kloumann, J. Ugander, and J. Kleinberg, “Block models and personalized
PageRank.,” Proc Natl Acad Sci USA, vol. 114, no. 1, pp. 33–38, Jan. 2017, doi:
10.1073/pnas.1611275114.
[11] B. Zhang, Y. Wang, Q. Jin, and J. Ma, “A Pagerank-Inspired Heuristic Scheme for
Influence Maximization in Social Networks,” International Journal of Web Services
Research, vol. 12, no. 4, pp. 48–62, Oct. 2015, doi: 10.4018/IJWSR.2015100104.
[12] S. Chaudhari, A. Azaria, and T. Mitchell, “An entity graph based Recommender
System,” AIC, vol. 30, no. 2, pp. 141–149, May 2017, doi: 10.3233/AIC-170728.
[13] Contributors to Wikimedia projects, “PageRank,” Wikipedia, Jul. 2021.
https://en.wikipedia.org/wiki/PageRank (accessed Mar. 01, 2021).
[14] J. Leskovec, “PageRank Algorithm, Mining massive Datasets (CS246), Stanford
University,” YouTube, 2019.
[15] J. F. Jardine, “PageRanks-Example,” Nov. 2007.
[16] H. Dubey, N. Khare, K. K. Appu Kuttan, and S. Bhatia, “Improved parallel pagerank
algorithm for spam filtering,” Indian J. Sci. Technol., vol. 9, no. 38, Oct. 2016, doi:
10.17485/ijst/2016/v9i38/90410.
[17] P. Garg and K. Kothapalli, “STIC-D: Algorithmic techniques for efficient parallel
pagerank computation on real-world graphs,” in Proceedings of the 17th International
Conference on Distributed Computing and Networking - ICDCN ’16, New York, New
York, USA, Jan. 2016, pp. 1–10, doi: 10.1145/2833312.2833322.
[18] D. Frey, Distributed Computing and Networking, 2nd ed. Berlin: Springer Nature, 2013,
p. 366.
[19] B. Bahmani, K. Chakrabarti, and D. Xin, “Fast personalized PageRank on
MapReduce,” in Proceedings of the 2011 international conference on Management of
data - SIGMOD ’11, New York, New York, USA, Jun. 2011, p. 973, doi:
10.1145/1989323.1989425.
[20] S. Lai, B. Shao, Y. Xu, and X. Lin, “Parallel computations of local PageRank problem
based on Graphics Processing Unit,” Concurrency Computat.: Pract. Exper., vol. 29,
no. 24, p. e4245, Aug. 2017, doi: 10.1002/cpe.4245.
[21] S. Hunold, Euro-Par 2015: Parallel Processing Workshops: Euro-Par 2015
International Workshops, Vienna, Austria, August 24-25, 2015, Revised Selected
Papers (Lecture Notes in Computer Science Book 9523), 1st ed. 2015. Cham:
Springer, 2015, p. 882.
[22] K. Lakhotia, R. Kannan, and V. Prasanna, “Accelerating PageRank using
Partition-Centric Processing,” in 2018 USENIX Annual Technical Conference (USENIX
ATC ’18), Boston, MA, Jul. 2018, pp. 427–440.
[23] X. Wang, L. Huang, Y. Zhu, Y. Zhou, H. Peng, and H. Xiong, “Addressing memory wall
problem of graph computation in reconfigurable system,” in 2015 IEEE 17th
International Conference on High Performance Computing and Communications, 2015
41