This document describes research on efficiently computing PageRank using GPUs. The key points are:
1. PageRank is computed using an iterative power method, which can be sped up using GPU parallelization. Common operations like sparse matrix-vector multiplication and vector operations are implemented using CUDA libraries.
2. Experiments show the parallel GPU implementation significantly outperforms the serial CPU implementation in terms of time taken for convergence, especially for large web graphs. Faster convergence is also achieved for higher damping factors.
3. Periodic use of Aitken extrapolation can further accelerate convergence by refining the vectors between power iterations. Results show reductions in the number of iterations required for convergence compared to the standard power method
Clustering using kernel entropy principal component analysis and variable ker...IJECEIAES
Clustering as unsupervised learning method is the mission of dividing data objects into clusters with common characteristics. In the present paper, we introduce an enhanced technique of the existing EPCA data transformation method. Incorporating the kernel function into the EPCA, the input space can be mapped implicitly into a high-dimensional of feature space. Then, the Shannon’s entropy estimated via the inertia provided by the contribution of every mapped object in data is the key measure to determine the optimal extracted features space. Our proposed method performs very well the clustering algorithm of the fast search of clusters’ centers based on the local densities’ computing. Experimental results disclose that the approach is feasible and efficient on the performance query.
A HYBRID CLUSTERING ALGORITHM FOR DATA MININGcscpconf
Data clustering is a process of arranging similar data into groups. A clustering algorithm
partitions a data set into several groups such that the similarity within a group is better than
among groups. In this paper a hybrid clustering algorithm based on K-mean and K-harmonic
mean (KHM) is described. The proposed algorithm is tested on five different datasets. The research is focused on fast and accurate clustering. Its performance is compared with the traditional K-means & KHM algorithm. The result obtained from proposed hybrid algorithm is much better than the traditional K-mean & KHM algorithm
Clustering using kernel entropy principal component analysis and variable ker...IJECEIAES
Clustering as unsupervised learning method is the mission of dividing data objects into clusters with common characteristics. In the present paper, we introduce an enhanced technique of the existing EPCA data transformation method. Incorporating the kernel function into the EPCA, the input space can be mapped implicitly into a high-dimensional of feature space. Then, the Shannon’s entropy estimated via the inertia provided by the contribution of every mapped object in data is the key measure to determine the optimal extracted features space. Our proposed method performs very well the clustering algorithm of the fast search of clusters’ centers based on the local densities’ computing. Experimental results disclose that the approach is feasible and efficient on the performance query.
A HYBRID CLUSTERING ALGORITHM FOR DATA MININGcscpconf
Data clustering is a process of arranging similar data into groups. A clustering algorithm
partitions a data set into several groups such that the similarity within a group is better than
among groups. In this paper a hybrid clustering algorithm based on K-mean and K-harmonic
mean (KHM) is described. The proposed algorithm is tested on five different datasets. The research is focused on fast and accurate clustering. Its performance is compared with the traditional K-means & KHM algorithm. The result obtained from proposed hybrid algorithm is much better than the traditional K-mean & KHM algorithm
deep reinforcement learning with double q learningSeungHyeok Baek
presentation for Lab seminar
Double DQN Algorithm of Deepmind
Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double Q-Learning." AAAI. Vol. 2. 2016.
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
This slide reviews deep reinforcement learning, specially Q-Learning and its variants. We introduce Bellman operator and approximate it with deep neural network. Last but not least, we review the classical paper: DeepMind Atari Game beats human performance. Also, some tips of stabilizing DQN are included.
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
The ability to mine and extract useful information automatically, from large datasets, is a
common concern for organizations (having large datasets), over the last few decades. Over the
internet, data is vastly increasing gradually and consequently the capacity to collect and store
very large data is significantly increasing.
Existing clustering algorithms are not always efficient and accurate in solving clustering
problems for large datasets.
However, the development of accurate and fast data classification algorithms for very large
scale datasets is still a challenge. In this paper, various algorithms and techniques especially,
approach using non-smooth optimization formulation of the clustering problem, are proposed
for solving the minimum sum-of-squares clustering problems in very large datasets. This
research also develops accurate and real time L2-DC algorithm based with the incremental
approach to solve the minimum
Advance Data Mining - Analysis and forecasting of power factor for optimum el...Shrikant Samarth
Task: Execute a research project using data mining techniques
Approach: The topic chosen was ‘Analysis and Forecasting of Power Factor for Optimum Electric Consumption in a Household.’ Research question – What can be the best short term range of forecast for power factor patterns so that optimum energy consumption can be achieved for a household?
To answer the question, CRISM- DM method was used. The ARIMA machine learning model was developed using R.
Findings: The best short term range of forecasts for the power factor was achieved for 6 months and 12 months duration using the ARIMA model. The MAPE value for the ARIMA model was around 1.83.
Tools: Rstudio
Principal component analysis - application in financeIgor Hlivka
Principal component analysis is a useful multivariate times series method to examine and study the drivers of the changes in the entire dataset. The main advantage of PCA is the reduction of dimensionality where the large sets of data get transformed into few principal factors that explain majority of variability in that group. PCA has found many applications in finance – both in risk and yield curve analytics
Artificial Neural Network and Multi-Response Optimization in Reliability Meas...inventionjournals
Neural network is an important tool for reliability analysis, including estimation of reliability or utility function which are too complicated to be analytical expressed for large or complex system. It has been demonstrated the neural network has significant improvement in the parameter estimation accuracy over the traditional chi-square test. There are many parameters of a neural network that should be determined while training the dataset, since different setups of algorithm parameters affect the estimation performance in either accuracy or computation efficiency. In this paper, neural network training is used to estimate the utility function for the parallel-series redundancy allocation problem, and weighted principal component based multi-response optimization method is applied to find the optimal setting of neural network parameters so that the simultaneous minimizations of training error and computing time are achieved.
Principal Component Analysis (PCA) and LDA PPT SlidesAbhishekKumar4995
Machine learning (ML) technique use for Dimension reduction, feature extraction and analyzing huge amount of data are Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are easily and interactively explained with scatter plot graph , 2D and 3D projection of Principal components(PCs) for better understanding.
Improving K-NN Internet Traffic Classification Using Clustering and Principle...journalBEEI
K-Nearest Neighbour (K-NN) is one of the popular classification algorithm, in this research K-NN use to classify internet traffic, the K-NN is appropriate for huge amounts of data and have more accurate classification, K-NN algorithm has a disadvantages in computation process because K-NN algorithm calculate the distance of all existing data in dataset. Clustering is one of the solution to conquer the K-NN weaknesses, clustering process should be done before the K-NN classification process, the clustering process does not need high computing time to conqest the data which have same characteristic, Fuzzy C-Mean is the clustering algorithm used in this research. The Fuzzy C-Mean algorithm no need to determine the first number of clusters to be formed, clusters that form on this algorithm will be formed naturally based datasets be entered. The Fuzzy C-Mean has weakness in clustering results obtained are frequently not same even though the input of dataset was same because the initial dataset that of the Fuzzy C-Mean is less optimal, to optimize the initial datasets needs feature selection algorithm. Feature selection is a method to produce an optimum initial dataset Fuzzy C-Means. Feature selection algorithm in this research is Principal Component Analysis (PCA). PCA can reduce non significant attribute or feature to create optimal dataset and can improve performance for clustering and classification algorithm. The resultsof this research is the combination method of classification, clustering and feature selection of internet traffic dataset was successfully modeled internet traffic classification method that higher accuracy and faster performance.
This is a presentation that I gave to my research group. It is about probabilistic extensions to Principal Components Analysis, as proposed by Tipping and Bishop.
Incremental Page Rank Computation on Evolving GraphsSubhajit Sahu
In this paper, we propose a method to incrementally compute PageRank for a large graph that is evolving. Our approach is quite general, and can be used to incrementally compute (on evolving graphs) any metric that satisfies the first order Markov property.
deep reinforcement learning with double q learningSeungHyeok Baek
presentation for Lab seminar
Double DQN Algorithm of Deepmind
Van Hasselt, Hado, Arthur Guez, and David Silver. "Deep Reinforcement Learning with Double Q-Learning." AAAI. Vol. 2. 2016.
Experimental study of Data clustering using k- Means and modified algorithmsIJDKP
The k- Means clustering algorithm is an old algorithm that has been intensely researched owing to its ease
and simplicity of implementation. Clustering algorithm has a broad attraction and usefulness in
exploratory data analysis. This paper presents results of the experimental study of different approaches to
k- Means clustering, thereby comparing results on different datasets using Original k-Means and other
modified algorithms implemented using MATLAB R2009b. The results are calculated on some performance
measures such as no. of iterations, no. of points misclassified, accuracy, Silhouette validity index and
execution time
Deep Reinforcement Learning: Q-LearningKai-Wen Zhao
This slide reviews deep reinforcement learning, specially Q-Learning and its variants. We introduce Bellman operator and approximate it with deep neural network. Last but not least, we review the classical paper: DeepMind Atari Game beats human performance. Also, some tips of stabilizing DQN are included.
FAST ALGORITHMS FOR UNSUPERVISED LEARNING IN LARGE DATA SETScsandit
The ability to mine and extract useful information automatically, from large datasets, is a
common concern for organizations (having large datasets), over the last few decades. Over the
internet, data is vastly increasing gradually and consequently the capacity to collect and store
very large data is significantly increasing.
Existing clustering algorithms are not always efficient and accurate in solving clustering
problems for large datasets.
However, the development of accurate and fast data classification algorithms for very large
scale datasets is still a challenge. In this paper, various algorithms and techniques especially,
approach using non-smooth optimization formulation of the clustering problem, are proposed
for solving the minimum sum-of-squares clustering problems in very large datasets. This
research also develops accurate and real time L2-DC algorithm based with the incremental
approach to solve the minimum
Advance Data Mining - Analysis and forecasting of power factor for optimum el...Shrikant Samarth
Task: Execute a research project using data mining techniques
Approach: The topic chosen was ‘Analysis and Forecasting of Power Factor for Optimum Electric Consumption in a Household.’ Research question – What can be the best short term range of forecast for power factor patterns so that optimum energy consumption can be achieved for a household?
To answer the question, CRISM- DM method was used. The ARIMA machine learning model was developed using R.
Findings: The best short term range of forecasts for the power factor was achieved for 6 months and 12 months duration using the ARIMA model. The MAPE value for the ARIMA model was around 1.83.
Tools: Rstudio
Principal component analysis - application in financeIgor Hlivka
Principal component analysis is a useful multivariate times series method to examine and study the drivers of the changes in the entire dataset. The main advantage of PCA is the reduction of dimensionality where the large sets of data get transformed into few principal factors that explain majority of variability in that group. PCA has found many applications in finance – both in risk and yield curve analytics
Artificial Neural Network and Multi-Response Optimization in Reliability Meas...inventionjournals
Neural network is an important tool for reliability analysis, including estimation of reliability or utility function which are too complicated to be analytical expressed for large or complex system. It has been demonstrated the neural network has significant improvement in the parameter estimation accuracy over the traditional chi-square test. There are many parameters of a neural network that should be determined while training the dataset, since different setups of algorithm parameters affect the estimation performance in either accuracy or computation efficiency. In this paper, neural network training is used to estimate the utility function for the parallel-series redundancy allocation problem, and weighted principal component based multi-response optimization method is applied to find the optimal setting of neural network parameters so that the simultaneous minimizations of training error and computing time are achieved.
Principal Component Analysis (PCA) and LDA PPT SlidesAbhishekKumar4995
Machine learning (ML) technique use for Dimension reduction, feature extraction and analyzing huge amount of data are Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) are easily and interactively explained with scatter plot graph , 2D and 3D projection of Principal components(PCs) for better understanding.
Improving K-NN Internet Traffic Classification Using Clustering and Principle...journalBEEI
K-Nearest Neighbour (K-NN) is one of the popular classification algorithm, in this research K-NN use to classify internet traffic, the K-NN is appropriate for huge amounts of data and have more accurate classification, K-NN algorithm has a disadvantages in computation process because K-NN algorithm calculate the distance of all existing data in dataset. Clustering is one of the solution to conquer the K-NN weaknesses, clustering process should be done before the K-NN classification process, the clustering process does not need high computing time to conqest the data which have same characteristic, Fuzzy C-Mean is the clustering algorithm used in this research. The Fuzzy C-Mean algorithm no need to determine the first number of clusters to be formed, clusters that form on this algorithm will be formed naturally based datasets be entered. The Fuzzy C-Mean has weakness in clustering results obtained are frequently not same even though the input of dataset was same because the initial dataset that of the Fuzzy C-Mean is less optimal, to optimize the initial datasets needs feature selection algorithm. Feature selection is a method to produce an optimum initial dataset Fuzzy C-Means. Feature selection algorithm in this research is Principal Component Analysis (PCA). PCA can reduce non significant attribute or feature to create optimal dataset and can improve performance for clustering and classification algorithm. The resultsof this research is the combination method of classification, clustering and feature selection of internet traffic dataset was successfully modeled internet traffic classification method that higher accuracy and faster performance.
This is a presentation that I gave to my research group. It is about probabilistic extensions to Principal Components Analysis, as proposed by Tipping and Bishop.
Incremental Page Rank Computation on Evolving GraphsSubhajit Sahu
In this paper, we propose a method to incrementally compute PageRank for a large graph that is evolving. Our approach is quite general, and can be used to incrementally compute (on evolving graphs) any metric that satisfies the first order Markov property.
HyPR: Hybrid Page Ranking on Evolving Graphs (NOTES)Subhajit Sahu
Highlighted notes on A Parallel Algorithm Template for Updating Single-Source Shortest Paths in Large-Scale Dynamic Networks.
While doing research work under Prof. Dip Banerjee, Prof, Kishore Kothapalli.
In Hybrid Pagerank the vertices are divided in 3 groups, V_old, V_border, V_new. Scaling for old, border vertices is N/N_new, and 1/N_new for V_new (i do this too ). Then PR is run only on V_border, V_new.
"V_border which is the set of nodes which have edges in Bi connecting V_old and V_new and is reachable using a breadth first traversal."
Does that mean V_border = V_batch(i) ∩ V_old? BFS from where?
"We can assume that the new batch of updates is topologically sorted since the PR scores of the new nodes in Bi is guaranteed to be lower than those in Co."
Is sum(PR) in V_old > sum(PR) in V_new always?
"For performing the comparisons with GPMA and GPMA+, we configure the experiment to run HyPR on the same platform as used in [1] which is a Intel Xeon CPU connected to a Titan X Pascal GPU, and also the same datasets."
Old GPUs are going to be slower ...
Like we were discussing last time, it is not possible to scale old ranks, and skip the unchanged components (or here V_old). Please check this simple counter example that shows skipping leads to incorrect ranks.
https://github.com/puzzlef/pagerank-levelwise-skip-unchanged-components
Another omission in the paper is that Hybrid PR (just like STICD) wont work for graphs which have dead ends. This is a pre-condition for the algorithm.
About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...Subhajit Sahu
TrueTime is a service that enables the use of globally synchronized clocks, with bounded error. It returns a time interval that is guaranteed to contain the clock’s actual time for some time during the call’s execution. If two intervals do not overlap, then we know calls were definitely ordered in real time. In general, synchronized clocks can be used to avoid communication in a distributed system.
The underlying source of time is a combination of GPS receivers and atomic clocks. As there are “time masters” in every datacenter (redundantly), it is likely that both sides of a partition would continue to enjoy accurate time. Individual nodes however need network connectivity to the masters, and without it their clocks will drift. Thus, during a partition their intervals slowly grow wider over time, based on bounds on the rate of local clock drift. Operations depending on TrueTime, such as Paxos leader election or transaction commits, thus have to wait a little longer, but the operation still completes (assuming the 2PC and quorum communication are working).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting Bitset for graph : SHORT REPORT / NOTESSubhajit Sahu
Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is commonly used for efficient graph computations. Unfortunately, using CSR for dynamic graphs is impractical since addition/deletion of a single edge can require on average (N+M)/2 memory accesses, in order to update source-offsets and destination-indices. A common approach is therefore to store edge-lists/destination-indices as an array of arrays, where each edge-list is an array belonging to a vertex. While this is good enough for small graphs, it quickly becomes a bottleneck for large graphs. What causes this bottleneck depends on whether the edge-lists are sorted or unsorted. If they are sorted, checking for an edge requires about log(E) memory accesses, but adding an edge on average requires E/2 accesses, where E is the number of edges of a given vertex. Note that both addition and deletion of edges in a dynamic graph require checking for an existing edge, before adding or deleting it. If edge lists are unsorted, checking for an edge requires around E/2 memory accesses, but adding an edge requires only 1 memory access.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Experiments with Primitive operations : SHORT REPORT / NOTESSubhajit Sahu
This includes:
- Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
- Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
- Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
- Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...Subhajit Sahu
Below are the important points I note from the 2020 paper by Martin Grohe:
- 1-WL distinguishes almost all graphs, in a probabilistic sense
- Classical WL is two dimensional Weisfeiler-Leman
- DeepWL is an unlimited version of WL graph that runs in polynomial time.
- Knowledge graphs are essentially graphs with vertex/edge attributes
ABSTRACT:
Vector representations of graphs and relational structures, whether handcrafted feature vectors or learned representations, enable us to apply standard data analysis and machine learning techniques to the structures. A wide range of methods for generating such embeddings have been studied in the machine learning and knowledge representation literature. However, vector embeddings have received relatively little attention from a theoretical point of view.
Starting with a survey of embedding techniques that have been used in practice, in this paper we propose two theoretical approaches that we see as central for understanding the foundations of vector embeddings. We draw connections between the various approaches and suggest directions for future research.
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTESSubhajit Sahu
https://gist.github.com/wolfram77/54c4a14d9ea547183c6c7b3518bf9cd1
There exist a number of dynamic graph generators. Barbasi-Albert model iteratively attach new vertices to pre-exsiting vertices in the graph using preferential attachment (edges to high degree vertices are more likely - rich get richer - Pareto principle). However, graph size increases monotonically, and density of graph keeps increasing (sparsity decreasing).
Gorke's model uses a defined clustering to uniformly add vertices and edges. Purohit's model uses motifs (eg. triangles) to mimick properties of existing dynamic graphs, such as growth rate, structure, and degree distribution. Kronecker graph generators are used to increase size of a given graph, with power-law distribution.
To generate dynamic graphs, we must choose a metric to compare two graphs. Common metrics include diameter, clustering coefficient (modularity?), triangle counting (triangle density?), and degree distribution.
In this paper, the authors propose Dygraph, a dynamic graph generator that uses degree distribution as the only metric. The authors observe that many real-world graphs differ from the power-law distribution at the tail end. To address this issue, they propose binning, where the vertices beyond a certain degree (minDeg = min(deg) s.t. |V(deg)| < H, where H~10 is the number of vertices with a given degree below which are binned) are grouped into bins of degree-width binWidth, max-degree localMax, and number of degrees in bin with at least one vertex binSize (to keep track of sparsity). This helps the authors to generate graphs with a more realistic degree distribution.
The process of generating a dynamic graph is as follows. First the difference between the desired and the current degree distribution is calculated. The authors then create an edge-addition set where each vertex is present as many times as the number of additional incident edges it must recieve. Edges are then created by connecting two vertices randomly from this set, and removing both from the set once connected. Currently, authors reject self-loops and duplicate edges. Removal of edges is done in a similar fashion.
Authors observe that adding edges with power-law properties dominates the execution time, and consider parallelizing DyGraph as part of future work.
My notes on shared memory parallelism.
Shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs. Using memory for communication inside a single program, e.g. among its multiple threads, is also referred to as shared memory [REF].
A Dynamic Algorithm for Local Community Detection in Graphs : NOTESSubhajit Sahu
**Community detection methods** can be *global* or *local*. **Global community detection methods** divide the entire graph into groups. Existing global algorithms include:
- Random walk methods
- Spectral partitioning
- Label propagation
- Greedy agglomerative and divisive algorithms
- Clique percolation
https://gist.github.com/wolfram77/b4316609265b5b9f88027bbc491f80b6
There is a growing body of work in *detecting overlapping communities*. **Seed set expansion** is a **local community detection method** where a relevant *seed vertices* of interest are picked and *expanded to form communities* surrounding them. The quality of each community is measured using a *fitness function*.
**Modularity** is a *fitness function* which compares the number of intra-community edges to the expected number in a random-null model. **Conductance** is another popular fitness score that measures the community cut or inter-community edges. Many *overlapping community detection* methods **use a modified ratio** of intra-community edges to all edges with atleast one endpoint in the community.
Andersen et al. use a **Spectral PageRank-Nibble method** which minimizes conductance and is formed by adding vertices in order of decreasing PageRank values. Andersen and Lang develop a **random walk approach** in which some vertices in the seed set may not be placed in the final community. Clauset gives a **greedy method** that *starts from a single vertex* and then iteratively adds neighboring vertices *maximizing the local modularity score*. Riedy et al. **expand multiple vertices** via maximizing modularity.
Several algorithms for **detecting global, overlapping communities** use a *greedy*, *agglomerative approach* and run *multiple separate seed set expansions*. Lancichinetti et al. run **greedy seed set expansions**, each with a *single seed vertex*. Overlapping communities are produced by a sequentially running expansions from a node not yet in a community. Lee et al. use **maximal cliques as seed sets**. Havemann et al. **greedily expand cliques**.
The authors of this paper discuss a dynamic approach for **community detection using seed set expansion**. Simply marking the neighbours of changed vertices is a **naive approach**, and has *severe shortcomings*. This is because *communities can split apart*. The simple updating method *may fail even when it outputs a valid community* in the graph.
Scalable Static and Dynamic Community Detection Using Grappolo : NOTESSubhajit Sahu
A **community** (in a network) is a subset of nodes which are _strongly connected among themselves_, but _weakly connected to others_. Neither the number of output communities nor their size distribution is known a priori. Community detection methods can be divisive or agglomerative. **Divisive methods** use _betweeness centrality_ to **identify and remove bridges** between communities. **Agglomerative methods** greedily **merge two communities** that provide maximum gain in _modularity_. Newman and Girvan have introduced the **modularity metric**. The problem of community detection is then reduced to the problem of modularity maximization which is **NP-complete**. **Louvain method** is a variant of the _agglomerative strategy_, in that is a _multi-level heuristic_.
https://gist.github.com/wolfram77/917a1a4a429e89a0f2a1911cea56314d
In this paper, the authors discuss **four heuristics** for Community detection using the _Louvain algorithm_ implemented upon recently developed **Grappolo**, which is a parallel variant of the Louvain algorithm. They are:
- Vertex following and Minimum label
- Data caching
- Graph coloring
- Threshold scaling
With the **Vertex following** heuristic, the _input is preprocessed_ and all single-degree vertices are merged with their corresponding neighbours. This helps reduce the number of vertices considered in each iteration, and also help initial seeds of communities to be formed. With the **Minimum label heuristic**, when a vertex is making the decision to move to a community and multiple communities provided the same modularity gain, the community with the smallest id is chosen. This helps _minimize or prevent community swaps_. With the **Data caching** heuristic, community information is stored in a vector instead of a map, and is reused in each iteration, but with some additional cost. With the **Vertex ordering via Graph coloring** heuristic, _distance-k coloring_ of graphs is performed in order to group vertices into colors. Then, each set of vertices (by color) is processed _concurrently_, and synchronization is performed after that. This enables us to mimic the behaviour of the serial algorithm. Finally, with the **Threshold scaling** heuristic, _successively smaller values of modularity threshold_ are used as the algorithm progresses. This allows the algorithm to converge faster, and it has been observed a good modularity score as well.
From the results, it appears that _graph coloring_ and _threshold scaling_ heuristics do not always provide a speedup and this depends upon the nature of the graph. It would be interesting to compare the heuristics against baseline approaches. Future work can include _distributed memory implementations_, and _community detection on streaming graphs_.
Application Areas of Community Detection: A Review : NOTESSubhajit Sahu
This is a short review of Community detection methods (on graphs), and their applications. A **community** is a subset of a network whose members are *highly connected*, but *loosely connected* to others outside their community. Different community detection methods *can return differing communities* these algorithms are **heuristic-based**. **Dynamic community detection** involves tracking the *evolution of community structure* over time.
https://gist.github.com/wolfram77/09e64d6ba3ef080db5558feb2d32fdc0
Communities can be of the following **types**:
- Disjoint
- Overlapping
- Hierarchical
- Local.
The following **static** community detection **methods** exist:
- Spectral-based
- Statistical inference
- Optimization
- Dynamics-based
The following **dynamic** community detection **methods** exist:
- Independent community detection and matching
- Dependent community detection (evolutionary)
- Simultaneous community detection on all snapshots
- Dynamic community detection on temporal networks
**Applications** of community detection include:
- Criminal identification
- Fraud detection
- Criminal activities detection
- Bot detection
- Dynamics of epidemic spreading (dynamic)
- Cancer/tumor detection
- Tissue/organ detection
- Evolution of influence (dynamic)
- Astroturfing
- Customer segmentation
- Recommendation systems
- Social network analysis (both)
- Network summarization
- Privary, group segmentation
- Link prediction (both)
- Community evolution prediction (dynamic, hot field)
<br>
<br>
## References
- [Application Areas of Community Detection: A Review : PAPER](https://ieeexplore.ieee.org/document/8625349)
This paper discusses a GPU implementation of the Louvain community detection algorithm. Louvain algorithm obtains hierachical communities as a dendrogram through modularity optimization. Given an undirected weighted graph, all vertices are first considered to be their own communities. In the first phase, each vertex greedily decides to move to the community of one of its neighbours which gives greatest increase in modularity. If moving to no neighbour's community leads to an increase in modularity, the vertex chooses to stay with its own community. This is done sequentially for all the vertices. If the total change in modularity is more than a certain threshold, this phase is repeated. Once this local moving phase is complete, all vertices have formed their first hierarchy of communities. The next phase is called the aggregation phase, where all the vertices belonging to a community are collapsed into a single super-vertex, such that edges between communities are represented as edges between respective super-vertices (edge weights are combined), and edges within each community are represented as self-loops in respective super-vertices (again, edge weights are combined). Together, the local moving and the aggregation phases constitute a stage. This super-vertex graph is then used as input fof the next stage. This process continues until the increase in modularity is below a certain threshold. As a result from each stage, we have a hierarchy of community memberships for each vertex as a dendrogram.
Approaches to perform the Louvain algorithm can be divided into coarse-grained and fine-grained. Coarse-grained approaches process a set of vertices in parallel, while fine-grained approaches process all vertices in parallel. A coarse-grained hybrid-GPU algorithm using multi GPUs has be implemented by Cheong et al. which grabbed my attention. In addition, their algorithm does not use hashing for the local moving phase, but instead sorts each neighbour list based on the community id of each vertex.
https://gist.github.com/wolfram77/7e72c9b8c18c18ab908ae76262099329
Survey for extra-child-process package : NOTESSubhajit Sahu
Useful additions to inbuilt child_process module.
📦 Node.js, 📜 Files, 📰 Docs.
Please see attached PDF for literature survey.
https://gist.github.com/wolfram77/d936da570d7bf73f95d1513d4368573e
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTERSubhajit Sahu
For the PhD forum an abstract submission is required by 10th May, and poster by 15th May. The event is on 30th May.
https://gist.github.com/wolfram77/692d263f463fd49be6eb5aa65dd4d0f9
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Subhajit Sahu
For the PhD forum an abstract submission is required by 10th May, and poster by 15th May. The event is on 30th May.
https://gist.github.com/wolfram77/1c1f730d20b51e0d2c6d477fd3713024
Fast Incremental Community Detection on Dynamic Graphs : NOTESSubhajit Sahu
In this paper, the authors describe two approaches for dynamic community detection using the CNM algorithm. CNM is a hierarchical, agglomerative algorithm that greedily maximizes modularity. They define two approaches: BasicDyn and FastDyn. BasicDyn backtracks merges of communities until each marked (changed) vertex is its own singleton community. FastDyn undoes a merge only if the quality of merge, as measured by the induced change in modularity, has significantly decreased compared to when the merge initially took place. FastDyn also allows more than two vertices to contract together if in the previous time step these vertices eventually ended up contracted in the same community. In the static case, merging several vertices together in one contraction phase could lead to deteriorating results. FastDyn is able to do this, however, because it uses information from the merges of the previous time step. Intuitively, merges that previously occurred are more likely to be acceptable later.
https://gist.github.com/wolfram77/1856b108334cc822cdddfdfa7334792a
The ability to recreate computational results with minimal effort and actionable metrics provides a solid foundation for scientific research and software development. When people can replicate an analysis at the touch of a button using open-source software, open data, and methods to assess and compare proposals, it significantly eases verification of results, engagement with a diverse range of contributors, and progress. However, we have yet to fully achieve this; there are still many sociotechnical frictions.
Inspired by David Donoho's vision, this talk aims to revisit the three crucial pillars of frictionless reproducibility (data sharing, code sharing, and competitive challenges) with the perspective of deep software variability.
Our observation is that multiple layers — hardware, operating systems, third-party libraries, software versions, input data, compile-time options, and parameters — are subject to variability that exacerbates frictions but is also essential for achieving robust, generalizable results and fostering innovation. I will first review the literature, providing evidence of how the complex variability interactions across these layers affect qualitative and quantitative software properties, thereby complicating the reproduction and replication of scientific studies in various fields.
I will then present some software engineering and AI techniques that can support the strategic exploration of variability spaces. These include the use of abstractions and models (e.g., feature models), sampling strategies (e.g., uniform, random), cost-effective measurements (e.g., incremental build of software configurations), and dimensionality reduction methods (e.g., transfer learning, feature selection, software debloating).
I will finally argue that deep variability is both the problem and solution of frictionless reproducibility, calling the software science community to develop new methods and tools to manage variability and foster reproducibility in software systems.
Exposé invité Journées Nationales du GDR GPL 2024
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Sérgio Sacani
Since volcanic activity was first discovered on Io from Voyager images in 1979, changes
on Io’s surface have been monitored from both spacecraft and ground-based telescopes.
Here, we present the highest spatial resolution images of Io ever obtained from a groundbased telescope. These images, acquired by the SHARK-VIS instrument on the Large
Binocular Telescope, show evidence of a major resurfacing event on Io’s trailing hemisphere. When compared to the most recent spacecraft images, the SHARK-VIS images
show that a plume deposit from a powerful eruption at Pillan Patera has covered part
of the long-lived Pele plume deposit. Although this type of resurfacing event may be common on Io, few have been detected due to the rarity of spacecraft visits and the previously low spatial resolution available from Earth-based telescopes. The SHARK-VIS instrument ushers in a new era of high resolution imaging of Io’s surface using adaptive
optics at visible wavelengths.
Richard's aventures in two entangled wonderlandsRichard Gill
Since the loophole-free Bell experiments of 2020 and the Nobel prizes in physics of 2022, critics of Bell's work have retreated to the fortress of super-determinism. Now, super-determinism is a derogatory word - it just means "determinism". Palmer, Hance and Hossenfelder argue that quantum mechanics and determinism are not incompatible, using a sophisticated mathematical construction based on a subtle thinning of allowed states and measurements in quantum mechanics, such that what is left appears to make Bell's argument fail, without altering the empirical predictions of quantum mechanics. I think however that it is a smoke screen, and the slogan "lost in math" comes to my mind. I will discuss some other recent disproofs of Bell's theorem using the language of causality based on causal graphs. Causal thinking is also central to law and justice. I will mention surprising connections to my work on serial killer nurse cases, in particular the Dutch case of Lucia de Berk and the current UK case of Lucy Letby.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...University of Maribor
Slides from talk:
Aleš Zamuda: Remote Sensing and Computational, Evolutionary, Supercomputing, and Intelligent Systems.
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Inter-Society Networking Panel GRSS/MTT-S/CIS Panel Session: Promoting Connection and Cooperation
https://www.etran.rs/2024/en/home-english/
Toxic effects of heavy metals : Lead and Arsenicsanjana502982
Heavy metals are naturally occuring metallic chemical elements that have relatively high density, and are toxic at even low concentrations. All toxic metals are termed as heavy metals irrespective of their atomic mass and density, eg. arsenic, lead, mercury, cadmium, thallium, chromium, etc.
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...Wasswaderrick3
In this book, we use conservation of energy techniques on a fluid element to derive the Modified Bernoulli equation of flow with viscous or friction effects. We derive the general equation of flow/ velocity and then from this we derive the Pouiselle flow equation, the transition flow equation and the turbulent flow equation. In the situations where there are no viscous effects , the equation reduces to the Bernoulli equation. From experimental results, we are able to include other terms in the Bernoulli equation. We also look at cases where pressure gradients exist. We use the Modified Bernoulli equation to derive equations of flow rate for pipes of different cross sectional areas connected together. We also extend our techniques of energy conservation to a sphere falling in a viscous medium under the effect of gravity. We demonstrate Stokes equation of terminal velocity and turbulent flow equation. We look at a way of calculating the time taken for a body to fall in a viscous medium. We also look at the general equation of terminal velocity.
Professional air quality monitoring systems provide immediate, on-site data for analysis, compliance, and decision-making.
Monitor common gases, weather parameters, particulates.
Cost Efficient PageRank Computation using GPU : NOTES
1. Cost Efficient PageRank Computation using GPU
Praveen K.∗
, Vamshi Krishna K.†
, Anil Sri Harsha B.‡
, S. Balasubramanian, P.K. Baruah
Sri Satya Sai Institute of Higher Learning, Prasanthi Nilayam, India
{praveen.kkp, k.vamshi.krish, anilharsha.b}@gmail.com,
{sbalasubramanian, pkbaruah}@sssihl.edu.in
Abstract—The PageRank algorithm for determining the
“importance” of Web pages forms the core component of
Google’s search technology. As the Web graph is very large,
containing over a billion nodes, PageRank is generally
computed offline, during the preprocessing of the Web
crawl, before any queries have been issued. Viewed math-
ematically, PageRank is nothing but the principal Eigen
vector of a sparse matrix which can be computed using any
of the standard iterative methods. In this particular work,
we attempt to parallelize the famous iterative method, the
Power method and its variant obtained through Aitken
extrapolation for computing PageRank using CUDA. From
our findings, we conclude that our parallel implementation
of the PageRank algorithm is highly cost effective not only
in terms of the time taken for convergence, but also in
terms of the number of iterations for higher values of the
damping factor.
Index Terms—PageRank, Principal Eigenvector, Eigen-
value, Extrapolation, Markov Matrix, CUDA, CUBLAS,
CUSPARSE.
I. INTRODUCTION
The evolution of GPU technology has outperformed
the traditional Parallel Programing paradigms, introduc-
ing a new phase in the field of scientific computing. GPU
based solutions have outperformed the corresponding
sequential implementations by leaps and bounds. Par-
ticularly, for scientific computing applications, they pro-
vide a very valuable resource to parallelize and achieve
optimum performance.
Determining a page’s relevance to query terms is a
complex problem for a search engine and this decides
how good a search engine is. Google solves this complex
problem through its PageRank [1][2] algorithm which
quantitatively rates the importance of each page on the
web, allowing it to rank the pages and thereby present to
the user, the more important (and typically most relevant
and helpful) pages first.
∗
Student Author
†
Student Author
‡
Student Author
As the Web graph is very large, containing over a
billion nodes, PageRank is generally computed offline,
during the preprocessing of the Web crawl, before any
queries have been issued [3]. In this particular work,
we parallelized a customized serial implementation of
power method and a variant of it obtained through Aitken
extrapolation [4] for computing PageRank.
The core component in the power method for com-
puting PageRank is the Sparse Matrix-Vector product
(SpMV) and certain vector-vector operations. The paral-
lel implementation was done using the CUDA program-
ing model, developed by NVIDIA. In this parallel imple-
mentation, we exploited the power of CUBLAS [5] and
CUSPARSE [6] routines for the same. From our results,
it is evident that the parallel implementation outperforms
the serial equivalent in terms of both the factors viz.
convergence time and number of iterations for specific
values of the damping factor (α). We validated our work
on publicly available datasets [7] [8], which are matrices
that represent web graphs of sizes varying from 884 x
884 to 281903 x 281903.
The remainder of this paper is organized as follows.
Section II reviews the related work done in this area.
Section III, gives an overview about the power method
and it’s extrapolation variant for computing PageRank.
Section IV discusses the methodology used to parallelize
the serial implementation. The experimental results ob-
tained are discussed in Section V. Finally we point to
some conclusions and future work that can be undertaken
in Section VI.
II. RELATED WORK
Standard works in literature for computing PageRank
in parallel focused on exploiting the power of large
cluster based computation. In general, the main focus
of any parallel implementaion of the PageRank algo-
rithm was on reducing the communication overhead and
efficient problem partitioning. Zhu et al. [9] used an
iterative aggregation and disaggregation method to effec-
tively speedup the PageRank computation in a distributed
2. environment. Gleich et al. [10] modeled the compu-
tation of PageRank as a linear system and evaluated
the performance of a parallel implementation on a 70-
node Beowulf cluster. A parallel implementation of the
standard power method for computing PageRank was
done by Wu et al. [11] on AMD GPUs using OpenCL
programing model. This work essentially concentrates on
parallelizing the sparse matrix vector product (SpMV)
using the AMD GPUs. Another work in this regard by
Cevahir et al. [12] focused on extending a CPU cluster
based parallel implementation on to a GPU cluster, using
upto 100 GPU nodes.
Our work is an attempt on measuring performance
gains in computational costs for calculating PageRank
using power method and it’s extrapolation variant on a
single NVIDIA GPU using the standard CUBLAS and
CUSPARSE libraries. To the best of our knowledge, no
work has been done on parallelizing PageRank compu-
tation using a single GPU.
III. POWER METHOD FOR COMPUTING PAGERANK
Finding Eigenvector for a given matrix is a numerical
linear algebra problem that can be solved using several
available methods in the literature. But, Power method is
often the method of choice due to it’s ability to calculate
the dominant Eigen pair. This method is an apt choice
for the PageRank problem because, we are interested
only in the principal Eigenvector of the given matrix.
Procedurally, this method is an iterative method that finds
the vector ¯
xk+1 from ¯
xk as ¯
xk+1 = A ¯
xk until ¯
xk+1
converges to desired tolerance.
When Power method converges, the vector ¯
xk+1 is
the Eigenvector for the given matrix corresponding to
the dominant Eigenvalue (which, in this case is 1).
The power method for computing PageRank can be
elucidated as follows. Here, δ is the 1-Norm difference
Data: Matrix A, Initial vector ¯
x0
Result: ¯
xk+1, the Eigenvector of Matrix A
repeat
¯
xk+1 = A ¯
xk;
δ = || ¯
xk+1 − ¯
xk||1;
until δ < ;
Algorithm 1: Power method for computing PageRank
between the vectors in the kth iteration and the k + 1th
iteration and is the desired convergence threshold. Note
that A = α ¯
PT + (1 − α)E. Where, E is an stochastic
matrix introduced to model the complete behaviour of
a random surfer on the web. This reformulation is a
consequence of the fact that the sparse matrix P alone
do not capture the complete behavior of a random surfer.
This introduces a new parameter α, also known as
“Damping Factor”, whose value ranges between 0 and 1.
It is an empirically proven fact that the damping factor
α signifies the sub-dominant Eigenvalue of the Google
matrix [13]. Observe that P is a sparse matrix and the
sparsity is lost by taking the Convex Combination.
A. An Efficient Power method type Algorithm for the
PageRank problem
We are dealing with matrices that represent the hy-
perlink structure of the web that are very huge in size.
Considering A as such and applying the Power method
will lead to memory overflow. This can be overcome by
replacing the matrix vector product Ax̄ as illustrated in
the following Algorithm.
Data: Matrix P, Initial vector ¯
x0, Damping
factor α, Personalization Vector v̄
Result: ¯
xk+1, the Eigenvector of Matrix P
repeat
¯
xk+1 = αPT ¯
xk;
w = || ¯
xk||1 − || ¯
xk+1||1;
¯
xk+1 = ¯
xk+1 + wv̄;
δ = || ¯
xk+1 − ¯
xk||1;
until δ ;
Algorithm 2: Modified Power method for computing
PageRank
Where v̄ is the Personalization vector which is 1
nē (ē
is a column vector of all ones). The irreducibility of A
implies that the dominant Eigenvalue is 1.
B. Extrapolation Techniques for accelerating Conver-
gence of Power Method
Due to the sheer size of the web (over 50 billion
links) PageRank computation by power method can take
several days to complete. Speeding up this computation
is highly critical.
Though, power method is an apt choice for the PageR-
ank problem, the convergence of the method is a matter
of concern when α goes closer to 1. Hence, there is a
need to lookout for variants of the power method that
can accelerate the computation of the desired Principal
Eigenvector of the given Matrix.
1) Aitken Extrapolation Technique: The Aitken Ex-
trapolation technique [4] for computing PageRank is
an extension of the basic power method in which the
component values of the intermediary vectors in the
??
why?
rank
3. power iterations are periodically pruned so that the
resultant vectors obtained are refined in the direction of
the desired vector. The algorithm can be elucidated as
follows.
Data: Matix A, Initial vector ¯
x0
Result: ¯
xk+1, the Eigenvector of Matrix A
repeat
¯
xk+1 = A ¯
xk;
δ = || ¯
xk+1 − ¯
xk||1;
periodically xk = Aitken(x̄k−2,x̄k−1 , x̄k);
until δ ;
Algorithm 3: Aitken Power method for computing
PageRank
function x̄ = Aitken(x̄k−2,x̄k−1 , x̄k);
for i ← 1 to n do
¯
gi = (xk−1
i - xk−2
i )2;
h̄i = xk
i - 2xk−1
i + xk−2
i ;
xi = xk
i - gi/hi;
end
Algorithm 4: The Aitken Routine
2) Operation Count: In order for an extrapolation
method such as Aitken Extrapolation to be useful, the
overhead should be minimal i.e. the costs in addition
to the cost of applying power iteration to generate
iterates. The operation count of the loop in Algorithm
4 is O(n), where n is the number of pages on the
web. The operation count of one extrapolation step is
less than the operation count of a single iteration of
the Power Method, and since Aitken Extrapolation may
be applied only periodically, Aitken Extrapolation has
minimal overhead. In the implementation, the additional
cost of each application of Aitken Extrapolation is about
1% of the cost of a single iteration of the Power Method
which is negligible.
IV. SCOPE FOR PARALLELIZATION
The power method for computing PageRank is highly
parallelizable since most of the computation involves
Sparse Matrix-Vector product (SpMV) and vector-vector
operations. Since the matrix under consideration has
a sparse structure, we used the CUSPARSE library
provided by NVIDIA to store the matrix and perform
the corresponding operations. The sparse matrix is con-
structed by reading an adjacency list, which represents
the nonzero values in the matrix.
The list of operations involved in power method
for computing PageRank and their equivalent CUS-
PARSE/CUBLAS library calls used in our implemen-
tation can be summarized as follows.
• ȳ = αPTx̄ → cusparseDcsrmv(), to perform the
sparse matrix PT and dense vector x̄ product
• ||x̄||1 and ||ȳ||1 → cublasDasum(), to calculate the
1-Norm of the vectors (Here, both the vectors x̄ and
ȳ are dense)
• ȳ = ȳ + wv̄ → cublasDaxpy(), to update the vector
ȳ, a simple axpy operation
• δ = ||ȳ −x̄||1 → Wrote a simple kernel in the GPU
to calculate δ
A. Experimental Setup
For the experiment, our parallelized version of the
power method was run on Tesla T20 based “Fermi”
GPU. A serial implementation of the same was also run
in Matlab R2010a on Intel(R) core(TM)2 Duo 3.00GHz
processor and 3GB RAM.
B. Evaluation Dataset
The performance of the parallel implementation of
power method for computing PageRank and it’s corre-
sponding extrapolation variant is evaluated on publicly
available datasets [7] [8]. We also evaluated the per-
formance on stanford.edu dataset. It consists of 281903
pages and 2382912 links. This link structure is repre-
sented as an adjacency list, which is read as an input to
construct the sparse matrix P. Table I summarizes the
details of the datasets used in our experimentation.
Name Size
Computational Complexity (CC) 884 x 884
Abortion 2292 x 2292
Genetic 3468 x 3468
Stanford.edu 281903 x 281903
TABLE I
DATASET DESCRIPTION
Fig. 1. Execution time for α = 0.75 on stanford.edu dataset
reduced
Serial = Aitken extrapolated Power method?
GPU = ?
4. α Reduction in It(Serial) Reduction in It(GPU)
0.75 1e-3 0 1
0.75 1e-8 3 3
0.75 1e-10 2 2
0.75 1e-12 1 2
0.80 1e-3 0 0
0.80 1e-8 1 4
0.80 1e-10 4 4
0.80 1e-12 3 3
0.99 1e-3 112 114
0.99 1e-8 116 129
0.99 1e-10 126 130
TABLE II
SAMPLE RESULTS FOR EXTRAPOLATION ON GENETIC DATASET
α Reduction in It(Serial) Reduction in It(GPU)
0.85 1e-3 3 4
0.85 1e-8 4 5
0.85 1e-10 4 5
0.85 1e-12 5 5
0.95 1e-3 4 7
0.95 1e-8 5 7
0.95 1e-10 6 7
0.95 1e-12 7 7
TABLE III
SAMPLE RESULTS FOR EXTRAPOLATION ON CC DATASET
V. RESULTS1
In this section, we quantify the performance of our
parallel implementation of the power method and it’s
extrapolation variant on datasets of varying sizes. The
performance of the same is evaluated for varying values
of α and tolerance factor . For our experimentation,
the α values ranged from 0.75 to 0.99 and the values
ranged from 1e − 3 to 1e − 15. The results of the
datasets titled computational complexity and Genetic
are tabulated in Tables II and III. Here, “It” refers
to the total number of iterations taken for the power
method to converge and “Ti” refers to the time taken
for convergence in seconds.
From our findings, we observed substantial perfor-
mance gains on all the datasets considered for exper-
imentation and it is more pronounced in the case of
stanford.edu dataset. As expected, the parallel implemen-
tation outperformed the serial implementation in terms
of time taken for convergence for all the datasets. We
also found a potential gain in terms of the number of
iterations for convergence as α became closer to 1 and
1
Due to the page limit constraints, we could present only a sample
of our obtained results
Fig. 2. Execution time for α = 0.85 on stanford.edu dataset
Fig. 3. Execution time for α = 0.90 on stanford.edu dataset
for higher tolerance levels. For instance, for α = 0.99
and = 1e-15, the serial implementation, converged in
18,234 iterations, whereas, the parallel implementation
took merely 3013 iterations for convergence. Thus, we
obtained a performance improvement of 16.52%.
We obtained positive results even in the case of
Aitken Extrapolation in terms of the number of iterations
taken for convergence. We measured the improvement in
terms of reduction in the number of iterations taken for
convergence. We summarize our findings in Tables II and
III. Specifically, for the Genetic dataset, for α = 0.99, and
for varying values, on an average, we observed that, the
parallel implementation reduced the number of iterations
for convergence by a difference of 6.
VI. CONCLUSIONS AND FUTURE WORK
PageRank computation that forms the core component
of Google’s search technology has been successfully
parallelized using the CUDA programing model on a
single GPU. We performed an extensive evaluation of
our parallel implementation for different input parameter
values and on datasets of varying sizes. From our results,
it is clearly evident that our work is very relevant as it
not only to minimizes the time involved in computing
PageRank, but also reduces the number of iterations for
convergence.
Too high
damping
factor (al
pha)
Too high α
Too small ε
A parallel impl. cannot
affect no. of iterations
5. Fig. 4. Execution time for α = 0.95 on stanford.edu dataset
Fig. 5. Execution time for α = 0.99 on stanford.edu dataset
We intend to extend this parallel implementation on
larger datasets like stanford-berkely.edu [14] that reflect
the size of the World Wide Web in greater magnitude.
All the datasets that are considered for experimentation
are large but sill fit within the limits of the GPU memory.
The single GPU approach for computing PageRank may
not be scalable for larger matrices as the on-board GPU
memory is a major constraint. Therefore, we look out
for alternatives to accommodate large scale matrices. We
also look out for higher performance gains through better
optimization techniques at various levels of the algo-
rithm. Though Aitken extrapolation gave positive results,
it suffers a performance loss because the pruning of the
component values of the iterates may not be optimal. In
this regard, we also intend to parallelize another variant
of power method obtained using quadratic extrapolation
technique, which prunes the values in an optimal way.
ACKNOWLEDGMENT
We would like to express our deepest sense of grati-
tude to our Founder Chancellor Bhagwan Sri Sathya Sai
Baba. Without His grace, this work would have been
impossible. We also acknowledge NVIDIA Pune division
for providing the GPU infrastructure and support.
Fig. 6. Number of iterations for convergence on stanford.edu dataset
(Power Method), = 1e-15
REFERENCES
[1] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Wino-
grad, “The pagerank citation ranking: Bringing order to the
web.,” Technical report, Stanford InfoLab, 1999.
[2] Kurt Bryan and Tanya Leise, “The $25,000,000,000 eigenvec-
tor: the linear algebra behind google,” SIAM Review, vol. 48,
pp. 569–581, 2006.
[3] Sergey Brin and Lawrence Page, “The anatomy of a large-scale
hypertextual web search engine,” Computer Networks and ISDN
Systems, vol. 30, no. 1-7, pp. 107 – 117, 1998, Proceedings of
the Seventh International World Wide Web Conference.
[4] Sepandar Kamvar, Taher Haveliwala, Christopher Manning, and
Gene Golub, “Extrapolation methods for accelerating pagerank
computations,” in In Proceedings of the Twelfth International
World Wide Web Conference. 2003, pp. 261–270, ACM Press.
[5] CUDA CUBLAS Library, PG-05326-040V01, NVIDIA Corpo-
ration, April 2011.
[6] CUDA CUSPARSE Library, PG-05329-040V01, NVIDIA Cor-
poration, January 2011.
[7] Sepandar D. Kamvar, “Test datasets for link analysis algo-
rithms, http://kamvar.org/assets/data/stanford-web.tar.gz,” Ac-
cessed Online.
[8] University of Toronto Department of Computer Science,
“Datasets for evaluatiing link analysis algorithms,
www.cs.toronto.edu/ tsap/experiments/datasets/index.html,”
Accessed Online.
[9] Yangbo Zhu, Shaozhi Ye, and Xing Li, “Distributed PageR-
ank computation based on iterative aggregation-disaggregation
methods,” in Proceedings of the 14th ACM international
conference on Information and knowledge management.
[10] David Gleich, “Fast parallel PageRank: A linear system
approach,” Tech. Rep., 2004.
[11] Tianji Wu, Bo Wang, Yi Shan, Feng Yan, Yu Wang, and Ningyi
Xu, “Efficient PageRank and spmv computation on AMD gpus,”
Parallel Processing, International Conference on, vol. 0, pp.
81–89, 2010.
[12] Ata Turk B. Barla Cambazoglu Akira Nukada Ali Cevahir,
Cevdet Aykanat and Satoshi Matsuoka, “Efficient PageRank
on GPU clusters,” 18th Workshop for the evaluation of high-
performance computing architecture, 2009.
[13] Taher H. Haveliwala, Sepandar D. Kamvar, and Ar D. Kamvar,
“The second eigenvalue of the google matrix,” 2003.
[14] Sepandar Kamvar, Taher Haveliwala, and Gene Golub, “Adap-
tive methods for the computation of PageRank,” Linear Algebra
and its Applications, vol. 386, no. 0, pp. 51 – 65, 2004.