Basic Computer Architecture and the Case for GPUs : NOTES

•

0 likes•89 views

Computer architectures are facing issues: Memory latencies are far higher. Benefits from instruction level parallelism (ILP) is reducing. With increasing clock rates, power consumption is increasing. Increasing complexity with multi-stage pipelines, intermediate buffers, multi-level caches, out-of-order execution, branch prediction, ... GPUs are parallel computer architectures that are good at some tasks, not so good at others. Running routines with high arithmetic intensity with overlapped memory access is the preferred approach. They may be unsuitable for irregular algorithms, where it is difficult to get high efficiency due to the high latency of accesses. They are less versatile compared to CPUs, using SIMD parallelism, and are dense compute-wise (per currency). NVIDIA's CUDA programming model enables GPUs to be used for general-purpose computing, and hence the term GPGPU. GPU Architectural, Programming, and Performance Models presentation at PPoPP, 2010, Bangalore, India. By Prof. Kishore Kothapalli with Prof. P. J. Narayanan and Suryakant Patidar. https://gist.github.com/wolfram77/43a6660121eef45b78c10d4e652dad6c

Devices & Hardware

PPoPP 2010, Bangalore, India.
Basic Architecture Concepts
CPU Architecture
4 stages of instruction execution
Too many cycles per instruction (CPI)
Fetch Decode Execute Write
t=1 2 3 4 5

PPoPP 2010, Bangalore, India.
Basic Architecture Concepts
CPU Architecture
4 stages of instruction execution
Too many cycles per instruction (CPI)
To reduce the CPI, introduce pipelined execution
Fetch Decode Execute Write
Fetch Decode Execute Write
Fetch Decode Execute Write
Fetch Decode Execute Write
t=1 2 3 4 5

PPoPP 2010, Bangalore, India.
Basic Architecture Concepts
CPU Architecture
4 stages of instruction execution
Too many cycles per instruction (CPI)
To reduce the CPI, introduce pipelined execution
Needs buffers to store results across stages.
A cache to handle slow memory access times
Fetch Decode Execute Write
Fetch Decode Execute Write
Fetch Decode Execute Write
Fetch Decode Execute Write
t=1 2 3 4 5
Cache

PPoPP 2010, Bangalore, India.
Basic Architecture Concepts
Fetch Decode Execute Write
Fetch Decode Execute Write
Fetch Decode Execute Write
Fetch Decode Execute Write
t=1 2 3 4 5
Cache
CPU Architecture
4 stages of instruction execution
Too many cycles per instruction (CPI)
To reduce the CPI, introduce pipelined execution
Needs buffers to store results across stages.
A cache to handle slow memory access times
Multilevel caches, outoforder execution, branch prediction, ...

PPoPP 2010, Bangalore, India.
Basic Architecture Concepts
CPU architecture getting too complex.
Not translating to equivalent performance
benefits
Need a rethink on traditional CPU architectures.

PPoPP 2010, Bangalore, India.
Basic Architecture Concepts
Couple with this the new wisdom in computer
architectures.
 Memory Wall – memory latencies far higher
 ILP Wall – Reducing benefits from instruction
level parallelism
 Power Wall – Increase in power consumption
with increase in clock rates.
Multicore is the way forward
Ex: GPUs, Cell, Intel Quad core, ...
Predicted that 100+ core computers would be a
reality soon.

PPoPP 2010, Bangalore, India.
Multicore and Manycore Processors
IBM Cell
NVidia GeForce 8800 includes 128 scalar processors
and Tesla
Sun T1 and T2
Tilera Tile64
Picochip combines 430 simple RISC cores
Cisco 188
TRIPS

PPoPP 2010, Bangalore, India.
The Case for the GPUs
GPUs are now common. They also have high computing
power per dollar, compared to the CPU
Today’s computer system has a CPU and a GPU, with the
GPU being used primarily for graphics.
GPUs are good at some tasks and not so good at others.
They are especially good at processing large data such as
images.
Let us use the right processor for the right task.
Goal: Increase the overall throughput of the computer system
on the given task. Use CPU and GPU synergistically.

PPoPP 2010, Bangalore, India.
Evolution of GPUs
Graphics: a few hundred triangles/vertices map to afewhundred
thousand pixels
Process pixels in parallel. Do the same thing on a large number of
different items.
Data parallel model : parallelism provided by the data
Thousands to millions of data elements
Same program/instruction on all of them
Hardware: 816 cores to process vertices and 64128 to process
pixels by 2005
Less versatile than CPU cores
SIMD mode of computations. Less hardware for instruction issue
No caching, branch prediction, out of order execution, etc.
‐ ‐
Can pack more cores in same silicon die area

PPoPP 2010, Bangalore, India.
GPUs as a Case Study
GPGPU – General Purpose Programming on
GPUs
OpenGL extensions
Very difficult to program
Recently manufactures started supporting Clike

programming abstraction to program GPUs
CUDA from NVidia
Other benefits of GPGPU
Affordable cost, easy availability, computational
power

PPoPP 2010, Bangalore, India.
GPUs as a Case Study
GPUs suited for routines with high arithmetic
intensity.
One feature is high memory latency, depending
on the nature of access.
Should overlap memory with arithmetic.

PPoPP 2010, Bangalore, India.
CPU Vs GPU
Few powerful cores Vs. lots of small cores

GPUs: For good performance, applications
need high arithmetic intensity
GPUs : No system managed cache.

PPoPP 2010, Bangalore, India.
GPGPU as a Case Study
Regular algorithms
Map well to data parallel model of GPUs
Each work item operates by itself or with a few
neighbors
Example settings : image processing.
Threads can share data, e.g., apron pixels in an
image processing kernel.

PPoPP 2010, Bangalore, India.
GPU as a Case Study
Irregular algorithms
Applications with data accesses that are not regular
in nature.
Occurs in settings such as graph algorithms, data
structures building, etc.
Difficult to get high efficiency due to high memory
latency of accesses.

PPoPP 2010, Bangalore, India.
GPGPU Tools and APIs
OpenGL
CUDA
OpenCL
Brook

Recommended

About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...

About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...

About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...

TrueTime is a service that enables the use of globally synchronized clocks, with bounded error. It returns a time interval that is guaranteed to contain the clock’s actual time for some time during the call’s execution. If two intervals do not overlap, then we know calls were definitely ordered in real time. In general, synchronized clocks can be used to avoid communication in a distributed system. The underlying source of time is a combination of GPS receivers and atomic clocks. As there are “time masters” in every datacenter (redundantly), it is likely that both sides of a partition would continue to enjoy accurate time. Individual nodes however need network connectivity to the masters, and without it their clocks will drift. Thus, during a partition their intervals slowly grow wider over time, based on bounds on the rate of local clock drift. Operations depending on TrueTime, such as Paxos leader election or transaction commits, thus have to wait a little longer, but the operation still completes (assuming the 2PC and quorum communication are working).

Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...

Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...

Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...

Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.

Adjusting Bitset for graph : SHORT REPORT / NOTES

Adjusting Bitset for graph : SHORT REPORT / NOTES

Adjusting Bitset for graph : SHORT REPORT / NOTES

Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is commonly used for efficient graph computations. Unfortunately, using CSR for dynamic graphs is impractical since addition/deletion of a single edge can require on average (N+M)/2 memory accesses, in order to update source-offsets and destination-indices. A common approach is therefore to store edge-lists/destination-indices as an array of arrays, where each edge-list is an array belonging to a vertex. While this is good enough for small graphs, it quickly becomes a bottleneck for large graphs. What causes this bottleneck depends on whether the edge-lists are sorted or unsorted. If they are sorted, checking for an edge requires about log(E) memory accesses, but adding an edge on average requires E/2 accesses, where E is the number of edges of a given vertex. Note that both addition and deletion of edges in a dynamic graph require checking for an existing edge, before adding or deleting it. If edge lists are unsorted, checking for an edge requires around E/2 memory accesses, but adding an edge requires only 1 memory access.

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...

Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.

Adjusting primitives for graph : SHORT REPORT / NOTES

Adjusting primitives for graph : SHORT REPORT / NOTES

Adjusting primitives for graph : SHORT REPORT / NOTES

Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is Multiply with different modes (map) 1. Performance of sequential execution based vs OpenMP based vector multiply. 2. Comparing various launch configs for CUDA based vector multiply. Sum with different storage types (reduce) 1. Performance of vector element sum using float vs bfloat16 as the storage type. Sum with different modes (reduce) 1. Performance of sequential execution based vs OpenMP based vector element sum. 2. Performance of memcpy vs in-place based CUDA based vector element sum. 3. Comparing various launch configs for CUDA based vector element sum (memcpy). 4. Comparing various launch configs for CUDA based vector element sum (in-place). Sum with in-place strategies of CUDA mode (reduce) 1. Comparing various launch configs for CUDA based vector element sum (in-place).

Experiments with Primitive operations : SHORT REPORT / NOTES

Experiments with Primitive operations : SHORT REPORT / NOTES

Experiments with Primitive operations : SHORT REPORT / NOTES

This includes: - Multiply with different modes (map) 1. Performance of sequential execution based vs OpenMP based vector multiply. 2. Comparing various launch configs for CUDA based vector multiply. - Sum with different storage types (reduce) 1. Performance of vector element sum using float vs bfloat16 as the storage type. - Sum with different modes (reduce) 1. Performance of sequential execution based vs OpenMP based vector element sum. 2. Performance of memcpy vs in-place based CUDA based vector element sum. 3. Comparing various launch configs for CUDA based vector element sum (memcpy). 4. Comparing various launch configs for CUDA based vector element sum (in-place). - Sum with in-place strategies of CUDA mode (reduce) 1. Comparing various launch configs for CUDA based vector element sum (in-place).

PageRank Experiments : SHORT REPORT / NOTES

PageRank Experiments : SHORT REPORT / NOTES

PageRank Experiments : SHORT REPORT / NOTES

This includes: - Adjusting data types for rank vector - Adjusting Pagerank parameters - Adjusting Sequential approach - Adjusting OpenMP approach - Comparing sequential approach - Adjusting Monolithic (Sequential) optimizations (from STICD) - Adjusting Levelwise (STICD) approach - Comparing Levelwise (STICD) approach - Adjusting ranks for dynamic graphs - Adjusting Levelwise (STICD) dynamic approach - Comparing dynamic approach with static - Adjusting Monolithic CUDA approach - Adjusting Monolithic CUDA optimizations (from STICD) - Adjusting Levelwise (STICD) CUDA approach - Comparing Levelwise (STICD) CUDA approach - Comparing dynamic CUDA approach with static - Comparing dynamic optimized CUDA approach with static

Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...

Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...

Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...

Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.

Recommended

About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...

About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...

About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...

TrueTime is a service that enables the use of globally synchronized clocks, with bounded error. It returns a time interval that is guaranteed to contain the clock’s actual time for some time during the call’s execution. If two intervals do not overlap, then we know calls were definitely ordered in real time. In general, synchronized clocks can be used to avoid communication in a distributed system. The underlying source of time is a combination of GPS receivers and atomic clocks. As there are “time masters” in every datacenter (redundantly), it is likely that both sides of a partition would continue to enjoy accurate time. Individual nodes however need network connectivity to the masters, and without it their clocks will drift. Thus, during a partition their intervals slowly grow wider over time, based on bounds on the rate of local clock drift. Operations depending on TrueTime, such as Paxos leader election or transaction commits, thus have to wait a little longer, but the operation still completes (assuming the 2PC and quorum communication are working).

Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...

Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...

Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...

Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.

Adjusting Bitset for graph : SHORT REPORT / NOTES

Adjusting Bitset for graph : SHORT REPORT / NOTES

Adjusting Bitset for graph : SHORT REPORT / NOTES

Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is commonly used for efficient graph computations. Unfortunately, using CSR for dynamic graphs is impractical since addition/deletion of a single edge can require on average (N+M)/2 memory accesses, in order to update source-offsets and destination-indices. A common approach is therefore to store edge-lists/destination-indices as an array of arrays, where each edge-list is an array belonging to a vertex. While this is good enough for small graphs, it quickly becomes a bottleneck for large graphs. What causes this bottleneck depends on whether the edge-lists are sorted or unsorted. If they are sorted, checking for an edge requires about log(E) memory accesses, but adding an edge on average requires E/2 accesses, where E is the number of edges of a given vertex. Note that both addition and deletion of edges in a dynamic graph require checking for an existing edge, before adding or deleting it. If edge lists are unsorted, checking for an edge requires around E/2 memory accesses, but adding an edge requires only 1 memory access.

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...

Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...

Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.

Adjusting primitives for graph : SHORT REPORT / NOTES

Adjusting primitives for graph : SHORT REPORT / NOTES

Adjusting primitives for graph : SHORT REPORT / NOTES

Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is Multiply with different modes (map) 1. Performance of sequential execution based vs OpenMP based vector multiply. 2. Comparing various launch configs for CUDA based vector multiply. Sum with different storage types (reduce) 1. Performance of vector element sum using float vs bfloat16 as the storage type. Sum with different modes (reduce) 1. Performance of sequential execution based vs OpenMP based vector element sum. 2. Performance of memcpy vs in-place based CUDA based vector element sum. 3. Comparing various launch configs for CUDA based vector element sum (memcpy). 4. Comparing various launch configs for CUDA based vector element sum (in-place). Sum with in-place strategies of CUDA mode (reduce) 1. Comparing various launch configs for CUDA based vector element sum (in-place).

Experiments with Primitive operations : SHORT REPORT / NOTES

Experiments with Primitive operations : SHORT REPORT / NOTES

Experiments with Primitive operations : SHORT REPORT / NOTES

This includes: - Multiply with different modes (map) 1. Performance of sequential execution based vs OpenMP based vector multiply. 2. Comparing various launch configs for CUDA based vector multiply. - Sum with different storage types (reduce) 1. Performance of vector element sum using float vs bfloat16 as the storage type. - Sum with different modes (reduce) 1. Performance of sequential execution based vs OpenMP based vector element sum. 2. Performance of memcpy vs in-place based CUDA based vector element sum. 3. Comparing various launch configs for CUDA based vector element sum (memcpy). 4. Comparing various launch configs for CUDA based vector element sum (in-place). - Sum with in-place strategies of CUDA mode (reduce) 1. Comparing various launch configs for CUDA based vector element sum (in-place).

PageRank Experiments : SHORT REPORT / NOTES

PageRank Experiments : SHORT REPORT / NOTES

PageRank Experiments : SHORT REPORT / NOTES

This includes: - Adjusting data types for rank vector - Adjusting Pagerank parameters - Adjusting Sequential approach - Adjusting OpenMP approach - Comparing sequential approach - Adjusting Monolithic (Sequential) optimizations (from STICD) - Adjusting Levelwise (STICD) approach - Comparing Levelwise (STICD) approach - Adjusting ranks for dynamic graphs - Adjusting Levelwise (STICD) dynamic approach - Comparing dynamic approach with static - Adjusting Monolithic CUDA approach - Adjusting Monolithic CUDA optimizations (from STICD) - Adjusting Levelwise (STICD) CUDA approach - Comparing Levelwise (STICD) CUDA approach - Comparing dynamic CUDA approach with static - Comparing dynamic optimized CUDA approach with static

Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...

Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...

Algorithmic optimizations for Dynamic Monolithic PageRank (from STICD) : SHOR...

Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.

Adjusting OpenMP PageRank : SHORT REPORT / NOTES

Adjusting OpenMP PageRank : SHORT REPORT / NOTES

Adjusting OpenMP PageRank : SHORT REPORT / NOTES

For massive graphs that fit in RAM, but not in GPU memory, it is possible to take advantage of a shared memory system with multiple CPUs, each with multiple cores, to accelerate pagerank computation. If the NUMA architecture of the system is properly taken into account with good vertex partitioning, the speedup can be significant. To take steps in this direction, experiments are conducted to implement pagerank in OpenMP using two different approaches, uniform and hybrid. The uniform approach runs all primitives required for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid approach runs certain primitives in sequential mode (i.e., sumAt, multiply).

word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...

word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...

word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...

Below are the important points I note from the 2020 paper by Martin Grohe: - 1-WL distinguishes almost all graphs, in a probabilistic sense - Classical WL is two dimensional Weisfeiler-Leman - DeepWL is an unlimited version of WL graph that runs in polynomial time. - Knowledge graphs are essentially graphs with vertex/edge attributes ABSTRACT: Vector representations of graphs and relational structures, whether handcrafted feature vectors or learned representations, enable us to apply standard data analysis and machine learning techniques to the structures. A wide range of methods for generating such embeddings have been studied in the machine learning and knowledge representation literature. However, vector embeddings have received relatively little attention from a theoretical point of view. Starting with a survey of embedding techniques that have been used in practice, in this paper we propose two theoretical approaches that we see as central for understanding the foundations of vector embeddings. We draw connections between the various approaches and suggest directions for future research.

DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES

DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES

DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES

https://gist.github.com/wolfram77/54c4a14d9ea547183c6c7b3518bf9cd1 There exist a number of dynamic graph generators. Barbasi-Albert model iteratively attach new vertices to pre-exsiting vertices in the graph using preferential attachment (edges to high degree vertices are more likely - rich get richer - Pareto principle). However, graph size increases monotonically, and density of graph keeps increasing (sparsity decreasing). Gorke's model uses a defined clustering to uniformly add vertices and edges. Purohit's model uses motifs (eg. triangles) to mimick properties of existing dynamic graphs, such as growth rate, structure, and degree distribution. Kronecker graph generators are used to increase size of a given graph, with power-law distribution. To generate dynamic graphs, we must choose a metric to compare two graphs. Common metrics include diameter, clustering coefficient (modularity?), triangle counting (triangle density?), and degree distribution. In this paper, the authors propose Dygraph, a dynamic graph generator that uses degree distribution as the only metric. The authors observe that many real-world graphs differ from the power-law distribution at the tail end. To address this issue, they propose binning, where the vertices beyond a certain degree (minDeg = min(deg) s.t. |V(deg)| < H, where H~10 is the number of vertices with a given degree below which are binned) are grouped into bins of degree-width binWidth, max-degree localMax, and number of degrees in bin with at least one vertex binSize (to keep track of sparsity). This helps the authors to generate graphs with a more realistic degree distribution. The process of generating a dynamic graph is as follows. First the difference between the desired and the current degree distribution is calculated. The authors then create an edge-addition set where each vertex is present as many times as the number of additional incident edges it must recieve. Edges are then created by connecting two vertices randomly from this set, and removing both from the set once connected. Currently, authors reject self-loops and duplicate edges. Removal of edges is done in a similar fashion. Authors observe that adding edges with power-law properties dominates the execution time, and consider parallelizing DyGraph as part of future work.

Shared memory Parallelism (NOTES)

Shared memory Parallelism (NOTES)

Shared memory Parallelism (NOTES)

My notes on shared memory parallelism. Shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs. Using memory for communication inside a single program, e.g. among its multiple threads, is also referred to as shared memory [REF].

A Dynamic Algorithm for Local Community Detection in Graphs : NOTES

A Dynamic Algorithm for Local Community Detection in Graphs : NOTES

A Dynamic Algorithm for Local Community Detection in Graphs : NOTES

**Community detection methods** can be *global* or *local*. **Global community detection methods** divide the entire graph into groups. Existing global algorithms include: - Random walk methods - Spectral partitioning - Label propagation - Greedy agglomerative and divisive algorithms - Clique percolation https://gist.github.com/wolfram77/b4316609265b5b9f88027bbc491f80b6 There is a growing body of work in *detecting overlapping communities*. **Seed set expansion** is a **local community detection method** where a relevant *seed vertices* of interest are picked and *expanded to form communities* surrounding them. The quality of each community is measured using a *fitness function*. **Modularity** is a *fitness function* which compares the number of intra-community edges to the expected number in a random-null model. **Conductance** is another popular fitness score that measures the community cut or inter-community edges. Many *overlapping community detection* methods **use a modified ratio** of intra-community edges to all edges with atleast one endpoint in the community. Andersen et al. use a **Spectral PageRank-Nibble method** which minimizes conductance and is formed by adding vertices in order of decreasing PageRank values. Andersen and Lang develop a **random walk approach** in which some vertices in the seed set may not be placed in the final community. Clauset gives a **greedy method** that *starts from a single vertex* and then iteratively adds neighboring vertices *maximizing the local modularity score*. Riedy et al. **expand multiple vertices** via maximizing modularity. Several algorithms for **detecting global, overlapping communities** use a *greedy*, *agglomerative approach* and run *multiple separate seed set expansions*. Lancichinetti et al. run **greedy seed set expansions**, each with a *single seed vertex*. Overlapping communities are produced by a sequentially running expansions from a node not yet in a community. Lee et al. use **maximal cliques as seed sets**. Havemann et al. **greedily expand cliques**. The authors of this paper discuss a dynamic approach for **community detection using seed set expansion**. Simply marking the neighbours of changed vertices is a **naive approach**, and has *severe shortcomings*. This is because *communities can split apart*. The simple updating method *may fail even when it outputs a valid community* in the graph.

Scalable Static and Dynamic Community Detection Using Grappolo : NOTES

Scalable Static and Dynamic Community Detection Using Grappolo : NOTES

Scalable Static and Dynamic Community Detection Using Grappolo : NOTES

A **community** (in a network) is a subset of nodes which are _strongly connected among themselves_, but _weakly connected to others_. Neither the number of output communities nor their size distribution is known a priori. Community detection methods can be divisive or agglomerative. **Divisive methods** use _betweeness centrality_ to **identify and remove bridges** between communities. **Agglomerative methods** greedily **merge two communities** that provide maximum gain in _modularity_. Newman and Girvan have introduced the **modularity metric**. The problem of community detection is then reduced to the problem of modularity maximization which is **NP-complete**. **Louvain method** is a variant of the _agglomerative strategy_, in that is a _multi-level heuristic_. https://gist.github.com/wolfram77/917a1a4a429e89a0f2a1911cea56314d In this paper, the authors discuss **four heuristics** for Community detection using the _Louvain algorithm_ implemented upon recently developed **Grappolo**, which is a parallel variant of the Louvain algorithm. They are: - Vertex following and Minimum label - Data caching - Graph coloring - Threshold scaling With the **Vertex following** heuristic, the _input is preprocessed_ and all single-degree vertices are merged with their corresponding neighbours. This helps reduce the number of vertices considered in each iteration, and also help initial seeds of communities to be formed. With the **Minimum label heuristic**, when a vertex is making the decision to move to a community and multiple communities provided the same modularity gain, the community with the smallest id is chosen. This helps _minimize or prevent community swaps_. With the **Data caching** heuristic, community information is stored in a vector instead of a map, and is reused in each iteration, but with some additional cost. With the **Vertex ordering via Graph coloring** heuristic, _distance-k coloring_ of graphs is performed in order to group vertices into colors. Then, each set of vertices (by color) is processed _concurrently_, and synchronization is performed after that. This enables us to mimic the behaviour of the serial algorithm. Finally, with the **Threshold scaling** heuristic, _successively smaller values of modularity threshold_ are used as the algorithm progresses. This allows the algorithm to converge faster, and it has been observed a good modularity score as well. From the results, it appears that _graph coloring_ and _threshold scaling_ heuristics do not always provide a speedup and this depends upon the nature of the graph. It would be interesting to compare the heuristics against baseline approaches. Future work can include _distributed memory implementations_, and _community detection on streaming graphs_.

Application Areas of Community Detection: A Review : NOTES

Application Areas of Community Detection: A Review : NOTES

Application Areas of Community Detection: A Review : NOTES

This is a short review of Community detection methods (on graphs), and their applications. A **community** is a subset of a network whose members are *highly connected*, but *loosely connected* to others outside their community. Different community detection methods *can return differing communities* these algorithms are **heuristic-based**. **Dynamic community detection** involves tracking the *evolution of community structure* over time. https://gist.github.com/wolfram77/09e64d6ba3ef080db5558feb2d32fdc0 Communities can be of the following **types**: - Disjoint - Overlapping - Hierarchical - Local. The following **static** community detection **methods** exist: - Spectral-based - Statistical inference - Optimization - Dynamics-based The following **dynamic** community detection **methods** exist: - Independent community detection and matching - Dependent community detection (evolutionary) - Simultaneous community detection on all snapshots - Dynamic community detection on temporal networks **Applications** of community detection include: - Criminal identification - Fraud detection - Criminal activities detection - Bot detection - Dynamics of epidemic spreading (dynamic) - Cancer/tumor detection - Tissue/organ detection - Evolution of influence (dynamic) - Astroturfing - Customer segmentation - Recommendation systems - Social network analysis (both) - Network summarization - Privary, group segmentation - Link prediction (both) - Community evolution prediction (dynamic, hot field) <br> <br> ## References - [Application Areas of Community Detection: A Review : PAPER](https://ieeexplore.ieee.org/document/8625349)

Community Detection on the GPU : NOTES

Community Detection on the GPU : NOTES

Community Detection on the GPU : NOTES

This paper discusses a GPU implementation of the Louvain community detection algorithm. Louvain algorithm obtains hierachical communities as a dendrogram through modularity optimization. Given an undirected weighted graph, all vertices are first considered to be their own communities. In the first phase, each vertex greedily decides to move to the community of one of its neighbours which gives greatest increase in modularity. If moving to no neighbour's community leads to an increase in modularity, the vertex chooses to stay with its own community. This is done sequentially for all the vertices. If the total change in modularity is more than a certain threshold, this phase is repeated. Once this local moving phase is complete, all vertices have formed their first hierarchy of communities. The next phase is called the aggregation phase, where all the vertices belonging to a community are collapsed into a single super-vertex, such that edges between communities are represented as edges between respective super-vertices (edge weights are combined), and edges within each community are represented as self-loops in respective super-vertices (again, edge weights are combined). Together, the local moving and the aggregation phases constitute a stage. This super-vertex graph is then used as input fof the next stage. This process continues until the increase in modularity is below a certain threshold. As a result from each stage, we have a hierarchy of community memberships for each vertex as a dendrogram. Approaches to perform the Louvain algorithm can be divided into coarse-grained and fine-grained. Coarse-grained approaches process a set of vertices in parallel, while fine-grained approaches process all vertices in parallel. A coarse-grained hybrid-GPU algorithm using multi GPUs has be implemented by Cheong et al. which grabbed my attention. In addition, their algorithm does not use hashing for the local moving phase, but instead sorts each neighbour list based on the community id of each vertex. https://gist.github.com/wolfram77/7e72c9b8c18c18ab908ae76262099329

Survey for extra-child-process package : NOTES

Survey for extra-child-process package : NOTES

Survey for extra-child-process package : NOTES

Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER

Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER

Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER

Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...

Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...

Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...

Fast Incremental Community Detection on Dynamic Graphs : NOTES

Fast Incremental Community Detection on Dynamic Graphs : NOTES

Fast Incremental Community Detection on Dynamic Graphs : NOTES

In this paper, the authors describe two approaches for dynamic community detection using the CNM algorithm. CNM is a hierarchical, agglomerative algorithm that greedily maximizes modularity. They define two approaches: BasicDyn and FastDyn. BasicDyn backtracks merges of communities until each marked (changed) vertex is its own singleton community. FastDyn undoes a merge only if the quality of merge, as measured by the induced change in modularity, has significantly decreased compared to when the merge initially took place. FastDyn also allows more than two vertices to contract together if in the previous time step these vertices eventually ended up contracted in the same community. In the static case, merging several vertices together in one contraction phase could lead to deteriorating results. FastDyn is able to do this, however, because it uses information from the merges of the previous time step. Intuitively, merges that previously occurred are more likely to be acceptable later. https://gist.github.com/wolfram77/1856b108334cc822cdddfdfa7334792a

Can you ﬁx farming by going back 8000 years : NOTES

Can you ﬁx farming by going back 8000 years : NOTES

Can you ﬁx farming by going back 8000 years : NOTES

HITS algorithm : NOTES

HITS algorithm : NOTES

HITS algorithm : NOTES

1. Webpages tend to behave as authorities or hubs. 2. An authority represents an research thesis, and a hub represents an encyclopedia. 3. Each page has an authority and a hub score. 4. The graph is based on query, included pointed to and from pages. 5. Authority score is the sum of scores of all hubs pointing to it. 6. Hub score is the sum of scores of all authorities is pointing to. 7. Score are normalized with L2-norm in each iteration (root of sum of squares). 8. Needs to be performed at query time. 9. Two scores are returned, instead of just one. https://gist.github.com/wolfram77/3d9ef6c5a5b63f53caabce4812c7ea81

Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES

Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES

Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES

For the IPDPS ParSocial event a presentation submission is required by 15th May. The event is on 3rd June. https://gist.github.com/wolfram77/51b15ca09eb28f6909673a2deb1a314d DYNAMIC BATCH PARALLEL ALGORITHMS FOR UPDATING PAGERANK Subhajit Sahut, Kishore Kothapallit and Dip Sankar Banerjeet tInternational Institute of Information Technology Hyderabad, India. tIndian Institute of Technology Jodhpur, India. subhajit.sahu@research. ,kkishore@iiit.ac.in, dipsankarb@iitj.ac.in This work is partially supported by a grant from the Department of Science and Technology (DST), India, under the National Supercomputing Mission (NSM) R&D in Exascale initiative vide Ref. No: DST/NSM/R&D Exascale/2021/16. FACEBOOK 15 TAKING A PAGE OUT OF GOOGLE’S PLAYBOOK 10 STOP FAKE NEWS FROM GOING VIRAL PUBLISHED APR 2015 BY SALVADOR RODRIGUEZ Click-Gap: When is Facebook is driving disproportionate amounts of traffic to websites. Effort to rid fakes news from Facebook’s services. Is a website relying on Facebook to drive significant traffic, but not well ranked by the rest of the web? Also News Citation Graph. PAGERANK APPLICATIONS Ranking of websites. Measuring scientific impact of researchers. Finding the best teams and athletes. Ranking companies by talent concentration. Predicting road/foot traffic in urban spaces. Analysing protein networks. Finding the most authoritative news sources Identifying parts of brain that change jointly. Toxic waste management. PAGERANK APPLICATIONS Debugging complex software systems (Moni torRank) Finding the most original writers (BookRank) Finding topical authorities (TwitterRank) WHAT IS PAGERANK l—-d Plu = Cus + —— UCIiNny Pru u->v = (1-—d) x “us ( ) outdegy, PageRank is a lLink-analysis algorithm. By Larry Page and Sergey Brin in 1996. For ordering information on the web. Represented with a random-surfer model. Rank of a page is defined recursively. Calculate iteratively with power-iteration.

Are Satellites Covered in Gold Foil : NOTES

Are Satellites Covered in Gold Foil : NOTES

Are Satellites Covered in Gold Foil : NOTES

Satellites are usually covered in aluminized polyimide. The yellowish gold color of polyimide with silver aluminium side facing in gives the satellite the appearance of being wrapped in gold. The material is called Multi-layer Insulation (MLI). It helps in radiative insulation of the onboard instruments of satellite. Gold is actually used in electrical contacts to prevent corrosion due to Ultra-violet light or X-rays. https://gist.github.com/wolfram77/8ae2de1a29caf1a2f84babed79943389

Taxation for Traders < Markets and Taxation : NOTES

Taxation for Traders < Markets and Taxation : NOTES

Taxation for Traders < Markets and Taxation : NOTES

This tutorial discusses on tax calculation for long-term and short-term capital gains, as well as for business income that speculative (intraday, BTST) and non-speculative (F&O, BTST). Also a nice concept on tax-loss harvesting. When turnover for business income is more than 5Cr or gains less than 6% an audit is required by a CA. Business income requires maintaining balance sheet and income statement. I may need an audit by a CA.

A Generalization of the PageRank Algorithm : NOTES

A Generalization of the PageRank Algorithm : NOTES

A Generalization of the PageRank Algorithm : NOTES

This paper discusses a method of Generalizing PageRank algorithm for different types of networks. Rank of each vertex is considered to be dependent upon both the in- and out-edges. Each edge can also have differing importance. This solves the problem of dead ends and spider traps without the need of taxation (?). --- Abstract— PageRank is a well-known algorithm that has been used to understand the structure of the Web. In its classical formulation the algorithm considers only forward looking paths in its analysis- a typical web scenario. We propose a generalization of the PageRank algorithm based on both out-links and in-links. This generalization enables the elimination network anomalies- and increases the applicability of the algorithm to an array of new applications in networked data. Through experimental results we illustrate that the proposed generalized PageRank minimizes the effect of network anomalies, and results in more realistic representation of the network. Keywords- Search Engine; PageRank; Web Structure; Web Mining; Spider-Trap; dead-end; Taxation;Web spamming

ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...

ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...

ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...

With biomedical signal processing algorithms, such as the Pan-Tompkins QRS peak detection algorithm, FIR filters are utilized. Raw ECG signal can be fed to a Moving window filter, which helps filter out noise and get the signal of interest. These FIR filters involve the use of multipliers and adders, which take in several input sample and output a single sample. This paper replaces accurate adders in such filters with 10 16-bit signed approximate adders (power ve error parameters) from the EvoApprox library. Functional validation is done in MATLAB with Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR) of Moving Window Integration; and Mean Square Error (MSE) of thresholds as error metrics. MIT-BIH Arrythmia database is used as the raw ECG input. In hardware evaluation, a 100-point FIR filter is implemented with a single Multiply-accumulate unit where the exact adder is replaced with selected approximate adder. RTL model is synthesized with 45nm NandGate Open Cell library in Synopsys design compiler. Area, Average power, and Worst-case delay are measured. On average the presented methodology provides an area-saving of 19.71% and power-saving of 19.27%.

Income Tax Calender 2021 (ITD) : NOTES

Income Tax Calender 2021 (ITD) : NOTES

Income Tax Calender 2021 (ITD) : NOTES

1. Old vs New tax regime [Form26AS] 2. Quarterly Tax Deducted at Source (TDS) [Form16] 3. Quarterly Tax Deducted at Source (other income) [Form 15G/H] 4. Quarterly Advance Tax (extra income) [Challan ITNS 280] Extras: - Public Provident Fund scheme (PPF) - National Pension Scheme (NPS) References: - https://www.incometax.gov.in/iec/foportal - https://finshots.in/archive/finshots-money-resolution-4-axe-your-tax/

More Related Content

More from Subhajit Sahu

Adjusting OpenMP PageRank : SHORT REPORT / NOTES

Adjusting OpenMP PageRank : SHORT REPORT / NOTES

Adjusting OpenMP PageRank : SHORT REPORT / NOTES

For massive graphs that fit in RAM, but not in GPU memory, it is possible to take advantage of a shared memory system with multiple CPUs, each with multiple cores, to accelerate pagerank computation. If the NUMA architecture of the system is properly taken into account with good vertex partitioning, the speedup can be significant. To take steps in this direction, experiments are conducted to implement pagerank in OpenMP using two different approaches, uniform and hybrid. The uniform approach runs all primitives required for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid approach runs certain primitives in sequential mode (i.e., sumAt, multiply).

word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...

word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...

word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...

Below are the important points I note from the 2020 paper by Martin Grohe: - 1-WL distinguishes almost all graphs, in a probabilistic sense - Classical WL is two dimensional Weisfeiler-Leman - DeepWL is an unlimited version of WL graph that runs in polynomial time. - Knowledge graphs are essentially graphs with vertex/edge attributes ABSTRACT: Vector representations of graphs and relational structures, whether handcrafted feature vectors or learned representations, enable us to apply standard data analysis and machine learning techniques to the structures. A wide range of methods for generating such embeddings have been studied in the machine learning and knowledge representation literature. However, vector embeddings have received relatively little attention from a theoretical point of view. Starting with a survey of embedding techniques that have been used in practice, in this paper we propose two theoretical approaches that we see as central for understanding the foundations of vector embeddings. We draw connections between the various approaches and suggest directions for future research.

DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES

DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES

DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES

https://gist.github.com/wolfram77/54c4a14d9ea547183c6c7b3518bf9cd1 There exist a number of dynamic graph generators. Barbasi-Albert model iteratively attach new vertices to pre-exsiting vertices in the graph using preferential attachment (edges to high degree vertices are more likely - rich get richer - Pareto principle). However, graph size increases monotonically, and density of graph keeps increasing (sparsity decreasing). Gorke's model uses a defined clustering to uniformly add vertices and edges. Purohit's model uses motifs (eg. triangles) to mimick properties of existing dynamic graphs, such as growth rate, structure, and degree distribution. Kronecker graph generators are used to increase size of a given graph, with power-law distribution. To generate dynamic graphs, we must choose a metric to compare two graphs. Common metrics include diameter, clustering coefficient (modularity?), triangle counting (triangle density?), and degree distribution. In this paper, the authors propose Dygraph, a dynamic graph generator that uses degree distribution as the only metric. The authors observe that many real-world graphs differ from the power-law distribution at the tail end. To address this issue, they propose binning, where the vertices beyond a certain degree (minDeg = min(deg) s.t. |V(deg)| < H, where H~10 is the number of vertices with a given degree below which are binned) are grouped into bins of degree-width binWidth, max-degree localMax, and number of degrees in bin with at least one vertex binSize (to keep track of sparsity). This helps the authors to generate graphs with a more realistic degree distribution. The process of generating a dynamic graph is as follows. First the difference between the desired and the current degree distribution is calculated. The authors then create an edge-addition set where each vertex is present as many times as the number of additional incident edges it must recieve. Edges are then created by connecting two vertices randomly from this set, and removing both from the set once connected. Currently, authors reject self-loops and duplicate edges. Removal of edges is done in a similar fashion. Authors observe that adding edges with power-law properties dominates the execution time, and consider parallelizing DyGraph as part of future work.

Shared memory Parallelism (NOTES)

Shared memory Parallelism (NOTES)

Shared memory Parallelism (NOTES)

My notes on shared memory parallelism. Shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs. Using memory for communication inside a single program, e.g. among its multiple threads, is also referred to as shared memory [REF].

A Dynamic Algorithm for Local Community Detection in Graphs : NOTES

A Dynamic Algorithm for Local Community Detection in Graphs : NOTES

A Dynamic Algorithm for Local Community Detection in Graphs : NOTES

**Community detection methods** can be *global* or *local*. **Global community detection methods** divide the entire graph into groups. Existing global algorithms include: - Random walk methods - Spectral partitioning - Label propagation - Greedy agglomerative and divisive algorithms - Clique percolation https://gist.github.com/wolfram77/b4316609265b5b9f88027bbc491f80b6 There is a growing body of work in *detecting overlapping communities*. **Seed set expansion** is a **local community detection method** where a relevant *seed vertices* of interest are picked and *expanded to form communities* surrounding them. The quality of each community is measured using a *fitness function*. **Modularity** is a *fitness function* which compares the number of intra-community edges to the expected number in a random-null model. **Conductance** is another popular fitness score that measures the community cut or inter-community edges. Many *overlapping community detection* methods **use a modified ratio** of intra-community edges to all edges with atleast one endpoint in the community. Andersen et al. use a **Spectral PageRank-Nibble method** which minimizes conductance and is formed by adding vertices in order of decreasing PageRank values. Andersen and Lang develop a **random walk approach** in which some vertices in the seed set may not be placed in the final community. Clauset gives a **greedy method** that *starts from a single vertex* and then iteratively adds neighboring vertices *maximizing the local modularity score*. Riedy et al. **expand multiple vertices** via maximizing modularity. Several algorithms for **detecting global, overlapping communities** use a *greedy*, *agglomerative approach* and run *multiple separate seed set expansions*. Lancichinetti et al. run **greedy seed set expansions**, each with a *single seed vertex*. Overlapping communities are produced by a sequentially running expansions from a node not yet in a community. Lee et al. use **maximal cliques as seed sets**. Havemann et al. **greedily expand cliques**. The authors of this paper discuss a dynamic approach for **community detection using seed set expansion**. Simply marking the neighbours of changed vertices is a **naive approach**, and has *severe shortcomings*. This is because *communities can split apart*. The simple updating method *may fail even when it outputs a valid community* in the graph.

Scalable Static and Dynamic Community Detection Using Grappolo : NOTES

Scalable Static and Dynamic Community Detection Using Grappolo : NOTES

Scalable Static and Dynamic Community Detection Using Grappolo : NOTES

A **community** (in a network) is a subset of nodes which are _strongly connected among themselves_, but _weakly connected to others_. Neither the number of output communities nor their size distribution is known a priori. Community detection methods can be divisive or agglomerative. **Divisive methods** use _betweeness centrality_ to **identify and remove bridges** between communities. **Agglomerative methods** greedily **merge two communities** that provide maximum gain in _modularity_. Newman and Girvan have introduced the **modularity metric**. The problem of community detection is then reduced to the problem of modularity maximization which is **NP-complete**. **Louvain method** is a variant of the _agglomerative strategy_, in that is a _multi-level heuristic_. https://gist.github.com/wolfram77/917a1a4a429e89a0f2a1911cea56314d In this paper, the authors discuss **four heuristics** for Community detection using the _Louvain algorithm_ implemented upon recently developed **Grappolo**, which is a parallel variant of the Louvain algorithm. They are: - Vertex following and Minimum label - Data caching - Graph coloring - Threshold scaling With the **Vertex following** heuristic, the _input is preprocessed_ and all single-degree vertices are merged with their corresponding neighbours. This helps reduce the number of vertices considered in each iteration, and also help initial seeds of communities to be formed. With the **Minimum label heuristic**, when a vertex is making the decision to move to a community and multiple communities provided the same modularity gain, the community with the smallest id is chosen. This helps _minimize or prevent community swaps_. With the **Data caching** heuristic, community information is stored in a vector instead of a map, and is reused in each iteration, but with some additional cost. With the **Vertex ordering via Graph coloring** heuristic, _distance-k coloring_ of graphs is performed in order to group vertices into colors. Then, each set of vertices (by color) is processed _concurrently_, and synchronization is performed after that. This enables us to mimic the behaviour of the serial algorithm. Finally, with the **Threshold scaling** heuristic, _successively smaller values of modularity threshold_ are used as the algorithm progresses. This allows the algorithm to converge faster, and it has been observed a good modularity score as well. From the results, it appears that _graph coloring_ and _threshold scaling_ heuristics do not always provide a speedup and this depends upon the nature of the graph. It would be interesting to compare the heuristics against baseline approaches. Future work can include _distributed memory implementations_, and _community detection on streaming graphs_.

Application Areas of Community Detection: A Review : NOTES

Application Areas of Community Detection: A Review : NOTES

Application Areas of Community Detection: A Review : NOTES

This is a short review of Community detection methods (on graphs), and their applications. A **community** is a subset of a network whose members are *highly connected*, but *loosely connected* to others outside their community. Different community detection methods *can return differing communities* these algorithms are **heuristic-based**. **Dynamic community detection** involves tracking the *evolution of community structure* over time. https://gist.github.com/wolfram77/09e64d6ba3ef080db5558feb2d32fdc0 Communities can be of the following **types**: - Disjoint - Overlapping - Hierarchical - Local. The following **static** community detection **methods** exist: - Spectral-based - Statistical inference - Optimization - Dynamics-based The following **dynamic** community detection **methods** exist: - Independent community detection and matching - Dependent community detection (evolutionary) - Simultaneous community detection on all snapshots - Dynamic community detection on temporal networks **Applications** of community detection include: - Criminal identification - Fraud detection - Criminal activities detection - Bot detection - Dynamics of epidemic spreading (dynamic) - Cancer/tumor detection - Tissue/organ detection - Evolution of influence (dynamic) - Astroturfing - Customer segmentation - Recommendation systems - Social network analysis (both) - Network summarization - Privary, group segmentation - Link prediction (both) - Community evolution prediction (dynamic, hot field) <br> <br> ## References - [Application Areas of Community Detection: A Review : PAPER](https://ieeexplore.ieee.org/document/8625349)

Community Detection on the GPU : NOTES

Community Detection on the GPU : NOTES

Community Detection on the GPU : NOTES

This paper discusses a GPU implementation of the Louvain community detection algorithm. Louvain algorithm obtains hierachical communities as a dendrogram through modularity optimization. Given an undirected weighted graph, all vertices are first considered to be their own communities. In the first phase, each vertex greedily decides to move to the community of one of its neighbours which gives greatest increase in modularity. If moving to no neighbour's community leads to an increase in modularity, the vertex chooses to stay with its own community. This is done sequentially for all the vertices. If the total change in modularity is more than a certain threshold, this phase is repeated. Once this local moving phase is complete, all vertices have formed their first hierarchy of communities. The next phase is called the aggregation phase, where all the vertices belonging to a community are collapsed into a single super-vertex, such that edges between communities are represented as edges between respective super-vertices (edge weights are combined), and edges within each community are represented as self-loops in respective super-vertices (again, edge weights are combined). Together, the local moving and the aggregation phases constitute a stage. This super-vertex graph is then used as input fof the next stage. This process continues until the increase in modularity is below a certain threshold. As a result from each stage, we have a hierarchy of community memberships for each vertex as a dendrogram. Approaches to perform the Louvain algorithm can be divided into coarse-grained and fine-grained. Coarse-grained approaches process a set of vertices in parallel, while fine-grained approaches process all vertices in parallel. A coarse-grained hybrid-GPU algorithm using multi GPUs has be implemented by Cheong et al. which grabbed my attention. In addition, their algorithm does not use hashing for the local moving phase, but instead sorts each neighbour list based on the community id of each vertex. https://gist.github.com/wolfram77/7e72c9b8c18c18ab908ae76262099329

Survey for extra-child-process package : NOTES

Survey for extra-child-process package : NOTES

Survey for extra-child-process package : NOTES

Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER

Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER

Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER

Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...

Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...

Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...

Fast Incremental Community Detection on Dynamic Graphs : NOTES

Fast Incremental Community Detection on Dynamic Graphs : NOTES

Fast Incremental Community Detection on Dynamic Graphs : NOTES

In this paper, the authors describe two approaches for dynamic community detection using the CNM algorithm. CNM is a hierarchical, agglomerative algorithm that greedily maximizes modularity. They define two approaches: BasicDyn and FastDyn. BasicDyn backtracks merges of communities until each marked (changed) vertex is its own singleton community. FastDyn undoes a merge only if the quality of merge, as measured by the induced change in modularity, has significantly decreased compared to when the merge initially took place. FastDyn also allows more than two vertices to contract together if in the previous time step these vertices eventually ended up contracted in the same community. In the static case, merging several vertices together in one contraction phase could lead to deteriorating results. FastDyn is able to do this, however, because it uses information from the merges of the previous time step. Intuitively, merges that previously occurred are more likely to be acceptable later. https://gist.github.com/wolfram77/1856b108334cc822cdddfdfa7334792a

Can you ﬁx farming by going back 8000 years : NOTES

Can you ﬁx farming by going back 8000 years : NOTES

Can you ﬁx farming by going back 8000 years : NOTES

HITS algorithm : NOTES

HITS algorithm : NOTES

HITS algorithm : NOTES

1. Webpages tend to behave as authorities or hubs. 2. An authority represents an research thesis, and a hub represents an encyclopedia. 3. Each page has an authority and a hub score. 4. The graph is based on query, included pointed to and from pages. 5. Authority score is the sum of scores of all hubs pointing to it. 6. Hub score is the sum of scores of all authorities is pointing to. 7. Score are normalized with L2-norm in each iteration (root of sum of squares). 8. Needs to be performed at query time. 9. Two scores are returned, instead of just one. https://gist.github.com/wolfram77/3d9ef6c5a5b63f53caabce4812c7ea81

Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES

Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES

Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES

For the IPDPS ParSocial event a presentation submission is required by 15th May. The event is on 3rd June. https://gist.github.com/wolfram77/51b15ca09eb28f6909673a2deb1a314d DYNAMIC BATCH PARALLEL ALGORITHMS FOR UPDATING PAGERANK Subhajit Sahut, Kishore Kothapallit and Dip Sankar Banerjeet tInternational Institute of Information Technology Hyderabad, India. tIndian Institute of Technology Jodhpur, India. subhajit.sahu@research. ,kkishore@iiit.ac.in, dipsankarb@iitj.ac.in This work is partially supported by a grant from the Department of Science and Technology (DST), India, under the National Supercomputing Mission (NSM) R&D in Exascale initiative vide Ref. No: DST/NSM/R&D Exascale/2021/16. FACEBOOK 15 TAKING A PAGE OUT OF GOOGLE’S PLAYBOOK 10 STOP FAKE NEWS FROM GOING VIRAL PUBLISHED APR 2015 BY SALVADOR RODRIGUEZ Click-Gap: When is Facebook is driving disproportionate amounts of traffic to websites. Effort to rid fakes news from Facebook’s services. Is a website relying on Facebook to drive significant traffic, but not well ranked by the rest of the web? Also News Citation Graph. PAGERANK APPLICATIONS Ranking of websites. Measuring scientific impact of researchers. Finding the best teams and athletes. Ranking companies by talent concentration. Predicting road/foot traffic in urban spaces. Analysing protein networks. Finding the most authoritative news sources Identifying parts of brain that change jointly. Toxic waste management. PAGERANK APPLICATIONS Debugging complex software systems (Moni torRank) Finding the most original writers (BookRank) Finding topical authorities (TwitterRank) WHAT IS PAGERANK l—-d Plu = Cus + —— UCIiNny Pru u->v = (1-—d) x “us ( ) outdegy, PageRank is a lLink-analysis algorithm. By Larry Page and Sergey Brin in 1996. For ordering information on the web. Represented with a random-surfer model. Rank of a page is defined recursively. Calculate iteratively with power-iteration.

Are Satellites Covered in Gold Foil : NOTES

Are Satellites Covered in Gold Foil : NOTES

Are Satellites Covered in Gold Foil : NOTES

Satellites are usually covered in aluminized polyimide. The yellowish gold color of polyimide with silver aluminium side facing in gives the satellite the appearance of being wrapped in gold. The material is called Multi-layer Insulation (MLI). It helps in radiative insulation of the onboard instruments of satellite. Gold is actually used in electrical contacts to prevent corrosion due to Ultra-violet light or X-rays. https://gist.github.com/wolfram77/8ae2de1a29caf1a2f84babed79943389

Taxation for Traders < Markets and Taxation : NOTES

Taxation for Traders < Markets and Taxation : NOTES

Taxation for Traders < Markets and Taxation : NOTES

This tutorial discusses on tax calculation for long-term and short-term capital gains, as well as for business income that speculative (intraday, BTST) and non-speculative (F&O, BTST). Also a nice concept on tax-loss harvesting. When turnover for business income is more than 5Cr or gains less than 6% an audit is required by a CA. Business income requires maintaining balance sheet and income statement. I may need an audit by a CA.

A Generalization of the PageRank Algorithm : NOTES

A Generalization of the PageRank Algorithm : NOTES

A Generalization of the PageRank Algorithm : NOTES

This paper discusses a method of Generalizing PageRank algorithm for different types of networks. Rank of each vertex is considered to be dependent upon both the in- and out-edges. Each edge can also have differing importance. This solves the problem of dead ends and spider traps without the need of taxation (?). --- Abstract— PageRank is a well-known algorithm that has been used to understand the structure of the Web. In its classical formulation the algorithm considers only forward looking paths in its analysis- a typical web scenario. We propose a generalization of the PageRank algorithm based on both out-links and in-links. This generalization enables the elimination network anomalies- and increases the applicability of the algorithm to an array of new applications in networked data. Through experimental results we illustrate that the proposed generalized PageRank minimizes the effect of network anomalies, and results in more realistic representation of the network. Keywords- Search Engine; PageRank; Web Structure; Web Mining; Spider-Trap; dead-end; Taxation;Web spamming

ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...

ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...

ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...

With biomedical signal processing algorithms, such as the Pan-Tompkins QRS peak detection algorithm, FIR filters are utilized. Raw ECG signal can be fed to a Moving window filter, which helps filter out noise and get the signal of interest. These FIR filters involve the use of multipliers and adders, which take in several input sample and output a single sample. This paper replaces accurate adders in such filters with 10 16-bit signed approximate adders (power ve error parameters) from the EvoApprox library. Functional validation is done in MATLAB with Structural Similarity Index (SSIM) and Peak Signal-to-Noise Ratio (PSNR) of Moving Window Integration; and Mean Square Error (MSE) of thresholds as error metrics. MIT-BIH Arrythmia database is used as the raw ECG input. In hardware evaluation, a 100-point FIR filter is implemented with a single Multiply-accumulate unit where the exact adder is replaced with selected approximate adder. RTL model is synthesized with 45nm NandGate Open Cell library in Synopsys design compiler. Area, Average power, and Worst-case delay are measured. On average the presented methodology provides an area-saving of 19.71% and power-saving of 19.27%.

Income Tax Calender 2021 (ITD) : NOTES

Income Tax Calender 2021 (ITD) : NOTES

Income Tax Calender 2021 (ITD) : NOTES

1. Old vs New tax regime [Form26AS] 2. Quarterly Tax Deducted at Source (TDS) [Form16] 3. Quarterly Tax Deducted at Source (other income) [Form 15G/H] 4. Quarterly Advance Tax (extra income) [Challan ITNS 280] Extras: - Public Provident Fund scheme (PPF) - National Pension Scheme (NPS) References: - https://www.incometax.gov.in/iec/foportal - https://finshots.in/archive/finshots-money-resolution-4-axe-your-tax/

More from Subhajit Sahu (20)

Adjusting OpenMP PageRank : SHORT REPORT / NOTES

Adjusting OpenMP PageRank : SHORT REPORT / NOTES

Adjusting OpenMP PageRank : SHORT REPORT / NOTES

word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...

word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...

word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...

DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES

DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES

DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES

Shared memory Parallelism (NOTES)

Shared memory Parallelism (NOTES)

Shared memory Parallelism (NOTES)

A Dynamic Algorithm for Local Community Detection in Graphs : NOTES

A Dynamic Algorithm for Local Community Detection in Graphs : NOTES

A Dynamic Algorithm for Local Community Detection in Graphs : NOTES

Scalable Static and Dynamic Community Detection Using Grappolo : NOTES

Scalable Static and Dynamic Community Detection Using Grappolo : NOTES

Scalable Static and Dynamic Community Detection Using Grappolo : NOTES

Application Areas of Community Detection: A Review : NOTES

Application Areas of Community Detection: A Review : NOTES

Application Areas of Community Detection: A Review : NOTES

Community Detection on the GPU : NOTES

Community Detection on the GPU : NOTES

Community Detection on the GPU : NOTES

Survey for extra-child-process package : NOTES

Survey for extra-child-process package : NOTES

Survey for extra-child-process package : NOTES

Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER

Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER

Dynamic Batch Parallel Algorithms for Updating PageRank : POSTER

Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...

Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...

Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...

Fast Incremental Community Detection on Dynamic Graphs : NOTES

Fast Incremental Community Detection on Dynamic Graphs : NOTES

Fast Incremental Community Detection on Dynamic Graphs : NOTES

Can you ﬁx farming by going back 8000 years : NOTES

Can you ﬁx farming by going back 8000 years : NOTES

Can you ﬁx farming by going back 8000 years : NOTES

HITS algorithm : NOTES

HITS algorithm : NOTES

HITS algorithm : NOTES

Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES

Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES

Dynamic Batch Parallel Algorithms for Updating Pagerank : SLIDES

Are Satellites Covered in Gold Foil : NOTES

Are Satellites Covered in Gold Foil : NOTES

Are Satellites Covered in Gold Foil : NOTES

Taxation for Traders < Markets and Taxation : NOTES

Taxation for Traders < Markets and Taxation : NOTES

Taxation for Traders < Markets and Taxation : NOTES

A Generalization of the PageRank Algorithm : NOTES

A Generalization of the PageRank Algorithm : NOTES

A Generalization of the PageRank Algorithm : NOTES

ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...

ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...

ApproxBioWear: Approximating Additions for Efficient Biomedical Wearable Comp...

Income Tax Calender 2021 (ITD) : NOTES

Income Tax Calender 2021 (ITD) : NOTES

Income Tax Calender 2021 (ITD) : NOTES

Basic Computer Architecture and the Case for GPUs : NOTES

1. PPoPP 2010, Bangalore, India. Basic Architecture Concepts CPU Architecture 4 stages of instruction execution Too many cycles per instruction (CPI) Fetch Decode Execute Write t=1 2 3 4 5

2. PPoPP 2010, Bangalore, India. Basic Architecture Concepts CPU Architecture 4 stages of instruction execution Too many cycles per instruction (CPI) To reduce the CPI, introduce pipelined execution Fetch Decode Execute Write Fetch Decode Execute Write Fetch Decode Execute Write Fetch Decode Execute Write t=1 2 3 4 5

3. PPoPP 2010, Bangalore, India. Basic Architecture Concepts CPU Architecture 4 stages of instruction execution Too many cycles per instruction (CPI) To reduce the CPI, introduce pipelined execution Needs buffers to store results across stages. A cache to handle slow memory access times Fetch Decode Execute Write Fetch Decode Execute Write Fetch Decode Execute Write Fetch Decode Execute Write t=1 2 3 4 5 Cache

4. PPoPP 2010, Bangalore, India. Basic Architecture Concepts Fetch Decode Execute Write Fetch Decode Execute Write Fetch Decode Execute Write Fetch Decode Execute Write t=1 2 3 4 5 Cache CPU Architecture 4 stages of instruction execution Too many cycles per instruction (CPI) To reduce the CPI, introduce pipelined execution Needs buffers to store results across stages. A cache to handle slow memory access times Multilevel caches, outoforder execution, branch prediction, ...

5. PPoPP 2010, Bangalore, India. Basic Architecture Concepts CPU architecture getting too complex. Not translating to equivalent performance benefits Need a rethink on traditional CPU architectures.

6. PPoPP 2010, Bangalore, India. Basic Architecture Concepts Couple with this the new wisdom in computer architectures.  Memory Wall – memory latencies far higher  ILP Wall – Reducing benefits from instruction level parallelism  Power Wall – Increase in power consumption with increase in clock rates. Multicore is the way forward Ex: GPUs, Cell, Intel Quad core, ... Predicted that 100+ core computers would be a reality soon.

7. PPoPP 2010, Bangalore, India. Multicore and Manycore Processors IBM Cell NVidia GeForce 8800 includes 128 scalar processors and Tesla Sun T1 and T2 Tilera Tile64 Picochip combines 430 simple RISC cores Cisco 188 TRIPS

8. PPoPP 2010, Bangalore, India. The Case for the GPUs GPUs are now common. They also have high computing power per dollar, compared to the CPU Today’s computer system has a CPU and a GPU, with the GPU being used primarily for graphics. GPUs are good at some tasks and not so good at others. They are especially good at processing large data such as images. Let us use the right processor for the right task. Goal: Increase the overall throughput of the computer system on the given task. Use CPU and GPU synergistically.

9. PPoPP 2010, Bangalore, India. Evolution of GPUs Graphics: a few hundred triangles/vertices map to afewhundred thousand pixels Process pixels in parallel. Do the same thing on a large number of different items. Data parallel model : parallelism provided by the data Thousands to millions of data elements Same program/instruction on all of them Hardware: 816 cores to process vertices and 64128 to process pixels by 2005 Less versatile than CPU cores SIMD mode of computations. Less hardware for instruction issue No caching, branch prediction, out of order execution, etc. ‐ ‐ Can pack more cores in same silicon die area

10. PPoPP 2010, Bangalore, India. GPUs as a Case Study GPGPU – General Purpose Programming on GPUs OpenGL extensions Very difficult to program Recently manufactures started supporting Clike programming abstraction to program GPUs CUDA from NVidia Other benefits of GPGPU Affordable cost, easy availability, computational power

11. PPoPP 2010, Bangalore, India. GPUs as a Case Study GPUs suited for routines with high arithmetic intensity. One feature is high memory latency, depending on the nature of access. Should overlap memory with arithmetic.

12. PPoPP 2010, Bangalore, India. CPU Vs GPU Few powerful cores Vs. lots of small cores  GPUs: For good performance, applications need high arithmetic intensity GPUs : No system managed cache.

13. PPoPP 2010, Bangalore, India. GPGPU as a Case Study Regular algorithms Map well to data parallel model of GPUs Each work item operates by itself or with a few neighbors Example settings : image processing. Threads can share data, e.g., apron pixels in an image processing kernel.

14. PPoPP 2010, Bangalore, India. GPU as a Case Study Irregular algorithms Applications with data accesses that are not regular in nature. Occurs in settings such as graph algorithms, data structures building, etc. Difficult to get high efficiency due to high memory latency of accesses.

15. PPoPP 2010, Bangalore, India. GPGPU Tools and APIs OpenGL CUDA OpenCL Brook