Highlighted notes on Optimizing Parallel Reduction in CUDA
While doing research work under Prof. Dip Banerjee, Prof. Kishore Kothapalli.
Interesting optimizations, i should try these soon as PageRank is basically lots of sums.
A Comparison of Stock Trend Prediction Using Accuracy Driven Neural Network V...idescitation
In the recent scenario, nevertheless to say, modern
finance is facing many hurdles to find effective ways to gather
information about stock market data at one shot. At the same
time it is inevitable for both individuals & institutions to
visualize, summarize & enhance their knowledge about the
market behavior for making wise decisions. This paper surveys
recent literature in the domain of neural network variants to
forecast the stock market trends. Classification is made in
terms of dependant variables, data preprocessing techniques
used, network structure, performance analysis and other
useful modeling information. Through the surveyed papers it
is shown that the neural network variants are widely accepted
to study and evaluate stock market behavior compared to
standalone neural network.
After a long period, I bring you new - fresh Presentation which gives you a brief idea on sub-problem of Dynamic Programming which is called as -"Longest Common Subsequence".I hope this presentation may help to all my viewers....
A Comparison of Stock Trend Prediction Using Accuracy Driven Neural Network V...idescitation
In the recent scenario, nevertheless to say, modern
finance is facing many hurdles to find effective ways to gather
information about stock market data at one shot. At the same
time it is inevitable for both individuals & institutions to
visualize, summarize & enhance their knowledge about the
market behavior for making wise decisions. This paper surveys
recent literature in the domain of neural network variants to
forecast the stock market trends. Classification is made in
terms of dependant variables, data preprocessing techniques
used, network structure, performance analysis and other
useful modeling information. Through the surveyed papers it
is shown that the neural network variants are widely accepted
to study and evaluate stock market behavior compared to
standalone neural network.
After a long period, I bring you new - fresh Presentation which gives you a brief idea on sub-problem of Dynamic Programming which is called as -"Longest Common Subsequence".I hope this presentation may help to all my viewers....
In this deck from ATPESC 2019, Jack Dongarra from UT Knoxville presents: Adaptive Linear Solvers and Eigensolvers.
"Success in large-scale scientific computations often depends on algorithm design. Even the fastest machine may prove to be inadequate if insufficient attention is paid to the way in which the computation is organized. We have used several problems from computational physics to illustrate the importance of good algorithms, and we offer some very general principles for designing algorithms. Two subthemes are, first, the strong connection between the algorithm and the architecture of the target machine; and second, the importance of non-numerical methods in scientific computations."
Watch the video: https://wp.me/p3RLHQ-lq3
Learn more: https://extremecomputingtraining.anl.gov/archive/atpesc-2019/agenda-2019/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Exploring Parallel Merging In GPU Based Systems Using CUDA C.Rakib Hossain
We present a program that implemented to execute Adaptive merge sort algorithm in parallel on a GPU based system. Parallel implementation is used to get better performance than serial implementation in runtime perspective. Parallel implementation executes independent executable operation in parallel using large number of cores in GPU based system. Results from a parallel implementation of the algorithm is given and compared with its serial implementation on run time basis. The parallel version is implemented with CUDA platform in a system based on NVIDIA GPU (GTX 650)
Accelerating HPC Applications on NVIDIA GPUs with OpenACCinside-BigData.com
In this deck from the Stanford HPC Conference, Doug Miles from NVIDIA presents: Accelerating HPC Applications on NVIDIA GPUs with OpenACC."
"OpenACC is a directive-based parallel programming model for GPU accelerated and heterogeneous parallel HPC systems. It offers higher programmer productivity compared to use of explicit models like CUDA and OpenCL.
Application source code instrumented with OpenACC directives remains portable to any system with a standard Fortran/C/C++ compiler, and can be efficiently parallelized for various types of HPC systems – multicore CPUs, heterogeneous CPU+GPU, and manycore processors.
This talk will include an introduction to the OpenACC programming model, provide examples of its use in a number of production applications, explain how OpenACC and CUDA Unified Memory working together can dramatically simplify GPU programming, and close with a few thoughts on OpenACC future directions."
Watch the video: https://youtu.be/CaE3n89QM8o
Learn more: https://www.openacc.org/
and
http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Speedrunning the Open Street Map osm2pgsql LoaderGregSmith458515
The Open Street Map project provides invaluable data that keeps driving users toward the PostGIS and PostgreSQL stacks. Loading today’s full Planet data set takes a 120GB XML file and unrolls it into over a terabyte of database data. Crunchy’s benchmark labs have followed the expansion of that Planet data over the last six database releases, as the re-ignition of the CPU wars combined with parallel execution features landing in the database. We’ll take a look at that data evolution, which server configurations worked, and which metrics techniques still matter in the all SSD era.
RAPIDS: ускоряем Pandas и scikit-learn на GPU Павел Клеменков, NVidiaMail.ru Group
Все мы знаем, что наш любимый Pandas исключительно однопоточный, а модели из scikit-learn часто учатся не очень быстро даже в несколько процессов. Поэтому в докладе я расскажу о проекте RAPIDS - наборе библиотек для анализа данных и построения предиктивных моделей с использованием NVIDIA GPU. В докладе я предложу подискутировать о том, что закон Мура больше не выполняется, рассмотрю принципы работы архитектуры CUDA. Разберу библиотеки cuDF и cuML, а также постараюсь предельно честно рассказать о том, ждать ли чуда от перехода на GPU и в каких случаях чудо неизбежно.
IJERA (International journal of Engineering Research and Applications) is International online, ... peer reviewed journal. For more detail or submit your article, please visit www.ijera.com
About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...Subhajit Sahu
TrueTime is a service that enables the use of globally synchronized clocks, with bounded error. It returns a time interval that is guaranteed to contain the clock’s actual time for some time during the call’s execution. If two intervals do not overlap, then we know calls were definitely ordered in real time. In general, synchronized clocks can be used to avoid communication in a distributed system.
The underlying source of time is a combination of GPS receivers and atomic clocks. As there are “time masters” in every datacenter (redundantly), it is likely that both sides of a partition would continue to enjoy accurate time. Individual nodes however need network connectivity to the masters, and without it their clocks will drift. Thus, during a partition their intervals slowly grow wider over time, based on bounds on the rate of local clock drift. Operations depending on TrueTime, such as Paxos leader election or transaction commits, thus have to wait a little longer, but the operation still completes (assuming the 2PC and quorum communication are working).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting Bitset for graph : SHORT REPORT / NOTESSubhajit Sahu
Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is commonly used for efficient graph computations. Unfortunately, using CSR for dynamic graphs is impractical since addition/deletion of a single edge can require on average (N+M)/2 memory accesses, in order to update source-offsets and destination-indices. A common approach is therefore to store edge-lists/destination-indices as an array of arrays, where each edge-list is an array belonging to a vertex. While this is good enough for small graphs, it quickly becomes a bottleneck for large graphs. What causes this bottleneck depends on whether the edge-lists are sorted or unsorted. If they are sorted, checking for an edge requires about log(E) memory accesses, but adding an edge on average requires E/2 accesses, where E is the number of edges of a given vertex. Note that both addition and deletion of edges in a dynamic graph require checking for an existing edge, before adding or deleting it. If edge lists are unsorted, checking for an edge requires around E/2 memory accesses, but adding an edge requires only 1 memory access.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Experiments with Primitive operations : SHORT REPORT / NOTESSubhajit Sahu
This includes:
- Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
- Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
- Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
- Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...Subhajit Sahu
Below are the important points I note from the 2020 paper by Martin Grohe:
- 1-WL distinguishes almost all graphs, in a probabilistic sense
- Classical WL is two dimensional Weisfeiler-Leman
- DeepWL is an unlimited version of WL graph that runs in polynomial time.
- Knowledge graphs are essentially graphs with vertex/edge attributes
ABSTRACT:
Vector representations of graphs and relational structures, whether handcrafted feature vectors or learned representations, enable us to apply standard data analysis and machine learning techniques to the structures. A wide range of methods for generating such embeddings have been studied in the machine learning and knowledge representation literature. However, vector embeddings have received relatively little attention from a theoretical point of view.
Starting with a survey of embedding techniques that have been used in practice, in this paper we propose two theoretical approaches that we see as central for understanding the foundations of vector embeddings. We draw connections between the various approaches and suggest directions for future research.
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTESSubhajit Sahu
https://gist.github.com/wolfram77/54c4a14d9ea547183c6c7b3518bf9cd1
There exist a number of dynamic graph generators. Barbasi-Albert model iteratively attach new vertices to pre-exsiting vertices in the graph using preferential attachment (edges to high degree vertices are more likely - rich get richer - Pareto principle). However, graph size increases monotonically, and density of graph keeps increasing (sparsity decreasing).
Gorke's model uses a defined clustering to uniformly add vertices and edges. Purohit's model uses motifs (eg. triangles) to mimick properties of existing dynamic graphs, such as growth rate, structure, and degree distribution. Kronecker graph generators are used to increase size of a given graph, with power-law distribution.
To generate dynamic graphs, we must choose a metric to compare two graphs. Common metrics include diameter, clustering coefficient (modularity?), triangle counting (triangle density?), and degree distribution.
In this paper, the authors propose Dygraph, a dynamic graph generator that uses degree distribution as the only metric. The authors observe that many real-world graphs differ from the power-law distribution at the tail end. To address this issue, they propose binning, where the vertices beyond a certain degree (minDeg = min(deg) s.t. |V(deg)| < H, where H~10 is the number of vertices with a given degree below which are binned) are grouped into bins of degree-width binWidth, max-degree localMax, and number of degrees in bin with at least one vertex binSize (to keep track of sparsity). This helps the authors to generate graphs with a more realistic degree distribution.
The process of generating a dynamic graph is as follows. First the difference between the desired and the current degree distribution is calculated. The authors then create an edge-addition set where each vertex is present as many times as the number of additional incident edges it must recieve. Edges are then created by connecting two vertices randomly from this set, and removing both from the set once connected. Currently, authors reject self-loops and duplicate edges. Removal of edges is done in a similar fashion.
Authors observe that adding edges with power-law properties dominates the execution time, and consider parallelizing DyGraph as part of future work.
My notes on shared memory parallelism.
Shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs. Using memory for communication inside a single program, e.g. among its multiple threads, is also referred to as shared memory [REF].
A Dynamic Algorithm for Local Community Detection in Graphs : NOTESSubhajit Sahu
**Community detection methods** can be *global* or *local*. **Global community detection methods** divide the entire graph into groups. Existing global algorithms include:
- Random walk methods
- Spectral partitioning
- Label propagation
- Greedy agglomerative and divisive algorithms
- Clique percolation
https://gist.github.com/wolfram77/b4316609265b5b9f88027bbc491f80b6
There is a growing body of work in *detecting overlapping communities*. **Seed set expansion** is a **local community detection method** where a relevant *seed vertices* of interest are picked and *expanded to form communities* surrounding them. The quality of each community is measured using a *fitness function*.
**Modularity** is a *fitness function* which compares the number of intra-community edges to the expected number in a random-null model. **Conductance** is another popular fitness score that measures the community cut or inter-community edges. Many *overlapping community detection* methods **use a modified ratio** of intra-community edges to all edges with atleast one endpoint in the community.
Andersen et al. use a **Spectral PageRank-Nibble method** which minimizes conductance and is formed by adding vertices in order of decreasing PageRank values. Andersen and Lang develop a **random walk approach** in which some vertices in the seed set may not be placed in the final community. Clauset gives a **greedy method** that *starts from a single vertex* and then iteratively adds neighboring vertices *maximizing the local modularity score*. Riedy et al. **expand multiple vertices** via maximizing modularity.
Several algorithms for **detecting global, overlapping communities** use a *greedy*, *agglomerative approach* and run *multiple separate seed set expansions*. Lancichinetti et al. run **greedy seed set expansions**, each with a *single seed vertex*. Overlapping communities are produced by a sequentially running expansions from a node not yet in a community. Lee et al. use **maximal cliques as seed sets**. Havemann et al. **greedily expand cliques**.
The authors of this paper discuss a dynamic approach for **community detection using seed set expansion**. Simply marking the neighbours of changed vertices is a **naive approach**, and has *severe shortcomings*. This is because *communities can split apart*. The simple updating method *may fail even when it outputs a valid community* in the graph.
Scalable Static and Dynamic Community Detection Using Grappolo : NOTESSubhajit Sahu
A **community** (in a network) is a subset of nodes which are _strongly connected among themselves_, but _weakly connected to others_. Neither the number of output communities nor their size distribution is known a priori. Community detection methods can be divisive or agglomerative. **Divisive methods** use _betweeness centrality_ to **identify and remove bridges** between communities. **Agglomerative methods** greedily **merge two communities** that provide maximum gain in _modularity_. Newman and Girvan have introduced the **modularity metric**. The problem of community detection is then reduced to the problem of modularity maximization which is **NP-complete**. **Louvain method** is a variant of the _agglomerative strategy_, in that is a _multi-level heuristic_.
https://gist.github.com/wolfram77/917a1a4a429e89a0f2a1911cea56314d
In this paper, the authors discuss **four heuristics** for Community detection using the _Louvain algorithm_ implemented upon recently developed **Grappolo**, which is a parallel variant of the Louvain algorithm. They are:
- Vertex following and Minimum label
- Data caching
- Graph coloring
- Threshold scaling
With the **Vertex following** heuristic, the _input is preprocessed_ and all single-degree vertices are merged with their corresponding neighbours. This helps reduce the number of vertices considered in each iteration, and also help initial seeds of communities to be formed. With the **Minimum label heuristic**, when a vertex is making the decision to move to a community and multiple communities provided the same modularity gain, the community with the smallest id is chosen. This helps _minimize or prevent community swaps_. With the **Data caching** heuristic, community information is stored in a vector instead of a map, and is reused in each iteration, but with some additional cost. With the **Vertex ordering via Graph coloring** heuristic, _distance-k coloring_ of graphs is performed in order to group vertices into colors. Then, each set of vertices (by color) is processed _concurrently_, and synchronization is performed after that. This enables us to mimic the behaviour of the serial algorithm. Finally, with the **Threshold scaling** heuristic, _successively smaller values of modularity threshold_ are used as the algorithm progresses. This allows the algorithm to converge faster, and it has been observed a good modularity score as well.
From the results, it appears that _graph coloring_ and _threshold scaling_ heuristics do not always provide a speedup and this depends upon the nature of the graph. It would be interesting to compare the heuristics against baseline approaches. Future work can include _distributed memory implementations_, and _community detection on streaming graphs_.
Application Areas of Community Detection: A Review : NOTESSubhajit Sahu
This is a short review of Community detection methods (on graphs), and their applications. A **community** is a subset of a network whose members are *highly connected*, but *loosely connected* to others outside their community. Different community detection methods *can return differing communities* these algorithms are **heuristic-based**. **Dynamic community detection** involves tracking the *evolution of community structure* over time.
https://gist.github.com/wolfram77/09e64d6ba3ef080db5558feb2d32fdc0
Communities can be of the following **types**:
- Disjoint
- Overlapping
- Hierarchical
- Local.
The following **static** community detection **methods** exist:
- Spectral-based
- Statistical inference
- Optimization
- Dynamics-based
The following **dynamic** community detection **methods** exist:
- Independent community detection and matching
- Dependent community detection (evolutionary)
- Simultaneous community detection on all snapshots
- Dynamic community detection on temporal networks
**Applications** of community detection include:
- Criminal identification
- Fraud detection
- Criminal activities detection
- Bot detection
- Dynamics of epidemic spreading (dynamic)
- Cancer/tumor detection
- Tissue/organ detection
- Evolution of influence (dynamic)
- Astroturfing
- Customer segmentation
- Recommendation systems
- Social network analysis (both)
- Network summarization
- Privary, group segmentation
- Link prediction (both)
- Community evolution prediction (dynamic, hot field)
<br>
<br>
## References
- [Application Areas of Community Detection: A Review : PAPER](https://ieeexplore.ieee.org/document/8625349)
This paper discusses a GPU implementation of the Louvain community detection algorithm. Louvain algorithm obtains hierachical communities as a dendrogram through modularity optimization. Given an undirected weighted graph, all vertices are first considered to be their own communities. In the first phase, each vertex greedily decides to move to the community of one of its neighbours which gives greatest increase in modularity. If moving to no neighbour's community leads to an increase in modularity, the vertex chooses to stay with its own community. This is done sequentially for all the vertices. If the total change in modularity is more than a certain threshold, this phase is repeated. Once this local moving phase is complete, all vertices have formed their first hierarchy of communities. The next phase is called the aggregation phase, where all the vertices belonging to a community are collapsed into a single super-vertex, such that edges between communities are represented as edges between respective super-vertices (edge weights are combined), and edges within each community are represented as self-loops in respective super-vertices (again, edge weights are combined). Together, the local moving and the aggregation phases constitute a stage. This super-vertex graph is then used as input fof the next stage. This process continues until the increase in modularity is below a certain threshold. As a result from each stage, we have a hierarchy of community memberships for each vertex as a dendrogram.
Approaches to perform the Louvain algorithm can be divided into coarse-grained and fine-grained. Coarse-grained approaches process a set of vertices in parallel, while fine-grained approaches process all vertices in parallel. A coarse-grained hybrid-GPU algorithm using multi GPUs has be implemented by Cheong et al. which grabbed my attention. In addition, their algorithm does not use hashing for the local moving phase, but instead sorts each neighbour list based on the community id of each vertex.
https://gist.github.com/wolfram77/7e72c9b8c18c18ab908ae76262099329
Survey for extra-child-process package : NOTESSubhajit Sahu
Useful additions to inbuilt child_process module.
📦 Node.js, 📜 Files, 📰 Docs.
Please see attached PDF for literature survey.
https://gist.github.com/wolfram77/d936da570d7bf73f95d1513d4368573e
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTERSubhajit Sahu
For the PhD forum an abstract submission is required by 10th May, and poster by 15th May. The event is on 30th May.
https://gist.github.com/wolfram77/692d263f463fd49be6eb5aa65dd4d0f9
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Subhajit Sahu
For the PhD forum an abstract submission is required by 10th May, and poster by 15th May. The event is on 30th May.
https://gist.github.com/wolfram77/1c1f730d20b51e0d2c6d477fd3713024
Fast Incremental Community Detection on Dynamic Graphs : NOTESSubhajit Sahu
In this paper, the authors describe two approaches for dynamic community detection using the CNM algorithm. CNM is a hierarchical, agglomerative algorithm that greedily maximizes modularity. They define two approaches: BasicDyn and FastDyn. BasicDyn backtracks merges of communities until each marked (changed) vertex is its own singleton community. FastDyn undoes a merge only if the quality of merge, as measured by the induced change in modularity, has significantly decreased compared to when the merge initially took place. FastDyn also allows more than two vertices to contract together if in the previous time step these vertices eventually ended up contracted in the same community. In the static case, merging several vertices together in one contraction phase could lead to deteriorating results. FastDyn is able to do this, however, because it uses information from the merges of the previous time step. Intuitively, merges that previously occurred are more likely to be acceptable later.
https://gist.github.com/wolfram77/1856b108334cc822cdddfdfa7334792a
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...Amil baba
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...PinkySharma900491
Class khatm kaam kaam karne kk kabhi uske kk innings evening karni nnod ennu Tak add djdhejs a Nissan s isme sniff kaam GCC bagg GB g ghan HD smart karmathtaa Niven ken many bhej kaam karne Nissan kaam kaam Karo kaam lal mam cell pal xoxo
2. 2
Parallel Reduction
Common and important data parallel primitive
Easy to implement in CUDA
Harder to get it right
Serves as a great optimization example
We’ll walk step by step through 7 different versions
Demonstrates several important optimization strategies
3. 3
Parallel Reduction
Tree-based approach used within each thread block
Need to be able to use multiple thread blocks
To process very large arrays
To keep all multiprocessors on the GPU busy
Each thread block reduces a portion of the array
But how do we communicate partial results between
thread blocks?
4 7 5 9
11 14
25
3 1 7 0 4 1 6 3
4. 4
Problem: Global Synchronization
If we could synchronize across all thread blocks, could easily
reduce very large arrays, right?
Global sync after each block produces its result
Once all blocks reach sync, continue recursively
But CUDA has no global synchronization. Why?
Expensive to build in hardware for GPUs with high processor
count
Would force programmer to run fewer blocks (no more than #
multiprocessors * # resident blocks / multiprocessor) to avoid
deadlock, which may reduce overall efficiency
Solution: decompose into multiple kernels
Kernel launch serves as a global synchronization point
Kernel launch has negligible HW overhead, low SW overhead
6. 6
What is Our Optimization Goal?
We should strive to reach GPU peak performance
Choose the right metric:
GFLOP/s: for compute-bound kernels
Bandwidth: for memory-bound kernels
Reductions have very low arithmetic intensity
1 flop per element loaded (bandwidth-optimal)
Therefore we should strive for peak bandwidth
Will use G80 GPU for this example
384-bit memory interface, 900 MHz DDR
384 * 1800 / 8 = 86.4 GB/s
7. 7
Reduction #1: Interleaved Addressing
__global__ void reduce0(int *g_idata, int *g_odata) {
extern __shared__ int sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();
// do reduction in shared mem
for(unsigned int s=1; s < blockDim.x; s *= 2) {
if (tid % (2*s) == 0) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
9. 9
Reduction #1: Interleaved Addressing
__global__ void reduce1(int *g_idata, int *g_odata) {
extern __shared__ int sdata[];
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();
// do reduction in shared mem
for (unsigned int s=1; s < blockDim.x; s *= 2) {
if (tid % (2*s) == 0) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
Problem: highly divergent
warps are very inefficient, and
% operator is very slow
10. 10
Performance for 4M element reduction
Kernel 1:
interleaved addressing
with divergent branching
8.054 ms 2.083 GB/s
Note: Block Size = 128 threads for all tests
Bandwidth
Time (222 ints)
11. 11
for (unsigned int s=1; s < blockDim.x; s *= 2) {
if (tid % (2*s) == 0) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
for (unsigned int s=1; s < blockDim.x; s *= 2) {
int index = 2 * s * tid;
if (index < blockDim.x) {
sdata[index] += sdata[index + s];
}
__syncthreads();
}
Reduction #2: Interleaved Addressing
Just replace divergent branch in inner loop:
With strided index and non-divergent branch:
13. 13
Performance for 4M element reduction
Kernel 1:
interleaved addressing
with divergent branching
8.054 ms 2.083 GB/s
Kernel 2:
interleaved addressing
with bank conflicts
3.456 ms 4.854 GB/s 2.33x 2.33x
Step
Speedup
Bandwidth
Time (222 ints)
Cumulative
Speedup
15. 15
for (unsigned int s=1; s < blockDim.x; s *= 2) {
int index = 2 * s * tid;
if (index < blockDim.x) {
sdata[index] += sdata[index + s];
}
__syncthreads();
}
for (unsigned int s=blockDim.x/2; s>0; s>>=1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
Reduction #3: Sequential Addressing
Just replace strided indexing in inner loop:
With reversed loop and threadID-based indexing:
// Already using this
16. 16
Performance for 4M element reduction
Kernel 1:
interleaved addressing
with divergent branching
8.054 ms 2.083 GB/s
Kernel 2:
interleaved addressing
with bank conflicts
3.456 ms 4.854 GB/s 2.33x 2.33x
Kernel 3:
sequential addressing
1.722 ms 9.741 GB/s 2.01x 4.68x
Step
Speedup
Bandwidth
Time (222 ints)
Cumulative
Speedup
Insert text here
17. 17
for (unsigned int s=blockDim.x/2; s>0; s>>=1) {
if (tid < s) {
sdata[tid] += sdata[tid + s];
}
__syncthreads();
}
Idle Threads
Problem:
Half of the threads are idle on first loop iteration!
This is wasteful…
18. 18
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();
// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
__syncthreads();
Reduction #4: First Add During Load
Halve the number of blocks, and replace single load:
With two loads and first add of the reduction:
19. 19
Performance for 4M element reduction
Kernel 1:
interleaved addressing
with divergent branching
8.054 ms 2.083 GB/s
Kernel 2:
interleaved addressing
with bank conflicts
3.456 ms 4.854 GB/s 2.33x 2.33x
Kernel 3:
sequential addressing
1.722 ms 9.741 GB/s 2.01x 4.68x
Kernel 4:
first add during global load
0.965 ms 17.377 GB/s 1.78x 8.34x
Step
Speedup
Bandwidth
Time (222 ints)
Cumulative
Speedup
20. 20
Instruction Bottleneck
At 17 GB/s, we’re far from bandwidth bound
And we know reduction has low arithmetic intensity
Therefore a likely bottleneck is instruction overhead
Ancillary instructions that are not loads, stores, or
arithmetic for the core computation
In other words: address arithmetic and loop overhead
Strategy: unroll loops
21. 21
Unrolling the Last Warp
As reduction proceeds, # “active” threads decreases
When s <= 32, we have only one warp left
Instructions are SIMD synchronous within a warp
That means when s <= 32:
We don’t need to __syncthreads()
We don’t need “if (tid < s)” because it doesn’t save any
work
Let’s unroll the last 6 iterations of the inner loop
22. __device__ void warpReduce(volatile int* sdata, int tid) {
sdata[tid] += sdata[tid + 32];
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}
// later…
for (unsigned int s=blockDim.x/2; s>32; s>>=1) {
if (tid < s)
sdata[tid] += sdata[tid + s];
__syncthreads();
}
if (tid < 32) warpReduce(sdata, tid);
22
Reduction #5: Unroll the Last Warp
Note: This saves useless work in all warps, not just the last one!
Without unrolling, all warps execute every iteration of the for loop and if statement
IMPORTANT:
For this to be correct,
we must use the
“volatile” keyword!
23. 23
Performance for 4M element reduction
Kernel 1:
interleaved addressing
with divergent branching
8.054 ms 2.083 GB/s
Kernel 2:
interleaved addressing
with bank conflicts
3.456 ms 4.854 GB/s 2.33x 2.33x
Kernel 3:
sequential addressing
1.722 ms 9.741 GB/s 2.01x 4.68x
Kernel 4:
first add during global load
0.965 ms 17.377 GB/s 1.78x 8.34x
Kernel 5:
unroll last warp
0.536 ms 31.289 GB/s 1.8x 15.01x
Step
Speedup
Bandwidth
Time (222 ints)
Cumulative
Speedup
24. 24
Complete Unrolling
If we knew the number of iterations at compile time,
we could completely unroll the reduction
Luckily, the block size is limited by the GPU to 512 threads
Also, we are sticking to power-of-2 block sizes
So we can easily unroll for a fixed block size
But we need to be generic – how can we unroll for block
sizes that we don’t know at compile time?
Templates to the rescue!
CUDA supports C++ template parameters on device and
host functions
25. 25
Unrolling with Templates
Specify block size as a function template parameter:
template <unsigned int blockSize>
__global__ void reduce5(int *g_idata, int *g_odata)
26. 26
Reduction #6: Completely Unrolled
if (blockSize >= 512) {
if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
if (blockSize >= 256) {
if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
if (blockSize >= 128) {
if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }
if (tid < 32) warpReduce<blockSize>(sdata, tid);
Note: all code in RED will be evaluated at compile time.
Results in a very efficient inner loop!
Template <unsigned int blockSize>
__device__ void warpReduce(volatile int* sdata, int tid) {
if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
}
27. 27
Invoking Template Kernels
Don’t we still need block size at compile time?
Nope, just a switch statement for 10 possible block sizes:
switch (threads)
{
case 512:
reduce5<512><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;
case 256:
reduce5<256><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;
case 128:
reduce5<128><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;
case 64:
reduce5< 64><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;
case 32:
reduce5< 32><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;
case 16:
reduce5< 16><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;
case 8:
reduce5< 8><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;
case 4:
reduce5< 4><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;
case 2:
reduce5< 2><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;
case 1:
reduce5< 1><<< dimGrid, dimBlock, smemSize >>>(d_idata, d_odata); break;
}
28. 28
Performance for 4M element reduction
Kernel 1:
interleaved addressing
with divergent branching
8.054 ms 2.083 GB/s
Kernel 2:
interleaved addressing
with bank conflicts
3.456 ms 4.854 GB/s 2.33x 2.33x
Kernel 3:
sequential addressing
1.722 ms 9.741 GB/s 2.01x 4.68x
Kernel 4:
first add during global load
0.965 ms 17.377 GB/s 1.78x 8.34x
Kernel 5:
unroll last warp
0.536 ms 31.289 GB/s 1.8x 15.01x
Kernel 6:
completely unrolled
0.381 ms 43.996 GB/s 1.41x 21.16x
Step
Speedup
Bandwidth
Time (222 ints)
Cumulative
Speedup
29. 29
Parallel Reduction Complexity
Log(N) parallel steps, each step S does N/2S
independent ops
Step Complexity is O(log N)
For N=2D, performs S[1..D]2D-S = N-1 operations
Work Complexity is O(N) – It is work-efficient
i.e. does not perform more operations than a sequential
algorithm
With P threads physically in parallel (P processors),
time complexity is O(N/P + log N)
Compare to O(N) for sequential reduction
In a thread block, N=P, so O(log N)
30. 30
What About Cost?
Cost of a parallel algorithm is processors time
complexity
Allocate threads instead of processors: O(N) threads
Time complexity is O(log N), so cost is O(N log N) : not
cost efficient!
Brent’s theorem suggests O(N/log N) threads
Each thread does O(log N) sequential work
Then all O(N/log N) threads cooperate for O(log N) steps
Cost = O((N/log N) * log N) = O(N) cost efficient
Sometimes called algorithm cascading
Can lead to significant speedups in practice
31. 31
Algorithm Cascading
Combine sequential and parallel reduction
Each thread loads and sums multiple elements into
shared memory
Tree-based reduction in shared memory
Brent’s theorem says each thread should sum
O(log n) elements
i.e. 1024 or 2048 elements per block vs. 256
In my experience, beneficial to push it even further
Possibly better latency hiding with more work per thread
More threads per block reduces levels in tree of recursive
kernel invocations
High kernel launch overhead in last levels with few blocks
On G80, best perf with 64-256 blocks of 128 threads
1024-4096 elements per thread
32. 32
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
__syncthreads();
Reduction #7: Multiple Adds / Thread
Replace load and add of two elements:
With a while loop to add as many as necessary:
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n) {
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}
__syncthreads();
33. 33
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
__syncthreads();
Reduction #7: Multiple Adds / Thread
Replace load and add of two elements:
With a while loop to add as many as necessary:
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n) {
sdata[tid] += g_idata[i] + g_idata[i+blockSize];
i += gridSize;
}
__syncthreads();
Note: gridSize loop stride
to maintain coalescing!
34. 34
Performance for 4M element reduction
Kernel 1:
interleaved addressing
with divergent branching
8.054 ms 2.083 GB/s
Kernel 2:
interleaved addressing
with bank conflicts
3.456 ms 4.854 GB/s 2.33x 2.33x
Kernel 3:
sequential addressing
1.722 ms 9.741 GB/s 2.01x 4.68x
Kernel 4:
first add during global load
0.965 ms 17.377 GB/s 1.78x 8.34x
Kernel 5:
unroll last warp
0.536 ms 31.289 GB/s 1.8x 15.01x
Kernel 6:
completely unrolled
0.381 ms 43.996 GB/s 1.41x 21.16x
Kernel 7:
multiple elements per thread
0.268 ms 62.671 GB/s 1.42x 30.04x
Kernel 7 on 32M elements: 73 GB/s!
Step
Speedup
Bandwidth
Time (222 ints)
Cumulative
Speedup
35. 35
template <unsigned int blockSize>
__device__ void warpReduce(volatile int *sdata, unsigned int tid) {
if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
}
template <unsigned int blockSize>
__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) {
extern __shared__ int sdata[];
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*(blockSize*2) + tid;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blockSize]; i += gridSize; }
__syncthreads();
if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }
if (tid < 32) warpReduce(sdata, tid);
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
Final Optimized Kernel
// I guess for global memory, 2 loads in 1 loop good enough
// For shared memory, better load as mush as possible at once (near instruction bottleneck)
38. 38
Conclusion
Understand CUDA performance characteristics
Memory coalescing
Divergent branching
Bank conflicts
Latency hiding
Use peak performance metrics to guide optimization
Understand parallel algorithm complexity theory
Know how to identify type of bottleneck
e.g. memory, core computation, or instruction overhead
Optimize your algorithm, then unroll loops
Use template parameters to generate optimal code
Questions: mharris@nvidia.com