The document discusses parallel and high performance computing. It begins with definitions of key terms like parallel computing, high performance computing, asymptotic notation, speedup, work and time optimality, latency, bandwidth and concurrency. It then covers parallel architecture and programming models including SIMD, MIMD, shared and distributed memory, data and task parallelism, and synchronization methods. Examples of parallel sorting and prefix sums are provided. Programming models like OpenMP, PPL and work stealing are also summarized.
program partitioning and scheduling IN Advanced Computer ArchitecturePankaj Kumar Jain
Advanced Computer Architecture,Program Partitioning and Scheduling,Program Partitioning & Scheduling,Latency,Levels of Parallelism,Loop-level Parallelism,Subprogram-level Parallelism,Job or Program-Level Parallelism,Communication Latency,Grain Packing and Scheduling,Program Graphs and Packing
program partitioning and scheduling IN Advanced Computer ArchitecturePankaj Kumar Jain
Advanced Computer Architecture,Program Partitioning and Scheduling,Program Partitioning & Scheduling,Latency,Levels of Parallelism,Loop-level Parallelism,Subprogram-level Parallelism,Job or Program-Level Parallelism,Communication Latency,Grain Packing and Scheduling,Program Graphs and Packing
PhD. defense slides.
Current supercomputer architectures are subject to memory related issues. For instance we can observe slowdowns induced by memory management mecanisms and their implementation. In this context, we focus on the management of large memory segments for multi-core and NUMA supercomputers similar to Tera 100 and Curie. We discuss our work in three parts. We first study several paging policies (page coloring, huge pages...) from multiple operating systems. We demonstrate an interference between those policies and layout decisions taken by userspace allocators. Such interactions can significantly reduce cache efficiency depending on the application, particularly on multi-core architectures. This study extends existing works by studying interactions between the operating system, the allocator and caches. Then, we discuss performance issues when large memory segments are allocated. To do so, we consider the interaction between the OS and userspace allocators. We show that we can significantly improve some application performances (up to 50%) by controling the memory exchange rate with the OS and by taking care of memory topologies. We finally study page fault extensibility in current Linux kernel implementation. We obsere a large impact due to page zeroing which is a security requirement. We propose an improvement on memory allocation semantic aimed at avoiding page zeroing. It shows a new interest for huge pages to improve paging scalability without changing too much kernel algorithms.
An explicitly parallel program must specify concurrency and interaction between concurrent subtasks.
The former is sometimes also referred to as the control structure and the latter as the communication model.
Optimization of Collective Communication in MPICH Lino Possamai
This is a lecture about the paper: "Optimization of Collective Communication in MPICH". Department of Computer Science, University Ca' Foscari of Venice, Italy
PhD. defense slides.
Current supercomputer architectures are subject to memory related issues. For instance we can observe slowdowns induced by memory management mecanisms and their implementation. In this context, we focus on the management of large memory segments for multi-core and NUMA supercomputers similar to Tera 100 and Curie. We discuss our work in three parts. We first study several paging policies (page coloring, huge pages...) from multiple operating systems. We demonstrate an interference between those policies and layout decisions taken by userspace allocators. Such interactions can significantly reduce cache efficiency depending on the application, particularly on multi-core architectures. This study extends existing works by studying interactions between the operating system, the allocator and caches. Then, we discuss performance issues when large memory segments are allocated. To do so, we consider the interaction between the OS and userspace allocators. We show that we can significantly improve some application performances (up to 50%) by controling the memory exchange rate with the OS and by taking care of memory topologies. We finally study page fault extensibility in current Linux kernel implementation. We obsere a large impact due to page zeroing which is a security requirement. We propose an improvement on memory allocation semantic aimed at avoiding page zeroing. It shows a new interest for huge pages to improve paging scalability without changing too much kernel algorithms.
An explicitly parallel program must specify concurrency and interaction between concurrent subtasks.
The former is sometimes also referred to as the control structure and the latter as the communication model.
Optimization of Collective Communication in MPICH Lino Possamai
This is a lecture about the paper: "Optimization of Collective Communication in MPICH". Department of Computer Science, University Ca' Foscari of Venice, Italy
Parallel computing is computing architecture paradigm ., in which processing required to solve a problem is done in more than one processor parallel way.
This Chapter provides a Background Review of Parallel and Distributed Computing. a focus is made on the concept of SISD, SIMD, MISD, MIMD.
It also gives an understanding of the notion of HPC (High-Performance Computing). A survey is done using some case studies to show why parallelism is needed. The chapter discusses the Amdahl's Law and the limitations. Gustafson's Law is also discussed.
Parallel External Memory Algorithms Applied to Generalized Linear ModelsRevolution Analytics
Presentation by Lee Edlefsen, Revolution Analytics to JSM 2012, San Diego CA, July 30 2012
For the past several decades the rising tide of technology has allowed the same data analysis code to handle the increase in sizes of typical data sets. That era is ending. The size of data sets is increasing much more rapidly than the speed of single cores, of RAM, and of hard drives. To deal with this, statistical software must be able to use multiple cores and computers. Parallel external memory algorithms (PEMA's) provide the foundation for such software. External memory algorithms (EMA's) are those that do not require all data to be in RAM, and are widely available. Parallel implementations of EMA's allow them to run on multiple cores and computers, and to process unlimited rows of data. This paper describes a general approach to efficiently parallelizing EMA's, using an R and C++ implementation of GLM as a detailed example. It examines the requirements for efficient PEMA's; the arrangement of code for automatic parallelization; efficient threading; and efficient inter-process communication. It includes billion row benchmarks showing linear scaling with rows and nodes, and demonstrating that extremely high performance is achievable.
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontierinside-BigData.com
In this deck from the Stanford Colloquium on Computer Systems Seminar, Brian Boucher from Maxeler Technologies presents: Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier.
"Maxeler Multiscale Dataflow computing is at the leading edge of energy-efficient high performance computing, providing competitive advantage in industries from energy to finance to defense. Maxeler builds the computer around the problem to maximize performance density, eliminating the elaborate caching and decoding machinery occupying most silicon in a standard processor. This talk will explain the motivation behind dataflow computing to escape the end of frequency scaling in the push to exascale machines, introduce the Maxeler dataflow ecosystem including MaxJ code and DFE hardware, and demonstrate the application of dataflow principles to a specific HPC software package (Quantum ESPRESSO)."
Watch the video: https://wp.me/p3RLHQ-hq1
Learn more: http://maxeler.com/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Please contact me to download this pres.A comprehensive presentation on the field of Parallel Computing.It's applications are only growing exponentially day by days.A useful seminar covering basics,its classification and implementation thoroughly.
Visit www.ameyawaghmare.wordpress.com for more info
We leave in the era where the atomic building elements of silicon computers, e.g., transistors and wires, are no longer visible using traditional optical microscopes and their sizes are measured in just tens of Angstroms. In addition, power dissipation per unit volume is bounded by the laws of Physics that all resulted among others in stagnating processor clock frequencies. Adding more and more processor cores that perform simpler and simpler tasks in an attempt to efficiently fill the available on-chip area seems to be the current trend taken by the Industry.
EdCrunch 2018 - Skyeng - EdTech product scaling: How to influence key growth ...Michael Karpov
Skyeng company case:
"EdTech product scaling: How to influence key growth indicators and achieve rapid progress. Product VS Marketing look"
Global conference for technology in education #EdCrunch
https://2018.edcrunch.ru/en/
Movement to business goals: Data, Team, Users (4C Conference)Michael Karpov
In this talk Mikhail Karpov discuss the methods used to move to business goals faster on example of VK.com processes, including teams flexible structure and feedback loop from service audience
"Пользователи: сигнал из космоса". CodeFest mini 2012Michael Karpov
О способах получения обратной связи от пользователей в российских и иностранных интернет-компаниях.
Также, на основе различных жизненных кейсов рассмотрим их полезность и применимость.
Михаил рассмотрит основные случаи и всякие примеры применения на основе Яндекса и нескольких других российских и иностранных компаний.
Как сделать команде приятное - Михаил Карпов (Яндекс)Michael Karpov
Команде приятное можно сделать разными способами.
Этот доклад о том, как это сделать с помощью вашего процесса.
Многие понимают то, что важно вовлекать команду в продукт, но также важно вовлекать команду в ваш процесс.
Для этого ваш процесс должен коротко и доходчиво отвечать на вопросы команды.
О них и поговорим.
сбор требований с помощью Innovation gamesMichael Karpov
За основу были взяты бизнес-игры от Люка Хоммана: innovationgames.com
Они представлены ниже в презентации: "Начни новый день", "Product Box", "Воспоминания о будущем", "Удиви клиента", "Катер", "Паутина".
Практическое занятие было не лишено соревновательного характера:
в первой части команды "разработчиков" собирали неявные знания с "пользователей", а во второй части должны были предложить им каждый свой продукт.
Естественно, что пользователи выбирают наиболее понравившийся (то есть наиболее удовлетворяющий их потребностям) и команда, предложившая данный продукт, побеждает.
Зачем нам Это? или Как продать agile командеMichael Karpov
Мы все сталкиваемся с ситуациями когда сложно работать с Заказчиком по Agile и уговорить его на подобный способ коммуникации.
Также, часто команде сложно уговорить своего менеджера.
Но!
Бывает и иначе: менеджер предлагает внедрять Agile, а команда "не до конца уверена"...
Именно о такой ситуации и рассказывает этот доклад!
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Elevating Tactical DDD Patterns Through Object CalisthenicsDorra BARTAGUIZ
After immersing yourself in the blue book and its red counterpart, attending DDD-focused conferences, and applying tactical patterns, you're left with a crucial question: How do I ensure my design is effective? Tactical patterns within Domain-Driven Design (DDD) serve as guiding principles for creating clear and manageable domain models. However, achieving success with these patterns requires additional guidance. Interestingly, we've observed that a set of constraints initially designed for training purposes remarkably aligns with effective pattern implementation, offering a more ‘mechanical’ approach. Let's explore together how Object Calisthenics can elevate the design of your tactical DDD patterns, offering concrete help for those venturing into DDD for the first time!
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties – USA
Expansion of bot farms – how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks – Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
4. “Parallel and High Performance”? “Parallel computing is a form of computation in which many calculations are carried out simultaneously” G.S. Almasi and A. Gottlieb, Highly Parallel Computing. Benjamin/Cummings, 1994 A High Performance (Super) Computer is: One of the 500 fastest computers as measured by HPL: the High Performance Linpack benchmark A computer that costs 200.000.000 руб or more Necessarily parallel, at least since the 1970’s 7/17/2009 4
5. Recent Developments For 20 years, parallel and high performance computing have been the same subject Parallel computing is now mainstream It reaches well beyond HPC into client systems: desktops, laptops, mobile phones HPC software once had to stand alone Now, it can be based on parallel PC software The result: better tools and new possibilities 7/17/2009 5
6. The Emergence of the Parallel Client Uniprocessor performance is leveling off Instruction-level parallelism nears a limit (ILP Wall) Power is getting painfully high (Power Wall) Caches show diminishing returns (Memory Wall) Logic density continues to grow (Moore’s Law) So uniprocessors will collapse in area and cost Cores per chip need to increase exponentially We must all learn to write parallel programs So new “killer apps” will enjoy more speed
7. The ILP Wall Instruction-level parallelism preserves the serial programming model While getting speed from “undercover” parallelism For example, see HPS†: out-of-order issue, in-order retirement, register renaming, branch prediction, speculation, … At best, we get a few instructions/clock † Y.N. Patt et al., "Critical Issues Regarding HPS, a High Performance Microarchitecture,“Proc. 18th Ann. ACM/IEEE Int'l Symp. on Microarchitecture, 1985, pp. 109−116.
8. The Power Wall In the old days, power was kept roughly constant Dynamic power, equal to CV2f, dominated Every shrink of .7 in feature size halved transistor area Capacitance C and voltage V also decreased by .7 Even with the clock frequency f increased by 1.4, power per transistor was cut in half Now, shrinking no longer reduces V very much So even at constant frequency, power density doubles Static (leakage) power is also getting worse Simpler, slower processors are more efficient And to conserve power, we can turn some of them off
9. The Memory Wall We can get bigger caches from more transistors Does this suffice, or is there a problem scaling up? To speed up 2X without changing bandwidth below the cache, the miss rate must be halved How much bigger does the cache have to be?† For dense matrix multiply or dense LU, 4x bigger For sorting or FFTs, the square of its former size For sparse or dense matrix-vector multiply, impossible Deeper interconnects increase miss latency Latency tolerance needs memory access parallelism † H.T. Kung, “Memory requirements for balanced computer architectures,” 13th International Symposium on Computer Architecture, 1986, pp. 49−54.
10. Overcoming the Memory Wall Provide more memory bandwidth Increase DRAM I/O bandwidth per gigabyte Increase microprocessor off-chip bandwidth Use architecture to tolerate memory latency More latency more threads or longer vectors No change in programming model is needed Use caches for bandwidth as well as latency Let compilers control locality Keep cache lines short Avoid mis-speculation
11. The End of The von Neumann Model “Instructions are executed one at a time…” We have relied on this idea for 60 years Now it (and things it brought) must change Serial programming is easier than parallel programming, at least for the moment But serial programs are now slow programs We need parallel programming paradigms that will make all programmers successful The stakes for our field’s vitality are high Computing must be reinvented
13. Asymptotic Notation Quantities are often meaningful only within a constant factor Algorithm performance analyses, for example f(n) = O(g(n)) means there exist constants c and n0 such that nn0 implies |f(n)| |cg(n)| f(n) = (g(n)) means there exist constants c and n0 such that nn0 implies |f(n)| |cg(n)| f(n) = (g(n)) means both f(n) = O(g(n)) and f(n) = (g(n)) 7/17/2009 13
14. Speedup, Time, and Work The speedup of a computation is how much faster it runs in parallel compared to serially If one processor takes T1 and p of them take Tp then the p-processor speedup is Sp = T1/Tp The work done is the number of operations performed, either serially or in parallel W1 = O(T1) is the serial work, Wp the parallel work We say a parallel computation is work-optimal ifWp = O(W1) = O(T1) We say a parallel computation is time-optimal ifTp = O(W1/p) = O(T1/p) 7/17/2009 14
15. Latency, Bandwidth, & Concurrency In any system that moves items from input to output without creating or destroying them, Queueing theory calls this result Little’s law latency × bandwidth = concurrency concurrency = 6 bandwidth = 2 latency = 3
17. Parallel Processor Architecture SIMD: Each instruction operates concurrently on multiple data items MIMD: Multiple instruction sequences execute concurrently Concurrency is expressible in space or time Spatial: the hardware is replicated Temporal: the hardware is pipelined 7/17/2009 17
18. Trends in Parallel Processors Today’s chips are spatial MIMD at top level To get enough performance, even in PCs Temporal MIMD is also used SIMD is tending back toward spatial Intel’s Larrabee combines all three Temporal concurrency is easily “adjusted” Vector length or number of hardware contexts Temporal concurrency tolerates latency Memory latency in the SIMD case For MIMD, branches and synchronization also 7/17/2009 18
19. Parallel Memory Architecture A shared memory system is one in which any processor can address any memory location Quality of access can be either uniform (UMA) or nonuniform (NUMA), in latency and/or bandwidth A distributed memory system is one in which processors can’t address most of memory The disjoint memory regions and their associated processors are usually called nodes A cluster is a distributed memory system with more than one processor per node Nearly all HPC systems are clusters 7/17/2009 19
20. Parallel Programming Variations Data Parallelism andTask Parallelism Functional Style and Imperative Style Shared Memory and Message Passing …and more we won’t have time to look at A parallel application may use all of them 7/17/2009 20
21. Data Parallelism and Task Parallelism A computation is data parallel when similar independent sub-computations are done simultaneously on multiple data items Applying the same function to every element of a data sequence, for example A computation is task parallel when dissimilar independent sub-computations are done simultaneously Controlling the motions of a robot, for example It sounds like SIMD vs. MIMD, but isn’t quite Some kinds of data parallelism need MIMD 7/17/2009 21
22. Functional and Imperative Programs A program is said to be written in (pure) functional style if it has no mutable state Computing = naming and evaluating expressions Programs with mutable state are usually called imperative because the state changes must be done when and where specified: while (z < x) { x = y; y = z; z = f(x, y);} return y; Often, programs can be written either way: let w(x, y, z) = if (z < x) then w(y, z, f(x, y)) else y; 7/17/2009 22
23. Shared Memory and Message Passing Shared memory programs access data in a shared address space When to access the data is the big issue Subcomputations therefore must synchronize Message passing programs transmit data between subcomputations The sender computes a value and then sends it The receiver recieves a value and then uses it Synchronization can be built in to communication Message passing can be implemented very well on shared memory architectures 7/17/2009 23
24. Barrier Synchronization A barrier synchronizes multiple parallel sub-computations by letting none proceed until all have arrived It is named after the barrier used to start horse races It guarantees everything before the barrier finishes before anything after it begins It is a central feature in several data-parallel languages such as OpenMP 7/17/2009 24
25. Mutual Exclusion This type of synchronization ensures only one subcomputation can do a thing at any time If the thing is a code block, it is a critical section It classically uses a lock: a data structure with which subcomputations can stop and start Basic operations on a lock object L might be Acquire(L): blocks until other subcomputations are finished with L, then acquires exclusive ownership Release(L): yields L and unblocks some Acquire(L) A lot has been written on these subjects 7/17/2009 25
26. Non-Blocking Synchronization The basic idea is to achieve mutual exclusion using memory read-modify-write operations Most commonly used is compare-and-swap: CAS(addr, old, new) reads memory at addr and if it contains old then old is replaced by new Arbitrary update operations at an addr require {read old; compute new; CAS(addr, old, new);}be repeated until the CAS operation succeeds If there is significant updating contention at addr, the repeated computation of new may be wasteful 7/17/2009 26
27. Load Balancing Some processors may be busier than others To balance the workload, subcomputations can be scheduled on processors dynamically A technique for parallel loops is self-scheduling: processors repetitively grab chunks of iterations In guided self-scheduling, the chunk sizes shrink Analogous imbalances can occur in memory Overloaded memory locations are called hot spots Parallel algorithms and data structures must be designed to avoid them Imbalanced messaging is sometimes seen 7/17/2009 27
29. A Data Parallel Example: Sorting 7/17/2009 29 void sort(int *src, int *dst,int size, intnvals) { inti, j, t1[nvals], t2[nvals]; for (j = 0 ; j < nvals ; j++) { t1[j] = 0; } for (i = 0 ; i < size ; i++) { t1[src[i]]++; } //t1[] now contains a histogram of the values t2[0] = 0; for (j = 1 ; j < nvals ; j++) { t2[j] = t2[j-1] + t1[j-1]; } //t2[j] now contains the origin for value j for (i = 0 ; i < size ; i++) {dst[t2[src[i]]++] = src[i]; } }
30. When Is a Loop Parallelizable? The loop instances must safely interleave A way to do this is to only read the data Another way is to isolate data accesses Look at the first loop: The accesses to t1[] are isolated from each other This loop can run in parallel “as is” 7/17/2009 30 for (j = 0 ; j < nvals ; j++) { t1[j] = 0; }
31. Isolating Data Updates The second loop seems to have a problem: Two iterations may access the same t1[src[i]] If both reads precede both increments, oops! A few ways to isolate the iteration conflicts: Use an “isolated update” (lock prefix) instruction Use an array of locks, perhaps as big as t1[] Use non-blocking updates Use a transaction 7/17/2009 31 for (i = 0 ; i < size ; i++) { t1[src[i]]++; }
32. Dependent Loop Iterations The 3rd loop is an interesting challenge: Each iteration depends on the previous one This loop is an example of a prefix computation If • is an associative binary operation on a set S, the • - prefixes of the sequence x0 ,x1 ,x2 … of values from S is x0,x0•x1,x0•x1•x2 … Prefix computations are often known as scans Scan can be done in efficiently in parallel 7/17/2009 32 for (j = 1 ; j < nvals ; j++) { t2[j] = t2[j-1] + t1[j-1]; }
33. Cyclic Reduction Each vertical line represents a loop iteration The associated sequence element is to its right On step k of the scan, iteration j prefixes its own value with the value from iteration j – 2k 7/17/2009 33 a b c d e f g a ab bc cd de ef fg a ab abc abcd bcde cdef defg a ab abc abcd abcde abcdef abcdefg
34. Applications of Scan Linear recurrences like the third loop Polynomial evaluation String comparison High-precision addition Finite automata Each xi is the next-state function given the ith input symbol and • is function composition APL compress When only the final value is needed, the computation is called a reduction instead It’s a little bit cheaper than a full scan
35. More Iterations nThan Processors p 7/17/2009 35 Wp = 3n + O(p log p), Tp = 3n / p + O(log p)
36. OpenMP OpenMP is a widely-implemented extension to C++ and Fortran for data† parallelism It adds directives to serial programs A few of the more important directives: #pragmaomp parallel for <modifiers><for loop> #pragmaomp atomic<binary op=,++ or -- statement> #pragmaomp critical <name><structured block> #pragmaomp barrier 7/17/2009 36 †And perhaps task parallelism soon
37. The Sorting Example in OpenMP Only the third “scan” loop is a problem We can at least do this loop “manually”: 7/17/2009 37 nt = omp_get_num_threads(); intta[nt], tb[nt]; #omp parallel for for(myt = 0; myt < nt; myt++) { //Set ta[myt]= local sum of nvals/nt elements of t1[] #pragmaomp barrier for(k = 1; k <= myt; k *= 2){ tb[myt] = ta[myt]; ta[myt] += tb[myt - k]; #pragmaomp barrier } fix = (myt > 0) ? ta[myt – 1] : 0; //Setnvals/ntelements of t2[] to fix + local scan of t1[] }
38. Parallel Patterns Library (PPL) PPL is a Microsoft C++ library built on top of the ConcRT user-mode scheduling runtime It supports mixed data- and task-parallelism: parallel_for, parallel_for_each, parallel_invoke agent, send, receive, choice, join, task_group Parallel loops use C++ lambda expressions: Updates can be isolated using intrinsic functions Microsoft and Intel plan to unify PPL and TBB 7/17/2009 38 parallel_for(1,nvals,[&t1](int j) { t1[j] = 0; }); (void)_InterlockedIncrement(t1[src[i]]++);
39. Dynamic Resource Management PPL programs are written for an arbitrary number of processors, could be just one Load balancing is mostly done by work stealing There are two kinds of work to steal: Work that is unblocked and waiting for a processor Work that is not yet started and is potentially parallel Work of the latter kind will be done serially unless it is first stolen by another processor This makes recursive divide and conquer easy There is no concern about when to stop parallelism 7/17/2009 39
40. A Quicksort Example void quicksort (vector<int>::iterator first, vector<int>::iterator last) { if (last - first < 2){return;} int pivot = *first; auto mid1 = partition (first, last, [=](int e){return e < pivot;}); auto mid2 = partition (mid1, last, [=](int e){return e == pivot;}); parallel_invoke( [=] { quicksort(first, mid1); }, [=] { quicksort(mid2, last); } ); }; 7/17/2009 40
41. LINQ and PLINQ LINQ (Language Integrated Query) extends the .NET languages C#, Visual Basic, and F# A LINQ query is really just a functional monad It queries databases, XML, or any IEnumerable PLINQ is a parallel implementation of LINQ Non-isolated functions must be avoided Otherwise it is hard to tell the two apart 7/17/2009 41
42. A LINQ Example 7/17/2009 42 PLINQ .AsParallel() var q = from n in names where n.Name == queryInfo.Name && n.State == queryInfo.State && n.Year >= yearStart && n.Year <= yearEnd orderbyn.Year ascending select n;
43. Message Passing Interface (MPI) MPI is a widely used message passing library for distributed memory HPC systems Some of its basic functions: A few of its “collective communication” functions: 7/17/2009 43 MPI_Init MPI_Comm_rank MPI_Comm_size MPI_Send MPI_Recv MPI_Reduce MPI_Allreduce MPI_Scan MPI_Exscan MPI_Barrier MPI_Gather MPI_Allgather MPI_Alltoall
44. Sorting in MPI Roughly, it could work like this on n nodes: Run the first two loops locally Use MPI_Allreduce to build a global histogram Run the third loop (redundantly) at every node Allocate n value intervals to nodes (redundantly) Balancing the data per node as well as possible Run the fourth loop using the local histogram Use MPI_Alltoall to redistribute the data Merge the n sorted subarrays on each node Collective communication is expensive But sorting needs it (see the Memory Wall slide) 7/17/2009 44
45. Another Way to Sort in MPI The Samplesort algorithm is like Quicksort It works like this on n nodes: Sort the local data on each node independently Take s samples of the sorted data on each node Use MPI_Allgather to send all nodes all samples Compute n 1 splitters (redundantly) on all nodes Balancing the data per node as well as possible Use MPI_Alltoall to redistribute the data Merge the n sorted subarrays on each node 7/17/2009 45
47. Parallel Computing Has Arrived We must rethink how we write programs And we are definitely doing that Other things will also need to change Architecture Operating systems Algorithms Theory Application software We are seeing the biggest revolution in computing since its very beginnings 7/17/2009 47