This chapter discusses various classification attributed to parallel architectures. It also introduces related parallel programming models and presents the actions of these models on parallel architectures. Notions such as Data parallelism Task parallelism, Tighty and Coupled system, UMA/NUMA, Multicore computing, Symmetric multiprocessing, Distributed Computing, Cluster computing, Shared memory without thread/Thread, etc..
From the perspective of Design and Analysis of Algorithm. I made these slide by collecting data from many sites.
I am Danish Javed. Student of BSCS Hons. at ITU Information Technology University Lahore, Punjab, Pakistan.
Interconnection Network
in this presentation there are some explain to Interconnection Network , and espically in computer architecture and parallel processing.
From the perspective of Design and Analysis of Algorithm. I made these slide by collecting data from many sites.
I am Danish Javed. Student of BSCS Hons. at ITU Information Technology University Lahore, Punjab, Pakistan.
Interconnection Network
in this presentation there are some explain to Interconnection Network , and espically in computer architecture and parallel processing.
Here I have discussed models of parallel systems, criteria for Parallel programming model, computations in parallel programming, Parallelization of programms, levels of parallelism, parallelism in those levels, Static Scheduling, Dynamic Scheduling, explicit and implicit representation of parallelism ect
A Distributed computing architeture consists of very lightweight software agents installed on a number of client systems , and one or more dedicated distributed computing managment servers.
Algorithms Lecture 1: Introduction to AlgorithmsMohamed Loey
We will discuss the following: Algorithms, Time Complexity & Space Complexity, Algorithm vs Pseudo code, Some Algorithm Types, Programming Languages, Python, Anaconda.
Merge sort is a sorting technique based on divide and conquer technique. With worst-case time complexity being Ο(n log n), it is one of the most respected algorithms.
Merge sort first divides the array into equal halves and then combines them in a sorted manner.
We will discuss the following: Graph, Directed vs Undirected Graph, Acyclic vs Cyclic Graph, Backedge, Search vs Traversal, Breadth First Traversal, Depth First Traversal, Detect Cycle in a Directed Graph.
This Chapter provides a Background Review of Parallel and Distributed Computing. a focus is made on the concept of SISD, SIMD, MISD, MIMD.
It also gives an understanding of the notion of HPC (High-Performance Computing). A survey is done using some case studies to show why parallelism is needed. The chapter discusses the Amdahl's Law and the limitations. Gustafson's Law is also discussed.
Here I have discussed models of parallel systems, criteria for Parallel programming model, computations in parallel programming, Parallelization of programms, levels of parallelism, parallelism in those levels, Static Scheduling, Dynamic Scheduling, explicit and implicit representation of parallelism ect
A Distributed computing architeture consists of very lightweight software agents installed on a number of client systems , and one or more dedicated distributed computing managment servers.
Algorithms Lecture 1: Introduction to AlgorithmsMohamed Loey
We will discuss the following: Algorithms, Time Complexity & Space Complexity, Algorithm vs Pseudo code, Some Algorithm Types, Programming Languages, Python, Anaconda.
Merge sort is a sorting technique based on divide and conquer technique. With worst-case time complexity being Ο(n log n), it is one of the most respected algorithms.
Merge sort first divides the array into equal halves and then combines them in a sorted manner.
We will discuss the following: Graph, Directed vs Undirected Graph, Acyclic vs Cyclic Graph, Backedge, Search vs Traversal, Breadth First Traversal, Depth First Traversal, Detect Cycle in a Directed Graph.
This Chapter provides a Background Review of Parallel and Distributed Computing. a focus is made on the concept of SISD, SIMD, MISD, MIMD.
It also gives an understanding of the notion of HPC (High-Performance Computing). A survey is done using some case studies to show why parallelism is needed. The chapter discusses the Amdahl's Law and the limitations. Gustafson's Law is also discussed.
Simulation of Heterogeneous Cloud InfrastructuresCloudLightning
During the last years, except from the traditional CPU based hardware servers, hardware accelerators are widely used in various HPC application areas. More specifically, Graphics Processing Units (GPUs), Many Integrated Cores (MICs) and Field-Programmable Gate Arrays (FPGAs) have shown a great potential in HPC and have been widely mobilised in supercomputing and in HPC-Clouds. This presentation focuses on the development of a cloud simulation framework that supports hardware accelerators. The design and implementation of the framework are also discussed.
This presentation was given by Dr. Konstantinos Giannoutakis (CERTH) at the CloudLightning Conference on 11th April 2017.
Introduction to Cloud Computing
Cloud computing is a transformative technology that allows businesses and individuals to access computing resources over the internet. Instead of owning and maintaining physical hardware and software, users can leverage cloud services provided by companies like Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP), and others. This shift has revolutionized how we think about IT infrastructure, software development, data storage, and more.
Key Concepts of Cloud Computing
On-Demand Self-Service:
Users can provision computing resources as needed without human intervention from the service provider. This includes servers, storage, and applications.
Broad Network Access:
Cloud services are available over the network and accessed through standard mechanisms, enabling use from a variety of devices like laptops, smartphones, and tablets.
Resource Pooling:
Providers use a multi-tenant model to serve multiple customers with dynamically assigned resources. This model allows for economies of scale and efficient resource utilization.
Rapid Elasticity:
Resources can be elastically provisioned and released, sometimes automatically, to scale rapidly outward and inward commensurate with demand.
Measured Service:
Cloud systems automatically control and optimize resource use by leveraging a metering capability, allowing for pay-as-you-go pricing models.
Types of Cloud Computing Services
Infrastructure as a Service (IaaS):
Provides virtualized computing resources over the internet. Examples include AWS EC2, Google Compute Engine, and Azure Virtual Machines.
Platform as a Service (PaaS):
Offers hardware and software tools over the internet, typically used for application development. Examples include Google App Engine, AWS Elastic Beanstalk, and Azure App Services.
Software as a Service (SaaS):
Delivers software applications over the internet, on a subscription basis. Examples include Google Workspace, Microsoft Office 365, and Salesforce.
Deployment Models
Public Cloud:
Services are delivered over the public internet and shared across multiple organizations. It offers cost savings but might pose concerns regarding data security and privacy.
Private Cloud:
Dedicated to a single organization, offering enhanced security and control over data and infrastructure. It's more expensive than public cloud but can be tailored to specific business needs.
Hybrid Cloud:
Combines public and private clouds, allowing data and applications to be shared between them. This model offers greater flexibility and optimization of existing infrastructure, security, and compliance.
Community Cloud:
Shared between organizations with common concerns (e.g., security, compliance, jurisdiction). It can be managed internally or by a third-party.
Advantages of Cloud Computing
Cost Efficiency: Reduces the need for significant capital expenditure on hardware and software.
Scalability and Flexibility: Easily scales up or down based on
Data Parallel and Object Oriented ModelNikhil Sharma
All the content is taken from Advance Computer Architecture book. Which (10.1.3 and 10.1.4)
This PPT covers the basics of Data-Parallel Model and Object-Oriented Model.
Automobile Management System Project Report.pdfKamal Acharya
The proposed project is developed to manage the automobile in the automobile dealer company. The main module in this project is login, automobile management, customer management, sales, complaints and reports. The first module is the login. The automobile showroom owner should login to the project for usage. The username and password are verified and if it is correct, next form opens. If the username and password are not correct, it shows the error message.
When a customer search for a automobile, if the automobile is available, they will be taken to a page that shows the details of the automobile including automobile name, automobile ID, quantity, price etc. “Automobile Management System” is useful for maintaining automobiles, customers effectively and hence helps for establishing good relation between customer and automobile organization. It contains various customized modules for effectively maintaining automobiles and stock information accurately and safely.
When the automobile is sold to the customer, stock will be reduced automatically. When a new purchase is made, stock will be increased automatically. While selecting automobiles for sale, the proposed software will automatically check for total number of available stock of that particular item, if the total stock of that particular item is less than 5, software will notify the user to purchase the particular item.
Also when the user tries to sale items which are not in stock, the system will prompt the user that the stock is not enough. Customers of this system can search for a automobile; can purchase a automobile easily by selecting fast. On the other hand the stock of automobiles can be maintained perfectly by the automobile shop manager overcoming the drawbacks of existing system.
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...Amil Baba Dawood bangali
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
Cosmetic shop management system project report.pdfKamal Acharya
Buying new cosmetic products is difficult. It can even be scary for those who have sensitive skin and are prone to skin trouble. The information needed to alleviate this problem is on the back of each product, but it's thought to interpret those ingredient lists unless you have a background in chemistry.
Instead of buying and hoping for the best, we can use data science to help us predict which products may be good fits for us. It includes various function programs to do the above mentioned tasks.
Data file handling has been effectively used in the program.
The automated cosmetic shop management system should deal with the automation of general workflow and administration process of the shop. The main processes of the system focus on customer's request where the system is able to search the most appropriate products and deliver it to the customers. It should help the employees to quickly identify the list of cosmetic product that have reached the minimum quantity and also keep a track of expired date for each cosmetic product. It should help the employees to find the rack number in which the product is placed.It is also Faster and more efficient way.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdffxintegritypublishin
Advancements in technology unveil a myriad of electrical and electronic breakthroughs geared towards efficiently harnessing limited resources to meet human energy demands. The optimization of hybrid solar PV panels and pumped hydro energy supply systems plays a pivotal role in utilizing natural resources effectively. This initiative not only benefits humanity but also fosters environmental sustainability. The study investigated the design optimization of these hybrid systems, focusing on understanding solar radiation patterns, identifying geographical influences on solar radiation, formulating a mathematical model for system optimization, and determining the optimal configuration of PV panels and pumped hydro storage. Through a comparative analysis approach and eight weeks of data collection, the study addressed key research questions related to solar radiation patterns and optimal system design. The findings highlighted regions with heightened solar radiation levels, showcasing substantial potential for power generation and emphasizing the system's efficiency. Optimizing system design significantly boosted power generation, promoted renewable energy utilization, and enhanced energy storage capacity. The study underscored the benefits of optimizing hybrid solar PV panels and pumped hydro energy supply systems for sustainable energy usage. Optimizing the design of solar PV panels and pumped hydro energy supply systems as examined across diverse climatic conditions in a developing country, not only enhances power generation but also improves the integration of renewable energy sources and boosts energy storage capacities, particularly beneficial for less economically prosperous regions. Additionally, the study provides valuable insights for advancing energy research in economically viable areas. Recommendations included conducting site-specific assessments, utilizing advanced modeling tools, implementing regular maintenance protocols, and enhancing communication among system components.
Democratizing Fuzzing at Scale by Abhishek Aryaabh.arya
Presented at NUS: Fuzzing and Software Security Summer School 2024
This keynote talks about the democratization of fuzzing at scale, highlighting the collaboration between open source communities, academia, and industry to advance the field of fuzzing. It delves into the history of fuzzing, the development of scalable fuzzing platforms, and the empowerment of community-driven research. The talk will further discuss recent advancements leveraging AI/ML and offer insights into the future evolution of the fuzzing landscape.
Student information management system project report ii.pdfKamal Acharya
Our project explains about the student management. This project mainly explains the various actions related to student details. This project shows some ease in adding, editing and deleting the student details. It also provides a less time consuming process for viewing, adding, editing and deleting the marks of the students.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
2. CHAPTER 2
Parallel and Distributed Computer
Architectures, Performance Metrics
And Parallel Programming Models
Previous … Chap 1: General Introduction (Parallel and Distributed Computing)
3. CONTENTS
• INTRODUCTION
• Why parallel Architecture ?
• Modern Classification of Parallel Computers
• Structural Classification of Parallel Computers
• Parallel Computers Memory Architectures
• Hardware Classification
• Performance of Parallel Computers architectures
- Peak and Sustained Performance
• Measuring Performance of Parallel Computers
• Other Common Benchmarks
• Parallel Programming Models
- Shared Memory Programming Model
- Thread Model
- Distributed Memory
- Data Parallel
- SPMD/MPMD
• Conclusion
Exercises ( Check your Progress, Further Reading and Evaluation)
4. Previously on Chap 1
Part 1- Introducing Parallel and Distributed Computing
• Background Review of Parallel and Distributed Computing
• INTRODUCTION TO PARALLEL AND DISTRIBUTED COMPUTING
• Some keys terminologies
• Why parallel Computing?
• Parallel Computing: the Facts
• Basic Design Computer Architecture: the von Neumann Architecture
• Classification of Parallel Computers (SISD,SIMD,MISD,MIMD)
• Assignment 1a
Part 2- Initiation to Parallel Programming Principles
• High Performance Computing (HPC)
• Speed: a need to solve Complexity
• Some Case Studies Showing the need of Parallel Computing
• Challenge of explicit Parallelism
• General Structure of Parallel Programs
• Introduction to the Amdahl's LAW
• The GUSTAFSON’s LAW
• SCALIBILITY
• Fixed Size Versus Scale Size
• Assignment 1b
• Conclusion
5. INTRODUCTION
• Parallel Computer Architecture is the method that consist of
Maximizing and organizing computer resources to achieve Maximum
performance.
- Performance at any instance of time, is achievable within the limit given
by the technology.
- The same system may be characterized both as "parallel" and
"distributed"; the processors in a typical distributed system run
concurrently in parallel.
• The use of more processors to compute tasks simultaneously
contribute in providing more features to computers systems.
• In the Parallel architecture, Processors during computation may have
access to a shared memory to exchange information between them.
•
imagesSource:Wikipedia,DistributingComputing,2020
6. • In a Distributed architecture, each processor during computation,
make use of its own private memory (distributed memory). In this
case, Information is exchanged by passing messages between the
processors.
• Significant characteristics of distributed systems are: concurrency of
components, lack of a global clock (Clock synchronization) , and
independent failure of components.
• The use of distributed systems to solve computational problems is
Called Distributed Computing (Divide problem into many tasks, each task is handle by one or
more computers, which communicate with each other via message passing).
• High-performance parallel computation operating shared-memory
multiprocessor uses parallel algorithms while the coordination of
a large-scale distributed system uses distributed algorithms.
INTRODUCTION
imagesSource:Wikipedia,DistributingComputing,2020
7. • Parallelism is nowadays in all levels of computer architectures.
• It is the Enhancements of Processors that justify the success in the
development of Parallelism.
• Today, they are superscalar (Execute several instructions in parallel each clock cycle).
- besides, The advancement of the underlying Very Large-Scale Integration (VLSI )technology,
which allows larger and larger numbers of components to fit on a chip and clock rates to increase.
• Three main elements define structure and performance of Multiprocessor:
- Processors
- Memory Hierarchies (registers, cache, main memory, magnetic discs, magnetic tapes)
- Interconnection Network
• But, the gap of performance between the processor and the memory is still
increasing ….
• Parallelism is used by computer architecture to translate the raw potential of
the technology into greater performance and expanded capability of the
computer system
• Diversity in parallel computer architecture makes the field challenging to learn
and challenging to present.
INTRODUCTION ( Cont…)
8. Remember that:
A parallel computer is a collection of processing elements that
cooperate and communicate to solve large problems fast.
• The attempt to solve this large problems raises some fundamental
questions which the answer can only by satisfy by understanding:
- Various components of Parallel and Distributed systems( Design
and operation),
- How much problems a given Parallel and Distributed system can
solve,
- How processors corporate, communicate / transmit data between
them,
- The primitive abstractions that the hardware and software provide
to the programmer for better control,
- And, How to ensure a proper translation to performance once these
elements are under control.
INTRODUCTION (Cont…)
9. Why Parallel Architecture ?
• No matter the performance of a single processor at a given time, we can
achieve in principle higher performance by utilizing many such processors
so far we are ready to pay the price (Cost).
Parallel Architecture is needed To:
Respond to Applications Trends
• Advances in hardware capability enable new application functionality
drives parallel architecture harder, since parallel architecture focuses on the
most demanding of these applications.
• At the Low end level, we have the largest volume of machines and greatest
number of users; at the High end, most demanding applications.
• Consequence: pressure for increased performance most demanding
applications must be written as parallel programs to respond to this
demand generated from the High end
Satisfy the need of High Computing in the field of computational science
and engineering
- A response to simulate physical phenomena impossible or very
costly to observe through empirical means (modeling global climate change
over long periods, the evolution of galaxies, the atomic structure of materials,
etc…)
10. Respond to Technology Trends
• Can’t “wait for the single processor to get fast enough ”
Respond to Architectural Trends
• Advances in technology determine what is possible; architecture
translates the potential of the technology into performance and
capability .
• Four generation of Computer architectures (tubes, transistors,
integrated circuits, and VLSI ) where strong distinction is function of
the type of parallelism implemented ( Bit level parallelism 4-bits
to 64 bits, 128 bits is the future).
• There has been tremendous architectural advances over this period
: Bit level parallelism, Instruction level Parallelism, Thread Level
Parallelism
All these forces driving the development of parallel architectures are
resumed under one main quest: Achieve absolute maximum
performance ( Supercomputing)
Why Parallel Architecture ? (Cont …)
11. Modernclassification
Accordingto(Sima,Fountain,Kacsuk)
Before modern classification,
Recall Flynn’s taxonomy classification of Computers
- based on the number of instructions that can be executed and how they operate on data.
Four Main Type:
• SISD: traditional sequential architecture
• SIMD: processor arrays, vector processor
• Parallel computing on a budget – reduced control unit cost
• Many early supercomputers
• MIMD: most general purpose parallel computer today
• Clusters, MPP, data centers
• MISD: not a general purpose architecture
Note: Globally four type of parallelism are implemented:
- Bit Level Parallelism: performance of processors based on word size ( bits)
- Instruction Level Parallelism: give ability to processors to execute more than instruction
per clock cycle
- Task Parallelism: characterize Parallel programs
- Superword Level Parallelism: Based on vectorization Techniques
Computer Architectures
SISD SIMD MIMD MISD
12. • Classification here is based on how parallelism is achieved
• by operating on multiple data: Data parallelism
• by performing many functions in parallel: Task parallelism (function)
• Control parallelism, task parallelism depending on the level of the functional
parallelism.
ModernClassification
Accordingto(Sima,Fountain,Kacsuk)
Parallel architectures
Data-parallel
architectures
Function-parallel
architectures
- Different operations are
performed on the same or
different data
- Asynchronous computation
- Speedup is less as each
processor will execute a different
thread or process on the same or
different set of data.
- Amount of parallelization is
proportional to the number of
independent tasks to be
performed
- Load balancing depends on the
availability of the hardware and
scheduling algorithms like static
and dynamic scheduling.
- Applicability : pipelining
- Same operations are
performed on different
subsets of same data
- Synchronous computation
- Speedup is more as there is
only one execution thread
operating on all sets of data.
- Amount of parallelization is
proportional to the input data
size
- Designed for optimum load
balance on multi processor
system
Applicability: Arrays, Matrix
13. • Flynn’s classification Focus on the behavioral aspect of computers .
• Looking at the structure, Parallel computers can be classified based on a focus on
how processors communicate with the memory.
When multiprocessors communicate through the global shared memory modules
then this organization is called Shared memory computer or Tightly
when every processor in a multiprocessor system, has its own local memory and
the processors communicate via messages transmitted between their local memories,
then this organization is called Distributed memory computer or Loosely coupled system
StructuralClassificationof ParallelComputers
14. Parallel Computer Memory Architectures
Shared Memory Parallel Computers architecture
- Processors can access all memory as global
address space
- Multi-processors can operate independently but
share the same memory resources
- Changes in a memory location effected by one
processor are visible to all other processors
Based on memory access time, we can
classify Shared memory Parallel Computers into
two:
Uniform Memory Access (UMA)
Non-Uniform Memory Access (NUMA)
15. ParallelComputerMemoryArchitectures(Cont…)
Uniform Memory Access (UMA) (known as Cache Coherent -
UMA)
• Commonly represented today by Symmetric
Multiprocessor (SMP) machines
• Identical processors
• Equal access and access times to memory
Note: Cache coherent is a hardware operation where any update of a
location in shared memory by one processor , is announce to all the
other processors .
Source:Imagesretrievedfromhttps://computing.llnl.gov/tutorials/parallel_comp/#SharedMemory
16. Non-Uniform Memory Access (NUMA)
• The architecture often link two or more SMPs
In such that :
- One SMP can directly access memory of another SMP
- Not all processors have equal access time to all memories
- Memory access across link is slower
Note: if Cache coherent is implemented, then we can also call it
Cache Coherent NUMA
• The proximity of memory to CPUs on Shared Memory parallel computer
makes Data sharing between tasks fast and uniform.
• But, there is a lack of scalability between memory and CPUs.
ParallelComputerMemoryArchitectures(Cont…)
Source:Imagesretrievedfromhttps://computing.llnl.gov/tutorials/parallel_comp/#SharedMemory
BruceJacob,...DavidT.Wang,inMemorySystems,2008
18. Distributed Memory Parallel Computer Architecture
• Different varieties as Shared Memory Computer.
• Require a communication network to connect inter-processor memory.
- Each processor operates independently with its own local memory
- individual processors changes does not affect the memory of other
processors.
- Cache Coherency does not apply here !
• Access to data in another processor is usually the task of the
programmer(explicitly define how and when data is communicated)
• This architecture is cost effective (can use commodity, off-the-shelf
processors and networking).
• But, the responsibility of the programmer is more engage for data
communication between processors
Source:Retrievedfrom
https://www.futurelearn.com/courses/supercomputing/0/steps/24022
ParallelComputerMemoryArchitectures(Cont…)
19. Source:NikolaosPloskas,NikolaosSamaras,inGPUProgramminginMATLAB,2016
ParallelComputerMemoryArchitectures(Cont…)
Overview of Parallel Memory Architecture
Note: - The largest and fastest computers in the world today employ both
shared and distributed memory architectures (Hybrid Memory)
- In hybrid design, Shared memory component here can be a shared
memory machine and/or graphics processing units (GPU)
- And, Distributed memory component is the networking of multiple
shared memory/GPU machines
- This type of memory architecture will continue to prevail and increase
20. • Parallel computers can be roughly classified according to the level
at which the hardware in the parallel architecture supports
parallelism.
Multicore Computing
Symmetric multiprocessing ( tightly coupled multiprocessing)
Hardwareclassification
- Made of computer system with multiple
identical processors that share memory
and connect via a bus
- Do not comprise more than 32 processors
to minimize bus contention
- Symmetric multiprocessors are extremely
cost-effective
retrievedfromhttps://en.wikipedia.org/wiki/Parallel_computing#Bit-
level_parallelism,2020
- Processor includes multiple processing units (called "cores") on the
same chip.
- issue multiple instructions per clock cycle from multiple instruction
streams
- Differs from a superscalar processor. But, Each core in a multi-core
processor can potentially be superscalar as well.
Superscalar: issue multiple instructions per clock cycle from one instruction stream
(thread).
- Example: IBM's Cell microprocessor in Sony PlayStation 3
21. Distributed Computing (distributed memory multiprocessor)
Cluster Computing
Hardwareclassification(Cont…)
• Not to be confused with Decentralized computing
- Allocation of resources (Hardware + software) to individual
workstations
• components are located on different networked computers,
which communicate and coordinate their actions by passing
messages to one another
• Interaction of components is done to achieve a common goal
• Characterize by concurrency of components, lack of a global
clock, and independent failure of components.
• can include heterogeneous computations where some nodes
may perform a lot more computation, some perform very
little computation and a few others may perform specialized
functionality
• Example: Multiplayer Online game
• loosely coupled computers that work together closely
• in some respects they can be regarded as a single computer
• multiple standalone machines constitute a cluster and
connected by a network.
• computer clusters have each node set to perform the same
task, controlled and scheduled by software.
• Computer clustering relies on a centralized management
approach which makes the nodes available as orchestrated
shared servers.
• Example: IBM's Sequoia
Sources:DinkarSitaram,GeethaManjunath,inMovingToTheCloud,2012
CiscoSystems,2003
23. Performance of parallel architectures
Various ways to measure the performance of a parallel algorithm running
on a parallel processor.
Most commonly used measurements:
- speed-up
- Efficiency/ Isoefficiency
- Elapsed time (Very important factor Elapsed time for a program divided by the cost of the machine that ran the job.
- price/performance
Note: none of these metrics should be used independent of the run time of the parallel system
Common metrics of Performance
• FLOPS and MIPS are units of measure for the numerical computing performance of a
computer
• Distributed computing uses the Internet to link personal computers to achieve more
FLOPS
- MIPS: million instructions per second
MIPS = instruction count/(execution time x 106)
- MFLOPS: million floating point operations per second.
FLOPS = FP ops in program/(execution time x 106)
• Which of the metric is better?
• FLOP is more related to the time of a task in numerical code.
# of FLOP / program is determined by the matrix size
See Chapter 1
24. “In June 2020, Fugaku turned in a High Performance Linpack (HPL) result
of 415.5 petaFLOPS, besting the now second-place Summit system by a
factor of 2.8x. Fugaku is powered by Fujitsu’s 48-core A64FX SoC,
becoming the first number one system on the list to be powered by ARM
processors. In single or further reduced precision, used in machine learning
and AI applications, Fugaku’s peak performance is over 1,000 petaflops (1
exaflops). The new system is installed at RIKEN Center for Computational
Science (R-CCS) in Kobe, Japan ” (wikipedia Flops, 2020).
Performance of parallel architectures
Here we are !
Single CPU Performance
The future
25. Peak and sustained performance
Peak performance
• Measured in MFLOPS
• Highest possible MFLOPS when the system does nothing but
numerical computation
• Rough hardware measure
• Little indication on how the system will perform in practice.
Peak Theoretical Performance
• Node performance in GFlops = (CPU speed in GHz) x (number of
CPU cores) x (CPU instruction per cycle) x (number of CPUs per
node)
26. Peak and sustained performance
• Sustained performance
• The MFLOPS rate that a program achieves over the entire run.
• Measuring sustained performance
• Using benchmarks
• Peak MFLOPS is usually much larger than sustained MFLOPS
• Efficiency rate = sustained MFLOPS / peak MFLOPS
27. Measuring the performance of
parallel computers
• Benchmarks: programs that are used to measure the
performance.
• LINPACK benchmark: a measure of a system’s floating point
computing power
• Solving a dense N by N system of linear equations Ax=b
• Use to rank supercomputers in the top500 list.
No. 1 since June 2020
Fugaku, is powered by Fujitsu’s 48-core A64FX SoC, becoming the first
number one system on the list to be powered by ARM processors.
29. PARALLEL PROGRAMMING MODELS
A programming perspective of Parallelism implementation in parallel
and distributed Computer architectures
30. Parallel Programming Models
Parallel programming models exist as an abstraction above hardware
and memory architectures.
There are commonly several parallel programming models used
• Shared Memory (without threads)
• Threads
• Distributed Memory / Message Passing
• Data Parallel
• Hybrid
• Single Program Multiple Data (SPMD)
• Multiple Program Multiple Data (MPMD)
These models are NOT specific to a particular type of machine or
memory architecture (a given model can be implemented on any
underlying hardware).
Example: - SHARED memory model on a DISTRIBUTED memory
machine ( Machine memory is physically distributed across networked
machines, but at the user level as a single shared memory global address
space --- Kendall Square Research (KSR) ALLCACHE---
31. Which Model to USE ??
There is no "best" model
However, there are certainly better implementations of some models over others
Parallel Programming Models
32. SharedMemoryProgramming Model
(WithoutThread)
• A thread is the basic unit to which the operating system allocates
processor time. They are smallest sequence of programmed
instructions
• In a Share Memory programming model,
- Processes/tasks share a common address space, which they
read and write to asynchronously.
- Make use of mechanisms such as locks / semaphores to control
access to the shared memory, resolve contentions and to prevent race
conditions and deadlocks.
• This may be consider as the simplest parallel programming model
33. • Note: Locks, Mutexe and semaphore are type of
synchronization objects in a share resources
environment. Abstract concepts.
-Locks protects access to some kind of shared resource, and give
right to access the protected share resource when owned.
Example, if you have a lockable object ABC you may:
- acquire the lock on ABC,
- take the lock on ABC,
- lock ABC,
- take ownership of ABC, or relinquish ownership of ABC if not needed
- Mutexe (Mutual EXclusion): lockable object that can be owned by
exactly one thread at a time
• Example: in C++, std::mutex, std::timed_mutex, std::recursive_mutex
-- Semaphore: A semaphore is a very relaxed type of lockable object,
with a predefined maximum count, and a current count.
Shared MemoryProgramming Model(Cont..)
34. Advantages Disadvantages
• No need to specify explicitly the
communication of data between tasks,
so no need to implement “ownership”.
Very advantageous for a Programmer
It becomes more difficult to understand
and manage data locality.
• All processes see and have equal access
to shared memory
There is Conservation of memory access,
cache refresh and bus traffic when keeping
data local to a given process
• Open for simplification during the
development of the program
controlling data locality is hard to
understand and may be beyond the control
of the average user.
Shared MemoryProgramming Model(Cont..)
During Implementation,
• Case: stand-alone shared memory machines
- native operating systems, compilers and/or hardware provide support for
shared memory programming. E.g. POSIX standard provides an API for using shared memory.
• Case: distributed memory machines:
- memory is physically distributed across a network of machines, but made
global through specialized hardware and software
35. • This is a type of shared memory programming.
• Here, a single "heavy weight" process can have multiple "light weight",
concurrent execution paths.
• To understand this model, let us consider the execution of a main
program a.out , scheduled to run by the native operating system.
Thread Model
a.out start by loading and acquiring all of the necessary system and user resources
to run. This constitute the "heavy weight" process
a.out performs some serial work, and then creates a number of tasks (threads) that
can be scheduled and run by the operating system concurrently
Each thread has local data, but also, shares the entire resources of a.out “Light
weight” and benefit from a global memory view because it shares the memory
space of a.out
Need for synchronization coordination to ensure that more than one thread is not
updating the same global address at any time.
36. • During Implementation, threads implementations commonly comprise:
A library of subroutines that are called from within parallel source code
A set of compiler directives imbedded in either serial or parallel source
code.
Note: Often , the programmer is responsible for determining the parallelism.
• Unrelated standardization efforts have resulted in two very different
implementations of threads:
- POSIX Threads
* Specified by the IEEE POSIX 1003.1c standard (1995). C Language only, Part of Unix/Linux operating systems and
Very explicit parallelism--requires significant programmer attention to detail.
- OpenMP ( Used for Tutorial in the context of this course).
* Industry standard, Compiler directive based Portable / multi-platform, including Unix and Windows
platforms, available in C/C++ and Fortran implementations, Can be very easy and simple to use - provides for
"incremental parallelism". Can begin with serial code.
Others include: - Microsoft threads
- Java, Python threads
- CUDA threads for GPUs
Thread Model (Cont…)
37. • In this Model,
A set of tasks uses their own local memory during computation
Multiple tasks can reside on the same physical machine and/or across an arbitrary
number of machines.
Exchange of data by tasks is done through communication( sending/ receiving
messages).
But, there must be a certain Process Cooperation during data transfer.
During Implementation,
• The programmer is responsible for determining all parallelism
• Message passing implementations usually comprise a library of subroutines that
are imbedded in source code.
• MPI is the "de facto" industry standard for message passing.
- Message Passing Interface (MPI), specification available at http://www.mpi-
forum.org/docs/.
DistributedMemory/MessagePassingModel
38. Can also be referred to as the Partitioned Global Address Space (PGAS) model.
Here,
Address space is treated globally
Most of the parallel work focuses on performing operations on a data set
typically organized into a common structure, such as an array or cube
A set of tasks work collectively on the same data structure, however, each task
works on a different partition of the same data structure.
Tasks perform the same operation on their partition of work, for example, "add 4
to every array element“
Can be implemented on share memory (data structure is accessed through
global memory) and distributed memory architectures (Global Data structure
can be logically/Physical split across tasks).
Data Parallel Model
39. For the Implementation,
• Various popular, and sometimes developmental parallel
programming based on the Data Parallel / PGAS model.
• - Coarray Fortran, compiler dependent
* further reading (https://en.wikipedia.org/wiki/Coarray_Fortran)
• - Unified Parallel C (UPC), extension to the C programming
language for SPMD parallel programming.
* further reading http://upc.lbl.gov/
- Global Arrays , shared memory style programming environment in the context of
distributed array data structures.
* Further reading on https://en.wikipedia.org/wiki/Global_Arrays
Data Parallel Model ( Cont…)
40. Single Program Multiple Data (SPMD) Multiple Program Multiple Data (MPMD)
"high level" programming model (Can be build based on any parallel programming
model)
Why SINGLE PROGRAM ?
All tasks execute their copy of the same
program (threads, message passing, data
parallel or hybrid) simultaneously
Why MULTIPLE PROGRAM ?
Tasks may execute different programs
(threads, message passing, data parallel or
hybrid) simultaneously
Why MULTIPLE DATA ?
All tasks may use different data
Why MULTIPLE Data ?
All tasks may use different data
Intelligent Enough: tasks do not necessarily
have to execute the entire program.
Not intelligent enough has SPMD.
But, may be better suited for certain types
of problems (functional decomposition
problems)
Single ProgramMultipleData (SPMD)/
MultipleProgram MultipleData (MPMD)
41. Conclusion
• Parallel computer architectures contribute in achieving maximum performance within the limit
given by the technology.
• Diversity in parallel computer architecture makes the field challenging to learn and challenging to
present
• Classification can be based on the number of instructions that can be executed and how they
operate on data- Flynn (SISD,SIMD,MISD,MIMD)
• Also, classification can be based on how parallelism is achieved (Data parallel architectures,
Function-parallel architectures)
• Classification can as well be focus on how processors communicate with the memory (Shared
memory computer or Tightly , Distributed memory computer or Loosely coupled system)
• There must be a way to appreciate the performance of the parallelize architecture
• FLOPS and MIPS are units of measure for the numerical computing performance of a computer.
• Parallelism is made possible with implementation of adequate parallel programming models.
• The most simple model appears to be the Shared Memory Programming Model.
• The SPMD and MPMD programming required mastering of the previous programming model for
Proper implementation.
• How do we then design a Parallel Program for effective parallelism?
See Next Chapter: Designing Parallel Programs and understanding notion of
Concurrency and Decomposition.
42. Challenge your understanding
1- What difference do you make between Parallel computer and Parallel Computing ?
2- What do you understand by True data dependency and Resource dependency?
3- Illustrate the notion of Vertical Waste and Horizontal Waste.
4- According to you, which of the design architecture can provide better performance ?. Use
performance metrics to justify your arguments.
6- what is Concurrent-read, concurrent-write (CRCW) PRAM
5-
On this Figure, we have an illustration of a Bus-based interconnects (a) with no local caches and (b)
Bus-based interconnects with local memory/caches.
Explain the difference focusing on :
- The design architecture
- The operation
- The Pros and Cons
6- Discuss on the HANDLER’S CLASSIFICATION Computers architectures compares to Flynn and others classifications .
43. Class Work Group and Presentation
• Purpose: Demonstrate Condition to detect eventual
Parallelism.
“Parallel computing requires that the segments to be executed
in parallel must be independent of each other. So, before
executing parallelism, all the conditions of parallelism between
the segments must be analyzed”.
Use Bernstein Conditions for Detection of Parallelism to demonstrate when
instructions i1, i2,….,in can be said “ Parallelized”.
44. REFERENCES
1. Xin Yuan, CIS4930/CDA5125: Parrallel and Distributed Systems,
Retrieve from http://www.cs.fsu.edu/~xyuan/cda5125/index.html
2. EECC722 – Shaaban, #1 lec # 3 Fall 2000 9-18-2000
3. Blaise Barney, Lawrence Livermore National Laboratory,
https://computing.llnl.gov/tutorials/parallel_comp/#ModelsOverv
iew, Last Modified: 11/02/2020 16:39:01
4. J BlazeWich et al, Handbook on Parallel and distributed
Processing, International Handbook of Information Systems,
spinger, 2000
5. Phillip J. windley, Parallel Architectures, lesson 6, CS462, Large
scale Distributed Systems, 2020
6. A. Grana, et al. Introduction to Parallel Computing, lecture 3