Highlighted notes of article while studying Concurrent Data Structures, CSE:
The Concurrency Challenge
Wen-mei W. Hwu, Kurt Keutzer, Tim Mattson
IEEE Design and Test
The Concurrency Challenge
July-August 2008, pp. 312-320, vol. 25
DOI Bookmark: 10.1109/MDT.2008.110
Wen-mei Hwu is a professor and
the Sanders-AMD Endowed Chair of
Electrical and Computer Engineering
at the University of Illinois at Urbana-
Champaign. His research interests
include architecture and compilation for parallel-
computing systems. He has a BS in electrical
engineering from National Taiwan University, and
a PhD in computer science from the University of
California, Berkeley. He is a Fellow of both the IEEE
and the ACM.
Kurt Keutzer is a professor of
electrical engineering and computer
science at the University of California,
Berkeley and a principal investigator
in UC Berkeley’s Universal Parallel
Computing Research Center. His research focuses
on the design and programming of ICs. He has a BS
in mathematics from Maharishi International University, and an MS a PhD in computer science from
Indiana University, Bloomington. He is a Fellow of the
IEEE and a member of the ACM.
Timothy G. Mattson is a principal
engineer in the Applications Research
Laboratory at Intel. He research inter-
ests focus on performance modeling
for future multicore microprocessors
and how different programming models map onto
these systems. He has a BS in chemistry from the
University of California, Riverside; an MS in chemistry
from the university of California, Santa Cruz; and a PhD in theoretical chemistry from the University of
California, Santa Cruz. He is a member of the
American Association for the Advancement of Sci-
ence (AAAS).
https://ieeexplore.ieee.org/document/4584454
RESILIENT INTERFACE DESIGN FOR SAFETY-CRITICAL EMBEDDED AUTOMOTIVE SOFTWAREcsandit
The replacement of the former, purely mechanical, functionality with mechatronics-based solutions, the introduction of new propulsion technologies, and the connection of cars to their environment are just a few reasons for the continuously increasing electrical and/or electronic
system (E/E system) complexity in modern passenger cars. Smart methodologies and techniques have been introduced in system development to cope with these new challenges. A topic that is often neglected is the definition of the interface between the hardware and software subsystems.
However, during the development of safety-critical E/E systems, according to the automotive
functional safety standard ISO 26262, an unambiguous definition of the hardware-software interface (HSI) has become vital. This paper presents a domain-specific modelling approach for mechatronic systems with an integrated hardware-software interface definition feature. The
newly developed model-based domain-specific language is tailored to the needs of mechatronic system engineers and supports the system’s architectural design including the interface definition, with a special focus on safety-criticality.
THE UNIFIED APPROACH FOR ORGANIZATIONAL NETWORK VULNERABILITY ASSESSMENTijseajournal
The present business network infrastructure is quickly varying with latest servers, services, connections,
and ports added often, at times day by day, and with a uncontrollably inflow of laptops, storage media and
wireless networks. With the increasing amount of vulnerabilities and exploits coupled with the recurrent
evolution of IT infrastructure, organizations at present require more numerous vulnerability assessments.
In this paper new approach the Unified process for Network vulnerability Assessments hereafter called as
a unified NVA is proposed for network vulnerability assessment derived from Unified Software
Development Process or Unified Process, it is a popular iterative and incremental software development
process framework.
Integrating profiling into mde compilersijseajournal
Scientific computation requires more and more performance in its algorithms. New massively parallel
architectures suit well to these algorithms. They are known for offering high performance and power
efficiency. Unfortunately, as parallel programming for these architectures requires a complex distribution
of tasks and data, developers find difficult to implement their applications effectively. Although approaches
based on source-to-source intends to provide a low learning curve for parallel programming and take
advantage of architecture features to create optimized applications, programming remains difficult for
neophytes. This work aims at improving performance by returning to the high-level models, specific
execution data from a profiling tool enhanced by smart advices computed by an analysis engine. In order to
keep the link between execution and model, the process is based on a traceability mechanism. Once the
model is automatically annotated, it can be re-factored aiming better performances on the re-generated
code. Hence, this work allows keeping coherence between model and code without forgetting to harness the
power of parallel architectures. To illustrate and clarify key points of this approach, we provide an
experimental example in GPUs context. The example uses a transformation chain from UML-MARTE
models to OpenCL code.
RESILIENT INTERFACE DESIGN FOR SAFETY-CRITICAL EMBEDDED AUTOMOTIVE SOFTWAREcsandit
The replacement of the former, purely mechanical, functionality with mechatronics-based solutions, the introduction of new propulsion technologies, and the connection of cars to their environment are just a few reasons for the continuously increasing electrical and/or electronic
system (E/E system) complexity in modern passenger cars. Smart methodologies and techniques have been introduced in system development to cope with these new challenges. A topic that is often neglected is the definition of the interface between the hardware and software subsystems.
However, during the development of safety-critical E/E systems, according to the automotive
functional safety standard ISO 26262, an unambiguous definition of the hardware-software interface (HSI) has become vital. This paper presents a domain-specific modelling approach for mechatronic systems with an integrated hardware-software interface definition feature. The
newly developed model-based domain-specific language is tailored to the needs of mechatronic system engineers and supports the system’s architectural design including the interface definition, with a special focus on safety-criticality.
THE UNIFIED APPROACH FOR ORGANIZATIONAL NETWORK VULNERABILITY ASSESSMENTijseajournal
The present business network infrastructure is quickly varying with latest servers, services, connections,
and ports added often, at times day by day, and with a uncontrollably inflow of laptops, storage media and
wireless networks. With the increasing amount of vulnerabilities and exploits coupled with the recurrent
evolution of IT infrastructure, organizations at present require more numerous vulnerability assessments.
In this paper new approach the Unified process for Network vulnerability Assessments hereafter called as
a unified NVA is proposed for network vulnerability assessment derived from Unified Software
Development Process or Unified Process, it is a popular iterative and incremental software development
process framework.
Integrating profiling into mde compilersijseajournal
Scientific computation requires more and more performance in its algorithms. New massively parallel
architectures suit well to these algorithms. They are known for offering high performance and power
efficiency. Unfortunately, as parallel programming for these architectures requires a complex distribution
of tasks and data, developers find difficult to implement their applications effectively. Although approaches
based on source-to-source intends to provide a low learning curve for parallel programming and take
advantage of architecture features to create optimized applications, programming remains difficult for
neophytes. This work aims at improving performance by returning to the high-level models, specific
execution data from a profiling tool enhanced by smart advices computed by an analysis engine. In order to
keep the link between execution and model, the process is based on a traceability mechanism. Once the
model is automatically annotated, it can be re-factored aiming better performances on the re-generated
code. Hence, this work allows keeping coherence between model and code without forgetting to harness the
power of parallel architectures. To illustrate and clarify key points of this approach, we provide an
experimental example in GPUs context. The example uses a transformation chain from UML-MARTE
models to OpenCL code.
Almost all software development activities require collaboration, and model-based software development is no exception. In modern model-based development collaboration comes in two levels. We start from collaborative language creation (aka metamodeling) and describe the benefits it can provide and then do the same for collaborative language use (aka modeling). We conclude by inspecting how the collaboration enables scalability in terms of multiple engineers, multiple languages, large models, and transformations.
STATISTICAL ANALYSIS FOR PERFORMANCE COMPARISONijseajournal
Performance responsiveness and scalability is a make-or-break quality for software. Nearly everyone runs into performance problems at one time or another. This paper discusses about performance issues faced during Pre Examination Process Automation System (PEPAS) implemented in java technology. The challenges faced during the life cycle of the project and the mitigation actions performed. It compares 3 java technologies and shows how improvements are made through statistical analysis in response time of the application. The paper concludes with result analysis.
Software Architecture: Introduction to the AbstractionHenry Muccini
The Software Architecture is the earliest model of the whole software system created along the software lifecycle
A Software Architecture can be designed along four perspectives:
- as A set of components and connectors communicating through interfaces
- as A set of architecture design decisions
- with Focus on set of views and viewpoints
- Written according to architectural styles
Model-Driven Development of Web Applicationsidescitation
Over the last few years Model-Driven Development (MDD) has been regarded as
the future of Software Engineering, offering architects the possibility of creating artifacts to
illustrate the design of the software solutions, contributing directly to the implementation of
the product after performing a series of model transformations on them. The model-to-text
transformations are the most important operations from the point of view of the automatic
code generation. The automatic generation or the fast prototyping of applications implies an
acceleration of the development process and a reduction of time and effort, which could
materialize in a noticeable cost reduction. This paper proposes a practical approach for the
model-based development of web applications, offering a solution for the layered and
platform independent modeling of web applications, as well as for the automatic generation
of software solutions realized using the ASP.NET technology.
Improving Consistency of UML Diagrams and Its Implementation Using Reverse En...journalBEEI
Software development deals with various changes and evolution that cannot be avoided due to the development processes which are vastly incremental and iterative. In Model Driven Engineering, inconsistency between model and its implementation has huge impact on the software development process in terms of added cost, time and effort. The later the inconsistencies are found, it could add more cost to the software project. Thus, this paper aims to describe the development of a tool that could improve the consistency between Unified Modeling Language (UML) design models and its C# implementation using reverse engineering approach. A list of consistency rules is defined to check vertical and horizontal consistencies between structural (class diagram) and behavioral (use case diagram and sequence diagram) UML diagrams against the implemented C# source code. The inconsistencies found between UML diagrams and source code are presented in a textual description and visualized in a tree view structure.
Simulations on Computer Network An Improved Study in the Simulator Methodolog...YogeshIJTSRD
Generally a network simulator is used to analyse the performance and behaviour of a network. The Simulation software plays a vital role in real world implementation. The hardware setup of network topologies are very costly and strenuous to modify often. The simulators act as the protocols for a system. The simulators in network such as Ns 2, Ns 3, OMNeT , NetSim, J SIM, REAL, OPNET, OMNEST, QualNetTraNS, NTCUns etc. It is quite a tedious process to select network simulator that is based up on the requirement for a users specified job. This paper gives a comparison and a general analysis of various network simulators. J. Sumitha | S. Swathi | Ellakkiya. M "Simulations on Computer Network: An Improved Study in the Simulator Methodologies and Their Applications" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd41176.pdf Paper URL: https://www.ijtsrd.comcomputer-science/simulation/41176/simulations-on-computer-network-an-improved-study-in-the-simulator-methodologies-and-their-applications/j-sumitha
This paper presents a vision on how the software development process could be a fully unified mechanized
process by getting benefits from the advances of Natural Language Processing and Program Synthesis
fields. The process begins from requirements written in natural language that is translated to sentences in
logical form. A program synthesizer gets those sentences in logical form (the translator's outcome) and
generates a source code. Finally, a compiler produces ready-to-run software. To find out how the building
blocks of our proposed approach works, we conducted an exploratory research on the literature in the
fields of Requirements Engineering, Natural Language Processing, and Program Synthesis. Currently, this
approach is difficult to accomplish in a fully automatic way due to the ambiguities inherent in natural
language, the reasoning context of the software, and the program synthesizer limitations in generating a
source code from logic.
‘O’ Model for Component-Based Software Development Processijceronline
The technology advancement has forced the user to become more dependent on information technology, and so on software. Software provides the platform for implementation of information technology. Component Based Software Engineering (CBSE) is adopted by software community to counter challenges thrown by fast growing demand of heavy and complex software systems. One of the essential reasons behind adopting CBSE for software development is the fast development of complicated software systems within well-defined boundaries of time and budget. CBSE provides the mechanical facilities by assembling already existing reusable components out of autonomously developed pieces of the software. The paper proposes a novel CBSE model named as O model, keeping an eye on the available CBSE lifecycle.
A tlm based platform to specify and verify component-based real-time systemsijseajournal
This paper is about modeling and verification languages with their pros and cons. Modeling is dynamic
part of system development process before realization. The cost and risky situations obligate designer to
model system before production and modeling gives designer more flexible and dynamic image of realized
system. Formal languages and modeling methods are the ways to model and verify systems but they have
their own difficulties in specifying systems. Some of them are very precise but hard to specify complex
systems like TRIO, and others do not support object oriented design and hardware/software co-design in
real-time systems. In this paper we are going to introduce systemC and the more abstracted method called
TLM 2.0 that solved all mentioned problems.
PROGRAMMING REQUESTS/RESPONSES WITH GREATFREE IN THE CLOUD ENVIRONMENTijdpsjournal
Programming request with GreatFree is an efficient programming technique to implement distributed polling in the cloud computing environment. GreatFree is a distributed programming environment through which diverse distributed systems can be established through programming rather than configuring or scripting. GreatFree emphasizes the importance of programming since it offers developers the opportunities to leverage their distributed knowledge and programming skills. Additionally, programming is the unique way to construct creative, adaptive and flexible systems to accommodate various distributed computing environments. With the support of GreatFree code-level Distributed Infrastructure Patterns, Distributed Operation Patterns and APIs, the difficult procedure is accomplished in a programmable, rapid and highly-patterned manner, i.e., the programming behaviors are simplified as the repeatable operation of Copy-Paste-Replace. Since distributed polling is one of the fundamental techniques to construct distributed systems, GreatFree provides developers with relevant APIs and patterns to program requests/responses in the novel programming environment.
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : NotesSubhajit Sahu
Highlighted notes of article while studying Concurrent Data Structures, CSE:
Is Multicore Hardware For General-Purpose Parallel Processing Broken?
By Uzi Vishkin
Communications of the ACM, April 2014, Vol. 57 No. 4, Pages 35-39
10.1145/2580945
Concurrent Matrix Multiplication on Multi-core ProcessorsCSCJournals
With the advent of multi-cores every processor has built-in parallel computational power and that can only be fully utilized only if the program in execution is written accordingly. This study is a part of an on-going research for designing of a new parallel programming model for multi-core architectures. In this paper we have presented a simple, highly efficient and scalable implementation of a common matrix multiplication algorithm using a newly developed parallel programming model SPC3 PM for general purpose multi-core processors. From our study it is found that matrix multiplication done concurrently on multi-cores using SPC3 PM requires much less execution time than that required using the present standard parallel programming environments like OpenMP. Our approach also shows scalability, better and uniform speedup and better utilization of available cores than that the algorithm written using standard OpenMP or similar parallel programming tools. We have tested our approach for up to 24 cores with different matrices size varying from 100 x 100 to 10000 x 10000 elements. And for all these tests our proposed approach has shown much improved performance and scalability
Almost all software development activities require collaboration, and model-based software development is no exception. In modern model-based development collaboration comes in two levels. We start from collaborative language creation (aka metamodeling) and describe the benefits it can provide and then do the same for collaborative language use (aka modeling). We conclude by inspecting how the collaboration enables scalability in terms of multiple engineers, multiple languages, large models, and transformations.
STATISTICAL ANALYSIS FOR PERFORMANCE COMPARISONijseajournal
Performance responsiveness and scalability is a make-or-break quality for software. Nearly everyone runs into performance problems at one time or another. This paper discusses about performance issues faced during Pre Examination Process Automation System (PEPAS) implemented in java technology. The challenges faced during the life cycle of the project and the mitigation actions performed. It compares 3 java technologies and shows how improvements are made through statistical analysis in response time of the application. The paper concludes with result analysis.
Software Architecture: Introduction to the AbstractionHenry Muccini
The Software Architecture is the earliest model of the whole software system created along the software lifecycle
A Software Architecture can be designed along four perspectives:
- as A set of components and connectors communicating through interfaces
- as A set of architecture design decisions
- with Focus on set of views and viewpoints
- Written according to architectural styles
Model-Driven Development of Web Applicationsidescitation
Over the last few years Model-Driven Development (MDD) has been regarded as
the future of Software Engineering, offering architects the possibility of creating artifacts to
illustrate the design of the software solutions, contributing directly to the implementation of
the product after performing a series of model transformations on them. The model-to-text
transformations are the most important operations from the point of view of the automatic
code generation. The automatic generation or the fast prototyping of applications implies an
acceleration of the development process and a reduction of time and effort, which could
materialize in a noticeable cost reduction. This paper proposes a practical approach for the
model-based development of web applications, offering a solution for the layered and
platform independent modeling of web applications, as well as for the automatic generation
of software solutions realized using the ASP.NET technology.
Improving Consistency of UML Diagrams and Its Implementation Using Reverse En...journalBEEI
Software development deals with various changes and evolution that cannot be avoided due to the development processes which are vastly incremental and iterative. In Model Driven Engineering, inconsistency between model and its implementation has huge impact on the software development process in terms of added cost, time and effort. The later the inconsistencies are found, it could add more cost to the software project. Thus, this paper aims to describe the development of a tool that could improve the consistency between Unified Modeling Language (UML) design models and its C# implementation using reverse engineering approach. A list of consistency rules is defined to check vertical and horizontal consistencies between structural (class diagram) and behavioral (use case diagram and sequence diagram) UML diagrams against the implemented C# source code. The inconsistencies found between UML diagrams and source code are presented in a textual description and visualized in a tree view structure.
Simulations on Computer Network An Improved Study in the Simulator Methodolog...YogeshIJTSRD
Generally a network simulator is used to analyse the performance and behaviour of a network. The Simulation software plays a vital role in real world implementation. The hardware setup of network topologies are very costly and strenuous to modify often. The simulators act as the protocols for a system. The simulators in network such as Ns 2, Ns 3, OMNeT , NetSim, J SIM, REAL, OPNET, OMNEST, QualNetTraNS, NTCUns etc. It is quite a tedious process to select network simulator that is based up on the requirement for a users specified job. This paper gives a comparison and a general analysis of various network simulators. J. Sumitha | S. Swathi | Ellakkiya. M "Simulations on Computer Network: An Improved Study in the Simulator Methodologies and Their Applications" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-5 | Issue-4 , June 2021, URL: https://www.ijtsrd.compapers/ijtsrd41176.pdf Paper URL: https://www.ijtsrd.comcomputer-science/simulation/41176/simulations-on-computer-network-an-improved-study-in-the-simulator-methodologies-and-their-applications/j-sumitha
This paper presents a vision on how the software development process could be a fully unified mechanized
process by getting benefits from the advances of Natural Language Processing and Program Synthesis
fields. The process begins from requirements written in natural language that is translated to sentences in
logical form. A program synthesizer gets those sentences in logical form (the translator's outcome) and
generates a source code. Finally, a compiler produces ready-to-run software. To find out how the building
blocks of our proposed approach works, we conducted an exploratory research on the literature in the
fields of Requirements Engineering, Natural Language Processing, and Program Synthesis. Currently, this
approach is difficult to accomplish in a fully automatic way due to the ambiguities inherent in natural
language, the reasoning context of the software, and the program synthesizer limitations in generating a
source code from logic.
‘O’ Model for Component-Based Software Development Processijceronline
The technology advancement has forced the user to become more dependent on information technology, and so on software. Software provides the platform for implementation of information technology. Component Based Software Engineering (CBSE) is adopted by software community to counter challenges thrown by fast growing demand of heavy and complex software systems. One of the essential reasons behind adopting CBSE for software development is the fast development of complicated software systems within well-defined boundaries of time and budget. CBSE provides the mechanical facilities by assembling already existing reusable components out of autonomously developed pieces of the software. The paper proposes a novel CBSE model named as O model, keeping an eye on the available CBSE lifecycle.
A tlm based platform to specify and verify component-based real-time systemsijseajournal
This paper is about modeling and verification languages with their pros and cons. Modeling is dynamic
part of system development process before realization. The cost and risky situations obligate designer to
model system before production and modeling gives designer more flexible and dynamic image of realized
system. Formal languages and modeling methods are the ways to model and verify systems but they have
their own difficulties in specifying systems. Some of them are very precise but hard to specify complex
systems like TRIO, and others do not support object oriented design and hardware/software co-design in
real-time systems. In this paper we are going to introduce systemC and the more abstracted method called
TLM 2.0 that solved all mentioned problems.
PROGRAMMING REQUESTS/RESPONSES WITH GREATFREE IN THE CLOUD ENVIRONMENTijdpsjournal
Programming request with GreatFree is an efficient programming technique to implement distributed polling in the cloud computing environment. GreatFree is a distributed programming environment through which diverse distributed systems can be established through programming rather than configuring or scripting. GreatFree emphasizes the importance of programming since it offers developers the opportunities to leverage their distributed knowledge and programming skills. Additionally, programming is the unique way to construct creative, adaptive and flexible systems to accommodate various distributed computing environments. With the support of GreatFree code-level Distributed Infrastructure Patterns, Distributed Operation Patterns and APIs, the difficult procedure is accomplished in a programmable, rapid and highly-patterned manner, i.e., the programming behaviors are simplified as the repeatable operation of Copy-Paste-Replace. Since distributed polling is one of the fundamental techniques to construct distributed systems, GreatFree provides developers with relevant APIs and patterns to program requests/responses in the novel programming environment.
Is Multicore Hardware For General-Purpose Parallel Processing Broken? : NotesSubhajit Sahu
Highlighted notes of article while studying Concurrent Data Structures, CSE:
Is Multicore Hardware For General-Purpose Parallel Processing Broken?
By Uzi Vishkin
Communications of the ACM, April 2014, Vol. 57 No. 4, Pages 35-39
10.1145/2580945
Concurrent Matrix Multiplication on Multi-core ProcessorsCSCJournals
With the advent of multi-cores every processor has built-in parallel computational power and that can only be fully utilized only if the program in execution is written accordingly. This study is a part of an on-going research for designing of a new parallel programming model for multi-core architectures. In this paper we have presented a simple, highly efficient and scalable implementation of a common matrix multiplication algorithm using a newly developed parallel programming model SPC3 PM for general purpose multi-core processors. From our study it is found that matrix multiplication done concurrently on multi-cores using SPC3 PM requires much less execution time than that required using the present standard parallel programming environments like OpenMP. Our approach also shows scalability, better and uniform speedup and better utilization of available cores than that the algorithm written using standard OpenMP or similar parallel programming tools. We have tested our approach for up to 24 cores with different matrices size varying from 100 x 100 to 10000 x 10000 elements. And for all these tests our proposed approach has shown much improved performance and scalability
Developing Real-Time Systems on Application ProcessorsToradex
Guaranteeing real-time and deterministic behavior on SoC-based systems can be challenging. In this blog post, we offer three approaches to add real-time control to systems that use a SoC running a feature-rich OS such as Linux. https://www.toradex.com/blog/developing-real-time-systems-on-application-processors
Improved Strategy for Distributed Processing and Network Application Developm...Editor IJCATR
The complexity of software development abstraction and the new development in multi-core computers have shifted the burden of distributed software performance from network and chip designers to software architectures and developers. We need to look at software development strategies that will integrate parallelization of code, concurrency factors, multithreading, distributed resources allocation and distributed processing. In this paper, a new software development strategy that integrates these factors is further experimented on parallelism. The strategy is multidimensional aligns distributed conceptualization along a path. This development strategy mandates application developers to reason along usability, simplicity, resource distribution, parallelization of code where necessary, processing time and cost factors realignment as well as security and concurrency issues in a balanced path from the originating point of the network application to its retirement.
Improved Strategy for Distributed Processing and Network Application DevelopmentEditor IJCATR
The complexity of software development abstraction and the new development in multi-core computers have shifted the
burden of distributed software performance from network and chip designers to software architectures and developers. We need to
look at software development strategies that will integrate parallelization of code, concurrency factors, multithreading, distributed
resources allocation and distributed processing. In this paper, a new software development strategy that integrates these factors is
further experimented on parallelism. The strategy is multidimensional aligns distributed conceptualization along a path. This
development strategy mandates application developers to reason along usability, simplicity, resource distribution, parallelization of
code where necessary, processing time and cost factors realignment as well as security and concurrency issues in a balanced path from
the originating point of the network application to its retirement.
A Generic Open Source Framework for Auto Generation of Data Manipulation Comm...iosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
The development of embedded applications (such as Wireless Sensor Network protocols) often
requires a shift to formal specifications. To insure the reliability and the performance of the
WSNs, such protocols must be designed following some methods reducing error rate. Formal
methods (as Automata, Petri nets, algebra, logics, etc.) were largely used in the specification of
these protocols, their analysis and their verification. After that, their implementation is an
important phase to deploy, test and use those protocols in real environments. The main
objective of the current paper is to formalize the transformation from high-level specification (in
Timed Automata) to low-level implementation (in NesC language and TinyOs system) and to
automate such transformation. The proposed transformation approach defines a set of rules that
allow the passage between these two levels. We implemented our solution and we illustrated the
proposed approach on a protocol case study for the "humidity" and "temperature" sensing in
WSNs applications.
EFFECTIVE EMBEDDED SYSTEMS SOFTWARE DESIGN METHODOLOGIEScscpconf
This paper gives Universities needs to improve their curriculum for Technology students to meet
the industry standards which will be helpful for their career .In the current improving
technologies studying of embedded system is required to understand the Electronic circuits .
They should include the new emerging technology such as multiprocessor system on chip where it is used in all the real time applications. In this paper design based tutorials will be discussed to understand Multiprocessor system on chip .The understanding of multiprocessor system on chip is difficult for a student and should be taught to meet the expectation from the industry. Since it is vast area, this paper proposes the most efficient tutoring method on multiprocessor system on chip.
An Integrated Prototyping Environment For Programmable AutomationMeshDynamics
We are implementing a rapid prototyping environment for robotic systems, based on tenets of modularity,
reconfigurability and extendibility that may help build robot systems "faster, better and cheaper". Given a task
specification, (e.g. repair brake assembly), the user browses through a library of building blocks that include both
hardware and software components. Software advisors or critics recommend how blocks may be "snapped" together to
speedily construct alternative ways to satisfy task requirements. Mechanisms to allow "swapping" competing modules
for comparative test and evaluation studies are also included in the prototyping environment. After some iterations, a
stable configuration or "wiring diagram" emerges. This customized version of the general prototyping environment still
contains all the hooks needed to incorporate future improvements in component technologies and to obviate unplanned obsolescence...
International Journal of Engineering Research and Applications (IJERA) is a team of researchers not publication services or private publications running the journals for monetary benefits, we are association of scientists and academia who focus only on supporting authors who want to publish their work. The articles published in our journal can be accessed online, all the articles will be archived for real time access.
Our journal system primarily aims to bring out the research talent and the works done by sciaentists, academia, engineers, practitioners, scholars, post graduate students of engineering and science. This journal aims to cover the scientific research in a broader sense and not publishing a niche area of research facilitating researchers from various verticals to publish their papers. It is also aimed to provide a platform for the researchers to publish in a shorter of time, enabling them to continue further All articles published are freely available to scientific researchers in the Government agencies,educators and the general public. We are taking serious efforts to promote our journal across the globe in various ways, we are sure that our journal will act as a scientific platform for all researchers to publish their works online.
About TrueTime, Spanner, Clock synchronization, CAP theorem, Two-phase lockin...Subhajit Sahu
TrueTime is a service that enables the use of globally synchronized clocks, with bounded error. It returns a time interval that is guaranteed to contain the clock’s actual time for some time during the call’s execution. If two intervals do not overlap, then we know calls were definitely ordered in real time. In general, synchronized clocks can be used to avoid communication in a distributed system.
The underlying source of time is a combination of GPS receivers and atomic clocks. As there are “time masters” in every datacenter (redundantly), it is likely that both sides of a partition would continue to enjoy accurate time. Individual nodes however need network connectivity to the masters, and without it their clocks will drift. Thus, during a partition their intervals slowly grow wider over time, based on bounds on the rate of local clock drift. Operations depending on TrueTime, such as Paxos leader election or transaction commits, thus have to wait a little longer, but the operation still completes (assuming the 2PC and quorum communication are working).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting Bitset for graph : SHORT REPORT / NOTESSubhajit Sahu
Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is commonly used for efficient graph computations. Unfortunately, using CSR for dynamic graphs is impractical since addition/deletion of a single edge can require on average (N+M)/2 memory accesses, in order to update source-offsets and destination-indices. A common approach is therefore to store edge-lists/destination-indices as an array of arrays, where each edge-list is an array belonging to a vertex. While this is good enough for small graphs, it quickly becomes a bottleneck for large graphs. What causes this bottleneck depends on whether the edge-lists are sorted or unsorted. If they are sorted, checking for an edge requires about log(E) memory accesses, but adding an edge on average requires E/2 accesses, where E is the number of edges of a given vertex. Note that both addition and deletion of edges in a dynamic graph require checking for an existing edge, before adding or deleting it. If edge lists are unsorted, checking for an edge requires around E/2 memory accesses, but adding an edge requires only 1 memory access.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Experiments with Primitive operations : SHORT REPORT / NOTESSubhajit Sahu
This includes:
- Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
- Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
- Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
- Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings o...Subhajit Sahu
Below are the important points I note from the 2020 paper by Martin Grohe:
- 1-WL distinguishes almost all graphs, in a probabilistic sense
- Classical WL is two dimensional Weisfeiler-Leman
- DeepWL is an unlimited version of WL graph that runs in polynomial time.
- Knowledge graphs are essentially graphs with vertex/edge attributes
ABSTRACT:
Vector representations of graphs and relational structures, whether handcrafted feature vectors or learned representations, enable us to apply standard data analysis and machine learning techniques to the structures. A wide range of methods for generating such embeddings have been studied in the machine learning and knowledge representation literature. However, vector embeddings have received relatively little attention from a theoretical point of view.
Starting with a survey of embedding techniques that have been used in practice, in this paper we propose two theoretical approaches that we see as central for understanding the foundations of vector embeddings. We draw connections between the various approaches and suggest directions for future research.
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTESSubhajit Sahu
https://gist.github.com/wolfram77/54c4a14d9ea547183c6c7b3518bf9cd1
There exist a number of dynamic graph generators. Barbasi-Albert model iteratively attach new vertices to pre-exsiting vertices in the graph using preferential attachment (edges to high degree vertices are more likely - rich get richer - Pareto principle). However, graph size increases monotonically, and density of graph keeps increasing (sparsity decreasing).
Gorke's model uses a defined clustering to uniformly add vertices and edges. Purohit's model uses motifs (eg. triangles) to mimick properties of existing dynamic graphs, such as growth rate, structure, and degree distribution. Kronecker graph generators are used to increase size of a given graph, with power-law distribution.
To generate dynamic graphs, we must choose a metric to compare two graphs. Common metrics include diameter, clustering coefficient (modularity?), triangle counting (triangle density?), and degree distribution.
In this paper, the authors propose Dygraph, a dynamic graph generator that uses degree distribution as the only metric. The authors observe that many real-world graphs differ from the power-law distribution at the tail end. To address this issue, they propose binning, where the vertices beyond a certain degree (minDeg = min(deg) s.t. |V(deg)| < H, where H~10 is the number of vertices with a given degree below which are binned) are grouped into bins of degree-width binWidth, max-degree localMax, and number of degrees in bin with at least one vertex binSize (to keep track of sparsity). This helps the authors to generate graphs with a more realistic degree distribution.
The process of generating a dynamic graph is as follows. First the difference between the desired and the current degree distribution is calculated. The authors then create an edge-addition set where each vertex is present as many times as the number of additional incident edges it must recieve. Edges are then created by connecting two vertices randomly from this set, and removing both from the set once connected. Currently, authors reject self-loops and duplicate edges. Removal of edges is done in a similar fashion.
Authors observe that adding edges with power-law properties dominates the execution time, and consider parallelizing DyGraph as part of future work.
My notes on shared memory parallelism.
Shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between programs. Using memory for communication inside a single program, e.g. among its multiple threads, is also referred to as shared memory [REF].
A Dynamic Algorithm for Local Community Detection in Graphs : NOTESSubhajit Sahu
**Community detection methods** can be *global* or *local*. **Global community detection methods** divide the entire graph into groups. Existing global algorithms include:
- Random walk methods
- Spectral partitioning
- Label propagation
- Greedy agglomerative and divisive algorithms
- Clique percolation
https://gist.github.com/wolfram77/b4316609265b5b9f88027bbc491f80b6
There is a growing body of work in *detecting overlapping communities*. **Seed set expansion** is a **local community detection method** where a relevant *seed vertices* of interest are picked and *expanded to form communities* surrounding them. The quality of each community is measured using a *fitness function*.
**Modularity** is a *fitness function* which compares the number of intra-community edges to the expected number in a random-null model. **Conductance** is another popular fitness score that measures the community cut or inter-community edges. Many *overlapping community detection* methods **use a modified ratio** of intra-community edges to all edges with atleast one endpoint in the community.
Andersen et al. use a **Spectral PageRank-Nibble method** which minimizes conductance and is formed by adding vertices in order of decreasing PageRank values. Andersen and Lang develop a **random walk approach** in which some vertices in the seed set may not be placed in the final community. Clauset gives a **greedy method** that *starts from a single vertex* and then iteratively adds neighboring vertices *maximizing the local modularity score*. Riedy et al. **expand multiple vertices** via maximizing modularity.
Several algorithms for **detecting global, overlapping communities** use a *greedy*, *agglomerative approach* and run *multiple separate seed set expansions*. Lancichinetti et al. run **greedy seed set expansions**, each with a *single seed vertex*. Overlapping communities are produced by a sequentially running expansions from a node not yet in a community. Lee et al. use **maximal cliques as seed sets**. Havemann et al. **greedily expand cliques**.
The authors of this paper discuss a dynamic approach for **community detection using seed set expansion**. Simply marking the neighbours of changed vertices is a **naive approach**, and has *severe shortcomings*. This is because *communities can split apart*. The simple updating method *may fail even when it outputs a valid community* in the graph.
Scalable Static and Dynamic Community Detection Using Grappolo : NOTESSubhajit Sahu
A **community** (in a network) is a subset of nodes which are _strongly connected among themselves_, but _weakly connected to others_. Neither the number of output communities nor their size distribution is known a priori. Community detection methods can be divisive or agglomerative. **Divisive methods** use _betweeness centrality_ to **identify and remove bridges** between communities. **Agglomerative methods** greedily **merge two communities** that provide maximum gain in _modularity_. Newman and Girvan have introduced the **modularity metric**. The problem of community detection is then reduced to the problem of modularity maximization which is **NP-complete**. **Louvain method** is a variant of the _agglomerative strategy_, in that is a _multi-level heuristic_.
https://gist.github.com/wolfram77/917a1a4a429e89a0f2a1911cea56314d
In this paper, the authors discuss **four heuristics** for Community detection using the _Louvain algorithm_ implemented upon recently developed **Grappolo**, which is a parallel variant of the Louvain algorithm. They are:
- Vertex following and Minimum label
- Data caching
- Graph coloring
- Threshold scaling
With the **Vertex following** heuristic, the _input is preprocessed_ and all single-degree vertices are merged with their corresponding neighbours. This helps reduce the number of vertices considered in each iteration, and also help initial seeds of communities to be formed. With the **Minimum label heuristic**, when a vertex is making the decision to move to a community and multiple communities provided the same modularity gain, the community with the smallest id is chosen. This helps _minimize or prevent community swaps_. With the **Data caching** heuristic, community information is stored in a vector instead of a map, and is reused in each iteration, but with some additional cost. With the **Vertex ordering via Graph coloring** heuristic, _distance-k coloring_ of graphs is performed in order to group vertices into colors. Then, each set of vertices (by color) is processed _concurrently_, and synchronization is performed after that. This enables us to mimic the behaviour of the serial algorithm. Finally, with the **Threshold scaling** heuristic, _successively smaller values of modularity threshold_ are used as the algorithm progresses. This allows the algorithm to converge faster, and it has been observed a good modularity score as well.
From the results, it appears that _graph coloring_ and _threshold scaling_ heuristics do not always provide a speedup and this depends upon the nature of the graph. It would be interesting to compare the heuristics against baseline approaches. Future work can include _distributed memory implementations_, and _community detection on streaming graphs_.
Application Areas of Community Detection: A Review : NOTESSubhajit Sahu
This is a short review of Community detection methods (on graphs), and their applications. A **community** is a subset of a network whose members are *highly connected*, but *loosely connected* to others outside their community. Different community detection methods *can return differing communities* these algorithms are **heuristic-based**. **Dynamic community detection** involves tracking the *evolution of community structure* over time.
https://gist.github.com/wolfram77/09e64d6ba3ef080db5558feb2d32fdc0
Communities can be of the following **types**:
- Disjoint
- Overlapping
- Hierarchical
- Local.
The following **static** community detection **methods** exist:
- Spectral-based
- Statistical inference
- Optimization
- Dynamics-based
The following **dynamic** community detection **methods** exist:
- Independent community detection and matching
- Dependent community detection (evolutionary)
- Simultaneous community detection on all snapshots
- Dynamic community detection on temporal networks
**Applications** of community detection include:
- Criminal identification
- Fraud detection
- Criminal activities detection
- Bot detection
- Dynamics of epidemic spreading (dynamic)
- Cancer/tumor detection
- Tissue/organ detection
- Evolution of influence (dynamic)
- Astroturfing
- Customer segmentation
- Recommendation systems
- Social network analysis (both)
- Network summarization
- Privary, group segmentation
- Link prediction (both)
- Community evolution prediction (dynamic, hot field)
<br>
<br>
## References
- [Application Areas of Community Detection: A Review : PAPER](https://ieeexplore.ieee.org/document/8625349)
This paper discusses a GPU implementation of the Louvain community detection algorithm. Louvain algorithm obtains hierachical communities as a dendrogram through modularity optimization. Given an undirected weighted graph, all vertices are first considered to be their own communities. In the first phase, each vertex greedily decides to move to the community of one of its neighbours which gives greatest increase in modularity. If moving to no neighbour's community leads to an increase in modularity, the vertex chooses to stay with its own community. This is done sequentially for all the vertices. If the total change in modularity is more than a certain threshold, this phase is repeated. Once this local moving phase is complete, all vertices have formed their first hierarchy of communities. The next phase is called the aggregation phase, where all the vertices belonging to a community are collapsed into a single super-vertex, such that edges between communities are represented as edges between respective super-vertices (edge weights are combined), and edges within each community are represented as self-loops in respective super-vertices (again, edge weights are combined). Together, the local moving and the aggregation phases constitute a stage. This super-vertex graph is then used as input fof the next stage. This process continues until the increase in modularity is below a certain threshold. As a result from each stage, we have a hierarchy of community memberships for each vertex as a dendrogram.
Approaches to perform the Louvain algorithm can be divided into coarse-grained and fine-grained. Coarse-grained approaches process a set of vertices in parallel, while fine-grained approaches process all vertices in parallel. A coarse-grained hybrid-GPU algorithm using multi GPUs has be implemented by Cheong et al. which grabbed my attention. In addition, their algorithm does not use hashing for the local moving phase, but instead sorts each neighbour list based on the community id of each vertex.
https://gist.github.com/wolfram77/7e72c9b8c18c18ab908ae76262099329
Survey for extra-child-process package : NOTESSubhajit Sahu
Useful additions to inbuilt child_process module.
📦 Node.js, 📜 Files, 📰 Docs.
Please see attached PDF for literature survey.
https://gist.github.com/wolfram77/d936da570d7bf73f95d1513d4368573e
Dynamic Batch Parallel Algorithms for Updating PageRank : POSTERSubhajit Sahu
For the PhD forum an abstract submission is required by 10th May, and poster by 15th May. The event is on 30th May.
https://gist.github.com/wolfram77/692d263f463fd49be6eb5aa65dd4d0f9
Abstract for IPDPS 2022 PhD Forum on Dynamic Batch Parallel Algorithms for Up...Subhajit Sahu
For the PhD forum an abstract submission is required by 10th May, and poster by 15th May. The event is on 30th May.
https://gist.github.com/wolfram77/1c1f730d20b51e0d2c6d477fd3713024
Fast Incremental Community Detection on Dynamic Graphs : NOTESSubhajit Sahu
In this paper, the authors describe two approaches for dynamic community detection using the CNM algorithm. CNM is a hierarchical, agglomerative algorithm that greedily maximizes modularity. They define two approaches: BasicDyn and FastDyn. BasicDyn backtracks merges of communities until each marked (changed) vertex is its own singleton community. FastDyn undoes a merge only if the quality of merge, as measured by the induced change in modularity, has significantly decreased compared to when the merge initially took place. FastDyn also allows more than two vertices to contract together if in the previous time step these vertices eventually ended up contracted in the same community. In the static case, merging several vertices together in one contraction phase could lead to deteriorating results. FastDyn is able to do this, however, because it uses information from the merges of the previous time step. Intuitively, merges that previously occurred are more likely to be acceptable later.
https://gist.github.com/wolfram77/1856b108334cc822cdddfdfa7334792a
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...PinkySharma900491
Class khatm kaam kaam karne kk kabhi uske kk innings evening karni nnod ennu Tak add djdhejs a Nissan s isme sniff kaam GCC bagg GB g ghan HD smart karmathtaa Niven ken many bhej kaam karne Nissan kaam kaam Karo kaam lal mam cell pal xoxo
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...Amil baba
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...
The Concurrency Challenge : Notes
1. See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/220650851
The Concurrency Challenge
Article in IEEE Design and Test of Computers · August 2008
DOI: 10.1109/MDT.2008.110 · Source: DBLP
CITATIONS
50
READS
94
3 authors:
Some of the authors of this publication are also working on these related projects:
Small Neural Nets View project
T2S-Tensor View project
Wen-mei W. Hwu
University of Illinois, Urbana-Champaign
405 PUBLICATIONS 15,965 CITATIONS
SEE PROFILE
Kurt Keutzer
University of California, Berkeley
383 PUBLICATIONS 17,939 CITATIONS
SEE PROFILE
Tim Mattson
Intel
128 PUBLICATIONS 3,117 CITATIONS
SEE PROFILE
All content following this page was uploaded by Tim Mattson on 15 June 2018.
The user has requested enhancement of the downloaded file.
2. The Concurrency Challenge
Wen-mei Hwu
University of Illinois at Urbana-Champaign
Kurt Keutzer
University of California, Berkeley
Timothy G. Mattson
Intel
&THE SEMICONDUCTOR INDUSTRY has settled on two
main trajectories for designing microprocessors. The
multicore trajectory began with two-core processors,
with the number of cores doubling with each
semiconductor process generation. A current exem-
plar is the recent Intel Core 2 Extreme microprocessor
with four processor cores, each of which is an out-of-
order, multiple-instruction-issue processor supporting
the full X86 instruction set. The many-core trajectory
begins with a large number of far smaller cores and,
once again, the number of cores doubles with each
generation. A current example is the Nvidia GeForce
8800 GTX graphics processing unit (GPU) with 128
cores, each of which is a heavily multithreaded, single-
instruction-issue, in-order processor that shares its
control and instruction cache with seven other cores.
Although these processors vary in their microarch-
itectures, they impose the same challenge: to benefit
from the continued increase of computing power
according to Moore’s law, application software must
support concurrent execution by multiple processor
cores. Unless software adapts to utilize these parallel
systems, the fundamental value proposition behind
the semiconductor and computer industries will falter.
In this article, we identify the challenges underlying
the current state of the art of parallel programming and
propose an application-centric methodology to pro-
tect application programmers from these complexities
in future parallel-application developments. For these
multiple-core processors to be useful, the software
industry must adapt and embrace
concurrency. Successful programming
environments for these processors must
be application centric. Concurrent pro-
gramming environments must be natu-
ral and must protect the application
programmer from as many hardware
idiosyncrasies as possible. Finally, solu-
tions should not be ad hoc but should
be derived according to well-established engineering
and architectural principles.
Today’s major programming models
Microprocessors have long employed concurrency
at the hardware level. They have taken one of the four
approaches shown in Figures 1 to hide the complexity
of parallel execution from programmers.1
In this figure,
the top horizontal line depicts the interface to human,
high-level programmers; the middle line depicts the
machine-level programming interface to compilers
and debuggers.
In Figure 1a, the instruction-set architecture (ISA)
and its hardware implementation completely hide the
complexity of concurrent execution from the human
programmer and the compiler. This model is used by
superscalar processors, such as the Pentium 4, where
the hardware’s execution of instructions in parallel
and out of their program order is hidden by a
sequential retirement mechanism. The execution state
exposed through the debuggers and exception han-
dlers to the human programmers is completely that of
sequential execution. Thus, programmers never deal
with any complexity or incorrect execution results due
to concurrent execution. This is still the programming
model for the vast majority of programmers today.
Figure 1b illustrates a model where parallel execu-
tion is exposed at the machine programming level to
the compilers and object-level tools, but not to the
human programmers. The most prominent micropro-
0
IEEE Design & Test of Computers dtco-25-04-hwu.3d 17/6/08 19:20:09 0
Editor’s note:
The evolutionary path of microprocessor design includes both multicore and
many-core architectures. Harnessing the most computing throughput from
these architectures requires concurrent or parallel execution of instructions.
The authors describe the challenges facing the industry as parallel-
computing platforms become even more widely available.
—David C. Yeh, Semiconductor Research Corp.
Design in the Late- and Post-Silicon Eras
0740-7475/08/$25.00 G 2008 IEEE Copublished by the IEEE CS and the IEEE CASS IEEE Design & Test of Computers
3. cessors based on this model are VLIW
(very long instruction word) and Intel
Intanium Processor Family processors.
These microprocessors use traditional
sequential-programming models such
as C and C++. A parallelizing compiler
identifies the parallel-execution opportu-
nities and performs the required code
transformations. Programmers must re-
compile their source and, in some cases,
rewrite their code to use algorithms with
more inherent parallelism. Sometimes,
they also need to adapt their coding style
with constructs that the parallelizing
compiler can more effectively analyze
and manipulate. Vendors often provide
critical math libraries as a domain API
implemented at the machine program-
ming level to further enhance their
processors’ execution efficiency. The
complexities of parallel execution are
exposed to human programmers when-
ever they need to use debuggers or when
they need to optimize their code be-
cause the hardware does not maintain a
sequential state. Acceptance of these
processors in the general-purpose market
has been hampered by the need to
recompile source code and the need to
deal with parallel execution complexities when
debugging and optimizing programs.
Figure 1c indicates an approach that has been
modestly successful in particular application domains
such as computer graphics, image processing, and
network protocol processing. In this approach, pro-
grammers use a language that is naturally adapted to
the needs of a particular application domain and
make the expression of parallelism natural; in some
cases, it is even more natural than with a traditional
sequential language. A good example is the network
programming language Click and its parallel exten-
sion, NP-Click.2
There are three main challenges with this approach.
First, programmers resist new languages, but this can
be ameliorated by choosing a popular language in the
application domain. Second, these programming
models are almost always tailored to a particular
multiprocessor family. Programmers are reticent to tie
their applications so strongly to a single architectural
family. Finally, today’s increasingly complex systems
require assembly of diverse application domains.
Integrating a programming environment with a variety
of domain-specific languages is an unsolved problem.
All three models have seen some success in their
target markets. The approach in Figure 1a is waning
because of the power demanded in architectures that
use hardware to manage parallelism. The approach in
Figure 1b generally fails to extract sufficient amounts
of parallelism for multiple-core microprocessors. The
successes of Figure 1c have been too narrow to
capture widespread interest. Thus, the approach in
Figure 1d dominates current practice in programming
multiple-core processors. For general-purpose parallel
programming, MPI (Message Passing Interface) and
OpenMP are the main parallel-programming models.
The new CUDA (Compute Unified Device Architec-
ture) language from Nvidia lets programmers write
massively threaded parallel programs without dealing
with the graphics API functions.3
These models force
application developers to explicitly write parallel
code, either by parallelizing individual application
0
IEEE Design & Test of Computers dtco-25-04-hwu.3d 17/6/08 19:20:34 1
Figure 1. Current programming models of major semiconductor
programmable platforms: Hardware hides all complexity of concurrent
execution (a). Parallelizing compilers and tools help contain the complexity
of concurrent execution (b). Domain-specific languages and environments
let programmers express concurrency in ways natural to a particular
domain (c). Parallel programming models require programmers to write
code for a particular model of concurrent execution (d). (CUDA: Compute
Unified Device Architecture; MPI: Message Passing Interface.) (Source: W.
Hwu et al., ‘‘Implicit Parallel Programming Models for Thousand-Core
Microprocessors,’’ Proc. 44th Design Automation Conf. (DAC 07), ACM
Press, 2007, pp. 754-759 G 2007 ACM Inc. Reprinted by permission.)
July/August 2008
4. programs or by defining their applications in terms of
multiple independent programs that run concurrently
on multiple-processor cores. This seriously impacts the
costs of developing, verifying, and supporting software
products. This drastic shift to explicit parallel program-
ming suggests that the concurrency revolution is
primarily a software revolution.4
Underlying problems of
applications software
Creating applications software is more of an art than
an engineering discipline. Compare software engineering
to a more established engineering discipline such as civil
engineering. Civil engineers build bridges using estab-
lished practice well-founded in the material sciences and
basic physics. Civil engineers create detailed plans of
their bridges and have them constructed independently.
However, software engineers don’t have a well-estab-
lished practice to follow. By analogy to civil engineering,
software engineers build bridges to the best of their
understanding and experience. Some of these ‘‘bridges’’
stay up, and other engineers try to understand the
principles that allowed them to stay up. Some fall down,
and other engineers try to understand the design flaws
that caused them to fall down. But the reproducible
practices, the underlying architectural principles and
fundamental laws, are poorly understood.
Thus, we come to the problem of concurrent
software engineering with a weak engineering foun-
dation in sequential software engineering. Any long-
term solution to this problem will need to establish
reproducible practices and software architectures as
opposed to programs that are hand-crafted point
solutions. Parallel programming is complicated,5
and
most programmers are not prepared to deal with that
complexity. As a result, after 30 years of hard work in
the high-performance computing community, only a
tiny fraction of programmers write parallel software.
The complexities of parallel programming impact
every phase of supporting a software product, from
code development and verification to user support
and maintenance.
By reviewing some of the most pertinent challenges
of parallel programming, we can determine the new
engineering principles and software architectures
required for addressing these challenges.
Understanding the system
A program implies a control flow, data structures,
and memory management. These combine to create a
huge number of states that a software system can
occupy at any point in time. This is why software
systems tend to be far more difficult to debug and
verify than hardware systems. Concurrency multiplies
this with an additional set of issues arising from the
multiple threads of control that are active and making
progress at any given time. Understanding the
behavior of the entire system against the backdrop of
so many states is difficult. We have few conceptual
tools beyond Unified Modeling Language (UML)
sequence diagrams to help us. We are largely left to
puzzle through complex interactions in our heads or
on a whiteboard.
Discovering concurrency
To write a parallel application program, developers
must identify and expose the tasks within a system that
can execute concurrently. Once these tasks are
available, their granularity, their ordering constraints,
and how they interact with the program’s data
structures must be managed. Typically, the burden of
discovering and managing concurrency falls on
programmers, forcing them to think carefully about
the entire application’s architecture and their algo-
rithms’ concurrent behavior.
Multithreaded reading
Programmers write code and then read it, building
an execution model in their minds. How they read a
program affects how they understand a program and
how they write code in the first place. In a serial
program, there is a single thread of control working
through their models. The current generation of
programmers have been trained to think in these
terms and read their programs with this single-
threaded approach. A concurrent program, however,
requires a multithreaded reading. At any point in the
code, programmers must understand the state of the
data structures and the flow of control in terms of every
semantically allowed interleaving of statements. With
current programming tools, this interleaving is not
captured in the program’s text. A programmer must
imagine the interplay of parallel activities and ensure
that they correspond to the given functional specifi-
cation.
Nondeterministic execution
The system state in a concurrent program depends
on the way execution overlaps across processor cores.
Because the detailed scheduling of threads is con-
0
IEEE Design & Test of Computers dtco-25-04-hwu.3d 17/6/08 19:20:34 2
Design in the Late- and Post-Silicon Eras
IEEE Design & Test of Computers
5. trolled by the system and not the programmer, this
leads to a fundamental nondeterminism in program
execution. Debugging and maintaining a program in
the face of nondeterminism is extremely difficult.
Interactions between threads
Threads in all but the most trivial parallel programs
interact. They pass messages, synchronize, update
shared data structures, and compete for system
resources. Reasoning about interactions between
threads is unique to parallel programming and a topic
about which most programmers have little or no
training.
Testing and Verification
Applications include test suites to verify correctness
as they are optimized, ported to new platforms, or
extended with new features. Developers face two
major challenges when moving test suites to a parallel
platform. First, a parallel test suite must exercise a
range of thread counts and a more diverse range of
hardware configurations. Second, the tests must deal
with rounding errors that change as the number and
scheduling of threads modify the order of associated
operations. Mathematically, every semantically al-
lowed ordering of statements is equally valid, so it is
incorrect to pick one arbitrary ordering (such as the
serial order) and insist on bitwise conformity to that
result. Addressing these two issues may require major
changes to the testing infrastructure, which can be a
formidable task if the test suite’s original developers
are no longer available to help.
Distributed development teams
Modern software development is a team effort. In
most cases, these teams are distributed across multiple
time zones and organizations. Changes in data
structures and high-level algorithms often cause subtle
problems in modules owned by other team members.
Managing those distributed teams of programmers is
difficult for sequential and concurrent programming.
But concurrent programs add complexities of their
own. First, concurrent modules do not compose very
well. With the commonly available parallel-program-
ming environments, it’s not always possible to use
parallel components to build large parallel systems
and expect correct and efficient execution. Program
components that behave reasonably well in isolation
can result in serious problems when they are
combined into a concurrent program.4
This makes
traditional approaches based on components created
and tested in isolation and later assembled into a
larger system untenable.
Optimization and porting
The purpose of parallel programming is increased
performance. Hence, it is fundamentally a program
optimization problem. For sequential programming,
the architectural parameters involved in performance
optimizations are well-understood in terms of com-
monly accepted hardware models. The optimization
problem is difficult, but the common models with a
well-known hierarchy of latencies—from registers to
cache, to memory—support a mature optimization
technology folded into the compilers. For parallel
programming, the state of affairs is far less mature. The
memory hierarchies are radically different between
platforms. How cores are connected on a single
processor and how many-core processors are con-
nected into a larger platform vary widely. The
optimization problem is consequently dramatically
more complicated.
Current tools for concurrent programming put the
burden of optimization largely on the programmer. To
optimize a parallel program today, a programmer must
understand the capabilities and constraints of the
specific hardware and perform optimizations, deter-
mining their effects with tedious and error-prone
experimentation. Each time the software is moved
between platforms, this experimentation must be
repeated. To a generation of programmers used to
leaving optimization to the compiler, this presents a
huge hurdle to the adoption of parallel programming.
Features versus performance
Parallel programming has been largely an activity
in the high-performance computing (HPC) communi-
ty, where the culture is to pursue performance at all
costs. But for software vendors working outside HPC,
performance takes a back seat to adding and
supporting new features. Consumers purchase soft-
ware on the basis of the available features and how
well these features are implemented, not execution
speed. New concurrent-programming tool chains for
software vendors must keep this perspective in mind.
Economic considerations
Historically, the complexities of parallel program-
ming have limited its commercial adoption. Under-
standing this history is critical if we want to succeed
0
IEEE Design & Test of Computers dtco-25-04-hwu.3d 17/6/08 19:20:34 3
July/August 2008
6. this time around. Figure 2a illustrates the classical
interface between hardware and software.1,6
The top
triangle illustrates the design space to explore, to find a
particular application solution. The bottom triangle
illustrates the design space that must be explored to
find the right target hardware architecture to host the
application.
Implicit in Figure 2a is the misleading notion that
the hardware search space is as important as the
application search space. This is not the case. The
triangles in Figure 2b are sized according to the
relative size of the two industries. According to the
US Department of Commerce, the 2005 revenue of the
semiconductor industry was $138 billion, whereas the
revenue of the applications industry was $1,152 bil-
lion. In economic terms, it is the hardware architec-
tures that should serve the needs of programmers, who
in turn serve the needs of the application industry. This
suggests that to meet the concurrency challenge,
future hardware solutions must be designed from an
application- and software-centric point of view.
An additional consideration is the increasing
influence of consumer and media applications. The
consumer electronics market often facilitates a winner-
take-all dynamic, resulting in reduced diversity.
Ultimately, this could reduce the number of platforms
that programmers need to support.
Addressing the challenge
From these observations, we define
three key principles that underlie our
approach to the concurrency challenge:
& We must approach the concurrency
challenge in an application-driven
and software-centric manner.
& The best way to deal with concur-
rency problems is to obviate them
by using application development
frameworks, parallel libraries, and
tools. Compilers and autotuners
should do the heavy lifting so that
programmers can quickly, or even
automatically, derive high-perfor-
mance solutions to new problems.
& When application programmers
must face the concurrency chal-
lenge directly, we should help them
employ a sound engineering ap-
proach—so that they can create a
design in terms of well-understood
software architectures with proven, replicable
solutions appropriate to that architecture. Pro-
gramming models play an important role, but
they add value only to the extent that they
support the chosen software architecture.
We strive to shield the vast majority of application
programmers from the complexity of parallel pro-
gramming. We advocate an application development
model where programmers classify their problems
within an application-oriented taxonomy that leads
them naturally to known solution strategies. They
then turn these strategies into code by using
frameworks or by adapting program skeletons that
are designed and implemented by expert parallel
programmers. As these application codes move
through the tool chain, compilers and autotuning
tools optimize the code for high-performance,
parallel execution.
Application frameworks
Application programmers should address their
parallel-software problems in familiar terms. Fortunate-
ly, many applications—games, media, signal process-
ing, and financial analytics—are naturally concurrent.
In many cases, application developers are already
familiar with the concurrency in their problems; they
0
IEEE Design & Test of Computers dtco-25-04-hwu.3d 17/6/08 19:20:34 4
Figure 2. Magnification of value and complexity exposed to the
programmers: classic view of hardware and software interface (a),
weighted view according to associated revenues (b). (ISA: instruction-set
architecture.) (Source: W. Hwu et al., ‘‘Implicit Parallel Programming
Models for Thousand-Core Microprocessors,’’ Proc. 44th Design
Automation Conf. (DAC 07), ACM Press, 2007, pp. 754-759 G 2007 ACM Inc.
Reprinted by permission.)
Design in the Late- and Post-Silicon Eras
IEEE Design & Test of Computers
7. just need technologies to express that concurrency in
ways natural to their application domain.
Consider a photo-retrieval application. An applica-
tion developer solving a face-recognition problem
would use an image-processing application framework
that supports primitives for feature extraction and
classification. The application developer selects the
primitives and specifies the interactions between them
to form a solution to the face-recognition problem.
Feature extraction and classification algorithms con-
tain ample opportunities for data parallelism, but there
is no reason to trouble the application developer with
these details. The framework engineers that develop
and support these primitives will manage any compli-
cations due to the concurrency.
Linking applications and programming
We have been developing a methodology to
systematically link application developers to the
appropriate concurrent software frameworks. We
identified a set of recurrent application solution
strategies that capture a developer’s strategy for a
particular solution, yet are general enough to permit a
variety of implementations specific to the needs of a
particular problem and target architecture. We call
these strategies dwarfs.7
Conceptually, they form a
platform (the arrow in Figure 2b) that provides a high-
level link between applications and hardware.
We have identified 13 dwarfs: dense linear algebra,
sparse linear algebra, spectral methods, n-body
methods, structured grids, unstructured grids, MapRe-
duce, combinational logic, graph traversals, dynamic
programming, backtrack branch and bound, graphical
models, and finite-state machines.
In the face-recognition problem, the MapReduce
dwarf supports image-feature extraction well. An
engineer working on image-feature extraction would
choose MapReduce as the basic solution strategy. A
collection of different MapReduce implementations
would be encapsulated into a single MapReduce
programming framework for feature extraction.8
Like-
wise, developers working in quantum chemistry are
well-versed in dense linear algebra and understand
how their problems are decomposed into dense linear-
algebra operations, just as programmers working in
game physics understand their problem as a compo-
sition of n-body and structured-grid solutions.
The ParLab group at the University of California,
Berkeley has worked with application communities
such as HPC, computational music, and visual
computing to grow the well-known list of seven dwarfs
of high-performance computing to 13.7
Is this list of 13
dwarfs complete? Probably not, but there’s no barrier
to adding more.
Programming support
Although dwarfs provide a critical link between
application-level problems and their solution strate-
gies, translating a solution strategy described by dwarfs
into working parallel code requires many supporting
steps. Fortunately, we can ease this task with
programming frameworks and skeletons, which raise
the abstraction level and make the programming
process considerably easier. In the examples we have
discussed so far, a generic MapReduce programming
framework could easily be tailored to implement
feature extraction, and a broadly applicable dense-
linear-algebra framework could serve to craft quantum
chemistry solutions. A higher-level framework could
define how these solutions are assembled into an
entire application.
We envision that parallel applications (such as a
media-search engine) will be expressed in application
frameworks (such as an image-manipulation frame-
work) and a library of primitives (such as image
retrieval). Each primitive will be implemented with
one or more well-documented solution strategies
(such as the MapReduce dwarf) and corresponding
skeletons that the programmer can adapt to imple-
ment a particular application. A rich set of frameworks
will be developed and documented to cover the needs
of applications. These programming frameworks will
likely be built on languages such as C or C++ with
OpenMP and MPI.
Although our long-term goal is to raise the
abstraction level and spare programmers from the
details of managing concurrency, we realize we
cannot provide complete coverage with frameworks.
Some application programmers will inevitably need to
program at a low level, and hence, parallel program-
ming models will continue to be an important part of
programming systems.
Current parallel-programming models such as MPI,
OpenMP, and the Partitioned Global Address Space
(PGAS) languages have all been shown to work for
both expert parallel programmers and application
programmers. We expect that as many-core systems
evolve, these programming models will evolve as well,
not only to better match the hardware but also to
better support the frameworks we envision.
0
IEEE Design & Test of Computers dtco-25-04-hwu.3d 17/6/08 19:20:35 5
July/August 2008
8. For example, the next release of OpenMP (OpenMP
3.0) will include a flexible task queue mechanism.9
This greatly expands the range of algorithms (and
dwarfs) that can be addressed with OpenMP. We
anticipate a healthy interplay between frameworks
and programming models, so that they can evolve
together to better meet the needs of application
programmers.
Code-optimization space exploration
Both frameworks and programming models will
require lower-level support to create efficient code.
For example, for a given skeleton and user-supplied
fill-in code, there are multiple ways to exploit
concurrency; the best choice is often determined by
the underlying hardware organization.
Ryoo et al., working with the CUDA programming
model and an Nvidia 8800 GPU, found that different
plausible rearrangements of code resulted in signifi-
cant performance differences.10
Different allocations
of resources among the large number of concurrent
threads led to nonlinear performance results as several
application kernels were optimized. Even a simple
computational kernel had multiple dimensions of
optimization based on, for example, loop tiling, thread
work assignment, loop unrolling, or data
prefetch.
Figure 3 shows the execution time of
different versions of SAD (sum of abso-
lute differences between elements of two
image areas).10
Each curve shows the
effect of varying the thread granularity
while keeping other optimization param-
eters fixed. There is a 43 spread across
these different choices of optimization
parameters. Taking these factors into
account is critical to the success of future
programming frameworks. However, it is
unreasonable to expect a human pro-
grammer to explore such a complex
optimization space.
We envision that parameterized pro-
gramming and automatic parameter
space explorations guided by perfor-
mance models (such as autotuners) will
be sufficient to achieve high levels of
performance with reasonable program-
mer effort.10
In particular, autotuners
combined with frameworks should help
nonexpert programmers achieve optimal
performance. Related work has shown that efficient
code for multicore microprocessors from Intel and
AMD can be derived using this methodology.11
TO BE SUCCESSFUL, the programming frameworks we
describe must be retargetable to a range of multiple-
core architectures and produce high-performance
code. To address these often conflicting goals, we
envision that frameworks and parallel-programming
models will be supported by powerful parallel-
compilation technology and autotuning tools to help
automatically optimize a solution for a particular
platform. This will require frameworks that are
parameterized and annotated to describe the compu-
tation’s fundamental properties. A new generation of
parallel compilers and autotuning tools will use these
parameterizations and annotations to optimize soft-
ware well beyond the capabilities from current-
generation compilers and tools.
We understand those who question whether our
proposed application-centric approach—even if it
works well in particular application domains—will
ever cover the full range of future applications. We
answer these skeptics in two ways. First, we believe
that a consumer-oriented, winner-take-all market
0
IEEE Design & Test of Computers dtco-25-04-hwu.3d 17/6/08 19:20:35 6
Figure 3. Performance profiles for different optimization parameters for a
single SAD (sum of absolute differences between elements of two image
areas) program. (Source: S. Ryoo et al., ‘‘Program Optimization Space
Pruning for a Multithreaded GPU,’’ Proc. 6th Ann. IEEE/ACM Int’l Symp.
Code Generation and Optimization (CGO 08), ACM Press, 2008, pp. 195-204
G ACM Inc. Reprinted by permission.)
Design in the Late- and Post-Silicon Eras
IEEE Design & Test of Computers
9. dynamic will work to our advantage. Just as consol-
idation reduced the number of players in consumer
electronics, over time the same effect will occur for
consumer-oriented application software. We’ve seen
this already, for example, in the dominance of
Microsoft products in the office-productivity domain.
The same effect will happen over time in other
application domains. For example, videogame soft-
ware suppliers (such as Electronic Arts) are playing a
significant role in the software industry, as are their
console makers (such as Sony) and their semicon-
ductor suppliers (such as Nvidia). They are all
influential in their respective industries. Over time,
these roles might evolve into positions of dominance
and the subsequent consolidation will make it easier
for a few application frameworks to support an entire
industry.
Second, we don’t need 100% application coverage
to have a major impact. There will always be classes of
applications that will not decompose into well-
understood problems (such as dwarfs) with well-
understood parallel solutions. In most cases, the same
features that make these applications hard to decom-
pose into dwarfs will also make them hard to
parallelize. Hence, there will always be a need for
research on new methodologies, compiler techniques,
and tools to address these pathologically difficult
applications. &
Acknowledgments
We acknowledge the support of the Gigascale
Systems Research Center (GSRC).
& References
1. W. Hwu et al., ‘‘Implicit Parallel Programming Models for
Thousand-Core Microprocessors,’’ Proc. 44th Design
Automation Conf. (DAC 07), ACM Press, 2007, pp.
754-759; http://doi.acm.org/10.1145/1278480.1278669
G ACM Inc. Reprinted by permission.
2. N. Shah et al., ‘‘NP-Click: A Productive Software
Development Approach for Network Processors,’’
IEEE Micro, vol. 24, no. 5, Sept./Oct. 2004, pp.
45-54.
3. NVIDIA CUDA Compute Unified Device Architecture
Programming Guide, v1.0, Nvidia Corp., June 2007,
http://developer.download.nvidia.com/compute/cuda/
1_0/NVIDIA_CUDA_Programming_Guide_1.0.pdf.
4. H. Sutter and J. Larus, ‘‘Software and the Concurrency
Revolution,’’ ACM Queue, vol. 3, no. 7, Sept. 2005, pp.
54-62.
5. T.G. Mattson, B.A. Sanders, and B.L. Massingill, Patterns
of Parallel Programming, Addison-Wesley Professional,
2004.
6. K. Keutzer et al., ‘‘System Level Design:
Orthogonalization of Concerns and Platform-Based
Design,’’ IEEE Trans. Computer-Aided Design, vol. 19,
no. 12, Dec. 2000, pp. 1523-1543.
7. K. Asanovic et al., The Landscape of Parallel Computing
Research: A View from Berkeley, tech. report UCB/
EECS-2006-183, EECS Dept., Univ. of California,
Berkeley, 2006.
8. J. Dean and S. Ghemawat, ‘‘MapReduce: Simplified Data
Processing on Large Clusters,’’ Proc. 6th Symp.
Operating System Design and Implementation (OSDI
04), Usenix, 2004, pp. 137-150.
9. ‘‘OpenMP Application Program Interface,’’ OpenMP
Architecture Review Board, May 2005, http://www.
openmp.org.
10. S. Ryoo et al., ‘‘Program Optimization Space Pruning for
a Multithreaded GPU,’’ Proc. 6th Ann. IEEE/ACM Int’l
Symp. Code Generation and Optimization (CGO 08),
ACM Press, 2008, pp. 195-204, http://doi.acm.org/10.
1145/1356058.1356084 G ACM Inc. Reprinted by
permission.
11. S. Williams et al., ‘‘Optimization of Sparse Matrix-Vector
Multiplication on Emerging Multicore Platforms,’’ Proc.
ACM/IEEE Conf. Supercomputing (SC 07), ACM Press,
2007, New York, article 38.
Wen-mei Hwu is a professor and
the Sanders-AMD Endowed Chair of
Electrical and Computer Engineering
at the University of Illinois at Urbana-
Champaign. His research interests
include architecture and compilation for parallel-
computing systems. He has a BS in electrical
engineering from National Taiwan University, and
a PhD in computer science from the University of
California, Berkeley. He is a Fellow of both the IEEE
and the ACM.
Kurt Keutzer is a professor of
electrical engineering and computer
science at the University of California,
Berkeley and a principal investigator
in UC Berkeley’s Universal Parallel
Computing Research Center. His research focuses
on the design and programming of ICs. He has a BS
in mathematics from Maharishi International Univer-
0
IEEE Design & Test of Computers dtco-25-04-hwu.3d 17/6/08 19:20:36 7
July/August 2008
10. sity, and an MS a PhD in computer science from
Indiana University, Bloomington. He is a Fellow of the
IEEE and a member of the ACM.
Timothy G. Mattson is a principal
engineer in the Applications Research
Laboratory at Intel. He research inter-
ests focus on performance modeling
for future multicore microprocessors
and how different programming models map onto
these systems. He has a BS in chemistry from the
University of California, Riverside; an MS in chemistry
from the university of California, Santa Cruz; and
a PhD in theoretical chemistry from the University of
California, Santa Cruz. He is a member of the
American Association for the Advancement of Sci-
ence (AAAS).
& Direct questions and comments about this article to
Wen-mei Hwu, Univ. of Illinois at Urbana-Champaign,
Coordinated Science Lab, 1308 W. Main St., Urbana,
IL 61801; w-hwu@uiuc.edu.
For further information about this or any other comput-
ing topic, please visit our Digital Library at http://www.
computer.org/csdl.
0
IEEE Design & Test of Computers dtco-25-04-hwu.3d 17/6/08 19:20:36 8
Design in the Late- and Post-Silicon Eras
IEEE Design & Test of Computers
View publication statsView publication stats