The document discusses techniques for speculatively executing load instructions in superscalar processors. It proposes combining value prediction to resolve register dependencies and a Load Forwarding History Table (LFHT) to resolve memory dependencies. Value prediction predicts operand values to allow early execution of load address calculations. The LFHT tracks past instances of load-store forwarding to identify loads that can speculatively execute before prior stores. Together these allow more loads to issue earlier, achieving around a 15% speedup on average over the baseline architecture.
In a simultaneous multithreaded system, a core’s pipeline resources are sometimes partitioned and otherwise shared amongst numerous active threads. One mutual resource is the write buffer, which acts as an intermediary between a store instruction’s retirement from the pipeline and the store value being written to cache. The write buffer takes a completed store instruction from the load/store queue and eventually writes the value to the level-one data cache. Once a store is buffered with a write-allocate cache policy, the store must remain in the write buffer until its cache block is in level-one data cache. This latency may vary from as little as a single clock cycle (in the case of a level-one cache hit) to several hundred clock cycles (in the case of a cache miss). This paper shows that cache misses routinely dominate the write buffer’s resources and deny cache hits from being written to memory, thereby degrading performance of simultaneous multithreaded systems. This paper proposes a technique to reduce denial of resources to cache hits by limiting the number of cache misses that may concurrently reside in the write buffer and shows that system performance can be improved by using this technique.
This summary provides the key details about a proposed load balancing algorithm in 3 sentences:
The document proposes a fully centralized and partially distributed load balancing algorithm that dynamically distributes tasks from a master processor to slave processors organized into communicators. The master processor monitors the workload and response time of each communicator to dynamically map additional tasks as communicators complete their work, improving resource utilization and response time. The algorithm forms a matrix to track the workload and response time of each communicator for different task types to aid the master processor in optimally balancing the load over time.
The document proposes a new multithreaded execution model and multi-ring architecture to exploit instruction-level parallelism. The model uses multiple instruction threads that are scheduled for execution based on data availability. The instructions from different threads are grouped together for parallel execution. The proposed multi-ring architecture features large resident thread activations, a high-speed buffer to avoid load/store stalls, and a dynamic instruction scheduler that selects instructions from multiple threads each cycle for execution on multiple pipelines. Initial simulation results show the architecture can achieve parallel execution of 7 instructions per cycle.
The document discusses an instruction-based stride prefetcher that uses a reference prediction table (RPT) to track load instructions and memory addresses. It examines how the maximum prefetch degree affects prefetcher performance. The results show that increasing the maximum degree improves coverage but performance peaks at a degree of 5, providing the highest average speedup of 1.085 compared to no prefetching. While a higher degree increases coverage, prefetching too many blocks negatively impacts performance.
The document discusses key concepts in computer architecture including Amdahl's law for predicting speedup from parallelization, branch predictors for improving instruction flow, cache organization and types, hazards that can occur in pipelines, interrupts for responding to events, and jump instructions for breaking standard instruction flow. It provides simplified explanations of these concepts in an accessible way for learning purposes.
The document discusses key concepts in computer architecture including Amdahl's law for predicting speedup from parallelization, branch predictors for improving instruction flow, cache organization and types, hazards that can occur in pipelines, interrupts for responding to events, and jump instructions for breaking standard instruction flow. It provides simplified explanations of these concepts in an accessible way for learning purposes.
This document summarizes the key components of MavStream, a data stream management system for processing continuous queries over data streams. It discusses the client-server architecture, with the client accepting queries and the server processing queries and returning outputs. The server components include an input processor that generates query plans, an instantiator that initializes operators, a scheduler that schedules operators, feeders that feed streams to operators, various stream operators like join and aggregation, buffer management to handle mismatches between input and processing rates, and a runtime optimizer that monitors performance and tuning scheduling strategies.
In a simultaneous multithreaded system, a core’s pipeline resources are sometimes partitioned and otherwise shared amongst numerous active threads. One mutual resource is the write buffer, which acts as an intermediary between a store instruction’s retirement from the pipeline and the store value being written to cache. The write buffer takes a completed store instruction from the load/store queue and eventually writes the value to the level-one data cache. Once a store is buffered with a write-allocate cache policy, the store must remain in the write buffer until its cache block is in level-one data cache. This latency may vary from as little as a single clock cycle (in the case of a level-one cache hit) to several hundred clock cycles (in the case of a cache miss). This paper shows that cache misses routinely dominate the write buffer’s resources and deny cache hits from being written to memory, thereby degrading performance of simultaneous multithreaded systems. This paper proposes a technique to reduce denial of resources to cache hits by limiting the number of cache misses that may concurrently reside in the write buffer and shows that system performance can be improved by using this technique.
This summary provides the key details about a proposed load balancing algorithm in 3 sentences:
The document proposes a fully centralized and partially distributed load balancing algorithm that dynamically distributes tasks from a master processor to slave processors organized into communicators. The master processor monitors the workload and response time of each communicator to dynamically map additional tasks as communicators complete their work, improving resource utilization and response time. The algorithm forms a matrix to track the workload and response time of each communicator for different task types to aid the master processor in optimally balancing the load over time.
The document proposes a new multithreaded execution model and multi-ring architecture to exploit instruction-level parallelism. The model uses multiple instruction threads that are scheduled for execution based on data availability. The instructions from different threads are grouped together for parallel execution. The proposed multi-ring architecture features large resident thread activations, a high-speed buffer to avoid load/store stalls, and a dynamic instruction scheduler that selects instructions from multiple threads each cycle for execution on multiple pipelines. Initial simulation results show the architecture can achieve parallel execution of 7 instructions per cycle.
The document discusses an instruction-based stride prefetcher that uses a reference prediction table (RPT) to track load instructions and memory addresses. It examines how the maximum prefetch degree affects prefetcher performance. The results show that increasing the maximum degree improves coverage but performance peaks at a degree of 5, providing the highest average speedup of 1.085 compared to no prefetching. While a higher degree increases coverage, prefetching too many blocks negatively impacts performance.
The document discusses key concepts in computer architecture including Amdahl's law for predicting speedup from parallelization, branch predictors for improving instruction flow, cache organization and types, hazards that can occur in pipelines, interrupts for responding to events, and jump instructions for breaking standard instruction flow. It provides simplified explanations of these concepts in an accessible way for learning purposes.
The document discusses key concepts in computer architecture including Amdahl's law for predicting speedup from parallelization, branch predictors for improving instruction flow, cache organization and types, hazards that can occur in pipelines, interrupts for responding to events, and jump instructions for breaking standard instruction flow. It provides simplified explanations of these concepts in an accessible way for learning purposes.
This document summarizes the key components of MavStream, a data stream management system for processing continuous queries over data streams. It discusses the client-server architecture, with the client accepting queries and the server processing queries and returning outputs. The server components include an input processor that generates query plans, an instantiator that initializes operators, a scheduler that schedules operators, feeders that feed streams to operators, various stream operators like join and aggregation, buffer management to handle mismatches between input and processing rates, and a runtime optimizer that monitors performance and tuning scheduling strategies.
The document summarizes a project report for a MOOC communication backbone system. It discusses a 5 node cluster with a mesh topology that uses MongoDB as a distributed database. The cluster provides course listing, description, and question/answer functionality to clients. It elects leaders using a Floodmax algorithm and ballot IDs. An inter-MOOC voting strategy is also proposed to decide which cluster hosts a competition.
1. The document describes a project to build RESTful web services for a social media application like Pinterest. It includes functionalities implemented, high-level architecture with Python web services, CouchDB database, and client testing with cURL.
2. The web services were built with Bottle microframework and exposed CRUD operations as HTTP methods. Data was stored in CouchDB with a flat schema for each user document.
3. While the schema had limitations of returning full documents, it mapped requests to data through URL traversal and operations through HTTP methods as required for REST.
This document summarizes a paper that proposes and evaluates the performance of a multithreaded architecture capable of exploiting both coarse-grained parallelism and fine-grained instruction-level parallelism. The architecture distributes processing across multiple processing elements connected by an interconnection network. Each processing element supports multiple concurrently executing threads by grouping instructions from different threads. The architecture introduces a distributed data structure cache to reduce network latency when accessing remote data. Simulation results indicate the architecture achieves high processor throughput and the data structure cache significantly reduces network latency.
UCLA CS219 Course Project Report (Prof. George Varghese)
StateKeeper: Generalizing Reachability
Today there are many tools that are built using formal methods to
verify networks. The majority of these tools check network specifi-
cations and configurations to identify only a limited class of failures
such as Reachability, Forwarding Loops and Slicing. They don’t take
into account the rate-limiting rules that affect the throughput of the
system and also link delays. Both of these quantities are responsible
for the Quality of Service (QoS). We believe that without a decent
throughput and affordable delay, a network is pretty much in a
dead state even if it passes the tests for reachability, forwarding
detection, and other such failure classes. We developed StateKeeper,
a tool based on ideas from Atomic Predicates Verifier. StateKeeper
keeps track of both network performance and network reachability.
In our experiments, we factor in link delay and throughput as our
performance metrics. StateKeeper will allow network operators
to verify Quality of Service at each port along a route by showing
information about the state of each step.
IRJET- Implementation of Mesi Protocol using VerilogIRJET Journal
This document describes the implementation of the MESI cache coherence protocol using Verilog. It begins with an abstract that introduces MESI and discusses maintaining data consistency in multiprocessor systems using caches. It then provides background on cache coherence protocols like MI, MSI, and MESI. The rest of the document details the design of a direct-mapped cache with three states (Modified, Shared, Invalid) using Verilog, including the inputs, outputs, and simulation results for operations like reads and writes. The goal is to design a cache that uses the MESI protocol to maintain coherence across multiple processor caches sharing the same memory space.
Comparative study on Cache Coherence Protocolsiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
The document provides an overview of basic concepts related to parallelization and data locality optimization. It discusses loop-level parallelism as a major target for parallelization, especially in applications using arrays. Long running applications tend to have large arrays, which lead to loops that have many iterations that can be divided among processors. The document also covers data locality and how the organization of computations can significantly impact performance by improving cache usage. It introduces concepts like symmetric multiprocessors and affine transform theory that are useful for parallelization and locality optimizations.
Cache coherency controller for MESI protocol based on FPGA IJECEIAES
In modern techniques of building processors, manufactures using more than one processor in the integrated circuit (chip) and each processor called a core. The new chips of processors called a multi-core processor. This new design makes the processors to work simultanously for more than one job or all the cores working in parallel for the same job. All cores are similar in their design, and each core has its own cache memory, while all cores shares the same main memory. So if one core requestes a block of data from main memory to its cache, there should be a protocol to declare the situation of this block in the main memory and other cores.This is called the cache coherency or cache consistency of multi-core. In this paper a special circuit is designed using very high speed integrated circuit hardware description language (VHDL) coding and implemented using ISE Xilinx software. The protocol used in this design is the modified, exclusive, shared and invalid (MESI) protocol. Test results were taken by using test bench, and showed all the states of the protocol are working correctly.
This document discusses bus-based multiprocessor systems. It begins by defining a bus as a communication system that transfers data between computer components. It then explains that a bus-based multiprocessor connects multiple CPUs to a shared bus and memory. Issues like bus arbitration and caching are discussed to prevent access conflicts and reduce bus overload. Write-through and ownership caching protocols are described for maintaining cache coherency across CPUs. The document provides examples and discusses advantages and disadvantages of different approaches.
Multiprotocol Label Switching (MPLS) fasten the speed of packet forwarding by forwarding the packets based on labels and reduces the use of routing table look up from all routers to label edge routers(LER) , where as the label switch routers (LSRs) uses Label Distribution Protocol (LDP) or RSVP (Resource reservation Protocol) for label allocation and Label table for packet forwarding . Dynamic protocol is implemented which carries a Updates packets for the details of Label Switch Paths, along with this feedback mechanism is also introduced which find the shortest path among MPLS network and also feedback is provided which also help to overcome congestion, this feedback mechanism is on a
hop by hop basis rather than end to end thus providing a more reliable and much faster and congestion
free path for the packets .
This document covers important concepts of process synchronization like: Background, critical section problem, critical region, synchronization hardware, semaphores. It is beneficial for engineering students of aryabhatta knowledge university of bihar (A.K.U. Bihar).
This document proposes two solutions to address issues with live migration of virtual machines (VMs) using DPDK for packet processing.
The first solution maintains separate memory pools for the host and VM. Buffers contain both a structure part stored on the host and a message content part stored in shared memory (SHM). Rings manipulate buffers using indexes instead of pointers. This allows hiding physical addresses from the VM. Memory pools are reinitialized after migration to prevent memory leaks.
The second solution simpler initializes the receiving DPDK driver queue only after migration. This may introduce more downtime but avoids memory copy. It still has potential memory leak issues. Both solutions require further testing.
This project implements a Disaster Recovery Manager for a data center that monitors virtual machines and recovers them if they fail. It uses the VMware vSphere API to connect to ESXi hypervisors and vCenter. The manager includes components to ping VMs, take periodic snapshots, and recover failed VMs either by reverting snapshots or migrating VMs to a new host. It detects failures by checking for missed heartbeat pings and excludes VMs intentionally powered off by a user from recovery. The manager was implemented in Java using multithreading and allows for conversion between image formats to support multiple hypervisors.
This document discusses using big data tools like Lucene to simplify debugging of failing tests by extracting and analyzing data from large simulation log files. It describes parsing UVM log files and storing message elements in a Lucene database for fast querying. Graphical representations of the log file data are presented to aid analysis, showing messages within a time range or containing specific strings. Using big data tools in this way can shorten debug time and verification schedules.
This document contains the second assignment set for the course Operating Systems, consisting of 10 questions worth 6 marks each, for a total of 60 marks. The questions cover topics related to operating systems such as virtual memory, scheduling algorithms, semaphores, and distributed systems. Sample answers are provided for each question that describe key concepts in more detail.
This document summarizes a lecture on speculative multithreading. The lecture discussed two papers: one that categorized hardware support for speculative multithreading, and one that described a software-only approach using transactional memory. The class discussion covered dividing responsibilities between hardware and software, sources of parallelism for speculation beyond just loops, scaling speculation, and programming support options.
Richard Sinnott is an experienced, reliable, and hardworking individual seeking a new opportunity. He has a wide range of experience in areas such as home health care, property maintenance, warehouse/logistics, and electro-mechanical assembly. Mr. Sinnott has excellent problem solving skills and is not afraid to ask questions or climb the leadership ladder to overcome obstacles. He has an excellent work ethic and is able to adapt to different work environments and tasks.
The Can Safe Cover provides hygienic drinking from beverage cans by creating a barrier between the drinker's mouth and the can. It prevents insects and debris from entering open cans and protects users from cuts from sharp can edges. Available in 5 colors, the cover allows users to distinguish their can from others while enjoying the benefits of hygienic drinking, protection from bugs, and safety from cuts.
Datadog provides a SaaS platform for aggregating, correlating, and collaborating on various data streams from development and operations teams. It collects metrics, events, and other data from servers, devices, applications, and third party services. This data is aggregated, correlated, and made available for visualization and analysis to help with capacity planning, issue resolution, and other tasks. Datadog discusses how it has designed its architecture to handle the large volume and variety of data ingested, including using eventual consistency and soft-state designs.
The document summarizes a project report for a MOOC communication backbone system. It discusses a 5 node cluster with a mesh topology that uses MongoDB as a distributed database. The cluster provides course listing, description, and question/answer functionality to clients. It elects leaders using a Floodmax algorithm and ballot IDs. An inter-MOOC voting strategy is also proposed to decide which cluster hosts a competition.
1. The document describes a project to build RESTful web services for a social media application like Pinterest. It includes functionalities implemented, high-level architecture with Python web services, CouchDB database, and client testing with cURL.
2. The web services were built with Bottle microframework and exposed CRUD operations as HTTP methods. Data was stored in CouchDB with a flat schema for each user document.
3. While the schema had limitations of returning full documents, it mapped requests to data through URL traversal and operations through HTTP methods as required for REST.
This document summarizes a paper that proposes and evaluates the performance of a multithreaded architecture capable of exploiting both coarse-grained parallelism and fine-grained instruction-level parallelism. The architecture distributes processing across multiple processing elements connected by an interconnection network. Each processing element supports multiple concurrently executing threads by grouping instructions from different threads. The architecture introduces a distributed data structure cache to reduce network latency when accessing remote data. Simulation results indicate the architecture achieves high processor throughput and the data structure cache significantly reduces network latency.
UCLA CS219 Course Project Report (Prof. George Varghese)
StateKeeper: Generalizing Reachability
Today there are many tools that are built using formal methods to
verify networks. The majority of these tools check network specifi-
cations and configurations to identify only a limited class of failures
such as Reachability, Forwarding Loops and Slicing. They don’t take
into account the rate-limiting rules that affect the throughput of the
system and also link delays. Both of these quantities are responsible
for the Quality of Service (QoS). We believe that without a decent
throughput and affordable delay, a network is pretty much in a
dead state even if it passes the tests for reachability, forwarding
detection, and other such failure classes. We developed StateKeeper,
a tool based on ideas from Atomic Predicates Verifier. StateKeeper
keeps track of both network performance and network reachability.
In our experiments, we factor in link delay and throughput as our
performance metrics. StateKeeper will allow network operators
to verify Quality of Service at each port along a route by showing
information about the state of each step.
IRJET- Implementation of Mesi Protocol using VerilogIRJET Journal
This document describes the implementation of the MESI cache coherence protocol using Verilog. It begins with an abstract that introduces MESI and discusses maintaining data consistency in multiprocessor systems using caches. It then provides background on cache coherence protocols like MI, MSI, and MESI. The rest of the document details the design of a direct-mapped cache with three states (Modified, Shared, Invalid) using Verilog, including the inputs, outputs, and simulation results for operations like reads and writes. The goal is to design a cache that uses the MESI protocol to maintain coherence across multiple processor caches sharing the same memory space.
Comparative study on Cache Coherence Protocolsiosrjce
IOSR Journal of Computer Engineering (IOSR-JCE) is a double blind peer reviewed International Journal that provides rapid publication (within a month) of articles in all areas of computer engineering and its applications. The journal welcomes publications of high quality papers on theoretical developments and practical applications in computer technology. Original research papers, state-of-the-art reviews, and high quality technical notes are invited for publications.
The document provides an overview of basic concepts related to parallelization and data locality optimization. It discusses loop-level parallelism as a major target for parallelization, especially in applications using arrays. Long running applications tend to have large arrays, which lead to loops that have many iterations that can be divided among processors. The document also covers data locality and how the organization of computations can significantly impact performance by improving cache usage. It introduces concepts like symmetric multiprocessors and affine transform theory that are useful for parallelization and locality optimizations.
Cache coherency controller for MESI protocol based on FPGA IJECEIAES
In modern techniques of building processors, manufactures using more than one processor in the integrated circuit (chip) and each processor called a core. The new chips of processors called a multi-core processor. This new design makes the processors to work simultanously for more than one job or all the cores working in parallel for the same job. All cores are similar in their design, and each core has its own cache memory, while all cores shares the same main memory. So if one core requestes a block of data from main memory to its cache, there should be a protocol to declare the situation of this block in the main memory and other cores.This is called the cache coherency or cache consistency of multi-core. In this paper a special circuit is designed using very high speed integrated circuit hardware description language (VHDL) coding and implemented using ISE Xilinx software. The protocol used in this design is the modified, exclusive, shared and invalid (MESI) protocol. Test results were taken by using test bench, and showed all the states of the protocol are working correctly.
This document discusses bus-based multiprocessor systems. It begins by defining a bus as a communication system that transfers data between computer components. It then explains that a bus-based multiprocessor connects multiple CPUs to a shared bus and memory. Issues like bus arbitration and caching are discussed to prevent access conflicts and reduce bus overload. Write-through and ownership caching protocols are described for maintaining cache coherency across CPUs. The document provides examples and discusses advantages and disadvantages of different approaches.
Multiprotocol Label Switching (MPLS) fasten the speed of packet forwarding by forwarding the packets based on labels and reduces the use of routing table look up from all routers to label edge routers(LER) , where as the label switch routers (LSRs) uses Label Distribution Protocol (LDP) or RSVP (Resource reservation Protocol) for label allocation and Label table for packet forwarding . Dynamic protocol is implemented which carries a Updates packets for the details of Label Switch Paths, along with this feedback mechanism is also introduced which find the shortest path among MPLS network and also feedback is provided which also help to overcome congestion, this feedback mechanism is on a
hop by hop basis rather than end to end thus providing a more reliable and much faster and congestion
free path for the packets .
This document covers important concepts of process synchronization like: Background, critical section problem, critical region, synchronization hardware, semaphores. It is beneficial for engineering students of aryabhatta knowledge university of bihar (A.K.U. Bihar).
This document proposes two solutions to address issues with live migration of virtual machines (VMs) using DPDK for packet processing.
The first solution maintains separate memory pools for the host and VM. Buffers contain both a structure part stored on the host and a message content part stored in shared memory (SHM). Rings manipulate buffers using indexes instead of pointers. This allows hiding physical addresses from the VM. Memory pools are reinitialized after migration to prevent memory leaks.
The second solution simpler initializes the receiving DPDK driver queue only after migration. This may introduce more downtime but avoids memory copy. It still has potential memory leak issues. Both solutions require further testing.
This project implements a Disaster Recovery Manager for a data center that monitors virtual machines and recovers them if they fail. It uses the VMware vSphere API to connect to ESXi hypervisors and vCenter. The manager includes components to ping VMs, take periodic snapshots, and recover failed VMs either by reverting snapshots or migrating VMs to a new host. It detects failures by checking for missed heartbeat pings and excludes VMs intentionally powered off by a user from recovery. The manager was implemented in Java using multithreading and allows for conversion between image formats to support multiple hypervisors.
This document discusses using big data tools like Lucene to simplify debugging of failing tests by extracting and analyzing data from large simulation log files. It describes parsing UVM log files and storing message elements in a Lucene database for fast querying. Graphical representations of the log file data are presented to aid analysis, showing messages within a time range or containing specific strings. Using big data tools in this way can shorten debug time and verification schedules.
This document contains the second assignment set for the course Operating Systems, consisting of 10 questions worth 6 marks each, for a total of 60 marks. The questions cover topics related to operating systems such as virtual memory, scheduling algorithms, semaphores, and distributed systems. Sample answers are provided for each question that describe key concepts in more detail.
This document summarizes a lecture on speculative multithreading. The lecture discussed two papers: one that categorized hardware support for speculative multithreading, and one that described a software-only approach using transactional memory. The class discussion covered dividing responsibilities between hardware and software, sources of parallelism for speculation beyond just loops, scaling speculation, and programming support options.
Richard Sinnott is an experienced, reliable, and hardworking individual seeking a new opportunity. He has a wide range of experience in areas such as home health care, property maintenance, warehouse/logistics, and electro-mechanical assembly. Mr. Sinnott has excellent problem solving skills and is not afraid to ask questions or climb the leadership ladder to overcome obstacles. He has an excellent work ethic and is able to adapt to different work environments and tasks.
The Can Safe Cover provides hygienic drinking from beverage cans by creating a barrier between the drinker's mouth and the can. It prevents insects and debris from entering open cans and protects users from cuts from sharp can edges. Available in 5 colors, the cover allows users to distinguish their can from others while enjoying the benefits of hygienic drinking, protection from bugs, and safety from cuts.
Datadog provides a SaaS platform for aggregating, correlating, and collaborating on various data streams from development and operations teams. It collects metrics, events, and other data from servers, devices, applications, and third party services. This data is aggregated, correlated, and made available for visualization and analysis to help with capacity planning, issue resolution, and other tasks. Datadog discusses how it has designed its architecture to handle the large volume and variety of data ingested, including using eventual consistency and soft-state designs.
Este documento presenta una guía sobre danzas zoomórficas tradicionales de diferentes regiones de Chile. Proporciona preguntas de selección múltiple sobre los nombres, movimientos y animales representados en bailes como El Pavo, El Torito, El Gallinazo, El Pequen y El Jote. También incluye secciones para identificar afirmaciones como verdaderas o falsas y listar los bailes típicos de cada zona geográfica del país.
El documento habla sobre Internet como una red de comunicación global que conecta ordenadores de todo el mundo para intercambiar datos. Explica que Internet puede usarse para buscar información a través de motores como Google y Yahoo, y ofrece consejos sobre cómo usarla de forma segura y respetuosa, como no abrir páginas inadecuadas, proteger la privacidad y no hacer trampa.
A Novel Approach for Detection of Routes with Misbehaving Nodes in MANETsIDES Editor
Network nodes in MANET’s are free to move randomly.
Therefore, the network topology may change rapidly.
Routing protocol for MANET’s are used for delivery of data
packets from source to the desired destination, Routing protocols
are also designed based on the assumption that all the
participating nodes are fully cooperative. However, due to the
scarcely available battery based energy, node behaviours may
exist. One such routing misbehaviours is that some nodes may
be selfish by participating in route discovery and maintenance
process, but refuse to forward the packet in order to save its
energy. To solve this problem we propose a reputation based
scheme where the watch dog uses a passive overhearing of
nodes and assign a value to it as an appreciation or add nuggets
to them. In this proposal, nodes with highest value are
highly recommended for data forwarding and allow nodes to
avoid the use of misbehaving nodes in future route selection.
AdHoc On Demand Distance vector routing protocol may be
used to get the recommendation details of the node intended
to forward the packet from the neighbouring nodes. This paper
proposes a novel method to mitigate the route with misbehaving
nodes and also suggests a way to find if any intruder is
present in the cluster of participating nodes using security
aware AODV protocol.
- Working Title Films is a UK film production company founded in 1983 that produces feature films and TV productions.
- Lions Gate is a North American entertainment company founded in 1997 that has been the most popular and successful independent film and TV distribution company in North America since 2012.
- Using a well-known production company like Working Title Films or distribution company like Lions Gate UK could help gain more attention for a thriller film by leveraging their reputation and wider audience reach.
Se présentant lui-même comme un geek, attiré par la technologie et la programmation, Brian SOLIS (son blog est consultable ici), Principal analyst chez ALTIMETER GROUP, nous a fait l’immense plaisir d’intervenir lors d’une visio-conférence organisée dans le cadre du MSc e-Business Manager de l’ISTA.
Tenue au campus de la Fonderie à Mulhouse, le thème de l’échange était le suivant : Vision prospective des réseaux sociaux.
Retour sur 3 idées clés...
This document provides information on the hotel portfolio of Anantara Hotels, Resorts & Spas. It lists over 149 hotels across 22 countries and 19,497 rooms under various brands, including Anantara, AVANI, Tivoli, Oaks, Elewana, and PER AQUUM hotels & resorts. The document includes lists of locations, room counts, and brief descriptions for each brand.
Dans le cadre de la formation bac+5/6 e-business manager de l'ISTA, nous proposons un annuellement un workshop de deux jours, mené par Hubmode, au cours duquel un webinar de 1h30 est organisé. Le thème 2016 était : "Vision prospective du magasin du futur".
Pour conclure le workshop, les étudiants de ce MSc ont produit un livre blanc : À quoi ressemblera le magasin de demain ?
This document discusses and compares different hardware data prefetching mechanisms and their impact on performance. It covers four main types of prefetching techniques: stream buffer, stride prefetchers based on program counter or cache block address, and locality-based prefetchers. For each technique, it describes the basic algorithm, hardware requirements, advantages and disadvantages. It provides detailed explanations of how stream buffers, stride prefetchers using a reference prediction table, and address-based stride prediction work. It also outlines the design of a locality-based stream prefetcher that dynamically adjusts prefetch degree and distance based on feedback.
This document provides an overview of code scheduling constraints and techniques. It discusses:
- Three types of constraints - control dependence, data dependence, and resource constraints. Changing operation order must not alter program results.
- Data dependence analysis to identify true, anti, and output dependences between memory accesses. This is challenging for arrays and pointers.
- Tradeoffs between register usage and parallelism due to limited register files.
- Supporting speculative execution through prefetching, poison bits, and predicated instructions.
- A basic machine model for representing hardware resources and operation properties like latency.
The document discusses load balancing algorithms for cluster computing environments. It proposes a fully centralized and partially distributed algorithm (FCPDA) that dynamically maps jobs to communicators (groups of processors) to improve response time and performance. The algorithm allows a communicator to take on additional jobs if it completes its initial job early. This approach aims to better balance the workload compared to other algorithms and reduce overall job completion time.
This document discusses techniques for basic block scheduling and global code scheduling in compiler optimization. It covers:
1. List scheduling for basic block scheduling by constructing a data dependence graph and visiting nodes in prioritized topological order.
2. Global code scheduling techniques like upward and downward code motion to move instructions between basic blocks while preserving data and control dependencies.
3. An algorithm for region-based global scheduling that supports moving operations to control-equivalent or dominating blocks.
This document discusses load balancing in distributed systems. It provides definitions of static and dynamic load balancing, compares their approaches, and describes several dynamic load balancing algorithms. Static load balancing assigns tasks at compile time without migration, while dynamic approaches migrate tasks at runtime based on current system state. Dynamic approaches have overhead from migration but better utilize resources. Specific dynamic algorithms discussed include nearest neighbor, random, adaptive contracting with neighbor, and centralized information approaches.
Enhanced equally distributed load balancing algorithm for cloud computingeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Enhanced equally distributed load balancing algorithm for cloud computingeSAT Journals
Abstract Cloud Computing as the name suggests, it is a style of computing where different users uses the resources on the go i.e. over the Internet. In the recent era, this technology has emerged as a strong option for not only large scale organizations but also for small scale organizations that only access/use the resources what they want. In recent research study, many organizations lose significant part of their revenues in handling the requests given by the clients over the web servers i.e. unable to balance the load for web servers which results in loss of data, delay in time and increased costs. This Paper gives a new enhanced load balancing algorithm by which the performance of their web application can be increased. This Algorithm works on the major drawbacks such as delay in time, response to request ratio etc.
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORESijmvsc
In a simultaneous multithreaded system, a core’s pipeline resources are sometimes partitioned and
otherwise shared amongst numerous active threads. One mutual resource is the write buffer, which acts as
an intermediary between a store instruction’s retirement from the pipeline and the store value being written
to cache. The write buffer takes a completed store instruction from the load/store queue and eventually
writes the value to the level-one data cache. Once a store is buffered with a write-allocate cache policy, the
store must remain in the write buffer until its cache block is in level-one data cache. This latency may vary
from as little as a single clock cycle (in the case of a level-one cache hit) to several hundred clock cycles
(in the case of a cache miss). This paper shows that cache misses routinely dominate the write buffer’s
resources and deny cache hits from being written to memory, thereby degrading performance of
simultaneous multithreaded systems. This paper proposes a technique to reduce denial of resources to
cache hits by limiting the number of cache misses that may concurrently reside in the write buffer and
shows that system performance can be improved by using this technique.
LATENCY-AWARE WRITE BUFFER RESOURCE CONTROL IN MULTITHREADED CORESijdpsjournal
In a simultaneous multithreaded system, a core’s pipeline resources are sometimes partitioned and
otherwise shared amongst numerous active threads. One mutual resource is the write buffer, which acts as
an intermediary between a store instruction’s retirement from the pipeline and the store value being written
to cache. The write buffer takes a completed store instruction from the load/store queue and eventually
writes the value to the level-one data cache. Once a store is buffered with a write-allocate cache policy, the
store must remain in the write buffer until its cache block is in level-one data cache. This latency may vary
from as little as a single clock cycle (in the case of a level-one cache hit) to several hundred clock cycles
(in the case of a cache miss). This paper shows that cache misses routinely dominate the write buffer’s
resources and deny cache hits from being written to memory, thereby degrading performance of
simultaneous multithreaded systems. This paper proposes a technique to reduce denial of resources to
cache hits by limiting the number of cache misses that may concurrently reside in the write buffer and
shows that system performance can be improved by using this technique.
Iaetsd appliances of harmonizing model in cloudIaetsd Iaetsd
This document proposes and evaluates a load balancing model for cloud computing environments that aims to optimize resource utilization and minimize energy usage. Key points:
- It introduces a "skewness" metric to measure uneven resource utilization across servers and develops algorithms to minimize skewness to improve overall utilization.
- The algorithms dynamically allocate resources based on demand, detecting "hot spots" that are overloaded and migrating VMs to reduce overload, as well as detecting "cold spots" that are underutilized to power them off to save energy.
- It evaluates the algorithms through trace-driven simulation and experimentation, finding it achieves good performance in load balancing while saving energy.
Modified Active Monitoring Load Balancing with Cloud Computingijsrd.com
Cloud computing is internet-based computing in which large groups of remote servers are networked to allow the centralized data storage, and online access to computer services or resources. Load Balancing is essential for efficient operations in distributed environments. As Cloud Computing is growing rapidly and clients are demanding more services and better results, load balancing for the Cloud has become a very interesting and important research area. In the absence of proper load balancing strategy/technique the growth of CC will never go as per predictions. The main focus of this paper is to verify the approach that has been proposed in the model paper [3]. An efficient load balancing algorithm has the ability to reduce the data center processing time, overall response time and to cope with the dynamic changes of cloud computing environments. The traditional load balancing Active Monitoring algorithm has been modified to achieve better data center processing time and overall response time. The algorithm presented in this paper efficiently distributes the requests to all the VMs for their execution, considering the CPU utilization of all VMs.
This document summarizes research on scheduling algorithms for loading streaming data into real-time data warehouses. The goal is to minimize data staleness over time. It describes how streaming warehouses continuously ingest incoming data streams to support time-critical analyses, unlike traditional warehouses which are periodically refreshed. It presents a model for temporal consistency and defines data staleness. It formulates the streaming warehouse update problem as a scheduling problem to minimize staleness and proves that any online, non-preemptive scheduling algorithm can achieve staleness within a constant factor of optimal if processors are sufficiently fast and no processor is idly waiting.
Continental division of load and balanced antIJCI JOURNAL
Increasing usability of internet is creating huge data which should be managed by the industries, but this
should not affect the processing time which may create inconvenience to end user. As cloud is emerging as
back bone of IT industry there are much enhancement needed in it Many companies are switching their
data from small storage location to Cloud, The migration is done such that the companies do not have a
burden to purchase the Infrastructure they merely have to rent out the Infrastructure but the migration
should not cost on the speed of storage or retrieval of the data from the server. Load balancing is one of the
major issue in cloud computing, but these problems are Tractable There are many algorithm for load
balancing which has advantage over the other in this paper Ant colony algorithm is studied .
A method for balancing heterogeneous request load in dht based p2 pIAEME Publication
This document proposes an adaptive load balancing method for distributed hash table (DHT)-based peer-to-peer (P2P) systems to address load imbalance caused by skewed access distributions. The method distinguishes between query routing load and query answering load. It uses three strategies: 1) adaptive object replication to balance query load by replicating popular objects, 2) adaptive routing replication to share routing load by duplicating overloaded nodes' routing tables, and 3) dynamic routing table reconfiguration to relieve overloaded nodes by replacing their entries with less loaded nodes. The strategies aim to dynamically redistribute load from overloaded to underloaded nodes. Experiments showed the balancing algorithms effectively improve system load balance and performance.
A method for balancing heterogeneous request load in dht based p2 pIAEME Publication
This document proposes an adaptive load balancing method for distributed hash table (DHT)-based peer-to-peer (P2P) systems to address load imbalance caused by skewed access distributions. The method distinguishes between query routing load and query answering load. It uses three strategies: 1) adaptive object replication to balance query load by replicating popular objects, 2) adaptive routing replication to share routing load by duplicating overloaded nodes' routing tables, and 3) dynamic routing table reconfiguration to relieve overloaded nodes by replacing their entries with less loaded nodes. The strategies aim to dynamically redistribute load from overloaded to underloaded nodes. Experiments showed the balancing algorithms effectively improve system load balance and performance.
Scalable Distributed Job Processing with Dynamic Load Balancingijdpsjournal
We present here a cost effective framework for a robust scalable and distributed job processing system that
adapts to the dynamic computing needs easily with efficient load balancing for heterogeneous systems. The
design is such that each of the components are self contained and do not depend on each other. Yet, they
are still interconnected through an enterprise message bus so as to ensure safe, secure and reliable
communication based on transactional features to avoid duplication as well as data loss. The load
balancing, fault-tolerance and failover recovery are built into the system through a mechanism of health
check facility and a queue based load balancing. The system has a centralized repository with central
monitors to keep track of the progress of various job executions as well as status of processors in real-time.
The basic requirement of assigning a priority and processing as per priority is built into the framework.
The most important aspect of the framework is that it avoids the need for job migration by computing the
target processors based on the current load and the various cost factors. The framework will have the
capability to scale horizontally as well as vertically to achieve the required performance, thus effectively
minimizing the total cost of ownership
Load balancing is one of the main challenges of every structured peer-to-peer (P2P) system that uses
distributed hash tables to map and distribute data items (objects) onto the nodes of the system. In a typical
P2P system with N nodes, the use of random hash functions for distributing keys among peer nodes can
lead to O(log N) imbalance. Most existing load balancing algorithms for structured P2P systems are not
adaptable to objects’ variant loads in different system conditions, assume uniform distribution of objects in
the system, and often ignore node heterogeneity. In this paper we propose a load balancing algorithm that
considers the above issues by applying node movement and replication mechanisms while load balancing.
Given the high overhead of replication, we postpone this mechanism as much as possible, but we use it
when necessary. Simulation results show that our algorithm is able to balance the load within 85% of the
optimal value.
This is a presentation for Chapter 7 Distributed system management
Book: DISTRIBUTED COMPUTING , Sunita Mahajan & Seema Shah
Prepared by Students of Computer Science, Ain Shams University - Cairo - Egypt
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the various techniques there followed in, for improving the performance of the Multicore
Processors. We conducted cluster experiments to find this limit. In this paper we propose an alternate design of Multicore processor based on the results of our cluster experiment.
Affect of parallel computing on multicore processorscsandit
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make
number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of
Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the
various techniques there followed in, for improving the performance of the Multicore
Processors.
We conducted cluster experiments to find this limit. In this paper we propose an alternate design
of Multicore processor based on the results of our cluster experiment.
Affect of parallel computing on multicore processors
shieh06a
1. Load Speculation
Jong-Jiann Shieh
Department of Computer Science and Engineering
Tatung University
shieh@ttu.edu.tw
and
Cheng-Chun Lin
Night.cola@msa.hinet.net
and
Shin-Rung Chen
apsras@amigo.cse.ttu.edu.tw
Abstract
The superscalar processor must issues instructions as
early as possible to enhance the performance. But load
instructions would be issued with register dependencies are
solved and memory dependencies are known. Register
dependence makes load instruction must wait until prior
instruction with same destination register is completed. Memory
dependence results in load instruction cannot be issued before
the ambiguities are resolved. Therefore load instructions only
could be issued when no register dependencies exist and all
prior stores’ effective addresses calculated. This paper combines
two mechanisms: value prediction (VP) and load forwarding
history table (LFHT) to speculatively execute load instructions.
Our study shows that by doing so there is about 15% average
speedup up over baseline architecture.
Keyword: load speculation, register dependence,
memory dependence, value prediction, load forwarding
1. Introduction
Modern superscalar processors allow instructions to
execute out of program order to find more instruction level
parallelism (ILP). These processors must monitor data
dependencies to maintain correct program behavior. There are
two types of data dependencies, register dependence and
memory dependence.
Register dependence is detected in the instruction
decode stage by examining instructions’ register operand fields.
If there is an instruction which load instruction depends on, the
load instruction must wait until prior instruction completed,
then the value of operand can be used.
The lack of information about memory dependence at
instruction decode time is a problem for an out-of-order
instruction scheduler. If the scheduler executes a load before a
prior store that writes to the same memory location, the load
will read the wrong value. In this event the load and all
subsequent dependent instructions must be re-executed,
resulting in a huge performance penalty.
To avoid these memory order violations, the instruction
scheduler should be conservative to prevent loads from
executing until all prior stores have executed. This approach
decreases performance because loads in majority cases will be
made falsely dependent on no alias stores as data on section 3
shown.
In this paper, we use a simple value predictor to predict
the operand value to avoid register dependence and propose a
structure called Load Forwarding History Table (LFHT) to
exploit memory dependence speculation at run time. As we
combine these two mechanisms, the predictor can help LFHT
making more load instructions execute without waiting for the
prior stores’ effective addresses calculated, this result in more
load instructions will be issued earlier. When a load instruction
is speculatively executed, instructions that are dependent upon
the load instruction will also be speculatively executed.
The organization of the rest of this paper is as follows.
Section 2 surveys previously proposed related works. Section 3
illustrates whole structure in superscalar processor. Section 4
describes our CPU model and simulation environment. The
performance is evaluated in section 5. Finally, the conclusion of
this paper is presented in section 6.
2. Related Works
The traditional works on memory disambiguation were
done in the context of compiler and hardware mechanisms for
non-speculative disambiguation to ensure program correctness.
Franklin and Sohi [2] proposed the address resolution buffer
(ARB). The ARB indicates memory references into bins
according to their address. The bins are used to cause a
temporal order between references to the same address. The
ARB is a structure based on bank. Multiple disambiguation
requests can be dispatched in one cycle, provided that they are
all to different banks.
Chrysos and Emer used predictor to solve memory
disambiguation problem in [5]. The goal of the designers is to
be able to schedule load instructions as soon as possible without
causing any memory order violations. The predictor proposed is
2. based on store-sets. A store set for a specific load is the set of
all stores upon which the load has ever depended. The
processor adds a store to the store set of the load if a memory
order violation is caused when the load executes before that
store. In the next instance of the load instruction, the store set is
accessed to determine which stores the load will need to wait
for before executing.
A. Yoaz., M. Erez., R. Ronen. and S. Jourdan designed a
CHT predictor [7]. The CHT predictor provides a prediction
about whether a load instruction will conflict with any store in
the instruction window. Allocating a new entry only when a
load collides for the first time and invalidating its entry when its
state changes to non-colliding. It does not predict which store
instruction the load will conflict with. Therefore, it is easier to
design but it does not provide the best possible information for
disambiguation purposes.
Color set [10] presents a simple mechanism which
incorporates multiple speculation levels within the processor
and classifies the load and the store instructions at run time to
the appropriate speculation level. Each speculation level is
termed as a color and the sets of load and store instructions are
called color sets. These colors divide the load instructions into
distinct sets, starting with the base color which corresponds to
the no violation case. In other words, this set is the set of load
instructions which have never collided with unready store
instructions in the past. Each color in the spectrum represents
increasing levels of aggressiveness in load speculation; a load
instruction is allowed to issue only if its color is less than or
equal to the current speculation level. If the processor later
discovers that the load has collided with a store, the color
assigned to the load instruction in the predictor is increased.
3. VALUE PREDICTION AND LFHT
3.1 Issuing a Load
When executing a load or store instruction, the
instruction is split into two micro instructions inside the
processor [1]. One instruction calculates the effective address,
and the other instruction performs the memory access once the
effective address calculated and any potential store alias
dependencies resolved. In the baseline architecture, each store
and load instruction must wait until its effective address
calculation completes. In addition, all stores are issued in-order
with respect to prior stores, and each load must wait on the most
recent store before it can be speculatively issued.
There are three cases that a load instruction always
spends cycles on, (1) waiting on its effective address calculation
(ea), (2) waiting for prior store addresses to be calculated (dep),
and (3) the latency for fetching the data (mem). This paper
focus on (1) and (2). We use data prediction to solve problem
(1), and use LFHT to solve problem (2).
Figure 3.1 shows how many cycles per load instruction
waiting on its effective address [13]. As the figure shows that
each load instruction must wait 7 cycles in average so that it can
get its effective address, this make a lot of wasting.
0
5
10
15
20
25
30
bzip2
crafty
gap
gcc
gzip
m
cf
parser
twolf
vortex
vpr
Figure 3.1 Cycles per load instruction spend on waiting its
effective address in baseline architecture.
In the conventional disambiguation memory dependence
mechanism [14], load-forwarding behavior can detect store alias
and forward store data. Figure 3.2 shows percentages of load
that can take advantage of load-forwarding behavior [12]. Most
load instructions will not forward store data and conflict with
prior store on the baseline simulation architecture (describe in
section 4), the average amount of forwarding load is 12.7% and
the lowest amount of forwarding load is only 2.7%. It means
that most load instructions are unnecessarily pending for
disambiguating memory dependence.
0.0%
10.0%
20.0%
30.0%
40.0%
50.0%
bzip2
crafty
gap
gzip
m
cf
parser
tw
olf
Figure 3.2 Percent of forwarding load instructions
3.2 Value Prediction
All loads have to wait until their effective address is
calculated before they can be issued. If the load is on the critical
path, and the address can be accurately predicted, then it can be
beneficial to speculate the value of the address and load the data
as soon as possible.
A load instruction is effectively split into two instructions
inside the processor, one instruction calculates the effective
address. In order to predict this instruction, we predict
instruction’s operand so that we don’t have to wait prior
instruction which this instruction depend on.
Since we predict instruction’s operand, there are some
instructions that don’t have their register dependence with prior
instruction, these instructions don’t need to predict operand
because they already have exactly register value. In our
3. simulation, we only predict instructions with register
dependence. But all load instructions must update the predictor,
so that we can maintain predictor’s accurate rate.
Data prediction helps speedup the effective address
calculation for a load. The load then has only to wait on
potential store aliases before issuing. But if the operand was
incorrectly predicted, a recovery mechanism will take place
when the actual operand is available.
Value prediction has been studied for a long time and
many schemes have been proposed [4, 6, 11]. In this paper we
use the simplest scheme, stride predictor, to predict the operand
of a load instruction.
3.3 Load Forwarding History Table
When a load is issued, it performs a lookup in the store
buffer for a non-committed aliased store and it performs its data
cache access in parallel. If a store alias is found, store data
forward to load and the load has a shorter latency. If there is no
store alias, and there is a data cache hit, the load has a longer
latency because of the pipelined data cache. If there is a miss in
the data cache, the miss will only be processed if no alias is
found in the store buffer, load-forwarding behavior can detect
store alias and forward store data. This way, load instructions
can be issued out of order without waiting for prior stores
executed [8, 9, 14].
Conventional disambiguation memory dependence
mechanism unable to provide information for load instruction in
the decode stage. For that reason, in order to exploit
load-forwarding behavior and bring about all of these benefits, a
mechanism is proposed: the load forwarding history table
(LFHT). The LFHT records the result produced when the load
instruction was executed for load forwarding behavior of the
last time, and determines whether or not to out of order issue the
load when the load instruction is encountered in the future.
Each LFHT entry contains two fields: the tag field and
alias bit field. The LFHT is considered as a direct mapped cache,
indexed by the PC. The alias bit field is a sticky bit, the load
instruction is always treated as alias and waits until all prior
store addresses have been calculated before issuing, after the
load instruction encounter the first load forwarding behavior
indicate conflict with store at the execution time.
LFHT will be established or updated according to load
forwarding behavior of a load instruction at run time. If LFHT
miss, part of load instruction’s PC is written to the
corresponding entry as tag and the alias bit will be set or clear
depend on the load forwarding behavior. If LFHT hit, alias bit
will also be set or clear depend on the load forwarding behavior.
But if alias bit is in set state and load forwarding behavior
indicate no conflict with prior store, alias bit is still kept in set
state. If LFHT hit, and alias bit is in clear state, the load
instruction will be speculatively executed.
The validation/invalidation of speculative load instruction
is performed when each prior store address has been calculated.
Each time a store address is calculates, all the executed
speculative loads that occur after store in the instruction
window have their addresses checked for an alias. If an alias is
found, recovery action is taken for the load, and the load must
be re-issued; corresponding alias bit changed into set state to
avoid incorrect speculative load execution in the future.
3.4 Combine VP and LFHT
We used LFHT introduced in section 3.3 to combine with
value predictor discussed in section 3.2.
If only VP is used, although some load instructions can
get their operand faster, but these load instructions still must
wait prior store instruction’s effective address calculated to
ensure that there are no memory dependence.
But if only LFHT is used, although we have overcome
memory disambiguation problem, but there still exist register
dependence, it means, some load instructions’ operand isn’t
available to use. So that these load instructions must wait
operand ready to issue to the function unit.
So we combine these two mechanisms as shown in figure
3.3 to solve both memory dependence and register dependence.
Instr. Fetch
Unit
Decode Unit
Register Update
Unit
Load/Store
Function Unit
Function Unit
Function Unit
LFHT
Register file
The additional data
path
VP
Figure 3.3 Architecture data path with VP and LFHT.
4. Evaluation Methodology
4.1 Machine Model
The simulator used in this work is derived from the
SimpleScalar 2.0 and 3.0c tool set [3], a suite of functional and
timing simulation tools. The instruction set architecture
employed is the Alpha instruction set, which is based on the
Alpha AXP ISA.
Table 1 summarizes some of the parameters used in our
baseline architecture. Table 2 shows the architectures we
studied in this evaluation.
Table 1 Baseline Architecture Configuration
Instruction fetch 8 instructions per cycle.
Out-of-Order
execution
mechanism
Issue of 8 instructions /cycle, 256 entry
RUU(which is the ROB and the IW
combined), 128 entry load/store queue.
Loads executed only after all preceding
store addresses are known. Value
bypassed to loads from matching stores
ahead in the load/store queue. 2 cycle
load forwarding latency.
4. Architected registers 32 interger, hi, lo, 32 floating point.
Functional units
(FU)
8-integer ALUs, 8 load/store units, 4-FP
adders, 1-Integer MULT/DIV, 1-FP
MULT/DIV
FU latency int alu--1, load/store--1, int mult--3, int
div--12, fp adder--2, fp mult--4, fp
div--12, fp sqrt--24
L1 Instruction cache 64K bytes, 2-way set assoc., 32 byte
block, 4 cycles hit latency.
L1 Data cache 64K bytes, 2-way set assoc., 32 byte
block, 4 cycles hit latency. Dual ported.
L2 unified cache 1024K bytes, 4-way set assoc., 64 byte
block, 12 cycles hit latency
Memory Memory access latency (first-36, rest-4)
cycle. Width of memory bus is 32 bytes.
TLB miss 30 cycles
Table 2 Architectures we studied
Baseline Baseline architecture
VP Baseline + VP
VP + LFHT Baseline + VP + LFHT
VP + LFHT with Cycle
Clear
Baseline + VP + LFHT with
Cycle Clear
Perfect VP + Perfect LFHT Baseline + Perfect VP + Perfect
LFHT
Note: Cycle Clear is a keyword detailed in section 5.3
4.2 Benchmarks
To perform our experimental study, we have collected
results of the SPEC2000 benchmarks. The programs were
compiled with the gcc compiler included in the tool set. Table 4
shows the input data set for each integer benchmark. Table 5
shows the floating-point benchmark. In simulating the
benchmarks, we skipped the first billion instructions, and
collected statistics on the next five hundred million instructions.
Table 4 Input data set for benchmarks
SPECint 2000 Input SPECfp 2000 Input
bzip2 input.source ammp ammp.in
crafty crafty.in applu applu.in
gap ref.in art a10.img &
gcc 166.i equake inp.in
gzip input.graphic galgel galgel.in
mcf inp.in licas lucas2.in
parser ref.in mesa mesa.in
twolf ./twolf/ref mgrid mgrid.in
vortex lendian.raw swim swim.in
vpr net.in & arch.in
5. Performance Analysis
In this section, we will examine the performance
improvement gained by using the proposed mechanism. We also
explore detail configuration of VP and LFHT.
First of all, we show how many load instructions each
benchmarks have (as shown in figure 5.1 and 5.2). This can help
analyzing data.
Note: load instructions with same program counter means these
instructions are the same instruction.
5.1 VP: Cycles Waste for Waiting EA
Figure 5.1 shows how many cycles per load waste for
waiting its effective address when using VP. Compare to figure
3.1 in section 3.1, 4.5 cycles per load instruction are saved after
using VP.
0
1
2
3
4
5
6
bzip2
crafty
gap
gcc
gzip
m
cf
parser
twolf
vortex
vpr
Figure 5.1 Cycles per load instructions spend on waiting its
effective address after using VP.
5.2 LFHT
The average LFHT hit rate of integer benchmarks are
68% for 128, 85% for 512, 94% for 2048, 99% for 8192, of
floating-point benchmarks are 80% for 128, 99% for 512, 97%
for 2048, 99% for 8192.
Figure 5.2 shows how many cycles per load waste for
waiting its effective address when using LFHT, each load
instruction spends average 3.43 cycles.
Figure 5.3 shows how many cycles per load instructions
spend on waiting its effective address after we combined both
VP and LFHT. Each load instruction now spends average 1.91
cycles..
Because of (1) effective address calculation need at
least one cycle, (2) there are 40% of load instructions with
unavailable operand can’t predict (because these load
instructions’ state field in VP are init or transient), so we still
have to wait 1.91 cycles.
5.3 IPC & Speedup
Figure 5.4 shows IPC in integer benchmarks. Figure 5.5
shows IPC in floating-point benchmarks.
Alias bit is a sticky bit in LFHT, alias bit keeps in set
state after first conflict store detected. When a load instruction
5. conflicts with a prior store instruction and the alias bit is set,
this load instruction always waits until all prior store addresses
have been calculated before issuing, no matter whether store
alias has occurred or not. That may reduce speculative load in
effect, as false data dependency after the long period of run time
may occurs.
To prevent the LFHT from being too conservative,
causing false data dependency, all alias bit fields in LFHT are
cleared at a regular length of cycles. In this paper, we modeled
50,000 cycles clear (CC) for LFHT.
Figure 5.6 shows integer benchmarks’ speedup over the
baseline. The average speedups are 14.5% without cycle clear,
16.1% with cycle clear. Figure 5.7 shows floating-point
benchmarks’ speedup over the baseline. The average speedups
are both 5% with or without cycle clear.
Speedup is calculated by:
(New schemes’IPC – baseline’s IPC) / baseline’s IPC
In our simulation, we also model a perfect scheme on
both VP and LFHT, using this scheme to check how many
performance we can get most from this scheme.
With VP, we use perfect prediction on all load
instructions. This means when a load instruction is dispatched to
RUU, whether its operand is available or not, we can use it with
actual value, because we can predict it with 100% accurate rate.
With LFHT, perfect method means load instruction
only wait for store instruction with same effective address. This
means there are no unnecessary time spent on waiting store
instructions’ effective addresses.
The average speedup with perfect prediction in integer
benchmarks is 20.5%, in floating-point benchmarks is 7.5%.
Our simulation results in speedup of 16.1% for integer
benchmarks, of 5% for floating-point. This is close to perfect
case.
The main reason of vortex’s performance worsen than
baseline is due to its low accurate rate in value prediction, so it
needs more cycles to squash instructions and re-fetch
instructions to instruction windows.
0
1
2
3
4
5
6
bzip2
crafty
gap
gcc
gzip
m
cf
parser
tw
olf
vortex
vpr
Figure 5.2 Cycles per load instructions spend on waiting its
effective address after using LFHT
0
0.5
1
1.5
2
2.5
3
3.5
bzip2
crafty
gap
gcc
gzip
m
cfparser
tw
olfvortex
vpr
Figure 5.3 Cycles per load spend on waiting its ea after combine
both VP and LFHT
IPC (integer)
0
1
2
3
4
5
6
7
bzip2
crafty
gap
gcc
gzip
m
cfparser
tw
olfvortex
vpr
baseline
VP(2K)
VP+LFHT
with CC
perfect
Figure 5.4 IPC (integer benchmarks)
IPC (floating-point)
0
1
2
3
4
5
6
am
m
p
applu
artequakegalgel
lucas
m
esa
m
grid
sw
im
baseline
VP
VP+LFHT
with CC
perfect
Figure 5.5 IPC (floating-point benchmarks)
6. speedup (integer)
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
bzip2
gap
gzip
parser
vortex
VP
VP+LFHT
with CC
perfect
Figure 5.6 Speedup over baseline (integer benchmarks).
speedup (floating-point)
-0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
am
m
papplu
artequakegalgel
lucas
m
esam
grid
sw
im
VP
VP+LFHT
with CC
perfect
Figure 5.7 Speedup over baseline (floating-point benchmarks).
6. Conclusions
In this paper we present a combine mechanism for
improving load instruction issue rule in the modern superscalar
processor. Conventional load instructions only be issued when
ensure there are no dependences, and therefore reduces the
instruction level parallelism. We proposed a scheme combine
two mechanisms, value prediction (VP) + load forwarding
history table (LFHT) to speculatively execute load instructions.
All the information of VP and LFHT are established or
updated at run-time using load instruction history and load
forwarding behavior. They will provide the information of
memory disambiguation for the load instruction speculative
issue at the issue time. Throughout this study, we have not only
examined the load instruction issue rule, but also re-establish
memory dependence and disambiguation mechanism from two
aspects: first, we have studied the characteristic of load
instructions and using the information of load for memory
dependence and disambiguation, and second, we have proposed
a combine scheme that can take advantages of improving the
instruction level parallelism.
We evaluated the performance of our proposed
architecture with SimpleScalar. VP provides average speedup of
8.5% over baseline simulation architecture. With VP & LFHT,
the speedup is 14.5% over baseline architecture. With Cycle
Clear of LFHT, 16.1% speedup over baseline architecture is
achieved.
References
[1] M. Johnson, “Superscalar Microprocessor Design,”
Englewood Cliffs, Prentice Hall, 1991.
[2] M. Franklin and G. S. Sohi ,“ARB: A Hardware
Mechanism for Dynamic Reordering of Memory
References”, IEEE Transactions on Computer, May 1996
[3] D. C. Burger and T. M. Austin, “The Simplescalar Tool Set,
version 2.0” ,Technical Report CS-TR-97-1342, University
of Wisconsin, Madison, June 1997.
[4] K. Wang and M. Franklin, “Highly accurate data value
prediction using hybrid predictors”, In proceedings of the
Thirtieth Annual IEEE/ACM International Symposium on
Microarchitecture, pages 281-290, Dec. 1997.
[5] G.Z Chrysos and J.S Emer ,“Memory dependence
prediction using store sets”, In proceedings of the 25th
Annual International Symposium on Computer
Architecture, pages 142-153, June 1998.
[6] G. Reinman and B. Calder. “Predictive techniques for
aggressive load speculation”, In proceedings of the 31st
Annual ACM/IEEE International Symposium on
Microarchitecture, pages 127-137, Nov 1998.
[7] A. Yoaz.; M. Erez.; R. Ronen.; S. Jourdan. “Speculation
Techniques for Improving Load Related Instruction
Scheduling.” Proceedings of the 26th International
Symposium on Computer Architecture, May 1999
[8] G. Reinman. B. Calder. “A Comparative Survey of Load
Speculation Architectures ,” Journal of Instruction Level
Parallelism, May 2000.
[9] A. Moshovos and G.S. Sohi . “Reducing memory latency
via read-after-read memory dependence prediction”, IEEE
Transactions on Computers, pages 313-326, March 2002.
[10] S. Onder, “Cost effective memory dependence prediction
using speculation levels and color sets”, In proceedings of
the 2002 International Conference on Parallel Architectures
and Compilation Techniques, pages 232-241, Sept. 2002.
[11] Huiyang Zhou, J. Flanagan. and T.M. Conte , “Detecting
global stride locality in value streams”, In proceedings of
the 30th Annual International Symposium on Computer
Architecture, pages 324-335, June 2003.
[12] Shin-Rung , Chen , “Memory Disambiguation using
Load Forwarding”, Master Thesis, Department of
Computer Science and Engineering, Tatung University,
July 2004
[13] Cheng-Chun Lin , “Load Speculation”, Master Thesis,
Department of Computer Science and Engineering, Tatung
University, July 2005
[14] Jon Paul Shen , Mikko H. Lipasti ,Modezn Processor
Design : Fundamentals of Superscalar Processors ,
McGraw-Hill,2005.