The paper includes a study of the most recent prefetching techniques developed for modern day processors, classifying them based on different criteria and performing a qualitative and quantitative evaluation of their performance. It also includes evaluation of the performance of compiler based data prefetching scheme using the built-in prefetcher of gcc compiler.
Abstract:
Extensive research has been done in prefetching techniques that hide memory latency in microprocessors leading to performance improvements. However, the energy aspect of prefetching is relatively unknown. While aggressive prefetching techniques often help to improve performance, they increase energy consumption by as much as 30% in the memory system. This paper provides a detailed evaluation on the energy impact of hardware data prefetching and then presents a set of new energy-aware techniques to overcome prefetching energy overhead of such schemes. These include compiler-assisted and hardware-based energy-aware techniques and a new power-aware prefetch engine that can reduce hardware prefetching related energy consumption by 7-11 ×. Combined with the effect of leakage energy reduction due to performance improvement, the total energy consumption for the memory system after the application of these techniques can be up to 12% less than the baseline with no prefetching.
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...caijjournal
Simultaneous Multi-Threading (SMT) processors improve system performance by allowing concurrent execution of multiple independent threads with sharing key datapath omponents and better utilization of resources. Speculative execution allows modern processors to fetch continuously and reduce the delays of control instructions. However, a significant amount of resources is usually wasted due to miss- peculation,
which could have been used by other valid instructions, and such a waste is even more pronounced in an SMT system. In order to minimize the waste of resources, a speculative trace capping technique [1] was proposed to limit the number of speculative instructions in the system. In this paper, a thorough analysis is given to investigate the trade-offs among applying this capping mechanism at different pipeline stages so as
to maximize its benefits. Our simulations show that the best choice can improve overall system throughput
by a very significant margin (up to 46%) without sacrificing execution fairness among the threads.
In the era of big data, even though we have large infrastructure, storage data varies in size,
formats, variety, volume and several platforms such as hadoop, cloud since we have problem associated
with an application how to process the data which is varying in size and format. Data varying in
application and resources available during run time is called dynamic workflow. Using large
infrastructure and huge amount of resources for the analysis of data is time consuming and waste of
resources, it’s better to use scheduling algorithm to analyse the given data set, for efficient execution of
data set without time consuming and evaluate which scheduling algorithm is best and suitable for the
given data set. We evaluate with different data set understand which is the most suitable algorithm for
analysis of data being efficient execution of data set and store the data after analysis
Efficient Resource Management Mechanism with Fault Tolerant Model for Computa...Editor IJCATR
Grid computing provides a framework and deployment environment that enables resource
sharing, accessing, aggregation and management. It allows resource and coordinated use of various
resources in dynamic, distributed virtual organization. The grid scheduling is responsible for resource
discovery, resource selection and job assignment over a decentralized heterogeneous system. In the
existing system, primary-backup approach is used for fault tolerance in a single environment. In this
approach, each task has a primary copy and backup copy on two different processors. For dependent
tasks, precedence constraint among tasks must be considered when scheduling backup copies and
overloading backups. Then, two algorithms have been developed to schedule backups of dependent and
independent tasks. The proposed work is to manage the resource failures in grid job scheduling. In this
method, data source and resource are integrated from different geographical environment. Faulttolerant
scheduling with primary backup approach is used to handle job failures in grid environment.
Impact of communication protocols is considered. Communication protocols such as Transmission
Control Protocol (TCP), User Datagram Protocol (UDP) which are used to distribute the message of
each task to grid resources.
Survey of streaming data warehouse update schedulingeSAT Journals
In this paper, we study scheduling problem of updates for the streaming data warehouses. The streaming data warehouses are the combination of traditional data warehouses and data stream systems. In this, jobs are nothing but the processes which are responsible for loading new data in the tables. Its purpose is to decrease the data staleness. In addition, it handles well, the challenges faced by the streaming warehouses like, data consistency, view hierarchies, heterogeneity found in update jobs because of dissimilar arrival times as well as size of data, preempt updates etc. The staleness of data is the scheduling metric considered here. In this, jobs are nothing but the processes which are responsible for loading new data in the tables. Its purpose is to decrease the data staleness. In addition, it handles well, the challenges faced by the streaming warehouses like, data consistency, view hierarchies, heterogeneity found in update jobs because of dissimilar arrival times as well as size of data, preempt updates etc. The staleness of data is the scheduling metric considered here.
Keywords: partitioning strategy, scalable scheduling, data stream management system.
Review of the paper: Traffic-aware Frequency Scaling for Balanced On-Chip Net...Luca Sinico
This work has been done as assignment and as part of the exam of the Distributed Systems course, while attending the Master's Degree in Computer Engineering at University of Padua.
If you find something wrong or not clear, or if you don't agree with me with the work done or the grades of the assessment, please tell me.
Performance of a speculative transmission scheme for scheduling latency reduc...Mumbai Academisc
This document proposes a speculative transmission scheme to reduce latency in input-queued centrally-scheduled cell switches for high-performance computing. The scheme allows cells to proceed without waiting for a grant under certain conditions, significantly reducing average control-path latency. Using this model, performance measures like mean delay and successful speculative transmission rate are derived. Results show latency can be almost entirely eliminated between request and response for loads up to 50%. Simulations confirm the analytical results.
Analysis of Hierarchical Scheduling for Heterogeneous Traffic over NetworkIJCNCJournal
Scheduling real time and non real time packets at network nodes has an important impact by reducing the
processing overhead, queuing delay and response time. Most of the existing packet scheduling algorithms
used in network based on First-In First-Out (FIFO), non-preemptive priority, and preemptive priority
scheduling. However, these algorithms incur a large processing overhead, queuing delay and response
time and are not dynamic to the data traffic changes. In this paper, we present a new hierarchical
scheduling algorithm to assign priority, Hierarchical Hybrid EDF/FIFO which can not only serve the real
time traffic but also provide best effort service to non real time traffic. To examine our approach for
scheduling, we realized our analytical study to express the worst case queuing delay and the worst case
response time for different traffics. The simulation results showed that the Hierarchical hybrid EDF/FIFO
achieved the minimum packet delay and adequate loss packet for non real time traffic when compared with
Hierarchical FIFO. In general, the performances of our approach draw near to Hierarchical EDF which
confirms the effectiveness of this approach.
Abstract:
Extensive research has been done in prefetching techniques that hide memory latency in microprocessors leading to performance improvements. However, the energy aspect of prefetching is relatively unknown. While aggressive prefetching techniques often help to improve performance, they increase energy consumption by as much as 30% in the memory system. This paper provides a detailed evaluation on the energy impact of hardware data prefetching and then presents a set of new energy-aware techniques to overcome prefetching energy overhead of such schemes. These include compiler-assisted and hardware-based energy-aware techniques and a new power-aware prefetch engine that can reduce hardware prefetching related energy consumption by 7-11 ×. Combined with the effect of leakage energy reduction due to performance improvement, the total energy consumption for the memory system after the application of these techniques can be up to 12% less than the baseline with no prefetching.
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...caijjournal
Simultaneous Multi-Threading (SMT) processors improve system performance by allowing concurrent execution of multiple independent threads with sharing key datapath omponents and better utilization of resources. Speculative execution allows modern processors to fetch continuously and reduce the delays of control instructions. However, a significant amount of resources is usually wasted due to miss- peculation,
which could have been used by other valid instructions, and such a waste is even more pronounced in an SMT system. In order to minimize the waste of resources, a speculative trace capping technique [1] was proposed to limit the number of speculative instructions in the system. In this paper, a thorough analysis is given to investigate the trade-offs among applying this capping mechanism at different pipeline stages so as
to maximize its benefits. Our simulations show that the best choice can improve overall system throughput
by a very significant margin (up to 46%) without sacrificing execution fairness among the threads.
In the era of big data, even though we have large infrastructure, storage data varies in size,
formats, variety, volume and several platforms such as hadoop, cloud since we have problem associated
with an application how to process the data which is varying in size and format. Data varying in
application and resources available during run time is called dynamic workflow. Using large
infrastructure and huge amount of resources for the analysis of data is time consuming and waste of
resources, it’s better to use scheduling algorithm to analyse the given data set, for efficient execution of
data set without time consuming and evaluate which scheduling algorithm is best and suitable for the
given data set. We evaluate with different data set understand which is the most suitable algorithm for
analysis of data being efficient execution of data set and store the data after analysis
Efficient Resource Management Mechanism with Fault Tolerant Model for Computa...Editor IJCATR
Grid computing provides a framework and deployment environment that enables resource
sharing, accessing, aggregation and management. It allows resource and coordinated use of various
resources in dynamic, distributed virtual organization. The grid scheduling is responsible for resource
discovery, resource selection and job assignment over a decentralized heterogeneous system. In the
existing system, primary-backup approach is used for fault tolerance in a single environment. In this
approach, each task has a primary copy and backup copy on two different processors. For dependent
tasks, precedence constraint among tasks must be considered when scheduling backup copies and
overloading backups. Then, two algorithms have been developed to schedule backups of dependent and
independent tasks. The proposed work is to manage the resource failures in grid job scheduling. In this
method, data source and resource are integrated from different geographical environment. Faulttolerant
scheduling with primary backup approach is used to handle job failures in grid environment.
Impact of communication protocols is considered. Communication protocols such as Transmission
Control Protocol (TCP), User Datagram Protocol (UDP) which are used to distribute the message of
each task to grid resources.
Survey of streaming data warehouse update schedulingeSAT Journals
In this paper, we study scheduling problem of updates for the streaming data warehouses. The streaming data warehouses are the combination of traditional data warehouses and data stream systems. In this, jobs are nothing but the processes which are responsible for loading new data in the tables. Its purpose is to decrease the data staleness. In addition, it handles well, the challenges faced by the streaming warehouses like, data consistency, view hierarchies, heterogeneity found in update jobs because of dissimilar arrival times as well as size of data, preempt updates etc. The staleness of data is the scheduling metric considered here. In this, jobs are nothing but the processes which are responsible for loading new data in the tables. Its purpose is to decrease the data staleness. In addition, it handles well, the challenges faced by the streaming warehouses like, data consistency, view hierarchies, heterogeneity found in update jobs because of dissimilar arrival times as well as size of data, preempt updates etc. The staleness of data is the scheduling metric considered here.
Keywords: partitioning strategy, scalable scheduling, data stream management system.
Review of the paper: Traffic-aware Frequency Scaling for Balanced On-Chip Net...Luca Sinico
This work has been done as assignment and as part of the exam of the Distributed Systems course, while attending the Master's Degree in Computer Engineering at University of Padua.
If you find something wrong or not clear, or if you don't agree with me with the work done or the grades of the assessment, please tell me.
Performance of a speculative transmission scheme for scheduling latency reduc...Mumbai Academisc
This document proposes a speculative transmission scheme to reduce latency in input-queued centrally-scheduled cell switches for high-performance computing. The scheme allows cells to proceed without waiting for a grant under certain conditions, significantly reducing average control-path latency. Using this model, performance measures like mean delay and successful speculative transmission rate are derived. Results show latency can be almost entirely eliminated between request and response for loads up to 50%. Simulations confirm the analytical results.
Analysis of Hierarchical Scheduling for Heterogeneous Traffic over NetworkIJCNCJournal
Scheduling real time and non real time packets at network nodes has an important impact by reducing the
processing overhead, queuing delay and response time. Most of the existing packet scheduling algorithms
used in network based on First-In First-Out (FIFO), non-preemptive priority, and preemptive priority
scheduling. However, these algorithms incur a large processing overhead, queuing delay and response
time and are not dynamic to the data traffic changes. In this paper, we present a new hierarchical
scheduling algorithm to assign priority, Hierarchical Hybrid EDF/FIFO which can not only serve the real
time traffic but also provide best effort service to non real time traffic. To examine our approach for
scheduling, we realized our analytical study to express the worst case queuing delay and the worst case
response time for different traffics. The simulation results showed that the Hierarchical hybrid EDF/FIFO
achieved the minimum packet delay and adequate loss packet for non real time traffic when compared with
Hierarchical FIFO. In general, the performances of our approach draw near to Hierarchical EDF which
confirms the effectiveness of this approach.
AN ALTERNATE APPROACH TO RESOURCE ALLOCATION STRATEGY USING NETWORK METRICSIN...ijgca
Monitoring in a grid environment involves the analysis of all resource metrics and network metrics. Monitoring the resource metrics helps the grid middle ware to decide which job to be
submitted to which resource. Decision for submitting a job will be better along with the consideration of the network metrics. Tuning of the network metrics is also made if the performance degrades.
Load distribution of analytical query workloads for database cluster architec...Matheesha Fernando
The document summarizes a research paper on optimizing the distribution of analytical query workloads across multiple database servers. It discusses:
1) How database clusters work and the idea of using materialized query tables (MQTs) to optimize analytical queries.
2) The proposed framework which uses a genetic algorithm-based scheduler to optimize mapping of queries and MQTs to servers to minimize overall workload completion time.
3) An evaluation of the genetic algorithm approach against exhaustive search and greedy algorithms on synthetic workloads, finding it provides results close to exhaustive search.
The document discusses process management in operating systems. It covers control blocks, interrupts, process states, scheduling algorithms like FIFO, SJF, SRTF, Round Robin and priority scheduling. It also discusses queuing, multiprogramming vs time sharing and scheduling criteria like CPU utilization, throughput, turnaround time and waiting time. Scheduling can be long, medium or short term and algorithms include priority queues and multilevel feedback queues.
Task scheduling methodologies for high speed computing systemsijesajournal
High Speed computing meets ever increasing real-time computational demands through the leveraging of
flexibility and parallelism. The flexibility is achieved when computing platform designed with
heterogeneous resources to support multifarious tasks of an application where as task scheduling brings
parallel processing. The efficient task scheduling is critical to obtain optimized performance in
heterogeneous computing Systems (HCS). In this paper, we brought a review of various application
scheduling models which provide parallelism for homogeneous and heterogeneous computing systems. In
this paper, we made a review of various scheduling methodologies targeted to high speed computing
systems and also prepared summary chart. The comparative study of scheduling methodologies for high
speed computing systems has been carried out based on the attributes of platform & application as well.
The attributes are execution time, nature of task, task handling capability, type of host & computing
platform. Finally a summary chart has been prepared and it demonstrates that the need of developing
scheduling methodologies for Heterogeneous Reconfigurable Computing Systems (HRCS) which is an
emerging high speed computing platform for real time applications.
This document discusses memory management techniques in operating systems. It provides background on how programs must be brought into memory to execute and techniques for organizing memory like segmentation and paging. It describes the multistep process a user program goes through before execution including being placed in a process in memory. It also discusses logical versus physical addresses, the memory management unit that maps virtual to physical addresses, and dynamic loading and linking of code.
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...vtunotesbysree
Here are three major complications that concurrent processing adds to an operating system:
1. Resource allocation and scheduling becomes more complex. The OS must allocate CPU time, memory, file descriptors, etc. among multiple concurrent processes and ensure all processes receive adequate resources. It must also schedule which process runs at what time on what CPU core.
2. Synchronization and communication between processes is more difficult. The OS must provide mechanisms for processes to synchronize their actions when accessing shared resources and to allow inter-process communication. This introduces challenges around things like race conditions and deadlocks.
3. Reliability and fault tolerance is harder. If one process crashes or hangs, it should not affect other processes. The OS must be able to
Distributed Dynamic Replication Management Mechanism Based on Accessing Frequ...May Sit Hman
Thesis (Master) of Computer Science. It's based on Distributed System and Database System. It's about management replica on distributed database systemm.
Resource monitoring involves obtaining information about the utilization of system resources like CPU, bandwidth, memory, and storage. These resources impact system performance and user productivity. There are two categories of systems to monitor - those experiencing performance issues and those running well for capacity planning. Key resources to monitor include CPU usage, memory usage, disk I/O, and network bandwidth. A variety of tools like top, vmstat, and iostat can help monitor resource utilization.
Learning scheduler parameters for adaptive preemptioncsandit
An operating system scheduler is expected to not al
low processor stay idle if there is any
process ready or waiting for its execution. This pr
oblem gains more importance as the numbers
of processes always outnumber the processors by lar
ge margins. It is in this regard that
schedulers are provided with the ability to preempt
a running process, by following any
scheduling algorithm, and give us an illusion of si
multaneous running of several processes. A
process which is allowed to utilize CPU resources f
or a fixed quantum of time (termed as
timeslice for preemption) and is then preempted for
another waiting process. Each of these
'process preemption' leads to considerable overhead
of CPU cycles which are valuable resource
for runtime execution. In this work we try to utili
ze the historical performances of a scheduler
and predict the nature of current running process,
thereby trying to reduce the number of
preemptions. We propose a machine-learning module t
o predict a better performing timeslice
which is calculated based on static knowledge base
and adaptive reinforcement learning based
suggestive module. Results for an "adaptive timesli
ce parameter" for preemption show good
saving on CPU cycles and efficient throughput time.
COMPARATIVE ANALYSIS OF FCFS, SJN & RR JOB SCHEDULING ALGORITHMSijcsit
This document compares the performance of three job scheduling algorithms - First Come First Serve (FCFS), Shortest Job Next (SJN), and Round Robin (RR). It presents the results of simulations run using five sample processes with different arrival times and execution times. The results show that RR has the lowest average response time but highest average turnaround time. SJN has the lowest average waiting time. FCFS generally has the highest average waiting, response, and turnaround times. The document concludes that no single algorithm is superior across all metrics and that further research could explore additional scheduling algorithms.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
This document summarizes research on scheduling algorithms for loading streaming data into real-time data warehouses. The goal is to minimize data staleness over time. It describes how streaming warehouses continuously ingest incoming data streams to support time-critical analyses, unlike traditional warehouses which are periodically refreshed. It presents a model for temporal consistency and defines data staleness. It formulates the streaming warehouse update problem as a scheduling problem to minimize staleness and proves that any online, non-preemptive scheduling algorithm can achieve staleness within a constant factor of optimal if processors are sufficiently fast and no processor is idly waiting.
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...CSCJournals
An ideal Network Processor, that is, a programmable multi-processor device must be capable of offering both the flexibility and speed required for packet processing. But current Network Processor systems generally fall short of the above benchmarks due to traffic fluctuations inherent in packet networks, and the resulting workload variation on individual pipeline stage over a period of time ultimately affects the overall performance of even an otherwise sound system. One potential solution would be to change the code running at these stages so as to adapt to the fluctuations; a near robust system with standing traffic fluctuations is the dynamic adaptive processor, reconfiguring the entire system, which we introduce and study to some extent in this paper. We achieve this by using a crucial decision making model, transferring the binary code to the processor through the SOAP protocol.
The document summarizes a proposed scheduling technique called Real Time Conflict-free Query Scheduling (RTCQS) for wireless sensor networks. RTCQS aims to increase throughput for high data rate sensor applications while supporting real-time queries. It uses a query planner to construct transmission plans for queries as sequential conflict-free steps. A query scheduler then schedules the query instances, using preemption for higher priority queries or concurrent execution when no conflicts exist. The goal is high throughput, low latency, and adaptability to varying workloads.
The document describes a proposed Fuzzy-AQM algorithm for congestion control in wireless ad-hoc networks. It begins by summarizing common Active Queue Management (AQM) policies and their issues. It then discusses congestion in ad-hoc networks and how the proposed Fuzzy-AQM algorithm uses fuzzy logic rules based on queue size and neighbor density to dynamically calculate packet drop probability, aiming to improve network performance. Simulation results showed the effectiveness of Fuzzy-AQM for congestion detection and avoidance.
Data prefetching has been considered an effective way to cross the performance gap between processor and memory and to mask data access latency caused by cache misses. Data prefetching prefers data closer to a processor before it is actually needed with hardware and or software support. Many prefetching techniques have been proposed in the few years to reduce data access latency by taking advantage of multi core architectures. In this paper, a taxonomy that classifies various design concerns has been proposed in developing a prefetching strategy. The various prefetching strategies and issues that have to be considered in designing a prefetching strategy for multi core processors. Yee Yee Soe "A Taxonomy of Data Prefetching Mechanisms" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-6 , October 2019, URL: https://www.ijtsrd.com/papers/ijtsrd29339.pdf Paper URL: https://www.ijtsrd.com/computer-science/other/29339/a-taxonomy-of-data-prefetching-mechanisms/yee-yee-soe
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDSMaurvi04
This document discusses fault tolerance techniques for computational grids. It begins with an introduction to grid computing and defines some key terms related to faults and failures. It then discusses different types of faults that can occur in grids, including physical faults, network faults, and process faults. It outlines several fault tolerance techniques used in grids, including job and data replication, checkpointing, scheduling approaches, and load balancing strategies. The document concludes with suggestions for future work, such as optimizing checkpoint storage and granularity.
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...cscpconf
The processor performance is highly dependent on the regular supply of correct instruction at
the right time. To reduce instruction cache misses, one of the proposed mechanism is the
instruction prefetching, which in turn will increase instructions supply to the processor. The
technology developments in these fields indicates that in future the gap between processing
speeds of processor and data transfer speed of memory is likely to be increased. Memory bandwidth can be increased significantly using the prefetching, but unsuccessful prefetches will pollute the primary cache. Prefetching can be done either with software or hardware. In software prefetching the compiler will insert a prefetch code in the program. In this case as actual memory capacity is not known to the compiler and it will lead to some harmful prefetches. In hardware prefetching instead of inserting prefetch code it will make use of extra hardware and which is utilized during the execution. The most significant source of lostperformance when the process waiting for the availability of the next instruction. All the prefetching methods are giving stress only to the fetching of the instruction for the execution, not to the overall performance of the processor. This paper is an attempt to study the branch handling in a uniprocessing environment, where, when ever branching is identified the proper cache memory management is enabled inside the memory management unit.
The effective way of processor performance enhancement by proper branch handlingcsandit
The processor performance is highly dependent on the regular supply of correct instruction at
the right time. To reduce instruction cache misses, one of the proposed mechanism is the
instruction prefetching, which in turn will increase instructions supply to the processor. The
technology developments in these fields indicates that in future the gap between processing
speeds of processor and data transfer speed of memory is likely to be increased. Memory
bandwidth can be increased significantly using the prefetching, but unsuccessful prefetches will
pollute the primary cache. Prefetching can be done either with software or hardware. In
software prefetching the compiler will insert a prefetch code in the program. In this case as
actual memory capacity is not known to the compiler and it will lead to some harmful
prefetches. In hardware prefetching instead of inserting prefetch code it will make use of extra
hardware and which is utilized during the execution. The most significant source of lost
performance when the process waiting for the availability of the next instruction. All the
prefetching methods are giving stress only to the fetching of the instruction for the execution,
not to the overall performance of the processor. This paper is an attempt to study the branch
handling in a uniprocessing environment, where, when ever branching is identified the proper
cache memory management is enabled inside the memory management unit.
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxfaithxdunce63732
This document summarizes the results of simulations run to analyze the performance of different processor configurations with varying levels of instruction-level parallelism. The key findings are:
1) For processors with significant memory latency, there is little performance difference between simple in-order and more complex out-of-order designs, as memory latency dominates execution time.
2) Supporting just two concurrently pending instructions provides most of the benefit of more complex out-of-order execution, while greatly reducing hardware complexity.
3) As the mismatch between processor and memory system performance increases, all designs see similar performance, regardless of the level of instruction-level parallelism exploited.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...IJCSEA Journal
The performanceof the processor is highly dependent on the regular supply of correct instruction at the right time. Whenever a data miss is occurring in the cache memory the processor has to spend more cycles for the fetching operation. One of the methodsused to reduce instruction cache miss is the instruction prefetching,which in turn will increase instructions supply to the processor.The technology developments in these fields indicates that in future the gap between processing speeds of processor and data transfer speed of memory is likely to be increased. Branch Predictor play a critical role in achieving effective performance in many modernpipelined microprocessor architecture.
Prefetching canbe done either with software or hardware. In software prefetching the compiler will insert a prefetch code in the program. In this case as actual memory capacity is not known to the compiler and it will lead to some harmful prefetches. In hardware prefetching instead of inserting prefetch code it will make use of extra hardware and which is utilized during the execution. The most significant source of lost performance when the process waiting for the availability of the next instruction. Thetime that is wasted in case of branch misprediction is equal to the number of stages in the pipeline, starting from fetch stage to execute stage. All the prefetching methods are givenstress only to the fetching of the instruction for the execution, not to the overall performance of the processor. The most significant source of lost performance is,when the process is waiting for the availability of the next instruction. The time that is wasted in case of branch misprediction is equal to the number of stages in the pipeline, starting from fetch stage to execution stage.This paper we made an attempt to study the branch handling in a uniprocessing environment, whenever branching is identified instead of invoking the branch prediction the proper cache memory management is enabled inside the memory management unit.
AN ALTERNATE APPROACH TO RESOURCE ALLOCATION STRATEGY USING NETWORK METRICSIN...ijgca
Monitoring in a grid environment involves the analysis of all resource metrics and network metrics. Monitoring the resource metrics helps the grid middle ware to decide which job to be
submitted to which resource. Decision for submitting a job will be better along with the consideration of the network metrics. Tuning of the network metrics is also made if the performance degrades.
Load distribution of analytical query workloads for database cluster architec...Matheesha Fernando
The document summarizes a research paper on optimizing the distribution of analytical query workloads across multiple database servers. It discusses:
1) How database clusters work and the idea of using materialized query tables (MQTs) to optimize analytical queries.
2) The proposed framework which uses a genetic algorithm-based scheduler to optimize mapping of queries and MQTs to servers to minimize overall workload completion time.
3) An evaluation of the genetic algorithm approach against exhaustive search and greedy algorithms on synthetic workloads, finding it provides results close to exhaustive search.
The document discusses process management in operating systems. It covers control blocks, interrupts, process states, scheduling algorithms like FIFO, SJF, SRTF, Round Robin and priority scheduling. It also discusses queuing, multiprogramming vs time sharing and scheduling criteria like CPU utilization, throughput, turnaround time and waiting time. Scheduling can be long, medium or short term and algorithms include priority queues and multilevel feedback queues.
Task scheduling methodologies for high speed computing systemsijesajournal
High Speed computing meets ever increasing real-time computational demands through the leveraging of
flexibility and parallelism. The flexibility is achieved when computing platform designed with
heterogeneous resources to support multifarious tasks of an application where as task scheduling brings
parallel processing. The efficient task scheduling is critical to obtain optimized performance in
heterogeneous computing Systems (HCS). In this paper, we brought a review of various application
scheduling models which provide parallelism for homogeneous and heterogeneous computing systems. In
this paper, we made a review of various scheduling methodologies targeted to high speed computing
systems and also prepared summary chart. The comparative study of scheduling methodologies for high
speed computing systems has been carried out based on the attributes of platform & application as well.
The attributes are execution time, nature of task, task handling capability, type of host & computing
platform. Finally a summary chart has been prepared and it demonstrates that the need of developing
scheduling methodologies for Heterogeneous Reconfigurable Computing Systems (HRCS) which is an
emerging high speed computing platform for real time applications.
This document discusses memory management techniques in operating systems. It provides background on how programs must be brought into memory to execute and techniques for organizing memory like segmentation and paging. It describes the multistep process a user program goes through before execution including being placed in a process in memory. It also discusses logical versus physical addresses, the memory management unit that maps virtual to physical addresses, and dynamic loading and linking of code.
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...vtunotesbysree
Here are three major complications that concurrent processing adds to an operating system:
1. Resource allocation and scheduling becomes more complex. The OS must allocate CPU time, memory, file descriptors, etc. among multiple concurrent processes and ensure all processes receive adequate resources. It must also schedule which process runs at what time on what CPU core.
2. Synchronization and communication between processes is more difficult. The OS must provide mechanisms for processes to synchronize their actions when accessing shared resources and to allow inter-process communication. This introduces challenges around things like race conditions and deadlocks.
3. Reliability and fault tolerance is harder. If one process crashes or hangs, it should not affect other processes. The OS must be able to
Distributed Dynamic Replication Management Mechanism Based on Accessing Frequ...May Sit Hman
Thesis (Master) of Computer Science. It's based on Distributed System and Database System. It's about management replica on distributed database systemm.
Resource monitoring involves obtaining information about the utilization of system resources like CPU, bandwidth, memory, and storage. These resources impact system performance and user productivity. There are two categories of systems to monitor - those experiencing performance issues and those running well for capacity planning. Key resources to monitor include CPU usage, memory usage, disk I/O, and network bandwidth. A variety of tools like top, vmstat, and iostat can help monitor resource utilization.
Learning scheduler parameters for adaptive preemptioncsandit
An operating system scheduler is expected to not al
low processor stay idle if there is any
process ready or waiting for its execution. This pr
oblem gains more importance as the numbers
of processes always outnumber the processors by lar
ge margins. It is in this regard that
schedulers are provided with the ability to preempt
a running process, by following any
scheduling algorithm, and give us an illusion of si
multaneous running of several processes. A
process which is allowed to utilize CPU resources f
or a fixed quantum of time (termed as
timeslice for preemption) and is then preempted for
another waiting process. Each of these
'process preemption' leads to considerable overhead
of CPU cycles which are valuable resource
for runtime execution. In this work we try to utili
ze the historical performances of a scheduler
and predict the nature of current running process,
thereby trying to reduce the number of
preemptions. We propose a machine-learning module t
o predict a better performing timeslice
which is calculated based on static knowledge base
and adaptive reinforcement learning based
suggestive module. Results for an "adaptive timesli
ce parameter" for preemption show good
saving on CPU cycles and efficient throughput time.
COMPARATIVE ANALYSIS OF FCFS, SJN & RR JOB SCHEDULING ALGORITHMSijcsit
This document compares the performance of three job scheduling algorithms - First Come First Serve (FCFS), Shortest Job Next (SJN), and Round Robin (RR). It presents the results of simulations run using five sample processes with different arrival times and execution times. The results show that RR has the lowest average response time but highest average turnaround time. SJN has the lowest average waiting time. FCFS generally has the highest average waiting, response, and turnaround times. The document concludes that no single algorithm is superior across all metrics and that further research could explore additional scheduling algorithms.
International Journal of Engineering and Science Invention (IJESI)inventionjournals
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online
This document summarizes research on scheduling algorithms for loading streaming data into real-time data warehouses. The goal is to minimize data staleness over time. It describes how streaming warehouses continuously ingest incoming data streams to support time-critical analyses, unlike traditional warehouses which are periodically refreshed. It presents a model for temporal consistency and defines data staleness. It formulates the streaming warehouse update problem as a scheduling problem to minimize staleness and proves that any online, non-preemptive scheduling algorithm can achieve staleness within a constant factor of optimal if processors are sufficiently fast and no processor is idly waiting.
Run-Time Adaptive Processor Allocation of Self-Configurable Intel IXP2400 Net...CSCJournals
An ideal Network Processor, that is, a programmable multi-processor device must be capable of offering both the flexibility and speed required for packet processing. But current Network Processor systems generally fall short of the above benchmarks due to traffic fluctuations inherent in packet networks, and the resulting workload variation on individual pipeline stage over a period of time ultimately affects the overall performance of even an otherwise sound system. One potential solution would be to change the code running at these stages so as to adapt to the fluctuations; a near robust system with standing traffic fluctuations is the dynamic adaptive processor, reconfiguring the entire system, which we introduce and study to some extent in this paper. We achieve this by using a crucial decision making model, transferring the binary code to the processor through the SOAP protocol.
The document summarizes a proposed scheduling technique called Real Time Conflict-free Query Scheduling (RTCQS) for wireless sensor networks. RTCQS aims to increase throughput for high data rate sensor applications while supporting real-time queries. It uses a query planner to construct transmission plans for queries as sequential conflict-free steps. A query scheduler then schedules the query instances, using preemption for higher priority queries or concurrent execution when no conflicts exist. The goal is high throughput, low latency, and adaptability to varying workloads.
The document describes a proposed Fuzzy-AQM algorithm for congestion control in wireless ad-hoc networks. It begins by summarizing common Active Queue Management (AQM) policies and their issues. It then discusses congestion in ad-hoc networks and how the proposed Fuzzy-AQM algorithm uses fuzzy logic rules based on queue size and neighbor density to dynamically calculate packet drop probability, aiming to improve network performance. Simulation results showed the effectiveness of Fuzzy-AQM for congestion detection and avoidance.
Data prefetching has been considered an effective way to cross the performance gap between processor and memory and to mask data access latency caused by cache misses. Data prefetching prefers data closer to a processor before it is actually needed with hardware and or software support. Many prefetching techniques have been proposed in the few years to reduce data access latency by taking advantage of multi core architectures. In this paper, a taxonomy that classifies various design concerns has been proposed in developing a prefetching strategy. The various prefetching strategies and issues that have to be considered in designing a prefetching strategy for multi core processors. Yee Yee Soe "A Taxonomy of Data Prefetching Mechanisms" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-6 , October 2019, URL: https://www.ijtsrd.com/papers/ijtsrd29339.pdf Paper URL: https://www.ijtsrd.com/computer-science/other/29339/a-taxonomy-of-data-prefetching-mechanisms/yee-yee-soe
FAULT TOLERANCE OF RESOURCES IN COMPUTATIONAL GRIDSMaurvi04
This document discusses fault tolerance techniques for computational grids. It begins with an introduction to grid computing and defines some key terms related to faults and failures. It then discusses different types of faults that can occur in grids, including physical faults, network faults, and process faults. It outlines several fault tolerance techniques used in grids, including job and data replication, checkpointing, scheduling approaches, and load balancing strategies. The document concludes with suggestions for future work, such as optimizing checkpoint storage and granularity.
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...cscpconf
The processor performance is highly dependent on the regular supply of correct instruction at
the right time. To reduce instruction cache misses, one of the proposed mechanism is the
instruction prefetching, which in turn will increase instructions supply to the processor. The
technology developments in these fields indicates that in future the gap between processing
speeds of processor and data transfer speed of memory is likely to be increased. Memory bandwidth can be increased significantly using the prefetching, but unsuccessful prefetches will pollute the primary cache. Prefetching can be done either with software or hardware. In software prefetching the compiler will insert a prefetch code in the program. In this case as actual memory capacity is not known to the compiler and it will lead to some harmful prefetches. In hardware prefetching instead of inserting prefetch code it will make use of extra hardware and which is utilized during the execution. The most significant source of lostperformance when the process waiting for the availability of the next instruction. All the prefetching methods are giving stress only to the fetching of the instruction for the execution, not to the overall performance of the processor. This paper is an attempt to study the branch handling in a uniprocessing environment, where, when ever branching is identified the proper cache memory management is enabled inside the memory management unit.
The effective way of processor performance enhancement by proper branch handlingcsandit
The processor performance is highly dependent on the regular supply of correct instruction at
the right time. To reduce instruction cache misses, one of the proposed mechanism is the
instruction prefetching, which in turn will increase instructions supply to the processor. The
technology developments in these fields indicates that in future the gap between processing
speeds of processor and data transfer speed of memory is likely to be increased. Memory
bandwidth can be increased significantly using the prefetching, but unsuccessful prefetches will
pollute the primary cache. Prefetching can be done either with software or hardware. In
software prefetching the compiler will insert a prefetch code in the program. In this case as
actual memory capacity is not known to the compiler and it will lead to some harmful
prefetches. In hardware prefetching instead of inserting prefetch code it will make use of extra
hardware and which is utilized during the execution. The most significant source of lost
performance when the process waiting for the availability of the next instruction. All the
prefetching methods are giving stress only to the fetching of the instruction for the execution,
not to the overall performance of the processor. This paper is an attempt to study the branch
handling in a uniprocessing environment, where, when ever branching is identified the proper
cache memory management is enabled inside the memory management unit.
CS 301 Computer ArchitectureStudent # 1 EID 09Kingdom of .docxfaithxdunce63732
This document summarizes the results of simulations run to analyze the performance of different processor configurations with varying levels of instruction-level parallelism. The key findings are:
1) For processors with significant memory latency, there is little performance difference between simple in-order and more complex out-of-order designs, as memory latency dominates execution time.
2) Supporting just two concurrently pending instructions provides most of the benefit of more complex out-of-order execution, while greatly reducing hardware complexity.
3) As the mismatch between processor and memory system performance increases, all designs see similar performance, regardless of the level of instruction-level parallelism exploited.
International Journal of Engineering Research and Development (IJERD)IJERD Editor
journal publishing, how to publish research paper, Call For research paper, international journal, publishing a paper, IJERD, journal of science and technology, how to get a research paper published, publishing a paper, publishing of journal, publishing of research paper, reserach and review articles, IJERD Journal, How to publish your research paper, publish research paper, open access engineering journal, Engineering journal, Mathemetics journal, Physics journal, Chemistry journal, Computer Engineering, Computer Science journal, how to submit your paper, peer reviw journal, indexed journal, reserach and review articles, engineering journal, www.ijerd.com, research journals,
yahoo journals, bing journals, International Journal of Engineering Research and Development, google journals, hard copy of journal
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...IJCSEA Journal
The performanceof the processor is highly dependent on the regular supply of correct instruction at the right time. Whenever a data miss is occurring in the cache memory the processor has to spend more cycles for the fetching operation. One of the methodsused to reduce instruction cache miss is the instruction prefetching,which in turn will increase instructions supply to the processor.The technology developments in these fields indicates that in future the gap between processing speeds of processor and data transfer speed of memory is likely to be increased. Branch Predictor play a critical role in achieving effective performance in many modernpipelined microprocessor architecture.
Prefetching canbe done either with software or hardware. In software prefetching the compiler will insert a prefetch code in the program. In this case as actual memory capacity is not known to the compiler and it will lead to some harmful prefetches. In hardware prefetching instead of inserting prefetch code it will make use of extra hardware and which is utilized during the execution. The most significant source of lost performance when the process waiting for the availability of the next instruction. Thetime that is wasted in case of branch misprediction is equal to the number of stages in the pipeline, starting from fetch stage to execute stage. All the prefetching methods are givenstress only to the fetching of the instruction for the execution, not to the overall performance of the processor. The most significant source of lost performance is,when the process is waiting for the availability of the next instruction. The time that is wasted in case of branch misprediction is equal to the number of stages in the pipeline, starting from fetch stage to execution stage.This paper we made an attempt to study the branch handling in a uniprocessing environment, whenever branching is identified instead of invoking the branch prediction the proper cache memory management is enabled inside the memory management unit.
Paper chosen for DesignCon 2015. Critical Memory Performance Metrics for DDR4. Is DDR4 the end of the DDR line of memory technologies? If so then stretching DDR4 to give that much more performance is critical. Discussed in this paper is how to measure the intricate performance metrics of your DDR4 system and why they matter. Understanding these critical parameters can lead to better system design, memory controller architecture and software design. Metrics such as Power Management, Page Hits, Bank Group and Bank Utilization, Multiple Open Bank Analysis, Data Bus Utilization and overhead on a DDR4 memory bus will be demonstrated and discussed.
Enhancing proxy based web caching system using clustering based pre fetching ...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Performing initiative data prefetchingKamal Spring
This paper presents an initiative data prefetching scheme on the storage servers in distributed file systems for cloud
computing. In this prefetching technique, the client machines are not substantially involved in the process of data prefetching, but the
storage servers can directly prefetch the data after analyzing the history of disk I/O access events, and then send the prefetched data
to the relevant client machines proactively. To put this technique to work, the information about client nodes is piggybacked onto the
real client I/O requests, and then forwarded to the relevant storage server. Next, two prediction algorithms have been proposed to
forecast future block access operations for directing what data should be fetched on storage servers in advance. Finally, the prefetched
data can be pushed to the relevant client machine from the storage server. Through a series of evaluation experiments with a
collection of application benchmarks, we have demonstrated that our presented initiative prefetching technique can benefit distributed
file systems for cloud environments to achieve better I/O performance. In particular, configuration-limited client machines in the cloud
are not responsible for predicting I/O access operations, which can definitely contribute to preferable system performance on them.
The document discusses parallel computing platforms and trends in microprocessor architectures that enable implicit parallelism. It covers topics like pipelining, superscalar execution, limitations of memory performance, and how caches can improve effective memory latency. The key points are:
1) Microprocessor clock speeds have increased dramatically but limitations remain regarding memory latency and bandwidth. Parallelism addresses performance bottlenecks in processors, memory, and communication.
2) Techniques like pipelining and superscalar execution exploit implicit parallelism by executing multiple instructions concurrently, but dependencies and branch prediction limit performance gains.
3) Memory latency is often the bottleneck, but caches can reduce effective latency through data reuse and temporal locality.
Second phase report on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTIO...Nikhil Jain
To implement and improve the performance of Advanced Encryption Standard algorithm by using multicore systems and Open MP API extracting as much parallelism as possible from the algorithm in parallel implementation approach.
An octa core processor with shared memory and message-passingeSAT Journals
Abstract This being the era of fast, high performance computing, there is the need of having efficient optimizations in the processor architecture and at the same time in memory hierarchy too. Each and every day, the advancement of applications in communication and multimedia systems are compelling to increase number of cores in the main processor viz., dual-core, quad-core, octa-core and so on. But, for enhancing the overall performance of multi processor chip, there are stringent requirements to improve inter-core synchronization. Thus, a MPSoC with 8-cores supporting both message-passing and shared-memory inter-core communication mechanisms is implemented on Virtex 5 LX110T FPGA. Each core is based on MIPS III (Microprocessor without interlocked pipelined stages) ISA, handling only integer type instructions and having six-stage pipeline with data hazard detection unit and forwarding logic. The eight processing cores and one central shared memory core are inter connected using 3x3 2-D mesh topology based Network-on-chip (NoC) with virtual channel router. The router is four stage pipelined supporting DOR X-Y routing algorithm and with round robin arbitration technique. For verification and functionality test of above fully synthesized multi core processor, matrix multiplication operation is mapped onto the above said. Partitioning and scheduling of multiple multiplications and addition for each element of resultant matrix has been done accordingly among eight cores to get maximum throughput. All the codes for processor design are written in Verilog HDL. Keywords: MPSoC, message-passing, shared memory, MIPS, ISA, wormhole router, network-on-chip, SIMD, data level parallelism, 2-D Mesh, virtual channel
Query Evaluation Techniques for Large Databases.pdfRayWill4
This document surveys techniques for efficiently executing queries over large databases. It describes algorithms for sorting, hashing, aggregation, joins and other operations. It also discusses parallel query execution, complex query plans, and techniques for non-traditional data models. The goal is to provide a foundation for designing query execution facilities in new database management systems.
Performance Review of Zero Copy TechniquesCSCJournals
E-government and corporate servers will require higher performance and security as usage increases. Zero copy refers to a collection of techniques which reduce the number of copies of blocks of data in order to make data transfer more efficient. By avoiding redundant data copies, the consumption of memory and CPU resources are reduced, thereby improving performance of the server. To eliminate costly data copies between user space and kernel space or between two buffers in the kernel space, various schemes are used, such as memory remapping, shared buffers, and hardware support. However, the advantages are sometimes overestimated and new security issues arise. This paper describes different approaches to implementing zero copy and evaluates these methods for their performance and security considerations, to help when evaluating these techniques for use in e-government applications
This document discusses energy-efficient hardware data prefetching. It begins with an introduction to data prefetching and why it is needed due to the growing gap between processor and memory speeds. It then covers different types of prefetching techniques including software-based, hardware-based, sequential, stride, and pointer prefetching. It also discusses the tradeoffs between software and hardware approaches. Finally, it introduces the concept of energy-aware data prefetching to reduce the increased energy consumption from aggressive prefetching techniques.
AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORScscpconf
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the various techniques there followed in, for improving the performance of the Multicore
Processors. We conducted cluster experiments to find this limit. In this paper we propose an alternate design of Multicore processor based on the results of our cluster experiment.
Affect of parallel computing on multicore processorscsandit
Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make
number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of
Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the
various techniques there followed in, for improving the performance of the Multicore
Processors.
We conducted cluster experiments to find this limit. In this paper we propose an alternate design
of Multicore processor based on the results of our cluster experiment.
1. The document discusses research activities related to reducing energy consumption by at least 30% through the development of core source technologies for universal operating systems.
2. It describes four papers being presented, including ones on system and device latency modeling, power management frameworks for embedded systems, and automatic selection of power policies for operating systems.
3. It also summarizes four research topics from the National University, including performance evaluation of parallel applications using a power-aware paging method on next-generation memory architectures.
Study on Fog Computing and Data Concurrency in IoT. Includes an analysis of different data concurrency techniques, their principle and some recent developments in the area. Also covers the topic of Fog Computing and its development and application in IoT.
Texture based feature extraction and object trackingPriyanka Goswami
This document provides a project report on texture-based feature extraction and object tracking. It discusses using various texture analysis techniques like Local Binary Pattern (LBP), Local Derivative Pattern (LDP), and Local Ternary Pattern (LTP) to extract features from images for tasks like cloud tracking. It implements these techniques in MATLAB and evaluates them on standard datasets to extract features and represent images with histograms for tasks like image recognition and analysis while reducing computational requirements compared to using raw images. The techniques are then applied to track cloud motion in weather satellite images by analyzing differences in texture histograms over time.
The project involved studying some of the popular filters and prediction algorithms used for stock market analysis. Based on that Moving Average Filter, Adaptive Kalman Filter, Multiple Linear Regression Filter, Bollinger Bands, and Chaikin Oscillator were developed and implemented in MATLAB. For carrying out the analysis, daily stock market data of 10 popular companies, over a period of 1 year was used. The overall project developed can be used as a complete package to carry out accurate and efficient stock market analysis and trend study.
The document describes an embedded system payload controller for an electro-optical satellite payload. The payload controller consists of embedded hardware implemented on an FPGA using various IP cores. It includes a Core8051 microcontroller, MIL-STD-1553 IP core, CoreTimer, and AMBA bus. The Core8051 initializes the IP cores and runs application software including a scheduler to control payload functions like commanding, power control, and temperature regulation on a defined schedule.
This document discusses data acquisition systems. It describes the typical components of a data acquisition system including sensors, data acquisition hardware, and computer software. The hardware acquires analog signals from sensors, converts the signals to digital values using an analog-to-digital converter, and transfers the data to a computer. The software analyzes and stores the digital data. Common applications of data acquisition systems include industrial processes and laboratory research. The document also provides examples of components such as Arduino boards and LabVIEW software that can be used to build simple, low-cost data acquisition systems.
The document discusses data acquisition systems. It defines data acquisition as the process of sampling real-world signals and converting them to digital values. A data acquisition system consists of sensors, DAQ hardware, and software. The key components are sensors that measure physical variables, signal conditioning hardware, analog-to-digital converters, and software for processing and analysis. Data acquisition systems are used widely in industries for measurement and control applications.
Biomedical Image Processing
Topics covered: Biomedical imaging, Need of image processing in medicine, Principles of image processing, Components of image processing, Application of image processing in different medical imaging systems
This document discusses thermal imaging and its various applications. It begins by explaining that thermal imaging produces images based on the heat detected from objects and was originally developed for military purposes. It then provides details on:
- How thermal imaging cameras work to detect differences in temperature and produce images.
- Common applications of thermal imaging in fields like firefighting, law enforcement, medical, agriculture, and more.
- The advantages of thermal imaging like its ability to see in total darkness and penetrate obscurants like smoke.
- Specific uses of thermal imaging in border security, condition monitoring, night vision, medical screening, and evaluating solar panels.
Electric vehicle and photovoltaic advanced roles in enhancing the financial p...IJECEIAES
Climate change's impact on the planet forced the United Nations and governments to promote green energies and electric transportation. The deployments of photovoltaic (PV) and electric vehicle (EV) systems gained stronger momentum due to their numerous advantages over fossil fuel types. The advantages go beyond sustainability to reach financial support and stability. The work in this paper introduces the hybrid system between PV and EV to support industrial and commercial plants. This paper covers the theoretical framework of the proposed hybrid system including the required equation to complete the cost analysis when PV and EV are present. In addition, the proposed design diagram which sets the priorities and requirements of the system is presented. The proposed approach allows setup to advance their power stability, especially during power outages. The presented information supports researchers and plant owners to complete the necessary analysis while promoting the deployment of clean energy. The result of a case study that represents a dairy milk farmer supports the theoretical works and highlights its advanced benefits to existing plants. The short return on investment of the proposed approach supports the paper's novelty approach for the sustainable electrical system. In addition, the proposed system allows for an isolated power setup without the need for a transmission line which enhances the safety of the electrical network
Optimizing Gradle Builds - Gradle DPE Tour Berlin 2024Sinan KOZAK
Sinan from the Delivery Hero mobile infrastructure engineering team shares a deep dive into performance acceleration with Gradle build cache optimizations. Sinan shares their journey into solving complex build-cache problems that affect Gradle builds. By understanding the challenges and solutions found in our journey, we aim to demonstrate the possibilities for faster builds. The case study reveals how overlapping outputs and cache misconfigurations led to significant increases in build times, especially as the project scaled up with numerous modules using Paparazzi tests. The journey from diagnosing to defeating cache issues offers invaluable lessons on maintaining cache integrity without sacrificing functionality.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
Harnessing WebAssembly for Real-time Stateless Streaming PipelinesChristina Lin
Traditionally, dealing with real-time data pipelines has involved significant overhead, even for straightforward tasks like data transformation or masking. However, in this talk, we’ll venture into the dynamic realm of WebAssembly (WASM) and discover how it can revolutionize the creation of stateless streaming pipelines within a Kafka (Redpanda) broker. These pipelines are adept at managing low-latency, high-data-volume scenarios.
Advanced control scheme of doubly fed induction generator for wind turbine us...IJECEIAES
This paper describes a speed control device for generating electrical energy on an electricity network based on the doubly fed induction generator (DFIG) used for wind power conversion systems. At first, a double-fed induction generator model was constructed. A control law is formulated to govern the flow of energy between the stator of a DFIG and the energy network using three types of controllers: proportional integral (PI), sliding mode controller (SMC) and second order sliding mode controller (SOSMC). Their different results in terms of power reference tracking, reaction to unexpected speed fluctuations, sensitivity to perturbations, and resilience against machine parameter alterations are compared. MATLAB/Simulink was used to conduct the simulations for the preceding study. Multiple simulations have shown very satisfying results, and the investigations demonstrate the efficacy and power-enhancing capabilities of the suggested control system.
A SYSTEMATIC RISK ASSESSMENT APPROACH FOR SECURING THE SMART IRRIGATION SYSTEMSIJNSA Journal
The smart irrigation system represents an innovative approach to optimize water usage in agricultural and landscaping practices. The integration of cutting-edge technologies, including sensors, actuators, and data analysis, empowers this system to provide accurate monitoring and control of irrigation processes by leveraging real-time environmental conditions. The main objective of a smart irrigation system is to optimize water efficiency, minimize expenses, and foster the adoption of sustainable water management methods. This paper conducts a systematic risk assessment by exploring the key components/assets and their functionalities in the smart irrigation system. The crucial role of sensors in gathering data on soil moisture, weather patterns, and plant well-being is emphasized in this system. These sensors enable intelligent decision-making in irrigation scheduling and water distribution, leading to enhanced water efficiency and sustainable water management practices. Actuators enable automated control of irrigation devices, ensuring precise and targeted water delivery to plants. Additionally, the paper addresses the potential threat and vulnerabilities associated with smart irrigation systems. It discusses limitations of the system, such as power constraints and computational capabilities, and calculates the potential security risks. The paper suggests possible risk treatment methods for effective secure system operation. In conclusion, the paper emphasizes the significant benefits of implementing smart irrigation systems, including improved water conservation, increased crop yield, and reduced environmental impact. Additionally, based on the security analysis conducted, the paper recommends the implementation of countermeasures and security approaches to address vulnerabilities and ensure the integrity and reliability of the system. By incorporating these measures, smart irrigation technology can revolutionize water management practices in agriculture, promoting sustainability, resource efficiency, and safeguarding against potential security threats.
Comparative analysis between traditional aquaponics and reconstructed aquapon...bijceesjournal
The aquaponic system of planting is a method that does not require soil usage. It is a method that only needs water, fish, lava rocks (a substitute for soil), and plants. Aquaponic systems are sustainable and environmentally friendly. Its use not only helps to plant in small spaces but also helps reduce artificial chemical use and minimizes excess water use, as aquaponics consumes 90% less water than soil-based gardening. The study applied a descriptive and experimental design to assess and compare conventional and reconstructed aquaponic methods for reproducing tomatoes. The researchers created an observation checklist to determine the significant factors of the study. The study aims to determine the significant difference between traditional aquaponics and reconstructed aquaponics systems propagating tomatoes in terms of height, weight, girth, and number of fruits. The reconstructed aquaponics system’s higher growth yield results in a much more nourished crop than the traditional aquaponics system. It is superior in its number of fruits, height, weight, and girth measurement. Moreover, the reconstructed aquaponics system is proven to eliminate all the hindrances present in the traditional aquaponics system, which are overcrowding of fish, algae growth, pest problems, contaminated water, and dead fish.
ACEP Magazine edition 4th launched on 05.06.2024Rahul
This document provides information about the third edition of the magazine "Sthapatya" published by the Association of Civil Engineers (Practicing) Aurangabad. It includes messages from current and past presidents of ACEP, memories and photos from past ACEP events, information on life time achievement awards given by ACEP, and a technical article on concrete maintenance, repairs and strengthening. The document highlights activities of ACEP and provides a technical educational article for members.
KuberTENes Birthday Bash Guadalajara - K8sGPT first impressionsVictor Morales
K8sGPT is a tool that analyzes and diagnoses Kubernetes clusters. This presentation was used to share the requirements and dependencies to deploy K8sGPT in a local environment.
Redefining brain tumor segmentation: a cutting-edge convolutional neural netw...IJECEIAES
Medical image analysis has witnessed significant advancements with deep learning techniques. In the domain of brain tumor segmentation, the ability to
precisely delineate tumor boundaries from magnetic resonance imaging (MRI)
scans holds profound implications for diagnosis. This study presents an ensemble convolutional neural network (CNN) with transfer learning, integrating
the state-of-the-art Deeplabv3+ architecture with the ResNet18 backbone. The
model is rigorously trained and evaluated, exhibiting remarkable performance
metrics, including an impressive global accuracy of 99.286%, a high-class accuracy of 82.191%, a mean intersection over union (IoU) of 79.900%, a weighted
IoU of 98.620%, and a Boundary F1 (BF) score of 83.303%. Notably, a detailed comparative analysis with existing methods showcases the superiority of
our proposed model. These findings underscore the model’s competence in precise brain tumor localization, underscoring its potential to revolutionize medical
image analysis and enhance healthcare outcomes. This research paves the way
for future exploration and optimization of advanced CNN models in medical
imaging, emphasizing addressing false positives and resource efficiency.
Using recycled concrete aggregates (RCA) for pavements is crucial to achieving sustainability. Implementing RCA for new pavement can minimize carbon footprint, conserve natural resources, reduce harmful emissions, and lower life cycle costs. Compared to natural aggregate (NA), RCA pavement has fewer comprehensive studies and sustainability assessments.
1. Study and Analysis of Software-Based Data
Prefetching Schemes
Priyanka Goswami
Electrical and Computer Engineering
The University of Arizona
Tucson, USA
priyankag@email.arizona.edu
Shalaka Satam
Electrical and Computer Engineering
The University of Arizona
Tucson, USA
shalakasatam@email.arizona.edu
Abstract— For multicore processors, data prefetching is an
effective technique to reduce latency caused by data access and
cache misses. There are two main methods of prefetching –
software and hardware prefetching. A third category is the
hybrid prefetching scheme, which is a combination of software
and hardware scheme and aims at combing the benefits of both
the techniques. For this project, we will perform a study
prefetching techniques and features used for classifying different
techniques. We will then study the most recent prefetching
techniques developed for modern day processors, classify them
based on different criteria and perform a qualitative and
quantitative evaluation of their performance. We will also
evaluate the performance of compiler based data prefetching
scheme using the built-in prefetcher of gcc compiler.
Keywords—prefetching; latency; memory; cache; processor;
multicore; software; data
I. INTRODUCTION
Prefetching is a technique used for transferring data from
the main memory to a temporary storage before it is required
by an application. It is used for reducing memory latencies by
overlapping computation with communication (data access).
An effective prefetcher should be able to accurately predict
the address of the data required in the future, thus helping in
reducing cache misses, besides improving the overall speed of
execution. Prefetching has become important for modern
computers mainly because of the exponential increase of
dataset sizes and the significant difference between DRAM
and processor speeds [1]. Prefetching has also become a key
component of Network on Chip (NoC) based multiprocessor
systems, where the effectiveness of the prefetching technique
directly translates to an improvement in overall system
performance. This is because in such a system the memory
access latency depends on the network traffic, besides the
distance between the processor request data and the storage
location [2].
Although prefetching is one of the most widely used
methods to reduce memory access latency, there are many
factors and design specifications that must be considered for it
to be effective. For example, in Chip – multiprocessors
(CMPs) increasing the lower level cache sizes is not plausible
and hence it is important that the prefetching technique used is
highly accurate to prevent cache pollution [3]. Additionally, it
is important that the prefetcher is light-weight and has low
overhead.
Over years, many different prefetching techniques have
been designed that aim to reduce memory access latency,
predict future data accesses with high accuracy, while being
light weight and having minimum overhead. The techniques
have also evolved to adapt to different processor architectures.
In the report, first we will introduce the different criteria used
for classifying prefetching techniques and introduce a new
possible criterion that can be used for classifying prefetchers,
based on the recent developments in processor architecture.
The following section will focus on classification and
analyzing the recently developed software based prefetching
techniques, the performance improvements achieved by them
and the type of processor architecture they are developed for.
We will also examine the advantages and disadvantages of
hardware and software based prefetching techniques and how
hybrid prefetching techniques have helped in combining the
best features of both the techniques. Then we will study the
effectiveness of software based data prefetching technique by
implementing compiler based prefetching on loop – intensive
codes and analyze the results.
II. CLASSIFICATION OF PREFETCHERS
Over the years, many distinctive characteristics of
prefetchers have been used to classify them. The most
commonly used characteristics used are [4] [5] [6]:
• Based on where the prefetching is implemented: This is
the most popular and oldest criteria used for classifying
prefetching techniques. The techniques are classified as:
1. Hardware based prefetcher – Prefetchers use
information of current and previous cache accesses to
fetch blocks of data. This technique generally
requires additional hardware (like extra memory to
store the history table) or modifications to the current
hardware. It is generally used where the data access
patterns are regular, which makes it easier for the
prediction algorithm the correctly determine future
data accesses.
2. 2. Software based prefetcher – This is a compiler
optimization technique, in which prefetch instructions
are either inserted by the compiler by identifying loops
in the code, or by the programmer. The time to
prefetch depends on the loop execution time, to ensure
that communication time is not more than the
computation time. Software based prefetchers are
preferred for irregular short streams of data, which is
seen in out – of order processors. This also includes
branch prediction algorithms.
3. Hybrid prefetcher - Combination of hardware and
software based prefetcher.
• Based on what is prefetched: This is applicable if separate
memory is used for instructions and data.
1. Instruction prefetching
2. Data prefetching
• Based on events that can be used to determine if
prefetching is required: This can be event triggered (like a
cache miss), software controlled, prediction based or
consisting of look ahead counter.
• Based on the source and destination of the data being
prefetched: The source can be main memory or a lower
level cache (like L1) and the destination is a higher - level
cache (this can be private, shared or exclusively for
storing prefetched data).
• Based on component initializing the prefetching:
1. Push based in which the memory or server prefetches
data and sends it to the processor performing
computation.
2. Pull based in which the processor requests the
(prefetched) data from the memory or lower
level cache.
• Based on the technique used for identifying the data to be
retrieved: The technique can be classified as
1. Static prefetcher
2. Dynamic prefetcher
We introduce another possible classification scheme for
prefetching algorithms based on the type of processor /
processor components a technique is suitable for. like no. of
cores a processor has (single core, multi core or many core
[7]), network based multiprocessor system, chip
multiprocessor systems, etc. We will use this scheme in
addition to the previous described characteristics to classify
the different software based prefetching techniques we will
analyze in the following section. Additionally, prefetchers can
also be classified based on the data the retrieve, i.e. temporal,
spatial or both.
Some of the main metrics used to analyze the performance of
prefetchers are prefetch degree, accuracy, overhead, timely
prefetching and difference in speed of application execution.
III. ANALYSIS AND CLASSIFICATION OF PREFETCHING
TECHNIQUES
In this section, we will analyze some recently developed
prefetching techniques for modern day processors such as
NoC multiprocessors. Based on the analysis we will classify
the technique using the above criteria and the performance
improvement achieved. A similar classification and review has
been described in [4], but in this report, we will focus only on
data prefetching techniques, especially those that have not
been covered in previously published surveys.
In [8], Jain and Lin (2013) introduce the Irregular Stream
Buffer prefetcher. It aims on using an additional level of
indirection for converting correlated addresses to consecutive
addresses for irregular sequence of memory references. It
identifies temporal streams of data, by identifying the next
memory access based on the current reference. The main
components of the prefetcher are: training unit, address
mapping caches and stream predictor.
• It prefetches data from the main memory (pull based
prefetcher) and uses a software controlled prediction
based algorithm. It is designed for single and multi-core
processors and is dynamic prefetcher.
• The prefetcher is tested using the SPEC 2006 benchmarks
on Marss simulator and has obtained 23.1% speed-up and
93.7% accuracy.
• On-chip storage is 32KB with 8.4% traffic overhead.
In [9], Mao et al. (2014) describe two methods: the request
prioritization (RP) and hybrid-local global prefetch (HLGPC).
The RP consists of assigning priorities on different low-level
cache (LLC) accesses (like read request, write back request,
etc.) and use a miss state handling register (MSHR) to track
the elapsed time for servicing each request. The HLGPC is
used to control the aggressiveness (prefetch degree and
distance) of the prefetcher by using two metrics: prefetch
frequency of each core and global access frequency of LLC.
• It prefetches data from main memory (pull based
prefetcher) and stores in L1 cache and uses software
controlled algorithm and is dynamic in nature. It is
designed for chip-multiprocessors having multi-core
system and STT-RAM LLC.
• The prefetcher is tested using SPEC 2000/2006
benchmarks with the Intel Sandy Bridge configuration in
MacSim simulator and energy calculations are done
using CACTI and synopsis. The performance
improvement increases, with larger sized LLC, with
maximum improvement of 9.1%, while the energy saved
decreases with minimum improvement of 5.6% (for
8MB LLC).
• The prefetcher requires additional registers (128 7-bit
MSHR and 20-entry write buffer).
3. In [2], Cireno et al. (2016) have designed a prefetching system
which consists of the client (responsible for handling prefetch
requests and load data to local cache) and the server
(responsible for tracking the time and notifying the client and
maintain the directory). The prefetching scheme initiates when
there is a cache miss and then a time series based prediction is
used to determine when the prefetch request is to be serviced.
In case of a miss, the register is updated and prefetching
parameters are adjusted. For timed prefetching, the client
maintains a separate time estimation for each client.
• It is pull based prefetcher as the client gets the data from
the main memory to higher level caches (L1 and L2) and
uses a combination of event triggered and prediction
based prefetcher. It is developed for NoC based multi
core processors and is dynamic in nature. The prefetcher
is used to fetch temporal data
• The prefetcher is evaluated using Extended SPARC V8
Arch and simulated on Infinity Platform. For 16 core
system, the performance improvement is 6.25 %. Metrics
used for evaluating are processor penalty, miss rate and
network transactions.
In [10], Bakhshalipour et al. (2017) have proposed a
prefetching technique called Domino, that uses two previous
miss addresses from the global history table, to determine the
next address of data to be prefetched. There are two Miss
History Tables (MHTs) per core – (1). MHT – 1 stores the 1st
miss (1 tag + prediction field + valid bit) and (2). MHT – 2
stores two consecutive misses (2 tags + prediction field +
identifier). A single predicted address is determined by
XORing the addresses from MHT – 2. Prefetch request is sent
if there is a hit in either MHT – 1 or MHT – 2.
• The technique is used to prefetch temporal data from the
main memory to the L1 Data cache directly (pull based
prefetcher) and is intended for multi core server
processors. The prefetcher uses an event trigger (cache
miss) to initialize and is dynamic in nature.
• Domino prefetcher is evaluated using Flexus simulator
with 16 core processor (Ultra SPARC III) and 8MB LLC
and 32 KB L1-data cache and benchmarks from the
CloudSuite like MapReduce.
• The Domino prefetcher improves system performance by
26% (more for certain benchmarks), and has better
performance compared to [8] (which also prefetches
temporal data but uses PC and address correlation), for
server workloads. But evaluation has been done with the
assumption that infinite space is available to store MHT –
1/ 2 tables, which is not possible for real world systems.
In [11], Fuchs et al. (2014) use code block working sets
(CBWS), which provides the address trace of ordered lines
accessed by loops, and the prefetcher uses this data to fetch
entire data block required for the loop iteration. The proposed
prefetcher is implemented as an addition to the spatial
memory streaming prefetcher. BLOCK_BEGIN and
BLOCK_END (instructions added to the ISA) are used to
determine the block boundaries. For the prediction, to prevent
timing constraints, prefetcher stores a history of k, that enables
predicting the blocks required, farther ahead. Although
developed for in-order pipeline, the prefetcher can be also
used for out-of-order pipeline, by fetching the address during
the commit stage.
• The prefetcher is used to fetch spatial data from the main
memory to the L2 cache (pull based prefetcher) and is
software controlled, using prediction to prefetch data. It is
dynamic type and developed for multi – core processors. Since
it uses additional space to store the history table, this
prefetcher can be classified as an hybrid prefetcher.
• The CBWS prefetcher is implemented as an add on to the
SMS prefetcher and performance evaluation is done in the
GEM5 simulator using SPEC2006 benchmarks. The
performance is compared with using only the SMS prefetcher.
The metrics used for evaluation cache misses, accuracy,
timely prefetches and overall speed – up.
• The CBWS + SMS prefetcher shows less cache misses
for majority benchmarks, compared to stride, GHB and SMS
prefetcher. Also, compared to the SMS, the CBWS + SMS
shows an improvement in timely accesses, but this is seen
only for memory – intensive benchmarks. Improvement is also
seen in accuracy, while performance speed – up is of 1.16
times over the SMS prefetcher for all benchmarks.
• This prefetcher uses less than 1KB of storage but
additional space is needed to store the differential history table
(DHT). For evaluation, a DHT of size 16 is used.
In [3], Kadjo et al. propose a prefetcher that performs optimal
control-flow speculation and effective address value
speculation to give an accurate prediction of future memory
references. The B-Fetch design depends on the expected path
through the future basic blocks and the effective addresses of
the next load instructions. Firstly, the future execution path is
predicted by the branch predictor. Then the B-Fetch analyses
the variation in the contents of the registers due to the earlier
branch instructions. This information is used to predict the
effective address. The Memory History Table (MHT) stores
the source, current and offset values. The use of variation
observed in register values over the history of effective
addresses helps in rightful prefetching even for instructions
which display irregular flow of control and data access
patterns.
• The technique uses a software controlled hybrid
prefetcher which is dynamic in nature. It is pull based,
getting data from the memory to the LLC.
• The prefetcher is tested using SPEC CPU2006
benchmarks, in the GEM5 simulator and the results
obtained by the prefetcher are compared with Stride and
4. SMS prefetcher. For evaluation, single and multi-threaded
loads are used.
• B-Fetch prefetcher achieves a mean speed up of
23.0% over the set baseline for 12.94KB size. But to
store the MHT additional hardware in form of
memory is required.
In [12], Aziz et al. propose a method to control prefetching
aggressiveness for network-based multiprocessor. The
controller minimizes the processor penalty by adjusting
prefetching aggressiveness. The Hill Climbing approach is
used to reduce the penalty. This method has five steps. Firstly,
the read miss transactions that arrive at the directory are
captured. The address prediction predicts the next processor
demanded address. The aggressive prediction predicts the
number of cache blocks that needs to be sent to the private
cache. This is followed by building the network packet which
contains all the predicted block addresses and finally the read
request is made to the private cache of predicted addresses.
The prefetcher performs transactions on the first level of
cache. This helps in avoiding network coherence delays.
• The prefetcher uses a software controlled algorithm and
the prefetcher is operated in static and dynamic mode
(with respect to the prefetching degree). It is pull based
and uses a private type of cache for storing the retrieved
data.
• The prefetcher is tested using PARSEC and SPLASH – 2
benchmarks. Additionally, benchmarks from MiBench
have also been used. The testing is done using real
systems - four Extended Sparc V8 ArchC with L1 and L2
cache.
• The prefetcher can reduce penalty by 7% and achieves an
increase of 24% in prefetching accuracy, for fixed degree
of prefetching.
In [13], Garside and Audsley propose a stream based
prefetcher consisting of a Prefetch Unit (PU). The PU includes
a stream buffer, prefetch buffer and squash buffer. It gathers
information about the data access trends by snooping on all
transactions made to the main memory. For each CPU, a
separate shared memory tile is present to access its memory.
The PU gets the memory request and if it is present in the
prefetch buffer, the requested address is added to the squash
buffer, else it is considered as a miss and the stream buffer is
updated.
• The prefetcher is developed for NoC multiprocessors, and
is event triggered, dynamic type.
• The PU is implemented within 4x4 Bluetiles NoC on
Xilinx Virtex-7 FPGA, with external memory and each
CPU has connection to the shared memory tree. The
CPUs are configured to run at 50MHz.
• It is observed that large prefetch distances yield better
results if the memory load increases. Improvement is also
seen in timely prefetching and accuracy of fetches.
In [1], Khan et al. propose the use of low overhead runtime
sampling and fast cache modeling for prefetching. The
prefetcher improves the single thread performance and it also
minimizes off-chip traffic and off-chip bandwidth
consumption. The prefetcher samples the memory instructions
in a random manner. The data cache blocks which are
accessed by the sampled instructions are monitored to see
reusability. If reused, the stride sample is recorded. The stride
sample is the difference between current and the previous
memory address accesses by the instruction. The reuse
samples are used to form the per-instruction cache
performance model. The stride samples are analyzed to find
the appropriate prefetch distance. Then the prefetch instruction
is scheduled for the load.
• This prefetching technique uses software based algorithm
and is dynamic in nature.
• The technique was evaluated on SPEC CPU 2006
benchmarks. The technique was also evaluated with
512kB L2 cache and it found 94% misses covered on
average.
• This prefetching method maintains minimum off-chip
traffic and it avoids LLC pollution. It also lowers off-chip
bandwidth demand. The results showed that the
multicores achieved higher throughput when resource
efficient prefetching is used.
From the above study and analysis most of the recent
works in prefetching have focused on developing light weight
prefetchers, especially for NoC multiprocessors. A common
feature of all the NoC based prefetchers is storing the data
directly from the main memory to the L1 cache, instead of
having a dedicated prefetch cache. Additionally, it is seen that
all prefetchers using prediction for determining the future
accesses need additional hardware (dedicated memory space)
for storing the history and hence are better classified under
hybrid prefetchers rather than simple software based
prefetchers. Also, the features used for carrying out
predictions and the amount of history stored varies for
different techniques.
IV. SOFTWARE BASED DATA PREFETCHING
Microprocessors use fetch instruction to implement
prefetching. Fetch instructions do not block memory
operations and hence have a lockup-free cache. This leads to
bypassing of outstanding memory operations when
prefetching is done. Software-initiated prefetching requires
minimum hardware as compared to the other prefetching
techniques. The complexity of software-initiated prefetching
lies on the position of the fetch instruction within the target
application.
Most software based prefetching techniques developed are
generally used along with a standard technique like the SMS
or GHB prefetcher, as described in [11], where the CBWS is
used as an additional component of the SMS prefetcher to
improve overall performance.
5. In the project, we have used the built in prefetch command
available for the gcc compiler and used it in loop intensive
codes for prefetching data, that will be required for the next
iteration. This is a very simple and efficient approach, that can
be used for applications related to image processing, matrix
operations, etc. For loops that access data in strides or contain
additional computations within the loop that would require
prefetching to be done in strides, rather than just one iteration
ahead, a prefetch distance must be determined. This distance
is a function of the latency caused due to cache misses and the
cycle time of the shortest path for one iteration of the loop [6].
V. EVALUATION AND RESULTS
For evaluating the built – in compiler based prefetcher, we
initially developed a C++ code to find the transpose of a
matrix and used the _mm_prefetch (char *p, int k) to prefetch
the data before next loop iteration, where p defines the address
for prefetching the data and k signifies the type of prefetching
to be done.
Figure. 1. Types of prefetches using _mm_prefetch(). [16]
Another prefetching function is the __builtin_prefetch (&i, j,
k), where I is the pointer to the address to be fetched, and j and
k define the type of prefetching. This was used in loop
intensive benchmarks of MiBench like fft, which contains
many nested loops as well. The figure below shows a part of
the loop of fft with the prefetching commands.
#define DO_PREFETCH
....
for(i=0;i<MAXSIZE;i++)
{RealIn[i]=0;
for(j=0;j<MAXWAVES;j++)
{if (rand()%2)
{RealIn[i]+=coeff[j]*cos(amp[j]*i);}
Else { RealIn[i]+=coeff[j]*sin(amp[j]*i);
ImagIn[i]=0;
#ifdef DO_PREFETCH
__builtin_prefetch (&coeff[j+1], 0, 1);
__builtin_prefetch (&[j+1], 0, 1);
#endif }
}
Figure 2. Loop of fft (main) function with prefetching
For comparison, this prefetc instruction was also applied to the
Dijkstra benchmark, which has just one loop. On running the
code using gcc compiler, on the system for fft the maximum
speed up was about 15% and for matrix transpose, a speed up
of 5% was achieved. Compared to this Dijkstra showed a
negligible performance difference. It has been seen that
compiler based prefetching along with other optimization
techniques such as loop unrolling, can show significant
improvement in performance.
VI. CONCLUSION
Data prefetching is one of the most commonly used
techniques in processor for reducing memory access latency,
cache misses, achieving higher accuracy and improving
overall performance of the system. While hardware based
prefetching, schemes were suitable for in order processors,
with predictable memory accesses, with the use of out of order
execution, software based prefetching techniques have shown
better performance. With development of modern day
processors such as NoC multiprocessors and chip
multiprocessors, a combination of both hardware and software
prefetching schemes are being used, and this class of
prefetchers is called hybrid prefetchers. In the project, we
have reviewed various prefetching techniques that have been
developed for modern day processors. We have classified each
technique based on set of features which makes it easier to
identify the type of application and processor, it is suitable for.
We have also listed the evaluation methodology, assumptions
and the performance improvement achieved by each
prefetching technique. Finally, we have implemented a simple
compiler based software data prefetching technique and
evaluated the results.
REFERENCES
[1] M. Khan, A. Sandberg and E. Hagersten, "A Case for
Resource Efficient Prefetching in Multicores," 2014 43rd
International Conference on Parallel Processing, Minneapolis
MN, 2014, pp. 101-110. doi: 10.1109/ICPP.2014.19
[2] M. Cireno, A. Aziz and E. Barros, "Temporized data
prefetching algorithm for NoC-based multiprocessor systems,"
2016 IEEE 27th International Conference on Application-
specific Systems, Architectures and Processors (ASAP),
London, 2016, pp. 235-236.doi: 10.1109/ASAP.2016.7760805
[3] D. Kadjo, J. Kim, P. Sharma, R. Panda, P. Gratz and D.
Jimenez, "B-Fetch: Branch Prediction Directed Prefetching for
Chip-Multiprocessors," 2014 47th Annual IEEE/ACM
International Symposium on Microarchitecture, Cambridge,
2014, pp. 623-634.doi: 10.1109/MICRO.2014.29
[4] S. Byna, Y. Chen and X. H. Sun, "A Taxonomy of Data
Prefetching Mechanisms," 2008 International Symposium on
Parallel Architectures, Algorithms, and Networks (i-span
2008), Sydney, NSW, 2008, pp. 19-24.doi: 10.1109/I-
SPAN.2008.24
[5] S. Mittal, “A Survey of Recent Prefetching Techniques for
Processor Caches”, 20xx. ACM Computing Surveys 0, 0,
Article 0 (2016), 36 pages
[6] S. VanderWiel and David J. Lilja, “A Survey of Data
Prefetching Techniques”, Technical Report No.: HPPC-96-05,
doi: 10.1.1.2.4449 [online]: http://citeseerx.ist.psu.edu/
[7]“Manycore -vs- Multicore”, [online]:
https://goparallel.sourceforge.net/ask-james-reinders-
multicore-vs-manycore/
6. [8] Akanksha Jain and Calvin Lin. 2013. Linearizing irregular
memory accesses for improved correlated prefetching. In
International Symposium on Microarchitecture. 247–259,
December 2013. Doi:10.1145/2540708.2540730
[9] J. Li, C. J. Xue and Yinlong Xu, "STT-RAM based
energy-efficiency hybrid cache for CMPs," 2011 IEEE/IFIP
19th International Conference on VLSI and System-on-Chip,
Hong Kong, 2011, pp. 31-36. doi:
10.1109/VLSISoC.2011.6081626
[10] M. Bakhshalipour; P. Lotfi-Kamran; H. Sarbazi-Azad,
"An Efficient Temporal Data Prefetcher for L1 Caches," in
IEEE Computer Architecture Letters, vol.PP, no.99, pp.1-1
doi: 10.1109/LCA.2017.2654347
[11] A. Fuchs, S. Mannor, U. Weiser and Y. Etsion, "Loop-
Aware Memory Prefetching Using Code Block Working
Sets," 2014 47th Annual IEEE/ACM International Symposium
on Microarchitecture, Cambridge, 2014, pp. 533-544. doi:
10.1109/MICRO.2014.27
[12] J. Garside and N. C. Audsley, "Prefetching across a
shared memory tree within a Network-on-Chip architecture,"
2013 International Symposium on System on Chip (SoC),
Tampere, 2013, pp. 1-4. doi: 10.1109/ISSoC.2013.6675268
[13] D. Kadjo, J. Kim, P. Sharma, R. Panda, P. Gratz and D.
Jimenez, "B-Fetch: Branch Prediction Directed Prefetching for
Chip-Multiprocessors," 2014 47th Annual IEEE/ACM
International Symposium on Microarchitecture, Cambridge,
2014, pp. 623-634. doi: 10.1109/MICRO.2014.29
[14] A. Aziz, M. Cireno, E. Barros and B. Prado, "Balanced
prefetching aggressiveness controller for NoC-based
multiprocessor," 2014 27th Symposium on Integrated Circuits
and Systems Design (SBCCI), Aracaju, 2014, pp. 1-7.doi:
10.1145/2660540.2660541
[15] M. Khan, A. Sandberg and E. Hagersten, "A Case for
Resource Efficient Prefetching in Multicores," 2014 43rd
International Conference on Parallel Processing, Minneapolis
MN, 2014, pp. 101-110.doi: 10.1109/ICPP.2014.19
[16] Lee, J., Kim, H., and Vuduc, R. 2012. When prefetching
works, when it doesn’t, and why. ACM Trans. Archit. Code
Optim. 9, 1, Article 2 (March 2012), 29 pages
http://doi.acm.org/10.1145/2133382.2133384