Our main aim of research is to find the limit of Amdahl's Law for multicore processors, to make number of cores giving more efficiency to overall architecture of the CMP(Chip Multi
Processor a.k.a. Multicore Processor). As it is expected this limit will be in the architecture of Multicore Processor, or in the programming. We surveyed the architecture of the Multicore
processors of various chip manufacturers namely INTEL™, AMD™, IBM™ etc., and the various techniques there followed in, for improving the performance of the Multicore
Processors. We conducted cluster experiments to find this limit. In this paper we propose an alternate design of Multicore processor based on the results of our cluster experiment.
The effective way of processor performance enhancement by proper branch handlingcsandit
The processor performance is highly dependent on the regular supply of correct instruction at
the right time. To reduce instruction cache misses, one of the proposed mechanism is the
instruction prefetching, which in turn will increase instructions supply to the processor. The
technology developments in these fields indicates that in future the gap between processing
speeds of processor and data transfer speed of memory is likely to be increased. Memory
bandwidth can be increased significantly using the prefetching, but unsuccessful prefetches will
pollute the primary cache. Prefetching can be done either with software or hardware. In
software prefetching the compiler will insert a prefetch code in the program. In this case as
actual memory capacity is not known to the compiler and it will lead to some harmful
prefetches. In hardware prefetching instead of inserting prefetch code it will make use of extra
hardware and which is utilized during the execution. The most significant source of lost
performance when the process waiting for the availability of the next instruction. All the
prefetching methods are giving stress only to the fetching of the instruction for the execution,
not to the overall performance of the processor. This paper is an attempt to study the branch
handling in a uniprocessing environment, where, when ever branching is identified the proper
cache memory management is enabled inside the memory management unit.
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...cscpconf
The processor performance is highly dependent on the regular supply of correct instruction at
the right time. To reduce instruction cache misses, one of the proposed mechanism is the
instruction prefetching, which in turn will increase instructions supply to the processor. The
technology developments in these fields indicates that in future the gap between processing
speeds of processor and data transfer speed of memory is likely to be increased. Memory bandwidth can be increased significantly using the prefetching, but unsuccessful prefetches will pollute the primary cache. Prefetching can be done either with software or hardware. In software prefetching the compiler will insert a prefetch code in the program. In this case as actual memory capacity is not known to the compiler and it will lead to some harmful prefetches. In hardware prefetching instead of inserting prefetch code it will make use of extra hardware and which is utilized during the execution. The most significant source of lostperformance when the process waiting for the availability of the next instruction. All the prefetching methods are giving stress only to the fetching of the instruction for the execution, not to the overall performance of the processor. This paper is an attempt to study the branch handling in a uniprocessing environment, where, when ever branching is identified the proper cache memory management is enabled inside the memory management unit.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...Ilango Jeyasubramanian
• Improved speedup by 3.2%, CPI by 7.68% from regular LRU policy by a new Dynamic SLRU policy Implementation on Gem5 simulator.
• Dynamic segmentation of each cache set was done with cache line access probability based LRU insertion policy and performed block replacement by normal SLRU policy.
• Achieved hit-rate improvement by 2.97%, power reduction by 11% with modified Gem5 Ruby memory system by added instructions in the X86 ISA and additional states in the MOESI protocol for coherence in round-robin ordered ring connected filter caches on Intel X86 processor.
The effective way of processor performance enhancement by proper branch handlingcsandit
The processor performance is highly dependent on the regular supply of correct instruction at
the right time. To reduce instruction cache misses, one of the proposed mechanism is the
instruction prefetching, which in turn will increase instructions supply to the processor. The
technology developments in these fields indicates that in future the gap between processing
speeds of processor and data transfer speed of memory is likely to be increased. Memory
bandwidth can be increased significantly using the prefetching, but unsuccessful prefetches will
pollute the primary cache. Prefetching can be done either with software or hardware. In
software prefetching the compiler will insert a prefetch code in the program. In this case as
actual memory capacity is not known to the compiler and it will lead to some harmful
prefetches. In hardware prefetching instead of inserting prefetch code it will make use of extra
hardware and which is utilized during the execution. The most significant source of lost
performance when the process waiting for the availability of the next instruction. All the
prefetching methods are giving stress only to the fetching of the instruction for the execution,
not to the overall performance of the processor. This paper is an attempt to study the branch
handling in a uniprocessing environment, where, when ever branching is identified the proper
cache memory management is enabled inside the memory management unit.
THE EFFECTIVE WAY OF PROCESSOR PERFORMANCE ENHANCEMENT BY PROPER BRANCH HANDL...cscpconf
The processor performance is highly dependent on the regular supply of correct instruction at
the right time. To reduce instruction cache misses, one of the proposed mechanism is the
instruction prefetching, which in turn will increase instructions supply to the processor. The
technology developments in these fields indicates that in future the gap between processing
speeds of processor and data transfer speed of memory is likely to be increased. Memory bandwidth can be increased significantly using the prefetching, but unsuccessful prefetches will pollute the primary cache. Prefetching can be done either with software or hardware. In software prefetching the compiler will insert a prefetch code in the program. In this case as actual memory capacity is not known to the compiler and it will lead to some harmful prefetches. In hardware prefetching instead of inserting prefetch code it will make use of extra hardware and which is utilized during the execution. The most significant source of lostperformance when the process waiting for the availability of the next instruction. All the prefetching methods are giving stress only to the fetching of the instruction for the execution, not to the overall performance of the processor. This paper is an attempt to study the branch handling in a uniprocessing environment, where, when ever branching is identified the proper cache memory management is enabled inside the memory management unit.
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
DESIGNED DYNAMIC SEGMENTED LRU AND MODIFIED MOESI PROTOCOL FOR RING CONNECTED...Ilango Jeyasubramanian
• Improved speedup by 3.2%, CPI by 7.68% from regular LRU policy by a new Dynamic SLRU policy Implementation on Gem5 simulator.
• Dynamic segmentation of each cache set was done with cache line access probability based LRU insertion policy and performed block replacement by normal SLRU policy.
• Achieved hit-rate improvement by 2.97%, power reduction by 11% with modified Gem5 Ruby memory system by added instructions in the X86 ISA and additional states in the MOESI protocol for coherence in round-robin ordered ring connected filter caches on Intel X86 processor.
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...ITIIIndustries
This paper proposes the algorithms for optimization of Remote Core Locking (RCL) synchronization method in multithreaded programs. The algorithm of initialization of RCL-locks and the algorithms for threads affinity optimization are developed. The algorithms consider the structures of hierarchical computer systems and non-uniform memory access (NUMA) to minimize execution time of RCLprograms. The experimental results on multi-core computer systems represented in the paper shows the reduction of RCLprograms execution time.
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture IJECEIAES
As majority of the compression algorithms are implementations for CPU architecture, the primary focus of our work was to exploit the opportunities of GPU parallelism in audio compression. This paper presents an implementation of Apple Lossless Audio Codec (ALAC) algorithm by using NVIDIA GPUs Compute Unified Device Architecture (CUDA) Framework. The core idea was to identify the areas where data parallelism could be applied and parallel programming model CUDA could be used to execute the identified parallel components on Single Instruction Multiple Thread (SIMT) model of CUDA. The dataset was retrieved from European Broadcasting Union, Sound Quality Assessment Material (SQAM). Faster execution of the algorithm led to execution time reduction when applied to audio coding for large audios. This paper also presents the reduction of power usage due to running the parallel components on GPU. Experimental results reveal that we achieve about 80-90% speedup through CUDA on the identified components over its CPU implementation while saving CPU power consumption.
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...caijjournal
Simultaneous Multi-Threading (SMT) processors improve system performance by allowing concurrent execution of multiple independent threads with sharing key datapath omponents and better utilization of resources. Speculative execution allows modern processors to fetch continuously and reduce the delays of control instructions. However, a significant amount of resources is usually wasted due to miss- peculation,
which could have been used by other valid instructions, and such a waste is even more pronounced in an SMT system. In order to minimize the waste of resources, a speculative trace capping technique [1] was proposed to limit the number of speculative instructions in the system. In this paper, a thorough analysis is given to investigate the trade-offs among applying this capping mechanism at different pipeline stages so as
to maximize its benefits. Our simulations show that the best choice can improve overall system throughput
by a very significant margin (up to 46%) without sacrificing execution fairness among the threads.
Architecture exploration of recent GPUs to analyze the efficiency of hardware...journalBEEI
This study analyzes the efficiency of parallel computational applications with the adoption of recent graphics processing units (GPUs). We investigate the impacts of the additional resources of recent architecture on the popular benchmarks compared with previous architecture. Our simulation results demonstrate that Pascal GPU architecture improves the performance by 273% on average compared to old-fashioned Fermi architecture. To evaluate the performance improvement depending on specific hardware resources, we divide the hardware resources into two types: computing and memory resources. Computing resources have bigger impact on performance improvement than memory resources in most of benchmarks. For Hotspot and B+ tree, the architecture adopting only enhanced computing resources can achieve similar performance gains of the architecture adopting both computing and memory resources. We also evaluate the influence of the number of warp schedulers in the SM (Streaming Multiprocessor) to the GPU performance in relationship with barrier waiting time. Based on these analyses, we propose the development direction for the future generation of GPUs.
An octa core processor with shared memory and message-passingeSAT Journals
Abstract This being the era of fast, high performance computing, there is the need of having efficient optimizations in the processor architecture and at the same time in memory hierarchy too. Each and every day, the advancement of applications in communication and multimedia systems are compelling to increase number of cores in the main processor viz., dual-core, quad-core, octa-core and so on. But, for enhancing the overall performance of multi processor chip, there are stringent requirements to improve inter-core synchronization. Thus, a MPSoC with 8-cores supporting both message-passing and shared-memory inter-core communication mechanisms is implemented on Virtex 5 LX110T FPGA. Each core is based on MIPS III (Microprocessor without interlocked pipelined stages) ISA, handling only integer type instructions and having six-stage pipeline with data hazard detection unit and forwarding logic. The eight processing cores and one central shared memory core are inter connected using 3x3 2-D mesh topology based Network-on-chip (NoC) with virtual channel router. The router is four stage pipelined supporting DOR X-Y routing algorithm and with round robin arbitration technique. For verification and functionality test of above fully synthesized multi core processor, matrix multiplication operation is mapped onto the above said. Partitioning and scheduling of multiple multiplications and addition for each element of resultant matrix has been done accordingly among eight cores to get maximum throughput. All the codes for processor design are written in Verilog HDL. Keywords: MPSoC, message-passing, shared memory, MIPS, ISA, wormhole router, network-on-chip, SIMD, data level parallelism, 2-D Mesh, virtual channel
Hardback solution to accelerate multimedia computation through mgp in cmpeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
Optimization of Remote Core Locking Synchronization in Multithreaded Programs...ITIIIndustries
This paper proposes the algorithms for optimization of Remote Core Locking (RCL) synchronization method in multithreaded programs. The algorithm of initialization of RCL-locks and the algorithms for threads affinity optimization are developed. The algorithms consider the structures of hierarchical computer systems and non-uniform memory access (NUMA) to minimize execution time of RCLprograms. The experimental results on multi-core computer systems represented in the paper shows the reduction of RCLprograms execution time.
Optimizing Apple Lossless Audio Codec Algorithm using NVIDIA CUDA Architecture IJECEIAES
As majority of the compression algorithms are implementations for CPU architecture, the primary focus of our work was to exploit the opportunities of GPU parallelism in audio compression. This paper presents an implementation of Apple Lossless Audio Codec (ALAC) algorithm by using NVIDIA GPUs Compute Unified Device Architecture (CUDA) Framework. The core idea was to identify the areas where data parallelism could be applied and parallel programming model CUDA could be used to execute the identified parallel components on Single Instruction Multiple Thread (SIMT) model of CUDA. The dataset was retrieved from European Broadcasting Union, Sound Quality Assessment Material (SQAM). Faster execution of the algorithm led to execution time reduction when applied to audio coding for large audios. This paper also presents the reduction of power usage due to running the parallel components on GPU. Experimental results reveal that we achieve about 80-90% speedup through CUDA on the identified components over its CPU implementation while saving CPU power consumption.
PERFORMANCE ENHANCEMENT WITH SPECULATIVE-TRACE CAPPING AT DIFFERENT PIPELINE ...caijjournal
Simultaneous Multi-Threading (SMT) processors improve system performance by allowing concurrent execution of multiple independent threads with sharing key datapath omponents and better utilization of resources. Speculative execution allows modern processors to fetch continuously and reduce the delays of control instructions. However, a significant amount of resources is usually wasted due to miss- peculation,
which could have been used by other valid instructions, and such a waste is even more pronounced in an SMT system. In order to minimize the waste of resources, a speculative trace capping technique [1] was proposed to limit the number of speculative instructions in the system. In this paper, a thorough analysis is given to investigate the trade-offs among applying this capping mechanism at different pipeline stages so as
to maximize its benefits. Our simulations show that the best choice can improve overall system throughput
by a very significant margin (up to 46%) without sacrificing execution fairness among the threads.
Architecture exploration of recent GPUs to analyze the efficiency of hardware...journalBEEI
This study analyzes the efficiency of parallel computational applications with the adoption of recent graphics processing units (GPUs). We investigate the impacts of the additional resources of recent architecture on the popular benchmarks compared with previous architecture. Our simulation results demonstrate that Pascal GPU architecture improves the performance by 273% on average compared to old-fashioned Fermi architecture. To evaluate the performance improvement depending on specific hardware resources, we divide the hardware resources into two types: computing and memory resources. Computing resources have bigger impact on performance improvement than memory resources in most of benchmarks. For Hotspot and B+ tree, the architecture adopting only enhanced computing resources can achieve similar performance gains of the architecture adopting both computing and memory resources. We also evaluate the influence of the number of warp schedulers in the SM (Streaming Multiprocessor) to the GPU performance in relationship with barrier waiting time. Based on these analyses, we propose the development direction for the future generation of GPUs.
An octa core processor with shared memory and message-passingeSAT Journals
Abstract This being the era of fast, high performance computing, there is the need of having efficient optimizations in the processor architecture and at the same time in memory hierarchy too. Each and every day, the advancement of applications in communication and multimedia systems are compelling to increase number of cores in the main processor viz., dual-core, quad-core, octa-core and so on. But, for enhancing the overall performance of multi processor chip, there are stringent requirements to improve inter-core synchronization. Thus, a MPSoC with 8-cores supporting both message-passing and shared-memory inter-core communication mechanisms is implemented on Virtex 5 LX110T FPGA. Each core is based on MIPS III (Microprocessor without interlocked pipelined stages) ISA, handling only integer type instructions and having six-stage pipeline with data hazard detection unit and forwarding logic. The eight processing cores and one central shared memory core are inter connected using 3x3 2-D mesh topology based Network-on-chip (NoC) with virtual channel router. The router is four stage pipelined supporting DOR X-Y routing algorithm and with round robin arbitration technique. For verification and functionality test of above fully synthesized multi core processor, matrix multiplication operation is mapped onto the above said. Partitioning and scheduling of multiple multiplications and addition for each element of resultant matrix has been done accordingly among eight cores to get maximum throughput. All the codes for processor design are written in Verilog HDL. Keywords: MPSoC, message-passing, shared memory, MIPS, ISA, wormhole router, network-on-chip, SIMD, data level parallelism, 2-D Mesh, virtual channel
Hardback solution to accelerate multimedia computation through mgp in cmpeSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology
International Journal of Engineering and Science Invention (IJESI) is an international journal intended for professionals and researchers in all fields of computer science and electronics. IJESI publishes research articles and reviews within the whole field Engineering Science and Technology, new teaching methods, assessment, validation and the impact of new technologies and it will continue to provide information on the latest trends and developments in this ever-expanding subject. The publications of papers are selected through double peer reviewed to ensure originality, relevance, and readability. The articles published in our journal can be accessed online.
Fast switching of threads between cores - Advanced Operating SystemsRuhaim Izmeth
Fast switching of threads between cores is a published research paper on Operating systems, This is our attempt to decode the research and present to the class
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
AN ATTEMPT TO IMPROVE THE PROCESSOR PERFORMANCE BY PROPER MEMORY MANAGEMENT F...IJCSEA Journal
The performanceof the processor is highly dependent on the regular supply of correct instruction at the right time. Whenever a data miss is occurring in the cache memory the processor has to spend more cycles for the fetching operation. One of the methodsused to reduce instruction cache miss is the instruction prefetching,which in turn will increase instructions supply to the processor.The technology developments in these fields indicates that in future the gap between processing speeds of processor and data transfer speed of memory is likely to be increased. Branch Predictor play a critical role in achieving effective performance in many modernpipelined microprocessor architecture.
Prefetching canbe done either with software or hardware. In software prefetching the compiler will insert a prefetch code in the program. In this case as actual memory capacity is not known to the compiler and it will lead to some harmful prefetches. In hardware prefetching instead of inserting prefetch code it will make use of extra hardware and which is utilized during the execution. The most significant source of lost performance when the process waiting for the availability of the next instruction. Thetime that is wasted in case of branch misprediction is equal to the number of stages in the pipeline, starting from fetch stage to execute stage. All the prefetching methods are givenstress only to the fetching of the instruction for the execution, not to the overall performance of the processor. The most significant source of lost performance is,when the process is waiting for the availability of the next instruction. The time that is wasted in case of branch misprediction is equal to the number of stages in the pipeline, starting from fetch stage to execution stage.This paper we made an attempt to study the branch handling in a uniprocessing environment, whenever branching is identified instead of invoking the branch prediction the proper cache memory management is enabled inside the memory management unit.
Synopsis on "ANALYZING THE EFFECTIVENESS OF THE ADVANCED ENCRYPTION STANDARD ...Nikhil Jain
To implement and improve the performance of Advanced Encryption Standard algorithm by using multicore systems and OpenMP API extracting as much parallelism as possible from the algorithm in parallel implementation approach.
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORSijdpsjournal
Advances in Integrated Circuit processing allow for more microprocessor design options. As Chip Multiprocessor system (CMP) become the predominant topology for leading microprocessors, critical components of the system are now integrated on a single chip. This enables sharing of computation resources that was not previously possible. In addition the virtualization of these computation resources exposes the system to a mix of diverse and competing workloads. On chip Cache memory is a resource of primary concern as it can be dominant in controlling overall throughput. This Paper presents analysis of various parameters affecting the performance of Multi-core Architectures like varying the number of cores, changes L2 cache size, further we have varied directory size from 64 to 2048 entries on a 4 node, 8 node 16 node and 64 node Chip multiprocessor which in turn presents an open area of research on multicore processors with private/shared last level cache as the future trend seems to be towards tiled architecture executing multiple parallel applications with optimized silicon area utilization and excellent performance.
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
Oak Ridge National Lab is home of Titan, the largest GPU accelerated supercomputer in the world. This fact alone can be an intimidating experience for users new to leadership computing facilities. Our facility has collected over four years of experience helping users port applications to Titan. This talk will explain common paths and tools to successfully port applications, and expose common difficulties experienced by new users. Lastly, learn how our free and open training program can assist your organization in this transition.
Accelerating Real Time Applications on Heterogeneous PlatformsIJMER
In this paper we describe about the novel implementations of depth estimation from a stereo
images using feature extraction algorithms that run on the graphics processing unit (GPU) which is
suitable for real time applications like analyzing video in real-time vision systems. Modern graphics
cards contain large number of parallel processors and high-bandwidth memory for accelerating the
processing of data computation operations. In this paper we give general idea of how to accelerate the
real time application using heterogeneous platforms. We have proposed to use some added resources to
grasp more computationally involved optimization methods. This proposed approach will indirectly
accelerate a database by producing better plan quality.
Concurrent Matrix Multiplication on Multi-core ProcessorsCSCJournals
With the advent of multi-cores every processor has built-in parallel computational power and that can only be fully utilized only if the program in execution is written accordingly. This study is a part of an on-going research for designing of a new parallel programming model for multi-core architectures. In this paper we have presented a simple, highly efficient and scalable implementation of a common matrix multiplication algorithm using a newly developed parallel programming model SPC3 PM for general purpose multi-core processors. From our study it is found that matrix multiplication done concurrently on multi-cores using SPC3 PM requires much less execution time than that required using the present standard parallel programming environments like OpenMP. Our approach also shows scalability, better and uniform speedup and better utilization of available cores than that the algorithm written using standard OpenMP or similar parallel programming tools. We have tested our approach for up to 24 cores with different matrices size varying from 100 x 100 to 10000 x 10000 elements. And for all these tests our proposed approach has shown much improved performance and scalability
Power minimization of systems using Performance Enhancement Guaranteed CachesIJTET Journal
Caches have long been an instrument for speeding memory access from microcontrollers to center based ASIC plans. For hard ongoing frameworks however stores are tricky because of most pessimistic scenario execution time estimation. As of late, an on-chip scratch cushion memory (SPM) to decrease the force and enhance execution. SPM does not productively reuse its space while execution. Here, an execution improvement ensured reserves (PEG-C) to improve the execution. It can likewise be utilized like a standard reserve to progressively store guidelines and information in view of their runtime access examples prompting attain to great execution. All the earlier plans have corruption of execution when contrasted with PEG-C. It has a superior answer for equalization time consistency and normal case execution
Similar to AFFECT OF PARALLEL COMPUTING ON MULTICORE PROCESSORS (20)
ANALYSIS OF LAND SURFACE DEFORMATION GRADIENT BY DINSAR cscpconf
The progressive development of Synthetic Aperture Radar (SAR) systems diversify the exploitation of the generated images by these systems in different applications of geoscience. Detection and monitoring surface deformations, procreated by various phenomena had benefited from this evolution and had been realized by interferometry (InSAR) and differential interferometry (DInSAR) techniques. Nevertheless, spatial and temporal decorrelations of the interferometric couples used, limit strongly the precision of analysis results by these techniques. In this context, we propose, in this work, a methodological approach of surface deformation detection and analysis by differential interferograms to show the limits of this technique according to noise quality and level. The detectability model is generated from the deformation signatures, by simulating a linear fault merged to the images couples of ERS1 / ERS2 sensors acquired in a region of the Algerian south.
4D AUTOMATIC LIP-READING FOR SPEAKER'S FACE IDENTIFCATIONcscpconf
A novel based a trajectory-guided, concatenating approach for synthesizing high-quality image real sample renders video is proposed . The lips reading automated is seeking for modeled the closest real image sample sequence preserve in the library under the data video to the HMM predicted trajectory. The object trajectory is modeled obtained by projecting the face patterns into an KDA feature space is estimated. The approach for speaker's face identification by using synthesise the identity surface of a subject face from a small sample of patterns which sparsely each the view sphere. An KDA algorithm use to the Lip-reading image is discrimination, after that work consisted of in the low dimensional for the fundamental lip features vector is reduced by using the 2D-DCT.The mouth of the set area dimensionality is ordered by a normally reduction base on the PCA to obtain the Eigen lips approach, their proposed approach by[33]. The subjective performance results of the cost function under the automatic lips reading modeled , which wasn’t illustrate the superior performance of the
method.
MOVING FROM WATERFALL TO AGILE PROCESS IN SOFTWARE ENGINEERING CAPSTONE PROJE...cscpconf
Universities offer software engineering capstone course to simulate a real world-working environment in which students can work in a team for a fixed period to deliver a quality product. The objective of the paper is to report on our experience in moving from Waterfall process to Agile process in conducting the software engineering capstone project. We present the capstone course designs for both Waterfall driven and Agile driven methodologies that highlight the structure, deliverables and assessment plans.To evaluate the improvement, we conducted a survey for two different sections taught by two different instructors to evaluate students’ experience in moving from traditional Waterfall model to Agile like process. Twentyeight students filled the survey. The survey consisted of eight multiple-choice questions and an open-ended question to collect feedback from students. The survey results show that students were able to attain hands one experience, which simulate a real world-working environment. The results also show that the Agile approach helped students to have overall better design and avoid mistakes they have made in the initial design completed in of the first phase of the capstone project. In addition, they were able to decide on their team capabilities, training needs and thus learn the required technologies earlier which is reflected on the final product quality
PROMOTING STUDENT ENGAGEMENT USING SOCIAL MEDIA TECHNOLOGIEScscpconf
Using social media in education provides learners with an informal way for communication. Informal communication tends to remove barriers and hence promotes student engagement. This paper presents our experience in using three different social media technologies in teaching software project management course. We conducted different surveys at the end of every semester to evaluate students’ satisfaction and engagement. Results show that using social media enhances students’ engagement and satisfaction. However, familiarity with the tool is an important factor for student satisfaction.
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGICcscpconf
In real world computing environment with using a computer to answer questions has been a human dream since the beginning of the digital era, Question-answering systems are referred to as intelligent systems, that can be used to provide responses for the questions being asked by the user based on certain facts or rules stored in the knowledge base it can generate answers of questions asked in natural , and the first main idea of fuzzy logic was to working on the problem of computer understanding of natural language, so this survey paper provides an overview on what Question-Answering is and its system architecture and the possible relationship and
different with fuzzy logic, as well as the previous related research with respect to approaches that were followed. At the end, the survey provides an analytical discussion of the proposed QA models, along or combined with fuzzy logic and their main contributions and limitations.
DYNAMIC PHONE WARPING – A METHOD TO MEASURE THE DISTANCE BETWEEN PRONUNCIATIONS cscpconf
Human beings generate different speech waveforms while speaking the same word at different times. Also, different human beings have different accents and generate significantly varying speech waveforms for the same word. There is a need to measure the distances between various words which facilitate preparation of pronunciation dictionaries. A new algorithm called Dynamic Phone Warping (DPW) is presented in this paper. It uses dynamic programming technique for global alignment and shortest distance measurements. The DPW algorithm can be used to enhance the pronunciation dictionaries of the well-known languages like English or to build pronunciation dictionaries to the less known sparse languages. The precision measurement experiments show 88.9% accuracy.
INTELLIGENT ELECTRONIC ASSESSMENT FOR SUBJECTIVE EXAMS cscpconf
In education, the use of electronic (E) examination systems is not a novel idea, as Eexamination systems have been used to conduct objective assessments for the last few years. This research deals with randomly designed E-examinations and proposes an E-assessment system that can be used for subjective questions. This system assesses answers to subjective questions by finding a matching ratio for the keywords in instructor and student answers. The matching ratio is achieved based on semantic and document similarity. The assessment system is composed of four modules: preprocessing, keyword expansion, matching, and grading. A survey and case study were used in the research design to validate the proposed system. The examination assessment system will help instructors to save time, costs, and resources, while increasing efficiency and improving the productivity of exam setting and assessments.
TWO DISCRETE BINARY VERSIONS OF AFRICAN BUFFALO OPTIMIZATION METAHEURISTICcscpconf
African Buffalo Optimization (ABO) is one of the most recent swarms intelligence based metaheuristics. ABO algorithm is inspired by the buffalo’s behavior and lifestyle. Unfortunately, the standard ABO algorithm is proposed only for continuous optimization problems. In this paper, the authors propose two discrete binary ABO algorithms to deal with binary optimization problems. In the first version (called SBABO) they use the sigmoid function and probability model to generate binary solutions. In the second version (called LBABO) they use some logical operator to operate the binary solutions. Computational results on two knapsack problems (KP and MKP) instances show the effectiveness of the proposed algorithm and their ability to achieve good and promising solutions.
DETECTION OF ALGORITHMICALLY GENERATED MALICIOUS DOMAINcscpconf
In recent years, many malware writers have relied on Dynamic Domain Name Services (DDNS) to maintain their Command and Control (C&C) network infrastructure to ensure a persistence presence on a compromised host. Amongst the various DDNS techniques, Domain Generation Algorithm (DGA) is often perceived as the most difficult to detect using traditional methods. This paper presents an approach for detecting DGA using frequency analysis of the character distribution and the weighted scores of the domain names. The approach’s feasibility is demonstrated using a range of legitimate domains and a number of malicious algorithmicallygenerated domain names. Findings from this study show that domain names made up of English characters “a-z” achieving a weighted score of < 45 are often associated with DGA. When a weighted score of < 45 is applied to the Alexa one million list of domain names, only 15% of the domain names were treated as non-human generated.
GLOBAL MUSIC ASSET ASSURANCE DIGITAL CURRENCY: A DRM SOLUTION FOR STREAMING C...cscpconf
The amount of piracy in the streaming digital content in general and the music industry in specific is posing a real challenge to digital content owners. This paper presents a DRM solution to monetizing, tracking and controlling online streaming content cross platforms for IP enabled devices. The paper benefits from the current advances in Blockchain and cryptocurrencies. Specifically, the paper presents a Global Music Asset Assurance (GoMAA) digital currency and presents the iMediaStreams Blockchain to enable the secure dissemination and tracking of the streamed content. The proposed solution provides the data owner the ability to control the flow of information even after it has been released by creating a secure, selfinstalled, cross platform reader located on the digital content file header. The proposed system provides the content owners’ options to manage their digital information (audio, video, speech, etc.), including the tracking of the most consumed segments, once it is release. The system benefits from token distribution between the content owner (Music Bands), the content distributer (Online Radio Stations) and the content consumer(Fans) on the system blockchain.
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEMcscpconf
This paper discusses the importance of verb suffix mapping in Discourse translation system. In
discourse translation, the crucial step is Anaphora resolution and generation. In Anaphora
resolution, cohesion links like pronouns are identified between portions of text. These binders
make the text cohesive by referring to nouns appearing in the previous sentences or nouns
appearing in sentences after them. In Machine Translation systems, to convert the source
language sentences into meaningful target language sentences the verb suffixes should be
changed as per the cohesion links identified. This step of translation process is emphasized in
the present paper. Specifically, the discussion is on how the verbs change according to the
subjects and anaphors. To explain the concept, English is used as the source language (SL) and
an Indian language Telugu is used as Target language (TL)
EXACT SOLUTIONS OF A FAMILY OF HIGHER-DIMENSIONAL SPACE-TIME FRACTIONAL KDV-T...cscpconf
In this paper, based on the definition of conformable fractional derivative, the functional
variable method (FVM) is proposed to seek the exact traveling wave solutions of two higherdimensional
space-time fractional KdV-type equations in mathematical physics, namely the
(3+1)-dimensional space–time fractional Zakharov-Kuznetsov (ZK) equation and the (2+1)-
dimensional space–time fractional Generalized Zakharov-Kuznetsov-Benjamin-Bona-Mahony
(GZK-BBM) equation. Some new solutions are procured and depicted. These solutions, which
contain kink-shaped, singular kink, bell-shaped soliton, singular soliton and periodic wave
solutions, have many potential applications in mathematical physics and engineering. The
simplicity and reliability of the proposed method is verified.
AUTOMATED PENETRATION TESTING: AN OVERVIEWcscpconf
The using of information technology resources is rapidly increasing in organizations,
businesses, and even governments, that led to arise various attacks, and vulnerabilities in the
field. All resources make it a must to do frequently a penetration test (PT) for the environment
and see what can the attacker gain and what is the current environment's vulnerabilities. This
paper reviews some of the automated penetration testing techniques and presents its
enhancement over the traditional manual approaches. To the best of our knowledge, it is the
first research that takes into consideration the concept of penetration testing and the standards
in the area.This research tackles the comparison between the manual and automated
penetration testing, the main tools used in penetration testing. Additionally, compares between
some methodologies used to build an automated penetration testing platform.
CLASSIFICATION OF ALZHEIMER USING fMRI DATA AND BRAIN NETWORKcscpconf
Since the mid of 1990s, functional connectivity study using fMRI (fcMRI) has drawn increasing
attention of neuroscientists and computer scientists, since it opens a new window to explore
functional network of human brain with relatively high resolution. BOLD technique provides
almost accurate state of brain. Past researches prove that neuro diseases damage the brain
network interaction, protein- protein interaction and gene-gene interaction. A number of
neurological research paper also analyse the relationship among damaged part. By
computational method especially machine learning technique we can show such classifications.
In this paper we used OASIS fMRI dataset affected with Alzheimer’s disease and normal
patient’s dataset. After proper processing the fMRI data we use the processed data to form
classifier models using SVM (Support Vector Machine), KNN (K- nearest neighbour) & Naïve
Bayes. We also compare the accuracy of our proposed method with existing methods. In future,
we will other combinations of methods for better accuracy.
VALIDATION METHOD OF FUZZY ASSOCIATION RULES BASED ON FUZZY FORMAL CONCEPT AN...cscpconf
In order to treat and analyze real datasets, fuzzy association rules have been proposed. Several
algorithms have been introduced to extract these rules. However, these algorithms suffer from
the problems of utility, redundancy and large number of extracted fuzzy association rules. The
expert will then be confronted with this huge amount of fuzzy association rules. The task of
validation becomes fastidious. In order to solve these problems, we propose a new validation
method. Our method is based on three steps. (i) We extract a generic base of non redundant
fuzzy association rules by applying EFAR-PN algorithm based on fuzzy formal concept analysis.
(ii) we categorize extracted rules into groups and (iii) we evaluate the relevance of these rules
using structural equation model.
PROBABILITY BASED CLUSTER EXPANSION OVERSAMPLING TECHNIQUE FOR IMBALANCED DATAcscpconf
In many applications of data mining, class imbalance is noticed when examples in one class are
overrepresented. Traditional classifiers result in poor accuracy of the minority class due to the
class imbalance. Further, the presence of within class imbalance where classes are composed of
multiple sub-concepts with different number of examples also affect the performance of
classifier. In this paper, we propose an oversampling technique that handles between class and
within class imbalance simultaneously and also takes into consideration the generalization
ability in data space. The proposed method is based on two steps- performing Model Based
Clustering with respect to classes to identify the sub-concepts; and then computing the
separating hyperplane based on equal posterior probability between the classes. The proposed
method is tested on 10 publicly available data sets and the result shows that the proposed
method is statistically superior to other existing oversampling methods.
CHARACTER AND IMAGE RECOGNITION FOR DATA CATALOGING IN ECOLOGICAL RESEARCHcscpconf
Data collection is an essential, but manpower intensive procedure in ecological research. An
algorithm was developed by the author which incorporated two important computer vision
techniques to automate data cataloging for butterfly measurements. Optical Character
Recognition is used for character recognition and Contour Detection is used for imageprocessing.
Proper pre-processing is first done on the images to improve accuracy. Although
there are limitations to Tesseract’s detection of certain fonts, overall, it can successfully identify
words of basic fonts. Contour detection is an advanced technique that can be utilized to
measure an image. Shapes and mathematical calculations are crucial in determining the precise
location of the points on which to draw the body and forewing lines of the butterfly. Overall,
92% accuracy were achieved by the program for the set of butterflies measured.
SOCIAL MEDIA ANALYTICS FOR SENTIMENT ANALYSIS AND EVENT DETECTION IN SMART CI...cscpconf
Smart cities utilize Internet of Things (IoT) devices and sensors to enhance the quality of the city
services including energy, transportation, health, and much more. They generate massive
volumes of structured and unstructured data on a daily basis. Also, social networks, such as
Twitter, Facebook, and Google+, are becoming a new source of real-time information in smart
cities. Social network users are acting as social sensors. These datasets so large and complex
are difficult to manage with conventional data management tools and methods. To become
valuable, this massive amount of data, known as 'big data,' needs to be processed and
comprehended to hold the promise of supporting a broad range of urban and smart cities
functions, including among others transportation, water, and energy consumption, pollution
surveillance, and smart city governance. In this work, we investigate how social media analytics
help to analyze smart city data collected from various social media sources, such as Twitter and
Facebook, to detect various events taking place in a smart city and identify the importance of
events and concerns of citizens regarding some events. A case scenario analyses the opinions of
users concerning the traffic in three largest cities in the UAE
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGEcscpconf
The anonymity of social networks makes it attractive for hate speech to mask their criminal
activities online posing a challenge to the world and in particular Ethiopia. With this everincreasing
volume of social media data, hate speech identification becomes a challenge in
aggravating conflict between citizens of nations. The high rate of production, has become
difficult to collect, store and analyze such big data using traditional detection methods. This
paper proposed the application of apache spark in hate speech detection to reduce the
challenges. Authors developed an apache spark based model to classify Amharic Facebook
posts and comments into hate and not hate. Authors employed Random forest and Naïve Bayes
for learning and Word2Vec and TF-IDF for feature selection. Tested by 10-fold crossvalidation,
the model based on word2vec embedding performed best with 79.83%accuracy. The
proposed method achieve a promising result with unique feature of spark for big data.
GENERAL REGRESSION NEURAL NETWORK BASED POS TAGGING FOR NEPALI TEXTcscpconf
This article presents Part of Speech tagging for Nepali text using General Regression Neural
Network (GRNN). The corpus is divided into two parts viz. training and testing. The network is
trained and validated on both training and testing data. It is observed that 96.13% words are
correctly being tagged on training set whereas 74.38% words are tagged correctly on testing
data set using GRNN. The result is compared with the traditional Viterbi algorithm based on
Hidden Markov Model. Viterbi algorithm yields 97.2% and 40% classification accuracies on
training and testing data sets respectively. GRNN based POS Tagger is more consistent than the
traditional Viterbi decoding technique.
The Indian economy is classified into different sectors to simplify the analysis and understanding of economic activities. For Class 10, it's essential to grasp the sectors of the Indian economy, understand their characteristics, and recognize their importance. This guide will provide detailed notes on the Sectors of the Indian Economy Class 10, using specific long-tail keywords to enhance comprehension.
For more information, visit-www.vavaclasses.com
How to Create Map Views in the Odoo 17 ERPCeline George
The map views are useful for providing a geographical representation of data. They allow users to visualize and analyze the data in a more intuitive manner.
How to Make a Field invisible in Odoo 17Celine George
It is possible to hide or invisible some fields in odoo. Commonly using “invisible” attribute in the field definition to invisible the fields. This slide will show how to make a field invisible in odoo 17.
The Art Pastor's Guide to Sabbath | Steve ThomasonSteve Thomason
What is the purpose of the Sabbath Law in the Torah. It is interesting to compare how the context of the law shifts from Exodus to Deuteronomy. Who gets to rest, and why?
Palestine last event orientationfvgnh .pptxRaedMohamed3
An EFL lesson about the current events in Palestine. It is intended to be for intermediate students who wish to increase their listening skills through a short lesson in power point.
The French Revolution, which began in 1789, was a period of radical social and political upheaval in France. It marked the decline of absolute monarchies, the rise of secular and democratic republics, and the eventual rise of Napoleon Bonaparte. This revolutionary period is crucial in understanding the transition from feudalism to modernity in Europe.
For more information, visit-www.vavaclasses.com
Synthetic Fiber Construction in lab .pptxPavel ( NSTU)
Synthetic fiber production is a fascinating and complex field that blends chemistry, engineering, and environmental science. By understanding these aspects, students can gain a comprehensive view of synthetic fiber production, its impact on society and the environment, and the potential for future innovations. Synthetic fibers play a crucial role in modern society, impacting various aspects of daily life, industry, and the environment. ynthetic fibers are integral to modern life, offering a range of benefits from cost-effectiveness and versatility to innovative applications and performance characteristics. While they pose environmental challenges, ongoing research and development aim to create more sustainable and eco-friendly alternatives. Understanding the importance of synthetic fibers helps in appreciating their role in the economy, industry, and daily life, while also emphasizing the need for sustainable practices and innovation.
We all have good and bad thoughts from time to time and situation to situation. We are bombarded daily with spiraling thoughts(both negative and positive) creating all-consuming feel , making us difficult to manage with associated suffering. Good thoughts are like our Mob Signal (Positive thought) amidst noise(negative thought) in the atmosphere. Negative thoughts like noise outweigh positive thoughts. These thoughts often create unwanted confusion, trouble, stress and frustration in our mind as well as chaos in our physical world. Negative thoughts are also known as “distorted thinking”.
2. 42 Computer Science & Information Technology (CS & IT)
Figure 1 INTEL™ Multicore Architecture
In this paper we present a detailed discussion on the various factors affecting the performance of a
Multicore processor. Next section (Background) will talk on the various techniques followed for
improving the performance of a CMP. In section 3 we detail the various factors affecting CMP. In
section 4 problems in a CMP along with analysis of our cluster experiment results. In section 5
we conclude with proposal of a Multicore Design for better performance.
Figure 2 AMD™ Multicore Architecture
2. BACKGROUND: SPEED-UP IN MULTICORE PROCESSORS
In the present day scenario processor performance is of the highest interest to the end users, as
well all the chip manufacturers are coming out with Multicore processors also known as Chip
Multi Processors ( CMPs ), which we call the Next generation Processors. These Multicore
processors are the present day state-of-the-art processors and all the Chip manufacturers
concentrate on how to improve the number of cores on the die with the increased number of
cores year after year i.e. Dual core, Quad core, Oct core etc. At the same time it needs to exploit
the techniques for efficient use of these cores to achieve more parallelisation or to gain speed-up.
To gain speed-up on these Multicore processors several techniques were identified by many
researchers that are listed here under:
1. Cascaded Execution,
2. Execution Migration,
3. Computer Science & Information Technology (CS & IT) 43
3. Speculative Pre-computation,
4. Dynamic Speculative Pre-computation,
5. Dynamic Prefetching thread,
6. Core spilling.
2.1. Cascaded Execution
Due to the presence of inherently sequential code and because of the certain limitations of
paralleling compilers, many application programs suffer with speed-up on SMP processors.
Cascaded execution is a technique followed on SMP where the non-parallel loops will be
distributed across multiple processors for their sequential part, but only a single processor
executes the loop body.
In cascaded execution the processors that are idle will execute the helper thread to optimize their
memory state [3]. In the helper phase a processor predicts and loads the data that is needed
immediately in near future onto the shared memory and all the processors alternatively switch
between helper and execution phases, but only one processor will be in execution phase and all
other processors are in their helper phases of the non-parallel loop body.
Figure 3 Cascaded Execution
In general a helper phase is responsible either for Prefetching or for sequential buffer data
restructuring.
2.2. Execution Migration
In this technique a sequential program can migrate from one core to another automatically during
execution. This is made possible, by making use of the different levels of caches available on a
Multi core processor, by bringing a lot of cache capacity closer to the execution core and hence
reducing the average memory access latency. When a single threaded process (or a sequential
task) is executing on a multi core processor, the processor uses only one core and the cache
capacity on the other core is wasted.
The Execution migration technique uses one characteristic called split-ability of the working sets
to identify the cache misses for migration using the affinity algorithm [4].
4. 44 Computer Science & Information Technology (CS & IT)
Execution migration will take place from one core to another basing on the principle “If we
cannot bring the data to the processor as fast as we would like, we could instead take the
processor to the data” as proposed in [5].
In general if the working set is not cache-able on to a single L2 it is possible to fit in the overall L2
cache, so assign the single sequential task to the complete processor and migrate the execution
with some frequency in a random access policy. For better results it was suggested by [6] to go
for circular accesses where the frequency of migration will be around ½.
The performance loss due to this migration on a Multicore with private L1 and a shared L2 can be
minimized if: (a) a migrating thread continues its execution on a core that was previously visited
by the thread, and (b) cores remember their predictor state for the previous activation. Dean. M.
Tullsen, Susan. J. Eggers and others in [6] presented an architecture for simultaneous multi-
threading a.k.a. Hyper threading, that is best suited for (a) minimizing the architecture of the
conventional super scalar design, (b) providing higher throughput with multi threaded execution
and is also best suited for single threaded execution (when executing alone).
2.3. Speculative Pre-computation
Day by day processor frequencies are increasing compared to memory access times and hence
memory latency dominates the processor performance. One way to over come this is to use the
technique of SMT. But this will not improve the performance of a sequential program (single-
threaded execution). H.Wang, J. P. Shen., et, al; propose a technique called Speculative Pre
computation (SP) on a SMT [7] or a CMP [8] processor, which uses the idle hard ware thread
context to execute Speculative threads. SP is a special Prefetch mechanism that identifies load
instructions that are difficult to Prefetch such as chains of loads. SP concentrates on delinquent
loads, as they cause more than 80% of L1 cache misses. The addresses accessed by the delinquent
loads are extracted by a set of dependent instructions known as Pre computation slices (p-slices).
These P-slices were executed by speculative threads, i.e. when a speculative thread is spawned, it
pre-computes the address accessed by a future delinquent load and Prefetch the data. This
speculative thread spawning happens in two ways.
• Basic trigger: - when an instruction in the non-speculative thread reaches the commit
stage initiates the speculative thread spawn it is called basic trigger.
• Chaining trigger:- The speculative thread explicitly spawning another speculative thread.
The process of SP requires the following tasks:
1. Identification of the delinquent loads:-These are identified by memory access profiling,
which is done by either the Compiler or by a memory access simulator. The load
instructions that have major impact on performance are selected as delinquent loads
(using L1 misses major, some times L2 or L3 or memory misses may also be considered).
2. Construction of Pre computation slices (p-slices):-At the time of delinquent load
execution, the instructions that were executed around 128 instructions prior to are
identified as a basic trigger, if the same delinquent load is executed for each time this
instruction is executed, the trigger is confirmed and the p-slice is constructed using the
load instruction. These constructed p-slices were subjected to optimization by eliminating
the redundant triggers and useless triggers, and
3. Establishment of the triggers (Basic and Chaining):- For this the p-slices will be added to
the object code.
5. Computer Science & Information Technology (CS & IT) 45
As the Prefetching threads run independently from the main thread, and due to the reason the
addresses are calculated using the code executed from the main thread the technique of SP are
more efficient than regular Prefetching techniques. The SP spawns the speculative thread either at
the rename stage of the trigger or at the commit stage. In both the cases there is a need for the
H/W support so as to copy the Live-in values to the child thread's context from the main thread's
context. It is observed that on a simulation higher performance is possible with more thread
contexts as it provides more possibilities for more speculation, but in a realistic environment there
is a possible degrade of performance due to the overhead of speculation, when using basic
triggers, as there is a possibility for the main thread to stall as each speculative thread will occupy
a thread’s context for a longer time period. These problems were overcome by chaining triggers.
The p-slices for the chaining trigger will have 3 parts :
1. A prologue,
2. A Spawn instruction for spawning another copy of p-slice and
3. An epilogue.
The prologue contains the loop induction variables and the epilogue contains the actual
instruction to produce the address of the delinquent load. Spawning a thread via a chaining trigger
will have less spawning overhead as trigger has no action from the main thread, and the
speculative thread directly stores the live-in values on to the Live-in buffer and spawns the child
thread. In case of a processor with a few number of thread contexts it is better to have a pending
slice queue (PSQ) for maintaining pending p-slices when all thread contexts are busy and the
chaining trigger spawns speculative threads.
As the Chaining trigger independently spawns the child threads there is a need to verify that
• The speculation is not too aggressive i.e. the data required by the main thread is not
evicted as well,
• The spawning should not continue after the main thread was exited.
2.4. Dynamic Speculative Pre-computation
J.D. Collins, H.Wang and D.M.Tullsen, et, al, proposed a technique called Dynamic Speculative
Pre computation (DSP)[9], which performs all necessary instruction analysis, extraction, and
optimization with the help of back-end instruction analysis hardware, located off the processor’s
critical path. DSP is hard ware controlled thread based model of the software based manually
constructed SP of the previous subsection.
This architecture uses:-
a) Delinquent Load Identification Table for identifying delinquent loads,
b) Retired Instruction Buffer for construction of pre computation slices and
c) Spawn Instruction Table for spawning instructions.
2.5. Dynamic Prefetching thread
To better utilize the resources of the CMP is becoming a difficult task for the sequential
programs. Hou Rui, Long bing Zhang, Weiwu Hu in their paper [10] identify a hardware
generated Prefetching helper threads to accelerate sequential programs known as Dynamic
Prefetching Thread (DPT), with two aggressive thread construction policies, “Self-Loop” and
“Fork-on-recursive-call”.
6. 46 Computer Science & Information Technology (CS & IT)
In the basic policy The Dynamic Prefetching Thread (DPT) is constructed automatically with the
delinquent loads which are the main cause for the memory stall cycles, where in the DPT is run
on the idle core of a CMP which gets its context from the “Shadow register”.
The ‘Shadow Register’ is a memory component added to the hardware in the design, between the
two cores of the dual core processor, and maintains the same data with the registers of the main
core (the core executing the main thread). For the shadow register, the main core running the
actual program will have write permission and the other core running the helper thread will have
only read permissions. With this basic thread construction policy the speedup is very low due to
limited Prefetching coverage.
• Self-Loop policy: - In the basic policy the delinquent load is dispatched to the idle core
where as in the 'Self-loop' policy, this delinquent load is dispatched to the idle core to
Prefetch the next instance of the same delinquent load, i.e. the next instance of the same
delinquent load is Prefetched into the same core by the same DPT, so that more instances
of the same delinquent load is available on the same core to reduce the memory
contention of the shadow register.
• Fork-on-recursive-call: - The basic policy as well as the ''Self-loop'' policy cannot
improve the performance of the sequential programs that use tree or graph like linked
data structures. Both these data structures 'will have nodes connected to adjacent nodes,
which can be felt like a node of the graph or tree connected to another sub-graph or sub-
tree, accordingly the data structures can be invoked by a recursive function call, and these
recursive function calls are identified as a Prefetching thread that can be dispatched to an
idle core. These two policies were considered together in the thread construction policy
with higher priority for “Fork-on-recursive-call” than the “Self Loop” policy to achieve
higher speed-up.
Figure 4 Architecture of Dual core with DPT
2.6. Core spilling
Logically CMP should give speed-up on sequential applications also, due to higher transistor
density available. One way to create CMP is to emulate a cluster with shared common fetch and
dispatch units. In [11] Cong, J. Han Guoling Jagannathan, et,al., Went orthogonal to this approach
and proposed the scheme for statically partitioned cores (Which may be clustered subsequently).
7. Computer Science & Information Technology (CS & IT) 47
One bottleneck with clustering approach was centralized cluster scheduling unit, limiting
scalability. Core Spilling method introduced by them, claims to overcome this giving higher
scalability.
Core spilling technique presented for single threaded application includes following functional
parts:
1. An application can continue the speculative execution, even after all available resources
like Physical register and Scheduling Window.
2. A Core Spill occurs when either the Instruction Window or the Scheduling Window fills
up on a core.
For any given spill two cores are involved.
a) Parent which is executing programme before the spill and
b) Child that receives the spill from the Parent.
Spilling can be implemented by augmenting each core with two additional registers:
1. Spilling Register: It is single bit register to show the status of Core Spill (Happened / Not
Happened).
2. Descendent Register: This keeps core ID of the Child of given core.
Along with one Register File, Store Buffer and In-Flight Store need to be created. Core spilling
performance can be improved by
1. Prefetching,
2. Locality based Filtering.
2.7. Comparison of the different techniques
The different techniques discussed above claim the following speed-up:
Speculative Pre computation on Chip Multi Processors - 10 to 12%
Dynamic Speculative Pre computation (simple p-slices) – 14%
Dynamic Speculative Pre computation (aggressive optimizations) -33%
Dynamic Prefetching Thread (basic thread) - 3.8%
Dynamic Prefetching Thread (aggressive thread construction policies ) - 29.6%
Core Spilling – 40%.
8. 48 Computer Science & Information Technology (CS & IT)
Figure 5 Comparison of the different techniques for speedup on a Multicore
3. FACTORS AFFECTING PERFORMANCE
The CMPs are not mainly designed for performance improvement but to reduce the heat
produced. As per Murphy’s Law something that is evolved for a particular characteristic / reason
can not yield something else, of course the improvement in performance of a multicore is a by-
product due to the multiprocessing capabilities of the CMP.
Various Factors: There are many factors that affect the Performance of the CMP, they are:
Number of Cores,
Effective cache,
Core Speed,
Density of element and
Amdahl’s Law.
For this discussion we are considering the Multicore Architectures of two competing processor
manufacturers namely INTEL™ & AMD™ that produce two different designs of the CMP.
AMD™ having L2 & L3 where as INTEL™ is having simply L2 (refer to Figure 1 & 2).
3.1. Number of Cores
Performance of a CMP is linearly proportional to the no of cores where by a processor can
accommodate a number of process contexts. As
Higher number of Cores can go for higher number of Processes depending on the
availability,
In case of non-availability of Processes other Cores can go for other tasks like Prefetching
and Profiling to increase the overall throughput,
9. Computer Science & Information Technology (CS & IT) 49
As well in such a situation some of the cores will be involved in overall activity and fault
tolerant processing
Performance α C
Accordingly Multicore processors’ performance is always better than a single-core processor and
this can deteriorate only on one condition that there is a single process for execution, as well the
process and data is sequentially sharp and is having lot of jumps where by profiling and/or
Prefetching is not possible to increase the speedup or throughput.
The constant of proportionality in the above equation depends on several factors like:
1. The number of parallelisable processes,
2. The number of cores involved in supporting process like Prefetching and profiling,
3. Parallelism of the Job ( Inherent parallelism ) and
4. Amdahl’s Law.
3.2. Effective cache
The degree of multiprocessing depends on the total number of process contexts available in the
memory, Which in turn depends on the total available cache to a core, this is in-line with the
‘Stored Program Concept’, Which states that a task will get done only when its context is
available in the memory. In general the size of cache controls the degree of multiprocessing as it
accommodates the process contexts. So the degree of multiprocessing will mainly depend on the
total available cache (L2 or L3) per the core but not on the entire Chip.
Effective Cache = {Cache / Core} α L2 / core + L3
(in case of AMD™)
Effective Cache = {Cache / Core} α L2
(in case of INTEL™)
The processing performance depends on 1). Seek time and 2). Fault frequency, of the two
performance depends linearly on the seek time but exponentially on fault time, when a fault
occurs the performance goes down, if more processes are there more faults will occur to degrade
the performance in AMD™ design, in spite of its superior design. One major reason for this is
that AMD™ is still at 45 nm technology where as INTEL™ is at 32 nm technology, so that
INTEL™ can afford to install more L2; as memory requires more density of element.
Performance of a CMP is linearly proportional to the seek time and the size of Effective cache.
Performance α Ts ∗ L
The performance is inversely proportional to the memory fault at exponential order to the
Effective cache size.
Performance α Tf / log L
Where Ts is the seek time, Tf is the fault time and L is the Effective cache
Having L2 & L3 is not a technical advantage of AMD™ but they are countering INTEL™ by
reducing the cost of the processor. Until and unless seek time reaches the fault time AMD™ can
not demolish the drawback of lower performance but is never possible as when a fault occurs data
and/or instruction are to be fetched from a slow speed memory.
10. 50 Computer Science & Information Technology (CS & IT)
3.3. Core Speed
The performance also proportional to the core speed, as in a CMP there are a number of cores
working at the same frequency, Performance is proportional exponentially to the core speed at the
order of the number of cores. When the Core speed is high the number of mnemonics executed is
high. Mnemonics executed depends on T cycle, if the Core Speed is high, more T cycles are
executed in a single unit of time and hence more number of instructions gets executed in time also
increase, accordingly throughput per core.
Performance α Cs
n
Where Cs is the core speed and ‘n’ is a constant.
The value of ‘n’ depends on several factors like Limit of Amdahl’s Law, Inter communication
between the cores, losses due to cache conflicts and inherent parallelism of code, and etc.
3.4. Density of element
If density is higher it will affect the performance, because more memory can be put in the die
which will reduce the number of faults as there is more space on the die, INTEL™ can afford for
more L2 and / or more cores. Accordingly there will be an automatic improvement in performance
and so is the case of INTEL™ design. At Higher density one can afford for more core or element
which automatically means more functional units and /or more threads per core and hence
improvement in performance is exponential.
Performance α Dm
Where D is the Density and ‘m’ is a constant.
The value of ‘m’ depends on other factors like number of interconnects. INTEL™ processors are
performing better due to an additional reason that they have removed interconnects or having pin
less design, and they are in the die, as at 32 nm technology it is very difficult to put the pin.
Interconnects are higher implies impedance is more, between each interconnects there is a
distance which is inversely proportional to capacitance. Which means more impedance producing
more heat to counter this there is no other go other than reducing the frequency that consume
lesser wattage.
So higher density will automatically necessitate more core running at lesser frequency and more
core with more memory will provide better performance is the secret of INTEL™. INTEL™ is
going for Multicore not because it wants more core but it has reached to the extent that it should
go for Multicore where as AMD™ went to Multicore just to demonstrate Multicore, that is the
reason Hex core of AMD™ came to the market much earlier than Hex Core INTEL™. INTEL™
Multicore can be over clocked where as AMD™ can not be. That is very clearly evident as
INTEL™ is using technology only when it is needed and Core 2 Duo or Dual Core are very
popular, AMD™ is trying to increase the number of cores to increase the degree of multi
processing with 45 nm technology and the Effective cache per Core (per Core L2 + total L3) is not
increased, so with lesser sized effective cache when more process contexts are available the total
number of faults are also more which degrades the performance to a higher level.
11. Computer Science & Information Technology (CS & IT) 51
3.5. Amdahl’s Law
The Amdahl’s law for parallel speed up is given by the formula
Speedup =T (1) / T (N)
Where T (N) is the time it takes to execute the program when using N number of processors [12,
13]. In case of parallelism this law states that if P is the proportion of a task that can gain benefit
from parallelisation and ( 1 - P) is the part that cannot go parallel, remains serial, then the
maximum speed up that can be achieved by using N processors is given by 1/(1-P+(P/N)). When
N is large i.e. under the limit N tends to Infinite, the speed up is given by 1/ (1-P) which is
independent of N, the number of processors (cores) and there is no improvement with increasing
the number of processing elements. This means there is an upper limit of parallelisation for the
task accordingly the number of cores in multi core processor.
4. PROBLEMS IN A MULTICORE PROCESSOR
A Multicore Processor is bundled with a few Problems:-
• In a Multicore Processor not only the rate of data transfer but also the amount of data
transfer will have its impact on the throughput or speedup as it has to go for a number of
tasks (processes) in parallel, accordingly Not only the Core Speed but also the size of
cache (Effective cache) is of major concern and
• As per Amdahl’s Law the number of cores for parallel execution is also another factor
that limits the performance.
We categorized these problems of the Multicore into:-
Performance: - Performance of the Multicore processor should increase with enhancement of the
number of cores.
Reliability: - Reliability of the Multicore processor should improve with improvement in the
number of cores.
Backward Compatibility: - The Multicore processor should be backward compatible with all the
software already in use, or other wise due to the increase in number of cores what ever the
software technique already in use should not become obsolete.
Localized Processing: - The Multicore Processor should localize processing at each core as
much as possible; the best example here is the case of a Neural Network where in a lot number of
neurons working in parallel achieving highest performance with reliability.
4.1. How to improve the Multicore Processor’s Performance
There are few things to be considered at this point
Collaboration among the Cores: - The Major contribution of the Multicore Processor
Performance is due to the number of Cores, that does parallel execution of the task. So
performance definitely depends on the arrangement of these cores and their inter connecting bus
for improved collaboration among the cores. Accordingly the design should provide a mechanism
for enhanced collaboration among the cores with the help of a communication channel among the
cores.
12. 52 Computer Science & Information Technology (CS & IT)
Limit of Collaboration: -In case highest collaboration is provided there will be more
communication between the cores that becomes a overhead for the communication channel that is
provided.
What is the Limit of Amdahl’s Law: - To identify this limit of parallelization, we conducted an
experiment that is described in [14].
4.2. Finding the Limit of Amdahl's Law
With the proposed design it is possible to combat with the architectural problems in Multicore
Processors. It is required, now to identify the Limit of Amdahl's Law for Multicore Processors.
For this We at J.B. Institute of Engineering & Technology - Hyderabad tested the performance of
the cluster by adding a number of nodes one after another and plotted the G Flops against Cluster
size with HPL Benchmark [15] on PelicanHPC [16] Cluster. For more details one can refer our
paper [14].
4.3. Important Observations
The results of our experiment show that there is no improvement in speed-up after 8 cores,
meaning that they are in line with the Amdahl’s law that predicts the maximum number of
parallelisable Cores. This state of the processor is termed as 'The Multicore Bottleneck'.
4.4. Reasons
The problem is due to the communication bottleneck with the bus that connects the last level
cache to the RAM via Cross Bar Switch, whose bandwidth is not equal to the sum of the
bandwidths of the individual cores that connects first level caches to that off last level.
Accordingly not all the cores can receive or send Data/Instructions simultaneously. This situation
restricts to keep majority of the cores in idle state, in case the number of cores are increased (The
value of N in the Amdahl’s law) in a processor makes more cores to be idle and no gain in speed-
up. In 2010 itself the amount of speed-up on average is 0.82 with N =6 cores and is expected to
be 0.87 in 2013 with N= 32 cores and to have the remaining 12% (for 0.99) with N=infinite or
around some thousands of cores! Is it reasonable to go for the thousand core processors, to gain
this small amount of improvement in efficiency or speedup?
5. OUR PROPOSED DESIGN OF THE MULTICORE PROCESSOR
5.1. Design Objectives
On the basis of our observations, we propose that processors will perform better when designed
keeping the following points in consideration:
a) A number of Cores with in the limits of Amdahl's law,
b) Possibly largest size of Effective cache, or to make all the cache as Effective cache for all
the Cores,
c) L2 cache to be designed as shown in Fig. 18 so that each L2serves a number of Cores,
d) All the L2caches are set-associative with other L2caches, for improved collaboration
among the cores
e) Design at higher density of the element to facilitate more core and large sized L3cache,
13. Computer Science & Information Technology (CS & IT) 53
f) Running at maximum possible Core Speed (with in the limits of power budgeting) so as
to keep T cycle low. So as to execute a number of mnemonics in a given slot of time.
Figure 6 Our Proposed Multicore Processor Architecture
5.2. Advantages with the proposed design
Our proposed design is a hybrid model that will take into consideration the advantages of both
INTEL™ and AMD™
Our design proposes the L1 cache as in AMD™ to improve localized processing at each
core,
All the L2 caches are set-associative with other L2caches so that all the Cores can go for
cooperative processing in the absence of multiple tasks,
When limited number of tasks are present, the second core attached to each L2 can be
used for running the helper thread under Cascaded Execution policy,
The L2 cache can be used as the Shadow Register for Dynamic Prefetching Thread
Execution policy,
When Executing a single sequential task all the L2 and L3can be used by any core to
accommodate a large amount of data and / or instructions in memory for the sequential
execution performance improvement,
Our design proposes a Higher Density of element as in INTEL™ so as to make room for
all three Levels of cache in the chip.
REFERENCES
[1] Simcha Gochman, Avi Mendelson, Alon Naveh, Efraim Rotem, Introduction to INTEL™ Core
Duo®Processor Architecture INTEL™Technology Journal Volume 10, Issue 02, May 15, 2006 ISSN
1535-864X.
[2] AMD™ 64 Architecture Programmers Manual Volume 1: Application Programming Advanced Micro
Devices Publication No.:24592 Revision: 3.14 Date: September 2007 page: 97.
[3] R.E.Anderson, Thu. D.Nguyen, et. al., Cascaded Execution: Speeding up unparallelised execution on
shared memory multiprocessors. In proceedings of the 13th International parallel processing
symposium.
14. 54 Computer Science & Information Technology (CS & IT)
[4] Pierre Michaud ; Exploiting the cache capacity of a single-chip Multi core processor with execution
migration, In proceedings of 10th International symposium on High performance computer
Architecture, 2004.
[5] Garcia-Molina, Lipton, Valdes, ’A Massive Memory Machine’, IEEE Transactions on Computers,
May 1984.
[6] Dean M. Tullsen , Susan J. Eggers , Joel S. Emery , Henry M. Levy , Jack L. Lo , and Rebecca L.
Stammy; Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous
Multithreading Processor ; In Proceedings of the 23rd Annual International Symposium on Computer
Architecture, Philadelphia, PA, May, 1996.
[7] J.D.Collins, H.Wang, D.M.Tullsen, C.Hughes, Y.Lee, D.Lavery, J.Shen; Speculative Pre
computation: Long range Prefetching of delinquent loads; In proceedings of the 28th Annual
International symposium on computer Architecture, july 2001.
[8] J.A. Brown, H. Wang, G. Chrysos, P. H. Wang, and J. P. Shen; “Speculative Pre computation on Chip
Multiprocessors”. In Proceedings of MEAC 6, November 2001.
[9] J.D.Collins, H.Wang, D.M.Tullsen, C.Hughes, Y.Lee, D.Lavery, J.Shen; Dynamic Speculative Pre
computation: In Proceedings of the 34th International Symposium on Micro architecture, December,
2001.
[10] Hou Rui, Long bing Zhang, Weiwu Hu; Accelerating sequential applications on Chip Multiprocessors
via Dynamic Prefetching thread ;Science Direct journal of microprocessors and micro systems
Elsevier 2007.
[11] Cong. J, Han Guoling Jagannathan, A. Reinman. G, Rutkowski. K; Accelerating Sequential
Applications on CMPs Using Core Spilling; In proceedings of IEEE transactions on parallel and
distributed systems 2007, volume 18, pages 1094-1107
[12] G.M.Amdahl, ‘Validity of the Single-Processor Approach to Achieving Large-Scale Computing
Capabilities’, Proceedings of American Federation of Information Processing Societies Conf., vol. 30
(Atlantic City, N.J., Apr. 18-20)AFIPS Press, 1967, pp. 483-485.
[13] J.L. Gustafson, ’Re evaluating Amdahl’s Law,’ ACM communications, May 1988, Volume 31,
Number 5 pp. 532-533.
[14] K.V.Ranga Rao and Niraj Upadhayaya; The Multi core Bottleneck in International Journal of
Multidisciplinary Research and Advances in Engineering (IJMRAE) ISSN 0975-7074 Volume 2, No.
II July 2010 pages 149-159.
[15] http://www.netlib.org/benchmark/hpl
[16] http://pareto.uab.es/mcreel/PelicanHPC/download/pelicanhpc-v1.8-64bit.iso.
Authors
K.V Ranga Rao, obtained his Bachelor degree in Computer science & Engineering
from the Nagarjuna University, Guntur, Master of Science in Software Systems from
BITS (DLPD), Pilani and Master of Technology in Computer Science from JNT
University Hyderabad. He is currently working as Associate Professor in the Department
of CSE of JB Institute of Engineering & Technology, Hyderabad and has submitted his
Ph.D in CSE (Parallel Computing) to JNTUK -Kakinada. His research interests include
high-performance computer architectures.
Dr. Niraj Upadhyaya, Ph.D., is an eminent scholar, professor in the CSE Department
and Dean of Academics at J.B. Institute of Engineering & Technology, Hyderabad. He
has his name established in the field of "Parallel Computing" and has many articles,
papers and journals to his name. He had his Ph.D. from the University of West of
England, Bristol, UK with the topic of "Memory Management of Cluster HPCs". He has
more than 20 years of experience which helps the students to always learn from him.