Report on hyperthreading


Published on

this is the report of hyper threading. which is make

Published in: Technology, Business
  • thank you so much :)
    Are you sure you want to  Yes  No
    Your message goes here
  • nothing
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Report on hyperthreading

  1. 1. INTRODUCTION1.Introduction To Hyper-threadingHyper-threading is a technology incorporated into Intel® Xeon™ processors and is the Intelimplementation of an architectural technique called simultaneous multi-threading(SMT). SMTtechnology was conceived to exploit a program’s instruction and thread-level parallelism throughthe simultaneous execution of multiple threads on a single physical processor. Severalcommercial processors recently have incorporated SMT technology.To understand Hyper-Threading, the basic issues that provided the impetus to develop such atechnology for Intel Architecture-(IA-) 32 processors must first be understood. According toIntel, processor efficiency for today’s typical server workloads is low.5 On average, only one-third of processor resources are utilized. Computer users running a Microsoft® Windows®operating system (OS) can monitor processor utilization with the Task Manager tool.Unfortunately, this tool provides only the high-level view of processor utilization. The TaskManager might show 100 percent processor utilization, but the actual utilization of internal,processor-execution resources is often much lower. This inefficient use of processor resourceslimits the system’s overall performance. To satisfy the growing needs of computer users, severaltechniques have been developed over the years to overcome processor inefficiency and improvesystem performance. These performance strategies were primarily based on two techniques:software improvement and resource redundancy.Software improvements range from algorithmic changes to code modifications (improvedcompilers, programming languages, parallel coding, and so forth) to multithreading. Theseinnovations were intended to provide better mapping of an application architecture to thehardware architecture. A perfect mapping would mean that no hardware resource is left unusedat any given time. Despite these efforts, a perfectly mapped scenario is yet to be achieved.The second technique, resource redundancy, avoids any attempt to improve performance throughbetter utilization. Rather, it takes an approach opposite to software improvement and actuallydegrades efficiency by duplicating resources. Although the duplication degrades efficiency, ithelps improve overall performance. The Intel processors of the 1980s had only a few executionunit resources and, consequently, could handle only a few instructions at any given time. Today,Intel Pentium® 4 processors (the Intel Xeon processor is derived from the Pentium 4 processorcore) are seven-way superscalar machines with the ability to pipeline 126 instructionsMultiprocessor systems with these Intel processors allow multiple threads to be scheduled forparallel execution, resulting in improved application performance. This redundancy within aprocessor as well as within the entire computing system addresses growing performance needs,but it does so at the cost of diminishing returns. 1
  2. 2. To improve system performance cost-effectively, inefficiencies in processor resource utilizationshould be alleviated or removed. Hyper-Threading technology provides an opportunity toachieve this objective. With less than 5 percent increase in die size, Hyper-Threadingincorporates multiple logical processors in each physical processor package. The logicalprocessors share most of the processor resources and can increase system performance whenconcurrently executing multithreaded or multitasked workloads.2. What is Hyper-Threading Technology?Hyper-threading (HT) is a term used related with computer processor. It is one of the featuresbuilt-in with most Intel made CPUs. Intel introduced this technology when it releases the first3GHz Pentium 4 processor.HT technology turns or simulates a single processor into two virtualprocessors so that it can handle two sets of instructions simultaneously. It is meant to increasethe performance of a processor by utilizing the idle or non-used part of a processor. This enablesa processor to perform tasks faster (usually 25% - 40% speed increase) than non-HT enabledprocessor. Generally a single core processor appears as one processor for the OS and executesonly one set of instruction at time. But, HT enabled single core processor appears as twoprocessors to the OS and executes two application threads as a result. For example, a dual coreprocessor that supports HT will appear four core processors to the OS. A quad core processorthat supports HT will appear as 8-core processor for the OS… etc. Ok, let us go and see theessentials you need to benefit from HT…Requirements of hyper-threading technology Threadingtechnology requires the following fundamentals:-- A processor built-in with HT technology Notall processors support HT, therefore before purchasing a computer make sure that it supports HThnology. You can easily identify HT enabled processor by checking its specification and CPUlogo. Normally, Intel clearly puts tags on HT built-in processors. Some of Intel family processorsthat support HT technology are Intel Atom, core processors, Xeon, Core i-series, Pentium 4 andPentium mobile processors. You will get more information about processors that support HT atIntel website List of Intel processors that support HT Technology - An operating system thatsupports HT. HT enabled single processor appears as two processors to the operating system.However, if the OS don’t support HT, you can’t benefit from this technology even though youhave HT enabled processor. The OS must recognize that you have HT enabled processors so thatit will schedule two threads or sets of instruction for processing. Windows XP and laterOperating systems are optimized for HT technology. A technology developed by Intel thatenables multithreaded software applications to execute threads in parallel on a single multi-coreprocessor instead of processing threads in a linear fashion.Hyper-Threading Technology is a groundbreaking innovation that significantly improvesprocessor performance. Pioneered by Intel on the Intel® Xeon™ processor family for servers,Hyper-Threading Technology has enabled greater productivity and enhanced the userexperience. View Cast introduced Hyper-Threading on the Niagara Power Stream systems inApril 2003.Hyper-Threading Technology is now supported on the Intel® Pentium® 4 Processor with HTTechnology. Hyper-Threading provides a significant performance boost that is particularly suitedto todays computing climate, applications, and operating systems. 2
  3. 3. Intel Hyper-ThreadingIntel Hyper-Threading Technology allows a computers chipset to run two threadssimultaneously, thus increasing the performance of the CPU. The technology utilizessimultaneous multithreading and is used to increase both throughput and energy efficiency.Basics Of Hyper-Threading TechnologyIntels Hyper-Threading Technology duplicates certain sections of the processor, specificallythose areas that store the CPUs architectural state and not the main execution resources. Byprocessing only the architectural state, Hyper-Threading is able to make the processor appear asif dual processors are available. As this process occurs, Hyper-Threading allows the operatingsystem to schedule two threads or processes simultaneously through each processor.Hyper-Threading System RequirementsThe users system must take advantage of symmetric multiprocessing, or SMP, support in theoperating system. Using SMP, Hyper-Threading is able to show logical processors as standardseparate processorsEnabling Hyper-Threading TechnologyTo enter your computers BIOS, typically you will press F2 when powering up your computeruntil the setup option for BIOS appears. "Hyper Threading" should be the first option highlightedwhen you enter the BIOS. Press Enter to enable the program and then "Save and Exit" to savethe enabled feature. You can repeat the process to disable Hyper-Threading.What Hyper-Threading ImprovesAccording to Intel, Hyper-Threading adds support for multithreaded code, brings multiple threaduse to a single processor and improves reaction and response times. System performance is saidto be 15 to 30 percent better when Hyper-Threading is used.3. How Hyper Threading Works?Faster clock speeds are an important way to deliver more computing power. But clock speed is onlyhalf the story. The other route to higher performance is to accomplish more work on each clockcycle, and thats where Hyper-Threading Technology comes in. A single processor supporting Hyper-Threading Technology presents itself to modern operating systems and applications as two virtual 3
  4. 4. processors. The processor can work on two sets of tasks simultaneously, use resources that otherwisewould sit idle, and get more work done in the same amount of time.HT Technology takes advantage of the multithreading capability thats built in to Windows XP andmany advanced applications. Multithreaded software divides its workloads into processes and threadsthat can be independently scheduled and dispatched. In a multiprocessor system, those threadsexecute on different processors. Fig.3 working principle of hyper-threading4. Hyper-threading Technology ArchitectureHyper-Threading Technology makes a single physical processor appear as multiple logicalprocessors. To do this, there is one copy of the architecture state for each logical processor, andthe logical processors share a single set of physical execution resources. From a software orarchitecture perspective, this means operating systems and user programs can schedule processesor threads to logical processors as they would on conventional physical processors in amultiprocessor system. From a micro architecture perspective, this means that instructions fromlogical processors will persist and execute simultaneously on shared execution resources. 4
  5. 5. Figure 4.1: Processors without Hyper-Threading ArchitectureIntel is a registered trademark of Intel Corporation or its subsidiaries in the United States andother countries. Xeon is a trademark of Intel Corporation or its subsidiaries in the United Statesand other countries. As an example, Figure 4.1 shows a multiprocessor system with two physicalprocessors that are not Hyper-Threading Technology-capable. Figure 4.2 shows a multiprocessorsystem with two physical processors that are Hyper-Threading Technology-capable. With twocopies of the architectural state on each physical processor, the system appears to have fourlogical processors. Figure 4.2: Processors with Hyper-Threading Technology 5
  6. 6. The first implementation of Hyper-Threading Technology is being made available on the IntelXeon processor family for dual and multiprocessor servers, with two logical processors perphysical processor. By more efficiently using existing processor resources, the Intel Xeonprocessor family can significantly improve performance at virtually the same system cost. Thisimplementation of Hyper-Threading Technology added less than 5% to the relative chip size andmaximum power requirements, but can provide performance benefits much greater than that.Each logical processor maintains a complete set of the architecture state. The architecture stateconsists of registers including the general-purpose registers, the control registers, the advancedprogrammable interrupt controller (APIC) registers, and some machine state registers. From asoftware perspective, once the architecture state is duplicated, the processor appears to be twoprocessors. The number of transistors to store the architecture state is an extremely small fractionof the total. Logical processors share nearly all other resources on the physical processor, such ascaches, execution units, branch predictors, control logic, and Buses. Each logical processor hasits own interrupt controller or APIC. Interrupts sent to a specific logical processor are handledonly by that logical processor.5. Implementation On The Intel Xeon Processor FamilySeveral goals were at the heart of the micro architecture design choices made for the Intel Xeonprocessor MP implementation of Hyper-Threading Technology. One goal was to minimize the die area cost of implementing Hyper-Threading Technology.Since the logical processors share the vast majority of micro architecture resources and only afew small structures were replicated, the die area cost of the first implementation was less than5% of the total die area.A second goal was to ensure that when one logical processor is stalled the other logical processorcould continue to make forward progress. A logical processor may be temporarily stalled for avariety of reasons, including servicing cache misses, handling branch mispredictions, or waitingfor the results of previous instructions. Independent forward progress was ensured by managingbuffering queues such that no logical processor can use all the entries when two active softwarethreads2 were executing. This is accomplished by either partitioning or limiting the number ofactive entries each thread can have.A third goal was to allow a processor running only one active software thread to run at the samespeed on a processor with Hyper-Threading Technology as on a processor without thiscapability. This means that partitioned resources should be recombined when only one softwarethread is active. A high-level view of the micro architecture pipeline is shown in Figure 4. Asshown, buffering queues separate major pipeline logic blocks. The buffering queues are eitherpartitioned or duplicated to ensure independent forward progress through each logic block. bymanaging buffering queues such that no logical processor can use all the entries when two activesoftware threads2 were executing. This is accomplished by either partitioning or limiting thenumber of active entries each thread can have. 6
  7. 7. A third goal was to allow a processor running only one active software thread to run at the samespeed on a processor with Hyper-Threading Technology as on a processor without thiscapability. This means that partitioned resources should be recombined when only one softwarethread is active. A high-level view of the micro architecture pipeline is shown in Figure 4. Asshown, buffering queues separate major pipeline logic blocks. The buffering queues are eitherpartitioned or duplicated to ensure independent forward progress through each logic block.Single-task and multi-task modesTo optimize performance when there is one software thread to execute, there are two modes ofoperation Referred to as single-task (ST) or multi-task (MT). In MT-mode, there are two activelogical processors and some of the resources are partitioned as described earlier. There are twoflavors of ST-mode: single-task logical processor 0 (ST0) and single-task logical processor 1(ST1). In ST0- or ST1-mode, only one logical processor is active, and resources that werepartitioned in MT-mode are re-combined to give the single active logical processor use of all ofthe resources. The IA-32 Intel Architecture has an instruction called HALT that stops processorexecution and normally allows the processor to go into a lower power mode. HALT is aprivileged instruction, meaning that only the operating system or other ring-0 processes mayexecute this instruction. User-level applications cannot execute HALT. On a processor withHyper-Threading Technology, executing HALT transitions the processor from MTmodeto ST0-or ST1-mode, depending on which logical processor executed the HALT. For example, if logicalprocessor 0 executes HALT, only logical processor 1would be active; the physical processorwould be inST1-mode and partitioned resources would bere combined giving logical processor 1 full use ofall processor resources. If the remaining active logical processor also executes HALT, thephysical processor would then be able to go to a lower-power mode. In ST0- or ST1-modes, aninterrupt sent to the HALTed processor would cause a transition to MT-mode. The operatingsystem is responsible for managing MT-mode transitions. 7
  8. 8. Fig 5.1. Resource AllocationFigure 7 summarizes this discussion. On a processor with Hyper-Threading Technology,resources are allocated to a single logical processor if the processor is in ST0- or ST1-mode. Onthe MT-mode, resources are shared between the two logical processors.Operating System and ApplicationA system with processors that use Hyper-Threading Technology appears to the operating systemand application software as having twice the number of processors than it physically has.Operating systems manage logical processors as they do physical processors, schedulingrunnable tasks or threads to logical processors. However, for best performance, the operatingsystem should implement two optimizations.The first is to use the HALT instruction if one logical processor is active and the other is not.HALT will allow the processor to transition to either the ST0- or ST1-mode. An operatingsystem that does not use this optimization would execute on the idle logical processor a sequenceof instructions that repeatedly checks for work to do. This so-called “idle loop” can consumesignificant execution resources that could otherwise be used to make faster progress on the otheractive logical processor.The second optimization is in scheduling software threads to logical processors. In general, forbest performance, the operating system should schedule threads to logical processors on differentphysical processors before scheduling multiple threads to the same physical processor. Thisoptimization allows software threads to use different physical execution resources when possible. 8
  9. 9. PerformanceThe Intel Xeon processor family delivers the highest server system performance of any IA-32Intel architecture processor introduced to date. Initial benchmark tests show up to a 65%performance increase on high-end server applications when compared to the previous-generationPentium® III Xeon™ processor on 4-way server platforms. A significant portion of those gainscan be attributed to Hyper-Threading Technology. Fig 5.2 Performance increases from Hyper-threading technology on an OLTP workloadFigure 8 shows the online transaction processing performance, scaling from a single-processorconfiguration through to a 4-processor system with Hyper-Threading Technology enabled. Thisgraph isnormalized to the performance of the single-processor system. It can be seen that there is asignificant overall performance gain attributable to Hyper-Threading Technology, 21% in thecases of the single and dual processor systems. 9
  10. 10. Fig 5.3 Web server benchmark performanceFigure 9 shows the benefit of Hyper-Threading Technology when executing other server-centricbenchmarks. The workloads chosen were two different benchmarks that are designed to exercisedata and Web server characteristics and a workload that focuses on exercising a server-side Javaenvironment. In these cases the performance benefit ranged from 16 to 28%.All the performance results quoted above are normalized to ensure that readers focus on therelative performance and not the absolute performance. Performance tests and ratings aremeasured using specific computer systems and/or components and reflect the approximateperformance of Intel products as measured by those tests. Any difference in system hardware orsoftware design or configuration may affect actual performance. Buyers should consult othersources of information to evaluate the performance of systems or components they areconsidering purchasing.6. Media Application on Hyper-threading TechnologyTo date, computational power has typically increased over time because of the evolution fromsimple pipelined designs to the complex speculation and out-of-order execution of many oftoday’s deeply-pipelined superscalar designs. While processors are now much faster than theyused to be, the rapidly growing complexity of such designs also makes achieving significantadditional gains more difficult. Consequently, processors/systems that can run multiple softwarethreads have received increasing attention as a means of boosting overall performance. In this 10
  11. 11. paper, we first characterize the workloads of video decoding, encoding, and watermarking oncurrent superscalar architectures, and then we characterize the same workloads using therecently-announced Hyper-Threading Technology. Our goal is to provide a better understandingof performance improvements in multimedia applications on processors with Hyper-ThreadingTechnology. Figure 1 shows a high-level view of Hyper-threading Technology and compares itto a dual-processor system. In the first implementation of Hyper-Threading Technology, onephysical processor exposes two logical processors. Similar to a dual-core or dual-processorsystem, a processor with Hyper-Threading Technology appears to an application as twoprocessors. Two applications or threads can be executed in parallel. The major differencebetween systems that use Hyper-Threading Technology and dual-processor systems is thedifferent amounts of duplicated resources. In today’s Hyper-Threading Technology, only a smallset of the micro architecture state is duplicated1, while the front-end logic, execution units, out-of-order retirement engine, and memory hierarchy are shared. Thus, compared to processorswithout Hyper-Threading Technology, the diesize is increased by less than 5% [7]. Whilesharing some resources may increase the latency of some single threaded applications, theoverall throughput is higher for multi-threaded or multi-process applications. 11
  12. 12. Fig. 6.1 High-level diagram of (a) a processor with Hyper-Threading Technology and (b) a dual-processor systemThis paper is organized as follows. First, we provide a brief review of the basic principles behindmost current video codecs, describing the overall application behavior of vediodeoecoding/encoding/watermarking and the implications of the key kernels for current andemerging architectures. Then, we show the multi-threaded software architectures of ourapplications, including data-domain and functional decomposition. Additionally, we describesome potential pitfalls when developing software on processors with Hyper-ThreadingTechnology and our techniques to avoid them. Finally, we provide some performance numbersand our observations.Multimedia WorkloadsThis section describes the workload characterization of selected multimedia applications oncurrent superscalar architectures. Although the workloads are well optimized for Pentium® 4processors, due to the inherent constitution of the algorithms, most of the modules in theseworkloads cannot fully utilize all the execution resources available in the microprocessor. Theparticular workloads we target are video decoding, encoding, and watermark detection2, whichare key components in both current and many future applications and are representative of manymedia workloads. 12
  13. 13. MPEG Decoder and EncoderThe Moving Pictures Expert Group (MPEG) is a standards group founded in 1988. Since itsinception, the group has defined a number of popular audio and video compression standards,including MPEG-1, MPEG-2, and MPEG-4 The standards incorporate three major compressiontechniques: (1) predictive coding; (2) transform-based coding; and (3) entropy coding. Toimplement these, the MPEG encoding pipeline consists of motion estimation, Discrete CosineTransform (DCT), quantization, and variable-length coding. The MPEG decoding pipelineconsists of the counterpart operations of Variable-Length Decoding (VLD), Inverse Quantization(IQ), Inverse Discrete Cosine Transform (IDCT), and Motion Compensation (MC), as shown inFigure 2. Figure 6.2 Block diagram of MPEG DecoderA digital video watermark, which is invisible and hard to alter by others, is informationembedded in the video content. A watermark can be made by slightly changing the video contentaccording to a secret pattern. For example, when just a few out of the millions of pixels in apicture are adjusted, the change is imperceptible to the human eye. A decoder can detect andretrieve the watermark by using the key that was used to create the watermark.Video WatermarkingAnother application that we studied is video watermark detection. Our watermark detector hastwo basic stages: video decoding and image-domain watermark detection. The application isoptimized with MPL (as the video decoder) and the Intel IPL (for the image manipulations usedduring watermark detection) . A UPC of 1.01 also indicates that there is room for improvement. 13
  14. 14. Figure 6.3 Two slice based task partitioning schemes between two thread(a)half- and- halfdispatching(dynamic scheduling ) and (b)slice- by slice- scheduling(static scheduling).Task Partitioning and SchedulingIn general, multimedia applications, such as video encoding and decoding, exhibit not only data-and instruction-level parallelism, but also the possibility for substantial thread-level parallelism.Such workloads are good candidates for speed-up on a number of different multithreadingarchitectures. This section discusses the trade-offs of different software multithreading methods.Data Domain decomposition-Slice-Based-DispatchingAs shown in Figure 4, a picture in a video bit stream can be divided into slices of macro blocks.Each slice, consisting of blocks of pixels, is a unit that can be decoded independently. Here wecompare two methods to decode the pictures in parallel:1. Half-and-half (aka static partitioning): In this method, one thread is statically assigned the firsthalf of the picture, hile another thread is thread is statically assigned the first half of the picture,while another thread is assigned the other half of the picture (as shown in Figure 4 (a)).Assuming that the complexity of the first half and second half is similar, these two threads willfinish the task at roughly the same time. However, some areas of the picture may be easier todecode than others. This may lead to one thread being idle while the other thread is still busy. 14
  15. 15. 2. Slice-by-slice (aka dynamic partitioning): In this method, slices are dispatched dynamically. Anew slice is assigned to a thread when the thread has finished its previously assigned slice. In thiscase, we don’t know which slices will be assigned to which thread. Instead, the assignmentdepends on the complexity of the slices assigned. As a result, one thread may decode a largerportion of the picture than the other if its assignments are easier than those of the other thread.The execution time difference between two threads, in the worst case, is the decoding time of thelast slice. In both cases, each thread performs Variable-Length Decoding (VLD), Inverse Discrete CosineTransform (IDCT), and Motion Compensation (MC) in its share of the pictures, macro block bymacro block. While one thread is working on MC (memory intensive), the other thread maywork on VLD or IDCT (less memory intensive). Although the partitioning does not explicitlyinterleave computations and memory references, on average, it better balances the use ofresources.Implications of Software Design For Hyper-threading technologyDuring the implementation of our applications on processors with Hyper-Threading Technology,we had a number of observations. In this section, we discuss some general software techniquestohelp readers design their applications better on systems with Hyper-Threading Technology.Using Hyper-Threading Technology, performance can be lost when the loads are not balanced.Because two logical processors share resources on one physical processor with Hyper-ThreadingTechnology, each logical processor does not get all the resources a single processor would get.When only a single thread of the application is actively working and the other thread is waiting(especially, spin-waiting), this portion of the application could have less than 100% of theresources when compared to a single processor, and it might run slower on a processor withsimultaneous multithreading capability than on processors without simultaneous multithreadingcapability. Thus, it is important to reduce the portion in which only one thread is activelyworking. For better performance, effective load balancing is crucial. The foremost advantage ofthe dynamic scheduling scheme is its good load balance between the two threads. Because someareas of the picture may be easier to decode than others, one thread under the static partitioningscheme may be idle while another thread still has a lot of work to do. In the dynamicpartitioning scheme, we have very good load balance. As we assign a new slice to a thread onlywhen it has finished its previous slice, the execution time difference between the two threads, inthe worst case, is the decoding time of a slice. Because two logical processors share one physicalprocessor, the effective sizes of the caches for each logical processor are roughly one half of theoriginal size. Thus, it is important for multithreaded applications to target one half of the cachesfor each application thread. For example, when considering code size optimization, excessiveloop unrolling should be avoided. While sharing caches may be a drawback for someapplications running on processors with Hyper-Threading Technology, it can provide better 15
  16. 16. Fig 6.4 Cache localities, during (a)motion compensation, in; (b)static partitioning, and in; (c) dynamic partitioningcache locality between the two logical processors for other applications. For example, Wang etal. use one logical processor to pre-fetch data into the shared caches to reduce a substantialamount of the memory latency of the application in the other logical processors. We now 16
  17. 17. illustrate the advantage of sharing caches in our application. On dual-processor systems, eachprocessor has a private cache. Thus, there may be a drawback to dynamic partitioning in terms ofcache locality. Figure 6 illustrates the cache locality in multiple frames of video. During motioncompensation, the decoder uses part of the previous picture, the referenced part of which isroughly co-located in the previous reference frame, to reconstruct the current frame. It is faster todecode the picture when the co-located part of the picture is still in the cache. In the case of adual- processor system, each thread is running on its own processor, each with its own cache. Ifthe co-located part of the picture in the previous frame is decoded by the same thread, it is morelikely that the local cache will have the pictures that have just been decoded.Since we dynamically assign slices to different threads, it is more likely that the co-locatedportion of the previous picture may not be in the local cache when each thread is running on itsown physical processor and cache. Thus, dynamic partitioning may incur more bus transactions.In contrast, the cache is shared between logical processors on a processor with Hyper-ThreadingTechnology, and thus, cache localities are preserved. We obtain the best of both worlds withdynamic scheduling: there is load balancing between the threads, and there is the same effectivecache locality as for static scheduling on a dual-processor system. 17
  18. 18. 7. ConclusionIntel Xeon Hyper-Threading is definitely having a positive impact on Linux kernel andmultithreaded applications.Today with Hyper-Threading Technology, processor-level threading can be utilized which offersmore efficient use of processor resources for greater parallelism and improved performance ontodays multi-threaded software.8. References     18