Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

03 intel v_tune_session_04

414 views

Published on

Published in: Technology, News & Politics
  • Be the first to comment

  • Be the first to like this

03 intel v_tune_session_04

  1. 1. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationObjectives In this session, you will learn to: Measure performance-related data for processors Identify the hierarchy of memory Benchmark processor performance Ver. 1.0 Slide 1 of 23
  2. 2. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationExamining Processor Specifications Processor: Computes the instructions in a program and calculates the result. Should be used optimally by the application. Performance also affects application performance. Performance should be measured to know how the processor is utilized. Ver. 1.0 Slide 2 of 23
  3. 3. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationIdentifying Processor Performance Processors consists of functional units that execute specific instructions. Different types of processors have different speed of executing instructions. Before beginning to optimize the application performance, you need to: Identify processor speed Identify the execution process Identify the functional units of a processor Ver. 1.0 Slide 3 of 23
  4. 4. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationIdentifying Processor Performance (Contd.) Pipelining is an important concept used in high-performance computing. Pipelining is shown in the following figure. C y c le C y c le C y c le C y c le C y c le C y c le one tw o th re e fo u r f iv e s ix C o m p u te In s tr u c tio n 1 R e a d th e R e a d th e W r it e t h e th e in s t r u c t io n d a ta R e s u lt in s tr u c tio n C o m p u te In s tr u c tio n 2 R e a d th e R e a d th e W r ite th e th e in s t r u c t io n d a ta R e s u lt in s tr u c tio n C o m p u te In s t r u c tio n 3 R e a d th e R e a d th e W r it e t h e th e in s tr u c tio n d a ta R e s u lt in s tr u c tio n 0 1 2 3 4 5 6 N u m b e r o f c lo c k c y c le s Ver. 1.0 Slide 4 of 23
  5. 5. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationIdentifying Processor Performance (Contd.) Pipelining has multiple stages. Different parts of pipeline perform different jobs. Some parts of the pipeline can be duplicated so that less work is done at each stage. Pipelining has substantial impact on the performance of the application. Ver. 1.0 Slide 5 of 23
  6. 6. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationIdentifying Processor Performance (Contd.) A process consists of different phases of processor and memory utilization. The sequence processes follow are: ► Phase 1: Memory burst Read the instruction to be executed ► Phase 2: CPU burst Read the data from the memory During this time, the process is either running or waiting for the ► Phase 3: Memory burst During this time, the process is processor. waiting for memory write operation Ver. 1.0 Slide 6 of 23
  7. 7. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationIdentifying Processor Performance (Contd.) Instructions for different applications are of diverse types. Typically, each application will have multiple types of instructions. Different parts of processor, called functional units, executes different types of instructions. Functional units are of the following types: Memory operations Integer operations Floating-point operations Ver. 1.0 Slide 7 of 23
  8. 8. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationMeasuring Processor Performance Processor performance is measured in terms of the following parameters: ► Branch mispredictions • It means that the branch executed is not the same as predicted by the processor. ► Loads/Stores complete It refers to the process of loading data • In such a case, there is stores refer to from the memory and an additional ► Throughput overhead to the number data values for the It refers in loading the of processes that writing data back to the memory per unit branch not their execution ofprocessor. complete executed by the unit time. per ► Turnaround time time. It refers to the amount time to execute a particular process. It is also called ► Instruction execution time It refers to the execution time for an execution time. ► Program execution time Itinstruction. refers to thee execution time for a program. ► Waiting time It refers to the amount of time a process It is the sum total of the ready queue. for has been waiting in the execution time ► Response time It refers to the amount of time taken to is each instruction. It refers to the fraction of time the CPU generate a response to a request. ► CPU utilization processing instructions. It refers to the fraction of time a process is usingdifference between CPU utilization The the CPU. ► CPU efficiency and CPU efficiency is that CPU utilization is the fraction of time when the CPU is not idle while CPU efficiency is the amount of time when the CPU is computing instructions. Ver. 1.0 Slide 8 of 23
  9. 9. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationMeasuring Processor Performance (Contd.) Some standard metrics to measure the processor performance are: ► Instructions retired ► Clock Cycles Per instruction Retired (CPI) ► Percentage of floating-point instructions CPI ismetric reports thethe percentage cycles tothat are retired This the ratio of the number of of instructions the number measures number clock of retired floating-point of instructions retired. instructions. during program execution. ItWhen the execution of the instructions is complete, the that A high percentage processors internal resource utilization. is a measure of a of floating-point instructions indicate A high value indicates only resource utilization. while other processor doesusing low a the instructions any longer. the program is not require specific resource resources are idle. Thus, when the processor discards these instructions, they are said to be retired. Ver. 1.0 Slide 9 of 23
  10. 10. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationJust a minute How can you measure processor performance? Answer: Processor performance is measured in terms of the following parameters: Branch mispredictions Loads/Stores complete Throughput Turnaround time Instruction execution time Program execution time Waiting time Response time CPU utilization CPU efficiency Ver. 1.0 Slide 10 of 23
  11. 11. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationExamining Memory Specifications The performance of a processor also depends on how fast data can be read from and written to the main memory. Memory speed is considerably slower than processor speed. The difference in the speeds of the processor and the memory affects application performance. In spite of computers with better processing power, the impact of processor speed on the performance of applications is not substantial. The solution is to minimize the mismatch between the processor and memory speeds. To optimize application performance, it is important to understand the memory hierarchy on a computer and the performance of different components of the memory. Ver. 1.0 Slide 11 of 23
  12. 12. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationUnderstanding the Memory Hierarchy The following figure shows the memory hierarchy on a computer system. ► R e g is te r s Registers speed up the execution of instructions by providing fast access to intermediate values This is the during a calculation. computed lowest level of cache ► Level 1 C ache F a s t e r / S m a lle r memory, which is faster and smaller ► Level 2 C ache It is larger in size but slower than the L1 cache ► M a in M e m o r y S lo w e r / L a r g e r It is slower and cheaper than cache memory but faster and more expensive than virtual The processor cannot directly memory. ► V ir tu a l M e m o r y access virtual memory. It is measured in megabytes. When data referenced by a M e m o r y H ie r a r c h y virtual address is requested, the virtual address is translated to a main memory address Ver. 1.0 Slide 12 of 23
  13. 13. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationJust a minute What is the purpose of cache memory? Answer: Cache memory reduces the mismatch in the speeds of the processor and the main memory. Ver. 1.0 Slide 13 of 23
  14. 14. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationUnderstanding Memory Performance When executing an instruction, the processor waits for the data to be fetched from the memory. The processor cannot execute any other instruction while waiting because the previous instructions are loaded into registers. To achieve optimal performance, you must store the data as near as possible to the processor so that the processor is not idle. This helps to reduce the time utilized for memory access and improve processor utilization. Ver. 1.0 Slide 14 of 23
  15. 15. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationUnderstanding Memory Performance (Contd.) You can calculate the time taken for memory access by knowing the hit and miss ratios. The hit ratio is the number of times required data is available to the total number of times data is requested from memory. The miss ratio is the number of times data is not found to the total number of times data is requested from memory. Ver. 1.0 Slide 15 of 23
  16. 16. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationUnderstanding Memory Performance (Contd.) To improve the performance of memory, you should ensure that the data that the processor requested is at the nearest location. For this, you must be able to predict which data the processor will reference. This can be accomplished using the principle of locality of reference. The two types of locality of reference are: ► Spatial locality Memory locations near each other are usually used together. ► Temporal locality If a program accesses a particular If a program accesses a particular memory location, it might soon memorythe same memorysoon access location, it might location. access a nearby memory location. This location is called temporal This location is called spatial locality. locality. Ver. 1.0 Slide 16 of 23
  17. 17. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationAnalyzing Issues Affecting Memory Performance Some of the issues that affect memory performance are: ► Cache compulsory loads When the required data is not found in the cache, it has to be ► Cache capacity loads At times, the cache has tois known loaded in the cache. This remove recently used data to load. ► Cache conflict loads as a cache compulsory Cache conflict loads occur if the accommodate other data requested processor accesses five or is This occurs whenis the ratiomore ► Cache efficiency Cache processor. the data of data by the efficiency units of data that use the the loaded for the first time insame loaded because, the capacity of the This is into the cache to the data ► Data alignment row. alignment is the organization cache. Data used. is limited. cache You can avoid cache conflict loads of data in memory. ► Software prefetch Software prefetch enables a by changing memory alignment, Effective data alignment can processor to load a specific using registers efficiency. data, or improve of memoryholding it is for location cache before using algorithms that use fewer required for processing. regions of memory. As a result, the time taken for reads and writes is reduced by the amount of time that is saved while the data is being loaded in the cache. Ver. 1.0 Slide 17 of 23
  18. 18. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationBenchmarking A benchmark is a standard that is used for comparison. In terms of application performance, you can consider processor and memory benchmarks. To arrive at a specific benchmark, you can use tests to compare the performance of hardware and software running a specified workload. If you use graphic applications, a benchmark that tests graphics speed might be useful. Ver. 1.0 Slide 18 of 23
  19. 19. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationBenchmarking (Contd.) The different types of benchmarks are: ► Single stream benchmarks Single stream benchmarks measure the time taken by the ► Throughput benchmarks Throughput benchmarks computer to execute a collection of benchmark processor performance ► Interactive benchmarks programs. benchmarks benchmark Interactive for several jobs or a mix of codes the components of a computer running simultaneously. such as input/output system, operating system, and networks. Ver. 1.0 Slide 19 of 23
  20. 20. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationJust a minute What are various benchmarks for measuring processor performance? Answer: The different types of benchmarks are: Single stream benchmarks Throughput benchmarks Interactive benchmarks Ver. 1.0 Slide 20 of 23
  21. 21. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationR e a d in g C P U C y c l e s t o M e a s u r e P r o c e s s o r P e r f o r m a n c e The benchmarks for processor performance are: Read Time Stamp Counter (RDTSC) Million Instructions Per Second (MIPS) Million Floating Point Multiply Operations (MFLOPS) Ver. 1.0 Slide 21 of 23
  22. 22. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationSummary In this session, you learned that: Application performance is closely related to hardware resources, such as processors and memory. Processor speed is measured in clock cycles per second. This is an indication of the number of instructions executed in unit time. Pipelining is an approach used for high-performance computing to obtain maximum processor output. The execution process of an instruction consists of CPU and memory bursts. A processor contains different functional units for executing memory, integers, and floating-point instructions. Ver. 1.0 Slide 22 of 23
  23. 23. Code Optimization & Performance Tuning using Intel VTuneInstalling Windows XP Professional Using Attended InstallationSummary (Contd.) Processor performance can be measured in terms of branch mispredictions, loads/stores complete, throughput, turnaround time, instruction execution time, program execution time, waiting time, response time, CPU utilization, and CPU efficiency. Computer memory consists of registers, cache memory, main memory, and virtual memory. The performance of memory depends on the speed of the memory. Cache compulsory loads, cache capacity loads, cache conflict loads, data alignment, and the software prefetch capability affect memory performance. Performance benchmarking is the process of defining standards for application performance in terms of processors and memory. Ver. 1.0 Slide 23 of 23

×