More Related Content
Similar to Features of modern intel microprocessors
Similar to Features of modern intel microprocessors (20)
Features of modern intel microprocessors
- 1. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Essential
Performance
Advanced
Performance
Distributed
Performance
Efficient
Performance
Features Of Modern Intel
Microprocessors
Prepared By:
Krunal P Siddhapathak (10BEC097)
- 2. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Core and Multi-Core Processor
What is a Core?
A standard processor has one core (single-core.) Single core processors
only process one instruction at a time (they do use pipelines internally,
which allow several instructions to be processed together; however, they
are still run one at a time.)
What is a Multi-Core Processor?
A multi-core processor is comprised of two or more independent cores,
each capable of processing individual instructions. A dual-core processor
contains two cores, a quad-core processor contains four cores, and a
hexa-core processor contains six cores.
- 3. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Need Of Multi-Core Processors
Multiple cores can be used to run two programs side by side
and, when an intensive program is running, (AV Scan, Video
conversion, CD ripping etc.) you can utilize another core to
run your browser to check your email etc.
Multiple cores really shine when you’re using a program that
can utilize more than one core (called Parallelization) to
improve the program’s efficiency and addressability. Programs
such as graphic software, games etc. can run multiple
instructions at the same time and deliver faster, smoother
results.
If you use CPU-intensive software, multiple cores will likely
provide a better computing experience. If you use your PC to
check emails and watch the occasional video, you really don’t
need a multi-core processor.
- 4. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Core 2 Duo vs. Core i3 vs. Core i5
Core 2 Duo Core i3 Core i5
Number of
Threads
Two Four Four
Socket 775 (45/65nm) 1156 (nm) 1156 (nm)
Compatible
RAM
DDR2 DDR3 DDR3
Turbo Boost No No Yes
Overclocking No Yes No
- 5. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Do I need an i3, i5, or i7?
As with all computer hardware, the type of processor you
need depends on your needs, for how long you want your
computer to stay current, and your budget.
If you:
Browse the internet, check email, and play the occasional flash game (like
Farmville): Get a single core netbook or desktop
Do word processing, spreadsheets etc., listen to music often, and watch
movies, get an i3 processor (or any dual core processor i.e. core 2 duo)
Play the occasional game and are happy with lower resolution and lower
quality graphics (my suggestion assumes the graphics processor on the
pre-built PC will be well-matched for the processor suggestions), watch
HD movies etc., get an i5.
If you do graphic publishing, music creation, programming (and
compiling), watch HD movies, or like to play visually appealing games,
get a quad core i5, or i7.
If you like to have the very best hardware and play the most graphically
intense games, get a quad core or hexa corei7 Extreme.
- 6. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel Sandy Bridge Microarchitecture
Many of the bottlenecks of previous designs have been dealt with in the Sandy Bridge.
Instruction fetch and predecoding has been a serious bottleneck in Intel designs for
many years. In the NetBurst architecture they tried to fix this problem by caching
decoded µops, without much success.
In the Sandy Bridge design, they are caching instructions both before and after
decoding. The limited size of the µop cache is therefore less problematic, and the µop
cache appears to be very efficient. The limited number of register read ports has been a
serious, and often neglected, bottleneck since the old Pentium Pro.
This bottleneck has now finally been removed in the Sandy Bridge. Previous Intel
processors have only one memory read port where AMD processors have two. This was a
bottleneck in many math applications. The Sandy Bridge has two read ports, whereby
this bottleneck is removed. The branch prediction has been improved by having bigger
buffers and a shorter misprediction penalty, but it has no loop predictor, and
mispredictions are still quite common.
The new AVX instruction set is an important improvement. The throughput of floating
point addition and multiplication is doubled when the new 256-bit YMM registers are
used. The new non-destructive three-operand instructions are quite convenient for
reducing register pressure and avoiding register move instructions. There is, however, a
serious performance penalty for mixing vector instructions with and without the VEX
prefix. This penalty is easily avoided if the programming guidelines are followed, but I
suspect that it will be a very common programming error in the future to inadvertently
mix VEX and non-VEX instructions, and such errors will be difficult to detect.
- 7. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel Sandy Bridge Microarchitecture(Contd.)
Whenever the narrowest bottleneck is removed from a
system, the next less narrow bottleneck will become the
limiting factor. The new bottlenecks that require attention in
the Sandy Bridge are the following:
The µop cache: This cache can ideally hold up to 1536 µops. The effective utilization
will be much less in most cases. The programmer should pay attention to make sure
the most critical inner loops fit into the µop cache.
Instruction fetch and decoding: The fetch/decode rate has not been improved over
previous processors and is still a potential bottleneck for code that doesn’t fit into
the µop cache.
Data cache bank conflicts: The increased memory read bandwidth means that the
frequency of cache conflicts will increase. Cache bank conflicts are almost
unavoidable in programs that utilize the memory ports to their maximum capacity.
Branch prediction: While the branch history buffer and branch target buffers are
probably bigger than in previous designs, mispredictions are still quite common.
Sharing of resources between threads: Many of the critical resources are shared
between the two threads of a core when hyperthreading is on. It may be wise to turn
off hyperthreading when multiple threads depend on the same execution resources.
- 8. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel Ivy Bridge Microarchitecture
Ivy Bridge is the codename for an Intel microprocessor using
the Sandy Bridge microarchitecture. The name is also applied
more broadly to the 22 nm die shrink of the microarchitecture
based on tri-gate ("3D") transistors, which is also used in the
future Ivy Bridge-EX and Ivy Bridge-EP microprocessors. Ivy
Bridge processors are backwards-compatible with the Sandy
Bridge platform, but might require a firmware update (vendor
specific). Intel has released new 7-series Panther Point
chipsets with integrated USB 3.0 to complement Ivy Bridge.
Volume production of Ivy Bridge chips began in the third
quarter of 2011. Quad-core and dual-core-mobile models
launched on April 29, 2012 and May 31, 2012 respectively.
Core i3 desktop processors, as well as the first 22 nm Pentium
were launched and available the first week of September,
2012.
- 9. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel Ivy Bridge Microarchitecture(Contd.)
How much faster are the Ivy Bridge processors?
The base clock frequency of these processors ranges from 2.8
GHz (for Core i5-3450S) to 3.5 GHz (for Core i7-3770K).
What different types of the Ivy Bridge processors
are available?
There are many types of processors in the Ivy Bridge family. The
type is indicated by putting a suffix to the CPU model name. The
following list explains these suffixes -
K – Unlocked, ready to be overclocked.
S – Performance optimized. Low power consumption.
T – Power optimized. Ultra low power consumption.
M – Mobile processors for mobile devices.
Q – Quad core processors.
- 10. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel Ivy Bridge Microarchitecture(Contd.)
Features present in Ivy Bridge:
HD graphics – Ivy Bridge processors have in-built GPU chip inside them.
The GPU supports DirectX 11 (Sandy Bridge supports version 10.1),
OpenGL 3.1 (Sandy Bridge supports version 3.0). Ivy Bridge processors
have the Intel HD4000/HD2500 GPU chips. This means that you do not
need an add-on graphics card.
QuickSync Video – This feature is introduced in the Intel 3rd generation
processors. It uses dedicated media processing to make video creation
and conversion faster and easier. Whether you want to create DVDs,
create, convert and edit 3D/2D videos, upload to your favorite social
networking sites – everything is done in a jiffy.
WiDi 3.0 – Wireless Display technology allows you to stream media
content to a multitude of your Wi-Fi connected display devices. You can
share a 1080p 60FPS video using WiDi.
Turbo Boost Technology 2.0 – Using the Turbo boost technology, you can
make your Ivy Bridge processors run faster than their base frequency. For
example, a 3.5GHz iCore i7 can be made to run at 3.9 GHz for some time.
- 11. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Core 2 Duo vs. Core i3
The Core 2 Duo is Intel's veteran, covering a wide range of price and
performance sweet spots. It is now being replaced, however, by
Intel's rookie Core i3. So, is the Core i3 actually better than the Core
2 Duo, or can you hold off upgrading for a while longer?
The Core 2 Duo has been the processor of choice in laptops for about
three years. Over those three years the average speeds of Core 2
Duo processors have advanced significantly and many of today's
Core 2 Duo laptops have speeds of around 2.2 GHz or faster. Core 2
Duo processors have also been the go-to for many less expensive
desktop systems, with speeds reaching over 3 GHz.
However, there is a newcomer which is challenging the Core 2 Duo.
This is the Core i3. It is very similar to the Core 2 Duo in many
ways. Both are dual-core processors and most Core 2 Duos and Core
i3 have similar clock speeds. However, the processors are based on
different architectures.
So, which one is better?
- 12. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Core 2 Duo vs. Core i3(Contd.)
Architecture
The Core 2 Duo processors are based off the Core 2 architecture.
The Core and Core 2 architectures were arguably Intel's most
successful architectures, as they replaced the Pentium 4
processors in desktop systems and made Intel competitive in that
space once again.
The Core i3 is based off a new architecture called Nehalem. The
Nehalem architecture has numerous advantages over the Core 2
architecture. Nehalem is better constructed for quad-core
processors, has hyper-threading available, and can use a feature
called Turbo Boost which maximizes processor speed. However,
because the Core i3 is the low-end Nehalem variant, most of
these features are disabled or not relevant - the Core i3 is a dual
core processor and Turbo Boost is disabled, but hyper-threading
is enabled.
- 13. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Core 2 Duo vs. Core i3(Contd.)
Processor Performance
The Core i3 is the slowest variant of the Nehalem based processor. The
Core 2 Duo processors, however, don't have the same differentiation
between versions of the same architecture. The fastest Core 2 Duo
desktop processor has a speed of 3.33 GHz, while the fastest Core i3
desktop processor is clocked at 3.06 GHz.
You might therefore expect that the Core 2 Duo would have the edge -
particularly when you consider that the Core 2 Duo costs almost three
times as much if you buy it individually - but in fact the Core i3 is faster,
and often by no small margin. The Core i3 is faster even in single-
threaded applications, but the performance gap really widens in multi-
threaded applications. This is because the Core i3 has hyper-threading,
which turns the two real cores into four virtual cores. Windows works with
the Core i3 as if it is a quad-core processor.
These results remain true in the mobile space, as well. Core i3 processors
punch at least 500 MHz above their weight in single-thread applications,
and are virtually always faster in multi-threaded applications, no matter
the clock speeds of the Core 2 Duo and Core i3 processors you are
comparing.
- 14. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Core 2 Duo vs. Core i3(Contd.)
Power Usage and Heat
A look at the technical specifications of the Core i3 processors automatically puts
them into a negative light when it comes to power consumption. The desktop Core i3
parts at listed as having a 73 Watt TDP, while most Core 2 Duo desktop parts have a
65 Watt TDP. In laptops the Core i3 has a 35 watt TDP, while Core 2 Duo mobile
processors usually have a 25 Watt TDP.
These differences pan out about how you'd expect them to when it comes to
absolute power consumption. The Core i3 processors do consume just slightly more
power than Core 2 Duo processors at load and at idle. We're talking a difference of
around 10 Watts on desktops and a few on laptops - nothing huge, but a difference
none the less.
However, when it comes to power efficiency the answer becomes less clear. In order
for a processor to be power efficient, it needs to not only have low power
consumption but also the ability to complete tasks quickly. This lowers the overall
"task energy" because a faster processor will be done with a task before a slower
processor, and once done it will slip back into an idle state.
When viewed from this perspective, the Core i3 is much more efficient than the Core
2 Duo on both the desktop and the laptop. This means that the Core i3 will probably
not use any more power than a Core 2 Duo - and may actually use less - unless your
usage patterns place a constant load on your processor.
- 15. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Various Core Processors Of Intel
Core i3 Series
Intel's Core i3 processor line has always been a budget option. These
processors remain dual-core, unlike the rest of the Core line, which is
made up of quad core processors. Intel's Core i3 processors also have
many features restricted.
The main feature that is kept from the Core i3 processors is Turbo Boost,
the dynamic overclocking available on most Intel processors. This,
alongside with the dual-core design, accounts for most of the performance
difference between Core i3 processors and the i5 and i7 options.
One feature that Core i3 has - and i5 doesn't - is hyper-threading. This is
Intel's logic-core duplication technology which allows each physical core to
be used as two logic cores. The result of this is that Windows will display a
dual-core Core i3 processor as if it were a quad-core.
Finally, Core i3 processors have their integrated graphics processor
restricted to a maximum clock speed of 1100 MHz, and all Core i3
processors have the 2000 series IGP, which is restricted to 6 execution
cores. This will result in slightly lower IGP performance overall, but the
difference is frankly inconsequential in many situations.
- 16. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Various Core Processors Of Intel(Contd.)
Core i5 Series
Intel used to split the Core i5 processor brand into two different
lines, one of which was dual-core and one of which was quad-
core.
All Sandy Bridge Core i5 processors are quad-core processors,
they all have Turbo Boost, and they all lack Hyper-Threading.
Most of the Core i5 processors, besides the K series (explained
later) use the same 2000 series IGP with a maximum clock speed
of 1100 MHz and six execution cores.
In the i3 vs. i5 vs. i7 battle, the Core i5 processor is now
obviously the main-stream option no matter which product you
buy. The only substantial difference between the Core i5 options
is the clock speed, which ranges from 2.8 GHz to 3.3 GHz.
Obviously, the products with a quicker clock speed are more
expensive than those that are slower.
- 17. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Various Core Processors Of Intel(Contd.)
Core i7 Series
These processors are virtually identical to the Core i5. They have a 100
MHz higher base clock speed, which is inconsequential in most situations.
The real feature difference is the addition of hyper-threading on the Core
i7, which means that the processor will appear as an 8-core processor in
Windows. This improves threaded performance and can result in a
substantial boost if you're using a program that is able to take advantage
of 8 threads.
Of course, most programs can't take advantage of 8 threads. Those that
can are almost usually meant for enterprise or advanced video editing
applications - 3D rendering programs, photo editing programs, and
scientific programs are categories of software frequently designed to use
8 threads. The average user is unlikely to see the full benefit of the hyper-
threading feature. In the Core i3 vs. i5 vs. i7 battle, the i7 has limited
appeal.
The IGP on Core i7 processors can also reach a higher maximum clock
speed of 1350 MHz as I've said before; however, this difference is largely
inconsequential when measuring real-world performance.
- 18. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Various Core Processors Of Intel(Contd.)
The K series processor
Late in the lifespan of Intel's previous Core i branded products;
Intel introduced the "K" series. These processors had unlocked
multipliers, making them easier to overclock.
Intel has kept this line of products alive with the new Sandy
Bridge architecture by introducing a K series Core i5 and i7
processor. As before, these processors have unlocked multipliers.
However, they also have a new feature - better integrated
graphics processors.
This comes in the form of the 3000 series IGP, which has 12
execution cores instead of 6. The maximum clock speed remains
limited by the processor brand - the Core i5 K is limited to 1100
MHz, while the Core i7 K can reach 1350 MHz the additional
execution cores can result in better performance in games,
although to honest, the IGP isn't remotely cut out for desktop
gaming.
- 19. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Various Core Processors Of Intel(Contd.)
The IGP Features: Sandy Bridge
The most importance new feature added to Intel's Sandy Bridge
processors is the inclusion of an IGP on the processor. Intel did this
before with Core i3 and some Core i5 processors, but the IGP was still
separate from the processor itself - the IGP and CPU were placed on the
same piece of silicon, but didn't physically work together.
Now Intel has taken the IGP integration a step further and worked the IGP
into the CPU architecture. It even shares cache with the processor. What
this means, in practical terms, is that the on-board graphics of Intel's new
processors are superior to anything they've offered before. It also enables
Quick Sync, a video transcoding feature that provides blazing
performance when converting videos to a different format.
Intel is offering two different types of IGPs on its processors. The 2000
has 6 execution units, while the 3000 has 12 execution units. Obviously,
the later is quicker. Intel hasn't tied the IGP that you receive to the type
of processor you choose, however. Instead, it has tied the 3000 series
IGP to the "K" series processors. If you see a "K" at the end of the
processor's name, it has the 3000 series IGP. So far, Intel doesn't offer a
Core i3 K series processor, but that could change in the future.
- 20. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Various Core Processors Of Intel(Contd.)
Laying Out the Chipset
The staggered release of Intel's previous Core i3/i5/i7 products also
resulted in a staggered release of processor sockets and their related
chipsets. First came LGA 1366 processor socket, which was tied to some
Core i7 processors. Then Intel confused things by releasing the LGA 1156
socket, which was made available on several different chipsets and
processor types. Choosing the right socket and chipset for a processor
wasn't easy.
Intel has now clarified matters by releasing a single processor socket and
two processor chipsets alongside Sandy Bridge. The new socket is LGA
1155, and it isn't backwards compatible with anything Intel has previously
offered. The new chipsets are P67 and H67, with the P variant being
performance-oriented and the H variant targeted at general use. The main
difference is that P67 allows for processor overclocking, while H67 does
not. P67 also offers 16 additional PCIe lanes. Both Core i3 and i5
processors are compatible with either chipset.
- 21. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Core i5 vs. Core i7
Core i5: The New Middle Class
While the hardware has changed, Intel's branding scheme remains the
same, and Core i5 remains Intel's primary mid-range processor. It is
targeted at the heart of the market, with pricing that is not at budget
levels but still affordable, and performance that is extremely quick but not
the fastest Intel offers.
Intel's high-end processor line is the Core i7. Many users who are looking
for a high-performance part end up considering both i5 and i7 products.
A Unified Socket and Chipset
Perhaps the best news to come out of Intel's new line of i5 and i7
processors is introduction of a single socket for all Sandy Bridge Core
i3/i5/i7 processors. For now, however, the Sandy Bridge processors all
use the LGA 1155 socket. In case you're wondering, this socket is not
backwards compatible with previous LGA1156 processors.
- 22. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Core i5 vs. Core i7(Contd.)
Intel Turbo Boost
Intel has made Turbo Boost a standard feature on all Core i5 and
i7 processors, from the least to most expensive. Intel has also
reduced the gap between the maximum turbo boost frequencies
on different processors. Previously, some of the older Core i7
processors actually had a much less efficient Turbo Boost feature
than some newer Core i5s.
All of Intel's current Core i5 and i7 processors offer a boost of
between 300 and 400 MHz The least expensive i5s offer the 300
MHz boost - for example, the Core i5 2300 has a base clock
speed of 2.8 GHz and a maximum Turbo Boost speed of 3.1 GHz.
The Intel Core i7 2600, on the other hand, offers a base clock
speed of 3.4 GHz and a maximum Turbo Boost of 3.8 GHz.
Besides the clock speed difference, Turbo Boost is essentially the
same on the i5 and i7 processors.
- 23. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Core i5 vs. Core i7(Contd.)
Difference in Hyper-Threading
Another significant performance difference is how the Core i7 and
Core i5 products will be handling hyper-threading. Hyper-
threading is a technology used by Intel to simulate more cores
than actually exist on the processor. While Core i7 products have
all been quad-cores, they appear in Windows as having eight
cores. This further improves performance when using programs
that make good use of multi-threading.
All Sandy Bridge Core i5 processors have hyper-threading
disabled, and all Sandy Bridge Core i7 processors have hyper-
threading enabled. This is a major feature difference of Core i5
vs. Core i7 processors, and it will give the Core i7 products an
advantage over Core i5 processors in some heavily multi-
threaded applications.
- 24. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Core i5 vs. Core i7(Contd.)
The New IGP
All of Intel's Sandy Bridge processors make use of a new
integrated IGP that is part of the processor architecture.
While far from a gaming-grade video solution, the
integrated IGP offers reasonable performance without
consuming much power. It also enables features like Quick
Sync, which can transcode video extremely quickly.
There are two versions of this IGP; the 2000 and the 3000.
The only difference between the two is the number of
execution units. The 2000 has 6, while the 3000 has 12.
This doesn't mean the 3000 is twice as quick, but it does
means the 3000 is about 50% quicker in most
benchmarks.
- 25. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Core i5 vs. Core i7(Contd.)
i5 vs. i7: What it means to Consumers and Power Users
Currently, the Core i5 processor brand makes up most of Intel's Sandy
Bridge processor line. The prices of these processors range from $177 to
$216 with base clock speeds between 2.8 GHz and 3.3 GHz. Intel only
offers two Core i7 products, the Core i7-2600 and Core i7-2600K, both of
which have a 3.4 GHz base clock speed. The i7-2600 has a price tag of
$294.
As you may have guessed, paying about $80 more for the 100 MHz clock
speed increase between the fastest i5 and the i7 isn't a great deal. The
main reason to pay this additional cash for an i7 is hyper-threading, but
this advantage will only be evident if you frequently use programs that
can actually make use of 8 threads.
For most users, the i5 is clearly the better deal. The i5-2500 makes the
most sense in my opinion, as it offers an extremely quick base clock
speed of 3.3 GHz for about $200. Of course, the value of this is subject to
change in the future as Intel fleshes out its product line with new models.
- 26. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Hyper Threading
Hyper-Threading Technology brings the concept of simultaneous multi-threading to
the Intel Architecture. Hyper-Threading Technology makes a single physical
processor appear as two logical processors. The physical execution resources are
shared and the architecture state is duplicated for the two logical processors. From
a software or architecture perspective, this means operating systems and user
programs can schedule processes or threads to logical processors as they would on
multiple physical processors. From a microarchitecture perspective, this means
that instructions from both logical processors will persist and execute
simultaneously on shared execution resources.
The amazing growth of the Internet and telecommunications is powered by ever-
faster systems demanding increasingly higher levels of processor performance. To
keep up with this demand we cannot rely entirely on traditional approaches to
processor design. Microarchitecture techniques used to achieve past processor
performance improvement–super-pipelining, branch prediction, super-scalar
execution, out-of-order execution, caches–have made microprocessors
increasingly more complex, have more transistors, and consume more power. In
fact, transistor counts and power are increasing at rates greater than processor
performance. Processor architects are therefore looking for ways to improve
performance at a greater rate than transistor counts and power dissipation. Intel’s
Hyper-Threading Technology is one solution.
- 27. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Hyper Threading(Contd.)
A look at today’s software trends reveals that server applications consist of
multiple threads or processes that can be executed in parallel. On-line
transaction processing and Web services have an abundance of software
threads that can be executed simultaneously for faster performance. Even
desktop applications are becoming increasingly parallel. Intel architects have
been trying to leverage this so-called thread-level parallelism (TLP) to gain a
better performance vs. transistor count and power ratio.
In both the high-end and mid-range server markets, multiprocessors have
been commonly used to get more performance from the system. By adding
more processors, applications potentially get substantial performance
improvement by executing multiple threads on multiple processors at the
same time. These threads might be from the same application, from different
applications running simultaneously, from operating system services, or from
operating system threads doing background maintenance. Multiprocessor
systems have been used for many years, and high-end programmers are
familiar with the techniques to exploit multiprocessors for higher performance
levels.
- 28. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Hyper Threading(Contd.)
In recent years a number of other techniques to further exploit TLP have been
discussed and some products have been announced. One of these techniques is
chip multiprocessing (CMP), where two processors are put on a single die. The two
processors each have a full set of execution and architectural resources. The
processors may or may not share a large on-chip cache. CMP is largely orthogonal
to conventional multiprocessor systems, as you can have multiple CMP processors
in a multiprocessor configuration. Recently announced processors incorporate two
processors on each die. However, a CMP chip is significantly larger than the size of
a single-core chip and therefore more expensive to manufacture; moreover, it
does not begin to address the die size and power considerations.
Another approach is to allow a single processor to execute multiple threads by
switching between them. Time-slice multithreading is where the processor
switches between software threads after a fixed time period. Time-slice
multithreading can result in wasted execution slots but can effectively minimize
the effects of long latencies to memory. Switch-on-event multithreading would
switch threads on long latency events such as cache misses. This approach can
work well for server applications that have large numbers of cache misses and
where the two threads are executing similar tasks. However, both the time-slice
and the switch-on event multi- threading techniques do not achieve optimal
overlap of many sources of inefficient resource usage, such as branch
mispredictions, instruction dependencies, etc.
- 29. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Hyper Threading(Contd.)
Finally, there is simultaneous multi-threading, where multiple threads can
execute on a single processor without switching. The threads execute
simultaneously and make much better use of the resources. This approach
makes the most effective use of processor resources: it maximizes the
performance vs. transistor count and power consumption. Hyper-Threading
Technology brings the simultaneous multi-threading approach to the Intel
architecture. In this paper we discuss the architecture and the first
implementation of Hyper-Threading Technology on the Intel Xeon processor
family.
Hyper-Threading Technology makes a single physical processor appear as
multiple logical processors. To do this, there is one copy of the architecture
state for each logical processor, and the logical processors share a single set
of physical execution resources. From a software or architecture perspective,
this means operating systems and user programs can schedule processes or
threads to logical processors as they would on conventional physical
processors in a multiprocessor system. From a microarchitecture perspective,
this means that instructions from logical processors will persist and execute
simultaneously on shared execution resources.
- 30. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Hyper Threading(Contd.)
There are few elements in CPU that need to be understand to
know about hyper-threading technology:
Registers - Registers are basically circuits that hold a single 64-bit value and are
the fastest form of storage available on a computer. The x86- architecture provides a
number of General Purpose Registers that are used by an executing program. In a
multicore chip, registers are unique to each core so if you have a quad-core
processor, there will be 4 sets of general purpose registers.
Cache – Cache is essentially a form of storage that falls between registers and RAM
in terms of speed. In modern processors there are generally three levels and in the
case of the i7, Levels 1 & 2 is private and Level 3 is shared by all the cores on a chip.
The most important thing to know is that accessing the cache is slower than
registers but still faster than RAM.
Execution Unit – This is the section in the CPU responsible for actually executing
the instructions. If you tell the computer to add 2 + 3, this is the part that operation
would be performed in.
Front-End – This is a unit of the processor that is also known as Instruction
Fetch/Decode. Essentially this unit will grab instructions from either cache or RAM
and decode them into a form that execution unit can understand.
Branch Predictor - this unit will attempt to predict branches in program code. If
there is an ―if-then‖ statement in a program, it will guess which statements will be
executed and prefetch them for the front-end.
- 31. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Hyper Threading(Contd.)
In a core with HT, the registers are all duplicated. This means that one core
will have 2 sets of registers and this is what the operating will see as a
―logical core‖ since the sum of the registers represents the processor’s state.
We’ll call these sets A and B. Even though it appears as two cores, they will
still be sharing the same cache, branch predictor, front-end, and most
importantly, execution unit. Because they still share so many resources, only
one thread will technically execute at once. The advantage of adding the HT
logic is that if a thread is executing and stalls for any reason, the other
thread can be switched in very fast while the cause of the stall in the first
thread is addressed. To better illustrate how this works, consider the
following:
Set A is considered the current state of the processor.
Thread a starts executing.
Thread A needs a value from memory that isn’t in the cache.
Memory access is very time consuming in CPU terms, so thread A is considered
stalled.
Instead of wasting cycles waiting for the memory operation to complete, set B is
considered the current state.
Thread B is now executing until it stalls or until thread A can execute again (memory
operation finishes).
- 32. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Hyper Threading(Contd.)
This process basically just continues on constantly. Now,
there should be an obvious question: What can cause a
thread to stall? There are a few things; the simplest one to
understand is a cache miss. This is when the thread goes to
access a value that isn’t currently in the cache or any of the
registers. A branch miss prediction can also occur when the
branch predictor prefetches the wrong instructions into the
cache.
There is another time Hyper-Threading kicks in, and that is if
one thread is using Floating-Point resources while the 2nd is
only using Integer resources. HT will allow them both to
execute simultaneously while they don't conflict.
- 33. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Does hyper threading actually help?
Hyper-Threading has some interesting performance characteristics as a
result of its nature. HT will provide close to zero advantage if instruction
decoding or execution is the limiting factor in performance. In the
Nehalem architecture this is rarely the case. It performs ideally when
there are a lot of cache misses or branch miss predictions since the
execution unit would otherwise be idle waiting for these issues to be
resolved.
Basically, certain applications will benefit more than others. Running a
more parallel workload such as rendering or encoding video will see a
nice benefit from HT since it’s likely both threads will be accessing the
same data so they aren’t really competing for cache. Additionally the
relatively small amount of local L2 cache in the i7 (256k) means there
will be a decent amount of memory access giving the second thread time
to execute. Also, it can result in a more responsive machine if not much
is going on since threads will have very low execution time and it’s much
faster for the CPU to switch the active register set than to grab another
thread from RAM and load it into the registers.
- 34. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Are there drawbacks?
As with most engineering decisions, there are drawbacks to
HT. One of the more obvious one is that since HT keeps the
execution unit fed more efficiently, it spends less time idle
and can result in higher operating temperatures. More time
idle would mean the CPU got a chance to cool down before
the next execution burst and would result in a lower max
temperature.
There are also programs that will either not see any benefit
from HT or see decreased performance as well. Typically
something that has performance limited by cache, instruction
decode, the execution unit, or memory access will see little to
negative improvement from HT (one of the reasons the i7 has
so much memory bandwidth).
- 35. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Are there drawbacks?(Contd.)
Running more than one multithreaded, computationally intensive task at
a time can also be a situation where HT doesn’t help performance. If a
processor core is running threads from different programs or that are
operating on different data, all of the shared resources are effectively
halved (data cache, branch prediction, instruction cache). This means
branch miss predictions and cache misses become even more common,
possibly to the point where both threads are stalled. Depending on the
specific program this can mean either lower performance (compared to
HT being disabled) or worse scaling than expected.
The last drawback is probably the most important one: The benefit of HT
is inconsistent and dependent upon the specific operating environment
and programs being run. Because of the way it works, code that is
heavily optimized is likely to show less benefit as it would be designed to
lower branch miss-predictions and cache misses. The inconsistency of HT
while multitasking won’t show up on benchmarks since they’re designed
to only test a single task at a time.
- 36. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Is it worth to use Hyper-thread technology?
If one does a lot of 3d rendering or Video
Transcoding then it probably is since this is the
workload HT is best suited for. If you find that you
generally run multiple intensive tasks
simultaneously (like playing a game while encoding
a video or recompiling the Linux kernel in a VM)
then HT could have a negative impact on overall
performance (though not necessarily). One thing
that is for sure is its impact is exaggerated in
synthetic benchmarks, almost to the point where it
becomes misleading.
- 37. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Virtualization
Server virtualization:
Huge data-centers contains large number of server. Work- load,
user-activity and other things decides which server when to use
and for the servers that are not been used according to their
capacities companies still spending their money, energy and
resources to keeping them updated and preventing them from
any crashing and overheating. So server virtualization concept is
used to make that physical server consolidate on fewer more
powerful and energy efficient server and that vm (virtual
machine) or energy efficient server imitate or pretends to be
multiple servers on network. Virtual server environment is
transparent on network so each user can interact with virtual
server as if they are still multiple servers but now main
advantage is that they should have to take care of only few
energy efficient servers instead of many servers and saving of
resources, energy and money also possible.
- 38. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Virtualization(Contd.)
As shown in figure in traditional architecture there is hardware which is
working on single operating system and in that operating system different -
different application are working.
But as we know as this system as not energy efficient so one virtual
environment is developed through which now we can work on different
operating system with a single machine.
- 39. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Virtual Machine
A virtual machine monitor (VMM) is a host program that allows a
single computer to support multiple, identical execution
environments. All the users see their systems as self-contained
computers isolated from other users, even though every user is
served by the same machine. In this context, a virtual machine is an
operating system (OS) that is managed by an underlying control
program. For example, IBM's VM/ESA can control multiple virtual
machines on an IBM S/390 system.
We are doing server virtualization to reduce energy cost, simplify
manageability and disaster management.
In server virtualization what we are doing is adding VMM software to
allow hardware to use more than one OS.
Major component of the server:
Processor
Chipset
Network interface
- 40. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Virtual Machine(Contd.)
Individual technologies that make up Intel VT are built in this
component that boost Performance, boost reliability, and boost
flexibility.
Intel VT supports virtual machine architectures comprised of two
principal classes of software:
Virtual-Machine Monitor (VMM): A VMM acts as a host and has full control
of the processor(s) and other platform hardware. VMM presents guest
software (see below) with an abstraction of a virtual processor and allows
it to execute directly on a logical processor. A VMM is able to retain
selective control of processor resources, physical memory, interrupt
management, and I/O.
Guest Software: Each virtual machine is a guest software environment
that supports a stack consisting of an operating system (OS) and
application software. Each operates independently of other virtual
machines and uses the same interface to processor(s), memory, storage,
graphics, and I/O provided by a physical platform. The software stack
acts as if it were running on a platform with no VMM. Software executing
in a virtual machine must operate with reduced privilege so that the VMM
can retain control of platform resources.
- 41. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel Virtualization Technology-Flex Migration
(Intel VT-X)
Obviously, as IT adds new systems, it would be much more convenient and
efficient if an IT manager could simply add new resources to existing pools without
having to worry about differences in processor generation. For this reason, Intel
has developed Intel VT Flex Migration. When combined with support from
virtualization software, it ensures that the hypervisor can expose a consistent set
of instructions across all servers in the pool. Intel VT Flex Migration support starts
with Intel® Core™ microarchitecture and will be available in future generations of
the Intel Xeon processor family.
With Intel VT Flex Migration, IT managers can easily add current and future Intel
Xeon processor-based systems to the same resource pool when using supporting
hypervisor software. This gives IT the power to choose the right server platform
when it is needed to optimize performance, cost, power, and reliability, without
having to worry about forward and backward compatibility across generations of
Intel Xeon processor-based servers starting with Intel Core microarchitecture and
extending into future generations of Intel Xeon processors. IT managers can pool
server resources using multiple generations of Intel Xeon processors whether they
are single, dual- or multi-processor based. This creates a dynamic virtual server
infrastructure that enables the use of live VM migration to improve usage models
such as failover, load balancing, disaster recovery, and server maintenance.
- 42. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel VT-X(Contd.)
Current Intel® Xeon® 5400 and 5200
processor series, 3300 and 3100 processor
series, as well as future Intel Xeon
processors, support Intel VT Flex Migration.
Using virtualization software that is enabled
to take advantage of this feature, Intel
servers based on these processors can be
pooled with earlier generation of Intel Core
microarchitecture processors. These include
Intel® Xeon® 7300, 5300, 5100, 3200,
3000 series processors. Major Intel VT-x
component is Intel VT-x flex migration. By
using this technology, we will be able to
migrate the application from one server to
another and recover from disaster.
From Intel VT flex migration one can
migrate between to generation processor so
one can react quickly on change in condition
making it much easier to server upend
running.
- 43. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Flex Priority
Intel VT Flex Priority optimizes and accelerates interrupt virtualization by
improving virtual machine access to the Task Priority Register thereby
enabling efficient Symmetric Multi-Processing (SMP) configurations of 32-bit
guest operating systems. For users, this translates into more efficient
performance in virtual environments for their critical enterprise applications.
Intel VT Flex Priority was designed to accelerate virtualization interrupt
handling thereby improving virtualization performance. Intel VT Flex Priority
accelerates interrupt handling by preventing unnecessary VMExits on
accesses to the Advanced Programmable Interrupt Controller.
Intel flex priority improves virtualization by 35%
When processor is constantly bombarded with interruption many of which are
critical so Intel VT flex priority is kind of like receptionist who alerts when
interruption is critical. Because it is not necessary that all the interrupt that
are given to the processor are necessarily
Critical to be executed at the time of occurrence of interruption so through
flex priority is kind like receptionist who alerts when interruption is critical so
processor can work efficiently if it is less interrupted.
- 44. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Virtualization for directed I/O
A VMM must support virtualization of I/O requests from guest software. I/O
virtualization may be supported by a VMM through any of the following
models:
Emulation: A VMM may expose a virtual device to guest software by emulating an existing
(legacy) I/O device. VMM emulates the functionality of the I/O device in software over whatever
physical devices are available on the physical platform. I/O virtualization through emulation
provides good compatibility (by allowing existing device drivers to run within a guest), but pose
limitations with performance and functionality.
New Software Interfaces: This model is similar to I/O emulation, but instead of emulating
legacy devices, VMM software exposes a synthetic device interface to guest software. The
synthetic device interface is defined to be virtualization-friendly to enable efficient virtualization
compared to the overhead associated with I/O emulation. This model provides improved
performance over emulation, but has reduced compatibility (due to the need for specialized guest
software or drivers utilizing the new software interfaces).
Assignment: A VMM may directly assign the physical I/O devices to VMs. In this model, the driver
for an assigned I/O device runs in the VM to which it is assigned and is allowed to interact directly
with the device hardware with minimal or no VMM involvement. Robust I/O assignment requires
additional hardware support to ensure the assigned device accesses are isolated and restricted to
resources owned by the assigned partition. The I/O assignment model may also be used to create
one or more I/O container partitions that support emulation or software interfaces for virtualizing
I/O requests from other guests. The I/O-container-based approach removes the need for running
the physical device drivers as part of VMM privileged software.
- 45. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Virtualization for directed I/O(Contd.)
Models contd.:
I/O Device Sharing: In this model, which is an extension to the I/O
assignment model, an I/O device supports multiple functional interfaces,
each of which may be independently assigned to a VM. The device
hardware itself is capable of accepting multiple I/O requests through any
of these functional interfaces and processing them utilizing the device's
hardware resources.
Depending on the usage requirements, a VMM may support any of
the above models for I/O virtualization. For example, I/O emulation
may be best suited for virtualizing legacy devices. I/O assignment
may provide the best performance when hosting I/O-intensive
workloads in a guest. Using new software interfaces makes a trade-
off between compatibility and performance, and device I/O sharing
provides more virtual devices than the number of physical devices in
the platform.
- 46. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Overview Of Intel Virtualization
A general requirement for all of above I/O virtualization
models is the ability to isolate and restrict device accesses to
the resources owned by the partition managing the device.
Intel VT for Directed I/O provides VMM software with the
following capabilities:
I/O device assignment: For flexibly assigning I/O devices to
VMs and extending the protection and isolation properties of VMs
for I/O operations.
DMA remapping: For supporting independent address
translations for Direct Memory Accesses (DMA) from devices.
Interrupt remapping: For supporting isolation and routing of
interrupts from devices and external interrupt controllers to
appropriate VMs.
Reliability: For recording and reporting to system software DMA
and interrupt errors that may otherwise corrupt memory or
impact VM isolation.
- 47. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
DMA Remapping
DMA remapping facilities have been implemented in a variety of contexts
in the past to facilitate different usages. In workstations and server
platforms, traditional I/O memory management units (IOMMUs) have
been implemented in PCI root bridges to efficiently support
scatter/gather operations or I/O devices with limited DMA addressability.
Other well-known examples of DMA remapping facilities include the AGP
Graphics Aperture Remapping Table (GART), the Translation and
Protection Table (TPT) defined in the Virtual Interface Architecture, and
subsequently influencing a similar capability in the InfiniBand
Architecture and Remote DMA (RDMA) over TCP/IP specifications. DMA
remapping facilities have also been explored in the context of NICs
designed for low latency cluster interconnects.
Traditional IOMMUs typically support an aperture-based architecture. All
DMA requests that target a programmed aperture address range in the
system physical address space are translated irrespective of the source
of the request. While this is useful for handling legacy device limitations
(such as limited DMA addressability or scatter/gather capabilities), they
are not adequate for I/O virtualization usages that require full DMA
isolation.
- 48. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
DMA Remapping(Contd.)
The VT-d architecture is a generalized IOMMU architecture that enables
system software to create multiple DMA protection domains. A protection
domain is abstractly defined as an isolated environment to which a subset of
the host physical memory is allocated. Depending on the software usage
model, a DMA protection domain may represent memory allocated to a VM,
or the DMA memory allocated by a guest-OS driver running in a VM or as
part of the VMM itself. The VT-d architecture enables system software to
assign one or more I/O devices to a protection domain. DMA isolation is
achieved by restricting access to a protection domain's physical memory from
I/O devices not assigned to it, through address- translation tables.
The I/O devices assigned to a protection domain can be provided a view of
memory that may be different than the host view of physical memory. VT-d
hardware treats the address specified in a DMA request as a DMA virtual
address (DVA). Depending on the software usage model, a DVA may be the
Guest Physical Address (GPA) of the VM to which the I/O device is assigned,
or some software-abstracted virtual I/O address (similar to CPU linear
addresses). VT-d hardware transforms the address in a DMA request issued
by an I/O device to its corresponding Host Physical Address (HPA).
- 49. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
DMA Remapping(Contd.)
Figure 5 illustrates DMA
address translation in a
multi-domain usage. I/O
devices 1 and 2 are
assigned to protection
domains 1 and 2,
respectively, each with
its own view of the DMA
address space.
Figure 6 illustrates a PC
platform configuration
with VT-d hardware
implemented in the
north-bridge component.
Figure 5: DMA remapping
Figure 6: Platform configuration with VT-d
- 50. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel Smart Memory Access
Intel Smart Memory Access improves system performance by
optimizing the use of the available data bandwidth from the memory
subsystem and hiding the latency of memory accesses. The goal is
to ensure that data can be used as quickly as possible and is located
as close as possible to where it’s needed to minimize latency and
thus improve efficiency and speed.
Intel Smart Memory Access includes a new capability called memory
disambiguation, which increases the efficiency of out-of-order
processing by providing the execution cores with the built-in
intelligence to speculatively load data for instructions that are about
to execute before all previous store instructions are executed.
Intel Smart Memory Access also includes an instruction pointer-
based prefetcher that ―prefetches‖ memory contents before they are
requested so they can be placed in cache and readily accessed when
needed. Increasing the number of loads that occur from cache
versus main memory reduces memory latency and improves
performance.
- 51. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intel Smart Memory Access(Contd.)
How Intel smart memory access improves execution
throughput?
Intel core microarchitecture memory cluster (level 1 data memory
subsystem) is highly out of order, non blocking and speculative.
It has a variety of methods of caching and buffering to help
achieve its performance. Included among these are Intel Smart
Memory Access and its two key features: memory disambiguation
and instruction pointer based (IP-based) prefetcher to the level 1
data cache.
- 52. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Memory Disambiguation
Since Intel Pentium pro and all Intel processor have featured a sophisticated
out of order memory engine allowing the CPU to execute non -dependent
instruction in any order but they had significant short coming, these
processors were built around a conservative set of assumptions concerning
which memory accesses could proceed out of order. They would not move a
load in the execution order above a store having an unknown address (cases
where a prior store has not been executed yet). This was because if the store
and load end up sharing the same address, it results in an incorrect
instruction execution. Yet many loads are to locations unrelated to recently
executed stores. Prior hardware implementations created false dependencies
if they blocked such loads based on unknown store addresses. All these false
dependencies resulted in many lost opportunities for out-of-order execution.
In designing Intel Core microarchitecture, Intel sought a way to eliminate
false dependencies using a technique known as memory disambiguation.
(―Disambiguation‖ is defined as the clarification that follows the removal of an
ambiguity.) Through memory disambiguation, Intel Core microarchitecture is
able to resolve many of the cases where the ambiguity of whether a
particular load and store share the same address thwart out-of-order
execution.
- 53. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Memory Disambiguation(Contd.)
Memory disambiguation uses a predictor and accompanying
algorithms to eliminate these false dependencies that block a load
from being moved up and completed as soon as possible. The basic
objective is to be able to ignore unknown store-address blocking
conditions whenever a load operation dispatched from the
processor’s reservation station (RS) is predicted to not collide with a
store. This prediction is eventually verified by checking all RS-
dispatched store addresses for an address match against newer
loads that were predicted non-conflicting and already executed. If
there is an offending load already executed, the pipe is flushed and
execution restarted from that load.
The memory disambiguation predictor is based on a hash table that
is indexed with a hashed version of the load’s EIP address bits.
(―EIP‖ is used here to represent the instruction pointer in all x86
modes.) Each predictor entry behaves as a saturating counter, with
reset.
- 54. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Memory Disambiguation(Contd.)
The predictor has two write operation both done during the
load’s retirement:
Increment the entry if load ―behaved well‖ that if it meet
unknown store address but none of them collided.
Reset the entry to zero if the load ―misbehaved.‖ That is, if it
collided with at least one older store that was dispatched by the
RS after the load. The reset is done regardless of whether the
load was actually disambiguated.
The predictor takes a conservative approach. In order to allow
memory disambiguation, it requires that a number of
consecutive iterations of a load having the same EIP behave
well. This isn’t necessarily a guarantee of success though. If
two loads with different EIPs clash in the same predictor
entry, their prediction will interact.
- 55. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Memory Disambiguation(Contd.)
Predictor lookup
The predictor is looked up when load instruction is dispatched from RS to
the memory pipe. If the respective counter is saturated, the load is
assumed to be safe and the result is written to the ―disambiguation
allowed bit‖ in the loaded buffer. This means that if load finds its relevant
store address and the load is allowed to go on. If the predictor is not
saturated, the load will behave like in prior implementations. In other
words, if there is a relevant unknown store address, the load will get
blocked.
Load dispatch
In case the load meets an older unknown store address, it sets the
―update bit‖ indicating the load should update the predictor. If the
prediction was "go,‖ the load will be dispatched and set the ―done‖ bit
indicating that disambiguation was done. If the prediction was "no go,"
the load will be conservatively blocked until resolving of all older store
addresses.
- 56. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Memory Disambiguation(Contd.)
Prediction verification
To recover in case of a misprediction by the disambiguation predictor, the
address of all the store operations dispatched from the RS to the Memory
Order Buffer must be compared with the address of all the loads that are
younger than the store. If such a match is found the respective ―reset bit‖
is set. When a load retires that was disambiguated and its reset bit set,
we restart the pipe from that load to re-execute it and all its dependent
instructions correctly.
Watchdog mechanism
Disambiguation is based on prediction and mispredictions can cause
execution pipe flush, it’s important to build in safeguards to avoid rare
cases of performance loss. Consequently, Intel Core microarchitecture
includes a mechanism to temporarily disable memory disambiguation to
prevent cas.es of performance loss. This mechanism constantly monitors
the success rate of the disambiguation predictor.
- 57. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Advanced smart cache
Intel Advanced Smart Cache is a multi-core optimized cache that
improves performance and efficiency by increasing the probability
that each execution core of a dual core processor can access data
from a higher-performance, more-efficient cache subsystem.
To accomplish this, Intel Core microarchitecture shares the Level 2
(L2) cache between the cores. This better optimizes cache resources
by storing data in one place that each core can access. By sharing L2
cache between each core, Intel Advanced Smart Cache allows each
core to dynamically use up to 100 percent of available L2 cache.
Threads can then dynamically use the required cache capacity.
As an extreme example, if one of the cores is inactive, the other core
will have access to the full cache. Intel Advanced Smart Cache
enables very efficient sharing of data between threads running in
different cores. It also enables obtaining data from cache at higher
throughput rates for better performance. Intel Advanced Smart
Cache provides a peak transfer rate of 96 GB/sec (at 3 GHz
frequency).
- 58. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Wide dynamic execution
Intel Wide Dynamic Execution significantly enhances dynamic execution,
enabling delivery of more instructions per clock cycle to improve execution
time and energy efficiency. Every execution core is 33 percent wider than
previous generations, allowing each core to fetch, decode, and retire up to
four full instructions simultaneously.
Intel Wide Dynamic Execution also includes a new and innovative capability
called Macrofusion. Macrofusion combines certain common x86 instructions
into a single instruction that is executed as a single entity, increasing the
peak throughput of the engine to five instructions per clock. The wide
execution engine, when Macrofusion comes into play, is then capable of up to
six instructions per cycle throughputs for even greater energy -efficient
performance.
Intel Core microarchitecture also uses extended microfusion, a technique that
―fuses‖ micro-ops derived from the same macro-op to reduce the number of
micro-ops that need to be executed. Studies have shown that micro-op fusion
can reduce the number of micro-ops handled by the out-of-order logic by
more than 10 percent.
Intel Core microarchitecture ―extends‖ the number of micro-ops that can be
fused internally within the processor.
- 59. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Wide dynamic execution(Contd.)
Intel Core microarchitecture also incorporates an updated ESP
(Extended Stack Pointer) Tracker. Stack tracking allows safe
early resolution of stack references by keeping track of the value
of the ESP register. About 25 percent of all loads are stack loads
and 95 percent of these loads may be resolved in the front end,
again contributing to greater energy efficiency [Bekerman].
Micro-op reduction resulting from micro-op fusion, Macrofusion,
ESP Tracker, and other techniques make various resources in the
engine appear virtually deeper than their actual size and results
in executing a given amount of work with less toggling of
signals—two factors that provide more performance for the same
or less power.
Intel Core microarchitecture also provides deep out of-order
buffers to allow for more instructions in flight, enabling more out-
of-order execution to better instruction level parallelism.
- 60. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Advanced Digital media boost
Intel Advanced Digital Media Boost helps achieve similar dramatic gains
in throughputs for programs utilizing SSE instructions of 128-bit
operands. (SSE instructions enhance Intel architecture by enabling
programmers to develop algorithms that can mix packed, single-
precision, and double-precision floating point and integers, using SSE
instructions.)
These throughput gains come from combining a 128-bit-wide internal
data path with Intel Wide Dynamic Execution and matching widths and
throughputs in the relevant caches. Intel Advanced Digital Media Boost
enables most 128-bit instructions to be dispatched at a throughput rate
of one per clock cycle, effectively doubling the speed of execution and
resulting in peak floating point performance of 24 GFlops (on each core,
single precision, at 3 GHz frequency).
Intel Advanced Digital Media Boost is particularly useful when running
many important multimedia operations involving graphics, video, and
audio, and processing other rich data sets that use SSE, SSE2, and SSE3
instructions.
- 61. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Intelligent power capability
Intel Intelligent Power Capability is a set of capabilities for reducing
power consumption and device design requirements. This feature
manages the runtime power consumption of all the processor’s
execution cores. It includes an advanced power-gating capability
that allows for an ultra fine-grained logic control that turns on
individual processor logic subsystems only if and when they are
needed.
Additionally, many buses and arrays are split so that data required in
some modes of operation can be put in a low-power state when not
needed. In the past, implementing such power gating has been
challenging because of the power consumed in powering down and
ramping back up, as well as the need to maintain system
responsiveness when returning to full power [Wechsler].
Through Intel Intelligent Power Capability Intel has been able to
satisfy these concerns, ensuring significant power savings without
sacrificing responsiveness.
- 62. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
References
http://www.brighthub.com
http://mintywhite.com
http://www.flyertalk.com
http://www.overclock.net/a/hyperthreading-explained
http://download.intel.com/technology/computing/vptech/Intel
(r)_VT_for_Direct_IO.pdf
http://software.intel.com/sites/default/files/m/3/4/d/6/3/183
74-sma.pdf
http://www.youtube.com/watch?v=gqZrarZiHp8
http://www.youtube.com/watch?v=3fcI6G7Scqk
http://www.youtube.com/watch?v=V9AiN7oJaIM
http://www.youtube.com/watch?v=kkrqyEpINSQ
http://www.youtube.com/watch?v=y0Q40pBoIwA
- 63. Software & Services Group
Developer Products Division Copyright© 2011, Intel Corporation. All rights reserved.
*Other brands and names are the property of their respective owners.
Thank You