Sample Intel chipset
details -
1.Intel Corporation is an American multinational
technology company headquartered in Santa
Clara, California. Intel is one of the world's
largest and highest valued semiconductor chip
makers, based on revenue.
2.Following is a basic details of core 2 duo
processor
Hiep Hong
CS 147
Spring 2009
2
Intel Core 2 Duo
CPU Chronology
3
CPU Chronology
Intel 4004
 108 KHz
 2300 transistors
Intel 8008
 500-800 KHz
 3500 transistors
Intel 8080
 2 MHz
 4500 transistors
4
Pre-Intel 8086:
CPU Chronology
5
CPU Chronology
6
Dual-Core or Core 2 Duo
 Core 2 Duo is a brand name by Intel.
 Dual-Core is a generic description meaning
two separate physical cores in one chip
package.
 Example: Pentium Dual Core, Core Duo and
Core 2 Duo.
7
Intel Core 2 Duo
8
Intel Core 2 Duo
 64 bit computing.
 x86-64 instruction set.
 The second generation of dual-core
processors from Intel.
 Two independent processor cores.
 Share up to 6MB of L2 cache.
 Developed with a new Architecture called
Core Microarchitacture.
9
Inside Intel Core 2 Duo Die
10
Intel Core 2 Duo
11
Sequence of
processing
12
Core Microarchitecture
13
Core Microarchitecture
 Advanced smart cache.
 Macro-fusion.
 Advanced digital media boost.
 Memory disambiguation.
 Advanced power gating.
14
Advanced smart cache
15
Advanced smart cache continued
 If one core has minimal cache requirements,
the other core can dynamically increase its
share of L2 cache
  Reduce cache misses.
  Improve performance.
16
Macro-Fusion
17
Macro-Fusion continued
18
Macro-Fusion continued
 Enable common pair of instructions to be
combined into a single instruction during
decoding.
 Reduce the total of executed instructions.
 Allow processor to execute more instructions
in less time.
 Increase performance.
19
Macro-Fusion continued
Without macro-fusion With macro-fusion
1 load eax, [mem1]
2 cmp eax, [mem2]
3 jne target
1 load eax, [mem1]
2 cmp eax, [mem2] + jne target
20
Advanced Digital Media Boost
 Improve performance when executing
Streaming SIMD Extension (SSE, SSE2, SEE3)
instructions.
 Accelerate video, speech, image, speech and
image, photo processing, encryption,
financial, engineering and scientific
applications.
21
Advanced Digital Media Boost
128-bit Streaming SIMD Extension (SSE, SSE2, SEE3) instructions.
22
Memory Disambiguation
 Accelerate the execution of memory-related
instructions.
 Load data for instructions about to be
executed before all previous store
instructions were executed.
 Memory-related instructions that can be
executed out of order.
23
Memory Disambiguation continued
24
Memory Disambiguation continued
25
Advanced Power Gating
26
Advanced Power Gating continued
27
Newer and better!
28
06/20/16
Intel Itanium Architecture
Itanium is a new processor family and architecture,
design by Intel and HP with the future of high
end server and workstation in mind.
06/20/16
Features of Itanium
 64-bit addressing
 EPIC (Explicit Parallel Instruction Computing)
 Wide Parallel Execution core
 Prediction
 FPU, ALU and Rotating registers
 Large fast Cache
 High Clock Speed
 Scalability
 Error Handling
 Fast Bus Architecture
06/20/16
Itanium Specifications
 Physical Characteristics
– 25.4M transistors
– .18micron CMOS process
– 6 metal layers
– C4 (flip-chip) assembly technology
– 1012-pad organic land grid array
– 733MHz and 800MHz initial release clock speeds
06/20/16
Itanium Specifications Cont…
 Instruction Dispersal
– 2 bundle dispersal windows
– 3 instructions per bundle
– 9 function unit slots
– 2 integer slots
– 2 floating point slots
– 2 memory slots
– 3 branch slots
– Maximum of 6 instructions issued each cycle
06/20/16
Itanium Specifications Cont…
 Floating Point Units
 2 extended and double precision FMACs (Floating-
point Multiply Add Calculators)
 4 double or single precision operations per clock
maximum
 3.2 GFLOPS of peak double precision floating point
performance at 800MHz
 2 additional single precision FMACs
 4 single precision operations per clock maximum
 6.4 GFLOPS of peak single precision floating point
performance total at 800MHz
06/20/16
Itanium Specifications Cont…
 Integer and Branch Units
– 4 single cycle integer ALUs
– 4 MMX units
– 3 branch units
06/20/16
Itanium Specifications Cont…
Level 3 Cache
– Off-die in two or four chips
– 2MB or 4MB
– Runs at core clock
– 4-way set associative
– Up to 294.8 million transistors
– 128-bit bus
– 21+ cycle latency
06/20/16
Itanium Specifications Cont…
 Level 2 Cache
– On-die
– 96k of full-speed cache
– 6-way set associative
– 256-bit bus
– 6-cycle + latency
06/20/16
Itanium Specifications Cont…
 Level 1 Cache
– On-die
– 16k instruction cache
– 4-way set associative
– 16k integer only data cache
– 2-cycle + latency
06/20/16
Itanium Specifications Cont…
 x86 Compatibility
– Hardware decoder turns x86 instructions into
EPIC instructions
– Dynamic scheduler optimizes x86 for EPIC
micro-architecture
– Shared cache
– Shared execution core
06/20/16
64-bit addressing
 EPIC processors are capable of addressing
a 64-bit memory space. In comparison, 32-
bit x86 processors access a relatively small
32-bit address space, or up to 4GB of
memory.
 A 64-bit memory space may be a limiting
factor to performance. This gives the
Itanium the memory addressing ability
needed to meet current and foreseeable
future high-end processing needs.
06/20/16
64-bit addressing cont…
 Through bank switching, x86 processors, such
as the Intel Pentium III Xeon and the AMD
Athlon, can address more than 4GB of
memory. Unfortunately, there is hardware
and software overhead to bank switching that
harms performance and increases complexity.
06/20/16
64-bit addressing cont…
 The first generation of Itanium systems, using the
460GX chipset, will be expandable with up to 64GB
of memory. Generations beyond that will be able to
take more memory. Higher end Itanium systems
designed by the likes of SGI, IBM and HP should
eventually be able to take far more than 64GB.
 While it may be hard to imagine 4GB or even 64GB
of memory being a bottleneck to performance,
when one considers SGI has mentioned plans to
eventually build machines using 512 Itanium
processors accessing more than a terabyte of data
in main memory, 64GB of memory, let alone 4GB,
begins to look rather small.
06/20/16
EPIC
 New Computer Architecture standard set by Intel
on its new itanium architecture
 Previously Computer architectures only consisted
of RISC, CISC and VLIW
 EPIC Uses complex instruction in additions to
basic instruction. This complex instruction
includes information on how to run the
instruction parallel with other instructions.
 EPIC instructions are put together by the
compiler into a threesome called a bundle.
06/20/16
Bundling
06/20/16
EPIC continue…..
 Bundle is a three instruction wide word - improves
instruction level parallelism. Each Bundle Contains three
instructions and a template field which are set during
code generation, by a compiler, or the assembler.
 Bundles are then sent to the CPU.
 Bundles in the CPU are put together in an instruction
group with other instructions
 An instruction group is a set of instructions which do not
have “read after write or write after write dependencies
between them and may execute in parallel.” This means
that the bundle do not affect each other with the data
they are working on, so they can run together without
getting in each others way.
06/20/16
EPIC continue….
 In any given clock cycle, the processor
executes as many instructions from one
instruction group as it can according to
resources.
 An instruction group must contain at least
one instruction but the number of instructions
in an instruction group is not limited.
 The instruction groups can end by cycle
breaks or end dynamically during run time
by taken branch
06/20/16
EPIC continues…..
 In addition of grouping operations into
instructions, the compiler handles several other
important tasks that improve efficiency,
parallelism and speed.
 CISC puts most of the burden of scheduling
instructions onto the CPU hardware. RISC gives
some of this responsibility to the compiler. VLIW
gives even more importance to the compiler.
 EPIC improves on previous technology by
adding branch hints, register stack and rotation,
data and control speculation and memory hints.
It also uses branch prediction.
06/20/16
Prediction
 It is a compiling technique that optimises or
removes branching code by working it so that
much of the code runs in parallel.
 It minimises the time it takes to run if – then –
else situations and uses processor width to
run both the ‘then’ and ‘else’ in parallel.
 When the ‘if’ branch is determined, the
incorrect branch result is discarded.
 By removing branches and making code more
parallel, prediction reduces the number of
cycles it takes to complete a task while
making use of a wide processor.
06/20/16
Prediction
According to Jerry Huck of HP:
 “Imagine that you are walking into the bank. You will
make either a deposit or a withdrawal. The teller may
predict you will make a withdrawal as they know you
usually do, so they fill out a with drawl form as you get
in line. If you get to the front and make a withdrawal, all
is well, but if you are there to make a deposit, the teller
then has to fill out the deposit slip and the time it takes
to complete the transaction increases.
 With Prediction, the teller is ambidextrous and, when
you get in line they fill out both a with drawl and a
deposit slip, so that when you get to the front, no matter
what task you intend on doing, the process will run
without a hitch.”
06/20/16
Prediction Continue….
 In the metaphor, prediction is the tellers
knowledge that they should fill out both the
deposit and withdrawal form before they know
exactly what you want. The teller’s ambidexterity,
the ability to fill out both forms at once, is akin to
the ability of an EPCI processor to run instructions
in parallel. prediction removes the penalty of if –
then – else and allows the if – then – else process to
run with as fewer steps as possible.
 A side benefit of prediction is that the removal of
branches causes less branch mispredicts. Branch
misprediction requires the pipeline to be flushed
and this is very cycle expensive procedure.
prediction reduces wasted processor time.
06/20/16
Wide Parallel Execution core
 Itanium processors are very wide.
 They are intended to run multiple instructions
and operations in parallel.
 Itanium processors will be deep with a ten stage
pipeline.
 The first generation itanium processor will be
able to issue six EPIC instruction in parallel every
clock cycle.
 The six issue (two bundler) scheduler disperses
instructions into nine functional slots, two
integer slots, two memory slots and three branch
slots, giving a total of nine dispersal slots.
06/20/16
Wide Parallel Execution core cont…
 This limits the number of each type of
instruction that can be assigned in a single clock
cycle. If an instruction/s can not be executed
because too many slots of one type are filled, the
instructions are delayed until the next cycle.
 This means that proper compiler design is
crucial to functional aspect of the itanium.
 Backing up the itanium six issue scheduler are
eleven execution units; four integer, two
floating points, three branch, two load/store
units.
06/20/16
 This helps support the various EPIC
instructions that can launch more than one
operation in a single instruction, such as
SIMD, floating point operations.
 Combined with the EPIC instruction set the
itanium can execute up to 20 operations in a
single cycle when doing some floating point
intensive task.
Wide Parallel Execution core cont…
06/20/16
FPU, ALU and Rotating Registers
 FPU
– The Itanium contains 4 pipelined FMAC
(Floating Point Multiple Add Calculator)
units. There are an additional two FMACs
tuned for 3D applications. They are each
capable of processing up to two single-
precision floating-point operations per clock.
That yields another 3.2GFLOPS of single-
precision processing power. All together, the
Itanium has a theoretical max of 6.4GLOPS of
single-precision floating point processing
power.
06/20/16
FPU, ALU and Rotating Registers
cont…
 ALU
– There are four pipelined ALUs (Arithmetic
Logic Unit) in the original Itanium. Each can
process one integer calculation per cycle. They
can also process MMX type instructions.
While the Itanium has the potential to be a
massive floating-point powerhouse, its integer
performance also has tremendous potential.
06/20/16
 Plentiful Registers
– The Itanium will come with 128 floating point
and 128 integer registers. When processing up to
20 operations in a single clock, the registers give
plenty of room for data inside the processor. This
reduces the chances of the execution of an
instruction being delayed because data could not
be held locally. This is especially important since
the Itanium can process up to eight floating-point
operations in a single clock. With the possibility
of eight operations running in a single clock,
having too few registers could be a serious
bottleneck.
FPU, ALU and Rotating Registers
cont…
06/20/16
 The registers also have the ability to rotate.
Rotating registers allows the processor to
perform an operation on multiple software
accessible registers in turn.
 This increases CPU pipeline utilization and
efficiency when dealing with streams of data to
process.
FPU, ALU and Rotating Registers
cont…
06/20/16
Large Fast Cache
 When a processor is waiting for data or
instructions, time is wasted. The longer it
takes for data and instructions to get to the
CPU, the worse it gets. When data and
instructions are in cache, the processor can
grab them much quicker than when having to
go to slow main memory. Not only is cache
latency much lower than DRAM latency, the
bandwidth is much higher.
06/20/16
Large Fast Cache cont…
 There are some trick programming
techniques in use out there to keep often-used
data and instructions in cache and they are
not the kind of techniques you learn in your
high school BASIC course.
 Still, the easiest way to keep data and
instructions in cache is to have a lot of cache
to keep them in. Intel knew that when they
designed the Itanium.
06/20/16
Large Fast Cache cont…
 The Itanium has three levels of cache. L1 and
L2 are on-die while L3 is on cartridge.
According to Intel, the L3 cache weighs in at
2MB or 4MB of four-way set associative cache
on two or four 1MB chips.
 IDC reports that the L2 cache size is 96k in
size, and the L1 cache, which does not deal
with floating point data, has a 16KB integer
data and a 16KB instruction cache.
06/20/16
Large Fast Cache cont…
 The 294.8 million transistors of (4MB) level
three cache runs at the full processor speed,
giving 12.8GBps of memory bandwidth at
800MHz.
 With 2MB or 4MB of L3 cache on the Itanium,
the chances of the required data and
instructions being in cache are quite good,
bus traffic can be reduced, and performance
increases. With six pipelines hungry for
instructions and data, the Itanium needs all
the cache it can get.
06/20/16
Large Fast Cache cont…
 To make caching even more effective, Intel uses data
speculation and cache hints. Data speculation is
caching and calling for data that may be needed or
may be changed before it is needed, so that, in the case
that the data is needed and it has not changed, the
CPU does not have to take a latency impact from
calling for the data.
 The processor, with the help of compiled instructions,
looks ahead, anticipates what info it may need, and
then brings it to cache or into the processor. This helps
hide memory latency. Cache hints are two-bit markers
for memory loads set by the compiler that help the
CPU find data in cache. This improves the speed of
retrieving data from cache.
06/20/16
Clock Speed
 The first generation of Itanium processors will
come in the first half of 2001 at 733MHz and
800MHz. The first generation's clock speed may
not be particularly quick, but Intel has several
generations ahead of the Itanium already in the
works that should increase performance.
 Intel claims they have plenty of clock headroom
in the Itanium design and are aiming for a
greater than 1GHz clock speed with their second
generation Itanium processor, McKinley, which
will have the L3 cache on-die.
06/20/16
Scalability
 The Itanium was not designed for small
systems, it is intended for 1 to 4000 processor
workstations and servers.
 There are several Itanium features designed
to help with hardware scalability: a full-CPU-
speed Level 2 bus, a large L3 cache, deferred-
transaction support and flexible page sizes.
06/20/16
Scalability cont…
 The full-CPU-speed Level 3 bus provides quick
communication between CPUs. The large L2
cache reduces inter-CPU bus traffic by keeping
data close to the CPU that needs it.
 Deferred-transaction support can stop one CPU
from getting in the way of another. Flexible
page sizes, from 4KB to 256MB, give the
Itanium family the flexibility to access small
amounts of memory in small chunks and
massive amounts of memory in massive chunks
without the overhead of smaller page sizes.
06/20/16
Scalability cont…
 The first generation Itanium chipset, the
460GX, will support up to four processors,
and OEMs will be able to build eight-way and
larger systems.
 Successive generations of chipsets should be
successively more scalable. Third party
solutions should also increase scalability.
06/20/16
Error Handling
 The Itanium will have extensive error
handling capabilities. It features ECC and
parity error checking on most processor
caches and busses.
 If a machine error occurs and a piece of data
becomes corrupted, the ECC or parity
checking will allow the machine to recognize
the error, fix it if possible, or flag it as
corrupted.
 The processor also has the capability to kill an
application or thread that has experienced a
machine error without having to reboot.
06/20/16
Error Handling cont…
 Chipset, OS, and system designers, which
will include the likes of HP, IBM, Compaq,
SGI, Microsoft and Intel, will bring out their
own error handling and reliability processes
that should further enhance Itanium-based
server uptime to 99.9% and beyond.
06/20/16
Fast Bus Architecture
 A major link in the food delivery system for
the Itanium is the system bus. The Itanium
will use a 2.1GBps multi-drop system bus to
keep well fed with data and instructions. We
expect it will have a 128-bit 133MHz bus.
 The memory subsystem and I/O will be
determined by the chipset used. First
generation systems should use dual-memory
ported SDRAM giving 4.2GBps of memory
bandwidth. Later generations will have the
option to use DDR SDRAM or RDRAM.
06/20/16
Fast Bus Architecture cont…
 Eventually, Intel plans on moving server
platforms to DDR II. 64bit, 66MHz PCI and
AGP Pro (4x) should be common on Itanium
motherboards and support will be included
in Intel's 460GX chipset
06/20/16
Itanium Roadmap
06/20/16
Future
 According to Intel, the EPIC architecture was
designed with about 25 years of headroom for
future development in mind.
 McKinley will follow the original Itanium and
will integrate its L3 cache onto the CPU die.
McKinley will arrive in the first half of 2002.
Madison may also arrive in 2002 on a .13-micron
process. Deerfield will arrive not long after, also
on a .13 process, at a lower price and
performance level but with more performance
for the dollar than Madison.
06/20/16
Future cont…
 Madison may also arrive in 2002 on a .13-
micron process. Deerfield will arrive not
long after, also on a .13 process, at a lower
price and performance level but with more
performance for the dollar than Madison.
 Furthermore it will offer larger amounts of
L3 cache.
 Deerfield will be positioned as a value part in
conjunction with Madison the same way as a
P3 and Celeron compares today. It might be
the CPU targeting consumer desktops.
06/20/16
Competition
Sun UltraSPARC
IBM PowerPC
Compaq’s Alpha
AMD’s Sledgehammer
06/20/16
Competition
 Sun UltraSPARC
 In 1995, 8 years after the first SPARC station
was introduced, Sun went 64 bit with the
introduction of UltraSPARC 1 RISC
processor. The first model ran at 143Mhz and
had 128 bit datapaths.
 In 1996, it became the first 64 bit CPU to
incoporate multimedia extensions to handle
complex 2D/3D graphics.
06/20/16
Competition cont…
 In 1997, the UltraSPARC 2 was released at
250Mhz while the UltraSPARC 3 (with new 256
bit data paths) is released in the second quarter
of 2000. Other plans include a UltraSPARC 4
which will be pumped up to 1 Ghz and
UltraSPARC 5 which will run at 1.5Ghz
 In 1997, the UltraSPARC 2 was released at
250Mhz while the UltraSPARC 3 (with new 256
bit datapaths) is released in the second quarter
of 2000. Other plans include a UltraSPARC 4
which will be pumped up to 1 Ghz and
UltraSPARC 5 which will run at 1.5Ghz
06/20/16
Competition cont…
 The UltraSPARC 2 was designed using a 0.25
micron process, while the UltraSPARC 3
employed a 0.18 micron process.
 As a side issue, Sun believes that its
UltraSPARC 3, 4 and 5 will be ahead of
Itanium before it arrives because the binary
application code written for the UltraSPARC
2 will run unmodified on the other series'
making the transition easy.
06/20/16
Competition cont…
 There are several arguments that Sun puts forth
against the Itanium:
– Sun has been supplying 64 bit solutions since 1995
and has ironed out its bugs. While Itanium may be
arriving soon, the testing of applications and
enterprise solutions could well take much longer.
– Furthermore Sun produces Solaris (which has
been a true 64 bit since 1999), so it has the
necessary experience in ironing out problems.
06/20/16
Competition cont…
 Lastly Sun claims that its Visual Instruction Set
can be used to speed up networking, I/O and
memory management by optimizing the passing
of data blocks through protocol stacks with the
special instructions.
06/20/16
Competition cont…
 IBM PowerPC
– IBM’s PowerPC RISC processor made its
debut on the 14th
of February 1990.
– In 1991, an alliance was formed between IBM,
Motorola and Apple and the PowerPC is still
being developed for MACS until today.
– PowerPC’s went 64 bit in 1998 with the
codename Power3 which covers the PowerPC
604e and Power PC RS64 processors.
– IBM’s roadmap included building a Power4 in
the last quarter of 2000 which was to run at 1
Ghz.
06/20/16
Competition cont…
 Compaq’s Alpha
– While Sun’s UltraSPARC and IBM’s PowerPC
processors went 64 bit in 1995 and 1998, Digital
Alpha’s CPU was 64 bit ever since its birth. That
was during 1992.
– Even during then the Alpha was a powerful CPU,
launched at 200Mhz, when the MIPS 64 bit R4000
ran at 100Mhz and Intel’s 32 bit 386 only ran at
25Mhz
– Today this processor belongs to Compaq.
06/20/16
Competition cont…
 The Alpha Server SC can run from 64 to 512
processors. Furthermore the Alpha SC server
would form the largest supercomputer in Europe
running 2500 Alpha EV67 CPU’s and would handle
5 trillion instructions per second.
 It is believed that accordingly the Alpha’s would
be more appropriate for huge number crunching
scientific and multimedia - entertainment
applications.
06/20/16
Competition cont…
– The Alpha Server SC can run from 64 to 512
processors. Furthermore the Alpha SC server
would form the largest supercomputer in
Europe running 2500 Alpha EV67 CPU’s and
would handle 5 trillion instructions per second.
– It is believed that accordingly the Alpha’s
would be more appropriate for huge number
crunching scientific and
multimedia/entertainment applications.
06/20/16
Competition cont…
 AMD’s Sledgehammer
– At the microprocessor forum on 5th
October 1999,
AMD announced details of its 64 bit processor.
This 64 bit processor is codenamed
Sledgehammer.
– With regards to this AMD plans to extend Intel’s
original x86 instruction to include a 64 bit mode.
This is to maintain compatibility with 32 bit
apps while benefiting from a 64 bit platform.
06/20/16
Competition cont…
– Sledgehammer will also employ AMD’s future
system bus, named Lightning Data Transport (LDT).
LDT is an internal chip to chip interconnect that can
deliver up to 6.4 Gbits/sec bandwidth, that’s about
20 times faster than current 266Mbits/sec system
interconnects.
– Finally the Sledgehammer’s universal selling point is
that it will be the only chip with full native x86 32 bit
and 64 bit compatibility. Other 64 bit chips may offer
some kind of x86 compatibility, but according to
AMD, each of these relegate the x86 instructions to a
second class status.
06/20/16
Conclusion
The Itanium has a complex, bleeding edge,
forward looking processor family that holds
promise for huge gains in processing power. The
processor uses the entirely new EPIC
architecture that has the potential to deliver
large improvements in processor parallelism. It
is all about speed, and the Itanium has the ability
to deliver it but the real test will be once Itanium
hits the consumer market.
Important processor and
link for its details
Details for core i7 processor
http://www.intel.com/content/www/us/en/chipsets/8-series-chipset-
pch-datasheet.html
Details for Zeon processor
processorhttp://www.intel.com/content/www/us/en/processors/xeon/x
eon-e5-1600-2600-vol-1-datasheet.html
Gigabit-et-et2-ef-multi-port-server-adapters-
briefhttp://www.intel.in/content/dam/doc/product-brief/gigabit-et-et2-
ef-multi-port-server-adapters-brief.pdf

QSpiders - Basic intel architecture

  • 1.
    Sample Intel chipset details- 1.Intel Corporation is an American multinational technology company headquartered in Santa Clara, California. Intel is one of the world's largest and highest valued semiconductor chip makers, based on revenue. 2.Following is a basic details of core 2 duo processor
  • 2.
    Hiep Hong CS 147 Spring2009 2 Intel Core 2 Duo
  • 3.
  • 4.
    CPU Chronology Intel 4004 108 KHz  2300 transistors Intel 8008  500-800 KHz  3500 transistors Intel 8080  2 MHz  4500 transistors 4 Pre-Intel 8086:
  • 5.
  • 6.
  • 7.
    Dual-Core or Core2 Duo  Core 2 Duo is a brand name by Intel.  Dual-Core is a generic description meaning two separate physical cores in one chip package.  Example: Pentium Dual Core, Core Duo and Core 2 Duo. 7
  • 8.
  • 9.
    Intel Core 2Duo  64 bit computing.  x86-64 instruction set.  The second generation of dual-core processors from Intel.  Two independent processor cores.  Share up to 6MB of L2 cache.  Developed with a new Architecture called Core Microarchitacture. 9
  • 10.
    Inside Intel Core2 Duo Die 10
  • 11.
  • 12.
  • 13.
  • 14.
    Core Microarchitecture  Advancedsmart cache.  Macro-fusion.  Advanced digital media boost.  Memory disambiguation.  Advanced power gating. 14
  • 15.
  • 16.
    Advanced smart cachecontinued  If one core has minimal cache requirements, the other core can dynamically increase its share of L2 cache   Reduce cache misses.   Improve performance. 16
  • 17.
  • 18.
  • 19.
    Macro-Fusion continued  Enablecommon pair of instructions to be combined into a single instruction during decoding.  Reduce the total of executed instructions.  Allow processor to execute more instructions in less time.  Increase performance. 19
  • 20.
    Macro-Fusion continued Without macro-fusionWith macro-fusion 1 load eax, [mem1] 2 cmp eax, [mem2] 3 jne target 1 load eax, [mem1] 2 cmp eax, [mem2] + jne target 20
  • 21.
    Advanced Digital MediaBoost  Improve performance when executing Streaming SIMD Extension (SSE, SSE2, SEE3) instructions.  Accelerate video, speech, image, speech and image, photo processing, encryption, financial, engineering and scientific applications. 21
  • 22.
    Advanced Digital MediaBoost 128-bit Streaming SIMD Extension (SSE, SSE2, SEE3) instructions. 22
  • 23.
    Memory Disambiguation  Acceleratethe execution of memory-related instructions.  Load data for instructions about to be executed before all previous store instructions were executed.  Memory-related instructions that can be executed out of order. 23
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
    06/20/16 Intel Itanium Architecture Itaniumis a new processor family and architecture, design by Intel and HP with the future of high end server and workstation in mind.
  • 30.
    06/20/16 Features of Itanium 64-bit addressing  EPIC (Explicit Parallel Instruction Computing)  Wide Parallel Execution core  Prediction  FPU, ALU and Rotating registers  Large fast Cache  High Clock Speed  Scalability  Error Handling  Fast Bus Architecture
  • 31.
    06/20/16 Itanium Specifications  PhysicalCharacteristics – 25.4M transistors – .18micron CMOS process – 6 metal layers – C4 (flip-chip) assembly technology – 1012-pad organic land grid array – 733MHz and 800MHz initial release clock speeds
  • 32.
    06/20/16 Itanium Specifications Cont… Instruction Dispersal – 2 bundle dispersal windows – 3 instructions per bundle – 9 function unit slots – 2 integer slots – 2 floating point slots – 2 memory slots – 3 branch slots – Maximum of 6 instructions issued each cycle
  • 33.
    06/20/16 Itanium Specifications Cont… Floating Point Units  2 extended and double precision FMACs (Floating- point Multiply Add Calculators)  4 double or single precision operations per clock maximum  3.2 GFLOPS of peak double precision floating point performance at 800MHz  2 additional single precision FMACs  4 single precision operations per clock maximum  6.4 GFLOPS of peak single precision floating point performance total at 800MHz
  • 34.
    06/20/16 Itanium Specifications Cont… Integer and Branch Units – 4 single cycle integer ALUs – 4 MMX units – 3 branch units
  • 35.
    06/20/16 Itanium Specifications Cont… Level3 Cache – Off-die in two or four chips – 2MB or 4MB – Runs at core clock – 4-way set associative – Up to 294.8 million transistors – 128-bit bus – 21+ cycle latency
  • 36.
    06/20/16 Itanium Specifications Cont… Level 2 Cache – On-die – 96k of full-speed cache – 6-way set associative – 256-bit bus – 6-cycle + latency
  • 37.
    06/20/16 Itanium Specifications Cont… Level 1 Cache – On-die – 16k instruction cache – 4-way set associative – 16k integer only data cache – 2-cycle + latency
  • 38.
    06/20/16 Itanium Specifications Cont… x86 Compatibility – Hardware decoder turns x86 instructions into EPIC instructions – Dynamic scheduler optimizes x86 for EPIC micro-architecture – Shared cache – Shared execution core
  • 39.
    06/20/16 64-bit addressing  EPICprocessors are capable of addressing a 64-bit memory space. In comparison, 32- bit x86 processors access a relatively small 32-bit address space, or up to 4GB of memory.  A 64-bit memory space may be a limiting factor to performance. This gives the Itanium the memory addressing ability needed to meet current and foreseeable future high-end processing needs.
  • 40.
    06/20/16 64-bit addressing cont… Through bank switching, x86 processors, such as the Intel Pentium III Xeon and the AMD Athlon, can address more than 4GB of memory. Unfortunately, there is hardware and software overhead to bank switching that harms performance and increases complexity.
  • 41.
    06/20/16 64-bit addressing cont… The first generation of Itanium systems, using the 460GX chipset, will be expandable with up to 64GB of memory. Generations beyond that will be able to take more memory. Higher end Itanium systems designed by the likes of SGI, IBM and HP should eventually be able to take far more than 64GB.  While it may be hard to imagine 4GB or even 64GB of memory being a bottleneck to performance, when one considers SGI has mentioned plans to eventually build machines using 512 Itanium processors accessing more than a terabyte of data in main memory, 64GB of memory, let alone 4GB, begins to look rather small.
  • 42.
    06/20/16 EPIC  New ComputerArchitecture standard set by Intel on its new itanium architecture  Previously Computer architectures only consisted of RISC, CISC and VLIW  EPIC Uses complex instruction in additions to basic instruction. This complex instruction includes information on how to run the instruction parallel with other instructions.  EPIC instructions are put together by the compiler into a threesome called a bundle.
  • 43.
  • 44.
    06/20/16 EPIC continue…..  Bundleis a three instruction wide word - improves instruction level parallelism. Each Bundle Contains three instructions and a template field which are set during code generation, by a compiler, or the assembler.  Bundles are then sent to the CPU.  Bundles in the CPU are put together in an instruction group with other instructions  An instruction group is a set of instructions which do not have “read after write or write after write dependencies between them and may execute in parallel.” This means that the bundle do not affect each other with the data they are working on, so they can run together without getting in each others way.
  • 45.
    06/20/16 EPIC continue….  Inany given clock cycle, the processor executes as many instructions from one instruction group as it can according to resources.  An instruction group must contain at least one instruction but the number of instructions in an instruction group is not limited.  The instruction groups can end by cycle breaks or end dynamically during run time by taken branch
  • 46.
    06/20/16 EPIC continues…..  Inaddition of grouping operations into instructions, the compiler handles several other important tasks that improve efficiency, parallelism and speed.  CISC puts most of the burden of scheduling instructions onto the CPU hardware. RISC gives some of this responsibility to the compiler. VLIW gives even more importance to the compiler.  EPIC improves on previous technology by adding branch hints, register stack and rotation, data and control speculation and memory hints. It also uses branch prediction.
  • 47.
    06/20/16 Prediction  It isa compiling technique that optimises or removes branching code by working it so that much of the code runs in parallel.  It minimises the time it takes to run if – then – else situations and uses processor width to run both the ‘then’ and ‘else’ in parallel.  When the ‘if’ branch is determined, the incorrect branch result is discarded.  By removing branches and making code more parallel, prediction reduces the number of cycles it takes to complete a task while making use of a wide processor.
  • 48.
    06/20/16 Prediction According to JerryHuck of HP:  “Imagine that you are walking into the bank. You will make either a deposit or a withdrawal. The teller may predict you will make a withdrawal as they know you usually do, so they fill out a with drawl form as you get in line. If you get to the front and make a withdrawal, all is well, but if you are there to make a deposit, the teller then has to fill out the deposit slip and the time it takes to complete the transaction increases.  With Prediction, the teller is ambidextrous and, when you get in line they fill out both a with drawl and a deposit slip, so that when you get to the front, no matter what task you intend on doing, the process will run without a hitch.”
  • 49.
    06/20/16 Prediction Continue….  Inthe metaphor, prediction is the tellers knowledge that they should fill out both the deposit and withdrawal form before they know exactly what you want. The teller’s ambidexterity, the ability to fill out both forms at once, is akin to the ability of an EPCI processor to run instructions in parallel. prediction removes the penalty of if – then – else and allows the if – then – else process to run with as fewer steps as possible.  A side benefit of prediction is that the removal of branches causes less branch mispredicts. Branch misprediction requires the pipeline to be flushed and this is very cycle expensive procedure. prediction reduces wasted processor time.
  • 50.
    06/20/16 Wide Parallel Executioncore  Itanium processors are very wide.  They are intended to run multiple instructions and operations in parallel.  Itanium processors will be deep with a ten stage pipeline.  The first generation itanium processor will be able to issue six EPIC instruction in parallel every clock cycle.  The six issue (two bundler) scheduler disperses instructions into nine functional slots, two integer slots, two memory slots and three branch slots, giving a total of nine dispersal slots.
  • 51.
    06/20/16 Wide Parallel Executioncore cont…  This limits the number of each type of instruction that can be assigned in a single clock cycle. If an instruction/s can not be executed because too many slots of one type are filled, the instructions are delayed until the next cycle.  This means that proper compiler design is crucial to functional aspect of the itanium.  Backing up the itanium six issue scheduler are eleven execution units; four integer, two floating points, three branch, two load/store units.
  • 52.
    06/20/16  This helpssupport the various EPIC instructions that can launch more than one operation in a single instruction, such as SIMD, floating point operations.  Combined with the EPIC instruction set the itanium can execute up to 20 operations in a single cycle when doing some floating point intensive task. Wide Parallel Execution core cont…
  • 53.
    06/20/16 FPU, ALU andRotating Registers  FPU – The Itanium contains 4 pipelined FMAC (Floating Point Multiple Add Calculator) units. There are an additional two FMACs tuned for 3D applications. They are each capable of processing up to two single- precision floating-point operations per clock. That yields another 3.2GFLOPS of single- precision processing power. All together, the Itanium has a theoretical max of 6.4GLOPS of single-precision floating point processing power.
  • 54.
    06/20/16 FPU, ALU andRotating Registers cont…  ALU – There are four pipelined ALUs (Arithmetic Logic Unit) in the original Itanium. Each can process one integer calculation per cycle. They can also process MMX type instructions. While the Itanium has the potential to be a massive floating-point powerhouse, its integer performance also has tremendous potential.
  • 55.
    06/20/16  Plentiful Registers –The Itanium will come with 128 floating point and 128 integer registers. When processing up to 20 operations in a single clock, the registers give plenty of room for data inside the processor. This reduces the chances of the execution of an instruction being delayed because data could not be held locally. This is especially important since the Itanium can process up to eight floating-point operations in a single clock. With the possibility of eight operations running in a single clock, having too few registers could be a serious bottleneck. FPU, ALU and Rotating Registers cont…
  • 56.
    06/20/16  The registersalso have the ability to rotate. Rotating registers allows the processor to perform an operation on multiple software accessible registers in turn.  This increases CPU pipeline utilization and efficiency when dealing with streams of data to process. FPU, ALU and Rotating Registers cont…
  • 57.
    06/20/16 Large Fast Cache When a processor is waiting for data or instructions, time is wasted. The longer it takes for data and instructions to get to the CPU, the worse it gets. When data and instructions are in cache, the processor can grab them much quicker than when having to go to slow main memory. Not only is cache latency much lower than DRAM latency, the bandwidth is much higher.
  • 58.
    06/20/16 Large Fast Cachecont…  There are some trick programming techniques in use out there to keep often-used data and instructions in cache and they are not the kind of techniques you learn in your high school BASIC course.  Still, the easiest way to keep data and instructions in cache is to have a lot of cache to keep them in. Intel knew that when they designed the Itanium.
  • 59.
    06/20/16 Large Fast Cachecont…  The Itanium has three levels of cache. L1 and L2 are on-die while L3 is on cartridge. According to Intel, the L3 cache weighs in at 2MB or 4MB of four-way set associative cache on two or four 1MB chips.  IDC reports that the L2 cache size is 96k in size, and the L1 cache, which does not deal with floating point data, has a 16KB integer data and a 16KB instruction cache.
  • 60.
    06/20/16 Large Fast Cachecont…  The 294.8 million transistors of (4MB) level three cache runs at the full processor speed, giving 12.8GBps of memory bandwidth at 800MHz.  With 2MB or 4MB of L3 cache on the Itanium, the chances of the required data and instructions being in cache are quite good, bus traffic can be reduced, and performance increases. With six pipelines hungry for instructions and data, the Itanium needs all the cache it can get.
  • 61.
    06/20/16 Large Fast Cachecont…  To make caching even more effective, Intel uses data speculation and cache hints. Data speculation is caching and calling for data that may be needed or may be changed before it is needed, so that, in the case that the data is needed and it has not changed, the CPU does not have to take a latency impact from calling for the data.  The processor, with the help of compiled instructions, looks ahead, anticipates what info it may need, and then brings it to cache or into the processor. This helps hide memory latency. Cache hints are two-bit markers for memory loads set by the compiler that help the CPU find data in cache. This improves the speed of retrieving data from cache.
  • 62.
    06/20/16 Clock Speed  Thefirst generation of Itanium processors will come in the first half of 2001 at 733MHz and 800MHz. The first generation's clock speed may not be particularly quick, but Intel has several generations ahead of the Itanium already in the works that should increase performance.  Intel claims they have plenty of clock headroom in the Itanium design and are aiming for a greater than 1GHz clock speed with their second generation Itanium processor, McKinley, which will have the L3 cache on-die.
  • 63.
    06/20/16 Scalability  The Itaniumwas not designed for small systems, it is intended for 1 to 4000 processor workstations and servers.  There are several Itanium features designed to help with hardware scalability: a full-CPU- speed Level 2 bus, a large L3 cache, deferred- transaction support and flexible page sizes.
  • 64.
    06/20/16 Scalability cont…  Thefull-CPU-speed Level 3 bus provides quick communication between CPUs. The large L2 cache reduces inter-CPU bus traffic by keeping data close to the CPU that needs it.  Deferred-transaction support can stop one CPU from getting in the way of another. Flexible page sizes, from 4KB to 256MB, give the Itanium family the flexibility to access small amounts of memory in small chunks and massive amounts of memory in massive chunks without the overhead of smaller page sizes.
  • 65.
    06/20/16 Scalability cont…  Thefirst generation Itanium chipset, the 460GX, will support up to four processors, and OEMs will be able to build eight-way and larger systems.  Successive generations of chipsets should be successively more scalable. Third party solutions should also increase scalability.
  • 66.
    06/20/16 Error Handling  TheItanium will have extensive error handling capabilities. It features ECC and parity error checking on most processor caches and busses.  If a machine error occurs and a piece of data becomes corrupted, the ECC or parity checking will allow the machine to recognize the error, fix it if possible, or flag it as corrupted.  The processor also has the capability to kill an application or thread that has experienced a machine error without having to reboot.
  • 67.
    06/20/16 Error Handling cont… Chipset, OS, and system designers, which will include the likes of HP, IBM, Compaq, SGI, Microsoft and Intel, will bring out their own error handling and reliability processes that should further enhance Itanium-based server uptime to 99.9% and beyond.
  • 68.
    06/20/16 Fast Bus Architecture A major link in the food delivery system for the Itanium is the system bus. The Itanium will use a 2.1GBps multi-drop system bus to keep well fed with data and instructions. We expect it will have a 128-bit 133MHz bus.  The memory subsystem and I/O will be determined by the chipset used. First generation systems should use dual-memory ported SDRAM giving 4.2GBps of memory bandwidth. Later generations will have the option to use DDR SDRAM or RDRAM.
  • 69.
    06/20/16 Fast Bus Architecturecont…  Eventually, Intel plans on moving server platforms to DDR II. 64bit, 66MHz PCI and AGP Pro (4x) should be common on Itanium motherboards and support will be included in Intel's 460GX chipset
  • 70.
  • 71.
    06/20/16 Future  According toIntel, the EPIC architecture was designed with about 25 years of headroom for future development in mind.  McKinley will follow the original Itanium and will integrate its L3 cache onto the CPU die. McKinley will arrive in the first half of 2002. Madison may also arrive in 2002 on a .13-micron process. Deerfield will arrive not long after, also on a .13 process, at a lower price and performance level but with more performance for the dollar than Madison.
  • 72.
    06/20/16 Future cont…  Madisonmay also arrive in 2002 on a .13- micron process. Deerfield will arrive not long after, also on a .13 process, at a lower price and performance level but with more performance for the dollar than Madison.  Furthermore it will offer larger amounts of L3 cache.  Deerfield will be positioned as a value part in conjunction with Madison the same way as a P3 and Celeron compares today. It might be the CPU targeting consumer desktops.
  • 73.
  • 74.
    06/20/16 Competition  Sun UltraSPARC In 1995, 8 years after the first SPARC station was introduced, Sun went 64 bit with the introduction of UltraSPARC 1 RISC processor. The first model ran at 143Mhz and had 128 bit datapaths.  In 1996, it became the first 64 bit CPU to incoporate multimedia extensions to handle complex 2D/3D graphics.
  • 75.
    06/20/16 Competition cont…  In1997, the UltraSPARC 2 was released at 250Mhz while the UltraSPARC 3 (with new 256 bit data paths) is released in the second quarter of 2000. Other plans include a UltraSPARC 4 which will be pumped up to 1 Ghz and UltraSPARC 5 which will run at 1.5Ghz  In 1997, the UltraSPARC 2 was released at 250Mhz while the UltraSPARC 3 (with new 256 bit datapaths) is released in the second quarter of 2000. Other plans include a UltraSPARC 4 which will be pumped up to 1 Ghz and UltraSPARC 5 which will run at 1.5Ghz
  • 76.
    06/20/16 Competition cont…  TheUltraSPARC 2 was designed using a 0.25 micron process, while the UltraSPARC 3 employed a 0.18 micron process.  As a side issue, Sun believes that its UltraSPARC 3, 4 and 5 will be ahead of Itanium before it arrives because the binary application code written for the UltraSPARC 2 will run unmodified on the other series' making the transition easy.
  • 77.
    06/20/16 Competition cont…  Thereare several arguments that Sun puts forth against the Itanium: – Sun has been supplying 64 bit solutions since 1995 and has ironed out its bugs. While Itanium may be arriving soon, the testing of applications and enterprise solutions could well take much longer. – Furthermore Sun produces Solaris (which has been a true 64 bit since 1999), so it has the necessary experience in ironing out problems.
  • 78.
    06/20/16 Competition cont…  LastlySun claims that its Visual Instruction Set can be used to speed up networking, I/O and memory management by optimizing the passing of data blocks through protocol stacks with the special instructions.
  • 79.
    06/20/16 Competition cont…  IBMPowerPC – IBM’s PowerPC RISC processor made its debut on the 14th of February 1990. – In 1991, an alliance was formed between IBM, Motorola and Apple and the PowerPC is still being developed for MACS until today. – PowerPC’s went 64 bit in 1998 with the codename Power3 which covers the PowerPC 604e and Power PC RS64 processors. – IBM’s roadmap included building a Power4 in the last quarter of 2000 which was to run at 1 Ghz.
  • 80.
    06/20/16 Competition cont…  Compaq’sAlpha – While Sun’s UltraSPARC and IBM’s PowerPC processors went 64 bit in 1995 and 1998, Digital Alpha’s CPU was 64 bit ever since its birth. That was during 1992. – Even during then the Alpha was a powerful CPU, launched at 200Mhz, when the MIPS 64 bit R4000 ran at 100Mhz and Intel’s 32 bit 386 only ran at 25Mhz – Today this processor belongs to Compaq.
  • 81.
    06/20/16 Competition cont…  TheAlpha Server SC can run from 64 to 512 processors. Furthermore the Alpha SC server would form the largest supercomputer in Europe running 2500 Alpha EV67 CPU’s and would handle 5 trillion instructions per second.  It is believed that accordingly the Alpha’s would be more appropriate for huge number crunching scientific and multimedia - entertainment applications.
  • 82.
    06/20/16 Competition cont… – TheAlpha Server SC can run from 64 to 512 processors. Furthermore the Alpha SC server would form the largest supercomputer in Europe running 2500 Alpha EV67 CPU’s and would handle 5 trillion instructions per second. – It is believed that accordingly the Alpha’s would be more appropriate for huge number crunching scientific and multimedia/entertainment applications.
  • 83.
    06/20/16 Competition cont…  AMD’sSledgehammer – At the microprocessor forum on 5th October 1999, AMD announced details of its 64 bit processor. This 64 bit processor is codenamed Sledgehammer. – With regards to this AMD plans to extend Intel’s original x86 instruction to include a 64 bit mode. This is to maintain compatibility with 32 bit apps while benefiting from a 64 bit platform.
  • 84.
    06/20/16 Competition cont… – Sledgehammerwill also employ AMD’s future system bus, named Lightning Data Transport (LDT). LDT is an internal chip to chip interconnect that can deliver up to 6.4 Gbits/sec bandwidth, that’s about 20 times faster than current 266Mbits/sec system interconnects. – Finally the Sledgehammer’s universal selling point is that it will be the only chip with full native x86 32 bit and 64 bit compatibility. Other 64 bit chips may offer some kind of x86 compatibility, but according to AMD, each of these relegate the x86 instructions to a second class status.
  • 85.
    06/20/16 Conclusion The Itanium hasa complex, bleeding edge, forward looking processor family that holds promise for huge gains in processing power. The processor uses the entirely new EPIC architecture that has the potential to deliver large improvements in processor parallelism. It is all about speed, and the Itanium has the ability to deliver it but the real test will be once Itanium hits the consumer market.
  • 86.
    Important processor and linkfor its details Details for core i7 processor http://www.intel.com/content/www/us/en/chipsets/8-series-chipset- pch-datasheet.html Details for Zeon processor processorhttp://www.intel.com/content/www/us/en/processors/xeon/x eon-e5-1600-2600-vol-1-datasheet.html Gigabit-et-et2-ef-multi-port-server-adapters- briefhttp://www.intel.in/content/dam/doc/product-brief/gigabit-et-et2- ef-multi-port-server-adapters-brief.pdf