SlideShare a Scribd company logo
1 of 11
LARRABEE ARCHITECTURE:
Intel's Larrabee is built out of a number of x86 cores that look, at a very high level, like this:
Each core is a dual-issue, in-order architecture loosely derived from the original Pentium
microprocessor. The Pentium core was modified to include support for 64-bit operations,
the updates to the x86 instruction set, larger caches, 4-way SMT/Hyper Threading and a 16-
wide vector ALU.
While the team that ended up working on Atom may have originally worked on the Larrabee
cores, there are some significant differences between Larrabee and Atom. Atom is geared
towards much higher single threaded performance, with a deeper pipeline, a larger L2 cache
and additional microarchitectural tweaks to improve general desktop performance.
Intel Larrabee
Core
Intel Pentium
Core (P54C)
Intel Atom Core
Manufacturing
Process
45nm 0.60µm 45nm
Simultaneous Multi-
Threading
4-way 1-way 2-way
Issue Width dual-issue dual-issue dual-issue
Pipeline Depth 5-stages (?) 5-stages 16-stages
Scalar Execution
Resources
2 x Integer ALUs
(?)
1 x FPU (?)
2 x Integer
ALUs
1 x FPU
2 x Integer ALUs
1 x FPU
Vector Execution
Resources
16-wide Vector
ALU
None 1 x SIMD SSE
L1 Cache (I/D) 32KB/32KB 8KB/8KB 32KB/24KB
L2 Cache 256KB None (External) 512KB
ISA
64-bit x86
SSEn support?
Parallel/Graphics?
32-bit x86
64-bit x86
Full Merom ISA
compatibility
Larrabee on the other hand is more Pentium-like to begin with; Intel states that Larrabee's
execution pipeline is "short" and followed up with us by saying that it's closer to the 5-stage
pipeline of the original Pentium than the 16-stage pipeline of Atom. While both Atom and
Larrabee support SMT (Simultaneous Multi-Threading), Larrabee can work on four threads
concurrently compared to two on Atom and one on the original Pentium.
L1 cache sizes are similar between Larrabee and Atom, but Larrabee gets a full 32KB data
cache compared to 24KB on Atom. If you remember back to our architectural discussion of
Atom, the smaller L1 D-cache was a side effect of going to a register file instead of a small
signal array for the cache. Die size increased but operating voltage decreased, forcing Atom
to have a smaller L1 D-cache but enabling it to reach lower power targets. Larrabee is a little
less constrained and thus we have conventional balanced L1 caches, at 4x the size of that in
the original Pentium.
The Pentium had no on-die L2 cache, it relied on external SRAM to be installed on the
motherboard. In order to maintain good desktop performance Atom came equipped with a
512KB L2 cache, while each Larrabee core will feature a 256KB L2 cache. Larrabee's
architecture does stress the importance of large, fast caches as you'll soon see, but 256KB is
the right size for Intel's architecture at this point. Larrabee's default OpenGL/DirectX
renderer is tile based and it turns out that most 64x64 or 128x128 tiles with 32-bit color/32-
bit Z can fit in a 128KB space, leaving an additional 128KB left over for caching additional
data. And remember, this is just on one Larrabee core - the whole GPU will be built out of
many more.
The big difference between Larrabee, Pentium and Atom is in the vector execution side. The
original Pentium had no SIMD units, Atom added support for SSE and Larrabee takes a giant
leap with a massive 16-wide vector ALU. This unit is able to work on up to 16 32-bit floating
point operations simultaneously, making it far wider than any of the aforementioned cores.
Given the nature of the applications that Larrabee will be targeting, such a wide vector unit
makes total sense.
Other changes to the Pentium core that made it into Larrabee are things like 64-bit x86
support and hardware prefetchers, although it is unknown as to how these compare to
Atom's prefetchers. It is a fair guess to say that prefetching will include optimizations for
data parallel situations, but whether this is in addition to other prefetch technology or a
replacement for it is something we'll have to wait to find out.
The vector unit is key and within that unit you've got a ton of registers and a very wide
vector ALU, which leads us to the fundamental building block of Larrabee. NVIDIA's GT200 is
built out of Streaming Processors, AMD's RV770 out of Stream Processing Units and
Larrabee's performance comes from these 16-wide vector ALUs:
The vector ALU can behave as a 16-wide single precision ALU or an 8-wide double precision,
although that doesn't necessarily translate into equivalent throughput (which Intel would
not at this point clarify). Compared to ATI and NVIDIA, here's how Larrabee looks at a basic
execution unit level:
NVIDIA's SPs work on a single operation, AMD's can work on five, and Larrabee's vector unit
can work on sixteen. NVIDIA has a couple hundred of these SPs in its high end GPUs, AMD
has 160 and Intel is expected to have anywhere from 16 - 32 of these cores in Larrabee. If
NVIDIA is on the tons-of-simple-hardware end of the spectrum, Intel is on the exact
opposite end of the scale.
We've already shown that AMD's architecture requires a lot of help from the compiler to
properly schedule and maximize the utilization of its execution resources within one of its 5-
wide SPs, with Larrabee the importance of the compiler is tremendous. Luckily for Larrabee,
some of the best (if not the best) compilers are made by Intel. If anyone could get away with
this sort of an architecture, it's Intel.
At the same time, while we don't have a full understanding of the details yet, we get the
idea that Larrabee's vector unit is sort of a chameleon. From the information we have, these
vector units could exectue atomic 16-wide ops for a single thread of a running program and
can handle register swizzling across all 16 exectution units. This implies something very AMD
like and wide. But it also looks like each of the 16 vector execution units, using the mask
registers can branch independently (looking very much more like NVIDIA's solution).
We've already seen how AMD and NVIDIA architectural differences show distinct
advantages and disadvantages against eachother in different games. If Intel is able to adapt
the way the vector unit is used to suit specific situations, they could have something huge
on their hands. Again, we don't have enough detail to tell what's going to happen, but things
do look very interesting.
Intel is keeping two important details of Larrabee very quiet: the details of the instruction
set and the configuration of the finished product. Remember that Larrabee won't ship until
sometime in 2009 or 2010, the first chips aren't even back from the fab yet, so not wanting
to discuss how many cores Intel will be able to fit on a single Larrabee GPU makes sense.
The final product will be some assembly of a multiple of 8 Larrabee cores, we originally
expected to see something in the 24-to-32 core range but that largely depends on targeted
die size as we'll soon explain:
Intel's own block diagrams indicated two memory controller partitions, but it's unclear
whether or not we should read into this. AMD and NVIDIA both use 64-bit memory
controllers and simply group multiples of them on a single chip. Given that Intel's Larrabee
will be more memory bandwidth efficient than what AMD and NVIDIA have put out, it's
quite possible that Larrabee could have a 128-bit memory interface, although we do believe
that'd be a very conservative move (we'd expect a 256-bit interface). Coupled with GDDR5
(which should be both cheaper and highy available by the Larrabee timeframe)
however, anything is possible.
All of the cores are connected via a bi-directional ring bus (512-bits in each direction),
presumably running at core speed. Given that Larrabee is expected to run at 2GHz+, this is
going to be one very high-bandwidth bus. This is half the bit-width of AMD's R600/RV670
ring bus, but the higher speed should more than make up the difference.
AMD recently abandoned their ring bus memory architecture citing a savings in die area and
a lack of need for such a robust solution as the reason. A ring bus, as memory busses go, is
fairly straight forward and less complex than other options. The disadvantage is that it is a
lot of wires and it delivers high bandwidth to all the clients on the bus whether they need it
or not. Of course, if all your memory clients need or can easily use high bandwidth then
that's a win for the ring bus.
Intel may have a better use for going with the ring bus than AMD: cache coherency and
inter-core communication. Partitioning the L2 and using the ring bus to maintain coherency
and facilitate communication could make good use of this massive amount of data moving
power. While Cell also allows for internal communication, Intel's solution of providing direct
access to low latency, coherent L1 and L2 partitions while enabling massive bandwidth
behind the L2 cache could result in a much faster and easier to program architecture when
data sharing is required.
How Many Cores in a Larrabee?
Initial estimates put Larrabee at somewhere in the 16 to 32-core range, we figured 32-cores
would be a sweetspot (not in the least because Intel's charts and graphs showed diminishing
returns over 32 cores) but 24-cores would be more likely for an initial product. Intel
however shared some data that made us question all of that.
Remember the design experiment? Intel was able to fit a 10-core Larrabee into the space of
a Core 2 Duo die. Given the specs of the Core 2 Duo Intel used (4MB L2 cache), it appears to
be a 65nm Conroe/Merom based Core 2 Duo - with a 143 mm^2 die size.
At 143 mm^2, Intel could fit 10 Larrabee-like cores so let's double that. Now we're at
286mm^2 (still smaller than GT200 and about the size of AMD's RV770) and 20-cores.
Double that once more and we've got 40-cores and have a 572mm^2 die, virtually the same
size as NVIDIA's GT200 but on a 65nm process.
The move to 45nm could scale as well as 50%, but chances are we'll see something closer to
60 - 70% of the die size simply by moving to 45nm (which is the node that Larrabee will be
built on). Our 40-core Larrabee is now at ~370mm^2 on 45nm. If Intel wanted to push for a
NVIDIA-like die size we could easily see a 64-core Larrabee at launch for the high end, with
24 or 32-core versions aiming at the mainstream.Update: One thing we did not consider
here is power limitations. So while Intel may be able to produce a 64-core Larrabee with a
GT200-like die-size, such a chip may exceed physical power limitations. It's far more likely
that we'll see something in the 16 - 32 core range at 45nm due to power constraints rather
than die size constraints.
Cache and Memory Hierarchy: Architected for Low Latency Operation
Intel has had a lot of experience building very high performance caches. Intel's caches are
more dense than what AMD has been able to produce on the x86 microprocessor front, and
as we saw in our Nehalem preview - Intel is also able to deliver significantly lower latency
caches than the competition as well. Thus it should come as no surprise to anyone that
Larrabee's strengths come from being built on fully programmable x86 cores, and from
having very large, very fast coherent caches.
Each Larrabee core features 4x the L1 caches of the original Pentium. The Pentium had an
8KB L1 data cache and an 8KB L1 instruction cache, each Larrabee core has a 32KB/32KB L1
D/I cache. The reasoning is that each Larrabee core can work on 4x the threads of the
original Pentium and thus with a 4x as large L1 the architecture remains balanced. The
original Pentium didn't have an integrated L2 cache, but each Larrabee core has access to its
own L2 cache partition - 256KB in size.
Larrabee's L2 pool increases with each core. An 8-core Larrabee would have 2MB of total L2
cache (256KB per core x 8 cores), a 32-core Larrabee would have an 8MB L2 cache. Each
core only has access to its L2 cache partition, it can read/write to its 256KB portion of the
pool and that's it. Communication with other Larrabee cores happens over the ring bus; a
single core will look for data in its L2 cache, if it doesn't find it there it will place the request
on the ring bus and will eventualy find the data in its L2.
Intel doesn't attempt to hide latency nearly as much as NVIDIA does, instead relying on its
high speed, low latency caches. The ratio of compute resources to cache size is much lower
with Larrabee than either AMD or NVIDIA's architectures.
AMD RV770 NVIDIA GT200 Intel Larrabee
Scalar ops per L1 Cache 80 24 16
L1 Cache Size 16KB unknown 32KB
Scalar ops per L2 Cache 100 30 16
L2 Cache Size unknown unknown 256KB
While both AMD and NVIDIA are very shy on giving out cache sizes, we do know that RV670
had a 256KB L2 for the entire chip cache and can expect that RV770 to have something
larger, but not large enough to come close to what Intel has with Larrabee. NVIDIA is much
closer in the compute-to-cache ratio than AMD, which makes sense given its approach to
designing much larger GPUs, but we have no reason to believe that NVIDIA has larger caches
on the GT200 die than Intel with Larrabee.
The caches are fully coherent, just like they are on a multi-core desktop CPU. The fully
coherent caches makes for some interesting cases when looking at multi-GPU
configurations. While Intel wouldn't get specific with multi-GPU Larrabee plans, it did state
that with a multi-GPU Larrabee setup Intel doesn't "expect to have quite as much pain as
they [AMD/NVIDIA] do".
We asked whether there was any limitation to maintaining cache coherence across multiple
chips and the anwswer was that it could be possible with enough bandwidth between the
two chips. While NVIDIA and AMD are still adding bits and pieces to refine multi-GPU
rendering, Intel could have a very robust solution right out of the gate if desired (think
shared framebuffer and much more efficient work load division for a single frame).
Programming for Larrabee
The Larrabee programming model is what sets it apart from the competition. While
competing GPU architectures have become increasingly programmable over the years,
Larrabee starts from a position of being fully programmable. To the developer, it appears as
exactly what it is - an arrangement of fully cache coherent x86 microprocessors. The first
iteration of Larrabee will hide this fact from the OS through its graphics driver, but future
versions of the chip could conceivably populate task manager just like your desktop x86
cores do today.
You have two options for harnessing the power of Larrabee: writing standard
DirectX/OpenGL code, or writing directly to the hardware using Larrabee C/C++, which as it
turns out is standard C (you can use compilers from MS, Intel, GCC, etc...). In a sense, this is
no different than what NVIDIA offers with its GPUs - they will run DirectX/OpenGL code, or
they can also run C-code thanks to CUDA. The difference here is that writing directly to
Larrabee gives you some additional programming flexibility thanks to the GPU being an
array of fully functional x86 GPUs. Programming for x86 architectures is a paradigm that the
software community as a whole is used to, there's no learning curve, no new hardware
limitations to worry about and no waiting on additional iterations of CUDA to enable new
features. You treat Larrabee like you treat your host CPU.
Game developers aren't big on learning new tricks however, especially on an unproven,
unreleased hardware platform such as Larrabee. Larrabee must run DirectX/OpenGL code
out of the box, and to do this Intel has written its own Larrabee native software renderer to
interface between DX/OGL and the Larrabee hardware.
In AMD/NVIDIA GPUs, DirectX/OpenGL instructions map to an internal GPU instruction set
at runtime. With Larrabee Intel does this mapping in software, taking DX/OGL instructions,
mapping them to its software renderer, and then running its software renderer on the
Larrabee hardware.
This intermediate stage should incur a performance penalty, as writing directly to Larrabee
is always going to be faster. However Intel has apparently produced a highly optimized
software renderer for Larrabee, once that's efficient enough so that any performance
penalty introduced by the intermediate stage is made up for by the reduction of memory
bandwidth enabled by the software renderer (we'll get to how this is possible in a moment).
Developers can also use a hybrid approach to Larrabee development. Larrabee can run
standard DX/OGL code but if there are features developers want to implement that aren't
enabled in the current DirectX version, they can simply write those features that they want
in Larrabee C/C++.
Without hardware it's difficult to tell exactly how well Larrabee will run DirectX/OpenGL
code, but Intel knows it must succeed on running current games very well in order to make
this GPU a success.

More Related Content

What's hot

Case study on Intel core i3 processor.
Case study on Intel core i3 processor. Case study on Intel core i3 processor.
Case study on Intel core i3 processor. Mauryasuraj98
 
Chapt 02 ia-32 processer architecture
Chapt 02   ia-32 processer architectureChapt 02   ia-32 processer architecture
Chapt 02 ia-32 processer architecturebushrakainat214
 
PCIe BUS: A State-of-the-Art-Review
PCIe BUS: A State-of-the-Art-ReviewPCIe BUS: A State-of-the-Art-Review
PCIe BUS: A State-of-the-Art-ReviewIOSRJVSP
 
Durgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpgaDurgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpgaObsidian Software
 
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA CampPCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA CampFPGA Central
 
Core i 7 processor
Core i 7 processorCore i 7 processor
Core i 7 processorSumit Biswas
 
Lec 03 ia32 architecture
Lec 03  ia32 architectureLec 03  ia32 architecture
Lec 03 ia32 architectureAbdul Khan
 
HP - HPC-29mai2012
HP - HPC-29mai2012HP - HPC-29mai2012
HP - HPC-29mai2012Agora Group
 
intel Presentation
intel Presentationintel Presentation
intel Presentationfinance6
 
Intel(R)Core(Tm)I7 Desktop Processor Product Brief
Intel(R)Core(Tm)I7 Desktop Processor Product BriefIntel(R)Core(Tm)I7 Desktop Processor Product Brief
Intel(R)Core(Tm)I7 Desktop Processor Product BriefOscar del Rio
 
Intel processor i3 to i7
Intel processor i3 to i7Intel processor i3 to i7
Intel processor i3 to i7laxdisuza
 
The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.Slide_N
 

What's hot (20)

Case study on Intel core i3 processor.
Case study on Intel core i3 processor. Case study on Intel core i3 processor.
Case study on Intel core i3 processor.
 
Technology (1)
Technology (1)Technology (1)
Technology (1)
 
Chapt 02 ia-32 processer architecture
Chapt 02   ia-32 processer architectureChapt 02   ia-32 processer architecture
Chapt 02 ia-32 processer architecture
 
PCIe BUS: A State-of-the-Art-Review
PCIe BUS: A State-of-the-Art-ReviewPCIe BUS: A State-of-the-Art-Review
PCIe BUS: A State-of-the-Art-Review
 
Using linux as_a_router
Using linux as_a_routerUsing linux as_a_router
Using linux as_a_router
 
Durgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpgaDurgam vahia open_sparc_fpga
Durgam vahia open_sparc_fpga
 
Intel tools to optimize HPC systems
Intel tools to optimize HPC systemsIntel tools to optimize HPC systems
Intel tools to optimize HPC systems
 
Intel i7
Intel i7Intel i7
Intel i7
 
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA CampPCIe Gen 3.0 Presentation @ 4th FPGA Camp
PCIe Gen 3.0 Presentation @ 4th FPGA Camp
 
Core i 7 processor
Core i 7 processorCore i 7 processor
Core i 7 processor
 
Lec 03 ia32 architecture
Lec 03  ia32 architectureLec 03  ia32 architecture
Lec 03 ia32 architecture
 
HP - HPC-29mai2012
HP - HPC-29mai2012HP - HPC-29mai2012
HP - HPC-29mai2012
 
Intel Core i7 Processors
Intel Core i7 ProcessorsIntel Core i7 Processors
Intel Core i7 Processors
 
intel Presentation
intel Presentationintel Presentation
intel Presentation
 
Intel(R)Core(Tm)I7 Desktop Processor Product Brief
Intel(R)Core(Tm)I7 Desktop Processor Product BriefIntel(R)Core(Tm)I7 Desktop Processor Product Brief
Intel(R)Core(Tm)I7 Desktop Processor Product Brief
 
Eisa
EisaEisa
Eisa
 
Intel processor i3 to i7
Intel processor i3 to i7Intel processor i3 to i7
Intel processor i3 to i7
 
A
AA
A
 
Arrandale presentation1
Arrandale presentation1Arrandale presentation1
Arrandale presentation1
 
The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.The Best Programming Practice for Cell/B.E.
The Best Programming Practice for Cell/B.E.
 

Similar to Larrabee

corei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptxcorei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptxPranita602627
 
Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3senayteklay
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningRenaldas Zioma
 
3rd grading-reviewer-pc-assembly-and-networking
3rd grading-reviewer-pc-assembly-and-networking3rd grading-reviewer-pc-assembly-and-networking
3rd grading-reviewer-pc-assembly-and-networkingstarlanter
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware LandscapeGrigory Sapunov
 
SNAPDRAGON SoC Family and ARM Architecture
SNAPDRAGON SoC Family and ARM Architecture SNAPDRAGON SoC Family and ARM Architecture
SNAPDRAGON SoC Family and ARM Architecture Abdullaziz Tagawy
 
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral IntegrationA 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral IntegrationTalal Khaliq
 
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph clusterCeph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph clusterCeph Community
 
fpga1 - What is.pptx
fpga1 - What is.pptxfpga1 - What is.pptx
fpga1 - What is.pptxssuser0de10a
 
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster Ceph Community
 
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster Ceph Community
 
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph clusterCeph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph clusterCeph Community
 
Amd Barcelona Presentation Slideshare
Amd Barcelona Presentation SlideshareAmd Barcelona Presentation Slideshare
Amd Barcelona Presentation SlideshareDon Scansen
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmicguest40fc7cd
 

Similar to Larrabee (20)

corei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptxcorei7anaghvjfinal-130316054830-.pptx
corei7anaghvjfinal-130316054830-.pptx
 
Corei7
Corei7Corei7
Corei7
 
Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3Computer Hardware & Software Lab Manual 3
Computer Hardware & Software Lab Manual 3
 
Core i71
Core i71Core i71
Core i71
 
Trip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine LearningTrip down the GPU lane with Machine Learning
Trip down the GPU lane with Machine Learning
 
P1 Unit 3
P1 Unit 3 P1 Unit 3
P1 Unit 3
 
3rd grading-reviewer-pc-assembly-and-networking
3rd grading-reviewer-pc-assembly-and-networking3rd grading-reviewer-pc-assembly-and-networking
3rd grading-reviewer-pc-assembly-and-networking
 
Deep learning: Hardware Landscape
Deep learning: Hardware LandscapeDeep learning: Hardware Landscape
Deep learning: Hardware Landscape
 
Intel Processor core i7
Intel Processor core i7Intel Processor core i7
Intel Processor core i7
 
SNAPDRAGON SoC Family and ARM Architecture
SNAPDRAGON SoC Family and ARM Architecture SNAPDRAGON SoC Family and ARM Architecture
SNAPDRAGON SoC Family and ARM Architecture
 
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral IntegrationA 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
A 32-Bit Parameterized Leon-3 Processor with Custom Peripheral Integration
 
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph clusterCeph Day KL - Delivering cost-effective, high performance Ceph cluster
Ceph Day KL - Delivering cost-effective, high performance Ceph cluster
 
fpga1 - What is.pptx
fpga1 - What is.pptxfpga1 - What is.pptx
fpga1 - What is.pptx
 
Amd Athlon Processors
Amd Athlon ProcessorsAmd Athlon Processors
Amd Athlon Processors
 
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
Ceph Day Seoul - Delivering Cost Effective, High Performance Ceph cluster
 
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
Ceph Day Taipei - Delivering cost-effective, high performance, Ceph cluster
 
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph clusterCeph Day Tokyo - Delivering cost effective, high performance Ceph cluster
Ceph Day Tokyo - Delivering cost effective, high performance Ceph cluster
 
Nehalem
NehalemNehalem
Nehalem
 
Amd Barcelona Presentation Slideshare
Amd Barcelona Presentation SlideshareAmd Barcelona Presentation Slideshare
Amd Barcelona Presentation Slideshare
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
 

More from Krunal Siddhapathak

More from Krunal Siddhapathak (10)

Grid computing
Grid computingGrid computing
Grid computing
 
Microblaze
MicroblazeMicroblaze
Microblaze
 
MICROWAVE RADIO COVERAGE FOR VEHICLE TO-VEHICLE AND IN-VEHICLE COMMUNICATION
MICROWAVE RADIO COVERAGE FOR VEHICLE TO-VEHICLE AND IN-VEHICLE COMMUNICATION MICROWAVE RADIO COVERAGE FOR VEHICLE TO-VEHICLE AND IN-VEHICLE COMMUNICATION
MICROWAVE RADIO COVERAGE FOR VEHICLE TO-VEHICLE AND IN-VEHICLE COMMUNICATION
 
Optical coherence tomography
Optical coherence tomographyOptical coherence tomography
Optical coherence tomography
 
Sic a new era in power electronics
Sic a new era in power electronicsSic a new era in power electronics
Sic a new era in power electronics
 
Ultra wide band antenna
Ultra wide band antennaUltra wide band antenna
Ultra wide band antenna
 
Atumatic toll tax system
Atumatic toll tax systemAtumatic toll tax system
Atumatic toll tax system
 
Musketeers mind rover
Musketeers mind roverMusketeers mind rover
Musketeers mind rover
 
Embedded C
Embedded CEmbedded C
Embedded C
 
Features of modern intel microprocessors
Features of modern intel microprocessorsFeatures of modern intel microprocessors
Features of modern intel microprocessors
 

Recently uploaded

Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...PsychoTech Services
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfAyushMahapatra5
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingTeacherCyreneCayanan
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Disha Kariya
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 

Recently uploaded (20)

Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
IGNOU MSCCFT and PGDCFT Exam Question Pattern: MCFT003 Counselling and Family...
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 

Larrabee

  • 1. LARRABEE ARCHITECTURE: Intel's Larrabee is built out of a number of x86 cores that look, at a very high level, like this: Each core is a dual-issue, in-order architecture loosely derived from the original Pentium microprocessor. The Pentium core was modified to include support for 64-bit operations,
  • 2. the updates to the x86 instruction set, larger caches, 4-way SMT/Hyper Threading and a 16- wide vector ALU. While the team that ended up working on Atom may have originally worked on the Larrabee cores, there are some significant differences between Larrabee and Atom. Atom is geared towards much higher single threaded performance, with a deeper pipeline, a larger L2 cache and additional microarchitectural tweaks to improve general desktop performance. Intel Larrabee Core Intel Pentium Core (P54C) Intel Atom Core Manufacturing Process 45nm 0.60µm 45nm Simultaneous Multi- Threading 4-way 1-way 2-way Issue Width dual-issue dual-issue dual-issue Pipeline Depth 5-stages (?) 5-stages 16-stages Scalar Execution Resources 2 x Integer ALUs (?) 1 x FPU (?) 2 x Integer ALUs 1 x FPU 2 x Integer ALUs 1 x FPU Vector Execution Resources 16-wide Vector ALU None 1 x SIMD SSE L1 Cache (I/D) 32KB/32KB 8KB/8KB 32KB/24KB L2 Cache 256KB None (External) 512KB ISA 64-bit x86 SSEn support? Parallel/Graphics? 32-bit x86 64-bit x86 Full Merom ISA compatibility Larrabee on the other hand is more Pentium-like to begin with; Intel states that Larrabee's execution pipeline is "short" and followed up with us by saying that it's closer to the 5-stage pipeline of the original Pentium than the 16-stage pipeline of Atom. While both Atom and Larrabee support SMT (Simultaneous Multi-Threading), Larrabee can work on four threads concurrently compared to two on Atom and one on the original Pentium. L1 cache sizes are similar between Larrabee and Atom, but Larrabee gets a full 32KB data cache compared to 24KB on Atom. If you remember back to our architectural discussion of Atom, the smaller L1 D-cache was a side effect of going to a register file instead of a small signal array for the cache. Die size increased but operating voltage decreased, forcing Atom
  • 3. to have a smaller L1 D-cache but enabling it to reach lower power targets. Larrabee is a little less constrained and thus we have conventional balanced L1 caches, at 4x the size of that in the original Pentium. The Pentium had no on-die L2 cache, it relied on external SRAM to be installed on the motherboard. In order to maintain good desktop performance Atom came equipped with a 512KB L2 cache, while each Larrabee core will feature a 256KB L2 cache. Larrabee's architecture does stress the importance of large, fast caches as you'll soon see, but 256KB is the right size for Intel's architecture at this point. Larrabee's default OpenGL/DirectX renderer is tile based and it turns out that most 64x64 or 128x128 tiles with 32-bit color/32- bit Z can fit in a 128KB space, leaving an additional 128KB left over for caching additional data. And remember, this is just on one Larrabee core - the whole GPU will be built out of many more. The big difference between Larrabee, Pentium and Atom is in the vector execution side. The original Pentium had no SIMD units, Atom added support for SSE and Larrabee takes a giant leap with a massive 16-wide vector ALU. This unit is able to work on up to 16 32-bit floating point operations simultaneously, making it far wider than any of the aforementioned cores. Given the nature of the applications that Larrabee will be targeting, such a wide vector unit makes total sense. Other changes to the Pentium core that made it into Larrabee are things like 64-bit x86 support and hardware prefetchers, although it is unknown as to how these compare to Atom's prefetchers. It is a fair guess to say that prefetching will include optimizations for data parallel situations, but whether this is in addition to other prefetch technology or a replacement for it is something we'll have to wait to find out.
  • 4. The vector unit is key and within that unit you've got a ton of registers and a very wide vector ALU, which leads us to the fundamental building block of Larrabee. NVIDIA's GT200 is built out of Streaming Processors, AMD's RV770 out of Stream Processing Units and Larrabee's performance comes from these 16-wide vector ALUs:
  • 5. The vector ALU can behave as a 16-wide single precision ALU or an 8-wide double precision, although that doesn't necessarily translate into equivalent throughput (which Intel would not at this point clarify). Compared to ATI and NVIDIA, here's how Larrabee looks at a basic execution unit level:
  • 6. NVIDIA's SPs work on a single operation, AMD's can work on five, and Larrabee's vector unit can work on sixteen. NVIDIA has a couple hundred of these SPs in its high end GPUs, AMD has 160 and Intel is expected to have anywhere from 16 - 32 of these cores in Larrabee. If NVIDIA is on the tons-of-simple-hardware end of the spectrum, Intel is on the exact opposite end of the scale. We've already shown that AMD's architecture requires a lot of help from the compiler to properly schedule and maximize the utilization of its execution resources within one of its 5- wide SPs, with Larrabee the importance of the compiler is tremendous. Luckily for Larrabee, some of the best (if not the best) compilers are made by Intel. If anyone could get away with this sort of an architecture, it's Intel. At the same time, while we don't have a full understanding of the details yet, we get the idea that Larrabee's vector unit is sort of a chameleon. From the information we have, these vector units could exectue atomic 16-wide ops for a single thread of a running program and
  • 7. can handle register swizzling across all 16 exectution units. This implies something very AMD like and wide. But it also looks like each of the 16 vector execution units, using the mask registers can branch independently (looking very much more like NVIDIA's solution). We've already seen how AMD and NVIDIA architectural differences show distinct advantages and disadvantages against eachother in different games. If Intel is able to adapt the way the vector unit is used to suit specific situations, they could have something huge on their hands. Again, we don't have enough detail to tell what's going to happen, but things do look very interesting. Intel is keeping two important details of Larrabee very quiet: the details of the instruction set and the configuration of the finished product. Remember that Larrabee won't ship until sometime in 2009 or 2010, the first chips aren't even back from the fab yet, so not wanting to discuss how many cores Intel will be able to fit on a single Larrabee GPU makes sense. The final product will be some assembly of a multiple of 8 Larrabee cores, we originally expected to see something in the 24-to-32 core range but that largely depends on targeted die size as we'll soon explain: Intel's own block diagrams indicated two memory controller partitions, but it's unclear whether or not we should read into this. AMD and NVIDIA both use 64-bit memory controllers and simply group multiples of them on a single chip. Given that Intel's Larrabee will be more memory bandwidth efficient than what AMD and NVIDIA have put out, it's quite possible that Larrabee could have a 128-bit memory interface, although we do believe that'd be a very conservative move (we'd expect a 256-bit interface). Coupled with GDDR5 (which should be both cheaper and highy available by the Larrabee timeframe) however, anything is possible.
  • 8. All of the cores are connected via a bi-directional ring bus (512-bits in each direction), presumably running at core speed. Given that Larrabee is expected to run at 2GHz+, this is going to be one very high-bandwidth bus. This is half the bit-width of AMD's R600/RV670 ring bus, but the higher speed should more than make up the difference. AMD recently abandoned their ring bus memory architecture citing a savings in die area and a lack of need for such a robust solution as the reason. A ring bus, as memory busses go, is fairly straight forward and less complex than other options. The disadvantage is that it is a lot of wires and it delivers high bandwidth to all the clients on the bus whether they need it or not. Of course, if all your memory clients need or can easily use high bandwidth then that's a win for the ring bus. Intel may have a better use for going with the ring bus than AMD: cache coherency and inter-core communication. Partitioning the L2 and using the ring bus to maintain coherency and facilitate communication could make good use of this massive amount of data moving power. While Cell also allows for internal communication, Intel's solution of providing direct access to low latency, coherent L1 and L2 partitions while enabling massive bandwidth behind the L2 cache could result in a much faster and easier to program architecture when data sharing is required. How Many Cores in a Larrabee? Initial estimates put Larrabee at somewhere in the 16 to 32-core range, we figured 32-cores would be a sweetspot (not in the least because Intel's charts and graphs showed diminishing returns over 32 cores) but 24-cores would be more likely for an initial product. Intel however shared some data that made us question all of that.
  • 9. Remember the design experiment? Intel was able to fit a 10-core Larrabee into the space of a Core 2 Duo die. Given the specs of the Core 2 Duo Intel used (4MB L2 cache), it appears to be a 65nm Conroe/Merom based Core 2 Duo - with a 143 mm^2 die size. At 143 mm^2, Intel could fit 10 Larrabee-like cores so let's double that. Now we're at 286mm^2 (still smaller than GT200 and about the size of AMD's RV770) and 20-cores. Double that once more and we've got 40-cores and have a 572mm^2 die, virtually the same size as NVIDIA's GT200 but on a 65nm process. The move to 45nm could scale as well as 50%, but chances are we'll see something closer to 60 - 70% of the die size simply by moving to 45nm (which is the node that Larrabee will be built on). Our 40-core Larrabee is now at ~370mm^2 on 45nm. If Intel wanted to push for a NVIDIA-like die size we could easily see a 64-core Larrabee at launch for the high end, with 24 or 32-core versions aiming at the mainstream.Update: One thing we did not consider here is power limitations. So while Intel may be able to produce a 64-core Larrabee with a GT200-like die-size, such a chip may exceed physical power limitations. It's far more likely that we'll see something in the 16 - 32 core range at 45nm due to power constraints rather than die size constraints. Cache and Memory Hierarchy: Architected for Low Latency Operation Intel has had a lot of experience building very high performance caches. Intel's caches are more dense than what AMD has been able to produce on the x86 microprocessor front, and as we saw in our Nehalem preview - Intel is also able to deliver significantly lower latency caches than the competition as well. Thus it should come as no surprise to anyone that Larrabee's strengths come from being built on fully programmable x86 cores, and from having very large, very fast coherent caches. Each Larrabee core features 4x the L1 caches of the original Pentium. The Pentium had an 8KB L1 data cache and an 8KB L1 instruction cache, each Larrabee core has a 32KB/32KB L1 D/I cache. The reasoning is that each Larrabee core can work on 4x the threads of the original Pentium and thus with a 4x as large L1 the architecture remains balanced. The original Pentium didn't have an integrated L2 cache, but each Larrabee core has access to its own L2 cache partition - 256KB in size. Larrabee's L2 pool increases with each core. An 8-core Larrabee would have 2MB of total L2 cache (256KB per core x 8 cores), a 32-core Larrabee would have an 8MB L2 cache. Each core only has access to its L2 cache partition, it can read/write to its 256KB portion of the pool and that's it. Communication with other Larrabee cores happens over the ring bus; a single core will look for data in its L2 cache, if it doesn't find it there it will place the request on the ring bus and will eventualy find the data in its L2. Intel doesn't attempt to hide latency nearly as much as NVIDIA does, instead relying on its high speed, low latency caches. The ratio of compute resources to cache size is much lower with Larrabee than either AMD or NVIDIA's architectures.
  • 10. AMD RV770 NVIDIA GT200 Intel Larrabee Scalar ops per L1 Cache 80 24 16 L1 Cache Size 16KB unknown 32KB Scalar ops per L2 Cache 100 30 16 L2 Cache Size unknown unknown 256KB While both AMD and NVIDIA are very shy on giving out cache sizes, we do know that RV670 had a 256KB L2 for the entire chip cache and can expect that RV770 to have something larger, but not large enough to come close to what Intel has with Larrabee. NVIDIA is much closer in the compute-to-cache ratio than AMD, which makes sense given its approach to designing much larger GPUs, but we have no reason to believe that NVIDIA has larger caches on the GT200 die than Intel with Larrabee. The caches are fully coherent, just like they are on a multi-core desktop CPU. The fully coherent caches makes for some interesting cases when looking at multi-GPU configurations. While Intel wouldn't get specific with multi-GPU Larrabee plans, it did state that with a multi-GPU Larrabee setup Intel doesn't "expect to have quite as much pain as they [AMD/NVIDIA] do". We asked whether there was any limitation to maintaining cache coherence across multiple chips and the anwswer was that it could be possible with enough bandwidth between the two chips. While NVIDIA and AMD are still adding bits and pieces to refine multi-GPU rendering, Intel could have a very robust solution right out of the gate if desired (think shared framebuffer and much more efficient work load division for a single frame). Programming for Larrabee The Larrabee programming model is what sets it apart from the competition. While competing GPU architectures have become increasingly programmable over the years, Larrabee starts from a position of being fully programmable. To the developer, it appears as exactly what it is - an arrangement of fully cache coherent x86 microprocessors. The first iteration of Larrabee will hide this fact from the OS through its graphics driver, but future versions of the chip could conceivably populate task manager just like your desktop x86 cores do today. You have two options for harnessing the power of Larrabee: writing standard DirectX/OpenGL code, or writing directly to the hardware using Larrabee C/C++, which as it turns out is standard C (you can use compilers from MS, Intel, GCC, etc...). In a sense, this is no different than what NVIDIA offers with its GPUs - they will run DirectX/OpenGL code, or they can also run C-code thanks to CUDA. The difference here is that writing directly to Larrabee gives you some additional programming flexibility thanks to the GPU being an
  • 11. array of fully functional x86 GPUs. Programming for x86 architectures is a paradigm that the software community as a whole is used to, there's no learning curve, no new hardware limitations to worry about and no waiting on additional iterations of CUDA to enable new features. You treat Larrabee like you treat your host CPU. Game developers aren't big on learning new tricks however, especially on an unproven, unreleased hardware platform such as Larrabee. Larrabee must run DirectX/OpenGL code out of the box, and to do this Intel has written its own Larrabee native software renderer to interface between DX/OGL and the Larrabee hardware. In AMD/NVIDIA GPUs, DirectX/OpenGL instructions map to an internal GPU instruction set at runtime. With Larrabee Intel does this mapping in software, taking DX/OGL instructions, mapping them to its software renderer, and then running its software renderer on the Larrabee hardware. This intermediate stage should incur a performance penalty, as writing directly to Larrabee is always going to be faster. However Intel has apparently produced a highly optimized software renderer for Larrabee, once that's efficient enough so that any performance penalty introduced by the intermediate stage is made up for by the reduction of memory bandwidth enabled by the software renderer (we'll get to how this is possible in a moment). Developers can also use a hybrid approach to Larrabee development. Larrabee can run standard DX/OGL code but if there are features developers want to implement that aren't enabled in the current DirectX version, they can simply write those features that they want in Larrabee C/C++. Without hardware it's difficult to tell exactly how well Larrabee will run DirectX/OpenGL code, but Intel knows it must succeed on running current games very well in order to make this GPU a success.