LARRABEE ARCHITECTURE:Intels Larrabee is built out of a number of x86 cores that look, at a very high level, like this:Each core is a dual-issue, in-order architecture loosely derived from the original Pentiummicroprocessor. The Pentium core was modified to include support for 64-bit operations,
the updates to the x86 instruction set, larger caches, 4-way SMT/Hyper Threading and a 16-wide vector ALU.While the team that ended up working on Atom may have originally worked on the Larrabeecores, there are some significant differences between Larrabee and Atom. Atom is gearedtowards much higher single threaded performance, with a deeper pipeline, a larger L2 cacheand additional microarchitectural tweaks to improve general desktop performance.Intel LarrabeeCoreIntel PentiumCore (P54C)Intel Atom CoreManufacturingProcess45nm 0.60µm 45nmSimultaneous Multi-Threading4-way 1-way 2-wayIssue Width dual-issue dual-issue dual-issuePipeline Depth 5-stages (?) 5-stages 16-stagesScalar ExecutionResources2 x Integer ALUs(?)1 x FPU (?)2 x IntegerALUs1 x FPU2 x Integer ALUs1 x FPUVector ExecutionResources16-wide VectorALUNone 1 x SIMD SSEL1 Cache (I/D) 32KB/32KB 8KB/8KB 32KB/24KBL2 Cache 256KB None (External) 512KBISA64-bit x86SSEn support?Parallel/Graphics?32-bit x8664-bit x86Full Merom ISAcompatibilityLarrabee on the other hand is more Pentium-like to begin with; Intel states that Larrabeesexecution pipeline is "short" and followed up with us by saying that its closer to the 5-stagepipeline of the original Pentium than the 16-stage pipeline of Atom. While both Atom andLarrabee support SMT (Simultaneous Multi-Threading), Larrabee can work on four threadsconcurrently compared to two on Atom and one on the original Pentium.L1 cache sizes are similar between Larrabee and Atom, but Larrabee gets a full 32KB datacache compared to 24KB on Atom. If you remember back to our architectural discussion ofAtom, the smaller L1 D-cache was a side effect of going to a register file instead of a smallsignal array for the cache. Die size increased but operating voltage decreased, forcing Atom
to have a smaller L1 D-cache but enabling it to reach lower power targets. Larrabee is a littleless constrained and thus we have conventional balanced L1 caches, at 4x the size of that inthe original Pentium.The Pentium had no on-die L2 cache, it relied on external SRAM to be installed on themotherboard. In order to maintain good desktop performance Atom came equipped with a512KB L2 cache, while each Larrabee core will feature a 256KB L2 cache. Larrabeesarchitecture does stress the importance of large, fast caches as youll soon see, but 256KB isthe right size for Intels architecture at this point. Larrabees default OpenGL/DirectXrenderer is tile based and it turns out that most 64x64 or 128x128 tiles with 32-bit color/32-bit Z can fit in a 128KB space, leaving an additional 128KB left over for caching additionaldata. And remember, this is just on one Larrabee core - the whole GPU will be built out ofmany more.The big difference between Larrabee, Pentium and Atom is in the vector execution side. Theoriginal Pentium had no SIMD units, Atom added support for SSE and Larrabee takes a giantleap with a massive 16-wide vector ALU. This unit is able to work on up to 16 32-bit floatingpoint operations simultaneously, making it far wider than any of the aforementioned cores.Given the nature of the applications that Larrabee will be targeting, such a wide vector unitmakes total sense.Other changes to the Pentium core that made it into Larrabee are things like 64-bit x86support and hardware prefetchers, although it is unknown as to how these compare toAtoms prefetchers. It is a fair guess to say that prefetching will include optimizations fordata parallel situations, but whether this is in addition to other prefetch technology or areplacement for it is something well have to wait to find out.
The vector unit is key and within that unit youve got a ton of registers and a very widevector ALU, which leads us to the fundamental building block of Larrabee. NVIDIAs GT200 isbuilt out of Streaming Processors, AMDs RV770 out of Stream Processing Units andLarrabees performance comes from these 16-wide vector ALUs:
The vector ALU can behave as a 16-wide single precision ALU or an 8-wide double precision,although that doesnt necessarily translate into equivalent throughput (which Intel wouldnot at this point clarify). Compared to ATI and NVIDIA, heres how Larrabee looks at a basicexecution unit level:
NVIDIAs SPs work on a single operation, AMDs can work on five, and Larrabees vector unitcan work on sixteen. NVIDIA has a couple hundred of these SPs in its high end GPUs, AMDhas 160 and Intel is expected to have anywhere from 16 - 32 of these cores in Larrabee. IfNVIDIA is on the tons-of-simple-hardware end of the spectrum, Intel is on the exactopposite end of the scale.Weve already shown that AMDs architecture requires a lot of help from the compiler toproperly schedule and maximize the utilization of its execution resources within one of its 5-wide SPs, with Larrabee the importance of the compiler is tremendous. Luckily for Larrabee,some of the best (if not the best) compilers are made by Intel. If anyone could get away withthis sort of an architecture, its Intel.At the same time, while we dont have a full understanding of the details yet, we get theidea that Larrabees vector unit is sort of a chameleon. From the information we have, thesevector units could exectue atomic 16-wide ops for a single thread of a running program and
can handle register swizzling across all 16 exectution units. This implies something very AMDlike and wide. But it also looks like each of the 16 vector execution units, using the maskregisters can branch independently (looking very much more like NVIDIAs solution).Weve already seen how AMD and NVIDIA architectural differences show distinctadvantages and disadvantages against eachother in different games. If Intel is able to adaptthe way the vector unit is used to suit specific situations, they could have something hugeon their hands. Again, we dont have enough detail to tell whats going to happen, but thingsdo look very interesting.Intel is keeping two important details of Larrabee very quiet: the details of the instructionset and the configuration of the finished product. Remember that Larrabee wont ship untilsometime in 2009 or 2010, the first chips arent even back from the fab yet, so not wantingto discuss how many cores Intel will be able to fit on a single Larrabee GPU makes sense.The final product will be some assembly of a multiple of 8 Larrabee cores, we originallyexpected to see something in the 24-to-32 core range but that largely depends on targeteddie size as well soon explain:Intels own block diagrams indicated two memory controller partitions, but its unclearwhether or not we should read into this. AMD and NVIDIA both use 64-bit memorycontrollers and simply group multiples of them on a single chip. Given that Intels Larrabeewill be more memory bandwidth efficient than what AMD and NVIDIA have put out, itsquite possible that Larrabee could have a 128-bit memory interface, although we do believethatd be a very conservative move (wed expect a 256-bit interface). Coupled with GDDR5(which should be both cheaper and highy available by the Larrabee timeframe)however, anything is possible.
All of the cores are connected via a bi-directional ring bus (512-bits in each direction),presumably running at core speed. Given that Larrabee is expected to run at 2GHz+, this isgoing to be one very high-bandwidth bus. This is half the bit-width of AMDs R600/RV670ring bus, but the higher speed should more than make up the difference.AMD recently abandoned their ring bus memory architecture citing a savings in die area anda lack of need for such a robust solution as the reason. A ring bus, as memory busses go, isfairly straight forward and less complex than other options. The disadvantage is that it is alot of wires and it delivers high bandwidth to all the clients on the bus whether they need itor not. Of course, if all your memory clients need or can easily use high bandwidth thenthats a win for the ring bus.Intel may have a better use for going with the ring bus than AMD: cache coherency andinter-core communication. Partitioning the L2 and using the ring bus to maintain coherencyand facilitate communication could make good use of this massive amount of data movingpower. While Cell also allows for internal communication, Intels solution of providing directaccess to low latency, coherent L1 and L2 partitions while enabling massive bandwidthbehind the L2 cache could result in a much faster and easier to program architecture whendata sharing is required.How Many Cores in a Larrabee?Initial estimates put Larrabee at somewhere in the 16 to 32-core range, we figured 32-coreswould be a sweetspot (not in the least because Intels charts and graphs showed diminishingreturns over 32 cores) but 24-cores would be more likely for an initial product. Intelhowever shared some data that made us question all of that.
Remember the design experiment? Intel was able to fit a 10-core Larrabee into the space ofa Core 2 Duo die. Given the specs of the Core 2 Duo Intel used (4MB L2 cache), it appears tobe a 65nm Conroe/Merom based Core 2 Duo - with a 143 mm^2 die size.At 143 mm^2, Intel could fit 10 Larrabee-like cores so lets double that. Now were at286mm^2 (still smaller than GT200 and about the size of AMDs RV770) and 20-cores.Double that once more and weve got 40-cores and have a 572mm^2 die, virtually the samesize as NVIDIAs GT200 but on a 65nm process.The move to 45nm could scale as well as 50%, but chances are well see something closer to60 - 70% of the die size simply by moving to 45nm (which is the node that Larrabee will bebuilt on). Our 40-core Larrabee is now at ~370mm^2 on 45nm. If Intel wanted to push for aNVIDIA-like die size we could easily see a 64-core Larrabee at launch for the high end, with24 or 32-core versions aiming at the mainstream.Update: One thing we did not considerhere is power limitations. So while Intel may be able to produce a 64-core Larrabee with aGT200-like die-size, such a chip may exceed physical power limitations. Its far more likelythat well see something in the 16 - 32 core range at 45nm due to power constraints ratherthan die size constraints.Cache and Memory Hierarchy: Architected for Low Latency OperationIntel has had a lot of experience building very high performance caches. Intels caches aremore dense than what AMD has been able to produce on the x86 microprocessor front, andas we saw in our Nehalem preview - Intel is also able to deliver significantly lower latencycaches than the competition as well. Thus it should come as no surprise to anyone thatLarrabees strengths come from being built on fully programmable x86 cores, and fromhaving very large, very fast coherent caches.Each Larrabee core features 4x the L1 caches of the original Pentium. The Pentium had an8KB L1 data cache and an 8KB L1 instruction cache, each Larrabee core has a 32KB/32KB L1D/I cache. The reasoning is that each Larrabee core can work on 4x the threads of theoriginal Pentium and thus with a 4x as large L1 the architecture remains balanced. Theoriginal Pentium didnt have an integrated L2 cache, but each Larrabee core has access to itsown L2 cache partition - 256KB in size.Larrabees L2 pool increases with each core. An 8-core Larrabee would have 2MB of total L2cache (256KB per core x 8 cores), a 32-core Larrabee would have an 8MB L2 cache. Eachcore only has access to its L2 cache partition, it can read/write to its 256KB portion of thepool and thats it. Communication with other Larrabee cores happens over the ring bus; asingle core will look for data in its L2 cache, if it doesnt find it there it will place the requeston the ring bus and will eventualy find the data in its L2.Intel doesnt attempt to hide latency nearly as much as NVIDIA does, instead relying on itshigh speed, low latency caches. The ratio of compute resources to cache size is much lowerwith Larrabee than either AMD or NVIDIAs architectures.
AMD RV770 NVIDIA GT200 Intel LarrabeeScalar ops per L1 Cache 80 24 16L1 Cache Size 16KB unknown 32KBScalar ops per L2 Cache 100 30 16L2 Cache Size unknown unknown 256KBWhile both AMD and NVIDIA are very shy on giving out cache sizes, we do know that RV670had a 256KB L2 for the entire chip cache and can expect that RV770 to have somethinglarger, but not large enough to come close to what Intel has with Larrabee. NVIDIA is muchcloser in the compute-to-cache ratio than AMD, which makes sense given its approach todesigning much larger GPUs, but we have no reason to believe that NVIDIA has larger cacheson the GT200 die than Intel with Larrabee.The caches are fully coherent, just like they are on a multi-core desktop CPU. The fullycoherent caches makes for some interesting cases when looking at multi-GPUconfigurations. While Intel wouldnt get specific with multi-GPU Larrabee plans, it did statethat with a multi-GPU Larrabee setup Intel doesnt "expect to have quite as much pain asthey [AMD/NVIDIA] do".We asked whether there was any limitation to maintaining cache coherence across multiplechips and the anwswer was that it could be possible with enough bandwidth between thetwo chips. While NVIDIA and AMD are still adding bits and pieces to refine multi-GPUrendering, Intel could have a very robust solution right out of the gate if desired (thinkshared framebuffer and much more efficient work load division for a single frame).Programming for LarrabeeThe Larrabee programming model is what sets it apart from the competition. Whilecompeting GPU architectures have become increasingly programmable over the years,Larrabee starts from a position of being fully programmable. To the developer, it appears asexactly what it is - an arrangement of fully cache coherent x86 microprocessors. The firstiteration of Larrabee will hide this fact from the OS through its graphics driver, but futureversions of the chip could conceivably populate task manager just like your desktop x86cores do today.You have two options for harnessing the power of Larrabee: writing standardDirectX/OpenGL code, or writing directly to the hardware using Larrabee C/C++, which as itturns out is standard C (you can use compilers from MS, Intel, GCC, etc...). In a sense, this isno different than what NVIDIA offers with its GPUs - they will run DirectX/OpenGL code, orthey can also run C-code thanks to CUDA. The difference here is that writing directly toLarrabee gives you some additional programming flexibility thanks to the GPU being an
array of fully functional x86 GPUs. Programming for x86 architectures is a paradigm that thesoftware community as a whole is used to, theres no learning curve, no new hardwarelimitations to worry about and no waiting on additional iterations of CUDA to enable newfeatures. You treat Larrabee like you treat your host CPU.Game developers arent big on learning new tricks however, especially on an unproven,unreleased hardware platform such as Larrabee. Larrabee must run DirectX/OpenGL codeout of the box, and to do this Intel has written its own Larrabee native software renderer tointerface between DX/OGL and the Larrabee hardware.In AMD/NVIDIA GPUs, DirectX/OpenGL instructions map to an internal GPU instruction setat runtime. With Larrabee Intel does this mapping in software, taking DX/OGL instructions,mapping them to its software renderer, and then running its software renderer on theLarrabee hardware.This intermediate stage should incur a performance penalty, as writing directly to Larrabeeis always going to be faster. However Intel has apparently produced a highly optimizedsoftware renderer for Larrabee, once thats efficient enough so that any performancepenalty introduced by the intermediate stage is made up for by the reduction of memorybandwidth enabled by the software renderer (well get to how this is possible in a moment).Developers can also use a hybrid approach to Larrabee development. Larrabee can runstandard DX/OGL code but if there are features developers want to implement that arentenabled in the current DirectX version, they can simply write those features that they wantin Larrabee C/C++.Without hardware its difficult to tell exactly how well Larrabee will run DirectX/OpenGLcode, but Intel knows it must succeed on running current games very well in order to makethis GPU a success.