Tegra 4i expands the market


Published on

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Tegra 4i expands the market

  1. 1. TEGRA 4I EXPANDS MARKET Cortex-A9r4 CPU Peps Up Nvidia’s First Integrated Processor By Linley Gwennap (March 4, 2013) ................................................................................................................... The moment Nvidia hinted at when it purchased Icera run simple user interfaces and small programs such ashas now arrived. The new Tegra 4i processor (formerly email readers; modern apps did not yet exist (the Appleknown by its code-name, Grey) combines Icera’s cellular App Store opened in January 2008). In this environment,technology with Nvidia’s application processors to create an ARM focused on die area and cost, keeping critical ele-integrated smartphone processor. Although Nvidia is far ments in the CPU small. Since that time, Cortex-A9 hasfrom the first to make this combination, the new product been applied to many new tasks, but the design is poorlywill greatly expand the company’s target market, which to- optimized for some of them.day is restricted to tablets and premium smartphones The branch history table (BHT), for example, has(Nvidia calls them superphones). By reducing the cost and only 512 entries in the original Cortex-A9. This size is finesize of reference designs, Tegra 4i aims for mainstream for small programs, but in more complex software, multi-smartphones that sell for a few hundred dollars. ple active branches often hash to the same entry, confusing Nvidia demonstrated the new chip at Mobile World the BHT’s predictions. The A9r4 expands the BHT to 2,048Congress and expects it to ship in commercial smart-phones by the end of this year. Tegra 4i offers four Cortex- Tegra 3 Tegra 4i Tegra 4A9 CPUs, upgraded to release 4 (r4), running at clock Main CPU Cores 4xCortex-A9 4xCortex-A9r4 4xCortex-A15speeds as high as 2.3GHz. As Table 1 shows, the graphics Max CPU Speed 1.7GHz 2.3GHz 1.9GHz Companion Core? Yes Yes Yesunit is considerably improved from Tegra 3 and is similar L2 Cache Size 1MB 1MB 2MBto that of Tegra 4. The new chip also includes Tegra 4’s SPECint Score* 590 920 1,168computational-photography unit. Its integrated cellular 8 texture + 48 texture + 48 texture + GPU Shadersmodem, based on Nvidia’s standalone i500 chip, is com- 4 vertex 12 vertex 24 vertexpatible with LTE networks as well as older 3G services. GPU Clock Speed 520MHz 660MHz 672MHz GLBenchmark 2.5† 12fps 30fps§ 57fps Video Decode‡ 1080p 24fps 1080p 60fps 4K 30fpsExtending Cortex-A9 Photog Engine? No Yes YesTegra 4i is the first processor to use the r4 version of Cortex- DRAM Channels 1x32-bit 1x32-bit 2x32-bitA9. Unlike previous releases, which contained mostly bug Max DRAM Speed LPDDR2-1066 LPDDR3-2133 LPDDR3-2133fixes, the A9r4 includes some significant improvements to LTE Baseband External Integrated External Process Technology 40nm LP 28nm HPM 28nm HPLthe branch predictor, TLB, and cache-memory system. Be- Die Size 80mm2§ 62mm2§ 85mm2§cause the basic microarchitecture stays the same, these im- Package 14mm PoP 12mm PoP 14mm PoPprovements have no effect on simple benchmarks such as Production 4Q11 4Q13 (est) 2Q13 (est)Dhrystone. But Nvidia has measured a 15% improvement Table 1. Comparison of recent Nvidia Tegra processors.in SPECint performance and a 25% gain in BrowserBench Performance data for Tegra 4/4i is preliminary. *SPECint-performance at the same clock speed. 2000_base compiled using GCC -o3; †Egypt C24Z16 Off- ARM announced Cortex-A9 in 2007—eons ago in the screen 1080p; ‡H.264 High Profile. (Source: Nvidia, exceptfast-moving mobile market. The company designed it to §The Linley Group estimate)© The Linley Group • Microprocessor Report March 2013
  2. 2. 2 Tegra 4i Expands Marketentries (the same size as Cortex-A15’s). Nvidia measured Flooring the Gas Pedalthe branch-misprediction rate of one SPECint2000 pro- After working with ARM to rev up the CPU’s logical design,gram at 48% on the original A9 and only 8% on the A9r4. Nvidia then optimized the physical design to maximizeThis example is extreme, but many programs will see some its clock speed. Tegra 4i targets 2.3GHz, compared withbenefit from the expanded BHT. 1.7GHz for the Cortex-A9 CPUs in Tegra 3+. Much of this Similarly, Cortex-A9’s original TLB had 128 entries, speed boost comes from the shrink to TSMC’s 28nm HPMproviding address translations for 512KB of data (using process, but 2.3GHz is among the fastest Cortex-A9 speedsstandard 4KB pages). This space is enough for small pro- yet announced, trailing only ST-Ericsson’s 2.5GHz designgrams, but complex modern apps need more. The A9r4 (which uses exotic 28nm FD-SOI technology).increases the TLB to 512 entries, offering access to four Tegra 4i’s power curve will affect its working speeds.times as much data without thrashing the TLB. The company did not disclose the chip’s power, but we The new CPU retains the same 32KB data cache as in expect it requires overvoltage to achieve its top speed,previous Cortex-A9 designs, but it improves the prefetch- pushing power to about 5W with all four CPUs running ater’s effectiveness. The original A9 included prefetch logic 2.3GHz. To fit Tegra 4i into the power envelope of athat attempted to detect a series of sequential memory ac- smartphone, Nvidia is likely to limit the CPUs to a slowercesses and continue fetching additional data before it was clock rate—perhaps 1.8GHz—when all four are running.needed. It was the first ARM CPU with this feature, how- With only a single CPU running, however, the chip shouldever, and the prefetcher too often fetches the wrong data, operate at its full rated speed.wasting cycles and power; most operating systems simply As in Tegra 3, Tegra 4i includes a fifth “companion”turn off this feature. The new prefetcher, based on a few core that uses the Cortex-A9 microarchitecture but is op-more years of experience, correctly handles most common timized for low power and runs at a lower clock speed (seeaccess patterns. MPR 11/21/11, “Nvidia Leads With Quad-Core AP”). As with all Cortex CPUs, ARM implemented the The low-power core handles light workloads, like emailnew design. Nvidia provided vigorous input regarding the and social media, but it transfers operation to the mainchanges and is the lead customer for this version. ARM will CPUs when the processing load picks up. In maximum-deliver the r4 design to other Cortex-A9 licensees, so we ex- performance mode, the four main CPUs run while thepect to see this version become more widespread over time. low-power core is shut down. The design improvements are unusual for an existingcore, and their performance impact is significant. ARM de- Digging Into the DXPclined to create a new name for this core, perhaps to avoid The Tegra 4i cellular modem derives from technologydiluting its emphasis on Cortex-A15. We believe Cortex-A10 Nvidia received when it acquired startup Icera in 2011 (seeor Cortex-A9+ would be more appropriate monikers than MPR 5/16/11, “Nvidia Picks Up the Phone”). The modemCortex-A9 r4. (Readers should avoid confusing this core with is the same as in the i500 LTE chip that Nvidia recentlyCortex-R4, a low-end ARM design intended for real-time announced. It supports a number of protocols, includingapplications.) GSM/EDGE, WCDMA/HSPA, and Release 8 LTE in FDD and TDD modes. 32KB I-Cache 128KB Instruction Memory As part of the Tegra 4i launch, Nvidia re- vealed details of the Icera architecture for the first Branch Fetch time. The modem employs a processor known as Pred One instr the DXP, which implements a custom instruction Decode set optimized for cellular processing. For example, it includes instructions to accelerate voice codecs 64x256-bit D Registers and encryption. As Figure 1 shows, the DXP is 32x32-bit C Regs 256 32 32 256 essentially a RISC CPU with a large vector unit. ALU / Load / Addr Permute The CPU fetches and executes one instruction per Vector Control Branch Store 512KB Vector ALU cycle and has a standard 32-entry register file called Data the C registers, which are backed by a small data Memory 32KB Data Cache (D-Mem) Vector ALU cache configurable as either 16KB or 32KB. 32 The vector unit has its own register file with ... 64 entries. These D registers are logically 256 bits Vector ALU wide, but physically they are broken into four “channels.” Each channel can contain two 32-bitFigure 1. Nvidia DXP microarchitecture. The DXP pairs a simple scalar values, four 16-bit values, or eight 8-bit values. TheCPU (left) with a 256-bit-wide multistage vector engine (right) to gen- vector unit performs the same operation across allerate large amounts of compute at low power. the values and across all four channels, creating a© The Linley Group • Microprocessor Report March 2013
  3. 3. Tegra 4i Expands Market 3large amount of data parallelism. The D registers are fed by (as Icera) has followed a similar path in the past, improv-a local 512KB memory, which can provide one 256-bit ing the speed of its HSPA modem from 10Mbps to 14Mbpsoperand per cycle. to 17Mbps using only firmware upgrades. The vector unit is unusual in being deep as well as Nvidia is planning additional firmware upgrades towide. Although Nvidia did not disclose full details, the unit implement LTE Release 10 features such as carrier aggre-has several pipelined stages that can each be configured for gation, which allows data to be transmitted on two fre-different computations. In this way, a single instruction can quencies (carriers) at once to reach the maximum rate.perform a complicated operation such as a matrix multi- This feature is important because few cellular providersply. Nvidia claims each channel can execute up to 95 8-bit have the 20MHz of contiguous spectrum required toarithmetic operations (e.g., add or multiply) with a single maximize LTE performance on a single carrier. Nvidia isinstruction. This approach provides lots of computational also developing firmware for TD-SCDMA. The companyhorsepower with a small amount of instruction decoding. did not announce a schedule for delivering any of theseWith vast amounts of parallelism, this architecture is well speed or feature improvements.suited to the high-speed signal processing required by Software execution typically requires more powermodern cellular algorithms. than offloading functions to hard-wired engines, but Tegra 4i includes two full DXP cores that operate at Nvidia’s team has been careful to minimize power. With itsup to 1.3GHz. These two cores handle the entire physical wide vector units and custom instruction set, the DXP useslayer, implementing algorithms such as a rake receiver, di- less power than a traditional DSP for cellular processing.versity, turbo decoding, and HARQ (error correction) in Because a single instruction can execute hundreds of oper-firmware. A third DXP core implements only the scalar ations, the DXP wastes little power in overhead tasks suchportion of the instruction set; this smaller, simpler core as instruction decoding and branching. Other architectureshandles the cellular protocol stack. use a mix of DSP and hard-wired logic, so on average, their In addition to the 1MB of local SRAM for the two power is similar to that of the Icera design. The i500 willvector units, the chip includes about 6MB of additional adjust the DXP clock speed and voltage as needed for theSRAM to implement HARQ buffers and other data stor- available data rate, so it will run at 1.3GHz only when per-age. As Figure 2 shows, this memory consumes more than forming LTE at the peak data rate, which happens rarely (ifhalf the baseband area. (Figure 2 shows an actual die photo ever) in the real world.of the i500 baseband, which differs radically from the ar- The original 3G Icera modem has been certified withtistic rendering of the chip that has been widely published.) carriers around the world for use in data cards and USBBecause it is a standalone device, the i500 includes a num- dongles, but not for voice devices. The ZTE Mimosa X,ber of system interfaces, such as a USB port and serial which began shipping in 2Q12, was the first smartphone toports, that are unnecessary when the baseband is imple- use this modem design; it achieved voice certification withmented as part of the Tegra 4i SoC. two carriers (Swisscom and EE, a UK carrier) and also shipped to a few carriers that do not require certification.Software-Defined Radio Nvidia’s LTE modem has undergone certification forMost other baseband designs use a combination of DSPsand hard-wired accelerators. These accelerators are cus-tomized for each protocol; thus, the traditional baseband Qualcomm MDM9215has separate units for GSM, WCDMA, and LTE. Nvidiaprovides a single set of hardware that can switch protocolssimply by switching firmware. As a result, the program-mable baseband is smaller than competing designs; for Scratch Vectorexample, the i500 die measures 14mm2, versus 35mm2 for Memory DXPQualcomm’s MDM9215—a 28nm LTE modem chip with Scalarsimilar data rates. To be fair, we note that the Qualcomm DXP Scratch D-Memchip includes a Cortex-A5 application CPU and a GPS Memory D-Membaseband, and it has been in production for nearly a year. System Interfaces The programmable design can be upgraded via new Vectorfirmware. The initial release of the i500 (and Tegra 4i) will DXPsupport Category 3 LTE (100Mbps down, 50Mbps up),twice the speed of Nvidia’s previous implementation. Butthe company is still optimizing its firmware for LTE, and it Figure 2. Die photo of Nvidia i500 baseband (foreground).expects to hit Category 4 speeds (150Mbps down) with a The i500 requires only 40% of the die area of a comparablefuture release. Similarly, it expects to boost the top HSPA Qualcomm modem chip (background). (Photo by Nvidia,speed from the initial 42Mbps to 84Mbps. The company overlay by The Linley Group)© The Linley Group • Microprocessor Report March 2013
  4. 4. 4 Tegra 4i Expands MarketAT&T’s data network, and the company is working to cer- images. The new chip can encode or decode 1080p video attify this design with other LTE carriers. 60fps (H.264 High Profile)—twice the rate of Tegra 3 but half the rate of Tegra 4, which also supports UltraHD video.Scoring in the Low 60s Although Tegra 4i doesn’t quite match Tegra 4’s per-Tegra 4i’s GPU retains the same split-shader design as formance, the changes are designed to keep the cost of theTegra 3 and Tegra 4, preventing it from supporting mod- chip low. Each Cortex-A9 r4 CPU measures just 1.15mm2,ern graphics APIs such as DirectX 10 or 11 and OpenCL. 57% smaller than the Cortex-A15 cores in Tegra 4. TheBut software optimized for other Tegra processors should GPU is also smaller, given the reduced number of shaders,run well on Tegra 4i. The new chip includes 60 shaders— and the video engine is about half the size. Cutting the sec-far more than in Tegra 3. As Figure 3 shows, the design ond DRAM channel both simplifies the memory-controllerallocates 48 texture shaders and 12 vertex shaders, provid- logic and greatly reduces the number of pads. At 1MB, theing the same pixel processing but half the vertex processing L2 cache is also half the size of Tegra 4’s. These changesof Tegra 4. Both chips clock the GPU at about the same leave room for the integrated baseband, which we estimatespeed: 660MHz for Tegra 4i and 672MHz for Tegra 4. consumes only 8mm2. Even after adding the baseband, Whereas Tegra 4 supports two 32-bit memory chan- Tegra 4i has a die size in the “low 60s,” according tonels, Tegra 4i has only one, halving its peak memory band- Nvidia, compared with the “mid-80s” for Tegra 4.width. Thus, graphics tests that are limited by either mem-ory bandwidth or vertex processing will run half as fast on Rebirth of the Reference DesignTegra 4i relative to Tegra 4. On the other hand, the two The integrated baseband allows Tegra 4i to fit into com-chips are equally good at pushing pixels. We estimate Tegra pact and inexpensive phones. Nvidia offers a reference4i will score about 30fps on GLBenchmark 2.5, putting it in design, code-named Phoenix, for Tegra 4i. As Figure 4the same class as high-end application processors shipping shows, the main components fit into a narrow space on thein phones today. phone’s left edge. In addition to the processor, only a few Compared with Tegra 3, Tegra 4i offers a theoretical other components are required. Shrinking the circuit-2x gain in memory bandwidth, even though both use a board size leaves more room for the battery; smartphonesingle-memory channel. This doubling assumes the use of designers can choose a smaller battery, yielding a thinnerLPDDR3-2133 memory chips, however, which do not exist phone, or a larger battery for longer life. The Phoenixtoday. Initial Tegra 4i smartphones are likely to use design is 8mm thick, but OEMs may be able to reduce thisLPDDR3-1600, providing a 50% memory-bandwidth boost dimension. (X-Men aficionados will appreciate the con-over Tegra 3. As faster LPDDR3 speed grades become avail- nection between Grey and Phoenix.)able, Tegra 4i can take advantage of them. The reference design includes Nvidia’s ICE9245 RF Tegra 4i includes the computational-photography transceiver. This chip, which also works with the i500, hasengine that Nvidia introduced with Tegra 4 (see MPR inputs and outputs for eight configurable bands, with di-1/21/13, “Tegra 4 Shows First Quad A15”). This unit sup- versity for each of the receive bands. It can support addi-ports advanced features such as high-dynamic-range (HDR) tional bands by using external switches and converged Vertex Vertex Vertex IDX / Clip / Setup Raster / Early Z Texture Texture L1 L1 L2 Cache Chan 0 Memory Controller 32 Figure 4. Nvidia’s Phoenix reference design. The main cir-Figure 3. Tegra 4i GPU design. The GPU has two pixel pipe- cuitry of the smartphone, including the Tegra 4i processorlines with 24 fragment shaders each, plus three vertex units (shown in false color), fits on a PC board less than one inchwith 4 shaders each. wide. (Photo source: Nvidia)© The Linley Group • Microprocessor Report March 2013
  5. 5. Tegra 4i Expands Market 5power amps (PAs). Built in a TSMC 65nm process to re-duce cost, the chip integrates all low-noise amplifiers Price and Availability(LNAs) for a highly integrated solution. Tegra 4i is currently sampling to lead customers; Unlike most of its competitors, Nvidia does not sup- Nvidia expects the first smartphones using the processorply its own connectivity chips. The Phoenix design uses a to ship in 4Q13. The company withheld pricing. For moreBroadcom Wi-Fi combo with a separate Broadcom GPS information on Tegra 4i, access Nvidia’s web site atchip, probably the BCM4334 and BCM4752, respectively. www.nvidia.com/object/tegra-4-processor.html.Nvidia has also qualified Tegra 4i with Wi-Fi combos fromTexas Instruments, but TI is a poor second supplier, sinceit announced it will exit the smartphone market this year. Nvidia’s 3G voice, it will create a sizable smartphone op-Because Broadcom competes with Nvidia in mainstream portunity. But to generate enough business to pay back itssmartphones, it can charge customers that use Tegra 4i a Icera investment, Nvidia needs more than one carrier cus-higher Wi-Fi price compared with customers that use tomer. The good news is that the company still has severalBroadcom’s own processors, tilting the playing field in its months to certify its modem technology before the firstfavor. Nvidia declined to purchase TI’s Wi-Fi business, Tegra 4i phones are ready to ship.leaving it with no acquisition options in this area. Although Nvidia calls Phoenix a reference design, the Closer to Tegra 4company has yet to reach the same level as MediaTek and At 2.3GHz, Tegra 4i delivers a 50% boost in CPU perfor-Qualcomm, which offer much more complete packages mance (on SPECint) compared with Tegra 3 and at least(see MPR 2/25/13, “Qualcomm Clashes With MediaTek”) twice the graphics performance. In fact, the new chip isthat even the smallest smartphone makers can use. closer to Tegra 4 than to Tegra 3 in performance, althoughPhoenix will help Nvidia attract mid-tier and large OEMs. Nvidia has been careful to leave enough of a gap that Tegra 4ZTE, which will follow Mimosa X with a smartphone using remains a viable product for premium smartphones. Ac-Tegra 4 and the i500, is a likely Tegra 4i customer. cording to the company’s testing, Tegra 4 scores 27% better on SPECint than Tegra 4i, and it is considerably better onFinding a Hole in the Wall both 3D graphics and video decoding. The second DRAMTegra 4i exploits a hole in Qualcomm’s otherwise solid channel will boost Tegra 4’s performance, particularly withproduct line. For premium phones, Qualcomm offers the the larger screens used in tablets. Both Tegra 4i and Tegra 42.3GHz 8974, a quad-core Krait processor that will outrun offer Nvidia’s new computational-photography engine.Tegra 4i but carries a premium price tag. For mainstream For a smartphone processor with integrated modem,smartphones, the company offers the MSM8960T, which Tegra 4i provides excellent performance. Few integratedhas only two CPUs. On single-thread programs, this chip’s quad-core processors have been announced, and many of1.7GHz Krait CPU should match the 2.3GHz Cortex-A9, them target the low-end smartphone market with lower-but the MSM8960T lacks both the marketing cachet and frequency CPUs such as Cortex-A5 and Cortex-A7. Inte-the multithreaded-benchmark scores of a quad-core chip. grated processors with dual Cortex-A15 or Krait cores This situation should give Tegra 4i an advantage in won’t match Tegra 4i in benchmark performance, al-mainstream phones. The Tegra 4i die is about half the size though they will perform well on most apps. Nvidia hasof the 8974, making it difficult for Qualcomm to compete also packed plenty of GPU performance into Tegra 4i; onon price using this die. The company may be able to devel- graphics tests, it should outperform the quad-A7 chips andop a cost-reduced quad-core chip by the time Tegra 4i is at least match the dual-A15 processors. Qualcomm’s 8974available, or perhaps shortly thereafter, but such a chip is the only integrated processor that should equal Tegramay fail to match Tegra 4i’s performance. 4i’s performance, but we don’t expect that company to cut Other integrated quad-core chips will fall far short of the price of its flagship processor enough to compete withTegra 4i. MediaTek’s MT6589 and Qualcomm’s MSM8226 the integrated Tegra.both use Cortex-A7 CPUs running at speeds of about Nvidia has been successful in tablets, but until now,1.2GHz. For its MP6530, Renesas uses an unusual “Two its premium positioning and lack of a low-cost option haveand a Half Men” configuration, with two A15 CPUs and limited its sales into smartphones. Tegra 2 and Tegra 3two A7s, that won’t add up to the performance of Nvidia’s have appeared mainly in hero phones such as the HTCquad A9r4 cores. One X, but these high-end devices ship in relatively small Nvidia’s lack of carrier qualifications for both voice volumes. As a result, Nvidia holds less than 2% of theand LTE is a concern. The company is working feverishly smartphone market. A move into the mainstream offersto remedy this situation, and at least one customer (ZTE) is greater volume opportunities; here, Nvidia will competemoving forward with the i500. Nvidia’s initial focus ap- mainly against Qualcomm’s Snapdragon. By bringingpears to be on AT&T, the only carrier with which it has Tegra features and performance to lower price points, theachieved LTE certification; if that carrier also signs off on new processor is an attractive alternative to Snapdragon. ♦© The Linley Group • Microprocessor Report March 2013