Tegra 4i expands the market

TEGRA 4I EXPANDS MARKET
Cortex-A9r4 CPU Peps Up Nvidia’s First Integrated Processor
By Linley Gwennap (March 4, 2013)
...................................................................................................................

The moment Nvidia hinted at when it purchased Icera run simple user interfaces and small programs such as
has now arrived. The new Tegra 4i processor (formerly email readers; modern apps did not yet exist (the Apple
known by its code-name, Grey) combines Icera’s cellular App Store opened in January 2008). In this environment,
technology with Nvidia’s application processors to create an ARM focused on die area and cost, keeping critical ele-
integrated smartphone processor. Although Nvidia is far ments in the CPU small. Since that time, Cortex-A9 has
from the first to make this combination, the new product been applied to many new tasks, but the design is poorly
will greatly expand the company’s target market, which to- optimized for some of them.
day is restricted to tablets and premium smartphones The branch history table (BHT), for example, has
(Nvidia calls them superphones). By reducing the cost and only 512 entries in the original Cortex-A9. This size is fine
size of reference designs, Tegra 4i aims for mainstream for small programs, but in more complex software, multi-
smartphones that sell for a few hundred dollars. ple active branches often hash to the same entry, confusing
Nvidia demonstrated the new chip at Mobile World the BHT’s predictions. The A9r4 expands the BHT to 2,048
Congress and expects it to ship in commercial smart-
phones by the end of this year. Tegra 4i offers four Cortex- Tegra 3 Tegra 4i Tegra 4
A9 CPUs, upgraded to release 4 (r4), running at clock Main CPU Cores 4xCortex-A9 4xCortex-A9r4 4xCortex-A15
speeds as high as 2.3GHz. As Table 1 shows, the graphics Max CPU Speed 1.7GHz 2.3GHz 1.9GHz
Companion Core? Yes Yes Yes
unit is considerably improved from Tegra 3 and is similar L2 Cache Size 1MB 1MB 2MB
to that of Tegra 4. The new chip also includes Tegra 4’s SPECint Score* 590 920 1,168
computational-photography unit. Its integrated cellular 8 texture + 48 texture + 48 texture +
GPU Shaders
modem, based on Nvidia’s standalone i500 chip, is com- 4 vertex 12 vertex 24 vertex
patible with LTE networks as well as older 3G services. GPU Clock Speed 520MHz 660MHz 672MHz
GLBenchmark 2.5† 12fps 30fps§ 57fps
Video Decode‡ 1080p 24fps 1080p 60fps 4K 30fps
Extending Cortex-A9 Photog Engine? No Yes Yes
Tegra 4i is the first processor to use the r4 version of Cortex- DRAM Channels 1x32-bit 1x32-bit 2x32-bit
A9. Unlike previous releases, which contained mostly bug Max DRAM Speed LPDDR2-1066 LPDDR3-2133 LPDDR3-2133
fixes, the A9r4 includes some significant improvements to LTE Baseband External Integrated External
Process Technology 40nm LP 28nm HPM 28nm HPL
the branch predictor, TLB, and cache-memory system. Be- Die Size 80mm2§ 62mm2§ 85mm2§
cause the basic microarchitecture stays the same, these im- Package 14mm PoP 12mm PoP 14mm PoP
provements have no effect on simple benchmarks such as Production 4Q11 4Q13 (est) 2Q13 (est)
Dhrystone. But Nvidia has measured a 15% improvement Table 1. Comparison of recent Nvidia Tegra processors.
in SPECint performance and a 25% gain in BrowserBench Performance data for Tegra 4/4i is preliminary. *SPECint-
performance at the same clock speed. 2000_base compiled using GCC -o3; †Egypt C24Z16 Off-
ARM announced Cortex-A9 in 2007—eons ago in the screen 1080p; ‡H.264 High Profile. (Source: Nvidia, except
fast-moving mobile market. The company designed it to §The Linley Group estimate)

© The Linley Group • Microprocessor Report March 2013

2 Tegra 4i Expands Market

entries (the same size as Cortex-A15’s). Nvidia measured Flooring the Gas Pedal
the branch-misprediction rate of one SPECint2000 pro- After working with ARM to rev up the CPU’s logical design,
gram at 48% on the original A9 and only 8% on the A9r4. Nvidia then optimized the physical design to maximize
This example is extreme, but many programs will see some its clock speed. Tegra 4i targets 2.3GHz, compared with
benefit from the expanded BHT. 1.7GHz for the Cortex-A9 CPUs in Tegra 3+. Much of this
Similarly, Cortex-A9’s original TLB had 128 entries, speed boost comes from the shrink to TSMC’s 28nm HPM
providing address translations for 512KB of data (using process, but 2.3GHz is among the fastest Cortex-A9 speeds
standard 4KB pages). This space is enough for small pro- yet announced, trailing only ST-Ericsson’s 2.5GHz design
grams, but complex modern apps need more. The A9r4 (which uses exotic 28nm FD-SOI technology).
increases the TLB to 512 entries, offering access to four Tegra 4i’s power curve will affect its working speeds.
times as much data without thrashing the TLB. The company did not disclose the chip’s power, but we
The new CPU retains the same 32KB data cache as in expect it requires overvoltage to achieve its top speed,
previous Cortex-A9 designs, but it improves the prefetch- pushing power to about 5W with all four CPUs running at
er’s effectiveness. The original A9 included prefetch logic 2.3GHz. To fit Tegra 4i into the power envelope of a
that attempted to detect a series of sequential memory ac- smartphone, Nvidia is likely to limit the CPUs to a slower
cesses and continue fetching additional data before it was clock rate—perhaps 1.8GHz—when all four are running.
needed. It was the first ARM CPU with this feature, how- With only a single CPU running, however, the chip should
ever, and the prefetcher too often fetches the wrong data, operate at its full rated speed.
wasting cycles and power; most operating systems simply As in Tegra 3, Tegra 4i includes a fifth “companion”
turn off this feature. The new prefetcher, based on a few core that uses the Cortex-A9 microarchitecture but is op-
more years of experience, correctly handles most common timized for low power and runs at a lower clock speed (see
access patterns. MPR 11/21/11, “Nvidia Leads With Quad-Core AP”).
As with all Cortex CPUs, ARM implemented the The low-power core handles light workloads, like email
new design. Nvidia provided vigorous input regarding the and social media, but it transfers operation to the main
changes and is the lead customer for this version. ARM will CPUs when the processing load picks up. In maximum-
deliver the r4 design to other Cortex-A9 licensees, so we ex- performance mode, the four main CPUs run while the
pect to see this version become more widespread over time. low-power core is shut down.
The design improvements are unusual for an existing
core, and their performance impact is significant. ARM de- Digging Into the DXP
clined to create a new name for this core, perhaps to avoid The Tegra 4i cellular modem derives from technology
diluting its emphasis on Cortex-A15. We believe Cortex-A10 Nvidia received when it acquired startup Icera in 2011 (see
or Cortex-A9+ would be more appropriate monikers than MPR 5/16/11, “Nvidia Picks Up the Phone”). The modem
Cortex-A9 r4. (Readers should avoid confusing this core with is the same as in the i500 LTE chip that Nvidia recently
Cortex-R4, a low-end ARM design intended for real-time announced. It supports a number of protocols, including
applications.) GSM/EDGE, WCDMA/HSPA, and Release 8 LTE in FDD
and TDD modes.
32KB I-Cache 128KB Instruction Memory As part of the Tegra 4i launch, Nvidia re-
vealed details of the Icera architecture for the first
Branch Fetch time. The modem employs a processor known as
Pred One instr the DXP, which implements a custom instruction
Decode set optimized for cellular processing. For example,
it includes instructions to accelerate voice codecs
64x256-bit D Registers
and encryption. As Figure 1 shows, the DXP is
32x32-bit C Regs 256
32 32
256 essentially a RISC CPU with a large vector unit.
ALU / Load / Addr
Permute The CPU fetches and executes one instruction per
Vector Control

Branch Store 512KB
Vector ALU
cycle and has a standard 32-entry register file called
Data the C registers, which are backed by a small data
Memory
32KB Data Cache
(D-Mem)
Vector ALU cache configurable as either 16KB or 32KB.
32 The vector unit has its own register file with
...

64 entries. These D registers are logically 256 bits
Vector ALU wide, but physically they are broken into four
“channels.” Each channel can contain two 32-bit
Figure 1. Nvidia DXP microarchitecture. The DXP pairs a simple scalar values, four 16-bit values, or eight 8-bit values. The
CPU (left) with a 256-bit-wide multistage vector engine (right) to gen- vector unit performs the same operation across all
erate large amounts of compute at low power. the values and across all four channels, creating a


Tegra 4i Expands Market 3

large amount of data parallelism. The D registers are fed by (as Icera) has followed a similar path in the past, improv-
a local 512KB memory, which can provide one 256-bit ing the speed of its HSPA modem from 10Mbps to 14Mbps
operand per cycle. to 17Mbps using only firmware upgrades.
The vector unit is unusual in being deep as well as Nvidia is planning additional firmware upgrades to
wide. Although Nvidia did not disclose full details, the unit implement LTE Release 10 features such as carrier aggre-
has several pipelined stages that can each be configured for gation, which allows data to be transmitted on two fre-
different computations. In this way, a single instruction can quencies (carriers) at once to reach the maximum rate.
perform a complicated operation such as a matrix multi- This feature is important because few cellular providers
ply. Nvidia claims each channel can execute up to 95 8-bit have the 20MHz of contiguous spectrum required to
arithmetic operations (e.g., add or multiply) with a single maximize LTE performance on a single carrier. Nvidia is
instruction. This approach provides lots of computational also developing firmware for TD-SCDMA. The company
horsepower with a small amount of instruction decoding. did not announce a schedule for delivering any of these
With vast amounts of parallelism, this architecture is well speed or feature improvements.
suited to the high-speed signal processing required by Software execution typically requires more power
modern cellular algorithms. than offloading functions to hard-wired engines, but
Tegra 4i includes two full DXP cores that operate at Nvidia’s team has been careful to minimize power. With its
up to 1.3GHz. These two cores handle the entire physical wide vector units and custom instruction set, the DXP uses
layer, implementing algorithms such as a rake receiver, di- less power than a traditional DSP for cellular processing.
versity, turbo decoding, and HARQ (error correction) in Because a single instruction can execute hundreds of oper-
firmware. A third DXP core implements only the scalar ations, the DXP wastes little power in overhead tasks such
portion of the instruction set; this smaller, simpler core as instruction decoding and branching. Other architectures
handles the cellular protocol stack. use a mix of DSP and hard-wired logic, so on average, their
In addition to the 1MB of local SRAM for the two power is similar to that of the Icera design. The i500 will
vector units, the chip includes about 6MB of additional adjust the DXP clock speed and voltage as needed for the
SRAM to implement HARQ buffers and other data stor- available data rate, so it will run at 1.3GHz only when per-
age. As Figure 2 shows, this memory consumes more than forming LTE at the peak data rate, which happens rarely (if
half the baseband area. (Figure 2 shows an actual die photo ever) in the real world.
of the i500 baseband, which differs radically from the ar- The original 3G Icera modem has been certified with
tistic rendering of the chip that has been widely published.) carriers around the world for use in data cards and USB
Because it is a standalone device, the i500 includes a num- dongles, but not for voice devices. The ZTE Mimosa X,
ber of system interfaces, such as a USB port and serial which began shipping in 2Q12, was the first smartphone to
ports, that are unnecessary when the baseband is imple- use this modem design; it achieved voice certification with
mented as part of the Tegra 4i SoC. two carriers (Swisscom and EE, a UK carrier) and also
shipped to a few carriers that do not require certification.
Software-Defined Radio Nvidia’s LTE modem has undergone certification for
Most other baseband designs use a combination of DSPs
and hard-wired accelerators. These accelerators are cus-
tomized for each protocol; thus, the traditional baseband Qualcomm MDM9215
has separate units for GSM, WCDMA, and LTE. Nvidia
provides a single set of hardware that can switch protocols
simply by switching firmware. As a result, the program-
mable baseband is smaller than competing designs; for Scratch Vector
example, the i500 die measures 14mm2, versus 35mm2 for Memory DXP
Qualcomm’s MDM9215—a 28nm LTE modem chip with Scalar
similar data rates. To be fair, we note that the Qualcomm DXP Scratch D-Mem
chip includes a Cortex-A5 application CPU and a GPS Memory D-Mem
baseband, and it has been in production for nearly a year. System
Interfaces
The programmable design can be upgraded via new Vector
firmware. The initial release of the i500 (and Tegra 4i) will DXP

support Category 3 LTE (100Mbps down, 50Mbps up),
twice the speed of Nvidia’s previous implementation. But
the company is still optimizing its firmware for LTE, and it Figure 2. Die photo of Nvidia i500 baseband (foreground).
expects to hit Category 4 speeds (150Mbps down) with a The i500 requires only 40% of the die area of a comparable
future release. Similarly, it expects to boost the top HSPA Qualcomm modem chip (background). (Photo by Nvidia,
speed from the initial 42Mbps to 84Mbps. The company overlay by The Linley Group)


4 Tegra 4i Expands Market

AT&T’s data network, and the company is working to cer- images. The new chip can encode or decode 1080p video at
tify this design with other LTE carriers. 60fps (H.264 High Profile)—twice the rate of Tegra 3 but
half the rate of Tegra 4, which also supports UltraHD video.
Scoring in the Low 60s Although Tegra 4i doesn’t quite match Tegra 4’s per-
Tegra 4i’s GPU retains the same split-shader design as formance, the changes are designed to keep the cost of the
Tegra 3 and Tegra 4, preventing it from supporting mod- chip low. Each Cortex-A9 r4 CPU measures just 1.15mm2,
ern graphics APIs such as DirectX 10 or 11 and OpenCL. 57% smaller than the Cortex-A15 cores in Tegra 4. The
But software optimized for other Tegra processors should GPU is also smaller, given the reduced number of shaders,
run well on Tegra 4i. The new chip includes 60 shaders— and the video engine is about half the size. Cutting the sec-
far more than in Tegra 3. As Figure 3 shows, the design ond DRAM channel both simplifies the memory-controller
allocates 48 texture shaders and 12 vertex shaders, provid- logic and greatly reduces the number of pads. At 1MB, the
ing the same pixel processing but half the vertex processing L2 cache is also half the size of Tegra 4’s. These changes
of Tegra 4. Both chips clock the GPU at about the same leave room for the integrated baseband, which we estimate
speed: 660MHz for Tegra 4i and 672MHz for Tegra 4. consumes only 8mm2. Even after adding the baseband,
Whereas Tegra 4 supports two 32-bit memory chan- Tegra 4i has a die size in the “low 60s,” according to
nels, Tegra 4i has only one, halving its peak memory band- Nvidia, compared with the “mid-80s” for Tegra 4.
width. Thus, graphics tests that are limited by either mem-
ory bandwidth or vertex processing will run half as fast on Rebirth of the Reference Design
Tegra 4i relative to Tegra 4. On the other hand, the two The integrated baseband allows Tegra 4i to fit into com-
chips are equally good at pushing pixels. We estimate Tegra pact and inexpensive phones. Nvidia offers a reference
4i will score about 30fps on GLBenchmark 2.5, putting it in design, code-named Phoenix, for Tegra 4i. As Figure 4
the same class as high-end application processors shipping shows, the main components fit into a narrow space on the
in phones today. phone’s left edge. In addition to the processor, only a few
Compared with Tegra 3, Tegra 4i offers a theoretical other components are required. Shrinking the circuit-
2x gain in memory bandwidth, even though both use a board size leaves more room for the battery; smartphone
single-memory channel. This doubling assumes the use of designers can choose a smaller battery, yielding a thinner
LPDDR3-2133 memory chips, however, which do not exist phone, or a larger battery for longer life. The Phoenix
today. Initial Tegra 4i smartphones are likely to use design is 8mm thick, but OEMs may be able to reduce this
LPDDR3-1600, providing a 50% memory-bandwidth boost dimension. (X-Men aficionados will appreciate the con-
over Tegra 3. As faster LPDDR3 speed grades become avail- nection between Grey and Phoenix.)
able, Tegra 4i can take advantage of them. The reference design includes Nvidia’s ICE9245 RF
Tegra 4i includes the computational-photography transceiver. This chip, which also works with the i500, has
engine that Nvidia introduced with Tegra 4 (see MPR inputs and outputs for eight configurable bands, with di-
1/21/13, “Tegra 4 Shows First Quad A15”). This unit sup- versity for each of the receive bands. It can support addi-
ports advanced features such as high-dynamic-range (HDR) tional bands by using external switches and converged
Vertex Vertex Vertex

IDX / Clip / Setup

Raster / Early Z

Texture Texture
L1 L1

L2 Cache

Chan 0 Memory Controller
32 Figure 4. Nvidia’s Phoenix reference design. The main cir-
Figure 3. Tegra 4i GPU design. The GPU has two pixel pipe- cuitry of the smartphone, including the Tegra 4i processor
lines with 24 fragment shaders each, plus three vertex units (shown in false color), fits on a PC board less than one inch
with 4 shaders each. wide. (Photo source: Nvidia)


Tegra 4i Expands Market 5

power amps (PAs). Built in a TSMC 65nm process to re-
duce cost, the chip integrates all low-noise amplifiers Price and Availability
(LNAs) for a highly integrated solution.
Tegra 4i is currently sampling to lead customers;
Unlike most of its competitors, Nvidia does not sup-
Nvidia expects the first smartphones using the processor
ply its own connectivity chips. The Phoenix design uses a
to ship in 4Q13. The company withheld pricing. For more
Broadcom Wi-Fi combo with a separate Broadcom GPS information on Tegra 4i, access Nvidia’s web site at
chip, probably the BCM4334 and BCM4752, respectively. www.nvidia.com/object/tegra-4-processor.html.
Nvidia has also qualified Tegra 4i with Wi-Fi combos from
Texas Instruments, but TI is a poor second supplier, since
it announced it will exit the smartphone market this year. Nvidia’s 3G voice, it will create a sizable smartphone op-
Because Broadcom competes with Nvidia in mainstream portunity. But to generate enough business to pay back its
smartphones, it can charge customers that use Tegra 4i a Icera investment, Nvidia needs more than one carrier cus-
higher Wi-Fi price compared with customers that use tomer. The good news is that the company still has several
Broadcom’s own processors, tilting the playing field in its months to certify its modem technology before the first
favor. Nvidia declined to purchase TI’s Wi-Fi business, Tegra 4i phones are ready to ship.
leaving it with no acquisition options in this area.
Although Nvidia calls Phoenix a reference design, the Closer to Tegra 4
company has yet to reach the same level as MediaTek and At 2.3GHz, Tegra 4i delivers a 50% boost in CPU perfor-
Qualcomm, which offer much more complete packages mance (on SPECint) compared with Tegra 3 and at least
(see MPR 2/25/13, “Qualcomm Clashes With MediaTek”) twice the graphics performance. In fact, the new chip is
that even the smallest smartphone makers can use. closer to Tegra 4 than to Tegra 3 in performance, although
Phoenix will help Nvidia attract mid-tier and large OEMs. Nvidia has been careful to leave enough of a gap that Tegra 4
ZTE, which will follow Mimosa X with a smartphone using remains a viable product for premium smartphones. Ac-
Tegra 4 and the i500, is a likely Tegra 4i customer. cording to the company’s testing, Tegra 4 scores 27% better
on SPECint than Tegra 4i, and it is considerably better on
Finding a Hole in the Wall both 3D graphics and video decoding. The second DRAM
Tegra 4i exploits a hole in Qualcomm’s otherwise solid channel will boost Tegra 4’s performance, particularly with
product line. For premium phones, Qualcomm offers the the larger screens used in tablets. Both Tegra 4i and Tegra 4
2.3GHz 8974, a quad-core Krait processor that will outrun offer Nvidia’s new computational-photography engine.
Tegra 4i but carries a premium price tag. For mainstream For a smartphone processor with integrated modem,
smartphones, the company offers the MSM8960T, which Tegra 4i provides excellent performance. Few integrated
has only two CPUs. On single-thread programs, this chip’s quad-core processors have been announced, and many of
1.7GHz Krait CPU should match the 2.3GHz Cortex-A9, them target the low-end smartphone market with lower-
but the MSM8960T lacks both the marketing cachet and frequency CPUs such as Cortex-A5 and Cortex-A7. Inte-
the multithreaded-benchmark scores of a quad-core chip. grated processors with dual Cortex-A15 or Krait cores
This situation should give Tegra 4i an advantage in won’t match Tegra 4i in benchmark performance, al-
mainstream phones. The Tegra 4i die is about half the size though they will perform well on most apps. Nvidia has
of the 8974, making it difficult for Qualcomm to compete also packed plenty of GPU performance into Tegra 4i; on
on price using this die. The company may be able to devel- graphics tests, it should outperform the quad-A7 chips and
op a cost-reduced quad-core chip by the time Tegra 4i is at least match the dual-A15 processors. Qualcomm’s 8974
available, or perhaps shortly thereafter, but such a chip is the only integrated processor that should equal Tegra
may fail to match Tegra 4i’s performance. 4i’s performance, but we don’t expect that company to cut
Other integrated quad-core chips will fall far short of the price of its flagship processor enough to compete with
Tegra 4i. MediaTek’s MT6589 and Qualcomm’s MSM8226 the integrated Tegra.
both use Cortex-A7 CPUs running at speeds of about Nvidia has been successful in tablets, but until now,
1.2GHz. For its MP6530, Renesas uses an unusual “Two its premium positioning and lack of a low-cost option have
and a Half Men” configuration, with two A15 CPUs and limited its sales into smartphones. Tegra 2 and Tegra 3
two A7s, that won’t add up to the performance of Nvidia’s have appeared mainly in hero phones such as the HTC
quad A9r4 cores. One X, but these high-end devices ship in relatively small
Nvidia’s lack of carrier qualifications for both voice volumes. As a result, Nvidia holds less than 2% of the
and LTE is a concern. The company is working feverishly smartphone market. A move into the mainstream offers
to remedy this situation, and at least one customer (ZTE) is greater volume opportunities; here, Nvidia will compete
moving forward with the i500. Nvidia’s initial focus ap- mainly against Qualcomm’s Snapdragon. By bringing
pears to be on AT&T, the only carrier with which it has Tegra features and performance to lower price points, the
achieved LTE certification; if that carrier also signs off on new processor is an attractive alternative to Snapdragon. ♦


Tegra 4i expands the market

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Tegra 4i expands the market

Similar to Tegra 4i expands the market (20)

Recently uploaded

Recently uploaded (20)

Tegra 4i expands the market