Evolution of Personal Computing by
Microprocessors and SoCs
For Credit Seminar: EEC7203 (Internal Assessment)
Submitted To...
Abstract
Throughout history, new and improved technologies have transformed the human
experience. In the 20th century, the...
Table of Contents
Title

Page No.

1.

Abstract

i

2.

Table of Contents

ii

3.

List of Figures

iii

4.

Introduction
...
List of Figures
Figure 1: 4004 Layout
Figure 2: Pentium Chip
Figure 3: Pentium CPU based PC architecture
Figure 4: Pentium...
Introduction
In 1969, Intel was found with aim of manufacturing memory devices. Their first
product was Shottky TTL bipola...
x86 and birth of the PC
The 8086 16 bit processor made its debut in 1978. New techniques such as that of
memory segmentati...
8086 and 80186/88 were limited to addressing 1M of memory. Thus, the PC was also
limited to this range. This limitation wa...
The next microarchitecture was the P6
or the Pentium Pro released in 1995.
It had an integrated L2 cache. One
major change...
feature an assembly line of steps that breaks up the grab/execute process to allow for higher
throughput.
The basic pipeli...
The diagram is from Intel feature presentation
of the NetBurst architecture. The Willamette
was an early variant with SSE2...
one are to be compared (Branch Check). After that the new
address

is

recorded

in

the

BTB

(Drive).

Northwood and Pre...
Merom was for
mobile computing,
Conroe was for
desktop

systems,

and

Woodcrest

was

for servers

and workstations.
Whil...
Figure 11: Macro fusion explained at IDF

Figure 12: Power Management capabilities of Core
architecture

In previous gener...
protection, is used to prevent certain types of malicious software from taking over computers
by inserting their code into...
Backend
Figure 15: Nehalem pipeline frontend

The new changes to the pipeline in this were as
Figure 14: Nehalem pipeline ...


QPI – QuickPath Interconnect was the new system bus replacing FSB. Intel had moved
the memory controller on to the CPU ...
Backend
Figure 19: Sandybridge pipeline frontend



Redesigned Branch Prediction Unit – SB
caches twice as many branch ta...


On-Die GPU and QuickSync - The Sandy Bridge GPU is on-die built out of the same
32nm transistors as the CPU cores. It g...
Figure 24: FinFET Delay vs Power

Figure 25: SEM photograph of fabricated FinFET trigate
transistors

A scanning electron ...
Backend
Figure 26: Haswell pipeline frontend

The architectural improvements in this
generation can be summarised as follo...
Figure 28: Performance comparisons of 5 generations of Intel processors

Intel is about half a century old. From the 4004 ...
Shift in Computing Trends
With its powerful x86 architecture, and excellent business strategy, Intel has managed
to domina...
'big two' are, indications suggest that this company will continue to go from strength to
strength.
Early 8-bit microproce...
minimises the number of components, which, in turn, keeps down both the cost and the size of
the circuit board, both of wh...
products under the OMAP series. NVIDIA markets under Tegra brand and other companies
such as Apple market theirs as A seri...
Conclusion
Computers have truly revolutionized our world and have changed the way we work,
communicate and entertain ourse...
References
[1] Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 1: Basic
Architecture, [online] Availa...
Upcoming SlideShare
Loading in …5
×

Evolution of Computing Microprocessors and SoCs

932 views

Published on

This file documents the evolution of microprocessors from 4004 to the 4th generation i7

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
932
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
56
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Evolution of Computing Microprocessors and SoCs

  1. 1. Evolution of Personal Computing by Microprocessors and SoCs For Credit Seminar: EEC7203 (Internal Assessment) Submitted To Dr. T. Shanmuganantham Associate Professor, Department of Electronics Engineering Azmath Moosa Reg No: 13304006 M. Tech 1st Yr Department of Electronics Engineering, School of Engg & Tech, Pondicherry University
  2. 2. Abstract Throughout history, new and improved technologies have transformed the human experience. In the 20th century, the pace of change sped up radically as we entered the computing age. For nearly 40 years the Microprocessor driven by innovations of companies like Intel have continuously created new possibilities in the lives of people around the world. In this paper, I hope to capture the evolution of this amazing device that has raised computing to a whole new level and made it relevant in all fields – Engineering, Research, Medical, Academia, Businesses, Manufacturing, Commuting etc. I will highlight the significant strides made in each generation of Processors and the remarkable ways in which engineers overcame seemingly unsurmountable challenges and continued to push the evolution to where it is today. Page | i
  3. 3. Table of Contents Title Page No. 1. Abstract i 2. Table of Contents ii 3. List of Figures iii 4. Introduction 1 5. X86 and birth of the PC 2 6. The Pentium 3 7. Pipelined Design 4 8. The Pentium 4 5 9. The Core Microarchitecture 7 10. Tick Tock Cadence 10 11. The Nehalem Microarchitecture 10 12. The SandyBridge Microarchitecture 12 13. The Haswell Microarchitecture 15 14. Performance Comparison 16 15. Shift in Computing Trends 18 16. Advanced RISC Machines 18 17. System on Chip (SoC) 19 18. Conclusion 22 19. References Page | ii
  4. 4. List of Figures Figure 1: 4004 Layout Figure 2: Pentium Chip Figure 3: Pentium CPU based PC architecture Figure 4: Pentium 2 logo Figure 5: Pentium 3 logo Figure 6: Pentium 4 HT technology illustration Figure 7: NetBurst architecture feature presentation at Intel Developer Forum Figure 8: The NetBurst Pipeline Figure 9: The Core architecture feature presentation at Intel Developer Forum Figure 10: The Core architecture pipeline Figure 11: Macro fusion explained at IDF Figure 12: Power Management capabilities of Core architecture Figure 13: Intel's new tick tock strategy revealed at IDF Figure 14: Nehalem pipeline backend Figure 15: Nehalem pipeline frontend Figure 16: Improved Loop Stream Detector Figure 17: Nehalem CPU based PC architecture Figure 18: Sandybridge architecture overview at IDF Figure 19: Sandybridge pipeline frontend Figure 20: Sandybridge pipeline backend Figure 21: Video transcoding capabilities of Nehalem Figure 22: Typical planar transistor Figure 23: FinFET Tri-Gate transistor Figure 24: FinFET Delay vs Power Figure 25: SEM photograph of fabricated FinFET trigate transistors Figure 26: Haswell pipeline frontend Figure 27: Haswell pipeline backend Figure 28: Performance comparisons of 5 generations of Intel processors Figure 29: Market share of personal computing devices. Figure 30: A smartphone SoC; Qualcomm's OMAP Figure 31: A SoC for tablet; Nvidia TEGRA 1 3 4 4 4 6 6 7 8 8 9 9 10 11 11 11 11 12 13 13 14 14 14 15 15 16 16 17 18 20 21 Page | iii
  5. 5. Introduction In 1969, Intel was found with aim of manufacturing memory devices. Their first product was Shottky TTL bipolar SRAM memory chip. A Japanese company – Nippon Calculating Machine Corporation approached Intel to design 12 custom chips for its new calculator. Intel engineers suggested a family of just four chips, including one that could be programmed for use in a variety of products. Intel designed a set of four chips known as the MCS-4. It included a central processing unit (CPU) chip—the 4004—as well as a supporting read-only memory (ROM) chip for the custom applications programs, a random-access memory (RAM) chip for processing data, and a shift-register chip for the input/output (I/O) Figure 1: 4004 Layout port. MCS-4 was a "building block" that engineers could purchase and then customize with software to perform different functions in a wide variety of electronic devices. And thus, the industry of the Microprocessor was born. 4004 had 2,300 pMOS transistors at 10um and was clocked at 740 kHz. 4 pins were multiplexed for both address and data (16 pin IC). In the very next year, the 8008 was introduced. It was an 8 bit processor clocked at 500 kHz with 3,500 pMOS transistors at the same 10um. It was actually slower with 0.05 MIPS (Millions of instructions per second) as compared to 4004 with 0.07. It was in 1974, that the 8080 with 10 times the performance of 8008 with a different transistor technology was launched. It used 4,500 NMOS transistors of size 6um. It was clocked at 2 MHz with a whopping 0.29 MIPS. Finally in March 1976, the 8085 clocked at 3 MHz with yet another newer transistor technology - depletion type NMOS transistors of size 3 um was launched. It was capable of 0.37 MIPS. The 8085 was a popular device of its time and is still used in universities across the globe to introduce students to microprocessors. Page | 1
  6. 6. x86 and birth of the PC The 8086 16 bit processor made its debut in 1978. New techniques such as that of memory segmentation into banks to extend capacity and Pipelining to speed up execution were introduced. It was designed to be compatible with 8085 Assembly Mnemonics. It had 29,000 transistors of 3um channel length and was clocked at 5, 8 and 10 MHz with a full 0.75 MIPS at maximum clock. It was the father of what is now known as the x86 Architecture which eventually turned out to be Intel’s most successful line of processors that power many computing devices even today. Introduced soon after was the processor that powered the first PC – the 8088. Clocked at 5-8 MHz with 0.33-0.66 MIPS, it was 8086 with an external 8 bit bus. In 1981, a revolution seized the computer industry stirred by the IBM PC. By the late '70s, personal computers were available from many vendors, such as Tandy, Commodore, TI and Apple. Computers from different vendors were not compatible. Each vendor had their own architecture, their own operating system, their own bus interface, and their own software. Backed by IBM's marketing might and name recognition, the IBM PC quickly captured the bulk of the market. Other vendors either left the PC market (TI), pursued niche markets (Commodore, Apple) or abandoned their own architecture in favor of IBM's (Tandy). With a market share approaching 90%, the PC became a de-facto standard. Software houses wrote operating systems (MicroSoft DOS, Digital Research DOS), spread sheets (Lotus 123), word processors (WordPerfect, WordStar) and compilers (MicroSoft C, Borland C) that ran on the PC. Hardware vendors built disk drives, printers and data acquisition systems that connected to the PC's external bus. Although IBM initially captured the PC market, it subsequently lost it to clone vendors. Accustomed to being a monopoly supplier of mainframe computers, IBM was unprepared for the fierce competition that arose as Compaq, Leading Edge, AT&T, Dell, ALR, AST, Ampro, Diversified Technologies and others all vied for a share of the PC market. Besides low prices and high performance, the clone vendors provided one other very important thing to the PC market: an absolute hardware standard. In order to sell a PC clone, the manufacturer had to be able to guarantee that it would run all of the customer's existing PC software, and work with all of the customer's existing peripheral hardware. The only way to do this was to design the clone to be identical to the original IBM PC at the register level. Thus, the standard that the IBM PC defined became graven in stone as dozens of clone vendors shipped millions of machines that conformed to it in every detail. This standardization has been an important factor in the low cost and wide availability of PC systems. Page | 2
  7. 7. 8086 and 80186/88 were limited to addressing 1M of memory. Thus, the PC was also limited to this range. This limitation was increased to 16 MB by 80286 released in 1982. It had max clock of 16 MHz with more than 2 MIPS. It had 134,000 transistors at 1.5um. The processors and the PC up to this point were all 16 bit. The 80386 range of processors, released in 1985, were the first 32 bit processors to be used in the PC. The first of these had 275,000 transistors at 1um and was clocked at 33 MHz with 5.1 MIPS. Its addressing range could be virtually 32 GB. Over the next few years, Intel modified the architecture and provided some improvements in terms of memory addressing range and clock speed. The 80486 range of processors, released in 1989, brought significant advancements in computing capability with a whopping 41 MIPS for a processor clocked at 50 MHz with 1.2 million transistors at 0.8 um or 800 nm. It had a new technique to speed up RAM read/writes with the Cache memory. It was integrated onto the CPU die and was referred to as level 1 or L1 cache (as opposed to the L2 cache available in the motherboard). As with the previous series, Intel slightly modified the architecture and released higher clocked versions over the next few years. The Pentium The Intel Pentium microprocessor was introduced in 1993. Its microarchitecture, dubbed P5, was Intel's fifth- generation and first 32 bit superscalar microarchitecture. Superscalar architecture is one in which multiple execution units or functional units (such as adders, shifters and multipliers) are provided and operate in parallel. As a direct extension of the 80486 architecture, it included dual integer pipelines, a faster floating-point unit, wider data bus, separate code and data caches and features for further reduced address Figure 2: Pentium Chip calculation latency. In 1996, the Pentium with MMX Technology (often simply referred to as Pentium MMX) was introduced with the same basic microarchitecture complemented with an MMX instruction set, larger caches, and some other enhancements. The Pentium was based on 0.8 um process technology, involved 3.1 million transistors and was clocked at 60 MHz with 100 MIPS. The Pentium was truly capable of addressing 4 GB of RAM without any operating system based virtualization. Page | 3
  8. 8. The next microarchitecture was the P6 or the Pentium Pro released in 1995. It had an integrated L2 cache. One major change Intel brought to the PC architecture was the presence of FSB (Front Side Bus) that managed the CPU’s communications with the RAM and other IO. RAM and Graphics card were high speed peripherals and were interfaced through the Northbridge. Other IO devices like keyboard and speakers were interfaced Figure 3: Pentium CPU based PC architecture through the Southbridge. Pentium II followed it soon in 1997. It had MMX, improved 16 bit performance and had double the L2 cache. Pentium II had 7.5 million transistors starting with 0.35um process technology but later revisions utilised 0.25um transistors. Figure 4: Pentium 2 logo The Pentium III followed in 1999 with 9.5 million 0.25um transistors and a new instruction set SSE (Streaming SIMD Extensions) that assisted DSP and graphics processing. Intel was able to push the clock speed higher and higher with Pentium III with some variants clocked as high as 1 GHz. Figure 5: Pentium 3 logo Pipelined Design At a high level the goal of a CPU is to grab instructions from memory and execute those instructions. All of the tricks and improvements we see from one generation to the next just help to accomplish that goal faster. The assembly line analogy for a pipelined microprocessor is over used but that's because it is quite accurate. Rather than seeing one instruction worked on at a time, modern processors Page | 4
  9. 9. feature an assembly line of steps that breaks up the grab/execute process to allow for higher throughput. The basic pipeline is as follows: fetch, decode, execute, and commit to memory. One would first fetch the next instruction from memory (there's a counter and pointer that tells the CPU where to find the next instruction). One would then decode that instruction into an internally understood format (this is key to enabling backwards compatibility). Next one would execute the instruction (this stage, like most here, is split up into fetching data needed by the instruction among other things). Finally one would commit the results of that instruction to memory and start the process over again. Modern CPU pipelines feature many more stages than what've been outlined above. Pipelines are divided into two halves. Frontend and Backend. The front end is responsible for fetching and decoding instructions, while the back end deals with executing them. The division between the two halves of the CPU pipeline also separates the part of the pipeline that must execute in order from the part that can execute out of order. Instructions have to be fetched and completed in program order (can't click Print until you click File first), but they can be executed in any order possible so long as the result is correct. Many instructions are either dependent on one another (e.g. C=A+B followed by E=C+D) or they need data that's not immediately available and has to be fetched from main memory (a process that can take hundreds of cycles, or an eternity in the eyes of the processor). Being able to reorder instructions before they're executed allows the processor to keep doing work rather than just sitting around waiting. This document aims to highlight changes to the x86 pipeline with each generation of processors. The Pentium 4 The NetBurst microarchitecture started with Pentium 4. This line of processors started in 2000 clocked at 1.4 GHz, 42 million transistors at 0.18 um process size and SSE2 instruction set. The early variants were codenamed Willamette (1.9 to 2.0 GHz) and later ones Northwood (up to 3.0 GHz) and Prescott. Page | 5
  10. 10. The diagram is from Intel feature presentation of the NetBurst architecture. The Willamette was an early variant with SSE2, Rapid Execution engine (in which ALUs operate at twice the core clock frequency) and Instruction Trace Cache (ITC cached decoded instructions for faster loop execution). HT Technology refers to the prevention of Figure 7: NetBurst architecture feature presentation at Intel Developer Forum CPU wastage by assigning it to execute one thread or application when another one waits for data from RAM to arrive. This essentially acts like a dual processor system. Figure 6: Pentium 4 HT technology illustration The NetBurst pipeline was 20 stages long. As illustrated in the figure to the right, the BTB (Branch Target Buffer) helps to define the address of the next micro-op in the trace cache (TC Nxt IP). Then micro-ops are fetched out of the trace cache (TC Fetch) and are transferred (Drive) into the RAT (register alias table). After that, the necessary resources are allocated (such as loading queues, storing buffers etc. (Alloc)), and there comes logic registers rename (Rename). Micro-ops are put in the Queue until there appears free place in the Schedulers. There, micro-ops' dependencies are to be solved, and then micro-ops are transferred to the register files of the corresponding Dispatch Units. There, a micro-op is executed, and Flags are calculated. When implementing the jump instruction, the real branch address and the predicted Page | 6
  11. 11. one are to be compared (Branch Check). After that the new address is recorded in the BTB (Drive). Northwood and Prescott were later variations with certain enhancements as illustrated in the diagram above. Processor specific details are unnecessary. The next major advancement was the 64 bit NetBurst released in 2005. The Prescott line up continued with maximum clock speeds of 3.8 GHz, transistor sizes of 0.09um. It had 2MB cache and EIST (Enhanced Intel SpeedStep Technology – allowing dynamic processor clock speed scaling through software). EIST was particularly useful for mobile processors as a lot of power was conserved when running at low clock speeds. NetBurst family continued to grow with the Pentium D (dual core HT disabled processors) and Pentium Extreme Figure 8: The NetBurst Pipeline Edition processors (Dual core with HT enabled). The Core Microarchitecture The high power consumption and heat intensity, the resulting inability to effectively increase clock speed, and other shortcomings such as the inefficient pipeline were the primary reasons for which Intel abandoned the NetBurst microarchitecture and switched to completely different architectural design, delivering high efficiency through a small pipeline rather than high clock speeds. Intel’s solution was the Core microarchitecture released in 2006. The first of these were sold under the brand name of “Core 2” with duo and quad variants (dual and quad CPUs). Page | 7
  12. 12. Merom was for mobile computing, Conroe was for desktop systems, and Woodcrest was for servers and workstations. While architecturally identical, the three processor differed lines in the socket used, bus speed, and power Figure 9: The Core architecture feature presentation at Intel Developer Forum consumption. The diagram below illustrates the Conroe architecture. 14 stage pipeline of the Core architecture was a trade-off between long and short pipeline designs. The architectural highlights of this generation are given below. Wide Dynamic Execution referred to two things. First, the ability of the processor to fetch, dispatch, execute and return four instructions simultaneously. Second, a technique Figure 10: The Core architecture pipeline called Macro fusion in which two x86 instructions could be combined into a single micro-op to increase performance. Page | 8
  13. 13. Figure 11: Macro fusion explained at IDF Figure 12: Power Management capabilities of Core architecture In previous generations, the ALU typically broke instructions into two blocks, which resulted in two micro ops and thus two execution clock cycles. In this generation, Intel extended the execution width of the ALU and the load/store units to 128 bits, allowing for eight single precision or four double precision blocks to be processed per cycle. The feature was called Advanced Digital Media Boost, because it applied to SSE instructions which were utilised by Multimedia transcoding applications. Intel Advanced Smart Cache referred to the unified L2 cache that allowed for a large L2 cache to be shared by two processing cores (2 MB or 4 MB). Caching was more effective now because data was no longer stored twice into different L2 caches any more (no replication). This freed up the system bus from being overloaded with RAM read/write activity as each core could share data directly through the cache. The Smart Memory Access feature referred to the inclusion of prefetchers. A prefetcher gets data into a higher level unit using very speculative algorithms. It is designed to provide data that is very likely to be requested soon, which can reduce memory access latency and increase efficiency. The memory prefetchers constantly have a look at memory access patterns, trying to predict if there is something they could move into the L2 cache from RAM - just in case that data could be requested next. Intelligent Power Capability was a culmination of many techniques. The 65-nm process provided a good basis for efficient ICs. Clock gating and sleep transistors made sure that all units as well as single transistors that were not needed remained shut down. Enhanced SpeedStep still reduced the clock speed when the system was idle or under a low load and was also capable of controlling each core separately. Some features were also available such as Execute Disable Bit by which an operating system with support for the bit may mark certain areas of memory as non-executable. The processor will then refuse to execute any code residing in these areas of memory. The general technique, known as executable space Page | 9
  14. 14. protection, is used to prevent certain types of malicious software from taking over computers by inserting their code into another program's data storage area and running their own code from within this section; this is known as a buffer overflow attack. It is also to be noted that HyperThreading was removed. Tick-Tock Cadence Since 2007, Intel adopted a "Tick-Tock" model to follow every microarchitectural change with a die shrink of the process technology. Every "tick" is a shrinking of process technology of the previous microarchitecture and every "Tock" is a new microarchitecture. Every year to 18 months, there is expected to be one Tick or Tock. Figure 13: Intel's new tick tock strategy revealed at IDF In 2007, the Core microarchitecture underwent a “Tick” to the 45 nm process. Processors were codenamed Penryn. Process shrinking always brings down energy consumption and improves power savings. The Nehalem Microarchitecture The next Tock was introduced in 2008 with the Nehalem microarchitecture. The transistor count in this generation was nearing the Billion mark with around 700 million transistors in the i7. The pipeline frontend and backend are illustrated below. Page | 10
  15. 15. Backend Figure 15: Nehalem pipeline frontend The new changes to the pipeline in this were as Figure 14: Nehalem pipeline backend follows:  Loop Stream Detector – detected and cached loops to prevent fetching instructions from cache and decoding them again and again  Improved Branch Predictor – Fetched branch Figure 16: Improved Loop Stream Detector instructions prior to execution based on an improved prediction algorithm  SSE 4+ - New instructions helpful for operations on database and DNA sequencing were introduced. Other changes to the architecture were:  HyperThreading – HT was reintroduced  Turbo Boost – The processor could intelligently control its clock speed as per application requirements and thus, dynamically conserve power. Unlike EIST, no OS intervention is required. Figure 17: Nehalem CPU based PC architecture Page | 11
  16. 16.  QPI – QuickPath Interconnect was the new system bus replacing FSB. Intel had moved the memory controller on to the CPU die.  L3 Cache – shared between all 4 cores The next tick was in 2010 codenamed Westmere with process shrinking to 32nm. The SandyBridge Microarchitecture The next Tock was in 2011 with the SandyBridge microarchitecture also marketed as 2nd generation of i3, i5 and i7 processors. With SandyBridge, Intel surpassed the 1 Billion transistor count mark. The architectural improvements in this generation can be summarised in the diagram below: Figure 18: Sandybridge architecture overview at IDF Changes to the pipeline were as follows:  A Micro-op Cache - When SB’s fetch hardware grabs a new instruction it first checks to see if the instruction is in the micro-op cache, if it is then the cache services the rest of the pipeline and the front end is powered down. The decode hardware is a very complex part of the x86 pipeline, turning it off saves a significant amount of power. Page | 12
  17. 17. Backend Figure 19: Sandybridge pipeline frontend  Redesigned Branch Prediction Unit – SB caches twice as many branch targets as Nehalem with much effective and longer storage of history.  Figure 20: Sandybridge pipeline backend Physical Register File - A physical register file stores micro-op operands in the register file; as the micro-op travels down the OoO (Out of Order execution engine) it only carries pointers to its operands and not the data itself. This significantly reduces the power of the OoO hardware (moving large amounts of data around a chip eats power), it also reduces die area further down the pipe. The die savings are translated into a larger out of order window.  AVX Instruction Set – Advanced Vector Extensions are a group of instructions that are suitable for floating point intensive calculations in multimedia, scientific and financial applications. SB features 256 bit operands for this instructions set. Other changes to the architecture were:  Ring On-Die Interconnect - With Nehalem/Westmere all cores, whether dual, quad or six of them, had their own private path to the last level (L3) cache. That’s roughly 1000 wires per core. The problem with this approach is that it doesn’t work well for scaling up in things that need access to the L3 cache. Sandy Bridge adds a GPU and video transcoding engine on-die that share the L3 cache. Rather than laying out another 2000 wires to the L3 cache Intel introduced a ring bus Page | 13
  18. 18.  On-Die GPU and QuickSync - The Sandy Bridge GPU is on-die built out of the same 32nm transistors as the CPU cores. It gets equal access to the L3 cache. The GPU is on its own power island and clock domain. The GPU can be powered down or clocked up independently of the CPU. Graphics turbo is available on both desktop and mobile parts. QuickSync is a hardware acceleration technology for video transcoding. Rendering videos will be faster and more efficient.  Multimedia Transcoding - Media processing in SB is composed of two major components: video decode, and video encode. The entire video pipeline is now decoded via fixed function units. This is in contrast to Intel’s previous design that uses the EU array for some video decode stages. SB processor power is cut in half Figure 21: Video transcoding capabilities of Nehalem for HD video playback.  More Aggressive Turbo Boost The next Tick was in 2012 with the IvyBridge microarchitecture. The die was shrinked to a 22nm process. It was marketed as 3rd generation of i3, i5 and i7 processors. Intel used FinFET tri-gate transistor structure for the first time. Comparisons of the new structure released by Intel are provided below. Figure 22: Typical planar transistor Figure 23: FinFET Tri-Gate transistor As the above diagram shows, a FinFET structure or a 3D gate (as Intel calls it) allows for more control over the channel by maximizing the Gate area. This means high ON current and extremely low leakage current. This directly translates into lower operating voltages, lower TDPs and hence higher clock frequencies. Comparisons in terms of delay and operating voltage between the two structures are shown to the right. Page | 14
  19. 19. Figure 24: FinFET Delay vs Power Figure 25: SEM photograph of fabricated FinFET trigate transistors A scanning electron microscope image of the actual transistors fabricated are shown to the right. A single transistor consists of multiple Fins as parallel conduction paths maximize current flow. The Haswell Microarchitecture Ivy Bridge was followed by the next Tock of 2013, the Haswell microarchitecture. It is currently being marketed as the 4th generation of core i3, i5 and i7 processors. Changes to the pipeline were as follows:  Wider Execution Unit - adds two more execution ports, one for integer math and branches (port 6) and one for store address calculation (port 7). The extra ALU and port does one of two things: either improve performance for integer heavy code, or allow integer work to continue while FP math occupies ports 0 and 1.  AVX2 and FMA - The other major addition to the execution engine is support for Intel's AVX2 instructions, including FMA (Fused Multiply-Add). Ports 0 & 1 now include newly designed 256-bit FMA units. As each FMA operation is effectively two floating point operations, these two units double the peak floating point throughput of Haswell compared to Sandy/Ivy Bridge. Page | 15
  20. 20. Backend Figure 26: Haswell pipeline frontend The architectural improvements in this generation can be summarised as follows:  Improved L3 Cache – The cache Figure 27: Haswell pipeline backend bandwidth has been increased and is now also capable of clocking itself separately from the Cores.  GPU and QuickSync – Notable performance improvements have been made to the ondie GPU. QuickSync is a hardware acceleration technology for Multimedia transcoding. Haswell improves on image quality and adds support for certain codecs such as SVC, Motion JPEG and MPEG2. Performance Comparisons Before concluding this document, a comparison of the performance of these processors has to be illustrated. The following graphs showcase performance improvements of Intel processors over five generations starting with Conroe all the way up to Haswell. Processor naming convention is as illustrated to the right Page | 16
  21. 21. Figure 28: Performance comparisons of 5 generations of Intel processors Intel is about half a century old. From the 4004 to the current 4th generation of i7, i5 and i3 processors, a lot has changed in the electronics industry. But this is not the end. This evolution will continue. Intel’s next Tick will be Broadwell scheduled for this year utilizing 14nm transistor technology. Page | 17
  22. 22. Shift in Computing Trends With its powerful x86 architecture, and excellent business strategy, Intel has managed to dominate the PC market for almost as long as its age. Now, however, market analysts have noticed a significant new shift in computing trends. More and more customers are losing interest in the PC and moving towards more mobile computing platforms. The chart below (Courtesy: Gartner) highlights this shift. Market Share 5,00,000 4,50,000 4,00,000 3,50,000 3,00,000 2,50,000 2,00,000 1,50,000 1,00,000 50,000 0 2012 2013 PC (Desk and Notebook) Ultramobile Tablet 2014 Smartphone (Normalised by 4) Figure 29: Market share of personal computing devices. PC sales are beginning to drop as is evident. Meanwhile, the era of tablets and smartphones is beginning. A common mistake many industry giants make is the lack of importance they give to such shifts and end up losing it all. It happened with IBM (it lost the PC market) and Intel will be no exception unless it is careful. Advanced RISC Machines The battle for the mainstream processor market has been fought between two main protagonists, Intel and AMD, while semiconductor manufacturers like Sun and IBM traditionally concentrated on the more specialist Unix server and workstation markets. Unnoticed to many, another company rose to a point of dominance, with sales of chips based on its technology far surpassing those of Intel and AMD combined. That pioneering company is ARM Holdings, and while it's not a name that's on everyone's lips in the same way that the Page | 18
  23. 23. 'big two' are, indications suggest that this company will continue to go from strength to strength. Early 8-bit microprocessors like the Intel 8080 or the Motorola 6800 had only a few simple instructions. They didn't even have an instruction to multiply two integer numbers, for example, so this had to be done using long software routines involving multiple shifts and additions. Working on the belief that hardware was fast but software was slow, subsequent microprocessor development involved providing processors with more instructions to carry out ever more complicated functions. Called the CISC (complicated instruction set computer) approach, this was the philosophy that Intel adopted and that, more or less, is still followed by today's latest Core i7 processors. In the early 1980s a radically different philosophy called RISC (reduced instruction set computer) was conceived. According to this model of computing, processors would have only a few simple instructions but, as a result of this simplicity, those instructions would be superfast, most of them executing in a single clock cycle. So while much more of the work would have to be done in the software, an overall gain in performance would be achievable. ARM was established on this philosophy. Semiconductor companies usually design their chips and fabricate them at their own facility (like Intel) or lease it to a foundry such as TSMC. However, ARM designs processors but neither manufactures silicon chips nor markets ARM-branded hardware. Instead it sells, or more accurately licences, intellectual property (IP), which allows other semiconductor companies to manufacture ARM-based hardware. Designs are supplied as a circuit description, from which the manufacturer creates a physical design to meet the needs of its own manufacturing processes. It's provided in a hardware description language that provides a textual definition of how the building blocks connect together. The language used is RTL (register transfer-level). System on Chip (SoCs) A processor is the large component that forms the heart of the PC. A core, on the other hand, is the heart of a microprocessor that semiconductor manufacturers can build into their own custom chip designs. That customised chip will often be much more than what most people would think of as a processor, and could provide a significant proportion of the functionality required in a particular device. Referred to as a system on chip (SoC) design, this type of chip Page | 19
  24. 24. minimises the number of components, which, in turn, keeps down both the cost and the size of the circuit board, both of which are essential for high volume portable products such as smartphones. ARM powered SoCs are included in games consoles, personal media players, set-top boxes, internet radios, home automation systems, GPS receivers, ebook readers, TVs, DVD and Blu-ray players, digital cameras and home media servers. Cheaper, less powerful chips are found in home products, including toys, cordless phones and even coffee makers. They're even used in cars to drive dashboard displays, anti-lock breaking, airbags and other safetyrelated systems, and for engine management. Also, healthcare products is a major growth area Figure 30: A smartphone SoC; Qualcomm's OMAP over the last five years, with products varying from remote patient monitoring systems to medical imaging scanners. ARM devices are used extensively in hard disk and solid state drives. They also crop up in wireless keyboards, and are used as the driving force behind printers and networking devices like wireless router/access points. Modern SoCs also come with advanced (DirectX-9 equivalent) graphics capabilities that can surpass game consoles like the Nintendo Wii. Imagination Technologies, which was once known in the PC world with its “PowerVR” graphics cards, licenses its graphics processors designs to many SoC makers, including Samsung, Apple and many more. Others like Qualcomm or NVIDIA design their own graphics architecture. Qualcomm markets its Page | 20
  25. 25. products under the OMAP series. NVIDIA markets under Tegra brand and other companies such as Apple market theirs as A series. HTC, LG, Nokia and other smartphone manufacturers do not design their own SoCs but use the above mentioned. Finally, SoCs come with a myriad of smaller co-processors that are critical to overall system performance. The video encoding and decoding hardware powers the video functionality of smartphones. The image processor ensures that photos are processed properly and saved quickly and the audio processor frees the CPU(s) from having to work on audio signals. Together, all those components -and their associated drivers/software- define the overall performance of a system. Figure 31: A SoC for tablet; Nvidia TEGRA Page | 21
  26. 26. Conclusion Computers have truly revolutionized our world and have changed the way we work, communicate and entertain ourselves. Fuelled by constant innovations in chip design and transistor technology this evolution doesn’t seem to be bothered to stop. In recent years, there have been tremendous shifts in computing trends with mobile computers such as tablets and smartphones becoming more and more preferable, possibly, due to lowering costs and prices. While computing did start with the microprocessor, it is headed towards a scheme that incorporates the microprocessor as a smaller subset of a larger system. One that incorporates graphics, memory, modem and video transcoding co processors on a single chip. The SoC era has begun… Page | 22
  27. 27. References [1] Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 1: Basic Architecture, [online] Available: http://www.intel.com/products/processor/manuals [2] King, J. ; Quinnell, E. ; Galloway, F. ; Patton, K. ; Seidel, P. ; Dinh, J. ; Hai Bui and Bhowmik, A., "The Floating-Point Unit of the Jaguar x86 Core," in 21st IEEE Symposium on Computer Arithmetic (ARITH), 2013, pp. 7-16. [3] Ibrahim, A.H. ; Abdelhalim, M.B. ; Hussein, H. ; Fahmy, A., "Analysis of x86 instruction set usage for Windows 7 applications," in 2nd International Conference on Computer Technology and Development (ICCTD), 2010, pp. 511-516. [4] PC Architecture, Acid Reviews, [online] 2014, http://acidreviews.blogspot.in/2008/12/pc-architecture.html (Accessed: 2nd February 2014). [5] Alpert, D. and Avnon, D., "Architecture of the Pentium microprocessor," IEEE Micro, vol. 13, Issue 3, pp. 11-21, 1993. [6] Computer Processor History, Computer Hope, [online] 2014, http://www.computerhope.com/history/processor.htm (Accessed: 2nd February 2014). [7] Gartner Press Release, Gartner Analyst, [online] 2014, http://www.gartner.com/newsroom/id/2610015 (Accessed: 8th February 2014). [8] Intel Processor Number, CPU World, [online] 2014, http://www.cpuworld.com/info/Intel/processor-number.html (Accessed: 9th February 2014).

×