SlideShare a Scribd company logo
1 of 27
Download to read offline
Solution Manual for Modern Processor Design by John Paul
Shen and Mikko H. Lipasti
This book emerged from the course Superscalar Processor Design, which has been taught at
Carnegie Mellon University since 1995. Superscalar Processor Design is a mezzanine course targeting
seniors and first-year graduate students. Quite a few of the more aggressive juniors have taken the
course in the spring semester of their junior year. The prerequisite to this course is the
Introduction to Computer Architecture course. The objectives for the Superscalar Processor Design
course include: (1) to teach modem processor design skills at the microarchitecture level of
abstraction; (2) to cover current microarchitecture techniques for achieving high performance
via the exploitation of instruction-level parallelism (ILP); and (3) to impart insights and hands-on
experience for the effective design of contemporary high-performance microprocessors for mobile,
desktop, and server markets. In addition to covering the contents of this book, the course contains
a project component that involves the microarchitectural design of a future-generation superscalar
microprocessor.

Here, in next successive posts, I am going to post solutions for the same Text-book (Modern
Processor Design by John Paul Shen and Mikko H. Lipasti). If you find any difficulty or wants to
suggest anything, feel free to comment...:)

Link:

http://targetiesnow.blogspot.in/p/solution-manual-for-modern-

processor.html

Modern Processor Design by John Paul Shen and Mikko H.
Lipasti : Exercise 1.6 and 1.7 Solution
Q.1.6: A program's run time is determined by the product of instructions per program, cycles
per instruction, and clock frequency. Assume the following instruction mix for a MlPS-like RISC
instruction set: 15% stores, 25% loads, 15% branches, and 35% integer arithmetic, 5% integer shift,
and 5% integer multiply. Given that load instructions require two cycles, branches require four
cycles, integer ALU instructions require one cycle, and integer multiplies require ten cycles,
compute the overall CPI.

Q.1.7: Given the parameters of Problem 6, consider a strength-reducing optimization that
converts multiplies by a compile-time constant into a sequence of shifts and adds. For this
instruction mix, 50% of the multiplies can be converted to shift-add sequences with an average
length of three instructions. Assuming a fixed frequency, compute the change in instructions per
program, cycles per instruction, and overall program speedup.

Solution: http://targetiesnow.blogspot.in/2013/11/modern-processor-design-byjohn-paul_9765.html#links

Ex 1.8, 1.9 and 1.10 Solution: Modern Processor Design by
John Paul Shen and Mikko H. Lipasti :

Q.1.8: Recent processors like the Pentium 4 processors do not implement single-cycle shifts.
Given the scenario of Problem 7, assume that s = 50% of the additional integer and shift
instructions introduced by
strength reduction are
shifts,
and shifts now take
four cycles to execute. Recompute the cycles per instruction and overall program speedup. Is
strength reduction still a good optimization?

Q.1.9: Given the assumptions of Problem 8, solve for the break-even ratio s (percentage of
additional instructions that are shifts). That is, find the value of s (if any) for which program
performance is identical to the baseline case without strength reduction (Problem 6).

Q.1.10: Given the assumptions of Problem 8, assume you are designing the shift unit on the
Pentium 4 processor. You have concluded there are two possible implementation options for the
shift unit: 4-cycle shift latency at a frequency of 2 GHz, or 2-cycle shift latency at 1.9 GHz. Assume
the rest of the pipeline could run at 2 GHz, and hence the 2-cycle shifter would set the entire
processor’s frequency to 1.9 GHz. Which option will provide better overall performance?

Solution:
john-paul_13.html

http://targetiesnow.blogspot.in/2013/11/modern-processor-design-by-
Q.2.4: Consider that you would like to add a load-immediate instruction to the TYP instruction
set and pipeline. This instruction extracts a 16-bit immediate value from the instruction word, signextends the immediate value to 32 bits, and stores the result in the destination register specified in
the
instruction
word.
Since
the
extraction
and
sign-extension
can
be
accomplished without the ALU, your colleague suggests that such
instructions be
able to
write their results into
the
register in the
decode (ID)
stage.
Using the
hazard
detection algorithm described in Figure 2-15, identify what additional hazards such a change might
introduce.

Q.2.5: Ignoring

pipeline interlock hardware (discussed in Problem 6), what additional

pipeline resources does the change outline in Problem 4 require? Discuss these resources and
their cost.

Q.2.6: Considering

the change outlined in Problem 4, redraw the pipeline interlock

hardware shown in Figure 2-18 to correctly handle the load-immediate instructions.

Solution:

http://targetiesnow.blogspot.in/2013/11/ex-24-25-26-solution-modern-

processor.html

Ex 2.8, 2.9 & 2.15 Solution : Modern Processor Design by
John Paul Shen and Mikko H. Lipasti :

Q.2.8: Consider

adding a store instruction with indexed addressing mode to

the TYP pipeline. This store differs from the existing store with
register+immediate addressing mode by computing its effective address as the sum
of two source registers, that is, stx r3, r4, r5 performs r3<-MEM[r4+r5]. Describe
the additional pipeline resources needed to support such an instruction in the TYP
pipeline. Discuss the advantages and disadvantages of such an instruction.
Q.2.9:

Consider adding a load-update instruction with register+immediate and postupdate

addressing mode. In this addressing mode, the effective address for the load is computed as
register+immediate, and the resulting address is written back into the base register. That is, lwu
r3,8(r4) performs r3<-MEM[r4+8]; r4<r4+8. Describe the additional pipeline resources needed to
support such an instruction in the TYP pipeline.

Q.2.15:

The IBM study of pipelined processor performance assumed an instruction mix based

on popular C programs in use in the 1980s. Since then, object oriented languages like C++ and Java
have become much more common. One of the effects of these languages is that object inheritance
and polymorphism can be used to replace conditional branches with virtual function calls. Given
the IBM instruction mix and CPI shown in the following table, perform the following transformations
to reflect the use of C++/Java, and recompute the overall CPI and speedup or slowdown due to this
change:
• Replace 50% of taken conditional branches with a load instruction followed by a jump register
instruction
(the load and jump register implement a virtual function call).
• Replace 25% of not-taken branches with a load instruction followed by a jump register
instruction.

Solution: http://targetiesnow.blogspot.in/2013/11/ex-28-29-215-solution-modernprocessor.html

Q.2.16:

In a TYP-based pipeline design with a data cache, load
instructions check the tag array for a cache hit in parallel with accessing
the data array to read the corresponding memory location. Pipelining stores
to such a cache is more difficult, since the processor must check the tag
first, before it overwrites the data array. Otherwise, in the case of a cache
miss, the wrong memory location may be overwritten by the store. Design a
solution to this problem that does not require sending the store down the
pipe twice, or stalling the pipe for every store instruction. Referring to
Figure 2-15, are there any new RAW, WAR, and/or WAW memory
hazards?
Q.2.17: The MIPS pipeline shown in Table 2-7 employs a two-phase
clocking scheme that makes efficient use of a shared TLB, since instruction
fetch accesses the TLB in phase one and data fetch accesses in phase
two. However, when resolving a conditional branch, both the branch target
address and the branch fall-through address need to be translated during
phase one in parallel with the branch condition check in phase one of the
ALU stage to enable instruction fetch from either the target or the fallthrough during phase two. This seems to imply a dual-ported TLB. Suggest
an architected solution to this problem that avoids dual-porting the TLB.

Solution: http://targetiesnow.blogspot.in/2013/11/ex-216-217-solution-modernprocessor.html

Q.3.1:

Given
the
following
benchmark
code,
and
assuming a virtually-addressed fully-associative cache
with infinite capacity and 64 byte blocks, compute the
overall miss rate (number of misses divided by number
of references). Assume that all variables except array
locations reside in registers, and that arrays A, B,
and C are placed consecutively in memory.
double A[1024], B[1024], C[1024];
for(int i=0;i<1000;i += 2) {
A[i] = 35.0 * B[i] + C[i+1];
}
Q.3.3: Given the example code in Problem 1, and
assuming
a
virtually-addressed
two-way
set
associative cache of capacity 8KB and 64 byte blocks,
compute the overall miss rate (number of misses divided
by number of references). Assume that all variables
except array locations reside in registers, and that
arrays A, B, and C are placed consecutively in memory.

Solution: http://targetiesnow.blogspot.in/2013/11/ex-31-33-solution-modernprocessor.html
Q.3.4: Consider a cache with 256 bytes. Word size is 4 bytes and block size is 16 bytes. Show
the values in the cache and tag bits after each of the following memory access operations for the
two cache organizations direct mapped and 2-way associative. Also indicate whether the access
was a hit or a miss. Justify. The addresses are in hexadecimal representation. Use LRU (least
recently used) replacement algorithm wherever needed.
1.Read 0010
2.Read 001C
3.Read 0018
4.Write 0010
5.Read 0484
6.Read 051C
7.Read 001C
8.Read 0210
9.Read 051C

Solution: http://targetiesnow.blogspot.in/2013/11/modern-processor-design-byjohn-paul_1.html

Ex. 3.13 Solution : Modern Processor Design by John Paul
Shen and Mikko H. Lipasti : Solution Manual

Q.3.13: Consider

a processor with 32-bit virtual addresses, 4KB pages and 36-bit physical

addresses. Assume memory is byte-addressable (i.e. the 32-bit VA specifies a byte in memory).
L1 instruction cache: 64 Kbytes, 128 byte blocks, 4-way set associative, indexed and tagged with
virtual address.
L1 data cache: 32 Kbytes, 64 byte blocks, 2-way set associative, indexed and tagged with physical
address, write-back.
4-way set associative TLB with 128 entries in all. Assume the TLB keeps a dirty bit, a reference bit,
and 3 permission bits (read, write, execute) for each entry.

Specify the number of offset, index, and tag bits for each of these structures in the table below.
Also, compute the total size in number of bit cells for each of the tag and data arrays.

Solution: http://targetiesnow.blogspot.in/2013/11/ex-313-solution-modernprocessor-design.html
Q.3.16:

Assume a two-level cache hierarchy with a private level one
instruction cache (L1I), a private level one data cache (L1D), and a shared
level two data cache (L2). Given local miss rates for the 4% for L1I, 7.5%
for L1D, and 35% for L2, compute the global miss rate for the L2 cache.

Q.3.17:

Assuming 1 L1I access per instruction and 0.4 data accesses
per instruction, compute the misses per instruction for the L1I, L1D, and L2
caches of Problem 16.

Q.3.18:

Given the miss rates of Problem 16, and assuming that
accesses to the L1I and L1 D caches take one cycle, accesses to the L2
take 12 cycles, accesses to main memory take 75 cycles, and a clock rate
of 1GHz, compute the average memory reference latency for this cache
hierarchy.

Q.3.19: Assuming

a perfect cache CPI (cycles per instruction) for a
pipelined processor equal to 1.15 CPI, compute the MCPI and overall CPI
for a pipelined processor with the memory hierarchy described in Problem
18 and the miss rates and access rates specified in Problem 16 and
Problem 17.

Solution: http://targetiesnow.blogspot.in/2013/11/ex-316-317-318-and-319solution-modern.html

Q.3.28:

Assume a synchronous front-side processormemory bus that operates at 100 Hz and has an 8-byte
data bus. Arbitration for the bus takes one bus cycle
(10 ns), issuing a cache line read command for 64
bytes of data takes one cycle, memory controller
latency (including DRAM access) is 60 ns, after which
data double words are returned in back-to back cycles.
Further assume the bus is blocking or circuit-
switched. Compute the latency to fill a single 64-byte
cache line. Then compute the peak read bandwidth for
this
processor-memory
bus,
assuming
the processor
arbitrates for the bus for a new read in the bus cycle
following completion of the last read.

Q.3.31:

Consider finite DRAM bandwidth at a memory
controller, as follows. Assume double-data-rate DRAM
operating at 100 MHz in a parallel non-interleaved
organization, with an 8 byte interface to the DRAM
chips. Further assume that each cache line read results
in a DRAM row miss, requiring a precharge and RAS
cycle, followed by row-hit CAS cycles for each of the
double words in the cache line. Assuming memory
controller overhead of one cycle (10 ns) to initiate a
read operation, and one cycle latency to transfer data
from the DRAM data bus to the processor-memory bus,
compute the latency for reading one 64 byte cache
block. Now compute the peak data bandwidth for the
memory interface, ignoring DRAM refresh cycles.

Solution: http://targetiesnow.blogspot.in/2013/11/ex-328-and-331-modernprocessor-design.html

Q.3.28:

Assume a synchronous front-side processor-memory bus that

operates at 100 Hz and has an 8-byte data bus. Arbitration for the bus
takes one bus cycle (10 ns), issuing a cache line read command for 64
bytes of data takes one cycle, memory controller latency (including
DRAM access) is 60 ns, after which data double words are returned in backto back cycles. Further assume the bus is blocking or circuitswitched. Compute the latency to fill a single 64-byte cache line. Then
compute the peak read bandwidth for this processor-memory bus, assuming
the processor arbitrates for the bus for a new read in the bus cycle
following completion of the last read.

Q.3.34:

Assume a single-platter disk drive with an average seek time

of 4.5 ms, rotation speed of 7200 rpm, data transfer rate of 10 Mbytes/s
per head, and controller overhead and queueing of 1 ms. What is the
average access latency for a 4096-byte read?

Q.3.35:

Recompute the average access latency for Problem 34 assuming
a rotation speed of 15 K rpm, two platters, and an average seek time of 4.0
ms.

Solution:

http://targetiesnow.blogspot.in/2013/11/ex-328-334-and-335-solution-

modern.html

Ex 4.8 Solution : Modern Processor Design by John Paul
Shen and Mikko H. Lipasti : Solution manual

Q.4.8: In an in-order pipelined processor, pipeline latches are used to hold result operands
from the time an execution unit computes them until they are written back to the register file
during the writeback stage. In an out-of-order processor, rename registers are used for the same
purpose. Given a four-wide out-of-order processor TYP
pipeline, compute the minimum number of rename registers needed
to prevent rename register starvation from limiting concurrency. What happens to this number if
frequency demands force
a
designer to
add five extra pipeline
stages
between
dispatch and execute, and five more stages between execute and retire/writeback?

Solution:

http://targetiesnow.blogspot.in/2013/11/modern-processor-design-by-

john-paul.html

Ex 5.1 and 5.2 Solution : Modern Processor Design by John
Paul Shen and Mikko H. Lipasti : Solution manual

Q.5.1: The displayed code that follows steps through the elements of two arrays (A[] and B[])
concurrently, and for each element, it puts the larger of the two values into the corresponding
element of a third array (C[]). The three arrays are of length N.

The instruction set used for Problems 5.1 through 5.6 is as follows:
Identify the basic blocks of this benchmark code by listing the static instructions belonging to each
basic block in the following table. Number the basic blocks based on the lexical ordering of the
code.
Note: There may be more boxes than there are basic blocks.

Q.5.2: Draw the control flow graph for this benchmark.

Solution:
john-paul_22.html

http://targetiesnow.blogspot.in/2013/11/modern-processor-design-by-
Ex 5.7 through 5.13 Solution : Modern Processor Design by
John Paul Shen and Mikko H. Lipasti : Solution manual

Q.5.7 through Problem 5.13:
loop body for problems 5:
if (x is even) then
increment a
if (x is a multiple of 10) then
increment b

Consider the following code segment within a

(branch b1)
(b1 taken)
(branch b2)
(b2 taken)

Assume that the following list of 9 values of x is to be processed by 9 iterations of
this loop.
8, 9, 10, 11, 12, 20, 29, 30, 31
Note: assume that predictor entries are updated by each dynamic branch before
the next dynamic branch accesses the predictor (i.e., there is no update delay).

Q.5.7: Assume that an one-bit (history bit) state machine (see above) is used as the
prediction algorithm for predicting the execution of the two branches in this loop. Indicate
the predicted and actual branch directions of the b1 and b2 branch instructions for each iteration
of this loop. Assume initial state of 0, i.e., NT, for the predictor.

Q.5.8: What are the prediction accuracies for b1 and b2?
Q.5.9: What is the overall prediction accuracy?
Q.5.10: Assume

a two-level branch prediction scheme is used. In addition to the one-

bit predictor, a one bit global register (g) is used. Register g stores the direction of the last branch
executed (which may not be the same branch as the branch currently being
predicted) and is used to index into two separate one-bit branch history tables (BHTs) as
shown below. Depending on the value of g, one of the two BHTs is selected and used to
do the normal one-bit prediction. Again, fill in the predicted and actual branch directions
of b1 and b2 for nine iterations of the loop. Assume the initial value of g = 0, i.e., NT.
For each prediction, depending on the current value of g, only one of the two BHTs is
accessed and updated. Hence, some of the entries below should be empty.
Note: assume that predictor entries are updated by each dynamic branch before the next
dynamic branch accesses the predictor (i.e. there is no update delay).

Q.5.11: What

are

the

prediction

accuracies

for

b1

and

b2?
Q.5.12: What is the overall prediction accuracy?
Q.5.13: What is the prediction accuracy of b2 when g = 0? Explain why.

Solution:

http://targetiesnow.blogspot.in/2013/11/ex-57-through-513-solution-

modern.html

Exercise 5.14 Solution : Modern Processor Design by John
Paul Shen and Mikko H. Lipasti : Solution manual

Q.5.14: Below is the control flow graph of a simple program. The CFG is annotated with three
different execution
trace paths.
For
each execution trace circle
which branch predictor
(bimodal, local, or Gselect) will best predict the branching behavior of the given trace. More
than one predictor may perform equally well on a particular trace. However, you are to use
each of the three predictors exactly once in choosing the best predictors for the three
traces. Circle your choice for each of the three traces and add. (Assume each trace is executed
many times and every node in the CFG is a conditional branch. The branch history register for
the local, global, and Gselect predictors is limited to 4 bits.)
Solution:

http://targetiesnow.blogspot.in/2013/11/ex-514-solution-modern-

processor-design.html

Ex 5.19 and 5.20 Solution : Modern Processor Design by John
Paul Shen and Mikko H. Lipasti : Solution Manual

Q.5.19 and 5.20: Register Renaming
Given the DAXPY kernel shown in Figure 5.31 and the IBM RS/6000 (RIOS-I) floating-point load
renaming scheme also discussed in class (both are shown in the following figure), simulate the
execution of two iterations of the DAXPY loop and show the state of the floating-point map table,
the pending target return queue, and the free list.
• Assume the initial state shown in the table for Problem 5.19.
• Note the table only contains columns for the registers that are referenced in the DAXPY loop.
• As in the RS/6000 implementation discussed, assume only a single load instruction is renamed per
cycle and that only a single floating-point instruction can complete per cycle.
• Only floating-point load, multiply, and add instructions are shown in the table, since only these
are relevant to the renaming scheme.
• Remember that only load destination registers are renamed.
• The first load from the loop prologue is filled in for you.
Q.5.19:

Fill in the remaining rows in the following table with the map table state and

pending target return queue state after the instruction is renamed, and the free list state after the
instruction completes.

Solution:
modern.html

http://targetiesnow.blogspot.in/2013/11/ex-519-and-520-solution-
Ex 5.21 and 5.22 Solution : Modern Processor Design by John
Paul Shen and Mikko H. Lipasti : Solution manual

Q.5.21: Simulate the execution of the following code snippet using Tomasulo’s algorithm.
Show the contents of the reservation station entries, register file busy, tag (the tag is the RS ID
number), and data fields for each cycle (make a copy of the table below for each cycle
that you simulate). Indicate which instruction is executing in each functional unit in each cycle.
Also indicate any result forwarding across a common
data bus by circling the producer and consumer and connecting them with an arrow.
i: R4 <- R0 + R8
j: R2 <- R0 * R4
k: R4 <- R4 + R8
l: R8 <- R4 * R2
Assume dual dispatch and dual CDB (common data bus). Add latency is two cycles, and multiply
latency is 3 cycles. An instruction can begin execution in the same cycle that it is dispatched,
assuming all dependencies are satisfied.

Q.5.22: Determine whether or not the code executes at the data flow limit for Problem 1.
Explain why or why not. Show your work.

Solution:

http://targetiesnow.blogspot.in/2013/11/ex-521-and-522-solution-

modern.html

Ex 5.23 & 5.31 Solution : Modern Processor Design by John
Paul Shen and Mikko H. Lipasti : Solution Manual

Q.5.23: As presented in this chapter, load bypassing is a technique for enhancing memory
data flow. With load bypassing, load instructions are allowed to jump ahead of earlier store
instructions. Once address generation is done, a
store instruction can be
completed architecturally and can
then enter the
store buffer to await
available bus cycle
for writing to memory. Trailing loads are allowed to bypass these stores in the store buffer if
there is no address aliasing.
In this problem you are to simulate such load bypassing (there is no load forwarding). You are given
a sequence of load/store instructions and their addresses (symbolic). The number to the left of
each instruction indicates the cycle in which that instruction is dispatched to the reservation
station; it can begin execution in that same cycle. Each store instruction will have an additional
number to its right, indicating the cycle in which it is ready to retire, i.e., exit the store buffer and
write to the memory.
Assumptions:
•All operands needed for address calculation are available at dispatch.
•One load and one store can have their addresses calculated per cycle.
•One load OR store can be executed, i.e., allowed to access the cache, per cycle.
•The reservation station entry is deallocated the cycle after address calculation and issue.
•The store buffer entry is deallocated when the cache is accessed.
•A store instruction can access the cache the cycle after it is ready to retire.
•Instructions are issued in order from the reservation stations.
•Assume 100% cache hits.
Q.5.31: A

victim cache is used to augment a direct-mapped cache to reduce conflict

misses. For additional background on this problem, read Jouppi’s paper on victim caches [Jouppi,
1990]. Please fill in the following table to reflect the state of each cache line in a 4-entry directmapped
cache
and
a
2-entry
fully
associative
victim
cache
following each memory reference shown. Also, record whether the reference was a
cache hit or a cache miss. The reference addresses are shown in hexadecimal format. Assume the d
irect
mapped cache is indexed with the low-order bits above the 16byte line offset (e.g. address 40 maps to set 0, address 50 maps to set 1, etc.). Use ‘’ to indicate an invalid line and the address of the line to indicate a valid line. Assume LRU policy f
or the victim cache and mark the LRU line as such in the table.

Solution:

http://targetiesnow.blogspot.in/2013/12/solution-manual-modern-

processor-design.html

Ex 6.3, 6.11 &6.12 : Modern Processor Design by John Paul
Shen and Mikko H. Lipasti : Solution Manual

Q.6.3: Given

the dispatch and retirement bandwidth specified, how many integer ARF

(architected register file) read and write ports are needed to sustain peak throughput? Given
instruction mixes in Table 5-2, also compute average ports needed for each benchmark. Explain
why you would not just build for the average case. Given the actual number of read and write ports
specified, how likely is it that dispatch will be port-limited? How likely is it that retirement will be
port-limited?
Q.6.11: The IBM POWER3 can detect up to four regular access streams and issue prefetches
for future references. Construct an address reference trace that will utilize all four streams.

Q.6.12: The IBM POWER4 can detect up to eight regular access streams and issue prefetches
for future references. Construct an address reference trace that will utilize all four streams.

Solution:

http://targetiesnow.blogspot.in/2013/12/ex-63-611-modern-

processor-design-by.html

Ex 7.10, 7.11, 7.12, 11.1, 11.8 & 11.10 Solution : Modern
Processor Design by John Paul Shen and Mikko H. Lipasti :
Solution Manual

Q.7.10: If

the P6 microarchitecture had to support an instruction set that included

predication, what effect would that have on the register renaming process?

Q.7.11: As described in the text, the P6 microarchitecture splits store operations into a STA
and STD pair for handling address generation and data movement.
sense from a microarchitectural implementation perspective.

Q.7.12: Following

Explain why this makes

up on Problem 7, would there be a performance benefit (measured in

instructions per cycle) if stores were not split? Explain why or why not?

Q.11.1: Using the syntax in Figure 11-2, show how to use the load-linked/store
conditional primitives to synthesize a compare-and-swap operation.

Q.11.8: Real coherence controllers include numerous transient states in addition to the
ones shown in Figure to support split-transaction buses. For example, when a processor issues
a bus read for an invalid line (I), the line is placed in a IS transient state until the processor
has received a valid data response that then causes the line to transition into shared state
(S). Given a split-transaction bus that separates each bus command (bus read, bus write, and
bus upgrade) into a request and response, augment the state table and state transition diagram
of Figure to incorporate all necessary transient states and bus responses. For simplicity,
assume that any bus command for a line in a transient state gets a negative acknowledge
(NAK) response that forces it to be retried after some delay.

Q.11.10:

Assuming a processor frequency of 1 GHz, a target CPI of 2, a per-instruction level-

2 cache miss rate of 1% per instruction, a snoop-based cache coherent system with 32 processors,
and
8-byte
address
messages
(including
command
and
snoop
addresses)compute the inbound and outbound snoop bandwidth required at each processor node.

Solution:
1110-solution.html

http://targetiesnow.blogspot.in/2013/12/ex-710-711-712-111-118-
This book cites 12 books:



MODERN PROCESSOR DESIGN: Fundamentals of Superscalar Processors,
Beta Edition by John P. Shen
o



Michigan State University - College Prowler Guide (College Prowler Off the
Record) by Amy Davis
o



page 441

Introduction to Arithmetic for Digital Systems Designers by Shlomo Waser
o



page 34

Star Wars Technical Journal by Shane Johnson
o



page 447

Operating Systems by M. J. Flynn
o



page 98

The Anatomy of a High-Performance Microprocessor: A Systems
Perspective by Bruce Shriver
o



page 448

Computer Architecture: A Quantitative Approach, Second Edition by John
L. Hennessy
o



page 34

Fortress Rochester: The Inside Story of the IBM iSeries by Frank G. Soltis
o



page 98

Computer Architecture (Computer Science Series) by Caxton C. Foster
o



page 617

Architecture of Pipelined Computers by Peter M. Kogge
o



Front Matter

Fault-Tolerant Computing (Ftcs-21) by Inc. Institute of Electrical and
Electronics Engineers
o



Front Matter (1), and Front Matter (2)

page 98

Computer Science: For Use with the International Baccalaureate Diploma
Programme by Andrew Meyenn
o

page 513

1 book cites this book:


The Pentium Chronicles: The People, Passion, and Politics Behind Intel's
Landmark Chips (Practitioners) by Robert P. Colwell
o

Back Matter

Architecture:

John L. Hennessy and David A. Patterson, Computer Architecture: A
Quantitative Approach, third edition, Morgan Kaufmann, New
York, 2003. Seewww.mkp.com/CA3.
David A. Patterson and John L. Hennessy, Computer Organization
and Design: The Hardware Interface. Text for COEN 171.
Gerrit A. Blaauw and Frederick P. Brooks, Jr., Computer Architecture:
Concepts and Evolution, Addison Wesley, 1997.
William Stallings, Computer Organization and Architecture, Prentice
Hall, 2000.
Miles J. Murdocca and Vincent P. Heuring, Principles of Computer
Architecture, Prentice Hall, 2000.
John D. Carpinelli, Computer Systems Organization and
Architecture, Addison Wesley, 2001.
John Paul Shen and Mikko H. Lipasti, Modern Processor Design:
Fundamentals for Superscalar Processors, McGraw-Hill, 2003.
Highly recommended as a complement to Hennessey and
Patterson.

UNIT I
Introduction: review of basic computer architecture, quantitative techniques in computer
design,
measuring and reporting performance. CISC and RISC processors.

UNIT II
Pipelining : Basic concepts, instruction and arithmetic pipeline, data hazards, control hazards,
and structural hazards, techniques for handling hazards. Exception handling. Pipeline
optimization techniques. Compiler techniques for improving performance.

UNIT III
Hierarchical memory technology: Inclusion, Coherence and locality properties; Cache
memory
organizations, Techniques for reducing cache misses; Virtual memory organization, mapping
and
management techniques, memory replacement policies. Instruction-level parallelism: basic
concepts, techniques for increasing ILP, superscalar, super-pipelined and VLIW processor
architectures. Array and vector processors.

UNIT IV
Multiprocessor architecture: taxonomy of parallel architectures. Centralized shared-memory
architecture: synchronization, memory consistency, interconnection networks. Distributed
shared-memory architecture. Cluster computers. Non von Neumann architectures: data flow
computers, reduction computer architectures, systolic architectures.

MODERN PROCESSOR ARCHITECTURE L T P C
3--3

Processor Design: The Evolution of processors, Instruction set Processor design,
Principles of Processor performance, Instruction level parallel processing
Pipelined processors: Pipelining fundamentals, pipelined processor design, deeply
pipelined processors
Subject

Systems on a chip -- Design an

Search

Nearby Subjects are:
Result Page Prev Next

Save Marked Records

Add All On Page

Mark

Year Entries
Systems On A Chip Computer Aided Design : Schwaderer, W. 2012
David,

1

Systems On A Chip Computer Simulation

2

Systems On A Chip Congresses

56

Systems On A Chip Design : Kempf, Torsten.

2011

1

Systems On A Chip Design And Construction

40

Systems On A Chip Design And Construction Congresses

3

Systems On A Chip Design And Construction Data Processing

4

Systems On A Chip Energy Consumption Congresses :
International SoC Design Conference

c2009

1

Systems On A Chip Testing

21

Systems On A Chip Testing Congresses

10

Systems On A Chip Testing Standards

3

Systems On A Chips : Pankratius, Victor.

Systems On Chip -- See Systems on a chip

2012

1
1
1

Systems Open Physics -- See Open systems (Physics)

1

Systems Optimization -- See Mathematical optimization

1

Systems Orthogonal -- See Orthogonalization methods

1

Systems Predator Prey -- See Predation (Biology)

--subdivision Effect of predation on under individual
animals and groups of animals, e.g. Fishes--Effect of
predation on
Systems Programming : Young, Michael J.
Systems Programming Computer Science

Systems Propulsion -- See Propulsion systems

Systems Recommendation Information Filtering -See Recommender systems (Information filtering)

c1991

1
82
1

1

Here are entered works on the personalized filtering
technology used to recommend a set of data that will be of
interest to a certain user.
Systems Recommender Information Filtering -See Recommender systems (Information filtering)
Here are entered works on the personalized filtering
technology used to recommend a set of data that will be of

1
interest to a certain user.
1

Systems Reliability -- See Reliability (Engineering)

--subdivision Reliability under types of equipment,
machinery, technical systems, industrial plants, etc., e.g.
Electronic apparatus and appliances--Reliability
Systems Retrieval Periodicals

1985

1
1

Systems River -- See Watersheds

--headings of the type . . . River Watershed
1

Systems Safety -- See System safety

1

Systems Science -- See System theory

Systems Shorthand : Taylor, Samuel,

1810

1
1

Systems Silvicultural -- See Silvicultural systems

Systems Software -- 10 Related Subjects

10

Systems Software

63

Systems Software Congresses

54

Systems Software Costs

2

Systems Software Evaluation
Systems Software Evaluation Congresses

1990

1
2
Systems Software Handbooks Manuals Etc : Welsh, David C. 1991

1

Systems Software Industrial Applications Congresses :
c2002
Workshop on Industrial Experiences with Systems Software

1

Systems Software Periodicals

11

Systems Software Reliability

5

Systems Software Technological Innovations : Marchionini,
Gary.

1991

1

Systems Software Testing

2005

1

Systems Spatial -- See Spatial systems

Systems Steiner -- See Steiner systems

Systems Stiff Differential Equations -- See Stiff
computation (Differential equations)

Systems Stochastic -- See Stochastic systems

Systems Theory

Systems Theory Of -- See System theory

Systems Therapeutic -- See Alternative medicine

--subdivision Alternative treatment under individual
diseases and types of diseases, e.g. Cancer--Alternative

1

1

1

1

58
1

1
treatment
Systems Therapy Family -- See Systemic therapy (Family
therapy)

Systole Cardiology -- See Heart Contraction

Save Marked Records

1

1

Add All On Page
Result Page Prev Next

Start Over

Extended Display

Limit/Sort Search

Search as Words

Another Search

More Related Content

What's hot

Serial Peripheral Interface
Serial Peripheral InterfaceSerial Peripheral Interface
Serial Peripheral InterfaceAnurag Tomar
 
I2c protocol - Inter–Integrated Circuit Communication Protocol
I2c protocol - Inter–Integrated Circuit Communication ProtocolI2c protocol - Inter–Integrated Circuit Communication Protocol
I2c protocol - Inter–Integrated Circuit Communication ProtocolAnkur Soni
 
High-Level Synthesis with GAUT
High-Level Synthesis with GAUTHigh-Level Synthesis with GAUT
High-Level Synthesis with GAUTAdaCore
 
System verilog assertions
System verilog assertionsSystem verilog assertions
System verilog assertionsHARINATH REDDY
 
I2C Bus (Inter-Integrated Circuit)
I2C Bus (Inter-Integrated Circuit)I2C Bus (Inter-Integrated Circuit)
I2C Bus (Inter-Integrated Circuit)Varun Mahajan
 
Extending the Life of your SS7 Network with SIGTRAN
Extending the Life of your SS7 Network with SIGTRANExtending the Life of your SS7 Network with SIGTRAN
Extending the Life of your SS7 Network with SIGTRANAlan Percy
 
Pdh and sdh1
Pdh and sdh1Pdh and sdh1
Pdh and sdh1Khant Oo
 
The ethernet frame a walkthrough
The ethernet frame a walkthroughThe ethernet frame a walkthrough
The ethernet frame a walkthroughMapYourTech
 
FS S5800 Series 48xGigabit SFP with 4x10GbE SFP+ Switch
FS S5800 Series 48xGigabit SFP with 4x10GbE SFP+ Switch FS S5800 Series 48xGigabit SFP with 4x10GbE SFP+ Switch
FS S5800 Series 48xGigabit SFP with 4x10GbE SFP+ Switch Katherine Wang
 
Level sensitive scan design(LSSD) and Boundry scan(BS)
Level sensitive scan design(LSSD) and Boundry scan(BS)Level sensitive scan design(LSSD) and Boundry scan(BS)
Level sensitive scan design(LSSD) and Boundry scan(BS)Praveen Kumar
 

What's hot (20)

SPI Protocol
SPI ProtocolSPI Protocol
SPI Protocol
 
Serial Peripheral Interface
Serial Peripheral InterfaceSerial Peripheral Interface
Serial Peripheral Interface
 
VLSI testing and analysis
VLSI testing and analysisVLSI testing and analysis
VLSI testing and analysis
 
I2c protocol - Inter–Integrated Circuit Communication Protocol
I2c protocol - Inter–Integrated Circuit Communication ProtocolI2c protocol - Inter–Integrated Circuit Communication Protocol
I2c protocol - Inter–Integrated Circuit Communication Protocol
 
High-Level Synthesis with GAUT
High-Level Synthesis with GAUTHigh-Level Synthesis with GAUT
High-Level Synthesis with GAUT
 
Token ring
Token ringToken ring
Token ring
 
I2C introduction
I2C introductionI2C introduction
I2C introduction
 
Vlsi design
Vlsi designVlsi design
Vlsi design
 
Traffic light control
Traffic light controlTraffic light control
Traffic light control
 
SPI Bus Protocol
SPI Bus ProtocolSPI Bus Protocol
SPI Bus Protocol
 
System verilog assertions
System verilog assertionsSystem verilog assertions
System verilog assertions
 
I2C Bus (Inter-Integrated Circuit)
I2C Bus (Inter-Integrated Circuit)I2C Bus (Inter-Integrated Circuit)
I2C Bus (Inter-Integrated Circuit)
 
Extending the Life of your SS7 Network with SIGTRAN
Extending the Life of your SS7 Network with SIGTRANExtending the Life of your SS7 Network with SIGTRAN
Extending the Life of your SS7 Network with SIGTRAN
 
Pdh and sdh1
Pdh and sdh1Pdh and sdh1
Pdh and sdh1
 
serdes
serdesserdes
serdes
 
Eigrp new
Eigrp newEigrp new
Eigrp new
 
The ethernet frame a walkthrough
The ethernet frame a walkthroughThe ethernet frame a walkthrough
The ethernet frame a walkthrough
 
cplds
cpldscplds
cplds
 
FS S5800 Series 48xGigabit SFP with 4x10GbE SFP+ Switch
FS S5800 Series 48xGigabit SFP with 4x10GbE SFP+ Switch FS S5800 Series 48xGigabit SFP with 4x10GbE SFP+ Switch
FS S5800 Series 48xGigabit SFP with 4x10GbE SFP+ Switch
 
Level sensitive scan design(LSSD) and Boundry scan(BS)
Level sensitive scan design(LSSD) and Boundry scan(BS)Level sensitive scan design(LSSD) and Boundry scan(BS)
Level sensitive scan design(LSSD) and Boundry scan(BS)
 

Viewers also liked

Solution manual for modern processor design by john paul shen and mikko h. li...
Solution manual for modern processor design by john paul shen and mikko h. li...Solution manual for modern processor design by john paul shen and mikko h. li...
Solution manual for modern processor design by john paul shen and mikko h. li...neeraj7svp
 
Solution manual for modern processor design by john paul shen and mikko h. li...
Solution manual for modern processor design by john paul shen and mikko h. li...Solution manual for modern processor design by john paul shen and mikko h. li...
Solution manual for modern processor design by john paul shen and mikko h. li...neeraj7svp
 
Ugc net solutions at target ies
Ugc net solutions at target iesUgc net solutions at target ies
Ugc net solutions at target iesneeraj7svp
 
Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Cdiscount
 
Memory mapping techniques and low power memory design
Memory mapping techniques and low power memory designMemory mapping techniques and low power memory design
Memory mapping techniques and low power memory designUET Taxila
 
Parallel computing
Parallel computingParallel computing
Parallel computingvirend111
 
Cpu presentation
Cpu presentationCpu presentation
Cpu presentationHarry Singh
 
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...vtunotesbysree
 

Viewers also liked (11)

Solution manual for modern processor design by john paul shen and mikko h. li...
Solution manual for modern processor design by john paul shen and mikko h. li...Solution manual for modern processor design by john paul shen and mikko h. li...
Solution manual for modern processor design by john paul shen and mikko h. li...
 
Solution manual for modern processor design by john paul shen and mikko h. li...
Solution manual for modern processor design by john paul shen and mikko h. li...Solution manual for modern processor design by john paul shen and mikko h. li...
Solution manual for modern processor design by john paul shen and mikko h. li...
 
ca-ap9222-pdf
ca-ap9222-pdfca-ap9222-pdf
ca-ap9222-pdf
 
Ugc net solutions at target ies
Ugc net solutions at target iesUgc net solutions at target ies
Ugc net solutions at target ies
 
Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)
 
Memory mapping techniques and low power memory design
Memory mapping techniques and low power memory designMemory mapping techniques and low power memory design
Memory mapping techniques and low power memory design
 
Parallel computing
Parallel computingParallel computing
Parallel computing
 
Caches microP
Caches microPCaches microP
Caches microP
 
Cpu
CpuCpu
Cpu
 
Cpu presentation
Cpu presentationCpu presentation
Cpu presentation
 
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
SOLUTION MANUAL OF OPERATING SYSTEM CONCEPTS BY ABRAHAM SILBERSCHATZ, PETER B...
 

Similar to Full solution manual for modern processor design by john paul shen and mikko h. lipasti

Solution manual for modern processor design by john paul shen and mikko h. li...
Solution manual for modern processor design by john paul shen and mikko h. li...Solution manual for modern processor design by john paul shen and mikko h. li...
Solution manual for modern processor design by john paul shen and mikko h. li...neeraj7svp
 
Solution manual for modern processor design by john paul shen and mikko h. li...
Solution manual for modern processor design by john paul shen and mikko h. li...Solution manual for modern processor design by john paul shen and mikko h. li...
Solution manual for modern processor design by john paul shen and mikko h. li...neeraj7svp
 
186 devlin p-poster(2)
186 devlin p-poster(2)186 devlin p-poster(2)
186 devlin p-poster(2)vaidehi87
 
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...IDES Editor
 
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...NECST Lab @ Politecnico di Milano
 
Integrating research and e learning in advance computer architecture
Integrating research and e learning in advance computer architectureIntegrating research and e learning in advance computer architecture
Integrating research and e learning in advance computer architectureMairaAslam3
 
Pipelining and vector processing
Pipelining and vector processingPipelining and vector processing
Pipelining and vector processingKamal Acharya
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPAnil Bohare
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Qualcomm Developer Network
 
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSORDESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSORVLSICS Design
 
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSORDESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSORVLSICS Design
 
Design and Analysis of A 32-bit Pipelined MIPS Risc Processor
Design and Analysis of A 32-bit Pipelined MIPS Risc ProcessorDesign and Analysis of A 32-bit Pipelined MIPS Risc Processor
Design and Analysis of A 32-bit Pipelined MIPS Risc ProcessorVLSICS Design
 
ECET 365 Success Begins/Newtonhelp.com
ECET 365 Success Begins/Newtonhelp.comECET 365 Success Begins/Newtonhelp.com
ECET 365 Success Begins/Newtonhelp.comledlang1
 
A survey of paradigms for building and
A survey of paradigms for building andA survey of paradigms for building and
A survey of paradigms for building andcseij
 
Paper id 25201467
Paper id 25201467Paper id 25201467
Paper id 25201467IJRAT
 

Similar to Full solution manual for modern processor design by john paul shen and mikko h. lipasti (20)

Solution manual for modern processor design by john paul shen and mikko h. li...
Solution manual for modern processor design by john paul shen and mikko h. li...Solution manual for modern processor design by john paul shen and mikko h. li...
Solution manual for modern processor design by john paul shen and mikko h. li...
 
Solution manual for modern processor design by john paul shen and mikko h. li...
Solution manual for modern processor design by john paul shen and mikko h. li...Solution manual for modern processor design by john paul shen and mikko h. li...
Solution manual for modern processor design by john paul shen and mikko h. li...
 
186 devlin p-poster(2)
186 devlin p-poster(2)186 devlin p-poster(2)
186 devlin p-poster(2)
 
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
Implementing True Zero Cycle Branching in Scalar and Superscalar Pipelined Pr...
 
Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...Pretzel: optimized Machine Learning framework for low-latency and high throug...
Pretzel: optimized Machine Learning framework for low-latency and high throug...
 
1.My Presentation.pptx
1.My Presentation.pptx1.My Presentation.pptx
1.My Presentation.pptx
 
Integrating research and e learning in advance computer architecture
Integrating research and e learning in advance computer architectureIntegrating research and e learning in advance computer architecture
Integrating research and e learning in advance computer architecture
 
Pipelining and vector processing
Pipelining and vector processingPipelining and vector processing
Pipelining and vector processing
 
Tute
TuteTute
Tute
 
Pipeline Computing by S. M. Risalat Hasan Chowdhury
Pipeline Computing by S. M. Risalat Hasan ChowdhuryPipeline Computing by S. M. Risalat Hasan Chowdhury
Pipeline Computing by S. M. Risalat Hasan Chowdhury
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
 
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Env...
 
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSORDESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
 
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSORDESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
DESIGN AND ANALYSIS OF A 32-BIT PIPELINED MIPS RISC PROCESSOR
 
Design and Analysis of A 32-bit Pipelined MIPS Risc Processor
Design and Analysis of A 32-bit Pipelined MIPS Risc ProcessorDesign and Analysis of A 32-bit Pipelined MIPS Risc Processor
Design and Analysis of A 32-bit Pipelined MIPS Risc Processor
 
ECET 365 Success Begins/Newtonhelp.com
ECET 365 Success Begins/Newtonhelp.comECET 365 Success Begins/Newtonhelp.com
ECET 365 Success Begins/Newtonhelp.com
 
J045075661
J045075661J045075661
J045075661
 
A survey of paradigms for building and
A survey of paradigms for building andA survey of paradigms for building and
A survey of paradigms for building and
 
Paper id 25201467
Paper id 25201467Paper id 25201467
Paper id 25201467
 
Matopt
MatoptMatopt
Matopt
 

Recently uploaded

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...Marc Dusseiller Dusjagr
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersChitralekhaTherkar
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppCeline George
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxGaneshChakor2
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsKarinaGenton
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docxPoojaSen20
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesFatimaKhan178732
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docxPoojaSen20
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon AUnboundStockton
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxmanuelaromero2013
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfchloefrazer622
 

Recently uploaded (20)

“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
“Oh GOSH! Reflecting on Hackteria's Collaborative Practices in a Global Do-It...
 
Micromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of PowdersMicromeritics - Fundamental and Derived Properties of Powders
Micromeritics - Fundamental and Derived Properties of Powders
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
URLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website AppURLs and Routing in the Odoo 17 Website App
URLs and Routing in the Odoo 17 Website App
 
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
Model Call Girl in Tilak Nagar Delhi reach out to us at 🔝9953056974🔝
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
CARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptxCARE OF CHILD IN INCUBATOR..........pptx
CARE OF CHILD IN INCUBATOR..........pptx
 
Science 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its CharacteristicsScience 7 - LAND and SEA BREEZE and its Characteristics
Science 7 - LAND and SEA BREEZE and its Characteristics
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
MENTAL STATUS EXAMINATION format.docx
MENTAL     STATUS EXAMINATION format.docxMENTAL     STATUS EXAMINATION format.docx
MENTAL STATUS EXAMINATION format.docx
 
Separation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and ActinidesSeparation of Lanthanides/ Lanthanides and Actinides
Separation of Lanthanides/ Lanthanides and Actinides
 
mini mental status format.docx
mini    mental       status     format.docxmini    mental       status     format.docx
mini mental status format.docx
 
Crayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon ACrayon Activity Handout For the Crayon A
Crayon Activity Handout For the Crayon A
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdfTataKelola dan KamSiber Kecerdasan Buatan v022.pdf
TataKelola dan KamSiber Kecerdasan Buatan v022.pdf
 
How to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptxHow to Make a Pirate ship Primary Education.pptx
How to Make a Pirate ship Primary Education.pptx
 
Arihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdfArihant handbook biology for class 11 .pdf
Arihant handbook biology for class 11 .pdf
 

Full solution manual for modern processor design by john paul shen and mikko h. lipasti

  • 1. Solution Manual for Modern Processor Design by John Paul Shen and Mikko H. Lipasti This book emerged from the course Superscalar Processor Design, which has been taught at Carnegie Mellon University since 1995. Superscalar Processor Design is a mezzanine course targeting seniors and first-year graduate students. Quite a few of the more aggressive juniors have taken the course in the spring semester of their junior year. The prerequisite to this course is the Introduction to Computer Architecture course. The objectives for the Superscalar Processor Design course include: (1) to teach modem processor design skills at the microarchitecture level of abstraction; (2) to cover current microarchitecture techniques for achieving high performance via the exploitation of instruction-level parallelism (ILP); and (3) to impart insights and hands-on experience for the effective design of contemporary high-performance microprocessors for mobile, desktop, and server markets. In addition to covering the contents of this book, the course contains a project component that involves the microarchitectural design of a future-generation superscalar microprocessor. Here, in next successive posts, I am going to post solutions for the same Text-book (Modern Processor Design by John Paul Shen and Mikko H. Lipasti). If you find any difficulty or wants to suggest anything, feel free to comment...:) Link: http://targetiesnow.blogspot.in/p/solution-manual-for-modern- processor.html Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Exercise 1.6 and 1.7 Solution
  • 2. Q.1.6: A program's run time is determined by the product of instructions per program, cycles per instruction, and clock frequency. Assume the following instruction mix for a MlPS-like RISC instruction set: 15% stores, 25% loads, 15% branches, and 35% integer arithmetic, 5% integer shift, and 5% integer multiply. Given that load instructions require two cycles, branches require four cycles, integer ALU instructions require one cycle, and integer multiplies require ten cycles, compute the overall CPI. Q.1.7: Given the parameters of Problem 6, consider a strength-reducing optimization that converts multiplies by a compile-time constant into a sequence of shifts and adds. For this instruction mix, 50% of the multiplies can be converted to shift-add sequences with an average length of three instructions. Assuming a fixed frequency, compute the change in instructions per program, cycles per instruction, and overall program speedup. Solution: http://targetiesnow.blogspot.in/2013/11/modern-processor-design-byjohn-paul_9765.html#links Ex 1.8, 1.9 and 1.10 Solution: Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Q.1.8: Recent processors like the Pentium 4 processors do not implement single-cycle shifts. Given the scenario of Problem 7, assume that s = 50% of the additional integer and shift instructions introduced by strength reduction are shifts, and shifts now take four cycles to execute. Recompute the cycles per instruction and overall program speedup. Is strength reduction still a good optimization? Q.1.9: Given the assumptions of Problem 8, solve for the break-even ratio s (percentage of additional instructions that are shifts). That is, find the value of s (if any) for which program performance is identical to the baseline case without strength reduction (Problem 6). Q.1.10: Given the assumptions of Problem 8, assume you are designing the shift unit on the Pentium 4 processor. You have concluded there are two possible implementation options for the shift unit: 4-cycle shift latency at a frequency of 2 GHz, or 2-cycle shift latency at 1.9 GHz. Assume the rest of the pipeline could run at 2 GHz, and hence the 2-cycle shifter would set the entire processor’s frequency to 1.9 GHz. Which option will provide better overall performance? Solution: john-paul_13.html http://targetiesnow.blogspot.in/2013/11/modern-processor-design-by-
  • 3. Q.2.4: Consider that you would like to add a load-immediate instruction to the TYP instruction set and pipeline. This instruction extracts a 16-bit immediate value from the instruction word, signextends the immediate value to 32 bits, and stores the result in the destination register specified in the instruction word. Since the extraction and sign-extension can be accomplished without the ALU, your colleague suggests that such instructions be able to write their results into the register in the decode (ID) stage. Using the hazard detection algorithm described in Figure 2-15, identify what additional hazards such a change might introduce. Q.2.5: Ignoring pipeline interlock hardware (discussed in Problem 6), what additional pipeline resources does the change outline in Problem 4 require? Discuss these resources and their cost. Q.2.6: Considering the change outlined in Problem 4, redraw the pipeline interlock hardware shown in Figure 2-18 to correctly handle the load-immediate instructions. Solution: http://targetiesnow.blogspot.in/2013/11/ex-24-25-26-solution-modern- processor.html Ex 2.8, 2.9 & 2.15 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Q.2.8: Consider adding a store instruction with indexed addressing mode to the TYP pipeline. This store differs from the existing store with register+immediate addressing mode by computing its effective address as the sum of two source registers, that is, stx r3, r4, r5 performs r3<-MEM[r4+r5]. Describe the additional pipeline resources needed to support such an instruction in the TYP pipeline. Discuss the advantages and disadvantages of such an instruction.
  • 4. Q.2.9: Consider adding a load-update instruction with register+immediate and postupdate addressing mode. In this addressing mode, the effective address for the load is computed as register+immediate, and the resulting address is written back into the base register. That is, lwu r3,8(r4) performs r3<-MEM[r4+8]; r4<r4+8. Describe the additional pipeline resources needed to support such an instruction in the TYP pipeline. Q.2.15: The IBM study of pipelined processor performance assumed an instruction mix based on popular C programs in use in the 1980s. Since then, object oriented languages like C++ and Java have become much more common. One of the effects of these languages is that object inheritance and polymorphism can be used to replace conditional branches with virtual function calls. Given the IBM instruction mix and CPI shown in the following table, perform the following transformations to reflect the use of C++/Java, and recompute the overall CPI and speedup or slowdown due to this change: • Replace 50% of taken conditional branches with a load instruction followed by a jump register instruction (the load and jump register implement a virtual function call). • Replace 25% of not-taken branches with a load instruction followed by a jump register instruction. Solution: http://targetiesnow.blogspot.in/2013/11/ex-28-29-215-solution-modernprocessor.html Q.2.16: In a TYP-based pipeline design with a data cache, load instructions check the tag array for a cache hit in parallel with accessing the data array to read the corresponding memory location. Pipelining stores to such a cache is more difficult, since the processor must check the tag first, before it overwrites the data array. Otherwise, in the case of a cache miss, the wrong memory location may be overwritten by the store. Design a solution to this problem that does not require sending the store down the pipe twice, or stalling the pipe for every store instruction. Referring to Figure 2-15, are there any new RAW, WAR, and/or WAW memory hazards?
  • 5. Q.2.17: The MIPS pipeline shown in Table 2-7 employs a two-phase clocking scheme that makes efficient use of a shared TLB, since instruction fetch accesses the TLB in phase one and data fetch accesses in phase two. However, when resolving a conditional branch, both the branch target address and the branch fall-through address need to be translated during phase one in parallel with the branch condition check in phase one of the ALU stage to enable instruction fetch from either the target or the fallthrough during phase two. This seems to imply a dual-ported TLB. Suggest an architected solution to this problem that avoids dual-porting the TLB. Solution: http://targetiesnow.blogspot.in/2013/11/ex-216-217-solution-modernprocessor.html Q.3.1: Given the following benchmark code, and assuming a virtually-addressed fully-associative cache with infinite capacity and 64 byte blocks, compute the overall miss rate (number of misses divided by number of references). Assume that all variables except array locations reside in registers, and that arrays A, B, and C are placed consecutively in memory. double A[1024], B[1024], C[1024]; for(int i=0;i<1000;i += 2) { A[i] = 35.0 * B[i] + C[i+1]; } Q.3.3: Given the example code in Problem 1, and assuming a virtually-addressed two-way set associative cache of capacity 8KB and 64 byte blocks, compute the overall miss rate (number of misses divided by number of references). Assume that all variables except array locations reside in registers, and that arrays A, B, and C are placed consecutively in memory. Solution: http://targetiesnow.blogspot.in/2013/11/ex-31-33-solution-modernprocessor.html
  • 6. Q.3.4: Consider a cache with 256 bytes. Word size is 4 bytes and block size is 16 bytes. Show the values in the cache and tag bits after each of the following memory access operations for the two cache organizations direct mapped and 2-way associative. Also indicate whether the access was a hit or a miss. Justify. The addresses are in hexadecimal representation. Use LRU (least recently used) replacement algorithm wherever needed. 1.Read 0010 2.Read 001C 3.Read 0018 4.Write 0010 5.Read 0484 6.Read 051C 7.Read 001C 8.Read 0210 9.Read 051C Solution: http://targetiesnow.blogspot.in/2013/11/modern-processor-design-byjohn-paul_1.html Ex. 3.13 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution Manual Q.3.13: Consider a processor with 32-bit virtual addresses, 4KB pages and 36-bit physical addresses. Assume memory is byte-addressable (i.e. the 32-bit VA specifies a byte in memory). L1 instruction cache: 64 Kbytes, 128 byte blocks, 4-way set associative, indexed and tagged with virtual address. L1 data cache: 32 Kbytes, 64 byte blocks, 2-way set associative, indexed and tagged with physical address, write-back. 4-way set associative TLB with 128 entries in all. Assume the TLB keeps a dirty bit, a reference bit, and 3 permission bits (read, write, execute) for each entry. Specify the number of offset, index, and tag bits for each of these structures in the table below. Also, compute the total size in number of bit cells for each of the tag and data arrays. Solution: http://targetiesnow.blogspot.in/2013/11/ex-313-solution-modernprocessor-design.html
  • 7. Q.3.16: Assume a two-level cache hierarchy with a private level one instruction cache (L1I), a private level one data cache (L1D), and a shared level two data cache (L2). Given local miss rates for the 4% for L1I, 7.5% for L1D, and 35% for L2, compute the global miss rate for the L2 cache. Q.3.17: Assuming 1 L1I access per instruction and 0.4 data accesses per instruction, compute the misses per instruction for the L1I, L1D, and L2 caches of Problem 16. Q.3.18: Given the miss rates of Problem 16, and assuming that accesses to the L1I and L1 D caches take one cycle, accesses to the L2 take 12 cycles, accesses to main memory take 75 cycles, and a clock rate of 1GHz, compute the average memory reference latency for this cache hierarchy. Q.3.19: Assuming a perfect cache CPI (cycles per instruction) for a pipelined processor equal to 1.15 CPI, compute the MCPI and overall CPI for a pipelined processor with the memory hierarchy described in Problem 18 and the miss rates and access rates specified in Problem 16 and Problem 17. Solution: http://targetiesnow.blogspot.in/2013/11/ex-316-317-318-and-319solution-modern.html Q.3.28: Assume a synchronous front-side processormemory bus that operates at 100 Hz and has an 8-byte data bus. Arbitration for the bus takes one bus cycle (10 ns), issuing a cache line read command for 64 bytes of data takes one cycle, memory controller latency (including DRAM access) is 60 ns, after which data double words are returned in back-to back cycles. Further assume the bus is blocking or circuit-
  • 8. switched. Compute the latency to fill a single 64-byte cache line. Then compute the peak read bandwidth for this processor-memory bus, assuming the processor arbitrates for the bus for a new read in the bus cycle following completion of the last read. Q.3.31: Consider finite DRAM bandwidth at a memory controller, as follows. Assume double-data-rate DRAM operating at 100 MHz in a parallel non-interleaved organization, with an 8 byte interface to the DRAM chips. Further assume that each cache line read results in a DRAM row miss, requiring a precharge and RAS cycle, followed by row-hit CAS cycles for each of the double words in the cache line. Assuming memory controller overhead of one cycle (10 ns) to initiate a read operation, and one cycle latency to transfer data from the DRAM data bus to the processor-memory bus, compute the latency for reading one 64 byte cache block. Now compute the peak data bandwidth for the memory interface, ignoring DRAM refresh cycles. Solution: http://targetiesnow.blogspot.in/2013/11/ex-328-and-331-modernprocessor-design.html Q.3.28: Assume a synchronous front-side processor-memory bus that operates at 100 Hz and has an 8-byte data bus. Arbitration for the bus takes one bus cycle (10 ns), issuing a cache line read command for 64 bytes of data takes one cycle, memory controller latency (including DRAM access) is 60 ns, after which data double words are returned in backto back cycles. Further assume the bus is blocking or circuitswitched. Compute the latency to fill a single 64-byte cache line. Then compute the peak read bandwidth for this processor-memory bus, assuming the processor arbitrates for the bus for a new read in the bus cycle following completion of the last read. Q.3.34: Assume a single-platter disk drive with an average seek time of 4.5 ms, rotation speed of 7200 rpm, data transfer rate of 10 Mbytes/s per head, and controller overhead and queueing of 1 ms. What is the average access latency for a 4096-byte read? Q.3.35: Recompute the average access latency for Problem 34 assuming
  • 9. a rotation speed of 15 K rpm, two platters, and an average seek time of 4.0 ms. Solution: http://targetiesnow.blogspot.in/2013/11/ex-328-334-and-335-solution- modern.html Ex 4.8 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution manual Q.4.8: In an in-order pipelined processor, pipeline latches are used to hold result operands from the time an execution unit computes them until they are written back to the register file during the writeback stage. In an out-of-order processor, rename registers are used for the same purpose. Given a four-wide out-of-order processor TYP pipeline, compute the minimum number of rename registers needed to prevent rename register starvation from limiting concurrency. What happens to this number if frequency demands force a designer to add five extra pipeline stages between dispatch and execute, and five more stages between execute and retire/writeback? Solution: http://targetiesnow.blogspot.in/2013/11/modern-processor-design-by- john-paul.html Ex 5.1 and 5.2 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution manual Q.5.1: The displayed code that follows steps through the elements of two arrays (A[] and B[]) concurrently, and for each element, it puts the larger of the two values into the corresponding element of a third array (C[]). The three arrays are of length N. The instruction set used for Problems 5.1 through 5.6 is as follows:
  • 10. Identify the basic blocks of this benchmark code by listing the static instructions belonging to each basic block in the following table. Number the basic blocks based on the lexical ordering of the code. Note: There may be more boxes than there are basic blocks. Q.5.2: Draw the control flow graph for this benchmark. Solution: john-paul_22.html http://targetiesnow.blogspot.in/2013/11/modern-processor-design-by-
  • 11. Ex 5.7 through 5.13 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution manual Q.5.7 through Problem 5.13: loop body for problems 5: if (x is even) then increment a if (x is a multiple of 10) then increment b Consider the following code segment within a (branch b1) (b1 taken) (branch b2) (b2 taken) Assume that the following list of 9 values of x is to be processed by 9 iterations of this loop. 8, 9, 10, 11, 12, 20, 29, 30, 31 Note: assume that predictor entries are updated by each dynamic branch before the next dynamic branch accesses the predictor (i.e., there is no update delay). Q.5.7: Assume that an one-bit (history bit) state machine (see above) is used as the prediction algorithm for predicting the execution of the two branches in this loop. Indicate the predicted and actual branch directions of the b1 and b2 branch instructions for each iteration of this loop. Assume initial state of 0, i.e., NT, for the predictor. Q.5.8: What are the prediction accuracies for b1 and b2? Q.5.9: What is the overall prediction accuracy? Q.5.10: Assume a two-level branch prediction scheme is used. In addition to the one- bit predictor, a one bit global register (g) is used. Register g stores the direction of the last branch executed (which may not be the same branch as the branch currently being predicted) and is used to index into two separate one-bit branch history tables (BHTs) as shown below. Depending on the value of g, one of the two BHTs is selected and used to do the normal one-bit prediction. Again, fill in the predicted and actual branch directions of b1 and b2 for nine iterations of the loop. Assume the initial value of g = 0, i.e., NT. For each prediction, depending on the current value of g, only one of the two BHTs is accessed and updated. Hence, some of the entries below should be empty. Note: assume that predictor entries are updated by each dynamic branch before the next dynamic branch accesses the predictor (i.e. there is no update delay). Q.5.11: What are the prediction accuracies for b1 and b2?
  • 12. Q.5.12: What is the overall prediction accuracy? Q.5.13: What is the prediction accuracy of b2 when g = 0? Explain why. Solution: http://targetiesnow.blogspot.in/2013/11/ex-57-through-513-solution- modern.html Exercise 5.14 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution manual Q.5.14: Below is the control flow graph of a simple program. The CFG is annotated with three different execution trace paths. For each execution trace circle which branch predictor (bimodal, local, or Gselect) will best predict the branching behavior of the given trace. More than one predictor may perform equally well on a particular trace. However, you are to use each of the three predictors exactly once in choosing the best predictors for the three traces. Circle your choice for each of the three traces and add. (Assume each trace is executed many times and every node in the CFG is a conditional branch. The branch history register for the local, global, and Gselect predictors is limited to 4 bits.)
  • 13. Solution: http://targetiesnow.blogspot.in/2013/11/ex-514-solution-modern- processor-design.html Ex 5.19 and 5.20 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution Manual Q.5.19 and 5.20: Register Renaming Given the DAXPY kernel shown in Figure 5.31 and the IBM RS/6000 (RIOS-I) floating-point load renaming scheme also discussed in class (both are shown in the following figure), simulate the execution of two iterations of the DAXPY loop and show the state of the floating-point map table, the pending target return queue, and the free list. • Assume the initial state shown in the table for Problem 5.19. • Note the table only contains columns for the registers that are referenced in the DAXPY loop. • As in the RS/6000 implementation discussed, assume only a single load instruction is renamed per cycle and that only a single floating-point instruction can complete per cycle. • Only floating-point load, multiply, and add instructions are shown in the table, since only these are relevant to the renaming scheme. • Remember that only load destination registers are renamed. • The first load from the loop prologue is filled in for you.
  • 14. Q.5.19: Fill in the remaining rows in the following table with the map table state and pending target return queue state after the instruction is renamed, and the free list state after the instruction completes. Solution: modern.html http://targetiesnow.blogspot.in/2013/11/ex-519-and-520-solution-
  • 15. Ex 5.21 and 5.22 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution manual Q.5.21: Simulate the execution of the following code snippet using Tomasulo’s algorithm. Show the contents of the reservation station entries, register file busy, tag (the tag is the RS ID number), and data fields for each cycle (make a copy of the table below for each cycle that you simulate). Indicate which instruction is executing in each functional unit in each cycle. Also indicate any result forwarding across a common data bus by circling the producer and consumer and connecting them with an arrow. i: R4 <- R0 + R8 j: R2 <- R0 * R4 k: R4 <- R4 + R8 l: R8 <- R4 * R2 Assume dual dispatch and dual CDB (common data bus). Add latency is two cycles, and multiply latency is 3 cycles. An instruction can begin execution in the same cycle that it is dispatched, assuming all dependencies are satisfied. Q.5.22: Determine whether or not the code executes at the data flow limit for Problem 1. Explain why or why not. Show your work. Solution: http://targetiesnow.blogspot.in/2013/11/ex-521-and-522-solution- modern.html Ex 5.23 & 5.31 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution Manual Q.5.23: As presented in this chapter, load bypassing is a technique for enhancing memory data flow. With load bypassing, load instructions are allowed to jump ahead of earlier store instructions. Once address generation is done, a store instruction can be completed architecturally and can then enter the store buffer to await available bus cycle for writing to memory. Trailing loads are allowed to bypass these stores in the store buffer if there is no address aliasing.
  • 16. In this problem you are to simulate such load bypassing (there is no load forwarding). You are given a sequence of load/store instructions and their addresses (symbolic). The number to the left of each instruction indicates the cycle in which that instruction is dispatched to the reservation station; it can begin execution in that same cycle. Each store instruction will have an additional number to its right, indicating the cycle in which it is ready to retire, i.e., exit the store buffer and write to the memory. Assumptions: •All operands needed for address calculation are available at dispatch. •One load and one store can have their addresses calculated per cycle. •One load OR store can be executed, i.e., allowed to access the cache, per cycle. •The reservation station entry is deallocated the cycle after address calculation and issue. •The store buffer entry is deallocated when the cache is accessed. •A store instruction can access the cache the cycle after it is ready to retire. •Instructions are issued in order from the reservation stations. •Assume 100% cache hits.
  • 17. Q.5.31: A victim cache is used to augment a direct-mapped cache to reduce conflict misses. For additional background on this problem, read Jouppi’s paper on victim caches [Jouppi, 1990]. Please fill in the following table to reflect the state of each cache line in a 4-entry directmapped cache and a 2-entry fully associative victim cache following each memory reference shown. Also, record whether the reference was a cache hit or a cache miss. The reference addresses are shown in hexadecimal format. Assume the d irect mapped cache is indexed with the low-order bits above the 16byte line offset (e.g. address 40 maps to set 0, address 50 maps to set 1, etc.). Use ‘’ to indicate an invalid line and the address of the line to indicate a valid line. Assume LRU policy f or the victim cache and mark the LRU line as such in the table. Solution: http://targetiesnow.blogspot.in/2013/12/solution-manual-modern- processor-design.html Ex 6.3, 6.11 &6.12 : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution Manual Q.6.3: Given the dispatch and retirement bandwidth specified, how many integer ARF (architected register file) read and write ports are needed to sustain peak throughput? Given instruction mixes in Table 5-2, also compute average ports needed for each benchmark. Explain why you would not just build for the average case. Given the actual number of read and write ports specified, how likely is it that dispatch will be port-limited? How likely is it that retirement will be port-limited?
  • 18. Q.6.11: The IBM POWER3 can detect up to four regular access streams and issue prefetches for future references. Construct an address reference trace that will utilize all four streams. Q.6.12: The IBM POWER4 can detect up to eight regular access streams and issue prefetches for future references. Construct an address reference trace that will utilize all four streams. Solution: http://targetiesnow.blogspot.in/2013/12/ex-63-611-modern- processor-design-by.html Ex 7.10, 7.11, 7.12, 11.1, 11.8 & 11.10 Solution : Modern Processor Design by John Paul Shen and Mikko H. Lipasti : Solution Manual Q.7.10: If the P6 microarchitecture had to support an instruction set that included predication, what effect would that have on the register renaming process? Q.7.11: As described in the text, the P6 microarchitecture splits store operations into a STA and STD pair for handling address generation and data movement. sense from a microarchitectural implementation perspective. Q.7.12: Following Explain why this makes up on Problem 7, would there be a performance benefit (measured in instructions per cycle) if stores were not split? Explain why or why not? Q.11.1: Using the syntax in Figure 11-2, show how to use the load-linked/store conditional primitives to synthesize a compare-and-swap operation. Q.11.8: Real coherence controllers include numerous transient states in addition to the ones shown in Figure to support split-transaction buses. For example, when a processor issues a bus read for an invalid line (I), the line is placed in a IS transient state until the processor has received a valid data response that then causes the line to transition into shared state (S). Given a split-transaction bus that separates each bus command (bus read, bus write, and bus upgrade) into a request and response, augment the state table and state transition diagram
  • 19. of Figure to incorporate all necessary transient states and bus responses. For simplicity, assume that any bus command for a line in a transient state gets a negative acknowledge (NAK) response that forces it to be retried after some delay. Q.11.10: Assuming a processor frequency of 1 GHz, a target CPI of 2, a per-instruction level- 2 cache miss rate of 1% per instruction, a snoop-based cache coherent system with 32 processors, and 8-byte address messages (including command and snoop addresses)compute the inbound and outbound snoop bandwidth required at each processor node. Solution: 1110-solution.html http://targetiesnow.blogspot.in/2013/12/ex-710-711-712-111-118-
  • 20. This book cites 12 books:  MODERN PROCESSOR DESIGN: Fundamentals of Superscalar Processors, Beta Edition by John P. Shen o  Michigan State University - College Prowler Guide (College Prowler Off the Record) by Amy Davis o  page 441 Introduction to Arithmetic for Digital Systems Designers by Shlomo Waser o  page 34 Star Wars Technical Journal by Shane Johnson o  page 447 Operating Systems by M. J. Flynn o  page 98 The Anatomy of a High-Performance Microprocessor: A Systems Perspective by Bruce Shriver o  page 448 Computer Architecture: A Quantitative Approach, Second Edition by John L. Hennessy o  page 34 Fortress Rochester: The Inside Story of the IBM iSeries by Frank G. Soltis o  page 98 Computer Architecture (Computer Science Series) by Caxton C. Foster o  page 617 Architecture of Pipelined Computers by Peter M. Kogge o  Front Matter Fault-Tolerant Computing (Ftcs-21) by Inc. Institute of Electrical and Electronics Engineers o  Front Matter (1), and Front Matter (2) page 98 Computer Science: For Use with the International Baccalaureate Diploma Programme by Andrew Meyenn o page 513 1 book cites this book:
  • 21.  The Pentium Chronicles: The People, Passion, and Politics Behind Intel's Landmark Chips (Practitioners) by Robert P. Colwell o Back Matter Architecture: John L. Hennessy and David A. Patterson, Computer Architecture: A Quantitative Approach, third edition, Morgan Kaufmann, New York, 2003. Seewww.mkp.com/CA3. David A. Patterson and John L. Hennessy, Computer Organization and Design: The Hardware Interface. Text for COEN 171. Gerrit A. Blaauw and Frederick P. Brooks, Jr., Computer Architecture: Concepts and Evolution, Addison Wesley, 1997. William Stallings, Computer Organization and Architecture, Prentice Hall, 2000. Miles J. Murdocca and Vincent P. Heuring, Principles of Computer Architecture, Prentice Hall, 2000. John D. Carpinelli, Computer Systems Organization and Architecture, Addison Wesley, 2001. John Paul Shen and Mikko H. Lipasti, Modern Processor Design: Fundamentals for Superscalar Processors, McGraw-Hill, 2003. Highly recommended as a complement to Hennessey and Patterson. UNIT I Introduction: review of basic computer architecture, quantitative techniques in computer design, measuring and reporting performance. CISC and RISC processors. UNIT II
  • 22. Pipelining : Basic concepts, instruction and arithmetic pipeline, data hazards, control hazards, and structural hazards, techniques for handling hazards. Exception handling. Pipeline optimization techniques. Compiler techniques for improving performance. UNIT III Hierarchical memory technology: Inclusion, Coherence and locality properties; Cache memory organizations, Techniques for reducing cache misses; Virtual memory organization, mapping and management techniques, memory replacement policies. Instruction-level parallelism: basic concepts, techniques for increasing ILP, superscalar, super-pipelined and VLIW processor architectures. Array and vector processors. UNIT IV Multiprocessor architecture: taxonomy of parallel architectures. Centralized shared-memory architecture: synchronization, memory consistency, interconnection networks. Distributed shared-memory architecture. Cluster computers. Non von Neumann architectures: data flow computers, reduction computer architectures, systolic architectures. MODERN PROCESSOR ARCHITECTURE L T P C 3--3 Processor Design: The Evolution of processors, Instruction set Processor design, Principles of Processor performance, Instruction level parallel processing Pipelined processors: Pipelining fundamentals, pipelined processor design, deeply pipelined processors
  • 23. Subject Systems on a chip -- Design an Search Nearby Subjects are: Result Page Prev Next Save Marked Records Add All On Page Mark Year Entries Systems On A Chip Computer Aided Design : Schwaderer, W. 2012 David, 1 Systems On A Chip Computer Simulation 2 Systems On A Chip Congresses 56 Systems On A Chip Design : Kempf, Torsten. 2011 1 Systems On A Chip Design And Construction 40 Systems On A Chip Design And Construction Congresses 3 Systems On A Chip Design And Construction Data Processing 4 Systems On A Chip Energy Consumption Congresses : International SoC Design Conference c2009 1 Systems On A Chip Testing 21 Systems On A Chip Testing Congresses 10 Systems On A Chip Testing Standards 3 Systems On A Chips : Pankratius, Victor. Systems On Chip -- See Systems on a chip 2012 1 1
  • 24. 1 Systems Open Physics -- See Open systems (Physics) 1 Systems Optimization -- See Mathematical optimization 1 Systems Orthogonal -- See Orthogonalization methods 1 Systems Predator Prey -- See Predation (Biology) --subdivision Effect of predation on under individual animals and groups of animals, e.g. Fishes--Effect of predation on Systems Programming : Young, Michael J. Systems Programming Computer Science Systems Propulsion -- See Propulsion systems Systems Recommendation Information Filtering -See Recommender systems (Information filtering) c1991 1 82 1 1 Here are entered works on the personalized filtering technology used to recommend a set of data that will be of interest to a certain user. Systems Recommender Information Filtering -See Recommender systems (Information filtering) Here are entered works on the personalized filtering technology used to recommend a set of data that will be of 1
  • 25. interest to a certain user. 1 Systems Reliability -- See Reliability (Engineering) --subdivision Reliability under types of equipment, machinery, technical systems, industrial plants, etc., e.g. Electronic apparatus and appliances--Reliability Systems Retrieval Periodicals 1985 1 1 Systems River -- See Watersheds --headings of the type . . . River Watershed 1 Systems Safety -- See System safety 1 Systems Science -- See System theory Systems Shorthand : Taylor, Samuel, 1810 1 1 Systems Silvicultural -- See Silvicultural systems Systems Software -- 10 Related Subjects 10 Systems Software 63 Systems Software Congresses 54 Systems Software Costs 2 Systems Software Evaluation Systems Software Evaluation Congresses 1990 1 2
  • 26. Systems Software Handbooks Manuals Etc : Welsh, David C. 1991 1 Systems Software Industrial Applications Congresses : c2002 Workshop on Industrial Experiences with Systems Software 1 Systems Software Periodicals 11 Systems Software Reliability 5 Systems Software Technological Innovations : Marchionini, Gary. 1991 1 Systems Software Testing 2005 1 Systems Spatial -- See Spatial systems Systems Steiner -- See Steiner systems Systems Stiff Differential Equations -- See Stiff computation (Differential equations) Systems Stochastic -- See Stochastic systems Systems Theory Systems Theory Of -- See System theory Systems Therapeutic -- See Alternative medicine --subdivision Alternative treatment under individual diseases and types of diseases, e.g. Cancer--Alternative 1 1 1 1 58 1 1
  • 27. treatment Systems Therapy Family -- See Systemic therapy (Family therapy) Systole Cardiology -- See Heart Contraction Save Marked Records 1 1 Add All On Page Result Page Prev Next Start Over Extended Display Limit/Sort Search Search as Words Another Search