Global Built-In Self-Repair for 3D Memories withRedundancy Sharing and Parallel TestingXiaodong Wang1Dilip Vasudevan†Hsien-Hsin S. Leexw285@cornell.edu firstname.lastname@example.org email@example.comSchool of Electrical and Computer Engineering †CEOL, Department of Computer ScienceGeorgia Institute of Technology University College CorkAtlanta, GA 30332, USA Cork, IrelandABSTRACT3D integration is a promising technology that provides high mem-ory bandwidth, reduced power, shortened latency, and smaller formfactor. Among many issues in 3D IC design and production, testingremains one of the major challenges. This paper introduces a newdesign-for-test technique called 3D-GESP, an efﬁcient Built-In-Self-Repair (BISR) algorithm to fulﬁll the test and reliability needs for3D-stacked memories. Instead of the local testing and redundancyallocation method as most current BISR techniques employed, weintroduce a global 3D BISR scheme, which not only enables redun-dancy sharing, but also parallelizes the BISR procedure among all thestacked layers of a 3D memory. Our simulation results show that ourproposed technique will signiﬁcantly increase the memory repair rateand reduce the test time.1. INTRODUCTION3D stacked IC is an emerging integration process. By stacking in-dividual die vertically using face-to-face vias or through silicon vias(TSVs), it promises to provide beneﬁts to improve interconnect la-tency, power, bandwidth, etc. Additionally, it results in a more com-pact form for the integrated system. More importantly, it continues toincrease device density and their functionality for a given footprint totrack Moore’s Law without scaling down the devices.Among many different 3D-stacked architecture alternatives, homo-geneous 3D-stacked memory is becoming one of the ﬁrst commer-cial 3-D IC products. Recently, Samsung announced to mass-producestacked 40nm DDR3 DRAM using TSV. Other stacking architecturessuch as memory-on-logic, have also been studied or prototyped [2, 4,10, 13, 20] to demonstrate the beneﬁts brought by stacking memorytiers directly atop of the processing elements to improve performance(both latency and bandwidth) and power consumption.Yielding will gradually become a critical issue for 3D memories asthe number of layers stacked grows . Built-in self-repair (BISR),a common technique to boost the yield of traditional 2D memories [7,11, 16] should be appropriately applied to 3D memories. For 3Dmemories, techniques such as through-silicon vias redundancy struc-ture to replace faulty TSV were recently prototyped [5, 9].At ﬁrst glance, it seems very straightforward that the BISR algo-rithm of 2D memories can be directly applied to 3D memories with-out any modiﬁcation, because the way of accessing memory chipsremain the same for 2D and 3D memories although the physical struc-ture has been greatly changed. However, after examining the charac-teristics of 3D memories more carefully, we found the inefﬁciency1Xiaodong Wang is currently a graduate student at Cornell Univer-sity. This research was performed when he was an exchange studentat Georgia Tech.Figure 1: 3D SRAM Array Architecturein direct application and proposed a real global BISR algorithm andphysical structure speciﬁcally tailored for 3D memories. We foundthat on average, the repair rate of our scheme is increased by 27.01%over the traditional local BISR scheme, and 8.26% over another globalMESP BISR algorithm. In the meantime, the testing time can be re-duced down to 1n(n is the number of layers) of 2D memories giventhe same memory capacity.The rest of the paper is organized as follows. Section 2 reviewsbackground. Section 3 motivates and introduces our global BISR for3D memories. Section 4 proposed our 3D-GESP algorithm. Section 5analyzes our simulation results. Section 6 concludes.2. BACKGROUND2.1 3D Memory Architecture3D TSV-based memories are generally designed by stacking 2Dplanar memory layers with the address and data lines running acrossthem vertically. The vertical connections of the address and data linesthrough the silicon are accomplished by using TSV. Thus the datastorage spans across multiple die layers in contrast to the 2D design,where both the logic and memory are on the same plane.A typical 3D memory architecture with vertical bitlines and 3D de-coders was described in . An abstract view of the architectureis illustrated in Figure 1. In which, several banks of SRAM are dis-tributed vertically across different die layers. The original 2D decoderis also dissected and partitioned across the 3D layers. This exampleis one variant of possible 3D SRAM implementations. According to, their 128-entry multi-ported SRAM array can reduce energy andlatency by 36% and 55%, respectively.
MUXsense amplifierrowdecodercolumn decoderNormalI/OBISTFuseMacroBISTBIRABIST I/O(a)addressdataBISTaddressdataMainMemorycontrolhold resultaddressaddresslocaladdressdatainfohitMemory dataRepair datadataBISTBIRAFaultCACHEGlobalRedundantUnit(b)Figure 2: Two Typical BISR schemes (a) Decoder RedirectionBISR, and (b) Fault Cache BISR2.2 Typical BISR ArchitectureThe BISR technique requires several spare rows and columns man-ufactured as a part of the memory cells in order to replace the faultycells in the array. In general, almost all the BISR design and opti-mization are based on two basic architectures: Decoder RedirectionBISR , and Fault Cache BISR .2.2.1 Decoder Redirection BISRThe block diagram of this conventional BISR architecture  isdepicted in Figure 2 (a). It consists of four major parts.• Redundant Row/Column. As shown in Figure 2(a), the faultycells (marked as “X”) are replaced by the redundant row and col-umn (shaded squares) with necessary modiﬁcations on the de-coder, in order to guarantee the correctness of memory operation.• Built-in Self-Test (BIST) circuit. It generates test patterns, andthen sends it to the memories. The march-like algorithm that giveshigh coverage, linear testing time, and simple hardware imple-mentation, is the most widely used technique in memory testing.• Built-in Redundancy-Analysis (BIRA) circuit. It collects thefault information from the BIST and then allocates the redundantunits for the memory array.• Fuse Macro. It stores the fault information and modiﬁes the de-coder to redirect the address from faulty cells to redundant units.This architecture is easy to implement. However, it is inherentlya local architecture because it is difﬁcult for a decoder to redirect itslocal faulty cells to the neighboring redundant resources.2.2.2 Fault Cache BISRThe block diagram of this BISR architecture  is depicted inFigure 2(b). It comprises the BIST, BIRA circuit, Fault Cache, andspare memories called Global Redundant Units (GRUs).• Global Redundant Unit. Rather than manufactured together withthe memory blocks, the GRU is fabricated separately. Nonethe-less, its function to replacing the faulty cells remains the same.• Fault Cache. The information of the BISR replacement is storedin it. During the normal operation, Fault Cache will determinewhether the memory cell the system accesses has been replacedby the GRU. If so, it will generate a “hit” signal along with itslocal address to the GRU array, and choose the data from GRUfor the data bus. When it is a cache miss, the data retrieved fromthe main memory will be used.This “Fault Cache” BISR architecture does not reconﬁgure the de-coder, so its redundancy resources can be global. However, this schemerequires extra components and more complicated wiring comparedwith the Decoder Redirection BISR.3. GLOBAL BISR SCHEME3.1 MotivationBased on the two basic BISR architectures in Section 2, a numberof optimizations have been proposed to improve the repair rate andthe area overhead. Tseng et al.  proposed a ReBISR scheme forthe RAMs in SOCs . In their scheme, multiple RAMs can share thesame global BIRA circuits. However, the redundancy resources arenot shareable — the neighboring memory blocks cannot share theirredundancy with each other. Such local replacement may cause aproblem. When the number of redundant units around a single blockare insufﬁcient, there will be repair failure, while some redundancyresources remain unused in other blocks, thus wasted. To solve this“local” problem, studies in [1, 21] proposed global replaceable re-dundancy schemes, allowing the use of the redundancy in one mem-ory block to repair faults in others. However, these techniques notonly require large area overheads, but also reduce the routability ofthe memory, and thus are less practical for traditional 2D memories.Fortunately, one key feature of 3D IC is that the total length of in-terconnect can be reduced considerably via TSV. Therefore, for 3Dmemories, the routability problem of 2D memories with “global”redundancy can be resolved by intelligent 3D design. Toward this,Chou et al.  proposed a memory repair technique by stitching goodparts of bad dies and stacking them together through TSV. However,their replacement is at the block level rather than at the row levelas most BISR schemes do, so it will result in huge wastage. Morerecently, Jiang et al.  introduced a wise redundancy sharing tech-nique across the dies for 3D memories. However, their scheme ispairwise only and not scalable for multi-layer 3D design. In addition,their proposal did not consider the test time, which is a critical issuefor the cost and proﬁtability.3.2 Real Global 3D BISR ArchitectureAs discussed above, the current local BISR redundancy alloca-tion cannot fully utilize the redundant resources on chip. This sit-uation becomes even worse for 3-D memories. Since each memory
rowBIST CircuitgrantgrantMemory Layer 0decoderrowlayerindexreference reqreferencefaultrequpperreqreferencefaultresetMemory Layer 11−to−2 decoderdecoderComparatorFSM ControllayerindexupperFSM ControllayerindexORGateORGategrantresetBIRA CircuitComparatorBISR Layercolumn decoder & SAcolumn decoder & SAhitFault CacheGlobalRedundant Unitdataaddressdata busfault addressaddress busFigure 3: Schematic of our Global BISR Designlayer is produced separately, the number of faulty cells may vary foreach layer. Therefore, besides the block-level wastage, the layer-levelwastage will occur, where the redundant resources may be insufﬁcientin certain layers while wasted in the other layers.In addition, it may not be desirable to conduct BISR procedure se-rially among the memory layers. For 2D memories, it is difﬁcult toparallelize the test among memory blocks, because a new datapath,which directly connects the BISR and every block, needs to be cre-ated. This complicates routing and incurs too much area overhead fora planar design. For 3D memories, the wiring constraint will be muchalleviated by TSV with modest area overhead.In this paper, global is deﬁned according to the discussion above.• Shareable global redundancy. The redundant resources can beshared by all the memory layers of a single 3D memory chip. Inthis way, the redundant resources can be fully utilized, and theoverall yield will be increased.• Parallel testing. Besides the yield issue, the global BISR shouldalso provide parallel testing among memory blocks. The test timewill be signiﬁcantly reduced, so will the cost.We choose the “Fault Cache BISR” architecture as the basis of ourscheme for it can help realize a real global BISR design. As Fig-ure 2(b) demonstrated, the Global Redundant Unit (GRU) can be usedto repair the faulty cells of all the memory blocks, avoiding the restric-tion set by the decoder. On the other hand, its wiring (i.e., routabilityof this architecture can be resolved by 3D technology through intel-ligent design. The detailed hardware and software BISR design willbe discussed in Section 4.4. 3D-GESP ALGORITHMIn this section, we introduce our global 3D BISR hardware design.The architecture is illustrated in Figure 3. To support this hardwarelayout for realizing the real global BISR scheme deﬁned in Section 3,we also introduce a 3D-Global ESP (3D-Global Essential Spare Piv-oting) algorithm. This algorithm is a combination of two algorithmswe proposed: a Global ESP (GESP) and a 3D-BISR algorithm.4.1 Global ESP Algorithm ExtensionThe GESP algorithm is speciﬁcally designed for the shareable globalredundancy, corresponding to the ﬁrst deﬁnition of global in Sec-tion 3. As Figure 3 shows, the redundancies (GRUs), Fault Cache,BIST, BIRA circuits, and all other auxiliary circuits, are placed atthe bottom layer called the “BISR Layer”, and are shared by all thememory layers.The MESP scheme, a widely used algorithm in industry , candirectly be applied to our hardware of Figure 3. However, the originalscheme was speciﬁcally designed for traditional 2D memories. Forour 3D BISR, we made several improvements mentioned below tofurther utilize the global characteristics and increase the repair rate.Because the GRU architecture uses the Fault Cache to determinewhether the main memory cell or GRU is the one that should be ac-cessed, there is no need to set a boundary required by the architecturewhich modiﬁes the decoder. This situation is shown in Figure 2. Thuswe propose two more characteristics for achieving an efﬁcient GESPalgorithm.1. We do not differentiate spare row or column as MESP did. OneGRU entry can be used either as a spare row or column, accord-ing to the preference of the BIRA algorithm. This can avoid thesituation where the BIRA needs additional one more spare rowbut all the spare rows have been used up. In that case, MESP
(a) (b)Figure 4: The Comparison between BIRA Algorithms. (a) Re-pairing of the MESP, and (b) Repairing of our GESP.may have to use at least one or typically more spare columns toreplace that 1-row repair. In our scheme, we can dynamicallyconﬁgure the spares as spare row or column, and thus, as longas there are some spares left, we can always use it as the BIRA’swish. This will further exploit the global characteristics of ourscheme.2. Unlike a conventional MESP the replacement must start at analigned boundary (shown in Figure 4(a)) in a memory row orcolumn, in our architecture, each GRU entry can point to anyarbitrary location of a memory row or column for replacementas demonstrated in Figure 4(b).According to the ﬁgure, after having detected a new fault, whichis not covered by any allocated GRUs, that fault will alwaysserve as the “start point” of the new allocated GRU. Assumingthis point’s location is (xi, yi), and the coverage length of GRUis L. When the future faults are detected, BIRA will checkwhether it resides within the range of the previous GRU cov-erage, i.e., [xi + L, yi] (row repair) or [xi, yi + L] (columnrepair). On the contrary, for MESP, the “start point” of an GRUentry is always the boundary of the memory blocks, no matterwhere the faults are located.As shown in Figure 4, for the same fault map, MESP requires ﬁveGRUs (three row entries plus two column entries) whereas our GESPrequires only four. If clustered faults are present, in particular, cross-ing the alignment boundary in the memory array (eight squares in theﬁgure), our GESP algorithm will provide more beneﬁts and a higherrepair rate. We will show our simulation results in Section 5.Obviously, these two improvements can be applied to traditional2D memories. However, it is only applicable to “Fault Cache BISR”,which is described in Section 2.2.2. For the “Decoder RedirectionBISR” scheme, row decoders will not have the ability to access sparecolumns, and the column decoders cannot access spare rows.However, the “Decoder Redirection BISR” is more common inindustry, because it is simpler and will incur less routability prob-lem, as Figure 2 demonstrates. Moreover, even if the memory appliesthe “Fault Cache BISR”, implementing our improvements will sufferfrom additional area overhead, which will be discussed in Section 5.However, this overhead problem can be hidden by our 3D memorystructure, which will also be discussed in Section 5.In the literaure, some other prior efforts were made over the MESPto achieve higher repair rates. For example, Huang et al.  proposedHYPERA to effectively increase the repair rate. But their design willincur severe timing penalty when accessing memories. Our GESPalgorithm, however, does not suffer from the timing penalties becausethe main memory and the redundancies are accessed simultaneously,No Faultreset=0#layer=HiZFaultreset=0#layer=HiZRepairreset=1#layer=Xfault=0 upper=1fault=1fault=0grant=1,upper=0Figure 5: State Transition Diagram of the Control FSMaccording to Figure 2(b). Some additional overhead will be required,but it’s comparable with traditional MESP and can be hidden by our3D memory layout. We’ll analyze these overheads in Section 5 indetails.4.2 3D-BISR AlgorithmBesides the shareable global redundancy, our hardware design inFigure 3 can also enable parallel testing, corresponding to the sec-ond deﬁnition of global in Section 3. Combined with our 3D-BISRalgorithm, the BISR procedure is parallelized for all memory layers.Therefore, no matter how many layers are stacked, the testing timewill remain the same as if testing one-layer memory. The key pointof this 3D parallel testing is to test all the memory layers simultane-ously. At the system level, the 3D-BISR can be described as follows.• Step 1: Perform BIST for one cell among all layers. The addressallocator (1-to-2 decoder in Figure 3) will ignore the layer address(higher-order bits), sending the data and local address to everymemory layer.• Step 2: All memory layers determine whether any cell is faulty.• Step 3: From Layer 1 to N, serially report to the bottom layerwhether any tested cell of that layer is faulty.• Step 4: Allocate GRU resources. Return to BIST.This system-level scheme needs one OR gate, one comparator, andone FSM controller to be added in each layer as shown in Figure 3.Figure 5 shows the state transition diagram. Here we deﬁne the vari-ables in the diagram.• Fault: When the layer comparator detects that the content of thememory differs from the reference value provided by the BISRlayer, it indicates a fault. The variable will be set to ‘1’, otherwise‘0’. It is one input of the OR gate as depicted in Figure 3.• Upper: For a certain local location, when there is a fault in theupper layer, the “upper” signal is set to ‘1’, another input of theOR gate.• Request: The output of the OR gate. For a certain layer, whenthere are faults in that layer or in its upper layer, this signal willalways be set to ‘1’.• Reset: Clear the comparator when it is set to ‘1’. This will setthe “Fault” signal to ‘0’, and continue the 3D-BISR procedure.• Grant: This is the signal sent by the BISR layer, telling the FSMof the layer which is in its “Fault” state to report the layer indexif its “Upper” signal is ‘0’.• #index: Reporting the layer index to the BISR layer when theFSM enters “Repair” state. For example, layer 1 will report “01”,and layer 2 will report “10”.
RequestCycle 1 2 3 4 5 Cycle 1 2 3 4 5Cycle 1 2 3 4 5Figure 6: Timing Diagram of an Example for 3D-BISRWhenever the BIST detects a faulty cell in certain layers, the “re-quest” signal of the BIRA circuit will be set to “1” by the compara-tor and OR gates as shown in Figure 3. The BIST procedure willbe stopped, and restored only after the fault information has been re-ported. For a speciﬁc layer, the 3D-BISR can be described as follows.• Step 1: Perform BIST for the cell speciﬁed.• Step 2: Determine whether the cell is faulty. If not, wait for thenext cell’s BIST. If yes, set “Fault” as ‘1’ and go to step 3. ItsFSM will enter “Fault” state.• Step 3: If the layer’s “Upper” equals ‘0’, and “Grant” equals ’1’,go to step 4. The FSM will enter the “Repair” state. Otherwise,keep step 3.• Step 4: Report its layer index through “#layer” signal. Set “Fault”signal to ‘0’. The “FSM” will enter “No Fault” state. Then, waitfor the BIST of the following memory cells.4.2.1 Example for 3D-BISR algorithmWe show an example for detailing our algorithm. Suppose a four-layer memory stack where layer 0, 1, and 2 each contains a faultycell at one exact location with a local row address 0x00 and local col-umn address 0x00; that is, the memory locations: 0x00000, 0x10000,and 0x20000 are faulty. We assume layer 0 is the top layer. Firstly,the BIST begins testing 0x0000 of all four memory layers. Then, thecomparators will determine whether the tested cell is faulty. Afterthat, the status of our 3D-BISR related circuits will function accord-ing to the algorithm described in Section 4.2. Figure 6 shows thetiming diagram which illustrates how our 3D-BISR scheme workscycle-by-cycle.4.2.2 New 3D redundancy structureAs can be seen from the above example, the memory access andresult comparison are done in parallel, while the reporting is done ina serial manner. This is because we want to save the quantity of theTSV and make our design scalable. Our design only needs log2(N)reporting TSV shown in Figure 3, while at least N TSV are neededfor the parallel reporting scheme. However, it is obvious that in theworst case when all the cells along the vertical axis are faulty, the testaddressdataBIST_modeaddressdataMainMemorycontrolhold resultaddressaddresslocaladdressdatainfohitMemory dataRow/columndataData busBISTBIRAFaultCacheRedundantRows &ColumnsRedundantCylindersCylinder datalocaladdressdata(a)Memory Layer 0Memory Layer 1Memory Layer 2Memory Layer 3CylinderRedundant(b)Figure 7: 3DR Architecture: (a) Modiﬁed Block Diagram, (b)Detailed Replacement Schemeand the repair procedure will be completely serial.To address this shortcoming, we propose a novel 3D redundantstructure. The current 2D BISR algorithm uses the spare rows andcolumns for 2D replacement. For 3D memories, 3D redundancystructure can be developed intuitively. Besides the spare rows andcolumns, the spare cylinder structure is introduced in this section.On the additional BISR layer, the 3D redundancy (3DR) unit —Redundant Cylinder is added as shown in Figure 7(a). Basically,our redundant cylinder has the same functionality as the redundantrow/column does. However, instead of replacing the faulty row/columnin a 2D manner, our redundant cylinder structure replace the faultycells along the vertical axis, as demonstrated in Figure 7(b).In order to support this cylinder replacement scheme, the faultcache should store the local row and column address of the faultycylinder, and ignore the layer address. Whenever there is a address“hit” during normal operation, the system will access the redundantcylinder, rather than the faulty cells for read and write.Now we’ll describe how this Redundant Cylinder structure help tofully parallize the BISR procedure. When executing BIST, the samecells along the vertical axis of all the 3D memory layers will be testedsimultaneously. If no more than one fault is detected and reported,the BISR procedure will remain the same as described previously.However, if multiple faults are deteceted, our BISR algorithm will notspend more cycles in accepting the fault information from the secondand other faults. To be more speciﬁc, after the BISR logic receives theﬁrst fault information and send a “grant” signal, it ﬁnds out that the“request” signal is still ‘1’, which means some other layer wants toreport a fault. In this case, our scheme will not waste time in listeningto any more fault information. Instead, it will allocate a redundantcylinder immediately, which replaces all the memory cells that havethe same local location, just as shown in Figure 7(b). Therefore, the
RequestCycle 1 2 3 4 Cycle 1 2 3 4Cycle 1 2 3 4NoFaultNoFaultAcceptInfoAllocateCylinderFigure 8: Timing Diagram of an Example for 3DRmaximum test time will be constrained to the upper limit of 2-layer3D memories no matter how many layers the 3-D memory has.4.2.3 Example for 3DR allocationWe show an example for our 3DR structure. Similar to the as-sumption in the previous example, layer 0, 1, and 2 each contains afaulty cell on their local address 0x0000. Figure 8 shows the timingdiagram. The ﬁrst two cycles remain the same as in Figure 6 .Cycle 3: The “Request” input of BIRA is still ‘1’, indicating thatthere is more than one fault along the vertical axis. Instead of accept-ing the second fault layer’s index, it sends the “Reset” signal to allthe memory layers (needs one additional shared TSV to transfer this“Reset” in Figure 3), and allocates a spare cylinder to this location asshown in Figure 7.Cycle 4: All the comparators have been reset, so the “Request”input of BISR will go back to “0”. Then the BISR circuit will de-assert the “hold” signal to the BIST and continues testing. No morecycles are needed.5. SIMULATION RESULTS AND ANALYSIS5.1 ExperimentsTo evaluate the effect of our proposed algorithm, we use the clus-tered fault model in . According to , the probability of notintroducing an error into a given circuit area during a time interval∆t of the manufacturing process is:p(∆t | k, l1, l2, . . . , ln) = c(x, y) + b × k + d × l(x, y)Where l(x, y) is the number of faulty neighboring memory cells arounda memory location (x,y), c(x, y) is the susceptibility parameter spec-iﬁed by the fabrication process, k is the number of faults already onthat layer, b and d are the clustered factors.∆t0: It is the beginning of the fault generation. l(x, y) = 0 andk = 0. The probability of the memory cells to be faulty is c(x, y)coherently across the layer. Then we generate an evenly distributed0%20%40%60%80%100%GRU=5 GRU=6 GRU=7 GRU=8 GRU=9 GRU=10RepairRateLocal Semi-Global GlobalFigure 9: Comparison of Global and Local BISR Schemesrandom value (0 to 1) for each memory cell. If the value is larger thanits corresponding probability, then we determine that there is a faulton that location.∆t1: First we calculate the l(x, y) and k after the ﬁrst ∆t0. Thenaccording to the probability equation, we still generate an evenly dis-tributed random value (0 to 1) for each memory cell to determinewhether there is a fault.∆t2—∆tn: Repeat ∆t1.There is no need to set an average number of faults for each layer,since this equation implicitly sets that value. Assume c = 0.85, b =0.04, d = −0.1. The maximum l(x, y) will be 8 (i.e., all the neigh-boring cells are faulty). When the probability equation generates avalue greater than 100%, there will be no faults generated. The rela-tionship is shown below.p(∆tn | k, l1, l2, . . . , ln) = c(x, y) + b × k + d × l(x, y)p = 0.85 + 0.04 × k − 0.1 × 8 < 100%k <1 − 0.85 + 0.1 × 80.04= 23.75Therefore, when k > 23.75, even if l(x, y) = 8, the probabilitywill never be smaller than 100%, indicating that there will be no faultgenerated. Therefore, the upper limit of faults on one layer in thisexample is set to 24.Here we also introduce the concept of length of training intervals(LTI). This fault generation procedure presented in  is just liketraining. Given enough number of training intervals, the number offaults on one layer will be very close to the maximum faults. There-fore, if we set the LTI long enough, the number of faults on one layerwill be stable. In this way, we can better simulate the real manufac-turing circumstances.In the following analysis, each memory layer simulated has the sizeof 1024×1024×8bit, with an average of 23.5340 faulty cells in eachlayer (implicitly set as discussed above). The parameters varied in thesimulation are the following.• GRU: The number of global redundancy units available for re-placing the faulty row/column.• Grid: The width of a row/column that a GRU replaces. For ex-ample, if grid = 32, the single GRU entry has 32 × 8 bits =256 bits, which replaces 256-bit horizontally (row) or vertically(column) in the main memory.• Cylinder: The number of Cylinder units are used during the wholeBISR process.5.2 Performance of Proposed GESP AlgorithmFirst, we quantify the efﬁciency of our shareable global redun-dancy structure. This simulation is based on 1,000 samples of aneight-layer 3D memory. The grid size is chosen to be 128. The re-sults are shown in Figure 9 where Global means that the eight mem-ory layers share the GRUs; Semi-global is that every four memory
0%10%20%30%40%50%60%70%80%90%100%GRU=20 GRU=24 GRU=28 GRU=32 GRU=36 GRU=40RepairRate(a)0%10%20%30%40%50%60%70%80%90%100%GRU=20 GRU=24 GRU=28 GRU=32 GRU=36 GRU=40(b)0%10%20%30%40%50%60%70%80%90%100%GRU=20 GRU=24 GRU=28 GRU=32 GRU=36 GRU=40(c)0%10%20%30%40%50%60%70%80%90%100%GRU=20GRU=24GRU=28GRU=32GRU=36GRU=40MESPGESP(d)0%10%20%30%40%50%60%70%80%90%100%GRU=20 GRU=24 GRU=28 GRU=32 GRU=36 GRU=40RepairRate(e)0%10%20%30%40%50%60%70%80%90%100%GRU=20 GRU=24 GRU=28 GRU=32 GRU=36 GRU=40(f)0%10%20%30%40%50%60%70%80%90%100%GRU=20 GRU=24 GRU=28 GRU=32 GRU=36 GRU=40(g)0%10%20%30%40%50%60%70%80%90%100%GRU=20GRU=24GRU=28GRU=32GRU=36GRU=40MESPGESP(h)Figure 10: Comparison of BIRA algorithms. (a) grid=4. (b) grid=8. (c) grid=16. (d) grid=32. (e) grid=64. (f) grid=128. (g) grid=256. (h)grid=512.layers share the GRUs, which are not shareable between groups; Lo-cal means every layer has its own GRUs, no sharing across layers.The total amount of the GRUs are the same for these three architec-tures. As shown, we clearly ﬁnd that the Global scheme achieves27% higher repair rate than the Local scheme on average (59.9%maximum improvement for GRU=8 on each layer). Compared withthe Semi-global scheme, our Global scheme has 8.6% improvementfor the repair rate, with a maximal improvement of 22.3%.Secondly, we evaluate the performance gain of the “cross-boundary”technique in our proposed GESP algorithm over the MESP algorithm.Our simulation is based on 1000 samples of a four-layer memory. Theselected grid sizes are 4, 8, 16, 32, 64, 128, 256, and 512. Figure 10shows the results. As shown, we can see that our GESP has a muchbetter performance than that of MESP when the grid is small, and thatgap is closing with increased grid size. When the grid reaches half ofthe entire memory size, the improvement diminishes. This is not dif-ﬁcult to understand as we use the clustered fault model to evaluate thealgorithms. When the grid is small, it is more likely for MESP to en-counter the “cross-boundary” problem, wasting a lot of GRU entriesfor repair. For our proposed GESP algorithm, there is no such “cross-boundary” issue given our GRU could ﬁx them at any arbitrary loca-tion as explained earlier in Figure 4. When the grid becomes larger,the likelihood of such “cross-boundary” problem will dwindle, thusthe performance gap of these two algorithms is shrunk. Overall, ourGESP algorithm outperforms the MESP algorithm by 8.26% on av-erage (27.60% for maximum), with only a little hardware overhead,which will be described next.5.3 Hardware overhead analysisPerformance-wise, our 3D-GESP algorithm does not incur timingpenalty when accessing the memory because the memory and the re-dundancy resources are accessed concurrently. For area, however,our 3D-GESP scheme needs space for its BISR module. This over-head has two sources. Firstly, for each memory layer, we dedicate acomparator, an FSM, and an OR gate to support our global 3D BISRscheme. Given these are simple logic structure, the main contribu-tor of the area overhead lie in the TSV. Using data in , the pitchof TSV to be 4 to 10 µm, and for 50nm process, the DRAM den-sity is 27.9Mb/mm2. Assume a four-layer 3D memory, accordingto our 3D BISR design in Figure 3, the comparator needs 8 TSVs,the FSM needs (log2 4) TSVs, the OR gate need 1 TSV, and the“Cylinder Repari” needs 1 TSV for “request” and 1 TSV for “grant”.In total, each layer requires 13 TSVs, occupying 1300µm2with a10µm TSV-pitch. For 1MB main memory each layer, it takes ap-proximately 285, 714µm2. The area overhead is merely 0.455%.In a more general case, assuming the 3D memory has a layers, eachlayer has b cells, each cell contains c bits. The total number of TSVsthat require by our scheme is: log2 a + c + 3. Therefore:total area overhead =3488 × (log2 a + c + 3)b × cAccording to most 3D memory conﬁgurations, the area overheadwill be less than 1%.Secondly, for the BIST and BISR circuits, the area overhead forthe MESP scheme was 8.7% as shown in . For our 3D-GESPscheme, which exploits cross-boundary repair, may need more area.For MESP, the Fault Cache only needs to store partial address of thefaulty cells because of the existence of the aligned boundary. Forexample, only 15 bits is needed in the Fault Cache for a four-layer3-D memory with a grid size of 128. For our 3D-GESP scheme, allthe address bits need to be stored, which is 31.8% more. Except that,no other area overhead is needed for our scheme. In the worst case,our hardware overhead will amount to 11.5%. However, in our 3DBISR design as depicted in Figure 3, we dedicate an entire layer toBISR. In this way, the hardware overhead for BISR modules will notbe a problem for implementing 3D-GESP. The real area overhead onthe memory layers is 0.39%.5.4 Implication of 3D MemoriesFinally, during simulation, we made an interesting observation. Asshown in Figure 11, given 1-, 4-, and 8-layer 3D memories with 4,5, 6, 7, 8, 9, and 10 GRU per layer to guarantee the hardware over-head ratio is the same, the 8-layer memory has the smallest repair rateamong others when the number of the GRU is small; on the contrarythe 8-layer memory has the highest repair rate when the number ofthe GRU is sufﬁcient. Here the overhead ratio is deﬁned as:
0%20%40%60%80%100%GRU=5 GRU=6 GRU=7 GRU=8 GRU=9 GRU=10RepairRate 1-layer 4-layer 8-layerFigure 11: Layer’s effect on the repair rate given the sameGRU/layerOverhead =GRU resources providedtotal cells of 3D memory productThis is a characteristics of 3D memory. We assume that all thememory layers are produced separately and have no correlation infault map, the average number of faults for each layer is m, and thestandard deviation is σ, same to those of one traditional 2D memory.For an n-layer 3D memory, the average faults for each layer is m andthe standard deviation is σ√n. Since the standard deviation of a 3Dmemory is smaller than that of a 2D memory, the average number offaults of each 3D memory layer will be converged into the range ofm ± p. To simplify the problem, we use the following example byassuming the number of GRU can only repair no more than 16 faultsof each layer, and the average number of faults m = 20. Since thestandard deviation is smaller for 3D memories, the possibility for acertain 3D memory product to have faults less than 16 for each layeris assumed to be 10%, while that of 2D memory is 20%. Therefore,2D memory will result in a higher repair rate. On the other hand,when the number of GRU can repair no more than 24 faults of eachlayer, the possibility for a certain 3D memory product that cannot berepaired is 10%, while that of 2D memory is 20%. The 2D memorywill suffer from a lower repair rate under that circumstance.Since it is always desirable to have more than 90% repair rate, 3Dmemories will show their advantage over 2D. As shown in Figure 11,with above 90% repair rate, the 3D memories need fewer GRU perlayer than their 2D counterpart.6. CONCLUSIONIn this paper, we present a novel hardware and software design for3-D BISR. Our scheme, 3D-GESP, is a real Global BISR technique,which enables the global redundancy sharing and parallel testing. Theexperimental results showed that our 3D-GESP scheme can achieve27.01% higher repair rate compared to the local BISR, and 8.26%over another global algorithm MESP. In addition, our scheme only re-quires 1ntesting time compared with the traditional BISR procedure,where n is the number of stacked layers of 3D memories. Therefore,our scheme will signiﬁcantly improve the manufacturing yield, repairrate, and testing throughput of 3D die-stacked memories.7. ACKNOWLEDGMENTSThis work was sponsored by an NSF grant CCF-0811738, ITRITaiwan, and gift from Intel Corp. Xiaodong Wang conducted thiswork while he was an exchange student at Georgia Tech from Shang-hai Jiao Tong University. Dilip Vasudevan was a visiting scholar atGeorgia Tech sponsored by the University of College Cork at Ireland.8. REFERENCES S. Bahl. A Sharable Built-in Self-repair for Semiconductor Memorieswith 2-D Redundancy Scheme. In 22nd IEEE International Symposiumon Defect and Fault Tolerance in VLSI Systems, pages 331–339, 2007. B. Black and et al. Die Stacking (3D) Microarchitecture. In Proc. of theInternational Symposium on Microarchitecture, 2006. Y.-F. Chou, D.-M. Kwai, and C.-W. Wu. Memory Repair by DieStacking with Through Silicon Vias. In IEEE International Workshopon Memory Technology, Design, and Testing, pages 53–58, 2009. M. Healy, K. Athikulwongse, R. Goel, M. Hossain, D. Kim, Y.-J. Lee,D. Lewis, T.-W. Lin, C. Liu, M. Jung, B. Ouellette, M. Pathak, H. Sane,G. Shen, D. H. Woo, X. Zhao, G. Loh, H.-H. S. Lee, and S. K. Lim.Design and Analysis of 3D-MAPS: A Many-Core 3D Processor withStacked Memory. In IEEE Custom Integrated Circuits Conference,2010. A.-C. Hsieh, T.-T. Hwang, M.-T. Chang, M.-H. Tsai, C.-M. Tseng, andH.-C. Li. TSV Redundancy: Architecture and Design Issues in 3D IC.In Proceedings of the Conference on Design Automation and Test inEurope, pages 166–171, 2010. T.-C. Huang and et al. HYPERA: High-Yield Performance-EfﬁcientRedundancy Analysis. In 19th IEEE Asian Test Symposium (ATS),pages 231–236, 2010. W. K. Huang, Y.-N. Shen, and F. Lombardi. New approaches for therepairs of memories with redundancy by row/column deletion for yieldenhancement. IEEE Transactions on Computer-Aided Design ofIntegrated Circuits and Systems, 9(3), March 1990. L. Jiang, R. Ye, and Q. Xu. Yield Enhancement for 3D-StackedMemory by Redundancy Sharing across Dies. In Proc. of InternationalConference on Computer-Aided Design, 2010. U. Kang and et al. 8 Gb 3-D DDR3 DRAM Using Through-Silicon ViaTechndology. IEEE Journal of Solid-State Circuits, 45(1):111–119,2010. D. H. Kim, K. Athikulwongse, M. B. Healy, M. M. Hossain, M. Jung,I. Khorosh, G. Kumar, Y.-J. Lee, D. Lewis, T.-W. Lin, C. Liu, S. Panth,M. Pathak, M. Ren, G. Shen, T. Song, D. H. Woo, X. Zhao, J. Kim,H. Choi, G. H. Loh, H.-H. S. Lee, and S. K. Lim. 3D-MAPS: 3DMassively Parallel Processor with Stacked Memory. In IEEEInternational Solid-State Circuits Conference (ISSCC), 2012. I. Kim and et al. Built-in self-repair for embedded high density sram. InIEEE International Test Conference, pages 1112–1119, 1998. H.-H. S. Lee and K. Chakrabarty. Test Challenges for 3D IntegratedCircuits. IEEE Design and Test of Computers, Special Issue on 3D ICDesign and Test, 26(5):26–35, Sep/Oct 2009. G. H. Loh. 3D-Stacked Memory Architectures for Multi-CoreProcessors. In Proceedings of the International Symposium onComputer Architecture, pages 453–464, 2008. S.-K. Lu and et al. Efﬁcient BISR Techniques for Embedded MemoriesConsidering Cluster Faults. IEEE Transactions on Very Large ScaleIntegration (VLSI) Systems, 18(2), February 2010. K. Puttaswamy and G. Loh. 3D-Integrated SRAM Components forHigh-Performance Microprocessors. IEEE Transactions on Computers,58(10):1369 –1381, 2009. R. Rajsuman. Design and test of large embedded memories: Anoverview. IEEE Design and Test of Computers, 18(3), May 2001. C. H. Stapper. Simulation of spatial fault distributions for integratedcircuit yield estimations. IEEE Transactions on Computer-AidedDesign of Integrated Circuits and Systems, 8(12), 1989. A. Tanabe and et al. A 30-ns 64-Mb DRAM with Built-in Self-Test andSelf-Repair Function. IEEE Journal of Solid-State Circuits,27(11):1525–1533, 1992. T.-W. Tseng, J.-F. Li, and C.-C. Hsu. ReBISR: A ReconﬁgurableBuilt-In Self-Repair Scheme for Random Access Memories in SOCs.IEEE Transactions on Very Large Scale Integration (VLSI) Systems,18(6), 2010. D. H. Woo, N. H. Seong, D. L. Lewis, and H.-H. S. Lee. An optimized3d-stacked memory architecture by exploring excessive, high-densitytsv bandwidth. In Proceedings of the 16th International Symposium onHigh-Performance Computer Architecture, pages 429–440, 2010. T. Yamagata and et al. A Distributed Globally Replaceable RedundancyScheme for Sub-Half-Micron ULSI Memories and Beyond. IEEEJournal of Solid-State Circuits, 31(2):195–201, 1996.