This paper proposes a fault tolerant design for integer parallel matrix-vector multiplications (MVMs). The scheme combines ideas from error correction codes and the self-checking capability of MVMs. It adds a detection matrix and sum matrix to the original parallel matrices to enable error detection and correction. Field-programmable gate array evaluation shows the proposed scheme can significantly reduce overheads compared to protecting each MVM individually. The detection matrix is generated using a checksum of each original matrix and a Hamming code. The sum matrix is the direct sum of the original matrices. By comparing results from the original MVMs and detection matrix, faulty outputs can be identified and corrected.
An efficient fault tolerance design for integer parallel matrix vector
1. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
An Efficient Fault-Tolerance Design for Integer Parallel Matrix–Vector
Multiplications
Abstract:
Parallel matrix processing is a typical operation in many systems, and in particular
matrix–vector multiplication (MVM) is one of the most common operations in the
modern digital signal processing and digital communication systems. This paper proposes
a fault tolerant design for integer parallel MVMs. The scheme combines ideas from error
correction codes with the self-checking capability of MVM. Field-programmable gate
array evaluation shows that the proposed scheme can significantly reduce the overheads
compared to the protection of each MVM on its own. Therefore, the proposed technique
can be used to reduce the cost of providing fault tolerance in practical implementations.
Software Implementation:
Modelsim
Xilinx 14.2
Existing System:
Matrix–vector multiplication (MVM) is a typical computation in domain transformation,
projection, or feature extraction. With increasing demand for high performance or
throughput, parallel computing is becoming more important in high-performance
computing systems, cloud computing systems, and communication systems. For example,
parallel matrix processing with a common input vector (parallel MVMs) is performed for
pre coding in multiuser multi-in multi-out (MIMO) (especially large-scale MIMO) and
high-performance low density parity check decoders. Parallel processing may run on a
large number of processing units, e.g., GPUs or distributed systems, and experiments
show that soft errors can affect the outputs. In this ABFT design, the two input matrices
2. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
are changed to a “row checksum” matrix and a “column checksum” matrix, respectively,
and single errors in the resulting “full checksum” matrix can be detected and corrected
based on its row checksum and column checksum.
Since the row checksum does not exist for an input vector, such ABFT scheme can only
be used for error detection in MVM. In the ABFT schemes proposed , checksum
technique is used to locate the affected result element or blocks of result elements in
MVM, and partial re computation is used for error correction. Such schemes are suitable
for the cases in which the delay introduced by re computation is acceptable. But this is
not always the case, especially in high-throughput communication receivers where the
inputs arrive in real time from A/D converters. A general method to protect linear
dynamic systems was proposed. Partof that system model can be extracted as an MVM. I
n a method to protect parallel digital filters was proposed. The scheme is based on
considering each filter as a bit in an error correction code (ECC) so that parity check
filters are added and used to detect and correct errors. The schemes in share a similar
approach in that redundant rows are added to the processing matrix by combination (or
coding) of the original rows of the matrix so that the relations among the multiple outputs
can be used to locate and correct the erroneous elements. However, this approach only
works for single MVM. In the protection of parallel fast Fourier transform (FFT)
operations on different input vectors was studied. In that case, a self checking property of
the FFT was combined with the ECC method to reduce the overhead required for
protection. we proposed an idea based on ECC and MVM self-checking (checksum) for
error protection of parallel MVMs. We assume real-time applications in which integer
vectors come from ADCs with a fixed timing and processing is synchronous and
repetitive, so a re computation is not tolerable.
This paper is an extension of that two page conference paper, and provides new results
based on the FGPA implementation. To the best of the authors’ knowledge, no other
3. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
fault-tolerant design for parallel MVMs has been proposed. In the proposed structure, a
“detection matrix” D and a “sum matrix” A are added for fault tolerance. D is generated
by performing ECC to the checksum row of each matrix, and A is a combination of all
matrices under protection. By comparing the results of the original MVMs and that of the
check matrix, the fault can be located. Then, the erroneous output vector can be
recovered by subtracting the correct output vectors from the sum output vector.
Disadvantages:
Overheads is not reduced
Cost is high
Errors are not identified easily
Proposed System:
Fault tolerance design for parallel matrix processing
A. Basic Structure The basic structure of the proposed fault-tolerant system for parallel
MVMs is shown in Fig. 1. From Fig. 1, we see that the original P matrices remain
unchanged, and the redundant part includes two matrices (regardless of the number of
parallel matrices under protection):
4. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Fig. 1. Fault-tolerant structure for parallel MVMs
detection matrix and sum matrix . By comparing the P outputs
with the checking result the faulty vector
can be detected, and then the correct output vector , can be obtained by subtracting
the correct vectors among from the “sum output vector” . The
number of rows of D (Rp) is determined by the number of parallel matrices (P). When
considering the case that only one MVM may fail, a Hamming code can be used for
protection, and the relationship between P and Rp would be
On the other hand, in our proposed technique (Fig. 1), a branch is an entire MVM
processing.
Generation of Detection Matrix and SUM Matrix
5. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
The sum matrix is just the direct sum of the original matrix
For the detection matrix, let us assume P = 4 to facilitate the illustration of the basic idea,
and assuming only one branch output may contain errors. First, a checksum row for each
matrix can be generated as
Then, the rows of detection D can be generated ,Hamming code as
Then, the error detection matrix can be defined as matrix as
6. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
if there are no errors, we should have
where “sum(x)” performs addition of all items within vector x, and
In (12), the result of is an integer vector, and sum(
produces an integer scalar. In summary, if we denote
7. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Fig. 2. Implementation structure of MVM (N = 20)
error detection of a single branch can also be done. After identifying the failing branch,
the corresponding output vector can be recovered by subtracting the right output vectors
from the sum output vector
Since the error detection is based on the sum of the output vectors, and the whole vector
could be recovered during the correction, the proposed scheme could deal with multiple
errors in a single output vector
Advantages:
Overheads are reduced
8. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
Cost is low
Error can be easily identified by error detection method
References:
[1] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing. Englewood Cliffs, NJ, USA:
Prentice-Hall, 1989.
[2] J. S. Lim, Two-Dimensional Digital Signal Processing. Englewood Cliffs, NJ, USA: Prentice-Hall,
1989.
[3] A. Grama, A. Gupta, G. Karypis, and V. Kumar, Introduction to Parallel Computing, 2nd ed.
London, U.K.: Pearson, Jan. 2003.
[4] K. K. Parhi, VLSI Digital Signal Processing Systems: Design and Implementation. Hoboken, NJ,
USA: Wiley, 1999.
[5] S. Roger, C. Ramiro, A. Gonzalez, V. Almenar, and A. M. Vidal, “Fully parallel GPU
implementation of a fixed-complexity soft-output MIMO detector,” IEEE Trans. Veh. Technol., vol. 61,
no. 8, pp. 3796–3800, Oct. 2012.
[6] Q. Qi and C. Chakrabarti, “Parallel high throughput soft-output sphere decoder,” in Proc. IEEE
Workshop Signal Process. Syst., Oct. 2010, pp. 174–179.
[7] M. Wu, C. Dick, and J. R. Cavallaro, “Improving MIMO sphere detection through antenna detection
order scheduling,” in Proc. SDR Forum, 2011, pp. 280–284.
[8] G. Wang, M. Wu, Y. Sun, and J. R. Cavallaro, “A massively parallel implementation of QC-LDPC
decoder on GPU,” in Proc. IEEE 9th Symp. Appl. Specific Process. (SASP), Jun. 2011, pp. 82–85.
[9] G. Wang, H. Shen, B. Yin, M. Wu, Y. Sun, and J. R. Cavallaro, “Parallel nonbinary LDPC decoding
on GPU,” in Proc. ASILOMAR Conf., Nov. 2012, pp. 1277–1281.
[10] I. S. Haque and V. S. Pande, “Hard data on soft errors: A largescale assessment of real-world error
rates in GPGPU,” in Proc. 10th IEEE/ACM Int. Conf. Cluster, Cloud Grid Comput. (CCGrid), May
2010, pp. 691–696.
9. NXFEE INNOVATION
(SEMICONDUCTOR IP &PRODUCT DEVELOPMENT)
(ISO : 9001:2015Certified Company),
# 45, Vivekanandar Street, Dhevan kandappa Mudaliar nagar, Nainarmandapam,
Pondicherry– 605004, India.
Buy Project on Online :www.nxfee.com | contact : +91 9789443203 |
email : nxfee.innovation@gmail.com
_________________________________________________________________
[11] P. Rech, C. Aguiar, C. Frost, and L. Carro, “An efficient and experimentally tuned software-based
hardening strategy for matrix multiplication on GPUs,” IEEE Trans. Nucl. Sci., vol. 60, no. 4, pp. 2797–
2804, Aug. 2013.
[12] K.-H. Huang and J. A. Abraham, “Algorithm-based fault tolerance for matrix operations,” IEEE
Trans. Comput., vol. C-33, no. 6, pp. 518–528, Jun. 1984.
[13] C. Ding, C. Karlsson, H. Liu, T. Davies, and Z. Chen, “Matrix multiplication on GPUs with on-line
fault tolerance,” in Proc. IEEE 9th Int. Symp. Parallel Distrib. Process. Appl., May 2011, pp. 311–317.