Successfully reported this slideshow.
Your SlideShare is downloading. ×

ICSRS_R038.pptx

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 30 Ad

ICSRS_R038.pptx

Download to read offline

Paper presentation in the 6th International Conference on System Reliability and Safety (2022). Tittle: "A Methodology for Selective Protection of Matrix Multiplications: a Diagnostic Coverage and Performance Trade-off for CNNs Executed on GPUs"

Paper presentation in the 6th International Conference on System Reliability and Safety (2022). Tittle: "A Methodology for Selective Protection of Matrix Multiplications: a Diagnostic Coverage and Performance Trade-off for CNNs Executed on GPUs"

Advertisement
Advertisement

More Related Content

Advertisement

ICSRS_R038.pptx

  1. 1. Javier Fernández1,3 , Irune Agirre3 , Jon Perez-Cerrolaza3 , Francisco J. Cazorla1 , Jaume Abella1,2
  2. 2. CONTENTS 01 02 03 CONTENTS CONTEXTUALIZATION PROPOSED SOLUTION EVALUATION 04 CONCLUSIONS
  3. 3. 3 Main Concepts Artificial Intelligence It has made enormous progress, reaching near- human accuracy in several safety-related tasks. Functional safety standards IEC 61508 IEC 61513 EN 5012X ISO 26262 Example in the Automotive domain
  4. 4. 4 Main Concepts Baseline concept Detection of faults at runtime in the Matrix-matrix Multiplication Catalog of diagnostics techniques CUTLASS: High-performance matrix- matrix multiplication Library “On the Safe Deployment of Matrix Multiplication in Massively Parallel Safety-Related Systems” Object detector application based on CNNs (Tiny YOLO-v3) Matrix-matrix Multiplication (MMM) It is the backbone of the Convolutional Neural Networks in terms of execution time: • Sequential implementation: 98,5 % • Vectorized implementation: 87% • CUDA based implementation: 67 % 𝐴11 𝐴12 𝐴21 𝐴22 𝐴31 𝐴32 𝑋 𝐵11 𝐵12 𝐵13 𝐵21 𝐵22 𝐵23 = 𝐶11 𝐶12 𝐶13 𝐶21 𝐶22 𝐶23 𝐶31 𝐶32 𝐶33
  5. 5. CONTENTS 01 02 03 04 CONTENTS CONTEXTUALIZATION PROPOSED SOLUTION EVALUATION CONCLUSIONS
  6. 6. 6 P R O P O S E D S O LU T I O N (1) (1) (1) 1 Berkeley DeepDrive dataset (https://www.bdd100k.com/)
  7. 7. 7 Stage 1 P R O P O S E D S O LU T I O N 1 Berkeley DeepDrive dataset (https://www.bdd100k.com/) (1) (1) (1)
  8. 8. Stage 1 P R O P O S E D S O LU T I O N 8 (1) (1) (1) 1 Berkeley DeepDrive dataset (https://www.bdd100k.com/)
  9. 9. 9 Stage 1 P R O P O S E D S O LU T I O N (1) (1) (1) 1 Berkeley DeepDrive dataset (https://www.bdd100k.com/)
  10. 10. 10 Stage 2 Stage 1 P R O P O S E D S O LU T I O N (1) (1) (1) 1 Berkeley DeepDrive dataset (https://www.bdd100k.com/)
  11. 11. 11 Stage 2 Stage 1 11 P R O P O S E D S O LU T I O N (1) (1) (1) 1 Berkeley DeepDrive dataset (https://www.bdd100k.com/)
  12. 12. Stage 2 Stage 1 P R O P O S E D S O LU T I O N 1 Berkeley DeepDrive dataset (https://www.bdd100k.com/) (1) (1) (1) 12
  13. 13. Stage 2 Stage 1 P R O P O S E D S O LU T I O N (1) (1) (1) 1 Berkeley DeepDrive dataset (https://www.bdd100k.com/) 13
  14. 14. Stage 2 Stage 1 Stage 2 Stage 1 P R O P O S E D S O LU T I O N (1) (1) (1) 1 Berkeley DeepDrive dataset (https://www.bdd100k.com/) 14
  15. 15. Stage 2 Stage 1 P R O P O S E D S O LU T I O N (1) (1) (1) 1 Berkeley DeepDrive dataset (https://www.bdd100k.com/)
  16. 16. Stage 2 Stage 1 P R O P O S E D S O LU T I O N (1) (1) (1) 1 Berkeley DeepDrive dataset (https://www.bdd100k.com/) 16
  17. 17. Stage 2 Stage 1 P R O P O S E D S O LU T I O N 17 DC computation per fault source: B. Fault injected at the global memory level: 𝐷𝑒𝑡𝐴= (𝐵1𝑑𝑒𝑡𝐴 † + 𝐵3𝑑𝑒𝑡𝐴 † ) x 𝑁_𝐵𝑅𝑇1+𝐵2𝑑𝑒𝑡𝐴 ∗ +𝐵4𝑑𝑒𝑡𝐴 ∗ 𝐷𝑒𝑡𝐵=(𝐵1𝑑𝑒𝑡𝐵 ⊗ + 𝐵2𝑑𝑒𝑡𝐵 ⊗ ) x 𝑁_𝐵𝐶𝑇1+𝐵3𝑑𝑒𝑡𝐵 △ +𝐵4𝑑𝑒𝑡𝐵 △ 𝐷𝐶 = 𝐷𝑒𝑡𝐴 + 𝐷𝑒𝑡𝐵 𝑀 + 𝑁 𝑥 𝐾 𝑥 𝑑𝑎𝑡𝑎_𝑠𝑖𝑧𝑒 A. Faults injected at the arithmetic level or at the register level: 𝐷𝐶 = 𝑖=1 4 (𝑁𝑏𝑙𝑜𝑐𝑘𝑠𝐵𝑖 × 𝑁det _𝐵𝑖 ) 𝑁𝑓𝑖
  18. 18. Stage 3 Stage 2 Stage 1 P R O P O S E D S O LU T I O N 1 Berkeley DeepDrive dataset (https://www.bdd100k.com/) (1) (1) (1) 18
  19. 19. CONTENTS 01 02 03 04 CONTENTS CONTEXTUALIZATION PROPOSED SOLUTION EVALUATION CONCLUSIONS
  20. 20. Set-up Matrix Multiplication 𝐴11 𝐴12 𝐴21 𝐴22 𝐴31 𝐴32 𝑋 𝐵11 𝐵12 𝐵13 𝐵21 𝐵22 𝐵23 = 𝐶11 𝐶12 𝐶13 𝐶21 𝐶22 𝐶23 𝐶31 𝐶32 𝐶33 𝑀𝑥𝐾 𝑀𝑥𝑁 𝐾𝑥𝑁 Implementation E VA LUAT I O N
  21. 21. Set-up Stage 1 s Sensibility to misclassification E VA LUAT I O N 21
  22. 22. Stage 2 Set-up Stage 1 Performance impact (without compiler optimization) Performance impact (maximum compiler optimization) E VA LUAT I O N Performance impact: L1 (Minimum): From 1,01 to 1,37 L3 (Maximum): From 1,002 to 1,18 22 Performance impact: L1 (Minimum): From 1,02 to 82,5 L7 (Maximum): From 1,04 to 171,5
  23. 23. Stage 2 DC of each layer of Tiny Yolo-v3 E VA LUAT I O N 23 Set-up Stage 1
  24. 24. Stage 3 Selective protection Remarks Note that, while such performance impact is high, it could be reduced if diagnostics are just executed once periodically. For example: For the highest diagnostic coverage PI = 3,8x the CNN execution time Process safety time = 100x a single classification task ---------------------------------------------------------------------------------------------------- PI is lower than 5 % E VA LUAT I O N 24 Stage 2 Set-up Stage 1
  25. 25. CONTENTS 01 02 03 04 CONTENTS CONCLUSIONS CONTEXTUALIZATION PROPOSED SOLUTION EVALUATION
  26. 26. C O N C LU S I O N S Conclusions Conclusions We propose a methodology to selectively protect CNNs deployed on GPUs decomposed into three stages and demonstrate its applicability on a tiny version of an object detector, tiny YOLO-v3. Additionally, we remark: • For this CNN, we observe a higher tendency to misclassify (from 83,4 to 99,6%) in the initial layers (L1-L8). However, the final layers also present lower but still high misclassification rates (from 55,2 to 74,34%). • For the given example, we observe that the lowest performance impact to achieve high, medium, and low DC ranges is 3,8, 3,33, and 2,61, respectively. 26
  27. 27. IKERLAN P.º José María Arizmendiarrieta, 2 - 20500 Arrasate-Mondragón T. +34 943712400 F. +34 943796944 THANK YOU
  28. 28. IKERLAN P.º José María Arizmendiarrieta, 2 - 20500 Arrasate-Mondragón T. +34 943712400 F. +34 943796944 NAME: JAVIER FERNÁNDEZ MUÑOZ EMAIL: JAVIER.FERNANDEZ@IKERLAN.ES Acknowledgements: • Ikerlan authors have received funding from the Elkartek grant project KK- 2021/00123 of the Basque government. • BSC authors have been partially supported by the Spanish Ministry of Science and Innovation under grant PID2019- 107255GBC21/AEI/10.13039/501100011033
  29. 29. Classification is correct if: 1. The central point of the box is less than 50 pixels away 2. Width and height of the boxes vary by less than 25 pixels 3. Accuracy differs by less than 15%.
  30. 30. Safe architectural patterns proposed:

Editor's Notes

  • To employ checksums algorithms as diagnostic techniques to compute an Execution Signature (ES) of all the values of the input and output matrices.
  • 9.06e9 bits,
    13,163 years.
    45 ms
  • evaluates the execution time penalty incurred by each diagnostic technique included in the safe catalog. To this end, we apply the diagnostics in all CNN layers and measure the execution time of each one. This process is repeated for the different types of protection techniques provided in the diagnostics catalog.
  • computes an array of golden ESs by including the safe library of diagnostic techniques in the MMM execution of
    each layer (without fault injections).
  • However, an exhaustive fault injection campaign may be unaffordable for large matrices due to the required number of iterations to cover all input combinations
  • We denote as B1 those blocks whose dimensions match the size of the blocks launched to the GPU, B2 as those with an equal number of columns but different rows, B3 if the number of rows matches but columns differ, and finally, B4 if both the number of rows and columns differ.

    In this case, the errors injected in the input matrices A and B affect several blocks. Therefore, a proper DC computation requires
    verifying if previous blocks have already counted the detected errors. To do this, we propose distinguishing between errors detected from the fault injection in A (DetA) and B (DetB) matrices
  • FP: the average of new objects that appear or False Positives FN: undetected objects or False Negatives

    Note that L11 errors do not produce as many FNs and FPs as the rest of the final layers since the concatenation with L5 and the absence of errors on
    the other branch (L9 and L10) mitigate their appearance.
  • relative performance impact is quite insensitive to layer dimensions.


    This increase
    is associated with the high optimization of the MMM on GPUs. Including a new data
    (array of ESs) in the computation exacerbates one of the main problems associated
    with GPU platforms, the bottleneck created for data access. This bottleneck is the
    main reason for the high-performance impact of the CRC implementation since this
    diagnostic is based on memory access. Moreover, Fletcher diagnostic has a similar
    performance to CRC. However, a key reason for this timing penalty lies in using the
    modulo operator, which is highly inefficient in GPU implementations.
  • After defining classification features such as the confidence level, error margin, and the total number of possible errors in the weights, we compute a statistically representative random sampling size.
  • Diagnostic Test Interval: Defined at design, it is the interval between online tests to detect faults in a safety-related system that have a specified diagnostic coverage

×