A scrupulous code review - 15 bugs in C++ codePVS-Studio LLC
A close look at 15 problems one can find when reviewing C++ code.
Offers code examples.
Covers indexed loops, tainted data, copy and paste errors, problems with comparisons, exceptions, etc.
You can use static code analysis tools to make code review easier. Code analyzers find errors and potential vulnerabilities in code, while saving the developers' time and the companies' money.
Manual code review is expensive - a group of programmers get together regularly to review the code.
One can run static analysis tools regularly to find mistakes and vulnerabilities early.
A scrupulous code review - 15 bugs in C++ codePVS-Studio LLC
A close look at 15 problems one can find when reviewing C++ code.
Offers code examples.
Covers indexed loops, tainted data, copy and paste errors, problems with comparisons, exceptions, etc.
You can use static code analysis tools to make code review easier. Code analyzers find errors and potential vulnerabilities in code, while saving the developers' time and the companies' money.
Manual code review is expensive - a group of programmers get together regularly to review the code.
One can run static analysis tools regularly to find mistakes and vulnerabilities early.
Sachpazis:Terzaghi Bearing Capacity Estimation in simple terms with Calculati...Dr.Costas Sachpazis
Terzaghi's soil bearing capacity theory, developed by Karl Terzaghi, is a fundamental principle in geotechnical engineering used to determine the bearing capacity of shallow foundations. This theory provides a method to calculate the ultimate bearing capacity of soil, which is the maximum load per unit area that the soil can support without undergoing shear failure. The Calculation HTML Code included.
Saudi Arabia stands as a titan in the global energy landscape, renowned for its abundant oil and gas resources. It's the largest exporter of petroleum and holds some of the world's most significant reserves. Let's delve into the top 10 oil and gas projects shaping Saudi Arabia's energy future in 2024.
Water scarcity is the lack of fresh water resources to meet the standard water demand. There are two type of water scarcity. One is physical. The other is economic water scarcity.
Welcome to WIPAC Monthly the magazine brought to you by the LinkedIn Group Water Industry Process Automation & Control.
In this month's edition, along with this month's industry news to celebrate the 13 years since the group was created we have articles including
A case study of the used of Advanced Process Control at the Wastewater Treatment works at Lleida in Spain
A look back on an article on smart wastewater networks in order to see how the industry has measured up in the interim around the adoption of Digital Transformation in the Water Industry.
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxR&R Consult
CFD analysis is incredibly effective at solving mysteries and improving the performance of complex systems!
Here's a great example: At a large natural gas-fired power plant, where they use waste heat to generate steam and energy, they were puzzled that their boiler wasn't producing as much steam as expected.
R&R and Tetra Engineering Group Inc. were asked to solve the issue with reduced steam production.
An inspection had shown that a significant amount of hot flue gas was bypassing the boiler tubes, where the heat was supposed to be transferred.
R&R Consult conducted a CFD analysis, which revealed that 6.3% of the flue gas was bypassing the boiler tubes without transferring heat. The analysis also showed that the flue gas was instead being directed along the sides of the boiler and between the modules that were supposed to capture the heat. This was the cause of the reduced performance.
Based on our results, Tetra Engineering installed covering plates to reduce the bypass flow. This improved the boiler's performance and increased electricity production.
It is always satisfying when we can help solve complex challenges like this. Do your systems also need a check-up or optimization? Give us a call!
Work done in cooperation with James Malloy and David Moelling from Tetra Engineering.
More examples of our work https://www.r-r-consult.dk/en/cases-en/
Final project report on grocery store management system..pdfKamal Acharya
In today’s fast-changing business environment, it’s extremely important to be able to respond to client needs in the most effective and timely manner. If your customers wish to see your business online and have instant access to your products or services.
Online Grocery Store is an e-commerce website, which retails various grocery products. This project allows viewing various products available enables registered users to purchase desired products instantly using Paytm, UPI payment processor (Instant Pay) and also can place order by using Cash on Delivery (Pay Later) option. This project provides an easy access to Administrators and Managers to view orders placed using Pay Later and Instant Pay options.
In order to develop an e-commerce website, a number of Technologies must be studied and understood. These include multi-tiered architecture, server and client-side scripting techniques, implementation technologies, programming language (such as PHP, HTML, CSS, JavaScript) and MySQL relational databases. This is a project with the objective to develop a basic website where a consumer is provided with a shopping cart website and also to know about the technologies used to develop such a website.
This document will discuss each of the underlying technologies to create and implement an e- commerce website.
Overview of the fundamental roles in Hydropower generation and the components involved in wider Electrical Engineering.
This paper presents the design and construction of hydroelectric dams from the hydrologist’s survey of the valley before construction, all aspects and involved disciplines, fluid dynamics, structural engineering, generation and mains frequency regulation to the very transmission of power through the network in the United Kingdom.
Author: Robbie Edward Sayers
Collaborators and co editors: Charlie Sims and Connor Healey.
(C) 2024 Robbie E. Sayers
1. L07-1
6.5930/1
Hardware Architectures for Deep Learning
Vectorized Kernel Computation
Joel Emer and Vivienne Sze
Massachusetts Institute of Technology
Electrical Engineering & Computer Science
February 27, 2023
2. L07-2
Sze and Emer
Goals of Today’s Lecture
• Understand parallelism and improved efficiency through:
– loop unrolling, and
– vectorization
February 27, 2023
3. L07-3
Sze and Emer
Background Reading
• Vector architectures
– Computer Architecture: A Quantitative Approach,
6th edition, by Hennessy and Patterson
• Ch 4: p282-310, App G
• Ch 4: p262-288, App G
These books and their online/e-book versions are available through
MIT libraries.
February 27, 2023
4. L07-4
Sze and Emer
input fmaps
H
W
C
1
Fully Connected Computation
output fmaps
…
filters
1
1
1
February 27, 2023
H
C
1
W
H
W
C
M
5. L07-5
Sze and Emer
input fmaps
H
W
C
1
Fully Connected Computation
output fmaps
…
filters
1
1
1
February 27, 2023
H
C
1
W
H
W
C
M
6. L07-6
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
• Matrix–Vector Multiply:
• Multiply all inputs in all channels by a weight and sum
February 27, 2023
7. L07-7
Sze and Emer
Filter Memory Layout
February 27, 2023
F[M0 C0 H0 W0] F[M0 C0 H0 W1] …
F[M0 C0 H1 W0] F[M0 C0 H1 W1] …
F[M0 C0 H2 W0] F[M0 C0 H2 W1] …
.
.
F[M0 C1 H0 W0] F[M0 C1 H0 W1] …
F[M0 C1 H1 W0] F[M0 C1 H1 W1] …
F[M0 C1 H2 W0] F[M0 C1 H2 W1] …
.
.
.
F[M1 C0 H0 W0] F[M1 C0 H0 W1] …
F[M1 C0 H1 W0] F[M1 C0 H1 W1] …
F[M1 C0 H2 W0] F[M1 C0 H2 W1] …
.
.
.
H
C
M
W
H
C
1
W
Filter weight
Next input
channel
Next row
Next output
channel
8. L07-8
Sze and Emer
Flattened FC Loops
February 27, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
CHWm = -C*H*W
for m in [0, M):
o[m] = 0
CHWm += C*H*W
for chw in [0, C*H*W):
o[m] += i[chw]
* f[CHWm + chw]
Offset to start of
current output
filter
Flattened
tensors
Loop invariant
hoisted and
strength reduced
29. L07-29
Sze and Emer
Flattened FC Loops
February 27, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
CHWm = -C*H*W;
for m in [0, M):
o[m] = 0;
CHWm += C*H*W;
for chw in [0, C*H*W)
o[m] += i[chw]
* f[CHWm + chw]
Most of the time is
spent here!
40. L07-40
Sze and Emer
Loop Unrolling (2chw)
February 27, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
CHWm = -C*H*W
for m in [0, M):
CHWm += C*H*W
for chw in [0, C*H*W, 2):
o[m] += (i[chw]
* f[CHWm + chw])
+ (i[chw + 1]
* f[CHWm + chw + 1]
Index calculation
amortized since
i[chw+1] => &(i+1)[chw]
Loop overhead
amortized over
more computation
Operands accessed
in pairs
Operands accessed
in pairs
Step by 2
70. L07-70
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = C*H*W-2, C*H*W-1
71. L07-71
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = C*H*W-2, C*H*W-1
72. L07-72
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
chw = C*H*W-2, C*H*W-1
73. L07-73
Sze and Emer
Fully-Connected (FC) Layer
M
CHW
CHW
1
Filters Input fmaps
×
1
Output fmaps
M
=
February 27, 2023
Can we incorporate this “pairing” into the architecture?
chw = C*H*W-2, C*H*W-1
74. L07-74
Sze and Emer
Vector Programming Model
February 27, 2023
+ + + + + +
[0] [1] [VLR-1]
Vector Arithmetic
Instructions
ADDV v3, v1, v2 v3
v2
v1
Scalar Registers
r0
r15
Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLR
Vector Length Register
VLRMAX – number of elements in a vector register
VLR – number of elements to use in an instruction
75. L07-75
Sze and Emer
Vector Programming Model
February 27, 2023
v1
Vector Load and
Store Instructions
LDV v1, r1, r2
Base, r1 Stride, r2
Memory
Vector Register
Scalar Registers
r0
r15
Vector Registers
v0
v15
[0] [1] [2] [VLRMAX-1]
VLR
Vector Length Register
76. L07-76
Sze and Emer
…
…
…
…
Compiler-based Vectorization
February 27, 2023
Ld Ai
Ld Bi
Add
St Zi
Ld Ai+1
Ld Bi+1
Add
St Zi+1
Ld Ai
Ld Bi
Add
St Zi
Ld Ai+1
Ld Bi+1
Add
St Zi+1
Scalar code Vector code
for i in [0:N):
Z[i] = A[i] + B[i];
Compiler recognizes independent operations
with loop dependence analysis
T
i
m
e
77. L07-77
Sze and Emer
Loop Unrolled
February 27, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
for m in [0, M):
CHWm = C*H*W*m
for chw in [0, C*H*W, 2):
o[m] += (i[chw]
* f[CHWm + chw])
+ (i[chw + 1]
* f[CHWm + chw + 1])
}
Will this vectorize?
Sequential
loads
Sequential
loads
Reduction crosses
elements
79. L07-79
Sze and Emer
Fully Connected – Loop Permutation
February 27, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
for m in [0, M):
for chw in [0, C*H*W, 2):
o[m] += i[chw] * f[CHW*m + chw]
o[m] += i[chw + 1] * f[CHW*m + chw + 1]
No output is
dependent
on another output
(other than
commutative order)
80. L07-80
Sze and Emer
Fully Connected – Loop Permutation
February 27, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
for m in [0, M):
for chw in [0, C*H*W, 2):
o[m] += i[chw] * f[CHW*m + chw]
o[m] += i[chw + 1] * f[CHW*m + chw + 1]
81. L07-81
Sze and Emer
for chw in [0, C*H*W, 2):
for m in [0, M):
o[m] += i[chw] * f[CHW*m + chw]
o[m] += i[chw + 1] * f[CHW*m + chw + 1]
Fully Connected – Loop Permutation
February 27, 2023
int i[C*H*W]; # Input activations
int f[M*C*H*W]; # Filter Weights
int o[M]; # Output activations
for m in [0, M):
for chw in [0, C*H*W, 2):
o[m] += i[chw] * f[CHW*m + chw]
o[m] += i[chw + 1] * f[CHW*m + chw + 1]
82. L07-82
Sze and Emer
FC – Permuted/Unrolled
February 27, 2023
// Unrolled inner loop
for chw in [0, C*H*W):
for m in [0, M, 2):
o[m] += i[chw] * f[CHW*m + chw]
o[m+1] += i[chw] * f[CHW*(m+1) + chw]
// Loops permuted
for chw in [0, C*H*W):
for m in [0, M):
o[m] += i[chw] * f[CHW*m + chw]
Unrolled
calculation
84. L07-84
Sze and Emer
// Loop invariant hoisting of i[chw]
for chw in [0, C*H*W):
i_chw = i[chw]
for m in [0, M, 2):
o[m] += i_chw * f[CHW*m + chw]
o[m+1] += i_chw * f[CHW*(m+1) + chw]
FC – Permuted/Unrolled/Hoisted
February 27, 2023
// Unrolled inner loop
for chw in [0, C*H*W):
for m in [0, M, 2):
o[m] += i[chw] * f[CHW*m + chw]
o[m+1] += i[chw] * f[CHW*(m+1) + chw]
Same for all
calculations
Load hoisted
out of loop
85. L07-85
Sze and Emer
Fully Connection Computation
February 27, 2023
F[M0 C0 H0 W0] F[M0 C0 H0 W1] …
F[M0 C0 H1 W0] F[M0 C0 H1 W1] …
F[M0 C0 H2 W0] F[M0 C0 H2 W1] …
.
.
F[M0 C1 H0 W0] F[M0 C1 H0 W1] …
F[M0 C1 H1 W0] F[M0 C1 H1 W1] …
F[M0 C1 H2 W0] F[M0 C1 H2 W1] …
.
.
.
F[M1 C0 H0 W0] F[M1 C0 H0 W1] …
F[M1 C0 H1 W0] F[M1 C0 H1 W1] …
F[M1 C0 H2 W0] F[M1 C0 H2 W1] …
.
.
.
I[C0 H0 W0] I[C0 H0 W1] …
I[C0 H1 W0] I[C0 H1 W1] …
I[C0 H2 W0] I[C0 H2 W1] …
.
.
I[C1 H0 W0] I[C1 H0 W1] …
I[C1 H1 W0] I[C1 H1 W1] …
I[C1 H2 W0] I[C1 H2 W1] …
.
.
.
// Loop invariant hosting of i[chw]
for chw in [0, C*H*W):
i_chw = i[chw];
for m in [0, M, 2):
o[m] += i_chw * f[CHW*m + chw]
o[m+1] += i_chw * f[CHW*(m+1) + chw]
Weights needed together are far apart.
What can we do
Weights needed together are far apart.
What can we do?
86. L07-86
Sze and Emer
FC – Layered Loops
February 27, 2023
// Unrolled inner loop
for chw in [0, C*H*W):
i_chw = i[chw]
for m in [0, M, 2):
o[m] += i_chw * f[CHW*m + chw]
o[m+1] += i_chw * f[CHW*(m+1) + chw]
// Level 2 loops
for chw in [0, C*H*W):
i_chw = i[chw]
for m1 in [0, M/VL):
// Level 1 loops
parallel_for m0 in [0, VL):
o[m1*VL+m0] += i_chw * f[CHW*(m1*VL+m0) + chw]
Level 0 is a set of
vector operations
Limit of m2 (M/VL)
times limit of m1 (LV)
is M
Limit of m1 (M/VL)
times limit of m0 (LV)
is M
m = m1*VL+m0
98. L07-98
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
Strength reduced
How many MACs/cycle (ignoring stalls)?
Can we unroll this to get even more?
99. L07-99
Sze and Emer
Full Connected - Vectorized
February 27, 2023
mv r1, 0 # r1 holds chw
add r4, 0 # r4 holds CHWVLm1_chw
xloop: ldv v1, i(r1), 0 # fill v1 with i[cwh]
mv r2, 0 # r2 holds m1VL
mloop: ldv v3, f(r4), CWH # v3 holds f[]
ldv v5, o(r2), 1 # v5 holds o[]
macv v5, v1, v3 # multiply f[] * i[]
stv v5, o(r2), 1 # store o
add r2, r2, VL # update m1VL
add r4, r4, CHWVL # update CHWVLm1_chw
blt r2, M, mloop
add r1, r1, 1 # update chw
add r4, r4, r1 # update CHWVLm1_chw
blt r1, CWH, xloop
Strength reduced
How many MACs/cycle (ignoring stalls)?
Can we unroll this to get even more?
100. L07-100
Sze and Emer
FC – Layered Loops
February 27, 2023
// Level 2 loops
for chw in [0, C*H*W):
for m1 in [0, M/VL):
// Level 1 loops
parallel_for m0 in VL):
o[m1*VL+m0] += i[chw] * f[VL*CWH*m1+CWH*m0+chw]
// Level 2 loops
for m1 in [0, M/VL):
for chw in [0, C*H*W):
// Level 1 loops
parallel_for m0 in [0, VL):
o[m1*VL+m0] += i[chw] * f[VL*CWH*m1+CWH*m0+chw]
No constraints
on loop
permutations!
Loop order
affects where
loop invariants
can be moved
101. L07-101
Sze and Emer
Vector ISA Attributes
• Compact
– one short instruction encodes N operations
– many implicit bookkeeping/control operations
• Expressive, hardware knows the N operations:
– are independent
– use the same functional unit
– access disjoint registers
– access registers in same pattern as previous instructions
– access a contiguous block of memory
(unit-stride load/store)
– access memory in a known pattern
(strided load/store)
February 27, 2023
Vector instructions make “explicit” many things that are “implicit” with standard instructions
102. L07-102
Sze and Emer
Vector ISA Hardware Implications
• Large amount of work per instruction
-> Less instruction fetch bandwidth requirements
-> Allows simplified instruction fetch design
• Implicit bookkeeping operations
-> Bookkeeping can run in parallel with main compute
• Disjoint vector element accesses
-> Banked rather than multi-ported register files
• No data dependence within a vector
-> Amenable to deeply pipelined/parallel designs
• Known regular memory access pattern
-> Allows for banked memory for higher bandwidth
February 27, 2023
103. L07-103
Sze and Emer
• Use deep pipeline (=> fast clock) to
execute element operations
• Simplifies control of deep pipeline because
elements in vector are independent (=> no
hazards!)
Six stage multiply pipeline
Vector Arithmetic Execution
February 27, 2023
V
1
V
2
V
3
V3 <- v1 * v2
104. L07-104
Sze and Emer
Vector Instruction Execution
February 27, 2023
ADDV C,A,B
Execution using
one pipelined
functional unit
Execution using
four pipelined
functional units
C[1]
C[2]
C[0]
A[3] B[3]
A[4] B[4]
A[5] B[5]
A[6] B[6]
C[4]
C[8]
C[0]
A[12] B[12]
A[16] B[16]
A[20] B[20]
A[24] B[24]
C[5]
C[9]
C[1]
A[13] B[13]
A[17] B[17]
A[21] B[21]
A[25] B[25]
C[6]
C[10]
C[2]
A[14] B[14]
A[18] B[18]
A[22] B[22]
A[26] B[26]
C[7]
C[11]
C[3]
A[15] B[15]
A[19] B[19]
A[23] B[23]
A[27] B[27]
105. L07-105
Sze and Emer
Vector Unit Structure
February 27, 2023
Lane
Function
Unit
(Adder)
Vector
Registers
Memory Subsystem
Elements
0, 4, 8, …
Elements
1, 5, 9, …
Elements
2, 6, 10, …
Elements
3, 7, 11, …
Function
Unit
(Mult)
106. L07-106
Sze and Emer
Vector Instruction Parallelism
Can overlap execution of multiple vector instructions
– example machine has 32 elements per vector register and 8 lanes
February 27, 2023
Complete 24 operations/cycle while issuing 1 short instruction/cycle
load
load
mul
mul
add
add
Load Unit Multiply Unit Add Unit
time
Instruction
issue
107. L07-107
Sze and Emer
ISA Datatypes
FP32
FP16
Int32
Int16
Int8
S E M
1 8 23
S E M
1 5 10
M
31
S
S M
1
1 15
S M
1 7
Range Accuracy
10-38 – 1038 .000006%
6x10-5 - 6x104 .05%
0 – 2x109 ½
0 – 6x104 ½
0 – 127 ½
Image Source: B. Dally
February 27, 2023