Gradient-Based Learning Applied to Document Recognition
Y. LeCun, L. Bottou, Y. Bengio and P. Haffner
Proceedings of the IEEE, 86(11):2278--‐2324, November 1998
01
LeNet
Speaker: Chia-Jung Ni
• History of Representative CNN models
• Three key ideas for CNN
• Local Receptive Fields
• Shared Weights
• Sub-sampling
• Model Architecture
• Implementation
• Keras
02
Outline
Slide: https://drive.google.com/file/d/12YWNNbqB-_JHl0CrNEl6loINBJoGHgE3/view?usp=sharing
Code: https://drive.google.com/file/d/1wDcDgoF8VSj29ab-cXsN82Q1pxdBiaUx/view?usp=sharing
1980s
CNN be
proposed
1998 LeNet
2012
AlexNet
2015
VGGNet
2015
GoogleNet
2016
ResNet
2017
DenseNet
03
History of Representative CNN models
The first time use
𝐛𝐚𝐜𝐤 − 𝐩𝐫𝐨𝐩𝐚𝐠𝐚𝐭𝐢𝐨𝐧 to
update model params.
The first time use
𝐆𝐏𝐔 to
accelerate computations
• Why local connectivity? (what)
• Spatial correlation is local
• Reduce # of parameters
04
Three key ideas : Local Receptive Fields (1/3)
Example. WLOG
- 1000x1000 image
- 3x3 filter (kernel)
106
+ 1 params.
/ hidden unit
32
+ 1 params.
/ hidden unit
• Why weight sharing? (where)
• Statistics is at different locations
• Reduce # of parameters
05
Three key ideas : Shared Weights (2/3)
Example. WLOG
- # input units (neurons) = 7
- # hidden unit = 3
3 ∗ 3 + 3 = 12 params. 3 ∗ 1 + 3 = 6 params.
• Why Sub-sampling? (Size)
• Sub-sampling the pixel will not change the object
• Reduce memory consumption
06
Three key ideas : Sub-sampling (3/3)
1 2 2 0
1 2 3 2
3 1 3 2
0 2 0 2
2 3
3 3
1.5 1.75
1.5 1.75
Max-pooling Avg-pooling
• Architecture of LeNet-5
• Two sets of convolutional and average pooling layers
• Followed by a flattening convolutional layer
• Then two fully-connected layers and finally a softmax classifier
07
Model Architecture
• Similar to the idea of activation function
• All the first 6 layers (C1, S2, C3, S4, C5, F6) feature maps are passed
through this nonlinear scaled hyperbolic tangent function
08
Model Architecture – Squashing Function
𝑓 𝑎 = 𝐴𝑡𝑎𝑛𝑓 𝑆𝑎 , 𝑤ℎ𝑒𝑟𝑒 ቊ
𝐴 = 1.17519
𝑆 = 2
3
with this choice of params,
the equalities 𝑓 1 = 1 and 𝑓 −1 = −1 satisfied.
𝑓 𝑎𝑓′ 𝑎
𝑓′′ 𝑎
Some details
- Symmetric functions will yield faster convergence,
although the learning might be slow as the weights are too
large/ small.
- The absolute value of the 2nd derivative of f(a) is a
maximum at +1 and -1, which also improves the
convergence toward the end of learning session.
09
Model Architecture – 1st layer (1/7)
• Trainable params
= (weight * input map channel + bias) * output map channel
= (5*5*1 + 1) * 6 = 156
• Connections
= (weight * input map channel + bias) * output map channel * output map size
= (5*5 *1 + 1) * 6 * (28*28) = 122,304
𝑊 𝑙
= 𝑖𝑛𝑡
W𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
𝐻 𝑙
= 𝑖𝑛𝑡
𝐻 𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
Convolution layer 1 (C1)
with 6 feature maps or filters
having size 5×5, a stride of one,
and ‘same’ padding!
10
Model Architecture – 2nd layer (2/7)
• Trainable params
= (weight + bias) * output map channel
= (1 + 1) * 6 = 12
• Connections
= (kernel size + bias) * output map channel * output map size
= (2*2 + 1) * 6 * (14*14) = 5,880
Subsampling layer 2 (S2)
with a filter size 2×2, a
stride of two, and ‘valid’
padding!
𝑊 𝑙
= 𝑖𝑛𝑡
W𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
𝐻 𝑙
= 𝑖𝑛𝑡
𝐻 𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
11
Model Architecture – 3rd layer (3/7)
• Trainable params
= ∑group [ (weight * input map channel + bias) * output map channel ]
= (5*5*3 + 1) * 6 + (5*5*4 + 1) * 6 + (5*5*4 + 1) * 3 + (5*5*6 + 1) * 1 = 456 + 606 + 303 +151 = 1,516
• Connections
= (weight * input map channel + bias) * output map channel * output map size
= [(5*5*3 + 1) * 6 + (5*5*4 + 1) * 6 + (5*5*4 + 1) * 3 + (5*5*6 + 1) * 1] * (10*10) = 151,600
Convolution layer 3 (C3)
with 16 feature maps having
size 5×5 and a stride of one,
and ‘valid’ padding!
Based on the consideration of computation costs,
• First 6 feature maps are connected to 3 contiguous input maps
• Second 6 feature maps are connected to 4 contiguous input maps
• Next 3 feature maps are connected to 4 discontinuous input maps
• Last 1 feature map are connected to all 6 input maps
𝑊 𝑙
= 𝑖𝑛𝑡
W𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
𝐻 𝑙
= 𝑖𝑛𝑡
𝐻 𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
12
Model Architecture – 4th layer (4/7)
• Trainable params
= (weight + bias) * output map channel
= (1 + 1) * 16 = 32
• Connections
= (kernel size + bias) * output map channel * output map size
= (2*2 + 1) * 16 * (5*5) = 2,000
Subsampling layer 4 (S4)
with a filter size 2×2, a
stride of two, and ‘valid’
padding!
𝑊 𝑙
= 𝑖𝑛𝑡
W𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
𝐻 𝑙
= 𝑖𝑛𝑡
𝐻 𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
13
Model Architecture – 5th layer (5/7)
• Trainable params
= (weight * input map channel + bias) * output map channel
= (5*5*16 + 1) * 120 = 48,120
• Connections
= (weight * input map channel + bias) * output map channel * output map size
= (5*5*16 + 1) * 120 * (1*1) = 48,120
Convolution layer 5 (C5)
with 120 feature maps or
filters having size 5×5, a stride
of one, and ‘valid’ padding!
𝑊 𝑙
= 𝑖𝑛𝑡
W𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
𝐻 𝑙
= 𝑖𝑛𝑡
𝐻 𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
14
Model Architecture – 6th layer (6/7)
• Trainable params
= (weight + bias) * output map channel
= (120 + 1) * 84 = 10,164
• Connections
= (weight + bias) * output map channel
= (120 + 1) * 84 = 10,164
Fully-connected layer (F6)
with 84 neuron units!
𝑊 𝑙
= 𝑖𝑛𝑡
W𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
𝐻 𝑙
= 𝑖𝑛𝑡
𝐻 𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
15
Model Architecture – Output layer (7/7)
Output layer
with the Euclidean Radial Basis Function
(RBF)
The output of each RBF units (RBF) 𝑦 𝑖 is
computed as follow:
𝑦 𝑖
= ෍
𝑗
𝑥𝑗 − 𝑤𝑖𝑗
2
Loss Function
With mean squared error
function (MSE) to measure
discrepancy
The output of a particular RBF can be interpreted as a penalty term measuring the fit
between the input pattern and a model of the class associated with the RBF. In
probabilistic terms, the RBF output can be interpreted as the unnormalized negative
log-likelihood of a Gaussian distortion in the space of configuration of layer F6.
, where 𝑦 𝐷 𝑝 is the output of 𝐷 𝑝
-th RBF units,
that is, the one that corresponds to the right
class of input pattern 𝑧 𝑝
.
Model Architecture (LeNet-5)
Notation W, H F S P
Layer
Feature Map
# channel
Feature Map
Size
Filter Size
(Kernel Size)
Stride Padding
Activation
function
Input Image 1 32x32 - - - -
1 Convilution 6 28x28 5x5 1 0 tanh
2 Avg-Pooling 6 14x14 2x2 2 0 tanh
3 Convilution 16 10x10 5x5 1 0 tanh
4 Avg-Pooling 16 5x5 2x2 2 0 tanh
5 Convilution 120 1x1 5x5 1 0 tanh
6 FC - 84 - - - tanh
Output FC - 10 - - - RBF
𝑊 𝑙
= 𝑖𝑛𝑡
W𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
𝐻 𝑙
= 𝑖𝑛𝑡
𝐻 𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
16
Implementation – Download Data Set & Normalize
16
Implementation – Define LeNet-5 Model
16
Implementation – Define LeNet-5 Model & Evaluate
16
Implementation – Visualize the Training Process
16
17
Thanks for your listening.
18
Appendix 1. Common to zero pad the border
Example. WLOG
- input 7x7
- 3x3 filter, applied with stride 1
- pad with 1 pixel border => what is the output?
In general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with (F-1)/2.
(will preserve size spatially)
• F = 3 => zero pad with 1
• F = 5 => zero pad with 2
𝑊 𝑙
= 𝑖𝑛𝑡
𝑊 𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
𝐻 𝑙
= 𝑖𝑛𝑡
𝐻 𝑙−1
− 𝐹 + 2𝑃
𝑆
+ 1
19
Appendix 2. Sub-Sampling v.s. Pooling
• Sub-Sampling is simply Average-Pooling with learnable weights per
feature map.
• Sub-Sampling is a generalization of Average-Pooling.
1 2 2 0
1 2 3 2
3 1 3 2
0 2 0 2
1.5 1.75
1.5 1.75
Avg-pooling
1.5 1.75
1.5 1.75
Sub-sampling
𝑤 + 𝑏
, where w and b ∈ 𝑅
19
Appendix 3. Radial Basis Function (RBF) units
𝑥84
𝑥1
𝑥2
𝑥83
𝑥𝑗
𝑦10
𝑦1
𝑦2
𝑦9
𝑦𝑖
⋮
⋮
⋮
⋮
𝑤1,1
𝑤10,84
𝒀10×1 = 𝑾10×84
𝑇
𝑿84×1
𝑵𝒐𝒕𝒆.
1) 𝑥𝑗 ∈ ℝ 𝑖𝑠 𝑡ℎ𝑒 𝑜𝑢𝑡𝑝𝑢𝑡 𝑜𝑓 𝑡ℎ𝑒 𝐹𝐶6 𝑙𝑎𝑦𝑒𝑟 𝑤𝑖𝑡ℎ 𝑠𝑞𝑢𝑎𝑠ℎ𝑖𝑛𝑔 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑓 𝑎 = 𝐴𝑡𝑎𝑛𝑓 𝑆𝑎 ,
∀ 𝑗 = 1, … , 84
2) 𝑇ℎ𝑒 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑠 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑣𝑒𝑐𝑡𝑜𝑟𝑠 ൛𝑤𝑖𝑗 |𝑖 = 1, … , 10 ; 𝑗 =

LeNet-5

  • 1.
    Gradient-Based Learning Appliedto Document Recognition Y. LeCun, L. Bottou, Y. Bengio and P. Haffner Proceedings of the IEEE, 86(11):2278--‐2324, November 1998 01 LeNet Speaker: Chia-Jung Ni
  • 2.
    • History ofRepresentative CNN models • Three key ideas for CNN • Local Receptive Fields • Shared Weights • Sub-sampling • Model Architecture • Implementation • Keras 02 Outline Slide: https://drive.google.com/file/d/12YWNNbqB-_JHl0CrNEl6loINBJoGHgE3/view?usp=sharing Code: https://drive.google.com/file/d/1wDcDgoF8VSj29ab-cXsN82Q1pxdBiaUx/view?usp=sharing
  • 3.
    1980s CNN be proposed 1998 LeNet 2012 AlexNet 2015 VGGNet 2015 GoogleNet 2016 ResNet 2017 DenseNet 03 Historyof Representative CNN models The first time use 𝐛𝐚𝐜𝐤 − 𝐩𝐫𝐨𝐩𝐚𝐠𝐚𝐭𝐢𝐨𝐧 to update model params. The first time use 𝐆𝐏𝐔 to accelerate computations
  • 4.
    • Why localconnectivity? (what) • Spatial correlation is local • Reduce # of parameters 04 Three key ideas : Local Receptive Fields (1/3) Example. WLOG - 1000x1000 image - 3x3 filter (kernel) 106 + 1 params. / hidden unit 32 + 1 params. / hidden unit
  • 5.
    • Why weightsharing? (where) • Statistics is at different locations • Reduce # of parameters 05 Three key ideas : Shared Weights (2/3) Example. WLOG - # input units (neurons) = 7 - # hidden unit = 3 3 ∗ 3 + 3 = 12 params. 3 ∗ 1 + 3 = 6 params.
  • 6.
    • Why Sub-sampling?(Size) • Sub-sampling the pixel will not change the object • Reduce memory consumption 06 Three key ideas : Sub-sampling (3/3) 1 2 2 0 1 2 3 2 3 1 3 2 0 2 0 2 2 3 3 3 1.5 1.75 1.5 1.75 Max-pooling Avg-pooling
  • 7.
    • Architecture ofLeNet-5 • Two sets of convolutional and average pooling layers • Followed by a flattening convolutional layer • Then two fully-connected layers and finally a softmax classifier 07 Model Architecture
  • 8.
    • Similar tothe idea of activation function • All the first 6 layers (C1, S2, C3, S4, C5, F6) feature maps are passed through this nonlinear scaled hyperbolic tangent function 08 Model Architecture – Squashing Function 𝑓 𝑎 = 𝐴𝑡𝑎𝑛𝑓 𝑆𝑎 , 𝑤ℎ𝑒𝑟𝑒 ቊ 𝐴 = 1.17519 𝑆 = 2 3 with this choice of params, the equalities 𝑓 1 = 1 and 𝑓 −1 = −1 satisfied. 𝑓 𝑎𝑓′ 𝑎 𝑓′′ 𝑎 Some details - Symmetric functions will yield faster convergence, although the learning might be slow as the weights are too large/ small. - The absolute value of the 2nd derivative of f(a) is a maximum at +1 and -1, which also improves the convergence toward the end of learning session.
  • 9.
    09 Model Architecture –1st layer (1/7) • Trainable params = (weight * input map channel + bias) * output map channel = (5*5*1 + 1) * 6 = 156 • Connections = (weight * input map channel + bias) * output map channel * output map size = (5*5 *1 + 1) * 6 * (28*28) = 122,304 𝑊 𝑙 = 𝑖𝑛𝑡 W𝑙−1 − 𝐹 + 2𝑃 𝑆 + 1 𝐻 𝑙 = 𝑖𝑛𝑡 𝐻 𝑙−1 − 𝐹 + 2𝑃 𝑆 + 1 Convolution layer 1 (C1) with 6 feature maps or filters having size 5×5, a stride of one, and ‘same’ padding!
  • 10.
    10 Model Architecture –2nd layer (2/7) • Trainable params = (weight + bias) * output map channel = (1 + 1) * 6 = 12 • Connections = (kernel size + bias) * output map channel * output map size = (2*2 + 1) * 6 * (14*14) = 5,880 Subsampling layer 2 (S2) with a filter size 2×2, a stride of two, and ‘valid’ padding! 𝑊 𝑙 = 𝑖𝑛𝑡 W𝑙−1 − 𝐹 + 2𝑃 𝑆 + 1 𝐻 𝑙 = 𝑖𝑛𝑡 𝐻 𝑙−1 − 𝐹 + 2𝑃 𝑆 + 1
  • 11.
    11 Model Architecture –3rd layer (3/7) • Trainable params = ∑group [ (weight * input map channel + bias) * output map channel ] = (5*5*3 + 1) * 6 + (5*5*4 + 1) * 6 + (5*5*4 + 1) * 3 + (5*5*6 + 1) * 1 = 456 + 606 + 303 +151 = 1,516 • Connections = (weight * input map channel + bias) * output map channel * output map size = [(5*5*3 + 1) * 6 + (5*5*4 + 1) * 6 + (5*5*4 + 1) * 3 + (5*5*6 + 1) * 1] * (10*10) = 151,600 Convolution layer 3 (C3) with 16 feature maps having size 5×5 and a stride of one, and ‘valid’ padding! Based on the consideration of computation costs, • First 6 feature maps are connected to 3 contiguous input maps • Second 6 feature maps are connected to 4 contiguous input maps • Next 3 feature maps are connected to 4 discontinuous input maps • Last 1 feature map are connected to all 6 input maps 𝑊 𝑙 = 𝑖𝑛𝑡 W𝑙−1 − 𝐹 + 2𝑃 𝑆 + 1 𝐻 𝑙 = 𝑖𝑛𝑡 𝐻 𝑙−1 − 𝐹 + 2𝑃 𝑆 + 1
  • 12.
    12 Model Architecture –4th layer (4/7) • Trainable params = (weight + bias) * output map channel = (1 + 1) * 16 = 32 • Connections = (kernel size + bias) * output map channel * output map size = (2*2 + 1) * 16 * (5*5) = 2,000 Subsampling layer 4 (S4) with a filter size 2×2, a stride of two, and ‘valid’ padding! 𝑊 𝑙 = 𝑖𝑛𝑡 W𝑙−1 − 𝐹 + 2𝑃 𝑆 + 1 𝐻 𝑙 = 𝑖𝑛𝑡 𝐻 𝑙−1 − 𝐹 + 2𝑃 𝑆 + 1
  • 13.
    13 Model Architecture –5th layer (5/7) • Trainable params = (weight * input map channel + bias) * output map channel = (5*5*16 + 1) * 120 = 48,120 • Connections = (weight * input map channel + bias) * output map channel * output map size = (5*5*16 + 1) * 120 * (1*1) = 48,120 Convolution layer 5 (C5) with 120 feature maps or filters having size 5×5, a stride of one, and ‘valid’ padding! 𝑊 𝑙 = 𝑖𝑛𝑡 W𝑙−1 − 𝐹 + 2𝑃 𝑆 + 1 𝐻 𝑙 = 𝑖𝑛𝑡 𝐻 𝑙−1 − 𝐹 + 2𝑃 𝑆 + 1
  • 14.
    14 Model Architecture –6th layer (6/7) • Trainable params = (weight + bias) * output map channel = (120 + 1) * 84 = 10,164 • Connections = (weight + bias) * output map channel = (120 + 1) * 84 = 10,164 Fully-connected layer (F6) with 84 neuron units! 𝑊 𝑙 = 𝑖𝑛𝑡 W𝑙−1 − 𝐹 + 2𝑃 𝑆 + 1 𝐻 𝑙 = 𝑖𝑛𝑡 𝐻 𝑙−1 − 𝐹 + 2𝑃 𝑆 + 1
  • 15.
    15 Model Architecture –Output layer (7/7) Output layer with the Euclidean Radial Basis Function (RBF) The output of each RBF units (RBF) 𝑦 𝑖 is computed as follow: 𝑦 𝑖 = ෍ 𝑗 𝑥𝑗 − 𝑤𝑖𝑗 2 Loss Function With mean squared error function (MSE) to measure discrepancy The output of a particular RBF can be interpreted as a penalty term measuring the fit between the input pattern and a model of the class associated with the RBF. In probabilistic terms, the RBF output can be interpreted as the unnormalized negative log-likelihood of a Gaussian distortion in the space of configuration of layer F6. , where 𝑦 𝐷 𝑝 is the output of 𝐷 𝑝 -th RBF units, that is, the one that corresponds to the right class of input pattern 𝑧 𝑝 .
  • 16.
    Model Architecture (LeNet-5) NotationW, H F S P Layer Feature Map # channel Feature Map Size Filter Size (Kernel Size) Stride Padding Activation function Input Image 1 32x32 - - - - 1 Convilution 6 28x28 5x5 1 0 tanh 2 Avg-Pooling 6 14x14 2x2 2 0 tanh 3 Convilution 16 10x10 5x5 1 0 tanh 4 Avg-Pooling 16 5x5 2x2 2 0 tanh 5 Convilution 120 1x1 5x5 1 0 tanh 6 FC - 84 - - - tanh Output FC - 10 - - - RBF 𝑊 𝑙 = 𝑖𝑛𝑡 W𝑙−1 − 𝐹 + 2𝑃 𝑆 + 1 𝐻 𝑙 = 𝑖𝑛𝑡 𝐻 𝑙−1 − 𝐹 + 2𝑃 𝑆 + 1 16
  • 17.
    Implementation – DownloadData Set & Normalize 16
  • 18.
    Implementation – DefineLeNet-5 Model 16
  • 19.
    Implementation – DefineLeNet-5 Model & Evaluate 16
  • 20.
    Implementation – Visualizethe Training Process 16
  • 21.
  • 22.
    18 Appendix 1. Commonto zero pad the border Example. WLOG - input 7x7 - 3x3 filter, applied with stride 1 - pad with 1 pixel border => what is the output? In general, common to see CONV layers with stride 1, filters of size FxF, and zero-padding with (F-1)/2. (will preserve size spatially) • F = 3 => zero pad with 1 • F = 5 => zero pad with 2 𝑊 𝑙 = 𝑖𝑛𝑡 𝑊 𝑙−1 − 𝐹 + 2𝑃 𝑆 + 1 𝐻 𝑙 = 𝑖𝑛𝑡 𝐻 𝑙−1 − 𝐹 + 2𝑃 𝑆 + 1
  • 23.
    19 Appendix 2. Sub-Samplingv.s. Pooling • Sub-Sampling is simply Average-Pooling with learnable weights per feature map. • Sub-Sampling is a generalization of Average-Pooling. 1 2 2 0 1 2 3 2 3 1 3 2 0 2 0 2 1.5 1.75 1.5 1.75 Avg-pooling 1.5 1.75 1.5 1.75 Sub-sampling 𝑤 + 𝑏 , where w and b ∈ 𝑅
  • 24.
    19 Appendix 3. RadialBasis Function (RBF) units 𝑥84 𝑥1 𝑥2 𝑥83 𝑥𝑗 𝑦10 𝑦1 𝑦2 𝑦9 𝑦𝑖 ⋮ ⋮ ⋮ ⋮ 𝑤1,1 𝑤10,84 𝒀10×1 = 𝑾10×84 𝑇 𝑿84×1 𝑵𝒐𝒕𝒆. 1) 𝑥𝑗 ∈ ℝ 𝑖𝑠 𝑡ℎ𝑒 𝑜𝑢𝑡𝑝𝑢𝑡 𝑜𝑓 𝑡ℎ𝑒 𝐹𝐶6 𝑙𝑎𝑦𝑒𝑟 𝑤𝑖𝑡ℎ 𝑠𝑞𝑢𝑎𝑠ℎ𝑖𝑛𝑔 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛 𝑓 𝑎 = 𝐴𝑡𝑎𝑛𝑓 𝑆𝑎 , ∀ 𝑗 = 1, … , 84 2) 𝑇ℎ𝑒 𝑐𝑜𝑚𝑝𝑜𝑛𝑒𝑛𝑡𝑠 𝑜𝑓 𝑖𝑛𝑡𝑒𝑟𝑚𝑒𝑑𝑖𝑎𝑡𝑒 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 𝑣𝑒𝑐𝑡𝑜𝑟𝑠 ൛𝑤𝑖𝑗 |𝑖 = 1, … , 10 ; 𝑗 =