© 2020 Samsung
Improving Power Efficiency for
Edge Inferencing Through
Memory Management
Optimizations
Nathan Levy
Samsung
September 2020
© 2020 Samsung
Outline
• Introduction
• Tiling and forwarding
• Filters management
• Case study of different modern convolutional neural networks
• Conclusion
2
© 2020 Samsung
Counting the I/O in the TOps per Watt KPI
• Power consumption is composed of:
• Compute power: perform the arithmetic computations
• I/O power: bring the data to the compute engine and back
• For convolutional neural networks, I/O power can be higher than compute
• The compiler must optimize the memory management
• Scope: convolutional neural network (CNN) acceleration
• Focus on convolutions
• Assume the network is fixed (and quantized, pruned…)
• Inference only
3
© 2020 Samsung
Description of the working model
What is a convolution?
• Input: feature map and filters
• Apply: multiply and accumulate
• Output: feature map
CNN can be processed by CPU, DSP, GPU, NPU/TPU
• These often have multiple cores
• These have multiple levels of memories (cache and
scratch-pad)
4
Input feature map
Filter
zb
Output feature map
Output activation
Filter
za
Output activation
X
i
Yi
X
o
Yo
© 2020 Samsung
Description of the working model
Simplified toy example:
• Single processing core
• Small local memory with “free”
access and big remote memory
with expensive access
Remote memory contains:
• Input and output feature maps
• Convolution filters
• Intermediate feature maps (if
needed)
5
Remote memory Conv 1
Local memory
Processing
unit
Feature Map 1 Feature Map N
Conv M
Filters
© 2020 Samsung
Typical orders of magnitude for NPU
• The memory management problem
depends on the relative size of:
• The local memory of the processor
• The typical feature map of the
network
• Our working point:
Feature map size ~
10 * Local memory size
6
Local memory size [B]
Feature map size [B]
10k 100k 1M 10M
10k
100k
1M
10M
100M
IoT
Mobile
Automotive
© 2020 Samsung
Tiling and forwarding
© 2020 Samsung
Tiling and Forwarding
How to overcome the small memory size:
• Split the input feature map into smaller
parts → tiles
• Apply the convolution on the input feature
map tile
→ Creates an output feature map tile
The output feature tile can be either:
• Written in remote memory
• Used for the next convolution
→ This is called forwarding
8
Remote memory Conv 1
Local memory
Processing
unit
IFM (not in mem) OFM (not in
mem)
Feature Map 1 Feature Map N
Conv M
Conv i
Input feature
map tile
Output feature
map tile
Filters
© 2020 Samsung
How to tile and forward
How does the compiler tile and forward to minimize power
consumption?
Key optimization parameters:
• Power consumption: bandwidth (I/O) + compute
→ Focus of this talk
• Processing throughput (operations/second)
• Chip area → cost
• System complexity, flexibility and maintenance
9
© 2020 Samsung
3x3
conv
3x3
conv
From
remote
memory
To
remote
memory
Tiling and increasing receptive fields
2 convolutions with 3x4 tiling. Without access to remote memory in the middle (forwarding)
For i in [1-12]:
• Load tile i of blue feature map from remote memory
• Run 1st convolution → tile i of green feature map
• Run 2nd convolution → tile i of yellow feature map
• Store tile i of yellow feature map to remote memory
© 2020 Samsung
Tiling and increasing receptive fields
2 convolutions with 3x4 tiling. Without access to remote memory in the middle (forwarding)
For i in [1-12]:
• Load tile i of blue feature map from remote memory
• Run 1st convolution → tile i of green feature map
• Run 2nd convolution → tile i of yellow feature map
• Store tile i of yellow feature map to remote memory
3x3
conv
3x3
conv
From
remote
memory
To
remote
memory
© 2020 Samsung
Tiling and increasing receptive fields
3x3
conv
3x3
conv
From
remote
memory
To
remote
memory
2 convolutions with 3x4 tiling. Without access to remote memory in the middle (forwarding)
For i in [1-12]:
• Load tile i of blue feature map from remote memory
• Run 1st convolution → tile i of green feature map
• Run 2nd convolution → tile i of yellow feature map
• Store tile i of yellow feature map to remote memory
© 2020 Samsung
3x3
conv
3x3
conv
From
remote
memory
To
remote
memory
Tiling and increasing receptive fields
2 convolutions with 3x4 tiling. Without access to remote memory in the middle (forwarding)
For i in [1-12]:
• Load tile i of blue feature map from remote memory
• Run 1st convolution → tile i of green feature map
• Run 2nd convolution → tile i of yellow feature map
• Store tile i of yellow feature map to remote memory
© 2020 Samsung
3x3
conv
3x3
conv
From
remote
memory
To
remote
memory
Tiling and increasing receptive fields
2 convolutions with 3x4 tiling. Without access to remote memory in the middle (forwarding)
For i in [1-12]:
• Load tile i of blue feature map from remote memory
• Run 1st convolution → tile i of green feature map
• Run 2nd convolution → tile i of yellow feature map
• Store tile i of yellow feature map to remote memory
© 2020 Samsung
3x3
conv
3x3
conv
From
remote
memory
To
remote
memory
Tiling and increasing receptive fields
2 convolutions with 3x4 tiling. Without access to remote memory in the middle (forwarding)
For i in [1-12]:
• Load tile i of blue feature map from remote memory
• Run 1st convolution → tile i of green feature map
• Run 2nd convolution → tile i of yellow feature map
• Store tile i of yellow feature map to remote memory
© 2020 Samsung
3x3
conv
3x3
conv
From
remote
memory
To
remote
memory
Tiling and increasing receptive fields
2 convolutions with 3x4 tiling. Without access to remote memory in the middle (forwarding)
Problem: 3x3 convolution requires margins
© 2020 Samsung
3x3
conv
3x3
conv
From
remote
memory
To
remote
memory
Tiling and increasing receptive fields
2 convolutions with 3x4 tiling. Without access to remote memory in the middle (forwarding)
Solution: Bigger blue and green tiles
Problems:
• Load more data for the first (blue) layer
• Compute more data for the intermediate (green) layer
2 pixels margin 1 pixel margin
© 2020 Samsung
Tiling and increasing receptive fields
Consider a long chain of convolutions. How many stages of forwarding do you want?
• Forwarding tiles through as many layers as possible avoids dumping to remote
memory
• This causes the overlaps to grow
• More data to load for the first layer
• Redundant processing
18
conv conv conv conv
© 2020 Samsung
2 convolutions
# of Convolutions Feature Map
Tiling
Bandwidth Bandwidth per
convolution
Operations per
convolution
Energy per
convolution
2 3x4 +6% -47% +5% -30%
3 3x4 +12% -63% +10% -39%
4 3x4 +17% -71% +16% -43%
5 4x4 +29% -74% +26% -42%
19
All numbers are compared to “no forwarding”
conv conv
To remote
memory
From remote
memory
Load the
overlaps
Compute the
overlaps
Less I/O
More compute
Break the feature map
into 12 to fit in memory
© 2020 Samsung
3 convolutions
# of Convolutions Feature Map
Tiling
Bandwidth Bandwidth per
convolution
Operations per
convolution
Energy per
convolution
2 3x4 +6% -47% +5% -30%
3 3x4 +12% -63% +10% -39%
4 3x4 +17% -71% +16% -43%
5 4x4 +29% -74% +26% -42%
20
All numbers are compared to “no forwarding”
conv conv conv
To remote
memory
From remote
memory
© 2020 Samsung
4 convolutions
# of Convolutions Feature Map
Tiling
Bandwidth Bandwidth per
convolution
Operations per
convolution
Energy per
convolution
2 3x4 +6% -47% +5% -30%
3 3x4 +12% -63% +10% -39%
4 3x4 +17% -71% +16% -43%
5 4x4 +29% -74% +26% -42%
21
All numbers are compared to “no forwarding”
conv conv conv conv
To remote
memory
From remote
memory
© 2020 Samsung
5 convolutions
# of Convolutions Feature Map
Tiling
Bandwidth Bandwidth per
convolution
Operations per
convolution
Energy per
convolution
2 3x4 +6% -47% +5% -30%
3 3x4 +12% -63% +10% -39%
4 3x4 +17% -71% +16% -43%
5 4x4 +29% -74% +26% -42%
22
All numbers are compared to “no forwarding”
conv conv conv conv conv
From remote
memory
To remote
memory
Overlaps are so big we
need smaller tiles to fit
in local memory
© 2020 Samsung
Tiling and increasing receptive fields
23
Minimal energy
Energy
[J]
• The trade-off between I/O and compute
power leads to a U-shaped behavior
• The trade-off requires an optimization
mechanism in the compiler
• In our toy example, all the convolutions
are identical
• In a real CNN, the convolutions are
different, so the optimization problem is
not one-dimensional
Number of convolutions
© 2020 Samsung
Tiling and increasing receptive fields
More constraints to take into account:
• Some hardware will prefer vertical tiling for memory access
• Increases overlap sizes
• Some hardware will impose constraints on tile size (multiple of a given number)
Out-of-scope ways to deal with overlaps:
• Multi-core processing → hardware considerations and system complexity
• Keep overlap data instead of loading/computing again: takes up memory and requires memory
fragmentation management → system complexity
24
Tile 4
Tile 3
Tile 2
Tile 1
Tile 3 Tile 4
Tile 1 Tile 2
© 2020 Samsung
Filters management
© 2020 Samsung
Filters management
Option 1: reload the filters from remote to
local memory for each tile
→ wasted bandwidth
26
Option 2: keep the filters of all convolutions
in local memory while processing all the tiles
→wasted space in the local memory
→ may increase the tiles number
Local memory
Filters for 1
convolution
Reload #tiles
times
Local memory
Filters for
convolution 1
Load once
Filters for
convolution n
© 2020 Samsung
Filters management
• No single strategy is the best all the time
• Choosing the right strategy can save up to
20% energy
• In practice, the problem is even more
complicated due to different convolutions
through the network
• The compiler should take the decision to
keep/reload the filters for each
convolution independently
27
0 100 200 300 400 500
Memory size [kB]
Energy
[J]
Local Memory Size [kB]
© 2020 Samsung
Case study of different modern
convolutional neural networks
28
© 2020 Samsung Huang 2017
Skip connections – residual block
Skip connections were introduced in the ResNet paper as
a way to facilitate back propagation
• Forwarding becomes more challenging: one strives
to keep both main path and skip connection in local
memory
Intensive usage in DenseNet
• Up to 6 skip connections alive simultaneously
29
He 2016
© 2020 Samsung
Skip connections – segmentation networks
U-net adds skip connections to encoder-decoder structure to
improve quality of segmentation
30
Ronneberger 2015.
© 2020 Samsung
Skip connections – segmentation networks
It can be mixed with residual blocks
31
Singla 2019
© 2020 Samsung
Inverted bottleneck
• Inverted bottleneck introduced in MobileNet
family
• Splitting the 3x3 convolution into 3x3 depthwise
convolution + 1x1 convolution reduces the
amount of parameters
• Can more easily be kept in memory and/or
reloaded
• 1x1 convolutions don’t create overlaps
• Skip connection on thinner feature map to
minimize memory impact
32
Sandler 2018
© 2020 Samsung
Conclusion
• Memory management is a complicated optimization problem
• The compiler affects many performance indices, in particular power and runtime
• The tradeoff is not intuitive and requires some evaluation mechanism
How do we do it?
• Map all the performance indices (power, runtime…) to a single cost
• The mapping may depend on the use case
• The compiler is able to analyze its decisions and evaluate the cost
• The compiler optimizes memory management to minimize the cost
33
© 2020 Samsung
Conclusion
What we didn’t cover:
• Pre-fetching for runtime reduction
• Multicore edge devices
• We described fixed hardware and fixed network. We actually co-design:
• Memory size, processing throughput, hardware limitations
• Network design to avoid access to remote memory
• Use of network architecture search
• Quantization and pruning to avoid access to remote memory
• Choice of bit-width and pruning rate per layer
34
© 2020 Samsung
Resources
He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE
conference on computer vision and pattern recognition. 2016.
Huang, Gao, et al. "Densely connected convolutional networks." Proceedings of the IEEE
conference on computer vision and pattern recognition. 2017.
Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for
biomedical image segmentation." International Conference on Medical image computing and
computer-assisted intervention. Springer, Cham, 2015.
https://github.com/Nishanksingla/UNet-with-ResBlock
Sandler, Mark, et al. "Mobilenetv2: Inverted residuals and linear bottlenecks." Proceedings of the
IEEE conference on computer vision and pattern recognition. 2018.
35

“Improving Power Efficiency for Edge Inferencing with Memory Management Optimizations,” a Presentation from Samsung

  • 1.
    © 2020 Samsung ImprovingPower Efficiency for Edge Inferencing Through Memory Management Optimizations Nathan Levy Samsung September 2020
  • 2.
    © 2020 Samsung Outline •Introduction • Tiling and forwarding • Filters management • Case study of different modern convolutional neural networks • Conclusion 2
  • 3.
    © 2020 Samsung Countingthe I/O in the TOps per Watt KPI • Power consumption is composed of: • Compute power: perform the arithmetic computations • I/O power: bring the data to the compute engine and back • For convolutional neural networks, I/O power can be higher than compute • The compiler must optimize the memory management • Scope: convolutional neural network (CNN) acceleration • Focus on convolutions • Assume the network is fixed (and quantized, pruned…) • Inference only 3
  • 4.
    © 2020 Samsung Descriptionof the working model What is a convolution? • Input: feature map and filters • Apply: multiply and accumulate • Output: feature map CNN can be processed by CPU, DSP, GPU, NPU/TPU • These often have multiple cores • These have multiple levels of memories (cache and scratch-pad) 4 Input feature map Filter zb Output feature map Output activation Filter za Output activation X i Yi X o Yo
  • 5.
    © 2020 Samsung Descriptionof the working model Simplified toy example: • Single processing core • Small local memory with “free” access and big remote memory with expensive access Remote memory contains: • Input and output feature maps • Convolution filters • Intermediate feature maps (if needed) 5 Remote memory Conv 1 Local memory Processing unit Feature Map 1 Feature Map N Conv M Filters
  • 6.
    © 2020 Samsung Typicalorders of magnitude for NPU • The memory management problem depends on the relative size of: • The local memory of the processor • The typical feature map of the network • Our working point: Feature map size ~ 10 * Local memory size 6 Local memory size [B] Feature map size [B] 10k 100k 1M 10M 10k 100k 1M 10M 100M IoT Mobile Automotive
  • 7.
    © 2020 Samsung Tilingand forwarding
  • 8.
    © 2020 Samsung Tilingand Forwarding How to overcome the small memory size: • Split the input feature map into smaller parts → tiles • Apply the convolution on the input feature map tile → Creates an output feature map tile The output feature tile can be either: • Written in remote memory • Used for the next convolution → This is called forwarding 8 Remote memory Conv 1 Local memory Processing unit IFM (not in mem) OFM (not in mem) Feature Map 1 Feature Map N Conv M Conv i Input feature map tile Output feature map tile Filters
  • 9.
    © 2020 Samsung Howto tile and forward How does the compiler tile and forward to minimize power consumption? Key optimization parameters: • Power consumption: bandwidth (I/O) + compute → Focus of this talk • Processing throughput (operations/second) • Chip area → cost • System complexity, flexibility and maintenance 9
  • 10.
    © 2020 Samsung 3x3 conv 3x3 conv From remote memory To remote memory Tilingand increasing receptive fields 2 convolutions with 3x4 tiling. Without access to remote memory in the middle (forwarding) For i in [1-12]: • Load tile i of blue feature map from remote memory • Run 1st convolution → tile i of green feature map • Run 2nd convolution → tile i of yellow feature map • Store tile i of yellow feature map to remote memory
  • 11.
    © 2020 Samsung Tilingand increasing receptive fields 2 convolutions with 3x4 tiling. Without access to remote memory in the middle (forwarding) For i in [1-12]: • Load tile i of blue feature map from remote memory • Run 1st convolution → tile i of green feature map • Run 2nd convolution → tile i of yellow feature map • Store tile i of yellow feature map to remote memory 3x3 conv 3x3 conv From remote memory To remote memory
  • 12.
    © 2020 Samsung Tilingand increasing receptive fields 3x3 conv 3x3 conv From remote memory To remote memory 2 convolutions with 3x4 tiling. Without access to remote memory in the middle (forwarding) For i in [1-12]: • Load tile i of blue feature map from remote memory • Run 1st convolution → tile i of green feature map • Run 2nd convolution → tile i of yellow feature map • Store tile i of yellow feature map to remote memory
  • 13.
    © 2020 Samsung 3x3 conv 3x3 conv From remote memory To remote memory Tilingand increasing receptive fields 2 convolutions with 3x4 tiling. Without access to remote memory in the middle (forwarding) For i in [1-12]: • Load tile i of blue feature map from remote memory • Run 1st convolution → tile i of green feature map • Run 2nd convolution → tile i of yellow feature map • Store tile i of yellow feature map to remote memory
  • 14.
    © 2020 Samsung 3x3 conv 3x3 conv From remote memory To remote memory Tilingand increasing receptive fields 2 convolutions with 3x4 tiling. Without access to remote memory in the middle (forwarding) For i in [1-12]: • Load tile i of blue feature map from remote memory • Run 1st convolution → tile i of green feature map • Run 2nd convolution → tile i of yellow feature map • Store tile i of yellow feature map to remote memory
  • 15.
    © 2020 Samsung 3x3 conv 3x3 conv From remote memory To remote memory Tilingand increasing receptive fields 2 convolutions with 3x4 tiling. Without access to remote memory in the middle (forwarding) For i in [1-12]: • Load tile i of blue feature map from remote memory • Run 1st convolution → tile i of green feature map • Run 2nd convolution → tile i of yellow feature map • Store tile i of yellow feature map to remote memory
  • 16.
    © 2020 Samsung 3x3 conv 3x3 conv From remote memory To remote memory Tilingand increasing receptive fields 2 convolutions with 3x4 tiling. Without access to remote memory in the middle (forwarding) Problem: 3x3 convolution requires margins
  • 17.
    © 2020 Samsung 3x3 conv 3x3 conv From remote memory To remote memory Tilingand increasing receptive fields 2 convolutions with 3x4 tiling. Without access to remote memory in the middle (forwarding) Solution: Bigger blue and green tiles Problems: • Load more data for the first (blue) layer • Compute more data for the intermediate (green) layer 2 pixels margin 1 pixel margin
  • 18.
    © 2020 Samsung Tilingand increasing receptive fields Consider a long chain of convolutions. How many stages of forwarding do you want? • Forwarding tiles through as many layers as possible avoids dumping to remote memory • This causes the overlaps to grow • More data to load for the first layer • Redundant processing 18 conv conv conv conv
  • 19.
    © 2020 Samsung 2convolutions # of Convolutions Feature Map Tiling Bandwidth Bandwidth per convolution Operations per convolution Energy per convolution 2 3x4 +6% -47% +5% -30% 3 3x4 +12% -63% +10% -39% 4 3x4 +17% -71% +16% -43% 5 4x4 +29% -74% +26% -42% 19 All numbers are compared to “no forwarding” conv conv To remote memory From remote memory Load the overlaps Compute the overlaps Less I/O More compute Break the feature map into 12 to fit in memory
  • 20.
    © 2020 Samsung 3convolutions # of Convolutions Feature Map Tiling Bandwidth Bandwidth per convolution Operations per convolution Energy per convolution 2 3x4 +6% -47% +5% -30% 3 3x4 +12% -63% +10% -39% 4 3x4 +17% -71% +16% -43% 5 4x4 +29% -74% +26% -42% 20 All numbers are compared to “no forwarding” conv conv conv To remote memory From remote memory
  • 21.
    © 2020 Samsung 4convolutions # of Convolutions Feature Map Tiling Bandwidth Bandwidth per convolution Operations per convolution Energy per convolution 2 3x4 +6% -47% +5% -30% 3 3x4 +12% -63% +10% -39% 4 3x4 +17% -71% +16% -43% 5 4x4 +29% -74% +26% -42% 21 All numbers are compared to “no forwarding” conv conv conv conv To remote memory From remote memory
  • 22.
    © 2020 Samsung 5convolutions # of Convolutions Feature Map Tiling Bandwidth Bandwidth per convolution Operations per convolution Energy per convolution 2 3x4 +6% -47% +5% -30% 3 3x4 +12% -63% +10% -39% 4 3x4 +17% -71% +16% -43% 5 4x4 +29% -74% +26% -42% 22 All numbers are compared to “no forwarding” conv conv conv conv conv From remote memory To remote memory Overlaps are so big we need smaller tiles to fit in local memory
  • 23.
    © 2020 Samsung Tilingand increasing receptive fields 23 Minimal energy Energy [J] • The trade-off between I/O and compute power leads to a U-shaped behavior • The trade-off requires an optimization mechanism in the compiler • In our toy example, all the convolutions are identical • In a real CNN, the convolutions are different, so the optimization problem is not one-dimensional Number of convolutions
  • 24.
    © 2020 Samsung Tilingand increasing receptive fields More constraints to take into account: • Some hardware will prefer vertical tiling for memory access • Increases overlap sizes • Some hardware will impose constraints on tile size (multiple of a given number) Out-of-scope ways to deal with overlaps: • Multi-core processing → hardware considerations and system complexity • Keep overlap data instead of loading/computing again: takes up memory and requires memory fragmentation management → system complexity 24 Tile 4 Tile 3 Tile 2 Tile 1 Tile 3 Tile 4 Tile 1 Tile 2
  • 25.
  • 26.
    © 2020 Samsung Filtersmanagement Option 1: reload the filters from remote to local memory for each tile → wasted bandwidth 26 Option 2: keep the filters of all convolutions in local memory while processing all the tiles →wasted space in the local memory → may increase the tiles number Local memory Filters for 1 convolution Reload #tiles times Local memory Filters for convolution 1 Load once Filters for convolution n
  • 27.
    © 2020 Samsung Filtersmanagement • No single strategy is the best all the time • Choosing the right strategy can save up to 20% energy • In practice, the problem is even more complicated due to different convolutions through the network • The compiler should take the decision to keep/reload the filters for each convolution independently 27 0 100 200 300 400 500 Memory size [kB] Energy [J] Local Memory Size [kB]
  • 28.
    © 2020 Samsung Casestudy of different modern convolutional neural networks 28
  • 29.
    © 2020 SamsungHuang 2017 Skip connections – residual block Skip connections were introduced in the ResNet paper as a way to facilitate back propagation • Forwarding becomes more challenging: one strives to keep both main path and skip connection in local memory Intensive usage in DenseNet • Up to 6 skip connections alive simultaneously 29 He 2016
  • 30.
    © 2020 Samsung Skipconnections – segmentation networks U-net adds skip connections to encoder-decoder structure to improve quality of segmentation 30 Ronneberger 2015.
  • 31.
    © 2020 Samsung Skipconnections – segmentation networks It can be mixed with residual blocks 31 Singla 2019
  • 32.
    © 2020 Samsung Invertedbottleneck • Inverted bottleneck introduced in MobileNet family • Splitting the 3x3 convolution into 3x3 depthwise convolution + 1x1 convolution reduces the amount of parameters • Can more easily be kept in memory and/or reloaded • 1x1 convolutions don’t create overlaps • Skip connection on thinner feature map to minimize memory impact 32 Sandler 2018
  • 33.
    © 2020 Samsung Conclusion •Memory management is a complicated optimization problem • The compiler affects many performance indices, in particular power and runtime • The tradeoff is not intuitive and requires some evaluation mechanism How do we do it? • Map all the performance indices (power, runtime…) to a single cost • The mapping may depend on the use case • The compiler is able to analyze its decisions and evaluate the cost • The compiler optimizes memory management to minimize the cost 33
  • 34.
    © 2020 Samsung Conclusion Whatwe didn’t cover: • Pre-fetching for runtime reduction • Multicore edge devices • We described fixed hardware and fixed network. We actually co-design: • Memory size, processing throughput, hardware limitations • Network design to avoid access to remote memory • Use of network architecture search • Quantization and pruning to avoid access to remote memory • Choice of bit-width and pruning rate per layer 34
  • 35.
    © 2020 Samsung Resources He,Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference on computer vision and pattern recognition. 2016. Huang, Gao, et al. "Densely connected convolutional networks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." International Conference on Medical image computing and computer-assisted intervention. Springer, Cham, 2015. https://github.com/Nishanksingla/UNet-with-ResBlock Sandler, Mark, et al. "Mobilenetv2: Inverted residuals and linear bottlenecks." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018. 35