Basic deep learning & Deep learning application to medicine

Deep learning application to medicine
Department of Nuclear Medicine, Seoul National University Hospital
Hongyoon Choi

CONTENTS
• Basic knowledge :
Linear regression to deep learning
• Overview of deep learning
• Real application to medical data

B a s i c k n o w l e d ge :
L i n e a r re g re s s i o n to d e e p l e a r n i n g

Age prediction
No. of wrinkles
Age
Regression:
Predict continuous value
Supervised Learning:
“Right answers” given

Lung tumor (Benign vs malignancy)
Tumor Size
Classification:
Predict discrete value
Supervised Learning:
“Right answers” given
0: Benign
1: Malignant

Supervised Learning Unsupervised Learning
Regression
Classification
Clustering
Generative model
Semi-Supervised Learning
Reinforcement Learning

Training Set
Learning Algorithm
hHeight Estimated
Weight
Training set
Height(X) Weight(Y)
170
155
180
175
190
160
.
.
.
71
49
80
63
91
52
.
.
.
h(X) = a0 + a1X

h(X) = a0 + a1X
a  Parameters
How to choose “a”?
Basic idea:
Input a0 and a1 to minimize “Some Target”
Some Target = Cost Function
J(a0, a1) = Mean of (h(X) – Y)2

0
1
2
3
0 1 2 3
Simply, a0 = 0 and only consider a1
0
2
4
6
-0.5 0 0.5 1 1.5 2 2.5
X
Y
a1
J(a1)

h(X) = a0 + a1X
J(a0, a1) = Mean of (h(X) – Y)2
Hypothesis
Cost Function
Target Minimize Cost
Parameters a0 and a1

a0 a1
J

Height
Weight
a0
a1
Optimal points of a0 and a1

How to optimize?
Gradient Descent
• Basic algorithm for training deep learning model
• Briefly,
• Start any points of a0 and a1
• Iteratively changing a0 and a1 until reach minimum J(a0,a1)

aj  aj - 𝛂
𝝏
𝝏𝒂𝒋
𝑱(𝒂 𝟎, 𝒂 𝟏) (j= 0 and 1)
Repeat until convergence
Gradient Descent Algorithm
𝛂: Learning Rate
a1
0
2
4
6
-0.5 0 0.5 1 1.5 2 2.5
J(a1)
Positive slope

a1
0
2
4
6
-0.5 0 0.5 1 1.5 2 2.5
J(a1)
a1
0
2
4
6
-0.5 0 0.5 1 1.5 2 2.5
J(a1)
Small 𝛂 large 𝛂

• Multiple Variables
h(X) = a0 + a1X1 + a2X2+a3X3 …
Weight Prediction using Height, Waist circumference, Head circumference, …
J(ai) = Mean of (h(X) – Y)2
aj  aj - 𝛂
𝝏
𝝏𝒂𝒋
𝑱(𝒂)Repeat

0
2
4
6
-0.5 0 0.5 1 1.5 2 2.5
Single Parameter Two Parameter
>3 Parameters : n-dimensional parabolic shape

Tumor Size
0: Benign
1: Malignant
h(x) = aX
h(X) = 1 if X > 3cm
h(X) = 0 if X < 3cm
Classification

Tumor Size
0: Benign
1: Malignant
Classification
Borderline  value of 0~1
“Logistic Function”
=“Sigmoid Function”
What we want…
1) 0< h(X) <1
2) For borderline, h(X) ~ 0.5
ℎ 𝑋 =
1
1 + 𝑒−𝑍
Z = a0 + a1X (X: tumor size)

Tumor Size
0: Benign
1: Malignant
Classification
Decision Boundary
ℎ 𝑋 =
1
1 + 𝑒−𝑍
Z = a0 + a1X (X: tumor size)0.5
3 cm
H(X) > 0.5  Malignant
H(X) < 0.5  Benign
Interpretation of h(X)
~ Probability of malignant

Classification
x1
x2
1 2 3
1
2
3
Classification with two variables
x1
x2
1 2 3
1
2
3
Decision boundary
= Threshold
x1
x2
1 2 3
1
2
3Y=1
Y=0
Sigmoid

Classification
Classification with multiple variables
x1
x2
1 2 3
1
2
3 ℎ 𝑋 =
1
1 + 𝑒−𝑍
Z = a0 + a1X1+a2X2
Linear regression h(X) = a0+a1x1+a2x2+…
Logistic classification h(X) = sig(Z) ,
Z = a0+a1x1+a2x2+…

Classification
Tumor Size
0: Benign
1: Malignant
Tumor Size
0: Benign
1: Malignant
ℎ 𝑋 =
1
1 + 𝑒−𝑍 Z = a0 + a1X
Changing a0
Changing a1
How to optimize aj?
 Make convex cost function
J(a0, a1) = Mean of (h(X) – Y)2
For linear regression,
0
2
4
6
-0.5 0 0.5 1 1.5 2 2.5

Classification
Cost function for logistic classification
J(a0, a1) = -log(h(X)) if Y = 1
-log(1-h(X)) if Y = 0
h(X)
J
10
(h(X) is 0~1)
Y=1
h(X)
J
10
(h(X) is 0~1)
Y=0

Classification
Cost function for logistic classification
J(a0, a1) = -[Y log (h(X)) + (1-Y) log (1-h(X))]
Cost function for logistic = “Binary crossentropy”
aj  aj - 𝛂
𝝏
𝝏𝒂𝒋
𝑱(𝒂)Repeat
Optimization algorithm : Same with linear regression

• Logistic regression as a Perceptron
Lesion size
Circularity
Hounsfield unit
x1
x2
x3
Z
w1
w2
w3
b0
Activation function
Sigmoid(Z)
Output
1: Malignancy
0: Benign
Find optimized W for minimized error
 Gradient descent

• Perceptron vs neuron

x1
x2
x1
x2
1 2 3
1
2
3
Linear classification Nonlinear classification
• Limitation of single-layer perceptron

Layer 3Layer 1 Layer 2
H(X)input
Output layer
a1
(2) = g(w11x1+w12x2+w13x3)
a2
(2) = g(w21x1+w22x2+w23x3)
a3
(2) = g(w31x1+w32x2+w33x3)
where g: activation function (sigmoid)
a1
(3)
= H(X)
= g(w21a1
(2)+w22a2
(2)+w23a3
(2))
Single-layer perceptron to neural network

Non-linear classification example: XOR/XNOR
x1 and x2 are binary (0 or 1).
x1
x2
Y=1
Y=0
(XNOR Problem)

Non-linear classification example: XOR
AND
X2
X1
+1 -30
20
20
H(X)
X2
X1
+1 -10
20
20
H(X)
OR
sigmoid(-10)~0
sigmoid(10)~1

Non-linear classification example: XOR
AND
X2
X1
+1
a1
-30
20
20
(NOT x1) AND (NOT x2)
a2
10
-20
-20
+1
H(X)
-10
20
20
OR
X1 X2
0 0
0 1
1 0
1 1
a1 a2
0 1
0 0
0 0
1 0
1
0
0
1
H(X)

Output
Lesion size
Circularity
Lesion size
Circularity
Malignancy Benign
Non-linear classification

H
O Output
How to calculate gradient descent ?
𝜕𝐽
𝜕𝑤2
=
𝜕𝐽
𝜕𝑂
𝜕𝑂
𝜕𝑍 𝑂
𝜕𝑍 𝑂
𝜕𝑤2
w2 𝑍 𝑜 = ෍ 𝑊2 𝐻
J = Cost (O, Y)
(O = h(X))
J= -[Y log (h(X)) + (1-Y) log (1-h(X))]
w1
𝜕𝐽
𝜕𝑤1
=
𝜕𝐽
𝜕𝑍ℎ
𝜕𝑍ℎ
𝜕𝑤1
H=sig(Zh) O=sig(ZO)
Zh=W1X ZO=W2H
“Back Propagation”
𝜕𝐽
𝜕𝑍ℎ
=
𝜕𝐽
𝜕𝑍 𝑂
𝜕𝑍 𝑂
𝜕𝐻
𝜕𝐻
𝜕𝑍ℎ

The era of artificial brain!

You see this:
But the camera sees this:
• Limitation of conventional neural network

• Limitation of conventional neural network
pixel 1
pixel 2
Cars
“Non”-Cars
50 x 50 pixel images→ 2500 pixels
(7500 if RGB)
pixel 1
pixel 2
= A point at
7500 dimensional axis
 7500 multivariable logistic regression

Curse of dimensionality

More dimensions
More sparse in data space
 Easy to overfit

More dimensions  More weights (Parameters to learn)
1990~2000
Better manual features instead of raw pixel value
+ Kernel-based learning

• Hard to learn deep layer
Problem of Vanishing Gradient

• Hard to learn deep layer
Slope~0
Slope~0
Output
Error Backpropagation
𝜕𝐽
𝜕𝑤2
=
𝜕𝐽
𝜕𝑂
𝜕𝑂
𝜕𝑍 𝑂
𝜕𝑍 𝑂
𝜕𝑤2
O=sig(ZO)
~0

Overcome Limitations &
To Deep Learning…
• Automatic feature extraction from raw data
• Learning deep-layered neural network
Algorithm Hardware Big Data

Deep learning
• Nonlinear problem & multilayer perceptron
0 1 1 3 5 7 8 8
0 0 0 1 3 3 5 3
0 0 1 2 4 7 7 1
0 0 2 3 8 5 7 6
2 5 8 8 8 4 9 5
0 0 8 8 6 4 2 3
128x128
=16,384
Output
- Require good manual features
- Raw data  Too big.
- More layers?  Difficulty in learning

Deep learning
• MLP to Deep learning
- Require good manual features
- Raw data  Too big.
- More layers?  Difficulty in learning
• Automatic feature extraction from raw data
• New activation function
& Stochastic gradient descent
• Methods for reducing overfitting

Deep learning
• Train deep layer
Problem of Vanishing Gradient
Slope~0
Slope~0

Deep learning
• Train deep layer
Nonlinearity function
Sigmoid  ReLU , tanh, ELU, Leaky ReLU
Sigmoid
ReLU
Slope~0
Slope~0
Output
Error Backpropagation
Slope = 1

Deep learning
• Train big data
aj  aj - 𝛂
𝝏
𝝏𝒂𝒋
𝑱(𝒂)RepeatGradient Descent
50x50x3 ~7500 pixel data per image
100,000 images of cars and non-cars
Front :
- Estimate 100,000 h(X)
Back = update ‘a’
- Cost calculated by 100,000 cost(Y,h(X))
a1
0
2
4
6
-0.5 0 0.5 1 1.5 2 2.5
J(a1)
Small 𝛂
Per iteration

Deep learning
• Train big data
Training data
1 weight update
Training data
1 weight update
per mini-batch
Multiple weight updates
Gradient
Descent
Stochastic
Gradient
Descent
(=batch gradient)
(=minibatch stochastic gradient)

Deep learning
• Train big data
Faster and efficient
to reach
global minima

Deep learning
• Problem of overfitting
Apple
Not apple
Because it’s yellow
Not apple
Because it’s not round

Deep learning
• Problem of overfitting
Dropout

Deep learning
• Automatic feature extraction efficiently
– Convolutional Neural Network for Image
– Recurrent Neural Network for Sequential Data

Deep learning
• Convolutional Neural Network
• Local feature extraction
• Translational invariance
• Sparsity compared with fully-connected layer

Deep learning
• Local feature extraction
• Translational invariance
• Sparsity compared with fully-connected layer
Fully-connected layer Convolutional layer

Deep learning
Convolutional filter
0 1 1 3 5 7 8 8
0 0 0 1 3 3 5 3
0 0 1 2 4 7 7 1
0 0 2 3 8 5 7 6
2 5 8 8 8 4 9 5
0 0 8 8 6 4 2 3
Image Filter
1 0 1
0 1 0
1 0 1
2 6 12 22 27 28
2 5 15 16 30 24
11 17 24 29 33 24
15 19 32 28 27 27
Output
Number of feature maps
= number of convolutional filters
Instead
number of
nodes

Deep learning
identify line / some texture
identify head lights and wheels
identify Car!

Deep learning

Deep learning
Initial Data : 256 x 256
Ear Ear
Eye Eye
Nose Tail
Foot Foot
After convolutions and poolings
=Abstracted features

Deep learning
Abstracted Features Feature vectors
Multivariate
Logistic
Dimension 256x256x3 4x4x1024 4096 1000

Deep learning
ImageNet Challenge Results
28.2%
2010
25.8%
2011
16.4%
2012
Shallow model
AlexNet
11.7%
2013
6.7%
2014
3.57%
2015
GoogleNet
ResNet
8-layers
22-layers
152-layers

Deep learning
ImageNet Challenge Results
AlexNet
GoogleNet
ResNet

Deep learning
Pedestrian Car Motorcycle Truck
• Cf> Multiple output (instead of binary classification)
4 Output nodes,
instead of 1 node
Y = [ 1, 0 , 0, 0] for pedestrian
Y = [ 0, 1, 0, 0 ] for car
Y = [ 0, 0, 1, 0 ] for motorcycle
Y = [ 0, 0, 0, 1 ] for truck
Activation function:
Softmax, instead of sigmoid

Deep learning
• Recurrent Neural Network

Deep learning
• Recurrent Neural Network
Vision
Deep CNN
Language
Generating
RNN
“A group of people shopping
at an outdoor market.
There are many vegetables at
the fruit stands.”
Neural Machine Translation
Google Translate
Text, Music Generation
https://www.youtube.com/watch?v=A2gyidoFsoI
Combined with CNN : Image caption generation

Deep learning
• Current Concept of Deep learning
Deep layered
neural network
Output
+
Data type-specific
layers
Convolution
Recurrent
Modification for
training
+
ReLU activation
SGD training
Dropout
Batch normalization
Variable Cost Function
…

Deep learning
Nomenclature
Supervised
Learning
Unsupervised
Learning
Reinforcement
Learning
• Regression
• Classification
• Clustering
• Generative model
• Algorithms
react to environment
Machine Learning

Deep learning
Unsupervised learning
, particularly
generative model
Transfer learning One-shot learning
Bayesian modeling Mobile-friendly model
Manifold and non-
Euclidean data
Current trends of deep learning
High accuracy to various purposes/situations

Deep learning
Flexible and scalable deep learning
• Transfer learning
Car
ImageNet-based model as a feature extractor
Train only
last layer

Deep learning
• Generative model
z~N(0,1)
G:
generator
Fake image
Real image
D:
Discriminator
1: real
0: fake
Can be composed with
convolutional, FC layers,
Batch normalization,
regularization, etc.
Various cost functions
/ combined cost functions
MSE, CE, Adversarial, KLD, etc.

Deep learning
Ref. DeepMind, NIPS 2017
https://tykimos.github.io/

Deep learning in Medicine
ROC curve
- better than dermatologists
Esteva, Andre, et al. Nature 2017

Diabetic Retinopathy
 Better or equivalent
to ophthalmologists
Normal DM
Gulshan, Varun, et al. JAMA 2016
ChestXnet
 Equivalent/Superior to radiologists (?)
Pranav Rajpurkar, … Andrew Ng, 2017. Arxiv
DL for medical imaging:
Supervised learning using CNN

FDA approve a device for diagnosing
diabetic retinopathy (2018.4)
AI-aided system (CT angiography
for large vessel occlusion)
Year of AI invasion to clinic

DL for medical imaging:
Supervised learning using CNN
AD & NC
MCI-converter & non-converter
FDG and amyloid PET to predict future cognitive decline
Choi H and Jin KH Arxiv 2017

https://adfdgpet.appspot.com
Online Demo
Input file
Web application
Output: likelihood for AD
& predicted cognitive score
Output:
Cognitive dysfunction-related map
p(Alzheimer|X)

https://insight.lunit.io/
Web application

Web application
https://modelderm.com
Han SS, et al. J Invest Derm 2018

Laborious Work Replaced by DL:
Segmentation
Choi, H., & Jin, K. H. J Neurosci Methods 2016 de Brebisson, et al. CVPR 2015.

Laborious Work Replaced by DL:
Detection
Liu Y, et al. Arxiv 2017

Enhance Image Acquisition & Quality
Dahl et al. Arxiv 2017
Normal dose abdomen CT Low dose abdomen CT
Low dose abdomen CT+CNN
Chen H et al. Biomed Opt Exp 2017

Image Generation
Cat
Cat
Common deep learning model Generative model
z = f(x)
where x: data, z: discriminative features
f: classifier model
x = g(z)
where x: data, z: latent
g: generation function

Generative Adversarial Network
z~N(0,1)
G:
generator
Karras T, et al. Arxiv 2017
G:
generator
Isola P, et al. arxiv, 2016.

Generative Adversarial Network
Structural MR generation from PET
Florbetapir PET
Generator:
U-net
Skip connection
Generated MR
PETandgeneratedMRPETandrealMR
Discriminator Real or Fake
Generative Adversarial Networks
for MR generation
z G(z)
z & G(z)
z & x
Choi H and Lee DS, J Nucl Med 2017.
RealMRI
Generated
MRI
18F-Florbetapir
PET

Conditional Generation
Antipov G, Arxiv 2017

Encoder
Latent
features
Generator
+ Age
Latent
features
+ Age
VAE model for brain PET
generation
 Brain metabolism aging movie
Choi H,… Lee DS. Biorxiv 2017
Conditional Generation

Choi H,… Lee DS. Biorxiv 2017
Estimating normal population distribution

Omics data
Quang D et al. NAR 2016
Predicting Function from
DNA sequences

Omics data
Low risk group
High risk group
Deep learning-based risk score
Choi H and Na KJ, Biomed Res Int. 2017

Disruptive Innovation: Raw medical & healthcare data
Diet + Previous Glucose Level
Future Glucose Level &
Scheduling Insulin
Sugar.iq from Medtronic

HTC DeepQ Tricoder
Predicting PVC from daily EKG
Diagnosis of otitis media

Deep learning facilitates left-shifting

Basic deep learning & Deep learning application to medicine

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Basic deep learning & Deep learning application to medicine

Similar to Basic deep learning & Deep learning application to medicine (20)

More from Hongyoon Choi

More from Hongyoon Choi (6)

Recently uploaded

Recently uploaded (20)

Basic deep learning & Deep learning application to medicine