Successfully reported this slideshow.
Upcoming SlideShare
×

# Introducton to Convolutional Nerural Network with TensorFlow

Explaining basic mechanism of the Convolutional Neural Network with sample TesnsorFlow codes.

Sample codes: https://github.com/enakai00/cnn_introduction

• Full Name
Comment goes here.

Are you sure you want to Yes No

### Introducton to Convolutional Nerural Network with TensorFlow

1. 1. Google confidential | Do not distribute Introduction to Convolutional Neural Network with TensorFlow Etsuji Nakai Cloud Solutions Architect at Google 2017/03/24 ver1.0 1
2. 2. Background & Objective 2
3. 3. ● What's happening here?! Image Classification Transfer Learning with Inception v3 https://codelabs.developers.google.com/codelabs/cpb102-txf-learning 3
4. 4. ● Let's study the underlying mechanism with this (relatively) simple CNN. Convolutional Neural Network with Two Convolution Layers Raw Image Softmax Function Pooling Layer Convolution Filter ・・・ Convolution Filter ・・・ ・・・ Dropout Layer Fully-connected Layer Pooling Layer Convolution Filter ・・・ Convolution Filter Pooling Layer ・・・ Pooling Layer 4
5. 5. ● Launch Cloud Datalab. ○ https://cloud.google.com/datalab/docs/quickstarts ● Open a new notebook and execute the following command. ○ !git clone https://github.com/enakai00/cnn_introduction.git ● Find notebook files in "cnn_introduction" folder. Jupyter Notebooks 5
6. 6. Logistic Regression 6
7. 7. ● Training Set: ○ N data points on (x, y) plane. ○ Data points belong to two categories which are labeled as t = 1, 0. ● Problem to solve: ○ Find a straight line to classify the given data. ○ If there's no perfect answer (which doesn't have any misclassification), find an optimal one in some sense. Sample Problem ● ✕ 7
8. 8. ● Define the straight line as below. ● We apply the maximum likelihood method to determine the parameter w. ● In other words, we will define a "probability to obtain the training set", and maximize it. Logistic Regression: Theoretical Ground x y ● ✕ 8
9. 9. ● The probability of t = 1 for a new data point at (x, y) should have the following properties. ○ t = 0.5 on the separation line. ○ for leaving away from the separation line. ● This can be satisfied by translating f (x, y) into the probability through logistic sigmoid function σ(a). Logistic Sigmoid Function x y ● ✕ 9 P(x, y) increases in this direction
10. 10. ● Using the probability defined in the previous page, calculate the probability of reproducing the training set . ○ If , the probability of observing it at is ○ If , the probability of observing it at is ○ These results can be expressed by a single equation as below. (Remember that for any x.) ● Hence, the total probability of reproducing all data (likelihood function) is expressed as: Likelihood Function of Logistic Regression 10
11. 11. ● Instead of maximizing the likelihood function, we generally minimize the following loss function to avoid the underflow issue of numerical calculation. Loss Function 11
12. 12. Gradient Descent Optimization ● By modifying parameters in the opposite direction of the gradient vector incrementally, it may eventually achieve the minimum. 12
13. 13. Learning Rate and Convergence Issue ● Learning rate ε decides the "step size" of each modification. ● The convergence of the optimization depends on the learning rate value. Converge Diverge http://sebastianruder.com/optimizing-gradient-descent/ 13
14. 14. TensorFlow Programming 14
15. 15. Programming Style of TensorFlow ● All data is represented by "multidimensional list". ○ In many cases, you can use a two-dimension list which is equivalent to the matrix. So by expressing models (functions) in the matrix form, you can translate them into TensorFlow codes. ● As a concrete example, we will write the following model (functions) in TensorFlow codes. ○ Pay attention to distinguish the following three objects. ■ Placeholder : a variable to store training data. ■ Variable: parameters to be adjusted by the training algorithm. ■ Functions constructed with Placeholders and Variables. 15
16. 16. Programming Style of TensorFlow ● The linear function representing the straight line can be expressed using matrix as below. ● should be treated as a Placeholder which holds multiple data simultaneously in general. So let represent n-th data and using the matrix holding all data for , you can write down the following matrix equation. ○ Where corresponds to the value of f for n-th data, and the "broadcast rule" is applied to the last part . This means adding to all matrix elements. 16
17. 17. Programming Style of TensorFlow ● Finally, by applying the sigmoid function σ to each element of f , the probability for each data is calculated. ○ The "broadcast rule" is applied to , meaning applying σ to each element of f . ● These relationships are expressed by TensorFlow codes as below. x = tf.placeholder(tf.float32, [None, 2]) w = tf.Variable(tf.zeros([2, 1])) w0 = tf.Variable(tf.zeros([1])) f = tf.matmul(x, w) + w0 p = tf.sigmoid(f) 17
18. 18. Programming Style of TensorFlow ● This explains the relationship between matrix calculations and TensorFlow codes. x = tf.placeholder(tf.float32, [None, 2]) w = tf.Variable(tf.zeros([2, 1])) w0 = tf.Variable(tf.zeros([1])) f = tf.matmul(x, w) + w0 p = tf.sigmoid(f) Placeholder stores training data Matrix size (The size of row should be None to hold arbitrary numbers of data.) Variables represent parameters to be trained. (Initializing to 0, here) The "broadcast rule" (similar to NumPy array) is applied to calculations. 18
19. 19. Error Function and Training Algorithm ● To train the model (i.e. to adjust the parameters), we need to define the error function and the training algorithm. t = tf.placeholder(tf.float32, [None, 1]) loss = -tf.reduce_sum(t*tf.log(p) + (1-t)*tf.log(1-p)) train_step = tf.train.AdamOptimizer().minimize(loss) tf.reduce_sum adds up all matrix elements. Using Adam Optimizer to minimize "loss" 19
20. 20. Calculations inside Session ● The TensorFlow codes we prepared so far just define functions and various relations without doing any calculation. We prepare a "Session" and actual calculations are executed in the session. Placeholder Variable Calculations Placeholder Session 20
21. 21. Using Session to Train the Model ● Create a new session and initialize Variables inside the session. ● By evaluating the training algorithm inside the session, Variables are adjusted with the gradient descent method. ○ "feed_dict" specifies the data which are stored in Placeholder. ○ When functions are evaluated in the session, the corresponding values are calculated using the current values of Variables. i = 0 for _ in range(20000): i += 1 sess.run(train_step, feed_dict={x:train_x, t:train_t}) if i % 2000 == 0: loss_val, acc_val = sess.run( [loss, accuracy], feed_dict={x:train_x, t:train_t}) print ('Step: %d, Loss: %f, Accuracy: %f' % (i, loss_val, acc_val)) sess = tf.Session() sess.run(tf.initialize_all_variables()) The gradient descent method is applied using the training data specified by feed_dict. Calculating "loss" and "accuracy" using the current values of Variables. 21
22. 22. Exercise ● Run through the Notebook: ○ No.1 Tensorflow Programming 22
23. 23. Linear Multicategory Classifier 23
24. 24. ● Logistic regression gives the "probability of being classified as t = 1" for each data in the training set. ● Parameters are adjusted to minimize the following error function. Recap: Logistic Regression ● ✕ P(x, y) increases in this direction 24
25. 25. ● Drawing 3-dimensional graph of , we can see that the “tilted plate” divides the plane into two classes. ● Logistic function σ translates the height on the plate into the probability of t = 1. Graphical Interpretation of Logistic Regression Logistic function σ z 25
26. 26. ● How can we divide the plane into three classes (instead of two)? ● We can define three linear functions and classify the point based on “which of them has the maximum value at that point.” ○ This is equivalent to dividing with three tilted plates. Building Multicategory Linear Classifier 26
27. 27. ● We can define the probability that belongs to the i-th class with the following softmax function. ● This translates the magnitude of into the probability satisfying the following (reasonable) conditions. Translation to Probability with Softmax function One dimensional example of "Softmax translation."27
28. 28. Image Classification with Linear Multicategory Classifier 28
29. 29. ● A grayscale image with 28x28 pixels can be represented as a 784 dimensional vector which is a collection of 784 float numbers. ○ In other words, it corresponds to a single point in a 784 dimensional space! Images as Points in High Dimensional Space ● When we spread a bunch of images into this 784 dimensional space, similar images may come together to form clusters of images. ○ If this is a correct assumption, we can classify the images by dividing the 784 dimensional space with the softmax function. 29
30. 30. Matrix Representation ● To divide M dimensional space into K classes, we prepare the K linear functions. ● Defining n-th image data as , the values of linear functions for all data can be represented as below. (The broadcast rule is applied to "+ w" operation.) 30
31. 31. Matrix Representation ● Here is the summary of the matrix representation. Broadcast rule 31
32. 32. Matrix Representation ● Finally, we can translate the result into a probability by applying softmax function. The probability of classified as k-th category for n-th data is: ● TensorFlow has "tf.nn.softmax" function which calculates them directly from the matrix F. 32
33. 33. TensorFlow Codes of the Model ● The matrix representations we built so far can be written in TensorFlow codes as below. ○ Pay attention to the difference between Placeholder and Variables. x = tf.placeholder(tf.float32, [None, 784]) w = tf.Variable(tf.zeros([784, 10])) w0 = tf.Variable(tf.zeros([10])) f = tf.matmul(x, w) + w0 p = tf.nn.softmax(f) 33
34. 34. Loss Function ● The class label of n-th data is given by a vector with the one-of-K representation. It has 1 only for the k-th element meaning it's class is k. ● Since the probability of having the correct answer for this data is , the probability of having correct answers for all data is calculated as below. ● We define the loss function as below. Then, minimizing the loss function is equivalent to maximizing the probability P. only for k' = k (the class of n-th data) 34
35. 35. TensorFlow Codes for Loss Function ● The loss function and the optimization algorithm can be written in TensorFlow codes as below. ● The following code calculates the accuracy of the model. ○ "correct_prediction" is a list of bool values of "correct or incorrect." ○ "accuracy" is calculated by taking the mean of bool values (1 for correct, 0 for incorrect.) t = tf.placeholder(tf.float32, [None, 10]) loss = -tf.reduce_sum(t * tf.log(p)) train_step = tf.train.AdamOptimizer().minimize(loss) correct_prediction = tf.equal(tf.argmax(p, 1), tf.argmax(t, 1)) accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)) 35
36. 36. Comparing Predictions and Class Labels ● The following shows how we calculate the correctness of predictions. Predict the class of according to the maximum probability. Comparing these to check the answer. Indicates the correct class of 36
37. 37. Mini-Batch Optimization of Parameters ● We repeat the optimization operations using 100 samples at a time. i = 0 for _ in range(2000): i += 1 batch_xs, batch_ts = mnist.train.next_batch(100) sess.run(train_step, feed_dict={x: batch_xs, t: batch_ts}) if i % 100 == 0: loss_val, acc_val = sess.run([loss, accuracy], feed_dict={x:mnist.test.images, t: mnist.test.labels}) print ('Step: %d, Loss: %f, Accuracy: %f' % (i, loss_val, acc_val)) ……Image data Label data batch_xs batch_ts 100 samples ・・・ ・・・ Optimization …… …… …… Optimization batch_xs batch_ts 37
38. 38. Mini-Batch Optimization of Parameters ● Mini-batch optimization has the following advantages. ○ Reduce the memory usage. ○ Avoid being trapped in the local minima with the random movement. Minimum Minimum Stochastic gradient descent with mini-batch method. Simple gradient descent method using all training data at once. True minimum Local minima 38
39. 39. Exercise Correct Incorrect 39 ● Run through the Notebook: ○ No.2 Softmax classifier for MNIST
40. 40. Basic Strategy of Convolutional Network 40
41. 41. ● The linear categorizer assumes that samples can be classified with flat planes. ● This cannot be a perfect assumption and fails to capture the global (topological) features of handwritten digits. The limitation of Linear Categorizer Correct Incorrect Examples form the result of linear classifier. 41
42. 42. ● The convolutional neural network (CNN) uses image filters to extract features from images and apply hidden layers to classify them. The Overview of Convolutional Neural Network Raw Image Softmax Function Pooling Layer Convolution Filter ・・・ Convolution Filter ・・・ ・・・ Dropout Layer Fully-connected Layer Pooling Layer Convolution Filter ・・・ Convolution Filter Pooling Layer ・・・ Pooling Layer 42
43. 43. ● Convolutional filters are ... just an image filter you sometimes apply in Photoshop! Examples of Convolutional Filters Filter to blur images Filter to extract vertical edges 43
44. 44. ● To classify the following training set, what would be the best filters? Question 44
45. 45. ● Applying image filters to capture various features of the image. ○ For example, if we want to classify three characters "+", "-", "|", we can apply filters to extract vertical and horizontal edges as below. ● Applying the pooling layer to (deliberately) reduce the image resolution. ○ The necessary information for classification is just a density of the filtered image. How Convolutional Neural Network Works 45
46. 46. 46 def edge_filter(): filter0 = np.array( [[ 2, 1, 0,-1,-2], [ 3, 2, 0,-2,-3], [ 4, 3, 0,-3,-4], [ 3, 2, 0,-2,-3], [ 2, 1, 0,-1,-2]]) / 23.0 filter1 = np.array( [[ 2, 3, 4, 3, 2], [ 1, 2, 3, 2, 1], [ 0, 0, 0, 0, 0], [-1,-2,-3,-2,-1], [-2,-3,-4,-3,-2]]) / 23.0 filter_array = np.zeros([5,5,1,2]) filter_array[:,:,0,0] = filter0 filter_array[:,:,0,1] = filter1 return tf.constant(filter_array, dtype=tf.float32) TensorFlow code to apply the filters x = tf.placeholder(tf.float32, [None, 784]) x_image = tf.reshape(x, [-1,28,28,1]) W_conv = edge_filter() h_conv = tf.abs(tf.nn.conv2d(x_image, W_conv, strides=[1,1,1,1], padding='SAME')) h_conv_cutoff = tf.nn.relu(h_conv-0.2) h_pool =tf.nn.max_pool(h_conv_cutoff, ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME')
47. 47. ● In this model, we use pre-defined (fixed) filters to capture vertical and horizontal edges. ● Question: How can we choose appropriate filters for more general images? Simple Model to Classify "+", "-", "|". Input image Convolution filter Pooling layer Softmax 47
48. 48. Exercise 48 ● Run through the Notebook: ○ No.3 Convolutional Filter Example ○ No.4 Toy model with static filters
49. 49. Dynamic Optimization of Convolution Filters 49
50. 50. ● In the convolutional neural network, we define filters as Variable. The optimization algorithm tries to adjust the filter values to achieve better predictions. ○ The following code applies 16 filters to images with 28x28 pixels(=784 dimensional vectors). Dynamic Optimization of Filters num_filters = 16 x = tf.placeholder(tf.float32, [None, 784]) x_image = tf.reshape(x, [-1,28,28,1]) W_conv = tf.Variable(tf.truncated_normal([5,5,1,num_filters], stddev=0.1)) h_conv = tf.nn.conv2d(x_image, W_conv, strides=[1,1,1,1], padding='SAME') h_pool =tf.nn.max_pool(h_conv, ksize=[1,2,2,1], strides=[1,2,2,1], padding='SAME') Placeholder to store images Define filters as Variables Apply filters and pooling layer 50
51. 51. Exercise 51 ● Run through the Notebook: ○ No.5 Single layer CNN for MNIST ○ No.6 Single layer CNN for MNIST result (Since filtered images contains negative pixel values, the background of images are not necessarily white.)
52. 52. ● By adding more filter (and pooling) layers, we can build multi-layer CNN. ○ Filters in different layers are believed to recognize different kinds of features, but details are still under the study. ○ Dropout layer is used to avoid overfitting by randomly cutting the part of connections during the training. Multi-layer Convolutional Neural Network Raw Image Softmax Function Pooling Layer Convolution Filter ・・・ Convolution Filter ・・・ ・・・ Dropout Layer Fully-connected Layer Pooling Layer Convolution Filter ・・・ Convolution Filter Pooling Layer ・・・ Pooling Layer 52
53. 53. ● Run through the Notebook: ○ No.7 CNN Handwriting Recognizer Exercise The images which passed through the second filters. Predicting the handwritten number. 53
54. 54. Neural Network Basics 54
55. 55. Single Layer Neural Network ● This is an example of a single layer neural network. ○ Two nodes in the hidden layer transform the value of a linear function with the activation function. ○ There are some choices for the activation function. We will use the hyperbolic tangent in the following examples. Logistic sigmoid Hyperbolic tangent ReLU Hidden layer Output layer 55
56. 56. Single Layer Neural Network ● Since the output from the hyperbolic tangent quickly changes from -1 to 1, the outputs from the hidden layer effectively split the input space into discrete regions with straight lines. ○ In this example, plane is split into 4 regions. ① ② ③ ④ 56
57. 57. Single Layer Neural Network ● The logistic sigmoid in the output node can classify the plane with a straight line, this single layer network can classify the 4 regions into two classes as below. ◯ ◯ ✕ ◯ ①② ④③ ① ② ③ ④ 57
58. 58. Limitation of Single Layer Network ● On the other hand, this neural network cannot classify data in the following pattern. ○ How can you extend the network to cope with this data? ◯ ◯ ✕ ✕ ①② ④③ Unable to classify with a straight line. 58
59. 59. Neural Network as Logical Units ■ A single node (consisted of a linear function and the activation function) works as a logical Unit for AND or OR as below. ● ● ● ● ● ● ● ● 59
60. 60. Neural Network as Logical Units ● Since the previous pattern is equivalent to XOR, we can combine the AND and OR units to make a XOR unit. As a result, the following "Enhanced output node" can classify the previous pattern. ◯ ◯ ✕ ✕ ①② ④③ AND Ops OR Ops XOR Ops 60
61. 61. Neural Network as Logical Units ● Combining the hidden layer and the "enhanced output unit", it results in the following 2-layer neural network. ○ The first hidden layer extract features as a combination of binary variables , and the second hidden layer plus output node classify them as a XOR logical unit. Classifying with XOR Logical Unit Extracting Features 61
62. 62. Exercise ● You can see the actual result on Neural Network Playground. ○ http://goo.gl/VIvOaQ 62
63. 63. Thank you! 63