Understanding Deep Learning for Big Data: The complexity and scale of big data impose tremendous challenges for their analysis. Yet, big data also offer us great opportunities. Some nonlinear phenomena, features or relations, which are not clear or cannot be inferred reliably from small and medium data, now become clear and can be learned robustly from big data. Typically, the form of the nonlinearity is unknown to us, and needs to be learned from data as well. Being able to harness the nonlinear structures from big data could allow us to tackle problems which are impossible before or obtain results which are far better than previous state-of-the-arts.
Nowadays, deep neural networks are the methods of choice when it comes to large scale nonlinear learning problems. What makes deep neural networks work? Is there any general principle for tackling high dimensional nonlinear problems which we can learn from deep neural works? Can we design competitive or better alternatives based on such knowledge? To make progress in these questions, my machine learning group performed both theoretical and experimental analysis on existing and new deep learning architectures, and investigate three crucial aspects on the usefulness of the fully connected layers, the advantage of the feature learning process, and the importance of the compositional structures. Our results point to some promising directions for future research, and provide guideline for building new deep learning models.
5. Traditional image features not learned end-to-end
5
Handcrafted
feature extractor
(eg. SIFT)
Divide image
to patches
Combine features
Learn classifier
11. Kernel machine learns faster
ImageNet 1.3M original images, and 1000 classes
Random cropping and mirroring images in streaming fashion
Number of training samples
10
5
40
60
80
100
Test
top-1 error
(%)
10
6
10
7
10
8
jointly-trained neural net
fixed neural net
doubly SGD
Training 1 week
using GPU
47.8
44.5
42.6
Random guessing
99.9% error
11
12. Similar results with MNIST8M
Classification with handwritten digits
8M images, 10 classes
LeNet5
12
13. Similar results with CIFAR10
Classification with internet images
60K images, 10 classes
13
32. Deep learning with time-varying computation graph
time
𝑡2
𝑡3
𝑡1
𝑡0
Mini-batch 1
Computation graph of RNN
determined by
1. The bipartite interaction
graph
2. The temporal ordering of
events
32
33. Much improvement prediction on Reddit dataset
Next item prediction Return time prediction
1,000 users, 1403 groups, ~10K interactions
MAR: mean absolute rank difference
MAE: mean absolute error (hours)
33
34. Predicting efficiency of solar panel materials
Dataset Harvard clean
energy project
Data point # 2.3 million
Type Molecule
Atom type 6
Avg node # 28
Avg edge # 33
Power Conversion Efficiency (PCE)
(0 -12 %)
predict
Organic
Solar Panel
Materials
34
36. Improved prediction with small model
Structure2vec gets ~4% relative error
with 10,000 times smaller model!
Test MAE Test RMSE # parameters
Mean predictor 1.986 2.406 1
WL level-3 0.143 0.204 1.6 m
WL level-6 0.096 0.137 1378 m
structure2vec 0.085 0.117 0.1 m
10% data for testing
36
37. Take Home Message:
Deep fully connected layers not the key
Exploit structure (CNN, Coevolution,
Structure2vec)
Train end-to-end
Editor's Notes
Why the performance rather than interpret the results
The task: classification (maybe one slide)
Have one slides for the neural networks.
The task: classification (maybe one slide)
The actual classification number
Not improving, finish it.
Make the meaning of convergence clearer: given sample, fewer error. Same error, fewer samples.
Emphasize what does it mean by scalable. (compare to alternative methods).
Take features from the last pooling layer Le-Net5 [LeCun’12]
H(x) the same line!!! Too busy!!! Remove the top. Smaller figure. Fewer gs.
Need theory cited. Lower bound.
Here we tried a large dataset, where the task is to predict the power conversion efficiency for molecular data. Accurate prediction is essential for screening of new form of energy and material. The dataset we used consists of 2.3 million samples from Harvard Clean Energy Project. And the figure here shows the PCE range is from 0 to 11
Now is the time to put them together. We start with the zero embeddings, and then perform one step of fixed point equation update. For example, to get update of mu_2, we use its neighborhood embeddings and input features. Similarly, we can get updates for all other posterior marginal embeddings. Same as traditional graphical model inference, we need to iterate the fixed point update several times. Intuitively, this will allow each embedding capture more and more neighborhood information. In the last step, we merge those marginal embeddings to get a vector representation of entire structure data. We can see this model can be trained in an end to end fashion. Also, the parameters in embedding iteration layers are shared, which makes it similar to recurrent neural network. We can simply extend it by using LSTM to formulate the fixed point equation.
Here is the result we reported. We compared with the Weisfeiler-Lehman kernel with different degrees. Since the kernel matrix cannot work in this scale, we manually created high dimensional explicit feature map for it. Due to its high dimensionality, we can at most work with degree 6.
We can see that we get 4% for the relative error on predicting. Also, to get comparable result for the Weisfeiler-Lehman kernel, it requires 1.3 billion parameters. We can get better results with only 0.1m parameters, which is a 10k times smaller model than alternatives.