This document summarizes Yvonne Matos' presentation on learning predictive modeling by participating in Kaggle challenges using TSA passenger screening data.
The key points are:
1) Matos started with a small subset of 120 images from one body zone to build initial neural network models and address challenges of large data sizes and compute requirements.
2) Through iterative tuning, her best model achieved good performance identifying non-threat images but had a high false negative rate for threats.
3) Her next steps were to reduce the false negative rate, run models on Google Cloud to handle full data sizes, and prepare the best model for real-world use.
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Learning Predictive Modeling with TSA and Kaggle
1. Learning Predictive Modeling with
TSA and Kaggle: Tips for Beginners
Yvonne K. Matos
ChiPy Data Science SIG Meeting, November 15, 2017
Photo: Benoit Tessier/Reuters
2. Where to start with deep learning?
Activation functions
Back Propogation
Neural Networks
Output layers
Weights
Recurrent Neural Networks
Sigmoid Functions
Loss Functions
Decision Trees
3. Define goals, pick a project, dive in!
1. Work with large datasets and cloud computing
2. Develop deep learning algorithms
3. Increase experience with Python
4. Get hired as a data scientist!
Data Science Pipeline
Data science is 80% cleaning and preprocessing, 20% modeling
Business
Question
Data
Question
Data
Collection
Data
Loading
Data
Cleaning
PreprocessingModelingValidation
Data
Answer
Business
Answer
Exploratory
Analysis
My Goals, ChiPy Mentorship Program
4. TSA Passenger Algorithm Screening Challenge
Problem: High false alarm rates create bottlenecks at airport checkpoints.
Challenge: Create an algorithm with a lower false alarm rate using a dataset of scan
images with simulated threats
Business
Question
Data
Question
Data
Collection
Data
Loading
Data
Cleaning
PreprocessingModelingValidation
Data
Answer
Business
Answer
Exploratory
Analysis
5. Exploration: Visualizing the Images
(n, 512, 660, 16)
Raw data for lowest res 10 MB image: 4D array
• 3D images
• 3TB dataset
(n, 128, 128, 128, 1)
Other 3D images: 5D array
(n, 512, 512, 660)
Higher res 330 MB image:
6. TSA 3D images vs 2D RGB images
=
If I fits,
I sits!
(n, 512, 660, 3)
Samples
Dimensions
Channels
(n, 512, 660, 16)
Samples
Dimensions
Channels
3 Channels
16 Channels
7. Anticipated Challenges
• First Python project
• 10 MB per low res file = long run times
• Enormous scope
– Full training in cloud = $$$$$
Plan of attack
• Begin small
• Scale up locally
• Run in the cloud
8. Start small with a data subset
19, 499 potential threats from 17 body zones
Zone 6
1,148 total images
• 1,032 non-threat
• 116 threat (10%)
17,628 non-threat
1,871 threat (9.6%)
Lowest res images:
~10 MB each
Start with 120 images
9. Image Preprocessing: Getting x and y data
120 samples: 102 non threat, 18 threat
X data
Y data
z6samlist = os.listdir('/Users/Yvonne/Desktop/TSA_Kaggle/Z6_n30_9.18.17')
z6paths = ['/Users/Yvonne/Desktop/TSA_Kaggle/Z6_n30_9.18.17/' + z6sam for z6sam in z6samlist]
del z6paths[0]
arr_list = [read_data(z6path) for z6path in z6paths]
x = np.stack(arr_list, axis=0)
maximum = np.max(x)
minimum = np.min(x)
x = (x - minimum)/(maximum - minimum)
X shape: (120, 512, 660, 16)
X size: 2.6 GB
y = z6sample_120.iloc[:,2].values
Y shape: (120,)
X Scaling
10. Neural Networks Attempt to Model
the Human Brain
X = independent variables
for each observation
Scaling X is a must!
W = weights
Input Layer
Hidden Layer
Output Layer
Output Value
x1
x2
x3
x4
ŷ
y
w1
w2
Σ
Σ
Σ
Σ
Σ
Σ
Goal: minimize C
Slide concept credit: SuperDataScience
ŷ = predicted value
y = actual value
12. Additional Info on Neural Networks:
Getting Started
Udemy Course
Deep Learning A-Z™: Hands-On
Artificial Neural Networks
YouTube
https://adeshpande3.github.io/adeshpande3.github.io/
Blog
Online Book
http://neuralnetworksanddeeplearning.com/index.html
13. Challenges: Getting Ready to Develop
First Model
No GPU support on mac for TF
• Attempted Solution:
– Build TF from unsupported version compatible
with OpenCL
• Lesson:
– Create a separate environment
+
14. Building the first model
One line of code = one layer
from keras.models import Sequential
from keras.layers import Dense, Flatten
classifier = Sequential()
classifier.add(Dense(25, input_shape=(512, 660, 16),
activation='relu', kernel_initializer='uniform'))
classifier.add(Flatten())
classifier.add(Dense(25, activation='relu',
kernel_initializer='uniform'))
classifier.add(Dense(1, activation='sigmoid',
kernel_initializer='uniform'))
classifier.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])
classifier.fit(x_train, y_train, batch_size=10, epochs=50)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
*5 *5 *5 *5 *5
*5*5 *5 *5 *5
96 9624 24Samples:
15. First Model Learns on Training Set
XIndex
0
1
2
95
y
0
0
0
1
C
Adjust w1, w2
...
...
ŷy
0 1 2 … 95
16. First Model Validation on Test Set
Challenge: Jupyter notebook disconnects mid-run
• Alternative
– Do long runs in .py file
y_pred = classifier.predict(x_test)
y_pred = (y_pred > 0.5)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
Overfitting an issue!
22 0
2 0
Predicted
Actual
No
Yes
No Yes
False negatives
False positives
=
=
17. Some Ways to Tune a Model, Address Overfitting
Increase or Decrease Epochs and batch size
Add Dropout layers
Add Hidden Layers and Increase Nodes
Test Different
Activation Functions
20. Challenge: Code Terminates
Memory usage too high with GridSearchCV
Parameters 1
Parameters 2
Parameters 3
Parameters 4
print(best_parameters)
{'batch_size': 25, 'nb_epoch': 100, 'optimizer': 'adam'}
print(best_accuracy)
0.79600000000000004
Alternative: Take iterative approach instead
for key, value in parameters.items():
build_classifier.fit(x_train, y_train)
print()
21. Scaling Up From 120 to ½ Samples in Zone 6 573
Best model
Confusion
matrix
from keras.models import Sequential
from keras.layers import Dense, Flatten, Dropout
classifier = Sequential()
classifier.add(Dense(25, input_shape=(512, 660, 16),
activation='relu', kernel_initializer='uniform'))
classifier.add(Dropout(rate=0.2))
classifier.add(Flatten())
classifier.add(Dense(50, activation='relu',
kernel_initializer='uniform'))
classifier.add(Dropout(rate=0.2))
classifier.add(Dense(50, activation='relu',
kernel_initializer='uniform'))
classifier.add(Dropout(rate=0.2))
classifier.add(Dense(50, activation='relu',
kernel_initializer='uniform'))
classifier.add(Dropout(rate=0.2))
classifier.add(Dense(50, activation='relu',
kernel_initializer='uniform'))
classifier.add(Dropout(rate=0.2))
classifier.add(Dense(1, activation='sigmoid',
kernel_initializer='uniform'))
classifier.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])
classifier.fit(x_train, y_train, batch_size=25, epochs=50)
*5 *5 *5 *5 *5
*5 *5 *5 *5 *5 *5 *5 *5 *5 *5
*5 *5 *5
*5 *5 *5 *5 *5 *5 *5*5 *5
*5*5*5*5*5*5*5
*5
*5*5*5*5*5*5*5*5*5*5
22. Challenges Scaling up
Alternative:
• Use online learning model, iterate through
each image in a partial fit model
Next step: Scale up to entire Zone 6 sample 1147
arr_list = [read_data(z6path) for z6path in z6paths]
x = np.stack(arr_list, axis=0)
X size = ~24.8 GB
TOO BIG!
23. Working
with
Can take awhile to connect
• Good tutorial:
Limited free credits ($300)
• Rate depends on power & region
– Estimate run cost
• Plan: use all credits for 1-2 runs
http://cs231n.github.io/gce-tutorial/
24. Working model to date
• High rate of identifying non-threats
• Low rate of false positive threat ID
BUT…
• Also has high rate of false negatives
=
=
26. What’s Next?
One month to go
• Reduce false negative rate
• Run in Google Cloud Platform
• Productionalize model
27. Key Takeaways of Challenging Projects
• Big data challenges even with few samples
• Flexibility with project scope
• Don’t be intimidated
• Lots can be learned in a short time!
28. Thanks!
• Thanks for coming tonight
• ChiPy mentorship program
• Trunk Club for hosting
Editor's Notes
focus on model building, but do have to do some image recognition, preprocessing
Extension for jupyter nb that helps with presentations
Must
Neuron: weighted sum of inputs
NN learn by adjusting weights
Cost function is error in prediction and want to minimize it
Cost function is error in prediction and want to minimize it