Learning Predictive Modeling with TSA and Kaggle

Learning Predictive Modeling with
TSA and Kaggle: Tips for Beginners
Yvonne K. Matos
ChiPy Data Science SIG Meeting, November 15, 2017
Photo: Benoit Tessier/Reuters

Where to start with deep learning?
Activation functions
Back Propogation
Neural Networks
Output layers
Weights
Recurrent Neural Networks
Sigmoid Functions
Loss Functions
Decision Trees

Define goals, pick a project, dive in!
1. Work with large datasets and cloud computing
2. Develop deep learning algorithms
3. Increase experience with Python
4. Get hired as a data scientist!
Data Science Pipeline
Data science is 80% cleaning and preprocessing, 20% modeling
Business
Question
Data
Question
Data
Collection
Data
Loading
Data
Cleaning
PreprocessingModelingValidation
Data
Answer
Business
Answer
Exploratory
Analysis
My Goals, ChiPy Mentorship Program

TSA Passenger Algorithm Screening Challenge
Problem: High false alarm rates create bottlenecks at airport checkpoints.
Challenge: Create an algorithm with a lower false alarm rate using a dataset of scan
images with simulated threats
Business
Question
Data
Question
Data
Collection
Data
Loading
Data
Cleaning
PreprocessingModelingValidation
Data
Answer
Business
Answer
Exploratory
Analysis

Exploration: Visualizing the Images
(n, 512, 660, 16)
Raw data for lowest res 10 MB image: 4D array
• 3D images
• 3TB dataset
(n, 128, 128, 128, 1)
Other 3D images: 5D array
(n, 512, 512, 660)
Higher res 330 MB image:

TSA 3D images vs 2D RGB images
=
If I fits,
I sits!
(n, 512, 660, 3)
Samples
Dimensions
Channels
(n, 512, 660, 16)
Samples
Dimensions
Channels
3 Channels
16 Channels

Anticipated Challenges
• First Python project
• 10 MB per low res file = long run times
• Enormous scope
– Full training in cloud = $$$$$
Plan of attack
• Begin small
• Scale up locally
• Run in the cloud

Start small with a data subset
19, 499 potential threats from 17 body zones
Zone 6
1,148 total images
• 1,032 non-threat
• 116 threat (10%)
17,628 non-threat
1,871 threat (9.6%)
Lowest res images:
~10 MB each
Start with 120 images

Image Preprocessing: Getting x and y data
120 samples: 102 non threat, 18 threat
X data
Y data
z6samlist = os.listdir('/Users/Yvonne/Desktop/TSA_Kaggle/Z6_n30_9.18.17')
z6paths = ['/Users/Yvonne/Desktop/TSA_Kaggle/Z6_n30_9.18.17/' + z6sam for z6sam in z6samlist]
del z6paths[0]
arr_list = [read_data(z6path) for z6path in z6paths]
x = np.stack(arr_list, axis=0)
maximum = np.max(x)
minimum = np.min(x)
x = (x - minimum)/(maximum - minimum)
X shape: (120, 512, 660, 16)
X size: 2.6 GB
y = z6sample_120.iloc[:,2].values
Y shape: (120,)
X Scaling

Neural Networks Attempt to Model
the Human Brain
X = independent variables
for each observation
Scaling X is a must!
W = weights
Input Layer
Hidden Layer
Output Layer
Output Value
x1
x2
x3
x4
ŷ
y
w1
w2
Σ
Σ
Σ
Σ
Σ
Σ
Goal: minimize C
Slide concept credit: SuperDataScience
ŷ = predicted value
y = actual value

Neural Networks Learn Through Backpropagation
XIndex
0
1
2
7
y
0
0
0
1
C
Adjust w1, w2
...
...
0 1 2 ... 7
Slide concept credit: SuperDataScience

Additional Info on Neural Networks:
Getting Started
Udemy Course
Deep Learning A-Z™: Hands-On
Artificial Neural Networks
YouTube
https://adeshpande3.github.io/adeshpande3.github.io/
Blog
Online Book
http://neuralnetworksanddeeplearning.com/index.html

Challenges: Getting Ready to Develop
First Model
No GPU support on mac for TF
• Attempted Solution:
– Build TF from unsupported version compatible
with OpenCL
• Lesson:
– Create a separate environment
+

Building the first model
One line of code = one layer
from keras.models import Sequential
from keras.layers import Dense, Flatten
classifier = Sequential()
classifier.add(Dense(25, input_shape=(512, 660, 16),
activation='relu', kernel_initializer='uniform'))
classifier.add(Flatten())
classifier.add(Dense(25, activation='relu',
kernel_initializer='uniform'))
classifier.add(Dense(1, activation='sigmoid',
classifier.compile(optimizer='adam', loss='binary_crossentropy',
metrics=['accuracy'])
classifier.fit(x_train, y_train, batch_size=10, epochs=50)
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
*5 *5 *5 *5 *5
*5*5 *5 *5 *5
96 9624 24Samples:

First Model Learns on Training Set
XIndex
0
1
2
95
y
0
0
0
1
C
Adjust w1, w2
...
...
ŷy
0 1 2 … 95

First Model Validation on Test Set
Challenge: Jupyter notebook disconnects mid-run
• Alternative
– Do long runs in .py file
y_pred = classifier.predict(x_test)
y_pred = (y_pred > 0.5)
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(cm)
Overfitting an issue!
22 0
2 0
Predicted
Actual
No
Yes
No Yes
False negatives
False positives
=
=

Some Ways to Tune a Model, Address Overfitting
Increase or Decrease Epochs and batch size
Add Dropout layers
Add Hidden Layers and Increase Nodes
Test Different
Activation Functions

Tuning with GridSearchCV
Grid of all possible combinations
side = 1 parameterEach

Tuning with GridSearchCV
from keras.layers import Dense, Flatten, Dropout
def build_classifier(optimizer):
classifier.add(Dropout(rate=0.1))
classifier.add(Dense(1, activation=’sigmoid',
classifier.compile(optimizer=optimizer,loss='binary_crossentropy',
return classifier
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import GridSearchCV
classifier = KerasClassifier(build_fn=build_classifier)
parameters = {'batch_size': [25, 32],
'nb_epoch': [50, 100],
'optimizer': ['adam', 'rmsprop’ ]}
grid_search = GridSearchCV(estimator=classifier,
param_grid=parameters,
scoring='accuracy',
cv=10)
grid_search = grid_search.fit(x_train, y_train)
best_parameters = grid_search.best_params_
best_accuracy = grid_search.best_score_
print(best_parameters)
print(best_accuracy)
Defining Model Architecture Defining Parameters for GridSearchCV

Challenge: Code Terminates
Memory usage too high with GridSearchCV
Parameters 1
Parameters 2
Parameters 3
Parameters 4
print(best_parameters)
{'batch_size': 25, 'nb_epoch': 100, 'optimizer': 'adam'}
print(best_accuracy)
0.79600000000000004
Alternative: Take iterative approach instead
for key, value in parameters.items():
build_classifier.fit(x_train, y_train)
print()

Scaling Up From 120 to ½ Samples in Zone 6 573
Best model
Confusion
matrix
from keras.layers import Dense, Flatten, Dropout
classifier.add(Dense(1, activation='sigmoid',
classifier.compile(optimizer='adam', loss='binary_crossentropy',
classifier.fit(x_train, y_train, batch_size=25, epochs=50)
*5 *5 *5 *5 *5
*5 *5 *5 *5 *5 *5 *5 *5 *5 *5
*5 *5 *5
*5 *5 *5 *5 *5 *5 *5*5 *5
*5*5*5*5*5*5*5
*5
*5*5*5*5*5*5*5*5*5*5

Challenges Scaling up
Alternative:
• Use online learning model, iterate through
each image in a partial fit model
Next step: Scale up to entire Zone 6 sample 1147
arr_list = [read_data(z6path) for z6path in z6paths]
x = np.stack(arr_list, axis=0)
X size = ~24.8 GB
TOO BIG!

Working
with
Can take awhile to connect
• Good tutorial:
Limited free credits ($300)
• Rate depends on power & region
– Estimate run cost
• Plan: use all credits for 1-2 runs
http://cs231n.github.io/gce-tutorial/

Working model to date
• High rate of identifying non-threats
• Low rate of false positive threat ID
BUT…
• Also has high rate of false negatives
=
=

TSA’s current algorithm also has a high
false negative rate

What’s Next?
One month to go
• Reduce false negative rate
• Run in Google Cloud Platform
• Productionalize model

Key Takeaways of Challenging Projects
• Big data challenges even with few samples
• Flexibility with project scope
• Don’t be intimidated
• Lots can be learned in a short time!

Thanks!
• Thanks for coming tonight
• ChiPy mentorship program
• Trunk Club for hosting

Learning Predictive Modeling with TSA and Kaggle

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Learning Predictive Modeling with TSA and Kaggle

Similar to Learning Predictive Modeling with TSA and Kaggle (20)

Recently uploaded

Recently uploaded (20)

Learning Predictive Modeling with TSA and Kaggle

Editor's Notes