Building a Better TSA Screening Algorithm:
a ChiPy Mentorship Project
Yvonne K. Matos
ChiPy Meeting, December 14, 2017
Photo: Benoit Tessier/Reuters
My Goals, ChiPy Mentorship Program
1. Get more experience with Python
2. Develop deep learning algorithms
3. Work with large datasets and cloud computing
4. Get hired as a data scientist!
Data Science Pipeline
Data science is 80% cleaning and preprocessing, 20% modeling
Business
Question
Data
Question
Data
Collection
Data
Loading
Data
Cleaning
PreprocessingModelingValidation
Data
Answer
Business
Answer
Exploratory
Analysis
TSA Passenger Algorithm Screening Challenge
Problem: High false alarm rates create bottlenecks at airport checkpoints.
Challenge: Create an algorithm with a lower false alarm rate using a dataset of scan
images with simulated threats
Business
Question
Data
Question
Data
Collection
Data
Loading
Data
Cleaning
PreprocessingModelingValidation
Data
Answer
Business
Answer
Exploratory
Analysis
Anticipated Challenges
• First Python project
• Huge files = long run times
• Enormous scope
– Full training in cloud = $$$$$
Plan of attack
• Begin small
• Scale up locally
• Run in the cloud
Exploration: Visualizing the Images
(n, 512, 660, 16)
Raw data for lowest res 10 MB image: 4D array
• 3D images
• 1,147 total
• 3TB dataset
(n, 512, 512, 660)
Higher res 330 MB image:
Samples
Dimensions
Channels
Start with two body zones
17 body zones * 1,147 images = 19,499 potential threats
Zone 17
• 1052 non-threat
• 95 threat (8.3%)
Zone 5
• 1041 non-threat
• 106 threat (9.2%)
Preprocessing:
3D to 2D Classification Problem
16 channels
(n, 512, 660, 16)
Raw data 10 MB image: 4D array
Zone 5 cropped
Zone 17 cropped
Crop images and save to PNG format
Developing Models: Convolutional Neural Network
classifier = Sequential()
classifier.add(Convolution2D(32, (3, 3), input_shape=(250, 250, 3), activation='relu'))
classifier.add(MaxPooling2D(pool_size=(2, 2)))
classifier.add(Convolution2D(32, (3, 3), activation='relu'))
classifier.add(MaxPooling2D(pool_size=(2, 2)))
classifier.add(Convolution2D(64, (3, 3), activation='relu'))
classifier.add(MaxPooling2D(pool_size=(2, 2)))
classifier.add(Flatten())
classifier.add(Dense(units=128, activation='relu'))
classifier.add(Dense(units=256, activation='relu'))
classifier.add(Dense(units=1, activation='sigmoid'))
classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
from keras.preprocessing.image import ImageDataGenerator
train_datagen = ImageDataGenerator(rescale=1./255, shear_range=0.2,
zoom_range=0.2, horizontal_flip=True)
test_datagen = ImageDataGenerator(rescale=1./255)
training_set = train_datagen.flow_from_directory('C:PathZ17_training_png',
target_size=(250, 250), batch_size=32, class_mode='binary')
test_set = test_datagen.flow_from_directory('C:PathZ17_test_png',
target_size=(250, 250), batch_size=32, class_mode='binary')
classifier.fit_generator(training_set, steps_per_epoch=917, epochs=10,
validation_data=test_set, validation_steps=230)
classifier.save('C:PathZ17_classifier.h5')
Convolutional Layers
Fully Connected Layers
Image Augmentation
Preliminary Pipeline Zone 17 Model
Accuracy: 95%
Precision: 67%
Recall: 86%
Zone 5 Model
Accuracy: 78%
Precision: 27%
Recall: 77%
1 row = 1 body scan image
Body scan
image
Crop, PNG Model
Prediction
Prediction
If
Pred. threat
FLAG!!
Thanks!
• Thanks for coming tonight
• ChiPy mentorship program
• mHUB for hosting

Building a Better TSA Screening Algorithm

  • 1.
    Building a BetterTSA Screening Algorithm: a ChiPy Mentorship Project Yvonne K. Matos ChiPy Meeting, December 14, 2017 Photo: Benoit Tessier/Reuters
  • 2.
    My Goals, ChiPyMentorship Program 1. Get more experience with Python 2. Develop deep learning algorithms 3. Work with large datasets and cloud computing 4. Get hired as a data scientist! Data Science Pipeline Data science is 80% cleaning and preprocessing, 20% modeling Business Question Data Question Data Collection Data Loading Data Cleaning PreprocessingModelingValidation Data Answer Business Answer Exploratory Analysis
  • 3.
    TSA Passenger AlgorithmScreening Challenge Problem: High false alarm rates create bottlenecks at airport checkpoints. Challenge: Create an algorithm with a lower false alarm rate using a dataset of scan images with simulated threats Business Question Data Question Data Collection Data Loading Data Cleaning PreprocessingModelingValidation Data Answer Business Answer Exploratory Analysis
  • 4.
    Anticipated Challenges • FirstPython project • Huge files = long run times • Enormous scope – Full training in cloud = $$$$$ Plan of attack • Begin small • Scale up locally • Run in the cloud
  • 5.
    Exploration: Visualizing theImages (n, 512, 660, 16) Raw data for lowest res 10 MB image: 4D array • 3D images • 1,147 total • 3TB dataset (n, 512, 512, 660) Higher res 330 MB image: Samples Dimensions Channels
  • 6.
    Start with twobody zones 17 body zones * 1,147 images = 19,499 potential threats Zone 17 • 1052 non-threat • 95 threat (8.3%) Zone 5 • 1041 non-threat • 106 threat (9.2%)
  • 7.
    Preprocessing: 3D to 2DClassification Problem 16 channels (n, 512, 660, 16) Raw data 10 MB image: 4D array Zone 5 cropped Zone 17 cropped Crop images and save to PNG format
  • 8.
    Developing Models: ConvolutionalNeural Network classifier = Sequential() classifier.add(Convolution2D(32, (3, 3), input_shape=(250, 250, 3), activation='relu')) classifier.add(MaxPooling2D(pool_size=(2, 2))) classifier.add(Convolution2D(32, (3, 3), activation='relu')) classifier.add(MaxPooling2D(pool_size=(2, 2))) classifier.add(Convolution2D(64, (3, 3), activation='relu')) classifier.add(MaxPooling2D(pool_size=(2, 2))) classifier.add(Flatten()) classifier.add(Dense(units=128, activation='relu')) classifier.add(Dense(units=256, activation='relu')) classifier.add(Dense(units=1, activation='sigmoid')) classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy']) from keras.preprocessing.image import ImageDataGenerator train_datagen = ImageDataGenerator(rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True) test_datagen = ImageDataGenerator(rescale=1./255) training_set = train_datagen.flow_from_directory('C:PathZ17_training_png', target_size=(250, 250), batch_size=32, class_mode='binary') test_set = test_datagen.flow_from_directory('C:PathZ17_test_png', target_size=(250, 250), batch_size=32, class_mode='binary') classifier.fit_generator(training_set, steps_per_epoch=917, epochs=10, validation_data=test_set, validation_steps=230) classifier.save('C:PathZ17_classifier.h5') Convolutional Layers Fully Connected Layers Image Augmentation
  • 9.
    Preliminary Pipeline Zone17 Model Accuracy: 95% Precision: 67% Recall: 86% Zone 5 Model Accuracy: 78% Precision: 27% Recall: 77% 1 row = 1 body scan image Body scan image Crop, PNG Model Prediction Prediction If Pred. threat FLAG!!
  • 10.
    Thanks! • Thanks forcoming tonight • ChiPy mentorship program • mHUB for hosting