1st Place in EY Data Science Challenge

EY Hong Kong NextWave
Data Science Challenge
Byung Eun Jeon - byungeuni
Hyunju Shim - sg04088
University of Hong Kong
May 30, 2019

Disclaimer
These presentation slides are not official EY presentation slides.
A winning team of EY Hong Kong NextWave Data Science
Competition produced the slides for the presentation on May 30,
2019 at the Citic Tower, Admiralty, Hong Kong.

Agenda
Page 2 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
1 Methodology and Algorithms
Findings and Patterns
Opportunities to Improve Performance
Smart Cities Applications
3
11
19
21
2
3
4

EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Methodology
and Algorithms

Overview of Methodology
1. Problem Formulation
2. EDA & Feature Engineering
3. Model Exploration and Selection
4. Training & Fine-Tuning
5. Prediction & Ensembling

Problem Formulation
Objective Model that finds out whether a specific citizen will
be in a predefined city-center
Meaningful Points:
ID resets every 24 hours
Cannot trace the same device across days
Exact date / Day of the week is not known
Limits the usage of Weekend/Weekday trends, Public Holidays
⅔ of Velocity Related Data is missing: Data handling is an issue
Hash is used to link several trajectories to a sequence
Variable Length Sequence Binary Classification
Number of Trajectories for each
unique hash ranges from 1 to 20
Target Variable:
0 (Not in center)
or 1 (In center)

Selection of Approach: Deep Learning
Why Deep Learning?
Size of Data is Big
Difficult to hand-design useful columns
Non-ML Statistical Models (e.g. ARIMA, Holt-Winters Method) require many
assumptions, yet it is difficult to make good assumptions
Machine Learning
Deep Learning
Requires minimal feature engineering because DL is flexible
at approximating non-linear functions useful for prediction
Convolutional NN – family
Has no sense of time
(i.e. Difficult to learn the seasonality during the day)
Recurrent NN – family
Why LSTM? better cope with vanishing gradient and
capable of learning long-term dependencies
Other Machine
Learning Models
Difficult to capture
sense of time
(Random Forest,
k-Nearest-
Neighbors)

Feature Engineering
Finally, concatenate trajectories with the same hash
to create one variable length sequence for each unique hash
Modification of
Given Features
Target
Entry Time (to seconds),
Exit Time (to seconds)
Handling Missing Values of
Velocity-related Features
Fill with each group’s median and mark
NaN values with new “valid” columns
If time spent in the trajectory is 0, fill
velocities to 0. Otherwise, fill velocities to
median of non-NaN values
Design of
New Features
Entry Center, Exit Center
Time Spent, Time After
New Hash (first trajectory of the hash),
Last Hash (last trajectory of the hash)
Vmin Valid, Vmean Valid, Vmax Valid
MinMax Normalization on
Continuous Variables
Modify so that inputs with different
ranges have same scale for features

Input with Variable Sequence Length
Objective
Create the batch with similar length
sequence to reduce sparsity of
input data
Problem
If sequence length ranges from 1 to 20,
input data becomes too sparse
General way of
“Variable Sequence
Length LSTM”
Our Approach
Combine Trajectories to create one long
sequence for each unique hash
Zero-pad each batch
with maximum
sequence length for
each batch
Bucketing and
zero-padding:
sort hashes
according to number
of trajectories
1.2 Million zero-
paddings per epoch
on average from
1000 simulations
313
zero-paddings per
epoch

Multiplicative LSTM
Motivation
LSTM architecture that has hidden-to-hidden transition functions that are
input-dependent and thus better suited to recover from surprising inputs
Larger Number of Parameters (x 1.25)
Trade-off between Flexibility of Model and Training Time
Source: Krause, B., Lu, L., Murray, I., and Renals, S. Multiplicative LSTM for sequence
modelling. ArXiv, 2016.
Multiplicative LSTMTraditional LSTM

Training & Prediction
Loss Function
• Metric (F1) is not differentiable
and when modified, does not
converge well
•
• Use Binary Cross Entropy Loss
Optimizer
• Adadelta Optimizer for first
part of training
• Adam Optimizer with learning
rate decay for later part
Train/Val Data Split
• Use 70/30 split when exploring
various models
• Use 95/5 split when fine-tuning
Computing
• Cloud Computing for GPU
computation
Simple Weighted Ensembling with higher weights
on predictions with higher score

Findings
and Patterns

Inspired by the theory and practice,
we present domain-specific findings
Batch Normalization (BN) and Dropout combined together do not work well
Inconsistency in variance – Recent study suggests that Dropout shifts the
variance of a specific neural unit; BN maintains its variance
Conducted experiments on three cases
i) BN and Dropout ii) Only Dropout iii) Dropout only after BN
Empirically observed that using Only Dropout performs the best
Disharmony
between
BN &
Dropout
Significance
of Velocity-
related
Variables
Experiment
on
LSTM+CNN
In NLP, tokenization with greater granularity sometimes achieves better results
Tried training model with each trajectory separated into two positions
This forced us to either delete or duplicate the velocity-related variables
Neither deleting or duplicating velocity led to better prediction
Although ⅔ of data is missing, information about velocity is valuable
Some people achieved SOTA using LSTM+CNN
Replicated the model and tried to generalize it to the given geolocation domain
For the given domain, our approach (LSTM with FC at the end)
performed better than LSTM+CNN
Sources:
1. Xiang Li, Shuo Chen, Xiaolin Hu, and Jian Yang. Understanding the disharmony between dropout and batch
normalization by variance shift. ArXiv, 2018

Through the Exploratory Data Analysis,
We found patterns of citizens in the city
Time:
00:00 ~ 04:00
Target Variable:
Trajectories inside
the yellow box
(city center)

Time:
04:00 ~ 08:00
Target Variable:
Trajectories inside
the yellow box
(city center)

Time:
08:00 ~ 12:00
Target Variable:
Trajectories inside
the yellow box
(city center)

Time:
12:00 ~ 16:00
Target Variable:
Trajectories inside
the yellow box
(city center)
Populated area such
as Residential Area,
Highway, and
Business Area can be
speculated using this
graph

Trajectory
Count
City-center
Percentage
Analysis of activity-percentage
complements the visualization of counts

We explored the data extensively, and
this led to better feature engineering
Both distribution of time within the trajectory and
time between trajectories are right-skewed
“Broken GPS” exits but are negligible
No time spent in the trajectory yet shift in the positions
Happens very rarely. Deep Learning handles small noises well (robustness of DL)
Limitations worth noting
Information about
seasonality such as
week-trend and holiday is
missing and cannot be
inferred using the given
information

Opportunities to
Improve Performance

With less constraints on resource,
each stage of process can be improved
Feature
Engineering
Hyperparameter
Tuning
Feeding the
Input
Designing
“Highway” Column
Stratified Random
Shuffling (SRS)
Grid Search
with Multi-GPU
Ensemble and EDA
suggest that hashes
on highway are
difficult to predict
Similar approach to
designing “center”
column, which
significantly improved
performance
Prioritized fine-tuning
model over hand
designing columns
When bucketing and
zero-padding,
random shuffle
among hashes that
have the same
sequence length
Keras does not
support SRS. Should
be implemented
using TF and NumPy
Less prioritized due
to time limit
If more computing
power were available,
Grid Search may
have outperformed
manual fine-tuning
LearningRate Number of Epoch

Smart Cities
Applications

Atlanta stands to benefit from
Data-driven Litter Collection Application
Sources: 1. Forbes 2. U.S. Bureau of Labor Statistics 3. U.S. Census Bureau 4. Atlanta Journal-Constitution
5. TechRepublic 6. Bisnow
While the urbanization and the surge of trash expect to continue,
Atlanta’s ecosystem presents opportunity to develop technology for trash management
Trends in
Atlanta
Demography Policy Ecosystem
Implication
Major changes to trash
collection schedule
made for the last 2
consecutive years
(April 1, 2019 and
July 9, 2018)
Rubicon Global, the
first tech unicorn in
trash, is Atlanta-based
“Smart Dumpster” for
recycling has been
developed by the
Government of Atlanta
Higher % change in
the employment than
the that of U.S
average in 2014~19
Metro Atlanta in 2019
has 4th fastest
growing population in
the U.S.
Frequent adjustments
to policies indicate the
increase of trash
Willingness of the state
to tackle the surging
amount of trash
Potential to leverage
the existing
infrastructure
Synergies with the
existing services
Development of the
urban culture, which
leads to the increase
of economic activities
and consumption

Agile Software Development with
a clear KPI is crucial for the Application
Review Community
Appearance Index (CAI),
Number of Complaints
Analyze Determine
requirements by involving
users continually
Value Proposition
Improve citizens’ life quality
Protect soil and water quality
Save cost of cleaning
Deploy Ex) population
density before and after
the baseball game
Design Database,
server, and UI/UX for
real-time prediction
Test Similar to EY
NextWave Competition,
aim for higher F-1 score
Develop Predict
locations/crowdedness
of public gatherings
Sources: 1. Team’s Analysis 2. Keep Atlanta Beautiful Commission
Prediction system can effectively allocate cleaning staffs to temporally crowded
areas that yield more pedestrian litters and optimize trash truck routes

1st Place in EY Data Science Challenge

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 1st Place in EY Data Science Challenge

Similar to 1st Place in EY Data Science Challenge (20)

Recently uploaded

Recently uploaded (20)

1st Place in EY Data Science Challenge