1st Place Presentation of EY Next Wave Data Science Challenge in 2019.
Following presentation contains background, methods, deep learning models, parameters, and other miscellaneous details about our model used for the data science challenge.
Overview of Model: LSTM models including Multiplicative LSTM, LSTM-CNN, simple LSTM were used to tackle Variable Sequence Length Binary Classification problem.
홍콩 지역 EY NextWave 데이터사이언스 대회에서 1위를 차지한 모델에 대한 설명입니다. 다양한 LSTM 모델 (Multiplicative LSTM, LSTM-CNN, Simple LSTM) 등을 사용해서 Variable Sequence Length Binary Classification 문제에 접근하였습니다.
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
1st Place in EY Data Science Challenge
1. EY Hong Kong NextWave
Data Science Challenge
Byung Eun Jeon - byungeuni
Hyunju Shim - sg04088
University of Hong Kong
May 30, 2019
2. Disclaimer
These presentation slides are not official EY presentation slides.
A winning team of EY Hong Kong NextWave Data Science
Competition produced the slides for the presentation on May 30,
2019 at the Citic Tower, Admiralty, Hong Kong.
3. Agenda
Page 2 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
1 Methodology and Algorithms
Findings and Patterns
Opportunities to Improve Performance
Smart Cities Applications
3
11
19
21
2
3
4
4. Page 3 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Methodology
and Algorithms
5. Overview of Methodology
1. Problem Formulation
2. EDA & Feature Engineering
3. Model Exploration and Selection
4. Training & Fine-Tuning
5. Prediction & Ensembling
Page 4 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
6. Problem Formulation
Page 5 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Objective Model that finds out whether a specific citizen will
be in a predefined city-center
Meaningful Points:
ID resets every 24 hours
Cannot trace the same device across days
Exact date / Day of the week is not known
Limits the usage of Weekend/Weekday trends, Public Holidays
⅔ of Velocity Related Data is missing: Data handling is an issue
Hash is used to link several trajectories to a sequence
Variable Length Sequence Binary Classification
Number of Trajectories for each
unique hash ranges from 1 to 20
Target Variable:
0 (Not in center)
or 1 (In center)
7. Selection of Approach: Deep Learning
Why Deep Learning?
Size of Data is Big
Difficult to hand-design useful columns
Non-ML Statistical Models (e.g. ARIMA, Holt-Winters Method) require many
assumptions, yet it is difficult to make good assumptions
Machine Learning
Deep Learning
Requires minimal feature engineering because DL is flexible
at approximating non-linear functions useful for prediction
Convolutional NN – family
Has no sense of time
(i.e. Difficult to learn the seasonality during the day)
Recurrent NN – family
Why LSTM? better cope with vanishing gradient and
capable of learning long-term dependencies
Other Machine
Learning Models
Difficult to capture
sense of time
(Random Forest,
k-Nearest-
Neighbors)
Page 6 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
8. Feature Engineering
Page 7 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Finally, concatenate trajectories with the same hash
to create one variable length sequence for each unique hash
Modification of
Given Features
Target
Entry Time (to seconds),
Exit Time (to seconds)
Handling Missing Values of
Velocity-related Features
Fill with each group’s median and mark
NaN values with new “valid” columns
If time spent in the trajectory is 0, fill
velocities to 0. Otherwise, fill velocities to
median of non-NaN values
Design of
New Features
Entry Center, Exit Center
Time Spent, Time After
New Hash (first trajectory of the hash),
Last Hash (last trajectory of the hash)
Vmin Valid, Vmean Valid, Vmax Valid
MinMax Normalization on
Continuous Variables
Modify so that inputs with different
ranges have same scale for features
9. Input with Variable Sequence Length
Page 8 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Objective
Create the batch with similar length
sequence to reduce sparsity of
input data
Problem
If sequence length ranges from 1 to 20,
input data becomes too sparse
General way of
“Variable Sequence
Length LSTM”
Our Approach
Combine Trajectories to create one long
sequence for each unique hash
Zero-pad each batch
with maximum
sequence length for
each batch
Bucketing and
zero-padding:
sort hashes
according to number
of trajectories
1.2 Million zero-
paddings per epoch
on average from
1000 simulations
313
zero-paddings per
epoch
10. Multiplicative LSTM
Motivation
LSTM architecture that has hidden-to-hidden transition functions that are
input-dependent and thus better suited to recover from surprising inputs
Larger Number of Parameters (x 1.25)
Trade-off between Flexibility of Model and Training Time
Source: Krause, B., Lu, L., Murray, I., and Renals, S. Multiplicative LSTM for sequence
modelling. ArXiv, 2016.
Page 9 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Multiplicative LSTMTraditional LSTM
11. Training & Prediction
Page 10 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Loss Function
• Metric (F1) is not differentiable
and when modified, does not
converge well
•
• Use Binary Cross Entropy Loss
Optimizer
• Adadelta Optimizer for first
part of training
• Adam Optimizer with learning
rate decay for later part
Train/Val Data Split
• Use 70/30 split when exploring
various models
• Use 95/5 split when fine-tuning
Computing
• Cloud Computing for GPU
computation
Simple Weighted Ensembling with higher weights
on predictions with higher score
12. Page 11 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Findings
and Patterns
13. Inspired by the theory and practice,
we present domain-specific findings
Page 12 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Batch Normalization (BN) and Dropout combined together do not work well
Inconsistency in variance – Recent study suggests that Dropout shifts the
variance of a specific neural unit; BN maintains its variance
Conducted experiments on three cases
i) BN and Dropout ii) Only Dropout iii) Dropout only after BN
Empirically observed that using Only Dropout performs the best
Disharmony
between
BN &
Dropout
Significance
of Velocity-
related
Variables
Experiment
on
LSTM+CNN
In NLP, tokenization with greater granularity sometimes achieves better results
Tried training model with each trajectory separated into two positions
This forced us to either delete or duplicate the velocity-related variables
Neither deleting or duplicating velocity led to better prediction
Although ⅔ of data is missing, information about velocity is valuable
Some people achieved SOTA using LSTM+CNN
Replicated the model and tried to generalize it to the given geolocation domain
For the given domain, our approach (LSTM with FC at the end)
performed better than LSTM+CNN
Sources:
1. Xiang Li, Shuo Chen, Xiaolin Hu, and Jian Yang. Understanding the disharmony between dropout and batch
normalization by variance shift. ArXiv, 2018
14. Through the Exploratory Data Analysis,
We found patterns of citizens in the city
Page 13 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Time:
00:00 ~ 04:00
Target Variable:
Trajectories inside
the yellow box
(city center)
15. Through the Exploratory Data Analysis,
We found patterns of citizens in the city
Page 14 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Time:
04:00 ~ 08:00
Target Variable:
Trajectories inside
the yellow box
(city center)
16. Page 15 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Through the Exploratory Data Analysis,
We found patterns of citizens in the city
Time:
08:00 ~ 12:00
Target Variable:
Trajectories inside
the yellow box
(city center)
17. Page 16 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Through the Exploratory Data Analysis,
We found patterns of citizens in the city
Time:
12:00 ~ 16:00
Target Variable:
Trajectories inside
the yellow box
(city center)
Populated area such
as Residential Area,
Highway, and
Business Area can be
speculated using this
graph
19. We explored the data extensively, and
this led to better feature engineering
Page 18 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Both distribution of time within the trajectory and
time between trajectories are right-skewed
“Broken GPS” exits but are negligible
No time spent in the trajectory yet shift in the positions
Happens very rarely. Deep Learning handles small noises well (robustness of DL)
Limitations worth noting
Information about
seasonality such as
week-trend and holiday is
missing and cannot be
inferred using the given
information
20. Page 19 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Opportunities to
Improve Performance
21. With less constraints on resource,
each stage of process can be improved
Page 20 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Feature
Engineering
Hyperparameter
Tuning
Feeding the
Input
Designing
“Highway” Column
Stratified Random
Shuffling (SRS)
Grid Search
with Multi-GPU
Ensemble and EDA
suggest that hashes
on highway are
difficult to predict
Similar approach to
designing “center”
column, which
significantly improved
performance
Prioritized fine-tuning
model over hand
designing columns
When bucketing and
zero-padding,
random shuffle
among hashes that
have the same
sequence length
Keras does not
support SRS. Should
be implemented
using TF and NumPy
Less prioritized due
to time limit
If more computing
power were available,
Grid Search may
have outperformed
manual fine-tuning
LearningRate Number of Epoch
22. Page 21 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Smart Cities
Applications
23. Atlanta stands to benefit from
Data-driven Litter Collection Application
Sources: 1. Forbes 2. U.S. Bureau of Labor Statistics 3. U.S. Census Bureau 4. Atlanta Journal-Constitution
5. TechRepublic 6. Bisnow
Page 22 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
While the urbanization and the surge of trash expect to continue,
Atlanta’s ecosystem presents opportunity to develop technology for trash management
Trends in
Atlanta
Demography Policy Ecosystem
Implication
Major changes to trash
collection schedule
made for the last 2
consecutive years
(April 1, 2019 and
July 9, 2018)
Rubicon Global, the
first tech unicorn in
trash, is Atlanta-based
“Smart Dumpster” for
recycling has been
developed by the
Government of Atlanta
Higher % change in
the employment than
the that of U.S
average in 2014~19
Metro Atlanta in 2019
has 4th fastest
growing population in
the U.S.
Frequent adjustments
to policies indicate the
increase of trash
Willingness of the state
to tackle the surging
amount of trash
Potential to leverage
the existing
infrastructure
Synergies with the
existing services
Development of the
urban culture, which
leads to the increase
of economic activities
and consumption
24. Agile Software Development with
a clear KPI is crucial for the Application
Page 23 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Review Community
Appearance Index (CAI),
Number of Complaints
Analyze Determine
requirements by involving
users continually
Value Proposition
Improve citizens’ life quality
Protect soil and water quality
Save cost of cleaning
Deploy Ex) population
density before and after
the baseball game
Design Database,
server, and UI/UX for
real-time prediction
Test Similar to EY
NextWave Competition,
aim for higher F-1 score
Develop Predict
locations/crowdedness
of public gatherings
Sources: 1. Team’s Analysis 2. Keep Atlanta Beautiful Commission
Prediction system can effectively allocate cleaning staffs to temporally crowded
areas that yield more pedestrian litters and optimize trash truck routes