SlideShare a Scribd company logo
1 of 25
EY Hong Kong NextWave
Data Science Challenge
Byung Eun Jeon - byungeuni
Hyunju Shim - sg04088
University of Hong Kong
May 30, 2019
Disclaimer
These presentation slides are not official EY presentation slides.
A winning team of EY Hong Kong NextWave Data Science
Competition produced the slides for the presentation on May 30,
2019 at the Citic Tower, Admiralty, Hong Kong.
Agenda
Page 2 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
1 Methodology and Algorithms
Findings and Patterns
Opportunities to Improve Performance
Smart Cities Applications
3
11
19
21
2
3
4
Page 3 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Methodology
and Algorithms
Overview of Methodology
1. Problem Formulation
2. EDA & Feature Engineering
3. Model Exploration and Selection
4. Training & Fine-Tuning
5. Prediction & Ensembling
Page 4 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Problem Formulation
Page 5 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Objective Model that finds out whether a specific citizen will
be in a predefined city-center
Meaningful Points:
ID resets every 24 hours
Cannot trace the same device across days
Exact date / Day of the week is not known
Limits the usage of Weekend/Weekday trends, Public Holidays
⅔ of Velocity Related Data is missing: Data handling is an issue
Hash is used to link several trajectories to a sequence
Variable Length Sequence Binary Classification
Number of Trajectories for each
unique hash ranges from 1 to 20
Target Variable:
0 (Not in center)
or 1 (In center)
Selection of Approach: Deep Learning
Why Deep Learning?
Size of Data is Big
Difficult to hand-design useful columns
Non-ML Statistical Models (e.g. ARIMA, Holt-Winters Method) require many
assumptions, yet it is difficult to make good assumptions
Machine Learning
Deep Learning
Requires minimal feature engineering because DL is flexible
at approximating non-linear functions useful for prediction
Convolutional NN – family
Has no sense of time
(i.e. Difficult to learn the seasonality during the day)
Recurrent NN – family
Why LSTM? better cope with vanishing gradient and
capable of learning long-term dependencies
Other Machine
Learning Models
Difficult to capture
sense of time
(Random Forest,
k-Nearest-
Neighbors)
Page 6 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Feature Engineering
Page 7 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Finally, concatenate trajectories with the same hash
to create one variable length sequence for each unique hash
Modification of
Given Features
Target
Entry Time (to seconds),
Exit Time (to seconds)
Handling Missing Values of
Velocity-related Features
Fill with each group’s median and mark
NaN values with new “valid” columns
If time spent in the trajectory is 0, fill
velocities to 0. Otherwise, fill velocities to
median of non-NaN values
Design of
New Features
Entry Center, Exit Center
Time Spent, Time After
New Hash (first trajectory of the hash),
Last Hash (last trajectory of the hash)
Vmin Valid, Vmean Valid, Vmax Valid
MinMax Normalization on
Continuous Variables
Modify so that inputs with different
ranges have same scale for features
Input with Variable Sequence Length
Page 8 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Objective
Create the batch with similar length
sequence to reduce sparsity of
input data
Problem
If sequence length ranges from 1 to 20,
input data becomes too sparse
General way of
“Variable Sequence
Length LSTM”
Our Approach
Combine Trajectories to create one long
sequence for each unique hash
Zero-pad each batch
with maximum
sequence length for
each batch
Bucketing and
zero-padding:
sort hashes
according to number
of trajectories
1.2 Million zero-
paddings per epoch
on average from
1000 simulations
313
zero-paddings per
epoch
Multiplicative LSTM
Motivation
LSTM architecture that has hidden-to-hidden transition functions that are
input-dependent and thus better suited to recover from surprising inputs
Larger Number of Parameters (x 1.25)
Trade-off between Flexibility of Model and Training Time
Source: Krause, B., Lu, L., Murray, I., and Renals, S. Multiplicative LSTM for sequence
modelling. ArXiv, 2016.
Page 9 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Multiplicative LSTMTraditional LSTM
Training & Prediction
Page 10 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Loss Function
• Metric (F1) is not differentiable
and when modified, does not
converge well
•
• Use Binary Cross Entropy Loss
Optimizer
• Adadelta Optimizer for first
part of training
• Adam Optimizer with learning
rate decay for later part
Train/Val Data Split
• Use 70/30 split when exploring
various models
• Use 95/5 split when fine-tuning
Computing
• Cloud Computing for GPU
computation
Simple Weighted Ensembling with higher weights
on predictions with higher score
Page 11 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Findings
and Patterns
Inspired by the theory and practice,
we present domain-specific findings
Page 12 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Batch Normalization (BN) and Dropout combined together do not work well
Inconsistency in variance – Recent study suggests that Dropout shifts the
variance of a specific neural unit; BN maintains its variance
Conducted experiments on three cases
i) BN and Dropout ii) Only Dropout iii) Dropout only after BN
Empirically observed that using Only Dropout performs the best
Disharmony
between
BN &
Dropout
Significance
of Velocity-
related
Variables
Experiment
on
LSTM+CNN
In NLP, tokenization with greater granularity sometimes achieves better results
Tried training model with each trajectory separated into two positions
This forced us to either delete or duplicate the velocity-related variables
Neither deleting or duplicating velocity led to better prediction
Although ⅔ of data is missing, information about velocity is valuable
Some people achieved SOTA using LSTM+CNN
Replicated the model and tried to generalize it to the given geolocation domain
For the given domain, our approach (LSTM with FC at the end)
performed better than LSTM+CNN
Sources:
1. Xiang Li, Shuo Chen, Xiaolin Hu, and Jian Yang. Understanding the disharmony between dropout and batch
normalization by variance shift. ArXiv, 2018
Through the Exploratory Data Analysis,
We found patterns of citizens in the city
Page 13 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Time:
00:00 ~ 04:00
Target Variable:
Trajectories inside
the yellow box
(city center)
Through the Exploratory Data Analysis,
We found patterns of citizens in the city
Page 14 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Time:
04:00 ~ 08:00
Target Variable:
Trajectories inside
the yellow box
(city center)
Page 15 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Through the Exploratory Data Analysis,
We found patterns of citizens in the city
Time:
08:00 ~ 12:00
Target Variable:
Trajectories inside
the yellow box
(city center)
Page 16 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Through the Exploratory Data Analysis,
We found patterns of citizens in the city
Time:
12:00 ~ 16:00
Target Variable:
Trajectories inside
the yellow box
(city center)
Populated area such
as Residential Area,
Highway, and
Business Area can be
speculated using this
graph
Trajectory
Count
City-center
Percentage
Page 17 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Analysis of activity-percentage
complements the visualization of counts
We explored the data extensively, and
this led to better feature engineering
Page 18 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Both distribution of time within the trajectory and
time between trajectories are right-skewed
“Broken GPS” exits but are negligible
No time spent in the trajectory yet shift in the positions
Happens very rarely. Deep Learning handles small noises well (robustness of DL)
Limitations worth noting
Information about
seasonality such as
week-trend and holiday is
missing and cannot be
inferred using the given
information
Page 19 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Opportunities to
Improve Performance
With less constraints on resource,
each stage of process can be improved
Page 20 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Feature
Engineering
Hyperparameter
Tuning
Feeding the
Input
Designing
“Highway” Column
Stratified Random
Shuffling (SRS)
Grid Search
with Multi-GPU
Ensemble and EDA
suggest that hashes
on highway are
difficult to predict
Similar approach to
designing “center”
column, which
significantly improved
performance
Prioritized fine-tuning
model over hand
designing columns
When bucketing and
zero-padding,
random shuffle
among hashes that
have the same
sequence length
Keras does not
support SRS. Should
be implemented
using TF and NumPy
Less prioritized due
to time limit
If more computing
power were available,
Grid Search may
have outperformed
manual fine-tuning
LearningRate Number of Epoch
Page 21 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Smart Cities
Applications
Atlanta stands to benefit from
Data-driven Litter Collection Application
Sources: 1. Forbes 2. U.S. Bureau of Labor Statistics 3. U.S. Census Bureau 4. Atlanta Journal-Constitution
5. TechRepublic 6. Bisnow
Page 22 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
While the urbanization and the surge of trash expect to continue,
Atlanta’s ecosystem presents opportunity to develop technology for trash management
Trends in
Atlanta
Demography Policy Ecosystem
Implication
Major changes to trash
collection schedule
made for the last 2
consecutive years
(April 1, 2019 and
July 9, 2018)
Rubicon Global, the
first tech unicorn in
trash, is Atlanta-based
“Smart Dumpster” for
recycling has been
developed by the
Government of Atlanta
Higher % change in
the employment than
the that of U.S
average in 2014~19
Metro Atlanta in 2019
has 4th fastest
growing population in
the U.S.
Frequent adjustments
to policies indicate the
increase of trash
Willingness of the state
to tackle the surging
amount of trash
Potential to leverage
the existing
infrastructure
Synergies with the
existing services
Development of the
urban culture, which
leads to the increase
of economic activities
and consumption
Agile Software Development with
a clear KPI is crucial for the Application
Page 23 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Review Community
Appearance Index (CAI),
Number of Complaints
Analyze Determine
requirements by involving
users continually
Value Proposition
Improve citizens’ life quality
Protect soil and water quality
Save cost of cleaning
Deploy Ex) population
density before and after
the baseball game
Design Database,
server, and UI/UX for
real-time prediction
Test Similar to EY
NextWave Competition,
aim for higher F-1 score
Develop Predict
locations/crowdedness
of public gatherings
Sources: 1. Team’s Analysis 2. Keep Atlanta Beautiful Commission
Prediction system can effectively allocate cleaning staffs to temporally crowded
areas that yield more pedestrian litters and optimize trash truck routes
Thank you!

More Related Content

What's hot

presentation
presentationpresentation
presentation
jie ren
 
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC BerkeleyWhy Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Charles Martin
 
A computationally efficient method to find transformed residue
A computationally efficient method to find transformed residueA computationally efficient method to find transformed residue
A computationally efficient method to find transformed residue
iaemedu
 
StarekGomez.ea.IROS2015.Presentation
StarekGomez.ea.IROS2015.PresentationStarekGomez.ea.IROS2015.Presentation
StarekGomez.ea.IROS2015.Presentation
Joseph Starek
 

What's hot (19)

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
AI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start UpAI and Machine Learning for the Lean Start Up
AI and Machine Learning for the Lean Start Up
 
One-Pass Clustering Superpixels
One-Pass Clustering SuperpixelsOne-Pass Clustering Superpixels
One-Pass Clustering Superpixels
 
presentation
presentationpresentation
presentation
 
Spline (Interpolation)
Spline (Interpolation)Spline (Interpolation)
Spline (Interpolation)
 
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC BerkeleyWhy Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
Why Deep Learning Works: Dec 13, 2018 at ICSI, UC Berkeley
 
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
Statistical Mechanics Methods for Discovering Knowledge from Production-Scale...
 
This Week in Machine Learning and AI Feb 2019
This Week in Machine Learning and AI Feb 2019This Week in Machine Learning and AI Feb 2019
This Week in Machine Learning and AI Feb 2019
 
Search relevance
Search relevanceSearch relevance
Search relevance
 
ACT Science Coffee - Michael Emmerich
ACT Science Coffee - Michael EmmerichACT Science Coffee - Michael Emmerich
ACT Science Coffee - Michael Emmerich
 
Spectrum Analytic Approach for Cooperative Navigation of Connected and Autono...
Spectrum Analytic Approach for Cooperative Navigation of Connected and Autono...Spectrum Analytic Approach for Cooperative Navigation of Connected and Autono...
Spectrum Analytic Approach for Cooperative Navigation of Connected and Autono...
 
Data Imputation by Soft Computing
Data Imputation by Soft ComputingData Imputation by Soft Computing
Data Imputation by Soft Computing
 
Recognition as Graph Matching
  Recognition as Graph Matching  Recognition as Graph Matching
Recognition as Graph Matching
 
Georgetown B-school Talk 2021
Georgetown B-school Talk  2021Georgetown B-school Talk  2021
Georgetown B-school Talk 2021
 
Data Structures and Algorithm - Week 8 - Minimum Spanning Trees
Data Structures and Algorithm - Week 8 - Minimum Spanning TreesData Structures and Algorithm - Week 8 - Minimum Spanning Trees
Data Structures and Algorithm - Week 8 - Minimum Spanning Trees
 
A computationally efficient method to find transformed residue
A computationally efficient method to find transformed residueA computationally efficient method to find transformed residue
A computationally efficient method to find transformed residue
 
International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
StarekGomez.ea.IROS2015.Presentation
StarekGomez.ea.IROS2015.PresentationStarekGomez.ea.IROS2015.Presentation
StarekGomez.ea.IROS2015.Presentation
 
Cari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufoCari presentation maurice-tchoupe-joskelngoufo
Cari presentation maurice-tchoupe-joskelngoufo
 

Similar to Winner of EY NextWave Data Science Challenge 2019

Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp  dimension reduction for similarity matching and pruning of time series ...Drsp  dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...
IJDKP
 
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
ijsrd.com
 
SCCAI- A Student Career Counselling Artificial Intelligence
SCCAI- A Student Career Counselling Artificial IntelligenceSCCAI- A Student Career Counselling Artificial Intelligence
SCCAI- A Student Career Counselling Artificial Intelligence
vivatechijri
 

Similar to Winner of EY NextWave Data Science Challenge 2019 (20)

Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep Learning
 
Traffic Prediction from Street Network images.pptx
Traffic Prediction from  Street Network images.pptxTraffic Prediction from  Street Network images.pptx
Traffic Prediction from Street Network images.pptx
 
Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp  dimension reduction for similarity matching and pruning of time series ...Drsp  dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...
 
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
RunPool: A Dynamic Pooling Layer for Convolution Neural NetworkRunPool: A Dynamic Pooling Layer for Convolution Neural Network
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
 
Presentation
PresentationPresentation
Presentation
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 
PointNet
PointNetPointNet
PointNet
 
MODELS 2022 Journal-First presentation: ETeMoX - explaining reinforcement lea...
MODELS 2022 Journal-First presentation: ETeMoX - explaining reinforcement lea...MODELS 2022 Journal-First presentation: ETeMoX - explaining reinforcement lea...
MODELS 2022 Journal-First presentation: ETeMoX - explaining reinforcement lea...
 
Developing Competitive Strategies in Higher Education through Visual Data Mining
Developing Competitive Strategies in Higher Education through Visual Data MiningDeveloping Competitive Strategies in Higher Education through Visual Data Mining
Developing Competitive Strategies in Higher Education through Visual Data Mining
 
Geometric Deep Learning
Geometric Deep Learning Geometric Deep Learning
Geometric Deep Learning
 
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
 
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
 
Analysis of computational
Analysis of computationalAnalysis of computational
Analysis of computational
 
An Approach for Project Scheduling Using PERT/CPM and Petri Nets (PNs) Tools
An Approach for Project Scheduling Using PERT/CPM and Petri Nets (PNs) ToolsAn Approach for Project Scheduling Using PERT/CPM and Petri Nets (PNs) Tools
An Approach for Project Scheduling Using PERT/CPM and Petri Nets (PNs) Tools
 
IEEE Big data 2016 Title and Abstract
IEEE Big data  2016 Title and AbstractIEEE Big data  2016 Title and Abstract
IEEE Big data 2016 Title and Abstract
 
KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning
KagNet: Knowledge-Aware Graph Networks for Commonsense ReasoningKagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning
KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
 
An accumulative computation framework on MapReduce ppl2013
An accumulative computation framework on MapReduce ppl2013An accumulative computation framework on MapReduce ppl2013
An accumulative computation framework on MapReduce ppl2013
 
SCCAI- A Student Career Counselling Artificial Intelligence
SCCAI- A Student Career Counselling Artificial IntelligenceSCCAI- A Student Career Counselling Artificial Intelligence
SCCAI- A Student Career Counselling Artificial Intelligence
 

Recently uploaded

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
AroojKhan71
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 

Recently uploaded (20)

Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Bommasandra Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 

Winner of EY NextWave Data Science Challenge 2019

  • 1. EY Hong Kong NextWave Data Science Challenge Byung Eun Jeon - byungeuni Hyunju Shim - sg04088 University of Hong Kong May 30, 2019
  • 2. Disclaimer These presentation slides are not official EY presentation slides. A winning team of EY Hong Kong NextWave Data Science Competition produced the slides for the presentation on May 30, 2019 at the Citic Tower, Admiralty, Hong Kong.
  • 3. Agenda Page 2 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) 1 Methodology and Algorithms Findings and Patterns Opportunities to Improve Performance Smart Cities Applications 3 11 19 21 2 3 4
  • 4. Page 3 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Methodology and Algorithms
  • 5. Overview of Methodology 1. Problem Formulation 2. EDA & Feature Engineering 3. Model Exploration and Selection 4. Training & Fine-Tuning 5. Prediction & Ensembling Page 4 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
  • 6. Problem Formulation Page 5 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Objective Model that finds out whether a specific citizen will be in a predefined city-center Meaningful Points: ID resets every 24 hours Cannot trace the same device across days Exact date / Day of the week is not known Limits the usage of Weekend/Weekday trends, Public Holidays ⅔ of Velocity Related Data is missing: Data handling is an issue Hash is used to link several trajectories to a sequence Variable Length Sequence Binary Classification Number of Trajectories for each unique hash ranges from 1 to 20 Target Variable: 0 (Not in center) or 1 (In center)
  • 7. Selection of Approach: Deep Learning Why Deep Learning? Size of Data is Big Difficult to hand-design useful columns Non-ML Statistical Models (e.g. ARIMA, Holt-Winters Method) require many assumptions, yet it is difficult to make good assumptions Machine Learning Deep Learning Requires minimal feature engineering because DL is flexible at approximating non-linear functions useful for prediction Convolutional NN – family Has no sense of time (i.e. Difficult to learn the seasonality during the day) Recurrent NN – family Why LSTM? better cope with vanishing gradient and capable of learning long-term dependencies Other Machine Learning Models Difficult to capture sense of time (Random Forest, k-Nearest- Neighbors) Page 6 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
  • 8. Feature Engineering Page 7 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Finally, concatenate trajectories with the same hash to create one variable length sequence for each unique hash Modification of Given Features Target Entry Time (to seconds), Exit Time (to seconds) Handling Missing Values of Velocity-related Features Fill with each group’s median and mark NaN values with new “valid” columns If time spent in the trajectory is 0, fill velocities to 0. Otherwise, fill velocities to median of non-NaN values Design of New Features Entry Center, Exit Center Time Spent, Time After New Hash (first trajectory of the hash), Last Hash (last trajectory of the hash) Vmin Valid, Vmean Valid, Vmax Valid MinMax Normalization on Continuous Variables Modify so that inputs with different ranges have same scale for features
  • 9. Input with Variable Sequence Length Page 8 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Objective Create the batch with similar length sequence to reduce sparsity of input data Problem If sequence length ranges from 1 to 20, input data becomes too sparse General way of “Variable Sequence Length LSTM” Our Approach Combine Trajectories to create one long sequence for each unique hash Zero-pad each batch with maximum sequence length for each batch Bucketing and zero-padding: sort hashes according to number of trajectories 1.2 Million zero- paddings per epoch on average from 1000 simulations 313 zero-paddings per epoch
  • 10. Multiplicative LSTM Motivation LSTM architecture that has hidden-to-hidden transition functions that are input-dependent and thus better suited to recover from surprising inputs Larger Number of Parameters (x 1.25) Trade-off between Flexibility of Model and Training Time Source: Krause, B., Lu, L., Murray, I., and Renals, S. Multiplicative LSTM for sequence modelling. ArXiv, 2016. Page 9 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Multiplicative LSTMTraditional LSTM
  • 11. Training & Prediction Page 10 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Loss Function • Metric (F1) is not differentiable and when modified, does not converge well • • Use Binary Cross Entropy Loss Optimizer • Adadelta Optimizer for first part of training • Adam Optimizer with learning rate decay for later part Train/Val Data Split • Use 70/30 split when exploring various models • Use 95/5 split when fine-tuning Computing • Cloud Computing for GPU computation Simple Weighted Ensembling with higher weights on predictions with higher score
  • 12. Page 11 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Findings and Patterns
  • 13. Inspired by the theory and practice, we present domain-specific findings Page 12 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Batch Normalization (BN) and Dropout combined together do not work well Inconsistency in variance – Recent study suggests that Dropout shifts the variance of a specific neural unit; BN maintains its variance Conducted experiments on three cases i) BN and Dropout ii) Only Dropout iii) Dropout only after BN Empirically observed that using Only Dropout performs the best Disharmony between BN & Dropout Significance of Velocity- related Variables Experiment on LSTM+CNN In NLP, tokenization with greater granularity sometimes achieves better results Tried training model with each trajectory separated into two positions This forced us to either delete or duplicate the velocity-related variables Neither deleting or duplicating velocity led to better prediction Although ⅔ of data is missing, information about velocity is valuable Some people achieved SOTA using LSTM+CNN Replicated the model and tried to generalize it to the given geolocation domain For the given domain, our approach (LSTM with FC at the end) performed better than LSTM+CNN Sources: 1. Xiang Li, Shuo Chen, Xiaolin Hu, and Jian Yang. Understanding the disharmony between dropout and batch normalization by variance shift. ArXiv, 2018
  • 14. Through the Exploratory Data Analysis, We found patterns of citizens in the city Page 13 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Time: 00:00 ~ 04:00 Target Variable: Trajectories inside the yellow box (city center)
  • 15. Through the Exploratory Data Analysis, We found patterns of citizens in the city Page 14 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Time: 04:00 ~ 08:00 Target Variable: Trajectories inside the yellow box (city center)
  • 16. Page 15 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Through the Exploratory Data Analysis, We found patterns of citizens in the city Time: 08:00 ~ 12:00 Target Variable: Trajectories inside the yellow box (city center)
  • 17. Page 16 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Through the Exploratory Data Analysis, We found patterns of citizens in the city Time: 12:00 ~ 16:00 Target Variable: Trajectories inside the yellow box (city center) Populated area such as Residential Area, Highway, and Business Area can be speculated using this graph
  • 18. Trajectory Count City-center Percentage Page 17 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Analysis of activity-percentage complements the visualization of counts
  • 19. We explored the data extensively, and this led to better feature engineering Page 18 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Both distribution of time within the trajectory and time between trajectories are right-skewed “Broken GPS” exits but are negligible No time spent in the trajectory yet shift in the positions Happens very rarely. Deep Learning handles small noises well (robustness of DL) Limitations worth noting Information about seasonality such as week-trend and holiday is missing and cannot be inferred using the given information
  • 20. Page 19 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Opportunities to Improve Performance
  • 21. With less constraints on resource, each stage of process can be improved Page 20 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Feature Engineering Hyperparameter Tuning Feeding the Input Designing “Highway” Column Stratified Random Shuffling (SRS) Grid Search with Multi-GPU Ensemble and EDA suggest that hashes on highway are difficult to predict Similar approach to designing “center” column, which significantly improved performance Prioritized fine-tuning model over hand designing columns When bucketing and zero-padding, random shuffle among hashes that have the same sequence length Keras does not support SRS. Should be implemented using TF and NumPy Less prioritized due to time limit If more computing power were available, Grid Search may have outperformed manual fine-tuning LearningRate Number of Epoch
  • 22. Page 21 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Smart Cities Applications
  • 23. Atlanta stands to benefit from Data-driven Litter Collection Application Sources: 1. Forbes 2. U.S. Bureau of Labor Statistics 3. U.S. Census Bureau 4. Atlanta Journal-Constitution 5. TechRepublic 6. Bisnow Page 22 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) While the urbanization and the surge of trash expect to continue, Atlanta’s ecosystem presents opportunity to develop technology for trash management Trends in Atlanta Demography Policy Ecosystem Implication Major changes to trash collection schedule made for the last 2 consecutive years (April 1, 2019 and July 9, 2018) Rubicon Global, the first tech unicorn in trash, is Atlanta-based “Smart Dumpster” for recycling has been developed by the Government of Atlanta Higher % change in the employment than the that of U.S average in 2014~19 Metro Atlanta in 2019 has 4th fastest growing population in the U.S. Frequent adjustments to policies indicate the increase of trash Willingness of the state to tackle the surging amount of trash Potential to leverage the existing infrastructure Synergies with the existing services Development of the urban culture, which leads to the increase of economic activities and consumption
  • 24. Agile Software Development with a clear KPI is crucial for the Application Page 23 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Review Community Appearance Index (CAI), Number of Complaints Analyze Determine requirements by involving users continually Value Proposition Improve citizens’ life quality Protect soil and water quality Save cost of cleaning Deploy Ex) population density before and after the baseball game Design Database, server, and UI/UX for real-time prediction Test Similar to EY NextWave Competition, aim for higher F-1 score Develop Predict locations/crowdedness of public gatherings Sources: 1. Team’s Analysis 2. Keep Atlanta Beautiful Commission Prediction system can effectively allocate cleaning staffs to temporally crowded areas that yield more pedestrian litters and optimize trash truck routes