SlideShare a Scribd company logo
1 of 25
EY Hong Kong NextWave
Data Science Challenge
Byung Eun Jeon - byungeuni
Hyunju Shim - sg04088
University of Hong Kong
May 30, 2019
Disclaimer
These presentation slides are not official EY presentation slides.
A winning team of EY Hong Kong NextWave Data Science
Competition produced the slides for the presentation on May 30,
2019 at the Citic Tower, Admiralty, Hong Kong.
Agenda
Page 2 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
1 Methodology and Algorithms
Findings and Patterns
Opportunities to Improve Performance
Smart Cities Applications
3
11
19
21
2
3
4
Page 3 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Methodology
and Algorithms
Overview of Methodology
1. Problem Formulation
2. EDA & Feature Engineering
3. Model Exploration and Selection
4. Training & Fine-Tuning
5. Prediction & Ensembling
Page 4 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Problem Formulation
Page 5 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Objective Model that finds out whether a specific citizen will
be in a predefined city-center
Meaningful Points:
ID resets every 24 hours
Cannot trace the same device across days
Exact date / Day of the week is not known
Limits the usage of Weekend/Weekday trends, Public Holidays
⅔ of Velocity Related Data is missing: Data handling is an issue
Hash is used to link several trajectories to a sequence
Variable Length Sequence Binary Classification
Number of Trajectories for each
unique hash ranges from 1 to 20
Target Variable:
0 (Not in center)
or 1 (In center)
Selection of Approach: Deep Learning
Why Deep Learning?
Size of Data is Big
Difficult to hand-design useful columns
Non-ML Statistical Models (e.g. ARIMA, Holt-Winters Method) require many
assumptions, yet it is difficult to make good assumptions
Machine Learning
Deep Learning
Requires minimal feature engineering because DL is flexible
at approximating non-linear functions useful for prediction
Convolutional NN – family
Has no sense of time
(i.e. Difficult to learn the seasonality during the day)
Recurrent NN – family
Why LSTM? better cope with vanishing gradient and
capable of learning long-term dependencies
Other Machine
Learning Models
Difficult to capture
sense of time
(Random Forest,
k-Nearest-
Neighbors)
Page 6 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Feature Engineering
Page 7 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Finally, concatenate trajectories with the same hash
to create one variable length sequence for each unique hash
Modification of
Given Features
Target
Entry Time (to seconds),
Exit Time (to seconds)
Handling Missing Values of
Velocity-related Features
Fill with each group’s median and mark
NaN values with new “valid” columns
If time spent in the trajectory is 0, fill
velocities to 0. Otherwise, fill velocities to
median of non-NaN values
Design of
New Features
Entry Center, Exit Center
Time Spent, Time After
New Hash (first trajectory of the hash),
Last Hash (last trajectory of the hash)
Vmin Valid, Vmean Valid, Vmax Valid
MinMax Normalization on
Continuous Variables
Modify so that inputs with different
ranges have same scale for features
Input with Variable Sequence Length
Page 8 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Objective
Create the batch with similar length
sequence to reduce sparsity of
input data
Problem
If sequence length ranges from 1 to 20,
input data becomes too sparse
General way of
“Variable Sequence
Length LSTM”
Our Approach
Combine Trajectories to create one long
sequence for each unique hash
Zero-pad each batch
with maximum
sequence length for
each batch
Bucketing and
zero-padding:
sort hashes
according to number
of trajectories
1.2 Million zero-
paddings per epoch
on average from
1000 simulations
313
zero-paddings per
epoch
Multiplicative LSTM
Motivation
LSTM architecture that has hidden-to-hidden transition functions that are
input-dependent and thus better suited to recover from surprising inputs
Larger Number of Parameters (x 1.25)
Trade-off between Flexibility of Model and Training Time
Source: Krause, B., Lu, L., Murray, I., and Renals, S. Multiplicative LSTM for sequence
modelling. ArXiv, 2016.
Page 9 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Multiplicative LSTMTraditional LSTM
Training & Prediction
Page 10 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Loss Function
• Metric (F1) is not differentiable
and when modified, does not
converge well
•
• Use Binary Cross Entropy Loss
Optimizer
• Adadelta Optimizer for first
part of training
• Adam Optimizer with learning
rate decay for later part
Train/Val Data Split
• Use 70/30 split when exploring
various models
• Use 95/5 split when fine-tuning
Computing
• Cloud Computing for GPU
computation
Simple Weighted Ensembling with higher weights
on predictions with higher score
Page 11 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Findings
and Patterns
Inspired by the theory and practice,
we present domain-specific findings
Page 12 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Batch Normalization (BN) and Dropout combined together do not work well
Inconsistency in variance – Recent study suggests that Dropout shifts the
variance of a specific neural unit; BN maintains its variance
Conducted experiments on three cases
i) BN and Dropout ii) Only Dropout iii) Dropout only after BN
Empirically observed that using Only Dropout performs the best
Disharmony
between
BN &
Dropout
Significance
of Velocity-
related
Variables
Experiment
on
LSTM+CNN
In NLP, tokenization with greater granularity sometimes achieves better results
Tried training model with each trajectory separated into two positions
This forced us to either delete or duplicate the velocity-related variables
Neither deleting or duplicating velocity led to better prediction
Although ⅔ of data is missing, information about velocity is valuable
Some people achieved SOTA using LSTM+CNN
Replicated the model and tried to generalize it to the given geolocation domain
For the given domain, our approach (LSTM with FC at the end)
performed better than LSTM+CNN
Sources:
1. Xiang Li, Shuo Chen, Xiaolin Hu, and Jian Yang. Understanding the disharmony between dropout and batch
normalization by variance shift. ArXiv, 2018
Through the Exploratory Data Analysis,
We found patterns of citizens in the city
Page 13 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Time:
00:00 ~ 04:00
Target Variable:
Trajectories inside
the yellow box
(city center)
Through the Exploratory Data Analysis,
We found patterns of citizens in the city
Page 14 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Time:
04:00 ~ 08:00
Target Variable:
Trajectories inside
the yellow box
(city center)
Page 15 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Through the Exploratory Data Analysis,
We found patterns of citizens in the city
Time:
08:00 ~ 12:00
Target Variable:
Trajectories inside
the yellow box
(city center)
Page 16 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Through the Exploratory Data Analysis,
We found patterns of citizens in the city
Time:
12:00 ~ 16:00
Target Variable:
Trajectories inside
the yellow box
(city center)
Populated area such
as Residential Area,
Highway, and
Business Area can be
speculated using this
graph
Trajectory
Count
City-center
Percentage
Page 17 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Analysis of activity-percentage
complements the visualization of counts
We explored the data extensively, and
this led to better feature engineering
Page 18 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Both distribution of time within the trajectory and
time between trajectories are right-skewed
“Broken GPS” exits but are negligible
No time spent in the trajectory yet shift in the positions
Happens very rarely. Deep Learning handles small noises well (robustness of DL)
Limitations worth noting
Information about
seasonality such as
week-trend and holiday is
missing and cannot be
inferred using the given
information
Page 19 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Opportunities to
Improve Performance
With less constraints on resource,
each stage of process can be improved
Page 20 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Feature
Engineering
Hyperparameter
Tuning
Feeding the
Input
Designing
“Highway” Column
Stratified Random
Shuffling (SRS)
Grid Search
with Multi-GPU
Ensemble and EDA
suggest that hashes
on highway are
difficult to predict
Similar approach to
designing “center”
column, which
significantly improved
performance
Prioritized fine-tuning
model over hand
designing columns
When bucketing and
zero-padding,
random shuffle
among hashes that
have the same
sequence length
Keras does not
support SRS. Should
be implemented
using TF and NumPy
Less prioritized due
to time limit
If more computing
power were available,
Grid Search may
have outperformed
manual fine-tuning
LearningRate Number of Epoch
Page 21 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Smart Cities
Applications
Atlanta stands to benefit from
Data-driven Litter Collection Application
Sources: 1. Forbes 2. U.S. Bureau of Labor Statistics 3. U.S. Census Bureau 4. Atlanta Journal-Constitution
5. TechRepublic 6. Bisnow
Page 22 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
While the urbanization and the surge of trash expect to continue,
Atlanta’s ecosystem presents opportunity to develop technology for trash management
Trends in
Atlanta
Demography Policy Ecosystem
Implication
Major changes to trash
collection schedule
made for the last 2
consecutive years
(April 1, 2019 and
July 9, 2018)
Rubicon Global, the
first tech unicorn in
trash, is Atlanta-based
“Smart Dumpster” for
recycling has been
developed by the
Government of Atlanta
Higher % change in
the employment than
the that of U.S
average in 2014~19
Metro Atlanta in 2019
has 4th fastest
growing population in
the U.S.
Frequent adjustments
to policies indicate the
increase of trash
Willingness of the state
to tackle the surging
amount of trash
Potential to leverage
the existing
infrastructure
Synergies with the
existing services
Development of the
urban culture, which
leads to the increase
of economic activities
and consumption
Agile Software Development with
a clear KPI is crucial for the Application
Page 23 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
Review Community
Appearance Index (CAI),
Number of Complaints
Analyze Determine
requirements by involving
users continually
Value Proposition
Improve citizens’ life quality
Protect soil and water quality
Save cost of cleaning
Deploy Ex) population
density before and after
the baseball game
Design Database,
server, and UI/UX for
real-time prediction
Test Similar to EY
NextWave Competition,
aim for higher F-1 score
Develop Predict
locations/crowdedness
of public gatherings
Sources: 1. Team’s Analysis 2. Keep Atlanta Beautiful Commission
Prediction system can effectively allocate cleaning staffs to temporally crowded
areas that yield more pedestrian litters and optimize trash truck routes
Thank you!

More Related Content

What's hot

Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic SegmentationSemantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation岳華 杜
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Gaurav Mittal
 
Deep learning in healthcare: Oppotunities and challenges with Electronic Medi...
Deep learning in healthcare: Oppotunities and challenges with Electronic Medi...Deep learning in healthcare: Oppotunities and challenges with Electronic Medi...
Deep learning in healthcare: Oppotunities and challenges with Electronic Medi...Thien Q. Tran
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것NAVER Engineering
 
Object Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning FrameworkObject Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning FrameworkNader Karimi
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer VisionSungjoon Choi
 
PyTorch, PixyzによるGenerative Query Networkの実装
PyTorch, PixyzによるGenerative Query Networkの実装PyTorch, PixyzによるGenerative Query Networkの実装
PyTorch, PixyzによるGenerative Query Networkの実装Shohei Taniguchi
 
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...Kirill Eremenko
 
Factorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender SystemsFactorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender SystemsEvgeniy Marinov
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Universitat Politècnica de Catalunya
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural NetworksAshray Bhandare
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You NeedDaiki Tanaka
 
Deep learning for object detection
Deep learning for object detectionDeep learning for object detection
Deep learning for object detectionWenjing Chen
 
Machine Learning vs. Deep Learning
Machine Learning vs. Deep LearningMachine Learning vs. Deep Learning
Machine Learning vs. Deep LearningBelatrix Software
 
Object Detection using Deep Neural Networks
Object Detection using Deep Neural NetworksObject Detection using Deep Neural Networks
Object Detection using Deep Neural NetworksUsman Qayyum
 
Deep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureDeep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureRouyun Pan
 
Image Object Detection Pipeline
Image Object Detection PipelineImage Object Detection Pipeline
Image Object Detection PipelineAbhinav Dadhich
 
Interpretability beyond feature attribution quantitative testing with concept...
Interpretability beyond feature attribution quantitative testing with concept...Interpretability beyond feature attribution quantitative testing with concept...
Interpretability beyond feature attribution quantitative testing with concept...MLconf
 
Deep Learning - Overview of my work II
Deep Learning - Overview of my work IIDeep Learning - Overview of my work II
Deep Learning - Overview of my work IIMohamed Loey
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Preferred Networks
 

What's hot (20)

Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic SegmentationSemantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
Semantic Segmentation - Fully Convolutional Networks for Semantic Segmentation
 
Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)Convolutional Neural Networks (CNN)
Convolutional Neural Networks (CNN)
 
Deep learning in healthcare: Oppotunities and challenges with Electronic Medi...
Deep learning in healthcare: Oppotunities and challenges with Electronic Medi...Deep learning in healthcare: Oppotunities and challenges with Electronic Medi...
Deep learning in healthcare: Oppotunities and challenges with Electronic Medi...
 
오토인코더의 모든 것
오토인코더의 모든 것오토인코더의 모든 것
오토인코더의 모든 것
 
Object Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning FrameworkObject Detection Using R-CNN Deep Learning Framework
Object Detection Using R-CNN Deep Learning Framework
 
Deep Learning in Computer Vision
Deep Learning in Computer VisionDeep Learning in Computer Vision
Deep Learning in Computer Vision
 
PyTorch, PixyzによるGenerative Query Networkの実装
PyTorch, PixyzによるGenerative Query Networkの実装PyTorch, PixyzによるGenerative Query Networkの実装
PyTorch, PixyzによるGenerative Query Networkの実装
 
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Vanishing Gradient ...
 
Factorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender SystemsFactorization Machines and Applications in Recommender Systems
Factorization Machines and Applications in Recommender Systems
 
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
Image Segmentation (D3L1 2017 UPC Deep Learning for Computer Vision)
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
 
[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need[Paper Reading] Attention is All You Need
[Paper Reading] Attention is All You Need
 
Deep learning for object detection
Deep learning for object detectionDeep learning for object detection
Deep learning for object detection
 
Machine Learning vs. Deep Learning
Machine Learning vs. Deep LearningMachine Learning vs. Deep Learning
Machine Learning vs. Deep Learning
 
Object Detection using Deep Neural Networks
Object Detection using Deep Neural NetworksObject Detection using Deep Neural Networks
Object Detection using Deep Neural Networks
 
Deep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & FutureDeep Learning Hardware: Past, Present, & Future
Deep Learning Hardware: Past, Present, & Future
 
Image Object Detection Pipeline
Image Object Detection PipelineImage Object Detection Pipeline
Image Object Detection Pipeline
 
Interpretability beyond feature attribution quantitative testing with concept...
Interpretability beyond feature attribution quantitative testing with concept...Interpretability beyond feature attribution quantitative testing with concept...
Interpretability beyond feature attribution quantitative testing with concept...
 
Deep Learning - Overview of my work II
Deep Learning - Overview of my work IIDeep Learning - Overview of my work II
Deep Learning - Overview of my work II
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 

Similar to 1st Place in EY Data Science Challenge

Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningPramit Choudhary
 
Traffic Prediction from Street Network images.pptx
Traffic Prediction from  Street Network images.pptxTraffic Prediction from  Street Network images.pptx
Traffic Prediction from Street Network images.pptxchirantanGupta1
 
Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp  dimension reduction for similarity matching and pruning of time series ...Drsp  dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...IJDKP
 
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
RunPool: A Dynamic Pooling Layer for Convolution Neural NetworkRunPool: A Dynamic Pooling Layer for Convolution Neural Network
RunPool: A Dynamic Pooling Layer for Convolution Neural NetworkPutra Wanda
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning민재 정
 
MODELS 2022 Journal-First presentation: ETeMoX - explaining reinforcement lea...
MODELS 2022 Journal-First presentation: ETeMoX - explaining reinforcement lea...MODELS 2022 Journal-First presentation: ETeMoX - explaining reinforcement lea...
MODELS 2022 Journal-First presentation: ETeMoX - explaining reinforcement lea...Antonio García-Domínguez
 
Developing Competitive Strategies in Higher Education through Visual Data Mining
Developing Competitive Strategies in Higher Education through Visual Data MiningDeveloping Competitive Strategies in Higher Education through Visual Data Mining
Developing Competitive Strategies in Higher Education through Visual Data MiningGurdal Ertek
 
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...ijsrd.com
 
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...ijsrd.com
 
Analysis of computational
Analysis of computationalAnalysis of computational
Analysis of computationalcsandit
 
An Approach for Project Scheduling Using PERT/CPM and Petri Nets (PNs) Tools
An Approach for Project Scheduling Using PERT/CPM and Petri Nets (PNs) ToolsAn Approach for Project Scheduling Using PERT/CPM and Petri Nets (PNs) Tools
An Approach for Project Scheduling Using PERT/CPM and Petri Nets (PNs) ToolsIJMER
 
IEEE Big data 2016 Title and Abstract
IEEE Big data  2016 Title and AbstractIEEE Big data  2016 Title and Abstract
IEEE Big data 2016 Title and Abstracttsysglobalsolutions
 
KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning
KagNet: Knowledge-Aware Graph Networks for Commonsense ReasoningKagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning
KagNet: Knowledge-Aware Graph Networks for Commonsense ReasoningKorea University
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...IJCNCJournal
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...IJCNCJournal
 
An accumulative computation framework on MapReduce ppl2013
An accumulative computation framework on MapReduce ppl2013An accumulative computation framework on MapReduce ppl2013
An accumulative computation framework on MapReduce ppl2013Yu Liu
 

Similar to 1st Place in EY Data Science Challenge (20)

Vtc9252019
Vtc9252019Vtc9252019
Vtc9252019
 
Model Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep LearningModel Evaluation in the land of Deep Learning
Model Evaluation in the land of Deep Learning
 
Traffic Prediction from Street Network images.pptx
Traffic Prediction from  Street Network images.pptxTraffic Prediction from  Street Network images.pptx
Traffic Prediction from Street Network images.pptx
 
Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp  dimension reduction for similarity matching and pruning of time series ...Drsp  dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...
 
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
RunPool: A Dynamic Pooling Layer for Convolution Neural NetworkRunPool: A Dynamic Pooling Layer for Convolution Neural Network
RunPool: A Dynamic Pooling Layer for Convolution Neural Network
 
Presentation
PresentationPresentation
Presentation
 
Combinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learningCombinatorial optimization and deep reinforcement learning
Combinatorial optimization and deep reinforcement learning
 
PointNet
PointNetPointNet
PointNet
 
MODELS 2022 Journal-First presentation: ETeMoX - explaining reinforcement lea...
MODELS 2022 Journal-First presentation: ETeMoX - explaining reinforcement lea...MODELS 2022 Journal-First presentation: ETeMoX - explaining reinforcement lea...
MODELS 2022 Journal-First presentation: ETeMoX - explaining reinforcement lea...
 
Developing Competitive Strategies in Higher Education through Visual Data Mining
Developing Competitive Strategies in Higher Education through Visual Data MiningDeveloping Competitive Strategies in Higher Education through Visual Data Mining
Developing Competitive Strategies in Higher Education through Visual Data Mining
 
Geometric Deep Learning
Geometric Deep Learning Geometric Deep Learning
Geometric Deep Learning
 
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
 
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
Artificial Neural Network Based Graphical User Interface for Estimation of Fa...
 
Analysis of computational
Analysis of computationalAnalysis of computational
Analysis of computational
 
An Approach for Project Scheduling Using PERT/CPM and Petri Nets (PNs) Tools
An Approach for Project Scheduling Using PERT/CPM and Petri Nets (PNs) ToolsAn Approach for Project Scheduling Using PERT/CPM and Petri Nets (PNs) Tools
An Approach for Project Scheduling Using PERT/CPM and Petri Nets (PNs) Tools
 
IEEE Big data 2016 Title and Abstract
IEEE Big data  2016 Title and AbstractIEEE Big data  2016 Title and Abstract
IEEE Big data 2016 Title and Abstract
 
KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning
KagNet: Knowledge-Aware Graph Networks for Commonsense ReasoningKagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning
KagNet: Knowledge-Aware Graph Networks for Commonsense Reasoning
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
 
An accumulative computation framework on MapReduce ppl2013
An accumulative computation framework on MapReduce ppl2013An accumulative computation framework on MapReduce ppl2013
An accumulative computation framework on MapReduce ppl2013
 

Recently uploaded

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 

Recently uploaded (20)

BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 

1st Place in EY Data Science Challenge

  • 1. EY Hong Kong NextWave Data Science Challenge Byung Eun Jeon - byungeuni Hyunju Shim - sg04088 University of Hong Kong May 30, 2019
  • 2. Disclaimer These presentation slides are not official EY presentation slides. A winning team of EY Hong Kong NextWave Data Science Competition produced the slides for the presentation on May 30, 2019 at the Citic Tower, Admiralty, Hong Kong.
  • 3. Agenda Page 2 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) 1 Methodology and Algorithms Findings and Patterns Opportunities to Improve Performance Smart Cities Applications 3 11 19 21 2 3 4
  • 4. Page 3 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Methodology and Algorithms
  • 5. Overview of Methodology 1. Problem Formulation 2. EDA & Feature Engineering 3. Model Exploration and Selection 4. Training & Fine-Tuning 5. Prediction & Ensembling Page 4 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
  • 6. Problem Formulation Page 5 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Objective Model that finds out whether a specific citizen will be in a predefined city-center Meaningful Points: ID resets every 24 hours Cannot trace the same device across days Exact date / Day of the week is not known Limits the usage of Weekend/Weekday trends, Public Holidays ⅔ of Velocity Related Data is missing: Data handling is an issue Hash is used to link several trajectories to a sequence Variable Length Sequence Binary Classification Number of Trajectories for each unique hash ranges from 1 to 20 Target Variable: 0 (Not in center) or 1 (In center)
  • 7. Selection of Approach: Deep Learning Why Deep Learning? Size of Data is Big Difficult to hand-design useful columns Non-ML Statistical Models (e.g. ARIMA, Holt-Winters Method) require many assumptions, yet it is difficult to make good assumptions Machine Learning Deep Learning Requires minimal feature engineering because DL is flexible at approximating non-linear functions useful for prediction Convolutional NN – family Has no sense of time (i.e. Difficult to learn the seasonality during the day) Recurrent NN – family Why LSTM? better cope with vanishing gradient and capable of learning long-term dependencies Other Machine Learning Models Difficult to capture sense of time (Random Forest, k-Nearest- Neighbors) Page 6 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU)
  • 8. Feature Engineering Page 7 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Finally, concatenate trajectories with the same hash to create one variable length sequence for each unique hash Modification of Given Features Target Entry Time (to seconds), Exit Time (to seconds) Handling Missing Values of Velocity-related Features Fill with each group’s median and mark NaN values with new “valid” columns If time spent in the trajectory is 0, fill velocities to 0. Otherwise, fill velocities to median of non-NaN values Design of New Features Entry Center, Exit Center Time Spent, Time After New Hash (first trajectory of the hash), Last Hash (last trajectory of the hash) Vmin Valid, Vmean Valid, Vmax Valid MinMax Normalization on Continuous Variables Modify so that inputs with different ranges have same scale for features
  • 9. Input with Variable Sequence Length Page 8 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Objective Create the batch with similar length sequence to reduce sparsity of input data Problem If sequence length ranges from 1 to 20, input data becomes too sparse General way of “Variable Sequence Length LSTM” Our Approach Combine Trajectories to create one long sequence for each unique hash Zero-pad each batch with maximum sequence length for each batch Bucketing and zero-padding: sort hashes according to number of trajectories 1.2 Million zero- paddings per epoch on average from 1000 simulations 313 zero-paddings per epoch
  • 10. Multiplicative LSTM Motivation LSTM architecture that has hidden-to-hidden transition functions that are input-dependent and thus better suited to recover from surprising inputs Larger Number of Parameters (x 1.25) Trade-off between Flexibility of Model and Training Time Source: Krause, B., Lu, L., Murray, I., and Renals, S. Multiplicative LSTM for sequence modelling. ArXiv, 2016. Page 9 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Multiplicative LSTMTraditional LSTM
  • 11. Training & Prediction Page 10 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Loss Function • Metric (F1) is not differentiable and when modified, does not converge well • • Use Binary Cross Entropy Loss Optimizer • Adadelta Optimizer for first part of training • Adam Optimizer with learning rate decay for later part Train/Val Data Split • Use 70/30 split when exploring various models • Use 95/5 split when fine-tuning Computing • Cloud Computing for GPU computation Simple Weighted Ensembling with higher weights on predictions with higher score
  • 12. Page 11 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Findings and Patterns
  • 13. Inspired by the theory and practice, we present domain-specific findings Page 12 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Batch Normalization (BN) and Dropout combined together do not work well Inconsistency in variance – Recent study suggests that Dropout shifts the variance of a specific neural unit; BN maintains its variance Conducted experiments on three cases i) BN and Dropout ii) Only Dropout iii) Dropout only after BN Empirically observed that using Only Dropout performs the best Disharmony between BN & Dropout Significance of Velocity- related Variables Experiment on LSTM+CNN In NLP, tokenization with greater granularity sometimes achieves better results Tried training model with each trajectory separated into two positions This forced us to either delete or duplicate the velocity-related variables Neither deleting or duplicating velocity led to better prediction Although ⅔ of data is missing, information about velocity is valuable Some people achieved SOTA using LSTM+CNN Replicated the model and tried to generalize it to the given geolocation domain For the given domain, our approach (LSTM with FC at the end) performed better than LSTM+CNN Sources: 1. Xiang Li, Shuo Chen, Xiaolin Hu, and Jian Yang. Understanding the disharmony between dropout and batch normalization by variance shift. ArXiv, 2018
  • 14. Through the Exploratory Data Analysis, We found patterns of citizens in the city Page 13 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Time: 00:00 ~ 04:00 Target Variable: Trajectories inside the yellow box (city center)
  • 15. Through the Exploratory Data Analysis, We found patterns of citizens in the city Page 14 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Time: 04:00 ~ 08:00 Target Variable: Trajectories inside the yellow box (city center)
  • 16. Page 15 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Through the Exploratory Data Analysis, We found patterns of citizens in the city Time: 08:00 ~ 12:00 Target Variable: Trajectories inside the yellow box (city center)
  • 17. Page 16 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Through the Exploratory Data Analysis, We found patterns of citizens in the city Time: 12:00 ~ 16:00 Target Variable: Trajectories inside the yellow box (city center) Populated area such as Residential Area, Highway, and Business Area can be speculated using this graph
  • 18. Trajectory Count City-center Percentage Page 17 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Analysis of activity-percentage complements the visualization of counts
  • 19. We explored the data extensively, and this led to better feature engineering Page 18 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Both distribution of time within the trajectory and time between trajectories are right-skewed “Broken GPS” exits but are negligible No time spent in the trajectory yet shift in the positions Happens very rarely. Deep Learning handles small noises well (robustness of DL) Limitations worth noting Information about seasonality such as week-trend and holiday is missing and cannot be inferred using the given information
  • 20. Page 19 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Opportunities to Improve Performance
  • 21. With less constraints on resource, each stage of process can be improved Page 20 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Feature Engineering Hyperparameter Tuning Feeding the Input Designing “Highway” Column Stratified Random Shuffling (SRS) Grid Search with Multi-GPU Ensemble and EDA suggest that hashes on highway are difficult to predict Similar approach to designing “center” column, which significantly improved performance Prioritized fine-tuning model over hand designing columns When bucketing and zero-padding, random shuffle among hashes that have the same sequence length Keras does not support SRS. Should be implemented using TF and NumPy Less prioritized due to time limit If more computing power were available, Grid Search may have outperformed manual fine-tuning LearningRate Number of Epoch
  • 22. Page 21 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Smart Cities Applications
  • 23. Atlanta stands to benefit from Data-driven Litter Collection Application Sources: 1. Forbes 2. U.S. Bureau of Labor Statistics 3. U.S. Census Bureau 4. Atlanta Journal-Constitution 5. TechRepublic 6. Bisnow Page 22 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) While the urbanization and the surge of trash expect to continue, Atlanta’s ecosystem presents opportunity to develop technology for trash management Trends in Atlanta Demography Policy Ecosystem Implication Major changes to trash collection schedule made for the last 2 consecutive years (April 1, 2019 and July 9, 2018) Rubicon Global, the first tech unicorn in trash, is Atlanta-based “Smart Dumpster” for recycling has been developed by the Government of Atlanta Higher % change in the employment than the that of U.S average in 2014~19 Metro Atlanta in 2019 has 4th fastest growing population in the U.S. Frequent adjustments to policies indicate the increase of trash Willingness of the state to tackle the surging amount of trash Potential to leverage the existing infrastructure Synergies with the existing services Development of the urban culture, which leads to the increase of economic activities and consumption
  • 24. Agile Software Development with a clear KPI is crucial for the Application Page 23 EY Hong Kong Data Science Challenge | Byung Eun Jeon & Hyunju Shim (HKU) Review Community Appearance Index (CAI), Number of Complaints Analyze Determine requirements by involving users continually Value Proposition Improve citizens’ life quality Protect soil and water quality Save cost of cleaning Deploy Ex) population density before and after the baseball game Design Database, server, and UI/UX for real-time prediction Test Similar to EY NextWave Competition, aim for higher F-1 score Develop Predict locations/crowdedness of public gatherings Sources: 1. Team’s Analysis 2. Keep Atlanta Beautiful Commission Prediction system can effectively allocate cleaning staffs to temporally crowded areas that yield more pedestrian litters and optimize trash truck routes