The Monthly Challenge was a fantastic journey of learning and experimenting.
The video: https://youtu.be/Y3-tWGheXUA
Read more about the challenge here: https://www.datasciencesociety.net/events/data-monthly-challenge/
The articles - http://bit.ly/2NmKIC2
It also takes a place at a university program as part of an innovative seminar in the Master’s program in Business Analytics at FEBA, Sofia University. The challenge is free and it takes the participants coming from all around the world through the steps of air pollution prediction given a dataset.
2. How are we going to
present ?
• Step by step that we made so far
• What approach we have chosen for each step
• Why ?
• How ? (We mean the technical part here)
3. Step 1 – Import the data
• Import all the datasets and look at
the classes
• Load the map of Sofia to have a look
(package “ggmap”)
• Left – join the Metadata with the EEA
4. Step 2 – dealing with the official
measurements data
• Split the EEA data based on the
measurement time – hours vs days
• Interpolate the P10 measurments using the
imputeTS package na.Kalman method
• Check the stations on the map and the p10
over 50
5.
6.
7.
8. Step 3 – dealing with the citizen
data
• Clean the data:
o Remove records without geohash
o Keep the records only in Sofia
o Remove the duplicates by geohash and
time – take the mean value
o Basic stats
o Remove mismeasurements
o What about p10 measurements ?
9.
10. Step 4 - Clustering
• First try making clusters by k-
mean_15
• Look at the mean p10
• Re-clustering