SlideShare a Scribd company logo
1 of 12
Download to read offline
OpenStreetMap Project
Data Wrangling with MongoDB
By Edmond Chin Juin Fung
Map Area: Singapore
Content:
1. Problems Encountered in the map
1.1 Nodes that is not relevant to the country Singapore
1.1.1 is_in:country
1.1.2 is_in
1.2 Problem related to subtag of the nodes
1.3 Omission of “name” attribute
1.4 Auditing “street”
1.5 Auditing Post Code
1.5.1 Alphabet and whitespace in postcode
1.5.2 Cleaning data
 Stage 1 – filtering with city
 Stage 2 – filtering with street
 Stage 3 – filtering with name
 Stage 4 – filtering the rest of the documents
 Final stage – cleaning
2. Data Overview
3. Additional Idea
Reference
1. Problems Encountered in the Map
After converting the Singapore.osm file into json file and storing it into mongodb (refer to
conversion_original.py), I ran some queries in mongo shell to analyse the data. I noticed there are
multiple problems with the data. They will be discussed below.
1.1 Nodes that is not relevant to the country Singapore
1.1.1 is_in:country
Basic querying revealed that some of the nodes have attribute “is_in:country”. So I use code such as
find({“is_in:country” : {“$exist”: 1}}) and .count() to print out a list of nodes that have such attributes
and total number of such nodes. I noticed a lot of them are not from Singapore but of neighbouring
countries such as Malaysia and Indonesia. Example as below:
From the above example, we see it is stated that the place is located in Malaysia, Johor state. The
name, “Layang-Layang” is definitely a place in Malaysia, therefore this document should not belong
in Singapore data.
Unfortunately, we cannot generalize every document that has “is_in:country” attribute with value
that is of non-Singapore to be located outside of Singapore, as it is possible for user to make input
mistake. Yet, it will be less likely to be an input mistake if the user entered 2 non-Singapore value
into 2 different attributes. Using above example, the “is_in:country” has value “Malaysia” and the
“is_in:state” has value “Johor”, which are both non-Singapore value, thus less likely to be an input
mistake and more likely to be a non-Singapore data.
(The python code used for the following paragraphs can be found in verification.py in the zip file)
Using this argument, I first find out the total number of documents that have “is_in:country” and its
unique values. Results as below:
I then proceed to find out the total unique value of “is_in:state” for each documents that do not
have “Singapore” as value for the “is_in:country” attribute. The result are as follow:
I identified that all 6 of them are non-Singapore value, thus suitable to be removed from the data.
Since the next most common attribute that is shared among those documents are “name”, I have to
verify them one by one. Luckily, I only left 39 documents to go through and it is not that difficult to
verify them since Singapore location are named in English whereas Malaysia and Indonesia locations
are named in Malay. I ran some code and part of the results are as follow:
I identified all of them as non-Singapore value since the name are in Malay, thus it is safe to remove
them from the data.
From my analysis, it is now safe to conclude that we can eliminate all documents that have non-
Singapore value for “is_in:country” attribute. I went back to the osm file, and used python code
(function cleanup_is_in in conversion_revised.py) to filter out these documents before converting it
to json and importing to mongodb. The results are as below:
1.1.2 is_in
(Additional cleaning; code can be found in verification.py)
After cleaning the data, further querying also reveals that there are node that do not have
“is_in:country” attribute but has “is_in” attribute that also points to locations that does not belong
to Singapore (example as below). This means these node had been missed out by our previous
analysis since it was based on “is_in:country” attribute. Example below:
From the example, we see that there is an “is_in” attribute that points to a non-Singapore locations
without “is_in:country” attribute. I then ran a code to query the data. Result as below:
From the result above, apart from 3 values (“Singapore, , Singapore”, “Singapore”, “Sentosa”), the
rest do not belongs to Singapore. Like previous analysis, to make sure that they are not input
mistake, I am going to run the code along with “name” attribute. If the “name” value is in Malay,
then we can assume the documents does not belong in Singapore. Part of the results are as below:
I have identified them to be of Malay origin, thus it is safe to remove the documents. The rest of the
documents are those that do not have “name”. Example as below:
From the above example, we can see that there is no other additional data that allows me to identify
whether it is not a Singapore location or just simply an input mistake by user. Therefore, I shall
remove these documents as well (please refer to conversion_revised.py for the identification and
removal function). I have also replaced documents with “is_in” attribute value
“Singapore,,Singapore” to “Singapore” as well in the coding. The results are as below:
1.2 Problem related to the subtags of the nodes
While querying the number of nodes and ways, I found some discrepancy in the total number
documents and the number of nodes and ways.
The total number of documents:
The total number of nodes:
The total number of ways:
As you can see from above, the value does not add up. There are 42 documents that are neither
node nor way. This is odd as during my conversion from osm file to json file, I specifically program it
so that only those that are of type “node” and “way” will be inserted into the json file. So what these
42 unknowns? Some querying reveals that the type attribute has been replaced by some other
information (refer to diagram below).
The problem lies in the way I write my python code and the osm data (please refer to the green
comments in conversion_original.py for the exact problem in the code). I assigned the “type” key
early on my code. Yet, it is possible for tag.attrib[“k”] to have value “type” as well, thus replacing my
previous assignment with the new one. I have decided not to dispose these information, but to
assign new key to it as the information might be useful. I have to do some modification add a
function as below:
I make sure that tag.attrib[“k”] will be put into a category with slightly different name if it has the
same string as “id”, “type”, “visible”, “created”, “pos”, “address”, “node_refs”. This will prevent the
replacement of value of the stated predetermined attributes. The results are as below:
The total number of documents:
The total number of nodes:
And the total number of ways:
Now they finally add up.
Furthermore, I have also created a new attribute “sub-tag_type” for those extra data. Results as
below (code can be found in verification.py):
1.3 Omission of “name” attribute
Since in this data, “name” is one of the most common attribute for documents that have meaningful
information. Yet, not all documents that have meaningful information have the attribute “name”.
Example as below:
By including “name” attribute into those documents, it will make the data more consistent, thus
allowing easier coding and analysis in the future. I will work on the most common documents, which
have one of these attributes, “highway”, “network” and “amenity”. I uses the function below for the
editing (can also be found in conversion_revised.py):
In my code, I first identify those documents, then add “name” attribute with the same value as
“highway”, “network” or “amenity” attribute. The result are as follow:
1.4 Auditing “street”
Some of the documents have attribute “addr:street”, with tells us which street the particular node is
in. Yet, the naming is a bit inconsistent. Some of the name uses abbreviation. For example, “Drive” is
written as “Dr.”, Avenue is written as “Ave.” etc. Below are some abbreviated street example that I
found on my data:
(For coding details, please refer to update_street(tag, mapping, node) function in conversion_revised.py)
To solve this, I first created a dictionary of which stores all the abbreviation as key and their original
words as value. Then, I convert all the abbreviation to the original words. Part of the results are as
below:
1.5 Auditing Post Code
Some querying reveals that the postcode are not consistent. Singapore postcode has 6 digits. Some
of the postcode do not have 6 digits, whereas some postcode has alphabet in it. Example as below:
In the example, we see that the postcode is not 6 digits and information such as “addr:city” also
shows that it is not even a Singapore location. Yet, we cannot generalize that all documents that do
not have 6 digits does not belongs to Singapore, as it might be input mistake by user. Therefore, we
will also have to rely on other information, such as “addr:city” to determine the location. On other
other hand, we also need to consider postcode that has alphabet or whitespace. We shall first settle
the alphabet and whitespace problem, then move on to cleaning the data.
1.5.1 Alphabet and whitespace in postal code
Some postcode has alphabet or whitespace in it. Example as such:
From the example above, we see that it is a Singapore location but the postcode has “Singapore”
word in it. I therefore remove all non-digit value from these data so that they will only produce digits
(please refer to all_digit(node) function in conversion_revised.py). Some of the resulting changes are
as follow:
1.5.2 Cleaning data
Since there are non-six digit postcode in the data, we will have to double check them with other
attribute to determine if the documents is really not from Singapore. On the other hand, since not
all documents have the same attributes, we will have to go through several stages to complete the
whole auditing and cleaning process. Below are some information regarding the documents:
Following the information above, we shall filter the documents with attributes in this order: city,
street, name, rest of the documents.
Stage 1 - Filtering with city:
Running code that filters postcode length not equal to 6 and with “addr:city” attribute gives me a list
of city as such:
Apart from “Singapore”, the rest of the city are not from Singapore. So it is safe to remove those
documents. I have to dig deeper why there are 53 documents with “Singapore” as city value that do
not have 6 digits postcode. Querying these documents for their street info get me the result as
below:
It appears all of them are indeed located in Singapore, but the postcode are not in correct order. I
will thus remove the “addr:postcode” attribute from these documents.
Stage 2 - Filtering with street:
Running code that filters postcode length not equal to 6, without “addr:city” but have “addr:street”
attribute gives me a list of city, part of them as below:
Apart from “Chancellor Drive”, the rest of the name are in Malay, thus we can consider them to be
from non-Singapore location. Further querying into the “Chancellor Drive” documents reveal
information as below:
Since all the details are in English and relates to a university, so I type in the postcode 79200 into
google map and it return me a location in Johor, where university of Southampton is located1
.
Therefore, I conclude these are not Singapore related documents as well.
Stage 3 - Filtering with name:
Running code that filters postcode length not equal to 6, without “addr:city” and “addr:street” but
have “name” attribute gives me a list of city, part of them as below:
Apart from a few Malay name that I can recognize, there are several documents which I cannot
verify its location. I will thus examine them with the rest of the leftover documents in the next stage.
Stage 4 - Filtering the rest of the documents:
Running code that filters the rest of the documents by their “user” attribute, I get results such as
below:
UTM stands for Universiti Teknologi Malaysia, thus we can remove those documents with user
starting with UTM. That left 3 documents to analyse. We shall analyse their full documents one by
one. After choosing some attributes to present, the results are such as below:
As mentioned previously, postcode 79200 points to a university town in Malaysia. Since the name of
the documents also seems to come from a university background, we shall assume they are not from
Singapore and remove them. As for the “buffet”, we do not have any practical way to identify it, thus
we shall keep it in the database but remove the inconsistent postcode.
Final Stage - Cleaning (code may be found in clean_postcode(node) function in conversion_revised.py)
I will thus remove all documents that do not have 6 digits postcode, apart from some
documents stated in stage 1 and stage 4. To verify that the cleaning has been done, the new
total number of documents should be 1015776 (no. doc. Before cleaning) – (5668(no. doc.
that do not have 6 digit postcode) – 54 (no. doc. retained from stage1 and stage4)) =
1,010,162. The result are as below:
2. Data Overview
The file sizes of the osm and json format files are as follow:
Singapore.osm - 188 MB
Singapore.osm.json - 284 MB
Number of documents
Number of nodes
Number of ways
Some fundamental percentages related to the data
(Code may be found in data_overview.py)
Top 20 appearing amenities
(Code may be found in data_overview.py)
3. Additional Idea
It is quite disappointing to see that only around 10% of the data have any useful information in it (as
percentage of nodes without name or description is 90.45%). On the other hand, a whopping 68.7%
of the data that have useful information are concerning highway and road. It seems like the users
who input the data are very interested in cars and highway. Undoubtedly, the most common listed
amenity is parking. This is quite puzzling as the total no. of landed vehicles in Singapore are around
900k2.
and the total population of Singapore is around 5.5million3.
, that means only around 16% of
the people actually owns a car. Yet, around 70% of the data that we get are regarding highway or
parking. It would be great if there are more information on businesses and public transport, as these
information are a lot more useful to many other people that do not drive.
Furthermore, the data do not follow any kind format, which make analysis rather difficult. For
example, document regarding highway have identification attribute “highway”, whereas document
regarding subway have identification attribute “network”. Yet, I think it would be a lot more
convenient if we just put these identification into a single category, such as amenity or types of
facilities. This will make the data tidier and allow easier analysis. As such, we will be able to identify
every unique type of documents by just calling the single category.
On the other hand, not all documents have the same attributes. For example, among the documents
that have “addr:postcode”, not all of them have “addr:city” or “addr:street”. Such inconsistencies
make analysis quite difficult. Therefore, it will be more efficient if user follow a certain kind of data
input method. Like, they will have to insert “addr:city” whenever they inserted “addr:postcode”.
Unfortunately, as these data depends on public user input, it will be quite difficult to insist on certain
format. One way of encouraging that will be to set a note to them regarding the appropriate format
to be inserted while they are doing the inserting. Another way is to make some input field
mandatory to be filled while they are inputting the data.
Still, insisting users to take certain action might discourage them to be involved in data input. Yet, if
osm is popular and useful, they might look beyond these restrictions. On the other hand, restricting
them to fill up certain mandatory field might also invite odd data input as user might just input
garbage into the field. To overcome this, we can also ensure the field are filled up properly with
drop-down selection or only allow certain character and format in the field.
Reference
1. https://www.google.com.sg/maps/place/Educity+Student+Village/@1.4303999,103.612795
5,17.24z/data=!4m5!1m2!2m1!1s79200!3m1!1s0x31da0ba2bc9f9ec1:0x80b6846292a4d575
2. http://www.singstat.gov.sg/statistics/latest-data#8/
3. https://www.lta.gov.sg/content/dam/ltaweb/corp/PublicationsResearch/files/FactsandFigur
es/MVP01-1_MVP_by_type.pdf

More Related Content

Viewers also liked

Viewers also liked (9)

Звіт про роботу за напрямками гри "Добрик 2015"
Звіт про роботу за напрямками гри "Добрик 2015"Звіт про роботу за напрямками гри "Добрик 2015"
Звіт про роботу за напрямками гри "Добрик 2015"
 
Seminararbeit_Bachlor
Seminararbeit_BachlorSeminararbeit_Bachlor
Seminararbeit_Bachlor
 
презентация2
презентация2презентация2
презентация2
 
como_organizar_excursiones
como_organizar_excursionescomo_organizar_excursiones
como_organizar_excursiones
 
О судебно-экспертной деятельности в Российской Федерации
О судебно-экспертной деятельности в Российской ФедерацииО судебно-экспертной деятельности в Российской Федерации
О судебно-экспертной деятельности в Российской Федерации
 
Multi Marine Venture Sdn Bhd (661581-X) GST registration
Multi Marine Venture Sdn Bhd (661581-X) GST registrationMulti Marine Venture Sdn Bhd (661581-X) GST registration
Multi Marine Venture Sdn Bhd (661581-X) GST registration
 
Factores de riesgo
Factores de riesgoFactores de riesgo
Factores de riesgo
 
3rd Semester (Dec-2015; Jan-2016) Computer Science and Information Science E...
3rd Semester (Dec-2015; Jan-2016) Computer Science and Information Science  E...3rd Semester (Dec-2015; Jan-2016) Computer Science and Information Science  E...
3rd Semester (Dec-2015; Jan-2016) Computer Science and Information Science E...
 
How to install sonarqube plugin in anypoint
How to install sonarqube plugin in anypoint How to install sonarqube plugin in anypoint
How to install sonarqube plugin in anypoint
 

Similar to Open Street Map Project

Presentation dual inversion-index
Presentation dual inversion-indexPresentation dual inversion-index
Presentation dual inversion-index
mahi_uta
 

Similar to Open Street Map Project (20)

Reinventing the Transaction Script (NDC London 2020)
Reinventing the Transaction Script (NDC London 2020)Reinventing the Transaction Script (NDC London 2020)
Reinventing the Transaction Script (NDC London 2020)
 
Graphql
GraphqlGraphql
Graphql
 
MongoDB Aggregations Indexing and Profiling
MongoDB Aggregations Indexing and ProfilingMongoDB Aggregations Indexing and Profiling
MongoDB Aggregations Indexing and Profiling
 
Excel analysis assignment this is an independent assignment me
Excel analysis assignment this is an independent assignment meExcel analysis assignment this is an independent assignment me
Excel analysis assignment this is an independent assignment me
 
Compose Camp - Session1.pdf
Compose Camp - Session1.pdfCompose Camp - Session1.pdf
Compose Camp - Session1.pdf
 
Solving performance issues in Django ORM
Solving performance issues in Django ORMSolving performance issues in Django ORM
Solving performance issues in Django ORM
 
C++ [ principles of object oriented programming ]
C++ [ principles of object oriented programming ]C++ [ principles of object oriented programming ]
C++ [ principles of object oriented programming ]
 
Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...Code is not text! How graph technologies can help us to understand our code b...
Code is not text! How graph technologies can help us to understand our code b...
 
Seo Expert course in Pakistan
Seo Expert course in PakistanSeo Expert course in Pakistan
Seo Expert course in Pakistan
 
Java script summary
Java script summaryJava script summary
Java script summary
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Linked Lists in Python, Python Institute in Delhi.pdf
Linked Lists in Python, Python Institute in Delhi.pdfLinked Lists in Python, Python Institute in Delhi.pdf
Linked Lists in Python, Python Institute in Delhi.pdf
 
Mongo learning series
Mongo learning series Mongo learning series
Mongo learning series
 
Laptop Price Prediction system
Laptop Price Prediction systemLaptop Price Prediction system
Laptop Price Prediction system
 
Introduction to google hacking database
Introduction to google hacking databaseIntroduction to google hacking database
Introduction to google hacking database
 
Advance data structure
Advance data structureAdvance data structure
Advance data structure
 
Streams of information - Chicago crystal language monthly meetup
Streams of information - Chicago crystal language monthly meetupStreams of information - Chicago crystal language monthly meetup
Streams of information - Chicago crystal language monthly meetup
 
Presentation dual inversion-index
Presentation dual inversion-indexPresentation dual inversion-index
Presentation dual inversion-index
 
ETL and pivoting in spark
ETL and pivoting in sparkETL and pivoting in spark
ETL and pivoting in spark
 
ETL and pivoting in spark
ETL and pivoting in sparkETL and pivoting in spark
ETL and pivoting in spark
 

Open Street Map Project

  • 1. OpenStreetMap Project Data Wrangling with MongoDB By Edmond Chin Juin Fung Map Area: Singapore Content: 1. Problems Encountered in the map 1.1 Nodes that is not relevant to the country Singapore 1.1.1 is_in:country 1.1.2 is_in 1.2 Problem related to subtag of the nodes 1.3 Omission of “name” attribute 1.4 Auditing “street” 1.5 Auditing Post Code 1.5.1 Alphabet and whitespace in postcode 1.5.2 Cleaning data  Stage 1 – filtering with city  Stage 2 – filtering with street  Stage 3 – filtering with name  Stage 4 – filtering the rest of the documents  Final stage – cleaning 2. Data Overview 3. Additional Idea Reference 1. Problems Encountered in the Map After converting the Singapore.osm file into json file and storing it into mongodb (refer to conversion_original.py), I ran some queries in mongo shell to analyse the data. I noticed there are multiple problems with the data. They will be discussed below. 1.1 Nodes that is not relevant to the country Singapore 1.1.1 is_in:country Basic querying revealed that some of the nodes have attribute “is_in:country”. So I use code such as find({“is_in:country” : {“$exist”: 1}}) and .count() to print out a list of nodes that have such attributes and total number of such nodes. I noticed a lot of them are not from Singapore but of neighbouring countries such as Malaysia and Indonesia. Example as below:
  • 2. From the above example, we see it is stated that the place is located in Malaysia, Johor state. The name, “Layang-Layang” is definitely a place in Malaysia, therefore this document should not belong in Singapore data. Unfortunately, we cannot generalize every document that has “is_in:country” attribute with value that is of non-Singapore to be located outside of Singapore, as it is possible for user to make input mistake. Yet, it will be less likely to be an input mistake if the user entered 2 non-Singapore value into 2 different attributes. Using above example, the “is_in:country” has value “Malaysia” and the “is_in:state” has value “Johor”, which are both non-Singapore value, thus less likely to be an input mistake and more likely to be a non-Singapore data. (The python code used for the following paragraphs can be found in verification.py in the zip file) Using this argument, I first find out the total number of documents that have “is_in:country” and its unique values. Results as below: I then proceed to find out the total unique value of “is_in:state” for each documents that do not have “Singapore” as value for the “is_in:country” attribute. The result are as follow: I identified that all 6 of them are non-Singapore value, thus suitable to be removed from the data. Since the next most common attribute that is shared among those documents are “name”, I have to verify them one by one. Luckily, I only left 39 documents to go through and it is not that difficult to verify them since Singapore location are named in English whereas Malaysia and Indonesia locations are named in Malay. I ran some code and part of the results are as follow:
  • 3. I identified all of them as non-Singapore value since the name are in Malay, thus it is safe to remove them from the data. From my analysis, it is now safe to conclude that we can eliminate all documents that have non- Singapore value for “is_in:country” attribute. I went back to the osm file, and used python code (function cleanup_is_in in conversion_revised.py) to filter out these documents before converting it to json and importing to mongodb. The results are as below: 1.1.2 is_in (Additional cleaning; code can be found in verification.py) After cleaning the data, further querying also reveals that there are node that do not have “is_in:country” attribute but has “is_in” attribute that also points to locations that does not belong to Singapore (example as below). This means these node had been missed out by our previous analysis since it was based on “is_in:country” attribute. Example below: From the example, we see that there is an “is_in” attribute that points to a non-Singapore locations without “is_in:country” attribute. I then ran a code to query the data. Result as below: From the result above, apart from 3 values (“Singapore, , Singapore”, “Singapore”, “Sentosa”), the rest do not belongs to Singapore. Like previous analysis, to make sure that they are not input mistake, I am going to run the code along with “name” attribute. If the “name” value is in Malay, then we can assume the documents does not belong in Singapore. Part of the results are as below:
  • 4. I have identified them to be of Malay origin, thus it is safe to remove the documents. The rest of the documents are those that do not have “name”. Example as below: From the above example, we can see that there is no other additional data that allows me to identify whether it is not a Singapore location or just simply an input mistake by user. Therefore, I shall remove these documents as well (please refer to conversion_revised.py for the identification and removal function). I have also replaced documents with “is_in” attribute value “Singapore,,Singapore” to “Singapore” as well in the coding. The results are as below: 1.2 Problem related to the subtags of the nodes While querying the number of nodes and ways, I found some discrepancy in the total number documents and the number of nodes and ways. The total number of documents: The total number of nodes: The total number of ways: As you can see from above, the value does not add up. There are 42 documents that are neither node nor way. This is odd as during my conversion from osm file to json file, I specifically program it so that only those that are of type “node” and “way” will be inserted into the json file. So what these 42 unknowns? Some querying reveals that the type attribute has been replaced by some other information (refer to diagram below). The problem lies in the way I write my python code and the osm data (please refer to the green comments in conversion_original.py for the exact problem in the code). I assigned the “type” key early on my code. Yet, it is possible for tag.attrib[“k”] to have value “type” as well, thus replacing my
  • 5. previous assignment with the new one. I have decided not to dispose these information, but to assign new key to it as the information might be useful. I have to do some modification add a function as below: I make sure that tag.attrib[“k”] will be put into a category with slightly different name if it has the same string as “id”, “type”, “visible”, “created”, “pos”, “address”, “node_refs”. This will prevent the replacement of value of the stated predetermined attributes. The results are as below: The total number of documents: The total number of nodes: And the total number of ways: Now they finally add up. Furthermore, I have also created a new attribute “sub-tag_type” for those extra data. Results as below (code can be found in verification.py): 1.3 Omission of “name” attribute Since in this data, “name” is one of the most common attribute for documents that have meaningful information. Yet, not all documents that have meaningful information have the attribute “name”. Example as below: By including “name” attribute into those documents, it will make the data more consistent, thus allowing easier coding and analysis in the future. I will work on the most common documents, which
  • 6. have one of these attributes, “highway”, “network” and “amenity”. I uses the function below for the editing (can also be found in conversion_revised.py): In my code, I first identify those documents, then add “name” attribute with the same value as “highway”, “network” or “amenity” attribute. The result are as follow: 1.4 Auditing “street” Some of the documents have attribute “addr:street”, with tells us which street the particular node is in. Yet, the naming is a bit inconsistent. Some of the name uses abbreviation. For example, “Drive” is written as “Dr.”, Avenue is written as “Ave.” etc. Below are some abbreviated street example that I found on my data: (For coding details, please refer to update_street(tag, mapping, node) function in conversion_revised.py) To solve this, I first created a dictionary of which stores all the abbreviation as key and their original words as value. Then, I convert all the abbreviation to the original words. Part of the results are as below: 1.5 Auditing Post Code
  • 7. Some querying reveals that the postcode are not consistent. Singapore postcode has 6 digits. Some of the postcode do not have 6 digits, whereas some postcode has alphabet in it. Example as below: In the example, we see that the postcode is not 6 digits and information such as “addr:city” also shows that it is not even a Singapore location. Yet, we cannot generalize that all documents that do not have 6 digits does not belongs to Singapore, as it might be input mistake by user. Therefore, we will also have to rely on other information, such as “addr:city” to determine the location. On other other hand, we also need to consider postcode that has alphabet or whitespace. We shall first settle the alphabet and whitespace problem, then move on to cleaning the data. 1.5.1 Alphabet and whitespace in postal code Some postcode has alphabet or whitespace in it. Example as such: From the example above, we see that it is a Singapore location but the postcode has “Singapore” word in it. I therefore remove all non-digit value from these data so that they will only produce digits (please refer to all_digit(node) function in conversion_revised.py). Some of the resulting changes are as follow: 1.5.2 Cleaning data
  • 8. Since there are non-six digit postcode in the data, we will have to double check them with other attribute to determine if the documents is really not from Singapore. On the other hand, since not all documents have the same attributes, we will have to go through several stages to complete the whole auditing and cleaning process. Below are some information regarding the documents: Following the information above, we shall filter the documents with attributes in this order: city, street, name, rest of the documents. Stage 1 - Filtering with city: Running code that filters postcode length not equal to 6 and with “addr:city” attribute gives me a list of city as such: Apart from “Singapore”, the rest of the city are not from Singapore. So it is safe to remove those documents. I have to dig deeper why there are 53 documents with “Singapore” as city value that do not have 6 digits postcode. Querying these documents for their street info get me the result as below: It appears all of them are indeed located in Singapore, but the postcode are not in correct order. I will thus remove the “addr:postcode” attribute from these documents. Stage 2 - Filtering with street: Running code that filters postcode length not equal to 6, without “addr:city” but have “addr:street” attribute gives me a list of city, part of them as below: Apart from “Chancellor Drive”, the rest of the name are in Malay, thus we can consider them to be from non-Singapore location. Further querying into the “Chancellor Drive” documents reveal information as below:
  • 9. Since all the details are in English and relates to a university, so I type in the postcode 79200 into google map and it return me a location in Johor, where university of Southampton is located1 . Therefore, I conclude these are not Singapore related documents as well. Stage 3 - Filtering with name: Running code that filters postcode length not equal to 6, without “addr:city” and “addr:street” but have “name” attribute gives me a list of city, part of them as below: Apart from a few Malay name that I can recognize, there are several documents which I cannot verify its location. I will thus examine them with the rest of the leftover documents in the next stage. Stage 4 - Filtering the rest of the documents: Running code that filters the rest of the documents by their “user” attribute, I get results such as below: UTM stands for Universiti Teknologi Malaysia, thus we can remove those documents with user starting with UTM. That left 3 documents to analyse. We shall analyse their full documents one by one. After choosing some attributes to present, the results are such as below: As mentioned previously, postcode 79200 points to a university town in Malaysia. Since the name of the documents also seems to come from a university background, we shall assume they are not from Singapore and remove them. As for the “buffet”, we do not have any practical way to identify it, thus we shall keep it in the database but remove the inconsistent postcode. Final Stage - Cleaning (code may be found in clean_postcode(node) function in conversion_revised.py) I will thus remove all documents that do not have 6 digits postcode, apart from some documents stated in stage 1 and stage 4. To verify that the cleaning has been done, the new total number of documents should be 1015776 (no. doc. Before cleaning) – (5668(no. doc. that do not have 6 digit postcode) – 54 (no. doc. retained from stage1 and stage4)) = 1,010,162. The result are as below:
  • 10. 2. Data Overview The file sizes of the osm and json format files are as follow: Singapore.osm - 188 MB Singapore.osm.json - 284 MB Number of documents Number of nodes Number of ways Some fundamental percentages related to the data (Code may be found in data_overview.py) Top 20 appearing amenities (Code may be found in data_overview.py)
  • 11. 3. Additional Idea It is quite disappointing to see that only around 10% of the data have any useful information in it (as percentage of nodes without name or description is 90.45%). On the other hand, a whopping 68.7% of the data that have useful information are concerning highway and road. It seems like the users who input the data are very interested in cars and highway. Undoubtedly, the most common listed amenity is parking. This is quite puzzling as the total no. of landed vehicles in Singapore are around 900k2. and the total population of Singapore is around 5.5million3. , that means only around 16% of the people actually owns a car. Yet, around 70% of the data that we get are regarding highway or parking. It would be great if there are more information on businesses and public transport, as these information are a lot more useful to many other people that do not drive. Furthermore, the data do not follow any kind format, which make analysis rather difficult. For example, document regarding highway have identification attribute “highway”, whereas document regarding subway have identification attribute “network”. Yet, I think it would be a lot more convenient if we just put these identification into a single category, such as amenity or types of facilities. This will make the data tidier and allow easier analysis. As such, we will be able to identify every unique type of documents by just calling the single category. On the other hand, not all documents have the same attributes. For example, among the documents that have “addr:postcode”, not all of them have “addr:city” or “addr:street”. Such inconsistencies make analysis quite difficult. Therefore, it will be more efficient if user follow a certain kind of data input method. Like, they will have to insert “addr:city” whenever they inserted “addr:postcode”. Unfortunately, as these data depends on public user input, it will be quite difficult to insist on certain format. One way of encouraging that will be to set a note to them regarding the appropriate format to be inserted while they are doing the inserting. Another way is to make some input field mandatory to be filled while they are inputting the data. Still, insisting users to take certain action might discourage them to be involved in data input. Yet, if osm is popular and useful, they might look beyond these restrictions. On the other hand, restricting them to fill up certain mandatory field might also invite odd data input as user might just input garbage into the field. To overcome this, we can also ensure the field are filled up properly with drop-down selection or only allow certain character and format in the field.