Open Street Map Project

OpenStreetMap Project
Data Wrangling with MongoDB
By Edmond Chin Juin Fung
Map Area: Singapore
Content:
1. Problems Encountered in the map
1.1 Nodes that is not relevant to the country Singapore
1.1.1 is_in:country
1.1.2 is_in
1.2 Problem related to subtag of the nodes
1.3 Omission of “name” attribute
1.4 Auditing “street”
1.5 Auditing Post Code
1.5.1 Alphabet and whitespace in postcode
1.5.2 Cleaning data
 Stage 1 – filtering with city
 Stage 2 – filtering with street
 Stage 3 – filtering with name
 Stage 4 – filtering the rest of the documents
 Final stage – cleaning
2. Data Overview
3. Additional Idea
Reference
1. Problems Encountered in the Map
After converting the Singapore.osm file into json file and storing it into mongodb (refer to
conversion_original.py), I ran some queries in mongo shell to analyse the data. I noticed there are
multiple problems with the data. They will be discussed below.
1.1 Nodes that is not relevant to the country Singapore
1.1.1 is_in:country
Basic querying revealed that some of the nodes have attribute “is_in:country”. So I use code such as
find({“is_in:country” : {“$exist”: 1}}) and .count() to print out a list of nodes that have such attributes
and total number of such nodes. I noticed a lot of them are not from Singapore but of neighbouring
countries such as Malaysia and Indonesia. Example as below:

From the above example, we see it is stated that the place is located in Malaysia, Johor state. The
name, “Layang-Layang” is definitely a place in Malaysia, therefore this document should not belong
in Singapore data.
Unfortunately, we cannot generalize every document that has “is_in:country” attribute with value
that is of non-Singapore to be located outside of Singapore, as it is possible for user to make input
mistake. Yet, it will be less likely to be an input mistake if the user entered 2 non-Singapore value
into 2 different attributes. Using above example, the “is_in:country” has value “Malaysia” and the
“is_in:state” has value “Johor”, which are both non-Singapore value, thus less likely to be an input
mistake and more likely to be a non-Singapore data.
(The python code used for the following paragraphs can be found in verification.py in the zip file)
Using this argument, I first find out the total number of documents that have “is_in:country” and its
unique values. Results as below:
I then proceed to find out the total unique value of “is_in:state” for each documents that do not
have “Singapore” as value for the “is_in:country” attribute. The result are as follow:
I identified that all 6 of them are non-Singapore value, thus suitable to be removed from the data.
Since the next most common attribute that is shared among those documents are “name”, I have to
verify them one by one. Luckily, I only left 39 documents to go through and it is not that difficult to
verify them since Singapore location are named in English whereas Malaysia and Indonesia locations
are named in Malay. I ran some code and part of the results are as follow:

I identified all of them as non-Singapore value since the name are in Malay, thus it is safe to remove
them from the data.
From my analysis, it is now safe to conclude that we can eliminate all documents that have non-
Singapore value for “is_in:country” attribute. I went back to the osm file, and used python code
(function cleanup_is_in in conversion_revised.py) to filter out these documents before converting it
to json and importing to mongodb. The results are as below:
1.1.2 is_in
(Additional cleaning; code can be found in verification.py)
After cleaning the data, further querying also reveals that there are node that do not have
“is_in:country” attribute but has “is_in” attribute that also points to locations that does not belong
to Singapore (example as below). This means these node had been missed out by our previous
analysis since it was based on “is_in:country” attribute. Example below:
From the example, we see that there is an “is_in” attribute that points to a non-Singapore locations
without “is_in:country” attribute. I then ran a code to query the data. Result as below:
From the result above, apart from 3 values (“Singapore, , Singapore”, “Singapore”, “Sentosa”), the
rest do not belongs to Singapore. Like previous analysis, to make sure that they are not input
mistake, I am going to run the code along with “name” attribute. If the “name” value is in Malay,
then we can assume the documents does not belong in Singapore. Part of the results are as below:

I have identified them to be of Malay origin, thus it is safe to remove the documents. The rest of the
documents are those that do not have “name”. Example as below:
From the above example, we can see that there is no other additional data that allows me to identify
whether it is not a Singapore location or just simply an input mistake by user. Therefore, I shall
remove these documents as well (please refer to conversion_revised.py for the identification and
removal function). I have also replaced documents with “is_in” attribute value
“Singapore,,Singapore” to “Singapore” as well in the coding. The results are as below:
1.2 Problem related to the subtags of the nodes
While querying the number of nodes and ways, I found some discrepancy in the total number
documents and the number of nodes and ways.
The total number of documents:
The total number of nodes:
The total number of ways:
As you can see from above, the value does not add up. There are 42 documents that are neither
node nor way. This is odd as during my conversion from osm file to json file, I specifically program it
so that only those that are of type “node” and “way” will be inserted into the json file. So what these
42 unknowns? Some querying reveals that the type attribute has been replaced by some other
information (refer to diagram below).
The problem lies in the way I write my python code and the osm data (please refer to the green
comments in conversion_original.py for the exact problem in the code). I assigned the “type” key
early on my code. Yet, it is possible for tag.attrib[“k”] to have value “type” as well, thus replacing my

previous assignment with the new one. I have decided not to dispose these information, but to
assign new key to it as the information might be useful. I have to do some modification add a
function as below:
I make sure that tag.attrib[“k”] will be put into a category with slightly different name if it has the
same string as “id”, “type”, “visible”, “created”, “pos”, “address”, “node_refs”. This will prevent the
replacement of value of the stated predetermined attributes. The results are as below:
The total number of documents:
The total number of nodes:
And the total number of ways:
Now they finally add up.
Furthermore, I have also created a new attribute “sub-tag_type” for those extra data. Results as
below (code can be found in verification.py):
1.3 Omission of “name” attribute
Since in this data, “name” is one of the most common attribute for documents that have meaningful
information. Yet, not all documents that have meaningful information have the attribute “name”.
Example as below:
By including “name” attribute into those documents, it will make the data more consistent, thus
allowing easier coding and analysis in the future. I will work on the most common documents, which

have one of these attributes, “highway”, “network” and “amenity”. I uses the function below for the
editing (can also be found in conversion_revised.py):
In my code, I first identify those documents, then add “name” attribute with the same value as
“highway”, “network” or “amenity” attribute. The result are as follow:
1.4 Auditing “street”
Some of the documents have attribute “addr:street”, with tells us which street the particular node is
in. Yet, the naming is a bit inconsistent. Some of the name uses abbreviation. For example, “Drive” is
written as “Dr.”, Avenue is written as “Ave.” etc. Below are some abbreviated street example that I
found on my data:
(For coding details, please refer to update_street(tag, mapping, node) function in conversion_revised.py)
To solve this, I first created a dictionary of which stores all the abbreviation as key and their original
words as value. Then, I convert all the abbreviation to the original words. Part of the results are as
below:
1.5 Auditing Post Code

Some querying reveals that the postcode are not consistent. Singapore postcode has 6 digits. Some
of the postcode do not have 6 digits, whereas some postcode has alphabet in it. Example as below:
In the example, we see that the postcode is not 6 digits and information such as “addr:city” also
shows that it is not even a Singapore location. Yet, we cannot generalize that all documents that do
not have 6 digits does not belongs to Singapore, as it might be input mistake by user. Therefore, we
will also have to rely on other information, such as “addr:city” to determine the location. On other
other hand, we also need to consider postcode that has alphabet or whitespace. We shall first settle
the alphabet and whitespace problem, then move on to cleaning the data.
1.5.1 Alphabet and whitespace in postal code
Some postcode has alphabet or whitespace in it. Example as such:
From the example above, we see that it is a Singapore location but the postcode has “Singapore”
word in it. I therefore remove all non-digit value from these data so that they will only produce digits
(please refer to all_digit(node) function in conversion_revised.py). Some of the resulting changes are
as follow:
1.5.2 Cleaning data

Since there are non-six digit postcode in the data, we will have to double check them with other
attribute to determine if the documents is really not from Singapore. On the other hand, since not
all documents have the same attributes, we will have to go through several stages to complete the
whole auditing and cleaning process. Below are some information regarding the documents:
Following the information above, we shall filter the documents with attributes in this order: city,
street, name, rest of the documents.
Stage 1 - Filtering with city:
Running code that filters postcode length not equal to 6 and with “addr:city” attribute gives me a list
of city as such:
Apart from “Singapore”, the rest of the city are not from Singapore. So it is safe to remove those
documents. I have to dig deeper why there are 53 documents with “Singapore” as city value that do
not have 6 digits postcode. Querying these documents for their street info get me the result as
below:
It appears all of them are indeed located in Singapore, but the postcode are not in correct order. I
will thus remove the “addr:postcode” attribute from these documents.
Stage 2 - Filtering with street:
Running code that filters postcode length not equal to 6, without “addr:city” but have “addr:street”
attribute gives me a list of city, part of them as below:
Apart from “Chancellor Drive”, the rest of the name are in Malay, thus we can consider them to be
from non-Singapore location. Further querying into the “Chancellor Drive” documents reveal
information as below:

Since all the details are in English and relates to a university, so I type in the postcode 79200 into
google map and it return me a location in Johor, where university of Southampton is located1
.
Therefore, I conclude these are not Singapore related documents as well.
Stage 3 - Filtering with name:
Running code that filters postcode length not equal to 6, without “addr:city” and “addr:street” but
have “name” attribute gives me a list of city, part of them as below:
Apart from a few Malay name that I can recognize, there are several documents which I cannot
verify its location. I will thus examine them with the rest of the leftover documents in the next stage.
Stage 4 - Filtering the rest of the documents:
Running code that filters the rest of the documents by their “user” attribute, I get results such as
below:
UTM stands for Universiti Teknologi Malaysia, thus we can remove those documents with user
starting with UTM. That left 3 documents to analyse. We shall analyse their full documents one by
one. After choosing some attributes to present, the results are such as below:
As mentioned previously, postcode 79200 points to a university town in Malaysia. Since the name of
the documents also seems to come from a university background, we shall assume they are not from
Singapore and remove them. As for the “buffet”, we do not have any practical way to identify it, thus
we shall keep it in the database but remove the inconsistent postcode.
Final Stage - Cleaning (code may be found in clean_postcode(node) function in conversion_revised.py)
I will thus remove all documents that do not have 6 digits postcode, apart from some
documents stated in stage 1 and stage 4. To verify that the cleaning has been done, the new
total number of documents should be 1015776 (no. doc. Before cleaning) – (5668(no. doc.
that do not have 6 digit postcode) – 54 (no. doc. retained from stage1 and stage4)) =
1,010,162. The result are as below:

2. Data Overview
The file sizes of the osm and json format files are as follow:
Singapore.osm - 188 MB
Singapore.osm.json - 284 MB
Number of documents
Number of nodes
Number of ways
Some fundamental percentages related to the data
(Code may be found in data_overview.py)
Top 20 appearing amenities
(Code may be found in data_overview.py)

3. Additional Idea
It is quite disappointing to see that only around 10% of the data have any useful information in it (as
percentage of nodes without name or description is 90.45%). On the other hand, a whopping 68.7%
of the data that have useful information are concerning highway and road. It seems like the users
who input the data are very interested in cars and highway. Undoubtedly, the most common listed
amenity is parking. This is quite puzzling as the total no. of landed vehicles in Singapore are around
900k2.
and the total population of Singapore is around 5.5million3.
, that means only around 16% of
the people actually owns a car. Yet, around 70% of the data that we get are regarding highway or
parking. It would be great if there are more information on businesses and public transport, as these
information are a lot more useful to many other people that do not drive.
Furthermore, the data do not follow any kind format, which make analysis rather difficult. For
example, document regarding highway have identification attribute “highway”, whereas document
regarding subway have identification attribute “network”. Yet, I think it would be a lot more
convenient if we just put these identification into a single category, such as amenity or types of
facilities. This will make the data tidier and allow easier analysis. As such, we will be able to identify
every unique type of documents by just calling the single category.
On the other hand, not all documents have the same attributes. For example, among the documents
that have “addr:postcode”, not all of them have “addr:city” or “addr:street”. Such inconsistencies
make analysis quite difficult. Therefore, it will be more efficient if user follow a certain kind of data
input method. Like, they will have to insert “addr:city” whenever they inserted “addr:postcode”.
Unfortunately, as these data depends on public user input, it will be quite difficult to insist on certain
format. One way of encouraging that will be to set a note to them regarding the appropriate format
to be inserted while they are doing the inserting. Another way is to make some input field
mandatory to be filled while they are inputting the data.
Still, insisting users to take certain action might discourage them to be involved in data input. Yet, if
osm is popular and useful, they might look beyond these restrictions. On the other hand, restricting
them to fill up certain mandatory field might also invite odd data input as user might just input
garbage into the field. To overcome this, we can also ensure the field are filled up properly with
drop-down selection or only allow certain character and format in the field.

Reference
1. https://www.google.com.sg/maps/place/Educity+Student+Village/@1.4303999,103.612795
5,17.24z/data=!4m5!1m2!2m1!1s79200!3m1!1s0x31da0ba2bc9f9ec1:0x80b6846292a4d575
2. http://www.singstat.gov.sg/statistics/latest-data#8/
3. https://www.lta.gov.sg/content/dam/ltaweb/corp/PublicationsResearch/files/FactsandFigur
es/MVP01-1_MVP_by_type.pdf

Open Street Map Project

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (9)

Similar to Open Street Map Project

Similar to Open Street Map Project (20)

Open Street Map Project