Thank you for the opportunity to present our work today This is joint work with cristoph martin, axel polleres and patrik schneider from WU Vienna
At Siemens we have a practical use case which is to compare cities The way we usually do that is by collecting data, computing metrics and then rankthem
In our example, the green city index, this is computing a ranking for cities accorind to their greennes or sustainability which is computed from severla qantitivae and qualitative indicators, such as waste or co2 emissions per capita. Creating studies like this involves a lot of manual tedious research for the underlying data plus the data of such a report is outdated as soon as it’s printed. Our assumption is that the quantitative indicators can be found in open data.
If we could automatically gather this open data we could compile reports or studies like this in a more up to date and dynamic fashion while using open data sources For that we built we the city data pipeline as a system to compute these comparable indicators from open data sources The system is organized in three different parts Data integration which crawls different open data sources such as eurostat, dbpedia, or undata. This involves also an ontology for data integration which is extended by attribute equations to specify relations between numerical attributes. In data refinemement and enrichment we clean the data and with enrichment we try to fill in the large amount of missing values. In this talk I’ll focus on different techniques on approximating missing values, I come to that in a second Eventually we publishe all the data on a sparql endpoint, as linked open data and in a searchable web user interface
To show how severe this problem of missing values is lets look at two datasest we integrated to this end If we organize our data as a big matrix where the rows are cities and indicators are columns (After some cleanup we have 207 indicators and 4438 city year combinations in the total dataset.) We find that Eurostat has 51% missing values where un data had even 97% missing values. That is because some indicators like population which are available for every city, many other indicaots such as length of transport network are only available for very few cities.This is the usual power law distribution of data you will find in many real world datasets. But note not only missing values in single datasets are a problem but particularly by combining different datasets covering different indicators and cities we get huge empty spaces. The sitation gets even worse when we integrate data from different years as you can imagine.
So the main question we address in this work is: how can we fill in missing values. One possibility is to find more data from more other sources but that means we will get even more missing values Another possibility is to use some kind of domain knowledge to compute indicators based on other indicators A third possibility is to try to automatically fill in the values
For example eurostat provides 62 equations for the urbanaudit dataset which define the set of derived indicators We could also use unit conversions given for example by the QUDT Both of those could be computed by attribute equations which we presented two years ago in eswc The problem is that this covers only very few indicators And often indicators are not computable but must be measured somehow
The third possiblity is predicting the missing values by some automatic processs namely machine learning methods where we tested 3 different standard methods. And we realised that not a single one of these methods works equally well on all indicators For validation we apply stratified 10fold cross validation and as quality measure we use the normalized root mean square error in percentage The problem is now that most of the methds for machine learning again need complete training data. We tested 2 different approaches to build the training data sets from our incomplete integrated data set
In the first approach we try to find complete subsets of relevant indicators We can apply this only on submatrices where we have values for all k indicators that means of course that for higher k we have less cities to train the model We apply all three methods and select for each indicator the best method
Now lets see how the differen methods perfomr in this setting with increasing number of predictors First we see that with Random forrest and increasing number of predictors the method gets even worse because as mentionend before the training set gets very small and thus we get overfitting Usually KNN is in this setting the best method as the red line for the best method is nearly always on top of the line for knn The good news is that the for the combination of selecting the best method for each indicator we arrive at a very low RMSE, less than 1%. The problem of this approach is that we can predict only very few indicators
The idea to predict more indicators is to generate new features with principal component analysis. In a Regularized iterative PCA intruduced by Roweis 97 the missing values are first filled in with neutral values wrt. To The folloging PCA. This approach allows us to fill in all the values and not only a small subset as before . We then use the principal components we use the as training data for all three methods and again select the best method
First we see that the error measures are worse than for the first approach based on complete subsets But as I said before the advantage of this approach is that we can predict all indicators. We can also see that the best method for up to 20 predictors is again KNN, but for a higher number of predictors the linear regression model is also selected as best method for more and more indicators So combining different methods for different indicators pays off. For 80 predictors we end up with an average error rate of nearly 3%
Note that so far we have considered the whole thing as an integrated single dataset. Now lets focus on these white spots specifially The question is: can we predict the indicators of eurostat for cities that are not in europoe, the green arrow, or the un indicators for cities which are only in eurostat the blue arrow We again used aproach 2 to fill in the missing values of a single dataset and use that as training set for the indicatosr of the other dataset.
Lets have a look at the error measure. On the left we see the performance of the models predicting eurostat data from un data Linear regression gets worse with more predictors. The best method stays at about the same level even when increasing numbers of predictors around 14% On the right we see the other direction: prediction un data from eurostat The error gets actually worse with increasing number of indicators. This is probably due to the bias introduced because eurostat contains only european cities while un contains cities worldwide
Another possiblity to do this cross dataset prediction by learning ontology axioms from the instance data So we compare the values of each eurostat indicator with each un indicator by using robust linear regression. This gives us a few equivalent property axions as well as linear dependencies between indicators from different sources. These linear dependencies are again represented as attribute equations and can automatically bridge indicators given in different units
Now lets compare the different things we tried The first approach complete subset regression gives us very good results but is limited to only a few cities and indicators The second approach predicts nearly all missing values but the quality is not as good Cross data set prediction offers a high gain by giving mappings between data sources but has the worst error rates so far With ontology learning from instance data we can also get mappings between data sources and linear dependencies expressible in an ontology as
The dataset is available openly at the website citydata.wu.ac.at The data is given as linked open data, via sparql and in a searchable web ui We have all the original values with provenance information And all the predicted values together with error estimates
ISWC 2015 - Collecting, integrating, enriching and republishing open city data as linked data