Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© Siemens AG 2015. All rights reserved
Collecting, integrating, enriching
and republishing open city data
as linked data
S...
Page 2 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Which city is the best? Compare cities!
Page 3 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
What we have: European Green City Index
Page 4 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
»»
Use data to compare cities
Idea: Exploi...
Page 5 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Integrated Open Data is very sparse
Cities...
Page 6 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
How can we fill in missing values?
 Get m...
Page 7 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Use domain knowledge
to predict missing va...
Page 8 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Use machine learning
to predict missing va...
Page 9 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Approach 1: Complete subset regression
For...
Page 10 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Approach 1: How many predictors needed?
Page 11 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Approach 2: Principal component regressio...
Page 12 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Approach 2: How many predictors needed?
Page 13 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Cross-dataset prediction 1/2:
(How) can t...
Page 14 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Cross-dataset prediction 1/2:
(How) can t...
Page 15 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Cross-dataset prediction 2/2:
Pairwise Li...
Page 16 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Conclusion on various things we tried:
Ap...
Page 17 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Now what's Semantic Web/Linked Data here?...
Page 18 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved
Current and future work
Ongoing
• Encode ...
Upcoming SlideShare
Loading in …5
×

ISWC 2015 - Collecting, integrating, enriching and republishing open city data as linked data

650 views

Published on

ISWC 2015 - Linked Data Track
Collecting, integrating, enriching and republishing open city data as linked data
http://citydata.wu.ac.at/

Published in: Technology
  • Be the first to comment

  • Be the first to like this

ISWC 2015 - Collecting, integrating, enriching and republishing open city data as linked data

  1. 1. © Siemens AG 2015. All rights reserved Collecting, integrating, enriching and republishing open city data as linked data Stefan Bischof – Siemens, WU Vienna Christoph Martin – WU Vienna Axel Polleres – WU Vienna Patrik Schneider – WU Vienna, TU Vienna
  2. 2. Page 2 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved Which city is the best? Compare cities!
  3. 3. Page 3 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved What we have: European Green City Index
  4. 4. Page 4 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved »» Use data to compare cities Idea: Exploit available open data on cities to compute comparable indicators Use standard Semantic Web technologies for:  ontology based data integration (including lightweight provenance, temporal and spatal context)  data refinement and enrichment (approximating missing values, resolve quality issues)  data publication (SPARQL, LOD, webUI) Comparable city indicators »» City Data PipelineCity Data Pipeline
  5. 5. Page 5 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved Integrated Open Data is very sparse Cities Indicators 51%51% values missingvalues missing 97%97% valuesvalues missingmissing But we need base indicators for all cities to compute comparable indicatorsBut we need base indicators for all cities to compute comparable indicators
  6. 6. Page 6 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved How can we fill in missing values?  Get more data – which makes data even sparser  Use domain knowledge …  Try to automatically fill in values …
  7. 7. Page 7 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved Use domain knowledge to predict missing values  Eurostat: 62 equations for derived indicators (e.g., population density)  Unit conversions (e.g., QUDT ontology)  Use materialization or query rewriting for value computation [ESWC13] Covers only few indicators How can we get more domain knowledge?
  8. 8. Page 8 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved Use machine learning to predict missing values Deploy and combine a portfolio of different regression methods: • Multiple linear regression (MLR) • K-nearest neighbour (KNN) • Random forrest decision trees (RFD) Validation: 10-fold cross validation Quality measure to pick the best method/indicator: normalized root mean square error in % However: many/most machine learning methods need more or less complete training data! 
  9. 9. Page 9 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved Approach 1: Complete subset regression For each target indicator • Find top-k predictors based on correlation matrix and form a complete subset • Apply all methods (MLR, KNN, RFD), compute RMSE% and select best method Cities Indicators MLR KNN RFD » » »
  10. 10. Page 10 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved Approach 1: How many predictors needed?
  11. 11. Page 11 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved Approach 2: Principal component regression Fill in missing values with neutral values wrt. PCA [Roweis’97] For each target indicator • Find top-k predictors among the PCs based on correlation matrix • Again: apply MLR, KNN,RFD  compute RMSE%  select the best method ties Indicators MLR KNN RFD » Principal components » » »
  12. 12. Page 12 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved Approach 2: How many predictors needed?
  13. 13. Page 13 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved Cross-dataset prediction 1/2: (How) can this be used for cross-dataset prediction? Cities Indicators
  14. 14. Page 14 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved Cross-dataset prediction 1/2: (How) can this be used for cross-dataset prediction? Con: •Not great .... Avg. RMSE for both directions over 10% •Could transfer a "bias" from one dataset's context to the other Pro: •for some indicators it works quite well 
  15. 15. Page 15 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved Cross-dataset prediction 2/2: Pairwise Linear regression can be used to "learn ontology mappings" from values Compare the values of each eurostat indicator with each UN indicator Find linear dep. of pairs (equations) or equal pairs (equivalent properties) Using robust linear regression necessary to handle outliers Cities Indicators
  16. 16. Page 16 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved Conclusion on various things we tried: Approach 1: complete subsets good results, 0.25 RMSE% covers only a few cities/indicators Approach 2: principal component regression predicts more missing values quality is not always good Cross dataset prediction in general: interesting ("highest gain") bad error rates with methods tested so far Ontology learning from instance data: several "conjectured" relationships derivable needs datasets with overlapping cities (usable to "confirm/reject" manual mappings) US Census ? ?
  17. 17. Page 17 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved Now what's Semantic Web/Linked Data here? City Data Pipeline Dataset available openly! http://citydata.wu.ac.at/ Data accessible as Linked Open Data, via SPARQL endpoint, and WebUI Original values including data source for each value Predicted values including error estimates
  18. 18. Page 18 October 2015 Corporate Technology © Siemens AG 2015. All rights reserved Current and future work Ongoing • Encode data in RDF Data Cube vocabulary and provenance in PROV • Use other methods, e.g., SVM or robust linear regression for PCR Future Work • Some form of time-series analysis • Add more data sources (Carbon Disclosure Project, QuerioCity) • Integrate GIS data sources (OSM, Linked Geo Data) ...Last, but not least: our assumption/driver: Predictions get better, the more Open data we integrate...

×