Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

RPI Research in Linked Open Government Systems

4,689 views

Published on

Open Government systems are changing the way that many governments around the world interact with their citizenry. Transparency and innovation are both enhanced by government openness and especially by government data sharing. Further, the linked data approach, using maturing semantic web technologies, has been shown to be very valuable in creating "mashups" in which government datasets can be combined in new and innovative ways, and turned into "live" infographics. In this talk, presented to the computer science department at PUC-RIO, we describe the research aspects of RPI's ongoing work in this area.

Published in: Technology, Education
  • Be the first to comment

RPI Research in Linked Open Government Systems

  1. 1. Linked Open Government Data http://logd.tw.rpi.edu Jim Hendler Tetherless World Professor of Computer and Cognitive Science Assistant Dean of Information Technology and Web Science Rensselaer Polytechnic Institute http://www.cs.rpi.edu/~hendler @jahendler (twitter)
  2. 2. Demo of our site http://logd.tw.rpi.edu
  3. 3. Government Data on the Web
  4. 4. Data.gov community: International
  5. 5. Government Data Sharing January 1, 2009 “ Openness will strengthen our democracy and promote efficiency and effectiveness in Government.” --- President Obama Putting Govt Data online- Data.gov.uk beta May 21, 2009 January 19, 2010 data.gov.uk online May 21, 2010 data.gov online data.gov relaunch with semantic web featured June30,2009 December 8, 2009 “ Open Government Directive” released 2009 2010 … 57 Data Sets ~6000 Data Set ~2000 Data Sets >305,000 Data Sets
  6. 6. New ways to see data sets David McCandless
  7. 7. Important to the citizens: eg. Education
  8. 8. What’s promising <ul><li>Linked open government data (data.gov, data.gov.uk) </li></ul><ul><ul><li>Of many kinds </li></ul></ul><ul><li>Markup languages and semantics and tools to enable transparency </li></ul><ul><li>Lower barriers to internet visualization, e.g. Google vis, MIT simile, many more… </li></ul><ul><li>Web 2.0 to put people in the loop and use and contribute to annotations </li></ul>
  9. 9. Moving data.gov to linked data (UK) <ul><li>Built around “linked data” from the start </li></ul><ul><li>Authorization for this from the Prime Minister </li></ul>
  10. 10. Moving data.gov to linked data (US) <ul><li>Third parties (like RPI) translate the government datasets into linked data formats </li></ul><ul><li>• US Data.gov hosts 6.4B RDF triples 5/21/2010 </li></ul><ul><ul><li>acknowledges Semantic Web as a key technology for open government data </li></ul></ul>
  11. 11. Linked Open Data goes beyond govt http://linkeddata.org/ Government Data is currently over ½ the cloud in size (~17B triples), 10s of thousands of links to other data (within and without)
  12. 12. <ul><li>Create Mashups </li></ul>More than 50 of these at http://logd.tw.rpi.edu
  13. 13. Data.gov + epa.gov
  14. 15. Adding some Web magic Web Analytics Social Data Networks External Links
  15. 16. Linking GDP of the US and China GDP of China (Billion Chinese Yuan ) GDP of the US (Billion Dollar) [Temporal Mashup] bea.gov + federalreserve.gov +stats.gov.cn
  16. 17. Linking GDP of the US and China GDP of China (Billion Chinese Yuan ) GDP of the US (Billion Dollar) [Temporal Mashup] bea.gov + federalreserve.gov +stats.gov.cn This mashup was built in less than 4 hours – including conversion of data, web interface, and visualization!
  17. 18. Mashups allow comparisons that single data sets cannot Trends in Smoking Prevalence, Tobacco Policy Coverage and Tobacco Prices (1991-2007) <ul><li>Extensible Mashups via Linked Data </li></ul><ul><li>Diverse datasets from NIH </li></ul><ul><li>Potentially linking to “unemployment rate” </li></ul><ul><li>Accountable Mashups via Provenance </li></ul><ul><li>Annotate datasets used in demos </li></ul><ul><li>Feedback users’ comment to gov contact (e.g. %) </li></ul>
  18. 19. Integrate with Social media
  19. 20. Our process derive derive create derive revision Convert Access Enhance Version SemDiff
  20. 21. Conversion of data sets
  21. 22. Csv2rdflod (from logd.tw.rpi.edu) Install csv2rdflod
  22. 23. Metadata is critical What kinds of metadata are: simple to create, powerful enough for search and internationalizable (esp. beyond English)
  23. 24. Work in Progress <ul><li>Automated linking </li></ul><ul><ul><li>Can we discover link points in the data given the standard URI and metadata collections we have </li></ul></ul><ul><ul><li>Approach </li></ul></ul><ul><ul><ul><li>High quality experimentation on small dataset (gold standard) </li></ul></ul></ul><ul><ul><ul><ul><li>MS by Johanna Flores, Web Science poster, 2011 </li></ul></ul></ul></ul><ul><ul><ul><li>Take best heuristics to large-scale data </li></ul></ul></ul><ul><ul><ul><ul><li>Ongoing </li></ul></ul></ul></ul><ul><ul><ul><li>Evaluate </li></ul></ul></ul><ul><ul><ul><ul><li>Ongoing </li></ul></ul></ul></ul><ul><ul><ul><li>Repeat as needed… </li></ul></ul></ul>
  24. 25. Datasets are incomplete
  25. 26. RDF encodings from our metadata collection
  26. 27. Process
  27. 28. <ul><li>Tried three heuristic approaches </li></ul>Bag of words LED on strings String Match Various Weighted Combinations
  28. 29. Simple Example EPA Toxic Release Data This looks like it could be state identifiers. Look for possible state identifiers: -Names: “Pennsylvania”, “Michigan”, “Wisconsin” -Abbr: “PA”, “MI”, “WI” -FIPS: “42”, “26”, “55” 75% match state identifiers. If this meets our threshold, then recommend interpreting as state and integrating with linked data on the web. Federal Information Processing Standards (FIPS) 14 is “Guam” which is not a US state Facility ID … Latitude Longitude ST:val … … 40.416944 -75.935 42 … … 42.955383 -85.480074 26 … … 43.1698 -88.01829 55 … … 38.87025 -77.00905 14 … … … … …
  29. 30. Results <ul><li>Analyzed 1,396 “raw” Data.gov datasets </li></ul><ul><ul><li>About 1.66B triples of converted CSV to RDF </li></ul></ul><ul><ul><li>Did not include metadata, provenance, linking or other products of the “enhanced” conversion </li></ul></ul><ul><li>Simple heuristics were able to identify 3,432 meaningful database labels, yielding 1.2M US state identifiers and 3.8M geo-coordinates </li></ul><ul><ul><li>Parallelized enhancement system is able to process 65k triples/second/process </li></ul></ul><ul><ul><li>Analyzed 1,396 “raw” Data.gov datasets in 3.1min </li></ul></ul><ul><ul><ul><li>on 256 processors of the CCNI Opteron cluster </li></ul></ul></ul><ul><ul><ul><li>Currently porting to an IBM BlueGene </li></ul></ul></ul><ul><li>Analysis (ongoing) found no errors in links produced (but many errors of omission) </li></ul>
  30. 31. Next Steps <ul><li>Explore </li></ul><ul><ul><li>Use of mapping heuristics </li></ul></ul><ul><ul><ul><li>Talking w/PUC-RIO about </li></ul></ul></ul><ul><ul><li>“ real” LED and machine-learning approaches </li></ul></ul><ul><ul><li>Metadata analysis </li></ul></ul><ul><ul><ul><li>Clustering & ML (336,000 labeled examples) </li></ul></ul></ul><ul><ul><ul><li>Metadata linking (esp. re: languages) </li></ul></ul></ul><ul><ul><ul><ul><li>Govt terms provide a good start </li></ul></ul></ul></ul><ul><ul><li>Try other mapping tools ( cf. SERIMI ) </li></ul></ul>
  31. 32. Challenge <ul><li>Ontology and vocabulary issues </li></ul><ul><ul><li>How do we compare across heterogeneous and unreconciled data </li></ul></ul><ul><ul><ul><li>Good news and Bad news </li></ul></ul></ul>
  32. 33. Good news – easy to do comparisons
  33. 34. Good news - Even if not “rationalized” together
  34. 35. Bad news – real comparisons are hard across govts
  35. 36. Presents a challenge Same or different?
  36. 37. Different “ontologies” ? Definitely not the expected result!!
  37. 38. And many other interesting issues <ul><li>Trust </li></ul><ul><ul><li>Government data is controversial, and potentially biased </li></ul></ul><ul><ul><ul><li>How do we confirm or dispute? </li></ul></ul></ul><ul><li>Combination </li></ul><ul><ul><li>When we combine data we need to keep the provenance of information (see trust) </li></ul></ul><ul><ul><ul><li>How can we show and use? </li></ul></ul></ul><ul><li>Scaling </li></ul><ul><ul><li>LOGD has already converted 8,678,741,017 triples </li></ul></ul><ul><ul><li>~500 of 390,000 reported US datasets </li></ul></ul><ul><li>Versioning and updating </li></ul><ul><li>Archiving </li></ul><ul><li>Searching in the data </li></ul><ul><li>… </li></ul>
  38. 39. Summary <ul><li>The Open Govt data is a critical resource </li></ul><ul><ul><li>Government data released as RDF (UK) </li></ul></ul><ul><ul><li>Government data converted to RDF (US) </li></ul></ul><ul><ul><li>Government data that can be found in many forms and used or converted (WWW) </li></ul></ul><ul><li>Government transparency comes through in the “mashing up” of data from many sites </li></ul><ul><ul><li>Key to linked data </li></ul></ul><ul><li>But many challenges remain </li></ul><ul><ul><li>Scaling, Trust, Provenance, Archiving, Curation, … </li></ul></ul><ul><li>The Research agenda for linked government data is an important area for a Web-Science based approach </li></ul>
  39. 40. Questions? http://logd.tw.rpi.edu
  40. 41. Govt systems can use linked data web for context Correlates fires, acres burned, and agency budgets
  41. 42. Visualization can help identify data errors Were there really no fires in 1985?

×