RPI Research in Linked Open Government Systems


Published on

Open Government systems are changing the way that many governments around the world interact with their citizenry. Transparency and innovation are both enhanced by government openness and especially by government data sharing. Further, the linked data approach, using maturing semantic web technologies, has been shown to be very valuable in creating "mashups" in which government datasets can be combined in new and innovative ways, and turned into "live" infographics. In this talk, presented to the computer science department at PUC-RIO, we describe the research aspects of RPI's ongoing work in this area.

Published in: Technology, Education
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • The table is some sample data from some EPA Toxic Release dataset. We focus on state (ST) for our example. Note that our heuristic looks for full state names, state abbreviations, or appropriate FIPS codes. Guam is colored red because it is not technically a state. Other items in this class include DC, Puerto Rico, American Samoa, the US native American tribal entities and certain municipalities (such as New York City) which are large enough to have their own codes. Explanation: Why is this hard? If we see a database column with the number “36” in it, we have no way to tell what it represents. But if it is in a list of values that all are two digits and the max is under 60, it may be codes (or it may be ages, so we also have to look for other clues). Even more confusing Albany has no separate FIPS code – it is 36, for New York State, but Manhattan has its own FIPS code of 36061 – so telling States from municipalities can be hard and we cannot reject columns just because some entries are not in the right range”
  • Computational Center for Nanotechnology Innovations *Our test achieved a rate of “checking” triples for heuristic matches of 65k triples/second/process. (Intentionally not mentioned in the slides are the exact numbers. Specifically, it took 3m20s to make recommendations for 209M triples using 16 processes on the Opteron blade cluster at the CCNI.) *Our two heuristics could be summed up as follows: (1) if the column header (property name) looks like it could be about states, and if at least 75% of the values in that column (object values for that property) could be interpreted as states, then recommend that that column be considered as specifying states (the property has range state); (2) if the column header (property name) looks like it could be about latitudes (resp. longitudes), and if at least 75% of the values in the column (object values for that property) could be interpreted as latitudes (resp. longitudes), then recommend that that column be considered as specifying latitudes (resp. longitudes) (the property has range latitude (resp. longitude).
  • RPI Research in Linked Open Government Systems

    1. 1. Linked Open Government Data http://logd.tw.rpi.edu Jim Hendler Tetherless World Professor of Computer and Cognitive Science Assistant Dean of Information Technology and Web Science Rensselaer Polytechnic Institute http://www.cs.rpi.edu/~hendler @jahendler (twitter)
    2. 2. Demo of our site http://logd.tw.rpi.edu
    3. 3. Government Data on the Web
    4. 4. Data.gov community: International
    5. 5. Government Data Sharing January 1, 2009 “ Openness will strengthen our democracy and promote efficiency and effectiveness in Government.” --- President Obama Putting Govt Data online- Data.gov.uk beta May 21, 2009 January 19, 2010 data.gov.uk online May 21, 2010 data.gov online data.gov relaunch with semantic web featured June30,2009 December 8, 2009 “ Open Government Directive” released 2009 2010 … 57 Data Sets ~6000 Data Set ~2000 Data Sets >305,000 Data Sets
    6. 6. New ways to see data sets David McCandless
    7. 7. Important to the citizens: eg. Education
    8. 8. What’s promising <ul><li>Linked open government data (data.gov, data.gov.uk) </li></ul><ul><ul><li>Of many kinds </li></ul></ul><ul><li>Markup languages and semantics and tools to enable transparency </li></ul><ul><li>Lower barriers to internet visualization, e.g. Google vis, MIT simile, many more… </li></ul><ul><li>Web 2.0 to put people in the loop and use and contribute to annotations </li></ul>
    9. 9. Moving data.gov to linked data (UK) <ul><li>Built around “linked data” from the start </li></ul><ul><li>Authorization for this from the Prime Minister </li></ul>
    10. 10. Moving data.gov to linked data (US) <ul><li>Third parties (like RPI) translate the government datasets into linked data formats </li></ul><ul><li>• US Data.gov hosts 6.4B RDF triples 5/21/2010 </li></ul><ul><ul><li>acknowledges Semantic Web as a key technology for open government data </li></ul></ul>
    11. 11. Linked Open Data goes beyond govt http://linkeddata.org/ Government Data is currently over ½ the cloud in size (~17B triples), 10s of thousands of links to other data (within and without)
    12. 12. <ul><li>Create Mashups </li></ul>More than 50 of these at http://logd.tw.rpi.edu
    13. 13. Data.gov + epa.gov
    14. 15. Adding some Web magic Web Analytics Social Data Networks External Links
    15. 16. Linking GDP of the US and China GDP of China (Billion Chinese Yuan ) GDP of the US (Billion Dollar) [Temporal Mashup] bea.gov + federalreserve.gov +stats.gov.cn
    16. 17. Linking GDP of the US and China GDP of China (Billion Chinese Yuan ) GDP of the US (Billion Dollar) [Temporal Mashup] bea.gov + federalreserve.gov +stats.gov.cn This mashup was built in less than 4 hours – including conversion of data, web interface, and visualization!
    17. 18. Mashups allow comparisons that single data sets cannot Trends in Smoking Prevalence, Tobacco Policy Coverage and Tobacco Prices (1991-2007) <ul><li>Extensible Mashups via Linked Data </li></ul><ul><li>Diverse datasets from NIH </li></ul><ul><li>Potentially linking to “unemployment rate” </li></ul><ul><li>Accountable Mashups via Provenance </li></ul><ul><li>Annotate datasets used in demos </li></ul><ul><li>Feedback users’ comment to gov contact (e.g. %) </li></ul>
    18. 19. Integrate with Social media
    19. 20. Our process derive derive create derive revision Convert Access Enhance Version SemDiff
    20. 21. Conversion of data sets
    21. 22. Csv2rdflod (from logd.tw.rpi.edu) Install csv2rdflod
    22. 23. Metadata is critical What kinds of metadata are: simple to create, powerful enough for search and internationalizable (esp. beyond English)
    23. 24. Work in Progress <ul><li>Automated linking </li></ul><ul><ul><li>Can we discover link points in the data given the standard URI and metadata collections we have </li></ul></ul><ul><ul><li>Approach </li></ul></ul><ul><ul><ul><li>High quality experimentation on small dataset (gold standard) </li></ul></ul></ul><ul><ul><ul><ul><li>MS by Johanna Flores, Web Science poster, 2011 </li></ul></ul></ul></ul><ul><ul><ul><li>Take best heuristics to large-scale data </li></ul></ul></ul><ul><ul><ul><ul><li>Ongoing </li></ul></ul></ul></ul><ul><ul><ul><li>Evaluate </li></ul></ul></ul><ul><ul><ul><ul><li>Ongoing </li></ul></ul></ul></ul><ul><ul><ul><li>Repeat as needed… </li></ul></ul></ul>
    24. 25. Datasets are incomplete
    25. 26. RDF encodings from our metadata collection
    26. 27. Process
    27. 28. <ul><li>Tried three heuristic approaches </li></ul>Bag of words LED on strings String Match Various Weighted Combinations
    28. 29. Simple Example EPA Toxic Release Data This looks like it could be state identifiers. Look for possible state identifiers: -Names: “Pennsylvania”, “Michigan”, “Wisconsin” -Abbr: “PA”, “MI”, “WI” -FIPS: “42”, “26”, “55” 75% match state identifiers. If this meets our threshold, then recommend interpreting as state and integrating with linked data on the web. Federal Information Processing Standards (FIPS) 14 is “Guam” which is not a US state Facility ID … Latitude Longitude ST:val … … 40.416944 -75.935 42 … … 42.955383 -85.480074 26 … … 43.1698 -88.01829 55 … … 38.87025 -77.00905 14 … … … … …
    29. 30. Results <ul><li>Analyzed 1,396 “raw” Data.gov datasets </li></ul><ul><ul><li>About 1.66B triples of converted CSV to RDF </li></ul></ul><ul><ul><li>Did not include metadata, provenance, linking or other products of the “enhanced” conversion </li></ul></ul><ul><li>Simple heuristics were able to identify 3,432 meaningful database labels, yielding 1.2M US state identifiers and 3.8M geo-coordinates </li></ul><ul><ul><li>Parallelized enhancement system is able to process 65k triples/second/process </li></ul></ul><ul><ul><li>Analyzed 1,396 “raw” Data.gov datasets in 3.1min </li></ul></ul><ul><ul><ul><li>on 256 processors of the CCNI Opteron cluster </li></ul></ul></ul><ul><ul><ul><li>Currently porting to an IBM BlueGene </li></ul></ul></ul><ul><li>Analysis (ongoing) found no errors in links produced (but many errors of omission) </li></ul>
    30. 31. Next Steps <ul><li>Explore </li></ul><ul><ul><li>Use of mapping heuristics </li></ul></ul><ul><ul><ul><li>Talking w/PUC-RIO about </li></ul></ul></ul><ul><ul><li>“ real” LED and machine-learning approaches </li></ul></ul><ul><ul><li>Metadata analysis </li></ul></ul><ul><ul><ul><li>Clustering & ML (336,000 labeled examples) </li></ul></ul></ul><ul><ul><ul><li>Metadata linking (esp. re: languages) </li></ul></ul></ul><ul><ul><ul><ul><li>Govt terms provide a good start </li></ul></ul></ul></ul><ul><ul><li>Try other mapping tools ( cf. SERIMI ) </li></ul></ul>
    31. 32. Challenge <ul><li>Ontology and vocabulary issues </li></ul><ul><ul><li>How do we compare across heterogeneous and unreconciled data </li></ul></ul><ul><ul><ul><li>Good news and Bad news </li></ul></ul></ul>
    32. 33. Good news – easy to do comparisons
    33. 34. Good news - Even if not “rationalized” together
    34. 35. Bad news – real comparisons are hard across govts
    35. 36. Presents a challenge Same or different?
    36. 37. Different “ontologies” ? Definitely not the expected result!!
    37. 38. And many other interesting issues <ul><li>Trust </li></ul><ul><ul><li>Government data is controversial, and potentially biased </li></ul></ul><ul><ul><ul><li>How do we confirm or dispute? </li></ul></ul></ul><ul><li>Combination </li></ul><ul><ul><li>When we combine data we need to keep the provenance of information (see trust) </li></ul></ul><ul><ul><ul><li>How can we show and use? </li></ul></ul></ul><ul><li>Scaling </li></ul><ul><ul><li>LOGD has already converted 8,678,741,017 triples </li></ul></ul><ul><ul><li>~500 of 390,000 reported US datasets </li></ul></ul><ul><li>Versioning and updating </li></ul><ul><li>Archiving </li></ul><ul><li>Searching in the data </li></ul><ul><li>… </li></ul>
    38. 39. Summary <ul><li>The Open Govt data is a critical resource </li></ul><ul><ul><li>Government data released as RDF (UK) </li></ul></ul><ul><ul><li>Government data converted to RDF (US) </li></ul></ul><ul><ul><li>Government data that can be found in many forms and used or converted (WWW) </li></ul></ul><ul><li>Government transparency comes through in the “mashing up” of data from many sites </li></ul><ul><ul><li>Key to linked data </li></ul></ul><ul><li>But many challenges remain </li></ul><ul><ul><li>Scaling, Trust, Provenance, Archiving, Curation, … </li></ul></ul><ul><li>The Research agenda for linked government data is an important area for a Web-Science based approach </li></ul>
    39. 40. Questions? http://logd.tw.rpi.edu
    40. 41. Govt systems can use linked data web for context Correlates fires, acres burned, and agency budgets
    41. 42. Visualization can help identify data errors Were there really no fires in 1985?