RPI Research in Linked Open Government Systems
Upcoming SlideShare
Loading in...5
×
 

RPI Research in Linked Open Government Systems

on

  • 3,408 views

Open Government systems are changing the way that many governments around the world interact with their citizenry. Transparency and innovation are both enhanced by government openness and especially ...

Open Government systems are changing the way that many governments around the world interact with their citizenry. Transparency and innovation are both enhanced by government openness and especially by government data sharing. Further, the linked data approach, using maturing semantic web technologies, has been shown to be very valuable in creating "mashups" in which government datasets can be combined in new and innovative ways, and turned into "live" infographics. In this talk, presented to the computer science department at PUC-RIO, we describe the research aspects of RPI's ongoing work in this area.

Statistics

Views

Total Views
3,408
Views on SlideShare
2,930
Embed Views
478

Actions

Likes
5
Downloads
40
Comments
0

8 Embeds 478

http://gov20radio.com 351
http://www.scoop.it 112
http://feeds.feedburner.com 6
http://open.govcampus.co 3
https://twitter.com 2
http://tweetedtimes.com 2
http://translate.googleusercontent.com 1
http://govcampus.co 1
More...

Accessibility

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • The table is some sample data from some EPA Toxic Release dataset. We focus on state (ST) for our example. Note that our heuristic looks for full state names, state abbreviations, or appropriate FIPS codes. Guam is colored red because it is not technically a state. Other items in this class include DC, Puerto Rico, American Samoa, the US native American tribal entities and certain municipalities (such as New York City) which are large enough to have their own codes. Explanation: Why is this hard? If we see a database column with the number “36” in it, we have no way to tell what it represents. But if it is in a list of values that all are two digits and the max is under 60, it may be codes (or it may be ages, so we also have to look for other clues). Even more confusing Albany has no separate FIPS code – it is 36, for New York State, but Manhattan has its own FIPS code of 36061 – so telling States from municipalities can be hard and we cannot reject columns just because some entries are not in the right range”
  • Computational Center for Nanotechnology Innovations *Our test achieved a rate of “checking” triples for heuristic matches of 65k triples/second/process. (Intentionally not mentioned in the slides are the exact numbers. Specifically, it took 3m20s to make recommendations for 209M triples using 16 processes on the Opteron blade cluster at the CCNI.) *Our two heuristics could be summed up as follows: (1) if the column header (property name) looks like it could be about states, and if at least 75% of the values in that column (object values for that property) could be interpreted as states, then recommend that that column be considered as specifying states (the property has range state); (2) if the column header (property name) looks like it could be about latitudes (resp. longitudes), and if at least 75% of the values in the column (object values for that property) could be interpreted as latitudes (resp. longitudes), then recommend that that column be considered as specifying latitudes (resp. longitudes) (the property has range latitude (resp. longitude).

RPI Research in Linked Open Government Systems RPI Research in Linked Open Government Systems Presentation Transcript

  • Linked Open Government Data http://logd.tw.rpi.edu Jim Hendler Tetherless World Professor of Computer and Cognitive Science Assistant Dean of Information Technology and Web Science Rensselaer Polytechnic Institute http://www.cs.rpi.edu/~hendler @jahendler (twitter)
  • Demo of our site http://logd.tw.rpi.edu
  • Government Data on the Web
  • Data.gov community: International
  • Government Data Sharing January 1, 2009 “ Openness will strengthen our democracy and promote efficiency and effectiveness in Government.” --- President Obama Putting Govt Data online- Data.gov.uk beta May 21, 2009 January 19, 2010 data.gov.uk online May 21, 2010 data.gov online data.gov relaunch with semantic web featured June30,2009 December 8, 2009 “ Open Government Directive” released 2009 2010 … 57 Data Sets ~6000 Data Set ~2000 Data Sets >305,000 Data Sets
  • New ways to see data sets David McCandless
  • Important to the citizens: eg. Education
  • What’s promising
    • Linked open government data (data.gov, data.gov.uk)
      • Of many kinds
    • Markup languages and semantics and tools to enable transparency
    • Lower barriers to internet visualization, e.g. Google vis, MIT simile, many more…
    • Web 2.0 to put people in the loop and use and contribute to annotations
  • Moving data.gov to linked data (UK)
    • Built around “linked data” from the start
    • Authorization for this from the Prime Minister
  • Moving data.gov to linked data (US)
    • Third parties (like RPI) translate the government datasets into linked data formats
    • • US Data.gov hosts 6.4B RDF triples 5/21/2010
      • acknowledges Semantic Web as a key technology for open government data
  • Linked Open Data goes beyond govt http://linkeddata.org/ Government Data is currently over ½ the cloud in size (~17B triples), 10s of thousands of links to other data (within and without)
    • Create Mashups
    More than 50 of these at http://logd.tw.rpi.edu
  • Data.gov + epa.gov
  •  
  • Adding some Web magic Web Analytics Social Data Networks External Links
  • Linking GDP of the US and China GDP of China (Billion Chinese Yuan ) GDP of the US (Billion Dollar) [Temporal Mashup] bea.gov + federalreserve.gov +stats.gov.cn
  • Linking GDP of the US and China GDP of China (Billion Chinese Yuan ) GDP of the US (Billion Dollar) [Temporal Mashup] bea.gov + federalreserve.gov +stats.gov.cn This mashup was built in less than 4 hours – including conversion of data, web interface, and visualization!
  • Mashups allow comparisons that single data sets cannot Trends in Smoking Prevalence, Tobacco Policy Coverage and Tobacco Prices (1991-2007)
    • Extensible Mashups via Linked Data
    • Diverse datasets from NIH
    • Potentially linking to “unemployment rate”
    • Accountable Mashups via Provenance
    • Annotate datasets used in demos
    • Feedback users’ comment to gov contact (e.g. %)
  • Integrate with Social media
  • Our process derive derive create derive revision Convert Access Enhance Version SemDiff
  • Conversion of data sets
  • Csv2rdflod (from logd.tw.rpi.edu) Install csv2rdflod
  • Metadata is critical What kinds of metadata are: simple to create, powerful enough for search and internationalizable (esp. beyond English)
  • Work in Progress
    • Automated linking
      • Can we discover link points in the data given the standard URI and metadata collections we have
      • Approach
        • High quality experimentation on small dataset (gold standard)
          • MS by Johanna Flores, Web Science poster, 2011
        • Take best heuristics to large-scale data
          • Ongoing
        • Evaluate
          • Ongoing
        • Repeat as needed…
  • Datasets are incomplete
  • RDF encodings from our metadata collection
  • Process
    • Tried three heuristic approaches
    Bag of words LED on strings String Match Various Weighted Combinations
  • Simple Example EPA Toxic Release Data This looks like it could be state identifiers. Look for possible state identifiers: -Names: “Pennsylvania”, “Michigan”, “Wisconsin” -Abbr: “PA”, “MI”, “WI” -FIPS: “42”, “26”, “55” 75% match state identifiers. If this meets our threshold, then recommend interpreting as state and integrating with linked data on the web. Federal Information Processing Standards (FIPS) 14 is “Guam” which is not a US state Facility ID … Latitude Longitude ST:val … … 40.416944 -75.935 42 … … 42.955383 -85.480074 26 … … 43.1698 -88.01829 55 … … 38.87025 -77.00905 14 … … … … …
  • Results
    • Analyzed 1,396 “raw” Data.gov datasets
      • About 1.66B triples of converted CSV to RDF
      • Did not include metadata, provenance, linking or other products of the “enhanced” conversion
    • Simple heuristics were able to identify 3,432 meaningful database labels, yielding 1.2M US state identifiers and 3.8M geo-coordinates
      • Parallelized enhancement system is able to process 65k triples/second/process
      • Analyzed 1,396 “raw” Data.gov datasets in 3.1min
        • on 256 processors of the CCNI Opteron cluster
        • Currently porting to an IBM BlueGene
    • Analysis (ongoing) found no errors in links produced (but many errors of omission)
  • Next Steps
    • Explore
      • Use of mapping heuristics
        • Talking w/PUC-RIO about
      • “ real” LED and machine-learning approaches
      • Metadata analysis
        • Clustering & ML (336,000 labeled examples)
        • Metadata linking (esp. re: languages)
          • Govt terms provide a good start
      • Try other mapping tools ( cf. SERIMI )
  • Challenge
    • Ontology and vocabulary issues
      • How do we compare across heterogeneous and unreconciled data
        • Good news and Bad news
  • Good news – easy to do comparisons
  • Good news - Even if not “rationalized” together
  • Bad news – real comparisons are hard across govts
  • Presents a challenge Same or different?
  • Different “ontologies” ? Definitely not the expected result!!
  • And many other interesting issues
    • Trust
      • Government data is controversial, and potentially biased
        • How do we confirm or dispute?
    • Combination
      • When we combine data we need to keep the provenance of information (see trust)
        • How can we show and use?
    • Scaling
      • LOGD has already converted 8,678,741,017 triples
      • ~500 of 390,000 reported US datasets
    • Versioning and updating
    • Archiving
    • Searching in the data
  • Summary
    • The Open Govt data is a critical resource
      • Government data released as RDF (UK)
      • Government data converted to RDF (US)
      • Government data that can be found in many forms and used or converted (WWW)
    • Government transparency comes through in the “mashing up” of data from many sites
      • Key to linked data
    • But many challenges remain
      • Scaling, Trust, Provenance, Archiving, Curation, …
    • The Research agenda for linked government data is an important area for a Web-Science based approach
  • Questions? http://logd.tw.rpi.edu
  • Govt systems can use linked data web for context Correlates fires, acres burned, and agency budgets
  • Visualization can help identify data errors Were there really no fires in 1985?