0
Integrating Government Data using Semantic Web technology  Dean Allemang Chief Scientist, TopQuadrant Inc.  Prepared for I...
Government Data Sources <ul><li>Recent efforts have changed the face of government data distribution </li></ul><ul><ul><li...
“ Objets trouv és” <ul><li>Artwork made from “found objects” </li></ul><ul><li>Project Runway, etc.  </li></ul>Lal Hitchco...
“ Found data” <ul><li>Data integration efforts try to make data reusable </li></ul><ul><ul><li>Data ‘wholesale’ instead of...
Formats for “Found Data” in government Format Examples Notes Spreadsheets Data.gov, USASpending.gov, DOI Flexibility makes...
Quality Considerations of Found Data <ul><li>Correctness </li></ul><ul><ul><li>Usual notion for data quality; is it right?...
A few species of Found Data <ul><li>Quantitative Data feeds </li></ul><ul><ul><li>This is what we are usually actually int...
Integration strategy using RDF <ul><li>IMPORT data into RDF </li></ul><ul><ul><li>RDF is a sort of ‘least common denominat...
Import data into RDF <ul><li>RDF as Common Data representation </li></ul><ul><li>‘ rote’ transformations </li></ul><Person...
Import Data into RDF <ul><li>Each common data type can be input ‘rote’ into RDF </li></ul><ul><ul><li>Input preserves info...
Data Quality and Controlled Vocabularies <ul><li>Do you reference a controlled vocabulary? </li></ul><ul><ul><li>Flickr, d...
Data Quality and Controlled Vocabularies (cont) <ul><li>How did you specify the term? </li></ul><ul><ul><li>del.icio.us, F...
Unstructured data and  Controlled Vocabularies <ul><li>Found Data sometimes doesn’t refer to vocabularies directly </li></...
Merging Data <ul><li>“ Schema mapping” </li></ul><ul><ul><li>Useful when multiple data sources provide the same informatio...
Data mapping Style 1:  Schema Mapping Examples <ul><li>Different sources use different names </li></ul><Person id=“3”> <na...
Schema Mapping Examples (cont) <ul><li>Different structures for similar data </li></ul><rss:item ID=“3”> <wgs:lat>39.94534...
Schema mapping solutions: <ul><li>With RDFS/OWL: </li></ul><ul><ul><li>:employer owl:equivalentProperty :Company . </li></...
Schema mapping solutions (cont) <ul><li>With a controlled meta-vocabulary and RDFS: </li></ul><ul><li>E.g., 11179 </li></u...
Role of Standards in the Mapping <ul><li>Schema standards like WGS: </li></ul><ul><ul><li>If all parties use them, no mapp...
Data Mapping Style 2:  Tagging or Sorting <ul><li>Like del.icio.us etc.  </li></ul><Bookmark href=“http://www.topquadrant....
Role of Standards in the Mapping <ul><li>Vocabulary standards (AGROVOC, WestLaw, FEA, etc.) </li></ul><ul><ul><li>Useful f...
Analysis and Display <ul><li>Wide variety of options, including eg: </li></ul><ul><ul><li>Use tags and tag structure to am...
Tags as Amalgamation FEA DOI GSA If two sources use the same controlled vocabulary, they can be amalgamated along that dim...
Mapping Columns
Model-driven displays SELECT ?lat ?long WHERE {?item a :DisplayLocation . ?item geo:lat ?lat . ?item geo:long ?long .} Nam...
Exercises <ul><li>Will use TopBraid™ Ensemble and TopBraid™ Composer </li></ul><ul><li>Using data from  </li></ul><ul><ul>...
Upcoming SlideShare
Loading in...5
×

Integrating Government Data New

754

Published on

Dean Allemang's Presentation at ISWC 2009

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
754
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
36
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Transcript of "Integrating Government Data New"

  1. 1. Integrating Government Data using Semantic Web technology Dean Allemang Chief Scientist, TopQuadrant Inc. Prepared for ISWC 2009
  2. 2. Government Data Sources <ul><li>Recent efforts have changed the face of government data distribution </li></ul><ul><ul><li>Better motivated </li></ul></ul><ul><ul><li>More sources </li></ul></ul><ul><ul><li>‘ Mandate’ (well, memorandum, anyway) for sharing data </li></ul></ul><ul><li>Government data sources </li></ul><ul><ul><li>Data.gov (main focus) </li></ul></ul><ul><ul><li>DOI Architecture - http://www.doi.gov/ocio/architecture/ </li></ul></ul><ul><ul><li>USGS Earthquakes - http://earthquake.usgs.gov/eqcenter/ </li></ul></ul><ul><ul><li>USASpending.gov </li></ul></ul><ul><li>Non-government data sources </li></ul><ul><ul><li>Dbpedia.org </li></ul></ul><ul><ul><li>oeGov </li></ul></ul>
  3. 3. “ Objets trouv és” <ul><li>Artwork made from “found objects” </li></ul><ul><li>Project Runway, etc. </li></ul>Lal Hitchcock Sculptures
  4. 4. “ Found data” <ul><li>Data integration efforts try to make data reusable </li></ul><ul><ul><li>Data ‘wholesale’ instead of ‘retail’ </li></ul></ul><ul><ul><li>Multiple efforts result in multiple data formats </li></ul></ul><ul><ul><li>Many efforts to ‘unify’ how data is represented – (competing) global data standards. </li></ul></ul><ul><ul><li>Maybe one day, one will win. </li></ul></ul><ul><li>Until that time, we have to make do with “found data” – data that is already available, however it is. </li></ul><ul><li>RDF (etc.) can help us do that </li></ul>
  5. 5. Formats for “Found Data” in government Format Examples Notes Spreadsheets Data.gov, USASpending.gov, DOI Flexibility makes it popular, but makes work at re-use time XML Data.gov Not really a single format, but can be parsed uniformly RSS USASpending.gov, USGS Syntax wars largely irrelevant now. Easy to read, dynamic RDFa <none?> New kid on the block, supported by Google, Yahoo!, Drupal SPARQL Endpoint Dbpedia.org Most flexible of all, dynamic RDF/N3/SKOS OEGov, Tetherless World Flexible, relatively static. Great for vocabularies etc.
  6. 6. Quality Considerations of Found Data <ul><li>Correctness </li></ul><ul><ul><li>Usual notion for data quality; is it right? </li></ul></ul><ul><ul><li>Misspellings, out-of-date data, etc. </li></ul></ul><ul><li>Understandability </li></ul><ul><ul><li>Found data requires interpretation. </li></ul></ul><ul><ul><li>E.g., what do columns in a spreadsheet mean? </li></ul></ul><ul><li>Accessibility </li></ul><ul><ul><li>How easily can the data be organized? </li></ul></ul><ul><ul><li>Eg. Spreadsheets can have haphazard organization </li></ul></ul><ul><ul><li>Eg., RSS feeds that aren’t dynamic, don’t have readable fields, etc. </li></ul></ul><ul><li>Reusability/Repurposing </li></ul><ul><ul><li>References to Controlled Vocabularies </li></ul></ul><ul><ul><li>Use of standardized ‘columns’ (properties) </li></ul></ul>
  7. 7. A few species of Found Data <ul><li>Quantitative Data feeds </li></ul><ul><ul><li>This is what we are usually actually interested in </li></ul></ul><ul><ul><li>Data is described using properties, units, tags, etc. </li></ul></ul><ul><li>Vocabularies * </li></ul><ul><ul><li>Structured, unstructured </li></ul></ul><ul><ul><li>Sometimes with strong standards behind them (Westlaw, AGROVOC) </li></ul></ul><ul><ul><li>Not always advertised as ‘vocabularies’ – also as org diagrams, architectures, or even data </li></ul></ul><ul><ul><ul><li>FEA, TOGAF </li></ul></ul></ul><ul><ul><ul><li>Geographical entities (States, cities, countries) FAO Geopolitical ontology </li></ul></ul></ul><ul><ul><ul><li>Units of measure, structure of gov’t agencies </li></ul></ul></ul><ul><li>Schema * </li></ul><ul><ul><li>Used to standardize properties (columns, XML tags, etc.) </li></ul></ul><ul><ul><ul><li>DC, WGS, FOAF, SIOC </li></ul></ul></ul><ul><ul><ul><li>11179 </li></ul></ul></ul><ul><li>* Two kinds of “controlled vocabulary” – often confused! </li></ul>
  8. 8. Integration strategy using RDF <ul><li>IMPORT data into RDF </li></ul><ul><ul><li>RDF is a sort of ‘least common denominator’ data representation </li></ul></ul><ul><li>MERGE data </li></ul><ul><ul><li>A wide variety of technologies available here </li></ul></ul><ul><ul><li>Semantic Web approach – you MODEL your mapping. </li></ul></ul><ul><li>ANALYZE and DISPLAY conclusions </li></ul><ul><ul><li>RDF is a sort of ‘least common denominator’ data representation </li></ul></ul>
  9. 9. Import data into RDF <ul><li>RDF as Common Data representation </li></ul><ul><li>‘ rote’ transformations </li></ul><Person id=“3”> <name>Irene Polikoff</name> <employer>TopQuadrant</employer> <position>CEO</position> </Person> Name Address Company Title Dean Allemang 10 Downing St. TopQuadrant Chief Scientist Michael Brodie 14 Wysteria Lane Verizon Chief Scientist
  10. 10. Import Data into RDF <ul><li>Each common data type can be input ‘rote’ into RDF </li></ul><ul><ul><li>Input preserves information from original; entities for e.g spreadsheet rows, XML elements, database tables, RSS channels, etc. </li></ul></ul><ul><ul><li>Often “found data” requires further processing to make sense, eg: </li></ul></ul><ul><ul><ul><li>Extracting trees from spreadsheets </li></ul></ul></ul><ul><ul><ul><li>Resolving references in XML </li></ul></ul></ul><ul><ul><li>SPARQL CONSTRUCT is useful for any of these, once data is ‘rote’ translated into RDF </li></ul></ul>Canus Dog Collie Wolf Beagle Terrier Lone Steppen Genus Species Sub-species Canus Dog Collie Canus Dog Beagle Canus Dog Terrier Canus Wolf Steppen Canus Wolf Lone
  11. 11. Data Quality and Controlled Vocabularies <ul><li>Do you reference a controlled vocabulary? </li></ul><ul><ul><li>Flickr, del.icio.us, no </li></ul></ul><ul><ul><li>DOI, GSA, FTF, etc. reference FEA </li></ul></ul><ul><ul><li>Some reference more than one, e.g., GSA references TOGAF also </li></ul></ul><ul><ul><li>Legal briefs reference West Key Numbering System (WestLaw) </li></ul></ul><ul><ul><li>If you reference one (or more), then information sharing becomes possible along that vocabulary </li></ul></ul><ul><li>Did you tell us which one you referenced? </li></ul><ul><ul><li>Reference is often implicit, or hidden in column name “Service Standard” (did you recognize that as FEA?) </li></ul></ul><ul><ul><li>Reference is often explicit but informal ISBN-10: 0123735564 </li></ul></ul><ul><ul><li>RDF provides global means of referencing vocabulary with a URI </li></ul></ul><ul><ul><li>http://www.fao.org/aos/agrovoc#c_16080 rdfs:label “Cow milk”@en </li></ul></ul>
  12. 12. Data Quality and Controlled Vocabularies (cont) <ul><li>How did you specify the term? </li></ul><ul><ul><li>del.icio.us, Flickr, etc. use (uncontrolled) strings </li></ul></ul><ul><ul><li>FEA uses controlled strings (which notion of “Quality” do you mean?) </li></ul></ul><ul><ul><li>WestLaw uses Key Numbering System: 2233(2) “Regular income” </li></ul></ul><ul><ul><li>RDF/SKOS uses global means of referring to terms with the URI </li></ul></ul><ul><ul><li>http://www.fao.org/aos/agrovoc#c_16080 rdfs:label “Cow milk”@en </li></ul></ul><ul><ul><li>Sounds familiar? The URI solves many problems of reference with respect to shared controlled vocabularies! </li></ul></ul>
  13. 13. Unstructured data and Controlled Vocabularies <ul><li>Found Data sometimes doesn’t refer to vocabularies directly </li></ul><ul><ul><li>“ Microsoft announced today that negotiations to acquire search giant Yahoo! have stalled.” </li></ul></ul><ul><ul><li>MICROSOFT , YAHOO! etc. could be controlled terms! </li></ul></ul><ul><ul><li>‘ standard’ terms might not match exactly (SEC names, etc.) </li></ul></ul><ul><li>Concept Extraction technology can be relevant here </li></ul><ul><ul><li>Reuters Calais reads news stories and extracts concepts in a controlled vocabulary </li></ul></ul><ul><ul><li>Still has all the reference issues from before </li></ul></ul><ul><ul><li>Calais uses RDF (URIs) to resolve this. </li></ul></ul><ul><ul><li>Hooray for Calais! </li></ul></ul>
  14. 14. Merging Data <ul><li>“ Schema mapping” </li></ul><ul><ul><li>Useful when multiple data sources provide the same information about similar items </li></ul></ul><ul><ul><li>Same information is described using different terms (columns, properties) </li></ul></ul><ul><li>“ Tagging” or “Sorting” </li></ul><ul><ul><li>‘ tags’ data (like del.icio.us or Library of Congress) </li></ul></ul><ul><ul><li>Useful for grouping similar items for search and discovery </li></ul></ul><ul><li>Both can be used together </li></ul><ul><ul><li>Eg., use tags to find similar things, then map schemas to report data uniformly </li></ul></ul>
  15. 15. Data mapping Style 1: Schema Mapping Examples <ul><li>Different sources use different names </li></ul><Person id=“3”> <name>Irene Polikoff</name> <employer>TopQuadrant</employer> <position>CEO</position> </Person> Name=name, but Company=employer Title=position Name Address Company Title Dean Allemang 10 Downing St. TopQuadrant Chief Scientist Michael Brodie 14 Wysteria Lane Verizon Chief Scientist
  16. 16. Schema Mapping Examples (cont) <ul><li>Different structures for similar data </li></ul><rss:item ID=“3”> <wgs:lat>39.945345</wgs:lat> <wgs:long>-79.34524</wgs:long> </rss:item> <image src=“doggie.jpg”> <wgs:Point> <wgs:lat>39.945345</wgs:lat> <wgs:long>-79.34524</wgs:long> </wgs:Point> </image> <Entry <position>39.945345,-79.34524</position> </Entry>
  17. 17. Schema mapping solutions: <ul><li>With RDFS/OWL: </li></ul><ul><ul><li>:employer owl:equivalentProperty :Company . </li></ul></ul><ul><ul><li>:position owl:equivalentProperty :Title . </li></ul></ul><ul><li>With SPARQL </li></ul><ul><ul><li>CONSTRUCT {?x wgs:lat ?lat . ?x wgs:long ?long .} </li></ul></ul><ul><ul><li>WHERE { ?x sxml:child ?point. ?point a :Point . </li></ul></ul><ul><ul><li>?point wgs:lat ?lat . ?point wgs:long ?long } </li></ul></ul><ul><li>With SPARQL extensions </li></ul><ul><ul><li>CONSTRUCT {?x wgs:lat ?lat . ?x wgs:long ?long .} </li></ul></ul><ul><ul><li>WHERE {?x :position ?pos . </li></ul></ul><ul><ul><li>LET (?lat:=str:before (?pos, “,”)) </li></ul></ul><ul><ul><li>LET (?long:=str:after(?pos, “,”)) } </li></ul></ul>
  18. 18. Schema mapping solutions (cont) <ul><li>With a controlled meta-vocabulary and RDFS: </li></ul><ul><li>E.g., 11179 </li></ul><ul><ul><li>:employer rdfs:subPropertyOf 11179:Concept1234 . </li></ul></ul><ul><ul><li>:position rdfs:subPropertyOf 11179:Concept5678 . </li></ul></ul>
  19. 19. Role of Standards in the Mapping <ul><li>Schema standards like WGS: </li></ul><ul><ul><li>If all parties use them, no mapping necessary!! </li></ul></ul><ul><ul><li>Simple standards encourage reuse: Microformats </li></ul></ul><ul><li>Schema meta-standards like 11179 </li></ul><ul><ul><li>If all parties map to them, no new mapping necessary – just use theirs! </li></ul></ul><ul><ul><li>One mega-standard makes re-use difficult </li></ul></ul><ul><ul><li>Meta-standard (don’t use my words, just map to them) makes reuse easier </li></ul></ul><ul><li>Vocabulary standards (AGROVOC, WestLaw, FEA, etc.) </li></ul><ul><ul><li>Not very applicable at this stage </li></ul></ul><ul><ul><li>Will come in to their own in the next step . . . </li></ul></ul>
  20. 20. Data Mapping Style 2: Tagging or Sorting <ul><li>Like del.icio.us etc. </li></ul><Bookmark href=“http://www.topquadrant.com”> <tag>Semantic Web</tag> </Bookmark> <System name=“Central Bookkeeping”> <Evaluation> <PerformanceMeasure>Quality</PerformanceMeasure> <Resullt>Fair</Result> </Evaluation> </Bookmark> That’s an FEA reference! Where does this come from?
  21. 21. Role of Standards in the Mapping <ul><li>Vocabulary standards (AGROVOC, WestLaw, FEA, etc.) </li></ul><ul><ul><li>Useful for organizing collaboration among groups </li></ul></ul><ul><ul><li>Used extensively by libraries, professional organizations, focused domain groups, etc. </li></ul></ul><ul><ul><li>Not used by del.icio.us, Flickr, etc. </li></ul></ul><ul><ul><li>Related to “Folksonomies” </li></ul></ul>
  22. 22. Analysis and Display <ul><li>Wide variety of options, including eg: </li></ul><ul><ul><li>Use tags and tag structure to amalgamate data </li></ul></ul><ul><ul><li>Display merged properties in a table </li></ul></ul><ul><ul><li>Display merged data on a specific widget (e.g., mapping geospatial data) </li></ul></ul><ul><ul><li>Business Intelligence reporting – pie chart, bars, graphs, etc. </li></ul></ul>
  23. 23. Tags as Amalgamation FEA DOI GSA If two sources use the same controlled vocabulary, they can be amalgamated along that dimension.
  24. 24. Mapping Columns
  25. 25. Model-driven displays SELECT ?lat ?long WHERE {?item a :DisplayLocation . ?item geo:lat ?lat . ?item geo:long ?long .} Name latitude longitude Slausen -171.3 38.4 Union -171.4 38.2 Vine -170.9 37.9 McArthur -170.4 38.1 Anaheim -171.3 38.2 Chinatown -171.1 38.5 Beverly -171.3 38.1 latitude longitude Station domain geo:lat geo:long :DisplayLocation domain domain subPropertyOf subPropertyOf subClassOf
  26. 26. Exercises <ul><li>Will use TopBraid™ Ensemble and TopBraid™ Composer </li></ul><ul><li>Using data from </li></ul><ul><ul><li>oeGov </li></ul></ul><ul><ul><li>USASpending.gov </li></ul></ul><ul><ul><li>… others TBD … </li></ul></ul><ul><li>Merge, slice, amalgamate, etc… </li></ul>
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×