http://www.planningalerts.com/: Email alerts of planning applications near a location. Data from screen scraping some local UK councils’ websites,http://ishortman.com/projects/expendituremap/:map of public expenditure data by UK region. Services such as defence, public order, science and technology, agriculture, and transport. Data based on normalised spreadsheet data from the UK’s Office for National Statistics Annual Abstract of Statistics.
Common tasks involved in the publish linked data, following presentation will give a brief overview of the each stage.
Linked data is mainly composed of its Publication, i.e. making your linked data available to the public, and Consumption, for others to consume and use it.
Uses standard SW technologies (RDF, OWL, SPARQL)Uses Garlik JXT triplestore
Be clear of the questions you are asking:
Data normalisation. Data sources in different formats.RDF/Turtle for its compactness and clarity.Data conversion to RDF. We used python scripts and Java (Jena) to convert the files to RDF.Modelling the datasets:Much of the data were multi-dimensional, so we used SCOVO to model these.
Modelling the Home Office datasets:Each row consists of Police Force data. Columns of each row contains crime values for offences such as “Violence against the person”, “Robbery”, “Offence Against vehicles”. We modelled the time period (2008/09), the geo regions, and the different crime types as “scovo:Dimension”.
Very difficult to integrate data from disparate sources
Asserted owl:sameAs relations between the geographic concepts of the datasets and the corresponding relevant entities in the O.S. Admin Geography (using string matching).
Do psiusecase demo here.
Demo of http://map.psi.enakting.org/
DBpedia is an example of a dataset,
Omitola birmingham cityuniv
World Sense-Making using Linked Data Tope Omitola(joint work with Prof. Nigel Shadbolt) Faculty Research Seminar Talk, Birmingham City University, UK Thurs 8 Dec. 2011 1
World Sense-Making using Linked Data Tope Omitola 3
Talk Outline EnAKTing: Its story From the Web to Semantic Web to Linked Data Public Sector Datasets: Publication and Consumption Findability of Appropriate Data Sources – Service Descriptions Provenance and Trust in Linked Data
What is EnAKTing? EPSRC-funded project. Addressing 3 key research problems; (1) how to build ontologies quickly that are capable of exploiting the potential of large-scale user participation, (2) how we query an unbounded web of linked data, (3) how to visualise, explore, browse and navigate this mass of data. Project Leaders: Prof. Sir Tim Berners-Lee, Prof. Dame Wendy Hall, and Prof. Nigel Shadbolt.
From the Web to Semantic Web to Linked Data The Web of Data Problems with the Web of Document RDF Linked Data
The Web of Data (a.k.a Semantic Web/Linked Data) Traditional Web of Documents Internet, Documents, Links Documents in HTML Links using URLs HTTP for document access and transfer
Some more problems with Web of Documents Difficult to Integrate Data Example Use Case: Making a Travel Plan Data Integration by looking and typing Slow Unproductive Workflow Difficult for apps to make “sense” of HTML text
Solutions Use RDF to give some structure to the data RDF <-> subject predicate object RDF links things, not just documents, and they are typed
RDF is a language (for data)Words URIsand literal textNouns and Verbs Classes andPropertiesSentence structure RDF Statements (triples)Paragraphs RDF GraphsFootnotes URIs[Domain Name Service]Dictionaries RDF Schemas • Generic grammar for languages of description • Functions as native language, second language, or pidgin.
RDF and Ontology The AAA Slogan: “Anyone can say Anything about Any topic.” s po . (subject predicate object .) <http://en.wikipedia.org/wiki/Tony_Benn><http:/ /purl.org/dc/elements/1.1/title> "Tony Benn” . RDF is used to build ontologies; a formal representation of shared knowledge by a set of concepts within a domain and the relationships between them Examples: Finance ontology; MusicBrainz, music ontology; GO, gene ontology, etc
What is Linked Data? Data, data, everywhere: We are surrounded by data: School performance, car fuel efficiency, etc Data help us to make better decisions You can discern the shape and structure of an entity by looking at the data it generates Data shapes conversations and markets
What is Linked Data? Linked Data: Framework where data is a first class citizen on the Web Evolving the current Web into a Global Data Space TimBL: 4 principles of Linked Data Use URIs as names for things, Use HTTP URIs, When someone looks up a URI, provide useful information, using the standards (RDF, etc), Include links to other URIs, so that they can discover more things
The Web of Linked Data Link everything. No silos. Thing Thing Thing Thing Thing Thing
The Web of Linked Data Linked Data (Semantic Web ) is a graph database:
Linked Data Advantage comes from linking the RDF(s) together. 17
Some Linked Datastores BBC NY Times Guardian DBpedia Geonames … 18
Linking (Linked) Open Data cloud linkeddata.org Many of the datastores are being linked together to form a network/graph. 19
Linked Data In summary: Linked Data provides: RDF A standardized data access mechanism, HTTP Hyperlink-based data discovery, using URIs Self-descriptive data, through using shared vocabularies
Government Linked Data Explosion of Government (Linked) Open Data efforts and projects.data.gov, data.gov.uk, data.gov.au Examples:
Public Sector Datasets Inherent value in opening up public government data Systems and Services can be tailored to citizens’ priorities. Likely questions citizens may need answers to are: – “Where can I find a good school, a good investment advisor, a good employer?” 23
Public Sector Datasets (contd.) Integration of datasets enables more complex questions to be asked and answered Some examples: – http://www.planningalerts.com/ – http://ishortman.com/projects/expendituremap/ Governments freeing up their data. Holy grail is information integration: Meshing. 24
Issues we focus on Findability of appropriate data sources SEARCH: Look at the data sources EXTRACT: Slicing of data sources INTEGRATE: Unifying the views EXPLORE: Answering the questions.
Workflow Identify Dataset Design/ Select Vocabularies Extract and convert data into RDF Publish as Linked Data Consume Linked Data (Application) 28
Publishing your data as Linked Data: Some Things to Consider How do you choose a good URI to name things? There are guidelines for this. Examples: http://dbpedia.org/resource/Wildlife_photography Tope Omitola @ Univ of Southampton:http://id.ecs.soton.ac.uk/person/24123 . Describing a Data Set using: voiD (the Vocabulary of Interlinked Datasets) Choosing and Using Vocabularies to Describe Data (SKOS, RDFS, OWL, scovo) Sourcing datasets: Where do you get the datasets from (e.g. Semantic Web search engines, manual search, etc) Choice of join points: When you have different datasets, where do you join them together Data normalization: using RDF make things easier. Alignment of datasets
Architecture Infer new Data concepts and Integration relationships SPARQL RDF GatherersData and RDF TriplestoreSources Extractors (4store) Services 30
Data Publication – Challenges and Solutions Research Questions: – In our case, dealing with data that are centred around the United Kingdom’s democratic system, – Using geography data from the UK’s Ordnance Survey as the “join-point” with data for criminal statistics, Members of Parliament, mortality rates, etc. Sourcing the datasets – Many government data sets are in pdf, html, or xls files, so automatic discovery methods are not possible (yet), – Went through manual discovery process, searching for them, – We found some in pdf, html, and in xls, – We decided against pdf and html 31
Data Publication – Challenges and Solutions (contd.)– We went for data in xls format. Why? • Ability to source from a wider range of public sector domains.Data Source Format DatasetPublicwhip.org.uk HTML MP votes records, etcTheyworkforyou.com XML dump Parliament, Parliament expensesHomeoffice.gov.uk Excel Recorded crime (England, 2008/09)Statistics.gov.uk Excel Hospital Waiting List (England 2008/09)Performance.doh.gov.uk Excel Mortality rates (England 2008/09)Ordnancesurvey.co.uk Linked Data UK’s mapping agency 32
Data Publication – Challenges and Solutions (contd.) Data normalisation. RDF as our standard model. Data conversion to RDF. Python + Java. Modelling the datasets: Multi-dimensional, used SCOVO. 33
Data Publication – Challenges and Solutions (contd.) Crime dataset:Table 7.03 Recorded crime by offence group by police force area, English region andWales, 2008/09 RecordedNumbers crimePolice force area, English Total Violence Sexual Robbery Burglary Offences Other Fraud Criminal Drug Otherregion and Wales against offences against theft and damage offences offences 1 the vehicles offences forgery person NumbersCleveland 55,094 10,662 566 404 6,175 5,224 13,697 905 13,746 2,636 1,079Durham 45,074 7,435 476 170 6,226 4,940 9,674 835 13,027 1,327 964Northumbria 105,234 19,147 989 732 11,418 11,620 24,042 2,909 27,178 5,166 2,033North East Region 205,402 37,244 2,031 1,306 23,819 21,784 47,413 4,649 53,951 9,129 4,076 :TimePeriodrdf:typeowl:Class; rdfs:subClassOfscovo:Dimension. :TP2008_09 rdf:type :TimePeriod. :GeographicalRegionrdfs:subClassOfscovo:Dimension; dc:title "Police force area, English region and Wales". :CriminalOffenceTyperdf:typeowl:Class; rdfs:subClassOfscovo:Dimension. 34
Some Issues in Linked Data Co-referencing, i.e. different sources referring to the same entities by different names. Cardiff in Dbpediahttp://dbpedia.org/resource/Cardiff or http://dbpedia.org/resource/Cardiff_City Cardiff in Geonameshttp://sws.geonames.org/2172349/ Which Cardiff shall we use? Solution: sameas service from Southampton 35
Alignment of Datasets (contd.) Asserted owl:sameAs relations between dataset geo and O.S. (using string matching) For example, the English county of Cumbria was aligned as the following: <http://enakting.ecs.soton.ac.uk/statistics/data/Cumbria> http://www.w3.org/2002/07/owl#sameAs <http://data.ordnancesurvey.co.uk/id/7000000000024876>. A few special cases. “Yorkshire and the Humber Region” vs “Yorkshire & the Humber” NHS Trust were labelled differently: e.g. South Tyneside NHS Trust had no equivalence in the OS. So used Google Maps. 38
Recap: Data Publication Sourcing : Many not in RDF yet. Some in html, pdf, and xls. We chose xls. Selection of RDF as the normal form. Used scovo to model multidimensional data. We used owl:sameAs to assert equivalences between geo regions. We used string matching. Some did not work, e.g. Yorkshire and the Humber. Some have no equivalent OS entities, so we had to go via Google Maps API 41
Consuming Linked Data How do you visualize linked data sets. Linked Data browsers, e.g. Disco, Tabulator. Linked Data Search Engines, e.g. Sig.ma, Falcons, Sindice. Domain-specific Applications and Mashups, e.g. dayta.me(from Southampton), US Global Foreign Aid Mashup.
Data Consumption Application acts as an aggregator of information based on user’s postal (zip) code. Generates data views based on geographical region of postal code. Shows political representatives (MPs) for constituencies, their voting records, and their expenses. 43
Data Consumption(contd.) Challenges: – The lack of UIs to quickly browse, search or visualise views on a widerange of differently modelled data, – Lack of suitable tools which allow efficient aggregation and presentation of datato the UI from multiple datasets, – Data consumers having partial knowledge of domain and finding it difficult to understand the domain and the data being modelled.Points out the need for a toolset that helps developers givebetter description of the domain being modelled. 45
Recap: Publish and Consume Information Integration; one of the holy grails Problems with data sources. Different formats, etc, RDF can act as a standard model. Publication to RDF. Challenges. Solutions. – scovo for multi-dimensional data – string matching and its complexities Consuming the data. Challenges. Solutions. – Aggregating data based on zip code – Complexities of geo boundaries We have re-published the data we generated into the linked data cloud: EnAKTing datasets www.enakting.org/enakting/datasets 46
Some of our Outputs http://geoservice.psi.enakting.org: service to discover geographical resources, http://map.psi.enakting.org/: integrate different PSI Linked Data sources by querying Backlinking service, http://backlinks.psi.enakting.org: service to discover back- links in PSI, http://void.rkbexplorer.com/: describes the contents of data sets, enabling discovery and reuse of resources, http://bagatelles.ecs.soton.ac.uk/psi/: platform for integrating several PSI catalogues from the Web http://4sreasoner.ecs.soton.ac.uk/ Scalable Reasoning in 4store; 4sr is a branch of4store where backward chained reasoning is implemented http://apps.seme4.com/see-uk/ : Visualization tool for some UK data 47
Findability of Appropriate Data Sources – Service Descriptions How do you tell the world about your new linked data sets? Provide good service descriptions of your data sets Use vocabulary of Interlinked Datasets
Vocabulary of Interlinked Datasets (VoID) allows description of datasets and their interlinking, e.g. "there are 200k links of type gr: predicates between dataset X and dataset Y; and dataset Y mainly offers data about homes and X about mortgages” . A dataset: a set of RDF triples published, maintained or aggregated by a single provider, and accessible on the Web, e.g.:DBpedia a void:Dataset . allows the description of RDF links between datasets (using void:Linkset).
Three Areas of voiD General Metadata Access Metadata Structural Metadata
voiD (contd.) General metadata: the datasets title, description, date of creation, the creator, publisher, licence, subject(s), etc;:DBpedia a void:Dataset;dcterms:title "DBPedia";dcterms:description "RDF data extracted from Wikipedia"; dcterms:contributor :FU_Berlin;dcterms:modified "2008-11-17"^^xsd:datedcterms:contributor :OpenLink_Software.
Access metadata: describes how the RDF data(set) can be accessed using sparql e.g.:DBpedia a void:Dataset;void:sparqlEndpoint<http://dbpedia.org/sparql>. using URI lookup,Sindice a void:Dataset ;void:uriLookupEndpoint<http://api.sindice.com/v2/ search?qt=term&q=> . using rdf dumps,:NYTimes a void:Dataset;void:dataDump<http://data.nytimes.com/people.rdf>.
Structural metadata describes the structure and schema of datasets naming some representative example entites for a dataset stating if datasets entities share common URIs:DBpedia a void:Dataset;void:uriSpace "http://dbpedia.org/resource/” . Stating the vocabularies used in a dataset:LiveJournal a void:Dataset;void:vocabulary<http://xmlns.com/foaf/0.1/>. Providing statistics about datasets, e.g. expressing the number of RDF triples or the number of entities of a dataset.:DBpedia a void:Dataset;void:triples 1000000000 ; void:entities 3400000.
Publishing voiD files as void.ttl in the root directory of the site, with a local “hash URI” for the dataset, e.g. http://example.com/void.ttl#MyDataset. Using the root URI of the site, such as http://example.com/, as the dataset URI, and serving both HTML and an RDF format via content negotiation from that URI. Embedding the VoID description as HTML+RDFa into homepage of dataset, with a local “hash URI” for the dataset, yielding URI such as http://example.com/#MyDataset.
Why is voiD useful -- voiD Discovery By enabling the discovery and usage of linked datasets. A sitemap such as http://www.yoursite.com/sitemap.xml references void.ttl, and sitemap.xml added robots.txt . A search engine crawls the website indexing void.ttl plus a cache of the rdf triples referenced in this void file. through backlinks: <document.rdf>void:inDataset<void.ttl#MyDataset>. Through a well-known URI: void.ttl can be placed in /.well-known/void on any Web server , e.g. http://www.example.com/.well-known/void .
@prefix void: <http://rdfs.org/ns/void#> . @prefix scovo: <http://purl.org/NET/scovo#> .<http://crime.psi.enakting.org/id/void> a void:Dataset;foaf:homepage<http://crime.psi.enakting.org/>;rdfs:label "crime.psi.enakting.org Linked Data Repository";dcterms:date "2010-09-13T11:30:29"^^xsd:date;dcterms:title "crime.psi.enakting.org Linked Data Repository";foaf:nick "crime";dcterms:description "United Kingdoms crime statistics per region for the year 2008/09, provided by the United Kingdom Home Office. Dataset provenance: http://www.homeoffice.gov.uk/rds/pdfs09/hosb1109chap7.xls";dcterms:publisher<http://crime.psi.enakting.org>;void:statItem [scovo:dimensionvoid:numberOfTriples; rdf:value 4988; rdfs:label "4,988 triples”; ];void:subset [ a void:Linkset; rdfs:label "crime.psi.enakting.org CRS -> http://data.ordnancesurvey.co.uk/";void:subjectsTarget<http://crime.psi.enakting.org/id/void>;void:objectsTarget<http://void.rkbexplorer.com/id/dataset/d1d473f29a9091069644824242e9ae07>;void:linkPredicatecoref:duplicate;void:statItem [rdfs:label "133 URI equivalences"; rdf:value 133; scovo:dimensionvoid:numberOfTriples; ] ].
Provenance and Trust in Linked Data Whom do you trust on the Web?
Provenance and Trust Mash-ups, aggregation, integration, data re-use. How do you elicit Reliability and Accuracy? Generate trust by revealing as much information of you as possible. Enables consumers to decide the quality and trustworthiness of your data. Useful for Data Discovery/Mining + Query Planning.
Different kinds of Provenance When was x derived (when-provenance). How was x derived (how-provenance). What data was used to derive x (what- provenance). Who carried out the transformation(s) from whence x came (who-provenance).
Provenance Models for Linked Datasets Provenance Vocabulary Ontology
Provenance Models for Linked Datasets (contd)• Open Provenance Model
Provenance Models for Linked Datasets (contd) Provenance for Datasets (voidp) http://www.enakting.org/provenance/voidp/
voiD Provenance Extension voidp Designed to be simple and lightweight. Mainly for (RDF) data publishers. Includes necessary information of the process, its inputs, and outputs. Basis is simple: An agent runs a process on a data (or dataset) to get another data (or dataset). Agent → Process → Data → Data’ . @prefix voidp: <http://purl.org/void/provenance/ns> .
voidp Classes and Predicates voidp:ProvenanceEvent:items under provenance control. voidp:actor: actor, person, group, software or physical artifact, involved in this provenance event. voidp:certification:used to contain dataset’ signature elements voidp:contact: contact details of whom to contact should people have queries about this dataset. voidp:item:the provenance characteristics of a data item under provenance control. voidp:processType: the type of transformation or conversion procedure carried out on the item’s source voidp:resultingDataset: dataset that is the result of this provenance event. voidp:sourceDataset: source dataset for the data item under provenance control.