SlideShare a Scribd company logo
1 of 15
Data.dcs: Converting Legacy Data into Linked Data Matthew Rowe Organisations, Information and Knowledge Group University of Sheffield
Outline Problem Legacy data contained within the Department of Computer Science Motivation Why produce linked data? Converting Legacy Data into Linked Data: Triplification of Legacy Data Coreference  Resolution Linking into the Web of Linked Data Deployment Conclusions
Problem ,[object Object],People Research groups Publications Legacy data is defined as important information which is stored in proprietary formats  Each member of the DCS maintains his/her own web page Heterogeneous formatting Different presentation of content Devoid of any semantic markup Hard to find information due to its distribution and heterogeneity E.g. cannot query for all publications which two given people have worked on in the past year Must manually search and filter through a lot of information ,[object Object],Rarely updated  Contains duplicates and errors Not machine-readable
Motivation Leveraging legacy data from the DCS in a machine-readable and consistent form would allow related information to be linked together People would be linked with their publications Research groups would be linked to their members Co-authors of papers could be found Linking DCS data into the Web of Linked Data would allow additional information to be provided: Listing conferences which DCS members have attended Provide up-to-date publication listings Via external linked datasets The DCS web site use case is indicative of the makeup of the WWW overall: Organisations provide legacy data which is intended for human consumption HTML documents are provided which lack any explicit machine-readable semantics (Mika et al, 2009) presented crawling logs from 2008-2009 where only 0.6% of web documents contained RDFa and 2% contained the hCardmicroformat
Converting Legacy Data into Linked Data The approach is divided into 3 different stages: Triplification Converting legacy data into RDF triples Coreference Resolution Identifying coreferring entities into the RDF dataset Linked to the Web of Linked Data Querying and discovering links into the Linked Data cloud
Triplification of Legacy Data The DCS publication database provides publication listings as XML However, all publication information is contained within the same <description> element (title, author, year, book title): <description>   <![CDATA[Rowe, M. (2009). Interlinking Distributed Social Graphs.    In <i>Proceedings of Linked Data on the Web Workshop, WWW 2009,    Madrid, Spain. (2009)</i>. Madrid, Madrid, Spain.<br>   <br>Edited by Sarah Duffy on Tue, 08 Dec 2009 09:31:30 +0000.]]> </description> DCS web site provides person information and research group listings in HTML documents Information is devoid of markup and is provided in a heterogeneous formats Extracting person information from HTML documents requires the derivation of context windows Portions of a given HTML document which contain person information (e.g. a person’s name together with their email address and homepage)
Triplification of Legacy Data
Triplification of Legacy Data Context windows are generated by identifying portions of a HTML document which contain a person’s name The structure of the HTML DOM is then used to partition the window such that it contains information about a single person HTML markup provides clues as to the segmentation of legacy data within the document Once a name is identified a set of algorithms moves up the DOM tree until a layout element is discovered (e.g. <div>, <li>, <tr>) Provided with a set of context windows, legacy data is then extracted from the window using Hidden Markov Models (HMMs) Context windows are extracted from HTML documents for DCS members and are provided from the publication listings XML HMMs:  Input: sequence of tokens from the context window Output: most likely state labels for each of the tokens States correspond to information which is to be extracted
Triplification of Legacy Data An RDF dataset is built from the extracted legacy data This provides the sourcedataset from which a linked dataset is built For person information triples are formed as follows: <http://data.dcs.shef.ac.uk/person/12025>  rdf:typefoaf:Person ; foaf:name "Matthew Rowe" . <http://www.dcs.shef.ac.uk/~mrowe/publications.html>  foaf:topic <http://data.dcs.shef.ac.uk/person/12025> For publication information triples are formed as follows: <http://data.dcs.shef.ac.uk/paper/239>  rdf:typebib:Entry ; bib:title "Interlinking Distributed Social Graphs." ; bib:hasYear "2009" ;	 bib:hasBookTitle "Proceedings of Linked Data on the Web Workshop, WWW, Madrid, Spain." ; foaf:maker _:a1 . _:a1 rdf:typefoaf:Person ;  foaf:name "Matthew Rowe"
Coreference Resolution The triplification of legacy data contained within the DCS web sites (from ~12,000 HTML documents)  produced 17,896 instances of foaf:Person and 1,088 instances of bib:Entry Contains many equivalent foaf:Person instances Must also assign people to their publications We create information about each research group manually to relate DCS members with their research groups: <http://data.dcs.shef.ac.uk/group/oak>	 rdf:typefoaf:Group ; foaf:name  "Organisations, Information and Knowledge Group" ; foaf:workplaceHomepage  <http://oak.dcs.shef.ac.uk> Querying the source dataset for all the people to appear on each of the group personnel listings pages provides the members of each group This provides a list of people who are members of the DCS For each person equivalent instances are found by Graph smushing: detecting equivalent instances which have the same email address or homepage Co-occurrence queries: detecting equivalent instances by matching coworkers who appear in the same page People are assigned to their publications using queries containing an abbreviated form of their name Applicable in this context due to the localised nature of the dataset
Coreference Resolution
Linking to the Web of Linked Data Our dataset at this stage in the approach is not linked data All links are internal to the dataset To overcome the burden of researchers updating their publications we query the DBLP linked dataset using a Networked Graph SPARQL query: The query detects authored research papers in DBLP based on coauthorship with co-workers The CONSTRUCT clause creates a triple relating DCS members with their papers in DBLP Using foaf:made When the Data.dcs member is later looked up, a follow-your-nose strategy can be used to get all papers from DBLP Can also look up conferences attended, countries visited, etc CONSTRUCT {   ?qfoaf:made ?paper .   ?pfoaf:made ?paper  } WHERE {   ?group foaf:member ?q .   ?group foaf:member ?p .   ?qfoaf:name ?n .   ?pfoaf:name ?c .    GRAPH <http://www4.wiwiss.fu-berlin.de/dblp/>    {     ?paper dc:creator ?x .     ?xfoaf:name ?n .     ?paper dc:creator ?y .     ?yfoaf:name ?c .   }   FILTER (?p != ?q) } <http://data.dcs.shef.ac.uk/person/Fabio-Ciravegna> foaf:made  <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/icml/IresonCCFKL05> ; foaf:made  <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/ijcai/BrewsterCW01>
Deployment Data.dcs is now up and running and can be accessed at the following URL: http://data.dcs.shef.ac.uk (please try it!) The data is deployed using Recipe 1 from “How to Publish Linked Data” http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/ Recipe 2 for Slash Namespaces from “Best Practices for Publishing RDF Vocabularies” http://www.w3.org/TR/swbp-vocab-pub/ Viewing <http://data.dcs.shef.ac.uk/group/oak> using OpenLinks’sURIBurner
Conclusions Leveraging legacy data requires information extraction components able to handle heterogeneous formats Hidden Markov Models provide a single solution to this problem, however other methods exist which could be explored Presented methods are applicable to other domains, simply requires a different topology and training   Current methods for Linked DCS Data into the Web of Linked Data are conservative: Bespoke SPARQL queries Future work will include the exploration of machine learning classification techniques to perform URI disambiguation This work is now being used as a blueprint for producing linked data from the entire University of Sheffield Covering different faculties and departments Intended to foster and identify possible collaborations across work areas
Twitter: @mattroweshow Web: http://www.dcs.shef.ac.uk/~mrowe Email: m.rowe@dcs.shef.ac.uk Questions? (Mika et al, 2009) - Peter Mika, Edgar Meij, and Hugo Zaragoza. Investigating the semantic gap through query log analysis. In 8th International Semantic Web Conference (ISWC2009), October 2009.

More Related Content

What's hot

Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
Tomek Pluskiewicz
 
Wed roman tut_open_datapub
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapub
eswcsummerschool
 
Semantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataSemantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic Data
Matthew Rowe
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)
Besnik Fetahu
 
Get on the Linked Data Web!
Get on the Linked Data Web!Get on the Linked Data Web!
Get on the Linked Data Web!
Armin Haller
 

What's hot (20)

Introduction to the Semantic Web
Introduction to the Semantic WebIntroduction to the Semantic Web
Introduction to the Semantic Web
 
RDFa Introductory Course Session 4/4 When RDFa
RDFa Introductory Course Session 4/4 When RDFaRDFa Introductory Course Session 4/4 When RDFa
RDFa Introductory Course Session 4/4 When RDFa
 
Microformats Workshop (2009)
Microformats Workshop  (2009)Microformats Workshop  (2009)
Microformats Workshop (2009)
 
Linked Data In Action
Linked Data In ActionLinked Data In Action
Linked Data In Action
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
 
The Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of LeipzigThe Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of Leipzig
 
An introduction to Linked (Open) Data
An introduction to Linked (Open) DataAn introduction to Linked (Open) Data
An introduction to Linked (Open) Data
 
Wed roman tut_open_datapub
Wed roman tut_open_datapubWed roman tut_open_datapub
Wed roman tut_open_datapub
 
Semantic Web 2.0: Creating Social Semantic Information Spaces
Semantic Web 2.0: Creating Social Semantic Information SpacesSemantic Web 2.0: Creating Social Semantic Information Spaces
Semantic Web 2.0: Creating Social Semantic Information Spaces
 
Semantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic DataSemantic Technologies: Representing Semantic Data
Semantic Technologies: Representing Semantic Data
 
Linked Data in Libraries
Linked Data in LibrariesLinked Data in Libraries
Linked Data in Libraries
 
Linked data HHS 2015
Linked data HHS 2015Linked data HHS 2015
Linked data HHS 2015
 
Introduction to Linked Data
Introduction to Linked DataIntroduction to Linked Data
Introduction to Linked Data
 
How links can make your open data even greater
How links can make your open data even greaterHow links can make your open data even greater
How links can make your open data even greater
 
Search Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited LectureSearch Joins with the Web - ICDT2014 Invited Lecture
Search Joins with the Web - ICDT2014 Invited Lecture
 
The Linked Data Lifecycle
The Linked Data LifecycleThe Linked Data Lifecycle
The Linked Data Lifecycle
 
euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)euclid_linkedup WWW tutorial (Besnik Fetahu)
euclid_linkedup WWW tutorial (Besnik Fetahu)
 
Get on the Linked Data Web!
Get on the Linked Data Web!Get on the Linked Data Web!
Get on the Linked Data Web!
 
Jarrar: Linked Data
Jarrar: Linked DataJarrar: Linked Data
Jarrar: Linked Data
 
Library Linked Data and the Future of Bibliographic Control
Library Linked Data and the Future of Bibliographic ControlLibrary Linked Data and the Future of Bibliographic Control
Library Linked Data and the Future of Bibliographic Control
 

Viewers also liked

Information extraction systems aspects and characteristics
Information extraction systems  aspects and characteristicsInformation extraction systems  aspects and characteristics
Information extraction systems aspects and characteristics
George Ang
 
Adaptive information extraction
Adaptive information extractionAdaptive information extraction
Adaptive information extraction
unyil96
 
Information Extraction from the Web - In today's web
Information Extraction from the Web - In today's webInformation Extraction from the Web - In today's web
Information Extraction from the Web - In today's web
Benjamin Habegger
 

Viewers also liked (9)

Information extraction systems aspects and characteristics
Information extraction systems  aspects and characteristicsInformation extraction systems  aspects and characteristics
Information extraction systems aspects and characteristics
 
Adaptive information extraction
Adaptive information extractionAdaptive information extraction
Adaptive information extraction
 
Information Extraction from the Web - In today's web
Information Extraction from the Web - In today's webInformation Extraction from the Web - In today's web
Information Extraction from the Web - In today's web
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Open Calais
Open CalaisOpen Calais
Open Calais
 
Web Scale Information Extraction tutorial ecml2013
Web Scale Information Extraction tutorial ecml2013Web Scale Information Extraction tutorial ecml2013
Web Scale Information Extraction tutorial ecml2013
 
Predicting Online Community Churners using Gaussian Sequences
Predicting Online Community Churners using Gaussian SequencesPredicting Online Community Churners using Gaussian Sequences
Predicting Online Community Churners using Gaussian Sequences
 
Social Computing Research with Apache Spark
Social Computing Research with Apache SparkSocial Computing Research with Apache Spark
Social Computing Research with Apache Spark
 
Comparing Ontotext KIM and Apache Stanbol
Comparing Ontotext KIM and Apache StanbolComparing Ontotext KIM and Apache Stanbol
Comparing Ontotext KIM and Apache Stanbol
 

Similar to Data.dcs: Converting Legacy Data into Linked Data

15. session 15 data binding
15. session 15   data binding15. session 15   data binding
15. session 15 data binding
Phúc Đỗ
 
15. session 15 data binding
15. session 15   data binding15. session 15   data binding
15. session 15 data binding
Phúc Đỗ
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data intro
vafopoulos
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked data
vafopoulos
 
Search Engines After The Semanatic Web
Search Engines After The Semanatic WebSearch Engines After The Semanatic Web
Search Engines After The Semanatic Web
samar_slideshare
 
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...
cscpconf
 
A semantic based approach for information retrieval from html documents using...
A semantic based approach for information retrieval from html documents using...A semantic based approach for information retrieval from html documents using...
A semantic based approach for information retrieval from html documents using...
csandit
 

Similar to Data.dcs: Converting Legacy Data into Linked Data (20)

An Implementation of a New Framework for Automatic Generation of Ontology and...
An Implementation of a New Framework for Automatic Generation of Ontology and...An Implementation of a New Framework for Automatic Generation of Ontology and...
An Implementation of a New Framework for Automatic Generation of Ontology and...
 
15. session 15 data binding
15. session 15   data binding15. session 15   data binding
15. session 15 data binding
 
15. session 15 data binding
15. session 15   data binding15. session 15   data binding
15. session 15 data binding
 
2011 05-02 linked data intro
2011 05-02 linked data intro2011 05-02 linked data intro
2011 05-02 linked data intro
 
2011 05-01 linked data
2011 05-01 linked data2011 05-01 linked data
2011 05-01 linked data
 
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
 
Linked Data Tutorial
Linked Data TutorialLinked Data Tutorial
Linked Data Tutorial
 
IRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description FrameworkIRJET- Data Retrieval using Master Resource Description Framework
IRJET- Data Retrieval using Master Resource Description Framework
 
Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011 Linked Data and Locah, UKSG2011
Linked Data and Locah, UKSG2011
 
Gt ea2009
Gt ea2009Gt ea2009
Gt ea2009
 
The Social Data Web
The Social Data WebThe Social Data Web
The Social Data Web
 
Linked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIGLinked Data Overview - AGI Technical SIG
Linked Data Overview - AGI Technical SIG
 
Search Engines After The Semanatic Web
Search Engines After The Semanatic WebSearch Engines After The Semanatic Web
Search Engines After The Semanatic Web
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
20100614 ISWSA Keynote
20100614 ISWSA Keynote20100614 ISWSA Keynote
20100614 ISWSA Keynote
 
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...
A SEMANTIC BASED APPROACH FOR INFORMATION RETRIEVAL FROM HTML DOCUMENTS USING...
 
A semantic based approach for information retrieval from html documents using...
A semantic based approach for information retrieval from html documents using...A semantic based approach for information retrieval from html documents using...
A semantic based approach for information retrieval from html documents using...
 
Information Intermediaries
Information IntermediariesInformation Intermediaries
Information Intermediaries
 
Data Portability with SIOC and FOAF
Data Portability with SIOC and FOAFData Portability with SIOC and FOAF
Data Portability with SIOC and FOAF
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 

More from Matthew Rowe

Existing Research and Future Research Agenda
Existing Research and Future Research AgendaExisting Research and Future Research Agenda
Existing Research and Future Research Agenda
Matthew Rowe
 
Modelling and Analysis of User Behaviour in Online Communities
Modelling and Analysis of User Behaviour in Online CommunitiesModelling and Analysis of User Behaviour in Online Communities
Modelling and Analysis of User Behaviour in Online Communities
Matthew Rowe
 
Anticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community ForumsAnticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community Forums
Matthew Rowe
 
Forecasting Audience Increase on Youtube
Forecasting Audience Increase on YoutubeForecasting Audience Increase on Youtube
Forecasting Audience Increase on Youtube
Matthew Rowe
 
Predicting Discussions on the Social Semantic Web
Predicting Discussions on the Social Semantic WebPredicting Discussions on the Social Semantic Web
Predicting Discussions on the Social Semantic Web
Matthew Rowe
 
PhD Viva - Disambiguating Identity Web References using Social Data
PhD Viva - Disambiguating Identity Web References using Social DataPhD Viva - Disambiguating Identity Web References using Social Data
PhD Viva - Disambiguating Identity Web References using Social Data
Matthew Rowe
 

More from Matthew Rowe (20)

Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
Transferring Semantic Categories with Vertex Kernels: Recommendations with Se...
 
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting RatingsSemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
SemanticSVD++: Incorporating Semantic Taste Evolution for Predicting Ratings
 
The Semantic Evolution of Online Communities
The Semantic Evolution of Online CommunitiesThe Semantic Evolution of Online Communities
The Semantic Evolution of Online Communities
 
From Mining to Understanding: The Evolution of Social Web Users
From Mining to Understanding: The Evolution of Social Web UsersFrom Mining to Understanding: The Evolution of Social Web Users
From Mining to Understanding: The Evolution of Social Web Users
 
Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...Mining User Lifecycles from Online Community Platforms and their Application ...
Mining User Lifecycles from Online Community Platforms and their Application ...
 
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
From User Needs to Community Health: Mining User Behaviour to Analyse Online ...
 
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
 
Identity: Physical, Cyber, Future
Identity: Physical, Cyber, FutureIdentity: Physical, Cyber, Future
Identity: Physical, Cyber, Future
 
Measuring the Topical Specificity of Online Communities
Measuring the Topical Specificity of Online CommunitiesMeasuring the Topical Specificity of Online Communities
Measuring the Topical Specificity of Online Communities
 
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
Who will follow whom? Exploiting Semantics for Link Prediction in Attention-I...
 
Attention Economics in Social Web Systems
Attention Economics in Social Web SystemsAttention Economics in Social Web Systems
Attention Economics in Social Web Systems
 
What makes communities tick? Community health analysis using role compositions
What makes communities tick? Community health analysis using role compositionsWhat makes communities tick? Community health analysis using role compositions
What makes communities tick? Community health analysis using role compositions
 
Existing Research and Future Research Agenda
Existing Research and Future Research AgendaExisting Research and Future Research Agenda
Existing Research and Future Research Agenda
 
Tutorial: Social Semantics
Tutorial: Social SemanticsTutorial: Social Semantics
Tutorial: Social Semantics
 
Modelling and Analysis of User Behaviour in Online Communities
Modelling and Analysis of User Behaviour in Online CommunitiesModelling and Analysis of User Behaviour in Online Communities
Modelling and Analysis of User Behaviour in Online Communities
 
Using Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
Using Behaviour Analysis to Detect Cultural Aspects in Social Web SystemsUsing Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
Using Behaviour Analysis to Detect Cultural Aspects in Social Web Systems
 
Anticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community ForumsAnticipating Discussion Activity on Community Forums
Anticipating Discussion Activity on Community Forums
 
Forecasting Audience Increase on Youtube
Forecasting Audience Increase on YoutubeForecasting Audience Increase on Youtube
Forecasting Audience Increase on Youtube
 
Predicting Discussions on the Social Semantic Web
Predicting Discussions on the Social Semantic WebPredicting Discussions on the Social Semantic Web
Predicting Discussions on the Social Semantic Web
 
PhD Viva - Disambiguating Identity Web References using Social Data
PhD Viva - Disambiguating Identity Web References using Social DataPhD Viva - Disambiguating Identity Web References using Social Data
PhD Viva - Disambiguating Identity Web References using Social Data
 

Recently uploaded

Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
Chris Hunter
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 

Recently uploaded (20)

Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptxSOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
SOCIAL AND HISTORICAL CONTEXT - LFTVD.pptx
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
fourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writingfourth grading exam for kindergarten in writing
fourth grading exam for kindergarten in writing
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..Sports & Fitness Value Added Course FY..
Sports & Fitness Value Added Course FY..
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 

Data.dcs: Converting Legacy Data into Linked Data

  • 1. Data.dcs: Converting Legacy Data into Linked Data Matthew Rowe Organisations, Information and Knowledge Group University of Sheffield
  • 2. Outline Problem Legacy data contained within the Department of Computer Science Motivation Why produce linked data? Converting Legacy Data into Linked Data: Triplification of Legacy Data Coreference Resolution Linking into the Web of Linked Data Deployment Conclusions
  • 3.
  • 4. Motivation Leveraging legacy data from the DCS in a machine-readable and consistent form would allow related information to be linked together People would be linked with their publications Research groups would be linked to their members Co-authors of papers could be found Linking DCS data into the Web of Linked Data would allow additional information to be provided: Listing conferences which DCS members have attended Provide up-to-date publication listings Via external linked datasets The DCS web site use case is indicative of the makeup of the WWW overall: Organisations provide legacy data which is intended for human consumption HTML documents are provided which lack any explicit machine-readable semantics (Mika et al, 2009) presented crawling logs from 2008-2009 where only 0.6% of web documents contained RDFa and 2% contained the hCardmicroformat
  • 5. Converting Legacy Data into Linked Data The approach is divided into 3 different stages: Triplification Converting legacy data into RDF triples Coreference Resolution Identifying coreferring entities into the RDF dataset Linked to the Web of Linked Data Querying and discovering links into the Linked Data cloud
  • 6. Triplification of Legacy Data The DCS publication database provides publication listings as XML However, all publication information is contained within the same <description> element (title, author, year, book title): <description> <![CDATA[Rowe, M. (2009). Interlinking Distributed Social Graphs. In <i>Proceedings of Linked Data on the Web Workshop, WWW 2009, Madrid, Spain. (2009)</i>. Madrid, Madrid, Spain.<br> <br>Edited by Sarah Duffy on Tue, 08 Dec 2009 09:31:30 +0000.]]> </description> DCS web site provides person information and research group listings in HTML documents Information is devoid of markup and is provided in a heterogeneous formats Extracting person information from HTML documents requires the derivation of context windows Portions of a given HTML document which contain person information (e.g. a person’s name together with their email address and homepage)
  • 8. Triplification of Legacy Data Context windows are generated by identifying portions of a HTML document which contain a person’s name The structure of the HTML DOM is then used to partition the window such that it contains information about a single person HTML markup provides clues as to the segmentation of legacy data within the document Once a name is identified a set of algorithms moves up the DOM tree until a layout element is discovered (e.g. <div>, <li>, <tr>) Provided with a set of context windows, legacy data is then extracted from the window using Hidden Markov Models (HMMs) Context windows are extracted from HTML documents for DCS members and are provided from the publication listings XML HMMs: Input: sequence of tokens from the context window Output: most likely state labels for each of the tokens States correspond to information which is to be extracted
  • 9. Triplification of Legacy Data An RDF dataset is built from the extracted legacy data This provides the sourcedataset from which a linked dataset is built For person information triples are formed as follows: <http://data.dcs.shef.ac.uk/person/12025> rdf:typefoaf:Person ; foaf:name "Matthew Rowe" . <http://www.dcs.shef.ac.uk/~mrowe/publications.html> foaf:topic <http://data.dcs.shef.ac.uk/person/12025> For publication information triples are formed as follows: <http://data.dcs.shef.ac.uk/paper/239> rdf:typebib:Entry ; bib:title "Interlinking Distributed Social Graphs." ; bib:hasYear "2009" ; bib:hasBookTitle "Proceedings of Linked Data on the Web Workshop, WWW, Madrid, Spain." ; foaf:maker _:a1 . _:a1 rdf:typefoaf:Person ; foaf:name "Matthew Rowe"
  • 10. Coreference Resolution The triplification of legacy data contained within the DCS web sites (from ~12,000 HTML documents) produced 17,896 instances of foaf:Person and 1,088 instances of bib:Entry Contains many equivalent foaf:Person instances Must also assign people to their publications We create information about each research group manually to relate DCS members with their research groups: <http://data.dcs.shef.ac.uk/group/oak> rdf:typefoaf:Group ; foaf:name "Organisations, Information and Knowledge Group" ; foaf:workplaceHomepage <http://oak.dcs.shef.ac.uk> Querying the source dataset for all the people to appear on each of the group personnel listings pages provides the members of each group This provides a list of people who are members of the DCS For each person equivalent instances are found by Graph smushing: detecting equivalent instances which have the same email address or homepage Co-occurrence queries: detecting equivalent instances by matching coworkers who appear in the same page People are assigned to their publications using queries containing an abbreviated form of their name Applicable in this context due to the localised nature of the dataset
  • 12. Linking to the Web of Linked Data Our dataset at this stage in the approach is not linked data All links are internal to the dataset To overcome the burden of researchers updating their publications we query the DBLP linked dataset using a Networked Graph SPARQL query: The query detects authored research papers in DBLP based on coauthorship with co-workers The CONSTRUCT clause creates a triple relating DCS members with their papers in DBLP Using foaf:made When the Data.dcs member is later looked up, a follow-your-nose strategy can be used to get all papers from DBLP Can also look up conferences attended, countries visited, etc CONSTRUCT { ?qfoaf:made ?paper . ?pfoaf:made ?paper } WHERE { ?group foaf:member ?q . ?group foaf:member ?p . ?qfoaf:name ?n . ?pfoaf:name ?c . GRAPH <http://www4.wiwiss.fu-berlin.de/dblp/> { ?paper dc:creator ?x . ?xfoaf:name ?n . ?paper dc:creator ?y . ?yfoaf:name ?c . } FILTER (?p != ?q) } <http://data.dcs.shef.ac.uk/person/Fabio-Ciravegna> foaf:made <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/icml/IresonCCFKL05> ; foaf:made <http://www4.wiwiss.fu-berlin.de/dblp/resource/record/conf/ijcai/BrewsterCW01>
  • 13. Deployment Data.dcs is now up and running and can be accessed at the following URL: http://data.dcs.shef.ac.uk (please try it!) The data is deployed using Recipe 1 from “How to Publish Linked Data” http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/ Recipe 2 for Slash Namespaces from “Best Practices for Publishing RDF Vocabularies” http://www.w3.org/TR/swbp-vocab-pub/ Viewing <http://data.dcs.shef.ac.uk/group/oak> using OpenLinks’sURIBurner
  • 14. Conclusions Leveraging legacy data requires information extraction components able to handle heterogeneous formats Hidden Markov Models provide a single solution to this problem, however other methods exist which could be explored Presented methods are applicable to other domains, simply requires a different topology and training Current methods for Linked DCS Data into the Web of Linked Data are conservative: Bespoke SPARQL queries Future work will include the exploration of machine learning classification techniques to perform URI disambiguation This work is now being used as a blueprint for producing linked data from the entire University of Sheffield Covering different faculties and departments Intended to foster and identify possible collaborations across work areas
  • 15. Twitter: @mattroweshow Web: http://www.dcs.shef.ac.uk/~mrowe Email: m.rowe@dcs.shef.ac.uk Questions? (Mika et al, 2009) - Peter Mika, Edgar Meij, and Hugo Zaragoza. Investigating the semantic gap through query log analysis. In 8th International Semantic Web Conference (ISWC2009), October 2009.