• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Introduction to LDL 2012
 

Introduction to LDL 2012

on

  • 656 views

see http://ldl2012.lod2.eu

see http://ldl2012.lod2.eu

Statistics

Views

Total Views
656
Views on SlideShare
656
Embed Views
0

Actions

Likes
0
Downloads
6
Comments
0

0 Embeds 0

No embeds

Accessibility

Upload Details

Uploaded via as OpenOffice

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • NB: An asterisk in these notes, like this: * Indicates a transition on the slide.
  • @book{mcenery2001corpus, title={Corpus linguistics: an introduction}, author={McEnery, T. and Wilson, A.}, year={2001}, publisher={Edinburgh Univ Pr} } @book{tognini2001corpus, title={Corpus linguistics at work}, author={Tognini-Bonelli, E.}, volume={6}, year={2001}, publisher={John Benjamins Publishing Co} } @inproceedings{brewster2004data, title={Data driven ontology evaluation}, author={Brewster, C. and Alani, H. and Dasmahapatra, S. and Wilks, Y.}, booktitle={Proceedings of LREC}, volume={2004}, year={2004}, organization={Citeseer} } @inproceedings{mahesh1995semantic, title={Semantic classification for practical natural language processing}, author={Mahesh, K. and Nirenburg, S.}, booktitle={Proceedings of the Sixth ASlS SIG/CR Classification Research Workshop: An Interdisciplinary Meeting. Chicago, IL}, year={1995}, organization={Citeseer} }
  • @book{mcenery2001corpus, title={Corpus linguistics: an introduction}, author={McEnery, T. and Wilson, A.}, year={2001}, publisher={Edinburgh Univ Pr} } @book{tognini2001corpus, title={Corpus linguistics at work}, author={Tognini-Bonelli, E.}, volume={6}, year={2001}, publisher={John Benjamins Publishing Co} } @inproceedings{brewster2004data, title={Data driven ontology evaluation}, author={Brewster, C. and Alani, H. and Dasmahapatra, S. and Wilks, Y.}, booktitle={Proceedings of LREC}, volume={2004}, year={2004}, organization={Citeseer} } @inproceedings{mahesh1995semantic, title={Semantic classification for practical natural language processing}, author={Mahesh, K. and Nirenburg, S.}, booktitle={Proceedings of the Sixth ASlS SIG/CR Classification Research Workshop: An Interdisciplinary Meeting. Chicago, IL}, year={1995}, organization={Citeseer} }

Introduction to LDL 2012 Introduction to LDL 2012 Presentation Transcript

  • Linked Data in LinguisticsRepresenting and Connecting Language Data and Language Metadata Sebastian Hellmann, Christian Chiarcos, Sebastian Nordhoff 34th Annual Meeting of the German Linguistic Society (DGfS), AG 2 Frankfurt/M., Germany, March 7th – 9th, 2012 If not otherwise noted, content is cc-by
  • Overview Technological Background (SH) Linked Open Data and Collaborative Research (SH) Linked Data for Linguistics (CC) Building a Linguistic Linked Open Data Cloud Prospects of Linked Data in Linguistics (CC) Annotated Corpora (CC) Lexical-Semantic Resource (SH) Linguistic Databases (SN) What to Expect from LDL-2012March 7th, 2012 Linked Data in Linguistics 2012 2
  • From Excel to RDF and Linked DataMarch 7th, 2012 Linked Data in Linguistics 2012 3
  • From Excel to RDF and Linked Data A data collection about sailing ships: Source http://en.wikipedia.org/wiki/File:Bounty_modified_photo.jpgMarch 7th, 2012 Linked Data in Linguistics 2012 4
  • From Excel to RDF and Linked Data Add the Gorch Fock Source http://en.wikipedia.org/wiki/File:Gorch_Fock_unter_Segeln_Kieler_Foerde_2006.jpgMarch 7th, 2012 Linked Data in Linguistics 2012 5
  • From Excel to RDF and Linked Data Add the auxiliary propulsion of the Gorch Fock The field is now irregularMarch 7th, 2012 Linked Data in Linguistics 2012 6
  • From Excel to RDF and Linked Data A first empty field is introducedMarch 7th, 2012 Linked Data in Linguistics 2012 7
  • From Excel to RDF and Linked Data Entity Attribute Value, data represented in triplesMarch 7th, 2012 Linked Data in Linguistics 2012 8
  • From Excel to RDF and Linked Data XML does also not produce sparsity or anomalies, but what about: 1. Automatically infer rows (reduces size) 2. Check consistency (not validity) 3. Merge two tables (not only syntactically, but semantically) 4. Enrich with external data (also retrieve updates) 5. QueryMarch 7th, 2012 Linked Data in Linguistics 2012 9
  • From Excel to RDF and Linked Data XML does also not produce sparsity or anomalies, but what about: 1. Automatically infer rows (reduces size) 2. Check consistency (not validity) 3. Merge two tables (not only syntactically, but semantically) 4. Enrich with external data (also retrieve updates) 5. QueryMarch 7th, 2012 Linked Data in Linguistics 2012 10
  • From Excel to RDF and Linked Data Description Logic (DL) is a family of formal knowledge representation languages fragments of first order logic usually decidable inference problems Well researched complexity Basis for the Web Ontology Language (OWL) Reasoner implementations availableFranz Baader, Ian Horrocks, and Ulrike Sattler Chapter 3 Description Logics. In Frank van Harmelen,Vladimir Lifschitz, and Bruce Porter, editors, Handbook of Knowledge Representation. Elsevier, 2007.March 7th, 2012 Linked Data in Linguistics 2012 11
  • From Excel to RDF and Linked Data Description Logic inferenceMarch 7th, 2012 Linked Data in Linguistics 2012 12
  • From Excel to RDF and Linked Data Description Logic constraints Possible to detect inconsistencies, i.e. Gorch Fock must not be a SailingshipMarch 7th, 2012 Linked Data in Linguistics 2012 13
  • From Excel to RDF and Linked Data XML does also not produce sparsity or anomalies, but what about: 1. Automatically infer rows (reduces size) 2. Check consistency (not validity) 3. Merge two tables (not only syntactically, but semantically) 4. Enrich with external data (also retrieve updates) 5. QueryMarch 7th, 2012 Linked Data in Linguistics 2012 14
  • Uniform Resource Identifiers (URIs)Agree on a common vocabulary and names for entitiesOn the schema level, coherence of properties and types is required for data integrationURIs allow for globally unique identifiers: “Gorch Fock” vs. http://en.wikipedia.org/wiki/Gorch_Fock_(1958) vs. http://dbpedia.org/resource/Gorch_Fock_(1958) dbpedia:Gorch_Fock_(1958)March 7th, 2012 Linked Data in Linguistics 2012 15
  • From Excel to RDF and Linked Data Last table before we get more technical 4 Types of ObjectMarch 7th, 2012 Linked Data in Linguistics 2012 16
  • From Excel to RDF and Linked Data owl:sameAs dbpedia:Gorch_Fock owl:sameAs my:Gorch_ _(1958) Fock Other datasets my:owner my:German _Navy dbpedia:German_N owl:sameAs avydbprop:shipLength More data“81.2”^^xsd:double March 7th, 2012 Linked Data in Linguistics 2012 17
  • RDF and OWL - recap RDF – Resource Description Framework Entity Attribute Value + URIs Triples Shared Vocabularies Graphs OWL – Web Ontology Language Based on Description Logic and extends RDF OWL-DL Reasoning Consistency checks Both are W3C standardsMarch 7th, 2012 Linked Data in Linguistics 2012 18
  • Syntax training Presenters will probably show you some code during the next days On the next slide you will see some syntax examplesMarch 7th, 2012 Linked Data in Linguistics 2012 19
  • Serialization: Turtle and XMLMarch 7th, 2012 Linked Data in Linguistics 2012 20
  • Serialization: Turtle and XMLMarch 7th, 2012 Linked Data in Linguistics 2012 21
  • SPARQL Ability to merge data and query it using the W3C standard SPARQL (SPARQL Protocol and Query Language) SPARQL is the SQL of the Semantic Web SELECT ?ship WHERE { ?ship rdf:type my:SailingShip . ?ship my:propulsion ?engine . ?engine my:fuelType my:Diesel . ?ship dbprop:shipLength ?length . Filter (xsd:double (?length) >= 80.0 ) }March 7th, 2012 Linked Data in Linguistics 2012 22
  • Linked DataMarch 7th, 2012 Linked Data in Linguistics 2012 23
  • Linked Open Data cloudMarch 7th, 2012 Linked Data in Linguistics 2012 24
  • Linked Open Data cloudMarch 7th, 2012 Linked Data in Linguistics 2012 25
  • Linked Open Data cloudMarch 7th, 2012 Linked Data in Linguistics 2012 26
  • Linked Open Data cloudMarch 7th, 2012 Linked Data in Linguistics 2012 27
  • Linked Open Data cloudMarch 7th, 2012 Linked Data in Linguistics 2012 28
  • Linked Open Data cloudMarch 7th, 2012 Linked Data in Linguistics 2012 29
  • Linked Open Data cloudMarch 7th, 2012 Linked Data in Linguistics 2012 30
  • Linked Open Data cloudMarch 7th, 2012 Linked Data in Linguistics 2012 31
  • Linked Open Data cloud Image of a table with some data March 7th, 2012 Linked Data in Linguistics 2012 32Source http://lod-cloud.net
  • 4 Rules of Linked Data Use URIs as names for things Use HTTP URIs so that people can look up those names. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL) Include links to other URIs. so that they can discover more things. http://www.w3.org/DesignIssues/LinkedData.htmlMarch 7th, 2012 Linked Data in Linguistics 2012 33
  • Linked Data - Content Negotiation Different views for different data consumers: BrowserMarch 7th, 2012 Linked Data in Linguistics 2012 34
  • Linked Data - Content Negotiation Different views for different data consumers: ApplicationsMarch 7th, 2012 Linked Data in Linguistics 2012 35
  • Linked Data A dataset is a set of RDF triples that is published, maintained or aggregated by a single provider. An RDF link is an RDF triple whose subject and object are described in different datasets A linkset is a collection of such RDF links between twoMarch 7th, 2012 Linked Data in Linguistics 2012 36
  • March 7th, 2012 Linked Data in Linguistics 2012 37
  • Why going for the fifth star? Central Contractor Registration (CCR) Geonames Source: http://webofdata.wordpress.com/2011/05/22/why-we-link/March 7th, 2012 Linked Data in Linguistics 2012 38
  • Open Licence allow republishing and reuse Motivation for collaboration: High potential that invested efforts can be reused, i.e. data, links, vocabularies, schemas (Effortful) feedback: Users complement data, extend vocabularies and contribute changes. VoCamps for achieving coherence. Source: Chiarcos, Hellmann, Nordhoff, Towards a Linguistic Linked Open Data cloud: The Open Linguistics Working Group, Traitement Automatique des Langues, to appearMarch 7th, 2012 Linked Data in Linguistics 2012 39
  • Example DBpedia Data is extracted from Wikipedia Wikipedia just publishes the unstructured data Small DBpedia team creates RDF Community of stakeholders clean the data and create links Estimate: 10-20% to consolidate community effortMarch 7th, 2012 Linked Data in Linguistics 2012 40
  • Scalability Golden Hammer Anti-Pattern AdequacyMarch 7th, 2012 Linked Data in Linguistics 2012 41
  • Linked Data for LinguisticsMarch 7th, 2012 Linked Data in Linguistics 2012 42
  • Linked Data for Linguistics Representation and modelling Structural interoperability Integrating distributed resources Conceptual interoperability Dynamic ImportMarch 7th, 2012 Linked Data in Linguistics 2012 43
  • Representation and modelling Structural interoperabilityMarch 7th, 2012 Linked Data in Linguistics 2012 44
  • Representation and modelling Different linguistic subcommunities have developed representation standards, e.g., LMF: Lexical Markup Framework (Francopoulo et al. 2009) lexical-semantic resources GrAF: Graph Annotation Framework (Ide and Suderman 2007) for annotated corpora based on labelled directed acyclic graphs (feature structures) RDF data model: labelled directed (multi-)graphs Uniform formalism for different resource types Sublanguages (e.g., RDFS, OWL) allow to define domain- specific vocabulariesMarch 7th, 2012 Linked Data in Linguistics 2012 45
  • Structural interoperability With different language resources represented in RDF, we can combine both sources of information freely cross-resource queries with RDF query languages (e.g., SPARQL) Given a corpus with WordNet sense annotations (e.g., the Manually Annotated Sub-Corpus MASC) (Ide et al. 2010) “Retrieve all sentences that describe locations” i.e., sentences containing a token annotated with a WordNet sense that is a hyponym of “location” Difficult to realize with GrAF or LMFMarch 7th, 2012 Linked Data in Linguistics 2012 46
  • Integrating distributed resources SPARQL supports nested subqueries to run on different repositories No physical integration of resources in a single data base required Easy to link to centralized repositories of reference terminology, etc.March 7th, 2012 Linked Data in Linguistics 2012 47
  • Conceptual interoperability Resources should specify which vocabulary (e.g., for annotation) they use and how it is defined By reference to community-maintained terminology repositories, e.g., GOLD (Farrar and Langendoen 2010) ISOcat (Windhouwer and Wright @ LDL-2012) Can be used, e.g., for disambiguation If a lexeme in a lexicon has a certain morphosyntactic categorization, we can retrieve all sentences from a corpus with corresponding annotations e.g., land as a noun, but not as a verbMarch 7th, 2012 Linked Data in Linguistics 2012 48
  • Dynamic import Linking resources implemented with URIs, which can be resolved on-the-fly to update and enrich data sets For a token in a corpus, additional information can be aggregated from different repositories by resolving links (retrieving senses from a lexical-semantic repository or concepts from a terminology repository) If the information in the target resource was updated since the original annotation was performed, then the updates are available at query time Inconsistencies can be avoided through versioningMarch 7th, 2012 Linked Data in Linguistics 2012 49
  • Ecosystem, infrastructure and community RDF and related standards are maintained by an active and relatively large community Different fields of application Libraries, GeoData, BioMed, ... Established W3C standard and technological infrastructure Linguistically relevant resources already provided lexical-semantic resources (e.g., WordNet) RDF facilitates distributed development, re-using data, and, indirectly, interdisciplinary cooperationMarch 7th, 2012 Linked Data in Linguistics 2012 50
  • Building a Linguistic Linked Open Data cloudMarch 7th, 2012 Linked Data in Linguistics 2012 51
  • Building a Linguistic Linked Open Data cloud In LOD cloud Lexical Semantic resources Linguistic meta data Further relevant types for linguistic research: Annotated corpora Input and output of NLP tools Linguistic data bases Repositories of linguistic terminologyMarch 7th, 2012 Linked Data in Linguistics 2012 52
  • Building a Linguistic Linked Open Data cloud Each single provider has different incentives to use Linked Data and/or RDF Concepts of RDF and Linked Data have been brought up to solve open problems in different subcommunities of linguistics and neighboring fields As an illustration, we briefly introduce three examplesMarch 7th, 2012 Linked Data in Linguistics 2012 53
  • Building a Linguistic Linked Open Data cloud Annotated corpora Underlying problem: structural and conceptual interoperability Natural Language Processing for the semantic web Underlying problem: NLP output represented in idiosyncratic formalisms, results to be represented in RDF Typological data bases Underlying problem: globally unique identifiers (not just for “languages”, but for dialects, language families, etc.)March 7th, 2012 Linked Data in Linguistics 2012 54
  • Annotated corpora Linked Data and Corpus InteroperabilityMarch 7th, 2012 Linked Data in Linguistics 2012 55
  • Linked Data and Corpus Interoperability Linked Data can be used to address interoperability issues of annotated corpora Corpus: collection of texts developed to analyze language and to develop tools for this purpose => Annotated corpora Different types of annotations, different communities involved, different languages => Interoperability challengeMarch 7th, 2012 Linked Data in Linguistics 2012 56
  • Linked Data and Corpus Interoperability Linked Data can be used to address interoperability issues of annotated corpora Corpus: collection of texts developed to analyze language and to develop tools for this purpose => Annotation Structural interoperability Interoperable representation form Different types of annotations, different communities involved, different languages Conceptual interoperability Reference definitions for linguistic categories and features => Interoperability challengeMarch 7th, 2012 Linked Data in Linguistics 2012 57
  • Structural Interoperability Analyses produced by different researchers / NLP tools use different representation formalisms word annotations (‘tokens‘)March 7th, 2012 Linked Data in Linguistics 2012 58
  • Structural Interoperability Analyses produced by different researchers / NLP tools use different representation formalisms word annotations (‘tokens‘) span annotations (‘markables‘)March 7th, 2012 Linked Data in Linguistics 2012 59
  • Structural Interoperability Analyses produced by different researchers / NLP tools use different representation formalisms word annotations (‘tokens‘) span annotations (‘markables‘) tree-like annotationsMarch 7th, 2012 Linked Data in Linguistics 2012 60
  • Structural Interoperability Analyses produced by different researchers / NLP tools use different representation formalisms relational annotationsMarch 7th, 2012 Linked Data in Linguistics 2012 61
  • Structural Interoperability Analyses produced by different researchers / NLP tools use different representation formalismsMarch 7th, 2012 Linked Data in Linguistics 2012 62
  • Structural InteroperabilityState-of-the art approachesGraph-based data modelRepresent data in standoff XML (Ide and Suderman 2007, Chiarcos et al. 2008, Eckart et al. @ LDL)Presentation of Nancy Ide @ LDL 2012March 7th, 2012 Linked Data in Linguistics 2012 63
  • XML standoff MASC corpus, GrAF formatMarch 7th, 2012 Linked Data in Linguistics 2012 64
  • Working with XML standoffHow to store, retrieve and query XML standoff data efficiently ? Direct use with XML data bases inefficient (Eckart 2008) Inline XML (e.g., Dipper et al. 2007) Relational DB formats (e.g., Eckart et al. @ LDL)RDF as another possibility (e.g., Chiarcos 2012) Databases are optimized for graph querying Extensive (open source) infrastructure available Conceptual interoperability Integration with Linked Data resourcesMarch 7th, 2012 Linked Data in Linguistics 2012 65
  • Corpus Interoperability with RDFStructural Interoperability e.g. POWLA - http://purl.org/powla lossless transformation to RDF from standoff XML Linking to lexical-semantic resources (WordNet)Conceptual Interoperability Cross-Linking to terminology repositories (OLiA, GOLD, ISOcat) Entity-Linking to metadata (Geodata, LOD cloud)March 7th, 2012 Linked Data in Linguistics 2012 66
  • Natural Language Processing Interchange Format NIFMarch 7th, 2012 Linked Data in Linguistics 2012 67
  • NLP Interchange Format (NIF) NIF is an RDF/OWL-based format Achieve interoperability for: Output of NLP tools Linguistic data in RDF Text documents Web of Data (LOD cloud)March 7th, 2012 Linked Data in Linguistics 2012 68
  • March 7th, 2012 Linked Data in Linguistics 2012 69
  • A Transparent Formalization of Text forMachinesMarch 7th, 2012 Linked Data in Linguistics 2012 70
  • A Transparent Formalization of Text forMachines Intransparent for machinesMarch 7th, 2012 Linked Data in Linguistics 2012 71
  • A Transparent Formalization of Text forMachines Universe of discourse is defined as the words over the alphabet of Unicode characters URI http://example.org/sample “The city Berlin is the capital #offset_0_42 of Germany.”March 7th, 2012 Linked Data in Linguistics 2012 72
  • NLP Interchange Format Specification for NIF 1.0 (http://nlp2rdf.org/nif-1-0/) different implementations (alpha/beta) are available as Open Source (UIMA, Gate Annie, Stanford Parser, DBpedia Spotlight) Mailing list available at http://nlp2rdf.org Demo: http://nlp2rdf.lod2.eu/demo.php Poster during the poster session Thursday 13:00-14:30March 7th, 2012 Linked Data in Linguistics 2012 73
  • Typological databases Glottolog/LangdocMarch 7th, 2012 Linked Data in Linguistics 2012 74
  • Glottolog/Langdoc Two subprojects Glottolog provides identifiers and additional information for 100k languoids (languages, dialects, families) main competitor projects: ISO 639-3/Ethnologue Multitree Langdoc provides identifiers and additional information for 180k references main competitor project: OLACMarch 7th, 2012 Linked Data in Linguistics 2012 75
  • Problems to address existing identifiers are not granular enough (ISO 636-3: 7k) existing identifiers have unclear reference (multitree altc refers to both Micro-Altaic and Macro-Altaic) existing identifiers have no verifiable empirical basis Solutions 21k identifiers for main tree total of 104k identifiers for all nodes of multitree treesMarch 7th, 2012 Linked Data in Linguistics 2012 76
  • RDF gl o t t o l o g : 1 2 345 gl : s u b l an g u o i d gl o t t o l o g : 41 2 02 . gl o t t o l o g : 1 2 345 gl : s u p e r l an g u o i d gl o t t o l o g : 9421 1 .March 7th, 2012 Linked Data in Linguistics 2012 77
  • Langdoc 180k references to literature treating (mostly) lesser- known languages annotated for language, document type, macro-area limited full text indexing “give me any grammar or grammar sketch from an Afro- Asiatic language spoken in Eurasia where the word dual occurs in the text”March 7th, 2012 Linked Data in Linguistics 2012 78
  • RDF gl o t t o l o g : 1 2 345 gl : i mme d i at e l yd e s c r i b e d i n l an g d o c : 2 345 6 .March 7th, 2012 Linked Data in Linguistics 2012 79
  • Position of G/L in the LLOD cloudMarch 7th, 2012 Linked Data in Linguistics 2012 80
  • Availability XHTML: http://glottolog.livingsources.org RDF: http://glottolog.livingsources.org/sparqlMarch 7th, 2012 Linked Data in Linguistics 2012 81
  • Outlook OutlookMarch 7th, 2012 Linked Data in Linguistics 2012 82
  • From OWLG to DGfS The Open Knowledge Foundation Working Group for Open Data in Linguistics (OKFN-OWLG) was founded in late 2010 We first established a series of meetings and a mailing list Build the structure, create momentum Two workshops: OKCon 2011 in Berlin, this workshop This afternoon: Christian Kreutz presents the OKFNMarch 7th, 2012 Linked Data in Linguistics 2012 83
  • Building the Linguistic Linked Data CloudMarch 7th, 2012 Linked Data in Linguistics 2012 84
  • This workshop Exploratory workshop Chart domains as to the amount and kind of data which can be integrated into the LLOD-cloud increase coverage more domains increase density more links between resources increase discussion between independent subcommunitiesMarch 7th, 2012 Linked Data in Linguistics 2012 85
  • This workshopMarch 7th, 2012 Linked Data in Linguistics 2012 86
  • Spread the word http://linguistics.okfn.org/ open-linguistics@lists.okfn.org poster at DGfS-CL session on Thursday start this workshop: first talk: Declerck et al. “Towards Linked Language Data (LLD) for Digital Humanities”March 7th, 2012 Linked Data in Linguistics 2012 87
  • We would like to thank MPI Springer LOD2March 7th, 2012 Linked Data in Linguistics 2012 88