Using the Web of Data for Information Extraction

6,316 views

Published on

Talk at Insiders Technologies , 21.01.2010. It's about publishing RDF data with D2R-server, link the data to get Linked Data, query the data with SPARQL via SQUIN and finally annotate text with this data by using RDFa in Epiphany.

Published in: Education, Technology
  • Be the first to comment

Using the Web of Data for Information Extraction

  1. 1. Insiders January 2010 Using the Web of Data for Information Extraction scoobie sparql rdfa D2R server rdf squin epiphany Linked Data OBIE Benjamin Adrian http://www.dfki.uni-kl.de/~adrian
  2. 2. Insiders Are you still surfing ... January 2010 Benjamin Adrian http://www.dfki.uni-kl.de/~adrian
  3. 3. Insiders … or overloaded? January 2010 Benjamin Adrian http://www.dfki.uni-kl.de/~adrian
  4. 4. Insiders A simple question ... January 2010 What are the cities of the universities in Rhineland Palatinate and what is the unemployment rate of these cities? Benjamin Adrian http://www.dfki.uni-kl.de/~adrian
  5. 5. Insiders A simple question ... January 2010 What are the cities of the universities in Rhineland Palatinate and what is the unemployment rate of these cities? PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX skos: <http://www.w3.org/2004/02/skos/core#> PREFIX eurostat: <http://www4.wiwiss.fu-berlin.de/eurostat/resource/eurostat/> PREFIX dbpedia: <http://dbpedia.org/ontology/> PREFIX dbpedia_cat: <http://dbpedia.org/resource/Category> SELECT ?dbpcity ?cityName ?ur WHERE { ?uni skos:subject dbpedia_cat:Universities_and_colleges_in_Rhineland-Palatinate; dbpedia:city ?dbpcity . ?dbpcity owl:sameAs ?statcity. ?statcity rdfs:label ?cityName ; eurostat:unemployment_rate_total ?ur } http://www.w3.org/TR/rdf-sparql-query/ Benjamin Adrian http://www.dfki.uni-kl.de/~adrian
  6. 6. Insiders … and its answer. January 2010 dbpcity cityName ur http://dbpedia.org/resource/Koblenz Koblenz 8.8 http://dbpedia.org/resource/Trier Trier 7.3 Data Sources: http://epp.eurostat.ec.europa.eu http://wiki.dbpedia.org http://www4.wiwiss.fu-berlin.de/eurostat/ Query Engine: SQUIN - Query the Web of Linked Data http://squin.sourceforge.net/ Benjamin Adrian http://www.dfki.uni-kl.de/~adrian
  7. 7. So much data out there, Insiders January too much? 2010 Benjamin Adrian http://www.dfki.uni-kl.de/~adrian
  8. 8. Insiders What data do you have? January 2010 Benjamin Adrian http://www.dfki.uni-kl.de/~adrian
  9. 9. Insiders Are you still surfing ... January 2010 Benjamin Adrian http://www.dfki.uni-kl.de/~adrian
  10. 10. Insiders Agenda January 2010 In order to use Web of Data for information extraction, you have to understand its basics. ● RDF on one slide ● Publish data in RDF with D2R Server ● Publish RDF as Linked Data ● Query Linked Data with SPARQL and Squin ● Use RDF for information extraction ● Bring Linked Data to text via RDFa Benjamin Adrian http://www.dfki.uni-kl.de/~adrian
  11. 11. Insiders Wouldn't this be nice. January 2010 Data Benjamin Adrian http://www.dfki.uni-kl.de/~adrian 11
  12. 12. Insiders Wouldn't this be nice. January 2010 Data Text User-defined Filter Ex tra ct io n Pi pe l in e Extraction Results enrich Benjamin Adrian http://www.dfki.uni-kl.de/~adrian 12
  13. 13. Insiders Wouldn't this be nice. January 2010 annotated Data Text text User-defined Filter Ex annotate tra ct io n Pi pe l in e Extraction Results enrich Benjamin Adrian http://www.dfki.uni-kl.de/~adrian 13
  14. 14. Insiders Wouldn't this be nice. January 2010 annotated Data Text text User-defined Filter Ex annotate tra ct io n Pi pe populate l in e Extraction Results enrich Benjamin Adrian http://www.dfki.uni-kl.de/~adrian 14
  15. 15. Insiders RDF on one slide January 2010 @prefix dblp_author: <http://dblp.l3s.de/d2r/page/authors/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix dc: <http://purl.org/dc/terms/> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix acm: <http://acm.rkbexplorer.com/description/> . dblp_author:Michael_Gillmann foaf:name „Michael Gillmann“ ; rdfs:seeAlso <http://www.bibsonomy.org/uri/author/Michael+Gillmann> ; rdf:type foaf:Agent ; owl:sameAs acm:person-197117-81d3fccbfd0249fc33c0d00f03a30af4 ; foaf:isMakerOf <http://dblp.l3s.de/d2r/resource/publications//icdar/SchulzEGAAD09> . <http://dblp.l3s.de/d2r/resource/publications/conf/icdar/SchulzEGAAD09> dc:creator dblp_author:Michael_Gillmann ; dc:creator dblp_author:Markus_Ebbecke ; dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ . * From: http://sig.ma/entity/ddcb76b935e91940e5508a460619a2ac.rdf Benjamin AdrianFound at: http://www.dfki.uni-kl.de/~adrian
  16. 16. Insiders RDF on one slide January 2010 @prefix dblp_author: <http://dblp.l3s.de/d2r/page/authors/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . Vocabularies @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix dc: <http://purl.org/dc/terms/> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix acm: <http://acm.rkbexplorer.com/description/> . dblp_author:Michael_Gillmann foaf:name „Michael Gillmann“ ; rdfs:seeAlso <http://www.bibsonomy.org/uri/author/Michael+Gillmann> ; rdf:type foaf:Agent ; owl:sameAs acm:person-197117-81d3fccbfd0249fc33c0d00f03a30af4 ; foaf:isMakerOf <http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09> . <http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09> dc:creator dblp_author:Michael_Gillmann ; dc:creator dblp_author:Markus_Ebbecke ; dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ . * From: http://sig.ma/entity/ddcb76b935e91940e5508a460619a2ac.rdf Benjamin AdrianFound at: http://www.dfki.uni-kl.de/~adrian
  17. 17. Insiders RDF on one slide January 2010 @prefix dblp_author: <http://dblp.l3s.de/d2r/page/authors/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . URLs / URIs @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix dc: <http://purl.org/dc/terms/> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix acm: <http://acm.rkbexplorer.com/description/> . dblp_author:Michael_Gillmann foaf:name „Michael Gillmann“ ; rdfs:seeAlso <http://www.bibsonomy.org/uri/author/Michael+Gillmann> ; rdf:type foaf:Agent ; owl:sameAs acm:person-197117-81d3fccbfd0249fc33c0d00f03a30af4 ; foaf:isMakerOf <http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09> . <http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09> dc:creator dblp_author:Michael_Gillmann ; dc:creator dblp_author:Markus_Ebbecke ; dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ . * From: http://sig.ma/entity/ddcb76b935e91940e5508a460619a2ac.rdf Benjamin AdrianFound at: http://www.dfki.uni-kl.de/~adrian
  18. 18. Insiders RDF on one slide January 2010 @prefix dblp_author: <http://dblp.l3s.de/d2r/page/authors/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . Subjects @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix dc: <http://purl.org/dc/terms/> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix acm: <http://acm.rkbexplorer.com/description/> . dblp_author:Michael_Gillmann foaf:name „Michael Gillmann“ ; rdfs:seeAlso <http://www.bibsonomy.org/uri/author/Michael+Gillmann> ; rdf:type foaf:Agent ; owl:sameAs acm:person-197117-81d3fccbfd0249fc33c0d00f03a30af4 ; foaf:isMakerOf <http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09> . <http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09> dc:creator dblp_author:Michael_Gillmann ; dc:creator dblp_author:Markus_Ebbecke ; dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ . * From: http://sig.ma/entity/ddcb76b935e91940e5508a460619a2ac.rdf Benjamin AdrianFound at: http://www.dfki.uni-kl.de/~adrian
  19. 19. Insiders RDF on one slide January 2010 @prefix dblp_author: <http://dblp.l3s.de/d2r/page/authors/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . Predicates @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix dc: <http://purl.org/dc/terms/> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix acm: <http://acm.rkbexplorer.com/description/> . dblp_author:Michael_Gillmann foaf:name „Michael Gillmann“ ; rdfs:seeAlso <http://www.bibsonomy.org/uri/author/Michael+Gillmann> ; rdf:type foaf:Agent ; owl:sameAs acm:person-197117-81d3fccbfd0249fc33c0d00f03a30af4 ; foaf:isMakerOf <http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09> . <http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09> dc:creator dblp_author:Michael_Gillmann ; dc:creator dblp_author:Markus_Ebbecke ; dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ . * From: http://sig.ma/entity/ddcb76b935e91940e5508a460619a2ac.rdf Benjamin AdrianFound at: http://www.dfki.uni-kl.de/~adrian
  20. 20. Insiders RDF on one slide January 2010 @prefix dblp_author: <http://dblp.l3s.de/d2r/page/authors/> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . Objects @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . @prefix dc: <http://purl.org/dc/terms/> . @prefix owl: <http://www.w3.org/2002/07/owl#> . @prefix acm: <http://acm.rkbexplorer.com/description/> . dblp_author:Michael_Gillmann foaf:name „Michael Gillmann“ ; rdfs:seeAlso <http://www.bibsonomy.org/uri/author/Michael+Gillmann> ; rdf:type foaf:Agent ; owl:sameAs acm:person-197117-81d3fccbfd0249fc33c0d00f03a30af4 ; foaf:isMakerOf <http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09> . <http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09> dc:creator dblp_author:Michael_Gillmann ; dc:creator dblp_author:Markus_Ebbecke ; dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ . * From: http://sig.ma/entity/ddcb76b935e91940e5508a460619a2ac.rdf Benjamin AdrianFound at: http://www.dfki.uni-kl.de/~adrian
  21. 21. Insiders RDF data is graph data. January 2010 Benjamin Adrian http://www.dfki.uni-kl.de/~adrian
  22. 22. Publishing relational Insiders January data in RDF 2010 Benjamin Adrian http://www.dfki.uni-kl.de/~adrian
  23. 23. Publishing relational Insiders January data in RDF 2010 D2R Server - Publishing Relational Databases on the Semantic Web http://www4.wiwiss.fu-berlin.de/bizer/d2r-server/ Two small command line calls: ./d2r-server -p 80 -b http://projects.dfki.uni-kl.de/mydatabase/ mydatabase.n3 ./generate-mapping -o mydatabase.n3 -b http://projects.dfki.uni-kl.de/mydatabase/ jdbc:mysql://localhost:3306/mydatabase Benjamin Adrian http://www.dfki.uni-kl.de/~adrian
  24. 24. Linked Data: Linking RDF Insiders January data from different sources 2010 Customer DB Employees DB How to interlink these datasets? Project DB DBpedia Benjamin Adrian http://www.dfki.uni-kl.de/~adrian
  25. 25. Linked Data: Linking RDF Insiders January data from different sources 2010 Linked Data Principles (TimBL, 2006) 1. Use URIs as names for things (e.g., http://dbpedia.org/resource/Berlin) 2. Use HTTP-URIs so that people can look up those names 3. Provide useful information in RDF when someone looks up an URI 4. Include links to other URIs to enable discovery of more information Example: <http://dbpedia.org/resource/Berlin> owl:sameAs opencyc:en/CityOfBerlinGermany ; owl:sameAs opencyc:en/Berlin_StateGermany owl:sameAs <http://sws.geonames.org/2950159/> owl:sameAs <http://www4.wiwiss.fu-berlin.de/eurostat/resource/regions/Berlin> owl:sameAs freebase:http://dbpedia.org/resource/Berlin Benjamin Adrian http://www.dfki.uni-kl.de/~adrian
  26. 26. SPARQL: Querying RDF Insiders January data 2010 SPARQL - the RDF query language. In contrast to SQL, it's data model is not set oriented but graph oriented. Some Examples: Resulting in tuples: SELECT ?interest ?friend WHERE {    <http://www.w3.org/People/Berners­Lee/card#i> foaf:knows ?friend .    ?friend foaf:interest ?interest .  } Resulting as graph : CONSTRUCT {?friend foaf:interest ?interest } WHERE {    <http://www.w3.org/People/Berners­Lee/card#i> foaf:knows ?friend .    ?friend foaf:interest ?interest .  } Benjamin Adrian http://www.dfki.uni-kl.de/~adrian
  27. 27. SPARQL: Query Linked Insiders January Data from different sources 2010 Customer DB Employees DB How to access these datasets with a single SPARQL query? Project DB DBpedia Benjamin Adrian http://www.dfki.uni-kl.de/~adrian
  28. 28. SPARQL: Query Linked Insiders January Data from different sources 2010 Customer DB Employees DB Squin: Query the Web of Linked Data http://squin.sourceforge.net/ Squin follows a Link Traversal D2R Server D2R Server approach over HTTP URIs. SQUIN Remember: SELECT DISTINCT ?c ?cityName ?ur WHERE { D2R Server D2R Server ?u skos:subject dbpedia_cat:Universities_and_colleges_i n_Rhineland-Palatinate; dbpedia:city ?c . ?c owl:sameAs [ rdfs:label ?cityName ; eurostat:unemployment_rate_total ?ur ] } Project DB DBpedia Benjamin Adrian http://www.dfki.uni-kl.de/~adrian
  29. 29. Using RDF and Linked Data Insiders January for Information Extraction 2010 User Linked Data Query asks question t a bou to answers Text Extraction Result Graph Pipeline Benjamin Adrian http://www.dfki.uni-kl.de/~adrian
  30. 30. Using RDF and Linked Data Insiders January for Information Extraction 2010 What data do we have? Example RDF data <http://dblp.l3s.de/d2r/resource/publications/dblp_pub:conf/icdar/SchulzEGAAD09> rdf:type foaf:Document ; dc:creator dblp_author:Markus_Ebbecke ;  dc:title „Seizing the Treasure: Transferring Knowledge in Invoice Analysis“ . Classes Instances Datatype Properties Object Properties Literals foaf:Document .../SchulzEGAAD09 dc:title dc:creator „Markus“ foaf:Person .../Markus_Ebbecke foaf:name foaf:knows „Ebbecke“ foaf:firstName „Seizing the foaf:surName Treasure: Transferring Knowledge in Invoice Analysis“ Benjamin Adrian http://www.dfki.uni-kl.de/~adrian
  31. 31. SCOOBIE Insiders January Domain Adaption 2010 Structured Text Corpus Data Data Patterns and Gazetteers Data Vocabulary Data Instance Data Data Preprocessing Information & Learning (offline) Extraction (online) Benjamin Adrian http://www.dfki.uni-kl.de/~adrian 31
  32. 32. SCOOBIE Insiders January Eco System 2010 Index Domain Knowledge Models Text Training Corpus Corpus Session Data Instances Ontology Models Patterns + Gazetteers Pre- process Train Extract Tasks API I O I Benjamin Adrian http://www.dfki.uni-kl.de/~adrian 32
  33. 33. SCOOBIE Insiders January OBIE Pipeline 2010 Normalization Text Extraction Language Detection Segmentation Tokenization Sentence Extraction POS-Tagging Symbolization Named Entity Recognition Structured Entity Recognition Noun Phrase Chunking Symbol Recognition Instantiation Instance Recognition Instance Disambiguation Chunk Classification Contextualization Fact Extraction Fact Selection Population Query Answering Benjamin Adrian http://www.dfki.uni-kl.de/~adrian 33
  34. 34. Used Machine Insiders January Learning Models 2010 Semi-Supervised Learning CRF-based Noun Phrase Chunker I Supervised Learning Gazetteer matching statistics (Named Entity Recognition) I Regex matching statistics (Structured Entity Recognition) Unsupervised or Instance-based Learning TF/IDF-based instance re-ranking (Instance Disambiguation) I K-Nearest-Neighbor chunk classifier (Chunk Classification) Spreading Activation-based fact ranking (Fact Selection) Benjamin Adrian http://www.dfki.uni-kl.de/~adrian 34
  35. 35. Used Machine Learning: Insiders January Conditional Random Field 2010 CRFs are sequence taggers: Train it with: Bill CAPITALIZED noun slept LOWERCASE non-noun here LOWERCASE non-noun Test it with: He CAPITALIZED visited LOWERCASE London CAPITALIZED CRF results: noun MALLET - MAchine Learning non-noun for LanguagE Toolkit non-noun http://mallet.cs.umass.edu/ Benjamin Adrian http://www.dfki.uni-kl.de/~adrian 35
  36. 36. Bringing Linked Data to Insiders January Text 2010 Annotate plain text or HTML with RDF data. I'm working at DFKI. RDFa offers an HTML extension: I'm working at <span about="dbpedia:DFKI" property="rdfs:label"> DFKI</span> Now lets generate RDFa automatically ... Benjamin Adrian http://www.dfki.uni-kl.de/~adrian 36
  37. 37. Insiders Do you remember? January 2010 annotated Data Text text User-defined Filter Ex annotate tra ct io n Pi pe populate l in e Extraction Results enrich Benjamin Adrian http://www.dfki.uni-kl.de/~adrian 37
  38. 38. Insiders RDF Epiphany January 2010 Epiphany takes the original webpage … Benjamin Adrian http://www.dfki.uni-kl.de/~adrian 38
  39. 39. Insiders RDF Epiphany January 2010 Epiphany takes the original webpage … and SCOOBIE initialized with an RDF data set … Benjamin Adrian http://www.dfki.uni-kl.de/~adrian 39
  40. 40. Insiders RDF Epiphany January 2010 Epiphany takes the original webpage … and SCOOBIE initialized with an RDF data set … It extracts RDF information from text and annotates it as RDFa … Benjamin Adrian http://www.dfki.uni-kl.de/~adrian 40
  41. 41. Insiders RDF Epiphany January 2010 Epiphany takes the original webpage … and SCOOBIE initialized with an RDF Linked Data set … It extracts RDF information from text and annotates it as RDFa … clicking on RDFa annotations opens further information from the Linked Data set Benjamin Adrian http://www.dfki.uni-kl.de/~adrian 41
  42. 42. Insiders RDF Epiphany January 2010 At a glance ● Epiphany is a free web service. ● Epiphany uses SCOOBIE. SCOOBIE ● Epiphany can be initialized with any RDF Linked Data set. ● Epiphany generates an RDF document about a web page. ● Epiphany annotates RDF as RDFa in the web page. http://projects.dfki.uni-kl.de/epiphany/ Benjamin Adrian http://www.dfki.uni-kl.de/~adrian 42
  43. 43. Insiders Summary January 2010 Customer DB Employees DB annotated Text text D2R D2R Server SQUIN Server User-defined Filter D2R D2R Server Server Project DB DBpedia Ex annotate tra ct io n Pi pe populate l in e Extraction Results enrich Benjamin Adrian http://www.dfki.uni-kl.de/~adrian 43
  44. 44. Insiders Outlook January 2010 Customer DB Employees DB E-Mail annotated E-Mail D2R D2R Server SQUIN Server User-defined Filter D2R D2R Server Server Project DB DBpedia Ex annotate tra ct io n Pi pe populate l in e Extraction Results enrich Benjamin Adrian http://www.dfki.uni-kl.de/~adrian 44
  45. 45. Insiders Thank you! January 2010 scoobie sparql rdfa D2R server rdf squin epiphany Linked Data OBIE Benjamin Adrian http://www.dfki.uni-kl.de/~adrian 45

×