Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

What you Can Make Out of Linked Data

1,093 views

Published on

Tutorial given at the 38th Internationalization & Unicode Conference.

Published in: Technology
  • Be the first to comment

What you Can Make Out of Linked Data

  1. 1. Text What you can make out of Linked Data Marco Fossati <fossati@spaziodati.eu> Steven R. Loomis <srloomis@us.ibm.com> 1
  2. 2. Let's meet the presenters first! 2
  3. 3. Marco Fossati Natural Language Processing Advocate Recommender Systems Aficionado Open Data Apologist 3
  4. 4. Steven R. Loomis IBM Chair, Unicode ULI-TC ! Projects: ICU, CLDR, ULI
  5. 5. Outline 1. Linked Open Data 101 2. DBpedia 3. The ULI use case 5
  6. 6. Warning! Highly interactive tutorial 6
  7. 7. Let's get started! 7
  8. 8. Text Linked Open Data 101 The Big Picture 8
  9. 9. What is data? Data is how we express facts in a reusable form 9
  10. 10. Why data? The ingredients for... ...Information Knowledge Wisdom 10
  11. 11. OK it's data, what else? Big Billions of facts “Santa Clara is a city” Linked Richly structured Open Open licenses 11
  12. 12. Facts, not words A fact is... An assertion about the world Subject + predicate + object A triple Human mind Natural language ! Machine 12
  13. 13. Human mind Perceiving relationships between entities 13
  14. 14. Natural language "Elvis Presley sings Jailhouse Rock" 14
  15. 15. Machine The triple Elvis Presley Jailhouse Rock ! sings 15
  16. 16. The graph Rich structure made of triples 16
  17. 17. From the web of documents... Text 17
  18. 18. ...to the web of entities Text 18
  19. 19. The web of entities An entity can be... Identified Described through relationships Understood both by humans and machines 19
  20. 20. Towards a WWW of entities Identify via HTTP URIs http://dbpedia.org/resource/Elvis_Presley Describe via RDF statements :Elvis_presley :sings :Jailhouse_Rock Understand via HTML for humans RDF for machines 20
  21. 21. Hands-on Time! https://pad.okfn.org/p/DBpediaULI 21
  22. 22. Next in line… 22
  23. 23. Text DBpedia Extracting Knowledge from Wikipedia 23
  24. 24. DBpedia is… A. …a data extraction framework from Wikipedia semi-structured data B. …an open-source community effort 24
  25. 25. Why? 25
  26. 26. Wikipedia can’t answer simple questions “What do Santa Clara and San Francisco have in common?” 26
  27. 27. Wikipedia can’t answer complex questions “Which are the black and white movies produced in Italy that have soundtracks which were composed by musicians who were born in a city of the Trentino-Alto-Adige region with less than 40,000 inhabitants?” 27
  28. 28. The story so far Project started in 2007 From good ol’ PHP to Java + Scala Steadily growing community Internationalization Committee Freely available on GitHub 28
  29. 29. Data in Wikipedia Title Short abstract Long abstract 29
  30. 30. Structure in Wikipedia Infobox Images 30
  31. 31. Structure in Wikipedia Links Categories 31
  32. 32. Structure in Wikipedia Interlanguage Links 32
  33. 33. Much more at http://dbpedia.org/Datasets 33
  34. 34. DBpedia Extraction Framework (DEF) Wikipedia dump Extractors RDF graph 34
  35. 35. Extractors Article Features Abstract, redirects, categories, geo-coordinates, interlanguage links, etc. Infobox Raw Mapping-based 35
  36. 36. Raw Infobox Extractor :Elvis_Presley :born “Elvis Aaron Presley…” :died “August 16, 1977…” :restingPlace “Graceland…” :education “L.C. Humes…” :occupation “Singer…” 36
  37. 37. The Big Issues Data is heterogeneous! Data is multilingual! 37
  38. 38. 38
  39. 39. Solution • The DBpedia ontology as a multilingual glue • Wikipedia-to-ontology Mapping 39
  40. 40. DBpedia Ontology Encoding the worldwide encyclopedic knowledge 40
  41. 41. Mapping-based Extractor Combines what belongs together Separates what is different 41
  42. 42. DIEF -Mapping-Based Infobox extractor 42
  43. 43. The Mappings Wiki Anybody can contribute to mappings.dbpedia.org 43
  44. 44. Download the latest DBpedia dump at http://downloads.dbpedia.org/ current/ 44
  45. 45. English SPARQL endpoint dbpedia.org/sparql 45
  46. 46. Language chapters DBpedia in your mother tongue 46
  47. 47. Active chapters International (English-based) Basque, Czech, Dutch, French, German, Greek, Indonesian, Italian, Japanese, Korean, Polish, Portuguese, Spanish 47
  48. 48. Host your own language chapter! 48
  49. 49. Applications Get the best out of DBpedia data 49
  50. 50. Knowledge Graphs Highly informative summaries in your own language 50
  51. 51. Text Question Answering “Who is Bram Stoker?” 51
  52. 52. Text Entity Linking Detecting Things in Text 52
  53. 53. Automatic Huge Gazetteers Language and Domain-specific Resources for Short Sentences Classification 53
  54. 54. DBpedia Stakeholders Who is using the knowledge base? 54
  55. 55. Open Government Linking Local Data 55
  56. 56. Digital Libraries Enriching the Catalogue 56
  57. 57. Data-driven Journalism Building Infographics 57
  58. 58. Hands-on Time! https://pad.okfn.org/p/DBpediaULI 58
  59. 59. And now the final part! 59
  60. 60. Text The ULI use case Putting Linked Open Data to work
  61. 61. What’s wrong with Localization Interoperability? Inconsistent application, implementation, and interpretation of standards Lack of clear requirements for localization data interchange
  62. 62. Unicode Localization Interoperability Technical Committee of Unicode Focus Areas: 1. Translation memory 2. Translation source strings / translations 3. Segmentation rules
  63. 63. ULI: Segmentation Given: Thanks to Dr. Jones for this effort. UAX#11 Segmentation: |Thanks to Dr.| Jones for this effort.| English: |Thanks to Dr. Jones for this effort.|
  64. 64. ULI Suppression: Abbreviations English Spanish Mr. Sr. Mrs. Dto. Dr. Sra. St. Avda. … … Russian проф. февр. тел. кв. …
  65. 65. Demo: ULI Breaks http://demo.icu-project.org/icu-bin/icusegments DEMO
  66. 66. DBpedia applied to ULI (University of Leipzig) Sebastian Hellman, Martin Brümmer, Dimitris Kontokostas Opportunity: Help segmentation by supplying abbreviation data
  67. 67. Yes! Evaluation shows that especially for small texts, abbreviations can contribute to precision and recall of segmentation
  68. 68. Success rate
  69. 69. multilingual with over 100 languages ! structured data eases extraction ! additional data like entity types and categories
  70. 70. Example: Mr. “MR” disambiguation page links to “Mr.” article. ! Ends in full stop, so may be an abbreviation.
  71. 71. The “Mr.” SPARQL query SELECT ?entryExample ?exampleTested ?indegreeRanking WHERE { <http://dbpedia.org/resource/Mr.> rdfs:label ?entryExample ; rdfs:comment ?exampleTested . FILTER ( lang(?entryExample) = lang(?exampleTested) ) #subselect: { SELECT count(?in) as ?indegreeRanking WHERE { ?in ?p <http://dbpedia.org/resource/Mr.> } } } LIMIT 100 DEMO
  72. 72. Example DBpedia data (English) St. Street <http://en.wikipedia.org/wiki/Street> <http://schema.org/Place> <http://dbpedia.org/ontology/Place> <http://dbpedia.org/ontology/PopulatedPlace>
  73. 73. Example DBpedia data (Russian) Проф. Профессор (Professor) <http://ru.wikipedia.org/wiki/Профессор>
  74. 74. 1. Get abbreviation URIs
  75. 75. 2. Load DBpedia data into local DB
  76. 76. 3. SPARQL Query data and tsv output
  77. 77. ! 22859 abbreviations with 78197 meanings in 99 languages
  78. 78. ! 22859 abbreviations with 78197 meanings in 99 languages ! ! Long Tail ! ! ! Only 25 languages >100 abbrevs. ! Only 7 languages >1000 abbrevs. ! !
  79. 79. Long tail (total abbrevs)
  80. 80. Long tail (total abbrevs) (zoom)
  81. 81. ULI Process DBpedia Wikipedia ULI Review Extraction Translation Memory Translation Memory Translation Memory Comparison Manual review CLDR "Lupa.na.encyklopedii" by Julo - Own work. Licensed under Public domain via Wikimedia Commons - https:// commons.wikimedia.org/wiki/File:Lupa.na.encyklopedii.jpg#mediaviewer/File:Lupa.na.encyklopedii.jpg CLDR abbrs. CLDR Suppressions
  82. 82. Comparison with Translation Memory Entry % in TM Corp. 0.0307% St. 0.0023% P.T.T.C. 0% "Trichtermitfilter" by Gmhofmann - Own work. Licensed under Creative Commons Attribution-Share Alike 3.0 via Wikimedia Commons - https://commons.wikimedia.org/wiki/File:Trichtermitfilter.jpg#mediaviewer/ File:Trichtermitfilter.jpg
  83. 83. CLDR Input Extract abbreviations from CLDR localized data Days of week: Sun. Mon. Tue. Wed. Thu. … Months: Jan. Feb. Mar. … etc…
  84. 84. Manual Review
  85. 85. CLDR output format <segmentations> <segmentation type="SentenceBreak"> <!--From ULI data, http://uli.unicode.org--> <suppressions type="standard"> <suppression>Port.</suppression> <suppression>Alt.</suppression> <suppression>Di.</suppression> <suppression>Ges.</suppression> <suppression>frz.</suppression>
  86. 86. CLDR 26 Output http://cldr.unicode.org “Break Suppression” de 239 en 151 es 164 fr 82 it 45 pt 170 ru 18
  87. 87. Challenges "Long Tail" Languages harder to find existing TM data harder to find linguistic rules/review harder to find tagged corpora to benchmark Systematic issues with using redirects/disambiguation
  88. 88. Opportunity Scope: Non-full stop punctuation- "Yahoo!" Language specific abbreviation rules Context (Medical, Business, …) Leverage Schema/Taxonomy ( “Place” vs “Person” etc. ) to filter DBpedia lists Additional LOD
  89. 89. Thank You! Further Q&A? ! Slides & contact info: https://pad.okfn.org/p/DBpediaULI

×