Your SlideShare is downloading. ×
0
2013 Linked Data in Practice Workshop (LDPW2013) , 30 November, 2013

Building DBpedia Japanese and
Linked Data Cloud in J...
Two Driving Forces to push LOD in Japan
• LOD for ACademia (LODAC) Project since 2010
– A research project in ROIS and NII...
LODAC Location:
Integration of location information

LODAC Project
- connecting academic data LODAC SPECIES: Connecting sp...
LODAC Museum
• Integrated database for information on
museums in Japan
Type of Information

– Data
• No. of museums:114
• ...
Use

Yokohama Art Spot

LODAC Museum × Yokohama Art LOD

– Application using
museum and local data
– Data related to art i...
LODAC SPECIES: Linking Species
Information with names
Museum
Specimen
DB

Species
Info. DB
Research
DB

GBIF

Taxon Name
L...
Search application
with LODAC SPECIES

http://lod.ac/apps/lsdcs
Specified Non-profit Corporation

Linked Open Data Initiative, Inc.
Prospectus
• LOD is becoming an infrastructure of our society
– Similar to the impact to our society by Web
– LOD help mat...
Projects
• Platforms
– CKAN Japanese
– DBpedia Japanese

• Collaborative Projects
– with Ministry of Industry, Trade, and ...
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
provided by NDL
Motivation
• Data hub for Japanese resources
– To promote LOD in Japan
– To connect datasets in Japanese

• Two linguistic...
DBpedia Japanese
• DBpedia i18n project
– 14 chapters

• generated from Japanese
Wikipedia dump files
– DIEF (DBpedia Info...
i18n/l10n efforts
• IRI, IRI, IRI, ...
• Configurations for Extractors and Parsers
• DBpedia Mappings for each chapter
Extraction process

ref: D. Kontokostas et al. "Internationalization of Linked Data. The case of the Greek DBpedia edition...
DBpedia Information Extraction Framework
• Software to extract data from Wikipedia dump
– including custom extractors/pars...
DisambiguationExtractor
• "ja" -> "(曖昧さ回避)"
HomepageExtractor
• propertyNamesMap
– "ja" -> Set("homepage", "website", "
", "
", "Web サイト",
"Webサイト")

• externalLinkSe...
ImageExtractor
• "ja" -> """(?i){{s?(Non free|Non-free
pubart)s?}}""".r
PersondataExtractor
•
•
•
•
•

Names of templates for personal information
“名前”(name)
“別名”(alias)
“概要”(abstract)
dates and...
Extracted triples after configurations
Type

Triples

disambiguation

106,386

homepages

49,355

images

843,170

persond...
Image of Infobox Extraction
Template

Mapping Infobox to ontology
Data Extraction

used for
extraction process
{{TemplateMapping
| mapToClass = ComicsCreator
| mappings =
{{PropertyMapping | templateProperty = 名前 | ontologyProperty =...
Statistics for DBpedia Mappings
DBpedia Japanese

DBpeida (English)

rate of all templates in
Wikipedia are mapped

4.67% ...
"Mapping Party"
• The mapping task is not easy
– Wikipedia Template
– DBpedia Ontology
– Well known vocabularies

• We hel...
DBpedia Publishing Architecture
URI case
URI

decode URI
for users
URI
URI
IRI case
IRI

IRI to URI

IRI
IRI
IRI issues
IRI

2. Input URIs
must be
decoded to IRIs

IRI to URI
3. Some
serializations can
not use IRIs

4. don't decode...
Query: Notable comics written by comics creators who have
received the Tezuka Osamu Cultural Prize
PREFIX dbp: <http://ja....
Japanese Linked Data Cloud
• 21 datasets
• Criteria
– providing more than 1000
triples
– providing either
dereference, dat...
JLDC with LOD cloud criteria

21 → 9
Links to/from Japanese WordNet
links

WN nouns

DBpedia
IRIs

WN to
DBpedia

DBpedia to
WN

resources

33,017

65,788

1,4...
Ongoing Work
• More Wikipedia entries and infoboxes
– Wikipedia Town

• More DBpedia mappings
– Mapping Party

• Parsers f...
Summary
• Linked Data in Japan is steadily expanding
– Started by the research project
– Now extended to various areas

• ...
Building DBpedia Japanese and Linked Data Cloud in Japanese
Building DBpedia Japanese and Linked Data Cloud in Japanese
Building DBpedia Japanese and Linked Data Cloud in Japanese
Building DBpedia Japanese and Linked Data Cloud in Japanese
Upcoming SlideShare
Loading in...5
×

Building DBpedia Japanese and Linked Data Cloud in Japanese

979

Published on

Presented at 2013 Linked Data in Practice Workshop (LDPW2013), 30 November, 2013

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
979
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
7
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Building DBpedia Japanese and Linked Data Cloud in Japanese"

  1. 1. 2013 Linked Data in Practice Workshop (LDPW2013) , 30 November, 2013 Building DBpedia Japanese and Linked Data Cloud in Japanese Fumihiro Kato, Hideaki Takeda, Seiji Koide, Ikki Ohmukai {fumi, takeda, koide, i2k}@nii.ac.jp National Institute of Informatics (NII) Research Organization of Information and Systems (ROIS) Graduate University for Advanced Studies (Sokendai)
  2. 2. Two Driving Forces to push LOD in Japan • LOD for ACademia (LODAC) Project since 2010 – A research project in ROIS and NII – Research on Linked Data for research • Linked Open Data Initiative Inc., (LODI) since 2012 – Non Profit Organization – Promotion of LOD in Japan – Collaboration with various stakeholders • Government, Public sectors, companies • Members of two forces are mostly overlapped
  3. 3. LODAC Location: Integration of location information LODAC Project - connecting academic data LODAC SPECIES: Connecting species data by name Specimen DB Species Info. DB App. for query expansion DBPedia Japanese Research GBIF Taxon Name DB DB BioSci. No. of Names: 113118 No. of Triples:14,532,449 DB LODAC Museum: LOD of data in museums Raw Data for entities Minimum Data to identify entities Data for entities Raw Data from Source A Integrated data Data from Source B Work dc:references dc:references crm:P55_has_current_location crm:P55_has_current_location dc:creator dc:creator dc:creator Museum crm:P55_has_current_location dc:references dc:references Creator dc:references dc:references CKAN Japanese: Catalog for Open Data
  4. 4. LODAC Museum • Integrated database for information on museums in Japan Type of Information – Data • No. of museums:114 • No. of triples: 40,059,131 RDF type No. of items Collections (total) lodac:Specimen + lodac:Work ca. 1,770,000 Collections (specimen) lodac:Specimen ca. 1,690,000 Collections (creative and historical work) lodac:Work ca. 130,000 Creators foaf:Person ca. Institutes Foaf:Organization ca. 200,000 • Integration by creator, work and institute • Data publication by RDF • Some applications using the data 8,800
  5. 5. Use Yokohama Art Spot LODAC Museum × Yokohama Art LOD – Application using museum and local data – Data related to art in Yokohama • Collections • Events • Q&A http://lod.ac/apps/yas/ × PinQA
  6. 6. LODAC SPECIES: Linking Species Information with names Museum Specimen DB Species Info. DB Research DB GBIF Taxon Name LOD BioSci. DB No. of Species Names:113118 No. of Triples:14,532,449
  7. 7. Search application with LODAC SPECIES http://lod.ac/apps/lsdcs
  8. 8. Specified Non-profit Corporation Linked Open Data Initiative, Inc.
  9. 9. Prospectus • LOD is becoming an infrastructure of our society – Similar to the impact to our society by Web – LOD help maturity and diversity of our society • We wish to diffuse LOD more in Japan ! – For Governments (Central and Local) – For Companies – For Citizens • How? – By Researchers, Engineers, Citizens together
  10. 10. Projects • Platforms – CKAN Japanese – DBpedia Japanese • Collaborative Projects – with Ministry of Industry, Trade, and Economics (METI) • Open Data METI – with National Statistics Center • Scheme Design for Area Code – Collaboration with Sabae City • e.g., “Sabae Burari” • Promotional Events
  11. 11. Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
  12. 12. provided by NDL
  13. 13. Motivation • Data hub for Japanese resources – To promote LOD in Japan – To connect datasets in Japanese • Two linguistic datasets – DBpedia Japanese – RDFized Japanese WordNet
  14. 14. DBpedia Japanese • DBpedia i18n project – 14 chapters • generated from Japanese Wikipedia dump files – DIEF (DBpedia Information Extraction Framework) – ~80m triples • Linking to – Japanese WordNet – Japanese Wikipedia Ontology – other DBpedia chapters • http://ja.dbpedia.org
  15. 15. i18n/l10n efforts • IRI, IRI, IRI, ... • Configurations for Extractors and Parsers • DBpedia Mappings for each chapter
  16. 16. Extraction process ref: D. Kontokostas et al. "Internationalization of Linked Data. The case of the Greek DBpedia edition." Journal of Web Semantics: Science, Services and Agents on the World Wide Web, vol. 15, No.3, Sep. 2012, pp.51-61
  17. 17. DBpedia Information Extraction Framework • Software to extract data from Wikipedia dump – including custom extractors/parsers to apply language specific configurations • Extractors / Parsers – DisambiguationExtractor – HomepageExtractor – ImageExtractor – PersondataExtractor
  18. 18. DisambiguationExtractor • "ja" -> "(曖昧さ回避)"
  19. 19. HomepageExtractor • propertyNamesMap – "ja" -> Set("homepage", "website", " ", " ", "Web サイト", "Webサイト") • externalLinkSectionsMap – "ja" -> "外部リンク" • officialMap – "ja" -> "公式"
  20. 20. ImageExtractor • "ja" -> """(?i){{s?(Non free|Non-free pubart)s?}}""".r
  21. 21. PersondataExtractor • • • • • Names of templates for personal information “名前”(name) “別名”(alias) “概要”(abstract) dates and places for birth and death
  22. 22. Extracted triples after configurations Type Triples disambiguation 106,386 homepages 49,355 images 843,170 persondata 1,811
  23. 23. Image of Infobox Extraction Template Mapping Infobox to ontology Data Extraction used for extraction process
  24. 24. {{TemplateMapping | mapToClass = ComicsCreator | mappings = {{PropertyMapping | templateProperty = 名前 | ontologyProperty = foaf:name }} {{PropertyMapping | templateProperty = 本名 | ontologyProperty = foaf:name }} {{PropertyMapping | templateProperty = 生年 | ontologyProperty = birthYear }} {{PropertyMapping | templateProperty = 生地 | ontologyProperty = birthPlace }} {{PropertyMapping | templateProperty = 没年 | ontologyProperty = deathYear }} {{PropertyMapping | templateProperty = 没地 | ontologyProperty = deathPlace }} {{PropertyMapping | templateProperty = 国籍 | ontologyProperty = nationality }} {{PropertyMapping | templateProperty = 受賞 | ontologyProperty = award }} {{PropertyMapping | templateProperty = 公式サイト | ontologyProperty = foaf:homepage }} {{PropertyMapping | templateProperty = 画像 | ontologyProperty = foaf:depiction }} {{PropertyMapping | templateProperty = ジャンル | ontologyProperty = genre }} {{PropertyMapping | templateProperty = 画像サイズ | ontologyProperty = imageSize }} {{PropertyMapping | templateProperty = 職業 | ontologyProperty = occupation }} {{PropertyMapping | templateProperty = 代表作 | ontologyProperty = notableWork }} }}
  25. 25. Statistics for DBpedia Mappings DBpedia Japanese DBpeida (English) rate of all templates in Wikipedia are mapped 4.67% (81 of 1733) 6.33% (369 of 5,826) rate of all properties in Wikipedia are mapped 2.47% (1,581 of 62,679) 3.47% (6,169 of 177,599) rate of all template occurrences Wikipedia are mapped 47.99% (286,858 of 597,696) 82.24% (2,435,773 of 2,728,357) rate of all property occurrences Wikipedia are mapped 38.75% (3,128,208 of 8,071,982) 54.95% (27,283,343 of 49,654,072)
  26. 26. "Mapping Party" • The mapping task is not easy – Wikipedia Template – DBpedia Ontology – Well known vocabularies • We held hands-on sessions – Aug. 2012: 10 people – Mar. 2013: 25 people
  27. 27. DBpedia Publishing Architecture
  28. 28. URI case URI decode URI for users URI URI
  29. 29. IRI case IRI IRI to URI IRI IRI
  30. 30. IRI issues IRI 2. Input URIs must be decoded to IRIs IRI to URI 3. Some serializations can not use IRIs 4. don't decode IRI IRI 1. IRIs have to be used properly in queries IRI 5. use the latest version
  31. 31. Query: Notable comics written by comics creators who have received the Tezuka Osamu Cultural Prize PREFIX dbp: <http://ja.dbpedia.org/resource/> PREFIX dbp-owl: <http://dbpedia.org/ontology/> SELECT ?creatorName ?comicName WHERE { ?creator a dbp-owl:ComicsCreator ; dbp-owl:award dbp:手塚治虫文化賞 ; dbp-owl:notableWork ?comic ; rdfs:label ?creatorName . ?comic a dbp-owl:Comics ; rdfs:label ?comicName . } dbp-owl:Comics サイボーグ009 rdfs:label rdf:type dbp-owl:AdministrativeRegion dbp:サイボーグ009 rdf:type dbp-owl: ComicsCreator dbp-owl:notableWork rdfs:label dbp:宮城県 rdf:type dbp-owl:birthPlace dbp:石ノ森章太 郎 宮城県 rdf:type foaf:Person dbp-owl:leaderName dbp-prop:生年 rdfs:label dbp-owl:award dbp:村井嘉浩 1938 石ノ森章太郎 dbp:手塚治虫 文化賞
  32. 32. Japanese Linked Data Cloud • 21 datasets • Criteria – providing more than 1000 triples – providing either dereference, data dump or SPARQL Endpoint – including Japanese labels – linking to other datasets in LOD cloud or JLDC • Open license is not mandatory
  33. 33. JLDC with LOD cloud criteria 21 → 9
  34. 34. Links to/from Japanese WordNet links WN nouns DBpedia IRIs WN to DBpedia DBpedia to WN resources 33,017 65,788 1,456,158 50.1% 2.3% properties 1,245 65,788 16,020 1.9% 7.8%
  35. 35. Ongoing Work • More Wikipedia entries and infoboxes – Wikipedia Town • More DBpedia mappings – Mapping Party • Parsers for Japanese – Japanese Calendar: 慶応3年1月2日 => "1868-01-02"^^xsd:date
  36. 36. Summary • Linked Data in Japan is steadily expanding – Started by the research project – Now extended to various areas • Creating a local chapter of DBpedia is a key to promote Linked Data in the local language – A hub in the local language – People in any areas can find connections in DBpedia with their data • Promotion of open license is still in progress
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×