Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

0

Share

Download to read offline

DBpedia ♥ Commons

Download to read offline

Extract semi-structure data from Wikimedia Commons to RDF using the DBpedia Extraction Framework

Related Books

Free with a 30 day trial from Scribd

See all
  • Be the first to like this

DBpedia ♥ Commons

  1. 1. DBpedia ♥ Commons Gaurav Vaidya - Dimitris Kontokostas - Andrea Di Menna - Jim O'Regan 2nd DBpedia Meeting Leipzig 03.09.2014
  2. 2. ~23M pages like this 2nd DBpedia Meeting Leipzig 03.09.2014
  3. 3. ~23M pages like this 2nd DBpedia Meeting Leipzig 03.09.2014
  4. 4. A lot of pages like this 2nd DBpedia Meeting Leipzig 03.09.2014
  5. 5. Many pages like this 2nd DBpedia Meeting Leipzig 03.09.2014
  6. 6. Not very similar to pages like this 2nd DBpedia Meeting Leipzig 03.09.2014
  7. 7. DBpedia Extraction Framework 2nd DBpedia Meeting Leipzig 03.09.2014 ✔ “Wiki agnostic” ✔ Pluggable extractors ✔ Out of the box support for common metadata ✗ Tuned for extraction in the main namespace (not File:) ✗ Many other challenges left
  8. 8. 2nd DBpedia Meeting Leipzig 03.09.2014 Challenges ✔ File metadata ✔ KML files ✔ Image Galleries ✔ Image Annotations ✔ Mappings Wiki ✔ Bootstrap community mappings ✔ Template Statistics ✔ Licensing ✔ Technical details I'll not go into
  9. 9. Out-of-the-box support 2nd DBpedia Meeting Leipzig 03.09.2014 ● Categories (skos) ● External links ● Geo-coordinates ● Raw infobox properties ● Labels ● PageIds / Revisions ● Links (internal / external) ● Mappings Wiki (with some tweaking / more on that later)
  10. 10. 2nd DBpedia Meeting Leipzig 03.09.2014 File metadata ● New Extractor ● New file Class hierarchy – dbo:File, dbo:Image, dbo:StillImage, dbo:MovingImage and dbo:Sound Sample Output: :Aeropetes.JPG a dbo:StillImage, dbo:Image, dbo:Document, dbo:File, Work; dcterms:type dbo:StillImage dbo:fileExtension "jpg" dcterms:format "image/jpeg" dbo:fileURL commons-path:Aeropetes.JPG ; foaf:depiction commons-path:Aeropetes.JPG ; dbo:thumbnail commons-path:Aeropetes.JPG?width=300 .
  11. 11. 2nd DBpedia Meeting Leipzig 03.09.2014 Image Galleries ● Attach each gallery item to the page resource :Colorado dbo:hasGalleryItem Colorado.JPG, Denver_Colorado_Art.jpg, ColoradoCenter1.jpg.
  12. 12. Image Annotations 2nd DBpedia Meeting Leipzig 03.09.2014 ● Annotation Gadget ● Boxes with optional description
  13. 13. Image Annotations ● W3 Media Fragments recommendation ● Embed the box in the URI – ?width=15130&height=1886#xywh=pixel:10431,324,1670,1208> . ● Add descriptions in the new resource 2nd DBpedia Meeting Leipzig 03.09.2014
  14. 14. 2nd DBpedia Meeting Leipzig 03.09.2014 Mappings Wiki
  15. 15. Template Statistics 2nd DBpedia Meeting Leipzig 03.09.2014
  16. 16. 2nd DBpedia Meeting Leipzig 03.09.2014 Licensing ● Identified & imported automatically ~360 licence templates ● Use the mappings wiki ● Needed some hacking to make it work – e.g. {{Self|GFDL|cc-by-sa-3.0,2.5,2.0,1.0}} :Acraea_circeis.JPG dbo:license <http://creativecommons.org/publicdomain/mark/1.0/> :Antepipona_deflenda_-_2012-10-17.webm dbo:license < http://creativecommons.org/licenses/by-sa/3.0/ >
  17. 17. KML Annotations attached to media Attach raw KML data to resource with custom extractor Sample Output: :Yellowstone_1871b.jpg dbo:hasKMLData “”” ?xml version=1.0 encoding=UTF-8?> <kml xmlns=http://earth.google.com/kml/2.2”> <GroundOverlay> <name>Yorktown, Indiana (1878)</name> <description>An 1878 map of Yorktown in Tippecanoe County, Indiana. Source: Kingman Brothers&apos; Combination Atlas Map of Tippecanoe County, Indiana, 1878.</description> <color>99ffffff</color><Icon><href>BIG_LINK_HERE</href> <viewBoundScale>0.75</viewBoundScale></Icon> <LatLonBox> <north>40.26126145890567</north><south>40.25777915632657</south> <east>-86.77033439383223</east><west>-86.77398493316619</west> <rotation>-1.123009884936565</rotation></LatLonBox> </GroundOverlay></kml>“”"^^rdfs:XMLLiteral . 2nd DBpedia Meeting Leipzig 03.09.2014
  18. 18. 2nd DBpedia Meeting Leipzig 03.09.2014 Left TODOs ● Nested templates are commonly used and cannot be handled by the mappings wiki atm – e.g. Media descriptions (although mapped) are missing {{Information |Description= {{en|Logo of the [[w:en:DBpedia|DBpedia project]]}} {{fr| Logo du projet [[w:fr:DBpedia|DBpedia]]}} ● Annotation descriptions need some tweaking – Need to render wikitext ● Put it under a SPARQL Endpoint ● Provide Linked Data – http://commons.dbpedia.org
  19. 19. 2nd DBpedia Meeting Leipzig 03.09.2014 Thank You! Special thanks to: ● Alexandru Todor (importing the License templates) ● Google Summer of Code for sponsoring this project (Gaurav Vaidya) Questions? Dataset: http://nl.dbpedia.org/downloads/commonswiki Dataset samples: https://github.com/gaurav/commons-extraction

Extract semi-structure data from Wikimedia Commons to RDF using the DBpedia Extraction Framework

Views

Total views

1,123

On Slideshare

0

From embeds

0

Number of embeds

1

Actions

Downloads

6

Shares

0

Comments

0

Likes

0

×