Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

WDPlus: Leveraging Wikidata to Link and Extend Tabular Data

156 views

Published on

Today, data about any domain can be found on the web in data repositories, web APIs and many millions of spreadsheets and CSV files. Researchers and organizations make these data available in a myriad of formats, layouts, terminology and cleanliness that make it difficult to integrate together. As a result, researchers aiming to use data in their analyses face three main challenges. The first one is finding datasets related to a feature, variable or topic of interest. For example, climate scientists need to look for years of observational data from authoritative sources when estimating the climate of a region. The second challenge is completing a given dataset with existing knowledge: machine learning applications are data hungry and require as many data points and features as possible to improve their predictions, which often requires integrating data from different sources. The third challenge is sharing integrated results: once several datasets have been merged together, how to make them available to the rest of the community?

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

WDPlus: Leveraging Wikidata to Link and Extend Tabular Data

  1. 1. WDPlus: Leveraging Wikidata to Link and Extend Tabular Data Daniel Garijo, Pedro Szekely Information Sciences Institute and Department of Computer Science @dgarijov dgarijo@isi.edu
  2. 2. Abundance of data sources in the Web Users of data face three challenges • How do I find relevant datasets for my problem? • How do I augment my dataset with existing information? • How can I share my integrated results with the community? Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend Tabular Data. AKTS 2019 (eScience 2019) 2 Raw data
  3. 3. Popular initiatives for addressing these challenges • Search individual items • Search is manual, based on user input • LOD cloud of connected datasets • Knowledge engineers are needed to map and augment content • ETL Frameworks (e.g, Karma, Open Refine) • Pipelines are custom, expertise required • Often not shared Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend Tabular Data. AKTS 2019 (eScience 2019) 3 Sources: https://lod-cloud.net/versions/2019-03-29/lod-cloud.png; https://panoply.io/data-warehouse-guide/3-ways-to-build-an-etl-process/
  4. 4. WDPlus A framework designed to: • Discover data on the Web • Improve raw data to make it useful • Search, querying dataset structure • Download fresh data • Combine existing dataset • Share improved data and methods Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend Tabular Data. AKTS 2019 (eScience 2019) 4 1953-01-20 Harry Truman Lamar President USA 1884-05-08 1972-12-26 male 1945-04-12 Bress Truman weath er shoppi ng crime sports Core Satellites Metadata index
  5. 5. WDPlus architecture Wikidata as a core KG • 60 Million items • 700 Million statements • 20,000 + contributors • +1 billion edits • Collaborative! 5 1953-01-20 Harry Truman Lamar President USA 1884-05-08 1972-12-26 male 1945-04-12 Bress Truman Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend Tabular Data. AKTS 2019 (eScience 2019) Core
  6. 6. WDPlus architecture Satellite organization • Detailed information on a domain • Crime records, sport events, etc. • Linked to the Wikidata core • Link first strategy • Custom properties and Qnodes • Extensions to core model • Synchronized with core • Decentralized • 1 satellite may be maintained by 1 community 6Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend Tabular Data. AKTS 2019 (eScience 2019) 1953-01-20 Harry Truman Lamar President USA 1884-05-08 1972-12-26 male 1945-04-12 Bress Truman weath er shoppi ng crime sports Core Satellites
  7. 7. WDPlus architecture Table models • Tables are not materialized • Able to become a satellite under demand • Described in machine-readable metadata index • Indexing columns names and relevant instances for fast retrieval • Link to table model is preserved 7Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend Tabular Data. AKTS 2019 (eScience 2019) 1953-01-20 Harry Truman Lamar President USA 1884-05-08 1972-12-26 male 1945-04-12 Bress Truman weath er shoppi ng crime sports Core Satellites Metadata index
  8. 8. Towards WDPLus 8Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend Tabular Data. AKTS 2019 (eScience 2019) 1953-01-20 Harry Truman Lamar President USA 1884-05-08 1972-12-26 male 1945-04-12 Bress Truman weath er shoppi ng crime sports Core Satellites Metadata index
  9. 9. WDPlus framework: Metadata index and table Augmentation • Search • Keywords, variables or content • Wikifier may be used in search • Download • Download a dataset or its metadata • Augment • Merge your dataset with contents from other datasets automatically • Upload • Add new datasets (automated metadata profiling and provenance) • Enrich • Header enrichment for search efficiency 9Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend Tabular Data. AKTS 2019 (eScience 2019)
  10. 10. WDPlus framework: T2WML 10Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend Tabular Data. AKTS 2019 (eScience 2019) Table overview Cell-based mapping. This mapping is saved in WDPlus for future reference Entity Linking Result sample Easy to share!
  11. 11. Creating Wikidata Satellites: Challenges • Identify new properties to model satellites • Currently done by hand by Knowledge engineers • Creation of new Qnodes for satellite instances • Identified a schema for each satellite • Feedback loop to Wikidata • How to select a “trusty” statement when several values are available? • Namespace issues • Single namespace, or namespace per satellite? • Inter-satellite linkages 11Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend Tabular Data. AKTS 2019 (eScience 2019)
  12. 12. Conclusions • Tabular data exists in heterogeneous formats • Difficult to find, use, augment and share • WDPlus is a framework to help discover, improve, search, augment, combine and share tabular data • WDPlus framework for profiling and enriching datasets • T2WML language to generate linked instances from tabular data • Encouraging early results on usability 12Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend Tabular Data. AKTS 2019 (eScience 2019)
  13. 13. Help us extend WDPlus! Do you have comments, suggestions or use cases? Contact me at: Daniel Garijo and Pedro Szekely. WDPlus: Leveraging Wikidata to Link and Extend Tabular Data. AKTS 2019 (eScience 2019) 13 dgarijo@isi.edu

×