Successfully reported this slideshow.
Your SlideShare is downloading. ×

A Template-Based Approach for Annotating Long-Tailed Datasets

Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad
Ad

Check these out next

1 of 12 Ad

A Template-Based Approach for Annotating Long-Tailed Datasets

Download to read offline

An increasing amount of data is shared on the Web through heterogeneous spreadsheets and CSV files. In order to homogenize and query these data, the scientific community has developed Extract, Transform and Load (ETL) tools and services that help making these files machine readable in Knowledge Graphs (KGs). However, tabular data may be complex; and the level of expertise required by existing ETL tools makes it difficult for users to describe their own data. In this paper we propose a simple annotation schema to guide users when transforming complex tables into KGs. We have implemented our approach by extending T2WML, a table annotation tool designed to help users annotate their data and upload the results to a public KG. We have evaluated our effort with six non-expert users, obtaining promising preliminary results.

An increasing amount of data is shared on the Web through heterogeneous spreadsheets and CSV files. In order to homogenize and query these data, the scientific community has developed Extract, Transform and Load (ETL) tools and services that help making these files machine readable in Knowledge Graphs (KGs). However, tabular data may be complex; and the level of expertise required by existing ETL tools makes it difficult for users to describe their own data. In this paper we propose a simple annotation schema to guide users when transforming complex tables into KGs. We have implemented our approach by extending T2WML, a table annotation tool designed to help users annotate their data and upload the results to a public KG. We have evaluated our effort with six non-expert users, obtaining promising preliminary results.

Advertisement
Advertisement

More Related Content

Slideshows for you (20)

Similar to A Template-Based Approach for Annotating Long-Tailed Datasets (20)

Advertisement

More from dgarijo (20)

Recently uploaded (20)

Advertisement

A Template-Based Approach for Annotating Long-Tailed Datasets

  1. 1. Information Sciences Institute A Template-Based Approach for Annotating Long-Tailed Datasets Daniel Garijo, Ke-Thia Yao, Amandeep Singh and Pedro Szekely {dgarijo, kyao, amandeep, szeke}@isi.edu @dgarijov This work was funded by the Defense Advanced Research Projects Agency (DARPA)
  2. 2. Information Sciences Institute Transforming tabular data into KGs... Expert
  3. 3. Information Sciences Institute Transforming tabular data into KGs... How can we ease the process for non-experts? Expert
  4. 4. Information Sciences Institute Challenges: Annotation Oil production Subject to annotate Variable (predicate) Object (values) Time Qualifiers
  5. 5. Information Sciences Institute Challenges: Annotation Oil production Oil price Units!Time ● Multiple variables, missing values, etc.
  6. 6. Information Sciences Institute Challenges: Summary How to create a way for non-experts to annotate their data… - Without having to learn a mapping language - Capturing qualifiers of described variables - Ignoring undesired columns/incomplete cells - Share the results as part of a public KG
  7. 7. Information Sciences Institute Proposed workflow Users should be able to 1. Annotate their data 2. Preview their progress 3. Share their results (KG) ?
  8. 8. Information Sciences Institute Annotation schema • We adopt the Wikidata data model (s,p,o,q,r) • Add 7 rows to define metadata https://t2wml-annotation.readthedocs.io
  9. 9. Information Sciences Institute T2WML Extension Load data Link and review Preview and upload (or save) https://github.com/usc-isi-i2/t2wml
  10. 10. Information Sciences Institute Sharing annotated datasets with Datamart https://github.com/usc-isi-i2/datamart-api Implementation (password protected): https://dsbox02.isi.edu/datamart-api/ REST API Datamart: - Metadata catalog (search variables, datasets, locations, etc.) - Data catalog (time series data)
  11. 11. Information Sciences Institute (Very) preliminary results Evaluation with users: - 6 users (not familiar with Semantic Web technologies) - Knowledge in Data Science/Scripting - 1 hour training in T2WML/schema - 3 datasets (each dataset was assigned to two users) Results: - All users were able to describe and upload their data - Trouble understanding differences between variables and qualifiers
  12. 12. Information Sciences Institute Conclusions and future work - Non experts should be empowered to populate existing KGs with their own data. - We propose a simple workflow to let users annotate, preview and share their data as a KG - Next steps: incorporate table understanding approaches in the annotation process - Less effort from users required

×