Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Resource Classification as the
Basis for a Visualization Pipeline
in LOD Scenarios
Oscar Peña del Rio, Unai Aguilera

and D...
Motivation
• The Semantic Web is still waiting for massive traction
• The focus is set on ontology development, provenance...
Data overview
• Need to get the whole picture of a dataset before
working with it
• Usually time/resource constraints and ...
Natural approaches to data exploration
• Takes some ideas from Tukey’s Exploratory Data Analysis
field (EDA)
• Proposes dif...
Visualization Pipeline
raw data analysis operators
visual
transformations
recommender
engine
end user
visualizations
rdf, ...
Extract resource features
• We focus on the data itself to infer its structure and
relevance within the whole dataset
• Th...
Primitive datatype inference
• Required to understand how each property may be interpreted, the operations it allows,
and ...
Results
• 190 properties evaluated (149
unique)

• 5 datasets (approx. 10M triples)

• Diverse topics

• Inference algorit...
Conclusions & future steps
• Common pitfalls addressed
• Missing property datatypes & ranges
• Incorrect typing / usage
• ...
Thank you
Oscar Peña del Rio

oscar.pena@deusto.es
Upcoming SlideShare
Loading in …5
×

Resource Classification as the Basis for a Visualization Pipeline in LOD Scenarios

390 views

Published on

Slide support for MTSR'15

Published in: Science
  • Be the first to comment

  • Be the first to like this

Resource Classification as the Basis for a Visualization Pipeline in LOD Scenarios

  1. 1. Resource Classification as the Basis for a Visualization Pipeline in LOD Scenarios Oscar Peña del Rio, Unai Aguilera and Diego López de Ipiña DeustoTech, University of Deusto
  2. 2. Motivation • The Semantic Web is still waiting for massive traction • The focus is set on ontology development, provenance and supporting technological stack • Little is known about the SW outside the research community • Potential benefits should be addressed to non-technical user profiles 2
  3. 3. Data overview • Need to get the whole picture of a dataset before working with it • Usually time/resource constraints and lack of expert knowledge are present • Based on Ben Shneiderman’s Overview task (from his famous Visual Information Seeking Mantra) • Diverse approaches perform basic statistics to fulfill this task (counts, averages, min/max, etc.) 3
  4. 4. Natural approaches to data exploration • Takes some ideas from Tukey’s Exploratory Data Analysis field (EDA) • Proposes different approaches to get an overview of the data • Techniques lack the rigor of more formal methodologies, is a more data-driven perspective • Data discovery is more natural this way, in line with the follow your nose principles 4
  5. 5. Visualization Pipeline raw data analysis operators visual transformations recommender engine end user visualizations rdf, json-ld, … statistical analysis, datatype inference… how to encode data in visual elements learned lessons, best practices & fit models Web browser accessible visualization 5 Defending visualization as the means for a coherent, understandable Semantic Web beneficial for all actors
  6. 6. Extract resource features • We focus on the data itself to infer its structure and relevance within the whole dataset • The data is directly accessed through SPARQL queries • Property usage: # unique class instances / # instance objects
 
 • Completeness ratio: # values assigned to property / # instance objects 6 dc:title -> 1 foaf:nick -> 3.4dc-terms:license -> 0.12 foaf:name -> 1 foaf:title -> 0.36
  7. 7. Primitive datatype inference • Required to understand how each property may be interpreted, the operations it allows, and how they relate to each other • We define the following classification categories: • Integer • Float • Boolean • IRI • String • Geographical component • Datetime component • Categorical data 7
  8. 8. Results • 190 properties evaluated (149 unique) • 5 datasets (approx. 10M triples) • Diverse topics • Inference algorithm tested against agreement between 6 experts 
 (>80% agreement, 5 out of 6) dataset TP TN FP FN Cat Correct Air quality 17 160 2 10 5 93,65% Restaurants 17 201 3 17 5 91,6% Historical sites 14 165 4 13 3 91,33% MORElab 56 399 15 13 12 94,2% Teseo 22 162 4 1 3 97,35% 8
  9. 9. Conclusions & future steps • Common pitfalls addressed • Missing property datatypes & ranges • Incorrect typing / usage • Redundancy • Most instances typed as plain, literal strings • Feed all the features to a classifier in order to create Entity Visualization Templates (work in progress) • Recommend coherent visual representations for each template 9
  10. 10. Thank you Oscar Peña del Rio oscar.pena@deusto.es

×