Web Data Management
Advanced Database Presentation
By:
Navid Sedighpour
Professor :
Dr. Alireza Bagheri
Nevember 2015
1
Interest
Lack of schema
Data is unstructured or at best “semi-structured”
Missing data, additional attributes, similar data but not identical
Volatility
May confirm to one schema now, but not later
Scale
How to capture everything?
Querying Difficulty
What is the user language?
What are the primitives?
Aren’t Search Engines sufficient?
2
Fusion Tables
Users contribute data in spreadsheet
Possible joins between multiple data sets
Extensive visualization
3
More Recent Approaches to Web Querying
More Recent Approaches to Web Querying
XML
Data exchange language
Tree based structure
4
More Recent Approaches to Web Querying
RDF
W3C Recommendation
Simple, self-descriptive model
5
RDF Data Volumes
90% of world's data generated over last two years
Data are growing fast
Size almost doubling every year
6
RDF Data Volumes
March 2009 – 89 Datasets
7
RDF Data Volumes
September 2010 – 203 datasets
8
RDF Data Volumes
September 2011 – 295 Datasets
9
RDF Data Volumes
10
April 2014 – 1091 Datasets
RDF Introduction
Everything is an uniquely named resource
Prefixes can be used to shorten names
Properties of resources can be defined
Relationships with other resources can be defined
Resource description can be contributed by different people/groups and can be located anywhere
in the web
Integrated web “database”
11
RDF Data Model
Triple : Subject, Predicate (Property) , Object
Subject : The entity that is described (URI or Blank Node)
Predicate : a feature of the entity
Object : value of the feature
Set of RDF Triples is called “RDF Graph”
12
RDF Example Instance
13
RDF Graph
14
SPARQL Queries
15
Naïve Triple Store Design
16
17
Naïve Triple Store Design
Easy to Implement
But
Too Many self-joins
Property Tables
Grouping by Entities
Types :
Clustered Property Tables
Property Class Tables
18
Clustered Property Tables
Group together the properties that tend to occur in the same (or similar) subjects
19
Property Class Tables
Cluster the subjects with the same type of property into one property table
20
Property Tables
Advantages :
Fewer Joins
Disadvantages :
Lots of NULLs
Clustering is not trivial
Multi-valued properties are complicated
21
Binary Tables
Grouping by Properties: for each property build a two column table containing both subject and
object, ordered by subjects
Also called “Vertically Partitioned Approach”
N two column tables (n is the number of unique properties in the data)
22
Binary Tables
Advantages :
Support multi-valued Properties
No NULLs
No Clustering
Good performance for subject-subject joins
Disadvantages:
Not useful for subject-subject joins
Expensive inserts
23
Graph-Based Approach
Answering SPARQL query = Subgraph Matching
gStore
24
Two steps need to be done :
1. For each node of Q* get the lists of nodes in G* that include that node
2. Do a multi-way join to get the candidate list
Alternatives :
Sequential scan of G*
 Both steps are inefficient
S-Tree
 Height Balanced Tree over signatures
 Run an inclusion query for each node of Q* and get lists of nodes in G* that include that node (q & s = q)
VS-Tree
 Support both steps efficiently
 Grouping by vertices
25
Graph-Based Approach
S-Tree
26
Pruning
S-Tree
27
S-Tree
28
S-Tree
29
S-Tree
30
VS-Tree
31
VS-Tree
32
Conclusion
RDF Data seem to have considerable promise for web data management
We talked about four approaches to web data management including Naïve triple store design,
Property Tables, Binary Tables and Graph-Based approach
VS-Tree has the best performance in Graph-Base approaches
gStore is more efficient than other approaches
33
References
34
[1] D. J. Abadi, A. Marcus, S. R. Madden, and K. Hollenbach, "Scalable semantic web data
management using vertical partitioning," in Proceedings of the 33rd international conference on Very large
data bases, 2007, pp. 411-422.
[2] L. Zou, J. Mo, L. Chen, M. T. Özsu, and D. Zhao, "gStore: answering SPARQL queries via
subgraph matching," Proceedings of the VLDB Endowment, vol. 4, pp. 482-493, 2011.
[3] L. Zou, M. T. Özsu, L. Chen, X. Shen, R. Huang, and D. Zhao, "gStore: a graph-based SPARQL
query engine," The VLDB Journal—The International Journal on Very Large Data Bases, vol. 23, pp. 565-
590, 2014.
[4] X. Shen, L. Zou, M. T. Ozsu, L. Chen, Y. Li, S. Han, et al., "A Graph-based RDF Triple Store."
35

Scalable Web Data Management using RDF

  • 1.
    Web Data Management AdvancedDatabase Presentation By: Navid Sedighpour Professor : Dr. Alireza Bagheri Nevember 2015 1
  • 2.
    Interest Lack of schema Datais unstructured or at best “semi-structured” Missing data, additional attributes, similar data but not identical Volatility May confirm to one schema now, but not later Scale How to capture everything? Querying Difficulty What is the user language? What are the primitives? Aren’t Search Engines sufficient? 2
  • 3.
    Fusion Tables Users contributedata in spreadsheet Possible joins between multiple data sets Extensive visualization 3 More Recent Approaches to Web Querying
  • 4.
    More Recent Approachesto Web Querying XML Data exchange language Tree based structure 4
  • 5.
    More Recent Approachesto Web Querying RDF W3C Recommendation Simple, self-descriptive model 5
  • 6.
    RDF Data Volumes 90%of world's data generated over last two years Data are growing fast Size almost doubling every year 6
  • 7.
    RDF Data Volumes March2009 – 89 Datasets 7
  • 8.
    RDF Data Volumes September2010 – 203 datasets 8
  • 9.
    RDF Data Volumes September2011 – 295 Datasets 9
  • 10.
    RDF Data Volumes 10 April2014 – 1091 Datasets
  • 11.
    RDF Introduction Everything isan uniquely named resource Prefixes can be used to shorten names Properties of resources can be defined Relationships with other resources can be defined Resource description can be contributed by different people/groups and can be located anywhere in the web Integrated web “database” 11
  • 12.
    RDF Data Model Triple: Subject, Predicate (Property) , Object Subject : The entity that is described (URI or Blank Node) Predicate : a feature of the entity Object : value of the feature Set of RDF Triples is called “RDF Graph” 12
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
    17 Naïve Triple StoreDesign Easy to Implement But Too Many self-joins
  • 18.
    Property Tables Grouping byEntities Types : Clustered Property Tables Property Class Tables 18
  • 19.
    Clustered Property Tables Grouptogether the properties that tend to occur in the same (or similar) subjects 19
  • 20.
    Property Class Tables Clusterthe subjects with the same type of property into one property table 20
  • 21.
    Property Tables Advantages : FewerJoins Disadvantages : Lots of NULLs Clustering is not trivial Multi-valued properties are complicated 21
  • 22.
    Binary Tables Grouping byProperties: for each property build a two column table containing both subject and object, ordered by subjects Also called “Vertically Partitioned Approach” N two column tables (n is the number of unique properties in the data) 22
  • 23.
    Binary Tables Advantages : Supportmulti-valued Properties No NULLs No Clustering Good performance for subject-subject joins Disadvantages: Not useful for subject-subject joins Expensive inserts 23
  • 24.
    Graph-Based Approach Answering SPARQLquery = Subgraph Matching gStore 24
  • 25.
    Two steps needto be done : 1. For each node of Q* get the lists of nodes in G* that include that node 2. Do a multi-way join to get the candidate list Alternatives : Sequential scan of G*  Both steps are inefficient S-Tree  Height Balanced Tree over signatures  Run an inclusion query for each node of Q* and get lists of nodes in G* that include that node (q & s = q) VS-Tree  Support both steps efficiently  Grouping by vertices 25 Graph-Based Approach
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
    Conclusion RDF Data seemto have considerable promise for web data management We talked about four approaches to web data management including Naïve triple store design, Property Tables, Binary Tables and Graph-Based approach VS-Tree has the best performance in Graph-Base approaches gStore is more efficient than other approaches 33
  • 34.
    References 34 [1] D. J.Abadi, A. Marcus, S. R. Madden, and K. Hollenbach, "Scalable semantic web data management using vertical partitioning," in Proceedings of the 33rd international conference on Very large data bases, 2007, pp. 411-422. [2] L. Zou, J. Mo, L. Chen, M. T. Özsu, and D. Zhao, "gStore: answering SPARQL queries via subgraph matching," Proceedings of the VLDB Endowment, vol. 4, pp. 482-493, 2011. [3] L. Zou, M. T. Özsu, L. Chen, X. Shen, R. Huang, and D. Zhao, "gStore: a graph-based SPARQL query engine," The VLDB Journal—The International Journal on Very Large Data Bases, vol. 23, pp. 565- 590, 2014. [4] X. Shen, L. Zou, M. T. Ozsu, L. Chen, Y. Li, S. Han, et al., "A Graph-based RDF Triple Store."
  • 35.