• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Semantic Repositories - Performance factors and design choices
 

Semantic Repositories - Performance factors and design choices

on

  • 3,000 views

presentation from the SemData@Sofia Roundtable, March 2010

presentation from the SemData@Sofia Roundtable, March 2010

Statistics

Views

Total Views
3,000
Views on SlideShare
2,986
Embed Views
14

Actions

Likes
2
Downloads
0
Comments
1

2 Embeds 14

http://www.slideshare.net 13
http://translate.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

11 of 1 previous next

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Semantic Repositories - Performance factors and design choices Semantic Repositories - Performance factors and design choices Presentation Transcript

    • Semantic Repositories Performance factors and design choices Atanas Kiryakov SemData@Sofia Roundtable Mar, 2010
      • Semantic repositories combine characteristics of:
        • database management systems (DBMS) and
        • inference engines
      • Semantic repositories take the role of web servers
        • They are able to hold, interpret, and serve requests from multiple users to massive amounts of data
      • Semantic repositories are still in the initial phase of rapid development
        • Since 2004, each couple of years, the engines have been getting an order of magnitude faster and more scalable
      Semantic Repositories Mar, 2010 # Semantic Repositories
      • Each new development allows loading more data, dealing with more comprehensive schemata and ontologies, and answering more complex queries in less time
      • As in mountain climbing , each new achievement opens new opportunities and challenges
      • Semantic repositories can also be seen as track-laying machines, which extend the reach of the data railways, step by step, changing the data-economy of entire domains and areas, by allowing more and more complex data to be handled at lower cost
      Semantic Repositories = Track-laying machines Mar, 2010 # Semantic Repositories
    • Semantic Repositories = Track-laying machines (2) Mar, 2010 # Semantic Repositories
    • Semantic Repositories = Track-laying machines (3) Mar, 2010 # Semantic Repositories
      • We build upon lightweight semantics that is easy to understand, deploy, and manage
      • For instance, think of ontologies as database schemata with simple interpretation rules. Plenty of obvious (but useful) implicit facts can be inferred and match queries right away
      Semantic Repositories Mar, 2010 # Semantic Repositories
    • It is simple Mar, 2010 # Semantic Repositories rdfs:subClassOf rdfs:subClassOf rdf: type rdf: type rdf: type rdf: type rdf: type rdf: type myData: Maria ptop:childOf rdfs:subClassOf ptop:Agent ptop:Person ptop:Woman ptop:childOf ptop:parentOf rdfs:range owl:inverseOf inferred ptop:parentOf myData:Ivan owl:relativeOf owl:inverseOf owl:SymmetricProperty rdfs:subPropertyOf ptop:relativeOf owl:inverseOf owl:inverseOf
    • Get more facts – Match more queries Mar, 2010 # Semantic Repositories rdfs:subClassOf rdfs:subClassOf
      • <C1,rdfs:subClassOf,C2>
      • <C2,rdfs:subClassOf,C3>
      • <C1,rdfs:subClassOf,C3>
      • <I,rdf:type,C1>
      • <C1,rdfs:subClassOf,C2>
      • <I,rdf:type,C2>
      • <P1,owl:inverseOf,P2>
      • <I1,P1,I2>
      • <I2,P2,I1>
      • <P1,rdf:type,owl:SymmetricProperty>
      • <P1,owl:inverseOf,P1>
      rdf: type rdf: type rdf: type rdf: type rdf: type rdf: type
        • The database will return Ivan as result of query for
        • Maria relativeOf ?x
        • when the fact asserted was
        • Ivan childOf Maria
      myData: Maria ptop:childOf rdfs:subClassOf ptop:Agent ptop:Person ptop:Woman ptop:childOf ptop:parentOf rdfs:range owl:inverseOf inferred ptop:parentOf myData:Ivan owl:relativeOf owl:inverseOf owl:SymmetricProperty rdfs:subPropertyOf ptop:relativeOf owl:inverseOf owl:inverseOf
    • The Semantics is Encoded in Simple Rules Mar, 2010 # Semantic Repositories rdfs:subClassOf rdfs:subClassOf rdf: type rdf: type rdf: type rdf: type rdf: type rdf: type
      • <C1,rdfs:subClassOf,C2>
      • <C2,rdfs:subClassOf,C3>
      • <C1,rdfs:subClassOf,C3>
      • <I,rdf:type,C1>
      • <C1,rdfs:subClassOf,C2>
      • <I,rdf:type,C2>
      • <P1,owl:inverseOf,P2>
      • <I1,P1,I2>
      • <I2,P2,I1>
      • <P1,rdf:type,owl:SymmetricProperty>
      • <P1,owl:inverseOf,P1>
      myData: Maria ptop:childOf rdfs:subClassOf ptop:Agent ptop:Person ptop:Woman ptop:childOf ptop:parentOf rdfs:range owl:inverseOf inferred ptop:parentOf myData:Ivan owl:relativeOf owl:inverseOf owl:SymmetricProperty rdfs:subPropertyOf ptop:relativeOf owl:inverseOf owl:inverseOf
    • Physical data representation: RDF vs. RDBMS Mar, 2010 # Semantic Repositories Person ID Name Gender 1 Maria P. F 2 Ivan Jr. M 3 … Parent ParID ChiID 1 2 … Spouse S1ID S2ID From To 1 3 … Statement Subject Predicate Object myo:Person rdf:type rdfs:Class myo:gender rdfs:type rdfs:Property myo:parent rdfs:range myo:Person myo:spouse rdfs:range myo:Person myd:Maria rdf:type myo:Person myd:Maria rdf:label “ Maria P.” myd:Maria myo:gender “ F” myd:Maria rdf:label “ Ivan Jr.” myd:Ivan myo:gender “ M” myd:Maria myo:parent Myd:Ivan myd:Maria myo:spouse myd:John …
      • The major differences with the DBMS can be summarized as follows:
      • they use ontologies as semantic schemata , which allows them to automatically reason about the data ;
      • they work with generic physical datamodels , which allows them to easily adopt updates and extensions in the schemata , i.e. in the structure of the data .
      Semantic Repositories vs. RDBMS Mar, 2010 # Semantic Repositories
      • RDF databases and column stores share a lot of design principles, a typical column store differs from an RDF-based semantic repository in several ways:
        • Globally unique identifiers. An important feature of RDF, as data representation model, is that it is based on the notion of Unique Resource Identifiers (URI)
        • Standard compliance. While there are no well-established standards in the area of the column stores, the RDF-based semantic repositories are highly interoperable between one another on the basis of a whole ecosystem of languages for schema definition, ontology definition, and querying
      • Semantic repositories can be described as “RDF-based column-stores with inference capabilities” .
      Semantic Repositories vs. Column Stores Mar, 2010 # Semantic Repositories
      • Easy integration of multiple data-sources
        • once the schemata of these sources is semantically aligned, the inference capabilities of the engine supports the interlinking and combination of the facts from the different sources;
      • Easy querying against rich or diverse data schemata
        • inference is applied to match semantics of the query to the semantics of the data, regardless of the vocabulary and the data modeling patterns used for encoding of the data;
      Semantic Repositories: Major Characteristics Mar, 2010 # Semantic Repositories
      • Great analytical power
        • one can count that semantics will be thoroughly applied even when this requires recursive inferences on multiple steps
        • Uncover facts, based on interlinking long-chains evidences – the vast majority of those facts would remain unspotted DBMS
      • Efficient data interoperability
        • importing RDF data from one store to another is straight-forward based on the usage of globally unique identifiers
      Semantic Repositories: Major Characteristics (2) Mar, 2010 # Semantic Repositories
    • Tasks to be Benchmarked
      • Data loading
        • including parsing, persistence, and indexing
      • Query evaluation
        • including query preparation and optimization and fetching
      • Data modification
        • which may involve changes to the ontologies and the schemata
      • Inference is not a first-level activity
        • Depending on the implementation, it can affect the performance of the other activities. In the current implementation of the data layer, inference is performed during loading and affects its performance.
      Semantic Repositories Mar, 2010 #
    • Performance Factors for Loading
      • Materialization
        • whether and to what extent forward-chaining is performed at load time; complexity of the forward-chaining;
      • Data model complexity
        • support for extended RDF data models, e.g. such including support for named graphs, is computationally more &quot;expensive&quot;
      • Indexing specifics
        • repositories can apply different indexing strategies depending on the data loaded, usage patterns, hardware constraints, etc.;
      • Transaction Isolation
      Semantic Repositories Mar, 2010 #
    • Performance Factors for Query Evaluation
      • Deduction
        • whether and how complex backward-chaining is involved, whether it is recursive, etc.
      • Size of the result-set
        • fetching large result-sets can take considerable time
      • Query complexity
        • the number of the constraints (e.g. triple-pattern joins), the semantics of the query (e.g. negation- and disjunction-related clauses), the usage of operators that are tough to support through indexing (e.g. LIKE)
      Semantic Repositories Mar, 2010 #
    • Performance Factors for Query Evaluation (2)
      • Number of clients
        • number of simultaneous client requests
      • Quality of results
        • what is the quality of the results required in modalities where incomplete answers are requested
      Semantic Repositories Mar, 2010 #
    • Performance Dimensions
      • Scale
        • the size of the repository in terms of number of RDF triples
      • Schema and data complexity
        • the complexity of the ontology/logical language
        • the specific ontology (or schema) and the dataset
        • E.g. a highly interconnected dataset, with long chains of transitive properties, can appear quite challenging for reasoning
        • sparse versus dense datasets
        • presence and size of literals
        • number of predicates used
        • usage of owl:sameAs and other alignment primitives
      Semantic Repositories Mar, 2010 #
    • Performance Dimensions (2)
      • Hardware and software setup
        • version and configuration of the compiler or the virtual machine
        • the operating system and the file system
          • and their configuration
        • the configuration of the engine itself
        • the hardware configuration, of course
      Semantic Repositories Mar, 2010 #
    • Scale! Which one?
      • Number of inserted statements (NIS)
        • How many statements have been inserted in the repository?
      • Number of stored statements (NSS)
        • How many statements have been stored and indexed?
        • Duplicates can make NIS smaller than NSS
        • For engines using forward-chaining and materialization, the volume of the data to be indexed includes the inferred triples
      • Number of retrievable statements (NRS)
        • How many different statements can be retrieved?
        • This number can be different from NSS when the repository supports some sort of backward-chaining
      Semantic Repositories Mar, 2010 #
    • Full-Cycle Benchmarking
      • We call full-cycle benchmarking any methodology that provides a complete picture about the performance with respect to the full “life cycle” of the data within the engine
        • At the high-level this means publication of data for both loading and query evaluation performance in the framework of a single experiment or benchmark run.
        • Full-cycle benchmarking requires load performance data (e.g. “5 billion triples of LUBM were loaded in 30 hours”) to be matched with query evaluation data (e.g. “… and the evaluation of the 14 queries took 1 hour on warm database.”)
      Semantic Repositories Mar, 2010 #
    • Full-Cycle Benchmarking (2)
      • Typical set of activities to be covered:
      • Loading input RDF files from the storage system
      • Parsing the RDF files
      • Indexing and storing the triples
      • Forward-chaining and materialization (optional)
      • Query parsing
      • Query optimization
        • Query re-writing (optional)
      • Query evaluation, involving
        • Backward-chaining (optional)
        • Fetching of the results
      Semantic Repositories Mar, 2010 #
    • Distribution Goals and Approaches
      • Semantic repositories can be distributed to meet different objectives:
        • To handling larger volumes of data
        • To speed up of loading, modification and indexing of data
        • Faster query evaluation (quicker handling of single complex query)
        • Better handling of concurrent query loads and large numbers of users (more queries/minute; smaller average response time)
        • Failover (be able to survive failure of one or more nodes/repositories)
      • Two major approaches (and objectives they could meet):
        • Data partitioning: 1, 2?, 3?, 4?, 5
        • Replication: 3?, 4, 5
      Semantic Repositories Mar, 2010 #
    • Single-Server Scalability Notes
      • The most cost efficient standard database server configuration in Q3 of 2009 is as follows:
        • 2 x Xeon 5500 series CPUs, each with 4 cores with hyper-threading
        • 72GB of RAM
        • 8 x 144GB 15krpm SAS drives in RAID5 with good RAID controller
      • A less classical configuration would include SSD drives
        • SSD (Solid State Drives) use flash memory instead of disks
        • They are one order of magnitude faster that HDD; also more expensive
        • An optimal configuration with SSDs could include less RAM, cheaper RAID controller, and 4-8 120GB SSDs in RAID0 or RAID10
        • SSD are still relatively new, should be selected and configured carefully
      • A classical configuration would cost $5,000-10,000
      Semantic Repositories Mar, 2010 #
    • Multi-core Parallelism
      • The cost-efficient DB servers today come with 8-12 CPU cores
        • each of which is quite autonomous computing device
        • all this cores share the same RAM
      • A multi-threaded semantic repository can benefit from parallel execution:
        • Using multiple CPUs for computation, as with repositories distributed across several machines
        • But without the communication costs and overheads of the distributed approach, in particular without the “remote join” problem
      • Efficient multi-threading WRT to loading/modifications requires non-trivial locking and synchronization
      • Multi-threaded (read) query evaluation is straightforward
      Semantic Repositories Mar, 2010 #
    • Multi-core Parallelism (2)
      • At present this is the most efficient solution for semantic repositories with data volume up to 40B statements
      • The most cost efficient DB servers, priced at $5,000-10,000, can handle efficiently 10-20B statements and several hundred queries per minute
      • Database servers with 4 CPUs and 128GB of RAM can be purchased for approx $20,000-$30,000
      • Such machines can handle datasets up to 40-50 billion explicit statements and query loads approaching 1000 queries/min.
      Semantic Repositories Mar, 2010 #
    • RDF Search # Semantic Repositories Mar, 2010