Faceted Views Over Large Scale Linked Data

  • 1,734 views
Uploaded on

Faceted Views Over Large Scale Linked Data

Faceted Views Over Large Scale Linked Data

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,734
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
3

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. OpenLink Virtuoso - Faceted Views Faceted Views over Large-Scale Linked Data Orri Erling Program Manager - Virtuoso Development Team, OpenLink Software
  • 2. Dimensions of Web Usage
    • Web 1.0: Publishing for All (Citizen Publisher) via Web Sites
    • Web 2.0: Commentary for All (Citizen Journalist) via Blogging and Social Networks, with User Generated Content across Data Silos
    • Web 3.0: Analysis for All (Citizen Analyst), via Linked Data enabling Data Mobility and Meshing, Your Data Is Your Statement, Applications Float on a Cloud of Data across a federation of HTTP accessible Data Spaces
    • Meanwhile, in the DBMS world, ad hoc data access and manipulation has consistently won over hard-coded alternatives e.g.: SQL over CODASYL, Today we see Linked Data as delivering the "ad hoc" factor in "best of both worlds" fashion, relative to alternatives (including RDBMS), across the Web and/or within Intranets & Extranets.
  • 3. The Challenges
    • Scale of Instance Data - 10^9 - 10^11 Triples
    • Scale of Ontology 100,000's of Classes
    • Faceted Browsing, Text and Structure
    • Deployment and Provisioning
  • 4. It Is Not Only About The Warehouse
    • Up until now, you design the warehouse for the application, load the data, make a data island
    • With Linked Data, the warehouse is self-filling, based on published data using terms from commonly shared vocabularies
    • Virtuoso facilitates the above by integrated RDF-ization middleware; you populate the warehouse as you query, and system simply gets smarter in line with your natural work patterns.
  • 5. It Is Not Only About Publishing Your Data
    • Having secrets does not mean using a secret language
    • Private environments still benefit from common vocabularies and terms
    • People and organizations publish anyway: Now it is about publishing for use in applications and integration, internet, extranet, intranet
    • Linked data and Virtuoso deliver on the Data Spaces concept: Express any statement for which there is a vocabulary and the data exposed by the statement can be found, joined and processed (e.g. Meshups). Basically, The network is the database.
  • 6. Solutions
    • Virtuoso 6, Single Server and Cluster Editions
    • SPARQL and SQL With The Right Extensions for serious BI style analytics
    • Integrated Web Services Platform, Suite of RDF-izers (Extractor & LOD Cloud Lookup variants)
    • Server Hosted Facet Browsing Service (via REST API), Entity Ranking, Other Building Blocks for Web 2.0 Style Development
  • 7. The lod.openlinksw.com Demo
    • 4.2 GTriples on 2 Commodity Servers
    • Full Text and Structured Querying
    • SPARQL End Point
    • Faceted Browsing Interface for Quick Discovery and Simple Report Composition
    • Usage Statistics across Source & Reference Graph IRIs, plus IFP and owl:sameAs usage stats
    • VoiD Graph Providing Rich Description of hosted Data Sets
    If OpenLink does not host it with enough capacity or the right data, you can procure your own infrastructure and get the software from us. From now on, anybody who chooses can be a search and analytics player.
  • 8. Technology
    • SPARQL Augmented With Run Time Inferencing
    • Entity Ranks for Better Search
    • Anytime Query Answering for Quick Approximate Results
    • A User Interface Combining Discovery and Query Building
    • Easy Web Services API's and SPARQL for Developing Applications
  • 9. Technology
  • 10. Run Time Taxonomies
    • No Materialization, Select Taxonomy At Query Time
    • Query Optimization Knows About Class and Property Hierarchies
  • 11. Run Time Identity
    • Optionally Follow owl:sameAs links
    • Optionally consider any two sharing an IFP to be the same
    • No materialization, Control sameAs and IFP following at query time, at the triple pattern level
    • For Ad Hoc, Do Identity at Run Time
    • For Deep Analytics and Batch Processing, Normalize Identities at Load Time
  • 12. Entity Ranking
    • References and the Rank of the Referrer Contribute to Rank, as In Web Search
    • Can Customize Weight By Graph, Predicate
    • Can Run Ranks on Selected Subsets
    • Ranks Are Calculated in a Batch Run
  • 13. Entity Name Service
    • Autocompletion of URI's
    • Autocompletion of Label-Like Properties
    • Ranked List of Synonyms
    • Statistics on Where a URI is Defined and Where it is Referenced
  • 14. Virtuoso Anytime Query Feature
    • Partial Results in Fixed Time
    • Useful for Interactive Browsing, Query Development over large data sets (e.g. LOD Cloud)
    • On public SPARQL end points, Protects Against DOS, still giving samples of the answers
    • Metering of query resource utilization
  • 15. The LOD Cloud Faceted Search, Find, and Lookup Services
    • Access via Web Services, SPARQL
    • Developed in Virtuoso using SQL, SPARQL, and Stored Procedures
    • Part of Virtuoso 6.x Open Source Edition (Single Server Edition only)
  • 16. Experience
    • If Data In Memory, Interactive Time and Linear Scale
    • RDF Aware Query Optimizer is Key
    • Parallel Execution Engine 1 Thread/Query/Partition
    • For Generic Linked Data, RDF Representation With 4 Indices, plus Full Text Indexing on Literal Objects
    • For Specialized Tasks, SQL + Stored Procedures With Parallel Programming Model (of course output will be Linked Data)
    • Unlimited Cross Partition Joining, Near Full Platform Utilization, and not a problem with the right message flow
  • 17. Some Performance Data
    • Current Live Instance Setup: 2 Linux boxes with 2x4 core Xeons each with 32G RAM for a data set in excess of 4.2 Billion Triples
    • 3.2 Million Single Triple Lookups Per Second
    • Load Rates over 100K Triples/sec
    • Entity Ranks for 4.2 GTriples in 30m/Iteration
  • 18. Deployment
    • For Intermittent Use, 1TB of RAM, 256 Virtual Cores at EC2 is $1228 Per Day
    • For Purchase, Cluster of 1TB RAM, 120 Nehalem Cores Lists Around $75K
    Rent of Buy? To Handle 10 GT 100% in RAM or 50 GT With Decent Working Set: * April 2009 US retail prices
  • 19. Conclusions
    • Applications exploiting open data access across heterogenous data sources at Web Scale now within anyone's reach
    • Usable for public Web Sites or for in-house Business Analytics
    • Web enabled Open Data Access & Analysis for All!