Peter Haase, Michael Schmidt
fluid Operations AG
Cloud-based Linked Data Management
for
Self-service Application Developme...
Increasing Popularity of Linked Open Data
• LOD cloud as of May 2009
• 4.7 billion triples
• 142 million RDF links
• LOD c...
Agenda
• Linked Data Application Development
Opportunities and Challenges
• Information Workbench as Platform for
Linked D...
New Opportunities
• Established standards define common data models,
vocabularies, semantics
• RDF/RDFS, OWL, SPARQL
• Fro...
Challenges in Building Linked Data Applications
• Heterogeneity in various dimensions
 Location of data (internal / exter...
The Information Workbench
• Platform for Linked Data application development
• Base functionality to build applications wi...
The IWB Application Development Process
Linked Open Data Discovery
• Visually explore data sets
registered to global regis...
The IWB Application Development Process
Linked Open Data Discovery
• Visually explore data sets
registered to global regis...
The IWB Application Development Process
Linked Open Data Discovery
• Visually explore data sets
registered to global regis...
The IWB Application Development Process
Linked Open Data Discovery
• Visually explore data sets
registered to global regis...
Information Workbench Architecture
• Extensible, widget-based UI
• Resource-centric presentation
• Living UI, which exploi...
Information Workbench Architecture
In the remainder of the talk
• Focus on challenges in data
integration layer
• In parti...
Linked Data Integration – Where we are
• Non-RDF data stored locally in the repository
• On demand, this data can be updat...
Linked Data Integration – Our Vision
• Current way of publishing
• Authors provide RDF dumps linked on some homepage
• Pro...
Software Components
• Definition of „Software Components“
"A software component is a unit of composition with contractuall...
Data Components
• What we need for Linked Data: „Data Components“
• Interfaces: data components with precise interfaces an...
Next Step: Data-as-a-Service
• Idea
• Producer provides data components
• Consumers can access data components as a service
Next Step: Data-as-a-Service
• Idea
• Producer provides data components
• Consumers can access data components as a servic...
Next Step: Data-as-a-Service
Virtualized Semantic Repositories
Identification, composition, and use of (fragments of) data...
Challenge 1: Precise Interfaces
• Standardization efforts for RDF meta data descriptions
• Statistical Core Vocabulary (SC...
Challenge 2: Deployment
• Based on Interfaces
• Possibly based on cloud technologies
• State-of-the-art not satisfying
• U...
Some Statistics
Based on subset of LOD cloud
(excluding a few extremely large datasets)
Challenge 3: Composition
Query Processing over Federation: State-of-the-Art
• First public implementations exists
• AliBab...
Linked Data Federation: Vision
Data Source Data Source Data Source Data Source
SPARQL
Endpoint
Virtualized Federation Laye...
Challenge 3: Composition
Rich theory in database community for Federated Query
Processing exists
• Data Statistics
• Accur...
Challenges
• Satisfying and standardized statistics framework for RDF
• void 2.0 not yet fully satisfying (e.g. histograms...
Conclusion
• Clear benefits of Linked Data application development platform
• Discovery of relevant data
• Virtualized int...
Thank you for your attention!
CONTACT
fluid Operations AG Email: info@fluidOps.com
Altrottstr. 31 Website: www.fluidOps.co...
Upcoming SlideShare
Loading in...5
×

Cloud-based Linked Data Management for Self-service Application Development

1,904

Published on

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,904
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
47
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Cloud-based Linked Data Management for Self-service Application Development"

  1. 1. Peter Haase, Michael Schmidt fluid Operations AG Cloud-based Linked Data Management for Self-service Application Development International Workshop on Scalable Semantic Computing Hangzhou, November 6, 2010
  2. 2. Increasing Popularity of Linked Open Data • LOD cloud as of May 2009 • 4.7 billion triples • 142 million RDF links • LOD cloud as of Sep 2010 • 25 billion triples • 395 million RDF links • Covering various domains • Media • Life Science • Geography • Publications • … Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
  3. 3. Agenda • Linked Data Application Development Opportunities and Challenges • Information Workbench as Platform for Linked Data Application Development • Accessing Linked Data as a Service Vision and First Experiences • Conclusions
  4. 4. New Opportunities • Established standards define common data models, vocabularies, semantics • RDF/RDFS, OWL, SPARQL • From data silos to a web of data • Ease of specifying relationships in a decentralized way • Innovative applications that integrate data from various domains and sources • Linked Government Data • Linked Open Data • Benefits of Linked Data in the enterprise • Semantically integrate and interlink data scattered among systems • Cross the chasm between enterprise-internal and public data • Leverage semantic technologies for improved search and presentation
  5. 5. Challenges in Building Linked Data Applications • Heterogeneity in various dimensions  Location of data (internal / external, open / closed)  Identifiers, structure and vocabularies  Ownership of data • Structured and unstructered data • Quality of Linked Data • Various forms of imperfection (erroneous, incomplete, imprecise data) • Trustworthiness • End-user oriented interfaces and interaction paradigms • Interfaces that operate over large amounts of data, flexible and dynamic schemas • Meaningful aggregation of the data • Support for expressive queries, while retaining intuitive interfaces • User-generated content • Collaborative annotation and knowledge acquisition
  6. 6. The Information Workbench • Platform for Linked Data application development • Base functionality to build applications without any programming • SDK for easy extensions • Covering the entire lifecycle of interacting with Linked Data  Discovery of data sources  Integration of data sources  Visualization  Search and Exploration  Collaborative generation of data • Targeted at • Semantic Web Community • Linked Open Data community • Innovative Enterprises • Demo and source available at http://iwb.fluidops.com/.
  7. 7. The IWB Application Development Process Linked Open Data Discovery • Visually explore data sets registered to global registries • Sort/filter data sets by domain, location, and many more facets to identify relevant data 1 LOD Discovery with the Information Workbench
  8. 8. The IWB Application Development Process Linked Open Data Discovery • Visually explore data sets registered to global registries • Sort/filter data sets by domain, location, and many more facets to identify relevant data Data Integration • Integrate discovered Linked Data • Add providers for internal and external legacy data sources • Improve data quality, e.g. via incremental refinement of ontology 1 2
  9. 9. The IWB Application Development Process Linked Open Data Discovery • Visually explore data sets registered to global registries • Sort/filter data sets by domain, location, and many more facets to identify relevant data Data Integration • Integrate discovered Linked Data • Add providers for internal and external legacy data sources • Improve data quality, e.g. via incremental refinement of ontology Customization • Declaratively specify UI based on available pool of widgets • Embed reports and charts into wiki pages and wiki page templates • Semantically annotate and interlink connected resources 1 3 2
  10. 10. The IWB Application Development Process Linked Open Data Discovery • Visually explore data sets registered to global registries • Sort/filter data sets by domain, location, and many more facets to identify relevant data Data Integration • Integrate discovered Linked Data • Add providers for internal and external legacy data sources • Improve data quality, e.g. via incremental refinement of ontology Customization • Declaratively specify UI based on available pool of widgets • Embed reports and charts into wiki pages and wiki page templates • Semantically annotate and interlink connected resources Advanced System Configuration and Extensions • Use APIs and SDKs to implement own widgets and mashups • Script data providers to integrate data behind non-standard interfaces • Develop and integrate own modules, e.g. for customized search and information extraction 1 2 3 4
  11. 11. Information Workbench Architecture • Extensible, widget-based UI • Resource-centric presentation • Living UI, which exploits semantics of underlying data • Large collection of predefined widgets, easily extendable • Search and information Access • Coexistence of structured and unstructured data • Different search paradigms (keyword and faceted search, semantic query completion) • Data integration through providers • Convert data from a data source into the RDF data format • Customizable, easily extensible • Use of public LOD registries
  12. 12. Information Workbench Architecture In the remainder of the talk • Focus on challenges in data integration layer • In particular: virtualized, cloud- based integration of data sources
  13. 13. Linked Data Integration – Where we are • Non-RDF data stored locally in the repository • On demand, this data can be updated periodically • RDF data can be… • persisted in repository, or • connected via naive federation layer (where possible)
  14. 14. Linked Data Integration – Our Vision • Current way of publishing • Authors provide RDF dumps linked on some homepage • Provisioning information missing (data zipped, splitted, available in different formats, …) • Often also SPARQL endpoints (typically with poor response times) • How it should be done • Rich meta-data describing content, structure, properties of the data • Enable exploration of data via meta repositories • Efforts have been made (see CKAN), but… • … poor quality of meta data and data • Possibility for end-users to buy service guarantees • Integration details should be irrelevant to the end-user
  15. 15. Software Components • Definition of „Software Components“ "A software component is a unit of composition with contractually specified interfaces and explicit context dependencies only. A software component can be deployed independently and is subject to composition by third parties." (wikipedia.org)
  16. 16. Data Components • What we need for Linked Data: „Data Components“ • Interfaces: data components with precise interfaces and metadata • Deployment: easy provisioning and integration in applications • Composition: transparent access to atomic or composite units • Definition of „Software Components“ "A software component is a unit of composition with contractually specified interfaces and explicit context dependencies only. A software component can be deployed independently and is subject to composition by third parties." (wikipedia.org)
  17. 17. Next Step: Data-as-a-Service • Idea • Producer provides data components • Consumers can access data components as a service
  18. 18. Next Step: Data-as-a-Service • Idea • Producer provides data components • Consumers can access data components as a service • Possible realization: use cloud technology! • Sold on demand • Elastic • Fully managed by provider characteristics of cloud services, like e.g. AWS, exactly match the needs (just like it is the case for Software-as-a-Service)
  19. 19. Next Step: Data-as-a-Service Virtualized Semantic Repositories Identification, composition, and use of (fragments of) datasets in manners that abstract the applications from the specific setup of the data management service (such as local vs. remote, federation, and distribution) • Idea • Producer provides data components • Consumers can access data components as a service • Possible realization: use cloud technology! • Sold on demand • Elastic • Fully managed by provider characteristics of cloud services, like e.g. AWS, exactly match the needs (just like it is the case for Software-as-a-Service)
  20. 20. Challenge 1: Precise Interfaces • Standardization efforts for RDF meta data descriptions • Statistical Core Vocabulary (SCOVO) • Very flexible • Forms a good basis for describing RDF statistics • Vocabulary of Interlinked Data Sets (voiD) • Based on SCOVO • Used to publish meta information about Linked Data Sources • voiD 2 (in progress) • Dataset meta information, like source, description, dump, license • Used vocabularies/ontologies • Dataset interlinking • Statistics (e.g. distinct subject count, triples with given predicate etc.) • Open data registries • Comprehensive Knowledge Archive Network • Based on DublinCore and DERI‘s data catalog vocabulary (dcat)
  21. 21. Challenge 2: Deployment • Based on Interfaces • Possibly based on cloud technologies • State-of-the-art not satisfying • URLs pointing to human readable description, but not the actual endpoint • Various forms of syntax errors in RDF documents • MIME types incorrect or missing • Endpoints/servers not reachable • Endpoint/file password protected
  22. 22. Some Statistics Based on subset of LOD cloud (excluding a few extremely large datasets)
  23. 23. Challenge 3: Composition Query Processing over Federation: State-of-the-Art • First public implementations exists • AliBaba federation layer on top of Sesame • Benchmark results show severy bottlenecks • Efficiency issues • Which data sets deliver results for which graph patterns? • Localized execution of subqueries • Global estimation of subquery result sizes • Join oder optimization • Incremental processing with completeness/correctness guarantees Peter Haase, Tobias Mathäß, Michael Ziller: An Evaluation to Approaches for Federated Query Processing over Linked Data. In Proc. I-Semantics 2010.
  24. 24. Linked Data Federation: Vision Data Source Data Source Data Source Data Source SPARQL Endpoint Virtualized Federation Layer Consumer Publisher Local Repository RDF Dump Data Component RDF Dump Data Component Self-service Data Provisioning (Data-as-a-Service)
  25. 25. Challenge 3: Composition Rich theory in database community for Federated Query Processing exists • Data Statistics • Accuracy vs. index size • Updating statistics • Query Optimization • Join types (e.g., semi-joins) • Minimizing communication cost • Optimizing execution localization • Streaming results Olaf Görlitz, Steffen Staab: Federated Data Management and Query Optimization for Linked Open Data. In „New Directions of Web Data Management“, to appear.
  26. 26. Challenges • Satisfying and standardized statistics framework for RDF • void 2.0 not yet fully satisfying (e.g. histograms missing) • Therefore: • Establish comprehensive, standardized statistics framework for RDF • Should also be tailored to query optimization • Address specifics of RDF and SPARQL • Graph-structured data model • Importance of efficient merge joins • OPTIONAL queries • Exploit built-in semantics of RDFS • Semantic Query Optimization Michael Schmidt, Michael Meier, Georg Lausen: Foundations of SPARQL Query Optimization. In Proc. ICDT 2010.
  27. 27. Conclusion • Clear benefits of Linked Data application development platform • Discovery of relevant data • Virtualized integration of data sources as a key step to success • Fast customization and extensions • Information Workbench addressing these needs • Still some work left to do • Metadata quality and standardization • Data quality in general, trust • Data-as-a-Service • Efficient federated query processing
  28. 28. Thank you for your attention! CONTACT fluid Operations AG Email: info@fluidOps.com Altrottstr. 31 Website: www.fluidOps.com Walldorf, Germany Tel.: +49 6227 3849-567
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×