Real Time Semantic Warehousing: Sindice.comtechnology for the enterprise Giovanni Tummarello, Ph.D Data Intensive Infrastr...
How we started : Sindice.com 80 Billions triple, 500,000,000 RDF Graphs, 5 TB of data.The Sindice Suite powers Sindice.com...
Semantic Sandboxes on: Sindice.com Data Sandboxes in Sindice.com – Powered by CloudSpaces
And then we met people asking      can you do it for us
Example story (Pharmaceutical company0To stay competitive, Pharmaceutical companies need to leverage all the data availabl...
Linked Data clouds for the Enterprise  – Strategic knowledge spaces, where new    databases can be added and “leveraged” w...
Sindice.com
Because you need Semantic SandBoxes
A Dataspace TemplateSemantic Web               A typical implementation template.Data               Dataspaces own:       ...
Dataspace Composition   Scalable cascading semantic ‘Dataspaces”   • Resources allocated in public/private clouds   • Allo...
Cloud powered!<dataspace id= “iphonedataspace”><dependencies>  http://ecommerce01.dataspace.sindice.net/</dataspace>  http...
Scale is only 1 dimensionMultiple dimensions of WeD data integration• RDF tool stack  flexibility• Cluster scalable proce...
Full Json Like Search.         On Solr.All operators supported.
What is SIREn ?• Plugin to Solr• Built for searching and operating on  semistructured data and relational  datastructures
SIREn: Semantic IR Engine• Extension to Enterprise Search Engine Solr• Semantic, full-text, incremental updates,  distribu...
Limitations of Apache Solr• Not efficient with highly heterogeneous  structured data sources  – Limitation on the number o...
Dictionary Size Explosion        Record 1label      Renaud Delbruname       Renaud Delbru
Dictionary Size Explosion                                                          Dictionary                             ...
Limitations of Apache Solr• Not efficient with highly heterogeneous  structured data sources  – Limitation on the number o...
Limitations of Apache Solr• Not efficient with highly heterogeneous  structured data sources  – Limitation on the number o...
Multi-valued attributes  • No support in Solr for "all words must match    in the same value of a multi-valued field".  • ...
Multi-valued attributes  • No support in Solr for "all words must match    in the same value of a multi-valued field".  • ...
Limitations of Apache Solr• Not efficient with highly heterogeneous  structured data sources  – Limitation on the number o...
Full-text search on attribute names• No support in Solr for “keyword search in  attribute names".• Query example       – (...
Limitations of Apache Solr• Not efficient with highly heterogeneous  structured data sources  – Limitation on the number o...
Relationship materialization• Its Json like indexing and searching• Materialize the relationships between your  entities a...
Some numbers: Siren on Sindice         Data Collection                      Settings 500M web data documents (RDF,    Cl...
Large scale RDF ‘Summaries”
Introducing large scale RDF ‘Summaries”We do it for:• Data exploration  – How to find datasets about movies ?• Assisted SP...
Large Scale RDF summariesClass Level                             12M relationships                              10B relati...
Sindice Analytics Widget Demo• http://test01.sindice.net:9001/sindice-stats-  webapp/• http://test01.sindice.net/szydan/da...
Relational Faceted Browsing. At speed of light                                   Patent Pending
SparQL is awesome.And now your guys can actually use it.
Thank you              Sindice.com team April 2012With the contribution of
Upcoming SlideShare
Loading in …5
×

Sindice warehousing meetup

870 views

Published on

Published in: Technology, Education
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
870
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
15
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • Search record (instead of entity)Record-centric indexing model
  • Use Case: Let’s index the entire web of dataDoc/s, lucene in action, uptime, etc.
  • How important a dataset is to my information need ?How to help users to browse and filter irrelevant datasets ?How can I measure the quality of a dataset ? Data quality, objective measuresTwo datasets can overlap, provide similar information, but one dataset is providing more fresh information, is updated more frequently.Concrete scenarios to test such assumptionsData Quality can be also useful for improving data acquisition, optimising resources to retrieve only top quality data
  • - Define “relationships” when introducing the graph, BEFORE talking about the numbers
  • Number of entities per classNumber of relations of a certain predicateOther metadata can be added to a class, e.g., other predicates used with the entities of that class
  • Sindice warehousing meetup

    1. 1. Real Time Semantic Warehousing: Sindice.comtechnology for the enterprise Giovanni Tummarello, Ph.D Data Intensive Infrastructure UNIT - DERI.ie CEO SindiceTech
    2. 2. How we started : Sindice.com 80 Billions triple, 500,000,000 RDF Graphs, 5 TB of data.The Sindice Suite powers Sindice.com. Online with 99,9%+
    3. 3. Semantic Sandboxes on: Sindice.com Data Sandboxes in Sindice.com – Powered by CloudSpaces
    4. 4. And then we met people asking can you do it for us
    5. 5. Example story (Pharmaceutical company0To stay competitive, Pharmaceutical companies need to leverage all the data available frominside sources as well as from the increasingly many public HCLS data sources available. Due tothe diversity of this data with respect to nature, formats, quality, there are complex integrationissues. Traditional data warehousing technology require big upfront thinking and is handledwithin a company in the “go via the IT department” approach. This does not meet the need ofdata scientists who are the only ones that can do the complex cross-use case thinking required.Via Real Time Semantic Data Warehousing (RETIS) data scientist expect to get:• The ability to speed up “In silico” scientific workflows (interrelation of diverse large datasets) by orders of magnitude by relying on a data warehousing approach.• The ability to create large scale “data maps” or “aggregated views” which would allow researchers to see “trends” and gather insights at high level which would not be possible by data accessed via single lookups.• The ability to receive recommendations and suggestions for new data connections based on an ever evolving ecosystem of available experimental datasets.• Provide their R&D departments with superior tools for investigating their internal knowledge; search engines and data browsing tools which provide unified views of multiple, evolving, live datasets without leakage of specific “queries” to the outside world which would reveal internal research trends• The ability to leverage the ever increasing body of public, crowd curated open data5 of 16
    6. 6. Linked Data clouds for the Enterprise – Strategic knowledge spaces, where new databases can be added and “leveraged” with an unprecedented ease – Integration “Pay as you go” : explore now, fine tune later. – Its BigData (Cluster+Clouds) meets RDF and Semantic Technologies
    7. 7. Sindice.com
    8. 8. Because you need Semantic SandBoxes
    9. 9. A Dataspace TemplateSemantic Web A typical implementation template.Data Dataspaces own: • Resources • Services • Datasets for others to reuse
    10. 10. Dataspace Composition Scalable cascading semantic ‘Dataspaces” • Resources allocated in public/private clouds • Allow to get Sindice Data and mix it/ process it for private purposes10 of 16
    11. 11. Cloud powered!<dataspace id= “iphonedataspace”><dependencies> http://ecommerce01.dataspace.sindice.net/</dataspace> http://price01.dataspace.sindice.net/</dependencies><resources> <mysql name=“sql”> <hbase size=“10g”> <siren name=“index”> <triplestore name=“sparql” kind=“virtuoso” /> </resources><retention> (see later)<update-rate>1D</update-rate><timeout>1D</timeout></retention></dataspace> 11 of 16
    12. 12. Scale is only 1 dimensionMultiple dimensions of WeD data integration• RDF tool stack  flexibility• Cluster scalable processing  scalability• “Cloud” Pipelines  dynamicity
    13. 13. Full Json Like Search. On Solr.All operators supported.
    14. 14. What is SIREn ?• Plugin to Solr• Built for searching and operating on semistructured data and relational datastructures
    15. 15. SIREn: Semantic IR Engine• Extension to Enterprise Search Engine Solr• Semantic, full-text, incremental updates, distributed search Semantic SIREn Databases Constant time
    16. 16. Limitations of Apache Solr• Not efficient with highly heterogeneous structured data sources – Limitation on the number of attributes: Dictionary size explosion
    17. 17. Dictionary Size Explosion Record 1label Renaud Delbruname Renaud Delbru
    18. 18. Dictionary Size Explosion Dictionary label:renaud Record 1 label Renaud Delbru label:delbru name Renaud Delbru name:renaud name:delbru Dictionary construction  Concatenation of attribute name and term  N * M complexity (worst case) 2 attributes * 2 terms = 4 dictionary entries 100K attributes * 1B terms = 100B entries
    19. 19. Limitations of Apache Solr• Not efficient with highly heterogeneous structured data sources – Limitation on the number of attributes: Dictionary size explosion Query clause explosion when searching across all attributes
    20. 20. Limitations of Apache Solr• Not efficient with highly heterogeneous structured data sources – Limitation on the number of attributes: Dictionary size explosion Query clause explosion when searching across all attributes• Limited support for structured query – Multi-valued attributes
    21. 21. Multi-valued attributes • No support in Solr for "all words must match in the same value of a multi-valued field". • A field value is a bag of words – No distinction between multiple values Record 1 Record 2label mans best pooch label mans worst friend to no one friend enemy
    22. 22. Multi-valued attributes • No support in Solr for "all words must match in the same value of a multi-valued field". • A field value is a bag of words – No distinction between multiple values • Query example – label : man’s friend – Solr returns Record 1 & 2 as results Record 1 Record 2label mans best friend pooch label mans worst enemy friend to no one
    23. 23. Limitations of Apache Solr• Not efficient with highly heterogeneous structured data sources – Limitation on the number of attributes: Dictionary size explosion Query clause explosion when searching across all attributes• Limited support for structured query – Multi-valued attributes – No full-text search on attribute names
    24. 24. Full-text search on attribute names• No support in Solr for “keyword search in attribute names".• Query example – (name OR label) = “Renaud Delbru” – Solr is unable to find the records without the exact attribute name Record 1 Record 2rdfs:label Renaud Delbru foaf:name Renaud Delbru Record 3 Record 4sioc:name Renaud Delbru full_name Renaud Delbru
    25. 25. Limitations of Apache Solr• Not efficient with highly heterogeneous structured data sources – Limitation on the number of attributes: Dictionary size explosion Query clause explosion when searching across all attributes• Limited support for structured query – Multi-valued attributes – No full-text search on attribute names – No 1:N relationship materialisation
    26. 26. Relationship materialization• Its Json like indexing and searching• Materialize the relationships between your entities and others.
    27. 27. Some numbers: Siren on Sindice Data Collection Settings 500M web data documents (RDF,  Cluster of 4 nodes RDFa, Microformat, etc.)  2 nodes for indexing 200K datasets  2 nodes for querying 50B triples  Replication Indexing Performance Services Full index construction takes  Keyword and structured queries approx 24 hours  Dataset search 436K triples / second  >> 99% uptime
    28. 28. Large scale RDF ‘Summaries”
    29. 29. Introducing large scale RDF ‘Summaries”We do it for:• Data exploration – How to find datasets about movies ?• Assisted SPARQL Query Editor – What is the data structure ?• Dataset Quality – How to differentiate relevant form irrelevant dataset ?
    30. 30. Large Scale RDF summariesClass Level 12M relationships 10B relationships
    31. 31. Sindice Analytics Widget Demo• http://test01.sindice.net:9001/sindice-stats- webapp/• http://test01.sindice.net/szydan/dataset- view/dataset/default/www.bbc.co.uk
    32. 32. Relational Faceted Browsing. At speed of light Patent Pending
    33. 33. SparQL is awesome.And now your guys can actually use it.
    34. 34. Thank you Sindice.com team April 2012With the contribution of

    ×