SlideShare a Scribd company logo
Real Time Semantic
 Warehousing: Sindice.com
technology for the enterprise
 Giovanni Tummarello, Ph.D
 Data Intensive Infrastructure UNIT -
 DERI.ie

 CEO SindiceTech
How we started : Sindice.com




 80 Billions triple, 500,000,000 RDF Graphs, 5 TB of data.
The Sindice Suite powers Sindice.com. Online with 99,9%+
Semantic Sandboxes on: Sindice.com




 Data Sandboxes in Sindice.com – Powered by CloudSpaces
And then we met people asking
      can you do it for us
Example story (Pharmaceutical company0
To stay competitive, Pharmaceutical companies need to leverage all the data available from
inside sources as well as from the increasingly many public HCLS data sources available. Due to
the diversity of this data with respect to nature, formats, quality, there are complex integration
issues. Traditional data warehousing technology require big upfront thinking and is handled
within a company in the “go via the IT department” approach. This does not meet the need of
data scientists who are the only ones that can do the complex cross-use case thinking required.
Via Real Time Semantic Data Warehousing (RETIS) data scientist expect to get:

•   The ability to speed up “In silico” scientific workflows (interrelation of diverse large
    datasets) by orders of magnitude by relying on a data warehousing approach.
•   The ability to create large scale “data maps” or “aggregated views” which would allow
    researchers to see “trends” and gather insights at high level which would not be possible by
    data accessed via single lookups.
•   The ability to receive recommendations and suggestions for new data connections based on
    an ever evolving ecosystem of available experimental datasets.
•   Provide their R&D departments with superior tools for investigating their internal
    knowledge; search engines and data browsing tools which provide unified views of multiple,
    evolving, live datasets without leakage of specific “queries” to the outside world which would
    reveal internal research trends
•   The ability to leverage the ever increasing body of public, crowd curated open data

5 of 16
Linked Data clouds for the Enterprise

  – Strategic knowledge spaces, where new
    databases can be added and “leveraged” with an
    unprecedented ease
  – Integration “Pay as you go” : explore now, fine
    tune later.
  – Its BigData (Cluster+Clouds) meets RDF and
    Semantic Technologies
Sindice.com
Because you need Semantic SandBoxes
A Dataspace Template




Semantic Web
               A typical implementation template.
Data
               Dataspaces own:
               • Resources
               • Services
               • Datasets for others to reuse
Dataspace Composition




   Scalable cascading semantic ‘Dataspaces”
   • Resources allocated in public/private clouds
   • Allow to get Sindice Data and mix it/ process it for private purposes


10 of 16
Cloud powered!
<dataspace id= “iphonedataspace”>

<dependencies>
  http://ecommerce01.dataspace.sindice.net/</dataspace>
  http://price01.dataspace.sindice.net/
</dependencies>

<resources>
   <mysql name=“sql”>
    <hbase size=“10g”>
    <siren name=“index”>
    <triplestore name=“sparql” kind=“virtuoso” />
 </resources>

<retention> (see later)
<update-rate>1D</update-rate>
<timeout>1D</timeout>
</retention>
</dataspace>



    11 of 16
Scale is only 1 dimension




Multiple dimensions of WeD data integration
• RDF tool stack  flexibility
• Cluster scalable processing  scalability
• “Cloud” Pipelines  dynamicity
Full Json Like Search.
         On Solr.
All operators supported.
What is SIREn ?

• Plugin to Solr
• Built for searching and operating on
  semistructured data and relational
  datastructures
SIREn: Semantic IR Engine

• Extension to Enterprise Search Engine Solr
• Semantic, full-text, incremental updates,
  distributed search
                             Semantic
                                              SIREn
                             Databases




                                  Constant time
Limitations of Apache Solr

• Not efficient with highly heterogeneous
  structured data sources
  – Limitation on the number of attributes:
     Dictionary size explosion
Dictionary Size Explosion

        Record 1
label      Renaud Delbru

name       Renaud Delbru
Dictionary Size Explosion
                                                          Dictionary
                                                       label:renaud
                  Record 1
    label            Renaud Delbru                     label:delbru

    name             Renaud Delbru                     name:renaud

                                                       name:delbru



    Dictionary construction
           Concatenation of attribute name and term
           N * M complexity (worst case)
    2 attributes * 2 terms = 4 dictionary entries
    100K attributes * 1B terms = 100B entries
Limitations of Apache Solr

• Not efficient with highly heterogeneous
  structured data sources
  – Limitation on the number of attributes:
     Dictionary size explosion
     Query clause explosion when searching across all
      attributes
Limitations of Apache Solr

• Not efficient with highly heterogeneous
  structured data sources
  – Limitation on the number of attributes:
     Dictionary size explosion
     Query clause explosion when searching across all
      attributes
• Limited support for structured query
  – Multi-valued attributes
Multi-valued attributes
  • No support in Solr for "all words must match
    in the same value of a multi-valued field".
  • A field value is a bag of words
        – No distinction between multiple values


              Record 1                         Record 2
label     man's best     pooch    label   man's worst     friend to no one
          friend                          enemy
Multi-valued attributes
  • No support in Solr for "all words must match
    in the same value of a multi-valued field".
  • A field value is a bag of words
        – No distinction between multiple values
  • Query example
        – label : man’s friend
        – Solr returns Record 1 & 2 as results

               Record 1                           Record 2

label      man's best friend pooch   label   man's worst enemy friend to no one
Limitations of Apache Solr

• Not efficient with highly heterogeneous
  structured data sources
  – Limitation on the number of attributes:
     Dictionary size explosion
     Query clause explosion when searching across all
      attributes
• Limited support for structured query
  – Multi-valued attributes
  – No full-text search on attribute names
Full-text search on attribute names
• No support in Solr for “keyword search in
  attribute names".
• Query example
       – (name OR label) = “Renaud Delbru”
       – Solr is unable to find the records without the exact
         attribute name
             Record 1                           Record 2
rdfs:label      Renaud Delbru       foaf:name      Renaud Delbru


             Record 3                           Record 4
sioc:name       Renaud Delbru       full_name      Renaud Delbru
Limitations of Apache Solr
• Not efficient with highly heterogeneous
  structured data sources
  – Limitation on the number of attributes:
     Dictionary size explosion
     Query clause explosion when searching across all
      attributes
• Limited support for structured query
  – Multi-valued attributes
  – No full-text search on attribute names
  – No 1:N relationship materialisation
Relationship materialization

• Its Json like indexing and searching




• Materialize the relationships between your
  entities and others.
Some numbers: Siren on Sindice

         Data Collection                      Settings
 500M web data documents (RDF,    Cluster of 4 nodes
  RDFa, Microformat, etc.)            2 nodes for indexing
 200K datasets                       2 nodes for querying
 50B triples                      Replication


     Indexing Performance                     Services
 Full index construction takes    Keyword and structured queries
  approx 24 hours                  Dataset search
 436K triples / second            >> 99% uptime
Large scale RDF ‘Summaries”
Introducing large scale RDF ‘Summaries”

We do it for:
• Data exploration
  – How to find datasets about movies ?
• Assisted SPARQL Query Editor
  – What is the data structure ?
• Dataset Quality
  – How to differentiate relevant form irrelevant
    dataset ?
Large Scale RDF summaries

Class Level
                             12M relationships




                              10B relationships
Sindice Analytics Widget Demo

• http://test01.sindice.net:9001/sindice-stats-
  webapp/

• http://test01.sindice.net/szydan/dataset-
  view/dataset/default/www.bbc.co.uk
Relational Faceted Browsing. At speed of light




                                   Patent Pending
SparQL is awesome.
And now your guys can actually use it.
Thank you




              Sindice.com team April 2012

With the contribution of

More Related Content

What's hot

Contributing to the Smart City Through Linked Library Data
Contributing to the Smart City Through Linked Library DataContributing to the Smart City Through Linked Library Data
Contributing to the Smart City Through Linked Library Data
Marcia Zeng
 
AAT LOD Microthesauri
AAT LOD MicrothesauriAAT LOD Microthesauri
AAT LOD Microthesauri
Marcia Zeng
 
Piloting Linked Data to Connect Library and Archive Resources to the New Worl...
Piloting Linked Data to Connect Library and Archive Resources to the New Worl...Piloting Linked Data to Connect Library and Archive Resources to the New Worl...
Piloting Linked Data to Connect Library and Archive Resources to the New Worl...
Laura Akerman
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin Yahoo
Peter Mika
 
Metadata Provenance Tutorial at SWIB 13, Part 1
Metadata Provenance Tutorial at SWIB 13, Part 1Metadata Provenance Tutorial at SWIB 13, Part 1
Metadata Provenance Tutorial at SWIB 13, Part 1
Kai Eckert
 
It's 2017, and I still want to sell you a graph database
It's 2017, and I still want to sell you a graph databaseIt's 2017, and I still want to sell you a graph database
It's 2017, and I still want to sell you a graph database
Swanand Pagnis
 
2011 and still bruteforcing - OWASP Spain
2011 and still bruteforcing - OWASP Spain2011 and still bruteforcing - OWASP Spain
2011 and still bruteforcing - OWASP Spain
Christian Martorella
 
An introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked DataAn introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked Data
Fabien Gandon
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
EUCLID project
 
Semantic Web
Semantic WebSemantic Web
SDA2013 Pundit: Creating, Exploring and Consuming Annotations
SDA2013 Pundit: Creating, Exploring and Consuming AnnotationsSDA2013 Pundit: Creating, Exploring and Consuming Annotations
SDA2013 Pundit: Creating, Exploring and Consuming Annotations
Marco Grassi
 
Linked data HHS 2015
Linked data HHS 2015Linked data HHS 2015
Linked data HHS 2015
Cason Snow
 
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
National Information Standards Organization (NISO)
 
Introduction To RDF and RDFS
Introduction To RDF and RDFSIntroduction To RDF and RDFS
Introduction To RDF and RDFS
Nilesh Wagmare
 
An introduction to Semantic Web and Linked Data
An introduction to Semantic  Web and Linked DataAn introduction to Semantic  Web and Linked Data
An introduction to Semantic Web and Linked DataGabriela Agustini
 
Linked Data Usecases
Linked Data UsecasesLinked Data Usecases
Linked Data Usecases
Myungjin Lee
 
Challenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services genChallenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services gen
robin fay
 

What's hot (19)

Contributing to the Smart City Through Linked Library Data
Contributing to the Smart City Through Linked Library DataContributing to the Smart City Through Linked Library Data
Contributing to the Smart City Through Linked Library Data
 
AAT LOD Microthesauri
AAT LOD MicrothesauriAAT LOD Microthesauri
AAT LOD Microthesauri
 
Piloting Linked Data to Connect Library and Archive Resources to the New Worl...
Piloting Linked Data to Connect Library and Archive Resources to the New Worl...Piloting Linked Data to Connect Library and Archive Resources to the New Worl...
Piloting Linked Data to Connect Library and Archive Resources to the New Worl...
 
Semantic Web Austin Yahoo
Semantic Web Austin YahooSemantic Web Austin Yahoo
Semantic Web Austin Yahoo
 
ITWS Capstone Lecture (Spring 2013)
ITWS Capstone Lecture (Spring 2013)ITWS Capstone Lecture (Spring 2013)
ITWS Capstone Lecture (Spring 2013)
 
Metadata Provenance Tutorial at SWIB 13, Part 1
Metadata Provenance Tutorial at SWIB 13, Part 1Metadata Provenance Tutorial at SWIB 13, Part 1
Metadata Provenance Tutorial at SWIB 13, Part 1
 
It's 2017, and I still want to sell you a graph database
It's 2017, and I still want to sell you a graph databaseIt's 2017, and I still want to sell you a graph database
It's 2017, and I still want to sell you a graph database
 
2011 and still bruteforcing - OWASP Spain
2011 and still bruteforcing - OWASP Spain2011 and still bruteforcing - OWASP Spain
2011 and still bruteforcing - OWASP Spain
 
An introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked DataAn introduction to Semantic Web and Linked Data
An introduction to Semantic Web and Linked Data
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
 
Semantic Web
Semantic WebSemantic Web
Semantic Web
 
SDA2013 Pundit: Creating, Exploring and Consuming Annotations
SDA2013 Pundit: Creating, Exploring and Consuming AnnotationsSDA2013 Pundit: Creating, Exploring and Consuming Annotations
SDA2013 Pundit: Creating, Exploring and Consuming Annotations
 
Linked data HHS 2015
Linked data HHS 2015Linked data HHS 2015
Linked data HHS 2015
 
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
NISO/DCMI Webinar: International Bibliographic Standards, Linked Data, and th...
 
Introduction To RDF and RDFS
Introduction To RDF and RDFSIntroduction To RDF and RDFS
Introduction To RDF and RDFS
 
XML Bible
XML BibleXML Bible
XML Bible
 
An introduction to Semantic Web and Linked Data
An introduction to Semantic  Web and Linked DataAn introduction to Semantic  Web and Linked Data
An introduction to Semantic Web and Linked Data
 
Linked Data Usecases
Linked Data UsecasesLinked Data Usecases
Linked Data Usecases
 
Challenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services genChallenges and opportunities in library discovery services gen
Challenges and opportunities in library discovery services gen
 

Similar to Sindice warehousing meetup

ISWC GoodRelations Tutorial Part 2
ISWC GoodRelations Tutorial Part 2ISWC GoodRelations Tutorial Part 2
ISWC GoodRelations Tutorial Part 2
Martin Hepp
 
GoodRelations Tutorial Part 2
GoodRelations Tutorial Part 2GoodRelations Tutorial Part 2
GoodRelations Tutorial Part 2
guestecacad2
 
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...
DuraSpace
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
Ontotext
 
RDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL PlatformsRDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL Platforms
Graph-TA
 
Redis - Your Magical superfast database
Redis - Your Magical superfast databaseRedis - Your Magical superfast database
Redis - Your Magical superfast database
the100rabh
 
Knowledge Representation, Semantic Web
Knowledge Representation, Semantic WebKnowledge Representation, Semantic Web
Knowledge Representation, Semantic WebSerendipity Seraph
 
RDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itRDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use it
Jose Luis Lopez Pino
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1
Stefan Schmidt
 
Finding Love with MongoDB
Finding Love with MongoDBFinding Love with MongoDB
Finding Love with MongoDBMongoDB
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
Rahul Jain
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic Web
Simon Price
 
Introduction to RDF
Introduction to RDFIntroduction to RDF
Introduction to RDF
Dr Sukhpal Singh Gill
 
Duraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository ServicesDuraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository Services
Matthew Critchlow
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic Web
Roberto García
 
Solr 8 interview
Solr 8 interview Solr 8 interview
Solr 8 interview
Alihossein shahabi
 
SPARQL in the Semantic Web
SPARQL in the Semantic WebSPARQL in the Semantic Web
SPARQL in the Semantic WebJan Beeck
 
An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4j
Debanjan Mahata
 

Similar to Sindice warehousing meetup (20)

ISWC GoodRelations Tutorial Part 2
ISWC GoodRelations Tutorial Part 2ISWC GoodRelations Tutorial Part 2
ISWC GoodRelations Tutorial Part 2
 
GoodRelations Tutorial Part 2
GoodRelations Tutorial Part 2GoodRelations Tutorial Part 2
GoodRelations Tutorial Part 2
 
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...
10-15-13 “Metadata and Repository Services for Research Data Curation” Presen...
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
 
RDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL PlatformsRDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL Platforms
 
Redis - Your Magical superfast database
Redis - Your Magical superfast databaseRedis - Your Magical superfast database
Redis - Your Magical superfast database
 
Knowledge Representation, Semantic Web
Knowledge Representation, Semantic WebKnowledge Representation, Semantic Web
Knowledge Representation, Semantic Web
 
RDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use itRDFa: introduction, comparison with microdata and microformats and how to use it
RDFa: introduction, comparison with microdata and microformats and how to use it
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1New Persistence Features in Spring Roo 1.1
New Persistence Features in Spring Roo 1.1
 
Finding Love with MongoDB
Finding Love with MongoDBFinding Love with MongoDB
Finding Love with MongoDB
 
Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )Case study of Rujhaan.com (A social news app )
Case study of Rujhaan.com (A social news app )
 
Knowledge mangement
Knowledge mangementKnowledge mangement
Knowledge mangement
 
A review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic WebA review of the state of the art in Machine Learning on the Semantic Web
A review of the state of the art in Machine Learning on the Semantic Web
 
Introduction to RDF
Introduction to RDFIntroduction to RDF
Introduction to RDF
 
Duraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository ServicesDuraspace Hot Topics Series 6: Metadata and Repository Services
Duraspace Hot Topics Series 6: Metadata and Repository Services
 
Exploring the Semantic Web
Exploring the Semantic WebExploring the Semantic Web
Exploring the Semantic Web
 
Solr 8 interview
Solr 8 interview Solr 8 interview
Solr 8 interview
 
SPARQL in the Semantic Web
SPARQL in the Semantic WebSPARQL in the Semantic Web
SPARQL in the Semantic Web
 
An Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4jAn Introduction to NOSQL, Graph Databases and Neo4j
An Introduction to NOSQL, Graph Databases and Neo4j
 

More from Semantic Web San Diego

2013 april gruff webinar san diego copy
2013 april  gruff webinar   san diego copy2013 april  gruff webinar   san diego copy
2013 april gruff webinar san diego copy
Semantic Web San Diego
 
The RDFa, seo wave
The RDFa, seo waveThe RDFa, seo wave
The RDFa, seo wave
Semantic Web San Diego
 
Semantic Web and the Web Of Commerce - pdf version
Semantic Web and the Web Of Commerce - pdf versionSemantic Web and the Web Of Commerce - pdf version
Semantic Web and the Web Of Commerce - pdf version
Semantic Web San Diego
 
Simplifying semantics for biomedical applications
Simplifying semantics for biomedical applicationsSimplifying semantics for biomedical applications
Simplifying semantics for biomedical applications
Semantic Web San Diego
 
Bio Seminar 2010
Bio Seminar 2010Bio Seminar 2010
Bio Seminar 2010
Semantic Web San Diego
 
San Diego 2010
San Diego 2010San Diego 2010
San Diego 2010
Semantic Web San Diego
 

More from Semantic Web San Diego (8)

2013 april gruff webinar san diego copy
2013 april  gruff webinar   san diego copy2013 april  gruff webinar   san diego copy
2013 april gruff webinar san diego copy
 
The RDFa, seo wave
The RDFa, seo waveThe RDFa, seo wave
The RDFa, seo wave
 
Rdfa semtech2011
Rdfa semtech2011Rdfa semtech2011
Rdfa semtech2011
 
Semantic Web and the Web Of Commerce - pdf version
Semantic Web and the Web Of Commerce - pdf versionSemantic Web and the Web Of Commerce - pdf version
Semantic Web and the Web Of Commerce - pdf version
 
Simplifying semantics for biomedical applications
Simplifying semantics for biomedical applicationsSimplifying semantics for biomedical applications
Simplifying semantics for biomedical applications
 
Sd sem weboct252010
Sd sem weboct252010Sd sem weboct252010
Sd sem weboct252010
 
Bio Seminar 2010
Bio Seminar 2010Bio Seminar 2010
Bio Seminar 2010
 
San Diego 2010
San Diego 2010San Diego 2010
San Diego 2010
 

Recently uploaded

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
Fwdays
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 

Recently uploaded (20)

LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi"Impact of front-end architecture on development cost", Viktor Turskyi
"Impact of front-end architecture on development cost", Viktor Turskyi
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 

Sindice warehousing meetup

  • 1. Real Time Semantic Warehousing: Sindice.com technology for the enterprise Giovanni Tummarello, Ph.D Data Intensive Infrastructure UNIT - DERI.ie CEO SindiceTech
  • 2. How we started : Sindice.com 80 Billions triple, 500,000,000 RDF Graphs, 5 TB of data. The Sindice Suite powers Sindice.com. Online with 99,9%+
  • 3. Semantic Sandboxes on: Sindice.com Data Sandboxes in Sindice.com – Powered by CloudSpaces
  • 4. And then we met people asking can you do it for us
  • 5. Example story (Pharmaceutical company0 To stay competitive, Pharmaceutical companies need to leverage all the data available from inside sources as well as from the increasingly many public HCLS data sources available. Due to the diversity of this data with respect to nature, formats, quality, there are complex integration issues. Traditional data warehousing technology require big upfront thinking and is handled within a company in the “go via the IT department” approach. This does not meet the need of data scientists who are the only ones that can do the complex cross-use case thinking required. Via Real Time Semantic Data Warehousing (RETIS) data scientist expect to get: • The ability to speed up “In silico” scientific workflows (interrelation of diverse large datasets) by orders of magnitude by relying on a data warehousing approach. • The ability to create large scale “data maps” or “aggregated views” which would allow researchers to see “trends” and gather insights at high level which would not be possible by data accessed via single lookups. • The ability to receive recommendations and suggestions for new data connections based on an ever evolving ecosystem of available experimental datasets. • Provide their R&D departments with superior tools for investigating their internal knowledge; search engines and data browsing tools which provide unified views of multiple, evolving, live datasets without leakage of specific “queries” to the outside world which would reveal internal research trends • The ability to leverage the ever increasing body of public, crowd curated open data 5 of 16
  • 6. Linked Data clouds for the Enterprise – Strategic knowledge spaces, where new databases can be added and “leveraged” with an unprecedented ease – Integration “Pay as you go” : explore now, fine tune later. – Its BigData (Cluster+Clouds) meets RDF and Semantic Technologies
  • 8. Because you need Semantic SandBoxes
  • 9. A Dataspace Template Semantic Web A typical implementation template. Data Dataspaces own: • Resources • Services • Datasets for others to reuse
  • 10. Dataspace Composition Scalable cascading semantic ‘Dataspaces” • Resources allocated in public/private clouds • Allow to get Sindice Data and mix it/ process it for private purposes 10 of 16
  • 11. Cloud powered! <dataspace id= “iphonedataspace”> <dependencies> http://ecommerce01.dataspace.sindice.net/</dataspace> http://price01.dataspace.sindice.net/ </dependencies> <resources> <mysql name=“sql”> <hbase size=“10g”> <siren name=“index”> <triplestore name=“sparql” kind=“virtuoso” /> </resources> <retention> (see later) <update-rate>1D</update-rate> <timeout>1D</timeout> </retention> </dataspace> 11 of 16
  • 12. Scale is only 1 dimension Multiple dimensions of WeD data integration • RDF tool stack  flexibility • Cluster scalable processing  scalability • “Cloud” Pipelines  dynamicity
  • 13. Full Json Like Search. On Solr. All operators supported.
  • 14. What is SIREn ? • Plugin to Solr • Built for searching and operating on semistructured data and relational datastructures
  • 15. SIREn: Semantic IR Engine • Extension to Enterprise Search Engine Solr • Semantic, full-text, incremental updates, distributed search Semantic SIREn Databases Constant time
  • 16. Limitations of Apache Solr • Not efficient with highly heterogeneous structured data sources – Limitation on the number of attributes: Dictionary size explosion
  • 17. Dictionary Size Explosion Record 1 label Renaud Delbru name Renaud Delbru
  • 18. Dictionary Size Explosion Dictionary label:renaud Record 1 label Renaud Delbru label:delbru name Renaud Delbru name:renaud name:delbru  Dictionary construction  Concatenation of attribute name and term  N * M complexity (worst case)  2 attributes * 2 terms = 4 dictionary entries  100K attributes * 1B terms = 100B entries
  • 19. Limitations of Apache Solr • Not efficient with highly heterogeneous structured data sources – Limitation on the number of attributes: Dictionary size explosion Query clause explosion when searching across all attributes
  • 20. Limitations of Apache Solr • Not efficient with highly heterogeneous structured data sources – Limitation on the number of attributes: Dictionary size explosion Query clause explosion when searching across all attributes • Limited support for structured query – Multi-valued attributes
  • 21. Multi-valued attributes • No support in Solr for "all words must match in the same value of a multi-valued field". • A field value is a bag of words – No distinction between multiple values Record 1 Record 2 label man's best pooch label man's worst friend to no one friend enemy
  • 22. Multi-valued attributes • No support in Solr for "all words must match in the same value of a multi-valued field". • A field value is a bag of words – No distinction between multiple values • Query example – label : man’s friend – Solr returns Record 1 & 2 as results Record 1 Record 2 label man's best friend pooch label man's worst enemy friend to no one
  • 23. Limitations of Apache Solr • Not efficient with highly heterogeneous structured data sources – Limitation on the number of attributes: Dictionary size explosion Query clause explosion when searching across all attributes • Limited support for structured query – Multi-valued attributes – No full-text search on attribute names
  • 24. Full-text search on attribute names • No support in Solr for “keyword search in attribute names". • Query example – (name OR label) = “Renaud Delbru” – Solr is unable to find the records without the exact attribute name Record 1 Record 2 rdfs:label Renaud Delbru foaf:name Renaud Delbru Record 3 Record 4 sioc:name Renaud Delbru full_name Renaud Delbru
  • 25. Limitations of Apache Solr • Not efficient with highly heterogeneous structured data sources – Limitation on the number of attributes: Dictionary size explosion Query clause explosion when searching across all attributes • Limited support for structured query – Multi-valued attributes – No full-text search on attribute names – No 1:N relationship materialisation
  • 26. Relationship materialization • Its Json like indexing and searching • Materialize the relationships between your entities and others.
  • 27. Some numbers: Siren on Sindice Data Collection Settings  500M web data documents (RDF,  Cluster of 4 nodes RDFa, Microformat, etc.)  2 nodes for indexing  200K datasets  2 nodes for querying  50B triples  Replication Indexing Performance Services  Full index construction takes  Keyword and structured queries approx 24 hours  Dataset search  436K triples / second  >> 99% uptime
  • 28. Large scale RDF ‘Summaries”
  • 29. Introducing large scale RDF ‘Summaries” We do it for: • Data exploration – How to find datasets about movies ? • Assisted SPARQL Query Editor – What is the data structure ? • Dataset Quality – How to differentiate relevant form irrelevant dataset ?
  • 30. Large Scale RDF summaries Class Level 12M relationships 10B relationships
  • 31. Sindice Analytics Widget Demo • http://test01.sindice.net:9001/sindice-stats- webapp/ • http://test01.sindice.net/szydan/dataset- view/dataset/default/www.bbc.co.uk
  • 32. Relational Faceted Browsing. At speed of light Patent Pending
  • 33. SparQL is awesome. And now your guys can actually use it.
  • 34. Thank you Sindice.com team April 2012 With the contribution of

Editor's Notes

  1. Search record (instead of entity)Record-centric indexing model
  2. Use Case: Let’s index the entire web of dataDoc/s, lucene in action, uptime, etc.
  3. How important a dataset is to my information need ?How to help users to browse and filter irrelevant datasets ?How can I measure the quality of a dataset ? Data quality, objective measuresTwo datasets can overlap, provide similar information, but one dataset is providing more fresh information, is updated more frequently.Concrete scenarios to test such assumptionsData Quality can be also useful for improving data acquisition, optimising resources to retrieve only top quality data
  4. - Define “relationships” when introducing the graph, BEFORE talking about the numbers
  5. Number of entities per classNumber of relations of a certain predicateOther metadata can be added to a class, e.g., other predicates used with the entities of that class