SlideShare a Scribd company logo
Data Collection and Integration,
  Linked Data Management
    Tutorial at the RENDER Kick-off Meeting


                  Oct 2010
Presentation Outline

• Introduction
• Web Mining
• Data Integration Strategy
• Linked Data Management
   – Linked Data: Introduction and Challenges
   – FactForge: Fast Track to the Center of the Web of Data




                   Data Collection and Integration, LD Management   Oct 2010   #2
Content and Data Management Solutions




       Data Collection and Integration, LD Management   Oct 2010   #3
Content and Data Management Solutions




       Data Collection and Integration, LD Management   Oct 2010   #4
Web Mining

• Web Crawling
• Focused Crawling
• Screen-scrapping
• Data feeds, RSS feeds




             Data Collection and Integration, LD Management   Oct 2010   #5
Web Mining Projects

• Namerimi – a national search engine
• Innovantage – job offers in UK
• Ronsmap – car offers in USA
• ANAM – recepie knowledge base
• Semantic KB of the Government Web Archive
  (TNA)




             Data Collection and Integration, LD Management   Oct 2010   #6
TNA SemKB Conceptual Model for Data Integration

Ontology                                   Central Government
                      PROTON
                                                Ontology
                  Mapping
                                            Mapping


               FactForge          Semantic Knowledge                     Semi-
                                    Base Ontology                     structured
                 LOD                                                     Data
                Linked                                          ETL
Factual                      Mapping
                 Data                  TNA Struct.
knowledge                                Data
                                                                      Government
                            Mapping                   Mapping
                                                                         Web
                                       Web Archive                     DataBase
                                        Extracts

Metadata
                      Government Web
                      Archive Metadata

Unstructured
Content               Government Web
                          Archive
                                                                      Legend:
                                                                                Reference
Presentation Outline

• Introduction
• Web Mining
• Data Integration Strategy
• Linked Data Management
   – Linked Data: Introduction and Challenges
   – FactForge: Fast Track to the Center of the Web of Data




                   Data Collection and Integration, LD Management   Oct 2010   #8
Data Integration Strategy

• Determine data sources
• Determine extraction policies
• Conceptual Model for Data Integration
• Transformation routines
• Identity Resolution




             Data Collection and Integration, LD Management   Oct 2010   #9
Conceptual Model for Data Integration In RENDER

• Build a stack of reference ontologies
   – a relatively stable controlled vocabulary
• Use those for:
   – Integration of RENDER-specific data to LOD
   – Refer to it from various components and resources
• The Stack:
   –   RENDER schema – very tiny set of primitives
   –   PROTON – few hundred schema primitives
   –   UMBEL – several thousand classes and subjects
   –   FactForge – gateway to the central LOD datasets

                Data Collection and Integration, LD Management   Oct 2010   #10
Presentation Outline

• Introduction
• Web Mining
• Data Integration Strategy
• Linked Data Management
   – Linked Data: Introduction and Challenges
   – FactForge: Fast Track to the Center of the Web of Data




                   Data Collection and Integration, LD Management   Oct 2010   #11
Linking Open Data (LOD)

• Linking Open Data W3C SWEO Community project
  http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData




• Initiative for publishing “linked data” which already
  includes 50+ interlinked datasets and about 15B facts

                      Data Collection and Integration, LD Management    Oct 2010   #12
Why Do People Not Use Linked Data?

• Plenty of people in the IT world have heard about
  linked data and like the idea
• However, the impact of linked data in enterprises
  is still very limited
• Because:
   – There are no well established opinions about what linked data can
     “buy” for the enterprise or best practices for using it
       • What are the concrete benefits?
   – It is not clear what it would cost
       • What are the problems?
       • What are the associated risks?




                   Data Collection and Integration, LD Management   Oct 2010   #13
Linked Data in the Enterprise: Why?

• To facilitate data integration
   – One can use LOD as an “interlingua” for enterprise data integration
   – Additional public information can help alignment and linking

• To add value to proprietary data
   – Public data can allow more analytics on top of proprietary data
   – For instance, by linking to spatial data from Geonames
   – Better description and access to content, e.g. search for images

• Make enterprise data more open
   – To make enterprise data easier to use outside the enterprise
   – Public identifiers and vocabularies can be used to access them




                 Data Collection and Integration, LD Management   Oct 2010   #14
Linked Data in the Enterprise: Challenges

• LOD is hard to comprehend
   – Diversity comes at a price
   – One needs to make a query against 200 different schemata and
     hundreds of thousands of classes and properties

• LOD is unreliable
   – Most of the servers behind LOD today are slow
      • High down time
   – Dealing with data distributed on the web is slow
       • A federated SPARQL query that uses 2-3 servers within several joins
         can be *very* slow
   – No kind of consistency is guaranteed
       • Low commitment to the formal semantics and intended usage of the
         ontologies and schemata


                  Data Collection and Integration, LD Management   Oct 2010    #15
Reason-able Views to the Web of Data

• Reason-able views represent an approach for
  reasoning and management of linked data
• Key ideas:
   – Group selected datasets and ontologies in a compound dataset
       • Clean up, post-process and enrich the datasets if necessary
       • Do this conservatively, in a clearly documented and automated manner, so that:
            – the operation can easily be performed each time a new version of one of the datasets is
              published
            – Users can easily understand the intervention made to the original dataset
   – Load the compound dataset in a single semantic repository
   – Perform inference with respect to tractable OWL dialects
   – Define a set of sample queries against the compound dataset
       • These determine the “level of service” or the “scope of consistency” contract
         offered by the reason-able view


                       Data Collection and Integration, LD Management               Oct 2010            #16
Reason-able Views: Objectives

• Make reasoning and query evaluation feasible
• Guarantee a basic level of consistency
   – The sample queries guarantee the consistency of the data in the same
     way in which regression tests do for the quality of software

• Guarantee availability
   – In the same way in which web search engines are usually more reliable
     than most of the web sites; they also to caching

• Easier exploration and querying of unseen data
   – Lower the cost of entry through URI auto-complete and RDF search
   – Sample queries provide re-usable extraction patterns, which reduce
     the time for acquaintance with the datasets and their interconnections


                   Data Collection and Integration, LD Management   Oct 2010   #17
Two Reason-able Views to the Web of Linked Data

• FactForge (indicated in red on the next slide)
   – Some of the central LOD datasets
   – General-purpose information (not specific to a domain)
   – 1.2B explicit plus .9M inferred indexed, 10B retrievable statements
   – The largest upper-level knowledge base
   – http://www.factforge.net/

• Linked Life Data (indicated in yellow)
   – 25 of the most popular life-science datasets
   – Complemented by gluing ontologies
   – 2.7B explicit and 1.4B inferred, total of 4.1B indexed statements
   – The largest non-synthetic dataset that was used for reasoning
   – http://www.linkedlifedata.com

                 Data Collection and Integration, LD Management   Oct 2010   #18
Linking Open Data Datasets and Views (red and yellow)




              Data Collection and Integration, LD Management   Oct 2010   #19
Presentation Outline

• Introduction
• Web Mining
• Data Integration Strategy
• Linked Data Management
   – Linked Data: Introduction and Challenges
   – FactForge: Fast Track to the Center of the Web of Data




                   Data Collection and Integration, LD Management   Oct 2010   #20
FactForge: Fast Track to the Center of the Web of Data

 • Datasets: DBPedia, Freebase, Geonames, UMBEL,
   MusicBrainz, Wordnet, CIA World Factbook, Lingvoj
 • Ontologies: Dublin Core, SKOS, RSS, FOAF
 • Inference: materialization with respect to OWL 2 RL
    – Seems to completely cover the semantics of the data
    – owl:sameAs optimization in BigOWLIM allows reduction of the
      indices without loss of semantics, but big gains in performance
 • Free public service at http://www.factforge.com,
    –   Incremental URI auto-suggest
    –   Query and explore through Forest and Tabulator
    –   RDF Search: retrieve ranked list of URIs by keywords (see slide 26)
    –   SPARQL end-point

                    Data Collection and Integration, LD Management   Oct 2010   #21
FactForge Loading and Inference Statistics

                           Explicit      Inferred       Total # of    Entities
                          Indexed        Indexed         Stored       ('000 of Inferred
        Dataset
                           Triples        Triples        Triples     nodes in closure
                            ('000)        ('000)          ('000)    the graph)   ratio
Sechmata and ontologies            11            7               18            6     0.6
DBpedia (categories)            2,877       42,587           45,464       1,144     14.8
DBpedia (sameAs)                5,544          566            6,110       8,464      0.1
UMBEL                           5,162       42,212           47,374         500      8.2
Lingvoj                            20          863              883           18    43.8
CIA Factbook                       76            4               80           25     0.1
Wordnet                         2,281        9,296           11,577         830      4.1
Geonames                       91,908      125,025         216,933       33,382       1.4
DBpedia core                 560,096       198,043         758,139      127,931       0.4
Freebase                     463,689        40,840         504,529       94,810       0.1
MusicBrainz                    45,536      421,093         466,630       15,595       9.2
Total                     1,177,961        881,224       2,058,185      283,253      0.7


                      Data Collection and Integration, LD Management       Oct 2010     #22
Post-processing

• Several kinds of post-processing were performed
   − Goal: to allow easier navigation and browsing
   − Mechanisms: the results are available through system predicates
   − For instance: preferred labels, text snippets and RDF Ranks for all
     nodes
• Final Statistics
   − Number of entities (RDF graph nodes): 405M
   − Number of inserted statements (NIS): 1,2B
   − Number of stored statements (NSS): 2.2B
   − Number of retrievable statements (NRS): 9.8B
       − 7.6B statements “compressed” through BigOWLIM’s owl:sameAs
         optimisation



                    Data Collection and Integration, LD Management   Oct 2010   #23
FactForge Provides Unique Query Capabilities

• Unmatched factual knowledge querying capabilities
   – Scope: the largest and most diverse integrated dataset,
     including 8 of the central LOD datasets
   – Analytical power: the semantics of the data and links
     between them is fully accounted for during query evaluation
   – RDFRank indicates importance in the integrated dataset
• No other service supports even one of the following:
   – Query evaluation against DBPedia, Freebase or Geonames
     considering their semantics
   – Allows simultaneous querying of multiple datasets,
     considering the semantics of the links between them

                 Data Collection and Integration, LD Management   Oct 2010   #24
Guess who is the most popular German entertainer?

PREFIX rdf: …          (run the query at http://factforge.net/sparql)
SELECT * WHERE {
     ?Person dbp-ont:birthPlace ?BirthPlace ;
             rdf:type opencyc:Entertainer ;
             ff:hasRDFRank ?RR .
     ?BirthPlace geo-ont:parentFeature dbpedia:Germany .
} ORDER BY DESC(?RR) LIMIT 100


•   Without FF, answering such queries in real time is impossible:
•   Used data from: DBPedia, Geonames, UMBEL and MusicBrainz
•   Inference over types, sub-classes, and transitive relationships
•   The most popular entertainer born in Germany is:
                                                                             F. Nietzsche
     – Asking factual questions to a global KB can bring unexpected and strange results
     – We ask who is the most popular person, who qualifies as an entertainer
     – It uses a simple notion of popularity: RDFRank


                           Data Collection and Integration, LD Management        Oct 2010   #25
The Modigliani Test for the Semantic Web

• ReadWriteWeb’s founder Richard McManus:
  “…the tipping point for the Semantic Web may be
  when one can … deliver – using Linked Data – a
  comprehensive list of locations of original Modigliani
  art works …”
http://www.readwriteweb.com/archives/the_modigliani_test_for_linked_data.php




                     Data Collection and Integration, LD Management   Oct 2010   #26
The LDSR Query Passing the Modigliani Test

PREFIX fb: …        (check the query at http://factforge.net/sparql)

SELECT DISTINCT
  ?painting_l ?owner_l ?city_fb_con ?city_db_loc ?city_db_cit
WHERE {
  ?p fb:visual_art.artwork.artist dbpedia:Amedeo_Modigliani ;
     fb:visual_art.artwork.owners [
        fb:visual_art.artwork_owner_relationship.owner ?ow ] ;
     ff:preferredLabel ?painting_l.
   ?ow ff:preferredLabel ?owner_l .
   OPTIONAL { ?ow fb:location.location.containedby [
                   ff:preferredLabel ?city_fb_con ] } .
   OPTIONAL { ?ow dbp-prop:location ?loc.
               ?loc rdf:type umbel-sc:City ;
                    ff:preferredLabel ?city_db_loc }
   OPTIONAL { ?ow dbp-ont:city [ ff:preferredLabel ?city_db_cit ]
   }
}




                    Data Collection and Integration, LD Management   Oct 2010   #27
Thank you!




        Questions?


       Comments?




Data Collection and Integration, LD Management   Oct 2010   #28

More Related Content

What's hot

Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
EUCLID project
 
Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org
sopekmir
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked Data
EUCLID project
 
Linked Open Data Principles, Technologies and Examples
Linked Open Data Principles, Technologies and ExamplesLinked Open Data Principles, Technologies and Examples
Linked Open Data Principles, Technologies and Examples
Open Data Support
 
IWMW 1998: Deploying new web technologies
IWMW 1998: Deploying new web technologiesIWMW 1998: Deploying new web technologies
IWMW 1998: Deploying new web technologies
IWMW
 
Big Linked Data - Creating Training Curricula
Big Linked Data - Creating Training CurriculaBig Linked Data - Creating Training Curricula
Big Linked Data - Creating Training Curricula
EUCLID project
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Ontotext
 
Structured Data for the Financial Industry
Structured Data for the Financial Industry Structured Data for the Financial Industry
Structured Data for the Financial Industry
sopekmir
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
Lars Marius Garshol
 
The Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of LeipzigThe Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of Leipzig
LOD2 Creating Knowledge out of Interlinked Data
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
eswcsummerschool
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
EUCLID project
 
The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open Data
Ontotext
 
Providing Linked Data
Providing Linked DataProviding Linked Data
Providing Linked Data
EUCLID project
 
Linked open data project
Linked open data projectLinked open data project
Linked open data project
Faathima Fayaza
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
Ontotext
 

What's hot (16)

Scaling up Linked Data
Scaling up Linked DataScaling up Linked Data
Scaling up Linked Data
 
Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org Industry Ontologies: Case Studies in Creating and Extending Schema.org
Industry Ontologies: Case Studies in Creating and Extending Schema.org
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked Data
 
Linked Open Data Principles, Technologies and Examples
Linked Open Data Principles, Technologies and ExamplesLinked Open Data Principles, Technologies and Examples
Linked Open Data Principles, Technologies and Examples
 
IWMW 1998: Deploying new web technologies
IWMW 1998: Deploying new web technologiesIWMW 1998: Deploying new web technologies
IWMW 1998: Deploying new web technologies
 
Big Linked Data - Creating Training Curricula
Big Linked Data - Creating Training CurriculaBig Linked Data - Creating Training Curricula
Big Linked Data - Creating Training Curricula
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
 
Structured Data for the Financial Industry
Structured Data for the Financial Industry Structured Data for the Financial Industry
Structured Data for the Financial Industry
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
The Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of LeipzigThe Semantic Data Web, Sören Auer, University of Leipzig
The Semantic Data Web, Sören Auer, University of Leipzig
 
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
ESWC SS 2012 - Wednesday Tutorial Barry Norton: Building (Production) Semanti...
 
Microtask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked DataMicrotask Crowdsourcing Applications for Linked Data
Microtask Crowdsourcing Applications for Linked Data
 
The Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open DataThe Power of Semantic Technologies to Explore Linked Open Data
The Power of Semantic Technologies to Explore Linked Open Data
 
Providing Linked Data
Providing Linked DataProviding Linked Data
Providing Linked Data
 
Linked open data project
Linked open data projectLinked open data project
Linked open data project
 
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the CloudFirst Steps in Semantic Data Modelling and Search & Analytics in the Cloud
First Steps in Semantic Data Modelling and Search & Analytics in the Cloud
 

Similar to Data Collection and Integration, Linked Data Management

Linked_Open_Data_Rome_Netcamp_13
Linked_Open_Data_Rome_Netcamp_13Linked_Open_Data_Rome_Netcamp_13
Linked_Open_Data_Rome_Netcamp_13
Michele Piunti
 
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo
 
Cloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application DevelopmentCloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application Development
Peter Haase
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data Virtualization
Denodo
 
Linked Open data: CNR
Linked Open data: CNRLinked Open data: CNR
Linked Open data: CNR
DatiGovIT
 
SemTechBiz 2012 Panel on Linking Enterprise Data
SemTechBiz 2012 Panel on Linking Enterprise DataSemTechBiz 2012 Panel on Linking Enterprise Data
SemTechBiz 2012 Panel on Linking Enterprise Data
3 Round Stones
 
Dublinked tech workshop_15_dec2011
Dublinked tech workshop_15_dec2011Dublinked tech workshop_15_dec2011
Dublinked tech workshop_15_dec2011
Dublinked .
 
Linked Data for the Enterprise: Opportunities and Challenges
Linked Data for the Enterprise: Opportunities and ChallengesLinked Data for the Enterprise: Opportunities and Challenges
Linked Data for the Enterprise: Opportunities and Challenges
Marin Dimitrov
 
[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...
[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...
[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...
Data Beers
 
Myth Busters III: I’m Building a Data Lake, So I Don’t Need Data Virtualization
Myth Busters III: I’m Building a Data Lake, So I Don’t Need Data VirtualizationMyth Busters III: I’m Building a Data Lake, So I Don’t Need Data Virtualization
Myth Busters III: I’m Building a Data Lake, So I Don’t Need Data Virtualization
Denodo
 
Modern Data Management for Federal Modernization
Modern Data Management for Federal ModernizationModern Data Management for Federal Modernization
Modern Data Management for Federal Modernization
Denodo
 
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Denodo
 
Linked Data Management
Linked Data ManagementLinked Data Management
Linked Data Management
Marin Dimitrov
 
A Survey of Exploratory Search Systems Based on LOD Resources
A Survey of Exploratory Search Systems Based on LOD ResourcesA Survey of Exploratory Search Systems Based on LOD Resources
A Survey of Exploratory Search Systems Based on LOD Resources
Karwan Jacksi
 
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
semanticsconference
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
Jesse Wang
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
Anja Jentzsch
 
Cni research data_oxford_horstmann_jefferies
Cni research data_oxford_horstmann_jefferiesCni research data_oxford_horstmann_jefferies
Cni research data_oxford_horstmann_jefferies
BDLSS
 
Configuring and Visualizing The Data Resources in a Cloud-based Data Collect...
Configuring and Visualizing The Data Resources  in a Cloud-based Data Collect...Configuring and Visualizing The Data Resources  in a Cloud-based Data Collect...
Configuring and Visualizing The Data Resources in a Cloud-based Data Collect...
FAST-Lab. Factory Automation Systems and Technologies Laboratory, Tampere University of Technology
 
Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft Fabric
Nathan Bijnens
 

Similar to Data Collection and Integration, Linked Data Management (20)

Linked_Open_Data_Rome_Netcamp_13
Linked_Open_Data_Rome_Netcamp_13Linked_Open_Data_Rome_Netcamp_13
Linked_Open_Data_Rome_Netcamp_13
 
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
Denodo: Enabling a Data Mesh Architecture and Data Sharing Culture at Landsba...
 
Cloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application DevelopmentCloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application Development
 
Unlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data VirtualizationUnlock Your Data for ML & AI using Data Virtualization
Unlock Your Data for ML & AI using Data Virtualization
 
Linked Open data: CNR
Linked Open data: CNRLinked Open data: CNR
Linked Open data: CNR
 
SemTechBiz 2012 Panel on Linking Enterprise Data
SemTechBiz 2012 Panel on Linking Enterprise DataSemTechBiz 2012 Panel on Linking Enterprise Data
SemTechBiz 2012 Panel on Linking Enterprise Data
 
Dublinked tech workshop_15_dec2011
Dublinked tech workshop_15_dec2011Dublinked tech workshop_15_dec2011
Dublinked tech workshop_15_dec2011
 
Linked Data for the Enterprise: Opportunities and Challenges
Linked Data for the Enterprise: Opportunities and ChallengesLinked Data for the Enterprise: Opportunities and Challenges
Linked Data for the Enterprise: Opportunities and Challenges
 
[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...
[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...
[Databeers] 06/05/2014 - Boris Villazon: “Data Integration - A Linked Data ap...
 
Myth Busters III: I’m Building a Data Lake, So I Don’t Need Data Virtualization
Myth Busters III: I’m Building a Data Lake, So I Don’t Need Data VirtualizationMyth Busters III: I’m Building a Data Lake, So I Don’t Need Data Virtualization
Myth Busters III: I’m Building a Data Lake, So I Don’t Need Data Virtualization
 
Modern Data Management for Federal Modernization
Modern Data Management for Federal ModernizationModern Data Management for Federal Modernization
Modern Data Management for Federal Modernization
 
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
Myth Busters: I’m Building a Data Lake, So I Don’t Need Data Virtualization (...
 
Linked Data Management
Linked Data ManagementLinked Data Management
Linked Data Management
 
A Survey of Exploratory Search Systems Based on LOD Resources
A Survey of Exploratory Search Systems Based on LOD ResourcesA Survey of Exploratory Search Systems Based on LOD Resources
A Survey of Exploratory Search Systems Based on LOD Resources
 
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
Ben Gardner | Delivering a Linked Data warehouse and integrating across the w...
 
The Web of data and web data commons
The Web of data and web data commonsThe Web of data and web data commons
The Web of data and web data commons
 
Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)Linked Data (1st Linked Data Meetup Malmö)
Linked Data (1st Linked Data Meetup Malmö)
 
Cni research data_oxford_horstmann_jefferies
Cni research data_oxford_horstmann_jefferiesCni research data_oxford_horstmann_jefferies
Cni research data_oxford_horstmann_jefferies
 
Configuring and Visualizing The Data Resources in a Cloud-based Data Collect...
Configuring and Visualizing The Data Resources  in a Cloud-based Data Collect...Configuring and Visualizing The Data Resources  in a Cloud-based Data Collect...
Configuring and Visualizing The Data Resources in a Cloud-based Data Collect...
 
Data Mesh using Microsoft Fabric
Data Mesh using Microsoft FabricData Mesh using Microsoft Fabric
Data Mesh using Microsoft Fabric
 

More from RENDER project

Text Stream Processing Tutorial @WIMS 2012
Text Stream Processing Tutorial @WIMS 2012Text Stream Processing Tutorial @WIMS 2012
Text Stream Processing Tutorial @WIMS 2012
RENDER project
 
Internals Of An Aggregated Web News Feed
Internals Of An Aggregated Web News FeedInternals Of An Aggregated Web News Feed
Internals Of An Aggregated Web News Feed
RENDER project
 
Unterstützungswerkzeuge für Wikipedia
Unterstützungswerkzeuge für WikipediaUnterstützungswerkzeuge für Wikipedia
Unterstützungswerkzeuge für WikipediaRENDER project
 
Render Review: Wikipedia Case Study, Year 1
Render Review: Wikipedia Case Study, Year 1Render Review: Wikipedia Case Study, Year 1
Render Review: Wikipedia Case Study, Year 1
RENDER project
 
Wiki case study - Review year 1
Wiki case study  - Review year 1Wiki case study  - Review year 1
Wiki case study - Review year 1
RENDER project
 
Towards a diversity-minded Wikipedia
Towards a diversity-minded WikipediaTowards a diversity-minded Wikipedia
Towards a diversity-minded Wikipedia
RENDER project
 
Diversiweb2011 02 Opening- Devika P. Madalli
Diversiweb2011 02 Opening- Devika P. MadalliDiversiweb2011 02 Opening- Devika P. Madalli
Diversiweb2011 02 Opening- Devika P. Madalli
RENDER project
 
Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...
Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...
Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...
RENDER project
 
Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus
Diversiweb2011 07 Approximate subgraph matching - Mitja TrampusDiversiweb2011 07 Approximate subgraph matching - Mitja Trampus
Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus
RENDER project
 
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
RENDER project
 
Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...
Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...
Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...
RENDER project
 
Diversiweb2011 04 Expressing Opinion Diversity - Delia Rusu
Diversiweb2011 04 Expressing Opinion Diversity - Delia RusuDiversiweb2011 04 Expressing Opinion Diversity - Delia Rusu
Diversiweb2011 04 Expressing Opinion Diversity - Delia Rusu
RENDER project
 
Diversiweb2011 03 Towards a Knowledge Diversity Model - Denny Vrandecic
Diversiweb2011 03 Towards a Knowledge Diversity Model - Denny VrandecicDiversiweb2011 03 Towards a Knowledge Diversity Model - Denny Vrandecic
Diversiweb2011 03 Towards a Knowledge Diversity Model - Denny Vrandecic
RENDER project
 
Diversiweb2011 01 Opening - Elena Simperl
Diversiweb2011 01 Opening - Elena SimperlDiversiweb2011 01 Opening - Elena Simperl
Diversiweb2011 01 Opening - Elena Simperl
RENDER project
 
Diversity toolkit
Diversity toolkitDiversity toolkit
Diversity toolkit
RENDER project
 
RENDER Telefonica
RENDER TelefonicaRENDER Telefonica
RENDER Telefonica
RENDER project
 
Defining Diversity
Defining DiversityDefining Diversity
Defining Diversity
RENDER project
 
Render Project introduction and overview
Render Project introduction and overviewRender Project introduction and overview
Render Project introduction and overview
RENDER project
 

More from RENDER project (18)

Text Stream Processing Tutorial @WIMS 2012
Text Stream Processing Tutorial @WIMS 2012Text Stream Processing Tutorial @WIMS 2012
Text Stream Processing Tutorial @WIMS 2012
 
Internals Of An Aggregated Web News Feed
Internals Of An Aggregated Web News FeedInternals Of An Aggregated Web News Feed
Internals Of An Aggregated Web News Feed
 
Unterstützungswerkzeuge für Wikipedia
Unterstützungswerkzeuge für WikipediaUnterstützungswerkzeuge für Wikipedia
Unterstützungswerkzeuge für Wikipedia
 
Render Review: Wikipedia Case Study, Year 1
Render Review: Wikipedia Case Study, Year 1Render Review: Wikipedia Case Study, Year 1
Render Review: Wikipedia Case Study, Year 1
 
Wiki case study - Review year 1
Wiki case study  - Review year 1Wiki case study  - Review year 1
Wiki case study - Review year 1
 
Towards a diversity-minded Wikipedia
Towards a diversity-minded WikipediaTowards a diversity-minded Wikipedia
Towards a diversity-minded Wikipedia
 
Diversiweb2011 02 Opening- Devika P. Madalli
Diversiweb2011 02 Opening- Devika P. MadalliDiversiweb2011 02 Opening- Devika P. Madalli
Diversiweb2011 02 Opening- Devika P. Madalli
 
Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...
Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...
Diversiweb2011 08 Mining Diverse Views from Related Articles - Ravali Pochamp...
 
Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus
Diversiweb2011 07 Approximate subgraph matching - Mitja TrampusDiversiweb2011 07 Approximate subgraph matching - Mitja Trampus
Diversiweb2011 07 Approximate subgraph matching - Mitja Trampus
 
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
Diversiweb2011 06 Faceted Approach To Diverse Query Processing - Devika P. Ma...
 
Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...
Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...
Diversiweb2011 05 Scalable Detection of Sentiment-Based Contradictions - Mika...
 
Diversiweb2011 04 Expressing Opinion Diversity - Delia Rusu
Diversiweb2011 04 Expressing Opinion Diversity - Delia RusuDiversiweb2011 04 Expressing Opinion Diversity - Delia Rusu
Diversiweb2011 04 Expressing Opinion Diversity - Delia Rusu
 
Diversiweb2011 03 Towards a Knowledge Diversity Model - Denny Vrandecic
Diversiweb2011 03 Towards a Knowledge Diversity Model - Denny VrandecicDiversiweb2011 03 Towards a Knowledge Diversity Model - Denny Vrandecic
Diversiweb2011 03 Towards a Knowledge Diversity Model - Denny Vrandecic
 
Diversiweb2011 01 Opening - Elena Simperl
Diversiweb2011 01 Opening - Elena SimperlDiversiweb2011 01 Opening - Elena Simperl
Diversiweb2011 01 Opening - Elena Simperl
 
Diversity toolkit
Diversity toolkitDiversity toolkit
Diversity toolkit
 
RENDER Telefonica
RENDER TelefonicaRENDER Telefonica
RENDER Telefonica
 
Defining Diversity
Defining DiversityDefining Diversity
Defining Diversity
 
Render Project introduction and overview
Render Project introduction and overviewRender Project introduction and overview
Render Project introduction and overview
 

Data Collection and Integration, Linked Data Management

  • 1. Data Collection and Integration, Linked Data Management Tutorial at the RENDER Kick-off Meeting Oct 2010
  • 2. Presentation Outline • Introduction • Web Mining • Data Integration Strategy • Linked Data Management – Linked Data: Introduction and Challenges – FactForge: Fast Track to the Center of the Web of Data Data Collection and Integration, LD Management Oct 2010 #2
  • 3. Content and Data Management Solutions Data Collection and Integration, LD Management Oct 2010 #3
  • 4. Content and Data Management Solutions Data Collection and Integration, LD Management Oct 2010 #4
  • 5. Web Mining • Web Crawling • Focused Crawling • Screen-scrapping • Data feeds, RSS feeds Data Collection and Integration, LD Management Oct 2010 #5
  • 6. Web Mining Projects • Namerimi – a national search engine • Innovantage – job offers in UK • Ronsmap – car offers in USA • ANAM – recepie knowledge base • Semantic KB of the Government Web Archive (TNA) Data Collection and Integration, LD Management Oct 2010 #6
  • 7. TNA SemKB Conceptual Model for Data Integration Ontology Central Government PROTON Ontology Mapping Mapping FactForge Semantic Knowledge Semi- Base Ontology structured LOD Data Linked ETL Factual Mapping Data TNA Struct. knowledge Data Government Mapping Mapping Web Web Archive DataBase Extracts Metadata Government Web Archive Metadata Unstructured Content Government Web Archive Legend: Reference
  • 8. Presentation Outline • Introduction • Web Mining • Data Integration Strategy • Linked Data Management – Linked Data: Introduction and Challenges – FactForge: Fast Track to the Center of the Web of Data Data Collection and Integration, LD Management Oct 2010 #8
  • 9. Data Integration Strategy • Determine data sources • Determine extraction policies • Conceptual Model for Data Integration • Transformation routines • Identity Resolution Data Collection and Integration, LD Management Oct 2010 #9
  • 10. Conceptual Model for Data Integration In RENDER • Build a stack of reference ontologies – a relatively stable controlled vocabulary • Use those for: – Integration of RENDER-specific data to LOD – Refer to it from various components and resources • The Stack: – RENDER schema – very tiny set of primitives – PROTON – few hundred schema primitives – UMBEL – several thousand classes and subjects – FactForge – gateway to the central LOD datasets Data Collection and Integration, LD Management Oct 2010 #10
  • 11. Presentation Outline • Introduction • Web Mining • Data Integration Strategy • Linked Data Management – Linked Data: Introduction and Challenges – FactForge: Fast Track to the Center of the Web of Data Data Collection and Integration, LD Management Oct 2010 #11
  • 12. Linking Open Data (LOD) • Linking Open Data W3C SWEO Community project http://esw.w3.org/topic/SweoIG/TaskForces/CommunityProjects/LinkingOpenData • Initiative for publishing “linked data” which already includes 50+ interlinked datasets and about 15B facts Data Collection and Integration, LD Management Oct 2010 #12
  • 13. Why Do People Not Use Linked Data? • Plenty of people in the IT world have heard about linked data and like the idea • However, the impact of linked data in enterprises is still very limited • Because: – There are no well established opinions about what linked data can “buy” for the enterprise or best practices for using it • What are the concrete benefits? – It is not clear what it would cost • What are the problems? • What are the associated risks? Data Collection and Integration, LD Management Oct 2010 #13
  • 14. Linked Data in the Enterprise: Why? • To facilitate data integration – One can use LOD as an “interlingua” for enterprise data integration – Additional public information can help alignment and linking • To add value to proprietary data – Public data can allow more analytics on top of proprietary data – For instance, by linking to spatial data from Geonames – Better description and access to content, e.g. search for images • Make enterprise data more open – To make enterprise data easier to use outside the enterprise – Public identifiers and vocabularies can be used to access them Data Collection and Integration, LD Management Oct 2010 #14
  • 15. Linked Data in the Enterprise: Challenges • LOD is hard to comprehend – Diversity comes at a price – One needs to make a query against 200 different schemata and hundreds of thousands of classes and properties • LOD is unreliable – Most of the servers behind LOD today are slow • High down time – Dealing with data distributed on the web is slow • A federated SPARQL query that uses 2-3 servers within several joins can be *very* slow – No kind of consistency is guaranteed • Low commitment to the formal semantics and intended usage of the ontologies and schemata Data Collection and Integration, LD Management Oct 2010 #15
  • 16. Reason-able Views to the Web of Data • Reason-able views represent an approach for reasoning and management of linked data • Key ideas: – Group selected datasets and ontologies in a compound dataset • Clean up, post-process and enrich the datasets if necessary • Do this conservatively, in a clearly documented and automated manner, so that: – the operation can easily be performed each time a new version of one of the datasets is published – Users can easily understand the intervention made to the original dataset – Load the compound dataset in a single semantic repository – Perform inference with respect to tractable OWL dialects – Define a set of sample queries against the compound dataset • These determine the “level of service” or the “scope of consistency” contract offered by the reason-able view Data Collection and Integration, LD Management Oct 2010 #16
  • 17. Reason-able Views: Objectives • Make reasoning and query evaluation feasible • Guarantee a basic level of consistency – The sample queries guarantee the consistency of the data in the same way in which regression tests do for the quality of software • Guarantee availability – In the same way in which web search engines are usually more reliable than most of the web sites; they also to caching • Easier exploration and querying of unseen data – Lower the cost of entry through URI auto-complete and RDF search – Sample queries provide re-usable extraction patterns, which reduce the time for acquaintance with the datasets and their interconnections Data Collection and Integration, LD Management Oct 2010 #17
  • 18. Two Reason-able Views to the Web of Linked Data • FactForge (indicated in red on the next slide) – Some of the central LOD datasets – General-purpose information (not specific to a domain) – 1.2B explicit plus .9M inferred indexed, 10B retrievable statements – The largest upper-level knowledge base – http://www.factforge.net/ • Linked Life Data (indicated in yellow) – 25 of the most popular life-science datasets – Complemented by gluing ontologies – 2.7B explicit and 1.4B inferred, total of 4.1B indexed statements – The largest non-synthetic dataset that was used for reasoning – http://www.linkedlifedata.com Data Collection and Integration, LD Management Oct 2010 #18
  • 19. Linking Open Data Datasets and Views (red and yellow) Data Collection and Integration, LD Management Oct 2010 #19
  • 20. Presentation Outline • Introduction • Web Mining • Data Integration Strategy • Linked Data Management – Linked Data: Introduction and Challenges – FactForge: Fast Track to the Center of the Web of Data Data Collection and Integration, LD Management Oct 2010 #20
  • 21. FactForge: Fast Track to the Center of the Web of Data • Datasets: DBPedia, Freebase, Geonames, UMBEL, MusicBrainz, Wordnet, CIA World Factbook, Lingvoj • Ontologies: Dublin Core, SKOS, RSS, FOAF • Inference: materialization with respect to OWL 2 RL – Seems to completely cover the semantics of the data – owl:sameAs optimization in BigOWLIM allows reduction of the indices without loss of semantics, but big gains in performance • Free public service at http://www.factforge.com, – Incremental URI auto-suggest – Query and explore through Forest and Tabulator – RDF Search: retrieve ranked list of URIs by keywords (see slide 26) – SPARQL end-point Data Collection and Integration, LD Management Oct 2010 #21
  • 22. FactForge Loading and Inference Statistics Explicit Inferred Total # of Entities Indexed Indexed Stored ('000 of Inferred Dataset Triples Triples Triples nodes in closure ('000) ('000) ('000) the graph) ratio Sechmata and ontologies 11 7 18 6 0.6 DBpedia (categories) 2,877 42,587 45,464 1,144 14.8 DBpedia (sameAs) 5,544 566 6,110 8,464 0.1 UMBEL 5,162 42,212 47,374 500 8.2 Lingvoj 20 863 883 18 43.8 CIA Factbook 76 4 80 25 0.1 Wordnet 2,281 9,296 11,577 830 4.1 Geonames 91,908 125,025 216,933 33,382 1.4 DBpedia core 560,096 198,043 758,139 127,931 0.4 Freebase 463,689 40,840 504,529 94,810 0.1 MusicBrainz 45,536 421,093 466,630 15,595 9.2 Total 1,177,961 881,224 2,058,185 283,253 0.7 Data Collection and Integration, LD Management Oct 2010 #22
  • 23. Post-processing • Several kinds of post-processing were performed − Goal: to allow easier navigation and browsing − Mechanisms: the results are available through system predicates − For instance: preferred labels, text snippets and RDF Ranks for all nodes • Final Statistics − Number of entities (RDF graph nodes): 405M − Number of inserted statements (NIS): 1,2B − Number of stored statements (NSS): 2.2B − Number of retrievable statements (NRS): 9.8B − 7.6B statements “compressed” through BigOWLIM’s owl:sameAs optimisation Data Collection and Integration, LD Management Oct 2010 #23
  • 24. FactForge Provides Unique Query Capabilities • Unmatched factual knowledge querying capabilities – Scope: the largest and most diverse integrated dataset, including 8 of the central LOD datasets – Analytical power: the semantics of the data and links between them is fully accounted for during query evaluation – RDFRank indicates importance in the integrated dataset • No other service supports even one of the following: – Query evaluation against DBPedia, Freebase or Geonames considering their semantics – Allows simultaneous querying of multiple datasets, considering the semantics of the links between them Data Collection and Integration, LD Management Oct 2010 #24
  • 25. Guess who is the most popular German entertainer? PREFIX rdf: … (run the query at http://factforge.net/sparql) SELECT * WHERE { ?Person dbp-ont:birthPlace ?BirthPlace ; rdf:type opencyc:Entertainer ; ff:hasRDFRank ?RR . ?BirthPlace geo-ont:parentFeature dbpedia:Germany . } ORDER BY DESC(?RR) LIMIT 100 • Without FF, answering such queries in real time is impossible: • Used data from: DBPedia, Geonames, UMBEL and MusicBrainz • Inference over types, sub-classes, and transitive relationships • The most popular entertainer born in Germany is: F. Nietzsche – Asking factual questions to a global KB can bring unexpected and strange results – We ask who is the most popular person, who qualifies as an entertainer – It uses a simple notion of popularity: RDFRank Data Collection and Integration, LD Management Oct 2010 #25
  • 26. The Modigliani Test for the Semantic Web • ReadWriteWeb’s founder Richard McManus: “…the tipping point for the Semantic Web may be when one can … deliver – using Linked Data – a comprehensive list of locations of original Modigliani art works …” http://www.readwriteweb.com/archives/the_modigliani_test_for_linked_data.php Data Collection and Integration, LD Management Oct 2010 #26
  • 27. The LDSR Query Passing the Modigliani Test PREFIX fb: … (check the query at http://factforge.net/sparql) SELECT DISTINCT ?painting_l ?owner_l ?city_fb_con ?city_db_loc ?city_db_cit WHERE { ?p fb:visual_art.artwork.artist dbpedia:Amedeo_Modigliani ; fb:visual_art.artwork.owners [ fb:visual_art.artwork_owner_relationship.owner ?ow ] ; ff:preferredLabel ?painting_l. ?ow ff:preferredLabel ?owner_l . OPTIONAL { ?ow fb:location.location.containedby [ ff:preferredLabel ?city_fb_con ] } . OPTIONAL { ?ow dbp-prop:location ?loc. ?loc rdf:type umbel-sc:City ; ff:preferredLabel ?city_db_loc } OPTIONAL { ?ow dbp-ont:city [ ff:preferredLabel ?city_db_cit ] } } Data Collection and Integration, LD Management Oct 2010 #27
  • 28. Thank you! Questions? Comments? Data Collection and Integration, LD Management Oct 2010 #28