SlideShare a Scribd company logo
1 of 81
WeST – Web Science & Technologies
                            University of Koblenz Landau, Germany




   Information Extraction
             for
  Building Knowledge Bases

                    Steffen Staab
    Saqib Mir – European Bioinformatics Institute
Ermelinda d‘Oro, Massimo Ruffolo – Univ. Calabria, Italy
A FEW SLIDES WHERE WEST
     COMES FROM

WeST – Web Science &   Steffen Staab          Slide 2
Technologies           staab@uni-koblenz.de
WeST – Web Science &   Steffen Staab          Slide 3
Technologies           staab@uni-koblenz.de
Institut WeST – Web Science & Technologies




Semantic Web Web Retrieval      Social Web     Multimedia Web Software Web GESIS




 WeST – Web Science &   Steffen Staab          Slide 4
 Technologies           staab@uni-koblenz.de
We (co-)organize conferences and schools




WeST – Web Science &   Steffen Staab          Slide 5
Technologies           staab@uni-koblenz.de
We build applications and develop methods…


                                                          BTC 1. Prize 2011



              1. Prize
              German
              Linked Open Gov Data
              Competition 2012



     BTC 1. Prize 2008                                    German KM 1. Prize 2011




WeST – Web Science &     Steffen Staab          Slide 6
Technologies             staab@uni-koblenz.de
We teach Web Science

Master in                                     Master in eGov@Koblenz
Web Science@Koblenz                            Free tuition
 Free tuition                                 Start Fall 2012
 Start Fall 2012                              English
 English

                                              2012 Web Science
                                              Summer School
                                              Lorentz Center, Leiden,
                                              The Netherlands,
                                              9-13 July 2012
WeST – Web Science &   Steffen Staab           Slide 7
Technologies           staab@uni-koblenz.de
We are active in joint projects

 EU Integrated Project ROBUST (10 Partners):
  Risk and Opportunity management of huge-scale
  BUSiness communiTy cooperation
 EU Live+Gov - Reality Sensing, Mining and Augmentation
  for Mobile Citizen–Government Dialogue
 EU WeGov – where eGovernment meets the eSociety
 EU IP SocialSensor - Sensing User Generated Input for
  Improved Media Discovery and Experience
 EU Net2 – a networked for networked knowledge
 EU MOST – Marrying ontologies and Software
  Technologies



WeST – Web Science &   Steffen Staab          Slide 8
Technologies           staab@uni-koblenz.de
Steffen Staab,
     Saqib Mir, European Bioinformatics Institute
     Ermelinda d‘Oro, Massimo Ruffolo, Univ Calabria, Italy

     INFORMATION EXTRACTION
     FOR
     BUILDING KNOWLEDGE BASES
WeST – Web Science &   Steffen Staab          Slide 9
Technologies           staab@uni-koblenz.de
GENERAL MOTIVATION


WeST – Web Science &   Steffen Staab          Slide 10
Technologies           staab@uni-koblenz.de
General objective: Extracting to LOD

                       useAsExample                         hasLivedIn




WeST – Web Science &      Steffen Staab          Slide 11
Technologies              staab@uni-koblenz.de
General objective: Analysing LOD




                       useAsExample                       hasLivedIn



WeST – Web Science &    Steffen Staab          Slide 12
Technologies            staab@uni-koblenz.de
http://lisa.west.uni-koblenz.de/lisa-demo/
Family‘s analysis of Munich LOD + Open Street Map data




 WeST – Web Science &   Steffen Staab          Slide 13
 Technologies           staab@uni-koblenz.de
http://lisa.west.uni-koblenz.de/lisa-demo/
Entrepreneur‘s analysis of Munich LOD + Open Street Map data




 WeST – Web Science &   Steffen Staab          Slide 14
 Technologies           staab@uni-koblenz.de
OBSERVATIONS ON
     INFORMATION EXTRACTION

WeST – Web Science &   Steffen Staab          Slide 15
Technologies           staab@uni-koblenz.de
Challenges & Opportunities for IE

Not all web pages are created equal




WeST – Web Science &   Steffen Staab          Slide 16
Technologies           staab@uni-koblenz.de
Challenges & Opportunities for IE

Some challenges are the same, e.g. finding type instances




WeST – Web Science &   Steffen Staab          Slide 17
Technologies           staab@uni-koblenz.de
Challenges & Opportunities for IE

Some challenges are the same, e.g. finding relation instances




WeST – Web Science &   Steffen Staab          Slide 18
Technologies           staab@uni-koblenz.de
Challenges & Opportunities for IE

Some contain concepts and their descriptions, some don‘t
                                                           No types here,
                                                         few relation types




WeST – Web Science &   Steffen Staab          Slide 19
Technologies           staab@uni-koblenz.de
Challenges & Opportunities for IE

Knowing that they are instances and of which type
    Textual                   Positional
  indication                  indication




WeST – Web Science &   Steffen Staab          Slide 20
Technologies           staab@uni-koblenz.de
Challenges & Opportunities for IE

To some extent
positional and layout
indications work across
languages and sites




WeST – Web Science &   Steffen Staab          Slide 21
Technologies           staab@uni-koblenz.de
Challenges & Opportunities for IE




             owl:sameAs
                                       We should not only think about
                                       Web pages, but about Web sites




WeST – Web Science &   Steffen Staab          Slide 22
Technologies           staab@uni-koblenz.de
Challenges & Opportunities for IE
                                         We should not only think about
                                         Web pages, but about Web sites




                       owl:sameAs




WeST – Web Science &     Steffen Staab          Slide 23
Technologies             staab@uni-koblenz.de
Comparing related work to our objectives
Related work objectives                           Our objectives
 IE on Web pages                                  IE on Web sites
 Acquiring instances and                          Acquiring items
  relationship instances                           Classifying items in
                                                          Instances
                                                          Concepts
                                                          Relation instances
                                                          Relationships
                                                   IE also based
 IE based on linear text
                                                    on spatial position
                       There is overlap and there are few
                          exceptions in related work
WeST – Web Science &       Steffen Staab           Slide 24
Technologies               staab@uni-koblenz.de
Outline

The Social Media-Case                         The Bio-Case
 Motivation
 State-of-the-Art
 Core idea of SXPath
 SXPath Language
     Spatial Data Model
     Syntax & Semantics
     Complexity
 Implementation
 Evaluation


WeST – Web Science &   Steffen Staab           Slide 25
Technologies           staab@uni-koblenz.de
Presentation-oriented documents


Acquiring a music band
profile:
A music band photo that
has at east its
descriptive information




Music band profile


     band photo

    band name




  WeST – Web Science &    Steffen Staab          Slide 26
  Technologies            staab@uni-koblenz.de
Presentation-oriented documents




WeST – Web Science &   Steffen Staab          Slide 27
Technologies           staab@uni-koblenz.de
Presentation-oriented documents

•    HTML DOM structure is site specific
•    Spatial arrangements are rarely explicit
•    Spatial layout is hidden in complex nesting of layout elements
•    Intricate DOM treee structures are conceptually difficult to
     query for the user (or a tool!)




    WeST – Web Science &   Steffen Staab          Slide 28
    Technologies           staab@uni-koblenz.de
Related Work

Web Query languages
 Xpath 1.0 and XQuery1.0
     Established
     Too difficult to use for scraping from intricate DOM structures

Visual languages
 Spatial Graph Grammars [Kong et al.] are quite complex in
  term of both usability and efficiency
 Algebras for creating and querying multimedia interactive
  presentations (e.g. ppt) [Subrahmanian et al.]
Web wrapper induction exploiting visual interface
[Gottlob et al.] [Sahuguet et al.]
     generate XPath location paths of DOM nodes
     can benefit from using Spatial XPath
WeST – Web Science &   Steffen Staab          Slide 29
Technologies           staab@uni-koblenz.de
Outline

The Social Media-Case                         The Bio-Case
 Motivation
 State-of-the-Art
 Core idea of SXPath
 SXPath Language
     Spatial Data Model
     Syntax & Semantics
     Complexity
 Implementation
 Evaluation


WeST – Web Science &   Steffen Staab           Slide 30
Technologies           staab@uni-koblenz.de
Idea: Use Spatial Relations among DOM Nodes




                                                         b


                                                             e



WeST – Web Science &   Steffen Staab          Slide 31
Technologies           staab@uni-koblenz.de
Idea: Use Spatial Relations among DOM Nodes




WeST – Web Science &   Steffen Staab          Slide 32
Technologies           staab@uni-koblenz.de
Idea: Use Spatial Relations among DOM Nodes




WeST – Web Science &   Steffen Staab          Slide 33
Technologies           staab@uni-koblenz.de
Spatial DOM (SDOM)




WeST – Web Science &   Steffen Staab          Slide 34
Technologies           staab@uni-koblenz.de
Spatial Relations Among Nodes

     Rectangular Cardinal Relations (RCR)


                                                          r1 E:NE r2



                                                Spatial models allow for expressing
                                                disjunctive relations among regions
     Topological Relations




 WeST – Web Science &   Steffen Staab          Slide 35
 Technologies           staab@uni-koblenz.de
XPath Example




WeST – Web Science &   Steffen Staab          Slide 37
Technologies           staab@uni-koblenz.de
SXPath Example




WeST – Web Science &   Steffen Staab          Slide 38
Technologies           staab@uni-koblenz.de
WeST – Web Science &   Steffen Staab          Slide 39
Technologies           staab@uni-koblenz.de
From XPath 1.0 towards Spatial Querying with SXPath

SXPath features
 adopts intuitive path notation:
     axis::nodetest [pred]*
 adds to XPath
     spatial axes
     spatial position functions
 natural semantics for spatial querying
 maintains polynomial time combined complexity




WeST – Web Science &   Steffen Staab          Slide 40
Technologies           staab@uni-koblenz.de
Why SXPath?




                                        resilient wrappers

           an XPath for                                   familiarity
       Information extraction
                                                         Simplicity
                                    human oriented
       efficiency
                                       web applications
WeST – Web Science &   Steffen Staab          Slide 41
Technologies           staab@uni-koblenz.de
Outline

The Social Media-Case                         The Bio-Case
 Motivation
 State-of-the-Art
 Core idea of SXPath
 SXPath Language
     Spatial Data Model
     Syntax & Semantics
     Complexity
 Implementation
 Evaluation


WeST – Web Science &   Steffen Staab           Slide 42
Technologies           staab@uni-koblenz.de
Spatial DOM (SDOM)




WeST – Web Science &   Steffen Staab          Slide 43
Technologies           staab@uni-koblenz.de
Spatial Navigation Axes




WeST – Web Science &   Steffen Staab          Slide 44
Technologies           staab@uni-koblenz.de
Spatial Navigation Axes




WeST – Web Science &   Steffen Staab          Slide 45
Technologies           staab@uni-koblenz.de
Syntax of SXPath




WeST – Web Science &   Steffen Staab          Slide 46
Technologies           staab@uni-koblenz.de
Complexity Results




WeST – Web Science &   Steffen Staab          Slide 50
Technologies           staab@uni-koblenz.de
Outline

The Social Media-Case                         The Bio-Case
 Motivation
 State-of-the-Art
 Core idea of SXPath
 SXPath Language
     Spatial Data Model
     Syntax & Semantics
     Complexity
 Implementation
 Evaluation


WeST – Web Science &   Steffen Staab           Slide 51
Technologies           staab@uni-koblenz.de
SXPath System Architecture




WeST – Web Science &   Steffen Staab          Slide 52
Technologies           staab@uni-koblenz.de
SXPath System




WeST – Web Science &   Steffen Staab          Slide 53
Technologies           staab@uni-koblenz.de
Results of Experiments




WeST – Web Science &   Steffen Staab          Slide 54
Technologies           staab@uni-koblenz.de
Formative User Study




WeST – Web Science &   Steffen Staab          Slide 55
Technologies           staab@uni-koblenz.de
Summative User Study




WeST – Web Science &   Steffen Staab          Slide 56
Technologies           staab@uni-koblenz.de
Summative User Study




WeST – Web Science &   Steffen Staab          Slide 57
Technologies           staab@uni-koblenz.de
Summative User Study




WeST – Web Science &   Steffen Staab          Slide 58
Technologies           staab@uni-koblenz.de
Existing Extensions to PDF




WeST – Web Science &   Steffen Staab          Slide 59
Technologies           staab@uni-koblenz.de
Page Header

                                                         Text Area and Paragraphs

                                                         Table


                                                         Item List




                                                         Page Number

                                                         Page Footer
WeST – Web Science &   Steffen Staab          Slide 60
Technologies           staab@uni-koblenz.de
Outline

The Social Media Case                         The Bio-Case
 Motivation                                   Motivation
 State-of-the-Art                             The (Biochemical) Deep
 Core idea of SXPath                           Web
 SXPath Language                              Contributions
     Spatial Data Model                          Page-level wrapper
                                                   induction
     Syntax & Semantics
                                                  Site-wide wrapper
     Complexity
                                                   generation
 Implementation                                  Error Correction by
 Evaluation                                       Mutual Reinforcement
                                               Conclusions and Future
                                                Directions
WeST – Web Science &   Steffen Staab           Slide 61
Technologies           staab@uni-koblenz.de
>1000 Life Science DBs, number growing quickly




WeST – Web Science &   Steffen Staab          Slide 62
Technologies           staab@uni-koblenz.de
Biochemical Web Sites: Observations - 1


   Labeled Data



    Full survey:
    http://sabio.villa-
    bosch.de/labelsurvey.html (404)

     Total               Labeled                 Unlabeled     Unlabeled
                                                               (Redundant)
     754                 719                     19            16
                 Table 1: Data fields across 20 Biochemical Web sites


 WeST – Web Science &     Steffen Staab           Slide 63
 Technologies             staab@uni-koblenz.de
Biochemical Web Sites: Observations - 2

    Dynamic Web Pages




 WeST – Web Science &   Steffen Staab          Slide 64
 Technologies           staab@uni-koblenz.de
Biochemical Web Sites: Observations - 3

    Rich Site Structure




WeST – Web Science &   Steffen Staab          Slide 65
Technologies           staab@uni-koblenz.de
Biochemical Web Sites: Observations - 4

 Web Services
   Survey: 11 of 100 Databases1 provide APIs
   Incomplete coverage
   Varying granularity
   No semantics in the service description

    1 Databases indexed by the Nucleic Acids Research Journal
       (http://www3.oup.co.uk/nar/database/). Complete survey available at
       http://sabiork.villa-bosch.de/index.html/survey.html




WeST – Web Science &    Steffen Staab          Slide 66
Technologies            staab@uni-koblenz.de
Biochemical Web Sites: Implications




              Induce Wrapper



                                                         Induce Wrapper




                          Induce Wrapper




WeST – Web Science &   Steffen Staab          Slide 67
Technologies           staab@uni-koblenz.de
Contributions


 Unsupervised Page-Level Wrapper Induction

 Unsupervised Site-Wide Wrapper Induction
  (Site Structure Discovery)

 Automatic Error Detection and Correction by
  Mutual Reinforcement




WeST – Web Science &   Steffen Staab          Slide 68
Technologies           staab@uni-koblenz.de
Page-Level Wrapper Induction – 1
         D1 = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47,…}
         O1 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21}




                                                                      //*[text()]




        D2 = {C00185, Cellobiose,…, R00306, 1.1.99.18,… }
        O2 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21}
 WeST – Web Science &     Steffen Staab          Slide 69
 Technologies             staab@uni-koblenz.de
Page-Level Wrapper Induction - 2

     Reclassify – Growing Data Regions




WeST – Web Science &   Steffen Staab          Slide 70
Technologies           staab@uni-koblenz.de
Page-Level Wrapper Induction - 3
                D1 = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47, 3.2.1.21 …}
                O1 = {Entry, Name,…, Reaction, R00026, Enzyme,…,}




                D2 = {C00185, Cellobiose,…, R00306, 1.1.99.18, 3.2.1.21 … }
                O2 = {Entry, Name,…, Reaction, R00026, Enzyme,…,}

WeST – Web Science &     Steffen Staab          Slide 71
Technologies             staab@uni-koblenz.de
Page-Level Wrapper Induction - 4


  Selecting Labels for Data
  html/…./table[1]/tr[8]/td[1]/…/code[1]/a[1]
    (“1.1.1.47” )

  html/…./table[1]/tr[6]/th[1]/…/code[1]/
    (“Reaction”)
  html/…./table[1]/tr[8]/th[1]/…/code[1]/
    (“Enzyme”)




WeST – Web Science &   Steffen Staab          Slide 72
Technologies           staab@uni-koblenz.de
Page-Level Wrapper Induction - 5



    Anchor the Path
    Enzyme - html/table[1]/tr[8]/th[1]/code[1]/
    html/table[1]/tr[8]/td[1]/code[1]/a[1]
    html/table[1]/tr[8]/td[1]/code[1]/a[2]

    //*[text()=‘Enzyme’] ../…./../td[1]/code[1]/a[position()≥2]/text()


             Pivot       Relative                        Generalize




WeST – Web Science &   Steffen Staab          Slide 73
Technologies           staab@uni-koblenz.de
Selected Sources

 KEGG, ChEBI, MSDChem
    Basic qualitative data
    Popular
    Overlapping/complementary data




WeST – Web Science &   Steffen Staab          Slide 74
Technologies           staab@uni-koblenz.de
Wrapper Induction - Evaluation

       SOURCE                                     #L   #D     #S   TP    FN   FP    P     R

       KEGG Compound                              10   762    3    411   351 46    89.9 53.9
       http://www.genome.jp/kegg/ compound/
                                                              15   759   3    0    100   99.6
       KEGG Reaction                              10   205    3    173   32   0    100   84.4
       http://www.genome.jp/kegg/ reaction/
                                                              15   205   0    0    100   100
       ChEBI                                      22   831    3    595   236 41    93.5 71.6
       http://www.ebi.ac.uk/chebi
                                                              15   829   2    0    100   99.7
       MSDChem                                    30   600    3    600   0    20   96.7 100
       http://www.ebi.ac.uk/msd-srv/msdchem/
                                                              15   600   0    20   96.7 100
                              Average (based on final wrappers for each source) 99.1 99.8
                 Table 2: Page-level wrapper induction results, 20 test pages
                        (L=Labels, D=Data entries, S=Training pages)
                         ~9 samples – ~99% P, ~98% R

WeST – Web Science &           Steffen Staab           Slide 75
Technologies                   staab@uni-koblenz.de
Site-Wide Wrapper Induction: Observations

   Not all pages contain data (e.g. Legal disclaimers,
   contact pages, navigational menus)
          An efficient approach should ignore these pages
          We dont need to learn the entire site-structure




 WeST – Web Science &   Steffen Staab          Slide 76
 Technologies           staab@uni-koblenz.de
Site-Wide Wrapper Induction: Observations - 2


  Classified Link-Collections point to data-intensive
  pages of the same class.




WeST – Web Science &   Steffen Staab          Slide 77
Technologies           staab@uni-koblenz.de
Site-Wide Wrapper Induction: Observations - 3

 Pages belong to the same class describe the same
  concepts
    Some concepts are sometimes omitted
    Ordering is always the same




WeST – Web Science &   Steffen Staab          Slide 78
Technologies           staab@uni-koblenz.de
Site-Wide Wrapper Induction


     1.     Start with C0                                                   L1
                                                     S={C0}
     2.     Follow all classified
            link-collections                                   C0
                                                                                 C1
     3.     Generate wrappers                                 L3
            for each set of target
                                                                       L2
            pages
                                                                                      C2
     4.     Determine if new                             C3
            class is formed
     5.     Add navigation step                                If C0 != Ci (i>0)
                                                                         S=S+Ci;
     6.     Repeat 2 – 5 for each
                                                               Navigation Steps
            new class formed in 4
                                                               W= {(C0 → L1→ C0),
                                                               (C0 → L2→ C2),
                                                               (C0 → L3→ C3)}


WeST – Web Science &   Steffen Staab          Slide 79
Technologies           staab@uni-koblenz.de
Site-Wide Wrapper Induction – Evaluation
         SOURCE          #C    #C’     #D       TP        FN    FP    P      R

         MSDChem         1     1       N/A      N/A       N/A   N/A   N/A    N/A

         ChEBI           3     1       1711     1195      516    0    100    69.8

         KEGG            10    7       6223 5044 1179           188   97     81.1

                                   Average                            98.5   75.5

       Table 3: Site-wide wrapper induction results, 20 test pages for each class
                 (C=Classes, C =Classes discovered, D=Data entries)




 WeST – Web Science &    Steffen Staab               Slide 80
 Technologies            staab@uni-koblenz.de
Error Detection and Correction:
Mutual Reinforcement


     Observation: Certain data reappear on more
     than one class of pages




WeST – Web Science &   Steffen Staab          Slide 81
Technologies           staab@uni-koblenz.de
Error Detection and Correction:
Mutual Reinforcement
 Reinforcement if reappearing data correctly classified as
  Data
 Otherwise it points to misclassification
   Label-Data Mismatch
         • Correction: Introduce more samples
     Label-Label Mismatch
         • Cannot be detected




WeST – Web Science &   Steffen Staab          Slide 82
Technologies           staab@uni-koblenz.de
Where to go next?

 Reverse engineering production
  1. LOD                               emitting RDF & RDFS
  2. Navigation model                   what belongs to what
  3. Interaction model     (- not treated at all by us so far -)
  4. Layout model                          spatial positioning


 Capture this generative model using machine learning
   Relational learning
         •    Markov logic programmes?
         •    …?




WeST – Web Science &   Steffen Staab          Slide 83
Technologies           staab@uni-koblenz.de
Bibliography

 Linda d’Oro, Massimo Ruffolo, Steffen Staab. SXPath –
  Extending XPath towards Spatial Querying on Web
  Documents. In: PVLDB – Proceedings of the VLDB
  Endowment, 4(2): 129-140, 2010.
 S. Mir, S. Staab, I. Rojas. Site-Wide Wrapper Induction for
  Life Science Deep Web Databases. In: DILS-2009 – Proc.
  of the Data Integration in the Life Sciences Workshop,
  Manchester, UK, July 20-22, LNCS, Springer, 2009.
 Saqib Mir, Steffen Staab, Isabel Rojas. An Unsupervised
  Approach for Acquiring Ontologies and RDF Data from
  Online Life Science Databases. In: 7th Extended Semantic
  Web Conference (ESWC2010), Heraklion, Greece, May
  30-June 3, 2010, pp. 319-333.
WeST – Web Science &   Steffen Staab          Slide 84
Technologies           staab@uni-koblenz.de
WeST – Web Science & Technologies
              University of Koblenz Landau, Germany




Thank you for your attention!

More Related Content

Similar to Information extraction for building knowledge basis

Challenges of Building Web Observatories
Challenges of Building Web ObservatoriesChallenges of Building Web Observatories
Challenges of Building Web ObservatoriesSteffen Staab
 
How to Do Things with Triples
How to Do Things with TriplesHow to Do Things with Triples
How to Do Things with TriplesSteffen Staab
 
Programming the Semantic Web
Programming the Semantic WebProgramming the Semantic Web
Programming the Semantic WebSteffen Staab
 
Semantic Web - A Survey Talk
Semantic Web - A Survey TalkSemantic Web - A Survey Talk
Semantic Web - A Survey TalkSteffen Staab
 
Web technologies
Web technologiesWeb technologies
Web technologiesReynel Albo
 
CAMA 2007 Visions of the Future for Contextualized Attention Metadata
CAMA 2007 Visions of the Future for Contextualized Attention MetadataCAMA 2007 Visions of the Future for Contextualized Attention Metadata
CAMA 2007 Visions of the Future for Contextualized Attention MetadataWayne Hodgins
 
Semantic Web Technologies: Principles and Practices
Semantic Web Technologies: Principles and PracticesSemantic Web Technologies: Principles and Practices
Semantic Web Technologies: Principles and PracticesSteffen Staab
 
Managing Social Communities
Managing Social CommunitiesManaging Social Communities
Managing Social CommunitiesSteffen Staab
 
Managing Social Communities
Managing Social CommunitiesManaging Social Communities
Managing Social Communitiesnet2-project
 
Ocelot (OSS remote Instrumentation)
Ocelot (OSS remote Instrumentation)Ocelot (OSS remote Instrumentation)
Ocelot (OSS remote Instrumentation)Inria
 
So you think you ….understand everyday life? Web2.0 & API theory – (still) ve...
So you think you ….understand everyday life? Web2.0 & API theory – (still) ve...So you think you ….understand everyday life? Web2.0 & API theory – (still) ve...
So you think you ….understand everyday life? Web2.0 & API theory – (still) ve...Olaf Janssen
 
Structured Data Presentation
Structured Data PresentationStructured Data Presentation
Structured Data PresentationShawn Day
 
Application architecture
Application architectureApplication architecture
Application architectureIván Stepaniuk
 
Traversing Networks of Complexity
Traversing Networks of ComplexityTraversing Networks of Complexity
Traversing Networks of Complexityfwiencek
 
Educators Bonanza – Discovering Resources and Getting Started with Robotics E...
Educators Bonanza – Discovering Resources and Getting Started with Robotics E...Educators Bonanza – Discovering Resources and Getting Started with Robotics E...
Educators Bonanza – Discovering Resources and Getting Started with Robotics E...MecklerMedia
 
Awakening Rip Van Winkle: Modernizing the Computer Science Web Curriculum
Awakening Rip Van Winkle: Modernizing the Computer Science Web CurriculumAwakening Rip Van Winkle: Modernizing the Computer Science Web Curriculum
Awakening Rip Van Winkle: Modernizing the Computer Science Web CurriculumRandy Connolly
 

Similar to Information extraction for building knowledge basis (20)

Challenges of Building Web Observatories
Challenges of Building Web ObservatoriesChallenges of Building Web Observatories
Challenges of Building Web Observatories
 
How to Do Things with Triples
How to Do Things with TriplesHow to Do Things with Triples
How to Do Things with Triples
 
Programming the Semantic Web
Programming the Semantic WebProgramming the Semantic Web
Programming the Semantic Web
 
Eyeing the Web
Eyeing the WebEyeing the Web
Eyeing the Web
 
Semantic Web - A Survey Talk
Semantic Web - A Survey TalkSemantic Web - A Survey Talk
Semantic Web - A Survey Talk
 
Web technologies
Web technologiesWeb technologies
Web technologies
 
Get cloudengine jisc-elluminate_wednesdays
Get cloudengine jisc-elluminate_wednesdaysGet cloudengine jisc-elluminate_wednesdays
Get cloudengine jisc-elluminate_wednesdays
 
CAMA 2007 Visions of the Future for Contextualized Attention Metadata
CAMA 2007 Visions of the Future for Contextualized Attention MetadataCAMA 2007 Visions of the Future for Contextualized Attention Metadata
CAMA 2007 Visions of the Future for Contextualized Attention Metadata
 
Semantic Web Technologies: Principles and Practices
Semantic Web Technologies: Principles and PracticesSemantic Web Technologies: Principles and Practices
Semantic Web Technologies: Principles and Practices
 
Managing Social Communities
Managing Social CommunitiesManaging Social Communities
Managing Social Communities
 
Managing Social Communities
Managing Social CommunitiesManaging Social Communities
Managing Social Communities
 
Ocelot (OSS remote Instrumentation)
Ocelot (OSS remote Instrumentation)Ocelot (OSS remote Instrumentation)
Ocelot (OSS remote Instrumentation)
 
CV_Tomasz_Stempkowicz_EN
CV_Tomasz_Stempkowicz_ENCV_Tomasz_Stempkowicz_EN
CV_Tomasz_Stempkowicz_EN
 
So you think you ….understand everyday life? Web2.0 & API theory – (still) ve...
So you think you ….understand everyday life? Web2.0 & API theory – (still) ve...So you think you ….understand everyday life? Web2.0 & API theory – (still) ve...
So you think you ….understand everyday life? Web2.0 & API theory – (still) ve...
 
Structured Data Presentation
Structured Data PresentationStructured Data Presentation
Structured Data Presentation
 
cv_10
cv_10cv_10
cv_10
 
Application architecture
Application architectureApplication architecture
Application architecture
 
Traversing Networks of Complexity
Traversing Networks of ComplexityTraversing Networks of Complexity
Traversing Networks of Complexity
 
Educators Bonanza – Discovering Resources and Getting Started with Robotics E...
Educators Bonanza – Discovering Resources and Getting Started with Robotics E...Educators Bonanza – Discovering Resources and Getting Started with Robotics E...
Educators Bonanza – Discovering Resources and Getting Started with Robotics E...
 
Awakening Rip Van Winkle: Modernizing the Computer Science Web Curriculum
Awakening Rip Van Winkle: Modernizing the Computer Science Web CurriculumAwakening Rip Van Winkle: Modernizing the Computer Science Web Curriculum
Awakening Rip Van Winkle: Modernizing the Computer Science Web Curriculum
 

More from Steffen Staab

Knowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sureKnowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sureSteffen Staab
 
Symbolic Background Knowledge for Machine Learning
Symbolic Background Knowledge for Machine LearningSymbolic Background Knowledge for Machine Learning
Symbolic Background Knowledge for Machine LearningSteffen Staab
 
Soziale Netzwerke und Medien: Multi-disziplinäre Ansätze für ein multi-dimens...
Soziale Netzwerke und Medien: Multi-disziplinäre Ansätze für ein multi-dimens...Soziale Netzwerke und Medien: Multi-disziplinäre Ansätze für ein multi-dimens...
Soziale Netzwerke und Medien: Multi-disziplinäre Ansätze für ein multi-dimens...Steffen Staab
 
Web Futures: Inclusive, Intelligent, Sustainable
Web Futures: Inclusive, Intelligent, SustainableWeb Futures: Inclusive, Intelligent, Sustainable
Web Futures: Inclusive, Intelligent, SustainableSteffen Staab
 
Concepts in Application Context ( How we may think conceptually )
Concepts in Application Context ( How we may think conceptually )Concepts in Application Context ( How we may think conceptually )
Concepts in Application Context ( How we may think conceptually )Steffen Staab
 
Storing and Querying Semantic Data in the Cloud
Storing and Querying Semantic Data in the CloudStoring and Querying Semantic Data in the Cloud
Storing and Querying Semantic Data in the CloudSteffen Staab
 
Ontologien und Semantic Web - Impulsvortrag Terminologietag
Ontologien und Semantic Web - Impulsvortrag TerminologietagOntologien und Semantic Web - Impulsvortrag Terminologietag
Ontologien und Semantic Web - Impulsvortrag TerminologietagSteffen Staab
 
Opinion Formation and Spreading
Opinion Formation and SpreadingOpinion Formation and Spreading
Opinion Formation and SpreadingSteffen Staab
 
10 Jahre Web Science
10 Jahre Web Science10 Jahre Web Science
10 Jahre Web ScienceSteffen Staab
 
(Semi-)Automatic analysis of online contents
(Semi-)Automatic analysis of online contents(Semi-)Automatic analysis of online contents
(Semi-)Automatic analysis of online contentsSteffen Staab
 
Programming with Semantic Broad Data
Programming with Semantic Broad DataProgramming with Semantic Broad Data
Programming with Semantic Broad DataSteffen Staab
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with ContextSteffen Staab
 
Closing Session ISWC 2015
Closing Session ISWC 2015Closing Session ISWC 2015
Closing Session ISWC 2015Steffen Staab
 
Semantic Technologies and Programmatic Access to Semantic Data
Semantic Technologies and Programmatic Access to Semantic Data Semantic Technologies and Programmatic Access to Semantic Data
Semantic Technologies and Programmatic Access to Semantic Data Steffen Staab
 
Seamless semantics - avoiding semantic discontinuity
Seamless semantics - avoiding semantic discontinuitySeamless semantics - avoiding semantic discontinuity
Seamless semantics - avoiding semantic discontinuitySteffen Staab
 
The Semantic Web - Interacting with the Unknown
The Semantic Web - Interacting with the UnknownThe Semantic Web - Interacting with the Unknown
The Semantic Web - Interacting with the UnknownSteffen Staab
 
Experiments in Computer Science - Don't loathe them, but love them
Experiments in Computer Science - Don't loathe them, but love themExperiments in Computer Science - Don't loathe them, but love them
Experiments in Computer Science - Don't loathe them, but love themSteffen Staab
 
Information-Rich Programming in F# with Semantic Data
Information-Rich Programming in F# with Semantic DataInformation-Rich Programming in F# with Semantic Data
Information-Rich Programming in F# with Semantic DataSteffen Staab
 

More from Steffen Staab (20)

Knowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sureKnowledge graphs for knowing more and knowing for sure
Knowledge graphs for knowing more and knowing for sure
 
Symbolic Background Knowledge for Machine Learning
Symbolic Background Knowledge for Machine LearningSymbolic Background Knowledge for Machine Learning
Symbolic Background Knowledge for Machine Learning
 
Soziale Netzwerke und Medien: Multi-disziplinäre Ansätze für ein multi-dimens...
Soziale Netzwerke und Medien: Multi-disziplinäre Ansätze für ein multi-dimens...Soziale Netzwerke und Medien: Multi-disziplinäre Ansätze für ein multi-dimens...
Soziale Netzwerke und Medien: Multi-disziplinäre Ansätze für ein multi-dimens...
 
Web Futures: Inclusive, Intelligent, Sustainable
Web Futures: Inclusive, Intelligent, SustainableWeb Futures: Inclusive, Intelligent, Sustainable
Web Futures: Inclusive, Intelligent, Sustainable
 
Concepts in Application Context ( How we may think conceptually )
Concepts in Application Context ( How we may think conceptually )Concepts in Application Context ( How we may think conceptually )
Concepts in Application Context ( How we may think conceptually )
 
Storing and Querying Semantic Data in the Cloud
Storing and Querying Semantic Data in the CloudStoring and Querying Semantic Data in the Cloud
Storing and Querying Semantic Data in the Cloud
 
Semantics reloaded
Semantics reloadedSemantics reloaded
Semantics reloaded
 
Ontologien und Semantic Web - Impulsvortrag Terminologietag
Ontologien und Semantic Web - Impulsvortrag TerminologietagOntologien und Semantic Web - Impulsvortrag Terminologietag
Ontologien und Semantic Web - Impulsvortrag Terminologietag
 
Opinion Formation and Spreading
Opinion Formation and SpreadingOpinion Formation and Spreading
Opinion Formation and Spreading
 
The Web We Want
The Web We WantThe Web We Want
The Web We Want
 
10 Jahre Web Science
10 Jahre Web Science10 Jahre Web Science
10 Jahre Web Science
 
(Semi-)Automatic analysis of online contents
(Semi-)Automatic analysis of online contents(Semi-)Automatic analysis of online contents
(Semi-)Automatic analysis of online contents
 
Programming with Semantic Broad Data
Programming with Semantic Broad DataProgramming with Semantic Broad Data
Programming with Semantic Broad Data
 
Text Mining using LDA with Context
Text Mining using LDA with ContextText Mining using LDA with Context
Text Mining using LDA with Context
 
Closing Session ISWC 2015
Closing Session ISWC 2015Closing Session ISWC 2015
Closing Session ISWC 2015
 
Semantic Technologies and Programmatic Access to Semantic Data
Semantic Technologies and Programmatic Access to Semantic Data Semantic Technologies and Programmatic Access to Semantic Data
Semantic Technologies and Programmatic Access to Semantic Data
 
Seamless semantics - avoiding semantic discontinuity
Seamless semantics - avoiding semantic discontinuitySeamless semantics - avoiding semantic discontinuity
Seamless semantics - avoiding semantic discontinuity
 
The Semantic Web - Interacting with the Unknown
The Semantic Web - Interacting with the UnknownThe Semantic Web - Interacting with the Unknown
The Semantic Web - Interacting with the Unknown
 
Experiments in Computer Science - Don't loathe them, but love them
Experiments in Computer Science - Don't loathe them, but love themExperiments in Computer Science - Don't loathe them, but love them
Experiments in Computer Science - Don't loathe them, but love them
 
Information-Rich Programming in F# with Semantic Data
Information-Rich Programming in F# with Semantic DataInformation-Rich Programming in F# with Semantic Data
Information-Rich Programming in F# with Semantic Data
 

Recently uploaded

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Neo4j
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 

Recently uploaded (20)

SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024Build your next Gen AI Breakthrough - April 2024
Build your next Gen AI Breakthrough - April 2024
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 

Information extraction for building knowledge basis

  • 1. WeST – Web Science & Technologies University of Koblenz Landau, Germany Information Extraction for Building Knowledge Bases Steffen Staab Saqib Mir – European Bioinformatics Institute Ermelinda d‘Oro, Massimo Ruffolo – Univ. Calabria, Italy
  • 2. A FEW SLIDES WHERE WEST COMES FROM WeST – Web Science & Steffen Staab Slide 2 Technologies staab@uni-koblenz.de
  • 3. WeST – Web Science & Steffen Staab Slide 3 Technologies staab@uni-koblenz.de
  • 4. Institut WeST – Web Science & Technologies Semantic Web Web Retrieval Social Web Multimedia Web Software Web GESIS WeST – Web Science & Steffen Staab Slide 4 Technologies staab@uni-koblenz.de
  • 5. We (co-)organize conferences and schools WeST – Web Science & Steffen Staab Slide 5 Technologies staab@uni-koblenz.de
  • 6. We build applications and develop methods… BTC 1. Prize 2011 1. Prize German Linked Open Gov Data Competition 2012 BTC 1. Prize 2008 German KM 1. Prize 2011 WeST – Web Science & Steffen Staab Slide 6 Technologies staab@uni-koblenz.de
  • 7. We teach Web Science Master in Master in eGov@Koblenz Web Science@Koblenz  Free tuition  Free tuition  Start Fall 2012  Start Fall 2012  English  English 2012 Web Science Summer School Lorentz Center, Leiden, The Netherlands, 9-13 July 2012 WeST – Web Science & Steffen Staab Slide 7 Technologies staab@uni-koblenz.de
  • 8. We are active in joint projects  EU Integrated Project ROBUST (10 Partners): Risk and Opportunity management of huge-scale BUSiness communiTy cooperation  EU Live+Gov - Reality Sensing, Mining and Augmentation for Mobile Citizen–Government Dialogue  EU WeGov – where eGovernment meets the eSociety  EU IP SocialSensor - Sensing User Generated Input for Improved Media Discovery and Experience  EU Net2 – a networked for networked knowledge  EU MOST – Marrying ontologies and Software Technologies WeST – Web Science & Steffen Staab Slide 8 Technologies staab@uni-koblenz.de
  • 9. Steffen Staab, Saqib Mir, European Bioinformatics Institute Ermelinda d‘Oro, Massimo Ruffolo, Univ Calabria, Italy INFORMATION EXTRACTION FOR BUILDING KNOWLEDGE BASES WeST – Web Science & Steffen Staab Slide 9 Technologies staab@uni-koblenz.de
  • 10. GENERAL MOTIVATION WeST – Web Science & Steffen Staab Slide 10 Technologies staab@uni-koblenz.de
  • 11. General objective: Extracting to LOD useAsExample hasLivedIn WeST – Web Science & Steffen Staab Slide 11 Technologies staab@uni-koblenz.de
  • 12. General objective: Analysing LOD useAsExample hasLivedIn WeST – Web Science & Steffen Staab Slide 12 Technologies staab@uni-koblenz.de
  • 13. http://lisa.west.uni-koblenz.de/lisa-demo/ Family‘s analysis of Munich LOD + Open Street Map data WeST – Web Science & Steffen Staab Slide 13 Technologies staab@uni-koblenz.de
  • 14. http://lisa.west.uni-koblenz.de/lisa-demo/ Entrepreneur‘s analysis of Munich LOD + Open Street Map data WeST – Web Science & Steffen Staab Slide 14 Technologies staab@uni-koblenz.de
  • 15. OBSERVATIONS ON INFORMATION EXTRACTION WeST – Web Science & Steffen Staab Slide 15 Technologies staab@uni-koblenz.de
  • 16. Challenges & Opportunities for IE Not all web pages are created equal WeST – Web Science & Steffen Staab Slide 16 Technologies staab@uni-koblenz.de
  • 17. Challenges & Opportunities for IE Some challenges are the same, e.g. finding type instances WeST – Web Science & Steffen Staab Slide 17 Technologies staab@uni-koblenz.de
  • 18. Challenges & Opportunities for IE Some challenges are the same, e.g. finding relation instances WeST – Web Science & Steffen Staab Slide 18 Technologies staab@uni-koblenz.de
  • 19. Challenges & Opportunities for IE Some contain concepts and their descriptions, some don‘t No types here, few relation types WeST – Web Science & Steffen Staab Slide 19 Technologies staab@uni-koblenz.de
  • 20. Challenges & Opportunities for IE Knowing that they are instances and of which type Textual Positional indication indication WeST – Web Science & Steffen Staab Slide 20 Technologies staab@uni-koblenz.de
  • 21. Challenges & Opportunities for IE To some extent positional and layout indications work across languages and sites WeST – Web Science & Steffen Staab Slide 21 Technologies staab@uni-koblenz.de
  • 22. Challenges & Opportunities for IE owl:sameAs We should not only think about Web pages, but about Web sites WeST – Web Science & Steffen Staab Slide 22 Technologies staab@uni-koblenz.de
  • 23. Challenges & Opportunities for IE We should not only think about Web pages, but about Web sites owl:sameAs WeST – Web Science & Steffen Staab Slide 23 Technologies staab@uni-koblenz.de
  • 24. Comparing related work to our objectives Related work objectives Our objectives  IE on Web pages  IE on Web sites  Acquiring instances and  Acquiring items relationship instances  Classifying items in  Instances  Concepts  Relation instances  Relationships  IE also based  IE based on linear text on spatial position There is overlap and there are few exceptions in related work WeST – Web Science & Steffen Staab Slide 24 Technologies staab@uni-koblenz.de
  • 25. Outline The Social Media-Case The Bio-Case  Motivation  State-of-the-Art  Core idea of SXPath  SXPath Language  Spatial Data Model  Syntax & Semantics  Complexity  Implementation  Evaluation WeST – Web Science & Steffen Staab Slide 25 Technologies staab@uni-koblenz.de
  • 26. Presentation-oriented documents Acquiring a music band profile: A music band photo that has at east its descriptive information Music band profile band photo band name WeST – Web Science & Steffen Staab Slide 26 Technologies staab@uni-koblenz.de
  • 27. Presentation-oriented documents WeST – Web Science & Steffen Staab Slide 27 Technologies staab@uni-koblenz.de
  • 28. Presentation-oriented documents • HTML DOM structure is site specific • Spatial arrangements are rarely explicit • Spatial layout is hidden in complex nesting of layout elements • Intricate DOM treee structures are conceptually difficult to query for the user (or a tool!) WeST – Web Science & Steffen Staab Slide 28 Technologies staab@uni-koblenz.de
  • 29. Related Work Web Query languages  Xpath 1.0 and XQuery1.0  Established  Too difficult to use for scraping from intricate DOM structures Visual languages  Spatial Graph Grammars [Kong et al.] are quite complex in term of both usability and efficiency  Algebras for creating and querying multimedia interactive presentations (e.g. ppt) [Subrahmanian et al.] Web wrapper induction exploiting visual interface [Gottlob et al.] [Sahuguet et al.]  generate XPath location paths of DOM nodes  can benefit from using Spatial XPath WeST – Web Science & Steffen Staab Slide 29 Technologies staab@uni-koblenz.de
  • 30. Outline The Social Media-Case The Bio-Case  Motivation  State-of-the-Art  Core idea of SXPath  SXPath Language  Spatial Data Model  Syntax & Semantics  Complexity  Implementation  Evaluation WeST – Web Science & Steffen Staab Slide 30 Technologies staab@uni-koblenz.de
  • 31. Idea: Use Spatial Relations among DOM Nodes b e WeST – Web Science & Steffen Staab Slide 31 Technologies staab@uni-koblenz.de
  • 32. Idea: Use Spatial Relations among DOM Nodes WeST – Web Science & Steffen Staab Slide 32 Technologies staab@uni-koblenz.de
  • 33. Idea: Use Spatial Relations among DOM Nodes WeST – Web Science & Steffen Staab Slide 33 Technologies staab@uni-koblenz.de
  • 34. Spatial DOM (SDOM) WeST – Web Science & Steffen Staab Slide 34 Technologies staab@uni-koblenz.de
  • 35. Spatial Relations Among Nodes Rectangular Cardinal Relations (RCR) r1 E:NE r2 Spatial models allow for expressing disjunctive relations among regions Topological Relations WeST – Web Science & Steffen Staab Slide 35 Technologies staab@uni-koblenz.de
  • 36. XPath Example WeST – Web Science & Steffen Staab Slide 37 Technologies staab@uni-koblenz.de
  • 37. SXPath Example WeST – Web Science & Steffen Staab Slide 38 Technologies staab@uni-koblenz.de
  • 38. WeST – Web Science & Steffen Staab Slide 39 Technologies staab@uni-koblenz.de
  • 39. From XPath 1.0 towards Spatial Querying with SXPath SXPath features  adopts intuitive path notation:  axis::nodetest [pred]*  adds to XPath  spatial axes  spatial position functions  natural semantics for spatial querying  maintains polynomial time combined complexity WeST – Web Science & Steffen Staab Slide 40 Technologies staab@uni-koblenz.de
  • 40. Why SXPath? resilient wrappers an XPath for familiarity Information extraction Simplicity human oriented efficiency web applications WeST – Web Science & Steffen Staab Slide 41 Technologies staab@uni-koblenz.de
  • 41. Outline The Social Media-Case The Bio-Case  Motivation  State-of-the-Art  Core idea of SXPath  SXPath Language  Spatial Data Model  Syntax & Semantics  Complexity  Implementation  Evaluation WeST – Web Science & Steffen Staab Slide 42 Technologies staab@uni-koblenz.de
  • 42. Spatial DOM (SDOM) WeST – Web Science & Steffen Staab Slide 43 Technologies staab@uni-koblenz.de
  • 43. Spatial Navigation Axes WeST – Web Science & Steffen Staab Slide 44 Technologies staab@uni-koblenz.de
  • 44. Spatial Navigation Axes WeST – Web Science & Steffen Staab Slide 45 Technologies staab@uni-koblenz.de
  • 45. Syntax of SXPath WeST – Web Science & Steffen Staab Slide 46 Technologies staab@uni-koblenz.de
  • 46. Complexity Results WeST – Web Science & Steffen Staab Slide 50 Technologies staab@uni-koblenz.de
  • 47. Outline The Social Media-Case The Bio-Case  Motivation  State-of-the-Art  Core idea of SXPath  SXPath Language  Spatial Data Model  Syntax & Semantics  Complexity  Implementation  Evaluation WeST – Web Science & Steffen Staab Slide 51 Technologies staab@uni-koblenz.de
  • 48. SXPath System Architecture WeST – Web Science & Steffen Staab Slide 52 Technologies staab@uni-koblenz.de
  • 49. SXPath System WeST – Web Science & Steffen Staab Slide 53 Technologies staab@uni-koblenz.de
  • 50. Results of Experiments WeST – Web Science & Steffen Staab Slide 54 Technologies staab@uni-koblenz.de
  • 51. Formative User Study WeST – Web Science & Steffen Staab Slide 55 Technologies staab@uni-koblenz.de
  • 52. Summative User Study WeST – Web Science & Steffen Staab Slide 56 Technologies staab@uni-koblenz.de
  • 53. Summative User Study WeST – Web Science & Steffen Staab Slide 57 Technologies staab@uni-koblenz.de
  • 54. Summative User Study WeST – Web Science & Steffen Staab Slide 58 Technologies staab@uni-koblenz.de
  • 55. Existing Extensions to PDF WeST – Web Science & Steffen Staab Slide 59 Technologies staab@uni-koblenz.de
  • 56. Page Header Text Area and Paragraphs Table Item List Page Number Page Footer WeST – Web Science & Steffen Staab Slide 60 Technologies staab@uni-koblenz.de
  • 57. Outline The Social Media Case The Bio-Case  Motivation  Motivation  State-of-the-Art  The (Biochemical) Deep  Core idea of SXPath Web  SXPath Language  Contributions  Spatial Data Model  Page-level wrapper induction  Syntax & Semantics  Site-wide wrapper  Complexity generation  Implementation  Error Correction by  Evaluation Mutual Reinforcement  Conclusions and Future Directions WeST – Web Science & Steffen Staab Slide 61 Technologies staab@uni-koblenz.de
  • 58. >1000 Life Science DBs, number growing quickly WeST – Web Science & Steffen Staab Slide 62 Technologies staab@uni-koblenz.de
  • 59. Biochemical Web Sites: Observations - 1 Labeled Data Full survey: http://sabio.villa- bosch.de/labelsurvey.html (404) Total Labeled Unlabeled Unlabeled (Redundant) 754 719 19 16 Table 1: Data fields across 20 Biochemical Web sites WeST – Web Science & Steffen Staab Slide 63 Technologies staab@uni-koblenz.de
  • 60. Biochemical Web Sites: Observations - 2 Dynamic Web Pages WeST – Web Science & Steffen Staab Slide 64 Technologies staab@uni-koblenz.de
  • 61. Biochemical Web Sites: Observations - 3 Rich Site Structure WeST – Web Science & Steffen Staab Slide 65 Technologies staab@uni-koblenz.de
  • 62. Biochemical Web Sites: Observations - 4  Web Services  Survey: 11 of 100 Databases1 provide APIs  Incomplete coverage  Varying granularity  No semantics in the service description 1 Databases indexed by the Nucleic Acids Research Journal (http://www3.oup.co.uk/nar/database/). Complete survey available at http://sabiork.villa-bosch.de/index.html/survey.html WeST – Web Science & Steffen Staab Slide 66 Technologies staab@uni-koblenz.de
  • 63. Biochemical Web Sites: Implications Induce Wrapper Induce Wrapper Induce Wrapper WeST – Web Science & Steffen Staab Slide 67 Technologies staab@uni-koblenz.de
  • 64. Contributions  Unsupervised Page-Level Wrapper Induction  Unsupervised Site-Wide Wrapper Induction (Site Structure Discovery)  Automatic Error Detection and Correction by Mutual Reinforcement WeST – Web Science & Steffen Staab Slide 68 Technologies staab@uni-koblenz.de
  • 65. Page-Level Wrapper Induction – 1 D1 = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47,…} O1 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21} //*[text()] D2 = {C00185, Cellobiose,…, R00306, 1.1.99.18,… } O2 = {Entry, Name,…, Reaction, R00026, Enzyme,…, 3.2.1.21} WeST – Web Science & Steffen Staab Slide 69 Technologies staab@uni-koblenz.de
  • 66. Page-Level Wrapper Induction - 2 Reclassify – Growing Data Regions WeST – Web Science & Steffen Staab Slide 70 Technologies staab@uni-koblenz.de
  • 67. Page-Level Wrapper Induction - 3 D1 = {C00221, beta-D-Glucose, …, R01520, 1.1.1.47, 3.2.1.21 …} O1 = {Entry, Name,…, Reaction, R00026, Enzyme,…,} D2 = {C00185, Cellobiose,…, R00306, 1.1.99.18, 3.2.1.21 … } O2 = {Entry, Name,…, Reaction, R00026, Enzyme,…,} WeST – Web Science & Steffen Staab Slide 71 Technologies staab@uni-koblenz.de
  • 68. Page-Level Wrapper Induction - 4 Selecting Labels for Data html/…./table[1]/tr[8]/td[1]/…/code[1]/a[1] (“1.1.1.47” ) html/…./table[1]/tr[6]/th[1]/…/code[1]/ (“Reaction”) html/…./table[1]/tr[8]/th[1]/…/code[1]/ (“Enzyme”) WeST – Web Science & Steffen Staab Slide 72 Technologies staab@uni-koblenz.de
  • 69. Page-Level Wrapper Induction - 5 Anchor the Path Enzyme - html/table[1]/tr[8]/th[1]/code[1]/ html/table[1]/tr[8]/td[1]/code[1]/a[1] html/table[1]/tr[8]/td[1]/code[1]/a[2] //*[text()=‘Enzyme’] ../…./../td[1]/code[1]/a[position()≥2]/text() Pivot Relative Generalize WeST – Web Science & Steffen Staab Slide 73 Technologies staab@uni-koblenz.de
  • 70. Selected Sources  KEGG, ChEBI, MSDChem  Basic qualitative data  Popular  Overlapping/complementary data WeST – Web Science & Steffen Staab Slide 74 Technologies staab@uni-koblenz.de
  • 71. Wrapper Induction - Evaluation SOURCE #L #D #S TP FN FP P R KEGG Compound 10 762 3 411 351 46 89.9 53.9 http://www.genome.jp/kegg/ compound/ 15 759 3 0 100 99.6 KEGG Reaction 10 205 3 173 32 0 100 84.4 http://www.genome.jp/kegg/ reaction/ 15 205 0 0 100 100 ChEBI 22 831 3 595 236 41 93.5 71.6 http://www.ebi.ac.uk/chebi 15 829 2 0 100 99.7 MSDChem 30 600 3 600 0 20 96.7 100 http://www.ebi.ac.uk/msd-srv/msdchem/ 15 600 0 20 96.7 100 Average (based on final wrappers for each source) 99.1 99.8 Table 2: Page-level wrapper induction results, 20 test pages (L=Labels, D=Data entries, S=Training pages) ~9 samples – ~99% P, ~98% R WeST – Web Science & Steffen Staab Slide 75 Technologies staab@uni-koblenz.de
  • 72. Site-Wide Wrapper Induction: Observations Not all pages contain data (e.g. Legal disclaimers, contact pages, navigational menus)  An efficient approach should ignore these pages  We dont need to learn the entire site-structure WeST – Web Science & Steffen Staab Slide 76 Technologies staab@uni-koblenz.de
  • 73. Site-Wide Wrapper Induction: Observations - 2 Classified Link-Collections point to data-intensive pages of the same class. WeST – Web Science & Steffen Staab Slide 77 Technologies staab@uni-koblenz.de
  • 74. Site-Wide Wrapper Induction: Observations - 3  Pages belong to the same class describe the same concepts  Some concepts are sometimes omitted  Ordering is always the same WeST – Web Science & Steffen Staab Slide 78 Technologies staab@uni-koblenz.de
  • 75. Site-Wide Wrapper Induction 1. Start with C0 L1 S={C0} 2. Follow all classified link-collections C0 C1 3. Generate wrappers L3 for each set of target L2 pages C2 4. Determine if new C3 class is formed 5. Add navigation step If C0 != Ci (i>0) S=S+Ci; 6. Repeat 2 – 5 for each Navigation Steps new class formed in 4 W= {(C0 → L1→ C0), (C0 → L2→ C2), (C0 → L3→ C3)} WeST – Web Science & Steffen Staab Slide 79 Technologies staab@uni-koblenz.de
  • 76. Site-Wide Wrapper Induction – Evaluation SOURCE #C #C’ #D TP FN FP P R MSDChem 1 1 N/A N/A N/A N/A N/A N/A ChEBI 3 1 1711 1195 516 0 100 69.8 KEGG 10 7 6223 5044 1179 188 97 81.1 Average 98.5 75.5 Table 3: Site-wide wrapper induction results, 20 test pages for each class (C=Classes, C =Classes discovered, D=Data entries) WeST – Web Science & Steffen Staab Slide 80 Technologies staab@uni-koblenz.de
  • 77. Error Detection and Correction: Mutual Reinforcement Observation: Certain data reappear on more than one class of pages WeST – Web Science & Steffen Staab Slide 81 Technologies staab@uni-koblenz.de
  • 78. Error Detection and Correction: Mutual Reinforcement  Reinforcement if reappearing data correctly classified as Data  Otherwise it points to misclassification  Label-Data Mismatch • Correction: Introduce more samples  Label-Label Mismatch • Cannot be detected WeST – Web Science & Steffen Staab Slide 82 Technologies staab@uni-koblenz.de
  • 79. Where to go next?  Reverse engineering production 1. LOD emitting RDF & RDFS 2. Navigation model what belongs to what 3. Interaction model (- not treated at all by us so far -) 4. Layout model spatial positioning  Capture this generative model using machine learning  Relational learning • Markov logic programmes? • …? WeST – Web Science & Steffen Staab Slide 83 Technologies staab@uni-koblenz.de
  • 80. Bibliography  Linda d’Oro, Massimo Ruffolo, Steffen Staab. SXPath – Extending XPath towards Spatial Querying on Web Documents. In: PVLDB – Proceedings of the VLDB Endowment, 4(2): 129-140, 2010.  S. Mir, S. Staab, I. Rojas. Site-Wide Wrapper Induction for Life Science Deep Web Databases. In: DILS-2009 – Proc. of the Data Integration in the Life Sciences Workshop, Manchester, UK, July 20-22, LNCS, Springer, 2009.  Saqib Mir, Steffen Staab, Isabel Rojas. An Unsupervised Approach for Acquiring Ontologies and RDF Data from Online Life Science Databases. In: 7th Extended Semantic Web Conference (ESWC2010), Heraklion, Greece, May 30-June 3, 2010, pp. 319-333. WeST – Web Science & Steffen Staab Slide 84 Technologies staab@uni-koblenz.de
  • 81. WeST – Web Science & Technologies University of Koblenz Landau, Germany Thank you for your attention!

Editor's Notes

  1. The spatial layout of content elements of a Web page helps human readers to understand the semantics of contents. In this page a user is able to identify details of a music band because descriptive information are close to the image and each music band is shown by using the same visual pattern._____________________________________________________________________ Introduction of human oriented,Browser makesrectangles *** usa la pagina last-fm***The web designer represent the pages for representing unitaria information L’Unitarietà is given by the spatial consistency of an information that give the semantics to a user and we want exploit such spatial consistency for query the pages. This is possible by exploiting as funzionano the layout engineThis is a page presentation oriented *** mettere last-fm col browser**Its internal representation in thisThe layout that make the browser is this.*** illuminare**The spatial layout of nodes ** mettere un rettangolo*** allow user to identify some homogeneous part of information that are the descriptions of these music band.This means presentation oriented human orientedi.e. the human understand that this information are referred to the same music band because they have a certain spatial continuity
  2. Layout engines of Web browsers assign a rectangle to each DOM element. ___________________________________________________The internal code of a page is this How can we query the page using the spatial information?The browser when visualize the pages represent the information in their rectangles that we can call minimum bounding rectangle. In fact the layout engine assign to each node*** parallelotraildom e quellochevedi--- vedicoldplayèscritto qua dentro e siillumina, img e siillumina***For each node based on the stylesheet, what the web designer.Presentation oriented, all also the style is used for give emphasis so that the human understand the important information, so the name in bold. (sviluppifuturiusarli)
  3. As shown in the the figure the complex, involved and nested structure of the DOM has a clear presentation that enable user to read and understand the meaning of information presented in the Web page.
  4. The rectangular algebra is an extension of the Allen’s interval algebra to the two dimensional case. For example in this case the relatio x (b,e) y is intuitively obtained by applying interval algebra to both sides of the rectangle.__________________________________________________________So we could use the spatial model of geospatial database for representing the mutual relationships between objects***Mostra RA***The rectngular algebra define 169 relations, all the possible relations between rectangles *** mostrare la figurona***Between this and this in the relation algebra this relation is called so*** illumina****** Ritaglia un singolo rettangolo***-----------------Modelli del mondo geospaziale per rappresentare le mutue relazioniRAIlluminare 2 - albero non basato del nesting ma su contenimento e relazioni
  5. In order to allow the representation of spatial relations existing between pairs of content items/rectangles laid out by the layout engine of a Web broser in the presentation of a Web page, we use the rectangular algebra relations model. This model is well known and widely adopted in geo-Spatial databases and has very interesting properties like invertibility that enable optimized evaluations of SXPath language.____________________________________________So we could use the spatial model of geospatial database for representing the mutual relationships between objects***Mostra RA***The rectngular algebra define 169 relations, all the possible relations between rectangles *** mostrare la figurona***Between this and this in the relation algebra this relation is called so*** illumina****** Ritaglia un singolo rettangolo***-----------------Modelli del mondo geospaziale per rappresentare le mutue relazioniRAIlluminare 2 - albero non basato del nesting ma su contenimento e relazioni
  6. No comment. Già tutto nella slide.and has very interesting properties like invertibility that enable optimized evaluations of SXPath language._______________________________________So we could use the spatial model of geospatial database for representing the mutual relationships between objects***Mostra RA***The rectngular algebra define 169 relations, all the possible relations between rectangles *** mostrare la figurona***Between this and this in the relation algebra this relation is called so*** illumina****** Ritaglia un singolo rettangolo***-----------------Modelli del mondo geospaziale per rappresentare le mutue relazioniRAIlluminare 2 - albero non basato del nesting ma su contenimento e relazioni
  7. By representing RA relations/spatial relation we obtain the SDOM where continuous arrows represent spatial containment and dotted arrows represent RA relations. This way we have a model of a Web page that represent all spatial relations existing between each pair of DOM nodes.Spatial relations enable also the definition of a spatial ordering along the 4 main direction North, South, East, and West as shown in the figure._____________________________Intuizione di DOMSo I can make a tree of the page not based on nesting of tags, but by using the spatial containment and spatial relations*** tirare fuori l’sdom****** sempre animando, mostrando sempre I due elementi scelti, ***Between image and radiohead there is the spatial relation (s, bi)I can represent this data model that do not capture the simple nesting of tags but catcht the spatial arrangment of the objects on the page*** con le animazioni***This is the new data model that I use called Spatial DOM. That is the Document Object Model with the objects of the DOM where the relations (queste scure) are containment relations, (quelle tratteggiate) are the Rarelations.It allows to introduce an ordering in the page using this model ----------------Nuovo modello che uso SDOMIntrodurre che permette di definire ordinamento spaziale nella pagina
  8. The RA relation is too fine grained and verbose, difficult to use by a human. So we introduce also the Rectangular Cardinal Relations and topological relations (Two of the most intuitive and diffused spatial models) in order to map RA relations and allow user to query spatial relations in a more intuitive way.________________________________________________________Such relations are very complicated We need more intuitive relations to use So we use another geospatial model called RCR and Topological relations mapped with the RA modelDivide in regional tiles and it is simple
  9. The RA relation is too fine grained and verbose, difficult to use by a human. So we introduce also the Rectangular Cardinal Relations and topological relations (Two of the most intuitive and diffused spatial models) in order to map RA relations and allow user to query spatial relations in a more intuitive way.________________________________________________________Such relations are very complicated We need more intuitive relations to use So we use another geospatial model called RCR and Topological relations mapped with the RA modelDivide in regional tiles and it is simple
  10. In this slide is show a comparison between Xpath and SXPath. Suppose a user that need to extract details of a music band. By using Xptah the user need to know the intricate DOM structure. By using SXPth the user can exploit the visual pattern adopted by the Web designers for organizing details of the music bands._______________________
  11. In this slide is show a comparison between Xpath and SXPath. Suppose a user that need to extract details of a music band. By using Xptah the user need to know the intricate DOM structure. By using SXPth the user can exploit the visual pattern adopted by the Web designers for organizing details of the music bands._______________________
  12. SXPath expressions are also resilient. In fact, a gicen visual pattern can be queried in the same way on different web pages having different internal encodings.____________________________________Another advantage is that it is more general For instance, with only a query I can catch different DOMs because their spatial representation is the same.So it generalize the patterns Our language catch visual patterns, catch in general way visual patterns on Web pages Example 2A single data record can be split in different sub-treesWrapper induction techniques like DEPTA [Zhai et al.] recognize datarecords when they are encoded in the DOM as consecutive similarsubtrees-------------------Esempio 2Altrovantaggioacchiappo DOM diversiIl linguaggiocattura in manieragenerale pattern visuali
  13. The SXPath language has been thought for supporting information extraction from presentation-oriented documents. It derives from Xpath so it do not requires the user to learn it from scratch.It is simple to learn and more human oriented than Xpath.SXPath maintain polynomial combined complexity and constitutes a stepping stone for different kinds of Web applications aimed at acquiring infromaiton from the Web.___________________
  14. The SDOM essentially is the traditional DOM enriched by the set of rectangular algebra relations between each pair of nodes.________________________________________
  15. The SXPath language is an extension of the XPath language. So beside traditional axes the SXPath language provides users with a new set of axes called spatial axed. Spatial axes are expressed by rectangular cardinal relations and topological relations that are more intuitive to use for human and that can be easily mapped into rectangular algebra relations.____________________________________
  16. Spatial axes are defined as interpreted binary relations expressed by RCR and mapped into RA by the means of the function mu.________________________________________
  17. As said before the SXPath language extends the Xpath language by spatial axes and spatial position functions.We have studied interesting fragments of SXPath corresponding to XPath fragments already studied in literature in order to have a clear picture of expressivity and complexity of the language. In particular, we studied the Core Xpath/SXPath (navigational core of Xpath/SXPath) and the WF/Spatial WF fragments (that allow position/spatial position functions).____________________________
  18. The semantics of SXPath is given by using the concept of context introduced by Wadler and aopted also by Gottlob in its studies on XPath expressivity and complexity.In SXPath the context must be extended to spatial positions of nodes and context sizes for each direction. So we have a 12-tuple instead of a 3-tuple_____________________________________________________________________________
  19. We have given the formal semantics of SXPath by using the denotation semantics approach. So in the main difference over the XPath formal semantics is given by the function that computes the spatial axes defined as shown here.____________________________________________________
  20. Obviously, each expression is evaluated over the context as shown here.Of couse________________________________
  21. The study of combined computational complexity of different SXPath fragments shows that SXPath maintain Polinomial time computational complexity. Obviously SXPath as a greater exponent in the polynomial because of the quadratic number of relation stored in the SDOM that need to be explored during the evaluation of spatial axes.We compute spatial axes by using the same dynamic programming approach suggested by Gottolob but we have to explore a quadratic number of further relation in the SDOM.________________________________________ Core SXPath queries can be evaluated in time O(SDS2 á SQS) where SDSis the size of the XML document, and SQS is the size of the query QProof Sketch There are O(SVv S2) many spatial relations to beconsidered in addition to the O(SVS) many relations of the DOMincurring a higher polynomial worst case complexityIn order to obtain a polynomial-time combined complexity bound for SXPathquery evaluation we use dynamic programming adopting the Context-ValueTable (CV-Table) principle introduced by Gottlob et al.Position and size are computed on demand, we compute all spatial positionfunctions in a loop for all pairs previousÉcurrent nodesFull SXPath computational costs are dominated by String Operations belongingto XPath 1.0In SWF the computation of spatial ordering generates a higher polynomial worstcase than XPath 1.0
  22. The architecture of the system consists in a parser of SXPath expressions (Query parser), a builder of the SDOM an engine that efficiently evaluates SXPath queries.______________________
  23. The GUI shows the DOM, allows to write queries, enables to check query results that are show on the screen._________________________________________
  24. In these two log-log plots are shown data efficiency and query efficiency. For evaluating data efficiency we used a growing document size, while for evaluating query efficiency we used a query with increasing number of location steps.Plots show that the system behavior is polynomial with respect to both data and query sizes._________________________________________________________
  25. For evaluating the usability we asked some students that already know the Xpath language to learn SXPath and use it for extracting product names and prices from a web pages.The experiment has shown that user found the language usable and effective._________________________________________________
  26. In the second experiment we evaluated the effectiveness of Sxpath with respect to Xpath. We discovered that the possibility to exploit the visual appearance of Web pages allow to write queries by less attempts than in Xpath, that Sxpath location path are more syntetic and that Sxpath is resilient (the same query can be used on different Web site having very different internal encodings in terms of DOM trees).________________________________
  27. In the second experiment we evaluated the effectiveness of Sxpath with respect to Xpath. We discovered that the possibility to exploit the visual appearance of Web pages allow to write queries by less attempts than in Xpath, that Sxpath location path are more syntetic and that Sxpath is resilient (the same query can be used on different Web site having very different internal encodings in terms of DOM trees).________________________________
  28. In the second experiment we evaluated the effectiveness of Sxpath with respect to Xpath. We discovered that the possibility to exploit the visual appearance of Web pages allow to write queries by less attempts than in Xpath, that Sxpath location path are more syntetic and that Sxpath is resilient (the same query can be used on different Web site having very different internal encodings in terms of DOM trees).________________________________
  29. In presentation-oriented documents the layout of elements in the internal representation provides visual cues that help human user to understand the meaning of contents.Both contents of Web pages and PDF documents are presented to users on a two dimensional Cartesian plane. The meaning of contents is clear only after rendering. For example, the PDF encoding consists in a (completely flat) stream of strings equipped with position in which they must appear on the page. The table in the figure can be understood only after rendering_________
  30. By using document layout analysis and document understanding techniques, combined with table recognition methods, different parts of a PDF document can be recognized. This way element of a PDF document can be represented in a more abstract format like DOM or SDOM._________________