This document summarizes a presentation about the WeST (Web Science & Technologies) institute at the University of Koblenz Landau in Germany. WeST conducts research in areas like the semantic web, social web, multimedia web, and software web. It organizes conferences and schools, builds applications, and teaches master's programs in web science. WeST is also involved in several European Union projects related to risk management, open government data, e-government, social media, and linking ontologies and software technologies. The presentation provides an overview of WeST's research areas, projects, and educational activities.
Information extraction for building knowledge basis
1. WeST – Web Science & Technologies
University of Koblenz Landau, Germany
Information Extraction
for
Building Knowledge Bases
Steffen Staab
Saqib Mir – European Bioinformatics Institute
Ermelinda d‘Oro, Massimo Ruffolo – Univ. Calabria, Italy
2. A FEW SLIDES WHERE WEST
COMES FROM
WeST – Web Science & Steffen Staab Slide 2
Technologies staab@uni-koblenz.de
3. WeST – Web Science & Steffen Staab Slide 3
Technologies staab@uni-koblenz.de
4. Institut WeST – Web Science & Technologies
Semantic Web Web Retrieval Social Web Multimedia Web Software Web GESIS
WeST – Web Science & Steffen Staab Slide 4
Technologies staab@uni-koblenz.de
5. We (co-)organize conferences and schools
WeST – Web Science & Steffen Staab Slide 5
Technologies staab@uni-koblenz.de
6. We build applications and develop methods…
BTC 1. Prize 2011
1. Prize
German
Linked Open Gov Data
Competition 2012
BTC 1. Prize 2008 German KM 1. Prize 2011
WeST – Web Science & Steffen Staab Slide 6
Technologies staab@uni-koblenz.de
7. We teach Web Science
Master in Master in eGov@Koblenz
Web Science@Koblenz Free tuition
Free tuition Start Fall 2012
Start Fall 2012 English
English
2012 Web Science
Summer School
Lorentz Center, Leiden,
The Netherlands,
9-13 July 2012
WeST – Web Science & Steffen Staab Slide 7
Technologies staab@uni-koblenz.de
8. We are active in joint projects
EU Integrated Project ROBUST (10 Partners):
Risk and Opportunity management of huge-scale
BUSiness communiTy cooperation
EU Live+Gov - Reality Sensing, Mining and Augmentation
for Mobile Citizen–Government Dialogue
EU WeGov – where eGovernment meets the eSociety
EU IP SocialSensor - Sensing User Generated Input for
Improved Media Discovery and Experience
EU Net2 – a networked for networked knowledge
EU MOST – Marrying ontologies and Software
Technologies
WeST – Web Science & Steffen Staab Slide 8
Technologies staab@uni-koblenz.de
9. Steffen Staab,
Saqib Mir, European Bioinformatics Institute
Ermelinda d‘Oro, Massimo Ruffolo, Univ Calabria, Italy
INFORMATION EXTRACTION
FOR
BUILDING KNOWLEDGE BASES
WeST – Web Science & Steffen Staab Slide 9
Technologies staab@uni-koblenz.de
15. OBSERVATIONS ON
INFORMATION EXTRACTION
WeST – Web Science & Steffen Staab Slide 15
Technologies staab@uni-koblenz.de
16. Challenges & Opportunities for IE
Not all web pages are created equal
WeST – Web Science & Steffen Staab Slide 16
Technologies staab@uni-koblenz.de
17. Challenges & Opportunities for IE
Some challenges are the same, e.g. finding type instances
WeST – Web Science & Steffen Staab Slide 17
Technologies staab@uni-koblenz.de
18. Challenges & Opportunities for IE
Some challenges are the same, e.g. finding relation instances
WeST – Web Science & Steffen Staab Slide 18
Technologies staab@uni-koblenz.de
19. Challenges & Opportunities for IE
Some contain concepts and their descriptions, some don‘t
No types here,
few relation types
WeST – Web Science & Steffen Staab Slide 19
Technologies staab@uni-koblenz.de
20. Challenges & Opportunities for IE
Knowing that they are instances and of which type
Textual Positional
indication indication
WeST – Web Science & Steffen Staab Slide 20
Technologies staab@uni-koblenz.de
21. Challenges & Opportunities for IE
To some extent
positional and layout
indications work across
languages and sites
WeST – Web Science & Steffen Staab Slide 21
Technologies staab@uni-koblenz.de
22. Challenges & Opportunities for IE
owl:sameAs
We should not only think about
Web pages, but about Web sites
WeST – Web Science & Steffen Staab Slide 22
Technologies staab@uni-koblenz.de
23. Challenges & Opportunities for IE
We should not only think about
Web pages, but about Web sites
owl:sameAs
WeST – Web Science & Steffen Staab Slide 23
Technologies staab@uni-koblenz.de
24. Comparing related work to our objectives
Related work objectives Our objectives
IE on Web pages IE on Web sites
Acquiring instances and Acquiring items
relationship instances Classifying items in
Instances
Concepts
Relation instances
Relationships
IE also based
IE based on linear text
on spatial position
There is overlap and there are few
exceptions in related work
WeST – Web Science & Steffen Staab Slide 24
Technologies staab@uni-koblenz.de
25. Outline
The Social Media-Case The Bio-Case
Motivation
State-of-the-Art
Core idea of SXPath
SXPath Language
Spatial Data Model
Syntax & Semantics
Complexity
Implementation
Evaluation
WeST – Web Science & Steffen Staab Slide 25
Technologies staab@uni-koblenz.de
26. Presentation-oriented documents
Acquiring a music band
profile:
A music band photo that
has at east its
descriptive information
Music band profile
band photo
band name
WeST – Web Science & Steffen Staab Slide 26
Technologies staab@uni-koblenz.de
28. Presentation-oriented documents
• HTML DOM structure is site specific
• Spatial arrangements are rarely explicit
• Spatial layout is hidden in complex nesting of layout elements
• Intricate DOM treee structures are conceptually difficult to
query for the user (or a tool!)
WeST – Web Science & Steffen Staab Slide 28
Technologies staab@uni-koblenz.de
29. Related Work
Web Query languages
Xpath 1.0 and XQuery1.0
Established
Too difficult to use for scraping from intricate DOM structures
Visual languages
Spatial Graph Grammars [Kong et al.] are quite complex in
term of both usability and efficiency
Algebras for creating and querying multimedia interactive
presentations (e.g. ppt) [Subrahmanian et al.]
Web wrapper induction exploiting visual interface
[Gottlob et al.] [Sahuguet et al.]
generate XPath location paths of DOM nodes
can benefit from using Spatial XPath
WeST – Web Science & Steffen Staab Slide 29
Technologies staab@uni-koblenz.de
30. Outline
The Social Media-Case The Bio-Case
Motivation
State-of-the-Art
Core idea of SXPath
SXPath Language
Spatial Data Model
Syntax & Semantics
Complexity
Implementation
Evaluation
WeST – Web Science & Steffen Staab Slide 30
Technologies staab@uni-koblenz.de
31. Idea: Use Spatial Relations among DOM Nodes
b
e
WeST – Web Science & Steffen Staab Slide 31
Technologies staab@uni-koblenz.de
32. Idea: Use Spatial Relations among DOM Nodes
WeST – Web Science & Steffen Staab Slide 32
Technologies staab@uni-koblenz.de
33. Idea: Use Spatial Relations among DOM Nodes
WeST – Web Science & Steffen Staab Slide 33
Technologies staab@uni-koblenz.de
34. Spatial DOM (SDOM)
WeST – Web Science & Steffen Staab Slide 34
Technologies staab@uni-koblenz.de
35. Spatial Relations Among Nodes
Rectangular Cardinal Relations (RCR)
r1 E:NE r2
Spatial models allow for expressing
disjunctive relations among regions
Topological Relations
WeST – Web Science & Steffen Staab Slide 35
Technologies staab@uni-koblenz.de
36. XPath Example
WeST – Web Science & Steffen Staab Slide 37
Technologies staab@uni-koblenz.de
37. SXPath Example
WeST – Web Science & Steffen Staab Slide 38
Technologies staab@uni-koblenz.de
38. WeST – Web Science & Steffen Staab Slide 39
Technologies staab@uni-koblenz.de
39. From XPath 1.0 towards Spatial Querying with SXPath
SXPath features
adopts intuitive path notation:
axis::nodetest [pred]*
adds to XPath
spatial axes
spatial position functions
natural semantics for spatial querying
maintains polynomial time combined complexity
WeST – Web Science & Steffen Staab Slide 40
Technologies staab@uni-koblenz.de
40. Why SXPath?
resilient wrappers
an XPath for familiarity
Information extraction
Simplicity
human oriented
efficiency
web applications
WeST – Web Science & Steffen Staab Slide 41
Technologies staab@uni-koblenz.de
41. Outline
The Social Media-Case The Bio-Case
Motivation
State-of-the-Art
Core idea of SXPath
SXPath Language
Spatial Data Model
Syntax & Semantics
Complexity
Implementation
Evaluation
WeST – Web Science & Steffen Staab Slide 42
Technologies staab@uni-koblenz.de
42. Spatial DOM (SDOM)
WeST – Web Science & Steffen Staab Slide 43
Technologies staab@uni-koblenz.de
47. Outline
The Social Media-Case The Bio-Case
Motivation
State-of-the-Art
Core idea of SXPath
SXPath Language
Spatial Data Model
Syntax & Semantics
Complexity
Implementation
Evaluation
WeST – Web Science & Steffen Staab Slide 51
Technologies staab@uni-koblenz.de
55. Existing Extensions to PDF
WeST – Web Science & Steffen Staab Slide 59
Technologies staab@uni-koblenz.de
56. Page Header
Text Area and Paragraphs
Table
Item List
Page Number
Page Footer
WeST – Web Science & Steffen Staab Slide 60
Technologies staab@uni-koblenz.de
57. Outline
The Social Media Case The Bio-Case
Motivation Motivation
State-of-the-Art The (Biochemical) Deep
Core idea of SXPath Web
SXPath Language Contributions
Spatial Data Model Page-level wrapper
induction
Syntax & Semantics
Site-wide wrapper
Complexity
generation
Implementation Error Correction by
Evaluation Mutual Reinforcement
Conclusions and Future
Directions
WeST – Web Science & Steffen Staab Slide 61
Technologies staab@uni-koblenz.de
58. >1000 Life Science DBs, number growing quickly
WeST – Web Science & Steffen Staab Slide 62
Technologies staab@uni-koblenz.de
59. Biochemical Web Sites: Observations - 1
Labeled Data
Full survey:
http://sabio.villa-
bosch.de/labelsurvey.html (404)
Total Labeled Unlabeled Unlabeled
(Redundant)
754 719 19 16
Table 1: Data fields across 20 Biochemical Web sites
WeST – Web Science & Steffen Staab Slide 63
Technologies staab@uni-koblenz.de
60. Biochemical Web Sites: Observations - 2
Dynamic Web Pages
WeST – Web Science & Steffen Staab Slide 64
Technologies staab@uni-koblenz.de
61. Biochemical Web Sites: Observations - 3
Rich Site Structure
WeST – Web Science & Steffen Staab Slide 65
Technologies staab@uni-koblenz.de
62. Biochemical Web Sites: Observations - 4
Web Services
Survey: 11 of 100 Databases1 provide APIs
Incomplete coverage
Varying granularity
No semantics in the service description
1 Databases indexed by the Nucleic Acids Research Journal
(http://www3.oup.co.uk/nar/database/). Complete survey available at
http://sabiork.villa-bosch.de/index.html/survey.html
WeST – Web Science & Steffen Staab Slide 66
Technologies staab@uni-koblenz.de
63. Biochemical Web Sites: Implications
Induce Wrapper
Induce Wrapper
Induce Wrapper
WeST – Web Science & Steffen Staab Slide 67
Technologies staab@uni-koblenz.de
64. Contributions
Unsupervised Page-Level Wrapper Induction
Unsupervised Site-Wide Wrapper Induction
(Site Structure Discovery)
Automatic Error Detection and Correction by
Mutual Reinforcement
WeST – Web Science & Steffen Staab Slide 68
Technologies staab@uni-koblenz.de
72. Site-Wide Wrapper Induction: Observations
Not all pages contain data (e.g. Legal disclaimers,
contact pages, navigational menus)
An efficient approach should ignore these pages
We dont need to learn the entire site-structure
WeST – Web Science & Steffen Staab Slide 76
Technologies staab@uni-koblenz.de
73. Site-Wide Wrapper Induction: Observations - 2
Classified Link-Collections point to data-intensive
pages of the same class.
WeST – Web Science & Steffen Staab Slide 77
Technologies staab@uni-koblenz.de
74. Site-Wide Wrapper Induction: Observations - 3
Pages belong to the same class describe the same
concepts
Some concepts are sometimes omitted
Ordering is always the same
WeST – Web Science & Steffen Staab Slide 78
Technologies staab@uni-koblenz.de
75. Site-Wide Wrapper Induction
1. Start with C0 L1
S={C0}
2. Follow all classified
link-collections C0
C1
3. Generate wrappers L3
for each set of target
L2
pages
C2
4. Determine if new C3
class is formed
5. Add navigation step If C0 != Ci (i>0)
S=S+Ci;
6. Repeat 2 – 5 for each
Navigation Steps
new class formed in 4
W= {(C0 → L1→ C0),
(C0 → L2→ C2),
(C0 → L3→ C3)}
WeST – Web Science & Steffen Staab Slide 79
Technologies staab@uni-koblenz.de
76. Site-Wide Wrapper Induction – Evaluation
SOURCE #C #C’ #D TP FN FP P R
MSDChem 1 1 N/A N/A N/A N/A N/A N/A
ChEBI 3 1 1711 1195 516 0 100 69.8
KEGG 10 7 6223 5044 1179 188 97 81.1
Average 98.5 75.5
Table 3: Site-wide wrapper induction results, 20 test pages for each class
(C=Classes, C =Classes discovered, D=Data entries)
WeST – Web Science & Steffen Staab Slide 80
Technologies staab@uni-koblenz.de
77. Error Detection and Correction:
Mutual Reinforcement
Observation: Certain data reappear on more
than one class of pages
WeST – Web Science & Steffen Staab Slide 81
Technologies staab@uni-koblenz.de
78. Error Detection and Correction:
Mutual Reinforcement
Reinforcement if reappearing data correctly classified as
Data
Otherwise it points to misclassification
Label-Data Mismatch
• Correction: Introduce more samples
Label-Label Mismatch
• Cannot be detected
WeST – Web Science & Steffen Staab Slide 82
Technologies staab@uni-koblenz.de
79. Where to go next?
Reverse engineering production
1. LOD emitting RDF & RDFS
2. Navigation model what belongs to what
3. Interaction model (- not treated at all by us so far -)
4. Layout model spatial positioning
Capture this generative model using machine learning
Relational learning
• Markov logic programmes?
• …?
WeST – Web Science & Steffen Staab Slide 83
Technologies staab@uni-koblenz.de
80. Bibliography
Linda d’Oro, Massimo Ruffolo, Steffen Staab. SXPath –
Extending XPath towards Spatial Querying on Web
Documents. In: PVLDB – Proceedings of the VLDB
Endowment, 4(2): 129-140, 2010.
S. Mir, S. Staab, I. Rojas. Site-Wide Wrapper Induction for
Life Science Deep Web Databases. In: DILS-2009 – Proc.
of the Data Integration in the Life Sciences Workshop,
Manchester, UK, July 20-22, LNCS, Springer, 2009.
Saqib Mir, Steffen Staab, Isabel Rojas. An Unsupervised
Approach for Acquiring Ontologies and RDF Data from
Online Life Science Databases. In: 7th Extended Semantic
Web Conference (ESWC2010), Heraklion, Greece, May
30-June 3, 2010, pp. 319-333.
WeST – Web Science & Steffen Staab Slide 84
Technologies staab@uni-koblenz.de
81. WeST – Web Science & Technologies
University of Koblenz Landau, Germany
Thank you for your attention!
Editor's Notes
The spatial layout of content elements of a Web page helps human readers to understand the semantics of contents. In this page a user is able to identify details of a music band because descriptive information are close to the image and each music band is shown by using the same visual pattern._____________________________________________________________________ Introduction of human oriented,Browser makesrectangles *** usa la pagina last-fm***The web designer represent the pages for representing unitaria information L’Unitarietà is given by the spatial consistency of an information that give the semantics to a user and we want exploit such spatial consistency for query the pages. This is possible by exploiting as funzionano the layout engineThis is a page presentation oriented *** mettere last-fm col browser**Its internal representation in thisThe layout that make the browser is this.*** illuminare**The spatial layout of nodes ** mettere un rettangolo*** allow user to identify some homogeneous part of information that are the descriptions of these music band.This means presentation oriented human orientedi.e. the human understand that this information are referred to the same music band because they have a certain spatial continuity
Layout engines of Web browsers assign a rectangle to each DOM element. ___________________________________________________The internal code of a page is this How can we query the page using the spatial information?The browser when visualize the pages represent the information in their rectangles that we can call minimum bounding rectangle. In fact the layout engine assign to each node*** parallelotraildom e quellochevedi--- vedicoldplayèscritto qua dentro e siillumina, img e siillumina***For each node based on the stylesheet, what the web designer.Presentation oriented, all also the style is used for give emphasis so that the human understand the important information, so the name in bold. (sviluppifuturiusarli)
As shown in the the figure the complex, involved and nested structure of the DOM has a clear presentation that enable user to read and understand the meaning of information presented in the Web page.
The rectangular algebra is an extension of the Allen’s interval algebra to the two dimensional case. For example in this case the relatio x (b,e) y is intuitively obtained by applying interval algebra to both sides of the rectangle.__________________________________________________________So we could use the spatial model of geospatial database for representing the mutual relationships between objects***Mostra RA***The rectngular algebra define 169 relations, all the possible relations between rectangles *** mostrare la figurona***Between this and this in the relation algebra this relation is called so*** illumina****** Ritaglia un singolo rettangolo***-----------------Modelli del mondo geospaziale per rappresentare le mutue relazioniRAIlluminare 2 - albero non basato del nesting ma su contenimento e relazioni
In order to allow the representation of spatial relations existing between pairs of content items/rectangles laid out by the layout engine of a Web broser in the presentation of a Web page, we use the rectangular algebra relations model. This model is well known and widely adopted in geo-Spatial databases and has very interesting properties like invertibility that enable optimized evaluations of SXPath language.____________________________________________So we could use the spatial model of geospatial database for representing the mutual relationships between objects***Mostra RA***The rectngular algebra define 169 relations, all the possible relations between rectangles *** mostrare la figurona***Between this and this in the relation algebra this relation is called so*** illumina****** Ritaglia un singolo rettangolo***-----------------Modelli del mondo geospaziale per rappresentare le mutue relazioniRAIlluminare 2 - albero non basato del nesting ma su contenimento e relazioni
No comment. Già tutto nella slide.and has very interesting properties like invertibility that enable optimized evaluations of SXPath language._______________________________________So we could use the spatial model of geospatial database for representing the mutual relationships between objects***Mostra RA***The rectngular algebra define 169 relations, all the possible relations between rectangles *** mostrare la figurona***Between this and this in the relation algebra this relation is called so*** illumina****** Ritaglia un singolo rettangolo***-----------------Modelli del mondo geospaziale per rappresentare le mutue relazioniRAIlluminare 2 - albero non basato del nesting ma su contenimento e relazioni
By representing RA relations/spatial relation we obtain the SDOM where continuous arrows represent spatial containment and dotted arrows represent RA relations. This way we have a model of a Web page that represent all spatial relations existing between each pair of DOM nodes.Spatial relations enable also the definition of a spatial ordering along the 4 main direction North, South, East, and West as shown in the figure._____________________________Intuizione di DOMSo I can make a tree of the page not based on nesting of tags, but by using the spatial containment and spatial relations*** tirare fuori l’sdom****** sempre animando, mostrando sempre I due elementi scelti, ***Between image and radiohead there is the spatial relation (s, bi)I can represent this data model that do not capture the simple nesting of tags but catcht the spatial arrangment of the objects on the page*** con le animazioni***This is the new data model that I use called Spatial DOM. That is the Document Object Model with the objects of the DOM where the relations (queste scure) are containment relations, (quelle tratteggiate) are the Rarelations.It allows to introduce an ordering in the page using this model ----------------Nuovo modello che uso SDOMIntrodurre che permette di definire ordinamento spaziale nella pagina
The RA relation is too fine grained and verbose, difficult to use by a human. So we introduce also the Rectangular Cardinal Relations and topological relations (Two of the most intuitive and diffused spatial models) in order to map RA relations and allow user to query spatial relations in a more intuitive way.________________________________________________________Such relations are very complicated We need more intuitive relations to use So we use another geospatial model called RCR and Topological relations mapped with the RA modelDivide in regional tiles and it is simple
The RA relation is too fine grained and verbose, difficult to use by a human. So we introduce also the Rectangular Cardinal Relations and topological relations (Two of the most intuitive and diffused spatial models) in order to map RA relations and allow user to query spatial relations in a more intuitive way.________________________________________________________Such relations are very complicated We need more intuitive relations to use So we use another geospatial model called RCR and Topological relations mapped with the RA modelDivide in regional tiles and it is simple
In this slide is show a comparison between Xpath and SXPath. Suppose a user that need to extract details of a music band. By using Xptah the user need to know the intricate DOM structure. By using SXPth the user can exploit the visual pattern adopted by the Web designers for organizing details of the music bands._______________________
In this slide is show a comparison between Xpath and SXPath. Suppose a user that need to extract details of a music band. By using Xptah the user need to know the intricate DOM structure. By using SXPth the user can exploit the visual pattern adopted by the Web designers for organizing details of the music bands._______________________
SXPath expressions are also resilient. In fact, a gicen visual pattern can be queried in the same way on different web pages having different internal encodings.____________________________________Another advantage is that it is more general For instance, with only a query I can catch different DOMs because their spatial representation is the same.So it generalize the patterns Our language catch visual patterns, catch in general way visual patterns on Web pages Example 2A single data record can be split in different sub-treesWrapper induction techniques like DEPTA [Zhai et al.] recognize datarecords when they are encoded in the DOM as consecutive similarsubtrees-------------------Esempio 2Altrovantaggioacchiappo DOM diversiIl linguaggiocattura in manieragenerale pattern visuali
The SXPath language has been thought for supporting information extraction from presentation-oriented documents. It derives from Xpath so it do not requires the user to learn it from scratch.It is simple to learn and more human oriented than Xpath.SXPath maintain polynomial combined complexity and constitutes a stepping stone for different kinds of Web applications aimed at acquiring infromaiton from the Web.___________________
The SDOM essentially is the traditional DOM enriched by the set of rectangular algebra relations between each pair of nodes.________________________________________
The SXPath language is an extension of the XPath language. So beside traditional axes the SXPath language provides users with a new set of axes called spatial axed. Spatial axes are expressed by rectangular cardinal relations and topological relations that are more intuitive to use for human and that can be easily mapped into rectangular algebra relations.____________________________________
Spatial axes are defined as interpreted binary relations expressed by RCR and mapped into RA by the means of the function mu.________________________________________
As said before the SXPath language extends the Xpath language by spatial axes and spatial position functions.We have studied interesting fragments of SXPath corresponding to XPath fragments already studied in literature in order to have a clear picture of expressivity and complexity of the language. In particular, we studied the Core Xpath/SXPath (navigational core of Xpath/SXPath) and the WF/Spatial WF fragments (that allow position/spatial position functions).____________________________
The semantics of SXPath is given by using the concept of context introduced by Wadler and aopted also by Gottlob in its studies on XPath expressivity and complexity.In SXPath the context must be extended to spatial positions of nodes and context sizes for each direction. So we have a 12-tuple instead of a 3-tuple_____________________________________________________________________________
We have given the formal semantics of SXPath by using the denotation semantics approach. So in the main difference over the XPath formal semantics is given by the function that computes the spatial axes defined as shown here.____________________________________________________
Obviously, each expression is evaluated over the context as shown here.Of couse________________________________
The study of combined computational complexity of different SXPath fragments shows that SXPath maintain Polinomial time computational complexity. Obviously SXPath as a greater exponent in the polynomial because of the quadratic number of relation stored in the SDOM that need to be explored during the evaluation of spatial axes.We compute spatial axes by using the same dynamic programming approach suggested by Gottolob but we have to explore a quadratic number of further relation in the SDOM.________________________________________ Core SXPath queries can be evaluated in time O(SDS2 á SQS) where SDSis the size of the XML document, and SQS is the size of the query QProof Sketch There are O(SVv S2) many spatial relations to beconsidered in addition to the O(SVS) many relations of the DOMincurring a higher polynomial worst case complexityIn order to obtain a polynomial-time combined complexity bound for SXPathquery evaluation we use dynamic programming adopting the Context-ValueTable (CV-Table) principle introduced by Gottlob et al.Position and size are computed on demand, we compute all spatial positionfunctions in a loop for all pairs previousÉcurrent nodesFull SXPath computational costs are dominated by String Operations belongingto XPath 1.0In SWF the computation of spatial ordering generates a higher polynomial worstcase than XPath 1.0
The architecture of the system consists in a parser of SXPath expressions (Query parser), a builder of the SDOM an engine that efficiently evaluates SXPath queries.______________________
The GUI shows the DOM, allows to write queries, enables to check query results that are show on the screen._________________________________________
In these two log-log plots are shown data efficiency and query efficiency. For evaluating data efficiency we used a growing document size, while for evaluating query efficiency we used a query with increasing number of location steps.Plots show that the system behavior is polynomial with respect to both data and query sizes._________________________________________________________
For evaluating the usability we asked some students that already know the Xpath language to learn SXPath and use it for extracting product names and prices from a web pages.The experiment has shown that user found the language usable and effective._________________________________________________
In the second experiment we evaluated the effectiveness of Sxpath with respect to Xpath. We discovered that the possibility to exploit the visual appearance of Web pages allow to write queries by less attempts than in Xpath, that Sxpath location path are more syntetic and that Sxpath is resilient (the same query can be used on different Web site having very different internal encodings in terms of DOM trees).________________________________
In the second experiment we evaluated the effectiveness of Sxpath with respect to Xpath. We discovered that the possibility to exploit the visual appearance of Web pages allow to write queries by less attempts than in Xpath, that Sxpath location path are more syntetic and that Sxpath is resilient (the same query can be used on different Web site having very different internal encodings in terms of DOM trees).________________________________
In the second experiment we evaluated the effectiveness of Sxpath with respect to Xpath. We discovered that the possibility to exploit the visual appearance of Web pages allow to write queries by less attempts than in Xpath, that Sxpath location path are more syntetic and that Sxpath is resilient (the same query can be used on different Web site having very different internal encodings in terms of DOM trees).________________________________
In presentation-oriented documents the layout of elements in the internal representation provides visual cues that help human user to understand the meaning of contents.Both contents of Web pages and PDF documents are presented to users on a two dimensional Cartesian plane. The meaning of contents is clear only after rendering. For example, the PDF encoding consists in a (completely flat) stream of strings equipped with position in which they must appear on the page. The table in the figure can be understood only after rendering_________
By using document layout analysis and document understanding techniques, combined with table recognition methods, different parts of a PDF document can be recognized. This way element of a PDF document can be represented in a more abstract format like DOM or SDOM._________________