On the Separability of Structural Classes of CommunitiesBruno Abrahao
Three major factors govern the intricacies of community extraction in networks: (1) the application domain includes a wide variety of networks of fundamentally different natures, (2) the literature offers a multitude of disparate community detection algorithms, and (3) there is no consensus characterizing how to discriminate communities from non-communities. In this paper, we present a comprehensive analysis of community properties through a class separability framework. Our approach enables the assessment of the structural dissimilarity among the output of multiple community detection algorithms and between the output of algorithms and communities that arise in practice. To demonstrate this concept, we furnish our method with a large set of structural properties and multiple community detection algorithms. Applied to a diverse collection of large scale network datasets, the analysis reveals that (1) the different detection algorithms extract fundamentally different structures; (2) the structure of communities that arise in practice is closest to that of communities that random-walk-based algorithms extract, although still significantly different from that of the output of all the algorithms; and (3) a small subset of the properties are nearly as discriminative as the full set, while making explicit the ways in which the algorithms produce biases. Our framework enables an informed choice of the most suitable community detection method for a given purpose and network and allows for a comparison of existing community detection algorithms while guiding the design of new ones.
Using HISCO and HISCAM to code and analyze occupationsRichard Zijdeman
This is the lab session I provided for the European Historical Sample Network Summerschool on why occupations are important in historical research and how we can appropriately deal with them using HISCO and HISCAM
On the Separability of Structural Classes of CommunitiesBruno Abrahao
Three major factors govern the intricacies of community extraction in networks: (1) the application domain includes a wide variety of networks of fundamentally different natures, (2) the literature offers a multitude of disparate community detection algorithms, and (3) there is no consensus characterizing how to discriminate communities from non-communities. In this paper, we present a comprehensive analysis of community properties through a class separability framework. Our approach enables the assessment of the structural dissimilarity among the output of multiple community detection algorithms and between the output of algorithms and communities that arise in practice. To demonstrate this concept, we furnish our method with a large set of structural properties and multiple community detection algorithms. Applied to a diverse collection of large scale network datasets, the analysis reveals that (1) the different detection algorithms extract fundamentally different structures; (2) the structure of communities that arise in practice is closest to that of communities that random-walk-based algorithms extract, although still significantly different from that of the output of all the algorithms; and (3) a small subset of the properties are nearly as discriminative as the full set, while making explicit the ways in which the algorithms produce biases. Our framework enables an informed choice of the most suitable community detection method for a given purpose and network and allows for a comparison of existing community detection algorithms while guiding the design of new ones.
Using HISCO and HISCAM to code and analyze occupationsRichard Zijdeman
This is the lab session I provided for the European Historical Sample Network Summerschool on why occupations are important in historical research and how we can appropriately deal with them using HISCO and HISCAM
CAMA 2007 Visions of the Future for Contextualized Attention MetadataWayne Hodgins
Invited keynote presentation by Wayne Hodgins at the CAMA 2007 http://ariadne.cs.kuleuven.ac.be/cama2007/ Contextualized Attention Metadata workshop at the Joint Conference on Digital Libraries JCDL 2007 http://www.jcdl2007.org/ in Vancouver British Columbia Canada on June 23, 2007
Application Architecture Patterns talk tailored for PHP / Symfony developers. (2016). We describe the traditional layers, a Domain Model, Hexagonal architecture and how the pieces fit together.
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
Data spaces in distributed environments should be allowed to evolve in agile ways providing data space owners with large flexibility about which data they store. Agility and heterogeneity, however, jeopardize data exchanges because representations may build on varying ontologies and data consumers may not rely on the semantic correctness of their queries in the context of semantically heterogeneous, evolving data spaces. Graph data spaces are one example of a powerful model for representing and querying data whose semantics may change over time. To assert and enforce conditions on individual graph data spaces, shape languages (e.g SHACL) have been developed. We investigate the question of how querying and programming can be guarded by reasoning over SHACL constraints in a distributed setting and we sketch a picture of how a future landscape based on semantically heterogeneous data spaces might look like.
Knowledge graphs for knowing more and knowing for sureSteffen Staab
Knowledge graphs have been conceived to collect heterogeneous data and knowledge about large domains, e.g. medical or engineering domains, and to allow versatile access to such collections by means of querying and logical reasoning. A surge of methods has responded to additional requirements in recent years. (i) Knowledge graph embeddings use similarity and analogy of structures to speculatively add to the collected data and knowledge. (ii) Queries with shapes and schema information can be typed to provide certainty about results. We survey both developments and find that the development of techniques happens in disjoint communities that mostly do not understand each other, thus limiting the proper and most versatile use of knowledge graphs.
CAMA 2007 Visions of the Future for Contextualized Attention MetadataWayne Hodgins
Invited keynote presentation by Wayne Hodgins at the CAMA 2007 http://ariadne.cs.kuleuven.ac.be/cama2007/ Contextualized Attention Metadata workshop at the Joint Conference on Digital Libraries JCDL 2007 http://www.jcdl2007.org/ in Vancouver British Columbia Canada on June 23, 2007
Application Architecture Patterns talk tailored for PHP / Symfony developers. (2016). We describe the traditional layers, a Domain Model, Hexagonal architecture and how the pieces fit together.
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
Data spaces in distributed environments should be allowed to evolve in agile ways providing data space owners with large flexibility about which data they store. Agility and heterogeneity, however, jeopardize data exchanges because representations may build on varying ontologies and data consumers may not rely on the semantic correctness of their queries in the context of semantically heterogeneous, evolving data spaces. Graph data spaces are one example of a powerful model for representing and querying data whose semantics may change over time. To assert and enforce conditions on individual graph data spaces, shape languages (e.g SHACL) have been developed. We investigate the question of how querying and programming can be guarded by reasoning over SHACL constraints in a distributed setting and we sketch a picture of how a future landscape based on semantically heterogeneous data spaces might look like.
Knowledge graphs for knowing more and knowing for sureSteffen Staab
Knowledge graphs have been conceived to collect heterogeneous data and knowledge about large domains, e.g. medical or engineering domains, and to allow versatile access to such collections by means of querying and logical reasoning. A surge of methods has responded to additional requirements in recent years. (i) Knowledge graph embeddings use similarity and analogy of structures to speculatively add to the collected data and knowledge. (ii) Queries with shapes and schema information can be typed to provide certainty about results. We survey both developments and find that the development of techniques happens in disjoint communities that mostly do not understand each other, thus limiting the proper and most versatile use of knowledge graphs.
Symbolic Background Knowledge for Machine LearningSteffen Staab
Machine learning aims at learning complex functions from data. Very often, this challenge remains ill-defined given the available amount of data, however, background knowledge that is available as knowledge graphs, ontologies or symbolic (physical) equations allows for an improved specification of the targeted solution. In this talk, we want to discuss several use cases that include symbolic background knowledge as regularizing priors, as constraints or as other inductive biases into machine learning tasks.
Soziale Netzwerke und Medien: Multi-disziplinäre Ansätze für ein multi-dimens...Steffen Staab
Präsentation von Oul Han und Steffen Staab
Workshop "Soziale Netzwerke und Medien" auf dem Treffen des Fakultätentags Informatik, 14. November 2019, Hamburg
Web Futures: Inclusive, Intelligent, SustainableSteffen Staab
Almost from its very beginning, the Web has been ambivalent.
It has facilitated freedom for information, but this also included the freedom to spread misinformation. It has faciliated intelligent personalization, but at the cost of intrusion into our private lifes. It has included more people than any other system before, but at the risk of exploiting them.
The Web is full of such ambivalences and the usage of artificial intelligences threatens to further amplify these ambivalences. To further the good and to contain the negative consequences, we need a research agenda studying and engineering the Web, as well as numerous activities by societies at large. In this talk, I will present and discuss a joint effort by an interdisciplinary team of Web Scientists to prepare and pursue such an agenda.
Concepts in Application Context ( How we may think conceptually )Steffen Staab
Formal concept analysis (FCA) derives a hierarchy of concepts
in a formal context that relates objects with attributes. This approach is very well aligned with the traditions of Frege, Saussure and Peirce, which relate a signifier (e.g. a word/an attribute) to a mental concept evoked by this word and meant to refer to a specific object in the real world. However, in the practice of natural languages as well as artificial languages (e.g. programming languages), the application context
often constitutes a latent variable that influences the interpretation of a signifier. We present some of our current work that analyzes the usage of words in natural language in varying application contexts as well as the usage of variables in programming languages in varying application contexts in order to provide conceptual constraints on these signifiers.
Storing and Querying Semantic Data in the CloudSteffen Staab
Daniel Janke and Steffen Staab. Tutorial at Reasoning Web
With proliferation of semantic data, there is a need to cope with trillions of triples by horizontally scaling data management in the cloud. To this end one needs to advance (i) strategies for data placement over compute and storage nodes, (ii) strategies for distributed query processing, and (iii) strategies for handling failure of compute and storage nodes. In this tutorial, we want to review challenges and how they have been addressed by research and development in the last 15 years.
Talk at Leopoldina Symposium on Digitization and its Effects on Man and Society
(Die Digitalisierung und ihre Auswirkungen auf Mensch und Gesellschaft)
leopoldina.org/de/veranstaltungen/veranstaltung/event/2464/
The evolution of the Web should move forward in an upward spiral that cylces between guiding values, engineering and science. Guiding values should comprise social values as well as system principles that further stabilization and growth of the Web. Principles I will talk about will include social inclusion, connectedness and fairness. Example efforts improve Web access for disabled, critically access Web structures and Web growth, and try to transfer knowledge about previously found patterns of Web growth to analogous cases.
(Semi-)Automatic analysis of online contentsSteffen Staab
How can media and discourse analyses combine approaches from humanities and statistical methods to deeply analyse large amounts of online contents.
Invited talk at Fachgruppen-Workshop der Deutschen Gesellschaft für Publizistik und Kommunikationswissenschaft
Soziale Medien – Echo-Kammer oder öffentlicher Raum?
Ansätze zur computergestützten Analyse von Internet-Korpora
6. Oktober 2016, Karlsruher Institut für Technologie (KIT)
Joint Keynote at Int. Conference on Knowledge Engineering and Semantic Web and Prague Computer Science Seminar, Prague, September 22, 2016
The challenges of Big Data are frequently explained by dealing with Volume, Velocity, Variety and Veracity. The large variety of data in organizations results from accessing different information systems with heterogeneous schemata or ontologies. In this talk I will present the research efforts that target the management of such broad data.
They include: (i) an integrated development environment for programming with broad data, (ii) a query language that allows for typing of query results, (iii) a typed lambda-calculus based on description logics, and (iv) efficient access to data repositories via schema indices.
We use metadata of various kind to improve and enrich text document clustering using an extension of Latent Dirichlet Allocation (LDA). The methods are fully implemented, evaluated and software is available on github.
These are the slides of an invited talk I gave September 8 at the Alexandria Workshop of TPDL-2016: http://alexandria-project.eu/events/3rd-workshop/
Essentials of Automations: Optimizing FME Workflows with ParametersSafe Software
Are you looking to streamline your workflows and boost your projects’ efficiency? Do you find yourself searching for ways to add flexibility and control over your FME workflows? If so, you’re in the right place.
Join us for an insightful dive into the world of FME parameters, a critical element in optimizing workflow efficiency. This webinar marks the beginning of our three-part “Essentials of Automation” series. This first webinar is designed to equip you with the knowledge and skills to utilize parameters effectively: enhancing the flexibility, maintainability, and user control of your FME projects.
Here’s what you’ll gain:
- Essentials of FME Parameters: Understand the pivotal role of parameters, including Reader/Writer, Transformer, User, and FME Flow categories. Discover how they are the key to unlocking automation and optimization within your workflows.
- Practical Applications in FME Form: Delve into key user parameter types including choice, connections, and file URLs. Allow users to control how a workflow runs, making your workflows more reusable. Learn to import values and deliver the best user experience for your workflows while enhancing accuracy.
- Optimization Strategies in FME Flow: Explore the creation and strategic deployment of parameters in FME Flow, including the use of deployment and geometry parameters, to maximize workflow efficiency.
- Pro Tips for Success: Gain insights on parameterizing connections and leveraging new features like Conditional Visibility for clarity and simplicity.
We’ll wrap up with a glimpse into future webinars, followed by a Q&A session to address your specific questions surrounding this topic.
Don’t miss this opportunity to elevate your FME expertise and drive your projects to new heights of efficiency.
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Clients don’t know what they don’t know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clients’ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Jeffrey Haguewood
Sidekick Solutions uses Bonterra Impact Management (fka Social Solutions Apricot) and automation solutions to integrate data for business workflows.
We believe integration and automation are essential to user experience and the promise of efficient work through technology. Automation is the critical ingredient to realizing that full vision. We develop integration products and services for Bonterra Case Management software to support the deployment of automations for a variety of use cases.
This video focuses on the notifications, alerts, and approval requests using Slack for Bonterra Impact Management. The solutions covered in this webinar can also be deployed for Microsoft Teams.
Interested in deploying notification automations for Bonterra Impact Management? Contact us at sales@sidekicksolutionsllc.com to discuss next steps.
DevOps and Testing slides at DASA ConnectKari Kakkonen
My and Rik Marselis slides at 30.5.2024 DASA Connect conference. We discuss about what is testing, then what is agile testing and finally what is Testing in DevOps. Finally we had lovely workshop with the participants trying to find out different ways to think about quality and testing in different parts of the DevOps infinity loop.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Generating a custom Ruby SDK for your web service or Rails API using Smithyg2nightmarescribd
Have you ever wanted a Ruby client API to communicate with your web service? Smithy is a protocol-agnostic language for defining services and SDKs. Smithy Ruby is an implementation of Smithy that generates a Ruby SDK using a Smithy model. In this talk, we will explore Smithy and Smithy Ruby to learn how to generate custom feature-rich SDKs that can communicate with any web service, such as a Rails JSON API.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
As AI technology is pushing into IT I was wondering myself, as an “infrastructure container kubernetes guy”, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefit’s both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
Accelerate your Kubernetes clusters with Varnish CachingThijs Feryn
A presentation about the usage and availability of Varnish on Kubernetes. This talk explores the capabilities of Varnish caching and shows how to use the Varnish Helm chart to deploy it to Kubernetes.
This presentation was delivered at K8SUG Singapore. See https://feryn.eu/presentations/accelerate-your-kubernetes-clusters-with-varnish-caching-k8sug-singapore-28-2024 for more details.
Neuro-symbolic is not enough, we need neuro-*semantic*Frank van Harmelen
Neuro-symbolic (NeSy) AI is on the rise. However, simply machine learning on just any symbolic structure is not sufficient to really harvest the gains of NeSy. These will only be gained when the symbolic structures have an actual semantics. I give an operational definition of semantics as “predictable inference”.
All of this illustrated with link prediction over knowledge graphs, but the argument is general.
Knowledge engineering: from people to machines and back
Building and Using Knowledge Bases
1. WeST – Web Science & Technologies
University of Koblenz Landau, Germany
Building and Using
Knowledge Bases
Steffen Staab
Saqib Mir – European Bioinformatics Institute
Ermelinda d„Oro, Massimo Ruffolo – Univ. Calabria, Italy
& WeST Team
2. Institut WeST – Web Science & Technologies
Semantic Web Web Retrieval Social Web Multimedia Web Software Web GESIS
WeST – Web Science & Steffen Staab Slide 2
Technologies staab@uni-koblenz.de
3. PhD thesis trauma 17 years ago
„Nach dem Auspacken der LPS 105 präsentiert sich dem
Betrachter ein stabiles Laufwerk, das genauso geringe
Außenmaße besitzt wie die Maxtor.“
Having unwrapped the LPS 105 – reveals itself to the
onlooker - a stable disk drive, which has similarly small
volume as the Maxtor.“
WeST – Web Science & Steffen Staab Slide 3
Technologies staab@uni-koblenz.de
4. GENERAL MOTIVATION
General motivation is not information extraction,
but it is solving tasks!
WeST – Web Science & Steffen Staab Slide 4
Technologies staab@uni-koblenz.de
5. General objective: Extracting to LOD
useAsExample hasLivedIn
Crucial to know: Ontologies nowadays reflect this structure
Ontologies are
• Modular (vs one to rule them all)
• Distributed (vs defined in one place)
• Connected (vs isolated templates)
• Extensible (vs claimed to be finished)
• Lightweight (vs computationally intractable)
• Popular ones are used more often (vs people disagreeing)
Ontologies – LEGO style
WeST – Web Science & Steffen Staab Slide 5
Technologies staab@uni-koblenz.de
6. Most famous applications
Steve Macbeth (Microsoft): - discussion wrt Schema.org -
“about 7% of pages we crawl have mark-up”
http://www.w3.org/2012/06/06-schema-minutes.html
LOD Cloud
Google Knowledge Graph
Bing gets its own knowledge graph
http://searchengineland.com/bing-britannica-partnership-123930
WeST – Web Science & Steffen Staab Slide 6
Technologies staab@uni-koblenz.de
7. Example ontology-based application 1:
ANALYSIS OF
URBAN PARAMETERS
WeST – Web Science & Steffen Staab Slide 7
Technologies staab@uni-koblenz.de
8. General objective: Analysing LOD
useAsExample hasLivedIn
WeST – Web Science & Steffen Staab Slide 8
Technologies staab@uni-koblenz.de
11. Example ontology-based application :
FACETED MULTIMEDIA
EXPLORATION
WeST – Web Science & Steffen Staab Slide 11
Technologies staab@uni-koblenz.de
12. Making Web 2.0 More Accessible
[Schenk et al; JoWS 2009]
GeoNames
Links Location
low- to
xxxxx
Persons xxxx midlevel
features
Knowledge Tags
WeST – Web Science & Steffen Staab Slide 12
Technologies staab@uni-koblenz.de
13. Choosing between Koblenz – and Koblenz
Video at: http://vimeo.com/2057249
WeST – Web Science & Steffen Staab Slide 13
Technologies staab@uni-koblenz.de
16. A tag view of „Koblenz“ & „Castle“
WeST – Web Science & Steffen Staab Slide 16
Technologies staab@uni-koblenz.de
17. Semantic Identity – Festung Ehrenbreitstein
WeST – Web Science & Steffen Staab Slide 17
Technologies staab@uni-koblenz.de
18. Persons – Celebrities, FOAFers & Flickr Users
Billion Triples Challenge 1. Prize
2008
WeST – Web Science & Steffen Staab Slide 18
Technologies
[Schenk et al; JoWS 2009]
staab@uni-koblenz.de
19. Now on to information extraction:
OBSERVATIONS ON
INFORMATION EXTRACTION
WeST – Web Science & Steffen Staab Slide 19
Technologies staab@uni-koblenz.de
20. Challenges & Opportunities for IE
Not all web pages are created equal
WeST – Web Science & Steffen Staab Slide 20
Technologies staab@uni-koblenz.de
21. Challenges & Opportunities for IE
Some challenges are the same, e.g. finding type instances
WeST – Web Science & Steffen Staab Slide 21
Technologies staab@uni-koblenz.de
22. Challenges & Opportunities for IE
Some challenges are the same, e.g. finding relation instances
WeST – Web Science & Steffen Staab Slide 22
Technologies staab@uni-koblenz.de
23. Challenges & Opportunities for IE
Some contain concepts and their descriptions, some don„t
No types here,
few relation types
WeST – Web Science & Steffen Staab Slide 23
Technologies staab@uni-koblenz.de
24. Challenges & Opportunities for IE
Knowing that they are instances and of which type
Textual Positional
indication indication
WeST – Web Science & Steffen Staab Slide 24
Technologies staab@uni-koblenz.de
25. Challenges & Opportunities for IE
To some extent
positional and layout
indications work across
languages and sites
WeST – Web Science & Steffen Staab Slide 25
Technologies staab@uni-koblenz.de
26. Challenges & Opportunities for IE
owl:sameAs
We should not only think about
Web pages, but about Web sites
WeST – Web Science & Steffen Staab Slide 26
Technologies staab@uni-koblenz.de
27. Challenges & Opportunities for IE
We should not only think about
Web pages, but about Web sites
owl:sameAs
WeST – Web Science & Steffen Staab Slide 27
Technologies staab@uni-koblenz.de
28. Comparing related work to our objectives
Related work objectives Our objectives
IE on Web pages IE on Web sites
Acquiring instances and Acquiring items
relationship instances Classifying items in
Instances
Concepts
Relation instances
Relationships
IE also based
IE based on linear text
on spatial position
There is overlap and of course there are
exceptions in related work
WeST – Web Science & Steffen Staab Slide 28
Technologies staab@uni-koblenz.de
29. Outline
The Social Media-Case The Bio-Case
Motivation
State-of-the-Art
Core idea of SXPath
Implementation
Evaluation
[Oro et al; VLDB 2010]
WeST – Web Science & Steffen Staab Slide 29
Technologies staab@uni-koblenz.de
31. Presentation-oriented documents
• HTML DOM structure is site specific
• Spatial arrangements are rarely explicit
• Spatial layout is hidden in complex nesting of layout elements
• Intricate DOM tree structures are conceptually difficult to query
for the user (or a tool!)
WeST – Web Science & Steffen Staab Slide 31
Technologies staab@uni-koblenz.de
32. Related Work
Web Query languages
Xpath 1.0 and XQuery1.0
Established
Too difficult to use for scraping from intricate DOM structures
Visual languages
Spatial Graph Grammars [Kong et al.] are quite complex in
term of both usability and efficiency
Algebras for creating and querying multimedia interactive
presentations (e.g. ppt) [Subrahmanian et al.]
Web wrapper induction exploiting visual interface
[Gottlob et al.] [Sahuguet et al.]
generate XPath location paths of DOM nodes
can benefit from using Spatial XPath
WeST – Web Science & Steffen Staab Slide 32
Technologies staab@uni-koblenz.de
33. Outline
The Social Media-Case The Bio-Case
Motivation
State-of-the-Art
Core idea of SXPath
Implementation
Evaluation
WeST – Web Science & Steffen Staab Slide 33
Technologies staab@uni-koblenz.de
34. Representing Spatial Relations between DOM Nodes
b
e
WeST – Web Science & Steffen Staab Slide 34
Technologies staab@uni-koblenz.de
35. Idea: Use Spatial Relations among DOM Nodes
WeST – Web Science & Steffen Staab Slide 35
Technologies staab@uni-koblenz.de
36. Spatial DOM (SDOM)
WeST – Web Science & Steffen Staab Slide 36
Technologies staab@uni-koblenz.de
38. Querying for Relations Among Nodes
Rectangular Cardinal Relations (RCR)
r1 E:NE r2
Spatial models allow for expressing
disjunctive relations among regions
Topological Relations
WeST – Web Science & Steffen Staab Slide 38
Technologies staab@uni-koblenz.de
39. XPath Example
WeST – Web Science & Steffen Staab Slide 39
Technologies staab@uni-koblenz.de
40. SXPath Example
WeST – Web Science & Steffen Staab Slide 40
Technologies staab@uni-koblenz.de
41. WeST – Web Science & Steffen Staab Slide 41
Technologies staab@uni-koblenz.de
42. From XPath 1.0 towards Spatial Querying with SXPath
SXPath features
adopts intuitive path notation:
axis::nodetest [pred]*
adds to XPath
spatial axes
spatial position functions
natural semantics for spatial querying
WeST – Web Science & Steffen Staab Slide 42
Technologies staab@uni-koblenz.de
44. Complexity Results
Formal model defined in the paper
[Oro et al; VLDB 2010]
WeST – Web Science & Steffen Staab Slide 44
Technologies staab@uni-koblenz.de
45. Outline
The Social Media-Case The Bio-Case
Motivation
State-of-the-Art
Core idea of SXPath
Implementation
Evaluation
WeST – Web Science & Steffen Staab Slide 45
Technologies staab@uni-koblenz.de
46. SXPath System
WeST – Web Science & Steffen Staab Slide 46
Technologies staab@uni-koblenz.de
50. Outline
The Social Media Case The Bio-Case
Motivation Motivation
State-of-the-Art The (Biochemical) Deep
Core idea of SXPath Web
SXPath Language Contributions
Spatial Data Model Page-level wrapper
induction
Syntax & Semantics
Site-wide wrapper
Complexity
generation
Implementation Error Correction by
Evaluation Mutual Reinforcement
Conclusions and Future
Directions
WeST – Web Science & Steffen Staab Slide 50
Technologies staab@uni-koblenz.de
51. >1000 Life Science DBs, number growing quickly
WeST – Web Science & Steffen Staab Slide 51
Technologies staab@uni-koblenz.de
52. Biochemical Web Sites: Observations - 1
Labeled Data
Full survey:
http://sabio.villa-
bosch.de/labelsurvey.html (404)
Total Labeled Unlabeled Unlabeled
(Redundant)
754 719 19 16
Table 1: Data fields across 20 Biochemical Web sites
WeST – Web Science & Steffen Staab Slide 52
Technologies staab@uni-koblenz.de
53. Biochemical Web Sites: Observations - 2
Dynamic Web Pages
WeST – Web Science & Steffen Staab Slide 53
Technologies staab@uni-koblenz.de
54. Biochemical Web Sites: Observations - 3
Rich Site Structure
WeST – Web Science & Steffen Staab Slide 54
Technologies staab@uni-koblenz.de
55. Biochemical Web Sites: Observations - 4
Semantics is often only in the report,
not in the underlying relational database
Web Services
Survey: 11 of 100 Databases1 provide APIs
Incomplete coverage
Varying granularity
No semantics in the service description
1 Databases indexed by the Nucleic Acids Research Journal
(http://www3.oup.co.uk/nar/database/). Complete survey was available at
http://sabiork.villa-bosch.de/index.html/survey.html
WeST – Web Science & Steffen Staab Slide 55
Technologies staab@uni-koblenz.de
56. Biochemical Web Sites: Extraction Tasks
[Mir et al; DILS 2009]
[Mir et al; ESWC 2010]
Induce Wrapper
Induce Wrapper
Induce Wrapper
WeST – Web Science & Steffen Staab Slide 56
Technologies staab@uni-koblenz.de
57. Contributions
Unsupervised Page-Level Wrapper Induction
Unsupervised Site-Wide Wrapper Induction
(Site Structure Discovery)
(Acquiring the Schema/Ontology)
Automatic Error Detection and Correction by
Mutual Reinforcement
WeST – Web Science & Steffen Staab Slide 57
Technologies staab@uni-koblenz.de
65. Site-Wide Wrapper Induction: Observations
Not all pages contain data (e.g. Legal disclaimers,
contact pages, navigational menus)
An efficient approach should ignore these pages
We dont need to learn the entire site-structure
WeST – Web Science & Steffen Staab Slide 65
Technologies staab@uni-koblenz.de
66. Site-Wide Wrapper Induction: Observations - 2
Classified Link-Collections point to data-intensive
pages of the same class.
WeST – Web Science & Steffen Staab Slide 66
Technologies staab@uni-koblenz.de
67. Site-Wide Wrapper Induction: Observations - 3
Pages belong to the same class describe the same
concepts
Some concepts are sometimes omitted
Ordering is always the same
WeST – Web Science & Steffen Staab Slide 67
Technologies staab@uni-koblenz.de
68. Site-Wide Wrapper Induction
1. Start with C0 L1
S={C0}
2. Follow all classified
link-collections C0
C1
3. Generate wrappers L3
for each set of target
L2
pages
C2
4. Determine if new C3
class is formed
5. Add navigation step If C0 != Ci (i>0)
S=S+Ci;
6. Repeat 2 – 5 for each
Navigation Steps
new class formed in 4
W= {(C0 → L1→ C0),
(C0 → L2→ C2),
(C0 → L3→ C3)}
WeST – Web Science & Steffen Staab Slide 68
Technologies staab@uni-koblenz.de
69. Site-Wide Wrapper Induction – Evaluation
SOURCE #C #C’ #D TP FN FP P R
MSDChem 1 1 N/A N/A N/A N/A N/A N/A
ChEBI 3 1 1711 1195 516 0 100 69.8
KEGG 10 7 6223 5044 1179 188 97 81.1
Average 98.5 75.5
Table 3: Site-wide wrapper induction results, 20 test pages for each class
(C=Classes, C =Classes discovered, D=Data entries)
WeST – Web Science & Steffen Staab Slide 69
Technologies staab@uni-koblenz.de
70. Error Detection and Correction:
Mutual Reinforcement
Observation: Certain data reappear on more
than one class of pages
WeST – Web Science & Steffen Staab Slide 70
Technologies staab@uni-koblenz.de
71. Error Detection and Correction:
Mutual Reinforcement
Reinforcement if reappearing data correctly classified as
Data
Otherwise it points to misclassification
Label-Data Mismatch
• Correction: Introduce more samples
Label-Label Mismatch
• Cannot be detected
WeST – Web Science & Steffen Staab Slide 71
Technologies staab@uni-koblenz.de
72. Where to go next?
Reverse engineering production
1. LOD emitting RDF & RDFS
2. Navigation model what belongs to what
3. Interaction model (- not treated at all by us so far -)
4. Layout model spatial positioning
Capture this generative model using machine learning
Relational learning
• Markov logic programmes?
• …?
WeST – Web Science & Steffen Staab Slide 72
Technologies staab@uni-koblenz.de
73. Bibliography
Ermelinda Oro, Massimo Ruffolo, Steffen Staab. SXPath –
Extending XPath towards Spatial Querying on Web
Documents. In: PVLDB – Proceedings of the VLDB
Endowment, 4(2): 129-140, 2010.
S. Mir, S. Staab, I. Rojas. Site-Wide Wrapper Induction for
Life Science Deep Web Databases. In: DILS-2009 – Proc.
of the Data Integration in the Life Sciences Workshop,
Manchester, UK, July 20-22, LNCS, Springer, 2009.
Saqib Mir, Steffen Staab, Isabel Rojas. An Unsupervised
Approach for Acquiring Ontologies and RDF Data from
Online Life Science Databases. In: 7th Extended Semantic
Web Conference (ESWC2010), Heraklion, Greece, May
30-June 3, 2010, pp. 319-333.
WeST – Web Science & Steffen Staab Slide 73
Technologies staab@uni-koblenz.de
74. WeST – Web Science & Technologies
University of Koblenz Landau, Germany
Thank you for your attention!
Editor's Notes
Layout engines of Web browsers assign a rectangle to each DOM element. ___________________________________________________The internal code of a page is this How can we query the page using the spatial information?The browser when visualize the pages represent the information in their rectangles that we can call minimum bounding rectangle. In fact the layout engine assign to each node*** parallelotraildom e quellochevedi--- vedicoldplayèscritto qua dentro e siillumina, img e siillumina***For each node based on the stylesheet, what the web designer.Presentation oriented, all also the style is used for give emphasis so that the human understand the important information, so the name in bold. (sviluppifuturiusarli)
As shown in the the figure the complex, involved and nested structure of the DOM has a clear presentation that enable user to read and understand the meaning of information presented in the Web page.
The rectangular algebra is an extension of the Allen’s interval algebra to the two dimensional case. For example in this case the relatio x (b,e) y is intuitively obtained by applying interval algebra to both sides of the rectangle.__________________________________________________________So we could use the spatial model of geospatial database for representing the mutual relationships between objects***Mostra RA***The rectngular algebra define 169 relations, all the possible relations between rectangles *** mostrare la figurona***Between this and this in the relation algebra this relation is called so*** illumina****** Ritaglia un singolo rettangolo***-----------------Modelli del mondo geospaziale per rappresentare le mutue relazioniRAIlluminare 2 - albero non basato del nesting ma su contenimento e relazioni
No comment. Già tutto nella slide.and has very interesting properties like invertibility that enable optimized evaluations of SXPath language._______________________________________So we could use the spatial model of geospatial database for representing the mutual relationships between objects***Mostra RA***The rectngular algebra define 169 relations, all the possible relations between rectangles *** mostrare la figurona***Between this and this in the relation algebra this relation is called so*** illumina****** Ritaglia un singolo rettangolo***-----------------Modelli del mondo geospaziale per rappresentare le mutue relazioniRAIlluminare 2 - albero non basato del nesting ma su contenimento e relazioni
By representing RA relations/spatial relation we obtain the SDOM where continuous arrows represent spatial containment and dotted arrows represent RA relations. This way we have a model of a Web page that represent all spatial relations existing between each pair of DOM nodes.Spatial relations enable also the definition of a spatial ordering along the 4 main direction North, South, East, and West as shown in the figure._____________________________Intuizione di DOMSo I can make a tree of the page not based on nesting of tags, but by using the spatial containment and spatial relations*** tirare fuori l’sdom****** sempre animando, mostrando sempre I due elementi scelti, ***Between image and radiohead there is the spatial relation (s, bi)I can represent this data model that do not capture the simple nesting of tags but catcht the spatial arrangment of the objects on the page*** con le animazioni***This is the new data model that I use called Spatial DOM. That is the Document Object Model with the objects of the DOM where the relations (queste scure) are containment relations, (quelle tratteggiate) are the Rarelations.It allows to introduce an ordering in the page using this model ----------------Nuovo modello che uso SDOMIntrodurre che permette di definire ordinamento spaziale nella pagina
The architecture of the system consists in a parser of SXPath expressions (Query parser), a builder of the SDOM an engine that efficiently evaluates SXPath queries.______________________
The RA relation is too fine grained and verbose, difficult to use by a human. So we introduce also the Rectangular Cardinal Relations and topological relations (Two of the most intuitive and diffused spatial models) in order to map RA relations and allow user to query spatial relations in a more intuitive way.________________________________________________________Such relations are very complicated We need more intuitive relations to use So we use another geospatial model called RCR and Topological relations mapped with the RA modelDivide in regional tiles and it is simple
In this slide is show a comparison between Xpath and SXPath. Suppose a user that need to extract details of a music band. By using Xptah the user need to know the intricate DOM structure. By using SXPth the user can exploit the visual pattern adopted by the Web designers for organizing details of the music bands._______________________
In this slide is show a comparison between Xpath and SXPath. Suppose a user that need to extract details of a music band. By using Xptah the user need to know the intricate DOM structure. By using SXPth the user can exploit the visual pattern adopted by the Web designers for organizing details of the music bands._______________________
SXPath expressions are also resilient. In fact, a gicen visual pattern can be queried in the same way on different web pages having different internal encodings.____________________________________Another advantage is that it is more general For instance, with only a query I can catch different DOMs because their spatial representation is the same.So it generalize the patterns Our language catch visual patterns, catch in general way visual patterns on Web pages Example 2A single data record can be split in different sub-treesWrapper induction techniques like DEPTA [Zhai et al.] recognize datarecords when they are encoded in the DOM as consecutive similarsubtrees-------------------Esempio 2Altrovantaggioacchiappo DOM diversiIl linguaggiocattura in manieragenerale pattern visuali
The architecture of the system consists in a parser of SXPath expressions (Query parser), a builder of the SDOM an engine that efficiently evaluates SXPath queries.______________________
The study of combined computational complexity of different SXPath fragments shows that SXPath maintain Polinomial time computational complexity. Obviously SXPath as a greater exponent in the polynomial because of the quadratic number of relation stored in the SDOM that need to be explored during the evaluation of spatial axes.We compute spatial axes by using the same dynamic programming approach suggested by Gottolob but we have to explore a quadratic number of further relation in the SDOM.________________________________________ Core SXPath queries can be evaluated in time O(SDS2 á SQS) where SDSis the size of the XML document, and SQS is the size of the query QProof Sketch There are O(SVv S2) many spatial relations to beconsidered in addition to the O(SVS) many relations of the DOMincurring a higher polynomial worst case complexityIn order to obtain a polynomial-time combined complexity bound for SXPathquery evaluation we use dynamic programming adopting the Context-ValueTable (CV-Table) principle introduced by Gottlob et al.Position and size are computed on demand, we compute all spatial positionfunctions in a loop for all pairs previousÉcurrent nodesFull SXPath computational costs are dominated by String Operations belongingto XPath 1.0In SWF the computation of spatial ordering generates a higher polynomial worstcase than XPath 1.0
The GUI shows the DOM, allows to write queries, enables to check query results that are show on the screen._________________________________________
In the second experiment we evaluated the effectiveness of Sxpath with respect to Xpath. We discovered that the possibility to exploit the visual appearance of Web pages allow to write queries by less attempts than in Xpath, that Sxpath location path are more syntetic and that Sxpath is resilient (the same query can be used on different Web site having very different internal encodings in terms of DOM trees).________________________________
In the second experiment we evaluated the effectiveness of Sxpath with respect to Xpath. We discovered that the possibility to exploit the visual appearance of Web pages allow to write queries by less attempts than in Xpath, that Sxpath location path are more syntetic and that Sxpath is resilient (the same query can be used on different Web site having very different internal encodings in terms of DOM trees).________________________________
In the second experiment we evaluated the effectiveness of Sxpath with respect to Xpath. We discovered that the possibility to exploit the visual appearance of Web pages allow to write queries by less attempts than in Xpath, that Sxpath location path are more syntetic and that Sxpath is resilient (the same query can be used on different Web site having very different internal encodings in terms of DOM trees).________________________________