This document summarizes the DIADEM data extraction methodology project. It introduces the team members and provides an overview of the technical approach and components. The technology is able to extract data from thousands of websites at large scale with high accuracy. It uses various techniques including ROSeAnn for entity extraction, OPAL for form understanding, AMBER for record identification, and OXPath for the extraction language. The system is able to adapt to new domains and outperforms other data extraction systems in precision, recall, and attribute labeling.
Abstract:
A growing number of resources are available for enriching documents with semantic annotations. While originally focused on a few standard classes of annotations, the ecosystem of annotators is now becoming increasingly diverse. Although annotators often have very different vocabularies, with both high-level and specialist
concepts, they also have many semantic interconnections. We will show that both the overlap and the diversity in annotator vocabularies motivate the need for semantic annotation integration: middleware that produces a unified annotation on top of diverse semantic annotators. On the one hand, the diversity of vocabulary allows applications
to benefit from the much richer vocabulary available in
an integrated vocabulary. On the other hand, we present evidence that the most widely-used annotators on the web suffer from serious accuracy deficiencies: the overlap in vocabularies from individual annotators allows an integrated annotator to boost accuracy by exploiting inter-annotator agreement and disagreement.
The integration of semantic annotations leads to new challenges, both compared to usual data integration scenarios and to standard aggregation of machine learning tools. We overview an approach to these challenges that performs ontology-aware aggregation. We
introduce an approach that requires no training data, making use of ideas from database repair. We experimentally compare this with a supervised approach, which adapts maximal entropy Markov models to the setting of ontology-based annotations. We further experimentally compare both these approaches with respect to ontology-unaware
supervised approaches, and to individual annotators.
PAGOdA (Pay-as-you-go OWL Query Answering Using a Triple Store) presentation by Bernardo Cuenca Grau
Abstract: We present an enhanced hybrid approach to OWL query answering that combines an RDF triple-store with an OWL reasoner in order to provide scalable pay-as-you-go performance. The enhancements presented here include an extension to deal with arbitrary OWL ontologies, and optimisations that significantly improve scalability. We have implemented these techniques in a prototype system, a preliminary evaluation of which has produced very encouraging results.
PDQ: Proof-driven Query Answering over Web=based Data
Abstract: The data needed to answer queries is often available through Web-based APIs. Indeed, for a given query there may be many Web-based sources which can be used to answer it, with the sources overlapping in their vocabularies, and differing in their access restrictions (required arguments) and cost.
We introduce PDQ (Proof-Driven Query Answering), a system for determining a query plan in the presence of web-based sources. It is: (i) constraint-aware -- exploiting relationships between sources to rewrite an expensive query into a cheaper one, (ii) access-aware -- abiding by any access restrictions known in the sources, and (iii) cost-aware -- making use of any cost information that is available about services.
PDQ takes the novel approach of generating query plans from proofs that a query is answerable. We demonstrate the use of PDQ and its effectiveness in generating low-cost plans.
Abstract:
A growing number of resources are available for enriching documents with semantic annotations. While originally focused on a few standard classes of annotations, the ecosystem of annotators is now becoming increasingly diverse. Although annotators often have very different vocabularies, with both high-level and specialist
concepts, they also have many semantic interconnections. We will show that both the overlap and the diversity in annotator vocabularies motivate the need for semantic annotation integration: middleware that produces a unified annotation on top of diverse semantic annotators. On the one hand, the diversity of vocabulary allows applications
to benefit from the much richer vocabulary available in
an integrated vocabulary. On the other hand, we present evidence that the most widely-used annotators on the web suffer from serious accuracy deficiencies: the overlap in vocabularies from individual annotators allows an integrated annotator to boost accuracy by exploiting inter-annotator agreement and disagreement.
The integration of semantic annotations leads to new challenges, both compared to usual data integration scenarios and to standard aggregation of machine learning tools. We overview an approach to these challenges that performs ontology-aware aggregation. We
introduce an approach that requires no training data, making use of ideas from database repair. We experimentally compare this with a supervised approach, which adapts maximal entropy Markov models to the setting of ontology-based annotations. We further experimentally compare both these approaches with respect to ontology-unaware
supervised approaches, and to individual annotators.
PAGOdA (Pay-as-you-go OWL Query Answering Using a Triple Store) presentation by Bernardo Cuenca Grau
Abstract: We present an enhanced hybrid approach to OWL query answering that combines an RDF triple-store with an OWL reasoner in order to provide scalable pay-as-you-go performance. The enhancements presented here include an extension to deal with arbitrary OWL ontologies, and optimisations that significantly improve scalability. We have implemented these techniques in a prototype system, a preliminary evaluation of which has produced very encouraging results.
PDQ: Proof-driven Query Answering over Web=based Data
Abstract: The data needed to answer queries is often available through Web-based APIs. Indeed, for a given query there may be many Web-based sources which can be used to answer it, with the sources overlapping in their vocabularies, and differing in their access restrictions (required arguments) and cost.
We introduce PDQ (Proof-Driven Query Answering), a system for determining a query plan in the presence of web-based sources. It is: (i) constraint-aware -- exploiting relationships between sources to rewrite an expensive query into a cheaper one, (ii) access-aware -- abiding by any access restrictions known in the sources, and (iii) cost-aware -- making use of any cost information that is available about services.
PDQ takes the novel approach of generating query plans from proofs that a query is answerable. We demonstrate the use of PDQ and its effectiveness in generating low-cost plans.
Semantic Faceted Search with SemFacet presentationDBOnto
Â
Abstract
An increasing number of applications rely on RDF, OWL 2, and SPARQL for storing and querying data. SPARQL, however, is not targeted towards end-users, and suitable query interfaces are needed. Faceted search is a prominent approach for end-user data access, and several RDF-based faceted search systems have been developed. There is, however, a lack of rigorous theoretical underpinning for faceted search in the context of RDF and OWL 2. In this paper, we provide such solid foundations. We formalise faceted interfaces for this context, identify a fragment of first-order logic capturing the underlying queries, and study the complexity of answering such queries for RDF and OWL 2 profiles. We then study interface generation and update, and devise efficiently implementable algorithms. Finally, we have implemented and tested our faceted search algorithms for scalability, with encouraging results.
Abstract:
An increasing number of applications rely on RDF, OWL 2, and SPARQL for storing and querying data. SPARQL, however, is not targeted towards end-users, and suitable query interfaces are needed. Faceted search is a prominent approach for end-user data access, and several RDF-based faceted search systems have been developed. There is, however, a lack of rigorous theoretical underpinning for faceted search in the context of RDF and OWL 2. In this paper, we provide such solid foundations. We formalise faceted interfaces for this context, identify a fragment of first-order logic capturing the underlying queries, and study the complexity of answering such queries for RDF and OWL 2 profiles. We then study interface generation and update, and devise efficiently implementable algorithms. Finally, we have implemented and tested our faceted search algorithms for scalability, with encouraging results.
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...DBOnto
Â
Abstract:
We present a novel approach to parallel materialisation (i.e.,
fixpoint computation) of datalog programs in centralised,
main-memory, multi-core RDF systems. Our approach comprises an algorithm that evenly distributes the workload to cores, and an RDF indexing data structure that supports efficient, âmostlyâ lock-free parallel updates. Our empirical evaluation shows that our approach parallelises computation very well: with 16 physical cores, materialisation can be up to 13.9 times faster than with just one core.
Abstract:
A growing number of resources are available for enriching documents with semantic annotations. While originally focused on a few standard classes of annotations, the ecosystem of annotators is now becoming increasingly diverse. Although annotators often have very different vocabularies, with both high-level and specialist
concepts, they also have many semantic interconnections. We will show that both the overlap and the diversity in annotator vocabularies motivate the need for semantic annotation integration: middleware that produces a unified annotation on top of diverse semantic annotators. On the one hand, the diversity of vocabulary allows applications
to benefit from the much richer vocabulary available in
an integrated vocabulary. On the other hand, we present evidence that the most widely-used annotators on the web suffer from serious accuracy deficiencies: the overlap in vocabularies from individual annotators allows an integrated annotator to boost accuracy by exploiting inter-annotator agreement and disagreement.
The integration of semantic annotations leads to new challenges, both compared to usual data integration scenarios and to standard aggregation of machine learning tools. We overview an approach to these challenges that performs ontology-aware aggregation. We
introduce an approach that requires no training data, making use of ideas from database repair. We experimentally compare this with a supervised approach, which adapts maximal entropy Markov models to the setting of ontology-based annotations. We further experimentally compare both these approaches with respect to ontology-unaware
supervised approaches, and to individual annotators.
Optique - to provide semantic end-to-end connection between users and data sources; enable users to rapidly formulate intuitive queries using familiar vocabularies and conceptualisations and return timely answers from large scale and heterogeneous data sources.
Abstract:
The data needed to answer queries is often available through Web-based APIs. Indeed, for a given query there may be many Web-based sources which can be used to answer it, with the sources overlapping in their vocabularies, and differing in their access restrictions (required arguments) and cost.
We introduce PDQ (Proof-Driven Query Answering), a system for determining a query plan in the presence of web-based sources. It is: (i) constraint-aware -- exploiting relationships between sources to rewrite an expensive query into a cheaper one, (ii) access-aware -- abiding by any access restrictions known in the sources, and (iii) cost-aware -- making use of any cost information that is available about services.
PDQ takes the novel approach of generating query plans from proofs that a query is answerable. We demonstrate the use of PDQ and its effectiveness in generating low-cost plans.
Abstract: We present an enhanced hybrid approach to OWL query answering that combines an RDF triple-store with an OWL reasoner in order to provide scalable pay-as-you-go performance. The enhancements presented here include an extension to deal with arbitrary OWL ontologies, and optimisations that significantly improve scalability. We have implemented these techniques in a prototype system, a preliminary evaluation of which has produced very encouraging results.
Abstract: An enhanced hybrid approach to OWL query answering that combines an RDF triple-store with an OWL reasoner in order to provide scaleable pay-as-you-go performance. The enhancements presented here include an extension to deal with arbitary OWL ontologies and optimisations that significantly improve scalability. We have implemented these techniques in a prototype system, a preliminary evaluation of which has produced very encouraging results.
Abstract:
We present a novel approach to parallel materialisation (i.e.,
fixpoint computation) of datalog programs in centralised,
main-memory, multi-core RDF systems. Our approach comprises an algorithm that evenly distributes the workload to cores, and an RDF indexing data structure that supports efficient, âmostlyâ lock-free parallel updates. Our empirical evaluation shows that our approach parallelises computation very well: with 16 physical cores, materialisation can be up to 13.9 times faster than with just one core.
Parallel Datalog Reasoning in RDFox PresentationDBOnto
Â
Abstract:
We present a novel approach to parallel materialisation (i.e.,
fixpoint computation) of datalog programs in centralised,
main-memory, multi-core RDF systems. Our approach comprises an algorithm that evenly distributes the workload to cores, and an RDF indexing data structure that supports efficient, âmostlyâ lock-free parallel updates. Our empirical evaluation shows that our approach parallelises computation very well: with 16 physical cores, materialisation can be up to 13.9 times faster than with just one core.
Query Distributed RDF Graphs: The Effects of Partitioning PaperDBOnto
Â
Abstract: Web-scale RDF datasets are increasingly processed using distributed RDF data stores built on top of a cluster of shared-nothing servers. Such systems critically rely on their data partitioning scheme and query answering scheme, the goal of which is to facilitate correct and ecient query processing. Existing data partitioning schemes are
commonly based on hashing or graph partitioning techniques. The latter techniques split a dataset in a way that minimises the number of connections between the resulting subsets, thus reducing the need for communication between servers; however, to facilitate ecient query answering,
considerable duplication of data at the intersection between subsets is often needed. Building upon the known graph partitioning approaches, in this paper we present a novel data partitioning scheme that employs minimal duplication and keeps track of the connections between partition elements; moreover, we propose a query answering scheme that
uses this additional information to correctly answer all queries. We show experimentally that, on certain well-known RDF benchmarks, our data partitioning scheme often allows more answers to be retrieved without distributed computation than the known schemes, and we show that our query answering scheme can eciently answer many queries.
Big Data LDN 2018: USING FAST-DATA TO MAKE SEMICONDUCTORSMatt Stubbs
Â
Date: 14th November 2018
Location: Fast Data Theatre
Time: 14:30 - 15:00
Speaker: Neil Condon
Organisation: Edwards Vacuum
About: Semiconductor fabrication plants build devices with billions of components, each 1000x smaller than the human hair, with some features that are only a few atoms across. They are the most highly automated manufacturing environments in the world: In large fabs, time-series sensor data alone tops 50 TB/day, and combining that data with subject-matter-expertise fast enough to keep the automated production equipment functioning, is a major and growing challenge. The level of collaboration required to build high-performance real-time analytics, combined with the IP-sensitive nature of the data, results in a unique DataOps environment, where the use of Cloud resources serves to complicate rather than simplify the value equation. Weâll explore some of the challenges, and discuss the attributes of a PaaS that could help the industry tackle its fast-data challenges.
An introductory but highly practical talk on starting a Data Science career and life. It touches upon all the main aspects of the path towards becoming a Data scientist, also seen through a personal development perspective. Moreover, we talk about the role that a data scientist ultimately fulfills - as an individual or as a team - in the technology innovation life cycle and the product life-cycle.
The talk presents measures and concept that can be applied in order to increase non-functional performance metrics.
In particular, it discusses algorithms, data structures and scheduling methods that allow to achieve a perceived speed-up of the app and its network requests.
We introduce the terms Worst-Case Loading Time (WCLT) and Average-Case Loading Time (ACLT) and show how the later can be optimised, while not deteriorating on the performance of the former.
Expert knowledge of the business domain and the usersâ behaviour is taken into consideration when designing data structures. This allows for a minimal amount of network requests, while most of the processing is done offline.
All this has to maintain a reasonable overhead in terms of data volume.
Our specific example, the GetYourGuide Tours and Activities App, is presented as an example on how these concepts can be implemented and what kind of gains are to be expected. The results presented in this talk are part of a GetYourGuide Research Prototype.
Semantic Faceted Search with SemFacet presentationDBOnto
Â
Abstract
An increasing number of applications rely on RDF, OWL 2, and SPARQL for storing and querying data. SPARQL, however, is not targeted towards end-users, and suitable query interfaces are needed. Faceted search is a prominent approach for end-user data access, and several RDF-based faceted search systems have been developed. There is, however, a lack of rigorous theoretical underpinning for faceted search in the context of RDF and OWL 2. In this paper, we provide such solid foundations. We formalise faceted interfaces for this context, identify a fragment of first-order logic capturing the underlying queries, and study the complexity of answering such queries for RDF and OWL 2 profiles. We then study interface generation and update, and devise efficiently implementable algorithms. Finally, we have implemented and tested our faceted search algorithms for scalability, with encouraging results.
Abstract:
An increasing number of applications rely on RDF, OWL 2, and SPARQL for storing and querying data. SPARQL, however, is not targeted towards end-users, and suitable query interfaces are needed. Faceted search is a prominent approach for end-user data access, and several RDF-based faceted search systems have been developed. There is, however, a lack of rigorous theoretical underpinning for faceted search in the context of RDF and OWL 2. In this paper, we provide such solid foundations. We formalise faceted interfaces for this context, identify a fragment of first-order logic capturing the underlying queries, and study the complexity of answering such queries for RDF and OWL 2 profiles. We then study interface generation and update, and devise efficiently implementable algorithms. Finally, we have implemented and tested our faceted search algorithms for scalability, with encouraging results.
Parallel Materialisation of Datalog Programs in Centralised, Main-Memory RDF ...DBOnto
Â
Abstract:
We present a novel approach to parallel materialisation (i.e.,
fixpoint computation) of datalog programs in centralised,
main-memory, multi-core RDF systems. Our approach comprises an algorithm that evenly distributes the workload to cores, and an RDF indexing data structure that supports efficient, âmostlyâ lock-free parallel updates. Our empirical evaluation shows that our approach parallelises computation very well: with 16 physical cores, materialisation can be up to 13.9 times faster than with just one core.
Abstract:
A growing number of resources are available for enriching documents with semantic annotations. While originally focused on a few standard classes of annotations, the ecosystem of annotators is now becoming increasingly diverse. Although annotators often have very different vocabularies, with both high-level and specialist
concepts, they also have many semantic interconnections. We will show that both the overlap and the diversity in annotator vocabularies motivate the need for semantic annotation integration: middleware that produces a unified annotation on top of diverse semantic annotators. On the one hand, the diversity of vocabulary allows applications
to benefit from the much richer vocabulary available in
an integrated vocabulary. On the other hand, we present evidence that the most widely-used annotators on the web suffer from serious accuracy deficiencies: the overlap in vocabularies from individual annotators allows an integrated annotator to boost accuracy by exploiting inter-annotator agreement and disagreement.
The integration of semantic annotations leads to new challenges, both compared to usual data integration scenarios and to standard aggregation of machine learning tools. We overview an approach to these challenges that performs ontology-aware aggregation. We
introduce an approach that requires no training data, making use of ideas from database repair. We experimentally compare this with a supervised approach, which adapts maximal entropy Markov models to the setting of ontology-based annotations. We further experimentally compare both these approaches with respect to ontology-unaware
supervised approaches, and to individual annotators.
Optique - to provide semantic end-to-end connection between users and data sources; enable users to rapidly formulate intuitive queries using familiar vocabularies and conceptualisations and return timely answers from large scale and heterogeneous data sources.
Abstract:
The data needed to answer queries is often available through Web-based APIs. Indeed, for a given query there may be many Web-based sources which can be used to answer it, with the sources overlapping in their vocabularies, and differing in their access restrictions (required arguments) and cost.
We introduce PDQ (Proof-Driven Query Answering), a system for determining a query plan in the presence of web-based sources. It is: (i) constraint-aware -- exploiting relationships between sources to rewrite an expensive query into a cheaper one, (ii) access-aware -- abiding by any access restrictions known in the sources, and (iii) cost-aware -- making use of any cost information that is available about services.
PDQ takes the novel approach of generating query plans from proofs that a query is answerable. We demonstrate the use of PDQ and its effectiveness in generating low-cost plans.
Abstract: We present an enhanced hybrid approach to OWL query answering that combines an RDF triple-store with an OWL reasoner in order to provide scalable pay-as-you-go performance. The enhancements presented here include an extension to deal with arbitrary OWL ontologies, and optimisations that significantly improve scalability. We have implemented these techniques in a prototype system, a preliminary evaluation of which has produced very encouraging results.
Abstract: An enhanced hybrid approach to OWL query answering that combines an RDF triple-store with an OWL reasoner in order to provide scaleable pay-as-you-go performance. The enhancements presented here include an extension to deal with arbitary OWL ontologies and optimisations that significantly improve scalability. We have implemented these techniques in a prototype system, a preliminary evaluation of which has produced very encouraging results.
Abstract:
We present a novel approach to parallel materialisation (i.e.,
fixpoint computation) of datalog programs in centralised,
main-memory, multi-core RDF systems. Our approach comprises an algorithm that evenly distributes the workload to cores, and an RDF indexing data structure that supports efficient, âmostlyâ lock-free parallel updates. Our empirical evaluation shows that our approach parallelises computation very well: with 16 physical cores, materialisation can be up to 13.9 times faster than with just one core.
Parallel Datalog Reasoning in RDFox PresentationDBOnto
Â
Abstract:
We present a novel approach to parallel materialisation (i.e.,
fixpoint computation) of datalog programs in centralised,
main-memory, multi-core RDF systems. Our approach comprises an algorithm that evenly distributes the workload to cores, and an RDF indexing data structure that supports efficient, âmostlyâ lock-free parallel updates. Our empirical evaluation shows that our approach parallelises computation very well: with 16 physical cores, materialisation can be up to 13.9 times faster than with just one core.
Query Distributed RDF Graphs: The Effects of Partitioning PaperDBOnto
Â
Abstract: Web-scale RDF datasets are increasingly processed using distributed RDF data stores built on top of a cluster of shared-nothing servers. Such systems critically rely on their data partitioning scheme and query answering scheme, the goal of which is to facilitate correct and ecient query processing. Existing data partitioning schemes are
commonly based on hashing or graph partitioning techniques. The latter techniques split a dataset in a way that minimises the number of connections between the resulting subsets, thus reducing the need for communication between servers; however, to facilitate ecient query answering,
considerable duplication of data at the intersection between subsets is often needed. Building upon the known graph partitioning approaches, in this paper we present a novel data partitioning scheme that employs minimal duplication and keeps track of the connections between partition elements; moreover, we propose a query answering scheme that
uses this additional information to correctly answer all queries. We show experimentally that, on certain well-known RDF benchmarks, our data partitioning scheme often allows more answers to be retrieved without distributed computation than the known schemes, and we show that our query answering scheme can eciently answer many queries.
Big Data LDN 2018: USING FAST-DATA TO MAKE SEMICONDUCTORSMatt Stubbs
Â
Date: 14th November 2018
Location: Fast Data Theatre
Time: 14:30 - 15:00
Speaker: Neil Condon
Organisation: Edwards Vacuum
About: Semiconductor fabrication plants build devices with billions of components, each 1000x smaller than the human hair, with some features that are only a few atoms across. They are the most highly automated manufacturing environments in the world: In large fabs, time-series sensor data alone tops 50 TB/day, and combining that data with subject-matter-expertise fast enough to keep the automated production equipment functioning, is a major and growing challenge. The level of collaboration required to build high-performance real-time analytics, combined with the IP-sensitive nature of the data, results in a unique DataOps environment, where the use of Cloud resources serves to complicate rather than simplify the value equation. Weâll explore some of the challenges, and discuss the attributes of a PaaS that could help the industry tackle its fast-data challenges.
An introductory but highly practical talk on starting a Data Science career and life. It touches upon all the main aspects of the path towards becoming a Data scientist, also seen through a personal development perspective. Moreover, we talk about the role that a data scientist ultimately fulfills - as an individual or as a team - in the technology innovation life cycle and the product life-cycle.
The talk presents measures and concept that can be applied in order to increase non-functional performance metrics.
In particular, it discusses algorithms, data structures and scheduling methods that allow to achieve a perceived speed-up of the app and its network requests.
We introduce the terms Worst-Case Loading Time (WCLT) and Average-Case Loading Time (ACLT) and show how the later can be optimised, while not deteriorating on the performance of the former.
Expert knowledge of the business domain and the usersâ behaviour is taken into consideration when designing data structures. This allows for a minimal amount of network requests, while most of the processing is done offline.
All this has to maintain a reasonable overhead in terms of data volume.
Our specific example, the GetYourGuide Tours and Activities App, is presented as an example on how these concepts can be implemented and what kind of gains are to be expected. The results presented in this talk are part of a GetYourGuide Research Prototype.
Presentation given at PrefabAus 2014. http://www.prefabaus.org.au/conference/
The material has been sourced from a number of researcher, including: Nico Adams, Robert Zlot, Paul Flick, Alberto Elfes, Laurent Lefort, Sarah King, Peter King, Peter Kambourios, Craig James, Leila Alem, Swee Mak.
Course 8 : How to start your big data project by Eric Rodriguez Betacowork
Â
For more info about our Big Data courses, check out our website âĄď¸ https://www.betacowork.com/big-data/
---------
"Data is the new oil" - Many companies and professionals do not know how to use their data or are not aware of the added value they could gain from it.
It is in response to these problems that the project âBrussels: The Beating Heart of Big Dataâ was born.
This project, financed by the Region of Brussels Capital and organised by Betacowork, offers 3 training cycles of 10 courses on big data, at both beginner and advanced levels. These 3 cycles will be followed by a Hackathon weekend.
No prerequisites are required to start these courses. The aim of these courses is to familiarize participants with the principles of Big Data.
------
For more info about our Big Data courses, check out our website âĄď¸ https://www.betacowork.com/big-data/
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
Â
Dr. Pouria Amirian explains data science, steps in a data science workflow and show some experiments in AzureML. He also mentions about big data issues in a data science project and solutions to them.
Data Science as a Service: Intersection of Cloud Computing and Data SciencePouria Amirian
Â
Dr. Pouria Amirian from the University of Oxford explains Data Science and its relationship with Big Data and Cloud Computing. Then he illustrates using AzureML to perform a simple data science analytics.
My presentation at The Richmond Data Science Community (Jan 2018). The slides are slightly different than what I had presented last year at The Data Intelligence Conference.
Multi-Cloud Breaks IT Ops: Best Practices to De-Risk Your Cloud StrategyThousandEyes
Â
Organizations are using multiple IaaS and SaaS providers today, yet traditional ITOps processes and tools are straining to cope with a vast new scope of challenges and risks. Recent research by Enterprise Management Associates (EMA) shows that 74% of enterprise network teams had incumbent network monitoring tools failing to address cloud requirements. As IT business leaders responsible for delivering services in this new ecosystem, how do you equip yourself with the right visibility?
Shamus McGillicuddy, Research Director for EMAâs network management practice, and Archana Kesavan, Director of Product Marketing at ThousandEyes dive deep into the challenges of multi-cloud and how to rethink your monitoring strategy and operational delivery processes.
Uncover:
Five common IT operational challenges of multi-cloud identified in recent EMA research
The risks of not evolving ITOps for a managed cloud environment
Four monitoring best practices for a cloud-centric IT Operation
Big Data is an emerging technology in Information Management that holds promising returns on investment, as it can provide advanced analytics capabilities. It is well suited for large enterprises, and when used properly, it can lead to breakthroughs in analytics, deriving information from data that was previously not possible. However, a Big Data project cannot be approached using traditional IT system design and methods. Its success relies on teamwork and collaboration among petroleum engineering subject matter experts, senior IT professionals, and data scientists. To ensure that Big Data initiatives do not deliver poor results or disappoint, Big Data projects require significant preparation, which dramatically increases the chances of success. This presentation provides practical information about how to get started and what to consider in your plan, and it gives useful tips and examples for planning and executing a Big Data project. At the end of the presentation, attendees will know what Big Data is, what it offers, how to plan such projects, what the roles and responsibilities are for the key project members, and how these projects should be implemented to benefit their organization. Big Data analytics offers enterprises a chance to move beyond simply gathering data to analyzing, mining, and correlating results for insights that translate into business solutions.
RightScale Roadtrip Boston: Accelerate to CloudRightScale
Â
The Accelerate to Cloud keynote will help you understand the current state of cloud adoption, identify the business value for your organization, and provide you a framework to plot your course to cloud adoption.
PHP Frameworks: I want to break free (IPC Berlin 2024)Ralf Eggert
Â
In this presentation, we examine the challenges and limitations of relying too heavily on PHP frameworks in web development. We discuss the history of PHP and its frameworks to understand how this dependence has evolved. The focus will be on providing concrete tips and strategies to reduce reliance on these frameworks, based on real-world examples and practical considerations. The goal is to equip developers with the skills and knowledge to create more flexible and future-proof web applications. We'll explore the importance of maintaining autonomy in a rapidly changing tech landscape and how to make informed decisions in PHP development.
This talk is aimed at encouraging a more independent approach to using PHP frameworks, moving towards a more flexible and future-proof approach to PHP development.
Search and Society: Reimagining Information Access for Radical FuturesBhaskar Mitra
Â
The field of Information retrieval (IR) is currently undergoing a transformative shift, at least partly due to the emerging applications of generative AI to information access. In this talk, we will deliberate on the sociotechnical implications of generative AI for information access. We will argue that there is both a critical necessity and an exciting opportunity for the IR community to re-center our research agendas on societal needs while dismantling the artificial separation between the work on fairness, accountability, transparency, and ethics in IR and the rest of IR research. Instead of adopting a reactionary strategy of trying to mitigate potential social harms from emerging technologies, the community should aim to proactively set the research agenda for the kinds of systems we should build inspired by diverse explicitly stated sociotechnical imaginaries. The sociotechnical imaginaries that underpin the design and development of information access technologies needs to be explicitly articulated, and we need to develop theories of change in context of these diverse perspectives. Our guiding future imaginaries must be informed by other academic fields, such as democratic theory and critical theory, and should be co-developed with social science scholars, legal scholars, civil rights and social justice activists, and artists, among others.
GraphRAG is All You need? LLM & Knowledge GraphGuy Korland
Â
Guy Korland, CEO and Co-founder of FalkorDB, will review two articles on the integration of language models with knowledge graphs.
1. Unifying Large Language Models and Knowledge Graphs: A Roadmap.
https://arxiv.org/abs/2306.08302
2. Microsoft Research's GraphRAG paper and a review paper on various uses of knowledge graphs:
https://www.microsoft.com/en-us/research/blog/graphrag-unlocking-llm-discovery-on-narrative-private-data/
"Impact of front-end architecture on development cost", Viktor TurskyiFwdays
Â
I have heard many times that architecture is not important for the front-end. Also, many times I have seen how developers implement features on the front-end just following the standard rules for a framework and think that this is enough to successfully launch the project, and then the project fails. How to prevent this and what approach to choose? I have launched dozens of complex projects and during the talk we will analyze which approaches have worked for me and which have not.
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...James Anderson
Â
Effective Application Security in Software Delivery lifecycle using Deployment Firewall and DBOM
The modern software delivery process (or the CI/CD process) includes many tools, distributed teams, open-source code, and cloud platforms. Constant focus on speed to release software to market, along with the traditional slow and manual security checks has caused gaps in continuous security as an important piece in the software supply chain. Today organizations feel more susceptible to external and internal cyber threats due to the vast attack surface in their applications supply chain and the lack of end-to-end governance and risk management.
The software team must secure its software delivery process to avoid vulnerability and security breaches. This needs to be achieved with existing tool chains and without extensive rework of the delivery processes. This talk will present strategies and techniques for providing visibility into the true risk of the existing vulnerabilities, preventing the introduction of security issues in the software, resolving vulnerabilities in production environments quickly, and capturing the deployment bill of materials (DBOM).
Speakers:
Bob Boule
Robert Boule is a technology enthusiast with PASSION for technology and making things work along with a knack for helping others understand how things work. He comes with around 20 years of solution engineering experience in application security, software continuous delivery, and SaaS platforms. He is known for his dynamic presentations in CI/CD and application security integrated in software delivery lifecycle.
Gopinath Rebala
Gopinath Rebala is the CTO of OpsMx, where he has overall responsibility for the machine learning and data processing architectures for Secure Software Delivery. Gopi also has a strong connection with our customers, leading design and architecture for strategic implementations. Gopi is a frequent speaker and well-known leader in continuous delivery and integrating security into software delivery.
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Ramesh Iyer
Â
In today's fast-changing business world, Companies that adapt and embrace new ideas often need help to keep up with the competition. However, fostering a culture of innovation takes much work. It takes vision, leadership and willingness to take risks in the right proportion. Sachin Dev Duggal, co-founder of Builder.ai, has perfected the art of this balance, creating a company culture where creativity and growth are nurtured at each stage.
UiPath Test Automation using UiPath Test Suite series, part 3DianaGray10
Â
Welcome to UiPath Test Automation using UiPath Test Suite series part 3. In this session, we will cover desktop automation along with UI automation.
Topics covered:
UI automation Introduction,
UI automation Sample
Desktop automation flow
Pradeep Chinnala, Senior Consultant Automation Developer @WonderBotz and UiPath MVP
Deepak Rai, Automation Practice Lead, Boundaryless Group and UiPath MVP
State of ICS and IoT Cyber Threat Landscape Report 2024 previewPrayukth K V
Â
The IoT and OT threat landscape report has been prepared by the Threat Research Team at Sectrio using data from Sectrio, cyber threat intelligence farming facilities spread across over 85 cities around the world. In addition, Sectrio also runs AI-based advanced threat and payload engagement facilities that serve as sinks to attract and engage sophisticated threat actors, and newer malware including new variants and latent threats that are at an earlier stage of development.
The latest edition of the OT/ICS and IoT security Threat Landscape Report 2024 also covers:
State of global ICS asset and network exposure
Sectoral targets and attacks as well as the cost of ransom
Global APT activity, AI usage, actor and tactic profiles, and implications
Rise in volumes of AI-powered cyberattacks
Major cyber events in 2024
Malware and malicious payload trends
Cyberattack types and targets
Vulnerability exploit attempts on CVEs
Attacks on counties â USA
Expansion of bot farms â how, where, and why
In-depth analysis of the cyber threat landscape across North America, South America, Europe, APAC, and the Middle East
Why are attacks on smart factories rising?
Cyber risk predictions
Axis of attacks â Europe
Systemic attacks in the Middle East
Download the full report from here:
https://sectrio.com/resources/ot-threat-landscape-reports/sectrio-releases-ot-ics-and-iot-security-threat-landscape-report-2024/
The Art of the Pitch: WordPress Relationships and SalesLaura Byrne
Â
Clients donât know what they donât know. What web solutions are right for them? How does WordPress come into the picture? How do you make sure you understand scope and timeline? What do you do if sometime changes?
All these questions and more will be explored as we talk about matching clientsâ needs with what your agency offers without pulling teeth or pulling your hair out. Practical tips, and strategies for successful relationship building that leads to closing the deal.
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Tobias Schneck
Â
As AI technology is pushing into IT I was wondering myself, as an âinfrastructure container kubernetes guyâ, how get this fancy AI technology get managed from an infrastructure operational view? Is it possible to apply our lovely cloud native principals as well? What benefitâs both technologies could bring to each other?
Let me take this questions and provide you a short journey through existing deployment models and use cases for AI software. On practical examples, we discuss what cloud/on-premise strategy we may need for applying it to our own infrastructure to get it to work from an enterprise perspective. I want to give an overview about infrastructure requirements and technologies, what could be beneficial or limiting your AI use cases in an enterprise environment. An interactive Demo will give you some insides, what approaches I got already working for real.
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Â
DIADEM: domain-centric intelligent automated data extraction methodology Presentation
1. WELCOME 1
DIADEM data extraction methodology
domain-centric intelligent automated
Web data as you want it
2. TEAM 2
Georg Gottlob
Professor, FRS
Project lead
Scientific director
Tim Furche
Postdoc
Technical director
Giovanni Grasso
Postdoc
Extraction infrastructure
Giorgio Orsi
Postdoc
Knowledge modelling
Christian Schallhart
Postdoc
Software engineering
Xiaonan Guo
Postdoc
Forms and interaction
3. TEAM 3
Omer Gunes
D.Phil. student
Jinsong Guo
D.Phil. student
Andrew Sellers
Captain USAF
former D.Phil. student
Andrey Kravchenko
D.Phil. student
Stefano Ortona
D.Phil. student
Cheng Wang
D.Phil. student
7. 7
50-80%
Data scientists [âŚ] spend 50 to 80 percent of their time [âŚ]
collecting and preparing [âŚ] digital data [âŚ] from sensors,
documents, the web and conventional databases.
âSTEVE LOHR
New York Times, Aug. 2014
8. INTRODUCTION 8
Data ⌠is still a pain
â Data exists, but getting and using it is hard
â For example, when you are making decisions
â Tipping point: tech leaders leverage data to striking effect
â Amazon, Walmart, Google
â What about the rest of the world?
9. 9
collect &
prepare
data
âYou canât do this manually, youâre never going to find
enough data scientists and analysts.â
â SHARMILA SHAHANI-MULLIGAN
CEO Clearstory
(New York Times, Aug 2014)
10. INTRODUCTION 10
⌠but there is a remedy
â We can get you the data you need in the form you need
â from competitors
â from open sources
â from your intranet
â At any scale, covering popular as well as long tail sources
â Far more comprehensive than manual solutions
â Far cheaper even than partial, manual solution
14. 14
âFor many kinds of information one has to extract
from thousands of sites in order to build a
comprehensive databaseâ
âNILESH DALV I
Yahoo!
15. 15
âNo one really has done this successfully
at scale yetâ
âRAGHU RAMAKRISHNAN
Yahoo!
17. HOW: TECHNOLOGY & TEAM 17
Technology: Our Strength
10,493
Sites from real-estate
and used-car
92%
Effective wrappers for
more than 92% of sites on
average
97%
Precision of extracted
primary attributes
20 2.1
Days on a 45 node
Amazon EC2 cluster
Days (one expert) to adjust
system to a new domain
18. HOW: TECHNOLOGY & TEAM 18
Technology: Our Strength
2000
1500
seconds)
1000
(time 500
0
number of records 0 250 500 750 1000
19. HOW: TECHNOLOGY & TEAM 19
Phenomenology
Self-organising
adjusts itself to observations on the pages
different sequence of tasks for every site
strong isolation of components
AI
Rule-based
AI
declarative rules instead of heuristics
uniform query of pages, phenomenology, âŚ
all domain-independent
appearance of objects on the web
reason for DIADEMâs high accuracy
easily adapted to new domains
21. HOW: TECHNOLOGY & TEAM 21
Manual Automatic
Supervised
+
magic
Data extraction isnât new âŚ
Scaling costly
Very common
Fully algorithmic
Active research
Human + algorithm Most commercial products
22. HOW: TECHNOLOGY & TEAM 22
Competitors
DIADEM data extraction methodology
Mozenda, Lixto, Connotate, domain-centric intelligent automated
BlackLocus, import.io,
scrapinghub.com, promptcloud.com
massive human effort small human effort
continuously once
low scale
one or few sources
massive scale
thousands of sources
low cost efficiency high cost efficiency
23. HOW: TECHNOLOGY & TEAM 23
What about Google & Co.
â Verticals are becoming ever more relevant for search
â the major change to Googleâs result page in the last decade
â crucial for intelligent personal assistants (Siri, Google Now)
â Revived interest in large-scale extraction of structured data
â as part of knowledge graph
â currently only good for common sense facts
â Recent AI/deep learning acquisitions by Google, Facebook
24. HOW? INCUBATION PLAN 24
Data scienceâa huge market
$50
billion
Data science
market 2017
*ACCORDING TO FORBES,
WIKIBON FORECAST
$25
billion
Data collection &
cleaning
*ACCORDING NEW YORK TIMES
29. HOW? INCUBATION PLAN 29
DIADEM Vision
âSuggest the best smart watch
for my preferences!â
âSuggest a great evening out!â
âSuggest a cheap
headphone with great
bass!â
âSuggest a great hotel in an area
with lots of bars and close to my
conference!â
30. HOW: TECHNOLOGY & TEAM 30
WWW 2014: Fallacies in DE
âKEVIN C. CHANG
Co-Founder Cazoodle, move.com, UIUC
#1: Can not start with âgiven a set of result pagesâ
#2: Must not stop at 70% accuracy
DIADEM
#3: Must be scalable to more than thousands of sources
#4: Must leverage human feedback
â
â
â
â
31. DIADEM ANALYSIS 31
Table 3: Wrapper quality
Wrapper quality
5
wrapper
effective wrong or
missing data
no data
UK real estate 91% 7% 2%
Oxford real estate 90% 6% 4%
ViNTs10 4% 5% 91%
UK used cars 93% 4% 3%
US real estate 90% 5% 5%
32. DIADEM ANALYSIS 32
Competition?
precision recall
84%
88%
95%
98%
99%
77%
56%
38%
97%
99%
72%
78%
81%
48%
53%
58%
MDR
DEPTA
ViNTs
DIADEM
0%
25%
50%
75%
100%
0%
25%
50%
75%
100%
Records REâ§RND UCâ§RND
CONCLUSION:
Do only a part of the job, and poorly
33. DIADEM ANALYSIS 33
Competition?
precision recall
83%
84%
97%
95%
42%
48%
96%
95%
65%
60%
58%
74%
RoadRunner
DEPTA
DIADEM
0%
25%
50%
75%
100%
0%
25%
50%
75%
100%
Attributes REâ§RND UCâ§RND
CONCLUSION:
Do only a part of the job, and poorly
34. DIADEM ANALYSIS 34
25%
Competition?
unit
beds
CONCLUSION:
make
transmission
age
engine_size
Do only a part of the job, and poorly
period_baths
receptions
0%
price
location
postcode
model
colour
body_type
fuel_type
registration
door_number
mileage
Attribute quality
ICQ dataset HA [14] ExQ [41] StatParser [36] DIADEM [17]
F1 for labeling 92% 96% 96% 98%
Table 3: Form labeling accuracy
cars are more prominently placed on the site. There are about 3%
of sites where no wrapper can be induced, typically as they con-tain
no properties, all properties are on aggregators, or they contain
no pivot attribute. For these sites, DIADEM correctly detects that
there is no effective wrapper. The final case is that DIADEM fails
to produce an effective wrapper, yet one exists. The most common
reasons for these failures are dynamic forms (15%), result pages
35. DIADEM 35
DIADEMâs Components
1 ROSeAnn (VLDBâ14)
World-best entity extraction from text and structure
36. DIADEM 36
DIADEMâs Components
The Ontological ROSeAnn Key: (Automatically VLDBâ14)
Understanding and Integrating Forms 1 World-best entity extraction from text and structure
1 TEMPLATE OPAL field_(WWWâby_proper<12, VLDBJâC,A> {13)
field<C>(N)(N@A{d,e,p}}
2
2
World-most-effective form understanding & filling
3 TEMPLATE field_by_segment<C,A>{field<C>(N)(N@A{e,p}}
4
5 TEMPLATE field_by_value<C,A> {field<C>(N)(N@A{m},
6 ÂŹ(A16= A, N@A1{d,e,p}_N@A1{e,p}) }
7
8 TEMPLATE field_minmax<C,CM,A> {
Range widget ⸠two fields + connected by âtoâ or other range connector
9 field<CM>(N1)(+ some child(clues in N1,the G),annotations child(or N2,classifications
G),adjacent(N1,N2),
10 N1@A{e,d},(field<C>(N2)_N2@A{e,d})
11 field<C_range>(N2)(child(N1,G),child(N2,G),next(N2,N1),
12 field<C>(N1),N2@range_connector{e,d},ÂŹ(A1$ C,N2@A1{d})
13 field<CM>(N1)(child(N1,!
G),child(N2,G),adjacent(N1,N2),
10 11 12 13
37. DIADEM 37
DIADEMâs Components
1 ROSeAnn (VLDBâ14)
World-best entity extraction from text and structure
2
OPAL (WWWâ12, VLDBJâ13)
World-most-effective form understanding & filling
3
AMBER (TWebâ14)
World-most-accurate record identification for listing pages
data area
a div a div a div a
p
span
PRICE
b
LOCATION
p
span
PRICE
b
LOCATION
p
span
PRICE
em p
span
strong
PRICE
div
b
LOCATION
span
PRICE
LOCATION
i
BEDS
38. DIADEM 38
DIADEMâs Components
1 2
3
4
Bitemporal Complex Event Processing of
ROSeAnn (VLDBâ14)
World-best entity extraction from text and structure
Web Event Advertisements?
OPAL (WWWâ12, VLDBJâ13)
World-most-effective form understanding & filling
Tim Furche1, Giovanni Grasso1, Michael Huemer2,
Christian Schallhart1, and Michael Schrefl2
AMBER (TWebâ14)
World-most-accurate record identification for listing pages
1 Department of Computer Science, Oxford University,
Wolfson Building, Parks Road, Oxford OX1 3QD
firstname.lastname@cs.ox.ac.uk
OXPath (VLDBâ11, VLDBJâ13)
World-most-efficient extraction language
2 Department of Business Informatics â Data & Knowledge Engineering,
Johannes Kepler University, Altenberger Str. 69, Linz, Austria
lastname@dke.uni-linz.ac.at
doc(âhttp://www.scottfraser.co.uk/â)//select[@id=âsearch-typeâ]/{1 /}
2 //input/{click /}/(//div[1]/table//td[4]/a/{click /})*{0,500}
//div[@class=âproperty-wrapperâ]:<record>
4 [? .:<ORIGIN_URL=current-url()>]
39. DIADEM 39
DIADEMâs Components
1 ROSeAnn (VLDBâ14)
World-best entity extraction from text and structure
2
OPAL (WWWâ12, VLDBJâ13)
World-most-effective form understanding & filling
3
AMBER (TWebâ14)
World-most-accurate record identification for listing pages
4
OXPath (VLDBâ11, VLDBJâ13)
World-most-efficient extraction language
5
DIADEM (VLDBâ14)
World-first accurate, automatic full-site extraction system
40. FORM PHENOMENOLOGY 40
Example 1: Form
â Task: classify and group form fields into semantic segments
â Problem: HTML structure is only an approximation
â Phenomenology: Detect semantic segments, e.g.,
â if there is a continuous list of option fields (ď, âď¸)
â with the same type
â and a parent that canât be classified
41. FORM PHENOMENOLOGY 41
Example 1: Form
s e g m e n t < C > ( â X ) : - h t m l - c h i l d ( N 1 , P ) ,
parent can not
be classified
html-child(N2, P) , N1 â N2, ÂŹsegment(P),
o p t i o n - f i e l d ( N 1) , o p t i o n - f i e l d ( N 2) ,
concept<C>(N1), concept<C>(N2),
m a x - c o n t - l i s t - o f - f i e l d s - w i t h - t y p e < C > ( N 1, N 2) .
both option fields
same type C
end points of largest continuous list of type C
42. RESULT PAGE PHENOMENOLOGY 42
Example 2: Dataareas
â Task: Finding areas on a page that contain relevant data
â Idea: Use the regularity resulting from the DB templates
â Problem: Distinguishing regular noise, e.g., featured properties
â Solution: Maximisation problem over pivot elements
â occurrences of mandatory attributes such as price
43. RESULT PAGE PHENOMENOLOGY 43
D1
M1,1
M1,2
D2
âŚ
D3
âŚ
M1,3 E
M1,4
Figure 3: Data area identification
consistent_cluster_members(C, N1, N2, N3) :- pivot(N1), pivot(N2), ...
similar_depth(N1, N2), similar_depth(N2, N3), similar_depth(N1,N3),
similar_tree_distance(N1, N2, N3).
its of order dominance: The pivot nodes in E are organized rather
regularly, whereas the pivot nodes in D1 vary quite notably. How-ever,
cluster(C,N) :- ... continuous, lca, contains at least one of all mandatories
there variation is small enough that M1,1 to M1,4 are depth and
44. RESULT PAGE PHENOMENOLOGY 44
Example 2: Record alignment
data area
a img div
img a img img a img img
ÂŁ860
div
div
ÂŁ900 ÂŁ500
p
ÂŁ900
â set of uniform, non-overlapping records
â maximise regularity, minimise outliers
â pairwise edit distance with bias towards pivot nodes
p
ÂŁ900
Figure 4: Record Segmentation
Algorithm 2: Segmentation(DOM P,Data Area d)
1 L {n : child(f(d),n) 2 P^9n0 2 y(d) : desc-or-self(n,n0)};
2 sort L in document order;
3 foreach 1  k  |L|â1 do Partition[k] {n : L[k] ( n ) L[k+1]};
4 Len min{|Partition[i]|: |{j : |Partition[ j]| = |Partition[i]|}| maximal};
5 while L[1]âsibl L[2] < Len do delete L[1];
6 while L[|L|â1]âsibl L[|L|] < Len do delete L[|L|];
7 while 1 < k < |L| do
8 if L[k]âsibl L[k+1] < Len then delete L[k+1] else k++;
9 StartCandidates {L}[{{n : 9l 2 L : nâsibl l = i} : i  Len};
10 OptimalSegmentation / 0; OptimalSim â˘;
11 foreach S 2 StartCandidates do
12 sort S in document order;
13 foreach 1  k  |L|â1 do
14 Segmentation[k] {n : nâsibl S[k]  Len};
15 if 8P 2 Segmentation : |P| = Len then
16 if irregularity(Segmentation) < OptimalSim then
all text nodes. With the exception of aâs tag, all HTML tags are
annotated by the type of step.
For the leftmost a and its i descendant in Figure 5, e.g., the tag
path is a/first-child::p/first-child::span/next-sibl::i.
Based on the tag path, AMBER quantifies the fraction of records
that support the assumption that a node n is an attribute of type A
within record r with the support suppr(n,A).
DEFINITION 9. Let E be an extraction instance on DOM P,
containing a node n within record r belonging to data area d, and
A 2 A an attribute type. Then suppr(n,A) denotes the support of
n as attribute of type A within r, defined as the fraction of records
r06= r in d that contain a node n0 with tag-pathr(n) = tag-pathr0 (n0)
that is annotated with A.
Consider a data area with 10 records, containing 1 PRICE-annotated
46. BLOCK PHENOMENOLOGY 46
Example 3: Pagination links
â Machine learning on top of derived features
Description Type Predicate
Content
1 Annotated as NEXT bool plm::annotated_by<NEXT>
2 Annotated as PAGINATION bool plm::annotated_by<PAGINATION>
3 Annotated as NUMBER bool plm::annotated_by<NUMBER>
4 Number of characters int plm::char_num
Page position
5 Relative position on page int2 plm::relative_position<css::page>
6 Relative position in first screen int2 plm::relative_position<std::first_screen>
7 In first screen bool plm::contained_in<std::first_screen>
8 In last screen bool plm::contained_in<std::last_screen>
Visual proximity
9 Pagination annotation close to node bool plm::in_proximity<plm::annotated_by<PAGINATION>>
10 Number of close numeric nodes int plm::num_in_proximity<numeric>
11 Closest numeric node is a link bool plm::closest<std::left_proximity>_with
<numeric>_is<non_link>
12 Closest numeric node has different style bool <numeric>_is<different_style>
13 Closest link annotated with NEXT bool <dom::clickable>_is<plm::annotated_by<NEXT>
14 Ascending w. closest numeric left, right bool plm::ascending-numerics
Structural
15 Preceding numeric node is a link bool plm::closest<std::preceding>_with
<numeric>_is<non_link>
16 Preceding numeric node has different style bool <numeric>_is<different_style>
17 Preceding link annotated with NEXT bool <dom::clickable>_is<plm::annotated_by<NEXT>
Table 3: PLM: Pagination Link Model
47. BLOCK PHENOMENOLOGY 47
Example 3: Pagination links
TEMPLATE annotated_by<Model,AType> {
2 <Model>::annotated_by<AType>(X) ( node_of_interest(X),
gate::annotation(X, <AType>, _). }
4 TEMPLATE in_proximity<Model,Property(Close)> {
â DatalogÂą rules for deriving features
â Lots of visual reasoning on the page
â Rich template language to avoid duplication
<Model>::in_proximity<Property>(X) ( node_of_interest(X),
6 std::proximity(Y,X), <Property(Close)>. }
TEMPLATE num_in_proximity<Model,Property(Close)> {
8 <Model>::in_proximity<Property>(X,Num) ( node_of_interest(X),
std::proximity(Close,X), Num = #count(N: <Property(Close)>). }
10 TEMPLATE relative_position<Model,Within(Height,Width)> {
<Model>::relative_position<Within>(X, (PosH, PosV)) ( node_of_interest(X),
12 css::box(X, LeftX, TopX, _, _), <Within(Height,Width)>,
Width , PosV = 100¡TopX
Height . }
PosH = 100¡LeftX
14 TEMPLATE contained_in<Model,Container(Left,Top,Bottom,Right)> {
<Model>::contained_in<Container>(X) ( node_of_interest(X),
16 css::box(X,LeftX,TopX,RightX,BottomX), <Container(Left,Top,Right,Bottom)>,
Left < LeftX < RightX < Right, Top < TopX < BottomX < Bottom. }
18 TEMPLATE closest<Model,Relation(Closest,X),Property(Closest),Test(Closest)> {
<Model>::closest<Relation>_with<Property>_is<Test>(X) ( node_of_interest(X),
20 <Relation(Closest,X)>, <Property(Closest)>, <Test(Closest)>,
ÂŹ(<Relation(Y,X)>, <Property(Y)>, <Relation(Y,Closest)>). }
Fig. 4: BERyL feature templates
In a similar way, the second template defines a boolean feature that holds for nodes