R is a programming language and environment commonly used in statistical computing, data analytics and scientific research.
It is one of the most popular languages used by statisticians, data analysts, researchers and marketers to retrieve, clean, analyze, visualize and present data.
Due to its expressive syntax and easy-to-use interface, it has grown in popularity in recent years.
DataStax Accelerate - "Extending Gremlin with Foundational Steps" - Discusses Gremlin the graph traversal language of Apache TinkerPop and its functional, data-flow style. Examines how Gremlin can be extended to a Domain Specific Language (DSL) to encapsulate foundational and domain specific logic for better code organization and reusability.
Search engines (e.g. Google.com, Yahoo.com, and Bi
ng.com) have become the dominant model of online search. Large and small e-commerce provide built-in search capability to their visitors to examine the products they have. While most large business are able to hire the
necessary skills to build advanced search engines,
small online business still lack the ability to evaluate the results of their search engines, which means losing the opportunity to compete with larger business. The purpose of this paper is to build an open-source model that can measure the relevance of search results for online businesses
as well as the accuracy of their underlined algorithms. We used data from a Kaggle.com competition in order to show our model running on real data.
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"Boris Glavic
The creation of values to represent incomplete information, often referred to as value invention, is central in data exchange. Within schema mappings, Skolem functions have long been used for value invention as they permit a precise representation of missing information. Recent work on a powerful mapping language called second-order tuple generating dependencies (SO tgds), has drawn attention to the fact that the use of arbitrary Skolem functions can have negative computational and programmatic properties in data exchange. In this paper, we present two techniques for understanding when the Skolem functions needed to represent the correct semantics of incomplete information are computationally well-behaved. Specifically, we consider when the Skolem functions in second-order (SO) mappings have a first-order (FO) semantics and are therefore programmatically and computationally more desirable for use in practice. Our first technique, linearization, significantly extends the Nash, Bernstein and Melnik unskolemization algorithm, by understanding when the sets of arguments of the Skolem functions in a mapping are related by set inclusion. We show that such a linear relationship leads to mappings that have FO semantics and are expressible in popular mapping languages including source-to-target tgds and nested tgds. Our second technique uses source semantics, specifically functional dependencies (including keys), to transform SO mappings into equivalent FO mappings. We show that our algorithms are applicable to a strictly larger class of mappings than previous approaches, but more importantly we present an extensive experimental evaluation that quantifies this difference (about 78% improvement) over an extensive schema mapping benchmark and illustrates the applicability of our results on real mappings.
Hacktoberfest 2020 'Intro to Knowledge Graph' with Chris Woodward of ArangoDB and reKnowledge. Accompanying video is available here: https://youtu.be/ZZt6xBmltz4
R is a programming language and environment commonly used in statistical computing, data analytics and scientific research.
It is one of the most popular languages used by statisticians, data analysts, researchers and marketers to retrieve, clean, analyze, visualize and present data.
Due to its expressive syntax and easy-to-use interface, it has grown in popularity in recent years.
DataStax Accelerate - "Extending Gremlin with Foundational Steps" - Discusses Gremlin the graph traversal language of Apache TinkerPop and its functional, data-flow style. Examines how Gremlin can be extended to a Domain Specific Language (DSL) to encapsulate foundational and domain specific logic for better code organization and reusability.
Search engines (e.g. Google.com, Yahoo.com, and Bi
ng.com) have become the dominant model of online search. Large and small e-commerce provide built-in search capability to their visitors to examine the products they have. While most large business are able to hire the
necessary skills to build advanced search engines,
small online business still lack the ability to evaluate the results of their search engines, which means losing the opportunity to compete with larger business. The purpose of this paper is to build an open-source model that can measure the relevance of search results for online businesses
as well as the accuracy of their underlined algorithms. We used data from a Kaggle.com competition in order to show our model running on real data.
SIGMOD 2013 - Patricia's talk on "Value invention for Data Exchange"Boris Glavic
The creation of values to represent incomplete information, often referred to as value invention, is central in data exchange. Within schema mappings, Skolem functions have long been used for value invention as they permit a precise representation of missing information. Recent work on a powerful mapping language called second-order tuple generating dependencies (SO tgds), has drawn attention to the fact that the use of arbitrary Skolem functions can have negative computational and programmatic properties in data exchange. In this paper, we present two techniques for understanding when the Skolem functions needed to represent the correct semantics of incomplete information are computationally well-behaved. Specifically, we consider when the Skolem functions in second-order (SO) mappings have a first-order (FO) semantics and are therefore programmatically and computationally more desirable for use in practice. Our first technique, linearization, significantly extends the Nash, Bernstein and Melnik unskolemization algorithm, by understanding when the sets of arguments of the Skolem functions in a mapping are related by set inclusion. We show that such a linear relationship leads to mappings that have FO semantics and are expressible in popular mapping languages including source-to-target tgds and nested tgds. Our second technique uses source semantics, specifically functional dependencies (including keys), to transform SO mappings into equivalent FO mappings. We show that our algorithms are applicable to a strictly larger class of mappings than previous approaches, but more importantly we present an extensive experimental evaluation that quantifies this difference (about 78% improvement) over an extensive schema mapping benchmark and illustrates the applicability of our results on real mappings.
Hacktoberfest 2020 'Intro to Knowledge Graph' with Chris Woodward of ArangoDB and reKnowledge. Accompanying video is available here: https://youtu.be/ZZt6xBmltz4
Gremlin is a Turing-complete, graph-based programming language developed for key/value-pair multi-relational graphs called property graphs. Gremlin makes extensive use of XPath 1.0 to support complex graph traversals. Connectors exist to various graph databases and frameworks. This language has application in the areas of graph query, analysis, and manipulation.
Full version of http://www.slideshare.net/valexiev1/gvp-lodcidocshort. Same is available on http://vladimiralexiev.github.io/pres/20140905-CIDOC-GVP/index.html
CIDOC Congress, Dresden, Germany
2014-09-05: International Terminology Working Group: full version.
2014-09-09: Getty special session: short version
The increasing amount of valuable semi-structured data has become available online. In this talk, we overview the state of the art in entity ranking over structured data ("linked data").
Introduction to Graphs, Apache TinkerPop and Gremlin through DataStax Enterprise (DSE) Graph and DataStax Studio from the Apache Cassandra & DataStax DC Meetup.
No more struggles with Apache Spark workloads in productionChetan Khatri
Paris Scala Group Event May 2019, No more struggles with Apache Spark workloads in production.
Apache Spark
Primary data structures (RDD, DataSet, Dataframe)
Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark.
Parallel read from JDBC: Challenges and best practices.
Bulk Load API vs JDBC write
An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
Avoid unnecessary shuffle
Alternative to spark default sort
Why dropDuplicates() doesn’t result consistency, What is alternative
Optimize Spark stage generation plan
Predicate pushdown with partitioning and bucketing
Why not to use Scala Concurrent ‘Future’ explicitly!
From Michal Malohlava's talk at MLConf NYC 3/27/15:
Building Machine Learning Applications with Sparkling Water: Writing applications which are processing and analyzing large amount of data is still hard. It often requires to design and run Machine Learning experiments in small scale and then consolidate them into a form of application and run them in large scale. There are several distributed machine learning platforms which are trying to mitigate this effort. In this talk we will focus on Sparkling Water which is combining benefits of two platforms – H2O and Spark. H2O is an open-source distributed math-engine providing tuned Machine Learning library, Spark is an execution platform which allows for processing large amount of data. The talk will demonstrate Sparkling Water features and shows its benefits for building rich and robust Machine Learning applications.
Majorly focused on Object Oriented way of plotting in Matplotlib, this review intends to serve as go to reference guide for quick plotting
#Matplotlib #Python #Pandas #Seaborn
Relaxing global-as-view in mediated data integration from linked dataAlessandro Adamou
Slides of my presentation at Semantic Big Data (SBD 2020) co-located with SIGMOD 2020. The actual presentation has my pre-recorded commentary and superimposed picture so these slides can be used as reference.
Dynamic Factual Summaries for Entity CardsFaegheh Hasibi
Slides for SIGIR 2017 paper "Dynamic Factual Summaries for Entity Cards"
Entity cards are being used frequently in modern web search engines to offer a concise overview of an entity directly on the results page. These cards are composed of various elements, one of them being the entity summary: a selection of facts describing the entity
from an underlying knowledge base. These summaries, while presenting a synopsis of the entity, can also directly address users’ information needs. In this paper, we make the first effort towards generating and evaluating such factual summaries. We introduce
and address the novel problem of dynamic entity summarization for entity cards, and break it down to two specific subtasks: fact ranking and summary generation. We perform an extensive evaluation of our method using crowdsourcing. Our results show the
effectiveness of our fact ranking approach and validate that users prefer dynamic summaries over static ones.
Interactive Sparkling Water session by Michal Malohlava
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Gremlin is a Turing-complete, graph-based programming language developed for key/value-pair multi-relational graphs called property graphs. Gremlin makes extensive use of XPath 1.0 to support complex graph traversals. Connectors exist to various graph databases and frameworks. This language has application in the areas of graph query, analysis, and manipulation.
Full version of http://www.slideshare.net/valexiev1/gvp-lodcidocshort. Same is available on http://vladimiralexiev.github.io/pres/20140905-CIDOC-GVP/index.html
CIDOC Congress, Dresden, Germany
2014-09-05: International Terminology Working Group: full version.
2014-09-09: Getty special session: short version
The increasing amount of valuable semi-structured data has become available online. In this talk, we overview the state of the art in entity ranking over structured data ("linked data").
Introduction to Graphs, Apache TinkerPop and Gremlin through DataStax Enterprise (DSE) Graph and DataStax Studio from the Apache Cassandra & DataStax DC Meetup.
No more struggles with Apache Spark workloads in productionChetan Khatri
Paris Scala Group Event May 2019, No more struggles with Apache Spark workloads in production.
Apache Spark
Primary data structures (RDD, DataSet, Dataframe)
Pragmatic explanation - executors, cores, containers, stage, job, a task in Spark.
Parallel read from JDBC: Challenges and best practices.
Bulk Load API vs JDBC write
An optimization strategy for Joins: SortMergeJoin vs BroadcastHashJoin
Avoid unnecessary shuffle
Alternative to spark default sort
Why dropDuplicates() doesn’t result consistency, What is alternative
Optimize Spark stage generation plan
Predicate pushdown with partitioning and bucketing
Why not to use Scala Concurrent ‘Future’ explicitly!
From Michal Malohlava's talk at MLConf NYC 3/27/15:
Building Machine Learning Applications with Sparkling Water: Writing applications which are processing and analyzing large amount of data is still hard. It often requires to design and run Machine Learning experiments in small scale and then consolidate them into a form of application and run them in large scale. There are several distributed machine learning platforms which are trying to mitigate this effort. In this talk we will focus on Sparkling Water which is combining benefits of two platforms – H2O and Spark. H2O is an open-source distributed math-engine providing tuned Machine Learning library, Spark is an execution platform which allows for processing large amount of data. The talk will demonstrate Sparkling Water features and shows its benefits for building rich and robust Machine Learning applications.
Majorly focused on Object Oriented way of plotting in Matplotlib, this review intends to serve as go to reference guide for quick plotting
#Matplotlib #Python #Pandas #Seaborn
Relaxing global-as-view in mediated data integration from linked dataAlessandro Adamou
Slides of my presentation at Semantic Big Data (SBD 2020) co-located with SIGMOD 2020. The actual presentation has my pre-recorded commentary and superimposed picture so these slides can be used as reference.
Dynamic Factual Summaries for Entity CardsFaegheh Hasibi
Slides for SIGIR 2017 paper "Dynamic Factual Summaries for Entity Cards"
Entity cards are being used frequently in modern web search engines to offer a concise overview of an entity directly on the results page. These cards are composed of various elements, one of them being the entity summary: a selection of facts describing the entity
from an underlying knowledge base. These summaries, while presenting a synopsis of the entity, can also directly address users’ information needs. In this paper, we make the first effort towards generating and evaluating such factual summaries. We introduce
and address the novel problem of dynamic entity summarization for entity cards, and break it down to two specific subtasks: fact ranking and summary generation. We perform an extensive evaluation of our method using crowdsourcing. Our results show the
effectiveness of our fact ranking approach and validate that users prefer dynamic summaries over static ones.
Interactive Sparkling Water session by Michal Malohlava
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Silicon Valley Cloud Computing Meetup
Mountain View, 2010-07-19
Examples of Hadoop Streaming, based on Python scripts running on the AWS Elastic MapReduce service, which show text mining on the "Enron Email Dataset" from Infochimps.com plus data visualization using R and Gephi
Source at: http://github.com/ceteri/ceteri-mapred
Best Hadoop Institutes : kelly tecnologies is the best Hadoop training Institute in Bangalore.Providing hadoop courses by realtime faculty in Bangalore.
FlexClock, a Plastic Clock Written in Oz with the QTk toolkitJean Vanderdonckt
This paper focuses on the techniques involved in building an interactive application using a plastic user interface. These techniques take advantage of the QTk toolkit, a toolkit that features unusual but interesting concepts with respect to more classical object-oriented toolkits. These features are possible thanks to the underlying programming language used, Oz. In particular, it can support all facilities provided by symbolic records like XML structures and more than that. It also exhibits the capacity to wrap any languages entities into higher order data structures. This paper shows by a case study how the combination of QTk and Oz helps developers write plastic user interface very easily.
Metadata and Provenance for ML Pipelines with Hopsworks Jim Dowling
This talk describes the scale-out, consistent metadata architecture of Hopsworks and how we use it to support custom metadata and provenance for ML Pipelines with Hopsworks Feature Store, NDB, and ePipe . The talk is here: https://www.youtube.com/watch?v=oPp8PJ9QBnU&feature=emb_logo
Query Rewriting and Optimization for Ontological DatabasesGiorgio Orsi
Ontological queries are evaluated against a knowledge base consisting of an extensional database and an ontology (i.e., a set of logical assertions and constraints that derive new intensional knowledge from the extensional database), rather than directly on the extensional database. The evaluation and optimization of such queries is an intriguing new problem for database research. In this article, we discuss two important aspects of this problem: query rewriting and query optimization. Query rewriting consists of the compilation
of an ontological query into an equivalent first-order query against the underlying extensional database.
We present a novel query rewriting algorithm for rather general types of ontological constraints that is well suited for practical implementations. In particular, we show how a conjunctive query against a knowledge base, expressed using linear and sticky existential rules, that is, members of the recently introduced Datalog+/- family of ontology languages, can be compiled into a union of conjunctive queries (UCQ) against the underlying database. Ontological query optimization, in this context, attempts to improve this rewriting process soas to produce possibly small and cost-effective UCQ rewritings for an input query.
ROSeAnn: Reconciling Opinions of Semantic Annotators VLDB 2014Giorgio Orsi
ROSeAnn - Reconciling Opinions of Semantic Annotators. VLDB 2014 Conference.
A growing number of resources are available for enriching documents with semantic annotations. While originally focused on a few standard classes of annotations, the ecosystem of annotators is now becoming increasingly diverse. Although annotators often
have very different vocabularies, with both high-level and specialist concepts, they also have many semantic interconnections. We will show that both the overlap and the diversity in annotator vocabularies motivate the need for semantic annotation integration: middleware
that produces a unified annotation on top of diverse semantic annotators. On the one hand, the diversity of vocabulary allows applications to benefit from the much richer vocabulary available in an integrated vocabulary. On the other hand, we present evidence that the most widely-used annotators on the web suffer from serious accuracy deficiencies: the overlap in vocabularies from individual annotators allows an integrated annotator to boost accuracy by exploiting inter-annotator agreement and disagreement.
The integration of semantic annotations leads to new challenges, both compared to usual data integration scenarios and to standard aggregation of machine learning tools. We overview an approach to these challenges that performs ontology-aware aggregation. We introduce an approach that requires no training data, making use of ideas from database repair. We experimentally compare this with a supervised approach, which adapts maximal entropy Markov models to the setting of ontology-based annotations. We further experimentally
compare both these approaches with respect to ontology-unaware supervised approaches, and to individual annotators.
Heuristic Ranking in Tightly Coupled Probabilistic Description LogicsGiorgio Orsi
The Semantic Web effort has steadily been gaining traction in the recent years. In particular,Web search companies are recently realizing that their products need to evolve towards having richer semantic search capabilities. Description logics (DLs) have been adopted as the formal underpinnings for Semantic Web languages used in describing ontologies. Reasoning under uncertainty has recently taken a leading role in this arena, given the nature of data found on theWeb. In this paper, we present a probabilistic extension of the DL EL++ (which underlies the OWL2 EL profile) using Markov logic networks (MLNs) as probabilistic semantics. This extension is tightly coupled, meaning that probabilistic annotations in formulas can refer to objects in the ontology. We show that, even though the tightly coupled nature of our language means that many basic operations are data-intractable, we can leverage a sublanguage of MLNs that allows to rank the atomic consequences of an ontology relative to their probability values (called ranking queries) even when these values are not fully computed. We present an anytime algorithm to answer ranking queries, and provide an upper bound on the error that it incurs, as well as a criterion to decide when results are guaranteed to be correct.
Wrapper induction faces a dilemma: To reach web scale, it requires automatically generated examples, but to produce accurate results, these examples must have the quality of human annotations. We resolve this conflict with AMBER, a system for fully automated data extraction from result pages. In contrast to previous approaches, AMBER employs domain specific gazetteers to discern basic domain attributes on a page, and leverages repeated occurrences of similar attributes to group related attributes into records rather than relying on the noisy structure of the DOM. With this approach AMBER is able to identify records and their attributes with almost perfect accuracy (>98%) on a large sample of websites. To make such an approach feasible at scale, AMBER automatically learns domain gazetteers from a small seed set. In this demonstration, we show how AMBER uses the repeated structure of records on deep web result pages to learn such gazetteers. This is only possible with a highly accurate extraction system. Depending on its parametrization, this learning process runs either fully automatically or with human interaction. We show how AMBER bootstraps a gazetteer for UK locations in 4 iterations: From a small seed sample we achieve 94.4% accuracy in recognizing UK locations in the 4th iteration.
Search engines are the sinews of the web. These sinews have become strained, however: Where the web's function once was a mix of library and yellow pages, it has become the central marketplace for information of almost any kind. We search more and more for objects with specific characteristics, a car with a certain mileage, an affordable apartment close to a good school, or the latest accessory for our phones. Search engines all too often fail to provide reasonable answers, making us sift through dozens of websites with thousands of offers--never to be sure a better offer isn't just around the corner. What search engines are missing is understanding of the objects and their attributes published on websites.
Automatically identifying and extracting these objects is akin to alchemy: transforming unstructured web information into highly structured data with near perfect accuracy. With DIADEM we present a formula for this transformation, but at a price: DIADEM identifies and extracts data from a website with high accuracy. The price is that for this task we need to provide DIADEM with extensive knowledge about the ontology and phenomenology of the domain, i.e., about entities (and relations) and about the representation of these entities in the textual, structural, and visual language of a website of this domain. In this demonstration, we demonstrate with a first prototype of DIADEM that, in contrast to alchemists, DIADEM has developed a viable formula.
OPAL: a passe-partout for web forms - WWW 2012 (Demonstration)Giorgio Orsi
Web forms are the interfaces of the deep web. Though modern web browsers provide facilities to assist in form filling, this assistance is limited to prior form fillings or keyword matching. Automatic form understanding enables a broad range of applications, including crawlers, meta-search engines, and usability and accessibility support for enhanced web browsing. In this demonstration, we use a novel form understanding approach, OPAL, to assist in form filling even for complex, previously unknown forms. OPAL associates form labels to fields by analyzing structural properties in the HTML encoding and visual features of the page rendering. OPAL interprets this labeling and classifies the fields according to a given domain ontology. The combination of these two properties, allows OPAL to deal effectively with many forms outside of the grasp of existing form filling techniques. In the UK real estate domain, OPAL achieves >99% accuracy in form understanding.
Querying UML Class Diagrams - FoSSaCS 2012Giorgio Orsi
UML Class Diagrams (UCDs) are the best known class-based formalism for conceptual modeling. They are used by software engineers to model the intensional structure of a system in terms of classes, attributes and operations, and to express constraints that must hold for every instance of the system. Reasoning over UCDs is of paramount importance in design, validation, maintenance and system analysis; however, for medium and large software projects, reasoning over UCDs may be impractical. Query answering, in particular, can be used to verify whether a (possibly incomplete) instance of the system modeled by the UCD, i.e., a snapshot, enjoys a certain property. In this work, we study the problem of querying UCD instances, and we relate it to query answering under guarded Datalog +/-, that is, a powerful Datalog-based language for ontological modeling. We present an expressive and meaningful class of UCDs, named UCDLog, under which conjunctive query answering is tractable in the size of the instances.
OPAL: automated form understanding for the deep web - WWW 2012Giorgio Orsi
Forms are our gates to the web. They enable us to access the deep content of web sites. Automatic form understanding unlocks this content for applications ranging from crawlers to meta-search engines and is essential for improving usability and accessibility of the web. Form understanding has received surprisingly little attention other than as component in specific applications such as crawlers. No comprehensive approach to form understanding exists and previous works disagree even in the definition of the problem. In this paper, we present OPAL, the first comprehensive approach to form understanding. We identify form labeling and form interpretation as the two main tasks involved in form understanding. On both problems OPAL pushes the state of the art: For form labeling, it combines signals from the text, structure, and visual rendering of a web page, yielding robust characterisations of common design patterns. In extensive experiments on the ICQ and TEL-8 benchmarks and a set of 200 modern web forms OPAL outperforms previous approaches by a significant margin. For form interpretation, we introduce a template language to describe frequent form patterns. These two parts of OPAL combined yield form understanding with near perfect accuracy (> 98%).
Nyaya: Semantic data markets: a flexible environment for knowledge management...Giorgio Orsi
We present Nyaya , a flexible system for the management of Semantic-Web data which couples a general-purpose storage mechanism with efficient ontology reasoning and querying capabilities. Nyaya processes large Semantic-Web datasets,
expressed in a variety of formalisms, by transforming them into a collection of Semantic Data Kiosks. Each kiosk exposes the native meta-data in a uniform fashion using Datalog± , a very general rule-based language for the representation of ontological constraints. The kiosks form a Semantic Data Market where the data in each kiosk can be uniformly accessed using conjunctive queries and where users can specify user-defined constraints over the data. Nyaya is easily extensible and robust to updates of both data and meta-data in the kiosk and can readily adapt to different logical organization of the persistent storage. The approach has been experimented using well-known benchmarks, and compared to state-of-the-art research prototypes and commercial systems.
3. Web data extraction WEB HTML pages layout Corporate edp apps structured data, Databases, XML WRAPPER Goal: Make web contents accessible to electronic data processing Wrappers: HTML select extract annotate XML
4.
5. MSO Monadic Datalog Elog Lixto Visual Wrapper = Suite Logic Database theory DB programming Application design
6. Lixto Visual Developer (VD) Navigation Steps Mozilla Web Browser Extraction Configuration
9. Need for Automatic Extraction Technology Example: Real Estate UK 17,000 sites Many not covered by aggregators We do have a list of all homepages (Yellow Pgs. UK) Manual or semi-automatic wrapping too expensive - wrapper construction - testing - keeping track of changes No tool or method can do it fully automatically. Other domains: Hospitals,restaurants, schools, travel agents, airlines, hospitals, pharmaceutical companies and retail companies such as supermarket chains…..
10. Need for Automatic Extraction Technology (2) All search engine providers need it! Many work on it. Keywords: Vertical search, object search, semantic search. Raghu Ramakrishnan , Yahoo!, March 2009: “ no one really has done this successfully at scale yet ” Alon Halevy , Google, Feb. 2009: “ Current technologies are not good enough yet to provide what search engines really need. […] any successful approach would probably need a combination of knowledge and learning.”
11. The Blackbox we want to construct BLACKBOX Application domain with thousands of websites URL Application relevant Structured data (XML or RDF) To achieve this, we plan to combine a host of annotators with a new knowledge-based approach.
12. Real estate Restauran t s Relationship to SeCo & Webdam Q: Find apartments in Milan whose prices are average in quarters were restaurant quality > average. Results Web service A Web service
16. Bottom-up (low-level) annotation Monochromatic Rectangle Georaphic search facility Postcode input field Active map … . ISA ISA Occurs in Price search facility … . … . Occurs in … . 105 105 127 [(02873,227) (03900,417)] Geo-Price-Searchbox ISA [(02873,227) (03900,417)]
17. Top-down reasoning Property Search Facility Property List Single Property Description Specially highlighted property part-of m 1
18. Bottom-up processing Top-down reasoning Monochromatic Rectangle Georaphic search facility Postcode input field Active map … . ISA ISA Occurs in Price search facility … . … . Occurs in … . 105 105 127 [(02873,227) (03900,417)] Property Search Facility Property List Single Property Description Geo-Price-Searchbox ISA [(02873,227) (03900,417)] Specially highlighted property Phenomenology part-of m 1
19. table(T) & occurs_in(T,areaselection) & occurs_in(T,priceselection) goodtable(T). goodtable(T) & child(Parent,T) containsgoodtable(Parent). goodtable(T) & containsgoodtable(T) propertysearchmask(T). If a table contains an area selection input field and a price selection field, both of which are not simultaneously contained in a smaller table, then this table is the property search mask Datalog for Web-Object Reasoning
27. Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2) X (tablebox(X) & contains(X,T1) & contains(X,T2)). Deduction in Datalog + undecidable (TGDs)
28. Object creation in Datalog + table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2) X (tablebox(X) & contains(X,T1) & contains(X,T2)). Deduction in Datalog + undecidable (TGDs) Datalog : require guardedness of rule bodies. Decidable, linear-time data complexity.
29. Datalog Family of languages. Incorporates ontological reasoning (>DL-LITE) Further research needed for extending it so to be an ideal language for web objects. Transitivity: containedin(T1,T2), containedin(T2,T3) containedin (T1,T3)
30. Datalog Family of languages. Incorporates ontological reasoning (>DL-LITE) Further research needed for extending it so to be an ideal language for web objects. Transitivity: containedin(T1,T2), containedin(T2,T3) containedin (T1,T3) unguarded!
45. Result Extraction <XQ> : For each atomic result A Let price = A/.../.../text() description = A/.../.../../text() ........ Return <rental area=Oxford> <price> 1,200 </price> <bedrooms> 3 </bedrooms> <bathrooms> 1 </bathrooms> <type> Flat </type> <location> George Street,OX1 </location> <description> ... </description> <otherInfo> Furnished; Long let - more than six months </otherInfo> ... <ental> price description Type OtherInfo Bathrooms location type = A/.../.../../text()
46.
Editor's Notes
The XQuery statement (FLWOR expression) outputs a single XML document per result page Produces a flat (i.e. no hierarchy) XML document per some prescribed schema Key intuition – notion of atomic result, regardless of presentation (list, table, etc.) Atomic results analogous to RDBMS query returns (attributes form tuples); field inputs are preserved as element attribute values