Data is incomplete when it contains missing/unknown information, or more generally when it is only partially available, e.g. because of restrictions on data access.
Incompleteness is receiving a renewed interest as it is naturally generated in data interoperation, a very common framework for today's data-centric applications. In this setting data is decentralized, needs to be integrated from several sources and exchanged between different applications. Incompleteness arises from the semantic and syntactic heterogeneity of different data sources.
Querying incomplete data is usually an expensive task. In this talk we survey on the state of the art and recent developments on the tractability of querying incomplete data, under different possible interpretations of incompleteness.
The document describes decision tree induction and classification. It provides examples of datasets that can be used to build decision trees, including examples on loan approvals, playing sailboats, and shapes. It explains the basic TDIDT (Top Down Induction of Decision Trees) algorithm for building decision trees from datasets in a recursive partitioning manner. Key aspects of decision tree learning covered include attribute selection criteria, information gain, entropy, and residual information.
Microsoft PROSE SDK: A Framework for Inductive Program SynthesisAlex Polozov
Presented at SPLASH (OOPSLA) 2015.
Inductive synthesis, or programming-by-examples (PBE) is gaining prominence with disruptive applications for automating repetitive tasks in end-user programming. However, designing, developing, and maintaining an effective industrial-quality inductive synthesizer is an intellectual and engineering challenge, requiring 1-2 man-years of effort.
Our novel observation is that many PBE algorithms are a natural fall-out of one generic meta-algorithm and the domain-specific properties of the operators in the underlying domain-specific language (DSL). The meta-algorithm propagates example-based constraints on an expression to its subexpressions by leveraging associated witness functions, which essentially capture the inverse semantics of the underlying operator. This observation enables a novel program synthesis methodology called _data-driven domain-specific deduction_ (D<sup>4</sup>), where domain-specific insight, provided by the DSL designer, is separated from the synthesis algorithm.
Our **FlashMeta** framework implements this methodology, allowing synthesizer developers to generate an efficient synthesizer from the mere DSL definition (if properties of the DSL operators have been modeled). In our case studies, we found that 10+ existing industrial-quality mass-market applications based on PBE can be cast as instances of D<sup>4</sup>. Our evaluation includes reimplementation of some prior works, which in FlashMeta become more efficient, maintainable, and extensible. As a result, FlashMeta-based PBE tools are deployed in several industrial products, including Microsoft PowerShell 3.0 for Windows 10, Azure Operational Management Suite, and Microsoft Cortana digital assistant.
This document discusses active learning. It begins with an overview and definition of active learning, contrasting it with passive learning. It then discusses applications of active learning in areas like speech recognition and medical diagnosis. Examples of active learning techniques like pool-based active learning are provided. The document examines whether active learning makes a difference compared to passive learning and discusses related human experiments. It explores active learning from multiple oracles, dealing with weak and strong oracles, and applications to crowdsourcing. The document concludes with discussing future directions and providing references.
The document describes an active learning strategy called D-Confidence that aims to efficiently identify small classes with a low labeling cost from incomplete specifications. It does this by combining low-confidence exploitation to improve accuracy with high-distance exploration to improve representativeness of identified cases. The strategy is evaluated on several datasets and is shown to identify minority classes faster, particularly for imbalanced data, outperforming alternative active learning methods.
This document provides an overview and introduction to the course CIS 674 Introduction to Data Mining. It defines data mining, outlines basic data mining tasks such as classification, clustering, and association rule mining. It also discusses the relationship between data mining and knowledge discovery in databases (KDD), and highlights some issues in data mining such as handling large datasets, high dimensionality, and interpretation of results.
This document provides an overview and introduction to the course CIS 674 Introduction to Data Mining. It defines data mining, outlines the basic data mining tasks of classification, regression, clustering, summarization and link analysis. It discusses the relationship between data mining and knowledge discovery in databases (KDD) and describes some common issues in data mining such as handling large datasets, high dimensionality, interpretation and visualization of results.
A Network-Aware Approach for Searching As-You-Type in Social MediaINRIA-OAK
We present a novel approach for as-you-type top-k keyword search over social media. We adopt a natural "network-aware" interpretation for information relevance, by which information produced by users who are closer to the seeker is considered more relevant. In practice, this query model poses new challenges for effectiveness and efficiency in online search, even when a complete query is given as input in one keystroke. This is mainly because it requires a joint exploration of the social space and classic IR indexes such as inverted lists. We describe a memory-efficient and incremental prefix-based retrieval algorithm, which also exhibits an anytime behavior, allowing to output the most likely answer within any chosen running-time limit. We evaluate it through extensive experiments for several applications and search scenarios , including searching for posts in micro-blogging (Twitter and Tumblr), as well as searching for businesses based on reviews in Yelp. They show that our solution is effective in answering real-time as-you-type searches over social media.
Speeding up information extraction programs: a holistic optimizer and a learn...INRIA-OAK
A wealth of information produced by individuals and organizations is expressed in natural language text. Text lacks the explicit structure that is necessary to support rich querying and analysis. Information extraction systems are sophisticated software tools to discover structured information in natural language text. Unfortunately, information extraction is a challenging and time-consuming task.
In this talk, I will first present our proposal to optimize information extraction programs. It consists of a holistic approach that focuses on: (i) optimizing all key aspects of the information extraction process collectively and in a coordinated manner, rather than focusing on individual subtasks in isolation; (ii) accurately predicting the execution time, recall, and precision for each information extraction execution plan; and (iii) using these predictions to choose the best execution plan to execute a given information extraction program.
Then, I will briefly present a principled, learning-based approach for ranking documents according to their potential usefulness for an extraction task. Our online learning-to-rank methods exploit the information collected during extraction, as we process new documents and the fine-grained characteristics of the useful documents are revealed. Then, these methods decide when the ranking model should be updated, hence significantly improving the document ranking quality over time.
This is joint work with Gonçalo Simões, INESC-ID and IST/University of Lisbon, and Pablo Barrio and Luis Gravano from Columbia University, NY.
The document describes decision tree induction and classification. It provides examples of datasets that can be used to build decision trees, including examples on loan approvals, playing sailboats, and shapes. It explains the basic TDIDT (Top Down Induction of Decision Trees) algorithm for building decision trees from datasets in a recursive partitioning manner. Key aspects of decision tree learning covered include attribute selection criteria, information gain, entropy, and residual information.
Microsoft PROSE SDK: A Framework for Inductive Program SynthesisAlex Polozov
Presented at SPLASH (OOPSLA) 2015.
Inductive synthesis, or programming-by-examples (PBE) is gaining prominence with disruptive applications for automating repetitive tasks in end-user programming. However, designing, developing, and maintaining an effective industrial-quality inductive synthesizer is an intellectual and engineering challenge, requiring 1-2 man-years of effort.
Our novel observation is that many PBE algorithms are a natural fall-out of one generic meta-algorithm and the domain-specific properties of the operators in the underlying domain-specific language (DSL). The meta-algorithm propagates example-based constraints on an expression to its subexpressions by leveraging associated witness functions, which essentially capture the inverse semantics of the underlying operator. This observation enables a novel program synthesis methodology called _data-driven domain-specific deduction_ (D<sup>4</sup>), where domain-specific insight, provided by the DSL designer, is separated from the synthesis algorithm.
Our **FlashMeta** framework implements this methodology, allowing synthesizer developers to generate an efficient synthesizer from the mere DSL definition (if properties of the DSL operators have been modeled). In our case studies, we found that 10+ existing industrial-quality mass-market applications based on PBE can be cast as instances of D<sup>4</sup>. Our evaluation includes reimplementation of some prior works, which in FlashMeta become more efficient, maintainable, and extensible. As a result, FlashMeta-based PBE tools are deployed in several industrial products, including Microsoft PowerShell 3.0 for Windows 10, Azure Operational Management Suite, and Microsoft Cortana digital assistant.
This document discusses active learning. It begins with an overview and definition of active learning, contrasting it with passive learning. It then discusses applications of active learning in areas like speech recognition and medical diagnosis. Examples of active learning techniques like pool-based active learning are provided. The document examines whether active learning makes a difference compared to passive learning and discusses related human experiments. It explores active learning from multiple oracles, dealing with weak and strong oracles, and applications to crowdsourcing. The document concludes with discussing future directions and providing references.
The document describes an active learning strategy called D-Confidence that aims to efficiently identify small classes with a low labeling cost from incomplete specifications. It does this by combining low-confidence exploitation to improve accuracy with high-distance exploration to improve representativeness of identified cases. The strategy is evaluated on several datasets and is shown to identify minority classes faster, particularly for imbalanced data, outperforming alternative active learning methods.
This document provides an overview and introduction to the course CIS 674 Introduction to Data Mining. It defines data mining, outlines basic data mining tasks such as classification, clustering, and association rule mining. It also discusses the relationship between data mining and knowledge discovery in databases (KDD), and highlights some issues in data mining such as handling large datasets, high dimensionality, and interpretation of results.
This document provides an overview and introduction to the course CIS 674 Introduction to Data Mining. It defines data mining, outlines the basic data mining tasks of classification, regression, clustering, summarization and link analysis. It discusses the relationship between data mining and knowledge discovery in databases (KDD) and describes some common issues in data mining such as handling large datasets, high dimensionality, interpretation and visualization of results.
A Network-Aware Approach for Searching As-You-Type in Social MediaINRIA-OAK
We present a novel approach for as-you-type top-k keyword search over social media. We adopt a natural "network-aware" interpretation for information relevance, by which information produced by users who are closer to the seeker is considered more relevant. In practice, this query model poses new challenges for effectiveness and efficiency in online search, even when a complete query is given as input in one keystroke. This is mainly because it requires a joint exploration of the social space and classic IR indexes such as inverted lists. We describe a memory-efficient and incremental prefix-based retrieval algorithm, which also exhibits an anytime behavior, allowing to output the most likely answer within any chosen running-time limit. We evaluate it through extensive experiments for several applications and search scenarios , including searching for posts in micro-blogging (Twitter and Tumblr), as well as searching for businesses based on reviews in Yelp. They show that our solution is effective in answering real-time as-you-type searches over social media.
Speeding up information extraction programs: a holistic optimizer and a learn...INRIA-OAK
A wealth of information produced by individuals and organizations is expressed in natural language text. Text lacks the explicit structure that is necessary to support rich querying and analysis. Information extraction systems are sophisticated software tools to discover structured information in natural language text. Unfortunately, information extraction is a challenging and time-consuming task.
In this talk, I will first present our proposal to optimize information extraction programs. It consists of a holistic approach that focuses on: (i) optimizing all key aspects of the information extraction process collectively and in a coordinated manner, rather than focusing on individual subtasks in isolation; (ii) accurately predicting the execution time, recall, and precision for each information extraction execution plan; and (iii) using these predictions to choose the best execution plan to execute a given information extraction program.
Then, I will briefly present a principled, learning-based approach for ranking documents according to their potential usefulness for an extraction task. Our online learning-to-rank methods exploit the information collected during extraction, as we process new documents and the fine-grained characteristics of the useful documents are revealed. Then, these methods decide when the ranking model should be updated, hence significantly improving the document ranking quality over time.
This is joint work with Gonçalo Simões, INESC-ID and IST/University of Lisbon, and Pablo Barrio and Luis Gravano from Columbia University, NY.
Slides covered during Analytics Boot Camp conducted with the help of IBM, Venturesity. Special credits to Kumar Rishabh (Google) and Srinivas Nv Gannavarapu (IBM)
The document provides an overview of key driver analysis techniques. It discusses the objectives of key driver analysis as determining the relative importance of predictor variables in predicting an outcome variable. It then reviews techniques like Shapley regression and Relative Importance Analysis. The document outlines 13 assumptions that should be checked when performing key driver analysis, such as whether the predictor and outcome variables are numeric. It provides options and recommendations for addressing violations of assumptions, like using generalized linear models for non-numeric variables. Two case studies applying the techniques to brand and cola preference data are presented.
Dowhy: An end-to-end library for causal inferenceAmit Sharma
In addition to efficient statistical estimators of a treatment's effect, successful application of causal inference requires specifying assumptions about the mechanisms underlying observed data and testing whether they are valid, and to what extent. However, most libraries for causal inference focus only on the task of providing powerful statistical estimators. We describe DoWhy, an open-source Python library that is built with causal assumptions as its first-class citizens, based on the formal framework of causal graphs to specify and test causal assumptions. DoWhy presents an API for the four steps common to any causal analysis---1) modeling the data using a causal graph and structural assumptions, 2) identifying whether the desired effect is estimable under the causal model, 3) estimating the effect using statistical estimators, and finally 4) refuting the obtained estimate through robustness checks and sensitivity analyses. In particular, DoWhy implements a number of robustness checks including placebo tests, bootstrap tests, and tests for unoberved confounding. DoWhy is an extensible library that supports interoperability with other implementations, such as EconML and CausalML for the the estimation step.
This document provides an overview of key concepts in data analytics, including:
1. It distinguishes between analytics, which uses analysis to make recommendations, and analysis.
2. Common purposes of data analysis are to confirm hypotheses or explore data through confirmatory or exploratory analysis.
3. The typical data analytics workflow involves 8 steps: identifying the issue, data collection/preparation, cleansing, transformation, analysis, validation, presentation, and making recommendations.
4. Important data preparation concepts covered include storage options, access and privacy considerations, representation formats, and data scales. Cleansing, transformation, and feature engineering techniques are also summarized.
5. Common analysis methods, validation approaches, and
This document summarizes a talk on practical machine learning issues. It discusses identifying the right machine learning scenario for a given task, such as classification, regression, clustering, or reinforcement learning. It also addresses common reasons why machine learning models may fail, such as using the wrong evaluation metrics, not having enough labeled training data, or not performing proper feature engineering. The document emphasizes the importance of choosing the appropriate machine learning model, having sufficient high-quality data, and selecting useful features.
A lot of people talk about Data Mining, Machine Learning and Big Data. It clearly must be important, right?
A lot of people are also trying to sell you snake oil - sometimes half-arsed and overpriced products or solutions promising a world of insight into your customers or users if you handover your data to them. Instead, trying to understanding your own data and what you could do with it, should be the first thing you’d be looking at.
In this talk, we’ll introduce some basic terminology about Data and Text Mining as well as Machine Learning and will have a look at what you can on your own to understand more about your data and discover patterns in your data.
The slides from my talk at the FOSDEM HPC, Big Data and Data Science devroom, which has general tips from various sources about putting your first machine learning model in production.
The video is available from the FOSDEM website: https://fosdem.org/2017/schedule/event/machine_learning_zoo/
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Ian Morgan
Professor Steve Roberts, Machine learning research group and Oxford-Man Institute + Alan Turing Institute. Steve gave this talk on the 24th January at the London Bayes Nets meetup.
Statistics in the age of data science, issues you can not ignoreTuri, Inc.
This document discusses issues in statistics that data scientists can and cannot ignore when working with large datasets. It begins by outlining the talk and defining key terms in data science. It then explains that model assessment, such as estimating model performance on new data, becomes easier with more data as statistical adjustments are not needed. However, more data and variables are not always better, as noise, collinearity, and overfitting can still occur. Several examples are given where common machine learning algorithms can be fooled into achieving high accuracy on training data even when the target variable is random. The conclusion emphasizes that data science, statistics, and domain expertise each provide unique perspectives, and effective teams need to understand all views.
Whether you are faced with performing quantitative or qualitative research managing and analyzing data is always going to be a major part of the work. Discover a wide variety of statistical software programs that you can use and learn the advantages and disadvantages of each depending on the type of data you will be using. This is a presentation developed through the Graduate Resource Center at the University of New Mexico.
With product reliability demonstration test planning and execution weighing heavily on cost, availability and schedule factors, Bayesian methods offer an intelligent way of incorporating engineering knowledge based on historical information into data analysis and interpretation, resulting in an overall more precise and less resource intensive failure rate estimation. This talk consists of three parts
1. Introduction to Bayesian vs Frequentist statistical approaches
2. Bayesian formalism for reliability estimation
3. Product/component case studies and examples
In this talk we saw few query recommendation techniques for queries in the long tail or unseen, not well formulated queries and synthetic generation of web queries.
This document provides an agenda for a meetup on data science topics. The meetup will be held once a month, with the next one on June 14th. It aims to provide the best networking and learning platform in Bangalore for areas like data science, big data, machine learning. The agenda includes introductions, an overview of the model building lifecycle, data exploration and feature engineering techniques, and modeling techniques like logistic regression, decision trees, random forests, and SVM. Teams will be formed to predict whether bids are from humans or robots using these techniques. Resources for implementing the techniques in Python and R are also provided.
This document provides an overview of machine learning, including:
- The types of machine learning such as supervised learning, unsupervised learning, and reinforcement learning. Supervised learning involves predicting outputs from labeled inputs using techniques like regression and classification, while unsupervised learning finds patterns in unlabeled data using clustering and dimensionality reduction.
- Common machine learning applications including speech recognition, machine translation, strategic gaming, computer vision, autonomous vehicles, manufacturing, and healthcare.
- Effective machine learning involves reducing programming time through existing tools, customizing and scaling products as needed, and using statistics rather than logic to make decisions from real-world data.
This document provides an overview of key aspects of data preparation and processing for data mining. It discusses the importance of domain expertise in understanding data. The goals of data preparation are identified as cleaning missing, noisy, and inconsistent data; integrating data from multiple sources; transforming data into appropriate formats; and reducing data through feature selection, sampling, and discretization. Common techniques for each step are outlined at a high level, such as binning, clustering, and regression for handling noisy data. The document emphasizes that data preparation is crucial and can require 70-80% of the effort for effective real-world data mining.
Twdatasci cjlin-big data analytics - challenges and opportunitiesAravindharamanan S
The document discusses the challenges and opportunities of big data analytics. It outlines some key differences between traditional data mining and big data, such as the size of datasets exceeding the capacity of single computers. This requires distributed data mining or machine learning across multiple machines. However, distributing computations introduces technical challenges around communication, synchronization, and developing algorithms that can handle iterative access to large datasets. While opportunities exist to parallelize existing algorithms or design new distributed algorithms, it will take time to develop integrated tools that make big data analytics as easy as traditional data mining. Lessons can be drawn from how linear algebra routines were optimized to handle memory hierarchies efficiently. Success will depend on understanding both algorithms and distributed systems.
This document discusses lifelong learning and approaches to address it. It begins with reminders for project deadlines and the plan for the lecture, which is to discuss the lifelong learning problem statement, basic approaches, and how to potentially improve upon them. It then reviews problem statements for online learning, multi-task learning, and meta-learning, noting that real-world settings involve sequential learning over time. Common approaches like fine-tuning all data can hurt past performance, while storing all data is infeasible. Meta-learning aims to efficiently learn new tasks from a non-stationary distribution of tasks.
Graph theory could potentially make a big impact on how we conduct businesses. Imagine the case where you wish to maximize the reach of your promotion via leveraging your customers' influence, to advocate your products and bring their friends on board. The same logic of harnessing one's networks can be applied to purchase recommendation, customer behavior, and fraud detection.
Running analyses on large graphs was not trivial for many companies - until recently. The field has made significant steps in the last five years and scalable graph computations are now the norm. You can now run graph computations out-of-core (no memory constraints) and in parallel (multiple machines), especially in Spark which is spreading like wildfire.
A lot of people are familiar with graphX, a pretty solid implementation of scalable graphs in Spark. GraphX is pretty interesting but the project seems to be orphaned. The good news is, there is now an alternative: Graphframes. They are a new data structure that takes the best parts of dataframes and graphs
In this talk, I will be explaining how to use Graphframes from Python, a new data structure in Spark 2.0 that takes the best parts of dataframes and graphs, with an example using personalized pagerank for recommendations.
Change Management in the Traditional and Semantic WebINRIA-OAK
Data has played such a crucial role in the development of modern society that relational databases, the most popular data management systems, have been defined as the "foundation of western civilization" for their massive adoption in business, government and education, that allowed to reach the necessary productivity and standardization rate.
Data, however, in order to become knowledge, needs to be organized in a way that enables the retrieval of relevant information and data analysis, for becoming a real asset. Another fundamental aspect for an effective data management is the full support for changes at the data and metadata level.
A failure in the management of data changes usually results in a dramatic diminishment of its usefulness.Data does evolve, in its format and its content, due to changes in the modeled domain, error correction, different required level of granularity, in order to accommodate new information.
Change management has been covered extensively in literature, we concentrate in particular on two enabling factors of the World Wide Web: XML (the W3C-endorsed and widely used markup language to define semi-structured data) and
ontologies (a major enabling factor of the Semantic Web vision), with related metadata.
Concretely, we cover change management techniques for XML, and we then concentrate on the additional problems arising when considering the evolution of semantically-equipped data (with the help of a use-case featuring evolving ontologies and related mappings), which cannot be exclusively operated at the syntactical level, disregarding logical consequences.
In recent years, several important content providers such as Amazon, Musicbrainz, IMDb, Geonames, Google, and Twitter, have chosen to export their data through Web services. To unleash the potential of these sources for new intelligent applications, the data has to be combined across different APIs.
To this end, we have developed ANGIE, a framework that maps the knowledge provided by Web services dynamically into a local knowledge base. ANGIE represents Web services as views with binding patterns over the schema of the knowledge base. In this talk, I will focus on two problems related to our framework.
In the first part, the focus will be on the automatic integration of new Web services. I will present a novel algorithm for inferring the view definition of a given Web service in terms of the schema of the global knowledge base. The algorithm also generates a declarative script can transform the call results into results of the view. Our experiments on real Web services show the viability of our approach.
The second part will address the evaluation of conjunctive queries under a budget of calls. Conjunctive queries may require an unbound number of calls in order to compute the maximal answers. However, Web services typically allow only a fixed number of calls per session. Therefore, we have to prioritize query evaluation plans. We are working on distinguishing among all plans that could return answers those plans that actually will. Finally, I will show an application for this new notion of plans.
Slides covered during Analytics Boot Camp conducted with the help of IBM, Venturesity. Special credits to Kumar Rishabh (Google) and Srinivas Nv Gannavarapu (IBM)
The document provides an overview of key driver analysis techniques. It discusses the objectives of key driver analysis as determining the relative importance of predictor variables in predicting an outcome variable. It then reviews techniques like Shapley regression and Relative Importance Analysis. The document outlines 13 assumptions that should be checked when performing key driver analysis, such as whether the predictor and outcome variables are numeric. It provides options and recommendations for addressing violations of assumptions, like using generalized linear models for non-numeric variables. Two case studies applying the techniques to brand and cola preference data are presented.
Dowhy: An end-to-end library for causal inferenceAmit Sharma
In addition to efficient statistical estimators of a treatment's effect, successful application of causal inference requires specifying assumptions about the mechanisms underlying observed data and testing whether they are valid, and to what extent. However, most libraries for causal inference focus only on the task of providing powerful statistical estimators. We describe DoWhy, an open-source Python library that is built with causal assumptions as its first-class citizens, based on the formal framework of causal graphs to specify and test causal assumptions. DoWhy presents an API for the four steps common to any causal analysis---1) modeling the data using a causal graph and structural assumptions, 2) identifying whether the desired effect is estimable under the causal model, 3) estimating the effect using statistical estimators, and finally 4) refuting the obtained estimate through robustness checks and sensitivity analyses. In particular, DoWhy implements a number of robustness checks including placebo tests, bootstrap tests, and tests for unoberved confounding. DoWhy is an extensible library that supports interoperability with other implementations, such as EconML and CausalML for the the estimation step.
This document provides an overview of key concepts in data analytics, including:
1. It distinguishes between analytics, which uses analysis to make recommendations, and analysis.
2. Common purposes of data analysis are to confirm hypotheses or explore data through confirmatory or exploratory analysis.
3. The typical data analytics workflow involves 8 steps: identifying the issue, data collection/preparation, cleansing, transformation, analysis, validation, presentation, and making recommendations.
4. Important data preparation concepts covered include storage options, access and privacy considerations, representation formats, and data scales. Cleansing, transformation, and feature engineering techniques are also summarized.
5. Common analysis methods, validation approaches, and
This document summarizes a talk on practical machine learning issues. It discusses identifying the right machine learning scenario for a given task, such as classification, regression, clustering, or reinforcement learning. It also addresses common reasons why machine learning models may fail, such as using the wrong evaluation metrics, not having enough labeled training data, or not performing proper feature engineering. The document emphasizes the importance of choosing the appropriate machine learning model, having sufficient high-quality data, and selecting useful features.
A lot of people talk about Data Mining, Machine Learning and Big Data. It clearly must be important, right?
A lot of people are also trying to sell you snake oil - sometimes half-arsed and overpriced products or solutions promising a world of insight into your customers or users if you handover your data to them. Instead, trying to understanding your own data and what you could do with it, should be the first thing you’d be looking at.
In this talk, we’ll introduce some basic terminology about Data and Text Mining as well as Machine Learning and will have a look at what you can on your own to understand more about your data and discover patterns in your data.
The slides from my talk at the FOSDEM HPC, Big Data and Data Science devroom, which has general tips from various sources about putting your first machine learning model in production.
The video is available from the FOSDEM website: https://fosdem.org/2017/schedule/event/machine_learning_zoo/
Professor Steve Roberts; The Bayesian Crowd: scalable information combinati...Ian Morgan
Professor Steve Roberts, Machine learning research group and Oxford-Man Institute + Alan Turing Institute. Steve gave this talk on the 24th January at the London Bayes Nets meetup.
Statistics in the age of data science, issues you can not ignoreTuri, Inc.
This document discusses issues in statistics that data scientists can and cannot ignore when working with large datasets. It begins by outlining the talk and defining key terms in data science. It then explains that model assessment, such as estimating model performance on new data, becomes easier with more data as statistical adjustments are not needed. However, more data and variables are not always better, as noise, collinearity, and overfitting can still occur. Several examples are given where common machine learning algorithms can be fooled into achieving high accuracy on training data even when the target variable is random. The conclusion emphasizes that data science, statistics, and domain expertise each provide unique perspectives, and effective teams need to understand all views.
Whether you are faced with performing quantitative or qualitative research managing and analyzing data is always going to be a major part of the work. Discover a wide variety of statistical software programs that you can use and learn the advantages and disadvantages of each depending on the type of data you will be using. This is a presentation developed through the Graduate Resource Center at the University of New Mexico.
With product reliability demonstration test planning and execution weighing heavily on cost, availability and schedule factors, Bayesian methods offer an intelligent way of incorporating engineering knowledge based on historical information into data analysis and interpretation, resulting in an overall more precise and less resource intensive failure rate estimation. This talk consists of three parts
1. Introduction to Bayesian vs Frequentist statistical approaches
2. Bayesian formalism for reliability estimation
3. Product/component case studies and examples
In this talk we saw few query recommendation techniques for queries in the long tail or unseen, not well formulated queries and synthetic generation of web queries.
This document provides an agenda for a meetup on data science topics. The meetup will be held once a month, with the next one on June 14th. It aims to provide the best networking and learning platform in Bangalore for areas like data science, big data, machine learning. The agenda includes introductions, an overview of the model building lifecycle, data exploration and feature engineering techniques, and modeling techniques like logistic regression, decision trees, random forests, and SVM. Teams will be formed to predict whether bids are from humans or robots using these techniques. Resources for implementing the techniques in Python and R are also provided.
This document provides an overview of machine learning, including:
- The types of machine learning such as supervised learning, unsupervised learning, and reinforcement learning. Supervised learning involves predicting outputs from labeled inputs using techniques like regression and classification, while unsupervised learning finds patterns in unlabeled data using clustering and dimensionality reduction.
- Common machine learning applications including speech recognition, machine translation, strategic gaming, computer vision, autonomous vehicles, manufacturing, and healthcare.
- Effective machine learning involves reducing programming time through existing tools, customizing and scaling products as needed, and using statistics rather than logic to make decisions from real-world data.
This document provides an overview of key aspects of data preparation and processing for data mining. It discusses the importance of domain expertise in understanding data. The goals of data preparation are identified as cleaning missing, noisy, and inconsistent data; integrating data from multiple sources; transforming data into appropriate formats; and reducing data through feature selection, sampling, and discretization. Common techniques for each step are outlined at a high level, such as binning, clustering, and regression for handling noisy data. The document emphasizes that data preparation is crucial and can require 70-80% of the effort for effective real-world data mining.
Twdatasci cjlin-big data analytics - challenges and opportunitiesAravindharamanan S
The document discusses the challenges and opportunities of big data analytics. It outlines some key differences between traditional data mining and big data, such as the size of datasets exceeding the capacity of single computers. This requires distributed data mining or machine learning across multiple machines. However, distributing computations introduces technical challenges around communication, synchronization, and developing algorithms that can handle iterative access to large datasets. While opportunities exist to parallelize existing algorithms or design new distributed algorithms, it will take time to develop integrated tools that make big data analytics as easy as traditional data mining. Lessons can be drawn from how linear algebra routines were optimized to handle memory hierarchies efficiently. Success will depend on understanding both algorithms and distributed systems.
This document discusses lifelong learning and approaches to address it. It begins with reminders for project deadlines and the plan for the lecture, which is to discuss the lifelong learning problem statement, basic approaches, and how to potentially improve upon them. It then reviews problem statements for online learning, multi-task learning, and meta-learning, noting that real-world settings involve sequential learning over time. Common approaches like fine-tuning all data can hurt past performance, while storing all data is infeasible. Meta-learning aims to efficiently learn new tasks from a non-stationary distribution of tasks.
Graph theory could potentially make a big impact on how we conduct businesses. Imagine the case where you wish to maximize the reach of your promotion via leveraging your customers' influence, to advocate your products and bring their friends on board. The same logic of harnessing one's networks can be applied to purchase recommendation, customer behavior, and fraud detection.
Running analyses on large graphs was not trivial for many companies - until recently. The field has made significant steps in the last five years and scalable graph computations are now the norm. You can now run graph computations out-of-core (no memory constraints) and in parallel (multiple machines), especially in Spark which is spreading like wildfire.
A lot of people are familiar with graphX, a pretty solid implementation of scalable graphs in Spark. GraphX is pretty interesting but the project seems to be orphaned. The good news is, there is now an alternative: Graphframes. They are a new data structure that takes the best parts of dataframes and graphs
In this talk, I will be explaining how to use Graphframes from Python, a new data structure in Spark 2.0 that takes the best parts of dataframes and graphs, with an example using personalized pagerank for recommendations.
Change Management in the Traditional and Semantic WebINRIA-OAK
Data has played such a crucial role in the development of modern society that relational databases, the most popular data management systems, have been defined as the "foundation of western civilization" for their massive adoption in business, government and education, that allowed to reach the necessary productivity and standardization rate.
Data, however, in order to become knowledge, needs to be organized in a way that enables the retrieval of relevant information and data analysis, for becoming a real asset. Another fundamental aspect for an effective data management is the full support for changes at the data and metadata level.
A failure in the management of data changes usually results in a dramatic diminishment of its usefulness.Data does evolve, in its format and its content, due to changes in the modeled domain, error correction, different required level of granularity, in order to accommodate new information.
Change management has been covered extensively in literature, we concentrate in particular on two enabling factors of the World Wide Web: XML (the W3C-endorsed and widely used markup language to define semi-structured data) and
ontologies (a major enabling factor of the Semantic Web vision), with related metadata.
Concretely, we cover change management techniques for XML, and we then concentrate on the additional problems arising when considering the evolution of semantically-equipped data (with the help of a use-case featuring evolving ontologies and related mappings), which cannot be exclusively operated at the syntactical level, disregarding logical consequences.
In recent years, several important content providers such as Amazon, Musicbrainz, IMDb, Geonames, Google, and Twitter, have chosen to export their data through Web services. To unleash the potential of these sources for new intelligent applications, the data has to be combined across different APIs.
To this end, we have developed ANGIE, a framework that maps the knowledge provided by Web services dynamically into a local knowledge base. ANGIE represents Web services as views with binding patterns over the schema of the knowledge base. In this talk, I will focus on two problems related to our framework.
In the first part, the focus will be on the automatic integration of new Web services. I will present a novel algorithm for inferring the view definition of a given Web service in terms of the schema of the global knowledge base. The algorithm also generates a declarative script can transform the call results into results of the view. Our experiments on real Web services show the viability of our approach.
The second part will address the evaluation of conjunctive queries under a budget of calls. Conjunctive queries may require an unbound number of calls in order to compute the maximal answers. However, Web services typically allow only a fixed number of calls per session. Therefore, we have to prioritize query evaluation plans. We are working on distinguishing among all plans that could return answers those plans that actually will. Finally, I will show an application for this new notion of plans.
On building more human query answering systemsINRIA-OAK
The underlying principle behind every query answering system is the existence of a query describing the information of interest. When this model is applied to non-expert users, two traditional issues become highly significant.
The first is that many queries are often over specified leading to empty answers. We propose a principled optimization-based interactive query relaxation framework for such queries. The framework computes dynamically and suggests alternative queries with less conditions to help the user arrive at a query with a non-empty answer, or at a query for which it is clear that independently of the relaxations the answer will always be empty.
The second issue is the lack of expertise from the user to accurately describe the requirements of the elements of interest. The user may though know examples of elements that would like to have in the results. We introduce a novel form of query paradigm in which queries are not any more specifications of what the user is searching for, but simply a sample of what the user knows to be of interest. We refer to this novel form of queries as Exemplar Queries.
Dynamically Optimizing Queries over Large Scale Data PlatformsINRIA-OAK
Enterprises are adapting large-scale data processing platforms, such as Hadoop, to gain actionable insights from their "big data". Query optimization is still an open challenge in this environment due to the volume and heterogeneity of data, comprising both structured and un/semi-structured datasets. Moreover, it has become common practice to push business logic close to the data via user-defined functions (UDFs), which are usually opaque to the optimizer, further complicating cost-based optimization. As a result, classical relational query optimization techniques do not fit well in this setting, while at the same time, suboptimal query plans can be disastrous with large datasets. In this talk, I will present new techniques that take into account UDFs and correlations between relations for optimizing queries running on large scale clusters. We introduce "pilot runs", which execute part of the query over a sample of the data to estimate selectivities, and employ a cost-based optimizer that uses these selectivities to choose an initial query plan. Then, we follow a dynamic optimization approach, in which plans evolve as parts of the queries get executed. Our experimental results show that our techniques produce plans that are at least as good as, and up to 2x (4x) better for Jaql (Hive) than, the best hand-written left-deep query plans.
The document discusses the growth of RDF data volumes on the web. It notes that the Linked Open Data cloud currently contains over 325 datasets with more than 25 billion triples, and the size is doubling every year. It provides examples of popular datasets that are part of the Linked Open Data cloud.
This document summarizes the OAK project meeting in fall 2014. It discusses OAK's history and environment, members, current grants, main research themes, and software. Key areas of research include database techniques for the semantic web, efficient parallel processing of web data, crowdsourcing and graphs, parallel XML and provenance, and hybrid and redundant data stores. Upcoming events include PhD defenses in September and the INRIA activity report.
Nautilus is a SQL query development tool started in 2009 that is still under development. It helps users understand query results by answering questions about why tuples are or aren't received. It uses various algorithms like TedNaive, TedRevised, and NedExplain, which are open source code implemented in Java as an Eclipse plugin. The tool has over 28,000 lines of code and 230 classes, and relies on dependencies like PostgreSQL, ZQL, and various Java libraries. It is developed by a team that has included past contributors and is currently worked on by Katerina Tzompanaki and Alexandre Constantin.
WaRG is a framework that allows compositional and execution of analytical RDF queries against a given RDF dataset. It contains 4924 lines of Java code and 1091 lines of Q scripts. The framework includes a GUI for visualizing the analytical schema, RDF dataset, and query results. It also allows building and executing analytical queries composed of a classifier query, measure query, and aggregate function. WaRG has dependencies on Kdb and Prefuse and future work includes Postgres integration and improving intermediate result handling.
The document describes ViP2P, a system for distributed data management and querying in peer-to-peer networks using materialized views. It enables peers to pose queries over distributed data by rewriting queries using available views. The querying peer looks up view definitions, rewrites the query into a logical plan based on the views, generates an optimized physical plan, and executes it to return results. The system uses distributed hash tables, third-party libraries, and supports defining, indexing, materializing and querying distributed views across the peer-to-peer network.
The S4 project is a social semantic search engine that searches Twitter tweets for keywords defined in RDF semantics from DBPedia and ranks the results based on proximity and position of keywords. The code is written in Python and stores tweet data either through serialization or a PostgreSQL database. It retrieves tweets and RDF semantics, stores the parsed tweet data, and runs a search algorithm to return top results to users.
This document describes an RDF saturation algorithm implemented in Java within the RDFVS project. The saturation algorithm is responsible for finding and storing implicit RDF triples by reasoning over input RDF data and RDFS schema. It uses PostgreSQL for data storage and supports updating the saturated triples when new data or schema information is received. The algorithm has been implemented as modules with over 6,000 lines of code by a single contributor.
The document describes an RDF data generator that can produce synthetic RDF data and schemas based on user-defined parameters. The generator consists of modules that control the generation of literals, data, schemas, distributions, and graphs. It allows configuring how subjects, predicates, and objects are generated by selecting sources and distributions. The generator codebase contains over 3,000 lines of Java code and relies on external Colt, Piccolo, and ViP2P libraries.
This document describes a project that provides methods for estimating the cardinality of conjunctive queries over RDF data. It discusses the key modules including an RDF loader, parser and importer to extract triples from RDF files and load them into a database. The database stores the triples and generates statistics tables. A cardinality estimator takes a conjunctive query and database statistics to output an estimate of the query's cardinality.
This document describes an RDF query reformulation algorithm. The algorithm takes as input a conjunctive RDF query and RDF Schema, and outputs a union of equivalent conjunctive queries. The code is written in Java and located on a source code repository. It contains several packages for reasoning, rules, signatures, and utilities. Libraries used include Jena and Xerces. Papers related to the algorithm are cited. Future work includes adding more robust tests of reformulated query content.
This document describes an RDF loader that takes plain RDF triple files as input and loads them into a PostgreSQL database. There are two versions - an older Java version and a newer bash script version that is faster. The loader creates tables to store the triples and indexes, and can optionally create an encoded version of the triples table and statistics on the URIs in the triples. It works by extracting triples from the NT files and inserting them into the database tables. Key libraries used include PostgreSQL and Log4j. Potential next steps include handling longer URIs and rewriting the bash version in Java.
This document describes a reuse-based optimization framework for Pig Latin queries. The framework identifies common subexpressions across a workload of Pig Latin scripts, selects the most efficient ones to execute based on a cost model, and reuses their results to compute the original script outputs. It analyzes query graphs to merge equivalent subexpressions and finds an optimal plan using a binary integer linear programming solver. The framework codebase contains classes for the optimizer, normalization rules, and decomposition of join operators.
PAXQuery is an execution engine for the XQuery query language that runs on the Apache Flink platform. It translates XML queries to algebraic trees, maps the algebraic trees to PACT plans to parallelize the queries. The code is organized into several modules that have dependencies between them. It has external dependencies on Apache Flink and other libraries. Future work includes finishing the development of the paxquery-xparser module and developing a web client interface.
This document summarizes two projects: ConjunctiveQueries, which parses RDF conjunctive query files and represents them as Java objects, and SQLTranslator, which translates the ConjunctiveQuery objects into SQL queries. It provides information on the owners, code locations, volumes, contributors, users, structures, dependencies, and diagram of how the two projects work together.
CliqueSquare is a distributed RDF data management platform that works on top of Hadoop. It stores RDF data in a distributed file system and answers SPARQL queries by finding an optimal query execution plan and executing it using Hadoop. The code is organized into layers that include a parser, logical and physical optimizers, a job translator, and Hadoop integration. It uses various Oak projects and follows a multi-step process to take a query through different abstraction levels until it can be executed on Hadoop.
This document describes a MapReduce-based RDF loader called CliqueSquare that stores RDF data in HDFS using custom partitioning strategies. It takes RDF triples as input, filters duplicates, partitions the data across HDFS nodes based on subject, predicate, or object, and replicates the data with a factor of three. The goal is to reduce network shuffling. It is open source software with over 100 classes and 10,000 lines of code and depends on Hadoop.
Microbial interaction
Microorganisms interacts with each other and can be physically associated with another organisms in a variety of ways.
One organism can be located on the surface of another organism as an ectobiont or located within another organism as endobiont.
Microbial interaction may be positive such as mutualism, proto-cooperation, commensalism or may be negative such as parasitism, predation or competition
Types of microbial interaction
Positive interaction: mutualism, proto-cooperation, commensalism
Negative interaction: Ammensalism (antagonism), parasitism, predation, competition
I. Mutualism:
It is defined as the relationship in which each organism in interaction gets benefits from association. It is an obligatory relationship in which mutualist and host are metabolically dependent on each other.
Mutualistic relationship is very specific where one member of association cannot be replaced by another species.
Mutualism require close physical contact between interacting organisms.
Relationship of mutualism allows organisms to exist in habitat that could not occupied by either species alone.
Mutualistic relationship between organisms allows them to act as a single organism.
Examples of mutualism:
i. Lichens:
Lichens are excellent example of mutualism.
They are the association of specific fungi and certain genus of algae. In lichen, fungal partner is called mycobiont and algal partner is called
II. Syntrophism:
It is an association in which the growth of one organism either depends on or improved by the substrate provided by another organism.
In syntrophism both organism in association gets benefits.
Compound A
Utilized by population 1
Compound B
Utilized by population 2
Compound C
utilized by both Population 1+2
Products
In this theoretical example of syntrophism, population 1 is able to utilize and metabolize compound A, forming compound B but cannot metabolize beyond compound B without co-operation of population 2. Population 2is unable to utilize compound A but it can metabolize compound B forming compound C. Then both population 1 and 2 are able to carry out metabolic reaction which leads to formation of end product that neither population could produce alone.
Examples of syntrophism:
i. Methanogenic ecosystem in sludge digester
Methane produced by methanogenic bacteria depends upon interspecies hydrogen transfer by other fermentative bacteria.
Anaerobic fermentative bacteria generate CO2 and H2 utilizing carbohydrates which is then utilized by methanogenic bacteria (Methanobacter) to produce methane.
ii. Lactobacillus arobinosus and Enterococcus faecalis:
In the minimal media, Lactobacillus arobinosus and Enterococcus faecalis are able to grow together but not alone.
The synergistic relationship between E. faecalis and L. arobinosus occurs in which E. faecalis require folic acid
When I was asked to give a companion lecture in support of ‘The Philosophy of Science’ (https://shorturl.at/4pUXz) I decided not to walk through the detail of the many methodologies in order of use. Instead, I chose to employ a long standing, and ongoing, scientific development as an exemplar. And so, I chose the ever evolving story of Thermodynamics as a scientific investigation at its best.
Conducted over a period of >200 years, Thermodynamics R&D, and application, benefitted from the highest levels of professionalism, collaboration, and technical thoroughness. New layers of application, methodology, and practice were made possible by the progressive advance of technology. In turn, this has seen measurement and modelling accuracy continually improved at a micro and macro level.
Perhaps most importantly, Thermodynamics rapidly became a primary tool in the advance of applied science/engineering/technology, spanning micro-tech, to aerospace and cosmology. I can think of no better a story to illustrate the breadth of scientific methodologies and applications at their best.
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Leonel Morgado
Current descriptions of immersive learning cases are often difficult or impossible to compare. This is due to a myriad of different options on what details to include, which aspects are relevant, and on the descriptive approaches employed. Also, these aspects often combine very specific details with more general guidelines or indicate intents and rationales without clarifying their implementation. In this paper we provide a method to describe immersive learning cases that is structured to enable comparisons, yet flexible enough to allow researchers and practitioners to decide which aspects to include. This method leverages a taxonomy that classifies educational aspects at three levels (uses, practices, and strategies) and then utilizes two frameworks, the Immersive Learning Brain and the Immersion Cube, to enable a structured description and interpretation of immersive learning cases. The method is then demonstrated on a published immersive learning case on training for wind turbine maintenance using virtual reality. Applying the method results in a structured artifact, the Immersive Learning Case Sheet, that tags the case with its proximal uses, practices, and strategies, and refines the free text case description to ensure that matching details are included. This contribution is thus a case description method in support of future comparative research of immersive learning cases. We then discuss how the resulting description and interpretation can be leveraged to change immersion learning cases, by enriching them (considering low-effort changes or additions) or innovating (exploring more challenging avenues of transformation). The method holds significant promise to support better-grounded research in immersive learning.
ESA/ACT Science Coffee: Diego Blas - Gravitational wave detection with orbita...Advanced-Concepts-Team
Presentation in the Science Coffee of the Advanced Concepts Team of the European Space Agency on the 07.06.2024.
Speaker: Diego Blas (IFAE/ICREA)
Title: Gravitational wave detection with orbital motion of Moon and artificial
Abstract:
In this talk I will describe some recent ideas to find gravitational waves from supermassive black holes or of primordial origin by studying their secular effect on the orbital motion of the Moon or satellites that are laser ranged.
Travis Hills of MN is Making Clean Water Accessible to All Through High Flux ...Travis Hills MN
By harnessing the power of High Flux Vacuum Membrane Distillation, Travis Hills from MN envisions a future where clean and safe drinking water is accessible to all, regardless of geographical location or economic status.
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...Sérgio Sacani
Context. With a mass exceeding several 104 M⊙ and a rich and dense population of massive stars, supermassive young star clusters
represent the most massive star-forming environment that is dominated by the feedback from massive stars and gravitational interactions
among stars.
Aims. In this paper we present the Extended Westerlund 1 and 2 Open Clusters Survey (EWOCS) project, which aims to investigate
the influence of the starburst environment on the formation of stars and planets, and on the evolution of both low and high mass stars.
The primary targets of this project are Westerlund 1 and 2, the closest supermassive star clusters to the Sun.
Methods. The project is based primarily on recent observations conducted with the Chandra and JWST observatories. Specifically,
the Chandra survey of Westerlund 1 consists of 36 new ACIS-I observations, nearly co-pointed, for a total exposure time of 1 Msec.
Additionally, we included 8 archival Chandra/ACIS-S observations. This paper presents the resulting catalog of X-ray sources within
and around Westerlund 1. Sources were detected by combining various existing methods, and photon extraction and source validation
were carried out using the ACIS-Extract software.
Results. The EWOCS X-ray catalog comprises 5963 validated sources out of the 9420 initially provided to ACIS-Extract, reaching a
photon flux threshold of approximately 2 × 10−8 photons cm−2
s
−1
. The X-ray sources exhibit a highly concentrated spatial distribution,
with 1075 sources located within the central 1 arcmin. We have successfully detected X-ray emissions from 126 out of the 166 known
massive stars of the cluster, and we have collected over 71 000 photons from the magnetar CXO J164710.20-455217.
CLASS 12th CHEMISTRY SOLID STATE ppt (Animated)eitps1506
Description:
Dive into the fascinating realm of solid-state physics with our meticulously crafted online PowerPoint presentation. This immersive educational resource offers a comprehensive exploration of the fundamental concepts, theories, and applications within the realm of solid-state physics.
From crystalline structures to semiconductor devices, this presentation delves into the intricate principles governing the behavior of solids, providing clear explanations and illustrative examples to enhance understanding. Whether you're a student delving into the subject for the first time or a seasoned researcher seeking to deepen your knowledge, our presentation offers valuable insights and in-depth analyses to cater to various levels of expertise.
Key topics covered include:
Crystal Structures: Unravel the mysteries of crystalline arrangements and their significance in determining material properties.
Band Theory: Explore the electronic band structure of solids and understand how it influences their conductive properties.
Semiconductor Physics: Delve into the behavior of semiconductors, including doping, carrier transport, and device applications.
Magnetic Properties: Investigate the magnetic behavior of solids, including ferromagnetism, antiferromagnetism, and ferrimagnetism.
Optical Properties: Examine the interaction of light with solids, including absorption, reflection, and transmission phenomena.
With visually engaging slides, informative content, and interactive elements, our online PowerPoint presentation serves as a valuable resource for students, educators, and enthusiasts alike, facilitating a deeper understanding of the captivating world of solid-state physics. Explore the intricacies of solid-state materials and unlock the secrets behind their remarkable properties with our comprehensive presentation.
PPT on Alternate Wetting and Drying presented at the three-day 'Training and Validation Workshop on Modules of Climate Smart Agriculture (CSA) Technologies in South Asia' workshop on April 22, 2024.
Evidence of Jet Activity from the Secondary Black Hole in the OJ 287 Binary S...Sérgio Sacani
Wereport the study of a huge optical intraday flare on 2021 November 12 at 2 a.m. UT in the blazar OJ287. In the binary black hole model, it is associated with an impact of the secondary black hole on the accretion disk of the primary. Our multifrequency observing campaign was set up to search for such a signature of the impact based on a prediction made 8 yr earlier. The first I-band results of the flare have already been reported by Kishore et al. (2024). Here we combine these data with our monitoring in the R-band. There is a big change in the R–I spectral index by 1.0 ±0.1 between the normal background and the flare, suggesting a new component of radiation. The polarization variation during the rise of the flare suggests the same. The limits on the source size place it most reasonably in the jet of the secondary BH. We then ask why we have not seen this phenomenon before. We show that OJ287 was never before observed with sufficient sensitivity on the night when the flare should have happened according to the binary model. We also study the probability that this flare is just an oversized example of intraday variability using the Krakow data set of intense monitoring between 2015 and 2023. We find that the occurrence of a flare of this size and rapidity is unlikely. In machine-readable Tables 1 and 2, we give the full orbit-linked historical light curve of OJ287 as well as the dense monitoring sample of Krakow.
2. ➡ Introduction
➡ Incomplete data model
➡ Querying incomplete data
‣ the key to tractability: naïve evaluation
➡ What makes naïve evaluation work?
‣ a general framework
➡ Applicability of the framework
➡ Moving forward
Plan
3. Data incompleteness: missing/unknown data values, partially available data, ...
Reasons:
mistakes: wrong/missing entries
restrictions on data access
data heterogeneity: data exchange/integration
Incomplete data
4. Querying incomplete data
Semantics of query answering
‣ How should the result of a query be defined
in the presence of incompleteness?
Query evaluation
‣ How do we evaluate a query on
an incomplete database?
‣ Can this be done efficiently ?
?
Q: which employees are managers?
incomplete database
Employee
Green
?
Brown
ManagerManager
Green ?
? Brown
Green ?
5. Incompleteness in theory and practice
Incompleteness in database systems
‣ Semantics of query answering: poorly designed
‣ Query evaluation: very efficient, optimized query engines
eg:
• In SQL the following are consistent statements for sets X,Y
|X| > |Y| and X−Y=∅
• This may occur if Y contains incomplete information (SQL nulls)
6. Incompleteness in theory and practice
Incompleteness in database systems
‣ Semantics of query answering: poorly designed
‣ Query evaluation: very efficient, optimized query engines
Theoretical framework for incompleteness
[Imielinski-Lipski, Abiteboul-Kanellakis-Grahne, etc. 80’s]
‣ Semantics of query answering: clean framework, suitable semantics
‣ Query evaluation: hard
7. Incompleteness in theory and practice
Bridging the gap between theory and systems:
answer queries correctly, use classical query engines
not satisfactorily addressed even in the simplest data model
Incompleteness in database systems
‣ Semantics of query answering: poorly designed
‣ Query evaluation: very efficient, optimized query engines
Theoretical framework for incompleteness
[Imielinski-Lipski, Abiteboul-Kanellakis-Grahne, etc. 80’s]
‣ Semantics of query answering: clean framework, suitable semantics
‣ Query evaluation: hard
8. ➡ Introduction
➡ Incomplete data model
➡ Querying incomplete data
‣ the key to tractability: naïve evaluation
➡ What makes naïve evaluation work?
‣ a general framework
➡ Applicability of the framework
➡ Moving forward
Plan
9. Incomplete relational data
Complete instance: over Const
dom(I) : the subset of Const ∪ Var occurring in I
Employee
Green
x1
Brown
ManagerManager
Green x1
x1 Brown
Green x2
Const : a countably infinite set of constants
Var : a countably infinite set of variables
(nulls)
I
Incomplete database instance (naïve table) [Imielinski, Lipski ’84]:
database instance over domain Const ∪ Var
10. ⟦I⟧
I
Any incomplete database represents a set of complete databases (possible worlds)
Semantics of incompletness:
a function ⟦ ⟧ associating with each incomplete database a set of complete databases
Semantics of incompleteness
11. Semantics of incompleteness
Three well known relational semantics:
‣ OWA (OpenWorld Assumption) [Imielinski-Lipski ’84]
‣ CWA (ClosedWorld Assumption) [Reiter ’77, Imielinski-Lipski ’84]
‣ WCWA (Weak ClosedWorld Assumption) [Reiter ’77]
I
Employee
Green
x1
Brown
ManagerManager
Green x1
x1 Brown
Green x2
12. Semantics of incompleteness
OWA:
Green
White
Brown
Green White
White Brown
Green Black
Green
Brown
Green Brown
Brown Brown
Black Brown
Green
White
Brown
Black
Smith
Green White
White Brown
Green Black
...
⟦I⟧OWA
v: valuation
Interpretation of incompleteness:
• missing data values, missing tuples
x1=White x2=Black
x1=x2=Brown
x1=W
hite
x2=Black
Employee
Green
x1
Brown
ManagerManager
Green x1
x1 Brown
Green x2
⟦I⟧OWA = { D over Const | D ⊇ v(I)
for some v: Var → Const }
I
13. Semantics of incompleteness
Green Brown
Brown Brown
Employee
Green
x1
Brown
ManagerManager
Green x1
x1 Brown
Green x2
I
CWA:
⟦I⟧CWA = { D over Const | D = v(I)
for some v: Var → Const }
Interpretation of incompleteness:
• missing data values
• no missing tuples
Green
White
Brown
Green White
White Brown
Green Black
x1=White x2=Black
x1=x2=Brown
x1=W
hite
x2=Black
Green
Brown
Green White
White Brown
Green Black
Green
White
Brown
...
⟦I⟧CWA
14. Semantics of incompleteness
Green
Brown
Green Brown
Brown Brown
Green Green
⟦I⟧WCWA
WCWA:
⟦I⟧WCWA ={D over Const |
D ⊇ v(I), dom(D)=dom(v(I))
for some v: Var → Const}
Interpretation of incompleteness:
• missing data values, missing tuples
• no missing domain elements
Employee
Green
x1
Brown
ManagerManager
Green x1
x1 Brown
Green x2
I
x1=White x2=Black
x1=x2=Brown
x1=W
hite
x2=Black
Green
White
Brown
Green White
White Brown
Green Black
...
Green
White
Brown
Black
Green White
White Brown
Green Black
Green Brown
15. Plan
➡ Introduction
➡ Incomplete data model
➡ Querying incomplete data
‣ the key to tractability: naïve evaluation
➡ What makes naïve evaluation work?
‣ a general framework
➡ Applicability of the framework
➡ Moving forward
16. Semantics of query answering: certain answers
Q (Boolean)
I
Querying incomplete databases
⟦I⟧
17. Certain answers
......
Green
White
Brown
Green White
White Brown
Green Black
under either CWA, OWA and WCWA
Employee
Green
x1
Brown
ManagerManager
Green x1
x1 Brown
Green x2
I
Example Q : “There is a manager who has a manager”
certQ(I) = true ⟦I⟧CWA / OWA / WCWA
18. Need to use the available (incomplete) data
Computing certain answers on I : usually hard
‣ from coNP-complete to undecidable for FO [Imielinski-Lipski ’84, Abiteboul et al ’91]
certQ(I)
I
Q
Computing certain answers
⟦I⟧
19. for all I
Naïve evaluation
Q(I) = certQ(I)
Q
Naïve evaluation works for Q :
Q(I) : Q evaluated directly on I,
as if variables were new distinct constants
(xi ≠ xj for i ≠j
xi ≠ c for all c ∈ Const )
I
Q
⟦I⟧
20. Naïve evaluation
...
Green
White
Brown
Green White
White Brown
Green Black
under CWA, OWA and WCWA
Employee
Green
x1
Brown
ManagerManager
Green x1
x1 Brown
Green x2
I
Example Q : “There is a manager who has a manager”
certQ(I) = true
Q(I) = true
⟦I⟧CWA / OWA / WCWA
21. Naïve evaluation
...
Example Q : “There is a manager who has a manager”
Employee ManagerManager
I
⟦I⟧CWA / OWA / WCWA
Generalizing:
Q(I) = certQ(I) for all I
naïve evaluation works for Q
under CWA, OWA and WCWA
22. Naïve evaluation: a model-checking problem (checking I ⊨ Q) EFFICIENT
‣ PTIME in the size of the instance for FO queries
‣ based on classical query evaluation algorithms of database engines
‣ can benefit from query optimization techniques
Certain answers: an entailment problem (checking that I entails Q) HARD
Naïve evaluation in theory and practice
Naïve evaluation works
correct query answering semantics, classical query evaluation algorithms /
entailment reduces to (straightforward) model-checking
clearly not always possible ! (undecidable vs. PTIME)
23. ⟦I⟧OWA/WCWA
...
Naïve evaluation does not always work
Green
White
Brown
Black
Green White
White Brown
Brown Black
I
Employee
Green
x1
Brown
ManagerManager
Green x1
x1 Brown
Brown x2
D
Q(I) = true
certQ(I) = false under OWA and WCWA
naïve evaluation does not work for Q under OWA and WCWA
A concrete example Q:“All employees are managers”
24. ➡ Introduction
➡ Incomplete data model
➡ Querying incomplete data
‣ the key to tractability: naïve evaluation
➡ What makes naïve evaluation work?
‣ a general framework
➡ Applicability of the framework
➡ Moving forward
Plan
25. Relating naïve evaluation and syntactic fragments
A unified framework for relating naïve evaluation and syntactic fragments for
several possible semantics:
Naïve evaluation works
for Q under [[]]
Q is “monotone”
w.r.t. [[]]
Q is preserved under a
class of homomorphisms
Q is expressible in a
syntactic fragment
Preservation
theorems
26. Naïve evaluation and monotonicity
Shown in a very general setting subsuming every
data model / semantics of incompleteness
(even beyond relational databases)
Naïve evaluation works
for Q under [[]]
Q is “monotone”
w.r.t. [[]]
Q is preserved under a
class of homomorphisms
Q is expressible in a
syntactic fragment
Preservation
theorems
27. Naïve evaluation and monotonicity
Database domain: a quadruple〈 D, C, ⟦ ⟧, ≈ 〉
description example
D : a set
database objects
(complete and incomplete)
all naïve tables over a fixed
schema σ
C : a subset of D complete database objects all complete instances over σ
⟦ ⟧ : D → 2C semantics of incompleteness ⟦ ⟧OWA , ⟦ ⟧CWA , etc.
≈ : an equivalence
relation on D
equivalence of objects (w.r.t.
queries)
isomorphism of relational
instances
28. Naïve evaluation and monotonicity
Over 〈 D, C , ⟦ ⟧ , ≈ 〉
Boolean query: Q : D → {true, false}
generic : x ≈ y implies Q(x)=Q(y)
monotone w.r.t. ⟦ ⟧ : y ∈ ⟦x⟧ implies Q(x) Q(y)
Certain answers for x ∈ D : cert(Q, x) = ⋀c ∈ ⟦x⟧ Q(c)
Naïve evaluation works for Q : For all x ∈ D Q(x) = cert(Q, x)
29. Naïve evaluation and monotonicity
Saturation property for〈 D, C , ⟦ ⟧ , ≈ 〉:
For all x ∈ D there exists y ∈ ⟦x⟧ y ≈ x
Over a saturated database domain,
if Q is a generic Boolean query:
Naïve evaluation works for Q iff
Q is monotone w.r.t. ⟦ ⟧
holds for most common semantics
Naïve evaluation works
for Q under [[]]
Q is “monotone”
w.r.t. [[]]
Q is preserved under a
class of homomorphisms
Q is expressible in a
syntactic fragment
Preservation
theorems
30. Monotonicity and preservation
Naïve evaluation works
for Q under [[]]
Q is “monotone”
w.r.t. [[]]
Q is preserved under a
class of homomorphisms
Q is expressible in a
syntactic fragment
Preservation
theorems
31. Many variants
onto homomorphism: dom(D’) = h( dom(D) )
strong onto homomorphism: D’=h(D)
Monotonicity and preservation
homomorphism D → D’ : a mapping h: dom(D) → dom(D’) s.t. h(D) ⊆ D’
a b
a a
c c
d d
a,b → c
D D’
Q preserved under homomorphism:
D → D’ implies Q(D) Q(D’) for all D
32. Monotonicity and preservation
Naïve evaluation works
for Q under [[]]
Q is “monotone”
w.r.t. [[]]
Q is preserved under a
class of homomorphisms
Q is expressible in a
syntactic fragment
Preservation
theorems
monotonicity w.r.t different semantics
preservation under different notions of
homomorphism
33. Preservation and syntactic fragments
Naïve evaluation works
for Q under [[]]
Q is “monotone”
w.r.t. [[]]
Q is preserved under a
class of homomorphisms
Q is expressible in a
syntactic fragment
Preservation
theorems
Preservation theorems
‣ syntactic characterizations of preservation
properties of queries in a given logic
‣ classical results in (finite) model theory
34. Preservation and syntactic fragments
Homomorphism Preservation Theorem: over arbitrary relational structures
an FO query Q is preserved under homomorphism iff Q is equivalent to a UCQ
‣ recently proved over finite structures [Rossman ’08]
Equally expressive:
‣ UCQ (Unions of Conjunctive Queries)
‣ SPJU algebra
‣ ∃Pos , i.e. {∃, ∧, ∨}-FO
35. Preservation and syntactic fragments
Lyndon Positivity Theorem [Lyndon ’59]: over arbitrary relational structures
an FO query Q is preserved under onto homomorphism iff Q is expressible in Pos
‣ fails in the finite [Ajtai-Gurevich 87, Rosen ’95, Stolboushkin ’95]
Pos: {∀,∃,∧, ∨}-FO
Homomorphism Preservation Theorem: over arbitrary relational structures
an FO query Q is preserved under homomorphism iff Q is equivalent to a UCQ
‣ recently proved over finite structures [Rossman ’08]
36. Preservation and syntactic fragments
Preservation theorems:
(Syntax Preservation) holds in the finite as well
classes of queries where naïve evaluation works
Naïve evaluation works
for Q under [[]]
Q is “monotone”
w.r.t. [[]]
Q is preserved under a
class of homomorphisms
Q is expressible in a
syntactic fragment
Preservation
theorems
37. Naïve evaluation and syntactic fragments
Three well known semantics as instances of our framework
Naïve evaluation works under:
OWA
Preservation under
homomorphism
∃Pos
WCWA
Preservation
under onto
homomorphism
Pos
CWA
Preservation
under strong onto
homomorphism
Pos+∀G
[Rossman] [Lyndon] [Gheerbrant,
Libkin, S.]
Naïve evaluation works
for Q under [[]]
Q is “monotone”
w.r.t. [[]]
Q is preserved under a
class of homomorphisms
Q is expressible in a
syntactic fragment
Preservation
theorems
38. • Preservation under strong onto homomorphisms:
‣ There is a preservation result in the infinite [Keisler ‘65]
‣ complex syntactic restrictions, one binary relation only, problematic to extend...
• A new sufficient condition for preservation, with a good syntax:
Pos+∀G : extends Pos with a limited form of negation (universal guards)
Positive fragment with Universal Guards
' := ⇤ | ⌅ | R(¯x) | x = y | ' ⇧ ' | ' ⌃ ' | ⇥x' | x' |
0
@
8¯x ( G(¯x) ! ) with
G : a relation or equality symbol
¯x : a tuple of distinct variables
1
A
Pos+∀G roughly corresponds to SPJU algebra + division
39. Examples revisited
Q : “There is a manager who has a manager” Pos
Pos+∀G
∃Pos
Q
naïve evaluation works for Q under CWA OWA, WCWA
40. Examples revisited
Q : “All employees are managers”
Q
Pos
Pos+∀G
∃Posnaïve evaluation works for Q under CWA
(recall: not true under OWA, nor under WCWA)
Naïve evaluation works well beyond ∃Pos under other semantics than OWA
41. ➡ Introduction
➡ Incomplete data model
➡ Querying incomplete data
‣ the key to tractability: naïve evaluation
➡ What makes naïve evaluation work?
‣ a general framework
➡ Applicability of the framework
➡ Moving forward
Plan
42. Beyond OWA, CWA and WCWA
Semantics of incompleteness have been considered in several contexts:
‣ programming semantics, logic programming, data exchange,...
[Minker’82, Ohori’90, Rounds’91, Libkin’95, Hernich’11]
➡ powerset semantics
➡ minimal semantics
43. Beyond OWA, CWA and WCWA
Naïve evaluation works under:
Powerset semantics
Preservation under
unions of strong onto
homomorphisms
∃Pos+∀Gbool
Minimal semantics
Preservation under
unions of minimal
homomorphisms
over cores
∃Pos+∀Gbool
Naïve evaluation works
for Q under [[]]
Q is “monotone”
w.r.t. [[]]
Q is preserved under a
class of homomorphisms
Q is expressible in a
syntactic fragment
Preservation
theorems
44. Beyond the relational data model
Incomplete XML based on a form of tree patterns [Barcelo-Libkin-Poggi-S. ’10]
_ *
*
library
authorauthor author
[name= x ]
[title= y ]
[name= x]
‣ missing data values
‣ missing nodes
‣ missing structural information
✓ labels
✓ parent-child, next-sibling
relationships
✓ etc.
Tree-pattern-queries: the analog of UCQs on trees
[name=
“Abiteboul” ]
45. Beyond the relational data model
The analog of naïve evaluation works for tree-pattern-queries under OWA on
rigid tree patterns [Barcelo-Libkin-Poggi-S. ’10]
‣ rigidity: essentially avoids structural incompleteness
Our framework explains this result:
‣ database domain:
✓ the set of complete/incomplete trees,
✓ OWA semantics: homomorphism-based
‣ tree pattern queries are preserved under homomorphisms of trees
‣ rigidity ensures the saturation property
46. Moving forward
Naïve evaluation on combinations of data models/semantics, e.g
➡ tree-structured/CWA
➡ graph-structured data
Query languages beyond FO
➡ fixed-point logics, fragments of SO, etc.
Naïve evaluation over restricted instances
➡ Applications: data integration/exchange
Beyond naïve-evaluation
➡ rewriting of the query/instance
(classical in ontology-based query answering)