Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine L...Soheila Dehghanzadeh
This document proposes using machine learning to simultaneously predict multiple performance metrics (running time, resource usage, etc.) for queries prior to execution. It describes building models based on training data from past query executions that map query features to performance features. Specifically, it uses KCCA to find dimensions of maximal correlation between query and performance features to define similarity. The models predict by weighting the performance metrics of similar past queries. Experiments show the approach can accurately predict time across different query types and databases.
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAIMLILAB
Graph Transplant is a method for augmenting graph-structured data using a technique analogous to Mixup for images. It extracts salient subgraphs from graphs based on node importance, then transplants one subgraph to replace a subgraph in another graph. This preserves local structure while mixing graphs. It also adaptively assigns labels to the mixed graphs based on the saliency of the constituent subgraphs. Experiments show Graph Transplant improves graph classification performance, model robustness, and calibration compared to other augmentation methods.
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AIMLILAB
This document discusses how network compression techniques can cause attribution maps to become deformed, compromising the reliability and trustworthiness of compressed models. It proposes matching attribution maps between original and compressed networks to address this. Specifically, it generates attribution maps by collapsing channels and employs losses to keep compressed network maps close to the original. Experiments show this attribution preservation framework can effectively maintain attribution across compression methods like knowledge distillation and pruning, improving predictive performance.
1) Saliency Grafting is a new data augmentation technique that uses saliency maps to stochastically sample patches from images to mix, generating diverse yet meaningful augmented data.
2) It introduces calibrated label mixing, where the label mixing ratio is determined by the relative importance of images based on saliency maps.
3) Experiments show Saliency Grafting outperforms other mixup-based augmentation methods, improving performance even under data scarcity conditions by maintaining high sample diversity.
Efficiency gains in inversion based interpretation through computerDustin Dewett
By leveraging self-organized maps and seismic inversion products, it is possible to:
1) More rapidly localize interpretation in a dataset while utilizing specialized geological knowledge.
2) Potentially make substantial gains in both efficiency and interpretation quality.
3) Increase the engagement of non-specialists.
The method classifies inversion results with an unbiased algorithm to find anomalous areas for specialists to focus on, helping maximize their time. However, specialists should still understand the physics behind any anomalies found.
This document discusses automatic localization techniques for data assimilation in ocean modeling using the OpenDA framework. It presents the OpenDA framework, which defines interfaces for data assimilation components. It also discusses ensemble Kalman filtering methods and the need for localization to address spurious correlations. The document describes automatic localization techniques proposed by Anderson and Zhang and Oliver to determine localization weights without requiring user specification. It outlines experiments using these techniques with the NEMO ocean model, assimilating sea surface height observations into the model. The results demonstrate that localization improves the model results, though limitations remain due to the number of observations available.
quality control STUDY ON 3 POLE MCCB MBA SIP report Akshay Nair
This document summarizes a quality control study on the production of 3-pole MCCBs at Havells India Ltd. The study found that the current process has low Cp and Cpk values, indicating high variation and rejection rates. The objectives were to analyze the production process, identify causes for low Cp and Cpk, and make improvements. Data was collected and analyzed using tools like control charts, fishbone diagrams, and Pareto charts. Improvements like fixing tooling issues and developing new springs increased Cp values but further work is needed to meet Cpk specifications. Recommendations include improving incoming material quality and using more in-house parts. The internship provided hands-on experience with quality control management systems and
An empirical evaluation of cost-based federated SPARQL query Processing EnginesUmair Qudus
Finding a good query plan is key to the optimization of query runtime. This holds in particular for cost-based federation
engines, which make use of cardinality estimations to achieve this goal. A number of studies compare SPARQL federation
engines across different performance metrics, including query runtime, result set completeness and correctness, number of sources
selected and number of requests sent. Albeit informative, these metrics are generic and unable to quantify and evaluate the
accuracy of the cardinality estimators of cost-based federation engines. To thoroughly evaluate cost-based federation engines, the
effect of estimated cardinality errors on the overall query runtime performance must be measured. In this paper, we address this
challenge by presenting novel evaluation metrics targeted at a fine-grained benchmarking of cost-based federated SPARQL query
engines. We evaluate five cost-based federated SPARQL query engines using existing as well as novel evaluation metrics by using
LargeRDFBench queries. Our results provide a detailed analysis of the experimental outcomes that reveal novel insights, useful
for the development of future cost-based federated SPARQL query processing engines.
Predicting Multiple Metrics for Queries: Better Decision Enabled by Machine L...Soheila Dehghanzadeh
This document proposes using machine learning to simultaneously predict multiple performance metrics (running time, resource usage, etc.) for queries prior to execution. It describes building models based on training data from past query executions that map query features to performance features. Specifically, it uses KCCA to find dimensions of maximal correlation between query and performance features to define similarity. The models predict by weighting the performance metrics of similar past queries. Experiments show the approach can accurately predict time across different query types and databases.
J. Park, H. Shim, AAAI 2022, MLILAB, KAISTAIMLILAB
Graph Transplant is a method for augmenting graph-structured data using a technique analogous to Mixup for images. It extracts salient subgraphs from graphs based on node importance, then transplants one subgraph to replace a subgraph in another graph. This preserves local structure while mixing graphs. It also adaptively assigns labels to the mixed graphs based on the saliency of the constituent subgraphs. Experiments show Graph Transplant improves graph classification performance, model robustness, and calibration compared to other augmentation methods.
G. Park, J.-Y. Yang, et. al., NeurIPS 2020, MLILAB, KAIST AIMLILAB
This document discusses how network compression techniques can cause attribution maps to become deformed, compromising the reliability and trustworthiness of compressed models. It proposes matching attribution maps between original and compressed networks to address this. Specifically, it generates attribution maps by collapsing channels and employs losses to keep compressed network maps close to the original. Experiments show this attribution preservation framework can effectively maintain attribution across compression methods like knowledge distillation and pruning, improving predictive performance.
1) Saliency Grafting is a new data augmentation technique that uses saliency maps to stochastically sample patches from images to mix, generating diverse yet meaningful augmented data.
2) It introduces calibrated label mixing, where the label mixing ratio is determined by the relative importance of images based on saliency maps.
3) Experiments show Saliency Grafting outperforms other mixup-based augmentation methods, improving performance even under data scarcity conditions by maintaining high sample diversity.
Efficiency gains in inversion based interpretation through computerDustin Dewett
By leveraging self-organized maps and seismic inversion products, it is possible to:
1) More rapidly localize interpretation in a dataset while utilizing specialized geological knowledge.
2) Potentially make substantial gains in both efficiency and interpretation quality.
3) Increase the engagement of non-specialists.
The method classifies inversion results with an unbiased algorithm to find anomalous areas for specialists to focus on, helping maximize their time. However, specialists should still understand the physics behind any anomalies found.
This document discusses automatic localization techniques for data assimilation in ocean modeling using the OpenDA framework. It presents the OpenDA framework, which defines interfaces for data assimilation components. It also discusses ensemble Kalman filtering methods and the need for localization to address spurious correlations. The document describes automatic localization techniques proposed by Anderson and Zhang and Oliver to determine localization weights without requiring user specification. It outlines experiments using these techniques with the NEMO ocean model, assimilating sea surface height observations into the model. The results demonstrate that localization improves the model results, though limitations remain due to the number of observations available.
quality control STUDY ON 3 POLE MCCB MBA SIP report Akshay Nair
This document summarizes a quality control study on the production of 3-pole MCCBs at Havells India Ltd. The study found that the current process has low Cp and Cpk values, indicating high variation and rejection rates. The objectives were to analyze the production process, identify causes for low Cp and Cpk, and make improvements. Data was collected and analyzed using tools like control charts, fishbone diagrams, and Pareto charts. Improvements like fixing tooling issues and developing new springs increased Cp values but further work is needed to meet Cpk specifications. Recommendations include improving incoming material quality and using more in-house parts. The internship provided hands-on experience with quality control management systems and
An empirical evaluation of cost-based federated SPARQL query Processing EnginesUmair Qudus
Finding a good query plan is key to the optimization of query runtime. This holds in particular for cost-based federation
engines, which make use of cardinality estimations to achieve this goal. A number of studies compare SPARQL federation
engines across different performance metrics, including query runtime, result set completeness and correctness, number of sources
selected and number of requests sent. Albeit informative, these metrics are generic and unable to quantify and evaluate the
accuracy of the cardinality estimators of cost-based federation engines. To thoroughly evaluate cost-based federation engines, the
effect of estimated cardinality errors on the overall query runtime performance must be measured. In this paper, we address this
challenge by presenting novel evaluation metrics targeted at a fine-grained benchmarking of cost-based federated SPARQL query
engines. We evaluate five cost-based federated SPARQL query engines using existing as well as novel evaluation metrics by using
LargeRDFBench queries. Our results provide a detailed analysis of the experimental outcomes that reveal novel insights, useful
for the development of future cost-based federated SPARQL query processing engines.
This document describes a dataflow implementation of Curran's approximation algorithm for pricing Asian options. The implementation computes the value at risk of a portfolio containing Asian options. It supports an arbitrary number of averaging points and achieves high precision using fixed-point arithmetic. Experimental results show that a single Maxeler dataflow engine can price portfolios over 10 times faster than a 48-core CPU. Further optimization including multi-DFE processing and porting to newer FPGA hardware could improve energy efficiency and performance.
What is a real-time recommendation engine? Our Senior Software Engineer, David Lippa, and our CTO, Jason Vertrees, break down the background, method, and results.
The document is a resume for Yulong Deng. It summarizes his education, including a Master of Science in Chemical Engineering from Carnegie Mellon University and a Bachelor of Engineering in Chemical Engineering from Dalian University of Technology. It also outlines his work experience, including his current role as a Process Engineer developing MPC software, and an internship modeling processes in Aspen Plus. Finally, it lists several projects undertaken in school related to process modeling, optimization, and machine learning.
Sanket V. Butoliya analyzed a customer churn dataset using WEKA 3.8 to predict whether customers will churn and identify factors causing churn. The presentation included an introduction, problem statement, dataset overview, data analysis using ZeroR, decision trees, neural networks, naive Bayes, and KNN algorithms. Accuracy was evaluated using different train-test splits, and key variables were visualized in Tableau and factors compared in Excel. The analysis aimed to help a cellular provider take preventive actions to reduce customer churn.
The document provides instructions for a task to perform classification on a US census dataset using KNIME to predict income. It instructs the user to experiment with decision trees, naive Bayes, and neural networks, and submit a report describing the experiments, results, and screenshots. The conclusion states that decision trees achieved the best accuracy of 83.2% on this dataset, compared to 76.5% for artificial neural networks and 76.4% for naive Bayes.
Energy Wasting Rate as a Metrics for Green Computing and Static AnalysisJérôme Rocheteau
This slides aims at defining a Green Computing metrics called Energy Wasting Rate that consists in the normalized sum of the energy consumption differences between sub-components of a given component and components, behaviorally equivalent but energetically more efficient. I detail how to realize such metrics then we sketch how these metrics can be useful and relevant for static analysis focused on software energy consumption.
Building useful models for imbalanced datasets (without resampling)Greg Landrum
1) Building machine learning models on imbalanced datasets, where there are many more inactive compounds than active ones, can lead to models with high accuracy but low ability to predict actives.
2) Shifting the decision threshold from 0.5 to a lower value, such as 0.2, for classifiers like random forests can significantly improve the models' ability to predict actives, as measured by Cohen's kappa, without retraining the models.
3) Across a variety of bioactivity prediction datasets, this threshold-shifting approach generally performed better than alternative methods like balanced random forests at improving predictions of active compounds.
Mining Assumptions for Software Components using Machine LearningLionel Briand
EPIcuRus is an approach to automatically generate assumptions for software components in cyber-physical systems modeled in Simulink. It uses machine learning techniques to mine assumptions from test case results. The approach includes generating tests cases using an important feature boundary test generation method, model checking candidate assumptions, and selecting the most informative safe assumptions. An evaluation on industrial case studies found it can learn non-vacuous assumptions for most requirements within a practical time limit, with the important feature boundary test generation performing best.
Canopy clustering is an unsupervised pre-clustering algorithm used to speed up K-means and hierarchical clustering on large datasets. It works by first selecting random points as canopy centers and assigning other points within a threshold distance to canopies. It then removes points within a smaller threshold to prevent them from being new centers, repeating until no points remain. This helps reduce the dataset size before the main clustering algorithm is applied.
Machine learning algorithms can adapt and learn from experience. The three main machine learning methods are supervised learning (using labeled training data), unsupervised learning (using unlabeled data), and semi-supervised learning (using some labeled and some unlabeled data). Supervised learning includes classification and regression tasks, while unsupervised learning includes cluster analysis.
The document proposes a compound scaling method to scale neural networks efficiently along their depth, width, and resolution dimensions simultaneously. It introduces EfficientNets, a new family of models created by applying this compound scaling technique to a baseline architecture found using neural architecture search. Evaluation shows that EfficientNets outperform previous state-of-the-art convolutional neural networks like ResNet and MobileNet in terms of accuracy and efficiency.
This document discusses time series forecasting methods and the AWS Forecast service. It provides an overview of traditional statistical versus modern machine learning approaches for time series. It then focuses on the DeepAR algorithm within AWS Forecast, explaining that it is a multi-step, multivariate approach that shares information across time series to model non-linearities and interactions. Best practices for using DeepAR are outlined, and there is a reference to a demo of DeepAR on an electricity dataset.
The document proposes a relational Gaussian process (RGP) model for learning from relational data. RGP extends Gaussian processes to incorporate relational information represented as a graph. It provides a data-dependent covariance function for supervised learning tasks like classification. The model was applied to semi-supervised learning problems and outperformed other methods on real-world datasets with few labeled examples needed. Potential extensions of RGP include modeling directed/asymmetric relations, multiple relation types, and weighted graphs.
Measuring the Combinatorial Coverage of Software in Real TimeZachary Ratliff
This document introduces a new real-time combinatorial coverage measurement tool called CCM Command Line. It summarizes the key limitations of an earlier tool, CCM, and describes new capabilities of the command line tool, including the ability to measure coverage incrementally and from various sources in real time. The document also discusses applications of the new tool and acknowledges those involved in its development.
Alexandra Johnson, Software Engineer, SigOpt at MLconf ATL 2017MLconf
Best Practices for Hyperparameter Optimization:
All machine learning and artificial intelligence pipelines – from reinforcement agents to deep neural nets – have tunable hyperparameters. Optimizing these hyperparameters provides tremendous performance gains, but only if the optimization is done correctly. This presentation will discuss topics including selecting performance criteria, why you should always use cross validation, and choosing between state of the art optimization methods.
This document provides an overview of evaluation measures for information retrieval systems. It discusses why evaluation is important for improving systems and measuring user satisfaction. Key points include:
- Common set-based measures include recall, precision, and F-measure. Ranked retrieval measures include average precision (AP), normalized discounted cumulative gain (nDCG), expected reciprocal rank (ERR), and Q-measure for graded relevance.
- Measures for diversified search aim to balance relevance and diversity across different user intents. Examples given include α-nDCG, ERR-IA, D#-nDCG, and U-IA.
- Statistical significance testing allows determining whether differences between systems are likely real or due to chance. The t
The document discusses addressing the time/quality trade-off in view maintenance when querying linked data. It proposes optimizing maintenance to satisfy either quality constraints within the lowest response time or time constraints with the highest response quality. It describes summarizing a dataset to estimate query freshness and challenges with building individual summaries for each maintenance plan. The conclusion notes next steps are designing a more realistic dataset and comparing histogram and predicate multiplication approaches.
Approximate Continuous Query Answering Over Streams and Dynamic Linked Data SetsSoheila Dehghanzadeh
To perform complex tasks, RDF Stream Processing Web applications evaluate continuous queries over streams and quasi-static (background) data. While the former are pushed in the application, the latter are continuously retrieved from the sources. As soon as the background data increase the volume and become distributed over the Web, the cost to retrieve them increases and applications become unresponsive.
In this paper, we address the problem of optimizing the evaluation of these queries by leveraging local views on background data. Local views enhance performance, but require maintenance processes, because changes in the background data sources are not automatically reflected in the application.
We propose a two-step query-driven maintenance process to maintain the local view: it exploits information from the query (e.g., the sliding window definition and the current window content) to maintain the local view based on user-defined Quality of Service constraints.
Experimental evaluation show the effectiveness of the approach.
This document describes a dataflow implementation of Curran's approximation algorithm for pricing Asian options. The implementation computes the value at risk of a portfolio containing Asian options. It supports an arbitrary number of averaging points and achieves high precision using fixed-point arithmetic. Experimental results show that a single Maxeler dataflow engine can price portfolios over 10 times faster than a 48-core CPU. Further optimization including multi-DFE processing and porting to newer FPGA hardware could improve energy efficiency and performance.
What is a real-time recommendation engine? Our Senior Software Engineer, David Lippa, and our CTO, Jason Vertrees, break down the background, method, and results.
The document is a resume for Yulong Deng. It summarizes his education, including a Master of Science in Chemical Engineering from Carnegie Mellon University and a Bachelor of Engineering in Chemical Engineering from Dalian University of Technology. It also outlines his work experience, including his current role as a Process Engineer developing MPC software, and an internship modeling processes in Aspen Plus. Finally, it lists several projects undertaken in school related to process modeling, optimization, and machine learning.
Sanket V. Butoliya analyzed a customer churn dataset using WEKA 3.8 to predict whether customers will churn and identify factors causing churn. The presentation included an introduction, problem statement, dataset overview, data analysis using ZeroR, decision trees, neural networks, naive Bayes, and KNN algorithms. Accuracy was evaluated using different train-test splits, and key variables were visualized in Tableau and factors compared in Excel. The analysis aimed to help a cellular provider take preventive actions to reduce customer churn.
The document provides instructions for a task to perform classification on a US census dataset using KNIME to predict income. It instructs the user to experiment with decision trees, naive Bayes, and neural networks, and submit a report describing the experiments, results, and screenshots. The conclusion states that decision trees achieved the best accuracy of 83.2% on this dataset, compared to 76.5% for artificial neural networks and 76.4% for naive Bayes.
Energy Wasting Rate as a Metrics for Green Computing and Static AnalysisJérôme Rocheteau
This slides aims at defining a Green Computing metrics called Energy Wasting Rate that consists in the normalized sum of the energy consumption differences between sub-components of a given component and components, behaviorally equivalent but energetically more efficient. I detail how to realize such metrics then we sketch how these metrics can be useful and relevant for static analysis focused on software energy consumption.
Building useful models for imbalanced datasets (without resampling)Greg Landrum
1) Building machine learning models on imbalanced datasets, where there are many more inactive compounds than active ones, can lead to models with high accuracy but low ability to predict actives.
2) Shifting the decision threshold from 0.5 to a lower value, such as 0.2, for classifiers like random forests can significantly improve the models' ability to predict actives, as measured by Cohen's kappa, without retraining the models.
3) Across a variety of bioactivity prediction datasets, this threshold-shifting approach generally performed better than alternative methods like balanced random forests at improving predictions of active compounds.
Mining Assumptions for Software Components using Machine LearningLionel Briand
EPIcuRus is an approach to automatically generate assumptions for software components in cyber-physical systems modeled in Simulink. It uses machine learning techniques to mine assumptions from test case results. The approach includes generating tests cases using an important feature boundary test generation method, model checking candidate assumptions, and selecting the most informative safe assumptions. An evaluation on industrial case studies found it can learn non-vacuous assumptions for most requirements within a practical time limit, with the important feature boundary test generation performing best.
Canopy clustering is an unsupervised pre-clustering algorithm used to speed up K-means and hierarchical clustering on large datasets. It works by first selecting random points as canopy centers and assigning other points within a threshold distance to canopies. It then removes points within a smaller threshold to prevent them from being new centers, repeating until no points remain. This helps reduce the dataset size before the main clustering algorithm is applied.
Machine learning algorithms can adapt and learn from experience. The three main machine learning methods are supervised learning (using labeled training data), unsupervised learning (using unlabeled data), and semi-supervised learning (using some labeled and some unlabeled data). Supervised learning includes classification and regression tasks, while unsupervised learning includes cluster analysis.
The document proposes a compound scaling method to scale neural networks efficiently along their depth, width, and resolution dimensions simultaneously. It introduces EfficientNets, a new family of models created by applying this compound scaling technique to a baseline architecture found using neural architecture search. Evaluation shows that EfficientNets outperform previous state-of-the-art convolutional neural networks like ResNet and MobileNet in terms of accuracy and efficiency.
This document discusses time series forecasting methods and the AWS Forecast service. It provides an overview of traditional statistical versus modern machine learning approaches for time series. It then focuses on the DeepAR algorithm within AWS Forecast, explaining that it is a multi-step, multivariate approach that shares information across time series to model non-linearities and interactions. Best practices for using DeepAR are outlined, and there is a reference to a demo of DeepAR on an electricity dataset.
The document proposes a relational Gaussian process (RGP) model for learning from relational data. RGP extends Gaussian processes to incorporate relational information represented as a graph. It provides a data-dependent covariance function for supervised learning tasks like classification. The model was applied to semi-supervised learning problems and outperformed other methods on real-world datasets with few labeled examples needed. Potential extensions of RGP include modeling directed/asymmetric relations, multiple relation types, and weighted graphs.
Measuring the Combinatorial Coverage of Software in Real TimeZachary Ratliff
This document introduces a new real-time combinatorial coverage measurement tool called CCM Command Line. It summarizes the key limitations of an earlier tool, CCM, and describes new capabilities of the command line tool, including the ability to measure coverage incrementally and from various sources in real time. The document also discusses applications of the new tool and acknowledges those involved in its development.
Alexandra Johnson, Software Engineer, SigOpt at MLconf ATL 2017MLconf
Best Practices for Hyperparameter Optimization:
All machine learning and artificial intelligence pipelines – from reinforcement agents to deep neural nets – have tunable hyperparameters. Optimizing these hyperparameters provides tremendous performance gains, but only if the optimization is done correctly. This presentation will discuss topics including selecting performance criteria, why you should always use cross validation, and choosing between state of the art optimization methods.
This document provides an overview of evaluation measures for information retrieval systems. It discusses why evaluation is important for improving systems and measuring user satisfaction. Key points include:
- Common set-based measures include recall, precision, and F-measure. Ranked retrieval measures include average precision (AP), normalized discounted cumulative gain (nDCG), expected reciprocal rank (ERR), and Q-measure for graded relevance.
- Measures for diversified search aim to balance relevance and diversity across different user intents. Examples given include α-nDCG, ERR-IA, D#-nDCG, and U-IA.
- Statistical significance testing allows determining whether differences between systems are likely real or due to chance. The t
The document discusses addressing the time/quality trade-off in view maintenance when querying linked data. It proposes optimizing maintenance to satisfy either quality constraints within the lowest response time or time constraints with the highest response quality. It describes summarizing a dataset to estimate query freshness and challenges with building individual summaries for each maintenance plan. The conclusion notes next steps are designing a more realistic dataset and comparing histogram and predicate multiplication approaches.
Approximate Continuous Query Answering Over Streams and Dynamic Linked Data SetsSoheila Dehghanzadeh
To perform complex tasks, RDF Stream Processing Web applications evaluate continuous queries over streams and quasi-static (background) data. While the former are pushed in the application, the latter are continuously retrieved from the sources. As soon as the background data increase the volume and become distributed over the Web, the cost to retrieve them increases and applications become unresponsive.
In this paper, we address the problem of optimizing the evaluation of these queries by leveraging local views on background data. Local views enhance performance, but require maintenance processes, because changes in the background data sources are not automatically reflected in the application.
We propose a two-step query-driven maintenance process to maintain the local view: it exploits information from the query (e.g., the sliding window definition and the current window content) to maintain the local view based on user-defined Quality of Service constraints.
Experimental evaluation show the effectiveness of the approach.
Revisiting the Calibration of Modern Neural NetworksSungchul Kim
The document discusses calibration in modern neural networks. It finds that some recent model families, like MLP-Mixers and Vision Transformers (ViTs), are both highly accurate and well-calibrated on in-distribution and out-of-distribution image datasets. Temperature scaling can improve calibration in older models but the newest architectures remain better calibrated. Model size and pretraining amount do not fully explain differences in calibration between families. Architecture appears to be a major factor, with non-convolutional models tending to have better calibration properties.
This document outlines a Six Sigma project to optimize an article library. The project aims to improve article trustworthiness by 80%, discard 60% of duplicate and out-of-date articles, decrease article retrieval time to 8 minutes or less, decrease costs by 20%, and increase customer satisfaction by 35%. Baseline data found the average article retrieval time was 11.6 minutes with a process sigma level of 3.195. Analysis identified duplicate articles, out-of-date articles, and untrustworthy articles as causes of long retrieval times. Improvement strategies included digitizing articles, adding expiration dates, and verifying article trustworthiness to comply with specifications.
This document discusses feature engineering and machine learning approaches for predicting customer behavior. It begins with an overview of feature engineering, including how it is used for image recognition, text mining, and generating new variables from existing data. The document then discusses challenges with artificial intelligence and machine learning models, particularly around explainability. It concludes that for smaller datasets, feature engineering can improve predictive performance more than complex machine learning models, while large datasets are better suited to machine learning approaches. Testing on a small travel acquisition dataset confirmed that traditional models with feature engineering outperformed neural networks.
Bridging the Gap: Machine Learning for Ubiquitous Computing -- EvaluationThomas Ploetz
Tutorial @Ubicomp 2015: Bridging the Gap -- Machine Learning for Ubiquitous Computing (evaluation session).
A tutorial on promises and pitfalls of Machine Learning for Ubicomp (and Human Computer Interaction). From Practitioners for Practitioners.
Presenter: Nils Hammerla <n.hammerla@gmail.com>
video recording of talks as they wer held at Ubicomp:
https://youtu.be/LgnnlqOIXJc?list=PLh96aGaacSgXw0MyktFqmgijLHN-aQvdq
Optimal query access plans are essential for good data server performance and it is the DB2 for Linux, UNIX and Windows query optimizer's job to choose the best access plan. However, occasionally queries that were performing well suddenly degrade, due to an unexpected access plan change. This presentation will cover a number of best practices to ensure that access plans don't unexpectedly change for the worse. All access plans can be made more stable with accurate DB statistics and proper DB configuration. DB2 9.7 provides a new feature to stabilize access plans for static SQL across binds and rebinds, which is particularly important for applications using SQL Procedural Language. When all else fails, optimization profiles can be used to force the desired access plan. This presentation will show you how to develop and implement a strategy to ensure your access plans are rock-solid.
[pdf presentation with notes]
This presentation discusses the following topics:
Introduction to Query Processing
Need for Query processing
Architecture of Query Processing
Query Processing Steps
Phases in a typical query processing
Represented in relational structures
Translating SQL Queries into Relational Algebra
Query Optimization
Importance of Query Optimization
Actions of Query Optimization
The document describes CHCDB, a clinical annotation database, and its web interface CHCDBWEB. It discusses how clinical annotation data was previously stored in disparate Excel files, which caused data redundancy and inconsistencies. A relational database using an Entity-Attribute-Value structure was created to centrally store the data and impose constraints to promote consistent data entry. The database is accessed through a web interface to allow multi-user access without software installation. Metadata tables were also added to store information about variable data types lost by the EAV structure.
Benchmarking Automated Machine Learning For Clusteringbiagiolicari7
This document discusses benchmarking four automated machine learning (AutoML) frameworks for clustering: AutoML4Clust, cSmartML, Autocluster, and ML2DAC. It describes the benchmark design, evaluation criteria of clustering quality, scalability, and consistency. The results show that ML2DAC emerged as the top performer based on clustering validity indices and Bayesian analysis, though it was not consistently the best. Room remains for improving AutoML frameworks' performance and transparency for clustering tasks.
The metrics that matter using scalability metrics for project planning of a d...Mary Chan
Have you expanded your organization across multiple locations, or are you a client that utilizes external partners that provide outsourcing services? Both have their "cost savings" challenge where cost savings analysis is often a topic well scrutinized. However, in the grand scheme of your organization, is it a metric that really matters? See actual analytics on multiple game projects and why cost savings isn't as important a metric when making informed decisions about project planning for scalable and distributed development. It's all about the Metrics that Matter.
Business Process Monitoring and MiningMarlon Dumas
Lecture delivered at the Second Latin-American Summer School in Business Process Management, Bogota, Colombia, 28 June 2017 - http://ii-las-bpm.uniandes.edu.co/
Six Sigma is a data-driven methodology for process improvement originally developed by Motorola. It involves defining a project goal, measuring key aspects of the current process, analyzing data to determine root causes of defects, improving the process by addressing causes, and controlling future process variation. The document provides an overview of Six Sigma and its development, then gives an example project summary involving improving calcium levels in a product. The project uses Six Sigma tools like process mapping, measurement systems analysis, data analysis, design of experiments, and risk analysis to select and validate factors influencing calcium and develop improvements.
This document discusses dimensionality reduction using principal component analysis (PCA). It explains that PCA is used to reduce the number of variables in a dataset while retaining the variation present in the original data. The document outlines the PCA algorithm, which transforms the original variables into new uncorrelated variables called principal components. It provides an example of applying PCA to reduce data from 2D to 1D. The document also discusses key PCA concepts like covariance matrices, eigenvalues, eigenvectors, and transforming data to the principal component coordinate system. Finally, it presents an assignment applying PCA and classification to a handwritten digits dataset.
A missing link in the ML infrastructure stack?Chester Chen
Talk at SF Big Analytics
Machine learning is quickly becoming a product engineering discipline. Although several new categories of infrastructure and tools have emerged to help teams turn their models into production systems, doing so is still extremely challenging for most companies. In this talk, we survey the tooling landscape and point out several parts of the machine learning lifecycle that are still underserved. We propose a new category of tool that could help alleviate these challenges and connect the fragmented production ML tooling ecosystem. We conclude by discussing similarities and differences between our proposed system and those of a few top companies.
Bio: Josh Tobin is the founder and CEO of a stealth machine learning startup. Previously, Josh worked as a deep learning & robotics researcher at OpenAI and as a management consultant at McKinsey. He is also the creator of Full Stack Deep Learning (fullstackdeeplearning.com), the first course focused on the emerging engineering discipline of production machine learning. Josh did his PhD in Computer Science at UC Berkeley advised by Pieter Abbeel.
The document describes the process of performance tuning for a database application. It presents a case study where a GUI screen was taking too long to load due to an inefficient query. The case study outlines identifying the problematic query, investigating solutions, replacing the query with a tuned version, and measuring the results. The document also provides an overview of database performance concepts like response time, wait time, and throughput. It describes different components involved in SQL processing like the parser, optimizer, row source generator, and execution plan.
Approximate Continuous Query Answering Over Streams and Dynamic Linked Data ...Soheila Dehghanzadeh
To perform complex tasks, RDF Stream Processing Web applications evaluate continuous queries over streams and quasi-static (background) data. While the former are pushed in the application, the latter are continuously retrieved from the sources. As soon as the background data increase the volume and become distributed over the Web, the cost to retrieve them increases and applications become unresponsive.
In this paper, we address the problem of optimizing the evaluation of these queries by leveraging local views on background data. Local views enhance performance, but require maintenance processes, because changes in the background data sources are not automatically reflected in the application.
We propose a two-step query-driven maintenance process to maintain the local view: it exploits information from the query (e.g., the sliding window definition and the current window content) to maintain the local view based on user-defined Quality of Service constraints.
Experimental evaluation show the effectiveness of the approach.
2019 2 testing and verification of vlsi design_verificationUsha Mehta
This document provides an introduction to verification of VLSI designs and functional verification. It discusses sources of errors in specifications and implementations, ways to reduce human errors through automation and mistake-proofing techniques. It also covers the reconvergence model of verification, different verification methods like simulation, formal verification and techniques like equivalence checking and model checking. The document then discusses verification flows, test benches, different types of test cases and limitations of functional verification.
A sensitivity analysis of contribution-based cooperative co-evolutionary algo...Borhan Kazimipour
Cooperative Co-evolutionary (CC) techniques have
demonstrated the promising performance in dealing with large-scale optimization problems. However, in many applications, their performance may drop due to the presence of imbalanced contributions to the objective function value from different subsets of decision variables. To remedy this drawback, Contribution-Based Cooperative Co-evolutionary (CBCC) algorithms have been proposed.
They have presented significant improvements over traditional CC techniques when the decomposition is accurate and the imbalance level is very high. However, in real-world scenarios, we might not have the knowledge about the ideal decomposition and actual imbalance level of a problem to be solved. Therefore, this study aims at analysing the performance of existing CBCC techniques in more realistic settings, i.e., when the decomposition error is unavoidable and the imbalance level is low or moderate.
Our in-depth analysis reveals that even in these situations, CBCC algorithms are superior alternatives to traditional CC techniques. We also observe that the variations of CBCC techniques may lead to the significantly different performance. Thus, we recommend practitioners to carefully choose a competent variant of CBCC which best suits their particular applications.
How to Add Chatter in the odoo 17 ERP ModuleCeline George
In Odoo, the chatter is like a chat tool that helps you work together on records. You can leave notes and track things, making it easier to talk with your team and partners. Inside chatter, all communication history, activity, and changes will be displayed.
How to Setup Warehouse & Location in Odoo 17 InventoryCeline George
In this slide, we'll explore how to set up warehouses and locations in Odoo 17 Inventory. This will help us manage our stock effectively, track inventory levels, and streamline warehouse operations.
This slide is special for master students (MIBS & MIFB) in UUM. Also useful for readers who are interested in the topic of contemporary Islamic banking.
A workshop hosted by the South African Journal of Science aimed at postgraduate students and early career researchers with little or no experience in writing and publishing journal articles.
The simplified electron and muon model, Oscillating Spacetime: The Foundation...RitikBhardwaj56
Discover the Simplified Electron and Muon Model: A New Wave-Based Approach to Understanding Particles delves into a groundbreaking theory that presents electrons and muons as rotating soliton waves within oscillating spacetime. Geared towards students, researchers, and science buffs, this book breaks down complex ideas into simple explanations. It covers topics such as electron waves, temporal dynamics, and the implications of this model on particle physics. With clear illustrations and easy-to-follow explanations, readers will gain a new outlook on the universe's fundamental nature.
हिंदी वर्णमाला पीपीटी, hindi alphabet PPT presentation, hindi varnamala PPT, Hindi Varnamala pdf, हिंदी स्वर, हिंदी व्यंजन, sikhiye hindi varnmala, dr. mulla adam ali, hindi language and literature, hindi alphabet with drawing, hindi alphabet pdf, hindi varnamala for childrens, hindi language, hindi varnamala practice for kids, https://www.drmullaadamali.com
This presentation was provided by Steph Pollock of The American Psychological Association’s Journals Program, and Damita Snow, of The American Society of Civil Engineers (ASCE), for the initial session of NISO's 2024 Training Series "DEIA in the Scholarly Landscape." Session One: 'Setting Expectations: a DEIA Primer,' was held June 6, 2024.
Optimizing SPARQL Query Processing On Dynamic and Static Data Based on Query Time/Freshness Requirements Using Materialization
1. JIST 2014
Optimizing SPARQL Query Processing On Dynamic
and Static Data Based on Query Time/Freshness
Requirements Using Materialization
Soheila Dehghanzadeh, Marcel Karnstedt, Stefan Decker, Josiane Xavier Parreira, Juergen Umbrich and Manfred Hauswirth
2. Outline
• Introduction
• Terminology
• Problem definition
• Proposed solution
• Experimental results
• Conclusion
Insight Centre for Data Analytics Slide 2
3. Introduction: Query Processing On Linked Data
• Report changes to the local store (maintenance)
• sources pro-actively report changes or their existence (pushing).
• query processor discover new sources and changes by crawling (pulling).
• Fast maintenance leads high quality but slow response and vice versa.
• Problem: On-demand maintenance according to response quality requirements.
• Why it is important? It eliminates unnecessary maintenance and leads to faster
response and better scalability.
Replication (database) or Caching (web)
Off-line
materialization
Local
Store
Query
Processor
Query
Response
NEW
sources
4. Terminology
• Quality requirements:
• Freshness B/(A+B)
• Completeness B/(B+C)
• Maintenance plan
• Each set of views chosen for maintenance is called a maintenance
plan.
• Having n views, number of maintenance plans is 2 𝑛
.
• Each maintenance plan leads to a different response quality.
20 October 2014Insight Centre for Data Analytics Slide 4
V1 V2 V3 V4
20% 90% 10% 80%
5. Freshness Example
a1 b1 T
a2 b2 T
a3 b3 F
a4 b4 T
a5 b5 F
20 October 2014Insight Centre for Data Analytics Slide 5
a1 c1 F
a1 c2 F
a1 c3 T
a2 c4 T
a6 c5 F
a1 b1 c1 F
a1 b1 c2 F
a1 b1 c3 T
a2 b2 c4 T
60% 40% 50%
a1 b1 T
a2 b2 T
a3 b3 T
a4 b4 T
a5 b5 T
a1 c1 F
a1 c2 F
a1 c3 T
a2 c4 T
a6 c5 F
a1 b1 c1 F
a1 b1 c2 F
a1 b1 c3 T
a2 b2 c4 T
100% 40% 50%
a1 b1 T
a2 b2 T
a3 b3 F
a4 b4 T
a5 b5 F
a1 c1 T
a1 c2 T
a1 c3 T
a2 c4 T
a6 c5 T
a1 b1 c1 T
a1 b1 c2 T
a1 b1 c3 T
a2 b2 c4 T
60% 100% 100%
6. Research questions
• What is the least costly maintenance plan that fulfils
response quality requirements.
• What is the quality of response without maintenance?
• what is the quality of response of each “maintenance plan”.
7. Experiment
• We use BSBM benchmark to create a dataset and a query
set.
• We label triples with true/false to specify their freshness
status.
• We summarize the cache to estimate the quality of a query
response without actually executing the query on cache.
• To summarize the cache we extended the cardinality
estimation techniques for freshness estimation problem.
Insight Centre for Data Analytics Slide 7
Alice Lives Dublin True
Bob Lives Berlin False
Alice Job Teacher True
Bob Job Developer False
8. Cardinality Estimation
• Capture the data distribution by splitting data into buckets
and only keep the bucket cardinality in the summary.
Insight Centre for Data Analytics Slide 8
Alice Job Teacher
Alice Lives Dublin
Alice Job PhD student
Alice Lives Athlon
Bob Job Manager
Bob Lives Berlin
Bob Lives Chicago
Bob Lives Munich
Bob Lives Belfast
Bob Lives Limerick
Bob Job CEO
Bob Job Consultant
Alice Job * 2
Bob Job * 3
Alice Lives * 2
Bob Lives * 5
* Job * 5
* Lives * 7
Freshness
True
True
False
False
True
True
True
False
False
False
False
False
2
3
1
1
1
2
Q1: ?a Job ?b
Q2: (?a Job ?b)^(?a Lives ?c)
Estimated Actual
5 5
35 19
Estimated Actual
5 5
19 19
Estimated Actual
2/5 2/5
6/35 3/19
Estimated Actual
2/5 2/5
3/19 3/19
9. Cardinality Estimation Approaches
• System R assumptions for cardinality estimation:
• data is uniformly distributed per attribute.
• join predicates are independent.
• Indexing approaches make both assumptions.
• Histogram captures the distribution of attributes for more
accurate estimation.
• Probabilistic Graphical Models captures dependencies
among attributes.
Insight Centre for Data Analytics Slide 9
10. Measure accuracy of the estimation
approach
Insight Centre for Data Analytics Slide 10
n is the number of queries
Measure the difference between the actual and estimated
freshness of queries in a query set.
12. Conclusion
• We proposed a new approach for on-demand view
maintenance based on the response quality requirements.
• We defined quality requirements based on freshness and
completeness.
• We summarized a synthetic dataset to estimate the
freshness of various queries using indexing and histogram
for our freshness estimation problem.
• Using probabilistic graphical model to summarize the
dataset is the future work and it is promising to reduce the
estimation error.
Insight Centre for Data Analytics Slide 12
13. Thanks a lot for your attention !
Any question is welcomed!
Insight Centre for Data Analytics Slide 13
Editor's Notes
Hi All and thanks for coming to my presentation.
In this work, I’m going to talk about how to optimize SPARQL query processing on static and dynamic data based on quality requirements of query response which are response time and freshness.
The outline of the talk is as follows:
First we will have a brief introduction on query processing and proposed approaches to make it faster
Second, we introduce the terminology of our work.
Third, we illustrate the targeted problem with an example.
Afterwards we define the problem.
The proposed solution and experimental results will then be presented.
At the end we will conclude the talk with some directions for future works.
To process queries on Linked data, the very naïve approach is that the query processor gets the query and fetch the relevant data from original sources, combine them and provide the response to the user. However, fetching data from original sources will take a lot of time and if the original sources become temporarily un-available, query processor can not provide the full response.
Enter ------------
To get rid of availability and latency problems, researchers came up with the idea of offline materialization which is called replication in database or caching in Web context. They proposed to materialize as much data as they can in their local store and respond queries only using their local store. This provides very fast response time and will not suffer from the availability issues.
Enter-----------
However, if the original sources become updated or new sources become available, query processor can’t reflect these changes into its responses and thus provided responses which will suffer from low quality.
Enter------------
To address this issue, maintenance mechanisms will help the query processor to compensate the quality issues.
Enter------
However, highly frequent maintenance will consume all computational resources and queries will need to wait for computational resources. Thus, a response with high quality will only be achieved with a long response time and vice versa.
Enter ------
The problem that we are targeting here is to do on-demand maintenance based on quality requirements of query response .
Enter ------
The importance of this problem is that, it eliminates unnecessary maintenance and leads to faster response and better scalability.
Here we define terminologies
<<<point to the first figure>>>
To specify the quality requirements of the response, suppose the shaded circle is the response that is provided with local store and transparent circle is the actual response. These 2 responses will share a set of tuples which is represented by “B”. “A” represents out-of-date responses provided with query processor. “C” represents valid responses that has been ignored by query processor due to the maintenance delay. If “A” is empty, that means query processor has partially provided a valid response but the response is not complete. If C is empty that means query processor has provided a complete response but the response is not fully fresh. Therefore, the definition of freshness is B divided by A plus B and the definition of completeness is B divided by B plus C.
----------------------
In each maintenance, query processor will decide to maintain a set of views which we call it a maintenance plan. Given n views, it is obvious that there exist 2 to the power n maintenance plans. As we will show in the next slide, each maintenance plan will lead to a different response quality. I just need to mention that in this work, we haven't touched response completeness and left it for future works. Therefore, we will only deal with response freshness as the response quality requirements.
In this example we label fresh tuples with T and stale Tuples with F. suppose we have a join between 2 mappings. We want to show that different maintenance plans will lead to different freshness of response. One maintenance plan is not to maintain anything.
<<<point to first row>>> One maintenance plan is not to maintain anything. Thus we need to measure the freshness of response with current data in local store. As we can see in first row, mapping 1 with 60% freshness will join with mapping 2 with 40% freshness and the result is 50% fresh.
<<<point to second row>>> in the second row, we show the next maintenance plan which is to maintain mapping 1. however, join result is still 50% fresh.
<<<point to third row>>>in the third row, we show another maintenance plan which is to maintain mapping 2. however, join result becomes 100% fresh this time.
Therefore, various maintenance plans will lead to different quality of response.
The problem that we are targeting is to find least costly maintenance that fulfils response quality requirements. This boils down to 2 sub-problems. first, estimating the quality of response provided with present cache without maintenance . Second, estimating the quality of response for other maintenance plans.
In this work we only targeted the first sub-problem.
To estimate the quality of response provided with present cache, we use BSBM benchmark generator to generate a dataset and a query set. we labled triples with true/false specifying their freshness status.
We summarize the cache to estimate the quality of a query response without actually executing the query on cache.
To summarize the cache we extended the cardinality estimation techniques for freshness estimation problem.
In the next slide we will present how to extend the cardinality estimation methods for freshness estimation.
Cardinality estimation approaches are trying to capture the data distribution by splitting data into buckets and keep the bucket cardinality in the summary. In our example, we summrize the whole dataset into an index which stores the cardinality of individual predicates. Now, to test the summary, we run 2 queries; Q1 and Q2. this summary provides accurate estimation for Q1 but to estimate the cardinality of Q2 it multiplies the cardinality of its triple patterns which is 35 while in fact the response cardinality is 19. As we can see this summary has failed to provide good estimation.
Enter------
However, a more granular summary, can provide more accurate responses. As we can see, the second index can provide accurate cardinality estimation for Q1 and Q2.
Enter------
Now, to extend cardinality estimation methods for freshness estimation we extend the summaries with one more column to store the number of fresh responses in addition to the total number of entries in that category.
Enter----------
As we can see the first index provides accurate freshness estimation for Q1 but it failed to provide a good estimation for Q2.
Enter-----------
However, By storing more granular information in second index, we can provide accurate freshness estimation for both Q1 and Q2.
To summarize underlying data for cardinality estimation, the original system R has made 2 simplifying assumptions: first they assumed data is uniformly distributed per attribute. Second, they assumed that join predicates are independent. Indexing approaches are making both assumptions to simplify the summarization and estimation process.
However, such assumptions barely holds in real datasets. Thus, researchers came up with the idea of histograms to address the uniform data distribution assumption.
using probabilistic graphical models we can build summaries that can address the join predicate independence assumptions.
In this paper we extended the indexing and histogram cardinality estimation methods for freshness estimation according to the procedure explained in the previous slide.
In order to measure the accuracy of the estimation approach we used the root mean square deviation error. That is, we sum over all squared differences between the estimated and actual freshness for all queries and we compute the rooted average to get a unique factor specifying the error of that method.
The result showed that, indexing approach can achieve a very low estimation error and low storage space simultaneously. However, histogram can provide lower estimation error only with a huge summary size.
We believe that majority of estimation errors in our queries, is caused by join dependencies which is not addressed by histogram. So we are hoping to further reduce the estimation error by using probabilistic graphical models in our future works.