We describe GTApprox | a new tool for medium-scale surrogate modeling in industrial design. Compared to existing software, GTApprox brings several innovations: a few novel approximation algorithms, several advanced methods of automated model selection, novel options in the form of hints. We demonstrate the efficiency of GTApprox on a large collection of test problems. In addition, we describe several applications of GTApprox to real engineering problems.
ENSEMBLE REGRESSION MODELS FOR SOFTWARE DEVELOPMENT EFFORT ESTIMATION: A COMP...ijseajournal
As demand for computer software continually increases, software scope and complexity become higher than ever. The software industry is in real need of accurate estimates of the project under development. Software development effort estimation is one of the main processes in software project management. However, overestimation and underestimation may cause the software industry loses. This study determines which technique has better effort prediction accuracy and propose combined techniques that could provide better estimates. Eight different ensemble models to estimate effort with Ensemble Models were compared with each other base on the predictive accuracy on the Mean Absolute Residual (MAR) criterion and statistical tests. The results have indicated that the proposed ensemble models, besides delivering high efficiency in contrast to its counterparts, and produces the best responses for software project effort estimation. Therefore, the proposed ensemble models in this study will help the project managers working with development quality software.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Software testing effort estimation with cobb douglas function a practical app...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Cause-Effect Graphing: Rigorous Test Case DesignTechWell
A tester’s toolbox today contains a number of test case design techniques—classification trees, pairwise testing, design of experiments-based methods, and combinatorial testing. Each of these methods is supported by automated tools. Tools provide consistency in test case design, which can increase the all-important test coverage in software testing. Cause-effect graphing, another test design technique, is superior from a test coverage perspective, reducing the number of test cases needed to provide excellent coverage. Gary Mogyorodi describes these black box test case design techniques, summarizes the advantages and disadvantages of each technique, and provides a comparison of the features of the tools that support them. Using an example problem, he compares the number of test cases derived and the test coverage obtained using each technique, highlighting the advantages of cause-effect graphing. Join Gary to see what new techniques you might want to add to your toolbox.
ENSEMBLE REGRESSION MODELS FOR SOFTWARE DEVELOPMENT EFFORT ESTIMATION: A COMP...ijseajournal
As demand for computer software continually increases, software scope and complexity become higher than ever. The software industry is in real need of accurate estimates of the project under development. Software development effort estimation is one of the main processes in software project management. However, overestimation and underestimation may cause the software industry loses. This study determines which technique has better effort prediction accuracy and propose combined techniques that could provide better estimates. Eight different ensemble models to estimate effort with Ensemble Models were compared with each other base on the predictive accuracy on the Mean Absolute Residual (MAR) criterion and statistical tests. The results have indicated that the proposed ensemble models, besides delivering high efficiency in contrast to its counterparts, and produces the best responses for software project effort estimation. Therefore, the proposed ensemble models in this study will help the project managers working with development quality software.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
Software testing effort estimation with cobb douglas function a practical app...eSAT Publishing House
IJRET : International Journal of Research in Engineering and Technology is an international peer reviewed, online journal published by eSAT Publishing House for the enhancement of research in various disciplines of Engineering and Technology. The aim and scope of the journal is to provide an academic medium and an important reference for the advancement and dissemination of research results that support high-level learning, teaching and research in the fields of Engineering and Technology. We bring together Scientists, Academician, Field Engineers, Scholars and Students of related fields of Engineering and Technology.
Cause-Effect Graphing: Rigorous Test Case DesignTechWell
A tester’s toolbox today contains a number of test case design techniques—classification trees, pairwise testing, design of experiments-based methods, and combinatorial testing. Each of these methods is supported by automated tools. Tools provide consistency in test case design, which can increase the all-important test coverage in software testing. Cause-effect graphing, another test design technique, is superior from a test coverage perspective, reducing the number of test cases needed to provide excellent coverage. Gary Mogyorodi describes these black box test case design techniques, summarizes the advantages and disadvantages of each technique, and provides a comparison of the features of the tools that support them. Using an example problem, he compares the number of test cases derived and the test coverage obtained using each technique, highlighting the advantages of cause-effect graphing. Join Gary to see what new techniques you might want to add to your toolbox.
Review on Algorithmic and Non Algorithmic Software Cost Estimation Techniquesijtsrd
Effective software cost estimation is the most challenging and important activities in software development. Developers want a simple and accurate method of efforts estimation. Estimation of the cost before starting of work is a prediction and prediction always not accurate. Software effort estimation is a very critical task in the software engineering and to control quality and efficiency a suitable estimation technique is crucial. This paper gives a review of various available software effort estimation methods, mainly focus on the algorithmic model and non algorithmic model. These existing methods for software cost estimation are illustrated and their aspect will be discussed. No single technique is best for all situations, and thus a careful comparison of the results of several approaches is most likely to produce realistic estimation. This paper provides a detailed overview of existing software cost estimation models and techniques. This paper presents the strength and weakness of various cost estimation methods. This paper focuses on some of the relevant reasons that cause inaccurate estimation. Pa Pa Win | War War Myint | Hlaing Phyu Phyu Mon | Seint Wint Thu "Review on Algorithmic and Non-Algorithmic Software Cost Estimation Techniques" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd26511.pdfPaper URL: https://www.ijtsrd.com/engineering/-/26511/review-on-algorithmic-and-non-algorithmic-software-cost-estimation-techniques/pa-pa-win
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018Codemotion
In machine learning, training large models on a massive amount of data usually improves results. Our customers report, however, that training such models and deploying them is either operationally prohibitive or outright impossible for them. We created a collection of machine learning algorithms that scale to any amount of data, including k-means clustering for data segmentation, factorization machines for recommendations, time-series forecasting, linear regression, topic modeling, and image classification. This talk will discuss those algorithms, understand where and how they can be used.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
An experimental evaluation of similarity-based and embedding-based link predi...IJDKP
The task of inferring missing links or predicting future ones in a graph based on its current structure
is referred to as link prediction. Link prediction methods that are based on pairwise node similarity
are well-established approaches in the literature and show good prediction performance in many realworld graphs though they are heuristic. On the other hand, graph embedding approaches learn lowdimensional representation of nodes in graph and are capable of capturing inherent graph features,
and thus support the subsequent link prediction task in graph. This paper studies a selection of
methods from both categories on several benchmark (homogeneous) graphs with different properties
from various domains. Beyond the intra and inter category comparison of the performances of the
methods, our aim is also to uncover interesting connections between Graph Neural Network(GNN)-
based methods and heuristic ones as a means to alleviate the black-box well-known limitation.
Design and Implementation of Automated Visualization for Input/Output for Pro...ijseajournal
While formal specification is regarded as an effective means to capture accurate requirements and design, validation of the specifications remains a challenge. Specification animation has been proposed to tackle the challenge, but lacking an effective representation of the input / output data in the animation can
considerably limit the understanding of the animation by clients. In this paper, we put forward a tool supported technique for visualization of the input / output data of processes in SOFL formal specifications. After discussing the motives of our work, we describe how data of each kind of data types available in the
SOFL language can be visualized to facilitate the representation and understanding of input / output data. We also present a supporting tool for the technique and a case study to demonstrate the usability and effectiveness of our proposed technique. Finally, we conclude the paper and point out the future research directions.
INVESTIGATING HUMAN-MACHINE INTERFACES’ EFFICIENCY IN INDUSTRIAL MACHINERY AN...IJITCA Journal
The twenty-first century has seen a vast technological revolution characterized by the development of
cyber-physical systems, integration of things, and new and computationally improved machines and
systems. However, there have been seemingly little strides in the development of user interfaces,
specifically for industrial machines and equipment. The aim of this study was to assess the efficiency of the
human-machine interfaces in the Kenyan context in providing a consistent and reliable working
environment for industrial machine operators. The researcher employed a convenient purposive sampling
to select 15 participants who had at least two years of hands-on experience in machines operation, control,
or instrumentation. The results of the study are herein presented, including the recommendations to
enhance workforce productivity and efficiency.
In the present paper, applicability and
capability of A.I techniques for effort estimation prediction has
been investigated. It is seen that neuro fuzzy models are very
robust, characterized by fast computation, capable of handling
the distorted data. Due to the presence of data non-linearity, it is
an efficient quantitative tool to predict effort estimation. The one
hidden layer network has been developed named as OHLANFIS
using MATLAB simulation environment.
Here the initial parameters of the OHLANFIS are
identified using the subtractive clustering method. Parameters of
the Gaussian membership function are optimally determined
using the hybrid learning algorithm. From the analysis it is seen
that the Effort Estimation prediction model developed using
OHLANFIS technique has been able to perform well over normal
ANFIS Model.
Conceptualization of a Domain Specific Simulator for Requirements Prioritizationresearchinventy
This paper conceptualizes a domain specific simulator for requirements prioritization; its aims at helping to identify appropriate prioritization strategies for a project in hand. The possible existing scenarios are difficult to analyze; they involve different variables, like the selection of: stakeholders (their availability, expertise, and importance); prioritization criteria; and prioritization methods. To demonstrate the feasibility of the proposed simulator elements, a well established general purpose simulator, called Arena, was used. The results demonstrate that, it is possible to build the suggested scenarios in order to study and make inferences about the prioritization strategies.
Power Consumption Prediction based on Statistical Learning Techniques - David...Data Science Milan
Early power estimation is a critical requirement for both SoC (system-on-chip) architects and marketing, which must provide competitive, yet realistic, power consumption RFQs (request for quotation) to customers in order to achieve design wins. RFQs based on deterministic techniques may work reasonably well when there is enough information, and the available data closely match the target design. However, this approach can be strongly design dependent, and the same model, which provided a good estimate on one design/technology, may fail on another design/technology with different specs and operating conditions. In this talk, we will present an innovative approach for early power consumption estimation based on statistical learning. The proposed statistical model has been compared against existing RFQs, and validated on two taped-out products for networking in leading-edge CMOS technology.
Davide Pandini holds a Laurea Degree (Summa cum Laude) in Electronics Engineering from the University of Bologna, Italy, a Ph.D. in Electrical Engineering and Telecommunications from the University of Padova,Italy, and a Ph.D. in Electrical and Computer Engineering from Carnegie Mellon University, Pittsburgh, U.S.He was a research intern at Philips Research Labs. in Eindhoven, the Netherlands, and at Digital EquipmentCorp., Western Research Labs. in Palo Alto, CA.He joined STMicroelectronics in Agrate Brianza, Italy, in 1995, where he is a R&D manager and a senior member of the technical staff. His current research interests include signal and power integrity, variation-tolerant design, design for manufacturing, statistical design, and EMC-aware design. Dr. Pandini has authored and coauthored more than fifty papers in international journals and conference proceedings, and during the academic years from 1998 to 2000, he was a visiting professor at the University of Brescia, Italy. He is a work-package manager in several funded European projects, and he is on the program committee of several international conferences. In 2008 Dr. Pandini was the recipient of the ST Corporate STAR 2008 Gold Award for leading the R&D excellence team on EMC-aware design. Since June 2015, Dr. Pandini is the Chairman of the Steering Committee of the ST Italy Technical Staff. In September 2015, Davide Pandini served as Volunteer at EXPO2015 "Feeding the Planet, Energy for Life", in Milano, Italy.
Automatically Estimating Software Effort and Cost using Computing Intelligenc...cscpconf
In the IT industry, precisely estimate the effort of each software project the development cost
and schedule are count for much to the software company. So precisely estimation of man
power seems to be getting more important. In the past time, the IT companies estimate the work
effort of man power by human experts, using statistics method. However, the outcomes are
always unsatisfying the management level. Recently it becomes an interesting topic if computing
intelligence techniques can do better in this field. This research uses some computing
intelligence techniques, such as Pearson product-moment correlation coefficient method and
one-way ANOVA method to select key factors, and K-Means clustering algorithm to do project
clustering, to estimate the software project effort. The experimental result show that using
computing intelligence techniques to estimate the software project effort can get more precise
and more effective estimation than using traditional human experts did
A Framework for Visualizing Association Mining Resultsertekg
Download Link > https://ertekprojects.com/gurdal-ertek-publications/blog/a-framework-for-visualizing-association-mining-results/
Association mining is one of the most used data mining techniques due to interpretable and actionable results. In this study we pro-pose a framework to visualize the association mining results, speci¯cally frequent itemsets and association rules, as graphs. We demonstrate the applicability and usefulness of our approach through a Market Basket Analysis (MBA) case study where we visually explore the data mining results for a supermarket data set. In this case study we derive several
interesting insights regarding the relationships among the items and sug-gest how they can be used as basis for decision making in retailing.
Review on Algorithmic and Non Algorithmic Software Cost Estimation Techniquesijtsrd
Effective software cost estimation is the most challenging and important activities in software development. Developers want a simple and accurate method of efforts estimation. Estimation of the cost before starting of work is a prediction and prediction always not accurate. Software effort estimation is a very critical task in the software engineering and to control quality and efficiency a suitable estimation technique is crucial. This paper gives a review of various available software effort estimation methods, mainly focus on the algorithmic model and non algorithmic model. These existing methods for software cost estimation are illustrated and their aspect will be discussed. No single technique is best for all situations, and thus a careful comparison of the results of several approaches is most likely to produce realistic estimation. This paper provides a detailed overview of existing software cost estimation models and techniques. This paper presents the strength and weakness of various cost estimation methods. This paper focuses on some of the relevant reasons that cause inaccurate estimation. Pa Pa Win | War War Myint | Hlaing Phyu Phyu Mon | Seint Wint Thu "Review on Algorithmic and Non-Algorithmic Software Cost Estimation Techniques" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-3 | Issue-5 , August 2019, URL: https://www.ijtsrd.com/papers/ijtsrd26511.pdfPaper URL: https://www.ijtsrd.com/engineering/-/26511/review-on-algorithmic-and-non-algorithmic-software-cost-estimation-techniques/pa-pa-win
And Then There Are Algorithms - Danilo Poccia - Codemotion Rome 2018Codemotion
In machine learning, training large models on a massive amount of data usually improves results. Our customers report, however, that training such models and deploying them is either operationally prohibitive or outright impossible for them. We created a collection of machine learning algorithms that scale to any amount of data, including k-means clustering for data segmentation, factorization machines for recommendations, time-series forecasting, linear regression, topic modeling, and image classification. This talk will discuss those algorithms, understand where and how they can be used.
International Journal of Engineering Research and Applications (IJERA) is an open access online peer reviewed international journal that publishes research and review articles in the fields of Computer Science, Neural Networks, Electrical Engineering, Software Engineering, Information Technology, Mechanical Engineering, Chemical Engineering, Plastic Engineering, Food Technology, Textile Engineering, Nano Technology & science, Power Electronics, Electronics & Communication Engineering, Computational mathematics, Image processing, Civil Engineering, Structural Engineering, Environmental Engineering, VLSI Testing & Low Power VLSI Design etc.
An experimental evaluation of similarity-based and embedding-based link predi...IJDKP
The task of inferring missing links or predicting future ones in a graph based on its current structure
is referred to as link prediction. Link prediction methods that are based on pairwise node similarity
are well-established approaches in the literature and show good prediction performance in many realworld graphs though they are heuristic. On the other hand, graph embedding approaches learn lowdimensional representation of nodes in graph and are capable of capturing inherent graph features,
and thus support the subsequent link prediction task in graph. This paper studies a selection of
methods from both categories on several benchmark (homogeneous) graphs with different properties
from various domains. Beyond the intra and inter category comparison of the performances of the
methods, our aim is also to uncover interesting connections between Graph Neural Network(GNN)-
based methods and heuristic ones as a means to alleviate the black-box well-known limitation.
Design and Implementation of Automated Visualization for Input/Output for Pro...ijseajournal
While formal specification is regarded as an effective means to capture accurate requirements and design, validation of the specifications remains a challenge. Specification animation has been proposed to tackle the challenge, but lacking an effective representation of the input / output data in the animation can
considerably limit the understanding of the animation by clients. In this paper, we put forward a tool supported technique for visualization of the input / output data of processes in SOFL formal specifications. After discussing the motives of our work, we describe how data of each kind of data types available in the
SOFL language can be visualized to facilitate the representation and understanding of input / output data. We also present a supporting tool for the technique and a case study to demonstrate the usability and effectiveness of our proposed technique. Finally, we conclude the paper and point out the future research directions.
INVESTIGATING HUMAN-MACHINE INTERFACES’ EFFICIENCY IN INDUSTRIAL MACHINERY AN...IJITCA Journal
The twenty-first century has seen a vast technological revolution characterized by the development of
cyber-physical systems, integration of things, and new and computationally improved machines and
systems. However, there have been seemingly little strides in the development of user interfaces,
specifically for industrial machines and equipment. The aim of this study was to assess the efficiency of the
human-machine interfaces in the Kenyan context in providing a consistent and reliable working
environment for industrial machine operators. The researcher employed a convenient purposive sampling
to select 15 participants who had at least two years of hands-on experience in machines operation, control,
or instrumentation. The results of the study are herein presented, including the recommendations to
enhance workforce productivity and efficiency.
In the present paper, applicability and
capability of A.I techniques for effort estimation prediction has
been investigated. It is seen that neuro fuzzy models are very
robust, characterized by fast computation, capable of handling
the distorted data. Due to the presence of data non-linearity, it is
an efficient quantitative tool to predict effort estimation. The one
hidden layer network has been developed named as OHLANFIS
using MATLAB simulation environment.
Here the initial parameters of the OHLANFIS are
identified using the subtractive clustering method. Parameters of
the Gaussian membership function are optimally determined
using the hybrid learning algorithm. From the analysis it is seen
that the Effort Estimation prediction model developed using
OHLANFIS technique has been able to perform well over normal
ANFIS Model.
Conceptualization of a Domain Specific Simulator for Requirements Prioritizationresearchinventy
This paper conceptualizes a domain specific simulator for requirements prioritization; its aims at helping to identify appropriate prioritization strategies for a project in hand. The possible existing scenarios are difficult to analyze; they involve different variables, like the selection of: stakeholders (their availability, expertise, and importance); prioritization criteria; and prioritization methods. To demonstrate the feasibility of the proposed simulator elements, a well established general purpose simulator, called Arena, was used. The results demonstrate that, it is possible to build the suggested scenarios in order to study and make inferences about the prioritization strategies.
Power Consumption Prediction based on Statistical Learning Techniques - David...Data Science Milan
Early power estimation is a critical requirement for both SoC (system-on-chip) architects and marketing, which must provide competitive, yet realistic, power consumption RFQs (request for quotation) to customers in order to achieve design wins. RFQs based on deterministic techniques may work reasonably well when there is enough information, and the available data closely match the target design. However, this approach can be strongly design dependent, and the same model, which provided a good estimate on one design/technology, may fail on another design/technology with different specs and operating conditions. In this talk, we will present an innovative approach for early power consumption estimation based on statistical learning. The proposed statistical model has been compared against existing RFQs, and validated on two taped-out products for networking in leading-edge CMOS technology.
Davide Pandini holds a Laurea Degree (Summa cum Laude) in Electronics Engineering from the University of Bologna, Italy, a Ph.D. in Electrical Engineering and Telecommunications from the University of Padova,Italy, and a Ph.D. in Electrical and Computer Engineering from Carnegie Mellon University, Pittsburgh, U.S.He was a research intern at Philips Research Labs. in Eindhoven, the Netherlands, and at Digital EquipmentCorp., Western Research Labs. in Palo Alto, CA.He joined STMicroelectronics in Agrate Brianza, Italy, in 1995, where he is a R&D manager and a senior member of the technical staff. His current research interests include signal and power integrity, variation-tolerant design, design for manufacturing, statistical design, and EMC-aware design. Dr. Pandini has authored and coauthored more than fifty papers in international journals and conference proceedings, and during the academic years from 1998 to 2000, he was a visiting professor at the University of Brescia, Italy. He is a work-package manager in several funded European projects, and he is on the program committee of several international conferences. In 2008 Dr. Pandini was the recipient of the ST Corporate STAR 2008 Gold Award for leading the R&D excellence team on EMC-aware design. Since June 2015, Dr. Pandini is the Chairman of the Steering Committee of the ST Italy Technical Staff. In September 2015, Davide Pandini served as Volunteer at EXPO2015 "Feeding the Planet, Energy for Life", in Milano, Italy.
Automatically Estimating Software Effort and Cost using Computing Intelligenc...cscpconf
In the IT industry, precisely estimate the effort of each software project the development cost
and schedule are count for much to the software company. So precisely estimation of man
power seems to be getting more important. In the past time, the IT companies estimate the work
effort of man power by human experts, using statistics method. However, the outcomes are
always unsatisfying the management level. Recently it becomes an interesting topic if computing
intelligence techniques can do better in this field. This research uses some computing
intelligence techniques, such as Pearson product-moment correlation coefficient method and
one-way ANOVA method to select key factors, and K-Means clustering algorithm to do project
clustering, to estimate the software project effort. The experimental result show that using
computing intelligence techniques to estimate the software project effort can get more precise
and more effective estimation than using traditional human experts did
A Framework for Visualizing Association Mining Resultsertekg
Download Link > https://ertekprojects.com/gurdal-ertek-publications/blog/a-framework-for-visualizing-association-mining-results/
Association mining is one of the most used data mining techniques due to interpretable and actionable results. In this study we pro-pose a framework to visualize the association mining results, speci¯cally frequent itemsets and association rules, as graphs. We demonstrate the applicability and usefulness of our approach through a Market Basket Analysis (MBA) case study where we visually explore the data mining results for a supermarket data set. In this case study we derive several
interesting insights regarding the relationships among the items and sug-gest how they can be used as basis for decision making in retailing.
SIMULATION-BASED OPTIMIZATION USING SIMULATED ANNEALING FOR OPTIMAL EQUIPMENT...Sudhendu Rai
The paper describes a software toolkit that enables the data-driven simulation-based optimization of print shops It enables quick modeling of complex print production environments under the cellular production framework. The software toolkit automates several steps of the modeling process by taking declarative inputs from the end-user and then automatically generating complex simulation models that are used to determine improved design and operating points. This paper describes the addition of another layer of automation consisting of simulation-based optimization using simulated-annealing that enables automated search of a large number of design alternatives in the presence of operational constraints to determine a cost-optimal solution. The results of the application of this approach to a real-world problem are also described.
Proceedings of the 2015 Industrial and Systems Engineering Res.docxwkyra78
Proceedings of the 2015 Industrial and Systems Engineering Research Conference
S. Cetinkaya and J. K. Ryan, eds.
Use of Symbolic Regression for Lean Six Sigma Projects
Daniel Moreno-Sanchez, MSc.
Jacobo Tijerina-Aguilera, MSc.
Universidad de Monterrey
San Pedro Garza Garcia, NL 66238, Mexico
Arlethe Yari Aguilar-Villarreal, MEng.
Universidad Autonoma de Nuevo Leon
San Nicolas de los Garza, NL 66451, Mexico
Abstract
Lean Six Sigma projects and the quality engineering profession have to deal with an extensive selection of tools
most of them requiring specialized training. The increased availability of standard statistical software motivates the
use of advanced data science techniques to identify relationships between potential causes and project metrics. In
these circumstances, Symbolic Regression has received increased attention from researchers and practitioners to
uncover the intrinsic relationships hidden within complex data without requiring specialized training for its
implementation. The objective of this paper is to evaluate the advantages and drawbacks of using computer assisted
Symbolic Regression within the Analyze phase of a Lean Six Sigma project. An application of this approach in a
service industry project is also presented.
Keywords
Symbolic Regression, Data Science, Lean Six Sigma
1. Introduction
Lean Six Sigma (LSS) has become a well-known hybrid methodology for quality and productivity improvement in
organizations. Its wide adoption in several industries has shaped Process Innovation and Operational Excellence
initiatives, enabling LSS to become a main topic in quality practitioner sites of interest [1], recognized Six Sigma
(SS) certification body of knowledge contents [2], and professional society conferences [3].
However LSS projects and the quality engineering profession have to deal with an extensive selection of tools most
of them requiring specialized training. To assist LSS practitioners it is common to categorize tools based on the
traditional DMAIC model which stands for Define, Measure, Analyze, Improve, and Control phases. Table 1
presents an overview of the main tools that are commonly used in each phase of a LSS project, allowing team
members to progressively develop an understanding between realizing each phase’s intent and how the selected
tools can contribute to that purpose.
This paper focuses on the Analyze phase where tools for statistical model building are most likely to be selected.
The increased availability of standard statistical software motivates the use of advanced data science techniques to
identify relationships between potential causes and project metrics. In these circumstances Symbolic Regression
(SR) has received increased attention from researchers and practitioners even though SR is still in an early stage of
commercial availability.
The objective of this paper is to evaluate the advantages and drawbacks o ...
Smart Response Surface Models using Legacy Data for Multidisciplinary Optimiz...IOSRJMCE
One of the key challenges in multidisciplinary design is integration of design and analysis methods of various systems in design framework. To achieve Multidisciplinary Design Optimization (MDO) goals of aircraft systems, high fidelity analysis are required from multiple disciplines like aerodynamics, structures or performance. High Fidelity Analysis like Computer-Aided Design and Engineering (CAD/CAE) techniques, complex computer models and computation-intensive analyses/simulations are often used to accurately study the system behaviour towards design optimization. Due to high computational cost and numerical noise associated with these analyses, they cannot be used effectively. The use of surrogates or Response Surface Models (RSM) is one approach in Multi Disciplinary design optimization to avoid the computation barrier and to take care of artificial minima due to numerical noise. This paper brings out a method based on use of “Smart Response Surface Models" to generate surrogate models, with its validated subspace, in the design space around the point of interest with the use of legacy data for MDO. The method has been evaluated on three test cases, which are created based on High Speed Civil Transport (HSCT) Multidisciplinary Design Optimization Test Suite
ESTIMATING PROJECT DEVELOPMENT EFFORT USING CLUSTERED REGRESSION APPROACHcscpconf
Due to the intangible nature of “software”, accurate and reliable software effort estimation is a challenge in the software Industry. It is unlikely to expect very accurate estimates of software
development effort because of the inherent uncertainty in software development projects and the complex and dynamic interaction of factors that impact software development. Heterogeneity exists in the software engineering datasets because data is made available from diverse sources.
This can be reduced by defining certain relationship between the data values by classifying them into different clusters. This study focuses on how the combination of clustering and
regression techniques can reduce the potential problems in effectiveness of predictive efficiency due to heterogeneity of the data. Using a clustered approach creates the subsets of data having a degree of homogeneity that enhances prediction accuracy. It was also observed in this study that ridge regression performs better than other regression techniques used in the analysis.
Estimating project development effort using clustered regression approachcsandit
Due to the intangible nature of “software”, accurate and reliable software effort estimation is a
challenge in the software Industry. It is unlikely to expect very accurate estimates of software
development effort because of the inherent uncertainty in software development projects and the
complex and dynamic interaction of factors that impact software development. Heterogeneity
exists in the software engineering datasets because data is made available from diverse sources.
This can be reduced by defining certain relationship between the data values by classifying
them into different clusters. This study focuses on how the combination of clustering and
regression techniques can reduce the potential problems in effectiveness of predictive efficiency
due to heterogeneity of the data. Using a clustered approach creates the subsets of data having
a degree of homogeneity that enhances prediction accuracy. It was also observed in this study
that ridge regression performs better than other regression techniques used in the analysis.
Software testing effort estimation with cobb douglas function- a practical ap...eSAT Journals
Abstract Effort estimation is one of the critical challenges in Software Testing Life Cycle (STLC). It is the basis for the project’s effort estimation, planning, scheduling and budget planning. This paper illustrates model with an objective to depict the accuracy and bias variation of an organization’s estimates of software testing effort through Cobb-Douglas function (CDF). Data variables selected for building the model were believed to be vital and have significant impact on the accuracy of estimates. Data gathered for the completed projects in the organization for about 13 releases. Statistically, all variables in this model were statistically significant at p<0.05><0.01 levels. The Cobb-Douglas function was selected and used for the software testing effort estimation. The results achieved with CDF were compared with the estimates provided by the area expert. The model’s estimation figures are more accurate than the expert judgment. CDF has one of the appropriate techniques for estimating effort for software testing. CDF model accuracy is 93.42%.
This article got published in the Software Developer's Journal's February Edition.
It describes the use of MapReduce paradigm to design Clustering algorithms and explain three algorithms using MapReduce.
- K-Means Clustering
- Canopy Clustering
- MinHash Clustering
Mathematical models and algorithms challengesijctcm
This paper succinctly illustrates challenges encountered when modelling systems mathematically.
Mathematical modelling entirely entails math symbols, numbers and relations forming a functional
equation. These mathematical equations can represent any system of interests, also provides ease computer
simulations. Mathematical models are extensively utilized in different fields i.e. engineering, by scientists,
and analysts to give a clear understanding of the problem. Modelling contributed a lot since inversion of
the concept. Simple and complex structures erected as a result of modelling. In that sense modelling is an
important part of engineering. It can be referred to as the primary building block of every system. A
complex model however is not an ideal solution. Engineers have to be cautious not to discard all
information as this might render the designed model useless – as detailed in this paper the model should be
simple with all necessary and relevant data. Basically the purpose of this paper is to show the importance
and clearly explain in detail challenges encountered when modelling
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
2. GTApprox: surrogate modeling for industrial design
Mikhail Belyaev1
, Evgeny Burnaev1,2
, Ermek Kapushev1,2
, Maxim Panov1,2
,
Pavel Prikhodko1
, Dmitry Vetrov2
, and Dmitry Yarotsky1,2
1
Kharkevich Institute for Information Transmission Problems, Bolshoy Karetny per. 19,
build.1, Moscow 127051, Russia
2
Datadvance llc, Nauchny proezd 17, Moscow 117246, Russia
Abstract
We describe GTApprox — a new tool for medium-scale surrogate modeling
in industrial design. Compared to existing software, GTApprox brings sev-
eral innovations: a few novel approximation algorithms, several advanced
methods of automated model selection, novel options in the form of hints.
We demonstrate the efficiency of GTApprox on a large collection of test
problems. In addition, we describe several applications of GTApprox to real
engineering problems.
Keywords: approximation, surrogate model, surrogate-based optimization
1. Introduction
Approximation problems (also known as regression problems) arise quite
often in industrial design, and solutions of such problems are conventionally
referred to as surrogate models [1]. The most common application of surro-
gate modeling in engineering is in connection to engineering optimization [2].
Indeed, on the one hand, design optimization plays a central role in the indus-
trial design process; on the other hand, a single optimization step typically
requires the optimizer to create or refresh a model of the response function
whose optimum is sought, to be able to come up with a reasonable next de-
sign candidate. The surrogate models used in optimization range from simple
local linear regression employed in the basic gradient-based optimization [3]
to complex global models employed in the so-called Surrogate-Based Opti-
mization (SBO) [4]. Aside from optimization, surrogate modeling is used in
dimension reduction [5, 6], sensitivity analysis [7–10], and for visualization
of response functions.
Preprint submitted to September 5, 2016
3. Mathematically, the approximation problem can generally be described as
follows. We assume that we are given a finite sample of pairs (xn, yn)N
n=1 (the
“training data”), where xn ∈ Rdin
, yn ∈ Rdout
. These pairs represent sampled
inputs and outputs of an unknown response function y = f(x). Our goal is
to construct a function (a surrogate model) f : Rdin
→ Rdout
which should be
as close as possible to the true function f.
A great variety of surrogate modeling methods exist, with different as-
sumptions on the underlying response functions, data sets, and model struc-
ture [11]. Bundled implementations of diverse surrogate modeling methods
can be found in many software tools, for example in the excellent open-
source general purpose statistical project R [12] and machine-learning Python
library scikit-learn [13], as well as in several engineering-oriented frame-
works [14–16]. Theoretically, any of these tools offers an engineer the neces-
sary means to construct and use surrogate models covering a wide range of
approximation scenarios. In practice, however, existing tools are often not
very convenient to an engineer, for two main reasons.
1. Excessively technical user interface and its inconsistency
across different surrogate modeling techniques. Predictive mod-
eling tools containing a variety of different modeling algorithms often pro-
vide a common top-level interface for loading training data and constructing
and applying surrogate models. However, the algorithms themselves usu-
ally remain isolated; in particular, they typically have widely different sets
of user options and tunable parameters. This is not surprising, as there is
a substantial conceptual difference in the logic of different modeling meth-
ods. For example, standard linear regression uses a small number of fixed
basis functions and only linear operations; kriging uses a large number of
basis functions specifically adjusted to the training data and involves non-
linear parameters; artificial neural networks may use a variable set of basis
functions and some elements of user-controlled self-validation; etc.
Such isolation of algorithms requires the user to learn their properties in
order to pick the right algorithm and to correctly set its options, and en-
gineers rarely have time for that. An experienced researcher would know,
for example, that artificial neural networks can produce quite accurate ap-
proximations for high-dimensional data, but when applied in 1D, the plotted
results would almost invariably look very unconvincing (compared to, say,
splines); kriging is a popular choice for moderately sized training sets, but
will likely exhaust the on-board RAM if the training set has more than a few
2
4. thousand elements; accurate approximations by neural networks may take
several days to train; etc. In existing tools such expert knowledge is usually
scattered in documentation, and users quite often resort to trial-and-error
when choosing the algorithm.
2. Lack of attention to special features of engineering prob-
lems. The bias of the engineering domain is already seen in the very fact
that regression problems in industrial design are much more common than
classification problems (i.e., those where one predicts a discrete label rather
than a continuous value y ∈ Rdout
), whereas quite the opposite seems to hold
in the more general context of all commercial and scientific applications of
predictive modeling1
. Moreover, the response function f(x) considered in
an engineering problem usually represents some physical quantity and is ex-
pected to vary smoothly or at least continuously with x. At the same time,
widely popular decision-tree-based methods such as random forests [19] and
gradient boosting [20] produce discontinuous piece-wise constant surrogate
models, completely inappropriate for, say, gradient-based optimization. This
example is rather obvious and the issue can be solved by simply ignoring
decision-tree-based methods, but, based on our experience of surrogate mod-
eling at industrial enterprises [21–25], we can identify several more subtle
elements of this engineering bias that require significant changes in the soft-
ware architecture, in particular:
Data anisotropy. Training data can be very anisotropic with respect to
different groups of variables. For example, a common source of data
are experiments performed under different settings of parameters with
some sort of detectors that have fixed positions (e.g., air pressure mea-
sured on a wing under different settings of Mach and angle of attack),
and the surrogate model needs to predict the outcome of the experi-
ment for a new setting of parameters and at a new detector position.
It can easily be that the detectors are abundant and regularly dis-
tributed, while the number of experiments is scarce and their param-
eters are high-dimensional and irregularly scattered in the parameter
space. If we only needed a surrogate model with respect to one of these
two groups of input variables, we could easily point out an appropri-
1
Note, for example, that classification problems form the majority of the 200+ Kaggle
data mining contests [17] and the 300+ UCI machine learning repository data sets [18],
well reflecting current trends in this area.
3
5. ate standard method (say, splines for detector position and regularized
linear regression for the experiment parameters), but how to combine
them into a single model? Such anisotropic scenarios, with different
expected dependency properties, seem to be quite typical in the engi-
neering domain [26].
Model smoothness and availability of gradients.
As mentioned above, surrogate models in engineering are (more often
than in other domains) used for optimization and sensitivity analysis,
and are usually expected to reasonably smoothly depend on the input
variables. Moreover, there is some trade-off between model smoothness
and accuracy, so it is helpful to be able to directly control the amount
of smoothness in the model. If a gradient-based optimization is to be
applied to the model, it is beneficial to have the exact analytic gra-
dient of the model, thus avoiding its expensive and often inaccurate
numerical approximation.
Local accuracy estimates. Surrogate-based optimization requires, in ad-
dition to the approximation of the response function, a model estimat-
ing local accuracy of this approximation [4, 27]. This model of local
accuracy is very rarely provided in existing software, and is usually re-
stricted to the method known in engineering literature as kriging [28],
which has been recently paid much attention in machine learning com-
munity under the name of Gaussian process regression [29].
Handling multidimensional output. In the literature, main attention is
focused on modeling functions with a single scalar output [1]. However,
in engineering practice the output is very often multidimensional, i.e.
the problem in question requires modeling several physical quantities
as functions of input parameters. Especially challenging are situations
when the outputs are highly correlated. An example is the modeling
of pressure distribution along the airfoil as a function of airfoil shape.
In such cases one expects the output components of a surrogate model
to be accordingly correlated with each other.
In this note we describe a surrogate modeling tool GTApprox (Generic
Tool for Approximation) designed with the goal of overcoming the above
shortcomings of existing software. First, the tool contains multiple novel
“meta-algorithms” providing the user with accessible means of controlling
4
6. the process of modeling in terms of easily understood options, in addition
to conventional method-specific parameters. Second, the tool has additional
modes and features addressing the specific engineering needs pointed out
above.
Some algorithmic novelties of the tool have already been described earlier
[30–37]; in the present paper we describe the tool as a whole, in particular fo-
cusing on the overall decision process and performance comparison that have
not been published before. The tool is a part of the MACROS library [38]. It
can be used as a standalone Python module or with a GUI within the pSeven
platform [39]. The trial version of the tool is available at [40]. A very de-
tailed exposition of the tool’s functionality can be found in its Sphinx-based
documentation [41].
The remainder of the paper is organized as follows. In Sections 2 and
3 we describe the tool’s structure and main algorithms. In particular, in
Section 2 we review individual approximation algorithms of the tool (such as
splines, RSM, etc.), with the emphasis on novel elements and special features.
In Section 3 we describe how the tool automatically chooses the appropriate
individual algorithm for a given problem. Next, in section 4 we report results
of comparison of the tool with alternative state-of-the-art surrogate modeling
methods on a collection of test problems. Finally, in section 5 we describe a
few industrial applications of the tool.
2. Approximation algorithms and special features
2.1. Approximation algorithms
GTApprox is aimed at solving a wide range of approximation problems.
There is no universal approximation algorithm which can efficiently solve all
types of problems, so GTApprox contains many individual algorithms, that
we hereafter refer to as techniques, each providing the best approximation
quality in a particular domain. Some of these techniques are more or less
standard, while others are new or at least contain features rarely found in
other software. We will briefly overview main individual techniques, focusing
on their novelties useful for engineering design.
Response Surface Models (RSM). This is a generalized linear regression in-
cluding several approaches to estimation of regression coefficients. RSM can
be either linear or quadratic with respect to input variables. Also, RSM
5
7. supports categorical input variables. There are a number of ways to es-
timate unknown coefficients of RSM, among which GTApprox implements
ridge regression[42], stepwise regression [43] and the elastic net [44].
Splines With Tension (SPLT). This is one-dimensional spline-based tech-
nique intended to combine the robustness of linear splines with the smooth-
ness of cubic splines. A non-linear algorithm [45] is used for an adaptive
selection of the optimal weights on each interval between neighboring points
of DoE (Design of Experiment, i.e. the set of input vectors of the training
set).
Gaussian Processes (GP) and Sparse Gaussian Process (SGP). These are
flexible nonlinear techniques based on modeling training data as a realization
of an infinite-dimensional Gaussian distribution defined by a mean function
and a covariance function [29, 46]. GP allows us to construct approximations
that exactly agree with the provided training data. Also, this technique pro-
vides local accuracy estimates based on the a posteriori covariance of the con-
sidered Gaussian process. Thanks to this important property we can use GP
in surrogate-based optimization [27] and adaptive design of experiments [34].
In GTApprox, parameters of GP are optimized by a novel optimization algo-
rithm with adaptive regularization [36], improving the generalization ability
of the approximation (see also results on theoretical properties of parame-
ters estimates [47–49]). GP memory requirements scale quadratically with
the size of the training set, so this technique is not applicable to very large
training sets. SGP is a version of GP that lifts this limitation by using only a
suitably selected subset of training data and approximating the correspond-
ing covariance matrices [35].
High Dimensional Approximation (HDA) and High Dimensional Approxi-
mation combined with Gaussian Processes (HDAGP). HDA is a nonlinear,
adaptive technique using decomposition over linear and nonlinear base func-
tions from a functional dictionary. This technique is related to artificial
neural networks and, more specifically, to the two-layer perceptron with a
nonlinear activation function [50]. However, neural networks are notorious
for overfitting [51] and for the need to adjust their parameters by trial-and-
error. HDA contains many tweaks and novel structural elements intended to
automate training and reduce overfitting while increasing the scope of the ap-
proach: Gaussian base functions in addition to standard sigmoids, adaptive
selection of the type and number of base functions [52], a new algorithm of
6
8. initialization of base functions’ parameters [37], adaptive regularization [52],
boosting used to construct ensembles for additional improvement of accuracy
and stability [33], post-processing of the results to remove redundant features.
HDAGP [36] extends GP by adding to it HDA-based non-stationary covari-
ance functions with the goal of improving GP’s ability to deal with spatially
inhomogeneous dependencies.
Tensor Products of Approximations (TA), incomplete Tensored Approxima-
tions (iTA), and Tensored Gaussian Processes (TGP)
TA [31] is not a single approximation method, but rather a general and
very flexible construction addressing the issue of anisotropy mentioned in the
introduction. In a nutshell, TA is about forming spatial products of different
approximation techniques, with each technique associated with its own subset
of input variables. The key condition under which TA is applicable is the
factorizability of the DoE: the DoE must be a Cartesian product of some
sets with respect to some partition of the whole collection of input variables
into sub-collections, see a two-factor example in Figure 1. If TA is enabled,
GTApprox automatically finds the most detailed factorization for the DoE
of the given training set. Once a factorization is found, to each factor one
can assign a suitable approximation technique, and then form the whole
approximation using a “product” of these techniques. This essentially means
that the overall approximation’s dictionary of basis functions is formed as
the product of the factors’ dictionaries. The coefficients of the expansion
over this dictionary can be found very efficiently [31, 53].
GTApprox offers a number of possible techniques that can be assigned
to a factor, including Linear Regression (LR), B-splines (BSPL), GP and
HDA, see example in Figure 2. It is natural, for example, to assign BSPL
to one-dimensional factors and LR, GP or HDA to multi-dimensional ones.
If not assigned by the user, GTApprox automatically assigns a technique to
each factor by a heuristic akin to the decision tree described later in section
3.
Factorizability of the DoE is not uncommon in engineering practice. Note,
in particular, that the usual full-factorial DoE is a special case of factorizable
DoE with one-dimensional factors. Also, factorization often occurs in various
scenarios where some input variables describe spatial or temporal location of
measurements within one experiment while other variables describe external
conditions or parameters of the experiment – in this case the two groups
of variables are typically varied independently. Moreover, there is often a
7
9. Figure 1: Example of factorization of a three-dimensional DoE consisting of 35 points: the
DoE is a Cartesian product of its two-dimensional projection to the x2x3-plane (of size
7) with its one-dimensional projection to the x1-axis (of size 5). The dotted lines connect
points with the same projection to the x2x3-plane.
(a) (b)
Figure 2: Two TA approximations for the same training set with two one-dimensional
factors: (a) the default approximation using splines in both factors; (b) an approximation
with splines in the first factor and Linear Regression in the second factor. Note that
approximation (b) depends linearly on x1.
8
10. significant anisotropy of the DoE with respect to this partition: each exper-
iment can be expensive, but once the experiment is performed the values of
the monitored quantity can be read off of multiple locations relatively easily,
so the DoE factor associated with locations is much larger than the factor
associated with experiment’s parameters. The advantage of TA is that it
can overcome this anisotropy by assigning to each DoE factor a separate
technique, most appropriate for this particular factor.
Nevertheless, exact factorizability is, of course, a relatively restrictive
assumption. The incomplete Tensored Approximation (iTA) technique of
GTApprox relaxes this assumption: this technique is applicable if the DoE
is only a subset of a full-factorial DoE and all factors are one-dimensional.
This covers a number of important use cases: a full-factorial DoE where some
experiments are not finished or the solver failed to converge, or a union of
several full-factorial DoEs resulting from different series of experiments, or
a Latin Hypercube on a grid. Despite the lack of Cartesian structure, con-
struction of the approximation in this case reduces to a convex quadratic
programming problem leading to a fast and accurate solution [53]. An ex-
ample of iTA’s application to a pressure distribution on a wing is shown in
Figure 3. Also, see Section 5.3 for an industrial example.
Tensored Gaussian Processes (TGP) [30, 32] is yet another incarnation of
tensored approximations. TGP is fast and intended for factorized DoE like
the TA technique, but is equipped with local accuracy estimates like GP.
Mixture of Approximations (MoA). If the response function is very inhomo-
geneous, a single surrogate model may not efficiently cover the whole design
space (see [21] and Section 5.1 for an engineering example with critical buck-
ling modes in composite panels). One natural approach to overcome this
issue is to perform a preliminary space partitioning and then build a sepa-
rate model for each part. This is exactly what Mixture of Approximations
does. This technique falls into the family of Hierarchical Mixture Models
[54, 55]. A Gaussian mixture model is used to do the partitioning, and after
that other techniques are used to build local models. MoA is implemented
to automatically estimate the number of parts; it supports possibly overlap-
ping parts and preservation of model continuity across different parts [21].
A comparison of MoA with a standard technique (GP) is shown in Figure 4.
Gradient Boosted Regression Trees (GBRT). This is a well-known technique
that uses decision trees as weak estimators and combines several weak esti-
mators into a single model, in a stage-wise fashion [20]. GBRT is suitable
9
11. (a) (b)
(c)
Figure 3: Application of iTA to reconstruction of a pressure distribution on a wing. The
distribution was obtained by an aerodynamic simulation. (a) The 200 × 29 grid on which
the pressure distribution is defined. (b) iTA is applied to a training set of pressure values
at 290 randomly chosen points. (c) The resulting approximation is compared with the
true distribution.
10
12. (a)
(b)
Figure 4: Application of MoA to a spatially inhomogeneous response function. (a) The
true response function. (b) Approximations by MoA and GP.
11
13. for problems with large data sets and in cases when smooth approximation
is not required.
Piece-wise Linear Approximation (PLA). This technique is based on the De-
launay triangulation of the training sample. It connects neighboring points
of the given training set into triangles (or tetrahedrons) and builds a linear
model in each triangle. PLA is a simple and reliable interpolation technique.
It is suitable for low-dimensional problems (mostly 1D, 2D and 3D) where
the approximation is not required to be smooth. In higher dimensions the
construction becomes computationally intractable.
2.2. User options and additional features
GTApprox implements a number of user options and additional features
that directly address the issues raised in the Section 1. The options are not
linked to specific approximation techniques described in the previous subsec-
tion; rather, the tool selects and tunes a technique according to the options
(see Section 3). The formulations of options and features avoid references to
algorithms’ details; rather, they are described by their overall effect on the
surrogate model. Below we list a few of these options and features.
Accelerator. Some techniques contain parameters significantly affecting the
training time of the surrogate model (e.g., the number of basic approximators
in HDA or the number of decision trees in GBRT). By default, GTApprox
favors accuracy over training time. The Accelerator option defines a number
of “levels”; each level assigns to each technique a set of parameters ensuring
that the training time with this technique is increasingly reduced as the level
is increased.
Gradient/Jacobian matrix. To serve optimization needs, each approximation
produced by GTApprox (except non-smooth models, i.e. GBRT and PLA)
is constructed simultaneously with its gradient (or Jacobian matrix in the
context of multi-component approximations).
Accuracy Evaluation (AE). Some GTAppox’ techniques of Bayesian nature
(mostly GP-based, i.e. GP, SGP, HDAGP, TGP) construct surrogate models
along with point-wise estimates of deviations of these models from true re-
sponse values [29]. These estimates can be used in SBO, to define an objective
function taking into account local uncertainty of the current approximation
(see an example in Section 3.2).
12
14. Smoothing. Each approximation constructed by GTApprox can be addition-
ally smoothed. The smoothing is done by additional regularization of the
model; the exact algorithms depends on the particular technique. Smooth-
ing affects the gradient of the model as well as the model itself. Smoothing
may be useful in tasks for which smooth derivatives are important, e.g., in
surrogate-based optimization.
Componentwise vs. joint approximation. If the response function has sev-
eral scalar components, approximation of all components one-by-one can be
lengthy and does not take into account relations that may connect different
components. Most of the GTApprox’ techniques have a special “joint” mode
where the most computationally intensive steps like iterative optimization of
the basis functions in HDA or kernel optimization in GP is performed only
once, simultaneously for all output components, and only the last step of
linear expansion over basis functions is performed separately for each out-
put [36]. This approach can significantly speed up training. For example,
training of a GP model with m outputs with a training set of N points re-
quires O(mN3
) arithmetic operations in the componentwise mode, while in
the joint mode it is just O(N3
+ mN2
). Furthermore, because of the partly
shared approximation workflow, the joint mode better preserves similarities
between different components of the response function (whenever they exist).
Exact Fit. Some GTApprox’ techniques (like GP and splines) allow to con-
struct approximations that pass exactly through the points of the training
set. Note, however, that this requirement may sometimes lead to overfitting
and is certainly not appropriate if the training set is noisy.
3. Automated choice of the technique
GTApprox implements two meta-algorithms automating the choice of the
approximation technique for the given problem: a simpler one, based on
hand-crafted rules and hereafter referred to as the “Decision Tree”, and a
more complex one, including problem-specific adaptation and branded as
“Smart Selection”. We outline these two meta-algorithms below.
3.1. “Decision tree”
The “decision tree” approach selects an appropriate technique using pre-
determined rules that involve size and dimensions of the data sets along with
13
15. user-specified features (requirements of model linearity, Exact Fit or Accu-
racy Evaluation, enabled Tensor Approximations), see Figure 5. The rules
are partly based on the factual capabilities and limitations of different tech-
niques and partly on extensive preliminary testing and practical experience.
The rules do not guarantee the optimal choice of a technique, but rather
select the most reasonable candidate.
3.2. “Smart selection”
The main drawbacks of the “decision tree” approach are that it only
takes into account the crudest properties of the data set (size, dimensions)
and cannot adjust parameters of the technique, which is often important.
To address both issues, “Smart selection” performs, for each training set,
a numerical optimization of the technique as well as its parameters [56, 57],
by minimizing the cross-validation error.
This is a quite complex optimization problem: the search space is tree-
structured, parameters can be continuous or categorical, the objective func-
tion is noisy and expensive to evaluate.
We first describe how we optimize the vector of parameters for a given
technique. To this end we use Surrogate Based Optimization (SBO) [27, 58].
Recall that SBO is an iterative algorithm that can be written as follows:
1. Pick somehow an initial candidate λ1 for the optimal vector of param-
eters.
2. Given the current candidate λk for the optimal vector of parameters,
find the value ck of the objective function (cross-validation error) on it.
3. Using all currently available pairs R = (λi, ci)k
i=1, construct an acqui-
sition function a(λ; R) reflecting our preference for λ to be the next
candidate vector.
4. Choose the new candidate vector of parameters λk+1 by numerically
optimizing the acquisition function and return to step 2.
The acquisition function used in step 3 must be a reasonably lightweight
function involving not only the current estimate of the objective function
c(λ), but also the uncertainty of this estimate, in order to make incentive
for the algorithm to explore new regions of the parameter space. A standard
choice for the acquisition function that we use is the Expected Improvement
function
aEI(λ; R) = E((c − c(λ))+),
14
16. Figure 5: The “decision tree” technique selection in GTApprox. Rectangles show individ-
ual techniques. Rhombuses show choices depending on properties of the training data and
user options. Pentagons show exceptional cases with conflicting or unfeasible requirements.
15
17. where c = min1≤i≤k ci is the currently known minimum and (c − c(λ))+ =
max(0, c − c(λ)) is the objective function’s improvement resulting from con-
sidering a new vector λ. The expectation here can be approximately written,
under assumption of a univariate normal distribution of error, in terms of the
expected value c(λ) of c(λ) and the expected value σ(λ) of the deviation of
c(λ) from c(λ). The function c(λ) is found as a GTApprox surrogate model
constructed from the data set R, and the accompanying uncertainty estimate
σ(λ) is found using the Accuracy Evaluation feature.
The described procedure allows us to choose optimal parameters for a
particular technique. In order to choose the technique we perform SBO for
each technique from some predefined set, and then select the technique with
the minimal error.
The set of techniques is formed according to hints specified by the user.
The hints are a generalization of options towards less technical and more
intuitive description of data or of the required properties of the surrogate
model. In general, hints may be imprecise, e.g. “IsNoisy” or “Clustered-
Data”. Hints may play the role of tags or keywords helping the users to
express their domain-specific knowledge and serving to limit the range of
techniques considered in the optimization.
The “smart selection” approach is time consuming, since each SBO it-
eration involves constructing a new auxiliary surrogate model. The process
can be sped up by a few hints. The “Accelerator” hint adjusts parameters
of the SBO procedure, making it faster but less accurate. The “Acceptable-
QualityLevel” hint allows the user to specify an acceptable level of model
accuracy for an early stopping of SBO.
There exist general-purpose Python frameworks for optimizing parame-
ters of approximation techniques, e.g. hyperopt [59]. We have considered us-
ing hyperopt (via HPOlib, [60]) with GTApprox as an alternative to “smart
selection”, but found the results to be worse than with “smart selection”.
First, being a general framework, HPOlib/hyperopt does not take into ac-
count special properties of particular techniques. For example, GP-based
techniques have high computational complexity and cannot be applied in the
case of large training sets, but HPOlib/hyperopt would attempt to build a GP
model anyway. Second, the only termination criterion in HPOlib/hyperopt
is the maximum number of constructed models – a criterion not very flexible
given that different models can have very different training times. Finally,
we have observed HPOlib/hyperopt in some cases to repeatedly construct
models with the same parameters, which is again inefficient since training
16
18. times for some of our models are quite large.
4. Comparison with alternative algorithms on test problems
We perform a comparison of accuracy between GTApprox and some of the
most popular, state-of-the-art predictive modeling Python libraries: scikit-
learn [13], XGBoost [61], and GPy [62].2
We emphasize that there are a
few caveats to this comparison. First, these libraries are aimed at a tech-
nically advanced audience of data analysts who are expected to themselves
select appropriate algorithms and tune their parameters. In particular, scikit-
learn does not provide a single entry point wrapping multiple techniques like
GTApprox does, as described in Section 3. We therefore compare GTApprox,
as a single algorithm, to multiple algorithms of scikit-learn. We also select a
couple of different modes in both XGBoost and GPy. Second, the scope of
scikit-learn and XGBoost is somewhat different from that of GTApprox: the
former are not focused on regression problems and their engineering appli-
cations and, in particular, their most powerful nonlinear regression methods
seem to be ensembles of trees (Random Forests and Gradient Boosted Trees)
that produce piece-wise constant approximations presumably not fully suit-
able for modeling continuous response functions. Keeping these points in
mind, our comparison should be otherwise reasonably fair and informative.
We describe now the specific techniques considered in the benchmark. All
techniques are used with default settings.
We consider a diverse set of scikit-learn methods for regression, both
linear and nonlinear: Ridge Regression with cross-validation (denoted by
SL RidgeCV in our tests), Support Vector Regression (SL SVR), Gaussian
Processes (SL GP), Kernel Ridge (SL KR), and Random Forest Regression
(SL RFR). Our preliminary experiments included more methods, in particular
common Linear Regression, LassoCV and Gradient Boosting, but we have
found their results to be very close to results of other linear or tree-based
methods.
We consider two modes of XGBoost: with the gbtree booster (default,
XGB) and with the gblinear booster (XGB lin).
2
The code and data for this benchmark are available at https://github.com/
yarotsky/gtapprox_benchmark. The versions of the libraries used in the benchmark
were GTApprox 6.8, scikit-learn 0.17.1, XGBoost 0.4, and GPy 1.0.9.
17
19. We consider two modes of GPy: the GPRegression model (GPy) and,
since some of our test sets are relatively large, the SparseGPRegression
model (GPy sparse).
Finally, we consider two versions of GTApprox corresponding to the
two meta-algorithms described in Section 2: the basic tree-based algorithm
(gtapprox) and the “smart selection” algorithm (gta smart).
Our test suite contains 31 small- and medium-scale problems, of which
23 are given by explicit formulas and the remaining 8 represent real-world
data sets or results of complex simulations. The problems defined by formu-
las include a number of functions often used for testing optimization algo-
rithms [63], such as Ackley function, Rosenbrock function, etc. Additionally,
they include a number of non-smooth and noisy functions. The real-world
data sets and data of complex simulations are borrowed from the UCI repos-
itory [18] and the GdR Mascot-Num benchmark [64]. Detailed descriptions
or references for the test problems can be found in the GTApprox documen-
tation ([41, MACROS User Manual, section “Benchmarks and Tests”]).
Each problem gives rise to several tests by varying the size of the training
set and the way the training set is generated: for problems defined by explicit
functions we create the training set by evaluating the response function on
a random DoE or on a Latin Hypercube DoE of the given size; for problems
with already provided data sets we randomly choose a subset of the given
size. As a result, the size of the training set in our experiments ranges from
5 to 30000. Testing is performed on a holdout set. Some of the problems
have multi-dimensional outputs; in such cases each scalar output is handled
independently and is counted as a separate test. The total number of tests
obtained in this way is 430. Input dimensionality of the problems ranges
from 1 to 20.
In each test we compute the relative root-mean-squared prediction error
as
RRMS =
M
n=1 f(xn) − f(xn)
2
M
n=1 f(xn) − f
2
1/2
,
where xn, f(xn)
M
n=1
is the test set with true values of the response function,
f(xn) is the predicted value, and f is the mean value of f on the test set.
Note that RRMS essentially compares surrogate models f with the trivial
constant prediction f, up to the fact that f is computed on the test set rather
than the training set.
18
20. Figure 6: Accuracy profiles of different approximation algorithms
Each test is run in a separate OS process with available virtual memory
restricted to 6GB. Some of the techniques raise exceptions when training on
certain problems (e.g., out-of-memory errors). In such cases we set RRMS =
+∞.
For each surrogate modeling algorithm we construct its accuracy profile
as the function showing for any RRMS threshold the ratio of tests where the
RRMS error was below this threshold.
The resulting profiles are shown in Figure 6. We find that, on the whole,
the default GTApprox is much more accurate than default implementations
of methods from other libraries, with the exception of highly noisy problems
where RRMS > 1: here gtapprox performs just a little worse than linear and
tree-based methods. As expected, gta smart yields even better results than
gtapprox.
We should, however, point out that this advantage in accuracy is achieved
at the cost of longer training. Possibly in contrast to other tools, GTApprox
favors accuracy over training time, assuming the user of the default algorithm
delegates to it the experiments needed to obtain an accurate model. In
Figure 7 we show profiles for training time. Whereas training of scikit-learn
and XGBoost algorithms with default settings typically takes a fraction of
second, GTApprox may need a few minutes, especially the “smart selection”
19
21. Figure 7: Training time profiles of different approximation algorithms
version. Of course, if desired, training time can be reduced (possibly at the
cost of accuracy) by tuning various options of GTApprox.
5. Applications
We briefly describe several industrial applications of GTApprox [21–24]
to illustrate how special features of GTApprox can help in solving real world
problems.
5.1. Surrogate models for reserve factors of composite stiffened panels
Aeronautical structures are mostly made of stiffened panels that consist of
thin shells (or skins) enforced with stiffeners in two orthogonal directions (see
Figure 8). The stiffened panels are subject to highly nonlinear phenomena
such as buckling or collapse. Most strength conditions for the structure’s
reliability can be formulated using so-called reserve factors (RFs). In the
simplest case, a reserve factor is the ratio between an allowable stress (for
example, material strength) and the applied stress. The whole structure is
validated if all RFs of all the panels it consists of are greater than 1. RF
values are usually found using computationally expensive Finite Element
(FE) methods.
20
22. Figure 8: A stiffened panel.
During the sizing process, i.e. optimizing geometry of the structure with
respect to certain criteria (usually minimization of the weight of the struc-
ture), RF values are taken as optimization constraints that allow to conclude
if the considered geometry would be reliable. So for all basic structures the
RFs and their gradients have to be recomputed on every optimization step,
which becomes a very expensive operation in terms of time.
The goal of this application was to create a surrogate model that works
orders of magnitude faster than the FE method and at the same time has a
good accuracy: error should be less than 5% for at least 95% of points with
RF values close to 1, and the model should reliably tell if the RF is greater
or less than 1 for a particular design.
The complexity of the problem was exacerbated by several issues. First,
the RF values depend on 20 parameters (geometry and loads), all of which
significantly affect the output values. Second, some RFs depend on the
parameters discontinuously. Third, points with RFs close to 1 are scattered
across the input domain.
The Mixture of Approximation (MoA) technique of GTApprox was used
to create a surrogate model based on the train dataset of 200000 points
that met the accuracy requirements and worked significantly faster than the
reference PS3 tool implementing the FE computation. The optimization
was further facilitated by the availability of the gradients of the GT Approx
model. The optimization results obtained by GTApprox and the PS3 tool are
shown in Figure 9. Details on the work can be found in [21, 22]. The obtained
GTApprox surrogate model was embedded into the pre-sizing optimization
process of Airbus A350XWB composite boxes.
5.2. Surrogate models for helicopter loads estimation
In this work GTApprox was used to create surrogate models for maximum
loads acting on various structural elements of the helicopter. Knowledge of
21
23. Figure 9: RF values at the optimum. Comparison of the optimal result found using the
GTApprox surrogate model with that of PS3.
maximum loads during the flight allows one to estimate fatigue and thus see
when repair is needed. Such data can only be collected in the flight tests
as one needs to install additional sensors to a helicopter to measure loads,
which are too expensive to be installed on every machine.
So the goal of the project was to take data already measured during flight
tests and create surrogate models that would allow to estimate loads on every
flight as a function of flight parameters and external conditions. The chal-
lenge of the project was that models for lots of different load types and flight
conditions (e.g. maneuver types) needed to be created. In total one needed
to build 4152 surrogate models. Such problem scale made it impossible to
tune each model “manually”. And at the same time different combinations
of loads and flight condition could demonstrate very different behavior and
depend on different set of input parameters. The input dimension varied in
the range from 8 to 10 and the sample size was from 1 to 108 points.
GTApprox’ capabilities on automatic technique selection and quality as-
sessment were used to create all 4152 models with the required accuracy
without manually tweaking their parameters in each case. In total, 2877
constant models, 777 RSM models, 440 GP models and 58 HDA models
were constructed. Only a few most complex cases had to be specifically ad-
22
24. Figure 10: Design of Experiments in the aerodynamic test case.
dressed in an individual manner. More details on the work can be found
in [24].
5.3. Surrogate models for aerodynamic problems
In this application GTApprox was used to obtain surrogate models for
aerodynamic response functions of 3-dimensional flight configurations [23].
The training data were obtained either by Euler/RANS CFD simulations or
by wind tunnel tests; in either case experiments were costly and/or time-
consuming, so a surrogate model was required to cover the whole domain of
interest.
The training set’s DoE, shown in Figure 10, had two important peculiar-
ities. First, the DoE was a union of several (irregular) grids resulting from
different experiments. Second, the grids were highly anisotropic: variables
x1, x3 were sampled with much lower resolutions than variable x2.
As explained in Section 2, this complex structure is exactly what the iTA
technique is aimed at. The whole DoE can be considered as an incomplete
grid, so iTA is directly applicable to the whole training set without the need
to construct and then merge separate approximations for different parts of
the design space.
In Figure 11 we compare, on a 2D slice of the region of main interest, the
iTA surrogate model with a model obtained using the scikit-learn Gaussian
Process technique [13], which may be considered as a conventional approach
23
25. (a) iTA (b) GP
Figure 11: iTA and GP approximations in the aerodynamic problem.
for this problem (since the full DoE is not factorizable). We observe phys-
ically unnatural “valleys” in the GP model. This degeneracy results from
the GP’s assumptions of uniformity and homogeneity of data [29] that do
not hold in this problem due to gaps in the DoE and large gradient of the
response function in a part of the design space. Clearly, the iTA model does
not have these drawbacks. In addition, iTA is much faster to train on this
2026-point set: it took 10 seconds for the iTA model and 1800 seconds for
the GP model3
.
6. Conclusion
We have described GTApprox — a new tool for medium-scale surrogate
modeling in industrial design – and its novel features that make it convenient
for surrogate modeling, especially for applications in engineering and for
use by non-experts in data analysis. The tool contains some entirely new
3
The experiments were conducted on a PC with Intel(R) Core(TM) i7-2600 CPU @
3.40GHz and 8GB RAM.
24
26. approximation algorithms (e.g., Tensor Approximation with arbitrary factors
and incomplete Tensor Approximation) as well as novel model selection meta-
algorithms. In addition, GTApprox supports multiple novel “non-technical”
options and features allowing the user to more easily express the desired
properties of the model or some domain-specific properties of a data.
When compared to scikit-learn algorithms in the default mode on a collec-
tion of test problems, GTApprox shows a superior accuracy. This is achieved
at the cost of longer training times that, nevertheless, remain moderate for
medium-scale problems.
We have also briefly described a few applications of GTApprox to real
engineering problems where a crucial role was played by the tool’s distinctive
elements (the new algorithms MoA and iTA, automated model selection,
built-in availability of gradients).
Acknowledgments
The scientific results of Sections 2 and 3.2 are based on the research
conducted at IITP RAS and supported by the Russian Science Foundation
(project 14-50-00150).
References
[1] A. Forrester, A. Sobester, A. Keane, Engineering design via surrogate
modelling: a practical guide, John Wiley & Sons, 2008.
[2] T. W. Simpson, V. Toropov, V. Balabanov, F. A. Viana, Design and
analysis of computer experiments in multidisciplinary design optimiza-
tion: a review of how far we have come or not, in: Proceeding of 12th
AIAA/ISSMO Multidisciplinary Analysis and Optimization Conference,
pp. 1–12.
[3] Y. Nesterov, Introductory lectures on convex optimization: a basic
course, volume 87 of Applied optimization, Springer US, 2004.
[4] A. I. Forrester, A. J. Keane, Recent advances in surrogate-based opti-
mization, Progress in Aerospace Sciences 45 (2009) 50–79.
[5] Y. Xia, H. Tong, W. K. Li, L.-X. Zhu, An adaptive estimation of dimen-
sion reduction space, Journal of the Royal Statistical Society: Series B
(Statistical Methodology) 64 (2002) 363–410.
25
27. [6] Y. Zhu, P. Zeng, Fourier methods for estimating the central subspace
and the central mean subspace in regression, Journal of the American
Statistical Association 101 (2006) 1638–1651.
[7] J. E. Oakley, A. O’Hagan, Probabilistic sensitivity analysis of complex
models: a Bayesian approach, Journal of the Royal Statistical Society:
Series B (Statistical Methodology) 66 (2004) 751–769.
[8] B. Sudret, Global sensitivity analysis using polynomial chaos expan-
sions, Reliability Engineering & System Safety 93 (2008) 964–979.
Bayesian Networks in Dependability.
[9] E. Burnaev, I. Panin, Adaptive design of experiments for sobol in-
dices estimation based on quadratic metamodel, in: A. Gammerman,
V. Vovk, H. Papadopoulos (Eds.), Statistical Learning and Data Sci-
ences: Third International Symposium, SLDS 2015, Egham, UK, April
20-23, 2015, Proceedings, volume 9047 of Lecture Notes in Artificial
Intelligence, Springer International Publishing, 2015, pp. 86–96.
[10] E. Burnaev, I. Panin, B. Sudret, Effective design for sobol indices es-
timation based on polynomial chaos expansions, in: A. Gammerman,
Z. Luo, J. Vega, V. Vovk (Eds.), 5th International Symposium, COPA
2016 Madrid, Spain, April 2022, 2016, Proceedings, volume 9653 of Lec-
ture Notes in Artificial Intelligence, Springer International Publishing,
2016, pp. 165–184.
[11] T. Hastie, R. Tibshirani, J. Friedman, The elements of statistical learn-
ing: data mining, inference and prediction, Springer, 2 edition, 2009.
[12] R Core Team, R: A language and environment for statistical computing,
2012.
[13] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-
plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay,
Scikit-learn: Machine learning in Python, Journal of Machine Learning
Research 12 (2011) 2825–2830.
[14] B. Adams, L. Bauman, W. Bohnhoff, K. Dalbey, M. Ebeida, J. Eddy,
M. Eldred, P. Hough, K. Hu, J. Jakeman, L. Swiler, D. Vigil, DAKOTA,
26
28. A Multilevel Parallel Object-Oriented Framework for Design Optimiza-
tion, Parameter Estimation, Uncertainty Quantification, and Sensitiv-
ity Analysis: Version 5.4 User’s Manual, Sandia National Laboratories,
2013.
[15] M. Hofmann, R. Klinkenberg, RapidMiner: Data Mining Use Cases and
Business Analytics Applications, Chapman & Hall/CRC, 2013.
[16] modeFRONTIER: The multi-objective optimization and design environ-
ment, http://www.esteco.com/modefrontier/, 2014. Accessed: 2016-
04-02.
[17] Kaggle, https://www.kaggle.com/, 2010. Accessed: 2016-04-02.
[18] M. Lichman, UCI machine learning repository, 2013.
[19] L. Breiman, Random forests, Machine Learning 45 (2001) 5–32.
[20] J. H. Friedman, Greedy function approximation: A gradient boosting
machine, Ann. Statist. 29 (2001) 1189–1232.
[21] M. Belyaev, E. Burnaev, S. Grihon, P. Prikhodko, Surrogate model-
ing of stability constraints for optimization of composite structures, in:
S. Koziel, L. Leifsson (Eds.), Surrogate-Based Modeling and Optimiza-
tion for Engineering applications, Springer, 2013, pp. 359–391.
[22] G. Sterling, P. Prikhodko, E. Burnaev, M. Belyaev, S. Grihon, On ap-
proximation of reserve factors dependency on loads for composite stiff-
ened panels, Advanced Materials Research 1016 (2014) 85–89.
[23] M. Belyaev, E. Burnaev, E. Kapushev, S. Alestra, M. Dormieux,
A. Cavailles, D. Chaillot, E. Ferreira, Building data fusion surrogate
models for spacecraft aerodynamic problems with incomplete factorial
design of experiments, Advanced Materials Research 1016 (2014) 405–
412.
[24] A. Struzik, E. Burnaev, P. Prikhodko, Surrogate models for helicopter
loads problems, in: Proceedings of the 5th European Conference for
Aeronautics and Space Sciences. Germany, Munich.
[25] Datadvance: pSeven/MACROS use cases, https://www.datadvance.
net/use-cases/, 2016. Accessed: 2016-04-02.
27
29. [26] D. Montgomery, Design and Analysis of Experiments, John Wiley &
Sons, 2006.
[27] D. R. Jones, M. Schonlau, W. J. Welch, Efficient global optimization of
expensive black-box functions, Journal of Global optimization 13 (1998)
455–492.
[28] D. G. Krige, A Statistical Approach to Some Basic Mine Valuation
Problems on the Witwatersrand, Journal of the Chemical, Metallurgical
and Mining Society of South Africa 52 (1951) 119–139.
[29] C. E. Rasmussen, C. K. I. Williams, Gaussian Processes for Machine
Learning (Adaptive Computation and Machine Learning), The MIT
Press, 2005.
[30] M. Belyaev, E. Burnaev, Y. Kapushev, Gaussian process regression for
structured data sets, in: A. Gammerman, V. Vovk, H. Papadopou-
los (Eds.), Statistical Learning and Data Sciences: Third International
Symposium, SLDS 2015, Egham, UK, April 20-23, 2015, Proceedings,
volume 9047 of Lecture Notes in Artificial Intelligence, Springer Inter-
national Publishing, 2015, pp. 106–115.
[31] M. G. Belyaev, Anisotropic smoothing splines in problems with factorial
design of experiments, Doklady Mathematics 91 (2015) 250–253.
[32] M. Belyaev, E. Burnaev, Y. Kapushev, Computationally efficient algo-
rithm for gaussian processes based regression in case of structured sam-
ples, Journal of Computational Mathematics and Mathematical Physics
56 (2016) 499–513.
[33] E. Burnaev, P. Prikhodko, On a method for constructing ensembles of
regression models, Automation & Remote Control 74 (2013) 1630–1644.
[34] E. Burnaev, M. Panov, Adaptive design of experiments based on gaus-
sian processes, in: A. Gammerman, V. Vovk, H. Papadopoulos (Eds.),
Statistical Learning and Data Sciences: Third International Symposium,
SLDS 2015, Egham, UK, April 20-23, 2015, Proceedings, volume 9047
of Lecture Notes in Artificial Intelligence, Springer International Pub-
lishing, 2015, pp. 116–126.
28
30. [35] E. Burnaev, A. Zaytsev, Surrogate modeling of mutlifidelity data for
large samples, Journal of Communications Technology & Electronics 60
(2015) 1348–1355.
[36] E. Burnaev, M. Panov, A. Zaytsev, Regression on the basis of non-
stationary gaussian processes with bayesian regularization, Journal of
Communications Technology and Electronics 61 (2016) 661–671.
[37] E. Burnaev, P. Erofeev, The influence of parameter initialization on the
training time and accuracy of a nonlinear regression model, Journal of
Communications Technology and Electronics 61 (2016) 646–660.
[38] E. Burnaev, F. Gubarev, S. Morozov, A. Prokhorov, D. Khominich,
PSE/MACROS: Software environment for process integration, data
mining and design optimization, The Internet Representation of Sci-
entific Editions of FSUE “VIMI” (2013) 41–50.
[39] A. Davydov, S. Morozov, N. Perestoronin, A. Prokhorov, The first
full-cloud design space exploration platform, in: European NAFEMS
Conferences: Simulation Process and Data Management (SPDM), 2–3
December 2015, Munich, Germany, pp. 90–95.
[40] pSeven/MACROS download page, https://www.datadvance.net/
download/, 2016. Accessed: 2016-04-02.
[41] Macros and GTApprox documentation, https://www.datadvance.
net/product/pseven-core/documentation.html, 2016. Accessed:
2016-04-02.
[42] A. N. Tikhonov, On the stability of inverse problems, Doklady Akademii
Nauk SSSR 39 (1943) 195 – 198.
[43] M. A. Efroymson, Multiple regression analysis, Mathematical Methods
for Digital Computers (1960).
[44] H. Zou, T. Hastie, Regularization and variable selection via the elas-
tic net, Journal of the Royal Statistical Society: Series B (Statistical
Methodology) 67 (2005) 301–320.
[45] R. J. Renka, Interpolatory tension splines with automatic selection of
tension factors, Society for Industrial and Applied Mathematics, Journal
for Scientific and Statistical Computing 8 (1987) 393–415.
29
31. [46] N. A. C. Cressie, Statistics for Spatial Data, Wiley, 1993.
[47] E. Burnaev, A. Zaytsev, V. Spokoiny, The Bernstein-von Mises theorem
for regression based on Gaussian processes, Russ. Math. Surv. 68 (2013)
954–956.
[48] E. Burnaev, A. Zaytsev, V. Spokoiny, Properties of the posterior distri-
bution of a regression model based on gaussian random fields, Automa-
tion and Remote Control 74 (2013) 1645–1655.
[49] E. Burnaev, A. Zaytsev, V. Spokoiny, Properties of the bayesian param-
eter estimation of a regression based on gaussian processes, Journal of
Mathematical Sciences 203 (2014) 789–798.
[50] S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice
Hall PTR, Upper Saddle River, NJ, USA, 2nd edition, 1998.
[51] R. Giles, S. Caruana, L. Lawrence, Overfitting in neural nets: Back-
propagation, conjugate gradient, and early stopping, in: Advances in
Neural Information Processing Systems 13: Proceedings of the 2000
Conference, volume 13, MIT Press, p. 402.
[52] M. G. Belyaev, E. V. Burnaev, Approximation of a multidimensional
dependency based on a linear expansion in a dictionary of parametric
functions, Informatics and its Applications 7 (2013) 114–125.
[53] M. G. Belyaev, Approximation of multidimensional dependences accord-
ing to structured samples, Scientific and Technical Information Process-
ing 42 (2015) 328–339.
[54] M. I. Jordan, Hierarchical mixtures of experts and the em algorithm,
Neural Computation 6 (1994) 181–214.
[55] S. E. Yuksel, J. N. Wilson, P. D. Gader, Twenty years of mixture of
experts., IEEE Trans. Neural Netw. Learning Syst. 23 (2012) 1177–1193.
[56] J. Snoek, H. Larochelle, R. P. Adams, Practical Bayesian optimization
of machine learning algorithms (2012) 2951–2959.
[57] J. S. Bergstra, R. Bardenet, Y. Bengio, B. K´egl, Algorithms for hyper-
parameter optimization, in: J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett,
30
32. F. Pereira, K. Q. Weinberger (Eds.), Advances in Neural Information
Processing Systems 24, Curran Associates, Inc., 2011, pp. 2546–2554.
[58] E. Brochu, V. M. Cora, N. De Freitas, A tutorial on Bayesian optimiza-
tion of expensive cost functions, with application to active user modeling
and hierarchical reinforcement learning, arXiv:1012.2599 (2010).
[59] J. Bergstra, D. Yamins, D. D. Cox, Making a science of model search:
Hyperparameter optimization in hundreds of dimensions for vision ar-
chitectures, ICML (1) 28 (2013) 115–123.
[60] HPOlib, HPOlib: A hyperparameter optimization library, https://
github.com/automl/HPOlib, since 2014.
[61] T. Chen, C. Guestrin, XGBoost: A Scalable Tree Boosting System
(2016). arXiv:1603.02754.
[62] GPy, GPy: A gaussian process framework in python, http://github.
com/SheffieldML/GPy, since 2012.
[63] Wikipedia: Test functions for optimization, https://en.wikipedia.
org/wiki/Test_functions_for_optimization, 2016. Accessed: 2016-
04-02.
[64] Gdr mascot-num benchmark, http://www.gdr-mascotnum.fr/
benchmarks.html, 2016. Accessed: 2016-04-02.
31