You develop pharmaceutical, cosmetic, food, industrial or civil engineered products, and are often confronted with the challenge of blending and formulating to meet process or performance properties. While traditional Research and Development does approach the problem with experimentation, it generally involves designs, time and resource constraints, and can be considered slow, expensive and often times redundant, fast forgotten or perhaps obsolete.
Consider the alternative Machine Learning tools offers today. We will show this is not only quick, efficient and ultimately the only way Front End of Innovation should proceed, and how it is particularly suited for formulation and classification.
Today, we will explain how Machine Learning can shed new light on this generic and very persistent formulation challenge. We will discuss the other important aspect of classification and clustering often associated with these formulations challenges in a forthcoming communication.
maXbox Starter 43 Work with Code Metrics ISO StandardMax Kleiner
Today we step through optimize your code with metrics and some style guide conventions. You cannot improve what you don’t measure and what you don’t measure, you cannot prove. A tool can be great for code quality but also provides a mechanism for extending your functions and quality with checks and tests.
The caret package, short for classification and regression training, contains numerous tools for developing predictive models using the rich set of models available in R. The package focuses on simplifying model training and tuning across a wide variety of modeling techniques. It also includes methods for pre-processing training data, calculating variable importance, and model visualizations.
An example from classification of music genres is used to illustrate the functionality on a real data set and to benchmark the benefits of parallel processing with several types of models.
Bio: Max is a Director of Nonclinical Statistics at Pfizer Global R&D in Connecticut. He has been applying models in the molecular diagnostic and pharmaceutical industries for over 15 years. He is the author of several R packages including the caret package that provides a simple and consistent interface to over 100 predictive models available in R.
Max has taught courses on modeling within Pfizer and externally. Recently, he taught modeling classes for the American Society of Chemistry, the Indian Ministry of Information Technology and Predictive Analytics World. He is a co-author of the forthcoming Spring book "Applied Predictive Modeling".
Twitter: @NycDataSci
Learn with our NYC Data Science Program (weekend courses for working professionals and 12 week full time for whom are advancing their career into Data Science)
Our next 12-Week Data Science Bootcamp starts in Jun. (Deadline to apply is May 1st, all decisions will be made by May 15th.)
====================================
Max Kuhn, Director is Nonclinical Statistics of Pfizer and also the author of Applied Predictive Modeling.
He will join us and share his experience with Data Mining with R.
Max is a nonclinical statistician who has been applying predictive models in the diagnostic and pharmaceutical industries for over 15 years. He is the author and maintainer for a number of predictive modeling packages, including: caret, C50, Cubist and AppliedPredictiveModeling. He blogs about the practice of modeling on his website at ttp://appliedpredictivemodeling.com/blog
---------------------------------------------------------
His Feb 18th course can be RSVP at NYC Data Science Academy.
Syllabus
Predictive Modeling using R
Description
This class will get attendees up to speed in predictive modeling using the R programming language. The goal of the course is to understand the general predictive modeling process and how it can be implemented in R. A selection of important models (e.g. tree-based models, support vector machines) will be described in an intuitive manner to illustrate the process of training and evaluating models.
Prerequisites:
Attendees should have a working knowledge of basic R data structures (e.g. data frames, factors etc) and language fundamentals such as functions and subsetting data. Understanding of the content contained in Appendix B sections B1 though B8 of Applied Predictive Modeling (free PDF from publisher [1]) should suffice.
Outline:
- An introduction to predictive modeling
- R and predictive modeling: the good and bad
- Illustrative example
- Measuring performance
- Data splitting and resampling
- Data pre-processing
- Classification trees
- Boosted trees
- Support vector machines
If time allows, the following topics will also be covered
- Parallel processing
- Comparing models
- Feature selection
- Common pitfalls
Materials:
Attendees will be provided with a copy of Applied Predictive Modeling[2] as well as course notes, code and raw data. Participants will be able to reproduce the examples described in the workshop.
Attendees should have a computer with a relatively recent version of R installed.
About the Instructor:
More about Max's work:
[1] http://rd.springer.com/content/pdf/bbm%3A978-1-4614-6849-3%2F1.pdf
[2] http://appliedpredictivemodeling.com
maXbox Starter 43 Work with Code Metrics ISO StandardMax Kleiner
Today we step through optimize your code with metrics and some style guide conventions. You cannot improve what you don’t measure and what you don’t measure, you cannot prove. A tool can be great for code quality but also provides a mechanism for extending your functions and quality with checks and tests.
The caret package, short for classification and regression training, contains numerous tools for developing predictive models using the rich set of models available in R. The package focuses on simplifying model training and tuning across a wide variety of modeling techniques. It also includes methods for pre-processing training data, calculating variable importance, and model visualizations.
An example from classification of music genres is used to illustrate the functionality on a real data set and to benchmark the benefits of parallel processing with several types of models.
Bio: Max is a Director of Nonclinical Statistics at Pfizer Global R&D in Connecticut. He has been applying models in the molecular diagnostic and pharmaceutical industries for over 15 years. He is the author of several R packages including the caret package that provides a simple and consistent interface to over 100 predictive models available in R.
Max has taught courses on modeling within Pfizer and externally. Recently, he taught modeling classes for the American Society of Chemistry, the Indian Ministry of Information Technology and Predictive Analytics World. He is a co-author of the forthcoming Spring book "Applied Predictive Modeling".
Twitter: @NycDataSci
Learn with our NYC Data Science Program (weekend courses for working professionals and 12 week full time for whom are advancing their career into Data Science)
Our next 12-Week Data Science Bootcamp starts in Jun. (Deadline to apply is May 1st, all decisions will be made by May 15th.)
====================================
Max Kuhn, Director is Nonclinical Statistics of Pfizer and also the author of Applied Predictive Modeling.
He will join us and share his experience with Data Mining with R.
Max is a nonclinical statistician who has been applying predictive models in the diagnostic and pharmaceutical industries for over 15 years. He is the author and maintainer for a number of predictive modeling packages, including: caret, C50, Cubist and AppliedPredictiveModeling. He blogs about the practice of modeling on his website at ttp://appliedpredictivemodeling.com/blog
---------------------------------------------------------
His Feb 18th course can be RSVP at NYC Data Science Academy.
Syllabus
Predictive Modeling using R
Description
This class will get attendees up to speed in predictive modeling using the R programming language. The goal of the course is to understand the general predictive modeling process and how it can be implemented in R. A selection of important models (e.g. tree-based models, support vector machines) will be described in an intuitive manner to illustrate the process of training and evaluating models.
Prerequisites:
Attendees should have a working knowledge of basic R data structures (e.g. data frames, factors etc) and language fundamentals such as functions and subsetting data. Understanding of the content contained in Appendix B sections B1 though B8 of Applied Predictive Modeling (free PDF from publisher [1]) should suffice.
Outline:
- An introduction to predictive modeling
- R and predictive modeling: the good and bad
- Illustrative example
- Measuring performance
- Data splitting and resampling
- Data pre-processing
- Classification trees
- Boosted trees
- Support vector machines
If time allows, the following topics will also be covered
- Parallel processing
- Comparing models
- Feature selection
- Common pitfalls
Materials:
Attendees will be provided with a copy of Applied Predictive Modeling[2] as well as course notes, code and raw data. Participants will be able to reproduce the examples described in the workshop.
Attendees should have a computer with a relatively recent version of R installed.
About the Instructor:
More about Max's work:
[1] http://rd.springer.com/content/pdf/bbm%3A978-1-4614-6849-3%2F1.pdf
[2] http://appliedpredictivemodeling.com
La presentación ofrece consejos prácticos para seleccionar el nombre de dominio apropiado para su sitio web. Si desea una copia de la presentación por favor enviar e-mail a info@hiscec.com
This slide contains:
Incidence of Tax, its shift-ability, effect of residental status of assesse on taxability of income, effect on tax in different demand situations.
Open Source 101 2022 - MySQL Indexes and HistogramsFrederic Descamps
Nobody complains that the database is too fast. But when things slow down, the complaints come quickly. The two most popular approaches to speeding up queries are indexes and histograms. But there are so many options and types on indexes that it can get confusing. Histograms are fairly new to MySQL but they do not work for all types of data. This talk covers how indexes and histograms work and show you how to test just how effective they are so you can measure the performance of your queries.
Determining the root cause of performance issues is a critical task for Operations. In this webinar, we'll show you the tools and techniques for diagnosing and tuning the performance of your MongoDB deployment. Whether you're running into problems or just want to optimize your performance, these skills will be useful.
La presentación ofrece consejos prácticos para seleccionar el nombre de dominio apropiado para su sitio web. Si desea una copia de la presentación por favor enviar e-mail a info@hiscec.com
This slide contains:
Incidence of Tax, its shift-ability, effect of residental status of assesse on taxability of income, effect on tax in different demand situations.
Open Source 101 2022 - MySQL Indexes and HistogramsFrederic Descamps
Nobody complains that the database is too fast. But when things slow down, the complaints come quickly. The two most popular approaches to speeding up queries are indexes and histograms. But there are so many options and types on indexes that it can get confusing. Histograms are fairly new to MySQL but they do not work for all types of data. This talk covers how indexes and histograms work and show you how to test just how effective they are so you can measure the performance of your queries.
Determining the root cause of performance issues is a critical task for Operations. In this webinar, we'll show you the tools and techniques for diagnosing and tuning the performance of your MongoDB deployment. Whether you're running into problems or just want to optimize your performance, these skills will be useful.
Den Datenschatz heben und Zeit- und Energieeffizienz steigern: Mathematik und...Joachim Schlosser
In einer Gesellschaft, in der das Sammeln von personenbezogenen Daten mittlerweile alltäglich geworden ist, ist es nicht weiter verwunderlich, dass auch der innovative Maschinenbauer Daten sammelt, wo er nur kann. Produktdaten, Maschinendaten, Statistikdaten – in einer durchschnittlichen Produktionsanlage fallen bereits heute jeden Tag Gigabytes an Daten an. „Big Data“ wurde eines der Schlagworte der Industrie 4.0.
Doch was verspricht man sich davon? Welche Information steckt in den aufgezeichneten Maschinen- und Produktdaten? Und wie erfolgt die Auswertung?
Im Rahmen des Vortrags wird aufgezeigt, wie Unternehmen auf Basis einer etablierten Plattform wie MATLAB® ihre Auswertealgorithmen entwickeln, testen und ausrollen können. Die kontinuierliche Auswertung selbst erfolgt dann wahlweise auf einem Anlagenserver oder aber auch in Echtzeit direkt an der Maschine. Veranschaulicht wird dies anhand von Beispielen aus der Praxis.
Doch neben der gesammelten Daten kommt auch den Steuerungseinheiten in der Produktion in der Industrie 4.0 eine größere Bedeutung zu.
Wenn Werkstücke demnächst selbst wissen, wo sie im Produktionsablauf hin möchten und welcher Verarbeitungsschritt ihnen angedeihen soll, dann bedeutet das auch für die einzelnen Komponenten und Module in Produktion und Logistik ein mehr an Funktionalität, da sie auf diese Eingaben ebenfalls reagieren sollen.
Wie stellen Sie sicher, dass diese zusätzliche Funktionalität nicht zu Lasten der Energiebilanz gehen? Wie fahren Sie die Motoren und anderen aktiven Komponenten Ihrer Fertigung so, dass sie flexibel auf veränderte Routen der Werkstücke reagieren und dennoch im optimalen Bereich fahren?
Mehr denn je brauchen Sie gesteuerte und geregelte Komponenten und Module. Das sollte schon seit Industrie 3.0 vorhanden sein, jedoch ist auch hier noch viel ganz konkretes Potential zur Steigerung von Produktivität und Einsparung von Energie und Produktionszeit vorhanden.
Sie sehen im Vortrag, wie Sie ihre Komponenten besser beschalten, dass die vernetzten dynamischen Anforderungen von Industrie 4.0 lokal effizient umgesetzt werden können.
[DSC Europe 22] Smart approach in development and deployment process for vari...DataScienceConferenc1
During development of machine learning model about 80% of time is used for data preparation and due to data quality issues, especially when there is a need to combine data from structured and unstructured data sources. Development of smart generic data mart can reduce go to production time for new ML models. We will share creative solutions for challenges we encountered during data transfer between DWH and Data Lake, furthermore data preprocessing, development, deployment/orchestration of ML models, using python/pyspark scripts.
Start machine learning in 5 simple stepsRenjith M P
Simple steps to get started with machine learning.
The use case uses python programming. Target audience is expected to have a very basic python knowledge.
Cloudera Data Science Challenge 3 Solution by Doug NeedhamDoug Needham
This is my solution for the Cloudera Data Science Challenge 3. I use Spark MLLib for problem1, and Spark GraphX for problem3. Problem2 is "simple" streaming map-reduce.
• For a full set of 420+ questions. Go to
https://skillcertpro.com/product/databricks-machine-learning-associate-exam-questions/
• SkillCertPro offers detailed explanations to each question which helps to understand the concepts better.
• It is recommended to score above 85% in SkillCertPro exams before attempting a real exam.
• SkillCertPro updates exam questions every 2 weeks.
• You will get life time access and life time free updates
• SkillCertPro assures 100% pass guarantee in first attempt.
Testing Experience - Evolution of Test Automation FrameworksŁukasz Morawski
Implementing automated tests is something that everybody wants to do. If you ask
any tester, test automation is their aim. And while it may be the golden target, very
few testers take pains to assess the required knowledge, under the illusion that a
programming language or expensive tool will suffice to cope with all problems likely
to arise. This is not true. Writing good automated tests is much harder than that,
requiring knowledge this article will make clear
Similar to Machine learning key to your formulation challenges (20)
As Europe's leading economic powerhouse and the fourth-largest hashtag#economy globally, Germany stands at the forefront of innovation and industrial might. Renowned for its precision engineering and high-tech sectors, Germany's economic structure is heavily supported by a robust service industry, accounting for approximately 68% of its GDP. This economic clout and strategic geopolitical stance position Germany as a focal point in the global cyber threat landscape.
In the face of escalating global tensions, particularly those emanating from geopolitical disputes with nations like hashtag#Russia and hashtag#China, hashtag#Germany has witnessed a significant uptick in targeted cyber operations. Our analysis indicates a marked increase in hashtag#cyberattack sophistication aimed at critical infrastructure and key industrial sectors. These attacks range from ransomware campaigns to hashtag#AdvancedPersistentThreats (hashtag#APTs), threatening national security and business integrity.
🔑 Key findings include:
🔍 Increased frequency and complexity of cyber threats.
🔍 Escalation of state-sponsored and criminally motivated cyber operations.
🔍 Active dark web exchanges of malicious tools and tactics.
Our comprehensive report delves into these challenges, using a blend of open-source and proprietary data collection techniques. By monitoring activity on critical networks and analyzing attack patterns, our team provides a detailed overview of the threats facing German entities.
This report aims to equip stakeholders across public and private sectors with the knowledge to enhance their defensive strategies, reduce exposure to cyber risks, and reinforce Germany's resilience against cyber threats.
Techniques to optimize the pagerank algorithm usually fall in two categories. One is to try reducing the work per iteration, and the other is to try reducing the number of iterations. These goals are often at odds with one another. Skipping computation on vertices which have already converged has the potential to save iteration time. Skipping in-identical vertices, with the same in-links, helps reduce duplicate computations and thus could help reduce iteration time. Road networks often have chains which can be short-circuited before pagerank computation to improve performance. Final ranks of chain nodes can be easily calculated. This could reduce both the iteration time, and the number of iterations. If a graph has no dangling nodes, pagerank of each strongly connected component can be computed in topological order. This could help reduce the iteration time, no. of iterations, and also enable multi-iteration concurrency in pagerank computation. The combination of all of the above methods is the STICD algorithm. [sticd] For dynamic graphs, unchanged components whose ranks are unaffected can be skipped altogether.
Machine learning key to your formulation challenges
1. Machine Learning, Key to Your Formulation
Challenges
Marc Borowczak, PRC Consulting LLC (http://www.prcconsulting.net)
February 17, 2016
Formulation Challenges are Everywhere…
Step1: Retrieve Existing Data
Step 2: Normalize Data
Step 3: Train and Test a Model
Step 4: Evaluate Model Performance
Step 5: Improving Model Performance with 5 Hidden Neurons
Step 6: Improving Model using Random Forest Algorithm
Step 7: Testing Further with Resampling
Step 8: Actual Display of a Random Forest Tree Solution
Conclusions
References
Formulation Challenges are Everywhere…
You develop pharmaceutical, cosmetic, food, industrial or civil engineered products, and are often confronted
with the challenge of blending and formulating to meet process or performance properties. While traditional
Research and Development does approach the problem with experimentation, it generally involves designs,
time and resource constraints, and can be considered slow, expensive and often times redundant, fast forgotten
or perhaps obsolete.
Consider the alternative Machine Learning tools offers today. We will show this is not only quick, efficient and
ultimately the only way Front End of Innovation should proceed, and how it is particularly suited for formulation
and classification.
Today, we will explain how Machine Learning can shed new light on this generic and very persistent formulation
challenge. We will discuss the other important aspect of classification and clustering often associated with these
formulations challenges in a forthcoming communication.
Step1: Retrieve Existing Data
To illustrate the approach, we selected a formulation dataset hosted on UCI Machine Learning Repository
(http://archive.ics.uci.edu/ml/datasets.html), to predict the compressive strength (http://archive.ics.uci.edu
/ml/datasets/Concrete+Compressive+Strength) performance dependency on the formulation ingredients. This is
the well-known formulation composition - property relationship scientists, engineers and business professionals
must address daily and any established R&D would certainly have similar and sometimes hidden knowledge in
its archives…
We will use R to demonstrate quickly the approach on this dataset (http://archive.ics.uci.edu/ml/machine-
learning-databases/concrete/compressive/Concrete_Data.xls), and also demonstrate how reproducibility of the
analysis is enforced. The analysis tool and platform are documented, all libraries clearly listed, while data is
retrieved programmatically and date stamped from the repository.
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
1 of 11 2/18/2016 5:07 PM
3. library(xlsx)
library(stringr)
library(caret)
library(neuralnet)
library(devtools)
library(rpart)
library(rpart.plot)
userdir <- getwd()
datadir <- "./data"
if (!file.exists("data")){dir.create("data")}
fileUrl <- "http://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compress
ive/Concrete_Data.xls?accessType=DOWNLOAD"
download.file(fileUrl,destfile="./data/Concrete_Data.xls",method="curl")
dateDownloaded <- date()
concrete <- read.xlsx("./data/Concrete_Data.xls",sheetName="Sheet1")
str(concrete)
## 'data.frame': 1030 obs. of 9 variables:
## $ Cement..component.1..kg.in.a.m.3.mixture. : num 540 540 332 332 199
...
## $ Blast.Furnace.Slag..component.2..kg.in.a.m.3.mixture.: num 0 0 142 142 132 ...
## $ Fly.Ash..component.3..kg.in.a.m.3.mixture. : num 0 0 0 0 0 0 0 0 0 0
...
## $ Water...component.4..kg.in.a.m.3.mixture. : num 162 162 228 228 192
228 228 228 228 228 ...
## $ Superplasticizer..component.5..kg.in.a.m.3.mixture. : num 2.5 2.5 0 0 0 0 0 0
0 0 ...
## $ Coarse.Aggregate...component.6..kg.in.a.m.3.mixture. : num 1040 1055 932 932 97
8 ...
## $ Fine.Aggregate..component.7..kg.in.a.m.3.mixture. : num 676 676 594 594 826
...
## $ Age..day. : num 28 28 270 365 360 90
365 28 28 28 ...
## $ Concrete.compressive.strength.MPa..megapascals.. : num 80 61.9 40.3 41.1 44
.3 ...
Step 2: Normalize Data
The dataset information reveals 1030 observations with 9 variables: 8 inputs, from which 7 are ingredients and
1 is a process attribute (Age) and 1 output, the strength property. There are no missing values in this set. We’ll
easily truncate the variable names and normalize the data, displaying the normalized strength.
normalize <- function(x) {return((x - min(x)) / (max(x) - min(x)))}
names(concrete)<-gsub("."," ",names(concrete))
names(concrete)<-word(names(concrete),1)
names(concrete)[9]<-"Strength"
concrete_norm <- as.data.frame(lapply(concrete, normalize))
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
3 of 11 2/18/2016 5:07 PM
4. ## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2663 0.4000 0.4172 0.5457 1.0000
These transformations should be typical of a generic formulation where ingredients and process variables are
independent or input variables, and property is a dependent or output variable.
Step 3: Train and Test a Model
The method we’ll follow now is a standard approach where we randomly split the data set in a train and test set.
The caret (https://cran.r-project.org/web/packages/caret/caret.pdf) package implements this task well. We’ll use
75% of the data to train and the remainder to test the model. To make the analysis reproducible, we’ll use the
set.seed() function.
set.seed(12121)
inTrain<-createDataPartition(y=concrete_norm$Strength,p=0.75,list=FALSE)
concrete_train <- concrete_norm[inTrain, ]
concrete_test <- concrete_norm[-inTrain, ]
concrete_model <- neuralnet(formula = Strength ~ Cement + Blast + Fly + Water + Superp
lasticizer + Coarse + Fine + Age, data = concrete_train)
The network topology is then easily visualized. See for details the excellent NeuralNetTool page
(https://beckmw.wordpress.com/tag/neural-network/). Suffise to say that even this simple first attempt highlights
the main dependencies and higher impacts are highligted with thicker links in the typical neural net
representation. here the I’s represent inputs, O is the output, H and B are Hidden and Bias nodes as defined in
the theory. Note a single bias node is added for each input and hidden layer to accomodate input features equal
to 0.
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
4 of 11 2/18/2016 5:07 PM
5. Step 4: Evaluate Model Performance
We will now compute predictions and compare them to actual strength and examine the correlation between
predicted and actual strength values.
model_results <- compute(concrete_model, concrete_test[1:8])
predicted_strength <- model_results$net.result
cor(predicted_strength, concrete_test$Strength)[1,1]
## [1] 0.8336352308
The default neural net exhibits a correlation of 0.8336352. We can certainly try to improve it by including a few
hidden neurons…
Step 5: Improving Model Performance with 5
Hidden Neurons
concrete_model2 <- neuralnet(formula = Strength ~ Cement + Blast + Fly + Water + Super
plasticizer + Coarse + Fine + Age, data = concrete_train, hidden=5)
plot.nnet(concrete_model2,cex.val=0.75)
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
5 of 11 2/18/2016 5:07 PM
6. We observe age remains a key contributor, but the effects of Water, Superplasticizer, Cement and Blast are also
visibly ranked. 4 out of the 5 Hidden nodes are about evenly contributing…
model_results2 <- compute(concrete_model2, concrete_test[1:8])
predicted_strength2 <- model_results2$net.result
cor(predicted_strength2, concrete_test$Strength)[1,1]
## [1] 0.9138705528
p <- plot(concrete_test$Strength,predicted_strength2)
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
6 of 11 2/18/2016 5:07 PM
7. Step 6: Improving Model using Random Forest
Algorithm
We now will rely on the Random Forest algorithm and attempt a model improvement.
model_result3 <- train(Strength ~ ., data = concrete_train,method='rf',prox=TRUE)
model_result3
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
7 of 11 2/18/2016 5:07 PM
8. ## Random Forest
##
## 774 samples
## 8 predictor
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 774, 774, 774, 774, 774, 774, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared RMSE SD Rsquared SD
## 2 0.07833903729 0.8774837294 0.004515965257 0.01607292866
## 5 0.07159729020 0.8869237640 0.004510365293 0.01395064039
## 8 0.07365915696 0.8777421774 0.005197103383 0.01691711063
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
predicted_strength3 <- predict(model_result3,concrete_test)
cor(predicted_strength3, concrete_test$Strength)
## [1] 0.943384811
The default Random Forest algorithm helped improve our prediction and exhibits a correlation of 0.9433848.
Again, we can certainly try to improve by introducing resampling… The caret package offers multiple methods
to try out. We’ll just try one to give an idea…
Step 7: Testing Further with Resampling
model_result4 <- train(Strength ~ ., method='rf',data = concrete_train,verbose=FALSE,t
rControl = trainControl(method="cv"))
model_result4
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
8 of 11 2/18/2016 5:07 PM
9. ## Random Forest
##
## 774 samples
## 8 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 696, 696, 695, 697, 698, 696, ...
## Resampling results across tuning parameters:
##
## mtry RMSE Rsquared RMSE SD Rsquared SD
## 2 0.06910599 0.9068129 0.01101706 0.04013275
## 5 0.06227778 0.9121046 0.01232558 0.03945448
## 8 0.06287753 0.9089120 0.01208088 0.03835530
##
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 5.
predicted_strength4 <- predict(model_result4,concrete_test)
cor(predicted_strength4, concrete_test$Strength)
## [1] 0.9428245
We observe the prediction is practically unchanged in this case, with a correlation of 0.9428245. Still not bad for
a quick analysis performed on existing data. Regardless of our property target, we already derived key areas to
investigate deeper… and can clearly see some key ingredients (Cement, Blast, Fly, Superplasticizer, Water…)
and the Age process as determining factors to produce strength performance. So naturally, one may want to
display this model.
Step 8: Actual Display of a Random Forest Tree
Solution
It turns out that so-called blackbox models are – well – meant to stay in their box! However, the rpart
(https://cran.r-project.org/web/packages/rpart/rpart.pdf) and rpart.plot (https://cran.r-project.org/web/packages
/rpart.plot/rpart.plot.pdf) packages make it easy to visualize even complex trees.
strength.tree <- rpart(Strength ~ .,data=concrete_train, control=rpart.control(minspli
t=20,cp=0.002))
prp(strength.tree,compress=TRUE)
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
9 of 11 2/18/2016 5:07 PM
10. In the network, normalized strength is indicated in the oval leaves, and are ranked from low to high from left to
right branches.
Conclusions
We hope this typical example demonstrates that Machine Learning algorithms are well positioned to help
resolve formulation challenges, offering a fast, efficient and economical alternative to tedious experimentation. It
is easy to imagine how similar questions can be resolved in all types of R&D, in materials, cosmetics, food or
any scientific area.
Rubber formulations to minimize rolling resistance and emissions, or modern composites to build renewable
energy sources or lighweight transportation vehicles and next-generation public transit, as well as innovative
UV-shield oinments and tasty snacks and drinks…, all present similar challenges where only the nature of
inputs and outputs vary. Therefore, the method can and should be applied broadly!
Next time, we’ll review how to address another common challenge: classification and clustering. Till then, we
hope this approach has triggered interest.
Why not try and implement Machine Learning in your scientific or technical expert area? Remember, PRC
Consulting, LLC (http://www.prcconsulting.net) is dedicated to boosting innovation thru improved Analytics, one
customer at the time!
References
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
10 of 11 2/18/2016 5:07 PM
11. The following sources are referenced as they provided significant help and information to develop this Machine
Learning analysis applied to formulations:
UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/datasets.html)1.
caret (https://cran.r-project.org/web/packages/caret/caret.pdf)2.
NeuralNetTool (https://beckmw.wordpress.com/tag/neural-network/)3.
rpart (https://cran.r-project.org/web/packages/rpart/rpart.pdf)4.
rpart.plot (https://cran.r-project.org/web/packages/rpart.plot/rpart.plot.pdf)5.
RStudio (https://www.rstudio.com)6.
Machine Learning, Key to Your Formulation Challenges file:///P:/MachineLearningExamples/Machine_Learning_Formulation.html
11 of 11 2/18/2016 5:07 PM