1. The document describes analyzing a heart failure dataset using logistic regression and linear discriminant analysis (LDA) models to predict patient survival.
2. Both optimized models produced similar results, with the optimized logistic regression performing slightly better with accuracies of 83% versus 80% for LDA. However, the recall for predicting deceased patients was only 74% for both models.
3. It is recommended to obtain additional medical features and more data samples, particularly for deceased patients, which may help improve the models' ability to correctly predict outcomes and better assist treatment decisions.
Predict Backorder on a supply chain data for an OrganizationPiyush Srivastava
Performed cleaning and founded the important variables and created a best model using different classification techniques (Random Forest, Naïve Bayes, Decision tree, KNN, Neural Network, Support Vector Machine) to predict the back-order for an organization using the best modelling and technique approach.
This report includes information about:
1. Pre-Processing Variables
a. Treating Missing Values
b. Treating correlated variables
2. Selection of Variables using random forest weights
3. Building model to predict donors and amount expected to be donated.
Introduction to Optimization with Genetic Algorithm (GA)Ahmed Gad
Selection of the optimal parameters for machine learning tasks is challenging. Some results may be bad not because the data is noisy or the used learning algorithm is weak, but due to the bad selection of the parameters values. This article gives a brief introduction about evolutionary algorithms (EAs) and describes genetic algorithm (GA) which is one of the simplest random-based EAs.
References:
Eiben, Agoston E., and James E. Smith. Introduction to evolutionary computing. Vol. 53. Heidelberg: springer, 2003.
https://www.linkedin.com/pulse/introduction-optimization-genetic-algorithm-ahmed-gad
https://www.kdnuggets.com/2018/03/introduction-optimization-with-genetic-algorithm.html
Data Science - Part XIV - Genetic AlgorithmsDerek Kane
This lecture provides an overview on biological evolution and genetic algorithms in a machine learning context. We will start off by going through a broad overview of the biological evolutionary process and then explore how genetic algorithms can be developed that mimic these processes. We will dive into the types of problems that can be solved with genetic algorithms and then we will conclude with a series of practical examples in R which highlights the techniques: The Knapsack Problem, Feature Selection and OLS regression, and constrained optimizations.
Predict Backorder on a supply chain data for an OrganizationPiyush Srivastava
Performed cleaning and founded the important variables and created a best model using different classification techniques (Random Forest, Naïve Bayes, Decision tree, KNN, Neural Network, Support Vector Machine) to predict the back-order for an organization using the best modelling and technique approach.
This report includes information about:
1. Pre-Processing Variables
a. Treating Missing Values
b. Treating correlated variables
2. Selection of Variables using random forest weights
3. Building model to predict donors and amount expected to be donated.
Introduction to Optimization with Genetic Algorithm (GA)Ahmed Gad
Selection of the optimal parameters for machine learning tasks is challenging. Some results may be bad not because the data is noisy or the used learning algorithm is weak, but due to the bad selection of the parameters values. This article gives a brief introduction about evolutionary algorithms (EAs) and describes genetic algorithm (GA) which is one of the simplest random-based EAs.
References:
Eiben, Agoston E., and James E. Smith. Introduction to evolutionary computing. Vol. 53. Heidelberg: springer, 2003.
https://www.linkedin.com/pulse/introduction-optimization-genetic-algorithm-ahmed-gad
https://www.kdnuggets.com/2018/03/introduction-optimization-with-genetic-algorithm.html
Data Science - Part XIV - Genetic AlgorithmsDerek Kane
This lecture provides an overview on biological evolution and genetic algorithms in a machine learning context. We will start off by going through a broad overview of the biological evolutionary process and then explore how genetic algorithms can be developed that mimic these processes. We will dive into the types of problems that can be solved with genetic algorithms and then we will conclude with a series of practical examples in R which highlights the techniques: The Knapsack Problem, Feature Selection and OLS regression, and constrained optimizations.
Data Science - Part III - EDA & Model SelectionDerek Kane
This lecture introduces the concept of EDA, understanding, and working with data for machine learning and predictive analysis. The lecture is designed for anyone who wants to understand how to work with data and does not get into the mathematics. We will discuss how to utilize summary statistics, diagnostic plots, data transformations, variable selection techniques including principal component analysis, and finally get into the concept of model selection.
Predicting Likely Donors and Donation AmountsMichele Vincent
In this presentation, we first predict likely donors using classification models. Then, we predict how much donation will likely donors give using regression models. Finally, we validate predictive models by measuring how effective they are.
Predictive Analytics, Predicting LIkely Donors and Donation AmountsMichele Vincent
In this presentation, we first predict likely donors using classification models. We also predict how much donation will likely donors give using regression models. Then, we validate predictive models by measuring how effective the models are.
Performance of genetic algorithm is flexible enough to make it applicable to a wide range of problems, such as the problem of placing N queens on N by N chessboard in order that no two queens can attack each other which is known as ‘n-Queens problem.
Lack of information about details of the problem made genetic algorithm confused in searching state space of the problem
A Hybrid Immunological Search for theWeighted Feedback Vertex Set ProblemMario Pavone
In this paper we present a hybrid immunological inspired algorithm (HYBRID-IA) for solving the Minimum Weighted Feedback Vertex Set (MWFVS) problem. MWFV S is one of the most interesting and challenging combinatorial optimization problem, which finds application in many fields and in many real life tasks. The proposed algorithm is inspired by the clonal selection principle, and therefore it takes advantage of the main strength characteristics of the operators of (i) cloning; (ii) hypermutation; and (iii) aging. Along with these operators, the algorithm uses a local search procedure, based on a deterministic approach, whose purpose is to refine the solutions found so far. In order to evaluate the efficiency and robustness of HYBRID-IA several experiments were performed on different instances, and for each instance it was compared to three different algorithms: (1) a memetic algorithm based on a genetic algorithm (MA); (2) a tabu search metaheuristic (XTS); and (3) an iterative tabu search (ITS). The obtained results prove the efficiency and reliability of HYBRID-IA on all instances in term of the best solutions found and also similar performances with all compared algorithms, which represent nowadays the state-of-the-art on for MWFV S problem.
SVM Vs Naive Bays Algorithm (Jupyter Notebook)Ravi Nakulan
The SVM and Naive bays are two very popular base models. They both have their strengths and challenges when it comes to choose the best fit for our problem. Here we got very high accuracy because it’s not a real time data but when you have close competition then it’s really hard to determine which one to consider. I considered SVM because it defined better line between cancer and no-caner and also it is one of the popular-model in medical-industry. SVM does not care about independent variables or their interdependency to each other. However, longer rows may slow our run time and can go from a day to week or more to run the model.
On the other hand, Naive Bays is a best algorithm for developing a base model and does not care the number of rows and very quick to run to get the insides.
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ijcsit
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA ijcsit
This article consolidates the idea that non-random pairing can promote the evolution of cooperation in a non-repeated version of the prisoner’s dilemma. This idea is taken from[1], which presents experiments utilizing stochastic simulation. In the following it is shown how the results from [1] is reproducible by
numerical analysis. It is also demonstrated that some unexplained findings in [1], is due to the methods used.
Extending A Trial’s Design Case Studies Of Dealing With Study Design IssuesnQuery
About the webinar
As trials increase in complexity and scope, there is a requirement for trial designs to reflect this.
From dealing with non-proportional hazards in survival analysis to dealing with cluster randomization, we examine how to deal with study design issues of complex trials.
In this free webinar, you will learn about:
Dealing with study design issues
Practical worked examples of
Non-proportional Hazards
Cluster Randomization
Three Armed Trials
Non-proportional Hazards
Non-proportional hazards and complex survival curves have become of increasing interest, due to being commonly seen in immunotherapy development. This has led to interest in assessing the robustness of standard methods and alternative methods that better adapt to deviations.
In this webinar, we look at methods proposed for complex survival curves and the weighted log-rank test as a candidate model to deal with a delayed survival effect.
Cluster Randomization
Cluster-randomized designs are often adopted when there is a high risk of contamination if cluster members were randomized individually. Stepped-wedge designs are useful in cases where it is difficult to apply a particular treatment to half of the clusters at the same time.
In this webinar, we introduce cluster randomization and stepped-wedge designs to provide an insight into the requirements of more complex randomization schedules.
Three Armed Trials
Non-inferiority testing is a common hypothesis test in the development of generic medicine and medical devices. The most common design compares the proposed non-inferior treatment to the standard treatment alone but this leaves uncertain if the treatment effect is the same as from previous studies. This “assay sensitivity” problem can be resolved by using a three arm trial which includes placebo alongside the new and reference treatments for direct comparison.
In this webinar we show a complete testing approach to this gold standard design and how to find the appropriate allocation and sample size for this study.
Duration - 60 minutes
Speaker: Ronan Fitzpatrick, Head of Statistics, Statsols
Predicting breast cancer: Adrian VallesAdrián Vallés
Performed and compared predictive modelling approaches (classification tree, logistic regression and random forest) to predict benign vs malignant breast cancers using R for the Data mining class (BANA 4080)
Data Science - Part III - EDA & Model SelectionDerek Kane
This lecture introduces the concept of EDA, understanding, and working with data for machine learning and predictive analysis. The lecture is designed for anyone who wants to understand how to work with data and does not get into the mathematics. We will discuss how to utilize summary statistics, diagnostic plots, data transformations, variable selection techniques including principal component analysis, and finally get into the concept of model selection.
Predicting Likely Donors and Donation AmountsMichele Vincent
In this presentation, we first predict likely donors using classification models. Then, we predict how much donation will likely donors give using regression models. Finally, we validate predictive models by measuring how effective they are.
Predictive Analytics, Predicting LIkely Donors and Donation AmountsMichele Vincent
In this presentation, we first predict likely donors using classification models. We also predict how much donation will likely donors give using regression models. Then, we validate predictive models by measuring how effective the models are.
Performance of genetic algorithm is flexible enough to make it applicable to a wide range of problems, such as the problem of placing N queens on N by N chessboard in order that no two queens can attack each other which is known as ‘n-Queens problem.
Lack of information about details of the problem made genetic algorithm confused in searching state space of the problem
A Hybrid Immunological Search for theWeighted Feedback Vertex Set ProblemMario Pavone
In this paper we present a hybrid immunological inspired algorithm (HYBRID-IA) for solving the Minimum Weighted Feedback Vertex Set (MWFVS) problem. MWFV S is one of the most interesting and challenging combinatorial optimization problem, which finds application in many fields and in many real life tasks. The proposed algorithm is inspired by the clonal selection principle, and therefore it takes advantage of the main strength characteristics of the operators of (i) cloning; (ii) hypermutation; and (iii) aging. Along with these operators, the algorithm uses a local search procedure, based on a deterministic approach, whose purpose is to refine the solutions found so far. In order to evaluate the efficiency and robustness of HYBRID-IA several experiments were performed on different instances, and for each instance it was compared to three different algorithms: (1) a memetic algorithm based on a genetic algorithm (MA); (2) a tabu search metaheuristic (XTS); and (3) an iterative tabu search (ITS). The obtained results prove the efficiency and reliability of HYBRID-IA on all instances in term of the best solutions found and also similar performances with all compared algorithms, which represent nowadays the state-of-the-art on for MWFV S problem.
SVM Vs Naive Bays Algorithm (Jupyter Notebook)Ravi Nakulan
The SVM and Naive bays are two very popular base models. They both have their strengths and challenges when it comes to choose the best fit for our problem. Here we got very high accuracy because it’s not a real time data but when you have close competition then it’s really hard to determine which one to consider. I considered SVM because it defined better line between cancer and no-caner and also it is one of the popular-model in medical-industry. SVM does not care about independent variables or their interdependency to each other. However, longer rows may slow our run time and can go from a day to week or more to run the model.
On the other hand, Naive Bays is a best algorithm for developing a base model and does not care the number of rows and very quick to run to get the insides.
ON FEATURE SELECTION ALGORITHMS AND FEATURE SELECTION STABILITY MEASURES: A C...ijcsit
Data mining is indispensable for business organizations for extracting useful information from the huge volume of stored data which can be used in managerial decision making to survive in the competition. Due to the day-to-day advancements in information and communication technology, these data collected from ecommerce and e-governance are mostly high dimensional. Data mining prefers small datasets than high dimensional datasets. Feature selection is an important dimensionality reduction technique. The subsets selected in subsequent iterations by feature selection should be same or similar even in case of small perturbations of the dataset and is called as selection stability. It is recently becomes important topic of research community. The selection stability has been measured by various measures. This paper analyses the selection of the suitable search method and stability measure for the feature selection algorithms and also the influence of the characteristics of the dataset as the choice of the best approach is highly problem dependent.
THE EFFECT OF SEGREGATION IN NONREPEATED PRISONER'S DILEMMA ijcsit
This article consolidates the idea that non-random pairing can promote the evolution of cooperation in a non-repeated version of the prisoner’s dilemma. This idea is taken from[1], which presents experiments utilizing stochastic simulation. In the following it is shown how the results from [1] is reproducible by
numerical analysis. It is also demonstrated that some unexplained findings in [1], is due to the methods used.
Extending A Trial’s Design Case Studies Of Dealing With Study Design IssuesnQuery
About the webinar
As trials increase in complexity and scope, there is a requirement for trial designs to reflect this.
From dealing with non-proportional hazards in survival analysis to dealing with cluster randomization, we examine how to deal with study design issues of complex trials.
In this free webinar, you will learn about:
Dealing with study design issues
Practical worked examples of
Non-proportional Hazards
Cluster Randomization
Three Armed Trials
Non-proportional Hazards
Non-proportional hazards and complex survival curves have become of increasing interest, due to being commonly seen in immunotherapy development. This has led to interest in assessing the robustness of standard methods and alternative methods that better adapt to deviations.
In this webinar, we look at methods proposed for complex survival curves and the weighted log-rank test as a candidate model to deal with a delayed survival effect.
Cluster Randomization
Cluster-randomized designs are often adopted when there is a high risk of contamination if cluster members were randomized individually. Stepped-wedge designs are useful in cases where it is difficult to apply a particular treatment to half of the clusters at the same time.
In this webinar, we introduce cluster randomization and stepped-wedge designs to provide an insight into the requirements of more complex randomization schedules.
Three Armed Trials
Non-inferiority testing is a common hypothesis test in the development of generic medicine and medical devices. The most common design compares the proposed non-inferior treatment to the standard treatment alone but this leaves uncertain if the treatment effect is the same as from previous studies. This “assay sensitivity” problem can be resolved by using a three arm trial which includes placebo alongside the new and reference treatments for direct comparison.
In this webinar we show a complete testing approach to this gold standard design and how to find the appropriate allocation and sample size for this study.
Duration - 60 minutes
Speaker: Ronan Fitzpatrick, Head of Statistics, Statsols
Predicting breast cancer: Adrian VallesAdrián Vallés
Performed and compared predictive modelling approaches (classification tree, logistic regression and random forest) to predict benign vs malignant breast cancers using R for the Data mining class (BANA 4080)
A short introduction to sample size estimation for Research methodology workshop at Dr. BVP RMC, Pravara Institute of Medical Sciences(DU), Loni by Dr. Mandar Baviskar
Explore the latest techniques and technologies used in classifying fetal health, from traditional methods to cutting-edge AI approaches. Understand the importance of accurate classification for prenatal care and fetal well-being. Join us to delve into this critical aspect of healthcare. visit https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/ for more data science insights
Network analysis of cancer metabolism: A novel route to precision medicineVarshit Dusad
Masters project presentation for MRes Systems and Synthetic Biology 2017-18 Imperial College London.
Study of cancer metabolism using constraint-based modeling and graph theory.
Sample size Calculation:
Objectives:
Calculate sample size according to particular type of research, and purpose.
Identify and select various software to calculate sample size according to particular type of research, and purpose.
Why to calculate sample size?
To show that under certain conditions, the hypothesis test has a good chance of showing a desired difference (if it exists)
To show to the IRB committee and funding agency that the study has a reasonable chance to obtain a conclusive result
To show that the necessary resources (human, monetary, time) will be minimized and well utilized.
Most Important: sample size calculation is an educated guess
It is more appropriate for studies involving hypothesis testing
There is no magic involved; only statistical and mathematical logic and some algebra
Researchers need to know something about what they are measuring and how it varies in the population of interest.
SAMPLE SIZE:
How many subjects are needed to assure a given probability of detecting a statistically significant effect of a given magnitude if one truly exists?
POWER:
If a limited pool of subjects is available, what is the likelihood of finding a statistically significant effect of a given magnitude if one truly exists?
Before We Can Determine Sample Size We Need To Answer The Following:
1. What is the primary objective of the study?
2. What is the main outcome measure?
Is it a continuous or dichotomous outcome?
3. How will the data be analyzed to detect a group difference?
4. How small a difference is clinically important to detect?
5. How much variability is in our target population?
6. What is the desired and ?
7. What is the anticipated drop out and non-response % ?
Where do we get this knowledge?
Previous published studies
Pilot studies
If information is lacking, there is no good way to calculate the sample size.
Type I error: Rejecting H0 when H0 is true
: The type I error rate.
Type II error: Failing to reject H0 when H0 is false
: The type II error rate
Power (1 - ): Probability of detecting group difference given the size of the effect () and the sample size of the trial (N).
Estimation of Sample Size by Three ways:
By using
(1) Formulae (manual calculations)
(2) Sample size tables or Nomogram
(3) Softwares.
SAMPLE SIZE FOR ADEQUATE PRECISION:
In a descriptive study,
Summary statistics (mean, proportion)
Reliability (or) precision
By giving “confidence interval”
Wider the C.I – sample statistic is not reliable and it may not give an accurate estimate of the true value of the population parameter.
Sample size calculation for cross sectional studies/surveys:
Cross sectional studies or cross sectional survey are done to estimate a population parameter like prevalence of some disease in a community or finding the average value of some quantitative variable in a population.
Sample size formula for qualitative variable and quantities variable are different.
This is based on hypothetical scenarios and random calculations to showcase the various steps of Project Management. This has not covered the 10 knowledge areas of PMBOK rather it is limited based on our hypothetical project requirements.
If the project is big then you may be required to include all the knowledge areas.
Count of Candies - Happy Halloween DayRavi Nakulan
Here we got a quite small sample data which contained a count of candies from 2008-2018 at Hamilton city. It is interesting to see if we can identify any independent variable which may impact the count of candies. Remember, Halloween is celebrated on October 31st so it may be a weekday or weekend.
I decided to choose "Weather" conditions to identify the impact of rain, cold, or severe wind. Seems wind has more impact on human nature because we can protect ourselves from rain, cold but not from the wind.
Hollywood-Movies are analyzed through visuals and graphs between, 1991 to 2020. However, the analysis does not contain the range of all movies being produced between this duration, hence limited to only 500 Movies. I am also learning Tableau and this is going to be my first visualization assignment ever. This may help people who have interest, playing with colors and some creativity.
This is one of the fun activities, i had have done just for visualization purpose. To find the Dataset on the websites without being analyzed and visualized, is really hard and cumbersome. I came across the idea of downloading my own raw dataset from Strava app, and to play around with visualization tools. The visual slides could have been more but i had limitations in order to add charts and number of slides for my project submission. Hope this will help others to get some reference.
Yes, I never used PowerBI before (Just 2-3 days ago I started).
National Science & Technology Entrepreneurship Development Board (NSTEDB) is turned out to be a catalyst for individuals who wants to learn best practices about entrepreneurship to put forward their first but most important step towards creating value for business.
Risk Plan always helpful for organization to address the issues which has potential to put the project on risk as well as the organization and people. We often neglect the problem and try to avoid them but addressing the risk and proactiveness helps project to be successful.
Communication plan with team huddle will help team to understand and communicate effectively. This is one of the best practices to keep updated our team and converting the message directly without being distorted by other person or medium.
Assurance responsabilité civile : A Report of Eco-CitiesRavi Nakulan
A report done by the Dr. Ioan CIUMANSU on Eco-Cities for the International Master students from UVSQ for their research work on "Railway Station and the City". Report describes how different angle of innovation describes the Sustainability and Eco-Innovation means to the project. The three proposed themes was as follows; Multi-Innovation, Climate Change and Technology.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
1. Professor: ---____-----
Durham College
Student Name : Ravi Nakulan
ID Number: ------------
Due Date: ------------
Tool Used: Python (Jupyter Notebook)
Assignment No. # 3 – Discriminant Analysis (heartfailure.csv)
Ravi Nakulan 1
2. Ravi Nakulan 2
Top Rows and Features
Bottom Rows and Features
1. The dataset is heartfailure dataset; meaning to predict
the probability during the medication how many
patients had have survived (alive) or deceased.
2. Dataset has numeric and categorical features (mixed
dataset)
3. It is also imbalanced dataset as our output
DEATH_EVENT has less sample of deceased class than
alive class
4. 0 (Zero) represents the alive class and 1 (One) is for
deceased class.
5. Need to normalize the numerical values because
platelets has large-Magnitude also
creatinine_phosphokinase & serum_creatinine. These
features also has different Units.
6. We will use the base model and then try to create a
data-pipeline to optimize it in order to compare the
best algorithm with the help of confusion matrix
7. We will also recommend other types of algorithm at
the end if our model doesn’t provide any satisfactory
result
3. Ravi Nakulan 3
B. Analysis Statement: Logistics regression vs Linear Discriminant Analysis (LDA)
Problem Statement:
We are using dataset called ‘heartfailure.csv’ for Mr. John Hughes to develop a LDA model to evaluate the efficiency against a Logistic Regression
for better decision making.
Analysis Statement:
Both the models (Logistic Regression & LDA) are multivariate statistical methods which has been used to evaluate the association between various
covariates and a categorical outcome. Also, these models are popular in medical sciences.
Logistic regression is a classification algorithm which is traditionally considered to limit to only two-class classification problems, and it does a pretty
good job to identify the classes.
LDA on the other hand preferred, when we have more than two non-ordinal response classes but LDA replicas the distribution of the predictors X
separately in each of the response classes, and it is based on Bayes’ theorem which estimates for the probability of the response category given the
value of X.
There are few reasons to perform Linear Discriminant Analysis (LDA).
1. When the response variable (y) is well separated then coefficient response of Logistic Regression model is quite unstable.
2. For small dataset LDA is considered stable than the Logistic Regression and if the distribution of the independent variables is approximately
normal in each of the classes, then again LDA is more stable.
Bottom line: It is a best practice to predict the probability of specified outcome with alternative algorithms to compare the result and validate the findings.
4. Ravi Nakulan 4
C. Insight of Pandas Profile Report
Pandas Profile Report is a quick way to do an exploratory data analysis in the form of report after python code.
1. We found there is no missing values in row, neither any duplicate rows.
2. Out of 13 variables: 07 Numerical + 06 Categorical values (Meaning it’s a mixed dataset)
3. Correlation graph represents positive correlationship with variables such as ‘age & serum_creatinine’ and
Negative correlationship with ‘Time, ejection_fraction’ & ‘serum_sodium’ while ’sex & smoking’ does not
show any correlation.
4. Most importantly: The dataset has big numeric values (platelets) and some smaller numeric values (Age, serum
creatinine, creatinine_phosphokinase etc.). So, we need to perform feature-scaling to compute the numerical values
efficiently to get the absolute minimum point.
5. Most of the variables are skewed and we need to perform a standardization (Mean 0 & Stand. Dev. 1) to bring
them in to Gaussian distribution.
6. The categorical data is imbalanced (samples are not equal) including the output variable (DEATH EVENT)
where Alive (0) has 203 samples and Deceased (1) has 96.
5. Ravi Nakulan 5
D. Classification Report of Base LDA model
• Since the Output class (DEATH EVENT) is imbalanced, we used SMOTE technique, which creates new synthetic instances
based on the neighborhood of the minority (deceased) class.
• After applying the SMOTE we got 162 samples for each-class & then we used test-train (20-80) split and had 60 test samples.
• The Accuracy of the model is 80% which is good.
• The Precision is 87% for the alive class (34/39 patients) & 67% for the deceased class (14/21
patients) indicating, a big difference in prediction between both the classes. It is good in terms of
identifying True-Positive for Alive patients from Total-Positive instances with 87% prediction.
However, the ratio of True-Positive from the total Positive Deceased patients is bad and it’s only 67%
time predicted to be True-Positive correctly. We got 7 wrong predictions for deceased class out of 21
observations, while for alive class its 5 times wrong out of 39 observations.
• A Recall gives us a measure to identify how accurately the model is able to identify the relevant data.
The Recall has produced slightly better result overall. So, for the alive class it predicted 83% true-
positive (34/39 patients) & 74% true positive (14/21 patients) for the deceased class. The recall
score which is considered as sensitivity, does has a better prediction in deceased class compare to the
Precision. It leads to a same question that we are only 74% successful to provide treatment by
identifying deceased patients. If the patient has a threat from decease to have treatment, then we are
not effective to provide treatment to save them because the recall prediction is 74% (True positive). A
Recall is highly important to provide the right treatment to a patient, especially who has a threat from
the disease.
• The f1-score will do the trade-off between Precision & Recall. We got 85% in Alive class and 70% in
Deceased class. F1-score does provide overall good result but it’s not satisfactory in deceased class
(70%) compare to alive class (85%). Now it’s time to optimized the base model to hope for better result.
• We will now perform optimization for LDA.
BASE LDA Model
Base LDA Model (With SMOTE)
Confusion Matrix of Base LDA Model
6. Ravi Nakulan 6
• The Accuracy of the optimized model is 80% which is same as the base LDA model.
• There has been no changes in Precision score after the optimization. The Precision is 87% for the
alive class (34/39 patients) & 67% for the deceased class (14/21 patients) indicating, a big difference in
prediction for both the classes. It is good in terms of identifying True Positive for Alive patients from
Total Positive instances with 87% prediction. However, the ratio of True-positive from the total
deceased-positive patients, is too bad and it’s only 67% time predicted to be True correctly. The
Precision prediction is only 1% less than the Logistic Regression prediction (88%). We got 7 wrong
predictions for deceased case out of 21 observations, while for alive case its 5 times wrong out of 39
observations.
• The Recall score is also the same as the basic LDA model after the optimization. Recall gives us a
measure to identify how accurately the model is able to identify the relevant data. The Recall has
produced slightly better result overall. So, for the alive class it predicted 83% true-positive (34/39
patients) & 74% true positive (14/21 patients) for the deceased class. The recall score which is
considered as sensitivity, does has a better prediction in deceased class compare to the Precision but
exactly the same as Logistic Regression. It leads to a same question that we are only 74% successful
to provide treatment by identifying deceased patients. If the patient has a threat to decease, then we are
not effective to provide treatment to save them because the recall prediction is 74% (True positive). A
Recall is highly important to provide the right treatment to a patient who has a threat against the
disease.
• Since the Precision and Recall produced the same result as base LDA model the f1-score is as same as
the base LDA model. After the trade-off between Precision & Recall we got 85% in Alive class and
70% in Deceased class. F1-score does provide overall good result but it’s not satisfactory in deceased
class (70%) compare to alive class (85%). Now it’s time to optimized the base model to hope for better
result.
• We will now move to LDA confusion-matrix and compare it with Logistic Regression.
Optimized LDA Model
Optimized LDA Model (With SMOTE)
D. Classification Report of Optimized LDA model
Confusion Matrix of Optimized LDA Model
7. Base LDA Model (With SMOTE)
Optimized LDA Model (With SMOTE)
D. Comparison between Base & Optimized model
• The base and optimized model has produced the same
score in all the metrics (precision, recall, f1-score and
accuracy of the model)
• The type-I error (false positives): It predicted 7 times
that patient is alive when in the reality it’s not.
• The type-II error (false negatives): Our models
predicted 5 times that the patient isn’t deceased but, it
did happen.
• Meaning that the trade-off between both type-I &
Type-II error leads us to find other alternatives with
lowest (or almost zero error) error to replace the
optimized LDA model.
• False positives and false negatives distributions
overlap to each other.
• ROC is a probability curve meaning if the AUC (Area Under the Curve) is 1 then it’s able to
distinguish both the classes perfectly.
• Higher AUC is better to determine the separation of classes from each other.
• We got AUC 78%. Meaning our model is able to distinguish the patients as alive and deceased
during the medication with 78% chance. We should focus on improving this score by some
hyperparameter tuning or by performing different algorithm.
• The diagonal broken line is a random model with an AUC of 0.5, meaning the model has no
discrimination capacity to distinguish between Alive and Deceased classes.
8. Ravi Nakulan 8
E. Comparison: Optimized LDA model Vs Optimized Linear Regression
• The overall Accuracy of the optimized Logistic Regression model is 83% which is good.
• Precision helps us to understand ratio between the True-Positives and all the Positives Cases. The
Precision is 88% for the Alive Class (36/41 patients) & 74% for the Deceased Class (14/19 patients)
indicating that it is good but not as good as per medical industry standards. Meaning, 88% of patients
identified correctly as Alive (True positive) from Total-Positives. While 74% is the ratio of True-
Positive from the total Deceased-Positive patients. So, the prediction of being Alive is 88% time
correct during the medication than 74% time in Deceased patient class. Meaning the prediction is
not the best because we could increase this prediction in both classes to find it as true-positives. we got 5
wrong predictions for deceased case out of 19 observations, while for alive case its again 5 times wrong
but out of 41 observations.
• Recall helps us understand to identify True positives meaning the Truth. The Recall has produced
exactly the same result for the Alive Class (36/41 patients) & (14/19 patients) for the Deceased Class
as Precision did. However, Recall indicates the measure of correctly identifying the true positives. It leads to a
question that we are only 74% successful to provide the treatment by identifying Deceased patients
during the medication. If the patient has a threat to decease, then we are not effective to provide
treatment to save them because the Recall prediction is 74% (True positive) only. A Recall is highly
important to provide the right treatment to a patient who has a threat against the disease.
• f1-score is a trade off between Precision & Recall, meaning f1-score is the harmonic mean of
Precision and Recall and since our Precision & Recall scores are the same; our f1-score is also the
same (88% true positive for Alive & 74% True positive for deceased). In some medical cases,
Precision & Recall can be equally important & therefore looking at f1-score is the best bet to know the
model's overall performance.
• We will now move to compare our Optimized Logistic Regression model with Optimized LDA model
Optimized Logistic Regression
• The Optimized Logistic Regression with SMOTE has been provided already to have a comparison with Optimized LDA model.
Optimized Model (With SMOTE)
Confusion Matrix of Logistic Regression
9. E. Comparison: Optimized LDA model Vs Optimized Linear Regression
• The Accuracy of the optimized Logistic Regression model is 83% while the LDA is 80% only. Meaning the
Accuracy have been produced by Logistic Regression is 3% better but needs to look at other parameters.
• Precision helps us to understand ratio between the True-Positives and all the Positives Cases. Here the
Precision for Logistic Regression comes as 88% for the Alive Class (True-Positive) which is slightly better
than 87% in LDA. Though this difference is just a textbook difference. Precision of the Deceased Class
for Logistic Regression comes as 74% (True Positive), while it is 67% time predicted to be True-positive
by the LDA model, which isn’t satisfactory. Meaning, Logistic Regression has done great job than LDA
in the Deceased Class. Winner is Logistic Regression for the Precision.
• Recall helps us understand to identify True positives meaning the Truth. The Recall has produced exactly the
same result as 88% for the Alive Class (36/41 patients) by Logistic Regression and 83% by the LDA model.
Meaning Logistic Regression has done better job to identify true incident by increasing the chane to 88% for
Alive class.
• In the Deceased class Logistic Regression has produced the same result as Precision did with 74% chance of
prediction, while by the LDA model it also produced the same result as 74% for Deceased class. Here, they both
have the same result (74%), which may have concern for Mr. John Hughes because in the medical filed it is
important to protect the patient from adversity of diseases, where recall becomes priority in some cases
because recall is imperative to provide the right treatment to patient who has a threat from the disease.
• Moving to f1-score which is a trade off between Precision & Recall, and clearly the Optimized Logistic
Regression has done better job in Alive class with 88% and 74% in Deceased class, comparing with
Optimized LDA where the Alive class prediction is 85% and 70% for the Deceased class. Meaning, f1-score
the harmonized mean of precision & recall indicates that Logistic Regression able to predict better result.
• The differences are not significantly high between both the models, however the Precision prediction in
Deceased class has quite high prediction difference between the both models. Therefore, Optimized Logistic
Regression is a better choice than Optimized LDA model.
Comparison of Optimized LDA & Optimized Logistic Regression models
Optimized Logistic Regression Classification Report
Optimized LDA Classification Report - Key insights are in the next slide -
with SMOTE
10. E. Comparison: Key Insights
• Key Insights
1. Since the Optimized Logistic Regression has produced better prediction in all the parameters, Precision, Recall, f1-score & Accuracy in both the
classes (Alive & Deceased) than the Optimized LDA model, therefore it is better to stick with Logistic Regression model. However only in the Deceased
class the Recall has produced the same result among both the optimized models.
2. Looking at the Recall prediction for Deceased class (74%) in both the optimized model, it hasn’t improved by either of them. Recall is one of the
important parameter to identify True Positive cases when it is True in reality. Our both the models are only 74% able to predict correctly in such
cases. So, if the patient is deceased than we could have provided them appropriate medical treatment to save their lives.
3. We can improve the model by identifying the high multicollinearity between the independent variables. Multicollinearity may not affect the accuracy of
the model significantly but there are chance that we might miss the reliability to know the effects of individual variable features in our model.
4. Can remove variables with no or very less correlation with respect to the output (dependent variable) to avoid the increase in variance and faster to
compute and predict.
5. Logistic Regression model turned out to be a good model to compare with LDA however LDA would have been interesting if we would have more than
two-classification problems (outcomes).
Comparison of Optimized LDA & Optimized Logistic Regression models
11. Ravi Nakulan 11
• Few Recommendations for Mr. John Hughes.
1. Both the models (Logistic Regression and LDA) are multivariate statistical methods which can determine the association between several covariates and
categorical outcome (deceased or alive). Since both the models have produced almost same results (significantly little difference in outcomes), it is
recommended to get more independent features (x) in the dataset. Additional features will certainly help to determine the effect of them in the outcome.
Notably, we have used SMOTE which bring our attention to that we did not have enough samples for the Deceased class, and thus we used sampling
technique. This is another issue with the model that we did not have enough number of samples. Realistically, we would have a smaller number of
records on Deceased Class samples but if we could narrow the gap between both classes, so that we can use other sampling techniques (oversampling)
to evaluate our model with Optimized Logistic Regression and Optimized LDA model.
2. Since the LDA does not provide any good result than Logistic regression, we can also perform QDA (Quadratic Discriminant Analysis). QDA serves as
a compromise between the non-parametric kNN method and the linear LDA and Logistics regression approaches. As name suggest, QDA assumes a
quadratic decision boundary and it can accurately model a wider range of problems than let alone can the linear methods. Because it assumes that each
class has its own covariance matrix. Also, it is useful and effective in the presence of a limited number of training observations which is exactly happen
in our case (Deceased Class). As it does make some assumptions about the form of the decision boundary.
3. Instead of Linear Algorithms we should use Non-Linear Algorithms because we have found that both the classes overlapping, and we need proper
separation. We can take a look at Decision Trees as it is a cost-sensitive algorithm and can be effective since we are using imbalanced dataset. Decision
Tree work on hierarchy of if/else based questions from the root note till the end of pure leaf node. We can use Entropy (0 to 1), Gini impurity (0 to 0.5) to
calculate the information gain (IG), as they are the measures of impurity of a node.
F. Recommendations for Mr. John Hughes
Reference from Class notes in Week 5: Professor Sam Plati
12. Thank You
Ravi
Nakulan
12
All the references are taken from our class-lecture notes provided by the Professor of Statistical
Prediction Modeling.