This document presents code for modeling mortality risk from patient data. It:
1) Loads and preprocesses a training dataset, including imputing missing values;
2) Fits two logistic regression models to predict mortality from age and BMI: a simple model with linear terms and a more complex model with quadratic terms;
3) Visualizes the models by plotting predictions against the data and applying classification thresholds;
4) Calculates the area under the receiver operating characteristic curve (AUROC) on the training data to evaluate model performance.
The AUROC is 0.778 for the simple model and 0.779 for the complex model, indicating modest predictive ability.
GoLightly - a customisable virtual machine written in GoEleanor McHugh
A brief overview of the Go programming language and how it might be used to build a simple customisable virtual machine. This is a reduced and updated version of my previous Go virtual machine talks with many code examples.
GoLightly - a customisable virtual machine written in GoEleanor McHugh
A brief overview of the Go programming language and how it might be used to build a simple customisable virtual machine. This is a reduced and updated version of my previous Go virtual machine talks with many code examples.
An introduction to functional programming with goEleanor McHugh
A crash course in functional programming concepts using Go. Heavy on code, light on theory.
You can find the examples at https://github.com/feyeleanor/intro_to_fp_in_go
Computer graphics lab report with code in cppAlamgir Hossain
This is the lab report for computer graphics in cpp language. Basically this course is only for the computer science and engineering students.
Problem list:
1.Program for the generation of Bresenham Line Drawing.
2. Program for the generation of Digital Differential Analyzer (DDA) Line Drawing.
3. Program for the generation of Midpoint Circle Drawing.
4. Program for the generation of Midpoint Ellipse Drawing.
5. Program for the generation of Translating an object.
6. Program for the generation of Rotating an Object.
7. Program for the generation of scaling an object.
All programs are coaded in cpp language .
A short list of the most useful R commands
reference: http://www.personality-project.org/r/r.commands.html
R programı ile ilgilenen veya yeni öğrenmeye başlayan herkes için hazırlanmıştır.
gptips1.0/concrete.mat
Concrete_Data:[1030x9 double array]
tr_ind:[1030x1 uint8 (logical) array]
te_ind:[1030x1 uint8 (logical) array]
tr_ind2:[773x1 uint8 (logical) array]
val_ind:[773x1 uint8 (logical) array]
gptips1.0/crossover.m
function [son,daughter]=crossover(mum,dad,gp)
%CROSSOVER GPTIPS function to crossover 2 GP expressions to produce 2 new
%GP expressions.
%
% [SON,DAUGHTER]=CROSSOVER(MUM,DAD,GP) uses standard subtree
% crossover on the expressions MUM and DAD to produce the offspring
% expressions SON and DAUGHTER.
%
% (c) Dominic Searson 2009
%
% v1.0
%
% See also MUTATE
% select random crossover nodes in mum and dad expressions
m_position=picknode(mum,0,gp);
d_position=picknode(dad,0,gp);
% extract main and subtree expressions
[m_main,m_sub]=extract(m_position,mum);
[d_main,d_sub]=extract(d_position,dad);
%combine to form 2 new GPtrees
daughter=strrep(m_main,'$',d_sub);
son=strrep(d_main,'$',m_sub);
gptips1.0/demo2data.mat
gptips1.0/displaystats.m
function displaystats(gp)
%DISPLAYSTATS GPTIPS function to display run stats periodically.
%
% DISPLAYSTATS(GP) updates the screen with run stats at the interval
% specified in GP.RUN.VERBOSE
%
% (c) Dominic Searson 2009
%
% v1.0
%
% See also: UPDATESTATS
%only display info if required
if ~gp.runcontrol.verbose || gp.runcontrol.quiet || mod(gp.state.count-1,gp.runcontrol.verbose)
return
end
gen=gp.state.count-1;
disp(['Generation ' num2str(gen)]);
disp(['Best fitness: ' num2str(gp.results.best.fitness)]);
disp(['Mean fitness: ' num2str(gp.state.meanfitness)]);
disp(['Best nodecount: ' num2str(gp.results.best.numnodes)]);
disp(' ');
gptips1.0/evalfitness.m
function [gp]=evalfitness(gp)
%EVALFITNESS GPTIPS function to call the user specified fitness function.
%
% [GP]=EVALFITNESS(GP) evaluates the the fitnesses of individuals stored
% in the GP structure and updates various other fields of GP accordingly.
%
% (c) Dominic Searson 2009
%
% v1.0
%
% See also TREE2EVALSTR
% Loop through population and calculate fitnesses
for i=1:gp.runcontrol.pop_size
% update state to reflect the index of the individual that is about to
% be evaluated
gp.state.current_individual=i;
%First preprocess the cell array of string expressions into a form that
%Matlab can evaluate
evalstr=tree2evalstr(gp.pop{i},gp);
%store number of nodes (sum total for all genes)
gp.fitness.numnodes(i,1)=getnumnodes(gp.pop{i});
% Evaluate gp individual using fitness function
% (the try catch is to assign a poor fitness value
% to trees that violate Matlab's
% daft 'Nesting of {, [, and ( cannot exceed a depth of 32.' error.
try
[fitness,gp]=feval(gp.fitness.fitfun,evalstr,gp);
gp.fitness.values(i)=fitness;
catch
if ~strncmpi(lasterr,'Nesting of {',12);
error(lasterr);
...
error 2.pdf101316, 6(46 PM01_errorPage 1 of 5http.docxSALU18
error 2.pdf
10/13/16, 6(46 PM01_error
Page 1 of 5http://localhost:8888/nbconvert/html/group/01_error.ipynb?download=false
In [ ]: %matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import sys
Error Definitions
Following is an example for the concept of absolute error, relative error and decimal precision:
We shall test the approximation to common mathematical constant, . Compute the absolute and relative
errors along with the decimal precision if we take the approximate value of .
In [ ]: # We can use the formulas you derieved above to calculate the actual n
umbers
absolute_error = np.abs(np.exp(1) - 2.718)
relative_error = absolute_error/np.exp(1)
print "The absolute error is "+str(absolute_error)
print "The relative error is "+str(relative_error)
Machine epsilon is a very important concept in floating point error. The value, even though miniscule, can
easily compund over a period to cause huge problems.
Below we see a problem demonstating how easily machine error can creep into a simple piece of code:
In [ ]: a = 4.0/3.0
b = a - 1.0
c = 3*b
eps = 1 - c
print 'Value of a is ' +str(a)
print 'Value of b is ' +str(b)
print 'Value of c is ' +str(c)
print 'Value of epsilon is ' +str(eps)
Ideally eps should be 0, but instead we see the machine epsilon and while the value is small it can lead to
issues.
e
e = 2.718
10/13/16, 6(46 PM01_error
Page 2 of 5http://localhost:8888/nbconvert/html/group/01_error.ipynb?download=false
In [ ]: print "The progression of error:"
for i in range(1,20):
print str(abs((10**i)*c - (10**i)))
The largest floating point number
The formula for obtaining the number is shown below, instead of calculating the value we can use the
system library to find this value.
In [ ]: maximum = (2.0-eps)*2.0**1023
print sys.float_info.max
print 'Value of maximum is ' +str(maximum)
The smallest floating point number
The formula for obtaining the number is shown below. Similarly the value can be found using the system
library to find this value.
In [ ]: minimum = eps*2.0**(-1022)
print sys.float_info.min
print sys.float_info.min*sys.float_info.epsilon
print 'Value of minimum is ' +str(minimum)
As we try to compute a number bigger than the aforementioned, largest floating point number we see weird
errors. The computer assigns infinity to these values.
In [ ]: overflow = maximum*10.0
print 'Value of overflow is ' +str(overflow)
As we try to compute a number smaller than the aforementioned smallest floating point number we see that
the computer assigns it the value 0. We actually lose precision in this case.
10/13/16, 6(46 PM01_error
Page 3 of 5http://localhost:8888/nbconvert/html/group/01_error.ipynb?download=false
In [1]: underflow = minimum/2.0
print 'Value of underflow is ' +str(underflow)
Truncation error is a very common form of error you will keep seing in the area of Numerical
Analysis/Computing.
Here we will look at the classic Calculus example of the approximation near 0. We c ...
An introduction to functional programming with goEleanor McHugh
A crash course in functional programming concepts using Go. Heavy on code, light on theory.
You can find the examples at https://github.com/feyeleanor/intro_to_fp_in_go
Computer graphics lab report with code in cppAlamgir Hossain
This is the lab report for computer graphics in cpp language. Basically this course is only for the computer science and engineering students.
Problem list:
1.Program for the generation of Bresenham Line Drawing.
2. Program for the generation of Digital Differential Analyzer (DDA) Line Drawing.
3. Program for the generation of Midpoint Circle Drawing.
4. Program for the generation of Midpoint Ellipse Drawing.
5. Program for the generation of Translating an object.
6. Program for the generation of Rotating an Object.
7. Program for the generation of scaling an object.
All programs are coaded in cpp language .
A short list of the most useful R commands
reference: http://www.personality-project.org/r/r.commands.html
R programı ile ilgilenen veya yeni öğrenmeye başlayan herkes için hazırlanmıştır.
gptips1.0/concrete.mat
Concrete_Data:[1030x9 double array]
tr_ind:[1030x1 uint8 (logical) array]
te_ind:[1030x1 uint8 (logical) array]
tr_ind2:[773x1 uint8 (logical) array]
val_ind:[773x1 uint8 (logical) array]
gptips1.0/crossover.m
function [son,daughter]=crossover(mum,dad,gp)
%CROSSOVER GPTIPS function to crossover 2 GP expressions to produce 2 new
%GP expressions.
%
% [SON,DAUGHTER]=CROSSOVER(MUM,DAD,GP) uses standard subtree
% crossover on the expressions MUM and DAD to produce the offspring
% expressions SON and DAUGHTER.
%
% (c) Dominic Searson 2009
%
% v1.0
%
% See also MUTATE
% select random crossover nodes in mum and dad expressions
m_position=picknode(mum,0,gp);
d_position=picknode(dad,0,gp);
% extract main and subtree expressions
[m_main,m_sub]=extract(m_position,mum);
[d_main,d_sub]=extract(d_position,dad);
%combine to form 2 new GPtrees
daughter=strrep(m_main,'$',d_sub);
son=strrep(d_main,'$',m_sub);
gptips1.0/demo2data.mat
gptips1.0/displaystats.m
function displaystats(gp)
%DISPLAYSTATS GPTIPS function to display run stats periodically.
%
% DISPLAYSTATS(GP) updates the screen with run stats at the interval
% specified in GP.RUN.VERBOSE
%
% (c) Dominic Searson 2009
%
% v1.0
%
% See also: UPDATESTATS
%only display info if required
if ~gp.runcontrol.verbose || gp.runcontrol.quiet || mod(gp.state.count-1,gp.runcontrol.verbose)
return
end
gen=gp.state.count-1;
disp(['Generation ' num2str(gen)]);
disp(['Best fitness: ' num2str(gp.results.best.fitness)]);
disp(['Mean fitness: ' num2str(gp.state.meanfitness)]);
disp(['Best nodecount: ' num2str(gp.results.best.numnodes)]);
disp(' ');
gptips1.0/evalfitness.m
function [gp]=evalfitness(gp)
%EVALFITNESS GPTIPS function to call the user specified fitness function.
%
% [GP]=EVALFITNESS(GP) evaluates the the fitnesses of individuals stored
% in the GP structure and updates various other fields of GP accordingly.
%
% (c) Dominic Searson 2009
%
% v1.0
%
% See also TREE2EVALSTR
% Loop through population and calculate fitnesses
for i=1:gp.runcontrol.pop_size
% update state to reflect the index of the individual that is about to
% be evaluated
gp.state.current_individual=i;
%First preprocess the cell array of string expressions into a form that
%Matlab can evaluate
evalstr=tree2evalstr(gp.pop{i},gp);
%store number of nodes (sum total for all genes)
gp.fitness.numnodes(i,1)=getnumnodes(gp.pop{i});
% Evaluate gp individual using fitness function
% (the try catch is to assign a poor fitness value
% to trees that violate Matlab's
% daft 'Nesting of {, [, and ( cannot exceed a depth of 32.' error.
try
[fitness,gp]=feval(gp.fitness.fitfun,evalstr,gp);
gp.fitness.values(i)=fitness;
catch
if ~strncmpi(lasterr,'Nesting of {',12);
error(lasterr);
...
error 2.pdf101316, 6(46 PM01_errorPage 1 of 5http.docxSALU18
error 2.pdf
10/13/16, 6(46 PM01_error
Page 1 of 5http://localhost:8888/nbconvert/html/group/01_error.ipynb?download=false
In [ ]: %matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import sys
Error Definitions
Following is an example for the concept of absolute error, relative error and decimal precision:
We shall test the approximation to common mathematical constant, . Compute the absolute and relative
errors along with the decimal precision if we take the approximate value of .
In [ ]: # We can use the formulas you derieved above to calculate the actual n
umbers
absolute_error = np.abs(np.exp(1) - 2.718)
relative_error = absolute_error/np.exp(1)
print "The absolute error is "+str(absolute_error)
print "The relative error is "+str(relative_error)
Machine epsilon is a very important concept in floating point error. The value, even though miniscule, can
easily compund over a period to cause huge problems.
Below we see a problem demonstating how easily machine error can creep into a simple piece of code:
In [ ]: a = 4.0/3.0
b = a - 1.0
c = 3*b
eps = 1 - c
print 'Value of a is ' +str(a)
print 'Value of b is ' +str(b)
print 'Value of c is ' +str(c)
print 'Value of epsilon is ' +str(eps)
Ideally eps should be 0, but instead we see the machine epsilon and while the value is small it can lead to
issues.
e
e = 2.718
10/13/16, 6(46 PM01_error
Page 2 of 5http://localhost:8888/nbconvert/html/group/01_error.ipynb?download=false
In [ ]: print "The progression of error:"
for i in range(1,20):
print str(abs((10**i)*c - (10**i)))
The largest floating point number
The formula for obtaining the number is shown below, instead of calculating the value we can use the
system library to find this value.
In [ ]: maximum = (2.0-eps)*2.0**1023
print sys.float_info.max
print 'Value of maximum is ' +str(maximum)
The smallest floating point number
The formula for obtaining the number is shown below. Similarly the value can be found using the system
library to find this value.
In [ ]: minimum = eps*2.0**(-1022)
print sys.float_info.min
print sys.float_info.min*sys.float_info.epsilon
print 'Value of minimum is ' +str(minimum)
As we try to compute a number bigger than the aforementioned, largest floating point number we see weird
errors. The computer assigns infinity to these values.
In [ ]: overflow = maximum*10.0
print 'Value of overflow is ' +str(overflow)
As we try to compute a number smaller than the aforementioned smallest floating point number we see that
the computer assigns it the value 0. We actually lose precision in this case.
10/13/16, 6(46 PM01_error
Page 3 of 5http://localhost:8888/nbconvert/html/group/01_error.ipynb?download=false
In [1]: underflow = minimum/2.0
print 'Value of underflow is ' +str(underflow)
Truncation error is a very common form of error you will keep seing in the area of Numerical
Analysis/Computing.
Here we will look at the classic Calculus example of the approximation near 0. We c ...
C program to find factorial of number using recursion as well as iteration ,
Calculate power of a number program in c using Recursion and Iteration, Write a C program to count digits of a number using Recursion and Iteration, Write a C program to find sum of first n natural numbers using Recursion, C program to print sum of digits of a given number using recursion ,Write a C program to find nth term in Fibonacci Series using Recursion, C program to find out the GCD (Greatest Common Divisor )of the two numbers using recursion,
Write a C program to find the first upper case letter in the given string using recursion, write C program to calculate length of the string using Recursion ,
Write a program in C to count number of divisors of a given number using recursion, Recursive program to check whether a given number is prime or composite,
C program to displays integers 100 through 1 using Recursion and Iteration, Write a program in C to convert a decimal number to binary using recursion,
Recursion Stack of factorial of 3 Recursion stack of 4th term of Fibonacci
Presented at Data Day Texas 2020 and attempts to show the tradeoffs between bigger data, better math, and better data. Uses Fashion MNIST as the use case, and a progression of better math from Random Forest to Gradient Boosted Trees to Feedforward Neural Nets to Convolutional Neural Nets.
Oh, and Cthulhu
ggtimeseries-->ggplot2 extensions
This R package offers novel time series visualisations. It is based on ggplot2 and offers geoms and pre-packaged functions for easily creating any of the offered charts. Some examples are listed below.
This package can be installed from github by installing devtools library and then running the following command - devtools::install_github('Ather-Energy/ggTimeSeries').
reference: https://github.com/Ather-Energy/ggTimeSeries
These notes review fitting GLMs to aggregate data. Binomial, Poisson and Negative Binomial models are shown, with a few others. I also cover how to implement Moran Eigenvector filtering in a GLM. All data are for mortality rates for the state of Texas from the CDC Wonder.
MH Prediction Modeling and Validation -cleanMin-hyung Kim
Overfitting
The Bias-Variance Trade-Off
Sequestered (unseen) test dataset
K-fold cross-validation (within the training dataset)
Performance measures
regression: mean squared error (MSE)
classification: sensitivity, specificity, AUROC, etc.
Visualize the overfitting and the bias-variance trade-off versus the model complexity
r for data science 2. grammar of graphics (ggplot2) clean -refMin-hyung Kim
REFERENCES
#1. RStudio Official Documentations (Help & Cheat Sheet)
Free Webpage) https://www.rstudio.com/resources/cheatsheets/
#2. Wickham, H. and Grolemund, G., 2016.R for data science: import, tidy, transform, visualize, and model data. O'Reilly.
Free Webpage) https://r4ds.had.co.nz/
Cf) Tidyverse syntax (www.tidyverse.org), rather than R Base syntax
Cf) Hadley Wickham: Chief Scientist at RStudio. Adjunct Professor of Statistics at the University of Auckland, Stanford University, and Rice University
r for data science 4. exploratory data analysis clean -rev -refMin-hyung Kim
REFERENCES
#1. RStudio Official Documentations (Help & Cheat Sheet)
Free Webpage) https://www.rstudio.com/resources/cheatsheets/
#2. Wickham, H. and Grolemund, G., 2016.R for data science: import, tidy, transform, visualize, and model data. O'Reilly.
Free Webpage) https://r4ds.had.co.nz/
Cf) Tidyverse syntax (www.tidyverse.org), rather than R Base syntax
Cf) Hadley Wickham: Chief Scientist at RStudio. Adjunct Professor of Statistics at the University of Auckland, Stanford University, and Rice University
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Adjusting primitives for graph : SHORT REPORT / NOTESSubhajit Sahu
Graph algorithms, like PageRank Compressed Sparse Row (CSR) is an adjacency-list based graph representation that is
Multiply with different modes (map)
1. Performance of sequential execution based vs OpenMP based vector multiply.
2. Comparing various launch configs for CUDA based vector multiply.
Sum with different storage types (reduce)
1. Performance of vector element sum using float vs bfloat16 as the storage type.
Sum with different modes (reduce)
1. Performance of sequential execution based vs OpenMP based vector element sum.
2. Performance of memcpy vs in-place based CUDA based vector element sum.
3. Comparing various launch configs for CUDA based vector element sum (memcpy).
4. Comparing various launch configs for CUDA based vector element sum (in-place).
Sum with in-place strategies of CUDA mode (reduce)
1. Comparing various launch configs for CUDA based vector element sum (in-place).
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Global Situational Awareness of A.I. and where its headedvikram sood
You can see the future first in San Francisco.
Over the past year, the talk of the town has shifted from $10 billion compute clusters to $100 billion clusters to trillion-dollar clusters. Every six months another zero is added to the boardroom plans. Behind the scenes, there’s a fierce scramble to secure every power contract still available for the rest of the decade, every voltage transformer that can possibly be procured. American big business is gearing up to pour trillions of dollars into a long-unseen mobilization of American industrial might. By the end of the decade, American electricity production will have grown tens of percent; from the shale fields of Pennsylvania to the solar farms of Nevada, hundreds of millions of GPUs will hum.
The AGI race has begun. We are building machines that can think and reason. By 2025/26, these machines will outpace college graduates. By the end of the decade, they will be smarter than you or I; we will have superintelligence, in the true sense of the word. Along the way, national security forces not seen in half a century will be un-leashed, and before long, The Project will be on. If we’re lucky, we’ll be in an all-out race with the CCP; if we’re unlucky, an all-out war.
Everyone is now talking about AI, but few have the faintest glimmer of what is about to hit them. Nvidia analysts still think 2024 might be close to the peak. Mainstream pundits are stuck on the wilful blindness of “it’s just predicting the next word”. They see only hype and business-as-usual; at most they entertain another internet-scale technological change.
Before long, the world will wake up. But right now, there are perhaps a few hundred people, most of them in San Francisco and the AI labs, that have situational awareness. Through whatever peculiar forces of fate, I have found myself amongst them. A few years ago, these people were derided as crazy—but they trusted the trendlines, which allowed them to correctly predict the AI advances of the past few years. Whether these people are also right about the next few years remains to be seen. But these are very smart people—the smartest people I have ever met—and they are the ones building this technology. Perhaps they will be an odd footnote in history, or perhaps they will go down in history like Szilard and Oppenheimer and Teller. If they are seeing the future even close to correctly, we are in for a wild ride.
Let me tell you what we see.
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
Adjusting OpenMP PageRank : SHORT REPORT / NOTESSubhajit Sahu
For massive graphs that fit in RAM, but not in GPU memory, it is possible to take
advantage of a shared memory system with multiple CPUs, each with multiple cores, to
accelerate pagerank computation. If the NUMA architecture of the system is properly taken
into account with good vertex partitioning, the speedup can be significant. To take steps in
this direction, experiments are conducted to implement pagerank in OpenMP using two
different approaches, uniform and hybrid. The uniform approach runs all primitives required
for pagerank in OpenMP mode (with multiple threads). On the other hand, the hybrid
approach runs certain primitives in sequential mode (i.e., sumAt, multiply).
MH prediction modeling and validation in r (2) classification 190709
1. Install required R software packages
for (packagename in c("tidyverse", "openxlsx")) {
if(!require(packagename, character.only = T)) {install.packages(packagename);
require(packagename, character.only = T)}
}
2. Imputation for the training dataset
#@ Load only the training data. Make sure the test data is not loaded before modeling is finished. ----
# If the test data is not sequestered, make a random split and save them first. Then only load the training
data, so that the test data remain unseen before modeling is finished. ----
dataset.train = readRDS(url("https://github.com/mkim0710/PH207x/blob/master/fhs.index100le10.rds?raw=true"))
# Here, we will use a single regression imputation for simplicity. However, multiple imputation is recommended.
imputation.model = glm(bmi1 ~ poly(age1, 2) + sex1, data = dataset.train)
dataset.train = dataset.train %>% mutate(
bmi1.old = bmi1
, bmi1.is_imputed = bmi1 %>% {ifelse(is.na(.), T, F)}
, bmi1 = bmi1 %>% {ifelse(is.na(.), predict(imputation.model), .)}
)
dataset.train %>% select(randid, death, age1, sex1, matches("bmi1")) %>% filter(bmi1.is_imputed)
# # A tibble: 2 x 7
# randid death age1 sex1 bmi1 bmi1.old bmi1.is_imputed
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>
# 1 1600765 0 45 2 26.9 NA TRUE
# 2 6921140 1 64 1 26.3 NA TRUE
3. Visualize the training dataset with the labels
dataset.train %>% ggplot(aes(x = age1, y = bmi1, color = death)) +
geom_point(data = dataset.train %>% filter(bmi1.is_imputed), color = "black") +
geom_point(size = 10, alpha = .2) + scale_color_gradient(low = "#0000ff80", high = "#ff000080") +
theme_minimal()
4.
5. Logistic model (1)
model1 = glm(death ~ poly(age1, 1, raw = T) + poly(bmi1, 1, raw = T), family = "binomial", data =
dataset.train)
model1 %>% summary #----
# Call:
# glm(formula = death ~ poly(age1, 1, raw = T) + poly(bmi1, 1,
# raw = T), family = "binomial", data = dataset.train)
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -1.8968 -0.7632 -0.4850 0.8773 2.5501
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) -8.51201 1.06490 -7.993 1.31e-15 ***
# poly(age1, 1, raw = T) 0.13326 0.01533 8.696 < 2e-16 ***
# poly(bmi1, 1, raw = T) 0.03576 0.02895 1.235 0.217
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 568.61 on 449 degrees of freedom
# Residual deviance: 465.63 on 447 degrees of freedom
# AIC: 471.63
#
# Number of Fisher Scoring iterations: 4
dataset.train = dataset.train %>%
mutate(death.model1.predict.prob = predict(model1, type = "response", newdata = .))
6. Visualize the logistic model (1) with the fitted probability
dataset.train %>% select(randid, death, age1, sex1, bmi1) %>%
mutate(death.model1.predict = predict(model1, type = "response")) %>%
ggplot(aes(x = age1, y = bmi1, color = death.model1.predict)) +
geom_point(data = dataset.train %>% filter(bmi1.is_imputed), color = "black") +
geom_point(size = 10, alpha = .2) + scale_color_gradient(low = "#0000ff80", high = "#ff000080") +
theme_minimal()
7. The estimated probability can be thresholded
(dichotomized) for a binary classification.
https://github.com/kenhktsui/Visualizing-Logistic-Regression
8. Visualize the logistic model (1) with a cutoff of mean
cutoff.value = dataset.train$death %>% mean
dataset.train %>% select(randid, death, age1, sex1, bmi1) %>%
mutate(death.model1.predict = predict(model1, type = "response") > cutoff.value) %>%
ggplot(aes(x = age1, y = bmi1, color = death.model1.predict)) +
geom_point(data = dataset.train %>% filter(bmi1.is_imputed), color = "black") +
geom_point(size = 10, alpha = .2) + scale_color_manual(values = c("#0000ff80", "#ff000080")) +
theme_minimal()
9. Visualize the logistic model (1) with a cutoff of 0.5
cutoff.value = 0.5
dataset.train %>% select(randid, death, age1, sex1, bmi1) %>%
mutate(death.model1.predict = predict(model1, type = "response") > cutoff.value) %>%
ggplot(aes(x = age1, y = bmi1, color = death.model1.predict)) +
geom_point(data = dataset.train %>% filter(bmi1.is_imputed), color = "black") +
geom_point(size = 10, alpha = .2) + scale_color_manual(values = c("#0000ff80", "#ff000080")) +
theme_minimal()
10.
11. Logistic model (2)
model2 = glm(death ~ poly(age1, 2, raw = T) + poly(bmi1, 2, raw = T), family = "binomial", data = dataset.train)
model2 %>% summary #----
# Call:
# glm(formula = death ~ poly(age1, 2, raw = T) + poly(bmi1, 2,
# raw = T), family = "binomial", data = dataset.train)
#
# Deviance Residuals:
# Min 1Q Median 3Q Max
# -2.2106 -0.7227 -0.5243 0.8057 2.1919
#
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 3.806304 5.819063 0.654 0.5130
# poly(age1, 2, raw = T)1 -0.206718 0.198164 -1.043 0.2969
# poly(age1, 2, raw = T)2 0.003295 0.001924 1.713 0.0868 .
# poly(bmi1, 2, raw = T)1 -0.247094 0.251485 -0.983 0.3258
# poly(bmi1, 2, raw = T)2 0.005231 0.004555 1.148 0.2508
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# (Dispersion parameter for binomial family taken to be 1)
#
# Null deviance: 568.61 on 449 degrees of freedom
# Residual deviance: 461.12 on 445 degrees of freedom
# AIC: 471.12
#
# Number of Fisher Scoring iterations: 4
dataset.train = dataset.train %>%
mutate(death.model2.predict.prob = predict(model2, type = "response"))
12. Visualize the logistic model (2) with the fitted probability
dataset.train %>% select(randid, death, age1, sex1, bmi1) %>%
mutate(death.model2.predict = predict(model2, type = "response")) %>%
ggplot(aes(x = age1, y = bmi1, color = death.model2.predict)) +
geom_point(data = dataset.train %>% filter(bmi1.is_imputed), color = "black") +
geom_point(size = 10, alpha = .2) + scale_color_gradient(low = "#0000ff80", high = "#ff000080") +
theme_minimal()
13. Visualize the logistic model (2) with a cutoff of mean
cutoff.value = dataset.train$death %>% mean
dataset.train %>% select(randid, death, age1, sex1, bmi1) %>%
mutate(death.model2.predict = predict(model2, type = "response") > cutoff.value) %>%
ggplot(aes(x = age1, y = bmi1, color = death.model2.predict)) +
geom_point(data = dataset.train %>% filter(bmi1.is_imputed), color = "black") +
geom_point(size = 10, alpha = .2) + scale_color_manual(values = c("#0000ff80", "#ff000080")) +
theme_minimal()
14. Visualize the logistic model (2) with a cutoff of 0.5
cutoff.value = 0.5
dataset.train %>% select(randid, death, age1, sex1, bmi1) %>%
mutate(death.model2.predict = predict(model2, type = "response") > cutoff.value) %>%
ggplot(aes(x = age1, y = bmi1, color = death.model2.predict)) +
geom_point(data = dataset.train %>% filter(bmi1.is_imputed), color = "black") +
geom_point(size = 10, alpha = .2) + scale_color_manual(values = c("#0000ff80", "#ff000080")) +
theme_minimal()
15.
16. Performance of a binary classification test:
Sensitivity, specificity, PPV, NPV, ...
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
17. Receiver operating characteristic (ROC)
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
• Let a conditions random variable X denote the probability estimated for the
observation.
• Given a threshold parameter T,
• The observation is classified as "positive" if X > T, and "negative" otherwise.
• X follows a probability density f1(x) if the instance actually belongs to class "positive", and
f0(x) if otherwise.
• Therefore,
• the true positive rate TPR(T) = 𝑇
∞
𝑓1 𝑥 𝑑𝑥
• and the false positive rate FPR(T) = 𝑇
∞
𝑓0 𝑥 𝑑𝑥
• The ROC curve plots parametrically TPR(T) versus FPR(T)
18. Area under the ROC curve (AUROC)
https://en.wikipedia.org/wiki/Receiver_operating_characteristic
• The ROC curve plots parametrically TPR(T) versus FPR(T)
• Area under the ROC curve (AUROC)
• When using normalized units, the AUROC is equal to the probability that a classifier will
rank a randomly chosen positive instance higher than a randomly chosen negative one
(assuming 'positive' ranks higher than 'negative').
• TPR(T): T -> y(x)
• FPR(T): T -> x
• Large T corresponds to a lower value of x
𝐴𝑈𝑅𝑂𝐶 = න
𝑥=0
1
𝑇𝑃𝑅 𝐹𝑃𝑅−1 𝑥 𝑑𝑥 = න
∞
−∞
𝑇𝑃𝑅 𝑇 𝐹𝑃𝑅′ 𝑇 𝑑𝑇
= න
−∞
∞
න
−∞
∞
𝐼 𝑇′ > 𝑇 𝑓1 𝑇′ 𝑓0 𝑇 𝑑𝑇′𝑑𝑇 = 𝑃 𝑋1 > 𝑋0 ,
where X1 is the estimated probability for a positive observation, X0 is the estimated
probability for a negative observation, X follows a probability density f1(x) if the instance
actually belongs to class "positive", and f0(x) if otherwise.
29. Fit multiple models using for-loop
#@ Fit multiple models using for-loop, and then save the models as R list of objects. =====
model.list = list()
for (i in 1:10) {
myformula = as.formula(paste0("death ~ poly(age1, ", i, ")", " + ", "poly(bmi1, ", i, ")"))
model.list[[i]] = glm(myformula, data = dataset.train, family = "binomial")
}
30. Calculate the training AUROC & test AUROC for multiple
models
# Make a table that shows the training AUROC and test AUROC for each model in the model.list. -----
df = data.frame(
i = 1:length(model.list)
, trainAUROC = model.list %>% map_dbl(function(model.object) {
dataset.train %>% {
function.vec_actual_prediction.threshold_roc(.$death, predict(model.object, type = "response", newdata = .))
} %>% function.threshold_roc.auc
})
, testAUROC = model.list %>% map_dbl(function(model.object) {
dataset.test %>% {
function.vec_actual_prediction.threshold_roc(.$death, predict(model.object, type = "response", newdata = .))
} %>% function.threshold_roc.auc
})
)
df
# i trainAUROC testAUROC
# 1 1 0.7788554 0.7603255
# 2 2 0.7787656 0.7608442
# 3 3 0.7779349 0.7668525
# 4 4 0.7819762 0.7646480
# 5 5 0.7822680 0.7607577
# 6 6 0.7821109 0.7607577
# 7 7 0.7861296 0.7568242
# 8 8 0.7820884 0.7510752
# 9 9 0.7829640 0.7449156
# 10 10 0.8003413 0.7516156
#@ Remove the test dataset (before any additional modeling~!) -----
rm(dataset.test)
31. Calculate the training AUROC & test AUROC for multiple
models
df %>% gather(key, value, trainAUROC, testAUROC) %>% ggplot(aes(x = i, y = value, color = key)) + geom_point()
+ geom_line()
32.
33. Visualize the training dataset with the labels
dataset.train %>% ggplot(aes(x = age1, y = bmi1, color = death)) +
geom_point(data = dataset.train %>% filter(bmi1.is_imputed), color = "black") +
geom_point(size = 10, alpha = .2) + scale_color_gradient(low = "#0000ff80", high = "#ff000080") +
theme_minimal()
34. Visualize the logistic model (10) with a cutoff of mean
i = 10
cutoff.value = dataset.train$death %>% mean
dataset.train %>% select(randid, death, age1, sex1, bmi1) %>%
mutate(death.predict = predict(model.list[[i]], type = "response") > cutoff.value) %>%
ggplot(aes(x = age1, y = bmi1, color = death.predict)) +
geom_point(data = dataset.train %>% filter(bmi1.is_imputed), color = "black") +
geom_point(size = 10, alpha = .2) + scale_color_manual(values = c("#0000ff80", "#ff000080")) + theme_minimal() +
labs(title = paste0("model.list[[", i, "]]", " with a cutoff of mean"))
35. Visualize the logistic model (9) with a cutoff of mean
i = 9
cutoff.value = dataset.train$death %>% mean
dataset.train %>% select(randid, death, age1, sex1, bmi1) %>%
mutate(death.predict = predict(model.list[[i]], type = "response") > cutoff.value) %>%
ggplot(aes(x = age1, y = bmi1, color = death.predict)) +
geom_point(data = dataset.train %>% filter(bmi1.is_imputed), color = "black") +
geom_point(size = 10, alpha = .2) + scale_color_manual(values = c("#0000ff80", "#ff000080")) + theme_minimal() +
labs(title = paste0("model.list[[", i, "]]", " with a cutoff of mean"))
36. Visualize the logistic model (8) with a cutoff of mean
i = 8
cutoff.value = dataset.train$death %>% mean
dataset.train %>% select(randid, death, age1, sex1, bmi1) %>%
mutate(death.predict = predict(model.list[[i]], type = "response") > cutoff.value) %>%
ggplot(aes(x = age1, y = bmi1, color = death.predict)) +
geom_point(data = dataset.train %>% filter(bmi1.is_imputed), color = "black") +
geom_point(size = 10, alpha = .2) + scale_color_manual(values = c("#0000ff80", "#ff000080")) + theme_minimal() +
labs(title = paste0("model.list[[", i, "]]", " with a cutoff of mean"))
37. Visualize the logistic model (5) with a cutoff of mean
i = 5
cutoff.value = dataset.train$death %>% mean
dataset.train %>% select(randid, death, age1, sex1, bmi1) %>%
mutate(death.predict = predict(model.list[[i]], type = "response") > cutoff.value) %>%
ggplot(aes(x = age1, y = bmi1, color = death.predict)) +
geom_point(data = dataset.train %>% filter(bmi1.is_imputed), color = "black") +
geom_point(size = 10, alpha = .2) + scale_color_manual(values = c("#0000ff80", "#ff000080")) + theme_minimal() +
labs(title = paste0("model.list[[", i, "]]", " with a cutoff of mean"))
38. Visualize the logistic model (3) with a cutoff of mean
i = 3
cutoff.value = dataset.train$death %>% mean
dataset.train %>% select(randid, death, age1, sex1, bmi1) %>%
mutate(death.predict = predict(model.list[[i]], type = "response") > cutoff.value) %>%
ggplot(aes(x = age1, y = bmi1, color = death.predict)) +
geom_point(data = dataset.train %>% filter(bmi1.is_imputed), color = "black") +
geom_point(size = 10, alpha = .2) + scale_color_manual(values = c("#0000ff80", "#ff000080")) + theme_minimal() +
labs(title = paste0("model.list[[", i, "]]", " with a cutoff of mean"))
39. Visualize the logistic model (2) with a cutoff of mean
i = 2
cutoff.value = dataset.train$death %>% mean
dataset.train %>% select(randid, death, age1, sex1, bmi1) %>%
mutate(death.predict = predict(model.list[[i]], type = "response") > cutoff.value) %>%
ggplot(aes(x = age1, y = bmi1, color = death.predict)) +
geom_point(data = dataset.train %>% filter(bmi1.is_imputed), color = "black") +
geom_point(size = 10, alpha = .2) + scale_color_manual(values = c("#0000ff80", "#ff000080")) + theme_minimal() +
labs(title = paste0("model.list[[", i, "]]", " with a cutoff of mean"))
40. Visualize the logistic model (1) with a cutoff of mean
i = 1
cutoff.value = dataset.train$death %>% mean
dataset.train %>% select(randid, death, age1, sex1, bmi1) %>%
mutate(death.predict = predict(model.list[[i]], type = "response") > cutoff.value) %>%
ggplot(aes(x = age1, y = bmi1, color = death.predict)) +
geom_point(data = dataset.train %>% filter(bmi1.is_imputed), color = "black") +
geom_point(size = 10, alpha = .2) + scale_color_manual(values = c("#0000ff80", "#ff000080")) + theme_minimal() +
labs(title = paste0("model.list[[", i, "]]", " with a cutoff of mean"))
46. K-fold "random" split of the training dataset
## Visual check of the distribution of the folds ----
dataset.train %>% ggplot(aes(x = age1, y = bmi1, color = as.factor(fold.index))) + geom_point()
47.
48. Fit multiple models in each cross-validation folds
#@ Nested for-loop: (1) Iteration of folds for cross-validation (2) Fit multiple models using for-loop =====
# Save the models as a "nested" list of objects to save the results from "nested" for-loop.
max.polynomial = 5
cv.model.list = list()
for (i.fold in sort(unique(dataset.train$fold.index))) {
cv.model.list[[i.fold]] = list()
dataset = dataset.train %>% filter(fold.index != i.fold) %>% as.data.frame
for (i in 1:max.polynomial) {
myformula = as.formula(paste0("death ~ poly(age1, ", i, ")", " + ", "poly(bmi1, ", i, ")"))
cv.model.list[[i.fold]][[i]] = glm(myformula, data = dataset, family = "binomial")
}
}
49. Calculate the training AUROC & validation AUROC for multiple
models in each cross-validation folds
#@ Define the loss function (optimization objective). -----
# Cf) You may define any function to avoid repetitive codes.
AUROC = function(y,yhat) mean((y-yhat)^2)
# Make a table that shows the training AUROC and test AUROC for each cross-validation & each model in the "nested"
model.list. -----
cv.df = data_frame(
cv = rep(1:k, each = max.polynomial)
, polynomial = rep(1:max.polynomial, k)
) %>% mutate(
trainAUROC = map2_dbl(cv, polynomial, function(i.fold, i) { cv.model.list[[i.fold]][[i]] %>% {AUROC(.$y,
predict(.)) } })
, cvAUROC = map2_dbl(cv, polynomial, function(i.fold, i) { AUROC(dataset.train %>% filter(fold.index ==
i.fold) %>% select(mpg) %>% unlist, predict(cv.model.list[[i.fold]][[i]], newdata = dataset.train %>%
filter(fold.index == i.fold))) } )
)
cv.df
# # A tibble: 50 x 4
# cv polynomial trainAUROC cvAUROC
# <int> <int> <dbl> <dbl>
# 1 1 1 22.5 19.3
# 2 1 2 18.9 16.3
# 3 1 3 18.9 16.3
# 4 1 4 18.5 17.4
# 5 1 5 17.6 17.4
# 6 1 6 17.4 16.9
# 7 1 7 17.2 16.3
# 8 1 8 17.0 17.9
# 9 1 9 17.0 17.7
# 10 1 10 17.0 17.6
# # ... with 40 more rows
50. Calculate the (aggregated) training AUROC & (aggregated) cv
AUROC for multiple models
# Make a table that shows the (aggregated) training error and test error for each model -----
cv.df.summarize = cv.df %>% select(-cv) %>% group_by(polynomial) %>% summarize_all(mean)
cv.df.summarize
# # A tibble: 10 x 3
# polynomial trainAUROC cvAUROC
# <int> <dbl> <dbl>
# 1 1 0.779 0.782
# 2 2 0.779 0.783
# 3 3 0.779 0.780
# 4 4 0.783 0.777
# 5 5 0.783 0.772
# 6 6 0.786 0.763
# 7 7 0.789 0.758
# 8 8 0.787 0.753
# 9 9 0.787 0.749
# 10 10 0.803 0.761
51. Visualize the (aggregated) training AUROC & cv AUROC for
multiple models
cv.df.summarize %>% gather(key, value, trainAUROC, cvAUROC) %>% ggplot(aes(x = polynomial, y = value, color =
key)) + geom_point() + geom_line()