For this project i used a fictional dataset. The dataset includes information on age in years, sex, height in centimeters, disease status, and three laboratory-measured blood biomarkers in mmol/L units. The number of observations 1598. 1271 subjects don’t have the disease and 327 have the disease.
Describing the data and data cleaning
First i investigate the data with figures and numerical summaries, check if there are obvious data errors and duplicates and clean the dataset accordingly.
data<-read.csv2("SimBiomarkerData.txt")
# Find duplicate rows based on the "ID" column
duplicate_rows <- data[duplicated(data$ID) | duplicated(data$ID, fromLast = TRUE), ]
# Print the duplicate rows
print(duplicate_rows,row.names = FALSE)
## ID sex age height biomarker1 biomarker2 biomarker3 disease_status laboratory
## ID20 2 54 163.3 4.94 4.9 64.17 0 L1
## ID20 2 54 163.3 4.94 4.9 64.17 0 L1
## ID21 1 49 184.71 4.29 4.26 34.62 0 L1
## ID21 1 49 184.71 4.29 4.26 34.62 0 L1
## ID22 2 66 162.31 5.21 4.37 60.92 0 L1
## ID22 2 66 162.31 5.21 4.37 60.92 0 L1
## ID23 2 66 159.29 5.21 5.9 70.15 0 L1
## ID23 2 66 159.29 5.21 5.9 70.15 0 L1
## ID24 2 51 169.6 4.41 4.52 61.51 0 L1
## ID24 2 51 169.6 4.41 4.52 61.51 0 L1
## ID25 1 54 178.63 4.16 6.26 48.05 0 L1
## ID25 1 54 178.63 4.16 6.26 48.05 0 L1
## ID26 2 54 176.6 5.86 4.17 53.43 0 L1
## ID26 2 54 176.6 5.86 4.17 53.43 0 L1
## ID27 2 72 169.33 5.42 5.71 66.47 0 L1
## ID27 2 72 169.33 5.42 5.71 66.47 0 L1
## ID28 2 65 158.92 6.2 5.6 68.37 0 L1
## ID28 2 65 158.92 6.2 5.6 68.37 0 L1
## ID29 1 42 178.73 3.89 13.79 52.6 0 L2
## ID29 1 42 178.73 3.89 13.79 52.6 0 L2
## ID30 1 55 156.16 4.73 6.07 46.66 0 L1
## ID30 1 55 156.16 4.73 6.07 46.66 0 L1
## ID31 2 42 176.86 3.88 14.16 59.72 0 L2
## ID31 2 42 176.86 3.88 14.16 59.72 0 L2
## ID32 1 56 185.41 5.63 5.25 58.63 0 L1
## ID32 1 56 185.41 5.63 5.25 58.63 0 L1
## ID33 2 54 162.81 5.14 4.92 64.44 0 L1
## ID33 2 54 162.81 5.14 4.92 64.44 0 L1
## ID34 2 65 167.83 4.86 4.26 71.17 0 L1
## ID34 2 65 167.83 4.86 4.26 71.17 0 L1
## ID35 1 58 177.76 6.19 5.07 49.08 1 L1
## ID35 1 58 177.76 6.19 5.07 49.08 1 L1
## ID36 2 53 175.56 5.39 6.45 58.03 1 L1
## ID36 2 53 175.56 5.39 6.45 58.03 1 L1
## ID37 1 64 171.44 5.38 6.28 46.05 0 L1
## ID37 1 64 171.44 5.38 6.28 46.05 0 L1
## ID38 2 64 167.33 6.01 5.84 66.38 0 L1
## ID38 2 64 167.33 6.01 5.84 66.38 0 L1
I found some duplicate rows, so we removed them. Each row has its own ID, so i can use the unique function in R to eliminate duplicate rows. After that i check if there are any missing values and see there are none.
## [1] 0
Next i set all variable to to the right variable type. And check their values.
# Convert multiple variables to numeric
data <- data %>%
mutate(across(c("height", "biomarker1", "biomarker2", "biomarker3"), as.numeric))
data <- data %>%
mutate(sex = as.factor(sex),
disease_status = as.factor(disease_status),
laboratory = as.factor(laboratory))
# Create a table of variable types
variable_types <- data.frame(
Type = sapply(data, class),
stringsAsFactors = FALSE
)
# Print the table
summary<-summary(data)
variable_types
## Type
## ID character
## sex factor
## age integer
## height numeric
## biomarker1 numeric
## biomarker2 numeric
## biomarker3 numeric
## disease_status factor
## laboratory factor
## ID sex age height biomarker1
## Length:1600 1:789 Min. :41.00 Min. : 145.8 Min. :2.730
## Class :character 2:811 1st Qu.:49.00 1st Qu.: 163.8 1st Qu.:4.490
## Mode :character Median :57.00 Median : 170.2 Median :5.180
## Mean :57.84 Mean : 192.0 Mean :5.268
## 3rd Qu.:66.00 3rd Qu.: 176.8 3rd Qu.:6.030
## Max. :79.00 Max. :17868.6 Max. :8.260
## biomarker2 biomarker3 disease_status laboratory
## Min. : 2.320 Min. :22.71 0:1273 L1:1340
## 1st Qu.: 4.610 1st Qu.:47.12 1: 327 L2: 260
## Median : 5.465 Median :54.26
## Mean : 6.851 Mean :54.01
## 3rd Qu.: 6.650 3rd Qu.:61.29
## Max. :18.100 Max. :85.07
Now i noticed that most variables have values in a normal range except for “height,” which has some unusually high values. To fix this, i remove rows where the height is greater than 245, as the tallest person alive is 245 cm tall.
# Set a threshold for extreme values (e.g., values above 245)
threshold <- 245
# Remove rows with values above the threshold in age
data <- data[data$height <= threshold, ]
summary(data)
## ID sex age height biomarker1
## Length:1598 1:788 Min. :41.00 Min. :145.8 Min. :2.730
## Class :character 2:810 1st Qu.:49.00 1st Qu.:163.8 1st Qu.:4.490
## Mode :character Median :57.00 Median :170.2 Median :5.180
## Mean :57.85 Mean :170.4 Mean :5.269
## 3rd Qu.:66.00 3rd Qu.:176.8 3rd Qu.:6.030
## Max. :79.00 Max. :196.2 Max. :8.260
## biomarker2 biomarker3 disease_status laboratory
## Min. : 2.320 Min. :22.71 0:1271 L1:1338
## 1st Qu.: 4.610 1st Qu.:47.12 1: 327 L2: 260
## Median : 5.470 Median :54.27
## Mean : 6.854 Mean :54.02
## 3rd Qu.: 6.650 3rd Qu.:61.29
## Max. :18.100 Max. :85.07
plot1<-ggplot(data, aes(sex)) + geom_bar(fill = "skyblue", alpha = 0.7)+
labs(title = "Sex Distribution ", x = "Sex", y = "Frequency")+
theme(plot.title = element_text(size = 10),) +
ggeasy::easy_center_title()
# Create a histogram
plot2<-ggplot(data, aes(x = age)) +
geom_histogram(binwidth = 2, fill = "skyblue", alpha = 0.7) +
labs(title = "Age Distribution", x = "Age", y = "Frequency")+
theme(plot.title = element_text(size = 10),) +
ggeasy::easy_center_title()
# Create a histogram
plot3<-ggplot(data, aes(x = height)) +
geom_histogram(binwidth = 2, fill = "skyblue", alpha = 0.7) +
labs(title = "Height Distribution", x = "Heigt", y = "Frequency")+
theme(plot.title = element_text(size = 10),) +
ggeasy::easy_center_title()
plot4<-ggplot(data, aes(x = biomarker1)) +
geom_histogram(binwidth = 0.15, fill = "skyblue", alpha = 0.7) +
labs(title = "Biomarker 1 Distribution", x = "Biomarker 1", y = "Frequency")+
theme(plot.title = element_text(size = 10),) +
ggeasy::easy_center_title()
plot5<-ggplot(data, aes(x = biomarker2)) +
geom_histogram(binwidth = 0.15, fill = "skyblue", alpha = 0.7) +
labs(title = "Biomarker 2 Distribution", x = "Biomarker 2", y = "Frequency")+
theme(plot.title = element_text(size = 10),) +
ggeasy::easy_center_title()
plot6<-ggplot(data, aes(x = biomarker3)) +
geom_histogram(binwidth = 0.25, fill = "skyblue", alpha = 0.7) +
labs(title = "Biomarker 3 Distribution", x = "Biomarker 3", y = "Frequency")+
theme(plot.title = element_text(size = 10),) +
ggeasy::easy_center_title()
grid.arrange(plot1, plot2, plot3, plot4, plot5 , plot6, ncol = 3)
For the sex variable we have 788 males and 810 females. The age ranges from 41 to 79 with mean 57.8 years old. Heigt of subjects ranges from 145.8 to 196.2 cm. Biomarker 1 ranges from 2.72 to 8.2 and has a mean value of 5.2 Biomarker 2 ranges from 2.3 to 18 and has a mean value of 6.8 Biomarker 3 ranges from 22.7 to 85 and has a mean value of 54. 1271 subjects dont have the disease and 327 have the disease. 1338 samples were taken from laboratory 1 and 260 from laboratory 2.
When i examine the distribution of our variables after cleaning the dataset, i observe that we have an almost equal number of males and females. The age variable exhibits the highest count around 40 years old. Both the height and biomarker1 variables display normal distributions. Biomarker 2 shows a bimodal distribution, while biomarker 3 exhibits a pattern resembling a normal distribution. We no longer observe any extreme values in any the variables.
The biomarkers
In this part i investigate the three blood biomarkers and their relationship with various variables.
Systematic difference between biomarker measurements between laboratories? :
Here i investigate if there is a systematic difference between the reported biomarker measurements of the two laboratories.
plot1<-ggplot(data, aes(x = laboratory, y = biomarker1)) +
geom_boxplot() +
labs(title = "Biomarker1 Measurements by Laboratory", x = "LabNumber", y = "Biomarker1") +
theme_minimal()
plot2<-ggplot(data, aes(x = laboratory, y = biomarker2)) +
geom_boxplot() +
labs(title = "Biomarker2 Measurements by Laboratory", x = "LabNumber", y = "Biomarker2") +
theme_minimal()
plot3<-ggplot(data, aes(x = laboratory, y = biomarker3)) +
geom_boxplot() +
labs(title = "Biomarker2 Measurements by Laboratory", x = "LabNumber", y = "Biomarker3") +
theme_minimal()
grid.arrange(plot1, plot2,plot3, ncol = 2)
plot5<-ggplot(data, aes(x = biomarker2, color = laboratory)) +
geom_histogram() +
labs(title = "biomarker 2 Distribution - Histogram", x = "biomarker 2", y = "Frequency")
It is apparent that biomarker1 does not really differ between laboratories. However, biomarkers 2 and 3 seem to differer between laboratories espessially biomarker2.
To investigate further if there is a systematic difference between the reported biomarker measurements of the two laboratories i will perform a independent samples t-test for each of the biomarkers.
# independent samples t-test's
t_test_result1 <- t.test(biomarker1 ~ laboratory, data = data)
t_test_result2 <- t.test(biomarker2 ~ laboratory, data = data)
t_test_result3 <- t.test(biomarker3 ~ laboratory, data = data)
print(t_test_result1)
##
## Welch Two Sample t-test
##
## data: biomarker1 by laboratory
## t = 0.96325, df = 366.68, p-value = 0.3361
## alternative hypothesis: true difference in means between group L1 and group L2 is not equal to 0
## 95 percent confidence interval:
## -0.07070457 0.20648185
## sample estimates:
## mean in group L1 mean in group L2
## 5.280389 5.212500
##
## Welch Two Sample t-test
##
## data: biomarker2 by laboratory
## t = -141.03, df = 378.25, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group L1 and group L2 is not equal to 0
## 95 percent confidence interval:
## -10.119191 -9.840897
## sample estimates:
## mean in group L1 mean in group L2
## 5.229918 15.209962
##
## Welch Two Sample t-test
##
## data: biomarker3 by laboratory
## t = -18.139, df = 372.36, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group L1 and group L2 is not equal to 0
## 95 percent confidence interval:
## -12.430616 -9.999122
## sample estimates:
## mean in group L1 mean in group L2
## 52.19525 63.41012
Results of the t-test show that there is a systematic difference between the reported biomarker measurements of the two laboratories, for biomarkers 2 and 3 but not 1. There could therefore be a difference in how the to two laboratories mesuread biomarkes 2 and 3.
Relationship between biomarkers:
Previous study reported that changes in biomarker 2 causes changes in biomarker 1, here i test if this is a reasonable claim.
To investigate this claim i begin by plotting the releationship between biomarker 2 and biomarker 1, and than calculate the correlation coefficient to quantify the strength and direction of the relationship.
ggplot(data, aes(x = biomarker2, y = biomarker1, color= laboratory)) +
geom_point() +
labs(title = "Scatter Plot of Biomarker1 vs Biomarker2", x = "Biomarker2", y = "Biomarker1") +
theme_minimal()
# Perform correlation analysis
correlation_coefficient <- cor(data$biomarker1, data$biomarker2)
correlation_coefficient
## [1] 0.1678844
I begin to investigate the releationship between biomarker 2 and biomarker 1. Like seen before biomarker 2 has a bimodal distribution. The plot suggest a positive relationship between biomarker2 and biomarker1 for both the groups in biomarker 2. The correlation is low or 0.168. However, because of the bimodal distribution in biomarker 2 a simple linear correlation may not fully capture the complexity of the relationship. As seen before this bimodal distribution is because values in biomarker2 are different between laborotaries. So to minimise the effect of the bimodal distribution on the reletationship i will include laboratirie in a regression.
# Fit linear regression model
regression_model <- lm(biomarker1 ~ biomarker2 , data = data)
regression_model2 <- lm(biomarker1 ~ biomarker2 + laboratory, data = data)
# Summary of the regression model
summary(regression_model)
##
## Call:
## lm(formula = biomarker1 ~ biomarker2, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.7020 -0.7432 -0.0878 0.7363 2.9931
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.957759 0.052489 94.453 < 2e-16 ***
## biomarker2 0.045462 0.006682 6.804 1.44e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.025 on 1596 degrees of freedom
## Multiple R-squared: 0.02819, Adjusted R-squared: 0.02758
## F-statistic: 46.29 on 1 and 1596 DF, p-value: 1.439e-11
##
## Call:
## lm(formula = biomarker1 ~ biomarker2 + laboratory, data = data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.3664 -0.5312 -0.0079 0.4973 2.3761
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.84930 0.09483 19.50 <2e-16 ***
## biomarker2 0.65605 0.01769 37.09 <2e-16 ***
## laboratoryL2 -6.61530 0.18394 -35.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7621 on 1595 degrees of freedom
## Multiple R-squared: 0.4634, Adjusted R-squared: 0.4627
## F-statistic: 688.6 on 2 and 1595 DF, p-value: < 2.2e-16
It is apparent that there is a relationship between biomarker2 and biomarker1. On average, for each unit increase in biomarker 2, biomarker 1 increases by 0.65605 units when considering the effect of laboratory as well. However i can only say that we see a releationship an can not say there is a causation or that changes in biomarker 2 cause changes in biomarker 1.
Biomarker 3 relationship to age and sex:
To investigate the relation ship between Biomarker 3 and age and sex i fit two multple linear regression models. To test which model is more appropriate, i look at the residual plots.
lm1<-lm(biomarker3 ~ age + sex, data = data)
lm2<-lm(biomarker3 ~ age + sex + I(age^2), data = data)
# Create a data frame with age and residuals
residuals_df <- data.frame(
Age = data$age,
Residuals_lm1 = residuals(lm1),
Residuals_lm2 = residuals(lm2)
)
# Plot age against residuals for both models
library(ggplot2)
# Linear model (lm1)
plot1<-ggplot(residuals_df, aes(x = Age, y = Residuals_lm1)) +
geom_point() +
labs(title = "Age vs. Residuals (Model 1)", x = "Age", y = "Residuals_lm1") +
theme_minimal()
# Quadratic model (lm2)
plot2<-ggplot(residuals_df, aes(x = Age, y = Residuals_lm2)) +
geom_point() +
labs(title = "Age vs. Residuals (Model 2)", x = "Age", y = "Residuals_lm2") +
theme_minimal()
grid.arrange(plot1, plot2, ncol = 2)
We see that the residuals have some pattern in model 1. The Residuals get higher as the get closer to age 60. So model 1 performance is dependant on age. In Model 2 we dont see this patter, so we conclude that model 2 i better.
relationship between the residuals and the laboratory variable:
lm3<-lm(biomarker3 ~ age + sex + I(age^2)+ laboratory, data = data)
residuals_df <- data.frame(
Age = data$age,
Laboratory =data$laboratory,
Residuals_lm2 = residuals(lm2)
)
plot2<-ggplot(residuals_df, aes(x = Laboratory, y = Residuals_lm2)) +
geom_boxplot() +
labs(title = "Age vs. Residuals (Quadratic Model)", x = "laboratory", y = "Residuals_lm2") +
theme_minimal()
plot2
#residuals_df <- data.frame(
#Age = data$age,
#Laboratory =data$laboratory,
#Residuals_lm3 = residuals(lm3)
#)
#plot3<-ggplot(residuals_df, aes(x = Laboratory, y = Residuals_lm3)) +
#geom_point() +
#labs(title = "Age vs. Residuals (Quadratic Model)", x = "laboratory", y = "Residuals_lm2") +
#theme_minimal()
We see a systematic pattern in the residuals, the plot shows that residuals are higher in the L2 level of the laboratory variable. When we go back to task 3 we saw that biomarker 3 values were systematically higher in the L1 than the L2 laboratory. So there is a relationship between labratory and biomarker3. So the pattern in the residual plot might indicate that the laboratory variable is unaccounted-for in the model. Because there is clearly a relationship in laboratory and biomarker3, the laboratory variable should be included in the model.
Disease status
In this part i find models to predict disease status using available information in the dataset. In this part i will only include individuals with blood samples analysed in laboratory L1
Subset(laboratory L1):
Here i filter the dataset to only include individuals with blood samples analysed in laboratory L1, than i split the data in test and training datasets.
data2<-filter(data, laboratory == "L1")## Filter data to only include the individuals with blood samples that were analysed in laboratory L1.
## Begin by creating test and train data
set.seed(0401)
train_ind<-sample(seq_len(nrow(data2)),size = (nrow(data2)* 0.75))##trainig set is 75% from data2
train<-data2[train_ind,]
test<-data2[-train_ind,]
Logistic regression:
Here i fit a logistic regression model with disease status as the dependent variable and the variables sex, age, height, biomarker 1, biomarker 2 and biomarker 3 as the predictor variables.
lm1<-glm(disease_status ~ sex + age + height + biomarker1 + biomarker2 + biomarker3, data = train,family = "binomial")
summary(lm1)
##
## Call:
## glm(formula = disease_status ~ sex + age + height + biomarker1 +
## biomarker2 + biomarker3, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.20792 -0.49401 -0.24688 -0.08461 2.93624
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.575e+01 3.496e+00 -4.505 6.63e-06 ***
## sex2 1.775e+00 3.517e-01 5.047 4.50e-07 ***
## age -1.237e-01 2.203e-02 -5.616 1.95e-08 ***
## height 1.082e-02 1.716e-02 0.630 0.528
## biomarker1 2.293e+00 2.320e-01 9.884 < 2e-16 ***
## biomarker2 1.102e+00 1.502e-01 7.340 2.14e-13 ***
## biomarker3 5.486e-04 1.755e-02 0.031 0.975
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1015.89 on 1002 degrees of freedom
## Residual deviance: 635.55 on 996 degrees of freedom
## AIC: 649.55
##
## Number of Fisher Scoring iterations: 6
From the summary output, we see that all coefficients, except for height and biomarker3, are statistically significant.
Best subset:
Here i use the stepAIC function to find the best subset of predictor variables in the lm1 model
## Start: AIC=649.55
## disease_status ~ sex + age + height + biomarker1 + biomarker2 +
## biomarker3
##
## Df Deviance AIC
## - biomarker3 1 635.55 647.55
## - height 1 635.95 647.95
## <none> 635.55 649.55
## - sex 1 662.77 674.77
## - age 1 669.94 681.94
## - biomarker2 1 697.23 709.23
## - biomarker1 1 764.88 776.88
##
## Step: AIC=647.55
## disease_status ~ sex + age + height + biomarker1 + biomarker2
##
## Df Deviance AIC
## - height 1 635.95 645.95
## <none> 635.55 647.55
## + biomarker3 1 635.55 649.55
## - sex 1 671.67 681.67
## - age 1 673.42 683.42
## - biomarker2 1 697.26 707.26
## - biomarker1 1 765.10 775.10
##
## Step: AIC=645.95
## disease_status ~ sex + age + biomarker1 + biomarker2
##
## Df Deviance AIC
## <none> 635.95 645.95
## + height 1 635.55 647.55
## + biomarker3 1 635.95 647.95
## - age 1 678.01 686.01
## - sex 1 686.00 694.00
## - biomarker2 1 697.96 705.96
## - biomarker1 1 765.28 773.28
The final model removes the heigt and biomarker3 variables resulting in the model:
\[ \log\left(\frac{\text{Pr}(Y_i = 1)}{1 - \text{Pr}(Y_i = 1)}\right) = \beta_0 + \beta_1 \cdot \text{sex}_i + \beta_2 \cdot \text{age}_i + \beta_3 \cdot \text{biomarker1}_i + \beta_4 \cdot \text{biomarker2}_i + \epsilon_i \]
Here, the binary outcome variable (disease status), and Pr is the probability of having the disease.
best model interpretation:
##
## Call:
## glm(formula = disease_status ~ sex + age + biomarker1 + biomarker2,
## family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.20649 -0.49298 -0.25040 -0.08423 2.89980
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -13.65355 0.93315 -14.632 < 2e-16 ***
## sex2 1.66306 0.24888 6.682 2.36e-11 ***
## age -0.12654 0.02041 -6.201 5.62e-10 ***
## biomarker1 2.29162 0.23191 9.881 < 2e-16 ***
## biomarker2 1.10285 0.14987 7.359 1.85e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1015.89 on 1002 degrees of freedom
## Residual deviance: 635.95 on 998 degrees of freedom
## AIC: 645.95
##
## Number of Fisher Scoring iterations: 6
The odds of having the disease for individuals with sex2 are approximately e^1.66306 = 5.27 times higher compared to sex1, with a 95% confidence interval of [4.387, 6.328].
For each one-unit increase in age, the odds of disease decrease by a factor of approximately 0.881, 95% confidence interval is [0.862, 0.901].
Biomarker1 is associated with a 9.88 times increase in odds for each one-unit change, while biomarker2 is linked to a 3.01 times increase, confidence interval is [7.228, 13.493].
biomarker2 is linked to a 3.01 times increase in odds, with a 95% confidence interval of [2.220, 4.087].
Test-error,best model:
##
## Call: glm(formula = disease_status ~ sex + age + biomarker1 + biomarker2,
## family = "binomial", data = train)
##
## Coefficients:
## (Intercept) sex2 age biomarker1 biomarker2
## -13.6536 1.6631 -0.1265 2.2916 1.1029
##
## Degrees of Freedom: 1002 Total (i.e. Null); 998 Residual
## Null Deviance: 1016
## Residual Deviance: 636 AIC: 646
test_p <- predict(lm1.step, newdata = test, type = "response")
test.pred<-rep(0,335) ## create a vector with all values= 0
test.pred[test_p>.5]<- 1 ## values in vector get 1 if test_p>.5
matrix<-table(test$disease_status,test.pred)## c confusion matrix(predicted vs actual)
matrix<-confusionMatrix(matrix)
matrix
## Confusion Matrix and Statistics
##
## test.pred
## 0 1
## 0 247 12
## 1 37 39
##
## Accuracy : 0.8537
## 95% CI : (0.8113, 0.8898)
## No Information Rate : 0.8478
## P-Value [Acc > NIR] : 0.4163478
##
## Kappa : 0.5282
##
## Mcnemar's Test P-Value : 0.0006068
##
## Sensitivity : 0.8697
## Specificity : 0.7647
## Pos Pred Value : 0.9537
## Neg Pred Value : 0.5132
## Prevalence : 0.8478
## Detection Rate : 0.7373
## Detection Prevalence : 0.7731
## Balanced Accuracy : 0.8172
##
## 'Positive' Class : 0
##
test-error,second best model:
By following the same steps as in task 11, compute and report the test error for the model with the second lowest AIC (see task 9).
glm2<-glm(disease_status ~ sex + age + height + biomarker1 + biomarker2, train, family = binomial)
test_p <- predict(glm2, newdata = test, type = "response")
test.pred<-rep(0,335) ## create a vector with all values= "0"
test.pred[test_p>.5]<- 1 ## values in vector get 1 if test_p>.5
matrix<-table(test$disease_status,test.pred)## confusion matrix(predicted vs actual)
matrix<-confusionMatrix(matrix)
matrix
## Confusion Matrix and Statistics
##
## test.pred
## 0 1
## 0 246 13
## 1 38 38
##
## Accuracy : 0.8478
## 95% CI : (0.8047, 0.8845)
## No Information Rate : 0.8478
## P-Value [Acc > NIR] : 0.5372758
##
## Kappa : 0.509
##
## Mcnemar's Test P-Value : 0.0007775
##
## Sensitivity : 0.8662
## Specificity : 0.7451
## Pos Pred Value : 0.9498
## Neg Pred Value : 0.5000
## Prevalence : 0.8478
## Detection Rate : 0.7343
## Detection Prevalence : 0.7731
## Balanced Accuracy : 0.8056
##
## 'Positive' Class : 0
##
test error comparison:
here i compair the test error for the best and the second best model according to AIC
# Create a dataframe
df <- data.frame(
Misclassification_Error = c(misclassification_error1, misclassification_error2 ),
Source = c("Best Model Test Data", "2.nd best test Data")
)
kable(df, format = "html") %>%
kable_styling(full_width = FALSE, bootstrap_options = "striped", position = "center")
Misclassification_Error | Source |
---|---|
0.1462687 | Best Model Test Data |
0.1522388 | 2.nd best test Data |