Biomarkers dataset

Describing the data and data cleaning
The biomarkers
Disease status

For this project i used a fictional dataset. The dataset includes information on age in years, sex, height in centimeters, disease status, and three laboratory-measured blood biomarkers in mmol/L units. The number of observations 1598. 1271 subjects don’t have the disease and 327 have the disease.

Describing the data and data cleaning

First i investigate the data with figures and numerical summaries, check if there are obvious data errors and duplicates and clean the dataset accordingly.

data<-read.csv2("SimBiomarkerData.txt")


# Find duplicate rows based on the "ID" column
duplicate_rows <- data[duplicated(data$ID) | duplicated(data$ID, fromLast = TRUE), ]

# Print the duplicate rows
print(duplicate_rows,row.names = FALSE)

##    ID sex age height biomarker1 biomarker2 biomarker3 disease_status laboratory
##  ID20   2  54  163.3       4.94        4.9      64.17              0         L1
##  ID20   2  54  163.3       4.94        4.9      64.17              0         L1
##  ID21   1  49 184.71       4.29       4.26      34.62              0         L1
##  ID21   1  49 184.71       4.29       4.26      34.62              0         L1
##  ID22   2  66 162.31       5.21       4.37      60.92              0         L1
##  ID22   2  66 162.31       5.21       4.37      60.92              0         L1
##  ID23   2  66 159.29       5.21        5.9      70.15              0         L1
##  ID23   2  66 159.29       5.21        5.9      70.15              0         L1
##  ID24   2  51  169.6       4.41       4.52      61.51              0         L1
##  ID24   2  51  169.6       4.41       4.52      61.51              0         L1
##  ID25   1  54 178.63       4.16       6.26      48.05              0         L1
##  ID25   1  54 178.63       4.16       6.26      48.05              0         L1
##  ID26   2  54  176.6       5.86       4.17      53.43              0         L1
##  ID26   2  54  176.6       5.86       4.17      53.43              0         L1
##  ID27   2  72 169.33       5.42       5.71      66.47              0         L1
##  ID27   2  72 169.33       5.42       5.71      66.47              0         L1
##  ID28   2  65 158.92        6.2        5.6      68.37              0         L1
##  ID28   2  65 158.92        6.2        5.6      68.37              0         L1
##  ID29   1  42 178.73       3.89      13.79       52.6              0         L2
##  ID29   1  42 178.73       3.89      13.79       52.6              0         L2
##  ID30   1  55 156.16       4.73       6.07      46.66              0         L1
##  ID30   1  55 156.16       4.73       6.07      46.66              0         L1
##  ID31   2  42 176.86       3.88      14.16      59.72              0         L2
##  ID31   2  42 176.86       3.88      14.16      59.72              0         L2
##  ID32   1  56 185.41       5.63       5.25      58.63              0         L1
##  ID32   1  56 185.41       5.63       5.25      58.63              0         L1
##  ID33   2  54 162.81       5.14       4.92      64.44              0         L1
##  ID33   2  54 162.81       5.14       4.92      64.44              0         L1
##  ID34   2  65 167.83       4.86       4.26      71.17              0         L1
##  ID34   2  65 167.83       4.86       4.26      71.17              0         L1
##  ID35   1  58 177.76       6.19       5.07      49.08              1         L1
##  ID35   1  58 177.76       6.19       5.07      49.08              1         L1
##  ID36   2  53 175.56       5.39       6.45      58.03              1         L1
##  ID36   2  53 175.56       5.39       6.45      58.03              1         L1
##  ID37   1  64 171.44       5.38       6.28      46.05              0         L1
##  ID37   1  64 171.44       5.38       6.28      46.05              0         L1
##  ID38   2  64 167.33       6.01       5.84      66.38              0         L1
##  ID38   2  64 167.33       6.01       5.84      66.38              0         L1

I found some duplicate rows, so we removed them. Each row has its own ID, so i can use the unique function in R to eliminate duplicate rows. After that i check if there are any missing values and see there are none.

data <- unique(data)
sum(is.na(data))

## [1] 0

Next i set all variable to to the right variable type. And check their values.

# Convert multiple variables to numeric
data <- data %>%
  mutate(across(c("height", "biomarker1", "biomarker2", "biomarker3"), as.numeric))

data <- data %>%
  mutate(sex = as.factor(sex),
         disease_status  = as.factor(disease_status),
         laboratory = as.factor(laboratory))


# Create a table of variable types
variable_types <- data.frame(
  Type = sapply(data, class),
  stringsAsFactors = FALSE
)

# Print the table


summary<-summary(data)


variable_types

##                     Type
## ID             character
## sex               factor
## age              integer
## height           numeric
## biomarker1       numeric
## biomarker2       numeric
## biomarker3       numeric
## disease_status    factor
## laboratory        factor

summary

##       ID            sex          age            height          biomarker1   
##  Length:1600        1:789   Min.   :41.00   Min.   :  145.8   Min.   :2.730  
##  Class :character   2:811   1st Qu.:49.00   1st Qu.:  163.8   1st Qu.:4.490  
##  Mode  :character           Median :57.00   Median :  170.2   Median :5.180  
##                             Mean   :57.84   Mean   :  192.0   Mean   :5.268  
##                             3rd Qu.:66.00   3rd Qu.:  176.8   3rd Qu.:6.030  
##                             Max.   :79.00   Max.   :17868.6   Max.   :8.260  
##    biomarker2       biomarker3    disease_status laboratory
##  Min.   : 2.320   Min.   :22.71   0:1273         L1:1340   
##  1st Qu.: 4.610   1st Qu.:47.12   1: 327         L2: 260   
##  Median : 5.465   Median :54.26                            
##  Mean   : 6.851   Mean   :54.01                            
##  3rd Qu.: 6.650   3rd Qu.:61.29                            
##  Max.   :18.100   Max.   :85.07

Now i noticed that most variables have values in a normal range except for “height,” which has some unusually high values. To fix this, i remove rows where the height is greater than 245, as the tallest person alive is 245 cm tall.

# Set a threshold for extreme values (e.g., values above 245)
threshold <- 245

# Remove rows with values above the threshold in age
data <- data[data$height <= threshold, ]

summary(data)

##       ID            sex          age            height        biomarker1   
##  Length:1598        1:788   Min.   :41.00   Min.   :145.8   Min.   :2.730  
##  Class :character   2:810   1st Qu.:49.00   1st Qu.:163.8   1st Qu.:4.490  
##  Mode  :character           Median :57.00   Median :170.2   Median :5.180  
##                             Mean   :57.85   Mean   :170.4   Mean   :5.269  
##                             3rd Qu.:66.00   3rd Qu.:176.8   3rd Qu.:6.030  
##                             Max.   :79.00   Max.   :196.2   Max.   :8.260  
##    biomarker2       biomarker3    disease_status laboratory
##  Min.   : 2.320   Min.   :22.71   0:1271         L1:1338   
##  1st Qu.: 4.610   1st Qu.:47.12   1: 327         L2: 260   
##  Median : 5.470   Median :54.27                            
##  Mean   : 6.854   Mean   :54.02                            
##  3rd Qu.: 6.650   3rd Qu.:61.29                            
##  Max.   :18.100   Max.   :85.07

plot1<-ggplot(data, aes(sex)) + geom_bar(fill = "skyblue", alpha = 0.7)+
  labs(title = "Sex Distribution ", x = "Sex", y = "Frequency")+ 
  theme(plot.title = element_text(size = 10),) +
  ggeasy::easy_center_title()

# Create a histogram
plot2<-ggplot(data, aes(x = age)) +
  geom_histogram(binwidth = 2, fill = "skyblue", alpha = 0.7) +
  labs(title = "Age Distribution", x = "Age", y = "Frequency")+ 
  theme(plot.title = element_text(size = 10),) +
  ggeasy::easy_center_title()

# Create a histogram
plot3<-ggplot(data, aes(x = height)) +
  geom_histogram(binwidth = 2, fill = "skyblue", alpha = 0.7) +
  labs(title = "Height Distribution", x = "Heigt", y = "Frequency")+ 
  theme(plot.title = element_text(size = 10),) +
  ggeasy::easy_center_title()

plot4<-ggplot(data, aes(x = biomarker1)) +
  geom_histogram(binwidth = 0.15, fill = "skyblue", alpha = 0.7) +
  labs(title = "Biomarker 1 Distribution", x = "Biomarker 1", y = "Frequency")+ 
  theme(plot.title = element_text(size = 10),) +
  ggeasy::easy_center_title()


plot5<-ggplot(data, aes(x = biomarker2)) +
  geom_histogram(binwidth = 0.15, fill = "skyblue", alpha = 0.7) +
  labs(title = "Biomarker 2 Distribution", x = "Biomarker 2", y = "Frequency")+ 
  theme(plot.title = element_text(size = 10),) +
  ggeasy::easy_center_title()


plot6<-ggplot(data, aes(x = biomarker3)) +
  geom_histogram(binwidth = 0.25, fill = "skyblue", alpha = 0.7) +
  labs(title = "Biomarker 3 Distribution", x = "Biomarker 3", y = "Frequency")+ 
  theme(plot.title = element_text(size = 10),) +
  ggeasy::easy_center_title()


grid.arrange(plot1, plot2, plot3, plot4, plot5 , plot6, ncol = 3)

For the sex variable we have 788 males and 810 females. The age ranges from 41 to 79 with mean 57.8 years old. Heigt of subjects ranges from 145.8 to 196.2 cm. Biomarker 1 ranges from 2.72 to 8.2 and has a mean value of 5.2 Biomarker 2 ranges from 2.3 to 18 and has a mean value of 6.8 Biomarker 3 ranges from 22.7 to 85 and has a mean value of 54. 1271 subjects dont have the disease and 327 have the disease. 1338 samples were taken from laboratory 1 and 260 from laboratory 2.

When i examine the distribution of our variables after cleaning the dataset, i observe that we have an almost equal number of males and females. The age variable exhibits the highest count around 40 years old. Both the height and biomarker1 variables display normal distributions. Biomarker 2 shows a bimodal distribution, while biomarker 3 exhibits a pattern resembling a normal distribution. We no longer observe any extreme values in any the variables.

The biomarkers

In this part i investigate the three blood biomarkers and their relationship with various variables.

Systematic difference between biomarker measurements between laboratories? :

Here i investigate if there is a systematic difference between the reported biomarker measurements of the two laboratories.

plot1<-ggplot(data, aes(x = laboratory, y = biomarker1)) +
  geom_boxplot() +
  labs(title = "Biomarker1 Measurements by Laboratory", x = "LabNumber", y = "Biomarker1") +
  theme_minimal()

plot2<-ggplot(data, aes(x = laboratory, y = biomarker2)) +
  geom_boxplot() +
  labs(title = "Biomarker2 Measurements by Laboratory", x = "LabNumber", y = "Biomarker2") +
  theme_minimal()

plot3<-ggplot(data, aes(x = laboratory, y = biomarker3)) +
  geom_boxplot() +
  labs(title = "Biomarker2 Measurements by Laboratory", x = "LabNumber", y = "Biomarker3") +
  theme_minimal()

grid.arrange(plot1, plot2,plot3, ncol = 2)

plot5<-ggplot(data, aes(x = biomarker2, color = laboratory)) +
  geom_histogram() +
  labs(title = "biomarker 2 Distribution - Histogram", x = "biomarker 2", y = "Frequency")

It is apparent that biomarker1 does not really differ between laboratories. However, biomarkers 2 and 3 seem to differer between laboratories espessially biomarker2.

To investigate further if there is a systematic difference between the reported biomarker measurements of the two laboratories i will perform a independent samples t-test for each of the biomarkers.

# independent samples t-test's
t_test_result1 <- t.test(biomarker1 ~ laboratory, data = data)
t_test_result2 <- t.test(biomarker2 ~ laboratory, data = data)
t_test_result3 <- t.test(biomarker3 ~ laboratory, data = data)

print(t_test_result1)

## 
##  Welch Two Sample t-test
## 
## data:  biomarker1 by laboratory
## t = 0.96325, df = 366.68, p-value = 0.3361
## alternative hypothesis: true difference in means between group L1 and group L2 is not equal to 0
## 95 percent confidence interval:
##  -0.07070457  0.20648185
## sample estimates:
## mean in group L1 mean in group L2 
##         5.280389         5.212500

print(t_test_result2)

## 
##  Welch Two Sample t-test
## 
## data:  biomarker2 by laboratory
## t = -141.03, df = 378.25, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group L1 and group L2 is not equal to 0
## 95 percent confidence interval:
##  -10.119191  -9.840897
## sample estimates:
## mean in group L1 mean in group L2 
##         5.229918        15.209962

print(t_test_result3)

## 
##  Welch Two Sample t-test
## 
## data:  biomarker3 by laboratory
## t = -18.139, df = 372.36, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group L1 and group L2 is not equal to 0
## 95 percent confidence interval:
##  -12.430616  -9.999122
## sample estimates:
## mean in group L1 mean in group L2 
##         52.19525         63.41012

Results of the t-test show that there is a systematic difference between the reported biomarker measurements of the two laboratories, for biomarkers 2 and 3 but not 1. There could therefore be a difference in how the to two laboratories mesuread biomarkes 2 and 3.

Relationship between biomarkers:

Previous study reported that changes in biomarker 2 causes changes in biomarker 1, here i test if this is a reasonable claim.

To investigate this claim i begin by plotting the releationship between biomarker 2 and biomarker 1, and than calculate the correlation coefficient to quantify the strength and direction of the relationship.

ggplot(data,aes(biomarker1,biomarker2)) + geom_point()

ggplot(data, aes(x = biomarker2, y = biomarker1, color= laboratory)) +
  geom_point() +
  labs(title = "Scatter Plot of Biomarker1 vs Biomarker2", x = "Biomarker2", y = "Biomarker1") +
  theme_minimal()

# Perform correlation analysis
correlation_coefficient <- cor(data$biomarker1, data$biomarker2)
correlation_coefficient

## [1] 0.1678844

I begin to investigate the releationship between biomarker 2 and biomarker 1. Like seen before biomarker 2 has a bimodal distribution. The plot suggest a positive relationship between biomarker2 and biomarker1 for both the groups in biomarker 2. The correlation is low or 0.168. However, because of the bimodal distribution in biomarker 2 a simple linear correlation may not fully capture the complexity of the relationship. As seen before this bimodal distribution is because values in biomarker2 are different between laborotaries. So to minimise the effect of the bimodal distribution on the reletationship i will include laboratirie in a regression.

# Fit linear regression model
regression_model <- lm(biomarker1 ~ biomarker2 , data = data)

regression_model2 <- lm(biomarker1 ~ biomarker2 + laboratory, data = data)

# Summary of the regression model
summary(regression_model)

## 
## Call:
## lm(formula = biomarker1 ~ biomarker2, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7020 -0.7432 -0.0878  0.7363  2.9931 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.957759   0.052489  94.453  < 2e-16 ***
## biomarker2  0.045462   0.006682   6.804 1.44e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.025 on 1596 degrees of freedom
## Multiple R-squared:  0.02819,    Adjusted R-squared:  0.02758 
## F-statistic: 46.29 on 1 and 1596 DF,  p-value: 1.439e-11

summary(regression_model2)

## 
## Call:
## lm(formula = biomarker1 ~ biomarker2 + laboratory, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.3664 -0.5312 -0.0079  0.4973  2.3761 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   1.84930    0.09483   19.50   <2e-16 ***
## biomarker2    0.65605    0.01769   37.09   <2e-16 ***
## laboratoryL2 -6.61530    0.18394  -35.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7621 on 1595 degrees of freedom
## Multiple R-squared:  0.4634, Adjusted R-squared:  0.4627 
## F-statistic: 688.6 on 2 and 1595 DF,  p-value: < 2.2e-16

It is apparent that there is a relationship between biomarker2 and biomarker1. On average, for each unit increase in biomarker 2, biomarker 1 increases by 0.65605 units when considering the effect of laboratory as well. However i can only say that we see a releationship an can not say there is a causation or that changes in biomarker 2 cause changes in biomarker 1.

Biomarker 3 relationship to age and sex:

To investigate the relation ship between Biomarker 3 and age and sex i fit two multple linear regression models. To test which model is more appropriate, i look at the residual plots.

lm1<-lm(biomarker3 ~ age + sex, data = data)
lm2<-lm(biomarker3 ~ age + sex + I(age^2), data = data)




# Create a data frame with age and residuals
residuals_df <- data.frame(
  Age = data$age,
  Residuals_lm1 = residuals(lm1),
  Residuals_lm2 = residuals(lm2)
)

# Plot age against residuals for both models
library(ggplot2)

# Linear model (lm1)
plot1<-ggplot(residuals_df, aes(x = Age, y = Residuals_lm1)) +
  geom_point() +
  labs(title = "Age vs. Residuals (Model 1)", x = "Age", y = "Residuals_lm1") +
  theme_minimal()

# Quadratic model (lm2)
plot2<-ggplot(residuals_df, aes(x = Age, y = Residuals_lm2)) +
  geom_point() +
  labs(title = "Age vs. Residuals (Model 2)", x = "Age", y = "Residuals_lm2") +
  theme_minimal()

grid.arrange(plot1, plot2, ncol = 2)

We see that the residuals have some pattern in model 1. The Residuals get higher as the get closer to age 60. So model 1 performance is dependant on age. In Model 2 we dont see this patter, so we conclude that model 2 i better.

relationship between the residuals and the laboratory variable:

lm3<-lm(biomarker3 ~ age + sex + I(age^2)+ laboratory, data = data)

residuals_df <- data.frame(
  Age = data$age,
  Laboratory =data$laboratory,
  Residuals_lm2 = residuals(lm2)
)


plot2<-ggplot(residuals_df, aes(x = Laboratory, y = Residuals_lm2)) +
  geom_boxplot() +
  labs(title = "Age vs. Residuals (Quadratic Model)", x = "laboratory", y = "Residuals_lm2") +
  theme_minimal()


plot2

#residuals_df <- data.frame(
  #Age = data$age,
  #Laboratory =data$laboratory,
  #Residuals_lm3 = residuals(lm3)
#)


#plot3<-ggplot(residuals_df, aes(x = Laboratory, y = Residuals_lm3)) +
  #geom_point() +
  #labs(title = "Age vs. Residuals (Quadratic Model)", x = "laboratory", y = "Residuals_lm2") +
  #theme_minimal()

We see a systematic pattern in the residuals, the plot shows that residuals are higher in the L2 level of the laboratory variable. When we go back to task 3 we saw that biomarker 3 values were systematically higher in the L1 than the L2 laboratory. So there is a relationship between labratory and biomarker3. So the pattern in the residual plot might indicate that the laboratory variable is unaccounted-for in the model. Because there is clearly a relationship in laboratory and biomarker3, the laboratory variable should be included in the model.

Disease status

In this part i find models to predict disease status using available information in the dataset. In this part i will only include individuals with blood samples analysed in laboratory L1

Subset(laboratory L1):

Here i filter the dataset to only include individuals with blood samples analysed in laboratory L1, than i split the data in test and training datasets.

data2<-filter(data, laboratory == "L1")## Filter data to only include the individuals with blood  samples that were analysed in laboratory L1.

## Begin by creating test and train data

set.seed(0401)

train_ind<-sample(seq_len(nrow(data2)),size = (nrow(data2)* 0.75))##trainig set is 75% from data2
train<-data2[train_ind,]
test<-data2[-train_ind,]

Logistic regression:

Here i fit a logistic regression model with disease status as the dependent variable and the variables sex, age, height, biomarker 1, biomarker 2 and biomarker 3 as the predictor variables.

lm1<-glm(disease_status ~ sex + age + height + biomarker1 + biomarker2 + biomarker3, data = train,family = "binomial")

summary(lm1)

## 
## Call:
## glm(formula = disease_status ~ sex + age + height + biomarker1 + 
##     biomarker2 + biomarker3, family = "binomial", data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.20792  -0.49401  -0.24688  -0.08461   2.93624  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.575e+01  3.496e+00  -4.505 6.63e-06 ***
## sex2         1.775e+00  3.517e-01   5.047 4.50e-07 ***
## age         -1.237e-01  2.203e-02  -5.616 1.95e-08 ***
## height       1.082e-02  1.716e-02   0.630    0.528    
## biomarker1   2.293e+00  2.320e-01   9.884  < 2e-16 ***
## biomarker2   1.102e+00  1.502e-01   7.340 2.14e-13 ***
## biomarker3   5.486e-04  1.755e-02   0.031    0.975    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1015.89  on 1002  degrees of freedom
## Residual deviance:  635.55  on  996  degrees of freedom
## AIC: 649.55
## 
## Number of Fisher Scoring iterations: 6

From the summary output, we see that all coefficients, except for height and biomarker3, are statistically significant.

Best subset:

Here i use the stepAIC function to find the best subset of predictor variables in the lm1 model

lm1.step <- stepAIC(lm1,direction = "both", trace = 1)

## Start:  AIC=649.55
## disease_status ~ sex + age + height + biomarker1 + biomarker2 + 
##     biomarker3
## 
##              Df Deviance    AIC
## - biomarker3  1   635.55 647.55
## - height      1   635.95 647.95
## <none>            635.55 649.55
## - sex         1   662.77 674.77
## - age         1   669.94 681.94
## - biomarker2  1   697.23 709.23
## - biomarker1  1   764.88 776.88
## 
## Step:  AIC=647.55
## disease_status ~ sex + age + height + biomarker1 + biomarker2
## 
##              Df Deviance    AIC
## - height      1   635.95 645.95
## <none>            635.55 647.55
## + biomarker3  1   635.55 649.55
## - sex         1   671.67 681.67
## - age         1   673.42 683.42
## - biomarker2  1   697.26 707.26
## - biomarker1  1   765.10 775.10
## 
## Step:  AIC=645.95
## disease_status ~ sex + age + biomarker1 + biomarker2
## 
##              Df Deviance    AIC
## <none>            635.95 645.95
## + height      1   635.55 647.55
## + biomarker3  1   635.95 647.95
## - age         1   678.01 686.01
## - sex         1   686.00 694.00
## - biomarker2  1   697.96 705.96
## - biomarker1  1   765.28 773.28

The final model removes the heigt and biomarker3 variables resulting in the model:

\[ \log\left(\frac{\text{Pr}(Y_i = 1)}{1 - \text{Pr}(Y_i = 1)}\right) = \beta_0 + \beta_1 \cdot \text{sex}_i + \beta_2 \cdot \text{age}_i + \beta_3 \cdot \text{biomarker1}_i + \beta_4 \cdot \text{biomarker2}_i + \epsilon_i \]

Here, the binary outcome variable (disease status), and Pr is the probability of having the disease.

best model interpretation:

summary(lm1.step)

## 
## Call:
## glm(formula = disease_status ~ sex + age + biomarker1 + biomarker2, 
##     family = "binomial", data = train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.20649  -0.49298  -0.25040  -0.08423   2.89980  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -13.65355    0.93315 -14.632  < 2e-16 ***
## sex2          1.66306    0.24888   6.682 2.36e-11 ***
## age          -0.12654    0.02041  -6.201 5.62e-10 ***
## biomarker1    2.29162    0.23191   9.881  < 2e-16 ***
## biomarker2    1.10285    0.14987   7.359 1.85e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1015.89  on 1002  degrees of freedom
## Residual deviance:  635.95  on  998  degrees of freedom
## AIC: 645.95
## 
## Number of Fisher Scoring iterations: 6

The odds of having the disease for individuals with sex2 are approximately e^1.66306 = 5.27 times higher compared to sex1, with a 95% confidence interval of [4.387, 6.328].

For each one-unit increase in age, the odds of disease decrease by a factor of approximately 0.881, 95% confidence interval is [0.862, 0.901].

Biomarker1 is associated with a 9.88 times increase in odds for each one-unit change, while biomarker2 is linked to a 3.01 times increase, confidence interval is [7.228, 13.493].

biomarker2 is linked to a 3.01 times increase in odds, with a 95% confidence interval of [2.220, 4.087].

Test-error,best model:

lm1.step

## 
## Call:  glm(formula = disease_status ~ sex + age + biomarker1 + biomarker2, 
##     family = "binomial", data = train)
## 
## Coefficients:
## (Intercept)         sex2          age   biomarker1   biomarker2  
##    -13.6536       1.6631      -0.1265       2.2916       1.1029  
## 
## Degrees of Freedom: 1002 Total (i.e. Null);  998 Residual
## Null Deviance:       1016 
## Residual Deviance: 636   AIC: 646

test_p <- predict(lm1.step, newdata = test, type = "response")

test.pred<-rep(0,335) ## create a vector with all values= 0
test.pred[test_p>.5]<- 1 ## values in vector get 1 if test_p>.5

matrix<-table(test$disease_status,test.pred)## c confusion matrix(predicted vs actual)

matrix<-confusionMatrix(matrix)

matrix

## Confusion Matrix and Statistics
## 
##    test.pred
##       0   1
##   0 247  12
##   1  37  39
##                                           
##                Accuracy : 0.8537          
##                  95% CI : (0.8113, 0.8898)
##     No Information Rate : 0.8478          
##     P-Value [Acc > NIR] : 0.4163478       
##                                           
##                   Kappa : 0.5282          
##                                           
##  Mcnemar's Test P-Value : 0.0006068       
##                                           
##             Sensitivity : 0.8697          
##             Specificity : 0.7647          
##          Pos Pred Value : 0.9537          
##          Neg Pred Value : 0.5132          
##              Prevalence : 0.8478          
##          Detection Rate : 0.7373          
##    Detection Prevalence : 0.7731          
##       Balanced Accuracy : 0.8172          
##                                           
##        'Positive' Class : 0               
##

misclassification_error1<- mean(test$disease_status != test.pred)

test-error,second best model:

By following the same steps as in task 11, compute and report the test error for the model with the second lowest AIC (see task 9).

glm2<-glm(disease_status ~ sex + age + height + biomarker1 + biomarker2, train, family = binomial)

test_p <- predict(glm2, newdata = test, type = "response")

test.pred<-rep(0,335) ## create a vector with all values= "0"
test.pred[test_p>.5]<- 1 ## values in vector get 1 if test_p>.5

matrix<-table(test$disease_status,test.pred)## confusion matrix(predicted vs actual)

matrix<-confusionMatrix(matrix)

matrix

## Confusion Matrix and Statistics
## 
##    test.pred
##       0   1
##   0 246  13
##   1  38  38
##                                           
##                Accuracy : 0.8478          
##                  95% CI : (0.8047, 0.8845)
##     No Information Rate : 0.8478          
##     P-Value [Acc > NIR] : 0.5372758       
##                                           
##                   Kappa : 0.509           
##                                           
##  Mcnemar's Test P-Value : 0.0007775       
##                                           
##             Sensitivity : 0.8662          
##             Specificity : 0.7451          
##          Pos Pred Value : 0.9498          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.8478          
##          Detection Rate : 0.7343          
##    Detection Prevalence : 0.7731          
##       Balanced Accuracy : 0.8056          
##                                           
##        'Positive' Class : 0               
##

misclassification_error2 <- mean(test$disease_status != test.pred)

test error comparison:

here i compair the test error for the best and the second best model according to AIC

# Create a dataframe
df <- data.frame(
  Misclassification_Error = c(misclassification_error1, misclassification_error2 ),
  Source = c("Best Model Test Data",  "2.nd best test Data")
)


kable(df, format = "html") %>%
  kable_styling(full_width = FALSE, bootstrap_options = "striped", position = "center")

Misclassification_Error	Source
0.1462687	Best Model Test Data
0.1522388	2.nd best test Data