Markdown Monster icon


In this project, I conduct a time series analysis of the number of passengers passing through Keflavik Airport and attempt to fit a model that predicts the monthly number of passengers for data the model has not seen. The data for the number of passengers passing through Keflavik Airport was obtained from the Iceland Statistics Website: https://statice.is/statistics/business-sectors/tourism/passengers/.

SARIMA Model

1 Get and clean data

First, I obtained the data for the number of passengers passing through Keflavik Airport each month from the year 2018 to February 2024 from the Iceland Statistics website. We will use the dates from January 2018 to January 2023 to train our models and then predict the number of passengers on unseen data for the dates from January 2023 to March 2024

data<-read.csv("fardegar.csv", sep = ";")

data$Dagsetning<- as.Date(paste0("01", data$Dagsetning), format = "%d%YM%m")## add 01 in front of each date, and make it a date variable.


## Create test and train data
train<-filter(data, Dagsetning <= "2022-12-01")
test<-filter(data,Dagsetning > "2022-12-01")


## plot our test and train sets
data$set <- ifelse(data$Dagsetning <= "2022-12-01", "train", "test")

ggplot(data,aes(Dagsetning, index, color= set))+ geom_line()

The plot shows the period we used to train our data and the period we used for prediction or testing our models.

2 Descriptive Statistics

# Find peak points
peak_points <- data[which(diff(sign(diff(data$index))) == -2) + 1, ]

ggplot(data, aes(x= Dagsetning, y = index)) + geom_point()+
                                              geom_line()+
                                              labs(x="Data", y = "Count")+ 
  scale_y_continuous(labels = comma) +scale_x_date(date_labels = "%Y %b", breaks = "4 months")+ theme(axis.text.x = element_text(angle = 70, hjust = 1))


From the picture, we see a clear seasonal pattern in the number of passengers at the airport. There are fewer passengers in the winter months, and the number increases in the summer, usually peaking around June, July, and August. We also observe a drop in the passenger count from March 2020 to July 2021. This decrease can obviously be attributed to the coronavirus pandemic. To further explore our data we will look at some tables.

data%>%
  group_by(Year=year(Dagsetning))%>%
  
  summarize(Count = sum(index),
            Mean = mean(index),
            SD = sd(index),
            MIN = min(index),
            MIN_Month = month(Dagsetning, label = TRUE)[which.min(index)],
            Max = max(index),
            MAX_Month = month(Dagsetning, label = TRUE)[which.max(index)])%>%kable()
Year Count Mean SD MIN MIN_Month Max MAX_Month
2018 2984018 248668.17 59875.55 186636 Jan 347475 Aug
2019 2597536 216461.33 45871.03 168850 Dec 300366 Aug
2020 609337 50778.08 60937.53 1262 Apr 167496 Feb
2021 907058 75588.17 64259.29 5339 Feb 172418 Aug
2022 2283435 190286.25 69814.20 82864 Jan 299473 Jul
2023 2814780 234565.00 62085.68 162554 Jan 345917 Jul
2024 360809 180404.50 17988.09 167685 Jan 193124 Feb

From the table, we observe that the year 2018 had the highest traffic at Keflavík Airport. There is also a noticeable decline in passenger numbers during the COVID-19 period from 2020 to 2021. In 2023, the count is approaching pre-pandemic levels. Additionally, January typically has lower passenger numbers, while August tends to have higher numbers, although this is not always the case

3 See if data is stationary

ts.plot(data$index)

ts.plot((diff(data$index))) 

To make the data stationary, we need to differentiate the index variable.

3.1 Test for stationary

## test for stationary
adf.test((data$index), alternative="stationary", k=0)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  (data$index)
## Dickey-Fuller = -1.6452, Lag order = 0, p-value = 0.7201
## alternative hypothesis: stationary
adf.test((diff(data$index)), alternative="stationary", k=0)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  (diff(data$index))
## Dickey-Fuller = -5.513, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary

The adf.test confirms that the data must be differenced to be stationary.

4 ACF and PACF plots

## Create a time-series object from the train-date with a monthly frequency, starting from the earliest date from the train-dataþ
ts_data <- ts(train$index, frequency = 12, start = min(data$Dagsetning))

acf(diff(ts_data))

pacf(diff(ts_data))

Our acf and pacf suggest using AR(1) and MA(0)

5 Create the model

### use auto.arima to find optimal parameters according to AIC
auto.arima(ts_data)
## Series: ts_data 
## ARIMA(1,1,0)(0,1,1)[12] 
## 
## Coefficients:
##          ar1     sma1
##       0.3800  -0.8322
## s.e.  0.1374   0.6822
## 
## sigma^2 = 576747308:  log likelihood = -545.8
## AIC=1097.6   AICc=1098.16   BIC=1103.15
## save the best model
arima_model <- auto.arima(ts_data)

## save fitted values
fitt<-fitted(arima_model)
## add fitted values to the train-set
train$fitted<-fitt

We see that the auto.arima function finds that the “d” parameter is best set to 1, indicating that the data is differenced once to make it stationary, as suggested in chapter 2. The model includes both seasonal and non-seasonal components. For the seasonal aspect, it has a seasonal moving average component with order (0,1,1) and a seasonal period of 12 months, denoted as (0,1,1)[12]. In terms of the non-seasonal aspect, it features a non-seasonal autoregressive component with order (1,1,0).

6 Forcast

Here we forecast the number of passengers passing through Keflavik Airport for future months that the model has not seen, or our test data.

## forcast values for the next 14 months(our test-set)
forecast_values <- forecast(arima_model, h = 14)


## add the values to our test-set
test$pred<-forecast_values$mean

7 Model prediction errors.

We’ll use RMSE (Root Mean Squared Error) to see, on average, how much our predictions differ from the actual values

# Residuals
errors<-test$index-test$pred

# Square the residuals
squared_errors <- errors^2

# Calculate the mean squared error
mse <- mean(squared_errors)

# Step 4: Calculate RMSE
rmse <- sqrt(mse)

# Print RMSE
print(rmse)
## [1] 57412.8

On average, our predictions for the number of passengers passing through Keflavik Airport each month from January 2023 to April 2024 were off by approximately 57,412.8 passengers. To get a better idea of the magnitude of the error, we can compare the actual and predicted values for each month.

test$diff<-test$index-test$pred

bla<-test$diff^2

mean<-mean(bla)
bla2<-sqrt(mean)
mean(test$diff)
## [1] 51923.72
test %>% 
  rename(Date = Dagsetning,
         Actual = index,
         Difference = diff,
         Predicted = pred)%>%gt()
Date Actual Predicted Difference
2023-01-01 162554 145608.5 16945.47
2023-02-01 176036 154055.7 21980.35
2023-03-01 200693 157168.2 43524.80
2023-04-01 198333 133841.9 64491.09
2023-05-01 219827 143603.7 76223.29
2023-06-01 288593 197948.3 90644.66
2023-07-01 345917 251477.6 94439.36
2023-08-01 326782 260896.0 65886.01
2023-09-01 268355 207090.1 61264.93
2023-10-01 260912 198662.3 62249.65
2023-11-01 193788 158215.7 35572.31
2023-12-01 172990 149414.0 23576.02
2024-01-01 167685 140624.2 27060.85
2024-02-01 193124 150050.8 43073.22

From the table, we observe that our highest error occurred in July 2023, where the actual number of passengers was 345,917, but our prediction was 251,477. Conversely, our best prediction was for January 2023, where the actual number of passengers was 162,554, and we predicted 145,608.

To better understand the errors of our predictions, we examine a plot comparing the actual and predicted values for the period from January 2023 to April 2024

tidy_test <- gather(test, key = "variable", value = "value", pred, index)

plot1<-ggplot(tidy_test, aes(x = Dagsetning, y = value, color = variable)) +
  geom_line(data = subset(tidy_test, !(variable == "pred" & value == 0))) +
  labs(x="Date", y=" Number of passengers", title = "SARIMA", color= " ")

We can observe that the actual number of passengers passing through Keflavik Airport is consistently higher than our predicted number. Nevertheless, our model appears to effectively capture the seasonal variations in passenger numbers. However, from both the table and the plot, we observe that the difference between the actual and predicted values increases as the number of passengers passing through KEF airport increases. This suggests that our predictions become less accurate as the actual number of passengers rises

8 Residuals

To check if our model meets the assumption of normally distributed residuals, we first visually inspect the residuals. This allows us to see if they resemble a typical bell-shaped curve. After the plot, we use the Shapiro-Wilk test to confirm if our residuals are indeed normally distributed

# Extract residuals
residuals <- residuals(arima_model)

## Train rmse
train_rmse <- sqrt(mean(residuals^2))

# Create a data frame for plotting
residuals_df <- data.frame(Residuals = residuals)

# Plotting the residuals
ggplot(residuals_df, aes(x = Residuals)) +
  geom_histogram(bins = 20, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Residuals",
       x = "Residuals",
       y = "Frequency")
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.

qqnorm(residuals)
qqline(residuals)

The plots provide evidence suggesting that the residuals are not normally distributed. To further investigate, we apply the Shapiro-Wilk test.

shapiro.test1<-shapiro.test(residuals)
print(shapiro.test1)
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals
## W = 0.86182, p-value = 6.923e-06

The Shapiro-Wilk normality test results in a test statistic (W) of 0.86182 and a p-value of 6.923e-06. With such a small p-value, we reject the null hypothesis of normality, indicating that the residuals are not normally distributed.

SARIMAX Model

Accounting for covid 19 Given that the assumptions of normality were not met, our focus shifts to improving the model. Considering the significant disruption in air traffic caused by COVID-19, our first priority is to account for its effect in an attempt to further refine the model. We get get the moving 7 day moving babbabab from “slóð”

1 Get and clean the covid data

# Read the CSV file
covid <- read.csv("weekly_deaths.csv")

# Keep only the first two columns
covid <- covid[, c(1, 2)]

# Remove rows with missing values
covid <- na.omit(covid)

# Convert 'date' to Date format
covid$date <- as.Date(covid$date, format = "%Y-%m-%d")

# Extract month from 'date'
covid$Month <- format(covid$date, "%Y-%m")


# Calculate the sum of  rolling 7 day average of deaths for each month
covid<- covid %>%
  group_by(Month) %>%
  summarise(deaths = sum(World))

covid<-covid %>% rename(
  Dagsetning = Month
)

covid$Dagsetning <- as.Date(paste0(covid$Dagsetning, "-01"), format = "%Y-%m-%d")
data2<-merge(data,covid, by="Dagsetning", all.x = TRUE)
data2$deaths[is.na(data2$deaths)] <- 0

1.1 Create the test and train data

train2<-filter(data, Dagsetning <= "2022-12-01")
test2<-filter(data,Dagsetning > "2022-12-01")

test2<-merge(test2,covid,by= "Dagsetning", all.x=TRUE)
train2<-merge(train2,covid,by= "Dagsetning", all.x=TRUE)

## Set NA to 0
train2$deaths[is.na(train2$deaths)] <- 0
test2$deaths[is.na(test$deaths)] <- 0

2 Number of passengers and covid deaths

We look at the Correlation between number of passengers and sum of rolling 7 day average of deaths for each month

ggplot(data2,aes(deaths,index))+ geom_line()+scale_x_continuous(labels = comma)+scale_y_continuous(labels = comma)

corrilation<-cor(data2$deaths, data2$index)
print(corrilation)
## [1] -0.767186

We observe a correlation of -0.76 between deaths and the number of passengers, indicating a relationship between these variables. Additionally, from the plot, it is evident that there are fewer passengers when deaths are high. Therefore, it is reasonable to include the number of COVID deaths as an exogenous variable in the model.

3 New model SARIMAX mode

We fil now fit a SARIMAX model with covid deaths as a exogenous variables without lags. We will use the same parameters as using in the first SARIMA model

deaths<-train2$deaths
ts_data2 <- ts(train2$index, frequency = 12, start = min(train2$Dagsetning))
sarimax_model<-auto.arima(ts_data2 , xreg = train2$deaths)

We observe that the best model from the auto.arima has the same parameters as our previous SARIMA model, indicating that we are essentially comparing the same model but with the addition of COVID deaths as an exogenous variable in our SARIMAX model.

4 Errors

Here, we forecast the number of passengers passing through Keflavik Airport for future months that the model has not seen, or our test data, and then calculate the test error.

## forcast values for the next 14 months(our test-set)
forecast_values2 <- forecast(sarimax_model, xreg = test2$deaths)


## add the values to our test-set
test2$pred<-forecast_values2$mean
## model prediction errors.


# Residuals
errors<-test2$index-test2$pred

# Square the residuals
squared_errors <- errors^2

# Calculate the mean squared error
mse <- mean(squared_errors)

# Step 4: Calculate RMSE
rmse2 <- sqrt(mse)
tidy_test2 <- gather(test2, key = "variable", value = "value", pred, index)

plot2<-ggplot(tidy_test2, aes(x = Dagsetning, y = value, color = variable)) +
  geom_line() +
  labs(x="Date",y=" Number of passengers", title = "SARIMAX", color= " ")

The SARIMAX model exhibits a similar issue to the SARIMA model, where the discrepancy between the actual and predicted values increases with a higher number of passengers passing through KEF airport. This indicates that as the number of passengers increases, the accuracy of our predictions diminishes.

5 Residuals

To check if our model meets the assumption of normally distributed residuals

# Extract residuals
residuals2 <- residuals(sarimax_model)

## Train rmse
train_rmse2 <- sqrt(mean(residuals2^2))

# Create a data frame for plotting
residuals_df2 <- data.frame(Residuals = residuals2)

# Plotting the residuals
residualplot2<-ggplot(residuals_df2, aes(x = Residuals)) +
  geom_histogram(bins = 20, fill = "skyblue", color = "black") +
  labs(title = "Histogram of Residuals",
       x = "Residuals",
       y = "Frequency")


shapiro.test2<-shapiro.test(residuals)

Compairing Models

comparison <- data.frame(
                   Model = c("SARIMA","SARIMAX"),
                   Test_errors = c(rmse, rmse2),
                   Train_errors = c(train_rmse,train_rmse2),
                   Normality_test = c(shapiro.test1$p.value,shapiro.test2$p.value))

kable(comparison)
Model Test_errors Train_errors Normality_test
SARIMA 57412.8 20798.09 6.9e-06
SARIMAX 56874.4 20542.96 6.9e-06
plot1

plot2

The table compares our models with regard to test and training error and the p-value from the Shapiro-Wilk Normality Test. We observe slightly better results from the SARIMAX model compared to the SARIMA model, as both our test and training errors have decreased. However, in the SARIMAX model, we still do not have normally distributed residuals.

lookin at the plots se see that the predictions are very similar for SARIMA and SARIMAX.

Random Forest.

Given that the residuals are not normally distributed and the model’s accuracy diminishes for higher values of the y variable, we explore alternative methods for predicting the number of passengers. Here we fit a random forest model to forecast passenger traffic at Keflavik Airport from January 2023 to March 2024. To prepare the data, we’re enriching it with features derived from the date variable. This includes adding two new variables based on the date: month and year.

data<-read.csv("data2.csv")


data$Dagsetning<-as.Date(data$Dagsetning)
data$Month <- as.numeric(format(data$Dagsetning, "%m"))
data$Year <- as.numeric(format(data$Dagsetning, "%Y"))
# Split the data into training and testing sets

train <- data%>%filter(Dagsetning <= "2022-12-01")
test<- data%>%filter(Dagsetning > "2022-12-01")

Tune RF

Here we utilize the tuneRF function to fine-tune our second model, aiming to determine the optimal mtry value, which represents the number of variables randomly sampled at each split in the decision trees. For the first model without the deaths variable, we decide to sample both variables month and year at each split, so we set m=2 in rf1

## Model 2 where covid deaths are included


tuned_rf <- tuneRF(
  x = train[, -c(1,2)],  # Exclude the target variable
  y = train[, 2],   # Target variable
  improve = 0.05,
  trace = TRUE,
  plot = TRUE
)
## mtry = 1  OOB error = 1621646737 
## Searching left ...
## Searching right ...
## mtry = 2     OOB error = 1518971558 
## 0.06331538 0.05 
## mtry = 3     OOB error = 990554225 
## 0.3478784 0.05

We that 3 the is the best value for mtry in in the random forest 2 model.

Built the model

rf_model <- randomForest(index ~ Month + Year +deaths,  data = train,mtry = 2, ntree = 500)
rf_model2 <- randomForest(index ~ Month + Year +deaths,  data = train,mtry = 3, ntree = 500)

errors

# Make predictions on the test data
predictions <- predict(rf_model, newdata = test)
predictions2 <- predict(rf_model2, newdata = test)
test$predicted<-predictions
test$predicted2<-predictions2

# Make predictions on the train data
predictions <- predict(rf_model, newdata = train)
predictions2 <- predict(rf_model2, newdata = train)
train$predicted<-predictions
train$predicted2<-predictions2
tidy_test <- gather(test, key = "variable", value = "value", predicted, index)
tidy_test2 <- gather(test, key = "variable", value = "value", predicted2, index)

rf1plot<-ggplot(tidy_test,aes(Dagsetning, value, color= variable))+ geom_line()+labs(x="Date",y=" Number of passengers", title = "RF1" , color= " ")
rf2plot<-ggplot(tidy_test2,aes(Dagsetning, value, color= variable))+ geom_line()+labs(x="Date",y=" Number of passengers", title = "RF2", color= " ")

Errors

## model prediction errors.


# Residuals
test_errors<-test$index-test$predicted

# Square the residuals
test_squared_errors <- test_errors^2

# Calculate the mean squared error
test_mse <- mean(test_squared_errors)

# Step 4: Calculate RMSE
rftest_rmse <- sqrt(test_mse)


# Extract residuals


## Train rmse
# Residuals
train_errors<-train$index-train$predicted

# Square the residuals
train_squared_errors <- train_errors^2

# Calculate the mean squared error
train_mse <- mean(train_squared_errors)

# Step 4: Calculate RMSE
rftrain_rmse <- sqrt(train_mse)


## model prediction errors.


# Residuals
test_errors2<-test$index-test$predicted2

# Square the residuals
test_squared_errors2 <- test_errors2^2

# Calculate the mean squared error
test_mse2 <- mean(test_squared_errors2)

# Step 4: Calculate RMSE
rftest_rmse2 <- sqrt(test_mse2)


# Extract residuals


## Train rmse
# Residuals
train_errors2<-train$index-train$predicted2

# Square the residuals
train_squared_errors2 <- train_errors2^2

# Calculate the mean squared error
train_mse2 <- mean(train_squared_errors2)

# Step 4: Calculate RMSE
rftrain_rmse2 <- sqrt(train_mse2)

Compairing models

comparison2 <- data.frame(
                   Model = c("SARIMA","SARIMAX","RF1","RF2"),
                   Test_errors = c(rmse, rmse2,rftest_rmse,rftest_rmse2),
                   Train_errors = c(train_rmse,train_rmse2,rftrain_rmse,rftrain_rmse2))

kable(comparison2)
Model Test_errors Train_errors
SARIMA 57412.80 20798.09
SARIMAX 56874.40 20542.96
RF1 44769.83 19541.96
RF2 38830.78 16153.25

We observe that both our Random Forest models outperform the time series models, SARIMA and SARIMAX. The Random Forest models exhibit lower test errors compared to both the time series models, indicating superior predictive performance for future data.

plot1
plot2
rf1plot
rf2plot