In this project, I conduct a time series analysis of the number
of passengers passing through Keflavik Airport and attempt to fit a
model that predicts the monthly number of passengers for data the model
has not seen. The data for the number of passengers passing through
Keflavik Airport was obtained from the Iceland Statistics Website: https://statice.is/statistics/business-sectors/tourism/passengers/.
First, I obtained the data for the number of passengers passing through Keflavik Airport each month from the year 2018 to February 2024 from the Iceland Statistics website. We will use the dates from January 2018 to January 2023 to train our models and then predict the number of passengers on unseen data for the dates from January 2023 to March 2024
data<-read.csv("fardegar.csv", sep = ";")
data$Dagsetning<- as.Date(paste0("01", data$Dagsetning), format = "%d%YM%m")## add 01 in front of each date, and make it a date variable.
## Create test and train data
train<-filter(data, Dagsetning <= "2022-12-01")
test<-filter(data,Dagsetning > "2022-12-01")
## plot our test and train sets
data$set <- ifelse(data$Dagsetning <= "2022-12-01", "train", "test")
ggplot(data,aes(Dagsetning, index, color= set))+ geom_line()
The plot shows the period we used to train our data and the period we used for prediction or testing our models.
# Find peak points
peak_points <- data[which(diff(sign(diff(data$index))) == -2) + 1, ]
ggplot(data, aes(x= Dagsetning, y = index)) + geom_point()+
geom_line()+
labs(x="Data", y = "Count")+
scale_y_continuous(labels = comma) +scale_x_date(date_labels = "%Y %b", breaks = "4 months")+ theme(axis.text.x = element_text(angle = 70, hjust = 1))
From the picture, we see a clear seasonal pattern in the number
of passengers at the airport. There are fewer passengers in the winter
months, and the number increases in the summer, usually peaking around
June, July, and August. We also observe a drop in the passenger count
from March 2020 to July 2021. This decrease can obviously be attributed
to the coronavirus pandemic. To further explore our data we will look at
some tables.
data%>%
group_by(Year=year(Dagsetning))%>%
summarize(Count = sum(index),
Mean = mean(index),
SD = sd(index),
MIN = min(index),
MIN_Month = month(Dagsetning, label = TRUE)[which.min(index)],
Max = max(index),
MAX_Month = month(Dagsetning, label = TRUE)[which.max(index)])%>%kable()
Year | Count | Mean | SD | MIN | MIN_Month | Max | MAX_Month |
---|---|---|---|---|---|---|---|
2018 | 2984018 | 248668.17 | 59875.55 | 186636 | Jan | 347475 | Aug |
2019 | 2597536 | 216461.33 | 45871.03 | 168850 | Dec | 300366 | Aug |
2020 | 609337 | 50778.08 | 60937.53 | 1262 | Apr | 167496 | Feb |
2021 | 907058 | 75588.17 | 64259.29 | 5339 | Feb | 172418 | Aug |
2022 | 2283435 | 190286.25 | 69814.20 | 82864 | Jan | 299473 | Jul |
2023 | 2814780 | 234565.00 | 62085.68 | 162554 | Jan | 345917 | Jul |
2024 | 360809 | 180404.50 | 17988.09 | 167685 | Jan | 193124 | Feb |
From the table, we observe that the year 2018 had the highest traffic at Keflavík Airport. There is also a noticeable decline in passenger numbers during the COVID-19 period from 2020 to 2021. In 2023, the count is approaching pre-pandemic levels. Additionally, January typically has lower passenger numbers, while August tends to have higher numbers, although this is not always the case
ts.plot(data$index)
ts.plot((diff(data$index)))
To make the data stationary, we need to differentiate the index variable.
## test for stationary
adf.test((data$index), alternative="stationary", k=0)
##
## Augmented Dickey-Fuller Test
##
## data: (data$index)
## Dickey-Fuller = -1.6452, Lag order = 0, p-value = 0.7201
## alternative hypothesis: stationary
adf.test((diff(data$index)), alternative="stationary", k=0)
##
## Augmented Dickey-Fuller Test
##
## data: (diff(data$index))
## Dickey-Fuller = -5.513, Lag order = 0, p-value = 0.01
## alternative hypothesis: stationary
The adf.test confirms that the data must be differenced to be stationary.
## Create a time-series object from the train-date with a monthly frequency, starting from the earliest date from the train-dataþ
ts_data <- ts(train$index, frequency = 12, start = min(data$Dagsetning))
acf(diff(ts_data))
pacf(diff(ts_data))
Our acf and pacf suggest using AR(1) and MA(0)
### use auto.arima to find optimal parameters according to AIC
auto.arima(ts_data)
## Series: ts_data
## ARIMA(1,1,0)(0,1,1)[12]
##
## Coefficients:
## ar1 sma1
## 0.3800 -0.8322
## s.e. 0.1374 0.6822
##
## sigma^2 = 576747308: log likelihood = -545.8
## AIC=1097.6 AICc=1098.16 BIC=1103.15
## save the best model
arima_model <- auto.arima(ts_data)
## save fitted values
fitt<-fitted(arima_model)
## add fitted values to the train-set
train$fitted<-fitt
We see that the auto.arima function finds that the “d” parameter is best set to 1, indicating that the data is differenced once to make it stationary, as suggested in chapter 2. The model includes both seasonal and non-seasonal components. For the seasonal aspect, it has a seasonal moving average component with order (0,1,1) and a seasonal period of 12 months, denoted as (0,1,1)[12]. In terms of the non-seasonal aspect, it features a non-seasonal autoregressive component with order (1,1,0).
Here we forecast the number of passengers passing through Keflavik Airport for future months that the model has not seen, or our test data.
## forcast values for the next 14 months(our test-set)
forecast_values <- forecast(arima_model, h = 14)
## add the values to our test-set
test$pred<-forecast_values$mean
We’ll use RMSE (Root Mean Squared Error) to see, on average, how much our predictions differ from the actual values
# Residuals
errors<-test$index-test$pred
# Square the residuals
squared_errors <- errors^2
# Calculate the mean squared error
mse <- mean(squared_errors)
# Step 4: Calculate RMSE
rmse <- sqrt(mse)
# Print RMSE
print(rmse)
## [1] 57412.8
On average, our predictions for the number of passengers passing through Keflavik Airport each month from January 2023 to April 2024 were off by approximately 57,412.8 passengers. To get a better idea of the magnitude of the error, we can compare the actual and predicted values for each month.
test$diff<-test$index-test$pred
bla<-test$diff^2
mean<-mean(bla)
bla2<-sqrt(mean)
mean(test$diff)
## [1] 51923.72
test %>%
rename(Date = Dagsetning,
Actual = index,
Difference = diff,
Predicted = pred)%>%gt()
Date | Actual | Predicted | Difference |
---|---|---|---|
2023-01-01 | 162554 | 145608.5 | 16945.47 |
2023-02-01 | 176036 | 154055.7 | 21980.35 |
2023-03-01 | 200693 | 157168.2 | 43524.80 |
2023-04-01 | 198333 | 133841.9 | 64491.09 |
2023-05-01 | 219827 | 143603.7 | 76223.29 |
2023-06-01 | 288593 | 197948.3 | 90644.66 |
2023-07-01 | 345917 | 251477.6 | 94439.36 |
2023-08-01 | 326782 | 260896.0 | 65886.01 |
2023-09-01 | 268355 | 207090.1 | 61264.93 |
2023-10-01 | 260912 | 198662.3 | 62249.65 |
2023-11-01 | 193788 | 158215.7 | 35572.31 |
2023-12-01 | 172990 | 149414.0 | 23576.02 |
2024-01-01 | 167685 | 140624.2 | 27060.85 |
2024-02-01 | 193124 | 150050.8 | 43073.22 |
From the table, we observe that our highest error occurred in July 2023, where the actual number of passengers was 345,917, but our prediction was 251,477. Conversely, our best prediction was for January 2023, where the actual number of passengers was 162,554, and we predicted 145,608.
To better understand the errors of our predictions, we examine a plot comparing the actual and predicted values for the period from January 2023 to April 2024
tidy_test <- gather(test, key = "variable", value = "value", pred, index)
plot1<-ggplot(tidy_test, aes(x = Dagsetning, y = value, color = variable)) +
geom_line(data = subset(tidy_test, !(variable == "pred" & value == 0))) +
labs(x="Date", y=" Number of passengers", title = "SARIMA", color= " ")
We can observe that the actual number of passengers passing through Keflavik Airport is consistently higher than our predicted number. Nevertheless, our model appears to effectively capture the seasonal variations in passenger numbers. However, from both the table and the plot, we observe that the difference between the actual and predicted values increases as the number of passengers passing through KEF airport increases. This suggests that our predictions become less accurate as the actual number of passengers rises
To check if our model meets the assumption of normally distributed residuals, we first visually inspect the residuals. This allows us to see if they resemble a typical bell-shaped curve. After the plot, we use the Shapiro-Wilk test to confirm if our residuals are indeed normally distributed
# Extract residuals
residuals <- residuals(arima_model)
## Train rmse
train_rmse <- sqrt(mean(residuals^2))
# Create a data frame for plotting
residuals_df <- data.frame(Residuals = residuals)
# Plotting the residuals
ggplot(residuals_df, aes(x = Residuals)) +
geom_histogram(bins = 20, fill = "skyblue", color = "black") +
labs(title = "Histogram of Residuals",
x = "Residuals",
y = "Frequency")
## Don't know how to automatically pick scale for object of type <ts>. Defaulting
## to continuous.
qqnorm(residuals)
qqline(residuals)
The plots provide evidence suggesting that the residuals are not normally distributed. To further investigate, we apply the Shapiro-Wilk test.
shapiro.test1<-shapiro.test(residuals)
print(shapiro.test1)
##
## Shapiro-Wilk normality test
##
## data: residuals
## W = 0.86182, p-value = 6.923e-06
The Shapiro-Wilk normality test results in a test statistic (W) of 0.86182 and a p-value of 6.923e-06. With such a small p-value, we reject the null hypothesis of normality, indicating that the residuals are not normally distributed.
Accounting for covid 19 Given that the assumptions of normality were not met, our focus shifts to improving the model. Considering the significant disruption in air traffic caused by COVID-19, our first priority is to account for its effect in an attempt to further refine the model. We get get the moving 7 day moving babbabab from “slóð”
# Read the CSV file
covid <- read.csv("weekly_deaths.csv")
# Keep only the first two columns
covid <- covid[, c(1, 2)]
# Remove rows with missing values
covid <- na.omit(covid)
# Convert 'date' to Date format
covid$date <- as.Date(covid$date, format = "%Y-%m-%d")
# Extract month from 'date'
covid$Month <- format(covid$date, "%Y-%m")
# Calculate the sum of rolling 7 day average of deaths for each month
covid<- covid %>%
group_by(Month) %>%
summarise(deaths = sum(World))
covid<-covid %>% rename(
Dagsetning = Month
)
covid$Dagsetning <- as.Date(paste0(covid$Dagsetning, "-01"), format = "%Y-%m-%d")
data2<-merge(data,covid, by="Dagsetning", all.x = TRUE)
data2$deaths[is.na(data2$deaths)] <- 0
train2<-filter(data, Dagsetning <= "2022-12-01")
test2<-filter(data,Dagsetning > "2022-12-01")
test2<-merge(test2,covid,by= "Dagsetning", all.x=TRUE)
train2<-merge(train2,covid,by= "Dagsetning", all.x=TRUE)
## Set NA to 0
train2$deaths[is.na(train2$deaths)] <- 0
test2$deaths[is.na(test$deaths)] <- 0
We look at the Correlation between number of passengers and sum of rolling 7 day average of deaths for each month
ggplot(data2,aes(deaths,index))+ geom_line()+scale_x_continuous(labels = comma)+scale_y_continuous(labels = comma)
corrilation<-cor(data2$deaths, data2$index)
print(corrilation)
## [1] -0.767186
We observe a correlation of -0.76 between deaths and the number of passengers, indicating a relationship between these variables. Additionally, from the plot, it is evident that there are fewer passengers when deaths are high. Therefore, it is reasonable to include the number of COVID deaths as an exogenous variable in the model.
We fil now fit a SARIMAX model with covid deaths as a exogenous variables without lags. We will use the same parameters as using in the first SARIMA model
deaths<-train2$deaths
ts_data2 <- ts(train2$index, frequency = 12, start = min(train2$Dagsetning))
sarimax_model<-auto.arima(ts_data2 , xreg = train2$deaths)
We observe that the best model from the auto.arima has the same parameters as our previous SARIMA model, indicating that we are essentially comparing the same model but with the addition of COVID deaths as an exogenous variable in our SARIMAX model.
Here, we forecast the number of passengers passing through Keflavik Airport for future months that the model has not seen, or our test data, and then calculate the test error.
## forcast values for the next 14 months(our test-set)
forecast_values2 <- forecast(sarimax_model, xreg = test2$deaths)
## add the values to our test-set
test2$pred<-forecast_values2$mean
## model prediction errors.
# Residuals
errors<-test2$index-test2$pred
# Square the residuals
squared_errors <- errors^2
# Calculate the mean squared error
mse <- mean(squared_errors)
# Step 4: Calculate RMSE
rmse2 <- sqrt(mse)
tidy_test2 <- gather(test2, key = "variable", value = "value", pred, index)
plot2<-ggplot(tidy_test2, aes(x = Dagsetning, y = value, color = variable)) +
geom_line() +
labs(x="Date",y=" Number of passengers", title = "SARIMAX", color= " ")
The SARIMAX model exhibits a similar issue to the SARIMA model, where the discrepancy between the actual and predicted values increases with a higher number of passengers passing through KEF airport. This indicates that as the number of passengers increases, the accuracy of our predictions diminishes.
To check if our model meets the assumption of normally distributed residuals
# Extract residuals
residuals2 <- residuals(sarimax_model)
## Train rmse
train_rmse2 <- sqrt(mean(residuals2^2))
# Create a data frame for plotting
residuals_df2 <- data.frame(Residuals = residuals2)
# Plotting the residuals
residualplot2<-ggplot(residuals_df2, aes(x = Residuals)) +
geom_histogram(bins = 20, fill = "skyblue", color = "black") +
labs(title = "Histogram of Residuals",
x = "Residuals",
y = "Frequency")
shapiro.test2<-shapiro.test(residuals)
comparison <- data.frame(
Model = c("SARIMA","SARIMAX"),
Test_errors = c(rmse, rmse2),
Train_errors = c(train_rmse,train_rmse2),
Normality_test = c(shapiro.test1$p.value,shapiro.test2$p.value))
kable(comparison)
Model | Test_errors | Train_errors | Normality_test |
---|---|---|---|
SARIMA | 57412.8 | 20798.09 | 6.9e-06 |
SARIMAX | 56874.4 | 20542.96 | 6.9e-06 |
plot1
plot2
The table compares our models with regard to test and training error and the p-value from the Shapiro-Wilk Normality Test. We observe slightly better results from the SARIMAX model compared to the SARIMA model, as both our test and training errors have decreased. However, in the SARIMAX model, we still do not have normally distributed residuals.
lookin at the plots se see that the predictions are very similar for SARIMA and SARIMAX.
Given that the residuals are not normally distributed and the model’s accuracy diminishes for higher values of the y variable, we explore alternative methods for predicting the number of passengers. Here we fit a random forest model to forecast passenger traffic at Keflavik Airport from January 2023 to March 2024. To prepare the data, we’re enriching it with features derived from the date variable. This includes adding two new variables based on the date: month and year.
data<-read.csv("data2.csv")
data$Dagsetning<-as.Date(data$Dagsetning)
data$Month <- as.numeric(format(data$Dagsetning, "%m"))
data$Year <- as.numeric(format(data$Dagsetning, "%Y"))
# Split the data into training and testing sets
train <- data%>%filter(Dagsetning <= "2022-12-01")
test<- data%>%filter(Dagsetning > "2022-12-01")
Here we utilize the tuneRF function to fine-tune our second model, aiming to determine the optimal mtry value, which represents the number of variables randomly sampled at each split in the decision trees. For the first model without the deaths variable, we decide to sample both variables month and year at each split, so we set m=2 in rf1
## Model 2 where covid deaths are included
tuned_rf <- tuneRF(
x = train[, -c(1,2)], # Exclude the target variable
y = train[, 2], # Target variable
improve = 0.05,
trace = TRUE,
plot = TRUE
)
## mtry = 1 OOB error = 1621646737
## Searching left ...
## Searching right ...
## mtry = 2 OOB error = 1518971558
## 0.06331538 0.05
## mtry = 3 OOB error = 990554225
## 0.3478784 0.05
We that 3 the is the best value for mtry in in the random forest 2 model.
rf_model <- randomForest(index ~ Month + Year +deaths, data = train,mtry = 2, ntree = 500)
rf_model2 <- randomForest(index ~ Month + Year +deaths, data = train,mtry = 3, ntree = 500)
# Make predictions on the test data
predictions <- predict(rf_model, newdata = test)
predictions2 <- predict(rf_model2, newdata = test)
test$predicted<-predictions
test$predicted2<-predictions2
# Make predictions on the train data
predictions <- predict(rf_model, newdata = train)
predictions2 <- predict(rf_model2, newdata = train)
train$predicted<-predictions
train$predicted2<-predictions2
tidy_test <- gather(test, key = "variable", value = "value", predicted, index)
tidy_test2 <- gather(test, key = "variable", value = "value", predicted2, index)
rf1plot<-ggplot(tidy_test,aes(Dagsetning, value, color= variable))+ geom_line()+labs(x="Date",y=" Number of passengers", title = "RF1" , color= " ")
rf2plot<-ggplot(tidy_test2,aes(Dagsetning, value, color= variable))+ geom_line()+labs(x="Date",y=" Number of passengers", title = "RF2", color= " ")
## model prediction errors.
# Residuals
test_errors<-test$index-test$predicted
# Square the residuals
test_squared_errors <- test_errors^2
# Calculate the mean squared error
test_mse <- mean(test_squared_errors)
# Step 4: Calculate RMSE
rftest_rmse <- sqrt(test_mse)
# Extract residuals
## Train rmse
# Residuals
train_errors<-train$index-train$predicted
# Square the residuals
train_squared_errors <- train_errors^2
# Calculate the mean squared error
train_mse <- mean(train_squared_errors)
# Step 4: Calculate RMSE
rftrain_rmse <- sqrt(train_mse)
## model prediction errors.
# Residuals
test_errors2<-test$index-test$predicted2
# Square the residuals
test_squared_errors2 <- test_errors2^2
# Calculate the mean squared error
test_mse2 <- mean(test_squared_errors2)
# Step 4: Calculate RMSE
rftest_rmse2 <- sqrt(test_mse2)
# Extract residuals
## Train rmse
# Residuals
train_errors2<-train$index-train$predicted2
# Square the residuals
train_squared_errors2 <- train_errors2^2
# Calculate the mean squared error
train_mse2 <- mean(train_squared_errors2)
# Step 4: Calculate RMSE
rftrain_rmse2 <- sqrt(train_mse2)
comparison2 <- data.frame(
Model = c("SARIMA","SARIMAX","RF1","RF2"),
Test_errors = c(rmse, rmse2,rftest_rmse,rftest_rmse2),
Train_errors = c(train_rmse,train_rmse2,rftrain_rmse,rftrain_rmse2))
kable(comparison2)
Model | Test_errors | Train_errors |
---|---|---|
SARIMA | 57412.80 | 20798.09 |
SARIMAX | 56874.40 | 20542.96 |
RF1 | 44769.83 | 19541.96 |
RF2 | 38830.78 | 16153.25 |
We observe that both our Random Forest models outperform the time series models, SARIMA and SARIMAX. The Random Forest models exhibit lower test errors compared to both the time series models, indicating superior predictive performance for future data.
plot1
plot2
rf1plot
rf2plot