10 Appendix

10.1 Appendix 1

In the first appendix, we will show you a function which could be used to extract the coordinates from the speed limit dataset. This is could be somewhat a challenging task, as the number of coordinates in the geom column varies among the different observations, merely because some streets go beyond a single road (two coordinates) and can be attributed to multiple roads. We will use regex function and string manipulation to fulfil our task. As we already explained in the first chapter of our report, we decided not to use this method as it was copmutationally too long, and we opted for a simpler way (importing the data as GeoJson), which gives a good results nonetheless.

# Please do note that this function could take a while to run
# Defining a function to seperate the column
coord_parse <- function(str) {
  coords <- str %>%
    str_extract("((-?\\d{1,3}(.\\d+)?[:blank:]-?\\d{1,3}(.\\d+)?)(,[:blank:])?)+") %>%
    str_split(", ") %>%
    unlist() %>%
    str_split(" ")
  n <- length(coords)
  coords %<>% unlist() %>%
    as.double() %>%
    matrix(ncol = n) %>%
    t()
  colnames(coords) <- c("longitude", "latitude")
  return(as_tibble(coords))
}

# It will created a nested tibble
speed_limits %<>% ungroup() %>%
  mutate(coordinates = map(the_geom, coord_parse)) %>%
  unnest(coordinates) # Note that by unnesting we will have multiple entries for the same street but that is okay because not only we will not work with this CSV but as a nested tibble, it is actually meaningful because it could be potentially used to distance between two ends of a street.

10.2 Appendix 2

In this appendix, we will look at the slower approach to the joining the speed limit dataset with the collision points. This method could be used to divide the coordinates of the street for the speed limits dataset in order to have the longitude and the latitude in two different columns. Despite our effort in the writing of this function and our trial to use the function, we have limited processes and low computation capabilities, hence in the tradeoff between the amount of time that we should dedicate to this process and its result, compared to the result of the simpler and faster process, is win by the latter.

However, we want to give the details of the first process for who is interested.

Explanation of the approach: This methods turns the accidents into spatial data points and the streets into polygons and then it applys a k-nearest neighbors algorithm. Even thought it is a very precise approach, we have decided to discarded because of its really long computational time (it tooks 100 hours to be completed), mainly due to the already mentioned limited process capability of the machines we own. However, we thought it was worth it to mention being very interesting, possibly more precise and having spent quite some time on it.

# Import the cleaned collision_weather dataset as CSV

# Import the speed_limits as geosonfile
speed_limits_geojson<- geojson_read(here::here("Data/VZV_Speed Limits.geojson"),  what = "sp")

# Pass it as points (example with the first 16 instances)
collision_points<-collision_with_weather%>% head(16) %>% st_as_sf(coords=c("LONGITUDE","LATITUDE")) %>% st_set_crs(4269)

# Pass it as polygon (lines) and use the tranformation for the geographical cordinates
speed_limit_polygens <- speed_limits_geojson %>% st_as_sf() %>% st_transform(4269)

# Use a knn algo to join the data
joined_collision_sl <-st_join(collision_points,speed_limit_polygens, join= st_nn, k = 1, maxdist = 500)

write.csv(x= joined_collision_sl, file = here::here("Data/collision_speedlimit.csv"))

10.3 Appendix 3

As already done in the variable selection chapter chapter, for our modelling we have thought that it was a good idea to run a Lasso regression to select the variables that were the most important one in terms of contributing factors to the accidents, since, once we join all the datasets, the importance could vary quite a lot (due to correlations betweent the different variablesamong everything else). So, we started by creating the following funciton.

model <- function(x, y, z) {
  # We define the penalty factor (what we want to surely keep)
  p.fac <- rep(1, ncol(x))
  p.fac[c(1:z)] <- 0 # Variables we want to be included into the model
  # Next we find a list of lambdas using cross-validation
  con_sel_lasso <-
    cv.glmnet(x,
              y,
              alpha = 1,
              family = "binomial",
              penalty.factor = p.fac) # Note that here we are include the first 8 variables for sure (as they everything else than contributing factors)
  
  # Next we run the model with 1se lamda
  con_sel_lasso_1se <-
    glmnet(
      x,
      y,
      alpha = 1,
      family = "binomial",
      lambda = con_sel_lasso$lambda.1se,
      penalty.factor = p.fac
    )
  
  # To graph it
  beta <- coef(con_sel_lasso_1se)
  
  # We can do the plot later if we wanted to
  
  # confusion matrix
  probabilities <- con_sel_lasso_1se %>% predict(newx = x)
  predicted_inj_death <- ifelse(probabilities > 0.5, 1, 0)
  con_cm_1se <-
    confusionMatrix(data = as.factor(predicted_inj_death),
                    reference = as.factor(y))
  
  # List of the important ones
  c <- coef(con_sel_lasso_1se, exact = TRUE)
  inds <- which(c != 0)
  imp_variables <- row.names(c)[inds]
  imp_variables <- imp_variables[!(imp_variables %in% '(Intercept)')]
  imp_variables
}

The code above is quite similar to what we have used in the variable selection section. At the next stage, we define our variables x and y in the chunk seen right below. For example, to answer the first research question, we tried to run the function on the aforementioned variables (z being chosen by the first variables of the dataset, which are the ones that we want to “force” into the model being part of our research questions, such as weather variables, speed limits and the multiple car variable).

# Turn the response and explanatory into metrices
x <- model.matrix(persons_killed ~ ., training.set)[, -1]
y <- as.matrix(training.set$persons_killed)
z <- 17

 #columns we want to keep in the dataset  
set.seed(123)
question1 <- model(x, y)

# Selection of important variables
final_data <- training.set %>%
  select(persons_killed | which(colnames(.) %in% question1))

The problem is that the use of function, not only it took a computational time of half an hour for each one, but also sometimes it was not giving any convergence, or it could not be able to select any variable, and, sometimes, even when it did, the variables selected were not including the ones that we were “forcing” into the model.

Moreover, not being very familiar with this type of function and regression, we were not able to detect the problems and to resolve them.

These are the main reasons why we decided to move our analysis towards a regression with which we were more familiar and that would have given us the results we needed to proceed with the rest of the modelling.