9 Conclusion

First of all, let’s have an overall look at which model we should use for each model.

Model selection for each research question:

  • Question 1: model 1
  • Question 2: model 1
  • Question 3: model 2
Generally, one could argue that in all the cases of our modelling, the difference in AIC has not been great, and hence one could choose to use the second model in all the cases. The second model represents a simpler construct where the weather variables have only been considered as binaries (e.g. rainy days). This means that no measurement of the weather condition is needed and we can simply consider if it has been raining, snowing or if that day was windy (for wind a binary variable may not make sense at all). This could lead to a lower cost for NYC both in terms of data gathering as well as computation time. Therefore, depending on these last two conditions, the decision of which model to use depends very much on the available resources of the city.

Furthermore, we will now provide an overall answer to our research question.

The three research questions this project aimed to answer were the following:

  1. Are fatalities caused by accidents lower in the roads with a speed limit of 25 MPH speed limit relative to the 30 and 40 MPH counterparts?

    New York City believes that “Pedestrians struck by vehicles travelling at 25 MPH are half as likely to die as those struck at 30 MPH” and although we don’t have experimental data to test the statement under laboratory conditions, we can test whether having 25 vs 30 mph or higher speed limit has helped in reducing fatalities by half or not.

The AIC for the first model is 8063.358, while for the second one is 8254.703, hence our suggestion is to go for the first one. In terms of prediction, however, both models perform very poorly. As we could see from the variable selection as well as our final predictions, both models developed by this project are not predictive enough to estimate the fatalities from an accidents given the simple speed limit of the street where it is happening. There is the need for a deeper analysis and more data about the actual speed of the cars involved at the time of the crash, altogether with other missing values. However, we could indeed confirm that the speed limits are significant in the prediction of a fatality in a car accident at a 0.01% level. The coefficient tells us that the probability for it to happen does not double, but it increases by 50%.


  1. Do weather conditions (e.g. heavy rain, snow and average wind speed) have a notable influence on road accidents, particularly for multiple cars and single-car accidents?

The AIC for the first model is 269335.4, while for the second model we have 269267.4. The difference here is very minimal, one could argue that going for the simpler model could also be a viable option. The predictions are overall fine, improvements can still be done, though, having the predictions showing quite a lot of outliers.

What is interesting here is that the two different methods we are using (the regression tree for the variables selection and the glm for the linear regression) are not giving the same results in terms of importance for the weather variables. The first one is giving very low importance on the features we are interested in, so low that they are not making the cut to be any of the nodes. However, once we force them into the second part of the analysis and we run the glm, we can see that we have significance showing and the influence is always around 50%. More specifically, the “_light” level of weather conditions is not important with the exception of snow, which is the only level with a significant _light level. Yet, it has to be noted that this snow represents only a few instances. On the other hand, both the “_moderate” and “_heavy” levels are significant for both wind and rain. As binary variables, the weather is following the same pattern. The snow variable is the least significant one (p.value of 0.01), while rain and wind are both very significant (p.value of 0). Interesting, however, is the fact that both the rain and the snow are negatively correlated to the prediction of multiple car accidents. This could be due to some correlation between the variables, or maybe the fact that when the weather is not very good, people prefer to stay inside, rather than take the car and go out, hence why the number of the accidents diminishes.

Regarding the wind variable, what can be argued is the fact that it was taking almost only positive values, as, once we ignore the first two months of missing value, it appears that NYC has been windy during 2019. In any case, the variable seems to play an important role by increasing the probability of a crash involving more than a car by 54%.

This was quite surprising for us, as the EDA did not reveal any correlation between these variables. Eventually, in general, we can say that the indeed, the weather conditions have an influence on the on multiple car accidents. However, due to the inconsistency of the results, we suggest that a deeper and more detailed analysis should be carried out.


  1. Can we estimate the probability of an injury of an accident using all our variables: Weather, the reason of the accident provided by the police from previous cases (contributing factors) and, lastly, the speed limits?

Here we have an AIC of 288842.8 for the first model and 289280.7 for the second one. As already mentioned in the previous chapter, the suggestion to go for the second model instead of the first one is mainly due to the distribution of its predictions, having a better performance in the forecast of positive values, which is the category we are interested in, since it is the cases in which an injury did happen. In other words, it is better to be prepared for the injury and not need it, rather than the opposite, in our opinion. However, it has to be noted that the predictions boxplot shows a lot of outliers, meaning that adjustments are needed.

Among the most important variables in both cases we have the loss_of_consciousness, then passing_too_closely and the disregard_traffic_control. There were some surprising results, such as the fact that passing too closely is showing a negative coefficient (we were expecting it to be positively correlated to the injuries). However, one could argue that it is negative since, as they were going to pass closely to another car, or another road user, they were not going too fast, hence when they hit the other vehicle they did not create any injuries. Another surprising result was the fact that the variable pedestrian_byciclist_other happens to have high importance in the prediction of injuries, while when we were exploring the data we have seen quite a low correlation between the injuries and the former variable. Also, the multiple car variable we have created, explains a lowering in the probability of an injury to happen in a car crash, we were expecting it to be positively correlated as we thought that since there were more people involved there was going to be a higher chance of one of them get injured. A deeper analysis has to be carried out in this sense. Also, the unsafe_lane_changing variable is lowering the probabilities of the event of an injury to happen. The last observation we could make was the fact that none of the weather variables was important enough to make the cut to get into the glm model.
The probability of the injury to happen with all the other feature being equal to 0 is also quite surprising, being more than 20% for the first model and more than 30% for the second one, which is kind of worrying, as it means that in the event of an accident to happen we have at least 20% of probability to have one injury!

What can be argued here is that there could be a problem of multicollinearity, moreover there are still some variables present in the model that shows no significance in the prediction of the dependent one, even though we have run a selection before. This could mean that the regression tree as a variable selection process may not be accurate enough.

In general, the contributing factors provided by the police from previous cases were giving some of the most important variables, as already mentioned, while the weather ones were not important, the speed limit appeared to be positively influencing the probability of an injury to happen. Due to the high number of outliers, however, we suggest that some more adjustments have to be made and that a deeper analysis has to be carried out, especially regarding the surprising results we have found in the negative coefficients.


9.1 Last words

One of the main personal objectives of this research was to get more comfortable with the use of large datasets and be able to carry out an accurate analysis of them, while also learn to better use the R software and all of the features it has available (such as the creation of a booklet, the interactive maps and tables, the slides, etc.). We cannot say that it has not been challenging, however we did find it very rewarding, as, on a personal level, we believe to have developed our skills quite a lot. Of course, given the limited process capabilities of our machines, especially with the initial joining 100 hours of computation time in joining the datasets, and the limited amount of time we had to carry out this research, we have had to make some decision and to, unfortunately, drop some more interesting processes (such as the lasso regression for the variables selection and the slower method for the joining of the datasets).

Eventually, new challenges were put given the particular situation we are living at the moment due to COVID-19, and the social distancing that is demanded, forcing us to work from home and not be able to discuss the project as much as we would have liked. However, thanks to the efficiency of both the professor and the assistants, altogether with a good partnership among the teammates, it was possible to overcome almost all the issues we have encountered throughout the analysis.

One last note, we find the final results not as satisfying as we had initially hoped, mainly as the models showed many outliers in terms of predictions, and for one of the research questions they were not always coherent. We do think that however some interesting outcomes have been found and that they could deserve a deeper analysis, with, possibly, more on point technology.