1 Executive Summary

In the framework of data science we carried out an analysis of the NYC car crashes. The main goal of this report is to find out (1) if the fatalities happening in accidents are correlated with the speed of the cars, more specifically to check the hypothesis that pedestrians hit by a car at 25mph are half likely to die compared to 30 mph; (2) if the weather conditions do have an influence on the involvement of multiple cars; and (3) which ones are the most important features needed in order to predict the probability of an injury to happen in a car accident.

To better show the data we are dealing with, an interactive map has been created (please see the introduction), which is easy to use and highly informative.

In order to answer these questions, we have used 3 datasets: (i) one describing the weather in NYC in 2019, (ii) one describing the speed limits of the street in NYC, and (iii) one describing the collisions that has happened in NYC in 2019.

To carry out our analysis, we started by exploring the detail of the different datasets we were using. The first one gave interesting insight on the fact that NYC is quite the windy place, though the speed was never recorded to be dangerous according to the Saffir Simspon scale. We could also see that the rainfall has not been very high in 2019 and that it snowed only 14 times. We created some factor variables based on the distribution of the data we had describing these three conditions, as well as binary variables (taking value 1 in the event of the weather condition happening, and 0 otherwise), that we have used to define the two different models for each research question in the modelling part of our analysis. We then carried out a variable selection for the second dataset, as we have decided to convert all the contributing factors into dummy variables, getting to a resulting number of features that was quite high. As we wanted to only have a deep analysis of the most inlfuencing ones, we decided to run two different models to select them: a lasso regression and a random forest. This process lead us to the decision of diving deep into the analysis of three contributing factors: (a) pedestrian_bicylist_other, (b) following_too_closely and (c) unsafe_speed. The analysis of the dataset gave us interesting insights, such as the fact that there were more accidents in December, compared to the other months, and during the daytime (from 8 am to 5pm). Eventually, for the speed limits, we have found the highest number of accidents happeining in streets with a 25mph limit, which is also the most extensively used limit in the city, after the vision zero in 2014. The injuries happened with highest probability at 15mph, while the fatalities at 40mph. Regarding the chosen variables beforementioned, the first one seems not to have that much influence on the injuries, as it has only few positive observations, for the second one the proportion of positive instances increases with the increase of speed limits and for the third one there seems to have no pattern.

We then moved to the modelling part of our analysis, we decided to carry out two models: one for the selections of the most important variables (regression tree) and one for the linear model and the interpretation (generalized linear model). We made some adjustment on the dataset, by balancing it for each research question depending on the variable we wanted to explain, and then we created a test and a training set (with, respectively, 25% and 75% of the data). For each question we runned both the models on two different dataframes, one having the weather variables as factors with 4 levels each (none, light, moderate and heavy), and one having them described as binaries (taking value 1 if the condition was present and 0 otherwise).

The results were quite interesting. We could indeed confirm that the speed limit had an effect on the probability of a fatality to happen in an accident, however it did not double if we compared the 25mph speed limit and those above it. Moreover, the quality of both the models appeared to be quite low, hence a deeper analysis and some adjusments are suggested. Regarding the weather conditions, the results were not really coherent, as in running the first selection for the variables we had them not including the meteorological ones (especially in the case of them being dummies), but once we forced them in the linear regression model, they had significant influence on the prediciton of multiple cars involved in a car crash. We think that a thorough analysis has to be carried out. Eventually, the most important variable to predict an injury happend to be the contributing factor loss of consciousness, passing too closely and disregard of traffic control, the speed limit was also signficant, but not as influencing as these three, while the weather data was not significant at all.