2 Introduction

In the framework of the course of Data Science taught by professor Vatter at HEC Lausanne, the authors have been granted the opportunity to dive deeply in the analysis of multiple large datasets and create a project that would answer three interesting research questions.

To understand the reason behind the chosen topic, one of the authors has experienced working as a bike-deliver and unfortunately, he has been involved in two car accidents while on shift, whereas the second author has been commuting all across the country on a weekly basis. We have combined our interests in analyzing accident data in one of the most compelling cities in the world: New York City. With around 8 million inhabitants, the city has always had a serious problem regarding car crashes. From the famous NYC traffic scenes shown in countless Hollywood movies to the impatient taxi drivers waiting in times squares, we would like to see if features such as weather conditions, speed limits and other contributing factors recorded by the police at the time of the accident have any effect on road users who are wandering these roads and may be harming one another.

The three research questions this project aims to answer are the following:

  1. Are fatalities caused by accidents in roads with a speed limit of 25 MPH lower relative to the 30 and 40 MPH counterparts? New York City believes that “Pedestrians struck by vehicles traveling at 25 MPH are half as likely to die as those struck at 30 MPH” and although we don’t have experimental data to test the statement under laboratory conditions, we can test whether having 25 vs 30 mph or higher speed limit has actually helped in reducing fatalities by half or not.

  2. Do weather conditions (e.g. heavy rain, snow and average wind speed) have a notable influence on road accidents, particularly for multiple car and single cars accidents?

  3. Can we estimate the probability of an injury of an accident using Weather conditions, contributing factors (police records of reasons why an accident has happened), and the speed limits?

To provide you with an initial understanding of the data we will be covering, we have constructed a density heat map of all the accidents in NYC after having processed the data. The aim is to given an overview of what we will be covering in the coming chapters.

The legend found on the top right the corner of the map describes the level of the density (high, average, low and very low) of the accidents. We can see that the zone with the warmer color is the one having the highest concentration of crashes (and hence the streets in which is better to pay more attention when driving around NYC). By pressing the accumulated number of accident, it is possible to zoom in and get as close as to a single observation, so that, with one can click on it, one can get the information about what has happened, such as if there was an injury or a fatality, if more cars were involved, the speed limit of the street, the weather conditions (rainy, windy) and the day of the week in which the accidents happened. Eventually, the little map in the bottom right corner gives the position which we are discovering from a little further away perspective.