1 Introduction

1.1 The data

For this project, we use COVID-19 data provided by Johns Hopkins University (updated daily), as well as data from the world bank with demographic information. More specifically, we use daily records of total confirmed infection cases (or cumulative number of cases), total number of fatalities (or cumulative number of deaths) per country starting from 2020-01-22 until 2020-04-05. The dataset, downloaded on 2020-04-06, contains 12,975 observations and 13 variables (country, iso3c, date, confirmed, deaths, population, land_area_skm, pop_density, pop_largest_city, gdp_capita, life_expectancy, region, income).

Special credits: This project has been developed as part of the dsfba course and many thanks to professor Thibault Vatter and his assistants for their contributions.

Furthermore, the repository for this project can be found here.

1.2 A note on Epidemiological Models

Today’s epidemiological models are mostly described by so called SIR-like models (see details in Martcheva 2015, 9–12). In this class of models, the population is divided into three groups:

  • (S)usceptible — people, might get infected;
  • (I)nfectious — people, who carry the infection and can infect others;
  • (R)ecovered/(R)emoved — people, who have already recovered from the disease and got immunity.

The SIR model is a system of ordinary nonlinear differential equations. In this homework, we focus on the following logistic model (see Batista 2020, 2; Martcheva 2015, 35–36):

\[ \frac{dC(t)}{dt} = r \, C(t) \cdot \left[1 - \frac{C(t)}{K}\right], \]

where \(C(t)\) is the accumulated number of cases at time \(t\), \(r\) is the growth rate (or infection rate), and \(K\) is the final size of epidemic. Let \(C_0\) be the initial number of cases: in other words, at time \(t = 0\), assume that there was \(C_0\) accumulated number of cases. The solution of the logistic model is

\[ C(t) = \frac{K\cdot C_0}{C_0 + (K-C_0) \, \exp(-r\,t)}, \]

which looks like a scaled logit model in econometrics.

1.3 This project

Because we only have access to the confirmed cases that are reported, we use those figures as a proxy for the total number of cases, with the understanding that they almost surely underestimates the actual number of interest. In what follows, we do a preliminary exploration of the data. We then use the logistic model to analyze the spread of COVID-19 and try to predict the final number of accumulated confirmed cases for every country. More specifically, we

  • start by focusing on modelling the spread in Switzerland;
  • then apply the same approach to every country in the dataset.

References

Batista, Milan. 2020. “Estimation of the Final Size of Coronavirus Epidemic by the Logistic Model,” March.
Martcheva, Maia. 2015. An Introduction to Mathematical Epidemiology. Springer, Boston, MA.