4 Modeling the spread of COVID-19 worldwide

In this section, we fit the logistic model to every country in the covid19_data_filtered dataset.

4.1 Fitting the logistic model to every country

Here we make use of the nested data, list-columns and logistic_model to fit the logistic model to every country in the dataset. Because, for some countries, the optimization method might not converge, we will use the possibly() function to see which ones fail and which ones succeed. Now one may wonder, for which country does the optimization fail ?

First, let’s fit the logistic model to every country and have a look at which one are not converging.

The countries for which the optimization fails are the following:

Denmark, Japan, Pakistan

We will also assess the goodness-of-fit of the logistic model in the various countries. Lets plot the residuals per country to have a look at the general trend of the residual in the model.

Comment: As we can see, the residuals seem to be quite close to zero, especially towards the end of the timeline (end of March, beginning of April), while at the beginning we can see some more distance from zero (given by the fact that the only country present in the dataset at that time is China, being the only one already in an epidemic situation). From the moment in which more countries join the dataset, we can see that the accuracy of the model increase, hence the residuals decrease and are closer to zero.

Let’s plot the residual per country to have a better look, and then we will zoom on the countries with the highest values of the residuals (meaning that they are the ones for which the model has a lower accuracy in the predictions).

Note: The axis are for this graph Residual of logistic_model per country and the one below has been adjusted for each country in order to be able to see better the different distributions.

Comment: As we can see the countries for which the model has the highest values for the residuals are China, France, Germany, Iran, Italy, South Korea, Spain and the USA. This does not come as a susprise, actually, since in the EDA part of our analysis, we have seen that they are the countries with the highest exponentiality in terms of absolute values of the confirmed cases (especially the US), while for China, as already mentioned, the residuals are higher at the beginning of the timeline, being the only country in the dataset, since we are considering only the countries in an epidemic situation.

We would like to also have also a score which shows the goodness-of-fit of the model for each country, however according to the (Burnham and Anderson 2002, 80), it is mentioned that AIC cannot be used for models with different number of observations (and also different datasets) as we see in the table below.

Number of observations per country
country	observations
Australia	19
Austria	23
Belgium	24
Brazil	18
Canada	19
Chile	15
China	75
Czechia	18
Ecuador	15
Finland	15
France	31
Germany	31
Greece	15
Iceland	15
Indonesia	14
Iran	37
Ireland	18
Israel	16
Italy	39
Korea, South	43
Luxembourg	16
Malaysia	21
Netherlands	24
Norway	26
Poland	15
Portugal	18
Romania	14
Saudi Arabia	14
Spain	29
Sweden	25
Switzerland	26
Thailand	15
Turkey	16
USA	28
United Kingdom	24

4.2 Fitted parameters and long-term predictions

We then describe the fitted parameters (i.e., the final size and the infection rates), both on a per-country basis and some aggregate numbers (e.g., total size of the epidemic over all considered countries). Furthermore, we study the evolution (say for \(t\) from 0 to 50) of the predictions of the number of confirmed cases from our models. Similarly as was discussed in the last sub-section of the exploratory data analysis, the number of confirmed cases per 100,000 habitants is also important to understand how specific countries are managing the spread of the epidemic. Thus, we predict the evolution of this number (i.e., by dividing our predictions for confirmed cases by the population size) and discuss.

We will do the aforementioned using the following functions:

Format the fitted parameters using broom::tidy().
For the long-term predictions, we use data = data.frame(t = 0:50) in add_predictions().

First we can see the parameters of the various models for each countries.

Fitted parameter per Country
country	K	R
Australia	6490	0.229
Austria	13475	0.227
Belgium	30106	0.192
Brazil	24528	0.194
Canada	26125	0.209
Chile	8226	0.178
China	80822	0.263
Czechia	7156	0.159
Ecuador	10116	0.131
Finland	4143	0.105
France	159261	0.185
Germany	114953	0.222
Greece	4636	0.093
Iceland	3258	0.094
Indonesia	4284	0.141
Iran	63936	0.175
Ireland	8411	0.167
Israel	10943	0.242
Italy	131962	0.201
Korea, South	9220	0.265
Luxembourg	3488	0.179
Malaysia	4802	0.148
Netherlands	24000	0.184
Norway	7300	0.143
Poland	16456	0.141
Portugal	15335	0.219
Romania	7725	0.180
Saudi Arabia	3811	0.159
Spain	141690	0.259
Sweden	26528	0.109
Switzerland	22253	0.227
Thailand	3162	0.151
Turkey	36048	0.311
United Kingdom	90683	0.201
USA	417895	0.280

Note: Please do note that we have rounded our results to 2 decimal places.

Comment: It doesn’t come as a surprise the fact that they vary quite a lot.

Now let’s look at the prediction per country.

Comment: We can see that the countries with the highest absolute values are the US, Spain, Germany, France and Italy (over a 100’000 of final confirmed cases predicted). This is in line with what we have found in the EDA part of our analysis.

In order to look at the aggregated sum, we can either use the logistic model on the filtered data and calculate new coefficients as we did with the swiss model but now applied to all countries and the second approach is to do it by region to use the fitted parameters of each country (nested) which makes more sense.

Comment: We can see that up to the 28th period the aggregated model predicts well however after that we do not have observations for most of the countries so it does not make sense to look at the observed values.

Note: Please do note that that the highest “t” belongs to China which is about 75 periods and therefore we have decided to extend our model to also include the all these dates. However, please do keep in mind that the 50th period the observation is mainly representative of predictions for China.

We can also do the aggregated sum for all countries only for the first 50 periods (because afterwards we do not have observed data points for almost all countries hence the observations naturally goes down). This model is a better one because it takes into account all the different coefficients rather than assigning the same one to all the countries.

Note: In the graph below, Please feel free to scroll over the country to see which one is contributing the most.

Comment: In terms of regions, we see the highest increase for North America due to the predicted increase for the US, followed by Europe & Central Asia, Italy, Spain and France among many and lastly, East Asian & Pacific with the most predicted cases for China.

Furthermore, referring back to the per-country predictions, we can also calculate the same for cases per 100,000 habitants displayed by the interactive plot below.

Comment: We can see that Iceland (green line) and Luxembourg (blue line) will have the highest number of confirmed cases per 100,000 habitants which is same as what we saw previously in the section of exploratory data analysis. This is due to their small populations and their infection numbers will be far larger than the countries that follow like Spain, Switzerland Belgium.

References

Burnham, Kenneth P., and David R. Anderson, eds. 2002. “Information and Likelihood Theory: A Basis for Model Selection and Inference.” In Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach, 49–97. New York, NY: Springer New York. https://doi.org/10.1007/978-0-387-22456-5_2.