Exercise Set VI - HMD

🗄️ Data Jupyter Google Colab 💡 Show Hints ℹ️ Show Solutions

In this final set, we will look at a real actuarial example and perform a geographical analysis of mortality datafrom the Human Mortality Database (HMD). An extensive source for data on death counts, births and life tables is mortality.org. You will need to create an account that is for free and can then download all available data.

Note that the structure of this lab is slightly different, where we provide a significant amount of acturial context before asking you to compute something. The goal is really understanding how DS can be applied in an end-to-end manner.

First, we download and import files on death counts, exposures, population and a life table for Switzerland.

Death probabilities

The probability of death within the next year for a person aged \(x\) is \[ q_x = \frac{l_x - l_{x+1}}{l_x} \,,\] where \(l_x\) is the number of people alive at age \(x\) and \(l_{x+1}\) is the number of those people who have survived until age \(x+1\). We can rewrite the formula as \[ l_{x+1} = (1-q_x) \cdot l_x \,.\] Additionally, we define the number of deaths \[D_x = l_x \cdot q_x \,.\]

A first intuition at calculating qx might be to divide the number of deaths by the number of people alive. However, we see that the numbers do not coincide. This is due to the fact that people migrate in and out of the country.

We now confirm, that the central death rates mx are calculated as the quotient between death counts \(D_x\) and exposure (~ the average number of people alive) \(P_x\) for ages 0 - 79. However, from age 80 onwards, the fluctuation in these values is too strong, so a smoothing transformation has been applied.

We then use these central death rates to approximate the probability of death. Assuming that the people die and migrate constantly over the span of one year, we can say that the exposures (~ average number of people alive) \(P_x\) is approximately

\[P_x = \frac{l_x + l_{x+1}}{2} = \frac{l_x + l_x\cdot(1-q_x)}{2} = l_x \cdot \frac{2-q_x}{2}\]

Thus the central death rate \(m_x\) is \[ m_x = \frac{D_x}{P_x} = \frac{2 q_x}{2-q_x}\,.\]

Solving this equation for \(q_x\) gives \[ q_x = \frac{m_x}{1 + 0.5 m_x}\] and we confirm this result numerically. Note that it is slightly off for age 0, because the number of deaths is not uniformly distributed over the year for infants. Check column ax of the life table. This column gives the average length of survival of those who have died within that year.

Life expectancy

The remaining life expectancy is defined as \[ \begin{align} e_x &= \sum_{s=1}^\infty \mathbb{P}(\text{dying in year } x+s\, | \,\text{having survived until year } x) \cdot (s+a_{x+s})\\ &= \sum_{s=1}^\infty \Big(\prod_{r = 1}^{s-1} p_{x+r}\Big) \cdot q_{x+s} \cdot (s+a_{x+s})\,. \end{align}\] Note that \(a_{x+s}\) is the average time to death of a person dying year \(x+s\).

Exercise 1

Define a function lifetime that computes the life expectancy as a function of age x and calendar year t.

Implement the mathematical formula by filtering data for the specified year, extracting probability arrays, and using a loop with np.prod() for cumulative survival probabilities.

These values coincide with the last column of our data frame. Let’s plot the period life expectancy at birth for newborns from 1876 to today.

Exercise 2

Plot the period life expectancy of newborns as a function of the calendar year.

Let’s start by filtering the data for newborns (age “0”) and visualizing the trend over time.

Filter the data for newborns (age “0”) and use plt.plot() to show the trend over time. Adjust the provided example code accordingly.

However, this is not the true life expectancy that people experienced. Someone born in 1876 did not have the same probability to die at age 80 as someone who was 80 in 1876. He benefitted from all the medical advancements society has made until the year 1956. The true formula would look like this: \[ e_{x,t} = \sum_{s=1}^\infty \Big(\prod_{r = 1}^{s-1} p_{x+r, t+r}\Big) \cdot q_{x+s, t+s} \cdot (s+a_{x+s, t+s})\,, \] where the index \(t\) indicates the calendar year. This would involve forecasting future mortality improvements. Let’s just investigate the life expectancy for people born before 1914.
We could extend the plot all the way to 2022 by assuming that there will be no more improvements after 2022.

Now, let’s plot the period life expectancy of newborns as a function of the calendar year.

Comparing different countries

We download and import additional lifetables for Switzerland’s neighbouring countries.

Exercise 3

Plot the development of their period life expectancies.

Use the provided example code structure to plot multiple countries simultaneously. Consider adding the current age to the life expectancy values for meaningful comparison.

Geographical plots

We would like to plot some of that information on a map. To do that, we need shapefiles that define the outlines of these countries. They are available, for example, on Eurostat https://ec.europa.eu/eurostat/web/gisco/geodata/statistical-units/territorial-units-statistics.

Let’s download and import these files using the library geopandas. It is easy to plot these shapes, but without manually defining the axis limits, the plot is heavily distorted by the french overseas territories.

Taking a closer at this (geopandas-) dataframe, we see that there are different levels for the granularity of regions and all of them come with different NUTS-IDs (Nomenclature des unités territoriales statistiques) and country codes.

Now, filter the data for Switzerland.

We can select all of those countries that we are interested in through the variable CNTR_CODE and choose different levels of granularity on the LEVL_CODE. By excluding those that contain the string FRY, we get rid of all french overseas territories.

Exercise 4

Select only those rows that contain information about the countries Austria, Germany, France, Italy and Switzerland. Only use regions with NUTS-Level 1 and exclude the French overseas-territories (their NUTS_ID starts with “FRY”).
Use the function pd.loc. You can use an AND-operator with & and a NOT-operator by putting a ~ in front of a logical expression. Other helpful functions are pd.isin() and pd.str.contains(). You might need to google how to use those.

Use pd.isin() for country selection, combine multiple conditions with &, and apply the ~ operator for exclusion. Remember to wrap each condition in parentheses when combining them.

All we need to do to plot the life expectancy is to match the information from our life tables to that of the shapefile. We use the function pd.merge for that task.