Exercise Set V - Pipeline
🗄️ Data Jupyter Google Colab 💡 Show Hints ℹ️ Show Solutions
Exercise 1
The goal of this exercise is to perform maximum likelihood estimation (MLE) using Python. Moreover, we aim to compare results derived with MLE with those produced with the method of moments (MoM) estimation. The file IntroductiontoNumPyExercise3.csv
contains household income short-fall data from a particular region in Burkina Faso. This data was collected during the Enquête Multisectorielle Continue (EMC) 2014. Here, we are going to fit the beta of the first kind (B1), with p.d.f.
\[ \begin{align} B 1(y ; x^{*}, 1, \alpha)=\frac{\alpha\left(x^{*} - y\right)^{\alpha - 1}}{{x^{*}}^{\alpha}} \quad \text { for } \quad 0<y<x^{*}. \end{align} \tag{1}\]
As mentioned previously, in this exercise we aim to use the maximum likelihood method and the method of moments to estimate the parameter \(\alpha\) of model Equation 1. Assume that \(y_{1}, y_{2}, ...,y_{n}\) is a random sample of income short-fall of size \(n\). Letting \(M_{k}=\frac{1}{n}\sum^{n}_{i=1}y^{k}_{i}\) denote the \(k\)th sample moment yields to the maximum likelihood estimator (MLE) and the method of moments estimator (MME) for \(\alpha\), given by \[ \begin{align} \hat{\alpha}_{MLE} = \frac{n}{n \log{\left(x^{*}\right)}-\sum\limits_{i=1}^{n}\log{\left(x^{*}-y_{i}\right)}} \qquad \text{and} \qquad \hat{\alpha}_{MME} = \frac{x^{*}-M_{1}}{M_{1}}, \end{align} \tag{2}\]
respectively. We derive \(\hat{\alpha}_{MLE}\) by maximising the log-likelihood function
\[ \begin{align} \ell\left(\alpha\right)=\ell\left(\alpha|\mathbf{y}\right):=\log L\left(\alpha|\mathbf{y}\right)=n\cdot \left[\log\left(\alpha\right) - \alpha \cdot \log\left(x^{*}\right)\right] + \left(\alpha - 1\right)\cdot\sum\limits_{i=1}^{n}\log\left(x^{*}-y_{i}\right), \end{align} \tag{3}\]
where \(L\left(\alpha|\mathbf{y}\right) = \prod_{i=1}^{n}B1\left(y_{i};x^{*},1,\alpha\right)=\left(\frac{\alpha}{{x^{*}}^{\alpha}}\right)^{n}\prod_{i=1}^{n}\left(x^{*}-y_{i}\right)^{\alpha-1}\) is the likelihood function. Thus, we differentiate Equation 3 w.r.t. \(\alpha\) and equate it to zero. We then solve for the parameter \(\alpha\) to obtain \(\hat{\alpha}_{MLE}\). On the other hand, \(\hat{\alpha}_{MME}\) is derived by equating the first sample moment (\(M_{1}\)) with the theoretical first moment of a \(B1-\)distributed random variable and by subsequently solving for the parameter \(\alpha\). Here, we consider \(x^{*} = 153, 530\) F CFA is the absolute income poverty line (le seuil absolu de pauvreté monétaire).
Use np.log()
for logarithmic calculations and np.sum()
for summations in the MLE formula. For MoM, use np.mean()
for the first sample moment. Create a custom function for the B1 density and use plt.hist()
with density=True
for comparison plots.
Exercise 2
Consider a sample of \(n = 1,340\) claims from automobile injury claims data from a single state in the United States of America (IntroductiontoNumPyExercise4.csv
). The full data, collected in \(2002\) by the Insurance Research Council (IRC)1, contains more than \(70,000\) closed claims based on data from \(32\) insurers. Furthermore, it provides information on demographic information about the claimant, lawyer involvement, and economic loss (LOSS
, in thousands), among other variables. Based on the claims data, answer the following questions:
Calculate the summary statistics for the variable
LOSS
(e.g. mean, median, standard deviation, minimum and maximum).Analyse the histogram of the variable
LOSS
. Comment on the shape of the distribution.Divide the original data set into two sub-samples, one corresponds to claims involving lawyers (
LAWYER=1
) and one where no lawyer is involved (LAWYER=2
). Moreover, for each subsample:Calculate the summary statistics for the variable
LOSS
(e.g. mean, median, standard deviation, minimum and maximum).Analyse the histogram of the variable
LOSS
. Compare the two distributions. How does the involvement of lawyers influence claims losses?
Use np.nanmean()
, np.nanmedian()
, np.nanstd()
, np.nanmin()
, and np.nanmax()
for summary statistics. Use np.nanpercentile()
for quartiles. Create subsets with data[data['column'] == value]
and visualize with plt.hist()
.
(a)
(b)
The distribution of the variable LOSS
is skewed to the right. Moreover, the mean (\(5.68\)) is higher than the meadian (\(2.0\))
(c)
*(i)*
*(ii)*
The present value they are asking us to calculate is
The results above suggest that the losses associated with lawyer involvement (LAWYER=1
) are higher than when a lawyer is not involved (LAWYER=2
).
Exercise 3
Imagine you are an actuarial analyst working for a major actuarial consulting firm. One of the consulting firm’s clients is a large insurance company that provides automobile insurance coverage for private passengers. As an actuarial analyst, you have been selected to work on a project with this insurance company to help them understand their claims distribution. To do this, the insurance company shares with the consulting firm claims data for a recent year (IntroductiontoNumPyExercise5.csv
), consisting of the following variables:
STATE
: a code randomly assigned to an individual state (codes range from “01” to “17”).CLASS
: rating class of the driver, based on age, sex, marital status, and use of vehicle.GENDER
: driver’s sex.AGE
: driver’s age.PAID
: total amount paid to settle and close a claim.
Based on the claims data, answer the following questions:
Write a simple statement in Python to count the number of claims incurred during the year.
Has the claims experience of the insurance company been balanced between men and women?
Provide the age range of the insurance company’s claimants. Do you think the target customer of the insurance company is young people? What is the average age of the claimants?
Analyse the histogram of the amount
PAID
and comment on the shape of the distribution.Create a new variable, the natural logarithm of the
PAID
variable. Furthermore, analyse the histogram of this new variable. Has the shape of the distribution changed or has it remained the same? Explain.
Use data.shape[0]
or len(data)
for counting rows. Filter data with data[data['GENDER'] == 'M']
for subsetting. Use np.nanmin()
, np.nanmax()
, and np.nanmean()
for age analysis. Apply np.log()
for transformations and plt.hist()
for distribution comparison.
(a)
(b)
The above results suggest that more claims were submitted by men (\(62\%\)) than by women (\(38\%\))
(c)
It seems that the insurance company is targeting older people (or at least all claimants during the year of interest were \(50\) years old or older).
The average claimant is \(64\) years old.
(d)
The distribution of the variable PAID
is skewed to the right.
(e)
The shape of the distribution of the variable PAID
has changed. Indeed, the transformation symmetrises the distribution. Logarithmic transformations are widely used in applied statistics. One of their advantages is that they help to symmetrise distributions that are skewed.
Exercise 4
Consider the IntroductiontoNumPyExercise6.csv
dataset, which contains information on thirteen different economic and demographic variables for \(185\) countries (see Table 1 for more information about these variables). Based on the data set, answer the following questions:
Calculate summary statistics for all variables (e.g. mean, median, standard deviation, minimum and maximum). Did all countries provide information for each variable? If no, please also indicate the total number of missing values for each variable.
Using Matplotlib, generate a scatter plot between the total fertility rate (births per woman) and the life expectancy at birth (in years):
FERTILITY
vs.LIFE_EXPECTANCY
. What can you conclude about the relationship between these two variables? Explain.
No. | Variable | Description |
---|---|---|
1 | BIRTH_ATTEND |
Births attended by skilled health personnel (%) |
2 | FEMALE_BOSS |
% of women holding positions as legislators, senior officials and managers |
3 | FERTILITY |
Total fertility rate (births per woman) |
4 | GDP |
Gross Domestic Product (in billions of USD) |
5 | HEALTH_EXPENDITURE |
Health expenditure per capita in \(2004\) (Purchasing Power Parity (PPP) in USD) |
6 | ILLITERATE |
Adult illiteracy rate (% of illiterate persons aged \(15\) and over) |
7 | PHYSICIAN |
Number of Physicians (per \(100,000\) people) |
8 | POPULATION |
Population in \(2005\) (in millions) |
9 | PRIVATE_HEALTH |
Private expenditure on health in \(2004\) (% of GDP) |
10 | PUBLIC_EDUCATION |
Public expenditure on education (% of GDP) |
11 | RESEARCHERS |
Researchers working in Research and Development (R&D) (per million people) |
12 | SMOKING |
Prevalence of smoking ((male) % of adults) |
13 | LIFE_EXPECTANCY |
Life expectancy at birth (in years) |
Table 1: The data source is the United Nations’ Human Development Report which is available at https://hdr.undp.org/.
Use data.describe()
for summary statistics and data.count()
to check for missing values. Calculate missing values with data.shape[0] - data.count()
. Create scatter plots using plt.scatter(x, y)
and add labels with plt.xlabel()
, plt.ylabel()
, and plt.title()
.
(a)
We can easily compute summary statistics for all variables by using the describe()
function from Pandas. Moreover, the total number of missing values for each variable can be computed using the count()
function.
(b)
We can see that as life expectancy increases, fertility rates tend to decrease. Some thoughts about this behaviour are:
Countries with lower life expectancy (e.g. developing countries) tend to have higher fertility rates due, among other reasons, to higher infant mortality rates. In fact, having more children increases the likelihood that more will survive to adulthood. In these countries, children are sometimes considered an asset, as they can contribute to the household (e.g. working, caring for elderly relatives, etc.).
On the other hand, developed countries have longer life expectancy due to, among other reasons, better health care and better living conditions. The probability of surviving to adulthood is higher and therefore there is less pressure to have many children.
\(\ldots\)