flowchart LR %% Main linear flow Acquire --> Tidy Tidy --> Transform %% Circular connections between Transform, Visualize, and Model Transform <--> Visualize Visualize <--> Model Model <--> Transform %% Final connections to Communicate %% Visualize --> Communicate Model --> Communicate %% Styling for clean appearance classDef default fill:#ffffff,stroke:#000000,stroke-width:2px,color:#000000,font-family:Arial,font-size:14px class Acquire,Tidy,Transform,Visualize,Model,Communicate default```
Data Science Pipeline
Cookbook & Steps to Follow
As already introduced in the overview of the course, our final goal is to perform a data science project. This involves the following steps:
Below, we provide a non-comprehensive list of the steps involved in a data science project:
- First, we must acquire data externally or internally. The external process involves scraping, calling APIs, or downloading from user interfaces. Sometimes, the institution that is interested in the analysis already provides the data, which allows for jumping to the next step using internal data. It is seldom straightforward to have good data. Remember the concept “Garbage in, garbage out” (GIGO). Suppose a project starts with “bad” data (e.g., data found online with an unknown source, which is simulated, or there is just bad-quality real data due to many missing values or mistakes in data collection process). In that case, it can render the analysis — no matter how good it is — irrelevant. This is the difference between “toy examples” and real data science projects.
- Once we have the data, we must tidy the data to make it suitable for analysis. In this particular step, we will focus on cleaning the data, which includes handling missing values, duplicates, and outliers, and spotting any mistakes. The formatting can also pose challenges, which we have to fix (transforming the numerical values in the correct format, correcting strings, etc.).
- Next, we do some initial visualizations to get a feel for the data and spot any remaining issues that remain after the cleaning step. We start with the univariate plots, which are plots that show the distribution of a single variable. Then, we move on to the multivariate plots, which are plots that show the relationship between two or more variables. The univariate plots can include density plots, histograms, box plots, and violin plots. The multivariate plots can include scatter plots, pair plots, and heatmaps.
- Sometimes, we find that this is not enough, and we must transform the data to perform mathematical and statistical operations on it. This step is optional, but it can be helpful to perform some transformations to make the data more suitable for the analysis. For example, we can transform the data to make it more normally distributed or to make it more linear.
- Finally, at one point or another, we must share our analysis (all the previous steps) with the interested stakeholders (teammates, supervisors, clients, etc). Communication can be in the form of a report and/or a presentation.
These steps are seldom linear, and often one must rinse and repeat the steps above to obtain a high-quality analysis.
Below, we provide some examples of how to perform these steps.
Example I: Actuarial DS in Health Insurance
Suppose you are the appointed actuary of a health insurer1 in the United States of America (USA). As an appointed actuary, you are responsible for providing an estimate for this year’s technical provisions of that insurer2.
Before making such an estimation, you are asked to carry out an exploratory data analysis (EDA) of the health insurer’s data. Below is a brief description of the dataset (healthinsurance.csv
) provided by the health insurer:
No. | Variable | Description |
---|---|---|
1 | age |
Age of primary beneficiary |
2 | sex |
Insurance contractor gender: female or male |
3 | bmi |
Body mass index (BMI) |
4 | children |
Number of children covered by health insurance or number of dependents |
5 | smoker |
Smoker or non-smoker |
6 | region |
The beneficiary’s residential area in the USA: northeast, northwest, southeast or southwest |
7 | charges |
Individual medical costs billed by health insurance |
Table 1: This dataset provides insurance expenses against the following attributes of the insured: age
, sex
, bmi
(body mass index), children
(number of children or dependents), smoker
(smoker or non-smoker) and region
(northeast, northwest, southeast or southwest).
You are asked to do this to better understand the data. In fact, an EDA can help you discover more about the data and you can use it to find patterns, relationships or anomalies. In particular, the health insurer is interested in answering the following questions:
How many clients does the health insurer have (with and without children or dependents)?
Can you write a simple statement in Python to obtain the data type of each column (or variable)? Are there missing values in the dataset?
For columns with numeric values, calculate their summary statistics. What can you conclude from these statistics (e.g. are the values within acceptable ranges)?
Draw a matrix of scatter plots to assess pairwise relationships in the dataset (you can use the
pandas.plotting.scatter_matrix
function, which creates a grid of axes such that each variable in our dataset will be in the y-axis across a single row and in the x-axis across a single column; the diagonal shows the univariate distribution for each variable). When looking at the matrix of scatter plots, which relationship attracts your attention the most? With which variable would you start your EDA?Using Matplotlib:
Generate a histogram of the variable
age
. Comment on the shape of the distribution.Create a scatter plot that shows the relationship between the variable
age
and the variablecharges
. What conclusions can you draw from looking at the scatter plot?Make a bar plot of the variable
age
vs. the variablecharges
. The idea is that each bar shows the meancharges
for eachage
(see documentation on thepandas.DataFrame.groupby
function and thepandas.DataFrame.plot.bar
function, which might be both useful to create the requested bar plot). What can you say about the behaviour of the variablecharges
with respect toage
by looking at the plot?
What can you say about the number of children? Generate a graph showing the frequency (number of clients) within each category of reported number of children. Furthermore, generate a box and whisker plot that shows the distribution of the variable
charges
with respect to each category (you can use thematplotlib.pyplot.boxplot
function).What is the proportion of women and men, respectively, in the dataset (you can also use a pie chart to support your results)? In addition, produce a violin plot to asses the relationship between the variable
sex
and the variablecharges
(you can use thematplotlib.pyplot.violinplot
function).What is the proportion of smokers and non-smokers, respectively, in the dataset (you can also use a pie chart to support your results)? Produce a scatter plot with a grouping variable (that is, for instance, a basic scatter plot, but with the points coloured based on
smoker
) to assess how the variablecharges
depends on the variablesmoker
and the variableage
(you may find thematplotlib.patches.Patch
function helpful when generating a scatter plot with a grouping variable). Who has highercharges
? Young smokers, young non-smokers, old smokers or old non-smokers? Explain.Asses how the BMI influences the
charges
. That is, evaluate the influence of being underweight, in a healthy range, overweight, obese or severe obese3 on the variablecharges
(generate a scatter plot with a grouping variable as in Question (h):age
vs.charges
withbmi
as the grouping variable, which will consider the five categories provided by the National Health Service (NHS) of Scotland).So far, as the appointed actuary, you have explored the relationship between the variable
smoker
andcharges
(Question (h)) and between the variablebmi
andcharges
(Question (i)). Now, you are interested about the combined influence of being severe obese and being a smoker.
First, you can create a new categorical variable (“bmi_smoker
”), which categorises clients who are smokers and suffer severe obesity according to the National Health Service (NHS) of Scotland (bmi
\(\geq 40\)). Then, you can generate a scatter plot with a grouping variable as in Question (h) and Question (i), but using the new categorical variable bmi_smoker
as the grouping variable.
(a)
(b)
We can see that the data types of the variables are: integers
, floats
and objects
. Moreover, there are no missing values.
(c)
We can see that the columns with numeric values seem to be within acceptable ranges. For instance, age
\(\in [18, 64]\), which makes sense, as this is a dataset of adult clients of a health insurer.
(d)
By looking at the matrix of scatter plots, we can see that the variable age
seems to be a good choice to start the exploratory data analysis (EDA). Indeed, there seems to be a clear relationship between the variable age
and the variable charges
:
charges
increase withage
.- From the scatter plot of the variable
age
vs. the variablecharges
, we realise that it looks like there are \(3\) distinct groups of clients (e.g. low-cost, medium-cost and high-cost clients).
(e)
(i)
We can see that the distribution of the variable age
seems to be uniform if one excludes the high concentration of young clients (with age = 18
and age = 19
, which turn out to be a total of \(68\) and \(69\) clients, respectively). Furthermore, there is between \(20\) and \(30\) clients for each of the other ages.
(ii)
As mentioned above, from the scatter plot of the variable age
vs. the variable charges
, we can conclude the following two points:
charges
increase withage
.- It seems that there are \(3\) distinct groups of clients (e.g. low-cost, medium-cost and high-cost clients).
(iii)
(f)
By looking at the box and plot, we realise that clients who reported to have \(2\) or \(3\) children experience slightly higher charges
than those who reported to have \(0\), \(1\), \(4\) or \(5\) children.
(g)
We can see that the data set is gender-balanced. That is, there are about the same number of women and men in the dataset of the health insurer.
From the violin plot, we realise that, in general, men have a greater number of high cost charges
than women.
(h)
We can see that the data set is not balanced in terms of smokers and non-smokers. This means that there are more customers who stated that they are smokers than those who stated that they are non-smokers in the data set.
As expected, smokers (regardless of age
) appear to have higher charges
than non-smokers.
(i)
Although it is not entirely clear, it appears that clients with a higher BMI (i.e. obesity and severe obesity) have higher charges
.
(j)
From the scatter plot above, we can clearly see that being a smoker with a high BMI is highly correlated with having high charges
(regardless of age
).
Example II: Principal Component Analysis for Interest rates
First, we download data on government issued bonds from here or more specifically here.
Your turn to tidy the data
Give the datframe cleaner column names and only keep those rows that do not contain any NAs.
The Yield Curve
The yield curve illustrates how the “yields”(=returns) of government bonds vary depending on the maturity. Over the last years they looked like the following.
Lots of interest rate derivatives depend on the shape(= the individual values) of the yield curve. However, these values are strongly correlated. Therefore, it can be beneficial to reduce the dimensionality of the yield curve, i.e. to describe each yield curve by a combination of only very few independent variables or principal components.
Principal Component Analysis
As a first step, we need to normalize the entries of our matrix to ensure that every column of the matrix has an equal impact. Thus, we transform the entries \(x_{ij}\) in the \(i\)-th row and \(j\)-th column via \[ \frac{x_{ij} - E[x_j]}{\sqrt{\text{Var}(x_j)}}\,, \] where \(x_j\) is the \(j\)-th column and \(E\) is the expected value.
We now want to find \(K\) linear combinations \(P_k\) (the principal components) of our original observations \(x_i = \{x_{i1}, ..., x_{i8}\}\), i.e. \[ P_k = w_{k1}x_{i1} + ... + w_{k1}x_{i1} = x_i \cdot w_k \] We can use these to project our data on the subspace spanned by \(w_1, ... w_k\) via \[ \hat x_i = \sum_{k=1}^K (x_i \cdot w_k) w_k = \sum_{k=1}^K P_k w_k\,.\] We use the convention that \(||{w_k}||_2 = 1\).
However, we want to make sure that the principal components cover most of the variation of the original data. More precisely, we want to ensure that the first vector \(w_1\) is chosen such that it maximizes the variance of \(P_1\), i.e. \[ w_1 = \text{argmax}_{w_1} \text{Var} (P_1) = \text{argmax}_{w_1} w_1^TX^TXw_1 = \text{argmax}_{w_1} w_1^T\Sigma w_1 \,,\] where \(\Sigma\) is the covariance matrix of X. The equation is satisfied if \(w_1\) is the eigenvector with the largest eigenvalue of \(\Sigma\).
Numerical Analysis
Your turn to normalize the data and compute the covariance matrix
Normalize \(X\) (keep this information in a datframe df_norm
) and then compute its covariance matrix (store it in the dataframe cov_matrix
).
Now, we determine the eigenvectors corresponding to the \(k\) largest eigenvalues of the covariance matrix \(\Sigma\).
The first 3 eigenvalues explain more than 99% of the overall variation in the data.
Let’s take a look at the weights \(w_1\), \(w_2\), \(w_3\).
The first weight is almost constant. Thus it shifts the the interest rate curve up or down depending on the value of the first principal component \(P_1\).
The second weight has a linear downward trend. Hence, it describes the rotation of the interest rate curve. A positive value of \(P_2\) gives the yield curve a downward trend, while a negative value rotates it to an upward trend.
The last weight vector has a V/U-shape. It thus decides the convexity/concavity of the yield curve.
Your turn to plot the weights
Plot the three weights as a function of the maturity.
Let us now look at the yield curves approximated by the first 1, 2 and 3 principal components. Note that we performed the PCA on the normalized matrix of \(X\). To reach the original interest rate levels, we thus have to reverse the normalizing procedure after computing \[ \hat x_i = \sum_{k=1}^K (x_i \cdot w_k) w_k = \sum_{k=1}^K P_k w_k\,.\]
You can use the principal components to more conveniently cluster data in a lower-dimensional setting.
As a comparison, we also show the original/true yield curves.
Applying pre-implemented functions
There are several pre-implemented functions in the sklearn
-package that can be applied here. Take a look at the following cells. These reproduce the results we computed manually before.
Footnotes
In Switzerland, these companies are known as “caisses-maladie” (in French) or “Krankenkassen” (in German).↩︎
The technical provisions are also called “provisions techniques” (in French) and “Rückstellung” (in German).↩︎
See the National Health Service (NHS) of Scotland for more information about the BMI.↩︎