MScAS 2025 - DSAS Lecture 3
2025-10-07
Core Benefits
From Arrays to DataFrames
NumPy arrays use positions (0, 1, 2…), while Pandas DataFrames use meaningful labels (PolicyID, date, zone). This makes actuarial data analysis more intuitive and robust!
The Power of Labeled Data
Pandas uses meaningful labels instead of positions. E.g., merge policy data + claims by PolicyID → automatic alignment!
❌ NumPy (Position)
Issues:
✅ Pandas (Labels)
Benefits:
The main advantage of Pandas indices over NumPy array is?
What is a Series
?
A Series is a 1D labeled array capable of holding any data type. Think of it as a column with a name for each row.
What is a DataFrame
?
A DataFrame is a 2D labeled data structure with columns of potentially different types.
Key inspection methods
head(n)
/ tail(n)
: View first/last n rowsinfo()
: Column types, non-null counts, memorydescribe()
: Summary statistics (mean, std, quartiles)shape
: Dimensions (rows, columns)dtypes
: Data type of each columncolumns
: Column namesTwo Ways to Index: Key Differences
loc
: Label-based indexing (slice endpoint inclusive)iloc
: Integer position-based indexing (slice endpoint exclusive, like Python slicing)loc
- Label-based:
iloc
- Position-based:
What’s the main difference between loc
and iloc
?
File Types
Pandas supports many file formats. We’ll focus on the most common ones in actuarial practice.
Format | Function | Delimiter | Common Use |
---|---|---|---|
CSV | pd.read_csv() |
Comma (, ) |
Claims data, policies |
TSV/TXT | pd.read_csv(sep='\t') |
Tab/Custom | Legacy systems |
Excel | pd.read_excel() |
Binary | Spreadsheet reports |
JSON | pd.read_json() |
Structured | Web APIs, modern apps |
File Paths
Use relative paths from your working directory: 'data/file.csv'
or absolute paths: 'C:/Users/name/data/file.csv'
. Forward slashes (/
) work on all operating systems.
Common parameters for pd.read_csv()
:
# Custom delimiter
df = pd.read_csv('file.txt', sep=';') # semicolon-separated
# Skip rows (e.g., skip header comments)
df = pd.read_csv('file.csv', skiprows=3)
# Select specific columns only
df = pd.read_csv('file.csv', usecols=['Claims', 'Payment'])
# Specify data types
df = pd.read_csv('file.csv', dtype={'Zone': str, 'Claims': int})
# Handle missing values
df = pd.read_csv('file.csv', na_values=['NA', 'missing', '-'])
# Set index column
df = pd.read_csv('file.csv', index_col='ID')
Note
Let’s load the Swedish Motor Insurance dataset from the NumPy lecture, a classic actuarial dataset from 1977. The dataset contains the following variables:
Essential First Steps
After loading data, always inspect its structure, data types, and quality. This helps identify issues early and understand your dataset.
What does pd.read_csv('data.csv', usecols=['Age', 'Premium'])
do?
Idea
Use boolean conditions to filter rows based on criteria.
Logical Operators
Combine conditions (each condition must be in parentheses) using:
&
for AND|
for OR~
for NOTMatching Against a List of Values
Filter rows where column values match a list of values.
How do you combine multiple conditions in Pandas boolean indexing?
Key Insights
Like NumPy, Pandas operations are vectorized (fast, no loops needed). Pandas adds two powerful features:
NaN
)Common operators: +
(add()
), -
(subtract()
), *
(multiply()
), /
(divide()
), **
(pow()
)
What happens when you add two Series with different indices?
Note
GroupBy follows three steps: Split data into groups → Apply function → Combine results.
Grouping by Multiple Columns
Pass a list of column names to group by multiple dimensions simultaneously. Result has a MultiIndex. Use .unstack()
to pivot into a readable table format.
Applying Multiple Functions at Once
Use .agg()
with a list of functions for quick summary statistics, or a dictionary to apply different functions to different columns. Essential for creating comprehensive risk reports!
Two Powerful GroupBy Operations
.transform()
: Returns same shape as input, broadcasts group statistics back to original rows (for adding group-level features).filter()
: Returns subset of original data, keeps/discards entire groups based on a conditionTransform (add group statistics):
Filter (keep entire groups):
What does df.groupby('Zone')['Claims'].sum()
return?
When to use map()
Use map()
to replace values in a Series (like “find & replace”). Perfect for converting codes to names (e.g., Zone 1 → “Stockholm”). Works with dictionaries or functions.
Tip
Use apply()
for calculations with conditions: if/else logic, multiple columns, custom formulas. Works on Series or DataFrame (with axis=1
for rows).
Note
Use axis=1
when your calculation needs multiple columns from each row (e.g., Claims ÷ Insured).
You want to replace Zone codes (1, 2, 3…) with city names. Which method should you use?
Handling Dates
Convert string dates to datetime objects using pd.to_datetime()
for time-based operations.
Setting Datetime as Index
Setting datetime as the index enables powerful time-based slicing and resampling operations (e.g., analyzing claim trends over time).
Tip
Resampling changes the frequency of time series data (e.g., daily → weekly → monthly).
Common Resampling Frequencies
Alias | Frequency | Example Use |
---|---|---|
D |
Calendar day | Daily claims |
B |
Business day | Trading days |
W |
Weekly | Weekly reports |
M |
Month end | Monthly summaries |
Q |
Quarter end | Quarterly reports |
A |
Year end | Annual statistics |
Why do we use pd.to_datetime()
on date columns?
Missing Data Representation
Pandas uses NaN
(NumPy) and None
(Python) interchangeably for missing data. Both are treated as “missing” by Pandas operations. Common in actuarial datasets (e.g., incomplete claims, pending investigations).
Tip
Use isnull()
and notnull()
to create boolean masks for filtering missing data. Combine with .any()
or .all()
to check across rows/columns.
When to Drop Missing Data?
Use dropna()
when missing values are truly invalid or represent data collection errors.
⚠️ Caution: Dropping can result in lose of valuable information! Consider the subset
parameter to drop only when specific critical columns are missing.
Imputation Strategies
fillna()
replaces missing values with meaningful substitutes. Common strategies:
What does fillna(0)
do?
Why Concatenate?
pd.concat()
combines multiple DataFrames along an axis. E.g. to combine data from different time periods, regions, or data sources (e.g., merging quarterly claim reports).
Note
axis=0
(default) stacks rows vertically. axis=1
stacks columns horizontally.
Row-wise (axis=0
):
Column-wise (axis=1
):
Managing Duplicate Indices
By default, concat()
preserves original indices, which can create duplicates! Use ignore_index=True
to create a clean sequential index (0, 1, 2…).
What does pd.concat([df1, df2])
do by default?
Join Types: The “What” of the Join
Join Methods: The “How” of the Join
Warning
Many-to-Many creates all possible combinations. Result size can explode! Use cautiously.
Sample Data:
Different Join Types:
In a one-to-many merge, what gets duplicated?
Essential concepts covered today:
Pandas Objects : Series & DataFrames with labeled indices for flexible data handling
Data I/O & Inspection : read_csv()
, head()
, info()
, describe()
for loading and exploring
Selection & Filtering : Boolean indexing, loc
/iloc
, isin()
for targeted subsetting
Index Alignment : Automatic label matching during operations (NaN for missing indices)
GroupBy Operations : Split-apply-combine for category-wise aggregations and risk segmentation
Custom Functions : apply()
and map()
for actuarial calculations (risk scores, severity)
Time Series : to_datetime()
, DatetimeIndex, and resample()
for temporal analysis
Missing Data : isnull()
, dropna()
, fillna()
for handling incomplete actuarial data
Combining Data : concat()
and merge()
for joining policy, claims, and customer datasets
Topics NOT covered today (see lecture notes for details)
stack()
, unstack()
, melt()
, and pivot()
operations (the latter two will be added tomorrow) 📚 Resources: Pandas lecture notes for comprehensive coverage and more examples
DSAS 2025 | HEC Lausanne