Exercise Set III - Pandas

Jupyter Google Colab 💡 Show Hints ℹ️ Show Solutions

Exercise 1

We start by practicing basic Pandas operations with Series and DataFrame objects.

Part A: Series Fundamentals

Create a Pandas Series object using the following actuarial data representing claim amounts:

Answer the following questions:

Retrieve the fourth claim amount from the Series.
Replace all claim amounts greater than \(2000\) with NaN using appropriate indexing.
Create a new Series using the same data but with a custom index representing policy IDs: ['POL001', 'POL002', 'POL003', 'POL004', 'POL005', 'POL006', 'POL007', 'POL008'].
Access the claim amount for policy 'POL004' using the custom index.
Compute the mean claim amount, ignoring NaN values.
Create a Series from the following dictionary representing total claims by insurance zone and access the claim amount for 'Zone3':

Hint

Use integer-based indexing with square brackets [] for position-based access
Remember that pandas uses 0-based indexing (first element is at index 0)
Use boolean indexing to select values: series[condition] = np.nan
Import numpy as np to use np.nan for missing values
Create Series with custom index: pd.Series(data, index=index_list)
Use label-based indexing with square brackets for custom indices: series['label']
The .mean() method automatically ignores NaN values
Create Series from dictionary: pd.Series(dictionary) - keys become the index
Dictionary keys become the index when creating a Series from a dictionary

Solution

(a)

(b)

(c)

(d)

(e)

(f)

Part B: DataFrame Creation and Basic Operations

Create a Pandas DataFrame using the following insurance policy data:

Answer the following questions:

Display the first three rows of the DataFrame.
Add a new column Annual_Claims with values [450, 0, 1200, 325, 0].
Retrieve the Premium column as a Series using both dictionary-style and attribute-style access.
Filter the DataFrame to show only policies where Customer_Age is greater than \(35\).
Calculate the claim ratio (Annual_Claims/Premium) for each policy and add it as a new column.
Access the underlying NumPy array of the DataFrame and transpose the DataFrame.

Hint

Use .head(n) method to display the first n rows
Add new columns with dictionary-style assignment: df['new_column'] = list_of_values
Dictionary-style column access: df['column_name']
Attribute-style column access: df.column_name (only works for valid Python identifiers)
Boolean indexing for filtering: df[df['column_name'] > value]
Perform arithmetic operations on columns: df['new_col'] = df['col1'] / df['col2']
Access underlying NumPy array with .values attribute
Transpose DataFrame with .T attribute

Solution

(a)

(b)

(c)

(d)

(e)

(f)

Exercise 2

Now we focus on data loading, inspection, and selection operations in Pandas.

Part A: Data Loading and Inspection

Create a sample insurance claims dataset and practice inspection methods:

Answer the following questions:

Display the first 5 rows and last 3 rows of the dataset.
Use the info() method to get a summary of the DataFrame structure.
Generate descriptive statistics for numerical columns using describe().
Check the data types of all columns and the shape of the DataFrame.
Create a simple CSV content as a string and demonstrate reading it with pd.read_csv().

Hint

Use .head(n) for the first n rows and .tail(n) for the last n rows
Call .info() method to see column names, data types, non-null counts, and memory usage
Use .describe() method to get count, mean, std, min, quartiles, and max for numerical columns
Use .dtypes attribute to see data types and .shape attribute for dimensions (rows, columns)
Create a multi-line string with CSV format, use io.StringIO() to convert to file-like object, then pass to pd.read_csv()

Solution

(a)

(b)

(c)

(d)

(e)

Part B: Boolean Indexing and Filtering

Using the claims dataset from Part A, perform the following filtering operations:

Filter claims where Claim_Amount is greater than \(3000\).
Filter claims for customers aged between 30 and 50 (inclusive) using multiple conditions.
Filter claims where Policy_Type is either ‘Auto’ or ‘Life’ using the isin() method.
Filter claims where Claim_Amount is greater than \(2800\) AND Policy_Type is ‘Home’.
Filter claims where Customer_Age is NOT between 35 and 45 (use the NOT operator ~).

Hint

Use boolean indexing: df[df['column_name'] > value] to create boolean masks
Use logical operators & (and), | (or) for multiple conditions
Each condition must be in parentheses: (condition1) & (condition2)
For “between” use: (df['column'] >= min_val) & (df['column'] <= max_val)
Use .isin() method with a list: df[df['column'].isin(['value1', 'value2'])]
Use ~ operator to negate conditions: df[~(condition)]
Combine numerical and categorical filtering as needed

Solution

(a)

(b)

(c)

(d)

(e)

Exercise 3

We now explore operations on data, including index alignment and arithmetic operations.

Part A: Universal Functions and Index Preservation

Create two Series objects representing quarterly premiums for different insurance products:

Answer the following questions:

Apply the natural logarithm to product_a_premiums and show that the index is preserved.
Calculate the square root of both premium series.
Add the two premium series together and show the result.
Calculate the percentage increase of Product A premiums over Product B premiums.

Hint

Apply NumPy universal functions directly to Series: np.log(), np.sqrt()
Check index preservation by comparing .index attributes using .equals()
NumPy ufuncs work element-wise on pandas objects while preserving indices
Use arithmetic operators (+, -, *, /) directly between Series
Pandas automatically aligns indices for operations
Calculate percentage increase: ((A - B) / B) * 100

Solution

(a)

(b)

(c)

(d)

Part B: Index Alignment and Missing Data

Create two Series with partially overlapping indices:

Answer the following questions:

Add the two series and observe how Pandas handles non-matching indices.
Subtract series_2023 from series_2024 with fill_value=0.
Multiply the series with fill_value=1.
Create two DataFrames with different row and column indices and perform subtraction to see alignment in both dimensions.

Hint

Use arithmetic operators (+, -, *) directly between Series - Pandas aligns indices automatically
Non-matching indices result in NaN values during operations
Use explicit methods (.subtract(), .multiply(), .add()) instead of operators when you need fill_value
Pass fill_value parameter to fill missing values during operations (e.g., fill_value=0 or fill_value=1)
Create DataFrames with different but partially overlapping indices and columns using np.random.randint()
DataFrame operations align both row and column indices simultaneously

Solution

(a)

(b)

(c)

(d)

Part C: DataFrame and Series Broadcasting

Create a DataFrame and practice broadcasting operations:

Answer the following questions:

Add the first row to all rows in the DataFrame (row-wise broadcasting).
Add the ‘Auto’ column to each column in the DataFrame (column-wise broadcasting).
Create a Series representing adjustment factors and multiply the DataFrame by these factors column-wise.

Hint

For row-wise broadcasting, use df + df.iloc[0] to add first row to all rows
For column-wise broadcasting, use df.add(df['column_name'], axis=0) with axis=0 parameter
Create a Series with same index as DataFrame columns for column-wise operations
Use .multiply(series, axis=1) method to multiply DataFrame by Series column-wise
Broadcasting rules: operations align along specified axis (0 for rows, 1 for columns)

Solution

(a)

(b)

(c)

Exercise 4

In this exercise, we explore time series data manipulation, function application, and resampling operations.

Part A: Time Series Data and Function Application

Create a time series dataset representing daily insurance claim amounts over a 3-month period:

Answer the following questions:

Convert the index to a different format showing only ‘YYYY-MM’.
Apply the .apply() method to categorize claim amounts as ‘Low’ (<1500), ‘Medium’ (1500-3000), or ‘High’ (>3000).
Resample the daily data to weekly totals using .resample('W').sum().
Create quarterly summaries showing mean, median, and total claims using .resample('Q').agg().
Filter the time series to show only data from February 2023 using date string indexing.

Hint

Use .index.strftime('%Y-%m') to format datetime index
Define a function for categorization: def categorize(x): return 'Low' if x < 1500 else ('Medium' if x <= 3000 else 'High')
Resampling uses frequency strings: ‘D’ (daily), ‘W’ (weekly), ‘M’ (monthly), ‘Q’ (quarterly)
Use .agg() with a list of functions: ['mean', 'median', 'sum']
Date string indexing: series['2023-02'] selects all February 2023 data

Solution

(a)

(b)

(c)

(d)

(e)

Exercise 5

This exercise focuses on handling missing data using various Pandas methods.

Part A: Missing Data Detection and Handling

Create a DataFrame with missing values representing insurance customer data:

Answer the following questions:

Count the missing values in each column.
Drop all rows that contain any missing values.
Drop columns that have more than 2 missing values using parameter thresh=6 in the method you use.
Fill missing values in the ‘Age’ column with the median age.
Forward fill missing values in ‘Annual_Income’.
Use interpolation to fill missing ‘Credit_Score’ values.
Replace missing ‘Policy_Premium’ values with a calculated value based on income (e.g., 3% of annual income).

Hint

Chain methods: df.isnull().sum() counts missing values per column
.dropna() removes rows with any NaN by default; use axis=1 for columns
thresh parameter in .dropna() specifies minimum number of non-null values required
Use .median() to calculate median: df['column'].fillna(df['column'].median())
Forward fill: df['column'].fillna(method='ffill') or df['column'].ffill()
.interpolate() uses linear interpolation by default
Use boolean indexing to conditionally fill: df.loc[df['col'].isnull(), 'col'] = calculated_value

Solution

(a)

(b)

(c)

(d)

(e)

(f)

(g)

Exercise 6

This exercise covers hierarchical indexing (MultiIndex), concatenation, merging, and joining operations.

Part A: Hierarchical Indexing (MultiIndex)

Create a DataFrame with hierarchical indexing representing insurance data by region and product type:

Answer the following questions:

Access all data for the ‘North’ region.
Calculate the total policies sold by region.
Reset the index to convert MultiIndex to regular columns.
Create a new MultiIndex by setting ‘Region’ and ‘Product’ as index from the reset DataFrame.

Hint

MultiIndex selection: df.loc['level1_value'] for outer level
Sum by level: df.sum(level='level_name') or df.groupby(level='level_name').sum()
Reset index: df.reset_index() moves index levels to columns
Set MultiIndex: df.set_index(['col1', 'col2'])

Solution

(a)

(b)

(c)

(d)

Part B: Concatenation, Merging, and Joining

Create separate DataFrames for different insurance operations:

Answer the following questions:

Concatenate two DataFrames vertically.
Merge customers and policies using an inner join on ‘Customer_ID’.
Perform a left join between policies and claims on ‘Policy_ID’.
Join customers with policies (after setting Customer_ID as index).
Create a full outer join between customers and policies showing all records.

Hint

Vertical concatenation: pd.concat([df1, df2], axis=0)
Inner merge: pd.merge(df1, df2, on='column_name', how='inner')
Left join: pd.merge(df1, df2, on='column_name', how='left')
Using .join(): first set index, then df1.join(df2, how='inner')
Outer join: pd.merge(df1, df2, on='column_name', how='outer')

Solution

(a)

(b)

(c)

(d)

(e)

Summary

You have completed comprehensive Pandas exercises covering the fundamental concepts from the lecture. These exercises have introduced you to:

Series and DataFrame fundamentals - Creation, indexing, and basic operations
Data loading and inspection - Reading data and getting summary statistics
Selection and filtering - Boolean indexing and conditional filtering
Operations on data - Universal functions, index alignment, and broadcasting
Time series and function application - Date formatting, resampling, and custom functions
Missing data handling - Detection, removal, and imputation strategies
Hierarchical indexing and data combination - MultiIndex operations, merging, and joining