Exercise Set III - Pandas
Jupyter Google Colab 💡 Show Hints ℹ️ Show Solutions
Exercise 1
We start by practicing basic Pandas operations with Series and DataFrame objects.
Part A: Series Fundamentals
Create a Pandas Series
object using the following actuarial data representing claim amounts:
Answer the following questions:
Retrieve the fourth claim amount from the
Series
.Replace all claim amounts greater than \(2000\) with
NaN
using appropriate indexing.Create a new
Series
using the same data but with a custom index representing policy IDs:['POL001', 'POL002', 'POL003', 'POL004', 'POL005', 'POL006', 'POL007', 'POL008']
.Access the claim amount for policy
'POL004'
using the custom index.Compute the mean claim amount, ignoring
NaN
values.Create a
Series
from the following dictionary representing total claims by insurance zone and access the claim amount for'Zone3'
:
- Use integer-based indexing with square brackets
[]
for position-based access - Remember that pandas uses 0-based indexing (first element is at index 0)
- Use boolean indexing to select values:
series[condition] = np.nan
- Import
numpy as np
to usenp.nan
for missing values - Create Series with custom index:
pd.Series(data, index=index_list)
- Use label-based indexing with square brackets for custom indices:
series['label']
- The
.mean()
method automatically ignoresNaN
values - Create Series from dictionary:
pd.Series(dictionary)
- keys become the index - Dictionary keys become the index when creating a Series from a dictionary
(a)
(b)
(c)
(d)
(e)
(f)
Part B: DataFrame Creation and Basic Operations
Create a Pandas DataFrame
using the following insurance policy data:
Answer the following questions:
Display the first three rows of the
DataFrame
.Add a new column
Annual_Claims
with values[450, 0, 1200, 325, 0]
.Retrieve the
Premium
column as aSeries
using both dictionary-style and attribute-style access.Filter the
DataFrame
to show only policies whereCustomer_Age
is greater than \(35\).Calculate the claim ratio (Annual_Claims/Premium) for each policy and add it as a new column.
Access the underlying NumPy array of the
DataFrame
and transpose theDataFrame
.
- Use
.head(n)
method to display the first n rows - Add new columns with dictionary-style assignment:
df['new_column'] = list_of_values
- Dictionary-style column access:
df['column_name']
- Attribute-style column access:
df.column_name
(only works for valid Python identifiers) - Boolean indexing for filtering:
df[df['column_name'] > value]
- Perform arithmetic operations on columns:
df['new_col'] = df['col1'] / df['col2']
- Access underlying NumPy array with
.values
attribute - Transpose DataFrame with
.T
attribute
(a)
(b)
(c)
(d)
(e)
(f)
Exercise 2
Now we focus on data loading, inspection, and selection operations in Pandas.
Part A: Data Loading and Inspection
Create a sample insurance claims dataset and practice inspection methods:
Answer the following questions:
Display the first 5 rows and last 3 rows of the dataset.
Use the
info()
method to get a summary of the DataFrame structure.Generate descriptive statistics for numerical columns using
describe()
.Check the data types of all columns and the shape of the DataFrame.
Create a simple CSV content as a string and demonstrate reading it with
pd.read_csv()
.
- Use
.head(n)
for the first n rows and.tail(n)
for the last n rows - Call
.info()
method to see column names, data types, non-null counts, and memory usage - Use
.describe()
method to get count, mean, std, min, quartiles, and max for numerical columns - Use
.dtypes
attribute to see data types and.shape
attribute for dimensions (rows, columns) - Create a multi-line string with CSV format, use
io.StringIO()
to convert to file-like object, then pass topd.read_csv()
(a)
(b)
(c)
(d)
(e)
Part B: Boolean Indexing and Filtering
Using the claims dataset from Part A, perform the following filtering operations:
Filter claims where
Claim_Amount
is greater than \(3000\).Filter claims for customers aged between 30 and 50 (inclusive) using multiple conditions.
Filter claims where
Policy_Type
is either ‘Auto’ or ‘Life’ using theisin()
method.Filter claims where
Claim_Amount
is greater than \(2800\) ANDPolicy_Type
is ‘Home’.Filter claims where
Customer_Age
is NOT between 35 and 45 (use the NOT operator~
).
- Use boolean indexing:
df[df['column_name'] > value]
to create boolean masks - Use logical operators
&
(and),|
(or) for multiple conditions - Each condition must be in parentheses:
(condition1) & (condition2)
- For “between” use:
(df['column'] >= min_val) & (df['column'] <= max_val)
- Use
.isin()
method with a list:df[df['column'].isin(['value1', 'value2'])]
- Use
~
operator to negate conditions:df[~(condition)]
- Combine numerical and categorical filtering as needed
(a)
(b)
(c)
(d)
(e)
Exercise 3
We now explore operations on data, including index alignment and arithmetic operations.
Part A: Universal Functions and Index Preservation
Create two Series
objects representing quarterly premiums for different insurance products:
Answer the following questions:
Apply the natural logarithm to
product_a_premiums
and show that the index is preserved.Calculate the square root of both premium series.
Add the two premium series together and show the result.
Calculate the percentage increase of Product A premiums over Product B premiums.
- Apply NumPy universal functions directly to Series:
np.log()
,np.sqrt()
- Check index preservation by comparing
.index
attributes using.equals()
- NumPy ufuncs work element-wise on pandas objects while preserving indices
- Use arithmetic operators (
+
,-
,*
,/
) directly between Series - Pandas automatically aligns indices for operations
- Calculate percentage increase:
((A - B) / B) * 100
(a)
(b)
(c)
(d)
Part B: Index Alignment and Missing Data
Create two Series
with partially overlapping indices:
Answer the following questions:
Add the two series and observe how Pandas handles non-matching indices.
Subtract
series_2023
fromseries_2024
withfill_value=0
.Multiply the series with
fill_value=1
.Create two DataFrames with different row and column indices and perform subtraction to see alignment in both dimensions.
- Use arithmetic operators (
+
,-
,*
) directly between Series - Pandas aligns indices automatically - Non-matching indices result in
NaN
values during operations - Use explicit methods (
.subtract()
,.multiply()
,.add()
) instead of operators when you needfill_value
- Pass
fill_value
parameter to fill missing values during operations (e.g.,fill_value=0
orfill_value=1
) - Create DataFrames with different but partially overlapping indices and columns using
np.random.randint()
- DataFrame operations align both row and column indices simultaneously
(a)
(b)
(c)
(d)
Part C: DataFrame and Series Broadcasting
Create a DataFrame and practice broadcasting operations:
Answer the following questions:
Add the first row to all rows in the DataFrame (row-wise broadcasting).
Add the ‘Auto’ column to each column in the DataFrame (column-wise broadcasting).
Create a Series representing adjustment factors and multiply the DataFrame by these factors column-wise.
- For row-wise broadcasting, use
df + df.iloc[0]
to add first row to all rows - For column-wise broadcasting, use
df.add(df['column_name'], axis=0)
withaxis=0
parameter - Create a Series with same index as DataFrame columns for column-wise operations
- Use
.multiply(series, axis=1)
method to multiply DataFrame by Series column-wise - Broadcasting rules: operations align along specified axis (0 for rows, 1 for columns)
(a)
(b)
(c)
Exercise 4
In this exercise, we explore time series data manipulation, function application, and resampling operations.
Part A: Time Series Data and Function Application
Create a time series dataset representing daily insurance claim amounts over a 3-month period:
Answer the following questions:
Convert the index to a different format showing only ‘YYYY-MM’.
Apply the
.apply()
method to categorize claim amounts as ‘Low’ (<1500), ‘Medium’ (1500-3000), or ‘High’ (>3000).Resample the daily data to weekly totals using
.resample('W').sum()
.Create quarterly summaries showing mean, median, and total claims using
.resample('Q').agg()
.Filter the time series to show only data from February 2023 using date string indexing.
- Use
.index.strftime('%Y-%m')
to format datetime index - Define a function for categorization:
def categorize(x): return 'Low' if x < 1500 else ('Medium' if x <= 3000 else 'High')
- Resampling uses frequency strings: ‘D’ (daily), ‘W’ (weekly), ‘M’ (monthly), ‘Q’ (quarterly)
- Use
.agg()
with a list of functions:['mean', 'median', 'sum']
- Date string indexing:
series['2023-02']
selects all February 2023 data
(a)
(b)
(c)
(d)
(e)
Exercise 5
This exercise focuses on handling missing data using various Pandas methods.
Part A: Missing Data Detection and Handling
Create a DataFrame with missing values representing insurance customer data:
Answer the following questions:
Count the missing values in each column.
Drop all rows that contain any missing values.
Drop columns that have more than 2 missing values using parameter
thresh=6
in the method you use.Fill missing values in the ‘Age’ column with the median age.
Forward fill missing values in ‘Annual_Income’.
Use interpolation to fill missing ‘Credit_Score’ values.
Replace missing ‘Policy_Premium’ values with a calculated value based on income (e.g., 3% of annual income).
- Chain methods:
df.isnull().sum()
counts missing values per column .dropna()
removes rows with any NaN by default; useaxis=1
for columnsthresh
parameter in.dropna()
specifies minimum number of non-null values required- Use
.median()
to calculate median:df['column'].fillna(df['column'].median())
- Forward fill:
df['column'].fillna(method='ffill')
ordf['column'].ffill()
.interpolate()
uses linear interpolation by default- Use boolean indexing to conditionally fill:
df.loc[df['col'].isnull(), 'col'] = calculated_value
(a)
(b)
(c)
(d)
(e)
(f)
(g)
Exercise 6
This exercise covers hierarchical indexing (MultiIndex), concatenation, merging, and joining operations.
Part A: Hierarchical Indexing (MultiIndex)
Create a DataFrame with hierarchical indexing representing insurance data by region and product type:
Answer the following questions:
Access all data for the ‘North’ region.
Calculate the total policies sold by region.
Reset the index to convert MultiIndex to regular columns.
Create a new MultiIndex by setting ‘Region’ and ‘Product’ as index from the reset DataFrame.
- MultiIndex selection:
df.loc['level1_value']
for outer level - Sum by level:
df.sum(level='level_name')
ordf.groupby(level='level_name').sum()
- Reset index:
df.reset_index()
moves index levels to columns - Set MultiIndex:
df.set_index(['col1', 'col2'])
(a)
(b)
(c)
(d)
Part B: Concatenation, Merging, and Joining
Create separate DataFrames for different insurance operations:
Answer the following questions:
Concatenate two DataFrames vertically.
Merge customers and policies using an inner join on ‘Customer_ID’.
Perform a left join between policies and claims on ‘Policy_ID’.
Join customers with policies (after setting Customer_ID as index).
Create a full outer join between customers and policies showing all records.
- Vertical concatenation:
pd.concat([df1, df2], axis=0)
- Inner merge:
pd.merge(df1, df2, on='column_name', how='inner')
- Left join:
pd.merge(df1, df2, on='column_name', how='left')
- Using
.join()
: first set index, thendf1.join(df2, how='inner')
- Outer join:
pd.merge(df1, df2, on='column_name', how='outer')
(a)
(b)
(c)
(d)
(e)
Summary
You have completed comprehensive Pandas exercises covering the fundamental concepts from the lecture. These exercises have introduced you to:
- Series and DataFrame fundamentals - Creation, indexing, and basic operations
- Data loading and inspection - Reading data and getting summary statistics
- Selection and filtering - Boolean indexing and conditional filtering
- Operations on data - Universal functions, index alignment, and broadcasting
- Time series and function application - Date formatting, resampling, and custom functions
- Missing data handling - Detection, removal, and imputation strategies
- Hierarchical indexing and data combination - MultiIndex operations, merging, and joining