Exercise Set III - Pandas

Jupyter Google Colab 💡 Show Hints ℹ️ Show Solutions

Exercise 1

We start by practicing basic Pandas operations with Series and DataFrame objects.

Part A: Series Fundamentals

Create a Pandas Series object using the following actuarial data representing claim amounts:

Answer the following questions:

  1. Retrieve the fourth claim amount from the Series.

  2. Replace all claim amounts greater than \(2000\) with NaN using appropriate indexing.

  3. Create a new Series using the same data but with a custom index representing policy IDs: ['POL001', 'POL002', 'POL003', 'POL004', 'POL005', 'POL006', 'POL007', 'POL008'].

  4. Access the claim amount for policy 'POL004' using the custom index.

  5. Compute the mean claim amount, ignoring NaN values.

  6. Create a Series from the following dictionary representing total claims by insurance zone and access the claim amount for 'Zone3':

  • Use integer-based indexing with square brackets [] for position-based access
  • Remember that pandas uses 0-based indexing (first element is at index 0)
  • Use boolean indexing to select values: series[condition] = np.nan
  • Import numpy as np to use np.nan for missing values
  • Create Series with custom index: pd.Series(data, index=index_list)
  • Use label-based indexing with square brackets for custom indices: series['label']
  • The .mean() method automatically ignores NaN values
  • Create Series from dictionary: pd.Series(dictionary) - keys become the index
  • Dictionary keys become the index when creating a Series from a dictionary

(a)

(b)

(c)

(d)

(e)

(f)

Part B: DataFrame Creation and Basic Operations

Create a Pandas DataFrame using the following insurance policy data:

Answer the following questions:

  1. Display the first three rows of the DataFrame.

  2. Add a new column Annual_Claims with values [450, 0, 1200, 325, 0].

  3. Retrieve the Premium column as a Series using both dictionary-style and attribute-style access.

  4. Filter the DataFrame to show only policies where Customer_Age is greater than \(35\).

  5. Calculate the claim ratio (Annual_Claims/Premium) for each policy and add it as a new column.

  6. Access the underlying NumPy array of the DataFrame and transpose the DataFrame.

  • Use .head(n) method to display the first n rows
  • Add new columns with dictionary-style assignment: df['new_column'] = list_of_values
  • Dictionary-style column access: df['column_name']
  • Attribute-style column access: df.column_name (only works for valid Python identifiers)
  • Boolean indexing for filtering: df[df['column_name'] > value]
  • Perform arithmetic operations on columns: df['new_col'] = df['col1'] / df['col2']
  • Access underlying NumPy array with .values attribute
  • Transpose DataFrame with .T attribute

(a)

(b)

(c)

(d)

(e)

(f)

Exercise 2

Now we focus on data loading, inspection, and selection operations in Pandas.

Part A: Data Loading and Inspection

Create a sample insurance claims dataset and practice inspection methods:

Answer the following questions:

  1. Display the first 5 rows and last 3 rows of the dataset.

  2. Use the info() method to get a summary of the DataFrame structure.

  3. Generate descriptive statistics for numerical columns using describe().

  4. Check the data types of all columns and the shape of the DataFrame.

  5. Create a simple CSV content as a string and demonstrate reading it with pd.read_csv().

  • Use .head(n) for the first n rows and .tail(n) for the last n rows
  • Call .info() method to see column names, data types, non-null counts, and memory usage
  • Use .describe() method to get count, mean, std, min, quartiles, and max for numerical columns
  • Use .dtypes attribute to see data types and .shape attribute for dimensions (rows, columns)
  • Create a multi-line string with CSV format, use io.StringIO() to convert to file-like object, then pass to pd.read_csv()

(a)

(b)

(c)

(d)

(e)

Part B: Boolean Indexing and Filtering

Using the claims dataset from Part A, perform the following filtering operations:

  1. Filter claims where Claim_Amount is greater than \(3000\).

  2. Filter claims for customers aged between 30 and 50 (inclusive) using multiple conditions.

  3. Filter claims where Policy_Type is either ‘Auto’ or ‘Life’ using the isin() method.

  4. Filter claims where Claim_Amount is greater than \(2800\) AND Policy_Type is ‘Home’.

  5. Filter claims where Customer_Age is NOT between 35 and 45 (use the NOT operator ~).

  • Use boolean indexing: df[df['column_name'] > value] to create boolean masks
  • Use logical operators & (and), | (or) for multiple conditions
  • Each condition must be in parentheses: (condition1) & (condition2)
  • For “between” use: (df['column'] >= min_val) & (df['column'] <= max_val)
  • Use .isin() method with a list: df[df['column'].isin(['value1', 'value2'])]
  • Use ~ operator to negate conditions: df[~(condition)]
  • Combine numerical and categorical filtering as needed

(a)

(b)

(c)

(d)

(e)

Exercise 3

We now explore operations on data, including index alignment and arithmetic operations.

Part A: Universal Functions and Index Preservation

Create two Series objects representing quarterly premiums for different insurance products:

Answer the following questions:

  1. Apply the natural logarithm to product_a_premiums and show that the index is preserved.

  2. Calculate the square root of both premium series.

  3. Add the two premium series together and show the result.

  4. Calculate the percentage increase of Product A premiums over Product B premiums.

  • Apply NumPy universal functions directly to Series: np.log(), np.sqrt()
  • Check index preservation by comparing .index attributes using .equals()
  • NumPy ufuncs work element-wise on pandas objects while preserving indices
  • Use arithmetic operators (+, -, *, /) directly between Series
  • Pandas automatically aligns indices for operations
  • Calculate percentage increase: ((A - B) / B) * 100

(a)

(b)

(c)

(d)

Part B: Index Alignment and Missing Data

Create two Series with partially overlapping indices:

Answer the following questions:

  1. Add the two series and observe how Pandas handles non-matching indices.

  2. Subtract series_2023 from series_2024 with fill_value=0.

  3. Multiply the series with fill_value=1.

  4. Create two DataFrames with different row and column indices and perform subtraction to see alignment in both dimensions.

  • Use arithmetic operators (+, -, *) directly between Series - Pandas aligns indices automatically
  • Non-matching indices result in NaN values during operations
  • Use explicit methods (.subtract(), .multiply(), .add()) instead of operators when you need fill_value
  • Pass fill_value parameter to fill missing values during operations (e.g., fill_value=0 or fill_value=1)
  • Create DataFrames with different but partially overlapping indices and columns using np.random.randint()
  • DataFrame operations align both row and column indices simultaneously

(a)

(b)

(c)

(d)

Part C: DataFrame and Series Broadcasting

Create a DataFrame and practice broadcasting operations:

Answer the following questions:

  1. Add the first row to all rows in the DataFrame (row-wise broadcasting).

  2. Add the ‘Auto’ column to each column in the DataFrame (column-wise broadcasting).

  3. Create a Series representing adjustment factors and multiply the DataFrame by these factors column-wise.

  • For row-wise broadcasting, use df + df.iloc[0] to add first row to all rows
  • For column-wise broadcasting, use df.add(df['column_name'], axis=0) with axis=0 parameter
  • Create a Series with same index as DataFrame columns for column-wise operations
  • Use .multiply(series, axis=1) method to multiply DataFrame by Series column-wise
  • Broadcasting rules: operations align along specified axis (0 for rows, 1 for columns)

(a)

(b)

(c)

Exercise 4

In this exercise, we explore time series data manipulation, function application, and resampling operations.

Part A: Time Series Data and Function Application

Create a time series dataset representing daily insurance claim amounts over a 3-month period:

Answer the following questions:

  1. Convert the index to a different format showing only ‘YYYY-MM’.

  2. Apply the .apply() method to categorize claim amounts as ‘Low’ (<1500), ‘Medium’ (1500-3000), or ‘High’ (>3000).

  3. Resample the daily data to weekly totals using .resample('W').sum().

  4. Create quarterly summaries showing mean, median, and total claims using .resample('Q').agg().

  5. Filter the time series to show only data from February 2023 using date string indexing.

  • Use .index.strftime('%Y-%m') to format datetime index
  • Define a function for categorization: def categorize(x): return 'Low' if x < 1500 else ('Medium' if x <= 3000 else 'High')
  • Resampling uses frequency strings: ‘D’ (daily), ‘W’ (weekly), ‘M’ (monthly), ‘Q’ (quarterly)
  • Use .agg() with a list of functions: ['mean', 'median', 'sum']
  • Date string indexing: series['2023-02'] selects all February 2023 data

(a)

(b)

(c)

(d)

(e)

Exercise 5

This exercise focuses on handling missing data using various Pandas methods.

Part A: Missing Data Detection and Handling

Create a DataFrame with missing values representing insurance customer data:

Answer the following questions:

  1. Count the missing values in each column.

  2. Drop all rows that contain any missing values.

  3. Drop columns that have more than 2 missing values using parameter thresh=6 in the method you use.

  4. Fill missing values in the ‘Age’ column with the median age.

  5. Forward fill missing values in ‘Annual_Income’.

  6. Use interpolation to fill missing ‘Credit_Score’ values.

  7. Replace missing ‘Policy_Premium’ values with a calculated value based on income (e.g., 3% of annual income).

  • Chain methods: df.isnull().sum() counts missing values per column
  • .dropna() removes rows with any NaN by default; use axis=1 for columns
  • thresh parameter in .dropna() specifies minimum number of non-null values required
  • Use .median() to calculate median: df['column'].fillna(df['column'].median())
  • Forward fill: df['column'].fillna(method='ffill') or df['column'].ffill()
  • .interpolate() uses linear interpolation by default
  • Use boolean indexing to conditionally fill: df.loc[df['col'].isnull(), 'col'] = calculated_value

(a)

(b)

(c)

(d)

(e)

(f)

(g)

Exercise 6

This exercise covers hierarchical indexing (MultiIndex), concatenation, merging, and joining operations.

Part A: Hierarchical Indexing (MultiIndex)

Create a DataFrame with hierarchical indexing representing insurance data by region and product type:

Answer the following questions:

  1. Access all data for the ‘North’ region.

  2. Calculate the total policies sold by region.

  3. Reset the index to convert MultiIndex to regular columns.

  4. Create a new MultiIndex by setting ‘Region’ and ‘Product’ as index from the reset DataFrame.

  • MultiIndex selection: df.loc['level1_value'] for outer level
  • Sum by level: df.sum(level='level_name') or df.groupby(level='level_name').sum()
  • Reset index: df.reset_index() moves index levels to columns
  • Set MultiIndex: df.set_index(['col1', 'col2'])

(a)

(b)

(c)

(d)

Part B: Concatenation, Merging, and Joining

Create separate DataFrames for different insurance operations:

Answer the following questions:

  1. Concatenate two DataFrames vertically.

  2. Merge customers and policies using an inner join on ‘Customer_ID’.

  3. Perform a left join between policies and claims on ‘Policy_ID’.

  4. Join customers with policies (after setting Customer_ID as index).

  5. Create a full outer join between customers and policies showing all records.

  • Vertical concatenation: pd.concat([df1, df2], axis=0)
  • Inner merge: pd.merge(df1, df2, on='column_name', how='inner')
  • Left join: pd.merge(df1, df2, on='column_name', how='left')
  • Using .join(): first set index, then df1.join(df2, how='inner')
  • Outer join: pd.merge(df1, df2, on='column_name', how='outer')

(a)

(b)

(c)

(d)

(e)


Summary

You have completed comprehensive Pandas exercises covering the fundamental concepts from the lecture. These exercises have introduced you to:

  1. Series and DataFrame fundamentals - Creation, indexing, and basic operations
  2. Data loading and inspection - Reading data and getting summary statistics
  3. Selection and filtering - Boolean indexing and conditional filtering
  4. Operations on data - Universal functions, index alignment, and broadcasting
  5. Time series and function application - Date formatting, resampling, and custom functions
  6. Missing data handling - Detection, removal, and imputation strategies
  7. Hierarchical indexing and data combination - MultiIndex operations, merging, and joining