flowchart LR
%% Main linear flow
Acquire --> Tidy
Tidy --> Transform
%% Circular connections between Transform, Visualize, and Model
Transform <--> Visualize
Visualize <--> Model
Model <--> Transform
%% Final connections to Communicate
%% Visualize --> Communicate
Model --> Communicate
%% Styling for clean appearance
classDef default fill:#ffffff,stroke:#000000,stroke-width:2px,color:#000000,font-family:Arial,font-size:14px
class Acquire,Tidy,Transform,Visualize,Model,Communicate default
Stage
What it means
Actuarial example
Acquire
Get data from external/internal sources
Download claims data from database or regulator website
Tidy
Clean, format, handle missing values
Fix date formats, remove duplicates, handle NaN values
Transform
Create new variables, normalize
Calculate BMI categories, create age bands
Visualize
Explore patterns, spot issues
Histograms of claims, scatter plots of age vs. premium
Interpretation: Short and long rates move in opposite directions
When PC2 score increases → short rates ↑ & long rates ↓ → curve flattens
Real example: Recession fears → long rates drop (expect rate cuts), short rates stay high
PC3: Curvature (~0.09% of variance)
Weights: Ends: positive | Middle (4-6Y): negative
Interpretation: Middle maturities move opposite to short/long ends
When PC3 score increases → middle bends down, ends bend up → humped curve
Real example: Expect rates to rise then fall → medium-term rates peak
In simple terms: PC1 score = “How high is the curve?”, PC2 score = “How steep?”, PC3 score = “How curved?”
Transform & Reconstruct
Project the normalized data onto principal components, then reconstruct using only the top K components.
Show code
# Transform data to PC spacePC_scores = df_norm.dot(eigen_vectors)# Select some random bond dates to track consistentlynp.random.seed(1)sample_indices = np.random.choice(len(df_norm), size=8, replace=False)sample_indices =sorted(sample_indices)# Use only the 9 yearly maturities that match our PCA dimensionsmaturities =list(range(1, 10)) # 1Y through 9Y to match the 9 columns we used for PCA# Create 3 panels: 1 Component, 2 Components, Originalfig, axes = plt.subplots(1, 3, figsize=(14, 4.5), sharey=True)colors = plt.cm.tab10(np.linspace(0, 1, len(sample_indices)))# Store all data for consistent y-axis scalingall_data = []# Panel 1: Reconstruction with 1 componentreconstructed_1 = PC_scores.iloc[:, :1].dot(eigen_vectors[:, :1].T)reconstructed_1_denorm = reconstructed_1 * std.values + mean.valuesall_data.extend(reconstructed_1_denorm.values.flatten())for idx, sample_idx inenumerate(sample_indices): date_label = df['date'].iloc[sample_idx] axes[0].plot(maturities, reconstructed_1_denorm.iloc[sample_idx], alpha=0.7, linewidth=2, color=colors[idx], label=date_label)axes[0].set_title("Using 1 Component", fontsize=11, fontweight='bold')axes[0].set_xlabel("Maturity (years)", fontsize=10)axes[0].set_ylabel("Yield (%)", fontsize=10)axes[0].grid(True, alpha=0.3)# Panel 2: Reconstruction with 2 componentsreconstructed_2 = PC_scores.iloc[:, :2].dot(eigen_vectors[:, :2].T)reconstructed_2_denorm = reconstructed_2 * std.values + mean.valuesall_data.extend(reconstructed_2_denorm.values.flatten())for idx, sample_idx inenumerate(sample_indices): axes[1].plot(maturities, reconstructed_2_denorm.iloc[sample_idx], alpha=0.7, linewidth=2, color=colors[idx])axes[1].set_title("Using 2 Components", fontsize=11, fontweight='bold')axes[1].set_xlabel("Maturity (years)", fontsize=10)axes[1].grid(True, alpha=0.3)# Panel 3: Original datafor idx, sample_idx inenumerate(sample_indices): original = df_yields.iloc[sample_idx].values all_data.extend(original) axes[2].plot(maturities, original, alpha=0.7, linewidth=2, color=colors[idx])axes[2].set_title("Original Data", fontsize=11, fontweight='bold')axes[2].set_xlabel("Maturity (years)", fontsize=10)axes[2].grid(True, alpha=0.3)# Set consistent y-axis limits across all panelsy_min, y_max =min(all_data), max(all_data)y_margin = (y_max - y_min) *0.05for ax in axes: ax.set_ylim(y_min - y_margin, y_max + y_margin)# Add overall titlefig.suptitle('Yield Curve Reconstruction Quality', fontsize=14, fontweight='bold', y=0.98)# Add legend to the first panel only with 2 rows (horizontal layout)axes[0].legend(title='Sample Dates', fontsize=7, loc='upper left', framealpha=0.9, ncol=4)plt.tight_layout()plt.show()
Result
Just 3 numbers (PC1, PC2, PC3 scores) capture nearly all information in 13 yields! Notice how 1 component is too simple (flat), 2 components add slope, and 3 components capture full curve shapes.
🤔 Pop Quiz
If the first principal component of yield curves is “almost constant”, what does a positive PC1 value indicate?
The yield curve is inverted (downward sloping)
Short-term rates are higher than long-term rates
Interest rates are higher across ALL maturities (parallel shift up)
The curve has increased curvature
Case Study II: PCA Summary
Key Results
Dimensionality Reduction
9 correlated yields (13 in total) → 3 uncorrelated components
99.9% of variance retained
Minimal information loss
Three Components & 2D Plot Interpretation:
PC1 (Level): 98.44% variance — Parallel shift of entire curve Horizontal axis: Movement left/right = interest rates rising/falling overall