Data Visualization with Python

MScAS 2025 - DSAS Lecture 4

Ilia Azizi

2025-10-14

Why Visualization?

“A picture is worth a thousand numbers”

Data visualization transforms raw data into insights through visual patterns. Essential for:

  • Exploration: Discover patterns, outliers, trends
  • Communication: Tell stories with data
  • Decision-making: Support data-driven choices
  • Quality control: Spot data issues instantly

Anscombe’s Quartet (1973):

Four datasets with identical statistics (same mean, variance, correlation) but completely different patterns!

Identical statistics ≠ Identical data patterns

Four datasets with the same statistics but telling different stories!

🤔 Pop Quiz

Four datasets have identical means, variances, and correlations. What’s the best way to check differences?

  • There’s no way to tell without deep statistical tests
  • Visualize all four datasets
  • Trust the statistics you calculated
  • Use only the dataset with the most observations

Univariate vs Multivariate

Understanding Data Dimensionality

The number of variables determines visualization approach:

Univariate (1 variable): Distribution, central tendency, spread

Multivariate (2 or more variables): Relationships, correlations, patterns

Univariate Analysis

  • Focus: Single variable characteristics
  • Questions: What’s the shape? Where’s the center? How spread out?
  • Tools: Histograms, box plots, violin plots
  • Example: Distribution of claim amounts

Multivariate Analysis

  • Focus: Variable relationships
  • Questions: How do they relate? Are they correlated? Any patterns?
  • Tools: Scatter plots, heatmaps, pair plots
  • Example: Claims vs. premium by age group

Examples of Univariate Plots

  • Histogram → Shows distribution shape (right-skewed vs left-skewed)
  • Box Plot → Highlights median, quartiles, and outliers clearly
  • Violin Plot → Combines density with summary statistics
Show code
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd

# Set style for better-looking plots
sns.set_style("whitegrid")
plt.rcParams['font.size'] = 10
plt.rcParams['axes.labelsize'] = 11
plt.rcParams['axes.titlesize'] = 12
plt.rcParams['axes.titleweight'] = 'bold'

# Simulate claim amounts with realistic distribution
np.random.seed(42)
claims = np.random.lognormal(mean=8, sigma=1.2, size=500)

fig, axes = plt.subplots(1, 3, figsize=(11, 3.5))

# Histogram
axes[0].hist(claims, bins=30, edgecolor='black', alpha=0.75, color='steelblue')
axes[0].set_title("Histogram", fontsize=11, fontweight='bold')
axes[0].set_xlabel("Claim Amount ($)", fontsize=9)
axes[0].set_ylabel("Frequency", fontsize=9)
axes[0].grid(True, alpha=0.3, axis='y')

# Box plot
bp = axes[1].boxplot(claims, vert=True, patch_artist=True,
                      boxprops=dict(facecolor='lightblue', color='black'),
                      whiskerprops=dict(color='black'),
                      capprops=dict(color='black'),
                      medianprops=dict(color='red', linewidth=2))
axes[1].set_title("Box Plot", fontsize=11, fontweight='bold')
axes[1].set_ylabel("Claim Amount ($)", fontsize=9)
axes[1].set_xticklabels(['Claims'], fontsize=9)
axes[1].grid(True, alpha=0.3, axis='y')

# Violin plot
parts = axes[2].violinplot([claims], showmeans=True, showmedians=True)
for pc in parts['bodies']:
    pc.set_facecolor('lightcoral')
    pc.set_alpha(0.7)
parts['cmeans'].set_color('blue')
parts['cmedians'].set_color('red')
axes[2].set_title("Violin Plot", fontsize=11, fontweight='bold')
axes[2].set_ylabel("Claim Amount ($)", fontsize=9)
axes[2].set_xticklabels(['Claims'], fontsize=9)
axes[2].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

Actuarial insight

Right-skewed claim distributions are typical → most claims are small, but a few catastrophic claims drive total losses.

Examples of Multivariate Plots

  • Scatter Plot → Visual relationship between two continuous variables (positive correlation between age and premium)
  • Correlation Heatmap → Quick overview of all pairwise relationships (red = positive, blue = negative)
Show code
# Simulate multivariate insurance data
np.random.seed(42)
n = 200
age = np.random.uniform(25, 70, n)
premium = 500 + age * 15 + np.random.normal(0, 200, n)
claims = 0.3 * premium + np.random.normal(0, 500, n)

df = pd.DataFrame({'Age': age, 'Premium': premium, 'Claims': claims})

fig, axes = plt.subplots(1, 2, figsize=(10, 3.5))

# Scatter plot with trend
axes[0].scatter(df['Age'], df['Premium'], alpha=0.6, s=40, color='steelblue', edgecolors='black', linewidth=0.5)
# Add trend line
z = np.polyfit(df['Age'], df['Premium'], 1)
p = np.poly1d(z)
axes[0].plot(df['Age'].sort_values(), p(df['Age'].sort_values()), "r--", linewidth=2, label='Trend')
axes[0].set_title("Age vs Premium", fontsize=11, fontweight='bold')
axes[0].set_xlabel("Age (years)", fontsize=9)
axes[0].set_ylabel("Annual Premium ($)", fontsize=9)
axes[0].grid(True, alpha=0.3)
axes[0].legend(fontsize=8)

# Correlation heatmap
correlation = df.corr()
im = axes[1].imshow(correlation, cmap="coolwarm", vmin=-1, vmax=1, aspect='auto')
axes[1].set_xticks(range(len(correlation.columns)))
axes[1].set_yticks(range(len(correlation.columns)))
axes[1].set_xticklabels(correlation.columns, fontsize=9)
axes[1].set_yticklabels(correlation.columns, fontsize=9)
axes[1].set_title("Correlation Matrix", fontsize=11, fontweight='bold')

# Add correlation values
for i in range(len(correlation.columns)):
    for j in range(len(correlation.columns)):
        text = axes[1].text(j, i, f'{correlation.iloc[i, j]:.2f}',
                           ha="center", va="center", color="black", fontsize=9)

# Add colorbar
cbar = plt.colorbar(im, ax=axes[1], shrink=0.8)
cbar.set_label('Correlation', fontsize=8)

plt.tight_layout()
plt.show()

Actuarial insight

Multivariate plots reveal risk factors → older policyholders pay higher premiums where premium amount correlates with age.

🤔 Pop Quiz

Which visualization best shows how premium amounts relate to policyholder age across 500 customers?

  • Histogram showing age distribution
  • Pie chart showing age categories
  • Scatter plot with age on x-axis and premium on y-axis
  • Bar chart comparing average premiums by decade

Matplotlib: The Foundation

Understanding Matplotlib Structure

Every plot has a hierarchy: Figure (canvas) → Axes (plotting area) → Elements (lines, markers, etc.)

When to Use Each Plot Type

  • Line Plot → Time series, trends over time (e.g., monthly claims)
  • Bar Chart → Comparing categories (e.g., claims by region)
  • Histogram → Distribution of a single variable (e.g., claim amounts)
  • Scatter Plot → Relationship between two continuous variables (e.g., age vs premium)
Show code
fig, axes = plt.subplots(2, 2, figsize=(11, 6.5))

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
categories = ['A', 'B', 'C', 'D', 'E']

# Line plot
axes[0, 0].plot(x, y, marker='o', linewidth=2.5, markersize=8, color='steelblue')
axes[0, 0].set_title("Line Plot", fontsize=11, fontweight='bold')
axes[0, 0].set_xlabel("Time Period", fontsize=9)
axes[0, 0].set_ylabel("Values", fontsize=9)
axes[0, 0].grid(True, alpha=0.3)

# Bar chart
bars = axes[0, 1].bar(categories, y, color='coral', edgecolor='black', linewidth=1.5)
axes[0, 1].set_title("Bar Chart", fontsize=11, fontweight='bold')
axes[0, 1].set_xlabel("Categories", fontsize=9)
axes[0, 1].set_ylabel("Values", fontsize=9)
axes[0, 1].grid(True, alpha=0.3, axis='y')

# Histogram
data = np.random.randn(1000)
axes[1, 0].hist(data, bins=30, edgecolor='black', alpha=0.75, color='mediumseagreen')
axes[1, 0].set_title("Histogram", fontsize=11, fontweight='bold')
axes[1, 0].set_xlabel("Values", fontsize=9)
axes[1, 0].set_ylabel("Frequency", fontsize=9)
axes[1, 0].grid(True, alpha=0.3, axis='y')

# Scatter plot
x_scatter = np.random.rand(100) * 10
y_scatter = x_scatter * 2 + np.random.randn(100) * 2
axes[1, 1].scatter(x_scatter, y_scatter, alpha=0.6, s=40, color='purple', edgecolors='black', linewidth=0.5)
axes[1, 1].set_title("Scatter Plot", fontsize=11, fontweight='bold')
axes[1, 1].set_xlabel("X values", fontsize=9)
axes[1, 1].set_ylabel("Y values", fontsize=9)
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Making Plots Informative

Customize colors, styles, markers, and labels for clarity and impact.

Basic plot:

Customized plot:

Using Subplots Effectively: Multiple views of the same data to tell a complete story

  • Line plot → Reveals trends for each region over time
  • Bar chart → Shows total volume per quarter at a glance
  • Stacked bar → Displays composition (which region contributes most)
Show code
# Simulate quarterly claims data
quarters = ['Q1', 'Q2', 'Q3', 'Q4']
region_north = [120, 135, 150, 140]
region_south = [100, 115, 125, 135]
region_east = [90, 105, 120, 115]

fig, axes = plt.subplots(1, 3, figsize=(12, 3.5))

# Subplot 1: Line plot comparison
axes[0].plot(quarters, region_north, marker='o', label='North', linewidth=2.5, markersize=8)
axes[0].plot(quarters, region_south, marker='s', label='South', linewidth=2.5, markersize=8)
axes[0].plot(quarters, region_east, marker='^', label='East', linewidth=2.5, markersize=8)
axes[0].set_title("Claims by Region", fontsize=11, fontweight='bold')
axes[0].set_ylabel("Number of Claims", fontsize=9)
axes[0].set_xlabel("Quarter", fontsize=9)
axes[0].legend(fontsize=8, loc='upper left')
axes[0].grid(True, alpha=0.3)

# Subplot 2: Bar chart
total_claims = [sum(x) for x in zip(region_north, region_south, region_east)]
bars = axes[1].bar(quarters, total_claims, color='steelblue', edgecolor='black', linewidth=1.5)
axes[1].set_title("Total Claims per Quarter", fontsize=11, fontweight='bold')
axes[1].set_ylabel("Total Claims", fontsize=9)
axes[1].set_xlabel("Quarter", fontsize=9)
axes[1].grid(True, alpha=0.3, axis='y')
# Add value labels
for bar in bars:
    height = bar.get_height()
    axes[1].text(bar.get_x() + bar.get_width()/2., height,
                f'{int(height)}', ha='center', va='bottom', fontsize=8)

# Subplot 3: Stacked bar chart
width = 0.6
axes[2].bar(quarters, region_north, width, label='North', color='#1f77b4', edgecolor='black')
axes[2].bar(quarters, region_south, width, bottom=region_north, 
            label='South', color='#ff7f0e', edgecolor='black')
axes[2].bar(quarters, region_east, width, 
            bottom=[i+j for i,j in zip(region_north, region_south)], 
            label='East', color='#2ca02c', edgecolor='black')
axes[2].set_title("Stacked Claims by Region", fontsize=11, fontweight='bold')
axes[2].set_ylabel("Number of Claims", fontsize=9)
axes[2].set_xlabel("Quarter", fontsize=9)
axes[2].legend(fontsize=8)
axes[2].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

Pro tip

Use fig, axes = plt.subplots(rows, cols) to create subplot grids

🤔 Pop Quiz

To compare claims frequency across 4 regions over the same time period, which Matplotlib approach is most efficient?

  • Create 4 separate plots and combine them manually
  • Use one large plot with 4 different colors
  • Use fig, axes = plt.subplots(2, 2) to create a 2×2 grid
  • Make 4 plots and send them as separate files

Seaborn: Statistical Graphics

Built on Matplotlib, Elevated for Statistics

  • High-level interface: Less code for complex plots
  • Beautiful defaults: Professional-looking plots out-of-the-box
  • Statistical focus: Built-in functions for distributions, relationships
  • DataFrame integration: Works seamlessly with Pandas

Matplotlib (more code):

Seaborn (cleaner):

Tip

Result: Seaborn adds KDE, better styling, with less code!

Seaborn Automatically Adds Statistical Overlays:

  • histplot() with kde=True → Adds smooth density curve
  • boxplot() → Automatic grouping by categories with x= parameter
  • violinplot() → Combines density + quartiles in one plot
Show code
import pandas as pd
import seaborn as sns

# Simulate claim severity data for different product types
np.random.seed(42)
n = 300

data = pd.DataFrame({
    'Product': np.repeat(['Auto', 'Home', 'Life'], n),
    'Claim_Amount': np.concatenate([
        np.random.lognormal(7, 1, n),    # Auto
        np.random.lognormal(8, 1.2, n),  # Home
        np.random.lognormal(9, 0.8, n)   # Life
    ])
})

# Set seaborn style
sns.set_style("whitegrid")

fig, axes = plt.subplots(1, 3, figsize=(12, 3.5))

# Histogram with KDE
sns.histplot(data=data, x='Claim_Amount', kde=True, ax=axes[0], bins=30, 
             color='steelblue', edgecolor='black', linewidth=0.5)
axes[0].set_title("Histogram + KDE", fontsize=11, fontweight='bold')
axes[0].set_xlabel("Claim Amount ($)", fontsize=9)
axes[0].set_ylabel("Frequency", fontsize=9)

# Box plot by category
sns.boxplot(data=data, x='Product', y='Claim_Amount', ax=axes[1], palette='Set2')
axes[1].set_title("Box Plot by Product", fontsize=11, fontweight='bold')
axes[1].set_ylabel("Claim Amount ($)", fontsize=9)
axes[1].set_xlabel("Product Type", fontsize=9)

# Violin plot by category
sns.violinplot(data=data, x='Product', y='Claim_Amount', ax=axes[2], palette='muted')
axes[2].set_title("Violin Plot by Product", fontsize=11, fontweight='bold')
axes[2].set_ylabel("Claim Amount ($)", fontsize=9)
axes[2].set_xlabel("Product Type", fontsize=9)

plt.tight_layout()
plt.show()

Actuarial insight

Life insurance claims show higher median but lower variance than Auto/Home → reflects predictable payouts for death benefits vs. variable property/accident damages.

Seaborn for Relationships: Example Plots

  • scatterplot() → Supports hue= and size= for 3rd/4th dimensions
  • regplot() → Adds regression line + confidence interval automatically
  • heatmap() → Visualize correlation matrices with annotations
Show code
# Simulate insurance portfolio data
np.random.seed(42)
n = 200

portfolio = pd.DataFrame({
    'Age': np.random.uniform(25, 75, n),
    'Premium': np.random.uniform(500, 5000, n),
    'Claims_Count': np.random.poisson(0.5, n),
    'Risk_Score': np.random.uniform(0, 100, n)
})

# Add some correlation
portfolio['Premium'] = 300 + portfolio['Age'] * 40 + np.random.normal(0, 400, n)
portfolio['Claims_Count'] = (portfolio['Risk_Score'] / 50 + np.random.poisson(0.3, n)).astype(int)

fig, axes = plt.subplots(1, 2, figsize=(11, 4))

# Scatter plot with regression line
scatter = sns.scatterplot(data=portfolio, x='Age', y='Premium', 
                hue='Claims_Count', size='Claims_Count',
                palette='viridis', ax=axes[0], alpha=0.7, sizes=(30, 200),
                edgecolor='black', linewidth=0.5)
sns.regplot(data=portfolio, x='Age', y='Premium', 
            scatter=False, color='red', ax=axes[0], line_kws={'linewidth': 2.5})
axes[0].set_title("Age vs Premium (sized by Claims)", fontsize=11, fontweight='bold')
axes[0].set_xlabel("Age (years)", fontsize=9)
axes[0].set_ylabel("Annual Premium ($)", fontsize=9)
axes[0].legend(title='Claims Count', fontsize=7, title_fontsize=8, loc='upper left')

# Correlation heatmap
correlation = portfolio[['Age', 'Premium', 'Claims_Count', 'Risk_Score']].corr()
sns.heatmap(correlation, annot=True, cmap="RdBu_r", ax=axes[1], 
            fmt=".2f", square=True, cbar_kws={'shrink': 0.8},
            vmin=-1, vmax=1, linewidths=1, linecolor='white',
            annot_kws={'fontsize': 9})
axes[1].set_title("Correlation Matrix", fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

Actuarial insight

Heatmap reveals age-premium correlation (0.67) → confirms experience-based pricing. Risk score drives claims count → validates underwriting criteria.

The Power of Pair Plots

sns.pairplot() → Visualize all pairwise relationships in one command!

Show code
# Use subset of portfolio data
portfolio_subset = portfolio[['Age', 'Premium', 'Risk_Score', 'Claims_Count']].sample(100, random_state=42)

# Create pair plot with smaller size
pair_grid = sns.pairplot(portfolio_subset, diag_kind='kde', 
                         plot_kws={'alpha': 0.6, 's': 30, 'edgecolor': 'black', 'linewidth': 0.5},
                         diag_kws={'linewidth': 2},
                         corner=True,
                         height=2)  # Controls size of each subplot
pair_grid.fig.suptitle("Portfolio Variables Pair Plot", y=0.99, fontweight='bold', fontsize=12)
plt.tight_layout()
plt.show()

Actuarial insight

Pair plots reveal non-obvious patterns → e.g., Risk_Score vs Claims_Count shows clustering, suggesting discrete risk categories in underwriting model.

Advanced Seaborn Features

  • hue= parameter → Split visualizations by category (automatic legend)
  • split=True in violinplot → Side-by-side comparison within groups
  • Palette control'Set2', 'muted', 'viridis' for consistent color schemes
Show code
# Create categorical data
np.random.seed(42)
df_claims = pd.DataFrame({
    'Region': np.repeat(['North', 'South', 'East', 'West'], 50),
    'Product': np.tile(np.repeat(['Auto', 'Home'], 25), 4),
    'Severity': np.random.lognormal(8, 1.5, 200)
})

fig, axes = plt.subplots(1, 2, figsize=(12, 3.8))

# Customized box plot
sns.boxplot(data=df_claims, x='Region', y='Severity', hue='Product',
            palette='Set2', ax=axes[0], linewidth=1.5)
axes[0].set_title("Claim Severity by Region and Product", 
                  fontsize=11, fontweight='bold')
axes[0].set_ylabel("Claim Severity ($)", fontsize=9)
axes[0].set_xlabel("Region", fontsize=9)
axes[0].legend(title='Product Type', fontsize=8, title_fontsize=9)
axes[0].grid(True, alpha=0.3, axis='y')

# Customized violin plot with split
sns.violinplot(data=df_claims, x='Region', y='Severity', hue='Product',
               split=True, palette='muted', ax=axes[1], linewidth=1.5)
axes[1].set_title("Split Violin Plot (Product Comparison)", 
                  fontsize=11, fontweight='bold')
axes[1].set_ylabel("Claim Severity ($)", fontsize=9)
axes[1].set_xlabel("Region", fontsize=9)
axes[1].legend(title='Product Type', fontsize=8, title_fontsize=9)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

Actuarial insight

Split violins show Home claims consistently exceed Auto across all regions → justifies higher base rates for property coverage.

🤔 Pop Quiz

Your manager needs claim distribution AND age relationship visualized in 10 minutes. What’s your best tool?

  • Matplotlib - full control and more code for statistical overlays
  • Seaborn - automatic statistical overlays, beautiful defaults, less code
  • Plotly - great for interactivity but overkill for static analysis
  • Base Python - calculating pixel coordinates when time is short

Matplotlib vs Seaborn vs Plotly

Feature Matplotlib Seaborn Plotly
Level Low-level, fine control High-level, statistical High-level, interactive
Ease of Use ⭐⭐ (more code) ⭐⭐⭐⭐ (concise) ⭐⭐⭐⭐ (intuitive)
Default Style Basic Professional Modern, polished
Statistical Plots Manual Built-in Some built-in
Interactivity ❌ Static only ❌ Static only Interactive
Use Case Custom plots, fine-tuning EDA, statistical analysis Dashboards, presentations
Learning Curve Steeper Moderate Moderate
Performance Fast Fast Slower (web-based)

When to Use What?

  • Matplotlib: Maximum control, custom visualizations, publication-quality static plots
  • Seaborn: Statistical exploration, beautiful defaults, distribution/relationship plots
  • Plotly: Interactive dashboards, presentations, hover tooltips, zooming

The Challenge: Visualize Insurance Portfolio

Dataset: Age vs Premium for 100 policyholders
Goal: Show relationship + distribution

Let’s see how each library handles this!

# Create sample insurance portfolio data (used in all three examples)
np.random.seed(42)
n = 100

portfolio_demo = pd.DataFrame({
    'Age': np.random.uniform(25, 75, n),
    'Premium': np.random.uniform(500, 5000, n)
})

# Add correlation
portfolio_demo['Premium'] = 300 + portfolio_demo['Age'] * 40 + np.random.normal(0, 400, n)

Matplotlib Approach: Manual Control

Pros: Complete customization, pixel-perfect control
Cons: More code, need to manually add trendline and style

Show Matplotlib code
fig, ax = plt.subplots(figsize=(9, 4.5))

# Scatter plot
ax.scatter(portfolio_demo['Age'], portfolio_demo['Premium'], 
           alpha=0.6, s=60, color='steelblue', edgecolors='black', linewidth=0.8)

# Manually add trendline
z = np.polyfit(portfolio_demo['Age'], portfolio_demo['Premium'], 1)
p = np.poly1d(z)
ax.plot(portfolio_demo['Age'].sort_values(), p(portfolio_demo['Age'].sort_values()), 
        "r--", linewidth=2.5, label=f'Trend: y = {z[0]:.1f}x + {z[1]:.0f}')

# Styling
ax.set_title("Age vs Premium (Matplotlib)", fontsize=13, fontweight='bold', pad=15)
ax.set_xlabel("Age (years)", fontsize=11)
ax.set_ylabel("Annual Premium ($)", fontsize=11)
ax.legend(loc='upper left', fontsize=10, framealpha=0.9)
ax.grid(True, alpha=0.3, linestyle='--', linewidth=0.8)
ax.set_facecolor('#f8f9fa')

plt.tight_layout()
plt.show()

Lines of code: ~15 | Trendline: Manual | Styling: All manual

Seaborn Approach: Statistical Focus

Pros: Automatic trendline + confidence interval, beautiful defaults, less code
Cons: Less fine-grained control

Show Seaborn code
fig, ax = plt.subplots(figsize=(9, 4.5))

# Seaborn combines scatter + regression in one call!
sns.regplot(data=portfolio_demo, x='Age', y='Premium', 
            scatter_kws={'alpha': 0.6, 's': 60, 'edgecolor': 'black', 'linewidths': 0.8},
            line_kws={'color': 'red', 'linewidth': 2.5},
            ax=ax)

# Styling
ax.set_title("Age vs Premium (Seaborn)", fontsize=13, fontweight='bold', pad=15)
ax.set_xlabel("Age (years)", fontsize=11)
ax.set_ylabel("Annual Premium ($)", fontsize=11)
ax.grid(True, alpha=0.3, linestyle='--', linewidth=0.8)
ax.set_facecolor('#f8f9fa')

plt.tight_layout()
plt.show()

Lines of code: ~8 | Trendline: Automatic with confidence interval! | Styling: Beautiful defaults

Plotly Approach: Interactive

Pros: Hover tooltips, zoom/pan, toggle trendline, export options
Cons: Slower rendering, requires web browser

Show Plotly code
import plotly.express as px

# Create figure matching Matplotlib/Seaborn theme
fig = px.scatter(
    portfolio_demo, 
    x='Age', 
    y='Premium',
    trendline='ols',
    title="Age vs Premium (Plotly - Interactive!)",
    labels={'Age': 'Age (years)', 'Premium': 'Annual Premium ($)'}
)

# Update scatter points to match Matplotlib style (steelblue with black edge)
fig.update_traces(
    marker=dict(
        size=8, 
        color='steelblue',
        opacity=0.6,
        line=dict(width=0.8, color='black')
    ),
    selector=dict(mode='markers')
);

# Update trendline to red dashed (matching Matplotlib)
fig.update_traces(
    line=dict(color='red', width=2.5, dash='dash'),
    selector=dict(mode='lines')
);

# Match Matplotlib/Seaborn styling
fig.update_layout(
    title=dict(
        text="Age vs Premium (Plotly - Interactive!)",
        font=dict(size=13, family='Arial', color='black'),
        x=0.5,
        xanchor='center'
    ),
    font=dict(size=11, family='Arial'),
    plot_bgcolor='#f8f9fa',
    paper_bgcolor='white',
    hovermode='closest',
    height=450,
    width=900,
    xaxis_title=dict(text="Age (years)", font=dict(size=11)),
    yaxis_title=dict(text="Annual Premium ($)", font=dict(size=11)),
    margin=dict(l=60, r=40, t=60, b=60)
);

# Grid matching Matplotlib (alpha=0.3, dashed)
fig.update_xaxes(
    showgrid=True, 
    gridwidth=0.8, 
    gridcolor='rgba(128,128,128,0.3)',
    griddash='dash'
);
fig.update_yaxes(
    showgrid=True, 
    gridwidth=0.8, 
    gridcolor='rgba(128,128,128,0.3)',
    griddash='dash'
);

# Display
fig

Lines of code: ~10 | Trendline: Automatic (multiple methods available!) | Interactivity: Hover, zoom, pan, export

Try it!

Hover over points, zoom in on a region, double-click to reset, click legend to toggle trendline!

Best Practice Workflow

  1. Start with Seaborn for quick EDA (fast, beautiful, statistical)
  2. Switch to Matplotlib when you need pixel-perfect control (publications, custom layouts)
  3. Use Plotly for final presentations/dashboards (stakeholder engagement, interactivity)

How to Convert Your Plots

Matplotlib → Seaborn: Easy! Seaborn uses Matplotlib under the hood
Matplotlib/Seaborn → Plotly: Use plotly.express (similar syntax), plotly.graph_objects (more control), or mpl_to_plotly() (automatic conversion!)
Plotly → Static: Use fig.write_image('plot.png') (requires kaleido package)

Example: Matplotlib scatter → Seaborn regplot

# Matplotlib
plt.scatter(df['x'], df['y'])

# Seaborn (adds automatic regression + CI)
sns.regplot(data=df, x='x', y='y')

Example: Matplotlib → Plotly (automatic conversion)

import plotly.tools as tls

# Create Matplotlib figure
fig, ax = plt.subplots()
ax.scatter(df['Age'], df['Premium'])
ax.set_title("Age vs Premium")

# Convert to interactive Plotly figure automatically!
plotly_fig = tls.mpl_to_plotly(fig)
plotly_fig.show()

Example: Seaborn → Plotly

# Seaborn
sns.scatterplot(data=df, x='Age', y='Premium', hue='Region')

# Plotly equivalent
px.scatter(df, x='Age', y='Premium', color='Region')

Example: Save Plotly as static image

fig = px.scatter(df, x='Age', y='Premium')
fig.write_image('plot.png')  # Requires: pip install kaleido
fig.write_html('plot.html')  # Interactive HTML

🤔 Pop Quiz

The CFO needs to zoom, hover, and toggle risk categories during a board meeting. Which library is best for plotting?

  • Matplotlib - perfect for board meetings
  • Seaborn - beautiful statistical plots
  • Plotly - interactive features like zoom, hover, and toggle
  • All three are equally suitable for this use case

Common Pitfalls & Best Practices

Learn from Bad Examples

Understanding what makes plots ineffective helps you create better visualizations.

❌ Bad: No labels, unclear message

✅ Good: Clear, labeled, informative

❌ Misleading Y-axis

Issue: Exaggerates differences

❌ Too many colors

Issue: Distracting, no meaning

❌ Pie chart in 3D (worst, and actually hard to do!)

Issue: Hard to read percentages, compare slices

Show code
# Example: Well-designed comparison plot
np.random.seed(42)

quarters = ['Q1 2024', 'Q2 2024', 'Q3 2024', 'Q4 2024']
revenue_2023 = [450, 480, 520, 550]
revenue_2024 = [480, 530, 580, 620]

fig, ax = plt.subplots(figsize=(8, 3.5))

x = np.arange(len(quarters))
width = 0.35

bars1 = ax.bar(x - width/2, revenue_2023, width, label='2023', 
               color='#9ecae1', edgecolor='black', linewidth=1.5)
bars2 = ax.bar(x + width/2, revenue_2024, width, label='2024', 
               color='#3182bd', edgecolor='black', linewidth=1.5)

# Add value labels on bars
for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height + 5,
                f'${height}M',
                ha='center', va='bottom', fontsize=9, fontweight='bold')

ax.set_title("Quarterly Revenue Comparison: 2023 vs 2024", 
             fontsize=13, fontweight='bold', pad=20)
ax.set_xlabel("Quarter", fontsize=11)
ax.set_ylabel("Revenue (Million $)", fontsize=11)
ax.set_xticks(x)
ax.set_xticklabels(quarters, fontsize=10)
ax.legend(loc='upper left', fontsize=10, framealpha=0.9)
ax.grid(True, axis='y', alpha=0.3, linestyle='--', linewidth=0.8)
_ = ax.set_ylim(0, max(revenue_2024) * 1.15)
ax.set_facecolor('#f8f9fa')

plt.tight_layout()
plt.show()

What Makes This Plot Excellent?

  • Clear title → Tells the story immediately
  • Labeled axes → Context for both variables
  • Y-axis starts at 0 → Honest representation (critical for bar charts!)
  • Value labels → No guessing, exact numbers visible
  • Legend → Explains grouping
  • Grid → Aids accurate reading
  • Professional appearance → Clean, polished, credible

The 5-second test: Can someone understand the main message in 5 seconds? ✅

10 Visualization Rules

  1. Always label axes and title → Context is king
  2. Start y-axis at zero (for bar charts) → Avoid misleading scales
  3. Choose appropriate plot type → Match plot to data type
  4. Use color purposefully → 2-3 colors max, meaningful encoding
  5. Add legends when needed → Explain what viewers see
  6. Keep it simple → Remove chart junk, focus on message
  7. Consider your audience → Adjust complexity accordingly
  8. Use consistent scales → When comparing plots
  9. Highlight key insights → Guide viewer attention
  10. Test readability → Can someone understand it in 5 seconds?

🤔 Pop Quiz

Your colleague’s plot shows impressive trends with perfect colors and styling. But management keeps asking “What is this plot representing?” What’s the most likely problem?

  • Switch from Matplotlib to Seaborn for better defaults
  • Add descriptive axis labels, title, and units
  • Increase the figure size from 8x6 to 12x8 inches
  • Change the color scheme to match company branding

Key Takeaways

Essential concepts covered today:

  1. Why Visualize? To tell a story and understand trends (identical statistics ≠ identical patterns)

  2. Plot Types: Univariate (histograms, box plots) vs Multivariate (scatter, heatmaps, pair plots)

  3. Matplotlib: Foundation library with complete control over every element (Figure → Axes → Elements)

  4. Seaborn: High-level statistical plots with beautiful defaults, DataFrame integration, automatic KDE/regression

  5. Library Comparison:

    • Matplotlib → Custom plots, publication-quality, fine-grained control
    • Seaborn → Statistical EDA, quick insights, less code
    • Plotly → Interactive dashboards, hover tooltips, presentations
  6. Best Practices: Label everything, choose appropriate plot type, start y-axis at zero (bar charts), use 2-3 colors max, keep it simple

  7. Common Pitfalls: Missing labels/titles, misleading scales, wrong plot type, too many colors, chart junk

Remember: Good Visualization = Data + Context + Clarity

Your plot should answer: Obvious patterns? (pattern) → Story? (insight) → What to do now? (action)

📚 Resources: Plotting lecture notes | Matplotlib docs | Seaborn gallery

Questions?