Data Visualization with Python

MScAS 2025 - DSAS Lecture 4

Ilia Azizi

2025-10-14

Why Visualization?

“A picture is worth a thousand numbers”

Data visualization transforms raw data into insights through visual patterns. Essential for:

Exploration: Discover patterns, outliers, trends
Communication: Tell stories with data
Decision-making: Support data-driven choices
Quality control: Spot data issues instantly

Anscombe’s Quartet (1973):

Four datasets with identical statistics (same mean, variance, correlation) but completely different patterns!

Identical statistics ≠ Identical data patterns

Four datasets with the same statistics but telling different stories!

🤔 Pop Quiz

Four datasets have identical means, variances, and correlations. What’s the best way to check differences?

There’s no way to tell without deep statistical tests
Visualize all four datasets
Trust the statistics you calculated
Use only the dataset with the most observations

Understanding Data Dimensionality

The number of variables determines visualization approach:

Univariate (1 variable): Distribution, central tendency, spread

Multivariate (2 or more variables): Relationships, correlations, patterns

Univariate Analysis

Focus: Single variable characteristics
Questions: What’s the shape? Where’s the center? How spread out?
Tools: Histograms, box plots, violin plots
Example: Distribution of claim amounts

Multivariate Analysis

Focus: Variable relationships
Questions: How do they relate? Are they correlated? Any patterns?
Tools: Scatter plots, heatmaps, pair plots
Example: Claims vs. premium by age group

Examples of Univariate Plots

Histogram → Shows distribution shape (right-skewed vs left-skewed)
Box Plot → Highlights median, quartiles, and outliers clearly
Violin Plot → Combines density with summary statistics

Show code

import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import pandas as pd

# Set style for better-looking plots
sns.set_style("whitegrid")
plt.rcParams['font.size'] = 10
plt.rcParams['axes.labelsize'] = 11
plt.rcParams['axes.titlesize'] = 12
plt.rcParams['axes.titleweight'] = 'bold'

# Simulate claim amounts with realistic distribution
np.random.seed(42)
claims = np.random.lognormal(mean=8, sigma=1.2, size=500)

fig, axes = plt.subplots(1, 3, figsize=(11, 3.5))

# Histogram
axes[0].hist(claims, bins=30, edgecolor='black', alpha=0.75, color='steelblue')
axes[0].set_title("Histogram", fontsize=11, fontweight='bold')
axes[0].set_xlabel("Claim Amount ($)", fontsize=9)
axes[0].set_ylabel("Frequency", fontsize=9)
axes[0].grid(True, alpha=0.3, axis='y')

# Box plot
bp = axes[1].boxplot(claims, vert=True, patch_artist=True,
                      boxprops=dict(facecolor='lightblue', color='black'),
                      whiskerprops=dict(color='black'),
                      capprops=dict(color='black'),
                      medianprops=dict(color='red', linewidth=2))
axes[1].set_title("Box Plot", fontsize=11, fontweight='bold')
axes[1].set_ylabel("Claim Amount ($)", fontsize=9)
axes[1].set_xticklabels(['Claims'], fontsize=9)
axes[1].grid(True, alpha=0.3, axis='y')

# Violin plot
parts = axes[2].violinplot([claims], showmeans=True, showmedians=True)
for pc in parts['bodies']:
    pc.set_facecolor('lightcoral')
    pc.set_alpha(0.7)
parts['cmeans'].set_color('blue')
parts['cmedians'].set_color('red')
axes[2].set_title("Violin Plot", fontsize=11, fontweight='bold')
axes[2].set_ylabel("Claim Amount ($)", fontsize=9)
axes[2].set_xticklabels(['Claims'], fontsize=9)
axes[2].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

Actuarial insight

Right-skewed claim distributions are typical → most claims are small, but a few catastrophic claims drive total losses.

Examples of Multivariate Plots

Scatter Plot → Visual relationship between two continuous variables (positive correlation between age and premium)
Correlation Heatmap → Quick overview of all pairwise relationships (red = positive, blue = negative)

Show code

# Simulate multivariate insurance data
np.random.seed(42)
n = 200
age = np.random.uniform(25, 70, n)
premium = 500 + age * 15 + np.random.normal(0, 200, n)
claims = 0.3 * premium + np.random.normal(0, 500, n)

df = pd.DataFrame({'Age': age, 'Premium': premium, 'Claims': claims})

fig, axes = plt.subplots(1, 2, figsize=(10, 3.5))

# Scatter plot with trend
axes[0].scatter(df['Age'], df['Premium'], alpha=0.6, s=40, color='steelblue', edgecolors='black', linewidth=0.5)
# Add trend line
z = np.polyfit(df['Age'], df['Premium'], 1)
p = np.poly1d(z)
axes[0].plot(df['Age'].sort_values(), p(df['Age'].sort_values()), "r--", linewidth=2, label='Trend')
axes[0].set_title("Age vs Premium", fontsize=11, fontweight='bold')
axes[0].set_xlabel("Age (years)", fontsize=9)
axes[0].set_ylabel("Annual Premium ($)", fontsize=9)
axes[0].grid(True, alpha=0.3)
axes[0].legend(fontsize=8)

# Correlation heatmap
correlation = df.corr()
im = axes[1].imshow(correlation, cmap="coolwarm", vmin=-1, vmax=1, aspect='auto')
axes[1].set_xticks(range(len(correlation.columns)))
axes[1].set_yticks(range(len(correlation.columns)))
axes[1].set_xticklabels(correlation.columns, fontsize=9)
axes[1].set_yticklabels(correlation.columns, fontsize=9)
axes[1].set_title("Correlation Matrix", fontsize=11, fontweight='bold')

# Add correlation values
for i in range(len(correlation.columns)):
    for j in range(len(correlation.columns)):
        text = axes[1].text(j, i, f'{correlation.iloc[i, j]:.2f}',
                           ha="center", va="center", color="black", fontsize=9)

# Add colorbar
cbar = plt.colorbar(im, ax=axes[1], shrink=0.8)
cbar.set_label('Correlation', fontsize=8)

plt.tight_layout()
plt.show()

Actuarial insight

Multivariate plots reveal risk factors → older policyholders pay higher premiums where premium amount correlates with age.

🤔 Pop Quiz

Which visualization best shows how premium amounts relate to policyholder age across 500 customers?

Histogram showing age distribution
Pie chart showing age categories
Scatter plot with age on x-axis and premium on y-axis
Bar chart comparing average premiums by decade

Matplotlib: The Foundation

Anatomy of a Plot
Basic Plot Types
Customization
Subplots

Understanding Matplotlib Structure

Every plot has a hierarchy: Figure (canvas) → Axes (plotting area) → Elements (lines, markers, etc.)

When to Use Each Plot Type

Line Plot → Time series, trends over time (e.g., monthly claims)
Bar Chart → Comparing categories (e.g., claims by region)
Histogram → Distribution of a single variable (e.g., claim amounts)
Scatter Plot → Relationship between two continuous variables (e.g., age vs premium)

Show code

fig, axes = plt.subplots(2, 2, figsize=(11, 6.5))

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
categories = ['A', 'B', 'C', 'D', 'E']

# Line plot
axes[0, 0].plot(x, y, marker='o', linewidth=2.5, markersize=8, color='steelblue')
axes[0, 0].set_title("Line Plot", fontsize=11, fontweight='bold')
axes[0, 0].set_xlabel("Time Period", fontsize=9)
axes[0, 0].set_ylabel("Values", fontsize=9)
axes[0, 0].grid(True, alpha=0.3)

# Bar chart
bars = axes[0, 1].bar(categories, y, color='coral', edgecolor='black', linewidth=1.5)
axes[0, 1].set_title("Bar Chart", fontsize=11, fontweight='bold')
axes[0, 1].set_xlabel("Categories", fontsize=9)
axes[0, 1].set_ylabel("Values", fontsize=9)
axes[0, 1].grid(True, alpha=0.3, axis='y')

# Histogram
data = np.random.randn(1000)
axes[1, 0].hist(data, bins=30, edgecolor='black', alpha=0.75, color='mediumseagreen')
axes[1, 0].set_title("Histogram", fontsize=11, fontweight='bold')
axes[1, 0].set_xlabel("Values", fontsize=9)
axes[1, 0].set_ylabel("Frequency", fontsize=9)
axes[1, 0].grid(True, alpha=0.3, axis='y')

# Scatter plot
x_scatter = np.random.rand(100) * 10
y_scatter = x_scatter * 2 + np.random.randn(100) * 2
axes[1, 1].scatter(x_scatter, y_scatter, alpha=0.6, s=40, color='purple', edgecolors='black', linewidth=0.5)
axes[1, 1].set_title("Scatter Plot", fontsize=11, fontweight='bold')
axes[1, 1].set_xlabel("X values", fontsize=9)
axes[1, 1].set_ylabel("Y values", fontsize=9)
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Making Plots Informative

Customize colors, styles, markers, and labels for clarity and impact.

Basic plot:

Customized plot:

Using Subplots Effectively: Multiple views of the same data to tell a complete story

Line plot → Reveals trends for each region over time
Bar chart → Shows total volume per quarter at a glance
Stacked bar → Displays composition (which region contributes most)

Show code

# Simulate quarterly claims data
quarters = ['Q1', 'Q2', 'Q3', 'Q4']
region_north = [120, 135, 150, 140]
region_south = [100, 115, 125, 135]
region_east = [90, 105, 120, 115]

fig, axes = plt.subplots(1, 3, figsize=(12, 3.5))

# Subplot 1: Line plot comparison
axes[0].plot(quarters, region_north, marker='o', label='North', linewidth=2.5, markersize=8)
axes[0].plot(quarters, region_south, marker='s', label='South', linewidth=2.5, markersize=8)
axes[0].plot(quarters, region_east, marker='^', label='East', linewidth=2.5, markersize=8)
axes[0].set_title("Claims by Region", fontsize=11, fontweight='bold')
axes[0].set_ylabel("Number of Claims", fontsize=9)
axes[0].set_xlabel("Quarter", fontsize=9)
axes[0].legend(fontsize=8, loc='upper left')
axes[0].grid(True, alpha=0.3)

# Subplot 2: Bar chart
total_claims = [sum(x) for x in zip(region_north, region_south, region_east)]
bars = axes[1].bar(quarters, total_claims, color='steelblue', edgecolor='black', linewidth=1.5)
axes[1].set_title("Total Claims per Quarter", fontsize=11, fontweight='bold')
axes[1].set_ylabel("Total Claims", fontsize=9)
axes[1].set_xlabel("Quarter", fontsize=9)
axes[1].grid(True, alpha=0.3, axis='y')
# Add value labels
for bar in bars:
    height = bar.get_height()
    axes[1].text(bar.get_x() + bar.get_width()/2., height,
                f'{int(height)}', ha='center', va='bottom', fontsize=8)

# Subplot 3: Stacked bar chart
width = 0.6
axes[2].bar(quarters, region_north, width, label='North', color='#1f77b4', edgecolor='black')
axes[2].bar(quarters, region_south, width, bottom=region_north, 
            label='South', color='#ff7f0e', edgecolor='black')
axes[2].bar(quarters, region_east, width, 
            bottom=[i+j for i,j in zip(region_north, region_south)], 
            label='East', color='#2ca02c', edgecolor='black')
axes[2].set_title("Stacked Claims by Region", fontsize=11, fontweight='bold')
axes[2].set_ylabel("Number of Claims", fontsize=9)
axes[2].set_xlabel("Quarter", fontsize=9)
axes[2].legend(fontsize=8)
axes[2].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

Pro tip

Use fig, axes = plt.subplots(rows, cols) to create subplot grids

🤔 Pop Quiz

To compare claims frequency across 4 regions over the same time period, which Matplotlib approach is most efficient?

Create 4 separate plots and combine them manually
Use one large plot with 4 different colors
Use fig, axes = plt.subplots(2, 2) to create a 2×2 grid
Make 4 plots and send them as separate files

Seaborn: Statistical Graphics

Motivation?
Univariate
Multivariate
Pair Plots
Customize

Built on Matplotlib, Elevated for Statistics

High-level interface: Less code for complex plots
Beautiful defaults: Professional-looking plots out-of-the-box
Statistical focus: Built-in functions for distributions, relationships
DataFrame integration: Works seamlessly with Pandas

Matplotlib (more code):

Seaborn (cleaner):

Tip

Result: Seaborn adds KDE, better styling, with less code!

Seaborn Automatically Adds Statistical Overlays:

histplot() with kde=True → Adds smooth density curve
boxplot() → Automatic grouping by categories with x= parameter
violinplot() → Combines density + quartiles in one plot

Show code

import pandas as pd
import seaborn as sns

# Simulate claim severity data for different product types
np.random.seed(42)
n = 300

data = pd.DataFrame({
    'Product': np.repeat(['Auto', 'Home', 'Life'], n),
    'Claim_Amount': np.concatenate([
        np.random.lognormal(7, 1, n),    # Auto
        np.random.lognormal(8, 1.2, n),  # Home
        np.random.lognormal(9, 0.8, n)   # Life
    ])
})

# Set seaborn style
sns.set_style("whitegrid")

fig, axes = plt.subplots(1, 3, figsize=(12, 3.5))

# Histogram with KDE
sns.histplot(data=data, x='Claim_Amount', kde=True, ax=axes[0], bins=30, 
             color='steelblue', edgecolor='black', linewidth=0.5)
axes[0].set_title("Histogram + KDE", fontsize=11, fontweight='bold')
axes[0].set_xlabel("Claim Amount ($)", fontsize=9)
axes[0].set_ylabel("Frequency", fontsize=9)

# Box plot by category
sns.boxplot(data=data, x='Product', y='Claim_Amount', ax=axes[1], palette='Set2')
axes[1].set_title("Box Plot by Product", fontsize=11, fontweight='bold')
axes[1].set_ylabel("Claim Amount ($)", fontsize=9)
axes[1].set_xlabel("Product Type", fontsize=9)

# Violin plot by category
sns.violinplot(data=data, x='Product', y='Claim_Amount', ax=axes[2], palette='muted')
axes[2].set_title("Violin Plot by Product", fontsize=11, fontweight='bold')
axes[2].set_ylabel("Claim Amount ($)", fontsize=9)
axes[2].set_xlabel("Product Type", fontsize=9)

plt.tight_layout()
plt.show()

Actuarial insight

Life insurance claims show higher median but lower variance than Auto/Home → reflects predictable payouts for death benefits vs. variable property/accident damages.

Seaborn for Relationships: Example Plots

scatterplot() → Supports hue= and size= for 3rd/4th dimensions
regplot() → Adds regression line + confidence interval automatically
heatmap() → Visualize correlation matrices with annotations

Show code

# Simulate insurance portfolio data
np.random.seed(42)
n = 200

portfolio = pd.DataFrame({
    'Age': np.random.uniform(25, 75, n),
    'Premium': np.random.uniform(500, 5000, n),
    'Claims_Count': np.random.poisson(0.5, n),
    'Risk_Score': np.random.uniform(0, 100, n)
})

# Add some correlation
portfolio['Premium'] = 300 + portfolio['Age'] * 40 + np.random.normal(0, 400, n)
portfolio['Claims_Count'] = (portfolio['Risk_Score'] / 50 + np.random.poisson(0.3, n)).astype(int)

fig, axes = plt.subplots(1, 2, figsize=(11, 4))

# Scatter plot with regression line
scatter = sns.scatterplot(data=portfolio, x='Age', y='Premium', 
                hue='Claims_Count', size='Claims_Count',
                palette='viridis', ax=axes[0], alpha=0.7, sizes=(30, 200),
                edgecolor='black', linewidth=0.5)
sns.regplot(data=portfolio, x='Age', y='Premium', 
            scatter=False, color='red', ax=axes[0], line_kws={'linewidth': 2.5})
axes[0].set_title("Age vs Premium (sized by Claims)", fontsize=11, fontweight='bold')
axes[0].set_xlabel("Age (years)", fontsize=9)
axes[0].set_ylabel("Annual Premium ($)", fontsize=9)
axes[0].legend(title='Claims Count', fontsize=7, title_fontsize=8, loc='upper left')

# Correlation heatmap
correlation = portfolio[['Age', 'Premium', 'Claims_Count', 'Risk_Score']].corr()
sns.heatmap(correlation, annot=True, cmap="RdBu_r", ax=axes[1], 
            fmt=".2f", square=True, cbar_kws={'shrink': 0.8},
            vmin=-1, vmax=1, linewidths=1, linecolor='white',
            annot_kws={'fontsize': 9})
axes[1].set_title("Correlation Matrix", fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

Actuarial insight

Heatmap reveals age-premium correlation (0.67) → confirms experience-based pricing. Risk score drives claims count → validates underwriting criteria.

The Power of Pair Plots

sns.pairplot() → Visualize all pairwise relationships in one command!

Show code

# Use subset of portfolio data
portfolio_subset = portfolio[['Age', 'Premium', 'Risk_Score', 'Claims_Count']].sample(100, random_state=42)

# Create pair plot with smaller size
pair_grid = sns.pairplot(portfolio_subset, diag_kind='kde', 
                         plot_kws={'alpha': 0.6, 's': 30, 'edgecolor': 'black', 'linewidth': 0.5},
                         diag_kws={'linewidth': 2},
                         corner=True,
                         height=2)  # Controls size of each subplot
pair_grid.fig.suptitle("Portfolio Variables Pair Plot", y=0.99, fontweight='bold', fontsize=12)
plt.tight_layout()
plt.show()

Actuarial insight

Pair plots reveal non-obvious patterns → e.g., Risk_Score vs Claims_Count shows clustering, suggesting discrete risk categories in underwriting model.

Advanced Seaborn Features

hue= parameter → Split visualizations by category (automatic legend)
split=True in violinplot → Side-by-side comparison within groups
Palette control → 'Set2', 'muted', 'viridis' for consistent color schemes

Show code

# Create categorical data
np.random.seed(42)
df_claims = pd.DataFrame({
    'Region': np.repeat(['North', 'South', 'East', 'West'], 50),
    'Product': np.tile(np.repeat(['Auto', 'Home'], 25), 4),
    'Severity': np.random.lognormal(8, 1.5, 200)
})

fig, axes = plt.subplots(1, 2, figsize=(12, 3.8))

# Customized box plot
sns.boxplot(data=df_claims, x='Region', y='Severity', hue='Product',
            palette='Set2', ax=axes[0], linewidth=1.5)
axes[0].set_title("Claim Severity by Region and Product", 
                  fontsize=11, fontweight='bold')
axes[0].set_ylabel("Claim Severity ($)", fontsize=9)
axes[0].set_xlabel("Region", fontsize=9)
axes[0].legend(title='Product Type', fontsize=8, title_fontsize=9)
axes[0].grid(True, alpha=0.3, axis='y')

# Customized violin plot with split
sns.violinplot(data=df_claims, x='Region', y='Severity', hue='Product',
               split=True, palette='muted', ax=axes[1], linewidth=1.5)
axes[1].set_title("Split Violin Plot (Product Comparison)", 
                  fontsize=11, fontweight='bold')
axes[1].set_ylabel("Claim Severity ($)", fontsize=9)
axes[1].set_xlabel("Region", fontsize=9)
axes[1].legend(title='Product Type', fontsize=8, title_fontsize=9)
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

Actuarial insight

Split violins show Home claims consistently exceed Auto across all regions → justifies higher base rates for property coverage.

🤔 Pop Quiz

Your manager needs claim distribution AND age relationship visualized in 10 minutes. What’s your best tool?

Matplotlib - full control and more code for statistical overlays
Seaborn - automatic statistical overlays, beautiful defaults, less code
Plotly - great for interactivity but overkill for static analysis
Base Python - calculating pixel coordinates when time is short

Matplotlib vs Seaborn vs Plotly

Comparison
Example
Matplotlib
Seaborn
Plotly
Differences

Feature	Matplotlib	Seaborn	Plotly
Level	Low-level, fine control	High-level, statistical	High-level, interactive
Ease of Use	⭐⭐ (more code)	⭐⭐⭐⭐ (concise)	⭐⭐⭐⭐ (intuitive)
Default Style	Basic	Professional	Modern, polished
Statistical Plots	Manual	Built-in	Some built-in
Interactivity	❌ Static only	❌ Static only	✅ Interactive
Use Case	Custom plots, fine-tuning	EDA, statistical analysis	Dashboards, presentations
Learning Curve	Steeper	Moderate	Moderate
Performance	Fast	Fast	Slower (web-based)

When to Use What?

Matplotlib: Maximum control, custom visualizations, publication-quality static plots
Seaborn: Statistical exploration, beautiful defaults, distribution/relationship plots
Plotly: Interactive dashboards, presentations, hover tooltips, zooming

The Challenge: Visualize Insurance Portfolio

Dataset: Age vs Premium for 100 policyholders
Goal: Show relationship + distribution

Let’s see how each library handles this!

# Create sample insurance portfolio data (used in all three examples)
np.random.seed(42)
n = 100

portfolio_demo = pd.DataFrame({
    'Age': np.random.uniform(25, 75, n),
    'Premium': np.random.uniform(500, 5000, n)
})

# Add correlation
portfolio_demo['Premium'] = 300 + portfolio_demo['Age'] * 40 + np.random.normal(0, 400, n)

Matplotlib Approach: Manual Control

Pros: Complete customization, pixel-perfect control
Cons: More code, need to manually add trendline and style

Show Matplotlib code

fig, ax = plt.subplots(figsize=(9, 4.5))

# Scatter plot
ax.scatter(portfolio_demo['Age'], portfolio_demo['Premium'], 
           alpha=0.6, s=60, color='steelblue', edgecolors='black', linewidth=0.8)

# Manually add trendline
z = np.polyfit(portfolio_demo['Age'], portfolio_demo['Premium'], 1)
p = np.poly1d(z)
ax.plot(portfolio_demo['Age'].sort_values(), p(portfolio_demo['Age'].sort_values()), 
        "r--", linewidth=2.5, label=f'Trend: y = {z[0]:.1f}x + {z[1]:.0f}')

# Styling
ax.set_title("Age vs Premium (Matplotlib)", fontsize=13, fontweight='bold', pad=15)
ax.set_xlabel("Age (years)", fontsize=11)
ax.set_ylabel("Annual Premium ($)", fontsize=11)
ax.legend(loc='upper left', fontsize=10, framealpha=0.9)
ax.grid(True, alpha=0.3, linestyle='--', linewidth=0.8)
ax.set_facecolor('#f8f9fa')

plt.tight_layout()
plt.show()

Lines of code: ~15 | Trendline: Manual | Styling: All manual

Seaborn Approach: Statistical Focus

Pros: Automatic trendline + confidence interval, beautiful defaults, less code
Cons: Less fine-grained control

Show Seaborn code

fig, ax = plt.subplots(figsize=(9, 4.5))

# Seaborn combines scatter + regression in one call!
sns.regplot(data=portfolio_demo, x='Age', y='Premium', 
            scatter_kws={'alpha': 0.6, 's': 60, 'edgecolor': 'black', 'linewidths': 0.8},
            line_kws={'color': 'red', 'linewidth': 2.5},
            ax=ax)

# Styling
ax.set_title("Age vs Premium (Seaborn)", fontsize=13, fontweight='bold', pad=15)
ax.set_xlabel("Age (years)", fontsize=11)
ax.set_ylabel("Annual Premium ($)", fontsize=11)
ax.grid(True, alpha=0.3, linestyle='--', linewidth=0.8)
ax.set_facecolor('#f8f9fa')

plt.tight_layout()
plt.show()

Lines of code: ~8 | Trendline: Automatic with confidence interval! | Styling: Beautiful defaults

Plotly Approach: Interactive

Pros: Hover tooltips, zoom/pan, toggle trendline, export options
Cons: Slower rendering, requires web browser

Show Plotly code

import plotly.express as px

# Create figure matching Matplotlib/Seaborn theme
fig = px.scatter(
    portfolio_demo, 
    x='Age', 
    y='Premium',
    trendline='ols',
    title="Age vs Premium (Plotly - Interactive!)",
    labels={'Age': 'Age (years)', 'Premium': 'Annual Premium ($)'}
)

# Update scatter points to match Matplotlib style (steelblue with black edge)
fig.update_traces(
    marker=dict(
        size=8, 
        color='steelblue',
        opacity=0.6,
        line=dict(width=0.8, color='black')
    ),
    selector=dict(mode='markers')
);

# Update trendline to red dashed (matching Matplotlib)
fig.update_traces(
    line=dict(color='red', width=2.5, dash='dash'),
    selector=dict(mode='lines')
);

# Match Matplotlib/Seaborn styling
fig.update_layout(
    title=dict(
        text="Age vs Premium (Plotly - Interactive!)",
        font=dict(size=13, family='Arial', color='black'),
        x=0.5,
        xanchor='center'
    ),
    font=dict(size=11, family='Arial'),
    plot_bgcolor='#f8f9fa',
    paper_bgcolor='white',
    hovermode='closest',
    height=450,
    width=900,
    xaxis_title=dict(text="Age (years)", font=dict(size=11)),
    yaxis_title=dict(text="Annual Premium ($)", font=dict(size=11)),
    margin=dict(l=60, r=40, t=60, b=60)
);

# Grid matching Matplotlib (alpha=0.3, dashed)
fig.update_xaxes(
    showgrid=True, 
    gridwidth=0.8, 
    gridcolor='rgba(128,128,128,0.3)',
    griddash='dash'
);
fig.update_yaxes(
    showgrid=True, 
    gridwidth=0.8, 
    gridcolor='rgba(128,128,128,0.3)',
    griddash='dash'
);

# Display
fig

Lines of code: ~10 | Trendline: Automatic (multiple methods available!) | Interactivity: Hover, zoom, pan, export

Try it!

Hover over points, zoom in on a region, double-click to reset, click legend to toggle trendline!

Best Practice Workflow

Start with Seaborn for quick EDA (fast, beautiful, statistical)
Switch to Matplotlib when you need pixel-perfect control (publications, custom layouts)
Use Plotly for final presentations/dashboards (stakeholder engagement, interactivity)

How to Convert Your Plots

Matplotlib → Seaborn: Easy! Seaborn uses Matplotlib under the hood
Matplotlib/Seaborn → Plotly: Use plotly.express (similar syntax), plotly.graph_objects (more control), or mpl_to_plotly() (automatic conversion!)
Plotly → Static: Use fig.write_image('plot.png') (requires kaleido package)

Example: Matplotlib scatter → Seaborn regplot

# Matplotlib
plt.scatter(df['x'], df['y'])

# Seaborn (adds automatic regression + CI)
sns.regplot(data=df, x='x', y='y')

Example: Matplotlib → Plotly (automatic conversion)

import plotly.tools as tls

# Create Matplotlib figure
fig, ax = plt.subplots()
ax.scatter(df['Age'], df['Premium'])
ax.set_title("Age vs Premium")

# Convert to interactive Plotly figure automatically!
plotly_fig = tls.mpl_to_plotly(fig)
plotly_fig.show()

Example: Seaborn → Plotly

# Seaborn
sns.scatterplot(data=df, x='Age', y='Premium', hue='Region')

# Plotly equivalent
px.scatter(df, x='Age', y='Premium', color='Region')

Example: Save Plotly as static image

fig = px.scatter(df, x='Age', y='Premium')
fig.write_image('plot.png')  # Requires: pip install kaleido
fig.write_html('plot.html')  # Interactive HTML

🤔 Pop Quiz

The CFO needs to zoom, hover, and toggle risk categories during a board meeting. Which library is best for plotting?

Matplotlib - perfect for board meetings
Seaborn - beautiful statistical plots
Plotly - interactive features like zoom, hover, and toggle
All three are equally suitable for this use case

Common Pitfalls & Best Practices

Bad Plot Examples
More Bad Examples
Good Example
Best Practices

Learn from Bad Examples

Understanding what makes plots ineffective helps you create better visualizations.

❌ Bad: No labels, unclear message

✅ Good: Clear, labeled, informative

❌ Misleading Y-axis

Issue: Exaggerates differences

❌ Too many colors

Issue: Distracting, no meaning

❌ Pie chart in 3D (worst, and actually hard to do!)

Issue: Hard to read percentages, compare slices

Show code

# Example: Well-designed comparison plot
np.random.seed(42)

quarters = ['Q1 2024', 'Q2 2024', 'Q3 2024', 'Q4 2024']
revenue_2023 = [450, 480, 520, 550]
revenue_2024 = [480, 530, 580, 620]

fig, ax = plt.subplots(figsize=(8, 3.5))

x = np.arange(len(quarters))
width = 0.35

bars1 = ax.bar(x - width/2, revenue_2023, width, label='2023', 
               color='#9ecae1', edgecolor='black', linewidth=1.5)
bars2 = ax.bar(x + width/2, revenue_2024, width, label='2024', 
               color='#3182bd', edgecolor='black', linewidth=1.5)

# Add value labels on bars
for bars in [bars1, bars2]:
    for bar in bars:
        height = bar.get_height()
        ax.text(bar.get_x() + bar.get_width()/2., height + 5,
                f'${height}M',
                ha='center', va='bottom', fontsize=9, fontweight='bold')

ax.set_title("Quarterly Revenue Comparison: 2023 vs 2024", 
             fontsize=13, fontweight='bold', pad=20)
ax.set_xlabel("Quarter", fontsize=11)
ax.set_ylabel("Revenue (Million $)", fontsize=11)
ax.set_xticks(x)
ax.set_xticklabels(quarters, fontsize=10)
ax.legend(loc='upper left', fontsize=10, framealpha=0.9)
ax.grid(True, axis='y', alpha=0.3, linestyle='--', linewidth=0.8)
_ = ax.set_ylim(0, max(revenue_2024) * 1.15)
ax.set_facecolor('#f8f9fa')

plt.tight_layout()
plt.show()

What Makes This Plot Excellent?

Clear title → Tells the story immediately
Labeled axes → Context for both variables
Y-axis starts at 0 → Honest representation (critical for bar charts!)
Value labels → No guessing, exact numbers visible
Legend → Explains grouping
Grid → Aids accurate reading
Professional appearance → Clean, polished, credible

The 5-second test: Can someone understand the main message in 5 seconds? ✅

10 Visualization Rules

Always label axes and title → Context is king
Start y-axis at zero (for bar charts) → Avoid misleading scales
Choose appropriate plot type → Match plot to data type
Use color purposefully → 2-3 colors max, meaningful encoding
Add legends when needed → Explain what viewers see
Keep it simple → Remove chart junk, focus on message
Consider your audience → Adjust complexity accordingly
Use consistent scales → When comparing plots
Highlight key insights → Guide viewer attention
Test readability → Can someone understand it in 5 seconds?

🤔 Pop Quiz

Your colleague’s plot shows impressive trends with perfect colors and styling. But management keeps asking “What is this plot representing?” What’s the most likely problem?

Switch from Matplotlib to Seaborn for better defaults
Add descriptive axis labels, title, and units
Increase the figure size from 8x6 to 12x8 inches
Change the color scheme to match company branding

Key Takeaways

Essential concepts covered today:

Why Visualize? To tell a story and understand trends (identical statistics ≠ identical patterns)
Plot Types: Univariate (histograms, box plots) vs Multivariate (scatter, heatmaps, pair plots)
Matplotlib: Foundation library with complete control over every element (Figure → Axes → Elements)
Seaborn: High-level statistical plots with beautiful defaults, DataFrame integration, automatic KDE/regression
Library Comparison:
- Matplotlib → Custom plots, publication-quality, fine-grained control
- Seaborn → Statistical EDA, quick insights, less code
- Plotly → Interactive dashboards, hover tooltips, presentations
Best Practices: Label everything, choose appropriate plot type, start y-axis at zero (bar charts), use 2-3 colors max, keep it simple
Common Pitfalls: Missing labels/titles, misleading scales, wrong plot type, too many colors, chart junk

Remember: Good Visualization = Data + Context + Clarity

Your plot should answer: Obvious patterns? (pattern) → Story? (insight) → What to do now? (action)

📚 Resources: Plotting lecture notes | Matplotlib docs | Seaborn gallery

Data Visualization with Python

Why Visualization?

🤔 Pop Quiz

Univariate vs Multivariate

🤔 Pop Quiz

Matplotlib: The Foundation

🤔 Pop Quiz

Seaborn: Statistical Graphics

🤔 Pop Quiz

Matplotlib vs Seaborn vs Plotly

🤔 Pop Quiz

Common Pitfalls & Best Practices

🤔 Pop Quiz

Key Takeaways

Questions?