Neural Networks and Function Approximation¶
This notebook introduces the fundamental concepts of using neural networks for function approximation, a core task in scientific machine learning (SciML). We will build up from the basic building block, the perceptron, to a single-layer network and demonstrate its capacity to approximate a simple function, connecting theory to practice using PyTorch.
The 1D Poisson Equation: Our Benchmark Problem¶
We begin with a familiar problem: the one-dimensional Poisson equation on the unit interval $[0, 1]$ with homogeneous Dirichlet boundary conditions:
$$-\frac{d^2u}{dx^2} = f(x), \quad x \in [0, 1]$$
subject to boundary conditions: $$u(0) = 0, \quad u(1) = 0$$
This equation models diverse physical phenomena: heat conduction in a rod, deflection of a loaded beam, or electrostatic potential in one dimension. The function $u(x)$ represents the unknown solution we seek, while $f(x)$ is the prescribed source term.
For our initial exploration, we choose a source term that gives a simple, known solution: $$f(x) = \pi^2 \sin(\pi x)$$
This choice yields the analytical solution: $$u(x) = \sin(\pi x)$$
We can verify this solution by direct substitution. The second derivative of $u(x) = \sin(\pi x)$ is $u''(x) = -\pi^2 \sin(\pi x)$, so: $$-\frac{d^2u}{dx^2} = -(-\pi^2 \sin(\pi x)) = \pi^2 \sin(\pi x) = f(x) \quad \checkmark$$
The boundary conditions are satisfied: $u(0) = \sin(0) = 0$ and $u(1) = \sin(\pi) = 0$ ✓
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
import random
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)
torch.cuda.manual_seed(SEED)
torch.cuda.manual_seed_all(SEED)
# Set up plotting style for clarity
plt.style.use('default')
plt.rcParams.update({
'font.size': 12,
'figure.figsize': (12, 8),
'lines.linewidth': 2.5,
'axes.grid': True,
'grid.alpha': 0.3
})
# Define the domain
x_plot = np.linspace(0, 1, 1000)
# Define the analytical solution and source function
def analytical_solution(x):
"""Analytical solution: u(x) = sin(π*x)"""
return np.sin(np.pi * x)
def source_function(x):
"""Source function: f(x) = π²*sin(π*x)"""
return np.pi**2 * np.sin(np.pi * x)
# Compute solutions for plotting
u_analytical_plot = analytical_solution(x_plot)
f_source_plot = source_function(x_plot)
# Create visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Plot the analytical solution
ax1.plot(x_plot, u_analytical_plot, 'b-', linewidth=3, label=r'$u(x) = \sin(\pi x)$')
ax1.fill_between(x_plot, 0, u_analytical_plot, alpha=0.2, color='blue')
ax1.axhline(y=0, color='k', linestyle='-', alpha=0.3)
ax1.plot([0, 0], [0, 0], 'ro', markersize=8, label='Boundary: u(0)=0')
ax1.plot([1, 1], [0, 0], 'ro', markersize=8, label='Boundary: u(1)=0')
ax1.set_xlabel('Position x')
ax1.set_ylabel('Solution u(x)')
ax1.set_title('Analytical Solution of 1D Poisson Equation', fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)
# Plot the source function
ax2.plot(x_plot, f_source_plot, 'r-', linewidth=3, label=r'$f(x) = \pi^2 \sin(\pi x)$')
ax2.fill_between(x_plot, 0, f_source_plot, alpha=0.2, color='red')
ax2.axhline(y=0, color='k', linestyle='-', alpha=0.3)
ax2.set_xlabel('Position x')
ax2.set_ylabel('Source f(x)')
ax2.set_title('Source Function', fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Solution satisfies boundary conditions: u(0) = {analytical_solution(0):.6f}, u(1) = {analytical_solution(1):.6f}")
Solution satisfies boundary conditions: u(0) = 0.000000, u(1) = 0.000000
The Function Approximation Challenge¶
Consider this question: Can we learn to approximate $u(x) = \sin(\pi x)$ by observing only sparse data points?
Let's explore this with a concrete example.
# Generate sparse training data
n_training_points = 15
x_train_np = np.linspace(0, 1, n_training_points)
u_train_np = analytical_solution(x_train_np)
# Add small amount of noise to make it more realistic
noise_level = 0.02
u_train_noisy_np = u_train_np + noise_level * np.random.randn(n_training_points)
# Convert to PyTorch tensors
x_train_tensor = torch.tensor(x_train_np.reshape(-1, 1), dtype=torch.float32)
u_train_tensor = torch.tensor(u_train_noisy_np.reshape(-1, 1), dtype=torch.float32)
# Create visualization
fig, ax = plt.subplots(1, 1, figsize=(12, 8))
# Plot the true function
x_fine = np.linspace(0, 1, 1000)
u_true_fine = analytical_solution(x_fine)
ax.plot(x_fine, u_true_fine, 'b-', linewidth=3, label='True Function: $u(x) = \sin(\pi x)$', alpha=0.7)
# Plot training data
ax.plot(x_train_np, u_train_noisy_np, 'ro', markersize=8, label=f'Noisy Training Data ({n_training_points} points, σ={noise_level})', markeredgecolor='darkred')
# Add some visual elements to emphasize the challenge
for i, xi in enumerate(x_train_np):
ax.axvline(x=xi, color='gray', linestyle='--', alpha=0.3, linewidth=1)
ax.set_xlabel('Position x', fontsize=14)
ax.set_ylabel('Function Value u(x)', fontsize=14)
ax.set_title('Function Approximation Challenge: Learning from Sparse Data', fontsize=16, fontweight='bold')
ax.legend(fontsize=12)
ax.grid(True, alpha=0.3)
ax.set_ylim(-0.2, 1.2)
# Add annotation
ax.annotate('How can we reconstruct\nthe smooth function\nfrom these few points?',
xy=(0.7, 0.8), xytext=(0.75, 0.4),
arrowprops=dict(arrowstyle='->', color='red', lw=2),
fontsize=12, ha='center',
bbox=dict(boxstyle='round,pad=0.3', facecolor='lightyellow', alpha=0.8))
plt.tight_layout()
plt.show()
print(f"Challenge: Approximate a continuous function using only {n_training_points} data points")
Challenge: Approximate a continuous function using only 15 data points
Traditional Methods: Finite Difference¶
Traditional numerical methods like the Finite Difference Method or Finite Element Method solve PDEs by discretizing the domain into a grid or mesh. They approximate the solution $u(x)$ by finding its values at these specific, discrete points.
For example, the Finite Difference method approximates the second derivative: $$\frac{d^2u}{dx^2} \approx \frac{u_{i+1} - 2u_i + u_{i-1}}{h^2}$$ This transforms the differential equation into a system of algebraic equations for the values $u_i$ at grid points $x_i$. The result is a discrete representation of the solution.
# Simple finite difference solution for comparison
def solve_poisson_fd(n_points=51):
"""Solve 1D Poisson equation using finite differences"""
# Create grid
x_fd = np.linspace(0, 1, n_points)
h = x_fd[1] - x_fd[0]
# Create coefficient matrix A for -u'' = f
# Central difference: u''_i ≈ (u_{i+1} - 2u_i + u_{i-1})/h²
# Boundary conditions u_0 = u_{N-1} = 0 are handled by reducing the system size
A = np.zeros((n_points-2, n_points-2))
np.fill_diagonal(A, -2.0 / h**2)
np.fill_diagonal(A[1:], 1.0 / h**2)
np.fill_diagonal(A[:, 1:], 1.0 / h**2)
# Right-hand side (source function at interior points)
f_rhs = source_function(x_fd[1:-1])
# Solve linear system A u_interior = -f_rhs
u_interior = np.linalg.solve(A, -f_rhs) # Note: A is for -u''
# Assemble full solution (including boundary conditions)
u_fd = np.zeros(n_points)
u_fd[1:-1] = u_interior
u_fd[0] = 0 # u(0) = 0
u_fd[-1] = 0 # u(1) = 0
return x_fd, u_fd
# Solve using finite differences with different resolutions
x_fd_coarse, u_fd_coarse = solve_poisson_fd(11) # Coarse grid
x_fd_fine, u_fd_fine = solve_poisson_fd(51) # Fine grid
# Create comparison plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
# Plot 1: Method comparison
x_exact = np.linspace(0, 1, 1000)
u_exact = analytical_solution(x_exact)
ax1.plot(x_exact, u_exact, 'k-', linewidth=3, label='Analytical Solution', alpha=0.8)
ax1.plot(x_fd_coarse, u_fd_coarse, 'ro-', linewidth=2, markersize=6, label='Finite Difference (Coarse)', alpha=0.8)
ax1.plot(x_fd_fine, u_fd_fine, 'b.-', linewidth=1, markersize=4, label='Finite Difference (Fine)', alpha=0.8)
ax1.scatter(x_train_np, u_train_noisy_np, color='green', s=80, label='Neural Network Training Data',
edgecolors='darkgreen', linewidth=2, zorder=5)
ax1.set_xlabel('Position x')
ax1.set_ylabel('Solution u(x)')
ax1.set_title('Comparison of Solution Methods', fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)
# Plot 2: Error analysis
error_coarse = np.abs(u_fd_coarse - analytical_solution(x_fd_coarse))
error_fine = np.abs(u_fd_fine - analytical_solution(x_fd_fine))
ax2.semilogy(x_fd_coarse, error_coarse, 'ro-', linewidth=2, markersize=6, label='Coarse Grid Error')
ax2.semilogy(x_fd_fine, error_fine, 'b.-', linewidth=1, markersize=4, label='Fine Grid Error')
ax2.set_xlabel('Position x')
ax2.set_ylabel('Absolute Error (log scale)')
ax2.set_title('Discretization Error Analysis', fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Print error statistics
print(f"Finite Difference Error Statistics:")
print(f"Coarse grid (11 points): Max error = {np.max(error_coarse):.6f}")
print(f"Fine grid (51 points): Max error = {np.max(error_fine):.6f}")
print(f"Error reduction factor: {np.max(error_coarse)/np.max(error_fine):.2f}")
Finite Difference Error Statistics: Coarse grid (11 points): Max error = 0.008265 Fine grid (51 points): Max error = 0.000329 Error reduction factor: 25.12
The Neural Network Approach: Function Approximation and Universal Approximation¶
Our goal is to train a neural network $u_{NN}(x; \theta)$ to approximate the continuous solution $u^*(x) = \sin(\pi x)$ over the interval $[0, 1]$. This is a function approximation problem.
A key theoretical result in neural networks is the Universal Approximation Theorem. In essence, it states:
Theorem (Cybenko, 1989; Hornik, 1991): A feedforward network with a single hidden layer, containing a finite number of neurons and using a non-constant, bounded, and monotonically increasing activation function (like Sigmoid or Tanh), can approximate any continuous function on a compact domain to arbitrary accuracy.
$$F(x) = \sum_{i=1}^{N} w_i \sigma(v_i x + b_i) + w_0$$
Mathematical statement: For any continuous $f: [0,1] \to \mathbb{R}$ and $\epsilon > 0$, there exists $N$ and parameters such that $|F(x) - f(x)| < \epsilon$ for all $x \in [0,1]$.
(Note: While the original theorem had specific activation requirements, it has been extended to other common activations like ReLU in practice).
The significance of this theorem is profound: it tells us that even a relatively simple network architecture (a single hidden layer) has the theoretical capacity to learn complex, non-linear functions like $\sin(\pi x)$, provided it has enough neurons and uses the right kind of non-linearity. We will experimentally demonstrate this capacity.
Traditional Numerical Method vs Neural Network: Discrete vs Continuous¶
In contrast, the Neural Network approach aims to learn a continuous function $u_{NN}(x; \theta)$ that approximates the true solution $u^*(x)$ over the entire domain $[0, 1]$.
This function is parameterized by the network's weights and biases $\theta$.
We train the network by showing it examples of the solution at sparse points $(x_i, u_i)$ and adjusting $\theta$ so the network's output $u_{NN}(x_i; \theta)$ matches $u_i$ as closely as possible.
The Basic Building Block: The Perceptron¶
The fundamental unit of a neural network is the perceptron, or artificial neuron. It takes an input vector $\boldsymbol{x}$, computes a weighted sum of its elements, adds a bias, and passes the result through an activation function $g$.
Mathematically, for an input vector $\boldsymbol{x} = [x_1, x_2, ..., x_n]$ and corresponding weights $\boldsymbol{w} = [w_1, w_2, ..., w_n]$, with a bias $b$, the operation is:
$$ \begin{align} z &= \mathrm{bias} + \mathrm{linear\_ combination\_ of \_ inputs} z &= \boldsymbol{w}^T\boldsymbol{x} + b \\ &= \sum_{i=1}^n w_i x_i + b \end{align} $$
The output $\hat{y}$ is then computed by passing it through an activation function:
$$\hat{y} = g(z)$$
Here, $z$ is the pre-activation or 'logit' value, and $g$ is the activation function. The activation function $g$ introduces nonlinearity to allow the perceptron to learn complex mappings from inputs to outputs. Typical choices for $g$ are sigmoid
, tanh
, or ReLU
functions, though the original perceptron used a step function.
The perceptron can be trained via supervised learning, adjusting the weights and biases to minimize the loss between the predicted $\hat{y}$ and the true label $y^{\text{true}}$. Backpropagation combined with gradient descent can be used to iteratively update the weights to reduce the loss.
The key components of a perceptron are:
- Input vector $\boldsymbol{x}$
- Weight vector $\boldsymbol{w}$
- Weighted sum $z = \boldsymbol{w}^T\boldsymbol{x}$
- Nonlinear activation $g$
- Output prediction $\hat{y}$
The perceptron provides a basic model of a neuron, and multilayer perceptrons composed of many interconnected perceptrons can be used to build neural networks with substantial representational power. A perceptron takes a set of inputs, scales them by corresponding weights, sums them together with a bias, applies a non-linear step function, and produces an output. This simple model can represent linear decision boundaries and serves as a building block for more complex neural networks. In training, weights are updated based on the difference between the predicted output and the actual label, often using the Perceptron learning algorithm.
Credits: Alexander Amini, MIT
The Critical Role of Nonlinearity¶
Why do we need the activation function $g$? What happens if we just use a linear function, like $g(z) = z$?
Consider a network with multiple layers, but no non-linear activation functions between them. The output of one layer is just a linear transformation of its input. If we stack these linear layers:
Let the first layer be $h_1 = W_1 x + b_1$. Let the second layer be $h_2 = W_2 h_1 + b_2$.
Substituting the first into the second: $$h_2 = W_2 (W_1 x + b_1) + b_2$$ $$h_2 = W_2 W_1 x + W_2 b_1 + b_2$$
This can be rewritten as: $$h_2 = (W_2 W_1) x + (W_2 b_1 + b_2)$$
Let $W_{eq} = W_2 W_1$ and $b_{eq} = W_2 b_1 + b_2$. Then: $$h_2 = W_{eq} x + b_{eq}$$
This is just another linear transformation! No matter how many linear layers we stack, the entire network will only be able to compute a single linear function of the input. A linear network can only learn:
Linear network can only learn: y = mx + b (a straight line in 1D) But sin(πx) is curved - impossible with just linear transformations!
To approximate complex, non-linear functions like $\sin(\pi x)$, we must introduce non-linearity using activation functions between the layers.
# Demonstrate Linear Network Failure
class LinearNetwork(nn.Module):
"""A simple network with only linear layers (no activation)"""
def __init__(self, width):
super().__init__()
self.layers = nn.Sequential(
nn.Linear(1, width),
nn.Linear(width, width), # Another linear layer
nn.Linear(width, 1) # Final linear layer
)
def forward(self, x):
return self.layers(x)
# Train the linear network
linear_model = LinearNetwork(width=10)
criterion = nn.MSELoss()
optimizer = optim.Adam(linear_model.parameters(), lr=0.01)
epochs = 3000
for epoch in range(epochs):
predictions = linear_model(x_train_tensor)
loss = criterion(predictions, u_train_tensor)
optimizer.zero_grad()
loss.backward()
optimizer.step()
# Visualize the linear network's prediction
x_test_tensor = torch.tensor(x_plot.reshape(-1, 1), dtype=torch.float32)
with torch.no_grad():
u_pred_linear = linear_model(x_test_tensor).numpy().flatten()
fig, ax = plt.subplots(1, 1, figsize=(10, 6))
ax.plot(x_plot, u_analytical_plot, 'b-', linewidth=3, label='True Function: $\sin(\pi x)$', alpha=0.7)
ax.plot(x_plot, u_pred_linear, 'r--', linewidth=2, label='Linear Network Prediction')
ax.scatter(x_train_np, u_train_noisy_np, color='k', s=40, alpha=0.7, label='Training Data', zorder=5)
ax.set_xlabel('Position x')
ax.set_ylabel('Function Value u(x)')
ax.set_title('Linear Network Failure to Approximate $\sin(\pi x)$', fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_ylim(-0.2, 1.2)
plt.tight_layout()
plt.show()
print(f"Final Loss (Linear Network): {loss.item():.6f}")
Final Loss (Linear Network): 0.122357
Introducing Nonlinearity: Activation Functions¶
Activation functions are applied element-wise to the output of a linear transformation within a neuron or layer. They introduce the non-linearity required for neural networks to learn complex mappings.
Some common activation functions include:
Sigmoid Squashes input to (0, 1). Useful for binary classification output. Can suffer from vanishing gradients. $$ \sigma(x) = \frac{1}{1 + e^{-x}} $$
Tanh Squashes input to (-1, 1). Zero-centered, often preferred over Sigmoid for hidden layers. Can also suffer from vanishing gradients. $$\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} $$
ReLU (Rectified Linear Unit) Outputs input directly if positive, zero otherwise. Computationally efficient, helps mitigate vanishing gradients for positive inputs. Can suffer from "dead neurons" if inputs are always negative. $$ f(x) = \max(0, x)$$
LeakyReLU Similar to ReLU but allows a small gradient for negative inputs, preventing dead neurons. $$f(x) = \max(\alpha x, x) \quad (\alpha \text{ is a small positive constant, e.g., 0.01})$$
import numpy as np
import matplotlib.pyplot as plt
# --- 1. Parameterized Activation Functions ---
def sigmoid(x, a=1.0):
"""
Parameterized Sigmoid activation function.
'a' controls the steepness.
"""
return 1 / (1 + np.exp(-a * x))
def tanh(x, a=1.0):
"""
Parameterized Hyperbolic Tangent activation function.
'a' controls the steepness.
"""
return np.tanh(a * x)
def relu(x):
"""
Rectified Linear Unit (ReLU) activation function.
It has no parameters.
"""
return np.maximum(0, x)
def leaky_relu(x, alpha=0.1):
"""
Parameterized Leaky ReLU activation function.
'alpha' is the slope for negative inputs.
"""
return np.maximum(alpha * x, x)
# --- 2. Setup for Plotting ---
# Input data range
x = np.linspace(-5, 5, 200)
# Create a 2x2 subplot grid
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Common Activation Functions (Parameterized)', fontsize=16, fontweight='bold')
# --- 3. Plotting each function on its subplot ---
# Sigmoid Plot (Top-Left)
axs[0, 0].plot(x, sigmoid(x, a=1), label='a=1 (Standard)')
axs[0, 0].plot(x, sigmoid(x, a=2), label='a=2 (Steeper)', linestyle='--')
axs[0, 0].plot(x, sigmoid(x, a=0.5), label='a=0.5 (Less Steep)', linestyle=':')
axs[0, 0].set_title('Sigmoid Function')
axs[0, 0].legend()
# Tanh Plot (Top-Right)
axs[0, 1].plot(x, tanh(x, a=1), label='a=1 (Standard)')
axs[0, 1].plot(x, tanh(x, a=2), label='a=2 (Steeper)', linestyle='--')
axs[0, 1].plot(x, tanh(x, a=0.5), label='a=0.5 (Less Steep)', linestyle=':')
axs[0, 1].set_title('Tanh Function')
axs[0, 1].legend()
# ReLU Plot (Bottom-Left)
axs[1, 0].plot(x, relu(x), label='ReLU')
axs[1, 0].set_title('ReLU Function')
axs[1, 0].legend()
# Leaky ReLU Plot (Bottom-Right)
axs[1, 1].plot(x, leaky_relu(x, alpha=0.1), label='α=0.1 (Standard)')
axs[1, 1].plot(x, leaky_relu(x, alpha=0.3), label='α=0.3', linestyle='--')
axs[1, 1].plot(x, leaky_relu(x, alpha=0.01), label='α=0.01', linestyle=':')
axs[1, 1].set_title('Leaky ReLU Function')
axs[1, 1].legend()
# --- 4. Final Touches for all subplots ---
# Apply common labels, grids, and axis lines to all subplots
for ax in axs.flat:
ax.set_xlabel('Input z')
ax.set_ylabel('Output g(z)')
ax.grid(True, which='both', linestyle='--', linewidth=0.5)
ax.axhline(y=0, color='k', linewidth=0.8)
ax.axvline(x=0, color='k', linewidth=0.8)
# Adjust layout to prevent titles and labels from overlapping
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
# Display the plot
plt.show()
Interactive Demo: Visualizing Nonlinear Transformation¶
This interactive demo illustrates how a combination of linear transformation and non-linearity can transform data in a way that linear transformations alone cannot. Observe how the data, initially not linearly separable in the input space (X), becomes separable after passing through a linear layer (Y) and then a non-linear activation (Z).
This provides intuition for why layers with non-linear activations are powerful: they can map data into a new space where complex patterns become simpler (potentially linearly separable), making them learnable by subsequent layers.
Building Capacity: The Single Hidden Layer Neural Network¶
A single perceptron is limited in the complexity of functions it can represent. To increase capacity, we combine multiple perceptrons into a layer. A single-layer feedforward neural network (also known as a shallow Multi-Layer Perceptron or MLP) consists of an input layer, one hidden layer of neurons, and an output layer.
For our 1D input $x$, a single-layer network with $N_h$ hidden neurons works as follows:
- Input Layer: Receives the input $x$.
- Hidden Layer: Each of the $N_h$ neurons in this layer performs a linear transformation on the input $x$ and applies a non-linear activation function $g$. The output of this layer is a vector $\boldsymbol{h}$ of size $N_h$.
- Pre-activation vector $\boldsymbol{z}^{(1)}$ (size $N_h$): $\boldsymbol{z}^{(1)} = W^{(1)}\boldsymbol{x} + \boldsymbol{b}^{(1)}$ (Here, $W^{(1)}$ is a $N_h \times 1$ weight matrix, $\boldsymbol{x}$ is treated as a $1 \times 1$ vector, and $\boldsymbol{b}^{(1)}$ is a $N_h \times 1$ bias vector).
- Activation vector $\boldsymbol{h}$ (size $N_h$): $\boldsymbol{h} = g(\boldsymbol{z}^{(1)})$ (where $g$ is applied element-wise).
- Output Layer: This layer takes the vector $\boldsymbol{h}$ from the hidden layer and performs another linear transformation to produce the final scalar output $\hat{y}$. For regression, the output layer typically has a linear activation (or no activation function explicitly applied after the linear transformation).
- Pre-activation scalar $z^{(2)}$: $z^{(2)} = W^{(2)}\boldsymbol{h} + b^{(2)}$ (Here, $W^{(2)}$ is a $1 \\times N_h$ weight matrix, and $b^{(2)}$ is a scalar bias).
- Final output $\hat{y}$: $\hat{y} = z^{(2)}$
Credits: Alexander Amini, MIT
# Implement Single-Layer NN in PyTorch
class SingleLayerNN(nn.Module):
"""Single hidden layer neural network for 1D input/output"""
def __init__(self, hidden_size=10):
super(SingleLayerNN, self).__init__()
# Input layer (1D) to Hidden layer (hidden_size)
self.hidden = nn.Linear(1, hidden_size)
# Hidden layer (hidden_size) to Output layer (1D)
self.output = nn.Linear(hidden_size, 1)
# Choose activation function for the hidden layer
self.activation = nn.Tanh()
def forward(self, x):
# Pass through hidden layer and apply activation
x = self.hidden(x)
x = self.activation(x)
# Pass through output layer (linear output)
x = self.output(x)
return x
Training a neural network¶
Neural networks are trained using an optimization algorithm that iteratively updates the network's weights and biases to minimize a loss function. The loss function measures how far the network's predictions are from the true target outputs in the training data. It is a measure of the model's error.
We quantify this difference using a Loss Function, Some common loss functions include:
Mean squared error (MSE) - The average of the squared differences between the predicted and actual values. Measures the square of the error. Used for regression problems.
Cross-entropy loss - Measures the divergence between the predicted class probabilities and the true distribution. Used for classification problems. Penalizes confident incorrect predictions.
Hinge loss - Used for Support Vector Machines classifiers. Penalizes predictions that are on the wrong side of the decision boundary.
For our function approximation (regression) task, the Mean Squared Error (MSE) is a common choice:
$$\mathcal{L}(\theta) = \frac{1}{N} \sum_{i=1}^N \left(u_{NN}(x_i; \theta) - u_i\right)^2$$
Minimizing this loss function with respect to the parameters $\theta$ is an optimization problem.
Loss optimization is the process of finding the network weights that acheives the lowest loss.
$$ \begin{align} \boldsymbol{w^*} &= \argmin_{\boldsymbol{w}}\frac{1}{n}\sum_{i=1}^n \mathcal{L}(f(x^{(i)};\boldsymbol{w}),y^{(i)})\\ \boldsymbol{w^*} &= \argmin_{\boldsymbol{w}} J(\boldsymbol{w}) \end{align} $$
The training process works like this:
Initialization: The weights and biases of the network are initialized, often with small random numbers.
Forward Pass: The input is passed through the network, layer by layer, applying the necessary transformations (e.g., linear combinations of weights and inputs followed by activation functions) until an output is obtained.
Calculate Loss: A loss function is used to quantify the difference between the predicted output and the actual target values.
Backward Pass (Backpropagation): The gradients of the loss with respect to the parameters (weights and biases) are computed using the chain rule for derivatives. This process is known as backpropagation.
Update Parameters: The gradients computed in the backward pass are used to update the parameters of the network, typically using optimization algorithms like stochastic gradient descent (SGD) or more sophisticated ones like Adam. The update is done in the direction that minimizes the loss.
Repeat: Steps 2-5 are repeated using the next batch of data until a stopping criterion is met, such as a set number of epochs (full passes through the training dataset) or convergence to a minimum loss value.
Validation: The model is evaluated on a separate validation set to assess its generalization to unseen data.
The goal of training is to find the optimal set of weights and biases $\theta^*$ for the network that minimize the difference between the network's output $u_{NN}(x; \theta)$ and the true training data $u_{train}$.
Computing gradients with Automatic Differentiation¶
The Core Insight: Functions Are Computational Graphs
Every computer program that evaluates a mathematical function can be viewed as a computational graph. Consider this simple function:
def f(x1, x2):
y = x1**2 + x2
return y
This creates a computational graph where each operation is a node. This decomposition is the key insight that makes automatic differentiation possible.
Forward Mode Automatic Differentiation¶
Forward mode AD computes derivatives by propagating derivative information forward through the computational graph, following the same path as the function evaluation.
Forward Mode: Computing $\frac{\partial y}{\partial x_1}$¶
Starting with our function $y = x_1^2 + x_2$, let's trace through the computation:
Seed the input: Set $\dot{x}_1 = 1$ and $\dot{x}_2 = 0$ (we're differentiating w.r.t. $x_1$)
Forward propagation:
- $v_1 = x_1^2$, so $\dot{v}_1 = 2x_1 \cdot \dot{x}_1 = 2x_1 \cdot 1 = 2x_1$
- $y = v_1 + x_2$, so $\dot{y} = \dot{v}_1 + \dot{x}_2 = 2x_1 + 0 = 2x_1$
Result: $\frac{\partial y}{\partial x_1} = 2x_1$
Forward Mode: Computing $\frac{\partial y}{\partial x_2}$¶
To get the derivative w.r.t. $x_2$, we seed differently:
Seed the input: Set $\dot{x}_1 = 0$ and $\dot{x}_2 = 1$
Forward propagation:
- $v_1 = x_1^2$, so $\dot{v}_1 = 2x_1 \cdot \dot{x}_1 = 2x_1 \cdot 0 = 0$
- $y = v_1 + x_2$, so $\dot{y} = \dot{v}_1 + \dot{x}_2 = 0 + 1 = 1$
Result: $\frac{\partial y}{\partial x_2} = 1$
Key insight: Forward mode requires one pass per input variable to compute all partial derivatives.
Reverse Mode Automatic Differentiation¶
Reverse mode AD (also called backpropagation) computes derivatives by propagating derivative information backward through the computational graph.
The Backward Pass Algorithm¶
- Forward pass: Compute function values and store intermediate results
- Seed the output: Set $\bar{y} = 1$ (derivative of output w.r.t. itself)
- Backward pass: Use the chain rule to propagate derivatives backward
Computing All Partial Derivatives in One Pass¶
The beauty of reverse mode is that it computes all partial derivatives in a single backward pass:
Forward pass: $y = x_1^2 + x_2$ (store intermediate values)
Backward pass with $\bar{y} = 1$:
- $\frac{\partial y}{\partial x_1} = \frac{\partial y}{\partial v_1} \cdot \frac{\partial v_1}{\partial x_1} = 1 \cdot 2x_1 = 2x_1$
- $\frac{\partial y}{\partial x_2} = \frac{\partial y}{\partial x_2} = 1$
Key insight: Reverse mode computes gradients w.r.t. all inputs in a single backward pass!
AD: The Mathematical Foundation¶
Automatic differentiation works because of a fundamental theorem:
Chain Rule: For composite functions $f(g(x))$: $$\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)$$
By systematically applying the chain rule to each operation in a computational graph, AD can compute exact derivatives for arbitrarily complex functions.
Automatic Differentiation in Practice: PyTorch¶
Let's see how automatic differentiation works in PyTorch:
import torch
# Define variables that require gradients
x1 = torch.tensor(2.0, requires_grad=True)
x2 = torch.tensor(3.0, requires_grad=True)
# Define the function
y = x1**2 + x2
# Compute gradients using reverse mode AD
y.backward()
# Access the computed gradients
print(f"dy/dx1: {x1.grad.item()}") # Should be 2*x1 = 4.0
print(f"dy/dx2: {x2.grad.item()}") # Should be 1.0
dy/dx1: 4.0 dy/dx2: 1.0
A More Complex Example: Neural Network¶
import torch
import torch.nn as nn
# Implement Single-Layer NN in PyTorch
class SingleLayerNN(nn.Module):
"""Single hidden layer neural network for 1D input/output"""
def __init__(self, hidden_size=10):
super(SingleLayerNN, self).__init__()
self.hidden = nn.Linear(1, hidden_size)
self.output = nn.Linear(hidden_size, 1)
self.activation = nn.Tanh()
def forward(self, x):
# Forward pass
x = self.hidden(x)
x = self.activation(x)
x = self.output(x)
return x
# Create network and data
model = SingleLayerNN(hidden_size=10)
# Define MSE loss
criterion = nn.MSELoss()
# Forward pass: compute predictions
predictions = model(x_train_tensor)
# Calculate loss (((output - target)**2).mean())
loss = criterion(predictions, u_train_tensor)
# Backward pass: compute gradients
loss.backward() # Compute gradients of the loss w.r.t. parameters
# Access gradients
for name, param in model.named_parameters():
print(f"{name}: gradient shape {param.grad.shape}")
hidden.weight: gradient shape torch.Size([10, 1]) hidden.bias: gradient shape torch.Size([10]) output.weight: gradient shape torch.Size([1, 10]) output.bias: gradient shape torch.Size([1])
When to Use Forward vs Reverse Mode¶
The choice depends on the structure of your problem:
- Forward Mode: Efficient when few inputs, many outputs (e.g., $f: \mathbb{R}^n \to \mathbb{R}^m$ with $n \ll m$)
- Reverse Mode: Efficient when many inputs, few outputs (e.g., $f: \mathbb{R}^n \to \mathbb{R}^m$ with $n \gg m$)
In machine learning, we typically have millions of parameters (inputs) and a single loss function (output), making reverse mode the natural choice.
Computational Considerations¶
Memory vs Computation Trade-offs¶
Forward Mode:
- Memory: O(1) additional storage
- Computation: O(n) for n input variables
Reverse Mode:
- Memory: O(computation graph size)
- Computation: O(1) for any number of input variables
Modern Optimizations¶
- Checkpointing: Trade computation for memory by recomputing intermediate values
- JIT compilation: Compile computational graphs for faster execution
- Parallelization: Distribute gradient computation across multiple devices
Gradient Descent¶
Gradient Descent is a first-order iterative optimization algorithm used to find the minimum of a differentiable function. In the context of training a neural network, we are trying to minimize the loss function.
- Initialize Parameters:
Choose an initial point (i.e., initial values for the weights and biases) in the parameter space, and set a learning rate that determines the step size in each iteration.
- Compute the Gradient:
Calculate the gradient of the loss function with respect to the parameters at the current point. The gradient is a vector that points in the direction of the steepest increase of the function. It is obtained by taking the partial derivatives of the loss function with respect to each parameter.
- Update Parameters:
Move in the opposite direction of the gradient by a distance proportional to the learning rate. This is done by subtracting the gradient times the learning rate from the current parameters:
$$\boldsymbol{w} = \boldsymbol{w} - \eta \nabla J(\boldsymbol{w})$$
Here, $\boldsymbol{w}$ represents the parameters, $\eta$ is the learning rate, and $\nabla J (\boldsymbol{w})$ is the gradient of the loss function $J$ with respect to $\boldsymbol{w}$.
- Repeat:
Repeat steps 2 and 3 until the change in the loss function falls below a predefined threshold, or a maximum number of iterations is reached.
Algorithm:¶
- Initialize weights randomly $\sim \mathcal{N}(0, \sigma^2)$
- Loop until convergence
- Compute gradient, $\frac{\partial J(\boldsymbol{w})}{\partial \boldsymbol{w}}$
- Update weights, $\boldsymbol{w} \leftarrow \boldsymbol{w} - \eta \frac{\partial J(\boldsymbol{w})}{\partial \boldsymbol{w}}$
- Return weights
Assuming a loss function is mean squared error (MSE). Let's compute the gradient of the loss with respect to the input weights.
The loss function is mean squared error:
$$\text{loss} = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$
Where $y_i$ are the true target and $\hat{y}_i$ are the predicted values.
To minimize this loss, we need to compute the gradients with respect to the weights $\mathbf{w}$ and bias $b$:
Using the chain rule, the gradient of the loss with respect to the weights is: $$\frac{\partial \text{loss}}{\partial \mathbf{w}} = \frac{2}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i) \frac{\partial y_i}{\partial \mathbf{w}}$$
The term inside the sum is the gradient of the loss with respect to the output $y_i$, which we called $\text{grad\_output}$: $$\text{grad\_output} = \frac{2}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)$$
The derivative $\frac{\partial y_i}{\partial \mathbf{w}}$ is just the input $\mathbf{x}_i$ multiplied by the derivative of the activation. For simplicity, let's assume linear activation, so this is just $\mathbf{x}_i$:
$$\therefore \frac{\partial \text{loss}}{\partial \mathbf{w}} = \mathbf{X}^T\text{grad\_output}$$
The gradient for the bias is simpler: $$\frac{\partial \text{loss}}{\partial b} = \sum_{i=1}^{n}\text{grad\_output}_i$$
Finally, we update the weights and bias by gradient descent:
$$\mathbf{w} = \mathbf{w} - \eta \frac{\partial \text{loss}}{\partial \mathbf{w}}$$
$$b = b - \eta \frac{\partial \text{loss}}{\partial b}$$
Where $\eta$ is the learning rate.
Variants:¶
There are several variants of Gradient Descent that modify or enhance these basic steps, including:
Stochastic Gradient Descent (SGD): Instead of using the entire dataset to compute the gradient, SGD uses a single random data point (or small batch) at each iteration. This adds noise to the gradient but often speeds up convergence and can escape local minima.
Momentum: Momentum methods use a moving average of past gradients to dampen oscillations and accelerate convergence, especially in cases where the loss surface has steep valleys.
Adaptive Learning Rate Methods: Techniques like Adagrad, RMSprop, and Adam adjust the learning rate individually for each parameter, often leading to faster convergence.
Limitations:¶
- It may converge to a local minimum instead of a global minimum if the loss surface is not convex.
- Convergence can be slow if the learning rate is not properly tuned.
- Sensitive to the scaling of features; poorly scaled data can cause the gradient descent to take a long time to converge or even diverge.
Effect of learning rate¶
The learning rate in gradient descent is a critical hyperparameter that can significantly influence the model's training dynamics. Let us now look at how the learning rate affects local minima, overshooting, and convergence:
- Effect on Local Minima:
High Learning Rate: A large learning rate can help the model escape shallow local minima, leading to the discovery of deeper (potentially global) minima. However, it can also cause instability, making it hard to settle in a good solution.
Low Learning Rate: A small learning rate may cause the model to get stuck in local minima, especially in complex loss landscapes with many shallow valleys. The model can lack the "energy" to escape these regions.
- Effect on Overshooting:
High Learning Rate: If the learning rate is set too high, the updates may be so large that they overshoot the minimum and cause the algorithm to diverge, or oscillate back and forth across the valley without ever reaching the bottom. This oscillation can be detrimental to convergence.
Low Learning Rate: A very low learning rate will likely avoid overshooting but may lead to extremely slow convergence, as the updates to the parameters will be minimal. It might result in getting stuck in plateau regions where the gradient is small.
- Effect on Convergence:
High Learning Rate: While it can speed up convergence initially, a too-large learning rate risks instability and divergence, as mentioned above. The model may never converge to a satisfactory solution.
Low Learning Rate: A small learning rate ensures more stable and reliable convergence but can significantly slow down the process. If set too low, it may also lead to premature convergence to a suboptimal solution.
Finding the Right Balance:¶
Choosing the right learning rate is often a trial-and-error process, sometimes guided by techniques like learning rate schedules or adaptive learning rate algorithms like Adam. These approaches attempt to balance the trade-offs by adjusting the learning rate throughout training, often starting with larger values to escape local minima and avoid plateaus, then reducing it to stabilize convergence.
# Implement PyTorch Training Loop Function
def train_network(model, x_train, u_train, epochs=5000, lr=0.01):
"""Train a neural network model using MSE loss and Adam optimizer"""
criterion = nn.MSELoss() # Mean Squared Error Loss
optimizer = optim.Adam(model.parameters(), lr=lr) # Adam optimizer
losses = []
for epoch in range(epochs):
# Forward pass: compute predictions
predictions = model(x_train)
# Calculate loss
loss = criterion(predictions, u_train)
# Backward pass: compute gradients
optimizer.zero_grad() # Clear previous gradients
loss.backward() # Compute gradients of the loss w.r.t. parameters
# Optimizer step: update parameters
optimizer.step() # Perform a single optimization step
losses.append(loss.item())
# Optional: Print loss periodically
# if (epoch + 1) % 1000 == 0:
# print(f'Epoch [{epoch+1}/{epochs}], Loss: {loss.item():.6f}')
return losses
# Initialize the single layer NN
hidden_size = 10
model = SingleLayerNN(hidden_size=hidden_size)
# Train the model
losses = train_network(model, x_train_tensor, u_train_tensor, epochs=8000, lr=0.01)
print(f"Final loss for {hidden_size} neurons: {losses[-1]:.6f}")
# Get predictions from the trained model
with torch.no_grad(): # Disable gradient calculation for inference
u_pred = model(x_test_tensor).numpy().flatten()
# Plotting
# Plot true function
plt.plot(x_plot, u_analytical_plot, 'k-', linewidth=3, label='True Function', alpha=0.8)
# Plot NN prediction
plt.plot(x_plot, u_pred, 'b-', linewidth=2.5,
label=f'NN ({hidden_size} neurons)')
# Plot training data
plt.scatter(x_train_np, u_train_noisy_np, color='red', s=40, alpha=0.7,
label='Training Data', zorder=5)
plt.xlabel('x')
plt.ylabel('u(x)')
plt.title(f'Neural Network Prediction', fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.ylim(-0.2, 1.2)
Final loss for 10 neurons: 0.000166
(-0.2, 1.2)
Universal Approximation Theorem: The Theoretical Foundation¶
Universal Approximation Theorem (Cybenko, 1989):
A single hidden layer network with sufficient neurons can approximate any continuous function to arbitrary accuracy.
$F(x) = \sum_{i=1}^{N} w_i \sigma(v_i x + b_i) + w_0$
Mathematical statement: For any continuous $f: [0,1] \to \mathbb{R}$ and $\epsilon > 0$, there exists $N$ and parameters such that $|F(x) - f(x)| < \epsilon$ for all $x \in [0,1]$.
Key questions:
- How many neurons $N$ do we need?
- Is this practical?
- Can we verify this experimentally?
The Key Experiment: Width vs Approximation Quality¶
Hypothesis: More neurons → better approximation (Universal Approximation Theorem)
Test: Train networks with 5, 10, 20, 50 neurons and measure performance
# Train Single-Layer NNs with Varying Width (The Experiment)
# Define hidden layer sizes to experiment with
hidden_sizes = [5, 10, 20, 50]
# Choose an activation function for the hidden layer
activation_fn = 'tanh' # Tanh often works well for smooth functions
# Dictionaries to store trained models and their loss histories
single_layer_models = {}
single_layer_losses = {}
epochs = 8000 # Number of training epochs
lr = 0.01 # Learning rate
print(f"Training single-layer networks with {activation_fn} activation...")
# Loop through different hidden sizes and train a model for each
for hidden_size in hidden_sizes:
print(f"\nTraining network with {hidden_size} neurons...")
# Instantiate the model
model = SingleLayerNN(hidden_size=hidden_size)
# Train the model
losses = train_network(model, x_train_tensor, u_train_tensor, epochs=epochs, lr=lr)
# Store the trained model and loss history
single_layer_models[hidden_size] = model
single_layer_losses[hidden_size] = losses
print(f"Final loss for {hidden_size} neurons: {losses[-1]:.6f}")
print("\nTraining complete.")
Training single-layer networks with tanh activation... Training network with 5 neurons... Final loss for 5 neurons: 0.000169 Training network with 10 neurons... Final loss for 10 neurons: 0.000174 Training network with 20 neurons... Final loss for 20 neurons: 0.000165 Training network with 50 neurons... Final loss for 50 neurons: 0.000259 Training complete.
# Visualize Results: Approximation Quality vs. Width
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
axes = axes.flatten()
x_test_tensor = torch.tensor(x_plot.reshape(-1, 1), dtype=torch.float32)
for i, hidden_size in enumerate(hidden_sizes):
ax = axes[i]
model = single_layer_models[hidden_size]
# Get predictions from the trained model
with torch.no_grad(): # Disable gradient calculation for inference
u_pred = model(x_test_tensor).numpy().flatten()
# Plot true function
ax.plot(x_plot, u_analytical_plot, 'k-', linewidth=3, label='True Function', alpha=0.8)
# Plot NN prediction
ax.plot(x_plot, u_pred, 'b-', linewidth=2.5,
label=f'NN ({hidden_size} neurons)')
# Plot training data
ax.scatter(x_train_np, u_train_noisy_np, color='red', s=40, alpha=0.7,
label='Training Data', zorder=5)
# Calculate and display error metrics
mse = np.mean((u_pred - u_analytical_plot)**2)
max_error = np.max(np.abs(u_pred - u_analytical_plot))
ax.text(0.05, 0.95, f'MSE: {mse:.6f}\nMax Error: {max_error:.4f}',
transform=ax.transAxes,
bbox=dict(boxstyle='round', facecolor='lightblue', alpha=0.8),
verticalalignment='top', fontsize=10)
ax.set_xlabel('x')
ax.set_ylabel('u(x)')
ax.set_title(f'Single Layer: {hidden_size} Neurons', fontweight='bold')
ax.legend()
ax.grid(True, alpha=0.3)
ax.set_ylim(-0.2, 1.2)
plt.tight_layout()
plt.show()
# Visualize Results: Training Convergence vs. Width
plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
for hidden_size in hidden_sizes:
losses = single_layer_losses[hidden_size]
plt.semilogy(losses, label=f'{hidden_size} neurons', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Training Loss (log scale)')
plt.title('Training Convergence', fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.subplot(1, 2, 2)
final_losses = [single_layer_losses[h][-1] for h in hidden_sizes]
plt.loglog(hidden_sizes, final_losses, 'o-', linewidth=2, markersize=8)
plt.xlabel('Number of Neurons')
plt.ylabel('Final Loss (log scale)')
plt.title('Final Loss vs Network Width', fontweight='bold')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nFinal Loss Summary:")
for hidden_size in hidden_sizes:
loss = single_layer_losses[hidden_size][-1]
print(f"{hidden_size:2d} neurons: {loss:.8f}")
print(f"\nImprovement from 5 to 50 neurons: {final_losses[0]/final_losses[-1]:.1f}x better")
Final Loss Summary: 5 neurons: 0.00016865 10 neurons: 0.00017450 20 neurons: 0.00016481 50 neurons: 0.00025939 Improvement from 5 to 50 neurons: 0.7x better
Analysis: Universal Approximation, Width, and Practicalities¶
The visualizations from our experiment (Cells 18 & 19) show a clear trend: as we increased the number of neurons in the hidden layer (the network's width), the network's ability to approximate the $\sin(\pi x)$ function significantly improved, and the final training loss decreased.
This experimental result experimentally validates the statement of the Universal Approximation Theorem (from Cell 5) – a single hidden layer with non-linearity does have the capacity to approximate continuous functions, and increasing the number of neurons provides more of this capacity, allowing it to better fit the target function.
However, the theorem guarantees existence, not practicality. Our experiment also hints at practical considerations:
- Number of Neurons Needed: While 50 neurons did a good job for $\sin(\pi x)$, approximating more complex functions might require a very large number of neurons in a single layer. This can be computationally expensive and require a lot of data.
- Training Difficulty: The theorem doesn't guarantee that gradient descent will successfully find the optimal parameters. Training can be challenging, especially for very wide networks or complex functions.
- Remaining Errors: Even with 50 neurons, there's still some error. For more complex functions or higher accuracy requirements, a single layer might struggle or need excessive width.
Practical Considerations: Overfitting and Hyperparameters¶
As network capacity increases (e.g., by adding more neurons), there's a risk of overfitting. This occurs when the model learns the training data (including noise) too well, capturing spurious patterns that don't generalize to unseen data, leading to poor performance outside the training set.
Example of under and overfitting the data
Overfitting can be detected by monitoring performance on a separate validation set during training. If the validation loss starts increasing while the training loss continues to decrease, it's a sign of overfitting.
Strategies to mitigate overfitting include using more training data, regularization techniques, early stopping (stopping training when validation performance degrades), or reducing model complexity.
Problem: High-capacity networks can memorize training data instead of learning the true function
Detection: Monitor validation loss - if it increases while training loss decreases, you're overfitting
Solutions:
- More training data
- Regularization (L1/L2, dropout)
- Early stopping
- Simpler architectures
Hyperparameter Choices¶
Hyperparameters are settings chosen before training that significantly influence the learning process and the final model. Key hyperparameters we've encountered include:
- Learning Rate ($\eta$): Controls the step size in gradient descent. Too high can cause divergence; too low can lead to slow convergence or getting stuck in local minima.
- Number of Epochs: How many times the training data is passed through the network. Too few may result in underfitting; too many can cause overfitting.
- Hidden Layer Size ($N_h$): The number of neurons in the hidden layer. Impacts model capacity. Too small can underfit; too large can overfit.
- Choice of Activation Function: Impacts the network's ability to model specific shapes and the training dynamics (e.g., Tanh/Sigmoid for smooth functions but potential vanishing gradients, ReLU for efficiency but "dead neuron" issue). The best choice can be problem-dependent.
Finding the right balance of hyperparameters is crucial for successful training and generalization.
How to detect overfitting with validation dataset¶
In practice, the learning algorithm does not actually find the best function, but merely one thatsignificantly reduces the training error. These additional limitations, such as theimperfection of the optimization algorithm, mean that the learning algorithm’seffective capacitymay be less than the representational capacity of the modelfamily.
Our modern ideas about improving the generalization of machine learningmodels are refinements of thought dating back to philosophers at least as early as Ptolemy. Many early scholars invoke a principle of parsimony that is now mostwidely known as Occam’s razor
(c. 1287–1347). This principle states that amongcompeting hypotheses that explain known observations equally well, we shouldchoose the “simplest” one. This idea was formalized and made more precise in the twentieth century by the founders of statistical learning theory.
We must remember that while simpler functions are more likely to generalize(to have a small gap between training and test error), we must still choose asufficiently complex hypothesis to achieve low training error. Typically, trainingerror decreases until it asymptotes to the minimum possible error value as modelcapacity increases (assuming the error measure has a minimum value). Typically generalization error has a U-shaped curve as a function of model capacity.
At the left end of the graph, training error and generalization errorare both high. This is the underfitting regime. As we increase capacity, training error decreases, but the gap between training and generalization error increases. Eventually,the size of this gap outweighs the decrease in training error, and we enter the overfitting regime, where capacity is too large, above the optimal capacity.
Image credits: Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT press.
# Demonstrate overfitting with very wide network
print("Demonstrating potential overfitting with a very wide network...")
# Train a very wide network
width = 500
wide_model = SingleLayerNN(width)
wide_losses = train_network(wide_model, x_train_tensor, u_train_tensor, epochs=5000, lr=0.01)
normal_model = SingleLayerNN(50)
normal_losses = train_network(normal_model, x_train_tensor, u_train_tensor, epochs=5000, lr=0.01)
# Compare to our best previous model
x_test = np.linspace(0, 1, 200)
x_test_tensor = torch.tensor(x_test.reshape(-1, 1), dtype=torch.float32)
u_true_test = analytical_solution(x_test)
with torch.no_grad():
wide_pred = wide_model(x_test_tensor).numpy().flatten()
normal_pred = normal_model(x_test_tensor).numpy().flatten()
plt.figure(figsize=(15, 5))
plt.subplot(1, 3, 1)
plt.plot(x_test, u_true_test, 'k-', linewidth=3, label='True Function')
plt.plot(x_test, normal_pred, 'b-', linewidth=3, label='50 Neurons')
plt.scatter(x_train_tensor, u_train_tensor, color='red', s=40, alpha=0.7, zorder=5)
plt.title('Normal Network (50 neurons)', fontweight='bold')
plt.xlabel('x')
plt.ylabel('u(x)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.subplot(1, 3, 2)
plt.plot(x_test, u_true_test, 'k-', linewidth=3, label='True Function')
plt.plot(x_test, wide_pred, 'r-', linewidth=3, label=f'{width:d} Neurons')
plt.scatter(x_train_tensor, u_train_tensor, color='red', s=40, alpha=0.7, zorder=5)
plt.title(f'Very Wide Network ({width:d} neurons)', fontweight='bold')
plt.xlabel('x')
plt.ylabel('u(x)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.subplot(1, 3, 3)
plt.semilogy(normal_losses, 'b-', label='50 Neurons', linewidth=2)
plt.semilogy(wide_losses, 'r-', label=f'{width:d} Neurons', linewidth=2)
plt.xlabel('Epoch')
plt.ylabel('Training Loss (log scale)')
plt.title('Training Loss Comparison', fontweight='bold')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Calculate generalization error (on test points not in training)
x_test_fine = np.linspace(0.05, 0.95, 100) # Points between training points
x_test_fine_tensor = torch.tensor(x_test_fine.reshape(-1, 1), dtype=torch.float32)
u_true_fine = analytical_solution(x_test_fine)
with torch.no_grad():
wide_pred_fine = wide_model(x_test_fine_tensor).numpy().flatten()
normal_pred_fine = normal_model(x_test_fine_tensor).numpy().flatten()
normal_mse = np.mean((normal_pred_fine - u_true_fine)**2)
wide_mse = np.mean((wide_pred_fine - u_true_fine)**2)
print(f"\nGeneralization Error (on unseen test points):")
print(f"50 neurons: {normal_mse:.6f}")
print(f"{width:d} neurons: {wide_mse:.6f}")
print(f"\nNote: Very wide networks can sometimes generalize worse despite lower training loss")
Demonstrating potential overfitting with a very wide network...
Generalization Error (on unseen test points): 50 neurons: 0.000121 500 neurons: 0.000376 Note: Very wide networks can sometimes generalize worse despite lower training loss
The story so far: Single-layer networks can approximate any function (Universal Approximation) but may need impractically many neurons. The question: Can depth be more efficient than width?
The Need for Depth - The XOR Problem: A Historical Turning Point¶
The XOR problem exposed fundamental limitations of true single-layer perceptrons, causing the "AI winter" of the 1970s. This simple problem reveals why depth is essential.
XOR Truth Table:
x₁ x₂ │ y
────────┼────
0 0 │ 0
0 1 │ 1
1 0 │ 1
1 1 │ 0
The crisis: No single line can separate these classes!
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim
plt.rcParams.update({'font.size': 12, 'figure.figsize': (14, 8)})
torch.manual_seed(42)
np.random.seed(42)
# XOR dataset
X_xor = np.array([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=np.float32)
y_xor = np.array([[0], [1], [1], [0]], dtype=np.float32)
X_xor_tensor = torch.tensor(X_xor)
y_xor_tensor = torch.tensor(y_xor)
print("XOR Dataset:")
for i in range(4):
print(f"({X_xor[i,0]}, {X_xor[i,1]}) → {y_xor[i,0]}")
# Visualize the impossibility
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
# Plot 1: XOR problem with failed linear attempts
colors = ['red', 'blue']
for i in range(2):
mask = y_xor.flatten() == i
ax1.scatter(X_xor[mask, 0], X_xor[mask, 1],
c=colors[i], s=200, alpha=0.8,
label=f'Class {i}', edgecolors='black', linewidth=2)
# Failed linear separation attempts
x_line = np.linspace(-0.5, 1.5, 100)
ax1.plot(x_line, 0.5 * np.ones_like(x_line), 'g--', linewidth=2, alpha=0.7, label='Failed Line 1')
ax1.plot(0.5 * np.ones_like(x_line), x_line, 'm--', linewidth=2, alpha=0.7, label='Failed Line 2')
ax1.plot(x_line, x_line, 'orange', linestyle='--', linewidth=2, alpha=0.7, label='Failed Line 3')
ax1.set_xlim(-0.3, 1.3)
ax1.set_ylim(-0.3, 1.3)
ax1.set_xlabel('x₁')
ax1.set_ylabel('x₂')
ax1.set_title('XOR: No Linear Separation Possible', fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)
# Plot 2: Required non-linear boundary
for i in range(2):
mask = y_xor.flatten() == i
ax2.scatter(X_xor[mask, 0], X_xor[mask, 1],
c=colors[i], s=200, alpha=0.8,
label=f'Class {i}', edgecolors='black', linewidth=2)
# Conceptual non-linear boundary
theta = np.linspace(0, 2*np.pi, 100)
x_circle1 = 0.25 + 0.15*np.cos(theta)
y_circle1 = 0.25 + 0.15*np.sin(theta)
x_circle2 = 0.75 + 0.15*np.cos(theta)
y_circle2 = 0.75 + 0.15*np.sin(theta)
ax2.plot(x_circle1, y_circle1, 'g-', linewidth=3, alpha=0.8, label='Required Boundary')
ax2.plot(x_circle2, y_circle2, 'g-', linewidth=3, alpha=0.8)
ax2.set_xlim(-0.3, 1.3)
ax2.set_ylim(-0.3, 1.3)
ax2.set_xlabel('x₁')
ax2.set_ylabel('x₂')
ax2.set_title('Required Non-Linear Boundary', fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nThe fundamental problem: XOR is NOT linearly separable!")
XOR Dataset: (0.0, 0.0) → 0.0 (0.0, 1.0) → 1.0 (1.0, 0.0) → 1.0 (1.0, 1.0) → 0.0
The fundamental problem: XOR is NOT linearly separable!
The Critical Distinction: True Single-Layer vs Multi-Layer¶
Historical confusion: What Minsky & Papert analyzed was a TRUE single-layer perceptron (Input → Output directly). This is different from our "single-layer" networks that have hidden layers!
Architecture comparison:
- True Single-Layer: Input → Output (NO hidden layers)
- Multi-Layer: Input → Hidden → Output (1+ hidden layers)
# Define the architectures correctly
class TrueSingleLayerPerceptron(nn.Module):
"""TRUE single-layer perceptron: Input → Output (NO hidden layers)
This is what Minsky & Papert showed cannot solve XOR!
"""
def __init__(self):
super().__init__()
self.layer = nn.Linear(2, 1) # Direct: 2 inputs → 1 output
def forward(self, x):
return torch.sigmoid(self.layer(x))
class MultiLayerPerceptron(nn.Module):
"""Multi-layer perceptron: Input → Hidden → Output
This CAN solve XOR!
"""
def __init__(self, hidden_size=4):
super().__init__()
self.hidden = nn.Linear(2, hidden_size)
self.output = nn.Linear(hidden_size, 1)
def forward(self, x):
h = torch.sigmoid(self.hidden(x))
return torch.sigmoid(self.output(h))
def train_xor_model(model, X, y, epochs=3000, lr=10.0):
"""Train model on XOR problem"""
optimizer = optim.SGD(model.parameters(), lr=lr)
criterion = nn.BCELoss()
for epoch in range(epochs):
pred = model(X)
loss = criterion(pred, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (epoch + 1) % 1000 == 0:
accuracy = ((pred > 0.5).float() == y).float().mean()
print(f'Epoch {epoch+1}: Loss = {loss.item():.4f}, Accuracy = {accuracy:.4f}')
return loss.item()
print("Network architectures defined:")
print("1. TrueSingleLayerPerceptron: Input → Output (what fails)")
print("2. MultiLayerPerceptron: Input → Hidden → Output (what works)")
Network architectures defined: 1. TrueSingleLayerPerceptron: Input → Output (what fails) 2. MultiLayerPerceptron: Input → Hidden → Output (what works)
The Historical Failure: True Single-Layer on XOR¶
Prediction: The true single-layer perceptron will fail spectacularly at XOR.
# Demonstrate the historical failure
print("=" * 60)
print("TRAINING TRUE SINGLE-LAYER PERCEPTRON")
print("(This is what Minsky & Papert showed fails!)")
print("=" * 60)
true_single = TrueSingleLayerPerceptron()
true_single_loss = train_xor_model(true_single, X_xor_tensor, y_xor_tensor)
# Analyze the failure
with torch.no_grad():
pred = true_single(X_xor_tensor)
accuracy = ((pred > 0.5).float() == y_xor_tensor).float().mean()
print(f"\nFINAL RESULTS:")
print(f"Accuracy: {accuracy:.4f} (should be ~0.5 = random guessing)")
print(f"Final loss: {true_single_loss:.4f}")
print("\nPredictions vs Targets:")
for i in range(4):
print(f" ({X_xor[i,0]}, {X_xor[i,1]}) → {pred[i,0]:.4f} (target: {y_xor[i,0]})")
print("\n❌ FAILURE CONFIRMED: True single-layer cannot solve XOR!")
============================================================ TRAINING TRUE SINGLE-LAYER PERCEPTRON (This is what Minsky & Papert showed fails!) ============================================================ Epoch 1000: Loss = 1.7931, Accuracy = 0.5000 Epoch 2000: Loss = 1.7931, Accuracy = 0.5000 Epoch 3000: Loss = 1.7931, Accuracy = 0.5000 FINAL RESULTS: Accuracy: 0.5000 (should be ~0.5 = random guessing) Final loss: 1.7931 Predictions vs Targets: (0.0, 0.0) → 0.9091 (target: 0.0) (0.0, 1.0) → 0.9708 (target: 1.0) (1.0, 0.0) → 0.9708 (target: 1.0) (1.0, 1.0) → 0.9910 (target: 0.0) ❌ FAILURE CONFIRMED: True single-layer cannot solve XOR!
The Solution: Adding Hidden Layers¶
Hypothesis: Adding just ONE hidden layer should solve XOR completely.
# Demonstrate the solution
print("=" * 60)
print("TRAINING MULTI-LAYER PERCEPTRON")
print("(Adding ONE hidden layer should solve XOR!)")
print("=" * 60)
multi_layer = MultiLayerPerceptron(4)
multi_layer_loss = train_xor_model(multi_layer, X_xor_tensor, y_xor_tensor)
with torch.no_grad():
pred = multi_layer(X_xor_tensor)
accuracy = ((pred > 0.5).float() == y_xor_tensor).float().mean()
print(f"\nFINAL RESULTS:")
print(f"Accuracy: {accuracy:.4f} (should be 1.0000!)")
print(f"Final loss: {multi_layer_loss:.4f}")
print("\nPredictions vs Targets:")
for i in range(4):
print(f" ({X_xor[i,0]}, {X_xor[i,1]}) → {pred[i,0]:.4f} (target: {y_xor[i,0]})")
print("\n✅ SUCCESS: Multi-layer network solves XOR perfectly!")
print("\nImprovement factor: Infinite (from failure to perfect solution)")
============================================================ TRAINING MULTI-LAYER PERCEPTRON (Adding ONE hidden layer should solve XOR!) ============================================================ Epoch 1000: Loss = 0.0012, Accuracy = 1.0000 Epoch 2000: Loss = 0.0005, Accuracy = 1.0000 Epoch 3000: Loss = 0.0003, Accuracy = 1.0000 FINAL RESULTS: Accuracy: 1.0000 (should be 1.0000!) Final loss: 0.0003 Predictions vs Targets: (0.0, 0.0) → 0.0002 (target: 0.0) (0.0, 1.0) → 0.9997 (target: 1.0) (1.0, 0.0) → 0.9997 (target: 1.0) (1.0, 1.0) → 0.0005 (target: 0.0) ✅ SUCCESS: Multi-layer network solves XOR perfectly! Improvement factor: Infinite (from failure to perfect solution)
Visualizing the Decision Boundaries¶
The geometric insight: Linear vs non-linear decision boundaries.
# Create decision boundary visualization
def plot_decision_boundary(model, title, ax):
h = 0.01
x_min, x_max = -0.5, 1.5
y_min, y_max = -0.5, 1.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
grid_points = torch.tensor(np.c_[xx.ravel(), yy.ravel()], dtype=torch.float32)
with torch.no_grad():
Z = model(grid_points).numpy()
Z = Z.reshape(xx.shape)
# Create contour plot
contour = ax.contourf(xx, yy, Z, levels=50, alpha=0.6, cmap='RdYlBu')
ax.contour(xx, yy, Z, levels=[0.5], colors='black', linewidths=3)
# Plot XOR points
colors = ['red', 'blue']
markers = ['o', 's']
for i in range(2):
mask = y_xor.flatten() == i
ax.scatter(X_xor[mask, 0], X_xor[mask, 1],
c=colors[i], s=300, alpha=1.0,
edgecolors='black', linewidth=3,
marker=markers[i], label=f'Class {i}')
ax.set_xlim(x_min, x_max)
ax.set_ylim(y_min, y_max)
ax.set_title(title, fontweight='bold', fontsize=14)
ax.set_xlabel('x₁')
ax.set_ylabel('x₂')
ax.legend()
ax.grid(True, alpha=0.3)
# Create comparison plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))
plot_decision_boundary(true_single, 'True Single-Layer\n(Linear - FAILS)', ax1)
plot_decision_boundary(multi_layer, 'Multi-Layer\n(Non-Linear - SUCCEEDS)', ax2)
plt.tight_layout()
plt.show()
print("KEY INSIGHT:")
print("• True single-layer: Can only create straight lines → FAILS")
print("• Multi-layer: Creates curved boundaries → SUCCEEDS")
print("• Hidden layers enable non-linear transformations!")
KEY INSIGHT: • True single-layer: Can only create straight lines → FAILS • Multi-layer: Creates curved boundaries → SUCCEEDS • Hidden layers enable non-linear transformations!
Mathematical Explanation: Why Depth Solves XOR¶
True single-layer limitation: $$y = \sigma(w_1 x_1 + w_2 x_2 + b)$$ Decision boundary: $w_1 x_1 + w_2 x_2 + b = 0$ (always a straight line)
Multi-layer solution: Decompose XOR into simpler operations $$h_1 = \sigma(w_{11} x_1 + w_{12} x_2 + b_1) \quad \text{(≈ OR gate)}$$ $$h_2 = \sigma(w_{21} x_1 + w_{22} x_2 + b_2) \quad \text{(≈ AND gate)}$$ $$y = \sigma(v_1 h_1 + v_2 h_2 + b_3) \quad \text{(≈ OR AND NOT)}$$
Result: XOR = (OR) AND (NOT AND) = compositional solution!
Beyond XOR: High-Frequency Functions¶
The deeper question: Does the depth advantage extend beyond simple classification?
Test case: High-frequency function $f(x) = \sin(\pi x) + 0.3\sin(10\pi x)$
# High-frequency function challenge
def high_freq_function(x):
return np.sin(np.pi * x) + 0.3 * np.sin(10 * np.pi * x)
# Generate data
x_hf = np.linspace(0, 1, 200)
y_hf_true = high_freq_function(x_hf)
# Sparse training data
x_hf_train = np.linspace(0, 1, 25)
y_hf_train = high_freq_function(x_hf_train) + 0.01 * np.random.randn(25)
# Convert to tensors
x_hf_train_t = torch.tensor(x_hf_train.reshape(-1, 1), dtype=torch.float32)
y_hf_train_t = torch.tensor(y_hf_train.reshape(-1, 1), dtype=torch.float32)
x_hf_test_t = torch.tensor(x_hf.reshape(-1, 1), dtype=torch.float32)
# Define architectures
class ShallowNetwork(nn.Module):
"""Single hidden layer with many neurons"""
def __init__(self, width=100):
super().__init__()
self.network = nn.Sequential(
nn.Linear(1, width),
nn.Tanh(),
nn.Linear(width, 1)
)
def forward(self, x):
return self.network(x)
class DeepNetwork(nn.Module):
"""Multiple hidden layers with fewer neurons each"""
def __init__(self, width=25, depth=4):
super().__init__()
layers = [nn.Linear(1, width), nn.Tanh()]
for _ in range(depth-1):
layers.extend([nn.Linear(width, width), nn.Tanh()])
layers.append(nn.Linear(width, 1))
self.network = nn.Sequential(*layers)
def forward(self, x):
return self.network(x)
def train_regressor(model, x_train, y_train, epochs=5000, lr=0.01):
"""Train regression model"""
optimizer = optim.Adam(model.parameters(), lr=lr)
criterion = nn.MSELoss()
for epoch in range(epochs):
pred = model(x_train)
loss = criterion(pred, y_train)
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (epoch + 1) % 1000 == 0:
print(f'Epoch {epoch+1}: Loss = {loss.item():.6f}')
return loss.item()
# Train models
print("Training shallow network (1 layer, 100 neurons)...")
shallow_net = ShallowNetwork(100)
shallow_loss = train_regressor(shallow_net, x_hf_train_t, y_hf_train_t)
print("\nTraining deep network (4 layers, 25 neurons each)...")
deep_net = DeepNetwork(25, 4)
deep_loss = train_regressor(deep_net, x_hf_train_t, y_hf_train_t)
print(f"\nComparison:")
print(f"Shallow final loss: {shallow_loss:.6f}")
print(f"Deep final loss: {deep_loss:.6f}")
print(f"Improvement: {shallow_loss/deep_loss:.1f}x better")
Training shallow network (1 layer, 100 neurons)... Epoch 1000: Loss = 0.042941 Epoch 2000: Loss = 0.042920 Epoch 3000: Loss = 0.042881 Epoch 4000: Loss = 0.042869 Epoch 5000: Loss = 0.042858 Training deep network (4 layers, 25 neurons each)... Epoch 1000: Loss = 0.000016 Epoch 2000: Loss = 0.002732 Epoch 3000: Loss = 0.000000 Epoch 4000: Loss = 0.000001 Epoch 5000: Loss = 0.000251 Comparison: Shallow final loss: 0.042858 Deep final loss: 0.000251 Improvement: 170.9x better
# Visualize high-frequency results
with torch.no_grad():
shallow_pred = shallow_net(x_hf_test_t).numpy().flatten()
deep_pred = deep_net(x_hf_test_t).numpy().flatten()
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))
# Shallow network
ax1.plot(x_hf, y_hf_true, 'k-', linewidth=3, label='True Function', alpha=0.8)
ax1.plot(x_hf, shallow_pred, 'r-', linewidth=2, label='Shallow Network (100 neurons)')
ax1.scatter(x_hf_train, y_hf_train, color='blue', s=30, alpha=0.7, zorder=5)
shallow_mse = np.mean((shallow_pred - y_hf_true)**2)
ax1.text(0.05, 0.95, f'MSE: {shallow_mse:.4f}', transform=ax1.transAxes,
bbox=dict(boxstyle='round', facecolor='white', alpha=0.8),
fontsize=12, fontweight='bold')
ax1.set_title('Shallow Network (1 Hidden Layer)', fontweight='bold')
ax1.legend()
ax1.grid(True, alpha=0.3)
ax1.set_xlabel('x')
ax1.set_ylabel('y')
# Deep network
ax2.plot(x_hf, y_hf_true, 'k-', linewidth=3, label='True Function', alpha=0.8)
ax2.plot(x_hf, deep_pred, 'g-', linewidth=2, label='Deep Network (4 layers)')
ax2.scatter(x_hf_train, y_hf_train, color='blue', s=30, alpha=0.7, zorder=5)
deep_mse = np.mean((deep_pred - y_hf_true)**2)
ax2.text(0.05, 0.95, f'MSE: {deep_mse:.4f}', transform=ax2.transAxes,
bbox=dict(boxstyle='round', facecolor='white', alpha=0.8),
fontsize=12, fontweight='bold')
ax2.set_title('Deep Network (4 Hidden Layers)', fontweight='bold')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_xlabel('x')
ax2.set_ylabel('y')
plt.suptitle('High-Frequency Function: Shallow vs Deep', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()
# Parameter comparison
shallow_params = sum(p.numel() for p in shallow_net.parameters())
deep_params = sum(p.numel() for p in deep_net.parameters())
print(f"\nParameter Efficiency:")
print(f"Shallow network: {shallow_params} parameters, MSE: {shallow_mse:.6f}")
print(f"Deep network: {deep_params} parameters, MSE: {deep_mse:.6f}")
print(f"\nDeep network: {shallow_mse/deep_mse:.1f}x better performance")
print(f" {shallow_params/deep_params:.1f}x more parameters")
print(f"\nConclusion: Deep networks are more parameter-efficient!")
Parameter Efficiency: Shallow network: 301 parameters, MSE: 0.042098 Deep network: 2026 parameters, MSE: 0.000635 Deep network: 66.3x better performance 0.1x more parameters Conclusion: Deep networks are more parameter-efficient!
Historical Timeline: From Crisis to Revolution¶
The XOR crisis and its resolution transformed AI:
Year | Event | Impact |
---|---|---|
1943 | McCulloch-Pitts neuron | Foundation laid |
1957 | Rosenblatt's Perceptron | First learning success |
1969 | Minsky & Papert: XOR problem | Showed true single-layer limits |
1970s-80s | "AI Winter" | Funding dried up |
1986 | Backpropagation algorithm | Enabled multi-layer training |
1989 | Universal Approximation Theorem | Theoretical foundation |
2006+ | Deep Learning Revolution | Depth proves essential |
The lesson: XOR taught us that depth is not luxury—it's necessity.
Why Depth Matters: The Four Key Insights¶
- Representation Efficiency
- Shallow networks: May need exponentially many neurons
- Deep networks: Hierarchical composition is exponentially more efficient
- Example: XOR impossible with 1 layer, trivial with 2 layers
- Feature Hierarchy
- Layer 1: Simple features (edges, basic patterns)
- Layer 2: Feature combinations (corners, textures)
- Layer 3+: Complex abstractions (objects, concepts)
- Key insight: Real-world problems have hierarchical structure
- Geometric Transformation
- Each layer performs coordinate transformation
- Deep networks "unfold" complex data manifolds
- XOR example: Transform non-separable → separable
- General principle: Depth enables progressive simplification
- Compositional Learning
- Complex functions = composition of simple functions
- Mathematical: $f(x) = f_L(f_{L-1}(...f_1(x)))$
- Practical: Build complexity incrementally
- Universal: Applies across domains (vision, language, science)