Kolmogorov Forward Equation (Fokker–Planck)¶
The Kolmogorov forward equation, also known as the Fokker–Planck equation, describes the time evolution of the probability density of a diffusion process. While the backward equation acts on initial conditions \((x_0, t_0)\), the forward equation acts on the current state \((x, t)\)—tracking how probability mass spreads over time.
Related Content
- Kolmogorov Backward Equation — the dual equation for expectations
- Forward–Backward Duality — the adjoint relationship
- Heat Equation — Fokker–Planck for Brownian motion
- Invariant Measures — stationary distributions
Setting¶
Consider the Itô diffusion:
Let \(p(x, t \mid x_0, t_0)\) denote the transition density:
Convention:
- \((x_0, t_0)\): initial state and time (fixed parameters)
- \((x, t)\): current state and time (variables the PDE acts on)
Notation Variants
Different texts use different notations:
- \(p(x, t \mid x_0, t_0)\) — conditional notation (used here)
- \(p(t; x_0, x)\) — semicolon separating time from space
- \(p_{t_0, x_0}(x, t)\) — subscript for initial condition
All describe the same object: the probability density of \(X_t = x\) given \(X_{t_0} = x_0\).
The Fokker–Planck Equation¶
Theorem (Fokker–Planck Equation)
The transition density satisfies:
where \(\mathcal{L}^*\) is the adjoint of the infinitesimal generator:
Initial condition: \(p(x, t_0 \mid x_0, t_0) = \delta(x - x_0)\)
Regularity conditions: The theorem requires:
- \(\mu(x,t)\) and \(\sigma(x,t)\) are sufficiently smooth (typically \(C^1\))
- \(\sigma(x,t) > 0\) (strict positivity; for uniform ellipticity, need \(\sigma(x,t) \geq c > 0\))
- Appropriate boundary conditions (natural, reflecting, or absorbing)
Two Formulations¶
Density Form (Strong)¶
The PDE for the transition density:
Expectation Form (Weak)¶
For any smooth test function \(f\) with suitable decay:
where the infinitesimal generator is:
Equivalently:
This formulation is dual to the density PDE: the generator \(\mathcal{L}\) acts on test functions, while its adjoint \(\mathcal{L}^*\) acts on densities.
Derivation¶
Derivation via Integration by Parts
Step 1: Start with the expectation form. For any smooth test function \(f\) with compact support:
Step 2: Write expectations as integrals against the density:
Step 3: The left side equals:
Step 4: Integrate by parts on the right side. For the drift term:
For the diffusion term:
Step 5: Since this holds for all test functions \(f\):
Expanded Forms¶
Constant Coefficients¶
For constant coefficients (\(\mu, \sigma = \text{const}\)), the Fokker–Planck equation simplifies to:
This is the advection–diffusion equation: probability advects with velocity \(\mu\) and diffuses with coefficient \(\sigma^2/2\).
Variable Coefficients¶
Full Expansion for Variable Coefficients
Expanding the derivatives explicitly using the product rule:
Drift term:
Diffusion term: Let \(D(x) = \sigma^2(x)\). Then:
Since \(D' = 2\sigma\sigma'\) and \(D'' = 2(\sigma')^2 + 2\sigma\sigma''\):
Combined:
Continuity Form and Probability Current¶
The Fokker–Planck equation can be written as a continuity equation:
where the probability current is:
Physical Interpretation
Probability is conserved—it flows according to current \(J\). The current has two components:
- Drift current: \(\mu(x) p\) — deterministic flow
- Diffusive current: \(-\frac{1}{2}\frac{\partial}{\partial x}[\sigma^2(x)p]\) — spreading due to noise (Fick's law)
Why the Diffusive Current Has This Form
The diffusive current \(-\frac{1}{2}\partial_x[\sigma^2 p]\) rather than simply \(-\frac{\sigma^2}{2}\partial_x p\) arises because:
- State-dependent noise: When \(\sigma(x)\) varies, particles in high-volatility regions spread faster
- Spurious drift: The term \(-\frac{1}{2}\sigma\sigma' p\) creates an effective drift toward low-volatility regions
- Itô convention: This specific form corresponds to the Itô interpretation of the SDE
In the Stratonovich convention, the diffusive current would take a different form.
Examples with Analytical Solutions¶
Example 1: Brownian Motion¶
For \(dX_t = dW_t\) (i.e., \(\mu = 0\), \(\sigma = 1\)):
This is the heat equation.
Solution (heat kernel):
Example 2: Brownian Motion with Drift¶
For \(dX_t = \mu\,dt + \sigma\,dW_t\):
Solution:
Example 3: Ornstein–Uhlenbeck Process¶
For \(dX_t = -\kappa X_t\,dt + \sigma\,dW_t\) (mean-reverting to zero):
Transient solution (starting from \(X_{t_0} = x_0\)):
where \(\tau = t - t_0\) and:
- Mean: \(m(\tau) = x_0 e^{-\kappa \tau}\)
- Variance: \(v(\tau) = \frac{\sigma^2}{2\kappa}(1 - e^{-2\kappa \tau})\)
Stationary solution (as \(\tau \to \infty\)):
Example 4: Geometric Brownian Motion¶
For \(dS_t = \mu S_t\,dt + \sigma S_t\,dW_t\) on \(S > 0\):
Solution (log-normal density):
Summary Table¶
| Process | SDE | Forward Equation | Transition Density |
|---|---|---|---|
| Brownian Motion | \(dX_t = dW_t\) | \(\partial_t p = \frac{1}{2}\partial_{xx} p\) | Gaussian, variance \(= t\) |
| BM with Drift | \(dX_t = \mu\,dt + \sigma\,dW_t\) | \(\partial_t p = -\mu\partial_x p + \frac{\sigma^2}{2}\partial_{xx} p\) | Gaussian, mean \(= \mu t\) |
| Ornstein–Uhlenbeck | \(dX_t = -\kappa X_t\,dt + \sigma\,dW_t\) | \(\partial_t p = \kappa\partial_x(xp) + \frac{\sigma^2}{2}\partial_{xx} p\) | Gaussian, mean-reverting |
| GBM | \(dS_t = \mu S_t\,dt + \sigma S_t\,dW_t\) | \(\partial_t p = -\mu\partial_S(Sp) + \frac{\sigma^2}{2}\partial_{SS}(S^2 p)\) | Log-normal |
Stationary (Invariant) Distributions¶
A density \(p_\infty(x)\) is stationary if \(\mathcal{L}^* p_\infty = 0\):
Zero-Current Condition¶
At stationarity with zero probability current (\(J = 0\)):
General Solution (One Dimension)¶
Existence Conditions
This formula only yields a valid density if:
- \(\sigma(x) > 0\) for all \(x\) in the domain
- The integral converges (normalizability)
- Boundary conditions are compatible (reflecting or natural)
- The process is positive recurrent (returns to compact sets in finite expected time)
For example, standard Brownian motion on \(\mathbb{R}\) has no stationary distribution (it is null recurrent).
Physical Interpretation of Stationarity¶
At equilibrium, the drift and diffusive currents balance:
This is analogous to detailed balance in statistical mechanics.
Analytical Solution Methods¶
Fourier Transform Method¶
For constant coefficients on \(\mathbb{R}\), apply the Fourier transform in \(x\):
The PDE becomes an ODE in \(t\):
Solution:
For delta initial condition \(p(x, t_0) = \delta(x - x_0)\), we have \(\hat{p}(k, t_0) = e^{-ikx_0}\), and the inverse transform yields the Gaussian solution.
Similarity Solutions¶
For the heat equation, exploit scale invariance by seeking solutions of the form:
Substituting into the PDE yields an ODE for \(F(\eta)\):
whose normalized solution is \(F(\eta) = \frac{1}{\sqrt{2\pi}}e^{-\eta^2/2}\).
Separation of Variables¶
On bounded domains \([a, b]\) with homogeneous boundary conditions, expand in eigenfunctions of \(\mathcal{L}^*\):
where \(\mathcal{L}^* \phi_n = -\lambda_n \phi_n\) with eigenvalues \(0 = \lambda_0 < \lambda_1 < \lambda_2 < \cdots\).
The eigenfunction \(\phi_0\) with \(\lambda_0 = 0\) is the stationary distribution (if it exists).
Numerical Methods¶
For general coefficients \(\mu(x)\) and \(\sigma(x)\), analytical solutions are often unavailable.
Finite Difference Methods¶
Discretize the PDE on a grid \((x_i, t_n)\):
| Scheme | Accuracy | Stability | Complexity |
|---|---|---|---|
| Explicit | \(O(\Delta t, \Delta x^2)\) | Conditional (CFL) | Simple |
| Implicit | \(O(\Delta t, \Delta x^2)\) | Unconditional | Linear solve |
| Crank–Nicolson | \(O(\Delta t^2, \Delta x^2)\) | Unconditional | Linear solve |
Stability Considerations
The advection term \(-\mu \partial_x p\) requires care:
- Central differences can cause oscillations for advection-dominated problems
- Upwind schemes add numerical diffusion but ensure stability
- CFL condition for explicit schemes: \(\Delta t \leq \frac{\Delta x^2}{\sigma^2}\)
Positivity Preservation¶
The exact solution satisfies \(p \geq 0\) for all \(t\). Numerical schemes should preserve this:
- Use flux-limiting schemes for advection
- Avoid Crank–Nicolson without modification (can produce negative values)
- Consider operator splitting for advection-diffusion
Monte Carlo Verification¶
Simulate paths of the SDE and compare the empirical histogram to the PDE solution:
```python import numpy as np import matplotlib.pyplot as plt
def simulate_sde(x0, t0, t_final, mu, sigma, n_paths=100000, n_steps=1000): """Simulate SDE paths using Euler-Maruyama.""" dt = (t_final - t0) / n_steps X = np.zeros((n_paths, n_steps + 1)) X[:, 0] = x0
for i in range(n_steps):
dW = np.sqrt(dt) * np.random.randn(n_paths)
X[:, i+1] = X[:, i] + mu(X[:, i]) * dt + sigma(X[:, i]) * dW
return X
Example: Ornstein-Uhlenbeck process¶
kappa, sigma_val = 1.0, 0.5 mu_func = lambda x: -kappa * x sigma_func = lambda x: sigma_val * np.ones_like(x)
X = simulate_sde(x0=1.0, t0=0.0, t_final=2.0, mu=mu_func, sigma=sigma_func)
Compare histogram to analytical solution¶
tau = 2.0 m_tau = 1.0 * np.exp(-kappa * tau) v_tau = (sigma_val2 / (2kappa)) * (1 - np.exp(-2kappa*tau))
x_grid = np.linspace(-2, 2, 200) p_analytical = np.exp(-(x_grid - m_tau)2 / (2v_tau)) / np.sqrt(2np.pi*v_tau)
plt.hist(X[:, -1], bins=100, density=True, alpha=0.7, label='Monte Carlo') plt.plot(x_grid, p_analytical, 'r-', lw=2, label='Analytical') plt.xlabel('x') plt.ylabel('p(x, t)') plt.legend() plt.title('Fokker-Planck vs Monte Carlo') plt.show() ```
Multidimensional Fokker–Planck¶
For \(X_t \in \mathbb{R}^d\) with \(dX_t^i = \mu^i(X_t, t)\,dt + \sigma^{i\alpha}(X_t, t)\,dW_t^\alpha\):
where the diffusion matrix is \(a^{ij} = \sum_{\alpha}\sigma^{i\alpha}\sigma^{j\alpha} = (\sigma\sigma^\top)^{ij}\).
In vector notation:
Ellipticity
For classical solutions, we require uniform ellipticity: there exists \(c > 0\) such that
This ensures the equation is genuinely parabolic and has smooth solutions.
Connection to Score Functions and Diffusion Models¶
Relevance to Generative Models
This section connects classical Fokker–Planck theory to modern machine learning. It can be skipped on first reading.
Given a marginal density \(p(x, t)\), the score function is:
This gradient of the log-density is fundamental to:
- Score matching in machine learning
- Reverse-time SDEs for diffusion generative models
- Denoising score matching in modern generative AI
Forward Process¶
In diffusion models, the forward SDE typically adds noise:
The marginal density \(p(x, t)\) (starting from data distribution \(p_0\)) satisfies the Fokker–Planck equation.
Reverse Process¶
The remarkable result (Anderson, 1982) is that the reverse-time SDE:
generates samples from \(p(x, t)\) running backward in time, where \(\bar{W}_t\) is a backward Brownian motion.
This is the theoretical foundation for:
- Denoising Diffusion Probabilistic Models (DDPM)
- Score-based Generative Models
- Stochastic Differential Equation approaches to generation
The key insight is that the score \(\nabla_x \log p\) can be learned from data via denoising score matching, enabling generation without knowing \(p\) explicitly.
Physical Interpretation and Applications¶
The forward equation describes:
- Conservation: Probability mass is conserved (continuity equation form)
- Advection: Probability flows with the drift field \(\mu(x)\)
- Diffusion: Probability spreads due to the noise term
Applications:
| Field | Application |
|---|---|
| Statistical mechanics | Brownian particles in a potential |
| Plasma physics | Particle velocity distributions |
| Population dynamics | Density evolution in space |
| Chemical kinetics | Reaction-diffusion systems |
| Finance | Evolution of asset price distributions |
| Neuroscience | Neural population dynamics |
| Machine learning | Diffusion generative models |
Historical Note¶
- Adriaan Fokker (1914): Derived in the context of Brownian motion in a radiation field
- Max Planck (1917): Independent derivation for radiation equilibrium problems
- Andrey Kolmogorov (1931): Rigorous mathematical foundation establishing the connection to diffusion processes and proving existence/uniqueness
The equation is sometimes called the Smoluchowski equation in the physics literature, particularly for overdamped systems.
Summary¶
The Fokker–Planck equation describes how probability density evolves forward in time, dual to the backward equation which describes how expectations depend on initial conditions.
| Aspect | Description |
|---|---|
| Input | Initial point mass \(\delta(x - x_0)\) |
| Output | Spreading probability density \(p(x, t)\) |
| Operator | Adjoint generator \(\mathcal{L}^*\) |
| Physical meaning | Conservation of probability with drift and diffusion |
| Long-time behavior | Convergence to stationary distribution (if ergodic) |
See Also¶
- Kolmogorov Backward Equation — the dual equation for expectations
- Forward–Backward Duality — the adjoint relationship in detail
- Heat Equation Overview — fundamental solutions
- Heat Equation and Brownian Motion — probabilistic connection
- Invariant Measures — stationary distributions
- Time Reversal of Diffusions — connection to score functions
- Feynman–Kac Formula — discounted expectations
Exercises¶
Exercise 1. For Brownian motion with drift \(dX_t = \mu\,dt + \sigma\,dW_t\), verify that the Gaussian density
satisfies the Fokker-Planck equation \(\partial_t p = -\mu\partial_x p + \frac{\sigma^2}{2}\partial_{xx} p\) by computing both sides explicitly.
Solution to Exercise 1
The Fokker-Planck equation for constant coefficients is \(\partial_t p = -\mu\partial_x p + \frac{\sigma^2}{2}\partial_{xx}p\). Let
and define \(z = x - x_0 - \mu t\) and \(\tau = t\) for convenience. Compute:
(the three terms come from differentiating \((2\pi\sigma^2 t)^{-1/2}\) and the exponential, noting \(\partial_t z = -\mu\)).
Now compute the right-hand side:
Both sides are equal, confirming the Gaussian density satisfies the Fokker-Planck equation. \(\checkmark\)
Exercise 2. Write the Fokker-Planck equation in the continuity form \(\partial_t p + \partial_x J = 0\) and identify the probability current \(J\) for the Ornstein-Uhlenbeck process \(dX_t = -\kappa X_t\,dt + \sigma\,dW_t\). At stationarity (\(\partial_t p = 0\)), show that \(J = 0\) and use this to derive the stationary density.
Solution to Exercise 2
For the OU process \(dX_t = -\kappa X_t\,dt + \sigma\,dW_t\), we have \(\mu(x) = -\kappa x\) and \(\sigma(x) = \sigma\).
The probability current is:
The continuity form of the Fokker-Planck equation is \(\partial_t p + \partial_x J = 0\).
At stationarity (\(\partial_t p = 0\)): \(\partial_x J = 0\), so \(J = \text{const}\). For a density on \(\mathbb{R}\) that decays at infinity, the current must vanish at \(\pm\infty\), so \(J = 0\) everywhere.
Setting \(J = 0\):
This is a separable ODE:
Integrating:
Normalizing: this is a Gaussian with variance \(\sigma^2/(2\kappa)\), so \(A = \sqrt{\kappa/(\pi\sigma^2)}\) and:
This is \(N(0, \sigma^2/(2\kappa))\).
Exercise 3. For constant coefficients, solve the Fokker-Planck equation using the Fourier transform. Starting from \(\partial_t \hat{p} = (-ik\mu - \frac{\sigma^2 k^2}{2})\hat{p}\) with initial condition \(\hat{p}(k, 0) = e^{-ikx_0}\), find \(\hat{p}(k, t)\) and verify that the inverse transform yields the Gaussian transition density.
Solution to Exercise 3
For constant coefficients, the Fokker-Planck equation in Fourier space is:
This is a first-order linear ODE in \(t\) with solution:
With initial condition \(p(x, 0) = \delta(x - x_0)\), the Fourier transform gives \(\hat{p}(k, 0) = e^{-ikx_0}\). Therefore:
This is the characteristic function of \(N(x_0 + \mu t, \sigma^2 t)\). The inverse Fourier transform gives:
Completing the square in \(k\) and evaluating the Gaussian integral:
This is the Gaussian transition density. \(\checkmark\)
Exercise 4. For geometric Brownian motion \(dS_t = \mu S_t\,dt + \sigma S_t\,dW_t\), expand the Fokker-Planck equation
using the product rule. Identify the effective drift and the "spurious drift" terms that arise from the state-dependent diffusion coefficient.
Solution to Exercise 4
For GBM, \(\mu(S) = \mu S\) and \(\sigma^2(S) = \sigma^2 S^2\). Expanding the Fokker-Planck equation:
Drift term:
Diffusion term (using the product rule twice):
Combined:
Identifying the terms:
- Standard diffusion: \(\frac{\sigma^2 S^2}{2}\partial_{SS}p\) — this is the usual second-order diffusion term.
- Effective drift on \(p\): \((2\sigma^2 - \mu)S\partial_S p\) — this combines the physical drift \(-\mu S\) with additional terms from differentiating \(\sigma^2 S^2\).
- Spurious drift terms: The \(\sigma^2 p\) and \(2\sigma^2 S\partial_S p\) contributions arise because \(\sigma(S) = \sigma S\) depends on \(S\). When the diffusion coefficient varies in space, differentiating \(\sigma^2(S)p\) produces extra terms beyond \(\frac{\sigma^2(S)}{2}\partial_{SS}p\). The term \(\sigma^2 p\) is analogous to a zeroth-order "source" or "sink" of probability, while \(2\sigma^2 S\partial_S p\) acts as an additional advective flux. These spurious drift effects cause probability mass to flow from high-volatility regions (large \(S\)) toward low-volatility regions.
Exercise 5. The general formula for the stationary density of a one-dimensional diffusion is
Apply this formula to the CIR process \(dX_t = \kappa(\theta - X_t)\,dt + \xi\sqrt{X_t}\,dW_t\) and show that the stationary density is a Gamma distribution. What condition on the parameters ensures the density is normalizable?
Solution to Exercise 5
For the CIR process, \(\mu(x) = \kappa(\theta - x)\) and \(\sigma^2(x) = \xi^2 x\). The general stationary density formula gives:
Evaluate the integral:
Therefore:
This is the density of a Gamma distribution with shape parameter \(\alpha = 2\kappa\theta/\xi^2\) and rate parameter \(\beta = 2\kappa/\xi^2\):
Normalizability condition: The Gamma density is normalizable if and only if \(\alpha > 0\), i.e.:
Since \(\kappa, \theta > 0\), this is always satisfied. The stronger Feller condition \(2\kappa\theta \geq \xi^2\) (i.e., \(\alpha \geq 1\)) ensures the density is bounded at the origin.
Exercise 6. On a bounded domain \([a, b]\) with homogeneous boundary conditions, the Fokker-Planck solution can be expanded in eigenfunctions: \(p(x, t) = \sum_n c_n \phi_n(x) e^{-\lambda_n t}\). For the heat equation on \([0, L]\) with absorbing boundaries, identify the eigenvalues \(\lambda_n\) and eigenfunctions \(\phi_n\). What is the long-time behavior of \(p(x, t)\)?
Solution to Exercise 6
On \([0, L]\) with absorbing (Dirichlet) boundaries \(p(0, t) = p(L, t) = 0\), the heat equation \(\partial_t p = \frac{1}{2}\partial_{xx}p\) is solved by separation of variables.
Write \(p(x, t) = X(x)T(t)\). Then \(T'/T = \frac{1}{2}X''/X = -\lambda\) (separation constant). This gives:
The eigenvalues and eigenfunctions are:
The time part gives \(T_n(t) = e^{-\lambda_n t}\). The general solution is:
where \(c_n\) are determined by the initial condition.
Long-time behavior: As \(t \to \infty\), the dominant term is \(n = 1\) (smallest eigenvalue):
The density decays exponentially to zero at rate \(\lambda_1 = \pi^2/(2L^2)\). This makes physical sense: with absorbing boundaries, all probability is eventually absorbed, so \(p \to 0\). There is no stationary distribution (the eigenvalue \(\lambda_0 = 0\) does not appear because \(\sin(0) = 0\) fails to satisfy the boundary conditions nontrivially).
Exercise 7. The score function is defined as \(s(x, t) = \nabla_x \log p(x, t)\). For a Gaussian density \(p(x, t) = \frac{1}{\sqrt{2\pi v(t)}}\exp(-(x - m(t))^2/(2v(t)))\), compute the score function explicitly. Explain why the reverse-time SDE uses the score to denoise: the drift \(-g^2(t)\nabla_x \log p\) points toward regions of higher probability density.
Solution to Exercise 7
For the Gaussian density \(p(x, t) = \frac{1}{\sqrt{2\pi v(t)}}\exp\left(-\frac{(x - m(t))^2}{2v(t)}\right)\):
The score function is:
Interpretation: The score points toward the mean \(m(t)\):
- If \(x > m(t)\): \(s(x,t) < 0\), pointing left (toward the mean)
- If \(x < m(t)\): \(s(x,t) > 0\), pointing right (toward the mean)
- The magnitude \(|s| = |x - m(t)|/v(t)\) is larger when \(x\) is far from the mean or when the variance is small (sharp peak)
Why the reverse-time SDE uses the score for denoising: In the reverse-time SDE, the drift contains the term \(-g^2(t)\nabla_x\log p\). Since the score \(\nabla_x\log p\) always points in the direction of increasing probability density, the contribution \(-g^2(t)\nabla_x\log p\) acts as a drift that pushes samples toward regions of higher probability density. During the forward process, noise is added, spreading the distribution. The reverse process must undo this: the score-based drift counteracts diffusion by steering samples back toward the high-density regions of the data distribution, effectively denoising the signal.