Thinking in Distributions — An Introductory Course in Bayesian Statistics

Contents

01One formula: Bayes' rule
02From rule to inference: prior, likelihood, posterior
03Updating beliefs, one observation at a time
04When the math runs out: MCMC
05Applied modeling I: Bayesian regression
06Applied modeling II: an A/B test, end to end

Part 01 · Foundations

One formula: Bayes' rule

Everything in this course flows from one identity about conditional probability. Suppose you hold a hypothesis $H$ and you observe some data $D$. Bayes' rule tells you exactly how the data should change your belief in the hypothesis:

$$\underbrace{P(H \mid D)}_{\textcolor{#B23A5E}{\text{posterior}}} \;=\; \frac{\overbrace{P(D \mid H)}^{\textcolor{#C9881B}{\text{likelihood}}} \;\times\; \overbrace{P(H)}^{\textcolor{#3D7AB8}{\text{prior}}}}{\underbrace{P(D)}_{\text{evidence}}}$$

Each piece has a name, and the names are the vocabulary of the whole field. The prior $P(H)$ is what you believed before seeing the data. The likelihood $P(D \mid H)$ asks: if the hypothesis were true, how probable is the data I actually saw? Multiply them, normalize by the total probability of the data $P(D)$, and you get the posterior $P(H \mid D)$ — your updated belief. Those three colors mean the same three things in every chart on this page.

As a one-line recipe: posterior ∝ likelihood × prior. The denominator $P(D)$ is just whatever constant makes the result sum to one.

Why intuition fails without it

The classic demonstration is medical screening. A test can be highly accurate and still be wrong most of the time when it says "positive" — if the condition is rare. Intuition fixates on the likelihood (the test's accuracy) and ignores the prior (the base rate). Bayes' rule forces you to combine both. Try it:

Interactive · the base-rate trap

A patient from the general population tests positive. What is the probability they actually have the condition?

prevalence (prior) 1.0%

sensitivity — P(+ | sick) 95%

specificity — P(− | healthy) 95%

P(sick | positive) = 16.1%

With the default settings — a 95%-accurate test and a 1% base rate — a positive result means only about a one-in-six chance of disease. Out of 10,000 people, roughly 95 sick people test positive, but so do about 495 healthy ones. Most positives are false positives, because there are so many more healthy people for the test to be wrong about. Drag the prevalence up and watch the posterior climb: the same evidence means different things against different priors. That sentence is the entire Bayesian worldview in miniature.

The frequentist contrast In classical (frequentist) statistics, parameters are fixed unknowns and only the data are random — so you can't speak of "the probability the patient is sick." Bayesians instead treat uncertainty itself as probability, which lets you make direct statements like "given this test result, there's a 16% chance of disease." Neither view is wrong; they answer different questions. This course teaches the Bayesian one.

Part 02 · Inference

From rule to inference: prior, likelihood, posterior

Part 1 used Bayes' rule on a yes/no hypothesis. Real inference is usually about a continuous parameter — a conversion rate, an effect size, a regression slope. The rule is unchanged; the prior and posterior just become probability density functions over the parameter:

$$p(\theta \mid D) \;\propto\; p(D \mid \theta)\, p(\theta)$$

The running example for the rest of the course: estimating a probability $\theta$ — think of a coin's bias, or the click-through rate of a button — from a series of successes and failures.

The Beta distribution: a belief about a probability

We need a distribution over values between 0 and 1. The natural choice is the Beta distribution, $\text{Beta}(\alpha, \beta)$. Its two shape parameters have a lovely interpretation: $\alpha - 1$ imaginary successes and $\beta - 1$ imaginary failures already "seen" before any data arrives. $\text{Beta}(1,1)$ is perfectly flat — total ignorance. $\text{Beta}(20, 20)$ is a confident belief that $\theta$ is near 0.5.

Now the magic. If the prior is $\text{Beta}(\alpha, \beta)$ and you observe $k$ successes in $n$ trials (a Binomial likelihood), the posterior works out — exactly, by algebra — to:

$$\theta \mid D \;\sim\; \text{Beta}(\alpha + k,\;\; \beta + n - k)$$

Updating a belief is literally addition: add your successes to $\alpha$, your failures to $\beta$. Explore how the three curves interact:

Interactive · the Beta–Binomial machine

prior Beta(α, β) likelihood (scaled) posterior Beta(α+k, β+n−k)

prior α 2

prior β 2

trials n 20

successes k 14

posterior mean — 94% credible interval — MLE k/n —

Three things to notice while dragging:

1. The posterior is a compromise. Its mean, $\frac{\alpha+k}{\alpha+\beta+n}$, sits between the prior mean and the data's raw frequency $k/n$, weighted by how much information each side carries. With $n=0$ the posterior is the prior; crank $n$ to 200 and the prior barely matters.

2. More data, narrower belief. The posterior tightens as $n$ grows — uncertainty shrinking is visible, not just a number.

3. The credible interval says what you want it to say. A 94% credible interval means "given the model and data, $\theta$ lies in this range with 94% probability" — the direct statement a frequentist confidence interval famously cannot make.

Conjugacy A prior whose posterior stays in the same family is called conjugate to the likelihood. Beta–Binomial, Gamma–Poisson, and Normal–Normal are the classic pairs. Conjugacy makes the algebra closed-form, which is why courses start here — but it covers only a small island of models. Part 4 is about leaving the island.

Where do priors come from?

The prior is the part newcomers distrust most — it looks like an invitation to smuggle in opinion. In practice priors are chosen deliberately and reported openly: flat or weakly-informative priors let the data dominate while ruling out absurd values (a click-through rate of 99.9% deserves skepticism before any data); informative priors encode genuine prior knowledge, like last quarter's measurements; and with even moderate amounts of data, reasonable priors converge to nearly identical posteriors — which you just verified with the $n$ slider. The honest framing: every analysis has assumptions; Bayesian analysis makes one of them an explicit, criticizable object.

Part 03 · Learning

Updating beliefs, one observation at a time

Bayes' rule composes beautifully: today's posterior is tomorrow's prior. Observing data points one at a time, or all at once in a batch, gives exactly the same final posterior. Inference is just accumulation.

The simulator below flips a coin whose true bias is hidden from you (and from the model). Three different analysts start with three different priors — a flat skeptic, a believer in fairness, and someone with a wrong, opinionated hunch. Watch what the data does to each of them.

Interactive · three analysts, one coin

flat — Beta(1,1) fair-coin believer — Beta(25,25) wrong hunch — Beta(2,12)

flips 0 heads 0

After a hundred flips the three analysts barely disagree; after a few hundred they are indistinguishable. This is the standard Bayesian answer to "but the prior is subjective": data washes priors out, and the rate at which it does so is itself visible in the posterior. When the analysts still disagree after lots of data, that's informative too — it means the data genuinely can't settle the question.

Point estimates, if you must

The posterior is the full answer, but sometimes a single number is demanded. The three standard summaries each minimize a different penalty for being wrong: the posterior mean minimizes squared error, the posterior median minimizes absolute error, and the posterior mode (the "MAP" estimate) is the single most probable value. Reporting any of them without an interval throws away the part of the answer that was hardest to earn — the uncertainty.

Part 04 · Computation

When the math runs out: MCMC

Conjugate pairs gave us posteriors by algebra. But a realistic model — a regression with several predictors, a hierarchy of groups, a custom likelihood — has a posterior with no closed form. The product $p(D\mid\theta)\,p(\theta)$ is easy to evaluate at any point, but the normalizing integral $p(D) = \int p(D\mid\theta)p(\theta)\,d\theta$ is intractable, and in many dimensions you can't even grid it out.

The modern solution is to give up on computing the posterior as a formula and instead draw samples from it. If you can collect thousands of draws $\theta^{(1)}, \theta^{(2)}, \ldots$ distributed according to the posterior, then every question becomes counting: the posterior mean is the average of the draws, a credible interval is a pair of percentiles, $P(\theta > 0.5)$ is the fraction of draws above 0.5.

The Metropolis algorithm

The oldest sampler, from 1953, is almost embarrassingly simple. It performs a random walk that is gently biased uphill — but not only uphill, which is the crucial part. From the current position $\theta$:

① Propose a nearby point $\theta'$ by adding random noise. ② Compute the ratio $r = \frac{p(\theta'\mid D)}{p(\theta\mid D)}$ — note the unknown normalizer cancels! ③ If $r \ge 1$ (the proposal is more probable), move there. If $r < 1$, move there anyway with probability r; otherwise stay put. Repeat.

The chain spends time in each region in proportion to its posterior probability — so the trail of visited points is a sample from the posterior. Watch it work on a deliberately nasty two-humped target:

Interactive · a Metropolis sampler, live

proposal step size 0.6

speed 8 / frame

samples 0 acceptance rate —

The step-size slider is the whole craft of MCMC in one control. Too small and the chain inches along, taking ages to find the second mode — the trace plot looks like a slow wander. Too large and almost every proposal lands in a low-probability wasteland and gets rejected — the trace looks like a flat staircase. Practitioners watch exactly these diagnostics: trace plots, acceptance rates, and statistics like $\hat{R}$ (which compares multiple independent chains — if they disagree, none of them has converged).

What modern samplers add Real tools no longer use plain Metropolis. Hamiltonian Monte Carlo (HMC) and its adaptive variant NUTS — the default in PyMC and Stan — use the gradient of the log-posterior to make long, informed moves instead of blind ones, scaling to hundreds of parameters. The mental model stays the same: a chain whose visit frequencies reproduce the posterior.

Part 05 · Applied I

Applied modeling I: Bayesian regression

Time to model something. Ordinary least squares gives you one line — the single best fit. Bayesian regression gives you a distribution over lines: every slope-and-intercept pair gets a posterior probability, and predictions inherit that uncertainty honestly. The model, written the way Bayesians write models:

$$\begin{aligned} y_i &\sim \mathcal{N}(\mu_i,\; \sigma) \\ \mu_i &= \alpha + \beta x_i \\ \alpha &\sim \mathcal{N}(0, 2), \quad \beta \sim \mathcal{N}(0, 2) \end{aligned}$$

Read it top-down as a story about how the data came to be: each observation is normal noise around a line, and before seeing data we hold mild beliefs about the line's parameters. Below, each pale crimson line is one draw from the posterior — one plausible "true line." The spray of lines is the uncertainty.

Interactive · a distribution over lines

observed data posterior draws of the line posterior mean line

observations n 8

noise σ 0.8

slope β: mean — 94% CI —

Drag $n$ from 2 up to 120 and watch the fan of lines collapse toward a single one. With two points the model is candid: many lines explain what little it has seen. Notice also that the fan is narrowest in the middle of the data and widens at the edges — extrapolation is automatically less certain than interpolation. Nobody programmed that in; it falls out of the posterior.

The same model in PyMC

In practice you write the model nearly verbatim and let a NUTS sampler do the work. In a Marimo or Jupyter notebook:

import pymc as pm
import arviz as az

with pm.Model() as linreg:
    # priors — one line each, matching the math above
    alpha = pm.Normal("alpha", mu=0, sigma=2)
    beta  = pm.Normal("beta",  mu=0, sigma=2)
    sigma = pm.HalfNormal("sigma", sigma=1)

    # likelihood
    mu = alpha + beta * x
    pm.Normal("y", mu=mu, sigma=sigma, observed=y)

    # draw 2000 posterior samples per chain with NUTS
    idata = pm.sample(2000, chains=4)

az.summary(idata)            # means, CIs, r_hat diagnostics
az.plot_trace(idata)         # the same trace plots as Part 4
az.plot_posterior(idata)     # posterior densities with 94% HDI

Three habits make this workflow rather than ritual: prior predictive checks (simulate fake data from the priors alone — does it look remotely plausible?), convergence diagnostics ($\hat{R} \approx 1.00$, healthy trace plots, no divergences), and posterior predictive checks (simulate new data from the fitted model and compare it to the real data — a model that can't reproduce what you saw can't be trusted for what you didn't).

Part 06 · Applied II

Applied modeling II: an A/B test, end to end

The capstone is the most common applied Bayesian analysis in industry. Two versions of a page; each visitor converts or doesn't. Model each conversion rate with its own Beta–Binomial machine from Part 2, then ask the posterior the question stakeholders actually care about: what is the probability that B beats A — and by how much?

Interactive · is B better than A?

posterior of θ_A posterior of θ_B

A: visitors 1000

A: conversions 50

B: visitors 1000

B: conversions 63

P(B > A) = — expected lift (B−A) — 94% CI on lift —

The lower chart is the posterior of the difference $\theta_B - \theta_A$, computed by Monte Carlo: draw thousands of $(\theta_A, \theta_B)$ pairs from the two posteriors and subtract. This is the Part 4 idea in action — once you have samples, any derived quantity is just arithmetic on the draws. Note what the answer looks like compared to a p-value: instead of "significant / not significant," you get a full distribution of plausible lifts, a direct probability that B wins, and an expected effect size — the raw ingredients of an actual business decision ("ship if P(B>A) > 95% and expected lift covers the switching cost").

The same analysis in Python, no MCMC needed thanks to conjugacy:

import numpy as np
rng = np.random.default_rng(42)

# Beta(1,1) priors + observed data  →  Beta posteriors
post_a = rng.beta(1 + 50, 1 + 1000 - 50, size=100_000)
post_b = rng.beta(1 + 63, 1 + 1000 - 63, size=100_000)

lift = post_b - post_a
print(f"P(B > A)      = {(lift > 0).mean():.1%}")
print(f"expected lift = {lift.mean():.4f}")
print(f"94% CI        = {np.percentile(lift, [3, 97])}")

Where to go next: hierarchical models

The single most important technique this course hasn't covered is the hierarchical (multilevel) model — arguably the killer app of Bayesian statistics. When data comes in groups (conversion rates per country, test scores per school, effects per hospital), you give each group its own parameter and a shared distribution those parameters are drawn from. Small groups get gently pulled toward the overall pattern ("partial pooling"), borrowing strength from the rest of the data instead of overfitting their own noise. In PyMC it's a four-line change to the regression above. It is the natural Part 7, and the books below all treat it as the main event.

Resource	What it's best at
Statistical Rethinking — Richard McElreath	The course's spiritual sequel: intuition-first, code-driven, superb on hierarchical models and causal thinking.
Bayesian Methods for Hackers — Cameron Davidson-Pilon	Free online, PyMC throughout, learn-by-computing.
Bayes Rules! — Johnson, Ott & Dogucu	A gentle, modern undergraduate treatment, free online.
Bayesian Data Analysis — Gelman et al.	The comprehensive graduate reference, free PDF from the authors.
PyMC & ArviZ documentation	Worked example notebooks for nearly every model family.

The whole course in three sentences Beliefs are distributions. Data updates them by multiplication (posterior ∝ likelihood × prior), and when the algebra gets hard, sampling replaces it. Everything else — regression, A/B tests, hierarchies — is choosing a likelihood that tells an honest story about how your data was generated.