One formula: Bayes' rule
Everything in this course flows from one identity about conditional probability. Suppose you hold a hypothesis \(H\) and you observe some data \(D\). Bayes' rule tells you exactly how the data should change your belief in the hypothesis:
Each piece has a name, and the names are the vocabulary of the whole field. The prior \(P(H)\) is what you believed before seeing the data. The likelihood \(P(D \mid H)\) asks: if the hypothesis were true, how probable is the data I actually saw? Multiply them, normalize by the total probability of the data \(P(D)\), and you get the posterior \(P(H \mid D)\) — your updated belief. Those three colors mean the same three things in every chart on this page.
As a one-line recipe: posterior ∝ likelihood × prior. The denominator \(P(D)\) is just whatever constant makes the result sum to one.
Why intuition fails without it
The classic demonstration is medical screening. A test can be highly accurate and still be wrong most of the time when it says "positive" — if the condition is rare. Intuition fixates on the likelihood (the test's accuracy) and ignores the prior (the base rate). Bayes' rule forces you to combine both. Try it:
A patient from the general population tests positive. What is the probability they actually have the condition?
With the default settings — a 95%-accurate test and a 1% base rate — a positive result means only about a one-in-six chance of disease. Out of 10,000 people, roughly 95 sick people test positive, but so do about 495 healthy ones. Most positives are false positives, because there are so many more healthy people for the test to be wrong about. Drag the prevalence up and watch the posterior climb: the same evidence means different things against different priors. That sentence is the entire Bayesian worldview in miniature.
From rule to inference: prior, likelihood, posterior
Part 1 used Bayes' rule on a yes/no hypothesis. Real inference is usually about a continuous parameter — a conversion rate, an effect size, a regression slope. The rule is unchanged; the prior and posterior just become probability density functions over the parameter:
The running example for the rest of the course: estimating a probability \(\theta\) — think of a coin's bias, or the click-through rate of a button — from a series of successes and failures.
The Beta distribution: a belief about a probability
We need a distribution over values between 0 and 1. The natural choice is the Beta distribution, \(\text{Beta}(\alpha, \beta)\). Its two shape parameters have a lovely interpretation: \(\alpha - 1\) imaginary successes and \(\beta - 1\) imaginary failures already "seen" before any data arrives. \(\text{Beta}(1,1)\) is perfectly flat — total ignorance. \(\text{Beta}(20, 20)\) is a confident belief that \(\theta\) is near 0.5.
Now the magic. If the prior is \(\text{Beta}(\alpha, \beta)\) and you observe \(k\) successes in \(n\) trials (a Binomial likelihood), the posterior works out — exactly, by algebra — to:
Updating a belief is literally addition: add your successes to \(\alpha\), your failures to \(\beta\). Explore how the three curves interact:
Three things to notice while dragging:
1. The posterior is a compromise. Its mean, \(\frac{\alpha+k}{\alpha+\beta+n}\), sits between the prior mean and the data's raw frequency \(k/n\), weighted by how much information each side carries. With \(n=0\) the posterior is the prior; crank \(n\) to 200 and the prior barely matters.
2. More data, narrower belief. The posterior tightens as \(n\) grows — uncertainty shrinking is visible, not just a number.
3. The credible interval says what you want it to say. A 94% credible interval means "given the model and data, \(\theta\) lies in this range with 94% probability" — the direct statement a frequentist confidence interval famously cannot make.
Where do priors come from?
The prior is the part newcomers distrust most — it looks like an invitation to smuggle in opinion. In practice priors are chosen deliberately and reported openly: flat or weakly-informative priors let the data dominate while ruling out absurd values (a click-through rate of 99.9% deserves skepticism before any data); informative priors encode genuine prior knowledge, like last quarter's measurements; and with even moderate amounts of data, reasonable priors converge to nearly identical posteriors — which you just verified with the \(n\) slider. The honest framing: every analysis has assumptions; Bayesian analysis makes one of them an explicit, criticizable object.
Updating beliefs, one observation at a time
Bayes' rule composes beautifully: today's posterior is tomorrow's prior. Observing data points one at a time, or all at once in a batch, gives exactly the same final posterior. Inference is just accumulation.
The simulator below flips a coin whose true bias is hidden from you (and from the model). Three different analysts start with three different priors — a flat skeptic, a believer in fairness, and someone with a wrong, opinionated hunch. Watch what the data does to each of them.
After a hundred flips the three analysts barely disagree; after a few hundred they are indistinguishable. This is the standard Bayesian answer to "but the prior is subjective": data washes priors out, and the rate at which it does so is itself visible in the posterior. When the analysts still disagree after lots of data, that's informative too — it means the data genuinely can't settle the question.
Point estimates, if you must
The posterior is the full answer, but sometimes a single number is demanded. The three standard summaries each minimize a different penalty for being wrong: the posterior mean minimizes squared error, the posterior median minimizes absolute error, and the posterior mode (the "MAP" estimate) is the single most probable value. Reporting any of them without an interval throws away the part of the answer that was hardest to earn — the uncertainty.
When the math runs out: MCMC
Conjugate pairs gave us posteriors by algebra. But a realistic model — a regression with several predictors, a hierarchy of groups, a custom likelihood — has a posterior with no closed form. The product \(p(D\mid\theta)\,p(\theta)\) is easy to evaluate at any point, but the normalizing integral \(p(D) = \int p(D\mid\theta)p(\theta)\,d\theta\) is intractable, and in many dimensions you can't even grid it out.
The modern solution is to give up on computing the posterior as a formula and instead draw samples from it. If you can collect thousands of draws \(\theta^{(1)}, \theta^{(2)}, \ldots\) distributed according to the posterior, then every question becomes counting: the posterior mean is the average of the draws, a credible interval is a pair of percentiles, \(P(\theta > 0.5)\) is the fraction of draws above 0.5.
The Metropolis algorithm
The oldest sampler, from 1953, is almost embarrassingly simple. It performs a random walk that is gently biased uphill — but not only uphill, which is the crucial part. From the current position \(\theta\):
① Propose a nearby point \(\theta'\) by adding random noise. ② Compute the ratio \(r = \frac{p(\theta'\mid D)}{p(\theta\mid D)}\) — note the unknown normalizer cancels! ③ If \(r \ge 1\) (the proposal is more probable), move there. If \(r < 1\), move there anyway with probability r; otherwise stay put. Repeat.
The chain spends time in each region in proportion to its posterior probability — so the trail of visited points is a sample from the posterior. Watch it work on a deliberately nasty two-humped target:
The step-size slider is the whole craft of MCMC in one control. Too small and the chain inches along, taking ages to find the second mode — the trace plot looks like a slow wander. Too large and almost every proposal lands in a low-probability wasteland and gets rejected — the trace looks like a flat staircase. Practitioners watch exactly these diagnostics: trace plots, acceptance rates, and statistics like \(\hat{R}\) (which compares multiple independent chains — if they disagree, none of them has converged).
Applied modeling I: Bayesian regression
Time to model something. Ordinary least squares gives you one line — the single best fit. Bayesian regression gives you a distribution over lines: every slope-and-intercept pair gets a posterior probability, and predictions inherit that uncertainty honestly. The model, written the way Bayesians write models:
Read it top-down as a story about how the data came to be: each observation is normal noise around a line, and before seeing data we hold mild beliefs about the line's parameters. Below, each pale crimson line is one draw from the posterior — one plausible "true line." The spray of lines is the uncertainty.
Drag \(n\) from 2 up to 120 and watch the fan of lines collapse toward a single one. With two points the model is candid: many lines explain what little it has seen. Notice also that the fan is narrowest in the middle of the data and widens at the edges — extrapolation is automatically less certain than interpolation. Nobody programmed that in; it falls out of the posterior.
The same model in PyMC
In practice you write the model nearly verbatim and let a NUTS sampler do the work. In a Marimo or Jupyter notebook:
import pymc as pm import arviz as az with pm.Model() as linreg: # priors — one line each, matching the math above alpha = pm.Normal("alpha", mu=0, sigma=2) beta = pm.Normal("beta", mu=0, sigma=2) sigma = pm.HalfNormal("sigma", sigma=1) # likelihood mu = alpha + beta * x pm.Normal("y", mu=mu, sigma=sigma, observed=y) # draw 2000 posterior samples per chain with NUTS idata = pm.sample(2000, chains=4) az.summary(idata) # means, CIs, r_hat diagnostics az.plot_trace(idata) # the same trace plots as Part 4 az.plot_posterior(idata) # posterior densities with 94% HDI
Three habits make this workflow rather than ritual: prior predictive checks (simulate fake data from the priors alone — does it look remotely plausible?), convergence diagnostics (\(\hat{R} \approx 1.00\), healthy trace plots, no divergences), and posterior predictive checks (simulate new data from the fitted model and compare it to the real data — a model that can't reproduce what you saw can't be trusted for what you didn't).
Applied modeling II: an A/B test, end to end
The capstone is the most common applied Bayesian analysis in industry. Two versions of a page; each visitor converts or doesn't. Model each conversion rate with its own Beta–Binomial machine from Part 2, then ask the posterior the question stakeholders actually care about: what is the probability that B beats A — and by how much?
The lower chart is the posterior of the difference \(\theta_B - \theta_A\), computed by Monte Carlo: draw thousands of \((\theta_A, \theta_B)\) pairs from the two posteriors and subtract. This is the Part 4 idea in action — once you have samples, any derived quantity is just arithmetic on the draws. Note what the answer looks like compared to a p-value: instead of "significant / not significant," you get a full distribution of plausible lifts, a direct probability that B wins, and an expected effect size — the raw ingredients of an actual business decision ("ship if P(B>A) > 95% and expected lift covers the switching cost").
The same analysis in Python, no MCMC needed thanks to conjugacy:
import numpy as np rng = np.random.default_rng(42) # Beta(1,1) priors + observed data → Beta posteriors post_a = rng.beta(1 + 50, 1 + 1000 - 50, size=100_000) post_b = rng.beta(1 + 63, 1 + 1000 - 63, size=100_000) lift = post_b - post_a print(f"P(B > A) = {(lift > 0).mean():.1%}") print(f"expected lift = {lift.mean():.4f}") print(f"94% CI = {np.percentile(lift, [3, 97])}")
Where to go next: hierarchical models
The single most important technique this course hasn't covered is the hierarchical (multilevel) model — arguably the killer app of Bayesian statistics. When data comes in groups (conversion rates per country, test scores per school, effects per hospital), you give each group its own parameter and a shared distribution those parameters are drawn from. Small groups get gently pulled toward the overall pattern ("partial pooling"), borrowing strength from the rest of the data instead of overfitting their own noise. In PyMC it's a four-line change to the regression above. It is the natural Part 7, and the books below all treat it as the main event.
| Resource | What it's best at |
|---|---|
| Statistical Rethinking — Richard McElreath | The course's spiritual sequel: intuition-first, code-driven, superb on hierarchical models and causal thinking. |
| Bayesian Methods for Hackers — Cameron Davidson-Pilon | Free online, PyMC throughout, learn-by-computing. |
| Bayes Rules! — Johnson, Ott & Dogucu | A gentle, modern undergraduate treatment, free online. |
| Bayesian Data Analysis — Gelman et al. | The comprehensive graduate reference, free PDF from the authors. |
| PyMC & ArviZ documentation | Worked example notebooks for nearly every model family. |