โ† Back to playground

๐ŸงชHypothesis

A working Python developer's guide to property-based testing โ€” stop hand-picking examples, describe what should always be true, and let Hypothesis hunt for the counterexample.

TL;DR

  • Hypothesis is property-based testing for Python. Instead of asserting f(2, 3) == 5, you assert a rule that should hold for every input, and the library tries to break it.
  • You describe inputs with strategies and decorate a test with @given. It runs your test on ~100 generated examples by default, probing edge cases a human would skip โ€” empty lists, NaN, huge ints, weird Unicode.
  • When a property fails, shrinking reduces the failing input to its simplest form and saves it to a local database so the failure reproduces on every future run. You debug [100], not [5, 87, 142, 3, 91, 60].

Why property-based testing

An example-based test encodes one row of a truth table. You picked the input, you computed the expected output by hand, and you wrote them down. That is a fine way to pin behavior you already understand โ€” but it only ever checks the cases you thought of, and the bugs that reach production are, almost by definition, in the cases you didn't.

Property-based testing inverts the relationship. You don't supply inputs; you describe the shape of all valid inputs and state a property that must hold across them. Hypothesis then generates a stream of examples โ€” deliberately biased toward the awkward ones โ€” and tries to find a single input that violates the property. If it can't after a hundred tries, the test passes. If it can, it hands you the smallest example that breaks.

Example-based the cases you thought of edge cases bug sits in a corner nobody sampled โ†’ it ships Property-based generators probe the whole space โœ— a generated example lands on the bug โ†’ it's caught
Same function, same input space. Example tests check a handful of points; a generator covers the edges where bugs hide.

This is more than fuzzing. Fuzzing throws noise at your code and watches for crashes. Property-based testing throws noise at your code and checks that the result is still correct โ€” that a round-trip survived, an invariant held, the answer matched a reference. The hard part isn't generating data; it's articulating what "correct" means for all inputs at once. The rest of this page is about getting good at that.

Lineage

The idea comes from QuickCheck in Haskell (2000). Hypothesis adapts it to Python with one decisive difference: you never write shrinkers by hand. Because shrinking is built into every strategy, anything you can generate, Hypothesis can also minimize.

Mental model in 60 seconds

A Hypothesis test is a loop wrapped around your assertions. One strategy feeds it inputs; your function body is the property; the framework runs it many times and, on the first failure, switches from "generate" mode to "minimize" mode.

STRATEGY st.lists(st.integers()) DRAW AN EXAMPLE xs = [3, -1, 0] RUN PROPERTY your assert statements pass or fail? keep going ยท or shrink pass โœ“ โ€” repeat up to max_examples (100) fail โœ— SHRINK reduce to a minimal case REPORT + SAVE falsifying example โ†’ .hypothesis/
Generate โ†’ run โ†’ repeat on success; on the first failure, shrink to the minimal counterexample, then report and cache it.

Four facts that fall out of this model and explain most of Hypothesis's behavior:

  • It runs your test many times. The default is 100 examples. A passing property-based test is a much stronger statement than a passing example test โ€” but also slower, so the count is tunable.
  • Generation is biased, not uniform. Strategies deliberately over-sample boundary values: 0, -1, empty collections, NaN, max-width integers, surrogate characters. The boring middle of the space is where bugs aren't.
  • Failures are minimized, then remembered. The falsifying example is cached on disk and replayed first next time, so a failure is reproducible even though generation is random.
  • It expects determinism. Given the same input your property must reach the same verdict. Non-determinism (clocks, network, global RNG) confuses both shrinking and replay โ€” Hypothesis will flag it as flaky.

Your first property test

Install it with pip install hypothesis (or uv add --dev hypothesis). A property test is an ordinary test_ function with a @given decorator; pytest discovers and runs it like any other. The classic starter property is the round-trip: serialize a value, parse it back, and assert you got the same thing.

import json
from hypothesis import given, strategies as st

@given(st.dictionaries(st.text(), st.integers()))
def test_json_round_trips(value):
    assert json.loads(json.dumps(value)) == value

Run it with pytest. Hypothesis generates ~100 dictionaries โ€” empty ones, ones with "" keys, ones with control characters, ones with huge integers โ€” and checks that each survives the trip through JSON. There is no expected-output column to maintain; the property is the spec.

Now watch it find a bug. Here's a helper that returns the last character of a string โ€” with an omission its author didn't notice:

def last_char(s):
    return s[-1]                 # forgets that s might be empty

@given(st.text())
def test_last_char_is_in_string(s):
    assert last_char(s) in s

pytest fails almost instantly, and the report is the whole point:

    def last_char(s):
>       return s[-1]                 # forgets that s might be empty
E       IndexError: string index out of range

----------------------------- Hypothesis ------------------------------
Falsifying example: test_last_char_is_in_string(
    s='',
)

Hypothesis didn't reach for a clever input โ€” it reached for the simplest one. The empty string is among the first things it tries, and it's exactly the case the author forgot. Someone writing examples by hand tends to start from "hello" and never circle back to "". The falsifying example names the bug in one line: s=''.

Strategies: describing inputs

A strategy is a value generator that also knows how to shrink. The hypothesis.strategies module (conventionally imported as st) ships strategies for essentially every built-in type, and combinators to build the rest. You rarely write generation logic yourself; you compose.

StrategyGeneratesCommon arguments
st.integers()Python ints (unbounded)min_value, max_value
st.floats()floats incl. nan/infallow_nan, allow_infinity, min_value
st.text()Unicode stringsalphabet, min_size, max_size
st.booleans()True / Falseโ€”
st.lists(elem)lists of elemmin_size, max_size, unique
st.dictionaries(k, v)dictsmin_size, max_size
st.tuples(a, b, โ€ฆ)fixed-shape tuplespositional strategies
st.sampled_from(seq)one value from seqgreat for enums
st.just(x) / st.none()a constant / Noneโ€”
st.datetimes()datetime objectstimezones, min_value
st.from_regex(p)strings matching pfullmatch=True
st.builds(C, โ€ฆ)instances of class Cstrategies per field
st.from_type(T)values for a type hintinfers from T

Strategies are composable values. Pass them to other strategies, take a union with st.one_of(...) (or the | operator), and transform them with two methods you'll use constantly:

# .map(f): generate, then transform
even_ints = st.integers().map(lambda n: n * 2)
sorted_lists = st.lists(st.integers()).map(sorted)

# .filter(pred): keep only values that pass โ€” use sparingly (see Pitfalls)
nonzero = st.integers().filter(lambda n: n != 0)

# a union of shapes โ€” e.g. a JSON-ish scalar
scalars = st.none() | st.booleans() | st.integers() | st.text()

When later values depend on earlier ones โ€” a list and a valid index into it, a start date and an end date after it โ€” reach for @composite. It gives you a draw function so you can write generation as ordinary procedural code:

from hypothesis import strategies as st

@st.composite
def list_and_index(draw):
    xs = draw(st.lists(st.integers(), min_size=1))
    i = draw(st.integers(min_value=0, max_value=len(xs) - 1))
    return xs, i

@given(list_and_index())
def test_index_is_valid(pair):
    xs, i = pair
    assert xs[i] in xs

Because the index is drawn from the list's length, it is always valid โ€” there is no wasted generation and nothing to filter. Constraints encoded in the strategy are also preserved during shrinking, which is the deeper reason to prefer constructive generation over filtering.

Shrinking to a minimal case

Shrinking is Hypothesis's signature feature and the reason its failures are actually useful. The first input that breaks a property is usually large and random โ€” a 40-element list, a string full of control codes. Before reporting, Hypothesis repeatedly simplifies that input, re-running your test on each candidate, keeping any that still fail, until it can't get smaller. What you see is the essence of the bug with all the noise removed.

[5, 87, 142, 3, 91, 60] โœ— [142, 3, 91] โœ— [142] โœ— [100] โœ— minimal drop elements, then shrink values toward 0
For the (deliberately false) property "every element is < 100", Hypothesis discards elements, then reduces the survivor, landing on [100] โ€” the simplest list that still violates it.

Two properties make this reliable. First, shrinking is built into the strategy, so you never write a shrinker: a list shrinks by dropping elements and shrinking each survivor, an integer shrinks toward zero, a string shrinks toward shorter and toward simpler characters. Second, and more subtly, Hypothesis uses integrated shrinking โ€” it minimizes the underlying choices the generator made, not the output value. That is why a value built with .map(lambda n: n * 2) still shrinks to an even number, and why the index from the @composite example above stays in range no matter how far it's reduced. The constraints you generated under are the constraints it shrinks under.

Why it matters

A failing test you can't reduce is a debugging session. A failing test reduced to [100] or s=' ' is usually a glance. Shrinking is what turns "the fuzzer found something" into "here is the bug."

What properties should I test?

The skill in property-based testing is finding properties โ€” assertions that are true for all inputs without restating the implementation. A handful of patterns cover most real cases. When you're stuck, run down this list and ask which ones fit.

Round-trip / inverse

If two functions are inverses, applying both should return the original. Encode/decode, serialize/parse, compress/decompress, save/load. The cheapest high-value property there is.

import gzip
@given(st.binary())
def test_gzip_round_trip(data):
    assert gzip.decompress(gzip.compress(data)) == data

Comparison against a test oracle

If you have a simple, obviously-correct (often slow) implementation, assert your optimized one agrees with it on every input. Great for refactors, caches, and performance rewrites.

@given(st.lists(st.integers()))
def test_fast_sort_matches_builtin(xs):
    assert my_fast_sort(xs) == sorted(xs)

Invariants

Some facts must survive an operation even if you can't predict the exact output. Sorting must preserve the multiset of elements; a transfer must conserve total money.

from collections import Counter
@given(st.lists(st.integers()))
def test_sort_preserves_elements(xs):
    assert Counter(sorted(xs)) == Counter(xs)

Idempotence

Many "normalize" or "clean up" operations should be stable under reapplication: doing them twice equals doing them once.

@given(st.text())
def test_normalize_is_idempotent(s):
    once = normalize(s)
    assert normalize(once) == once

Algebraic laws

Commutativity, associativity, identity elements. If order shouldn't matter, say so.

@given(st.sets(st.integers()), st.sets(st.integers()))
def test_union_is_commutative(a, b):
    assert a | b == b | a

"It never crashes"

The weakest property, but a real one โ€” and a great starting point for messy, exception-prone code like parsers. Any unhandled exception fails the test.

@given(st.text())
def test_parser_does_not_crash(s):
    parse(s)   # raising anything unexpected = failure

A bug, end to end

Tie it together. Here's a plausible-looking function to merge two already-sorted lists, with a real bug, and the test oracle property that catches it.

def merge(a, b):
    """Merge two sorted lists into one sorted list."""
    out, i, j = [], 0, 0
    while i < len(a) and j < len(b):
        if a[i] <= b[j]:
            out.append(a[i]); i += 1
        else:
            out.append(b[j]); j += 1
    return out + a[i:]           # bug: forgets b's leftover tail

@given(st.lists(st.integers()), st.lists(st.integers()))
def test_merge_matches_sorted(a, b):
    assert merge(sorted(a), sorted(b)) == sorted(a + b)

The body looks fine on a quick read, and hand-picked examples like merge([1, 3], [2, 4]) pass โ€” when the lists interleave, the loop consumes both. But the loop stops the moment either list runs out, and the function only appends what's left of a. Anything still in b is silently dropped. Hypothesis finds the smallest input that exposes it:

Falsifying example: test_merge_matches_sorted(
    a=[],
    b=[0],
)
E   assert [] == [0]

The minimal trigger is two single-element lists that share a value. With a=[0], b=[0], neither branch is wrong individually, but the loop exits after one append and the bug surfaces only when both lists hold the same number. That is exactly the kind of boundary a human skips and a generator hits on its second or third try. Once you've fixed it, lock the case in permanently so it's checked first on every future run:

from hypothesis import example, given, strategies as st

@given(st.lists(st.integers()), st.lists(st.integers()))
@example(a=[], b=[0])      # regression: leftover tail of b
def test_merge_matches_sorted(a, b):
    assert merge(sorted(a), sorted(b)) == sorted(a + b)

Beyond @given

The decorator gets you most of the value, but a handful of other tools handle the cases real test suites run into.

@example โ€” pin explicit cases

Mix specific, must-always-run inputs into a property test: known edge cases, past regressions, the example from the docstring. They run before generation and don't shrink.

@settings โ€” tune the run

Control example count, time budget, and more. Two you'll touch most are max_examples and deadline.

from hypothesis import given, settings, strategies as st

@settings(max_examples=1000, deadline=None)
@given(st.lists(st.integers()))
def test_thoroughly(xs):
    ...

For different environments, register named profiles once and select one with an env var (HYPOTHESIS_PROFILE=ci pytest):

from hypothesis import settings

settings.register_profile("ci", max_examples=1000, deadline=None)
settings.register_profile("dev", max_examples=20)

assume() โ€” discard bad inputs

When a generated example violates a precondition, assume() throws it out and asks for another, rather than failing. Reach for it only when you can't express the constraint in the strategy itself.

from hypothesis import assume, given, strategies as st

@given(st.integers(), st.integers())
def test_divmod_identity(a, b):
    assume(b != 0)
    q, r = divmod(a, b)
    assert q * b + r == a

Replay & the example database

Every failure is stored under .hypothesis/examples and tried first next time, so fixed bugs stay fixed and failures reproduce without a seed. For sharing a one-off repro, Hypothesis prints a @seed(...) or @reproduce_failure(...) decorator you can paste in temporarily โ€” but a permanent @example is the better home for a case worth keeping.

target() โ€” steer the search

For optimization-flavored properties, report a score and Hypothesis will hunt for inputs that maximize it. Targeted search shines on things like worst-case latency or queue depth.

from hypothesis import given, target, strategies as st

@given(st.lists(st.integers(), min_size=1))
def test_latency_under_budget(jobs):
    latency = simulate(jobs)
    target(latency)              # push toward worst cases
    assert latency < BUDGET_MS

Ghostwriter โ€” generate the test

Not sure where to start? The ghostwriter inspects a function and prints a runnable starter test. hypothesis write gzip.compress emits a round-trip skeleton; pipe it to a file and refine.

Stateful testing

Single-input properties don't fit objects with internal state โ€” a cache, a connection pool, a parser with a buffer. The bug usually isn't one bad call; it's an unlucky sequence of calls. Stateful testing lets Hypothesis generate those sequences for you.

You subclass RuleBasedStateMachine, mark methods as @rules (the operations Hypothesis may call, with strategies for their arguments) and @invariants (checked after every step). A common pattern runs the real object beside a simple model and asserts they never disagree.

deposit(50) balance โ†’ 50 withdraw(30) balance โ†’ 20 withdraw(40) balance โ†’ -20 โœ— invariant โœ“ invariant โœ“ balance โ‰ฅ 0 โœ— Hypothesis searches over sequences โ€” and shrinks the failing one to its shortest form.
The bug needs a specific order of operations. Stateful testing generates and then minimizes the sequence, not just the arguments.
from hypothesis import strategies as st
from hypothesis.stateful import RuleBasedStateMachine, rule, invariant

class AccountMachine(RuleBasedStateMachine):
    def __init__(self):
        super().__init__()
        self.account = Account()   # system under test
        self.model = 0             # trusted shadow balance

    @rule(amount=st.integers(min_value=1, max_value=1000))
    def deposit(self, amount):
        self.account.deposit(amount)
        self.model += amount

    @rule(amount=st.integers(min_value=1, max_value=1000))
    def withdraw(self, amount):
        self.account.withdraw(amount)
        self.model -= amount

    @invariant()
    def matches_model(self):
        assert self.account.balance() == self.model

# expose to pytest โ€” note the .TestCase class attribute
TestAccount = AccountMachine.TestCase

The last line is the bit people get wrong: the runnable test is the .TestCase class attribute on your machine, assigned to a module-level Test* name so pytest collects it. Hypothesis then drives random programs through your rules and, on failure, prints the shortest sequence of calls that breaks an invariant.

Pytest & CI

Hypothesis is framework-agnostic but pytest-native in practice: a @given function is collected and reported like any other test, failures show up as normal assertion errors, and the hypothesis plugin adds a statistics summary under pytest --hypothesis-show-statistics.

Fixtures: the one real gotcha

You can use pytest fixtures alongside @given โ€” list the fixture parameters that @given doesn't fill โ€” but a function-scoped fixture runs once for the whole test, not once per generated example. State built up in the fixture is shared across all examples. For per-example setup, do the work inside the test body (or draw it), and keep fixtures for expensive, read-only resources.

import pytest
from hypothesis import given, strategies as st

@pytest.fixture
def client():
    return make_client()        # created ONCE, reused for every example

@given(name=st.text())
def test_round_trips(client, name):
    client.put(name)
    assert client.get() == name

Determinism in CI

Generation is random across runs, which is usually what you want โ€” more runs explore more space. For perfectly reproducible CI, set derandomize=True. For anything I/O-bound or variable in timing, set deadline=None so a slow example isn't reported as a failure. And commit nothing from .hypothesis/ except where you intentionally share a failure database.

Drawing mid-test

When you need a value that depends on what the code did so far, use st.data() and draw inside the body: @given(data=st.data()), then x = data.draw(st.integers()). It composes when @composite would be awkward โ€” at the cost of slightly noisier failure output.

Pitfalls & best practices

  • Don't over-filter. .filter() and assume() discard examples; if most are thrown away, generation slows to a crawl and a health check fires. Prefer constructing valid data directly โ€” bound an st.integers(min_value=โ€ฆ), draw an index from a length โ€” over generating broadly and filtering.
  • Keep tests deterministic. Reading the clock, the network, or the global random module makes a property pass and fail for the "same" input. Either inject those values through strategies or stub them; Hypothesis will otherwise flag the test as flaky.
  • Mind the deadline on slow code. The 200 ms-per-example budget catches accidental blow-ups, but legitimately slow tests should set deadline=None rather than fight it.
  • Avoid implicit assumptions. @given(st.integers()) really does include 0 and huge negatives; st.floats() really does include nan and inf. If your property only holds on a subset, encode that subset in the strategy โ€” the failures it finds otherwise are real.
  • Make properties meaningful, not tautological. A property that reimplements the function it tests proves nothing. The best properties come from a different angle than the implementation: an inverse, an oracle, a conserved quantity.
  • Let it run more in CI than locally. A profile with max_examples=20 for a fast edit loop and 1000 for CI gives quick feedback without sacrificing coverage where it counts.

Source & ecosystem

Hypothesis is open source, actively developed, and small enough to read. A few pointers if you want to go deeper than this page.

  • hypothesisWorks/hypothesis โ€” the repo. The Python package lives under hypothesis-python/; src/hypothesis/strategies/ is where every built-in strategy is defined, and a great place to learn how generation and shrinking are implemented.
  • hypothesis.extra โ€” first-party extensions for the libraries you actually test: hypothesis.extra.numpy generates arrays with controlled dtypes and shapes, hypothesis.extra.pandas generates DataFrames and typed columns, and there's support for Django models and dateutil timezones.
  • Ghostwriter & the CLI โ€” hypothesis write <module-or-function> bootstraps tests; hypothesis codemod helps with API migrations.
  • HypoFuzz โ€” a companion tool that runs your existing Hypothesis tests as a coverage-guided fuzzer for long, CI-style campaigns, reusing the same example database.

Further reading