Good Tests Are Hard to Vary

Not all passing tests are equally good.

Some tests protect a rule. Others protect an accident: a particular name, a particular call count, an over-specified fixture, or a message string nobody promised to keep stable.

David Deutsch gives a useful way to see the difference. In The Beginning of Infinity, he describes a good explanation as one that is hard to vary while still accounting for what it purports to account for.

His point is not merely that good explanations are testable. Bad explanations can make testable predictions too. In his 2009 TED talk, Deutsch uses the myth of Persephone as an example: in its simple form, it cannot account for opposite seasons in the two hemispheres. But the myth is easy to vary. The characters, motives, and rituals can be changed without damaging its structure. The axial-tilt explanation is different: vary the tilt, and the predictions change.

That is the distinction worth importing into tests. Deutsch is not writing about software tests, so the transfer is deliberately narrow. A test is not an explanation of reality, but it can encode a reasoned claim about why a behavior should hold. A test can run, make assertions, and even fail for a real reason while still explaining very little.

Applied to testing, the idea becomes practical:

A good test is hard to vary in the places that matter.

Its setup, input, and assertions are there because they help detect a specific kind of defect. Change the load-bearing parts and the test should either fail or stop testing the same thing. Change an irrelevant detail and nothing important should depend on it.

That makes “hard to vary” a useful test quality criterion. Not a metric. A judgment.

A test that explains too little

Take a Spring Boot service that decides whether a taxpayer qualifies for a relocation deduction.

The rule is simple:

the taxpayer must have moved to another municipality
the taxpayer must meet the income floor
the service must record a successful evaluation

Here is a test that passes, but contains too much accidental detail:

@SpringBootTest
class RelocationDeductionServiceSpec extends Specification {
  @SpringBean
  CitizenRegistryClient citizenRegistryClient = Stub()
  @SpringBean
  IncomeRepository incomeRepository = Mock()
  @SpringBean
  AuditLog auditLog = Mock()

  @Autowired
  RelocationDeductionService service

  def "approves relocation deduction"() {
    given:
    citizenRegistryClient.find("p-123") >> new CitizenProfile(
      cpr: "0000000000",
      firstName: "Anna",
      lastName: "Jensen",
      previousMunicipality: "0851",
      currentMunicipality: "0101"
    )

    when:
    def decision = service.evaluate("p-123", 2025)

    then:
    decision.approved
    decision.reason == "ELIGIBLE"
    decision.message == "Citizen Anna Jensen qualifies for relocation deduction"
    decision.evaluatedAt != null

    1 * incomeRepository.totalEarnedIncome("p-123", 2025) >> 350_000G
    1 * auditLog.record("p-123", "RELOCATION_DEDUCTION_EVALUATED", { it.success })
    0 * _
  }
}

The test checks real behavior, but it gives the rule and the surrounding scenery the same weight.

The CPR number, names, and exact message text only belong here if the business contract depends on them. Otherwise they pin the test to incidental facts. The repository call count has the same smell: it verifies how the service currently gets the answer, not the rule the service must enforce. 0 * _ makes the problem sharper by failing the test for any unrelated collaboration change.

The test has become easy to vary in the business-relevant parts and hard to vary in the incidental parts. For a rule-focused test, that is the wrong shape.

The same test, harder to vary

Now remove everything that does not explain approval:

@SpringBootTest
class RelocationDeductionServiceSpec extends Specification {
  @SpringBean
  CitizenRegistryClient citizenRegistryClient = Stub()
  @SpringBean
  IncomeRepository incomeRepository = Stub()
  @SpringBean
  AuditLog auditLog = Mock()

  @Autowired
  RelocationDeductionService service

  def "approves deduction when taxpayer has moved to another municipality and meets the income floor"() {
    given:
    citizenRegistryClient.find("person-1") >> new CitizenProfile(
      previousMunicipality: "0851",
      currentMunicipality: "0101"
    )
    incomeRepository.totalEarnedIncome("person-1", 2025) >> 350_000G

    when:
    def decision = service.evaluate("person-1", 2025)

    then:
    decision.approved
    decision.reason == "ELIGIBLE"

    1 * auditLog.record("person-1", "RELOCATION_DEDUCTION_EVALUATED", { it.success })
  }
}

This version says less, but explains more. Even the test name carries more weight: it names the conditions that make approval correct instead of only saying that approval happens.

The municipality values matter because moving to another municipality is part of the rule. The income value matters because it places the taxpayer above the floor. The approval and reason matter because they are the observable decision. The audit interaction remains because, in this example, it is part of the service’s contract. A downstream process, compliance check, or operations trail may rely on that event being recorded. The message text is different: unless it is promised as a stable contract, it is presentation detail.

The repository is now a Stub(), not a Mock(), because this test does not care how many times the service asks for income. It only needs income to exist as part of the decision.

That is the point of the hard-to-vary lens: make the test strict about the rule and relaxed about everything else.

One example is not always an explanation

The same idea applies to example-based tests.

This test may be correct:

expect:
lateFilingPenaltyPolicy.penalty(21) == 4_200G

But it only records one fact. It does not explain the shape of the rule.

If the rule is “200 per day until the penalty reaches 5,000,” a better test captures the relationship:

class LateFilingPenaltyPolicySpec extends Specification {
  def policy = new LateFilingPenaltyPolicy(
    dailyPenalty: 200G,
    cap: 5_000G
  )

  def "penalty increases by the daily rate until the cap"() {
    expect:
    policy.penalty(daysLate) == expectedPenalty

    where:
    daysLate || expectedPenalty
    0        || 0G
    1        || 200G
    21       || 4_200G
    24       || 4_800G
    25       || 5_000G
    26       || 5_000G
    60       || 5_000G
  }
}

This test is harder to vary because the examples are not random. They mark the structure of the rule: before the cap, at the cap, and after the cap.

The point is not to add more rows. The point is to choose rows that explain the rule. A table of arbitrary examples is still arbitrary. The value comes from the reason each example is there.

Three questions for every test

When reviewing a test, ask:

1. What is load-bearing?

Which details are necessary for the behavior under test? If a specific value appears in given, it should have a visible reason in then or in the path to the outcome. Keep those details. Remove, default, or soften the rest.

2. What defect would this test catch?

If the answer is vague, the test probably does not explain its own purpose clearly enough.

3. What should be allowed to change?

A good test should fail when the rule breaks. It should not fail merely because harmless implementation detail changed.

That is where Deutsch’s idea earns its place in day-to-day testing. “Hard to vary” is not a slogan about writing more assertions. It is a discipline for choosing the right assertions.

A test becomes stronger when every strict detail has a reason, every irrelevant detail is absent, and the remaining code explains the behavior it protects.