Five incubators, fifty runs, one requirement — a textbook one-factor-at-a-time experiment that points straight at the wrong fix, unless you check what it was actually measuring.
An incubator must hold its payload at 35 ± 1 °C. The acceptance test: pre-heat to steady state, drop in a cold, well-plate-sized aluminum block instrumented with ten thermocouples around the rim, and the plate — the thing the biology actually sits on — must be inside the 34–36 °C band within 60 seconds of insertion.
Bench feel said the control loop was sluggish, so we ran a clean OFAT on the controller's proportional gain P — five gain levels, two replicates, across five production units (50 runs total, to separate part-to-part from run-to-run). The going-in hypothesis: crank P and we pass. Every unit failed. Why — and what do we actually change?
Hand Claude the raw thermocouple logs and the run index. It joins them, computes time-to-spec for every run, plots the response against the factor, and renders the pass/fail call — the mechanical core of a designed experiment, done in minutes. The part it can't do for you is the part that matters: deciding whether the experiment was valid in the first place.
The panel below runs the actual test rig in your browser — a PI control loop driving a fast air node coupled to a slow, heavy plate — and generates the full 50-run campaign with realistic part-to-part variation and measurement noise. Download the raw data and the experiment index, hand them to your own Claude, and ask for the OFAT analysis and a verdict. Then audit it with the ladder in §4.
Two things jump out. First, the plate's time-to-spec sits stubbornly around 100 seconds at every gain — the OFAT curve is flat. Drag the P-gain slider in the what-if and watch the plate curve barely twitch. Second, the on-board RTD recovers into the band in ~10 seconds and holds; if you graded on the RTD you would have shipped. The requirement is about the plate. We were watching the wrong sensor and turning the wrong knob.
A solver announces a bug by leaking energy. A spend report announces one when the totals don't reconcile. A designed experiment has its own oracle: validity. A sound experiment doesn't just hand you a number — it tells you when you measured the wrong thing or varied the wrong factor. These four rungs are how this OFAT confesses.
The deepest point: this OFAT was executed flawlessly — balanced, replicated, low-noise — and it was still about to send us to retune a gain that does nothing. Clean execution is not validity.
The on-board RTD is right there in the data and it tells a happy story. The requirement is on the plate. Before any analysis, ask: which column corresponds to the quantity the spec is written against? Compute time-to-spec on that — and on the coldest thermocouple, since the spec must hold everywhere on the plate, not on average.
A flat response curve is information, not a disappointment. It says the variance in your outcome lives somewhere other than the factor you swept. When P from 2 to 12 doesn't move time-to-spec, stop tuning P and go hunting for what the plate's heat-up actually depends on.
OFAT only ever tells you about the factors you chose to vary. The plate's heat-up is governed by the driving temperature and the air-to-plate coupling — neither of which is the controller gain. The setpoint was never in the experiment, so the experiment could never find it. When the swept factor is flat, the answer is usually a factor you didn't include.
Two free levers worth setting. Turn the reasoning effort up
for the hard part — Claude Code's /effort (see
Feed it documents for the
model and effort controls); a transcription wants it low, an analysis like
this one wants it high. And end your prompt with an explicit
self-check — “before you finish, confirm the verdict reads the spec'd gauge and holds across all units”
— which is exactly why the prompt above asks Claude to verify itself.
Naming the oracle is the highest-value line in the prompt. And keep the
expensive model's context light — route transcription and formatting
to cheaper tools (see Spend
tokens well).
Hit apply boost profile in the panel. Instead of retuning P, we feed-forward on the known disturbance: when the cold plate goes in, briefly raise the setpoint to ~38 °C to pour heat into the plate, then settle back to 35 once it's up. The coldest thermocouple now clears 34 °C inside the 60-second window and the plate settles mid-band — passing both the speed and the steady-state halves of the spec, which no value of P could do.
The honest footnote: a fixed setpoint boost overshoots the upper limit (slide the setpoint up and watch the warmest TC leave the band), which is why the real answer is a setpoint profile or plate-referenced control — not a single magic number. The OFAT's job was never to hand us that answer; it was to prove P wasn't it, and point us at the setpoint. It did.
Once you've drawn the conclusion, the same two moves — see explore in HTML, deliver in PDF for the one-time setup. For this campaign, the two prompts: