← AI-pair numerics

Decision point: the OFAT that fingered the wrong knob

Five incubators, fifty runs, one requirement — a textbook one-factor-at-a-time experiment that points straight at the wrong fix, unless you check what it was actually measuring.

1. Setup

An incubator must hold its payload at 35 ± 1 °C. The acceptance test: pre-heat to steady state, drop in a cold, well-plate-sized aluminum block instrumented with ten thermocouples around the rim, and the plate — the thing the biology actually sits on — must be inside the 34–36 °C band within 60 seconds of insertion.

Bench feel said the control loop was sluggish, so we ran a clean OFAT on the controller's proportional gain P — five gain levels, two replicates, across five production units (50 runs total, to separate part-to-part from run-to-run). The going-in hypothesis: crank P and we pass. Every unit failed. Why — and what do we actually change?

2. Why this is twenty minutes, not a week

Hand Claude the raw thermocouple logs and the run index. It joins them, computes time-to-spec for every run, plots the response against the factor, and renders the pass/fail call — the mechanical core of a designed experiment, done in minutes. The part it can't do for you is the part that matters: deciding whether the experiment was valid in the first place.

3. Build it

The panel below runs the actual test rig in your browser — a PI control loop driving a fast air node coupled to a slow, heavy plate — and generates the full 50-run campaign with realistic part-to-part variation and measurement noise. Download the raw data and the experiment index, hand them to your own Claude, and ask for the OFAT analysis and a verdict. Then audit it with the ladder in §4.

generating 50-run campaign… 5 units × 5 gains × 2 reps
The OFAT result — time to enter the 34–36 °C band
What-if — re-run one unit with knobs you choose
Loading…

Two things jump out. First, the plate's time-to-spec sits stubbornly around 100 seconds at every gain — the OFAT curve is flat. Drag the P-gain slider in the what-if and watch the plate curve barely twitch. Second, the on-board RTD recovers into the band in ~10 seconds and holds; if you graded on the RTD you would have shipped. The requirement is about the plate. We were watching the wrong sensor and turning the wrong knob.

4. Verification — experimental validity is the oracle

A solver announces a bug by leaking energy. A spend report announces one when the totals don't reconcile. A designed experiment has its own oracle: validity. A sound experiment doesn't just hand you a number — it tells you when you measured the wrong thing or varied the wrong factor. These four rungs are how this OFAT confesses.

Rung 1 — gauge relevance. Does your measurement correspond to the requirement? The spec is on the plate; the RTD reads the air near the heater.
Pass: the gauge measures the spec'd quantity.
Fail here: RTD passes while the plate fails — the sensor you trusted does not measure the thing under requirement. Every downstream conclusion is suspect until this is fixed.
Rung 2 — factor causality. Did the factor you varied actually move the response?
Pass: the response slopes with the factor — you're on a real lever.
Fail here: the plate's time-to-spec is flat across all five P gains. A knob that doesn't move the needle is not the knob. The flat curve is the data telling you to look elsewhere.
Rung 3 — consistency across units. Is the result systemic or a fluke part? This is why we ran five incubators, not one.
Pass: all five units fail the same way — it's the design, not a bad build. (If one unit fails alone, that's a manufacturing escape, a different action entirely.)
Rung 4 — find the real lever. Search the actual design space, not the one you walked in assuming.
Pass: the setpoint moves the plate where P could not. Raise it and the plate clears 34 °C well inside 60 s — though a constant boost overshoots the upper limit at steady state, which points at a setpoint profile (or controlling on plate temperature directly) as the production fix.

The deepest point: this OFAT was executed flawlessly — balanced, replicated, low-noise — and it was still about to send us to retune a gain that does nothing. Clean execution is not validity.

5. Hints

Hint 1 — grade on the requirement, not the convenient signal

The on-board RTD is right there in the data and it tells a happy story. The requirement is on the plate. Before any analysis, ask: which column corresponds to the quantity the spec is written against? Compute time-to-spec on that — and on the coldest thermocouple, since the spec must hold everywhere on the plate, not on average.

Hint 2 — read the shape of the OFAT, not just the pass/fail

A flat response curve is information, not a disappointment. It says the variance in your outcome lives somewhere other than the factor you swept. When P from 2 to 12 doesn't move time-to-spec, stop tuning P and go hunting for what the plate's heat-up actually depends on.

Hint 3 — the gotcha (a perfect experiment, wrong scope)

OFAT only ever tells you about the factors you chose to vary. The plate's heat-up is governed by the driving temperature and the air-to-plate coupling — neither of which is the controller gain. The setpoint was never in the experiment, so the experiment could never find it. When the swept factor is flat, the answer is usually a factor you didn't include.

Hint 4 — what to ask for
PromptJoin raw_data.csv to experiment_index.csv on run_id. For each run compute time for the COLDEST thermocouple to enter and stay in [34,36] C, and separately for the RTD. Plot both against P gain, one series per incubator. Tell me: (a) does P move the plate time? (b) does the RTD verdict agree with the plate verdict? (c) is the failure consistent across all five units?
Hint — tune the collaborator

Two free levers worth setting. Turn the reasoning effort up for the hard part — Claude Code's /effort (see Feed it documents for the model and effort controls); a transcription wants it low, an analysis like this one wants it high. And end your prompt with an explicit self-check — “before you finish, confirm the verdict reads the spec'd gauge and holds across all units” — which is exactly why the prompt above asks Claude to verify itself. Naming the oracle is the highest-value line in the prompt. And keep the expensive model's context light — route transcription and formatting to cheaper tools (see Spend tokens well).

6. Where to draw the line

Let Claude do the joining, the per-run time-to-spec extraction, the plotting, the variance bookkeeping — that's exactly the tireless, error-prone clerical work it's good at, and the reconciliation-style checks scale across all 50 runs. But the validity calls are yours: is this gauge the right gauge, is this factor causal, is the swept range even where the physics lives. Those are judgment, and they are precisely where a confidently-wrong analysis — clean data, tidy plot, decisive conclusion — will walk you off a cliff if you outsource them.

7. One worked solution

What good looks like

Hit apply boost profile in the panel. Instead of retuning P, we feed-forward on the known disturbance: when the cold plate goes in, briefly raise the setpoint to ~38 °C to pour heat into the plate, then settle back to 35 once it's up. The coldest thermocouple now clears 34 °C inside the 60-second window and the plate settles mid-band — passing both the speed and the steady-state halves of the spec, which no value of P could do.

The honest footnote: a fixed setpoint boost overshoots the upper limit (slide the setpoint up and watch the warmest TC leave the band), which is why the real answer is a setpoint profile or plate-referenced control — not a single magic number. The OFAT's job was never to hand us that answer; it was to prove P wasn't it, and point us at the setpoint. It did.

8. Going further

Ship it

Once you've drawn the conclusion, the same two moves — see explore in HTML, deliver in PDF for the one-time setup. For this campaign, the two prompts:

ExploreBuild a single self-contained HTML file (data inlined, no CDN links) to slice the 50 runs by incubator and P gain, overlay plate vs RTD time-to-spec, and toggle the 60-second deadline line.
DeliverDraft a Markdown test report — requirement, method, the OFAT result, the validity findings, and a pass/fail verdict with the recommended fix — and give me the pandoc command for a PDF via the Typst engine.