← AI-pair numerics

Spend report from a messy export

Someone drops a CSV of purchase orders on your desk and needs a spend summary by Monday. The skill here isn't the chart — it's knowing whether to trust it.

1. Setup

You have an export of Q1 purchase orders — one row per line item, with vendor, category, quantity, unit price, and line total. It came out of someone's accounting system and it is, like every real export, a little broken. You want: total spend, spend by category, top vendors, and one or two honest findings — a short report a manager can act on.

This is the daily-grind task Claude is quietly best at: parse a file, clean it, aggregate it, chart it, summarize it. No physics here. The twist is that the analysis has no closed-form answer to check against — so we need a different oracle.

2. Why this is twenty minutes, not an afternoon

Hand Claude the CSV and ask for a spend report. It parses, groups, totals, charts, and drafts the prose in one pass — the work that used to be an afternoon of pivot-table wrestling. Your job moves up a level: not making the report, but auditing it. A spreadsheet that adds up is not a correct analysis.

3. Build it

Download the dataset, hand it to Claude, and ask for: total spend, spend by category and by vendor, the month-over-month trend, and the three biggest line items. Then — this is the actual skill — demand that it reconcile the numbers before you believe them.

4. Verification — reconciliation is the oracle

In the physics problems on this site, a wrong solver announces itself: energy isn't conserved, the closed form doesn't match. Data analysis has the same property if you set it up right. The oracle here is reconciliation — numbers that must agree, agreeing. The panel below runs the live dataset. It starts raw and broken; toggle the cleaning steps and watch the checks resolve.

Loading dataset… ↓ download the CSV
Loading…

The instructive one is the last check. The largest line in the raw data — ten RF connectors at $4,500 each — is a 10× price typo (they're $450). It passes the arithmetic check perfectly: 10 × $4,500 does equal $45,000. Every obvious test is green. But it single-handedly makes Electronics look like your top spend category, and the moment you ask "does that finding survive removing the one biggest line?" — the robustness rung — the answer flips to Machined Parts. That's the analysis equivalent of a solver that conserves energy and is still wrong.

Rung 1 — cross-foot. Spend totalled by category must equal spend totalled by vendor must equal the raw line-item sum. They're the same dollars sliced two ways.
Pass: all three breakdowns hit the same grand total.
Fail diagnosis: a grouping silently dropped rows (here, the blank-category lines vanish from the category pivot).
Rung 2 — row integrity. Row count equals unique key count. If they differ, you have duplicates inflating every total.
Pass: every PO ID appears once.
Fail diagnosis: a duplicated line is double-counting (here, PO-1042 appears twice).
Rung 3 — internal consistency. Derived columns must agree: line_total should equal qty × unit_price.
Pass / flagged: mismatches surfaced for human review — you flag, you don't silently overwrite financial data.
Rung 4 — robustness. Does the headline finding survive removing the single largest contributor? If one row changes your conclusion, that row had better be real.
Pass: top category is stable to dropping the biggest line.
Fail diagnosis: a single outlier (often a typo) is driving the result.

5. Hints

Hint 1 — clean before you count

The order matters: dedupe, normalize text keys (vendor names with inconsistent spelling won't group), handle blanks explicitly (decide whether they're dropped or bucketed as "Uncategorized" — don't let the tool decide silently). Only then aggregate.

Hint 2 — demand the cross-foot

Ask Claude to total the same money two independent ways (by category and by vendor) and confirm they match the raw sum. This one check catches the most common and most invisible bug: a group-by that quietly excludes nulls. If the two breakdowns disagree, the gap is the dropped rows.

Hint 3 — the gotcha (consistent ≠ correct)

The arithmetic check (line_total = qty × price) catches sloppy data entry, but it cannot catch a wrong number that's internally consistent — a price with an extra zero still multiplies correctly. That class of error only falls to the robustness check: perturb the input (drop the biggest row, the newest month, the top vendor) and see whether the conclusion holds.

Hint 4 — what to ask for
PromptLoad this CSV. Before reporting anything: 1. Report row count vs unique PO count. 2. Total spend by category AND by vendor; confirm both equal the raw line-item sum. 3. List any rows where line_total != qty*unit_price. 4. Show me the top category with, and without, the single largest line item. Then write the summary.
Hint — tune the collaborator

Two free levers worth setting. Turn the reasoning effort up for the hard part — Claude Code's /effort (see Feed it documents for the model and effort controls); a transcription wants it low, an analysis like this one wants it high. And end your prompt with an explicit self-check — “before you finish, reconcile every total against an independent count” — which is exactly why the prompt above asks Claude to verify itself. Naming the oracle is the highest-value line in the prompt. And keep the expensive model's context light — route transcription and formatting to cheaper tools (see Spend tokens well).

6. Where to draw the line

You will not hand-audit 10,000 rows, and you shouldn't try. Build the reconciliation checks once — cross-foot, row count, internal consistency, robustness — and let them run on every report. That's the leverage: the checks scale, your attention doesn't. But do spend your scarce human attention on the robustness flag, because that's the one that catches the confident, internally-consistent, completely-wrong answer. The machine reconciles; you adjudicate the outlier.

7. One worked solution

What good looks like

The panel above is the worked solution: it parses the same CSV, runs all four reconciliation rungs, and regenerates the findings text live as you clean. With every toggle on, the report is honest: Machined Parts leads (driven by genuine CNC work), not Electronics (an artifact of one typo). Note that Rung 3 stays amber even when everything else is green — that's correct. You flag the two line-total mismatches for a human; you don't silently rewrite accounting records.

Everything here is plain browser JavaScript — no library, no backend. The point isn't the code; it's the discipline of making the numbers prove themselves before you put them in front of a manager.

8. Going further

Ship it

You have an analysis you trust. Now you either explore it or deliver it — see explore in HTML, deliver in PDF for the one-time setup. For this dataset, the two prompts:

ExploreBuild a single self-contained HTML file (data inlined, no CDN links) that lets me filter the purchase orders by vendor and category, sort the table, and see spend-by-category as a live bar chart.
DeliverDraft a one-page Markdown brief — two-sentence summary, findings table, embedded SVG bar chart — and give me the pandoc command to render it to PDF with the Typst engine.