A datasheet PDF dragged into every turn of a conversation is read, and paid for, every turn. Transcribe it once into clean text — Markdown for prose, YAML for structured data — and work from the lean copy. Cheaper, faster, and easier to check.
data/, then point Claude at
that. Run the transcription in a second terminal on the cheaper
Sonnet model, so the heavy read never touches your main session.
A PDF is a layout format, not a text format. Pulling one into context drags along page furniture, repeated headers, OCR noise, and — for a scanned drawing — image data. A 40-page datasheet can be tens of thousands of tokens, and if it sits in the conversation it's re-read on every single turn. You usually need about thirty numbers off it. Paying for the other 39 pages, repeatedly, is the waste.
The cleanest way to keep that heavy read out of your main session isn't a
clever prompt — it's a second window. Open another
terminal, start Claude there on the cheaper Sonnet model,
and give it one job: read the document and write a faithful transcription
into data/. Your engineering session — on Opus —
never sees the raw PDF.
This is where Claude Code's slash commands and model selection earn their keep. Launch the worker straight onto Sonnet:
claude --model sonneta second terminal
— or switch a session that's already open with the /model command:
/model sonnetin Claude Code
Then turn the reasoning effort down. Transcription is a fidelity task, not a thinking task — you want the model copying digits, not pondering them, and cranking up effort just spends tokens and time for no extra accuracy:
/effort lowin the Sonnet worker
Two wins, and they're the whole point:
You don't need a programmatic multi-agent system for this — two terminals is the technique, and it's the right amount of machinery for a one-off read. Hand the worker its instructions:
data/bearing-6004.pdf into data/bearing-6004.yaml. Capture every dimension, load rating, and limiting speed as structured key/value fields with units. Transcribe faithfully — do not summarize or round — and flag anything illegible rather than guessing.
Once you've done this a couple of times, save the instructions as a project
skill — a custom slash command — so the worker
pins the right model and effort for you. Drop a file at
.claude/skills/transcribe-pdf/SKILL.md:
--- name: transcribe-pdf description: Transcribe a PDF into faithful Markdown or YAML model: sonnet effort: low disable-model-invocation: true allowed-tools: Read, Write --- Transcribe the document at $ARGUMENTS into data/. Use YAML for specs and tables (named fields + units), Markdown for prose. Transcribe faithfully — do not summarize or round. Carry a `source:` reference (file + page). Flag anything illegible rather than guessing..claude/skills/transcribe-pdf/SKILL.md
Now the whole job is one line in the Sonnet window — model and effort already set, every time:
/transcribe-pdf data/bearing-6004.pdfin the Sonnet worker
Pick the target format by what the document is:
| Markdown (.md) | YAML (.yaml) | |
|---|---|---|
| Best for | Prose: standards, manuals, procedures, narrative reports. | Structure: datasheet specs, BOMs, parameter tables, key/value data. |
| Keeps | Headings, paragraphs, lists, the occasional table. | Named fields, units, nesting — machine-queryable. |
| You then | Read it, quote it, cite a section. | Load it, compute on it, diff it across revisions. |
A datasheet becomes a handful of fields you can compute against:
part: 6004
type: deep_groove_ball_bearing
bore_mm: 20
outer_dia_mm: 42
width_mm: 12
dynamic_load_C_kN: 9.36
static_load_C0_kN: 5.00
limiting_speed_rpm: 30000
source: { doc: bearing-6004.pdf, page: 1 }data/bearing-6004.yaml
Now “compute L10 life for this bearing” reads three fields
instead of re-parsing a PDF — and the source line tells
you exactly where to go verify it. A spec manual, by contrast, stays prose:
# Acceptance Test Procedure — CW-2 Skid ## 3.2 Vibration limit Overall velocity shall not exceed **4.5 mm/s RMS** measured at the bearing housing per ISO 10816-3 for rigid mounting...data/atp-section3.md
data/, carry a source reference in the
transcription, and spot-check any number you're betting the design on
against the page it came from. Same habit as the rest of this site: the
document is the oracle; the transcription is a convenience.
Transcribing is half the job; feeding the lean copy well is the other half. Three habits keep a long session sharp:
NOTES.md in the
notes/ folder with the numbers
and decisions pulled so far. The next session reads the note instead of
re-reading the source — persistent memory for a few hundred tokens.
And the effort knob cuts both ways: transcription wants
/effort low, but the analysis it feeds usually wants it
high. Turn it up when the reasoning — not the reading —
is the hard part.
Rough arithmetic: a datasheet page is ~500–800 tokens of usable
text, more if scanned. Forty pages re-read across a dozen turns is
hundreds of thousands of token-reads. The YAML of the thirty numbers you
actually use is a few hundred tokens, read once and cached in a tiny file.
You're not just saving money — a lean, named dataset is something you
can git diff when the vendor issues a revision, which a PDF
blob never lets you do.
This is the third rung of the trunk: after you've installed Claude and set up a project with git, getting your source documents into clean, diffable text is what makes the problem branches efficient. The bearing and tolerance-stack problems run entirely on numbers that, in real life, you'd transcribe off a datasheet or a drawing exactly this way.