Document parsing

Dated content — June 2026

This page names specific tools, models, and prices, which rotate quarterly. The selection logic is durable; the names are a snapshot. Cross-check the Model snapshot for current model names and pricing.

In one line: The unglamorous, quality-defining prerequisite to every RAG system. Bad parsing → bad chunks → bad retrieval → bad answers. Spend more time here than you think you should.

In plain English

A PDF is not text. It's a layout description — fonts, positions, vectors, sometimes raster images. Pulling clean text and the structure (headings, tables, lists, page numbers) out of one is genuinely hard. Document parsing is the pipeline that turns "blob.pdf" into "here is a list of clean text chunks, each with its heading and source page." Get this wrong and every other RAG decision is downstream of broken inputs.

The major options (2026)

Tool	Type	Strengths	$ shape	Best for
unstructured (OSS + API)	Hybrid	Broadest format support	OSS / $0.01 per page hosted	General-purpose default
LlamaParse	LLM-based hosted	Excellent tables, layout	$0.003–$0.03/page	Complex PDFs, financials
Docling (IBM)	OSS	Strong PDF + tables, on-prem	free	Compliance, on-prem
Reducto	Hosted	Tables, forms, accuracy SOTA	$0.005–$0.02/page	Spreadsheets, dense docs
Omni	Hosted	LLM-vision based	$0.005+/page	Mixed layouts
Mistral OCR	Hosted	Multimodal LLM OCR	$1 / 1000 pages	Scanned docs, multilingual
AWS Textract	Cloud	Forms, tables, signatures	$0.0015–$0.05/page	AWS-native pipelines
GCP Document AI	Cloud	Tuned processors per doc type	tiered	GCP-native; insurance/legal
Azure Document Intelligence	Cloud	Pre-built models, forms	tiered	Azure-native; receipts/invoices
pypdf / pdfplumber / pdfium	OSS libs	Lightweight text extract	free	Simple, mostly-text PDFs
Marker	OSS	PDF → Markdown, fast	free	Markdown-target pipelines
Vision LLM (Claude / GPT / Gemini)	Hosted LLM	Tables, charts, hand-written	LLM $$	Edge cases nothing else handles

Default pick for most teams

unstructured (open-source library) for the common case, with a fallback to a vision LLM for the pages that look garbled. It handles PDF / DOCX / HTML / EPUB / images / emails / PPTX out of one API and is well-supported by LlamaIndex and LangChain.

When you hit complex tables or financial PDFs: LlamaParse or Reducto. Both reliably beat unstructured on heavily-tabled documents, and the price is fine at modest scale.

When to deviate

Mostly clean text PDFs (blog exports, books, papers): pypdf or pdfplumber — fast, free, sufficient.
Heavily tabular (10-K filings, lab reports, scientific tables): LlamaParse, Reducto, or a vision LLM.
Scanned / hand-written / multilingual OCR: Mistral OCR, Google Document AI, or Claude / Gemini vision.
Compliance / on-prem / air-gapped: Docling or self-hosted unstructured.
Already in AWS / GCP / Azure: the native service usually wins on integration.
You want Markdown out the other side for an LLM: Marker or LlamaParse in Markdown mode.

Minimum integration

unstructured — the workhorse:

from unstructured.partition.auto import partition

elements = partition(filename="report.pdf")
for el in elements:
    print(el.category, "|", el.text[:80])
# Title | Q3 2025 Earnings Report
# NarrativeText | We exceeded expectations...
# Table | <table HTML preserved>
# ...

LlamaParse — when tables matter:

from llama_parse import LlamaParse

parser = LlamaParse(api_key=os.environ["LLAMA_CLOUD_KEY"], result_type="markdown")
documents = parser.load_data("financial-report.pdf")
print(documents[0].text)  # Markdown with tables, headings preserved

Vision LLM — the last-resort hammer:

import base64
img = base64.b64encode(Path("page-12.png").read_bytes()).decode()
r = client.messages.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": [
        {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": img}},
        {"type": "text", "text": "Extract all tables on this page as JSON arrays."},
    ]}],
)

Patterns worth knowing

Vision-based extraction is rapidly winning on hard pages. Giving the page image directly to Claude Sonnet or Gemini 2.5 Pro often beats traditional OCR + parsing on tables and forms — especially in 2026 with native vision.
Layout-aware chunking (keep table rows together, keep headings with their sections, never split mid-sentence) beats naive fixed-token chunking by a lot.
Hierarchical chunks (small chunks for embedding match, larger parent chunks returned to the LLM) is the 2026 best-practice default.
Store the original page reference alongside chunks so you can cite "page 12 of report.pdf" — a non-negotiable UX feature.
Parse once, store the structured output. Don't re-parse on every query.
Run a small "is this garbage?" check after parsing (length > threshold, percent of non-ASCII, etc.). Bad pages happen; catch them at ingest, not at query.

Pricing & cost notes

Document parsing is sneakily expensive at scale. A 1,000-page corpus:

unstructured OSS: free + compute (~$0–$5 on a small VM).
unstructured API: ~$10.
LlamaParse: ~$3–$30 depending on mode.
Reducto: ~$5–$20.
AWS Textract (tables/forms): ~$15–$50.
Vision LLM (one model call per page): $20–$100+ depending on model.

For a one-million-page enterprise corpus, parsing alone is often the biggest line item — bigger than embeddings, bigger than the LLM. Pick deliberately.

Pitfalls

Treating PDF as text. It isn't. A "blank-looking" PDF often has 6 invisible columns, footnotes embedded in body text, and tables stored as graphics.
Naive fixed-size chunking. Splitting "Total revenue was" | "$1.2B" across two chunks means the retriever finds half the answer.
Ignoring tables until they break things. Tables are where dollars, dates, and SKUs live. If your domain has tables, evaluate parsers on your tables.
No page-level provenance. Users want citations. If you didn't store the page number, you can't show it.
Re-parsing on every retrieval. Parse at ingest, store the result, never re-parse at query time.
Trusting one parser for every file type. A PDF parser does not parse Excel well; an HTML parser does not parse PowerPoint. Route by mime type.
No quality canary. Build a small labeled set of "we know what should come out of these 10 docs" and run it on every parser upgrade — catches silent regressions.

🤔 Quick checkQuick check

→ Next: Synthetic data tools

The major options (2026)​

Default pick for most teams​

When to deviate​

Minimum integration​

Patterns worth knowing​

Pricing & cost notes​

Pitfalls​