Hunger MountainIntelligenceConsultingCivicWritingLabsAboutFree business analysis
GenAI

Where State of the Art Fails

mrkiouak@gmail.com · 2026-05-15
Where State of the Art Fails

Where State of the Art Fails: Automating data extraction and analysis of PDFs in 2026.

It's 2026. Nvidia, semiconductor, and DRAM producer stocks have risen by hundreds of percent, and I naïvely assumed that state-of-the-art frontier LLMs from the major providers could accurately read a PDF. This was wrong of every model tried — Gemini, Claude, and ChatGPT each failed to approach 70% accuracy on a ground-truth eval, and frequently got details wrong across a wide sample of PDFs.

Vermont town reports — what the document is

Vermont publishes its municipal finance in the open. Each town prints an annual town report ahead of Town Meeting Day — held the first Tuesday in March — and residents vote the operating budget, plus a slate of special articles, on the floor of that meeting. The town report is the document the vote is informed by. A typical 60–140-page report contains:

Layout varies — vector PDFs from Word or InDesign for roughly 80% of reports, scans for the remainder.

The first attempt: ask Gemini for the whole thing

I started with the obvious approach. Bind a Pydantic schema describing the entire budget document — funds, line items, tax rates, warrant articles, narrative reports — to a single Gemini 3.1 Pro call over the native PDF, set the thinking budget and output cap to Vertex's maximums, and ask the model to fill it in.

On Warren FY2026 — 78 pages, 28 funds, 248 line items in a hand-curated ground truth — Gemini returned six funds and 80 line items. Recall against ground truth: 31.1%. Output looked plausible at first glance: real fund names, real dollar amounts, the right schema. Roughly three-quarters of the document was missing from the output, silently.

Claude Opus 4.7 with the 1M-context beta failed the same task differently: it returned 33 fund records and then hit Anthropic's 32k output-token cap before producing any line items. GPT-5 wasn't tested.

The second attempt: locate sections, then extract them

The second attempt split the work in two. One Gemini Pro call classified each page — budget table, warrant article, departmental narrative, tax-rate disclosure — and emitted a section index. A second batch of per-section Pro calls then extracted line items inside each budget section.

I ran this across a wide range of Vermont town reports. Surface output often looked correct on inspection, which delayed discovering how flawed the underlying extraction actually was. On Warren, scored against the same 830-fact ground truth: 28 funds, 248 line items in the JSON, 66.9% recall, and a 50% reconciliation pass rate — half the funds had a sum of extracted line items disagreeing with the fund's own printed total by more than 2%. End-to-end cost ran $30–50 per town for a full Pro-with-thinking sweep, and debug iterations on a single problem town added another $50–100.

The recall was usable. The dollars were not.

Pause: read what the literature said

By this point I had spent enough on Vertex credits to justify checking whether anyone who actually studies document understanding had a better idea.1 The 2025–2026 literature converges on three findings:

The 2026 recommendation for mixed digital + scanned PDFs is therefore: per-page routing (embedded text layer for digital pages, rasterize + VLM for scans); a layout-aware parser emitting structured Markdown or HTML with preserved tables; structure-preserving chunking; and ColPali-style vision retrieval for figure-heavy corpora.

Reproducing the hybrid

The literature's recipe — specialized parser for tables, frontier VLM for figures and forms, retrieval separated — maps to a three-stage pipeline for this use case:

The architecture is precisely the decoupled-VLM pattern the literature names — layout detection (the Pro recipe call) followed by per-region recognition (the per-fund classifier) — with an arithmetic-reconciliation loop bolted on. Result on Warren FY2026, scored against the 830-fact ground truth:

PipelineRecallAmount accuracyReconciliation passLinesCost per runWall
e2e-pro (attempt 1: single Gemini call)31.1%15.6%80$1.01115s
orig (attempt 2: section locator + per-section reads)66.9%65.9%50.0%248$30–5030–90 min
mid-docai-layout (DocAI Layout Parser → one Pro call)84.8%62.5%308$0.40532s
mid-docai-gemini (DocAI Gemini-3 Layout Parser → one Pro call)91.1%71.9%310$2.741072s
hyb-llamaparse-pro-flash (LlamaParse cells → hybrid downstream)0%70%0$1.23105s
hyb-mistral-pro-flash (Mistral OCR + Pro recipe + per-fund classifier)96.9%96.0%78.1% (25/32)292$1.16494s
hyb-reducto-pro-flash (Reducto OCR + Pro recipe + per-fund classifier)95.5%94.7%84.4% (27/32)3858$2.621509s

Eval code and the Warren fixture are bundled at https://github.com/Rkiouak/mixed-pdf-extraction-eval. With GOOGLE_CLOUD_PROJECT and MISTRAL_API_KEY set, python -m eval.run_eval --towns warren --architectures hyb-mistral-pro-flash reproduces the headline row against the same 830-fact ground truth. Signup links for every other architecture's API are in the repo's README.

Takeaways

2026 definitely won't be the year of generial artificial intelligence. As a software developer, it was frankly shocking that the tool that can convert my natural language instructions into pulumi orchestrated AWS and GCP infrastructure hosting largely LLM generated code can't also read a budget table from a pdf. There are cases where LLMs have been very, very good at extracting narrative or bulleted semi structured natural language and number data for me -- but trying to work with these financial documents I found LLMs to be an utter failure -- and the current academic literature backs this up.

Miscellania

Two further results worth flagging. Google Document AI's Layout Parser feeding a single Pro call (mid-docai-layout) reaches 84.8% recall at $0.40 per run — cheap enough to justify on cost-sensitive backfills. The Gemini-3-backed DocAI parser (mid-docai-gemini) climbs to 91.1% but at $2.74 per run. The LlamaParse-fronted hybrid (hyb-llamaparse-pro-flash) emitted zero line items: LlamaParse's cell output composed poorly with the downstream Pro structuring call, the integration-boundary failure Applied AI's paper specifically warns about when single-vendor pipelines exceed their accuracy budget.

The recall numbers track the literature's prediction with no qualitative surprises. The headline gap — 96.9% on the SOTA-pattern hybrid versus 31.1% on a naive single-call versus 66.9% on a section-locator-plus-extractor pipeline — is the gap the 2026 surveys predicted for any single-vendor pipeline operating without a specialized table parser ahead of the VLM.

Metric definitions. Recall — atomic facts matched ÷ 830 ground-truth facts. Amount accuracy — matched facts where the candidate value is within ±$1 of ground truth ÷ 830. Reconciliation pass — % of funds where the sum of expenditure-flow line items at the target fiscal year equals the printed fund total within ±2%. Cost — actual API spend per Warren run, computed from each provider's billed usage rather than estimated. Wall — end-to-end seconds.


Footnotes

Footnotes

  1. The State of Document Understanding for Mixed PDFs in 2026 — a 2025–2026 survey covering benchmarks, end-to-end VLMs, modular pipeline systems, commercial APIs, and the architecture patterns now considered SOTA across RAG, semantic search, Q&A, and structured information extraction.

  2. OmniDocBench v1.5 (CVPR 2025; 1,355 pages, 9 document types): https://arxiv.org/abs/2412.07626, https://github.com/opendatalab/OmniDocBench. Composite scores: GLM-OCR (Zhipu/Z.ai, 0.9B, MIT) https://huggingface.co/zai-org/GLM-OCR = 94.62; PaddleOCR-VL-1.5 (0.9B, Apache-2.0) https://arxiv.org/abs/2601.21957 = 94.5; FireRed-OCR ≈ 92.94; MinerU 2.5 (1.2B, Apache-2.0) https://arxiv.org/abs/2509.22186 = top quartile; dots.ocr (1.7B, Apache-2.0) https://github.com/rednote-hilab/dots.ocr = 87.5 EN / 84.0 ZH; Gemini 3 Pro ≈ 90.33; GPT-5.2 ≈ 85.4.

  3. LlamaIndex, Feb 2026: OmniDocBench is saturated — what's next for OCR benchmarks? https://www.llamaindex.ai/blog/omnidocbench-is-saturated-what-s-next-for-ocr-benchmarks. olmOCR-Bench (Poznanski et al. 2025): https://github.com/allenai/olmocr/tree/main/olmocr/bench, paper https://arxiv.org/abs/2510.19817.

  4. MonkeyOCR SRR triplet: https://arxiv.org/abs/2506.05218. MinerU 2.5: https://arxiv.org/abs/2509.22186. PaddleOCR-VL: https://arxiv.org/abs/2601.21957. Chandra OCR 2 (Datalab): https://github.com/datalab-to/chandra.

  5. Reducto RD-TableBench dataset: https://huggingface.co/datasets/reducto/rd-tablebench. Comparison numbers per Reducto's published methodology https://reducto.ai/. TableFormer (TEDS 98.5 simple / 95.0 complex on PubTabNet): https://arxiv.org/abs/2203.01017.

  6. Applied AI, PDF Parsing Benchmark, June 2026: https://www.applied-ai.com/briefings/pdf-parsing-benchmark/. 17 parsers tested across 800+ real-world PDFs; Gemini 3 Pro topped the field at 88% edit similarity; no parser exceeded 88%; LlamaParse rated best price/quality at $0.003/page; parser accuracy varied 55+ points by domain.

  7. The same Pro+CompactDoc structuring call works correctly when fed by Mistral OCR 3 or Reducto, so the failure sits at the LlamaParse-to-Pro integration boundary rather than in either component alone.

  8. Reducto's higher line count includes ~5 funds detected on the library and PTO operating pages that the v1 Warren ground truth explicitly excludes from scope. Precision on the in-scope subset is comparable to Mistral.

Comments


Log in to post a comment.

No comments yet. Be the first.