Green decoration

Extract Data from Documents

Document data extraction for Swiss back office, productized. PDFs, forms and scans become typed JSON against the schema your ERP, PIM or case-management already expects — with field-level confidence and role-gated HITL on the fields that need it.

Document Data Extraction Suite

Typed schema, not free-form prompts

Extraction is bound to a typed Category, SubCategory, FileBlock and Field schema. Each field has a type, a required-or-optional flag, and a downstream owner. The LLM is forced to fill that shape — not to write prose about the document. The ERP, PIM or case-management on the other side gets data it can validate before it stores it.

OpenAI JSON-mode against the schema

The extractor runs in OpenAI JSON mode and is constrained by the field definitions on disk. Missing fields stay null, typed fields stay typed, and the model is not allowed to invent keys. The same pattern is in production at Weita; the Insurance AI POC adds a config UI so operations can extend the schema without a developer release.

Two-pass OCR for difficult scans

Before extraction, scanned pages and image PDFs run through Mistral OCR in two passes — raw text first, then a structured pass that preserves tables and columns. Multi-column German invoices, handwritten Belege and noisy faxes feed the same downstream extractor as digital PDFs. No separate pipeline per format.

Field-level confidence and routing

Confidence is tracked per field, not per document. A delivery note with a clean date and an ambiguous total only sends the total to the reviewer — the rest writes through. The reviewer corrects the single value, the document closes, and the correction is captured against the prompt version that produced it.

Staged extraction pipeline with audit

Extraction is a staged Laravel job queue with explicit success and failure states for each stage — PrepareFile, RecognizeImages, ExtractMetadata, ExtractDetails, SaveFile, Notification. Every stage logs the input, the prompt version, the model, the output and the operator. Re-runs after a prompt change are deterministic against the same source file.

New families and fields as configuration

Adding a new document family or a new field on an existing one is admin work, not engineering work. Operations defines the FileBlock and its fields, ties the family to a downstream role and target system, and the next inbound document of that shape flows through. The Insurance AI POC ships this surface for the customer's own administrators.

How we deliver it

We start with a representative sample of your real documents and the target system that will store the extracted data. From that we draft the typed schema — categories, file blocks, fields, required vs optional, downstream owner per block. The schema is the contract for the rest of the project.

One family goes end-to-end first: two-pass OCR, JSON-mode extraction against the schema, field-level confidence, role-gated review and downstream write-out. We measure accuracy on your real backlog, not on a synthetic test set, and tune prompts against documents you have already paid people to process.

We tune the review surface to your roles and your risk appetite — which fields auto-write at which confidence, which fields always require sign-off, where escalation goes. Reviewer corrections feed the prompt registry, so the same mistake gets less common at each prompt version.

On hand-off we transfer the typed schema, the prompt registry, the role definitions and the runbook. Operations can add fields and document families without us; most customers keep us on a light retainer for prompt drift and new categories, others run it entirely in-house from month four.

Why this engine, not generic data extraction

Schema is the contract, prompts are the implementation

Free-form prompt extraction asks a model to describe a document and hopes the output stays parseable. We invert that: your typed schema is the contract on disk, JSON-mode binds the model to it, and the prompt registry holds the wording of how to ask. When extraction drifts, prompts change as data — not as a code release — and the schema stays stable for downstream consumers.

Confidence at the field, not the document

Document-level confidence forces an all-or-nothing review choice. We track confidence per field, so the reviewer sees only the values that actually need a human and the rest of the record writes through. That is what makes high-volume extraction sustainable for Swiss back-office teams.

Audit-grade by construction

Every extracted value carries its source page, its prompt version, its model identifier and the reviewer if a human touched it. The trail is on by default, not bolted on for a compliance review — and the provenance of each field is what makes regulated Swiss back-office automation something operations actually trusts.

Frequently Asked Questions

  • OCR converts pixels into characters. Classification decides what the document is. Extraction lifts the specific typed values out of it — invoice total, due date, claim reference, supplier ID — and writes them into the field on the target system. All three usually sit in the same pipeline, but extraction is the step that creates the data your ERP or case-management actually stores.

About SAPIENTROQdecoration

ai avatar

Hey there! I’m your AI assistant developed by SAPIENTROQ. I am a language model connected to a RAG database that contains information about the company. If you need insights on AI solutions, real use cases, or how AI can boost your business, please feel free to ask in any language you prefer.

Choose an option

Hey! I am AI agent developed by SAPIENTROQ 🤖
Decoration
Decoration

Interested in a solution?

We are glad to show you various options without any obligation.

Roland Kurmann

Roland Kurmann

CEO, SAPIENTROQ

Book a call

Decoration