Question 1

How is document data extraction different from OCR or classification?

Accepted Answer

OCR converts pixels into characters. Classification decides what the document is. Extraction lifts the specific typed values out of it — invoice total, due date, claim reference, supplier ID — and writes them into the field on the target system. All three usually sit in the same pipeline, but extraction is the step that creates the data your ERP or case-management actually stores.

Question 2

Why typed-schema extraction and JSON mode instead of free-form prompts?

Accepted Answer

Free-form prompts force you to parse the model's prose back into fields, and the prose changes shape between runs. A typed schema plus JSON mode pins the output to your field definitions: missing fields stay null, typed fields stay typed, no invented keys. Downstream validation runs against a stable contract instead of a regex over prose.

Question 3

What does field-level confidence mean in practice?

Accepted Answer

Each field gets its own confidence signal. High-confidence fields write through to the target system automatically. Low-confidence fields surface in a role-gated review queue — the reviewer sees the cropped page region and the proposed value, edits if needed, and approves. The rest of the record is already saved by the time they touch it.

Question 4

Can a new document family or field be added without a developer?

Accepted Answer

Yes. Category, SubCategory, FileBlock and Field are admin-defined models. Operations adds the new family or field, ties it to a downstream role and target system, and the next inbound document of that shape flows through. The Insurance AI POC ships exactly this admin surface; Weita exposes a similar surface for prompt and category tuning.

Question 5

How does two-pass OCR feed extraction, and when do you need the structured pass?

Accepted Answer

Digital PDFs often need only the raw pass. Scanned pages, image PDFs, multi-column invoices and handwritten Belege benefit from the structured Mistral pass that preserves tables and column boundaries. The extractor consumes the cleaner of the two representations per page, so a single pipeline handles both digital and scanned inputs without a separate code path.

Question 6

How is extraction audited for regulated Swiss back-office workflows?

Accepted Answer

Every extracted value carries its source page, the prompt version, the model identifier and — if a human touched it — the reviewer and the prior value. Failure states are explicit, not silent. For FINMA-supervised, MDR or IVDR-sensitive flows we deploy on Swiss-resident hosting or on-premises; the dedicated vertical pages cover certification posture.

Question 7

How does the engine plug into an existing ERP, PIM or case-management?

Accepted Answer

We model extraction as a staged Laravel job queue ending in a write-out stage that targets your system of record. The contract is the typed schema — the downstream system receives the same field shape every time. For Weita that target is a Laravel modular monolith PIM; for the Insurance AI POC it is a case-management database; the pattern carries over to most ERP and PIM systems.

Question 8

Do you cite a specific accuracy or throughput number?

Accepted Answer

Not on the website. Accuracy and throughput depend on document quality, schema strictness and the HITL threshold the customer wants. We measure both on your real backlog during the pilot family and quote against those numbers. Generic accuracy claims do not survive contact with a real document mix.

Extract Data from Documents

Document Data Extraction Suite

How we deliver it

Schema discovery against real documents

Pilot extraction on one family

Field-level HITL tuning

Hand-off with prompt registry

Selected engagements

Why this engine, not generic data extraction

Schema is the contract, prompts are the implementation

Confidence at the field, not the document

Audit-grade by construction

Frequently Asked Questions