PDF Form Annotation Pipeline
This tool converts a raw PDF form into a clean Schema V3 definition that can drive a digital form experience. The schema contains only visible form structure: headings, printed text, fields, groups, tables, repeat groups, and widget ownership.
Schema V3 intentionally avoids presentation rules. You do not choose roles, layouts, table semantics, spans, or field presentations. The frontend maps the clean schema directly to the editor and preview.
1. Widget Extraction
Stage 1 shows all interactive PDF fields: text boxes, checkboxes, radio buttons, select controls, and signature fields. These source fields are called widgets.
Widget table
| Action | What it does |
|---|---|
| Locate | Highlights the widget on the PDF canvas. |
| History | Shows the edit history for this widget. |
| Delete | Removes the widget from the source extraction. |
Drawing a new widget
Press D to enter draw mode, then drag on the PDF canvas to define the widget box. A popup lets you choose the widget type.
2. Widget Linking
Stage 2 labels source widgets and helps identify widgets that still need review. Stage 3 uses these widgets as the allowed source IDs when building Schema V3.
| Panel | Meaning |
|---|---|
| Linked Widgets | Widgets with a reviewed label. |
| Missed Widgets | Widgets that still need a label or connection. |
3. Form Builder
Stage 3 edits Schema V3 directly. The left panel shows pages and ordered page items. Select any item to edit it in the inspector. Changes auto-save as a draft; use Update DB when the schema is ready to become the pipeline output.
Schema V3 Structure
A Schema V3 document is a clean form definition. Extraction status, provider, model, usage, and page diagnostics live outside the schema in the extraction envelope.
Page items are flat and ordered. A section is only a marker heading. Groups, tables, and repeat groups are the containers for nested items.
Item Types
| Kind | Use it for |
|---|---|
| section | Printed section headings or major dividers. |
| text | Any printed prose that is not an answer field: instructions, legal text, warnings, notes, or paragraphs. |
| field | One logical answer field with one widget carrier. |
| group | A labeled collection of related nested items. |
| table | Rows and cells, including irregular tables where each row has its own cell count. |
| repeat_group | Repeated records such as providers, contacts, dependents, jobs, or services. |
What the JSON Looks Like
Envelope
{
"schema": {
"schema_version": 3,
"title": "Employment Form",
"page_count": 3,
"pages": [
{ "page": 1, "items": [] }
]
},
"extraction": {
"file_id": "64a1f3...",
"cached": false,
"version": 0,
"provider": "openrouter",
"model": "openai/gpt-4o-mini",
"usage": {},
"api_error": false,
"parse_error": false,
"processing": false,
"page_status": []
}
}Simple field
{
"kind": "field",
"label": "First Name",
"input_type": "text",
"widgets": [
{ "widget_id": "page1-widget6" }
]
}Split field
{
"kind": "field",
"label": "Date of Birth",
"input_type": "date",
"parts": [
{ "label": "Month", "widget_id": "page1-widget1" },
{ "label": "Day", "widget_id": "page1-widget2" },
{ "label": "Year", "widget_id": "page1-widget3" }
]
}Choice field
{
"kind": "field",
"label": "Please check all that apply.",
"input_type": "checkbox",
"choices": [
{ "label": "Absent Parent", "widget_id": "page1-widget13" },
{ "label": "Child Care", "widget_id": "page1-widget15" },
{
"label": "Other",
"widget_id": "page1-widget35",
"details": {
"label": "Other",
"input_type": "text",
"widgets": [
{ "widget_id": "page1-widget32" }
]
}
}
]
}Group and repeat group
{
"kind": "group",
"label": "Applicant Name",
"items": []
}
{
"kind": "repeat_group",
"label": "Providers",
"item_label": "Provider",
"items": [
{ "label": "Provider 1", "items": [] }
]
}Field Types and Widget Carriers
Every field has an input_type and exactly one widget carrier.
| Input type | Use it for |
|---|---|
| text | Names, addresses, IDs, free-form answers. |
| date | Dates, including split month/day/year widgets. |
| number | Amounts, counts, years, scores. |
| checkbox | Single checkbox or select-all-that-apply groups. |
| radio | Mutually exclusive options. |
| select | Dropdown or list selection widgets. |
| signature | Signature or initials fields. |
| Carrier | Use it when |
|---|---|
widgets[] | The logical field is backed by one normal widget, or a simple list of equivalent widgets. |
parts[] | One answer is split across multiple widgets, such as date parts or SSN boxes. |
choices[] | Each option has its own widget, such as radio buttons or checkbox lists. A choice may use details for an attached Other/Specify input. |
Irregular Tables
Schema V3 tables do not need row spans, column spans, or a regular/irregular setting. Each row owns its own cells. A row may have two cells, the next may have three, and another may include an empty cell.
{
"kind": "table",
"label": "Assessment",
"rows": [
{
"cells": [
{ "kind": "text", "text": "Question" },
{ "kind": "text", "text": "Answer" },
{ "kind": "text", "text": "Comments" }
]
},
{
"cells": [
{ "kind": "text", "text": "Pain level" },
{
"kind": "field",
"label": "Pain level",
"input_type": "number",
"widgets": [{ "widget_id": "page2-widget1" }]
}
]
},
{
"cells": [
{ "kind": "text", "text": "Mobility notes" },
{
"kind": "field",
"label": "Mobility notes",
"input_type": "text",
"widgets": [{ "widget_id": "page2-widget2" }]
},
{ "kind": "empty" }
]
}
]
}Decision Guide
Which item kind?
Which carrier?
PDF Widget Linking
A widget reference connects a Schema V3 field to a physical PDF widget. The reference is always a widget_id inside widgets[], parts[], choices[], or a choice's optional details.
- Select a field in the Stage 3 tree.
- Choose the carrier mode in the field inspector.
- Select the target slot, part, or choice.
- Click the matching PDF widget in link mode.
choices[].details. Missing widgets are shown as unresolved work.Validation and Review
Click Checks or press Ctrl+K to run validation. Update DB blocks when Schema V3 is malformed or PDF widget coverage is incomplete.
| Check | Meaning |
|---|---|
schema_version === 3 | V2 and malformed drafts are rejected. |
| Supported item kinds | Only section, text, field, group, table, and repeat_group are allowed. |
| One field carrier | A field must use exactly one of widgets, parts, or choices. Choice details must use exactly one of widgets or parts. |
| Duplicate widget IDs | The same widget cannot be owned by two fields. |
| Unknown widget IDs | References must exist in the source PDF widgets when those are available. |
| Missing widgets | Warnings while editing, blocking errors during Update DB. |
Keyboard Shortcuts
| Shortcut | Action | Context |
|---|---|---|
| Ctrl+Z | Undo | All stages |
| Ctrl+Y / Ctrl+Shift+Z | Redo | All stages |
| Ctrl+S | Save draft | All stages |
| Alt+Left | Previous PDF | All stages |
| Alt+Right | Next PDF | All stages |
| Left / Right | Previous / next page | Stages 1-3 |
| Ctrl+K | Open checks | Stage 3 |
| D | Toggle draw mode | Stage 1 |
| Esc | Close modal or cancel active mode | All stages |
Common Errors
"Schema V2 structured JSON is not supported"
The draft or cache uses the old V2 shape. Re-run Stage 3 extraction so the backend produces Schema V3.
"Field must use exactly one widget carrier"
Open the field inspector and choose one mode: widgets, parts, or choices. Remove the other carriers.
"Duplicate widget ID"
The same PDF widget is referenced by more than one field. Keep the first correct owner and unlink the duplicate reference.
"Missing extracted widget"
A source PDF widget is not owned by any field. Link it to the right field, or keep it in the Unresolved fields group until you decide where it belongs.
"Unknown widget ID"
The schema references a widget ID that does not exist in the current PDF widget list. Remove the bad reference or relink the field to a valid widget.
