Skip to main content

Parse-Only Workflow

Convert any document (PDF, DOCX, images) to structured markdown without defining extraction fields. This is useful when you want the parsed content to feed into your own pipeline, or when you need to preview how anyformat sees a document before setting up extraction.

When to Use This

  • Document preview — See the markdown before writing extraction fields
  • Custom pipelines — Feed parsed markdown into your own LLM, search index, or RAG system
  • Debugging — Understand how a document is parsed (blocks, tables, reading order)
  • Lightweight integration — You only need the text, not structured extraction

Create a Parse-Only Workflow

A parse-only workflow has no extraction fields. Pass a single placeholder field (required by the API) — it won’t be used.
curl -X POST 'https://api.anyformat.ai/v2/workflows/' \
  -H 'Content-Type: application/json' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -d '{
    "name": "Document Parser",
    "description": "Parse documents to markdown without extraction",
    "fields": [
      {"name": "_placeholder", "description": "unused", "data_type": "string"}
    ]
  }'
Save the workflow_id — you’ll reuse it for every parse request.

Remove the Extract Node

By default, new workflows include both a parse and an extract node. To skip extraction entirely, update the workflow graph in the anyformat platform to remove the extract node, leaving only the parse node.
Removing the extract node means the workflow will only convert documents to markdown. No structured data extraction will run, which makes processing faster and cheaper.

Submit a Document

curl -X POST 'https://api.anyformat.ai/v2/workflows/YOUR_WORKFLOW_ID/run/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -F 'file=@document.pdf'

Retrieve the Parsed Markdown

Poll until processed, then fetch results. The results endpoint returns a unified JSON response with the parsed markdown for each file.
# Poll until processed
while true; do
  STATUS=$(curl -s -o /dev/null -w '%{http_code}' \
    -H 'Authorization: Bearer YOUR_API_KEY' \
    "https://api.anyformat.ai/v2/files/${run_id}/extraction/")
  [ "$STATUS" = "200" ] && break
  [ "$STATUS" != "412" ] && echo "Error: $STATUS" && exit 1
  sleep 5
done

# Get results
curl -s -H 'Authorization: Bearer YOUR_API_KEY' \
  "https://api.anyformat.ai/v2/workflows/${workflow_id}/results/"
The response is a JSON object keyed by filename. Each file contains a results object with the output from each workflow node. Pass file_id to get results for a single file.
{
  "document.pdf": {
    "results": {
      "parse": {
        "markdown": "<DOCUMENT id=\"1\" page=\"1\">..."
      }
    }
  }
}
For workflows with extraction fields, the response also includes an extraction key:
{
  "invoice.pdf": {
    "results": {
      "parse": {
        "markdown": "<DOCUMENT ...>..."
      },
      "extraction": {
        "total": {"value": "1,234.56", "confidence": 95.2},
        "date": {"value": "2026-01-15", "confidence": 88.0}
      }
    }
  }
}

Example Output

The parsed markdown preserves document structure with semantic blocks:
<DOCUMENT id="1" page="1">
<section id="p1_b1" data-type="title" data-bbox="x0:0.034,y0:0.037,x1:0.436,y1:0.053">

# ACME CORPORATION

</section>

<section id="p1_b2" data-type="text" data-bbox="x0:0.031,y0:0.055,x1:0.304,y1:0.140">

123 Business Ave, Suite 100
New York, NY 10001

</section>

<section id="p1_b3" data-type="table" data-bbox="x0:0.025,y0:0.219,x1:0.976,y1:0.807">

<table>
<thead>
<tr>
<th data-cell-id="r0c0">Item</th>
<th data-cell-id="r0c1">Quantity</th>
<th data-cell-id="r0c2">Price</th>
</tr>
</thead>
<tbody>
<tr>
<td data-cell-id="r1c0">Widget A</td>
<td data-cell-id="r1c1">10</td>
<td data-cell-id="r1c2">$25.00</td>
</tr>
</tbody>
</table>

</section>
</DOCUMENT>
Each <section> includes:
  • id — Block identifier (page and block number)
  • data-type — Semantic type: title, text, table, other (figures/images)
  • data-bbox — Bounding box coordinates (normalized 0-1)
  • data-cell-id — Table cell identifiers for precise cell referencing

Markdown Content

The parse.markdown field contains parsed markdown with section tags, table structure, and embedded images. Figures and charts are included as base64-encoded <img> tags, which makes the payload larger for image-heavy documents.

Parse Node Configuration

You can configure the parse node in the anyformat platform by clicking on the parse node in the workflow graph. Available options:
SettingDescription
EngineFast for quick analysis, Performant for higher accuracy
Figure EnhancementWhen enabled, uses an LLM to extract structured data from charts and images (e.g., axis labels, data points). Off by default.
Prompt HintOptional text to guide the parser — useful for domain-specific documents (e.g., “This is a medical lab report, preserve all numeric values exactly”)
Figure Enhancement adds an extra LLM call per figure block, which increases processing time and cost. Only enable it if you need structured descriptions of charts and images.

Tips

Parse-only workflows skip the extraction LLM call entirely, making them faster and cheaper than full extraction workflows.
  • Reuse one workflow — Create a single parse-only workflow and submit all documents to it. No need for separate workflows per document type.
  • Tables are preserved — The parser detects tables and outputs them as HTML <table> elements with cell IDs for precise referencing.
  • Multi-page handling — Each page gets its own <DOCUMENT> block with page number. All pages are processed automatically.
  • Use visual for images — The raw variant strips images. If your documents contain figures, charts, or logos you need to preserve, use variant=visual.

Next Steps

Complete Workflow Guide

Add extraction fields to get structured data from your documents

Invoice Processing

See a full extraction example with nested fields and line items