Skip to main content

Invoice Processing

Extract structured data from PDF invoices including vendor details, dates, totals, and individual line items. This recipe uses the object field type with nested_fields to capture tabular line item data.

Workflow Fields

We recommend creating this workflow in the anyformat platform where you can test with sample invoices and iterate on field descriptions. Copy the workflow ID to use with the API.
FieldTypeDescription
invoice_numberstringThe unique invoice identifier
vendor_namestringName of the company issuing the invoice
issue_datedateDate the invoice was issued
due_datedatePayment due date
subtotalfloatAmount before tax
tax_amountfloatTotal tax applied
total_amountfloatFinal amount due including tax
currencyenumCurrency code (USD, EUR, GBP)
line_itemsobjectIndividual items on the invoice

Field Configuration

{
  "fields": [
    {"name": "invoice_number", "description": "The unique invoice identifier or number", "data_type": "string"},
    {"name": "vendor_name", "description": "Name of the company that issued the invoice", "data_type": "string"},
    {"name": "issue_date", "description": "Date the invoice was issued", "data_type": "date"},
    {"name": "due_date", "description": "Date by which payment is due", "data_type": "date"},
    {"name": "subtotal", "description": "Amount before tax", "data_type": "float"},
    {"name": "tax_amount", "description": "Total tax amount", "data_type": "float"},
    {"name": "total_amount", "description": "Final total amount due including tax", "data_type": "float"},
    {
      "name": "currency",
      "description": "Currency of the invoice amounts",
      "data_type": "enum",
      "enum_options": [
        {"name": "USD", "description": "US Dollar"},
        {"name": "EUR", "description": "Euro"},
        {"name": "GBP", "description": "British Pound"},
        {"name": "CAD", "description": "Canadian Dollar"},
        {"name": "AUD", "description": "Australian Dollar"}
      ]
    },
    {
      "name": "line_items",
      "description": "Individual line items listed on the invoice",
      "data_type": "object",
      "nested_fields": [
        {"name": "description", "description": "Description of the item or service", "data_type": "string"},
        {"name": "quantity", "description": "Number of units", "data_type": "integer"},
        {"name": "unit_price", "description": "Price per unit", "data_type": "float"},
        {"name": "amount", "description": "Total amount for this line item", "data_type": "float"}
      ]
    }
  ]
}

Process a Document

curl -X POST 'https://api.anyformat.ai/v2/workflows/YOUR_WORKFLOW_ID/run/' \
  -H 'Authorization: Bearer YOUR_API_KEY' \
  -F 'file=@invoice.pdf'

Get Results

# Poll for results with backoff
max_attempts = 60
base_delay = 5

for attempt in range(max_attempts):
    response = requests.get(
        f"https://api.anyformat.ai/v2/files/{file_id}/extraction/",
        headers=headers
    )

    if response.status_code == 200:
        results = response.json()
        break
    elif response.status_code == 412:
        delay = min(base_delay * (1.5 ** min(attempt, 5)), 30)
        time.sleep(delay)
    else:
        raise Exception(f"Error: {response.json()['detail']}")
else:
    raise TimeoutError("Processing timed out")

# Use extracted data
print(f"Invoice #{results['invoice_number']['value']}")
print(f"Vendor: {results['vendor_name']['value']}")
print(f"Total: {results['currency']['value']} {results['total_amount']['value']}")

for item in results["line_items"]:
    print(f"  - {item['description']['value']}: {item['amount']['value']}")

Example Response

{
  "invoice_number": {"value": "INV-2024-0847", "confidence": 97},
  "vendor_name": {"value": "Acme Consulting LLC", "confidence": 95},
  "issue_date": {"value": "2024-03-15", "confidence": 93},
  "due_date": {"value": "2024-04-14", "confidence": 91},
  "subtotal": {"value": 3750.00, "confidence": 94},
  "tax_amount": {"value": 337.50, "confidence": 92},
  "total_amount": {"value": 4087.50, "confidence": 96},
  "currency": {"value": "USD", "confidence": 98},
  "line_items": [
    {
      "description": {"value": "Strategy consulting - March", "confidence": 90},
      "quantity": {"value": 40, "confidence": 88},
      "unit_price": {"value": 75.00, "confidence": 91},
      "amount": {"value": 3000.00, "confidence": 93}
    },
    {
      "description": {"value": "Travel expenses", "confidence": 92},
      "quantity": {"value": 1, "confidence": 95},
      "unit_price": {"value": 750.00, "confidence": 89},
      "amount": {"value": 750.00, "confidence": 94}
    }
  ]
}

Tips

Use float for all monetary amounts, not string. This gives you numeric values you can sum and compare without parsing.
The object type with nested_fields captures repeating tabular data like line items. Each row becomes an object in the array.
  • Write specific field descriptions. “Total amount due including tax” extracts better than just “total”.
  • If invoices span multiple currencies, add the enum_options for all currencies you expect.
  • For multi-page invoices, processing automatically handles all pages.

Next Steps

Field Types

Learn about object, enum, and other complex field types

Response Formats

Export results as CSV or JSONL for downstream systems