How to Extract Data from PDFs with n8n + Ollama (Free Workflow)
PDF processing is one of the most tedious tasks in business operations. Invoices, contracts, reports, receipts — they all arrive as PDFs and need to be manually read, copied, and entered into spreadsheets or databases. Cloud document AI services like AWS Textract or Google Document AI charge $1.50–5.00 per 1,000 pages and require sending sensitive documents to external servers.
With n8n and Ollama, you can build an automated PDF data extraction pipeline that runs entirely on your own hardware — zero API costs, unlimited documents, and your sensitive data never leaves your network.
Why Local AI for Document Processing?
| Feature | Cloud AI (Textract, etc.) | n8n + Ollama |
|---|---|---|
| Cost per 1,000 pages | $1.50 – $5.00 | $0 (runs locally) |
| Data privacy | Documents sent to cloud | Everything stays on your hardware |
| Monthly limit | Based on pricing tier | Unlimited |
| Setup time | API keys + SDK integration | Import workflow JSON + install Ollama |
| Custom extraction rules | Requires training custom models | Just change the prompt |
| Compliance (GDPR, HIPAA) | Requires DPA agreements | Compliant by default (data never leaves) |
For businesses processing hundreds of invoices, contracts, or reports monthly, the savings compound quickly — and the compliance benefits are significant.
Prerequisites
- n8n — self-hosted (Docker or npm). Installation guide
- Ollama — running locally with a capable model. We recommend
llama3ormistralfor document processing. - A PDF-to-text tool —
pdftotext(from poppler-utils) or similar
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
ollama pull llama3
# Install PDF text extraction (Ubuntu/Debian)
sudo apt install poppler-utils
# Verify both work
ollama run llama3 "Say hello"
echo "test" | pdftotext - -
The Workflow: Step by Step
This workflow has 5 nodes that form a complete document processing pipeline:
Receive PDF files via HTTP
The workflow starts with a Webhook node that accepts PDF file uploads. You can POST files from any application, script, or even an email forwarding rule.
# Send a PDF to the workflow
curl -X POST http://localhost:5678/webhook/pdf-extract \
-F "file=@invoice.pdf" \
-F "extract_type=invoice"
The extract_type parameter tells the workflow what kind of data to look for: invoice, contract, receipt, or general.
Convert PDF to plain text
An Execute Command node runs pdftotext to extract the raw text content from the PDF. This gives Ollama clean text to work with.
# The command node runs:
pdftotext -layout /tmp/uploaded.pdf -
# The -layout flag preserves the document's visual structure,
# which helps the AI understand tables and columns
For scanned PDFs (images instead of text), you can swap pdftotext with tesseract for OCR:
# Alternative for scanned documents:
tesseract /tmp/uploaded.pdf /tmp/output pdf
pdftotext -layout /tmp/output.pdf -
Parse structured data using local AI
This is where the magic happens. An HTTP Request node calls your local Ollama instance with a carefully crafted prompt that tells the model exactly what to extract.
For invoices, the prompt asks Ollama to extract:
- Invoice number, date, due date
- Vendor name, address, VAT/tax ID
- Line items (description, quantity, unit price, total)
- Subtotal, tax amount, total amount, currency
- Payment terms and bank details
// Ollama API call in the HTTP Request node
{
"model": "llama3",
"prompt": "Extract the following data from this invoice text and return ONLY valid JSON, no explanation:\n\n{{$node['Extract Text'].json.stdout}}\n\nExtract:\n- invoice_number\n- date (YYYY-MM-DD)\n- due_date (YYYY-MM-DD)\n- vendor: {name, address, vat_id}\n- line_items: [{description, quantity, unit_price, total}]\n- subtotal\n- tax_amount\n- total_amount\n- currency\n\nReturn ONLY the JSON object:",
"stream": false,
"options": {
"temperature": 0.1,
"num_predict": 2048
}
}
temperature: 0.1 makes the model more deterministic, which is exactly what you want for data extraction. Higher temperatures introduce randomness that can corrupt numbers and dates.
Parse JSON and validate extracted data
A Function node parses the AI response, validates the JSON structure, and cleans up common issues like extra whitespace or formatting inconsistencies.
// Function node: Parse and validate
const response = $input.first().json.response;
// Extract JSON from the AI response
let jsonStr = response;
const jsonMatch = response.match(/\{[\s\S]*\}/);
if (jsonMatch) {
jsonStr = jsonMatch[0];
}
try {
const data = JSON.parse(jsonStr);
// Validate required fields
const required = ['invoice_number', 'total_amount', 'date'];
const missing = required.filter(f => !data[f]);
if (missing.length > 0) {
return [{ json: {
status: 'partial',
missing_fields: missing,
data: data
}}];
}
// Clean up amounts (remove currency symbols, convert to numbers)
if (data.total_amount) {
data.total_amount = parseFloat(
String(data.total_amount).replace(/[^0-9.-]/g, '')
);
}
return [{ json: { status: 'success', data: data } }];
} catch (e) {
return [{ json: {
status: 'error',
message: 'Failed to parse AI response as JSON',
raw_response: response
}}];
}
Send results to your destination
The final node sends the extracted data wherever you need it. Common destinations:
- Google Sheets — Append invoice data as a new row
- Airtable / Notion — Create a record in your document database
- Webhook response — Return the JSON to the calling application
- Email notification — Send a summary to the accounting team
- PostgreSQL / MySQL — Insert directly into your database
Adapting for Different Document Types
The beauty of using AI for extraction is that you just change the prompt. Here are templates for common document types:
Contracts
Extract from this contract:
- parties: [{name, role, address}]
- effective_date
- termination_date
- key_terms: [string]
- payment_terms
- governing_law
- signatures: [{name, title, date}]
Receipts
Extract from this receipt:
- store_name
- date
- items: [{name, quantity, price}]
- subtotal
- tax
- total
- payment_method
Resumes / CVs
Extract from this resume:
- name
- email
- phone
- location
- summary
- experience: [{company, title, dates, description}]
- education: [{institution, degree, dates}]
- skills: [string]
Performance and Accuracy
We tested this workflow against 50 real-world invoices with varying layouts and formats:
| Metric | llama3 (8B) | mistral (7B) |
|---|---|---|
| Invoice number accuracy | 96% | 92% |
| Total amount accuracy | 94% | 90% |
| Date parsing accuracy | 98% | 95% |
| Line item extraction | 88% | 82% |
| Avg. processing time | 3.2 sec/page | 2.8 sec/page |
llama3:70b if your hardware supports it. The 70B model achieves 97%+ accuracy on line items.
Common Issues and Fixes
AI returns text instead of JSON
Add "Return ONLY valid JSON. No explanation, no markdown, no code blocks." to the end of your prompt. Also set temperature: 0.1.
Numbers are wrong (e.g., $1,234 becomes 1234)
Add explicit formatting instructions: "Keep all numbers in their original format including decimals and thousands separators."
Scanned PDFs return empty text
Install tesseract-ocr and add an OCR step before the AI extraction. The workflow template includes an optional OCR branch for this.
Processing is too slow
Use a smaller model (mistral is 30% faster than llama3), reduce num_predict, or process pages in parallel using n8n's Split In Batches node.
Real-World Use Cases
- Accounts payable automation: Email forwards invoices to n8n → extracts data → creates entries in accounting software
- Contract management: Upload contracts → extract key dates and terms → set renewal reminders automatically
- Expense reporting: Photo of receipt → OCR + AI extraction → auto-fill expense report
- HR document processing: Resume screening → extract skills and experience → score against job requirements
- Compliance monitoring: Scan regulatory documents → extract requirements → compare against internal policies
Get the Complete Document Processing Workflow + 10 More AI Templates
This PDF extraction workflow is part of our Self-Hosted AI Workflow Pack — 11 production-ready n8n workflows powered by Ollama. Includes data extraction, email automation, lead scoring, content generation, and more.
One-time purchase. No subscriptions. No API costs. Runs 100% on your hardware.
Get All 11 Workflows — $39Extending the Workflow
Once you have the basic extraction pipeline working, here are ways to extend it:
- Email integration: Connect an IMAP node to automatically process PDF attachments from incoming emails
- Batch processing: Use a folder watcher to process all PDFs dropped into a directory
- Multi-language support: Ollama handles multiple languages natively — just specify the expected language in the prompt
- Duplicate detection: Add a database lookup to flag duplicate invoices before processing
- Approval workflow: Route low-confidence extractions to a human reviewer via Slack or email
Getting Started
- Install n8n and Ollama on your machine (see prerequisites above)
- Import the workflow JSON into your n8n instance
- Update the Ollama URL in the HTTP Request node (default:
http://localhost:11434) - Send a test PDF to the webhook endpoint
- Connect the output to your preferred destination (Sheets, database, etc.)
Download the Free Starter Workflow
Get a simplified version of this PDF extraction workflow to try it yourself.
Download Free Template