Automate Data Enrichment with n8n + Ollama (AI-Powered Web Scraping Pipeline)
Data enrichment tools like Clearbit, ZoomInfo, and Apollo charge $100–500/month for a simple premise: take a company name or domain, look up public information, and return structured data. For sales teams doing outreach, this is table stakes. But paying per-record fees for data that is already publicly available on company websites, LinkedIn profiles, and press releases feels increasingly unnecessary.
With n8n and Ollama, you can build a self-hosted data enrichment pipeline that scrapes public web pages, feeds raw HTML through a local AI model to extract structured information, and pushes clean, enriched records into Google Sheets, Airtable, or your CRM. No API keys for third-party enrichment services, no per-record costs, and no sending your prospect lists to external servers.
In this guide, you will build a complete n8n data enrichment workflow that:
- Accepts a list of company domains or contact names via webhook or spreadsheet
- Scrapes public web pages (company sites, about pages, blog posts)
- Passes raw page content through Ollama to extract structured fields
- Validates and normalizes the extracted data
- Writes enriched records to Google Sheets or Airtable
What Is AI-Powered Data Enrichment?
Traditional data enrichment relies on pre-built databases. Clearbit maintains a database of company profiles and matches your input against it. The problem: their data goes stale, coverage gaps exist for smaller companies, and you pay whether the match is good or not.
AI-powered enrichment takes a different approach. Instead of looking up records in a database, it reads the actual source material — company websites, public profiles, press releases — and extracts structured information in real time. The AI model acts as an intelligent parser that understands context, handles varied page layouts, and returns consistent structured output regardless of how the source data is formatted.
This matters for three reasons:
- Freshness: You are reading the live website, not a cached record from months ago
- Coverage: Any company with a website can be enriched, not just those in a vendor's database
- Flexibility: You define the fields you want extracted — change the prompt, change the output
| Traditional Enrichment (Clearbit, ZoomInfo) | AI Enrichment (n8n + Ollama) | |
|---|---|---|
| Cost | $100–500/month + per-record fees | $0 (self-hosted) |
| Data freshness | Days to weeks old | Real-time (scraped live) |
| Coverage | Limited to vendor's database | Any public website |
| Custom fields | Fixed schema | Define any fields via prompt |
| Data privacy | Prospect list sent to vendor | Stays on your server |
| Rate limits | Strict API quotas | Limited only by your hardware |
When to use which: AI enrichment works best for enriching 50–5,000 records where you need custom fields or operate in a niche market. For enriching 100,000+ records with standard firmographic data, a traditional vendor may still be faster. The two approaches also work well together — use Clearbit for basic fields, then run AI enrichment to fill gaps.
The Architecture
The pipeline has five stages: trigger, scrape, extract, validate, and output. n8n orchestrates the entire flow, Ollama handles the intelligent extraction, and the HTTP Request node does the web scraping.
[Webhook / Google Sheets Trigger]
↓
[Loop Over Records]
↓
[HTTP Request — Scrape Website]
↓
[HTML to Text Conversion]
↓
[Ollama — Extract Structured Data]
↓
[JSON Parse & Validate]
↓
[Google Sheets / Airtable — Write Enriched Data]
Each record flows through independently, so failures on one URL do not block the rest. The workflow includes error handling to skip unreachable websites and flag records that could not be enriched.
Step 1: Set Up the Input Trigger
You need a way to feed company domains into the workflow. Two common approaches:
Option A: Webhook Trigger (for on-demand enrichment)
Add a Webhook node that accepts a JSON array of domains:
// Webhook configuration
// Method: POST
// Path: /enrich-companies
// Authentication: Header Auth (recommended)
//
// Expected payload:
{
"companies": [
{"domain": "stripe.com", "name": "Stripe"},
{"domain": "linear.app", "name": "Linear"},
{"domain": "posthog.com", "name": "PostHog"}
]
}
Option B: Google Sheets Trigger (for batch enrichment)
Use a Google Sheets Trigger node that watches for new rows in a spreadsheet. Set up a sheet with columns: Domain, Company Name, Status. When you paste new domains into the sheet, the workflow triggers automatically.
// Google Sheets Trigger configuration
// Spreadsheet: "Lead Enrichment Pipeline"
// Sheet: "Input"
// Trigger on: "Row Added"
// Poll interval: Every 1 minute
//
// After the trigger, add a Filter node:
// Condition: {{ $json.Status }} is empty
// (Skip rows that have already been processed)
After the trigger, add a Split In Batches node to process records one at a time. This prevents overloading your Ollama instance with parallel requests and keeps web scraping at a polite pace.
// Split In Batches configuration
// Batch Size: 1
// Options → Reset: false
//
// Add a Wait node after each batch iteration:
// Wait time: 2 seconds
// (Be polite to target websites — do not hammer them)
Step 2: Scrape the Company Website
Use the HTTP Request node to fetch the company's homepage and about page. These two pages contain 80% of the information you need for lead enrichment.
// HTTP Request Node — Fetch Homepage
// Method: GET
// URL: https://{{ $json.domain }}
// Options:
// Timeout: 10000 (10 seconds)
// Follow Redirects: true
// Full Response: true (to capture status code)
// Response Format: String (raw HTML)
//
// On Error: Continue (do not stop the workflow)
// Add header: User-Agent: Mozilla/5.0 (compatible; DataBot/1.0)
After fetching the homepage, add a second HTTP Request node for the about page:
// HTTP Request Node — Fetch About Page
// Method: GET
// URL: https://{{ $json.domain }}/about
// Same options as above
// On Error: Continue (about page may not exist)
HTML to Text Conversion
Raw HTML is full of tags, scripts, and CSS that waste Ollama's context window. Use a Function node to strip it down to plain text:
// Function Node: Clean HTML to Text
const html = $json.data || $json.body || '';
// Remove script and style blocks entirely
let text = html
.replace(/<script[\s\S]*?<\/script>/gi, '')
.replace(/<style[\s\S]*?<\/style>/gi, '')
.replace(/<nav[\s\S]*?<\/nav>/gi, '')
.replace(/<footer[\s\S]*?<\/footer>/gi, '')
.replace(/<header[\s\S]*?<\/header>/gi, '');
// Replace HTML tags with spaces
text = text.replace(/<[^>]+>/g, ' ');
// Decode HTML entities
text = text
.replace(/&/g, '&')
.replace(/ /g, ' ')
.replace(/</g, '<')
.replace(/>/g, '>')
.replace(/"/g, '"');
// Collapse whitespace
text = text.replace(/\s+/g, ' ').trim();
// Truncate to ~4000 chars to fit Ollama's context window efficiently
text = text.substring(0, 4000);
return [{ json: { ...($input.first().json), pageText: text } }];
Why strip HTML aggressively? An average company homepage is 50–150KB of raw HTML, but only 2–5KB of actual useful text. Sending raw HTML to Ollama wastes tokens on navigation links, JavaScript, and CSS classes. The cleaning function above reduces input size by 90%+ and improves extraction accuracy because the model focuses on actual content.
Step 3: Extract Structured Data with Ollama
This is where the AI does the heavy lifting. Connect an Ollama node (or an HTTP Request to the Ollama API) and give it a carefully crafted extraction prompt.
// Ollama HTTP API Request
// Method: POST
// URL: http://localhost:11434/api/generate
// Body (JSON):
{
"model": "llama3.1:8b",
"prompt": "You are a data extraction assistant. Analyze the following company webpage text and extract structured information. Return ONLY a valid JSON object with these fields. If a field cannot be determined from the text, use null.\n\nFields to extract:\n- company_name: Official company name\n- tagline: Company tagline or one-line description\n- industry: Primary industry (e.g., SaaS, Fintech, Healthcare, E-commerce)\n- business_model: B2B, B2C, or B2B2C\n- employee_count_range: Estimated range (e.g., 1-10, 11-50, 51-200, 201-500, 500+)\n- founding_year: Year founded\n- headquarters: City, Country\n- key_products: Array of main products or services (max 3)\n- tech_stack_signals: Array of technologies mentioned on the page (max 5)\n- pricing_model: free, freemium, paid, enterprise, or null\n- target_customer: Who the product is for (1 sentence)\n- recent_news: Any recent announcements, funding rounds, or milestones\n\nWebpage text:\n---\n{{ $json.pageText }}\n---\n\nReturn ONLY the JSON object, no explanations or markdown.",
"stream": false,
"options": {
"temperature": 0.1,
"num_predict": 1000
}
}
The low temperature (0.1) is critical for extraction tasks. You want deterministic, factual output — not creative interpretation. The num_predict limit prevents the model from rambling past the JSON.
The Extraction Prompt Explained
The prompt above is designed for reliability:
- Role assignment ("data extraction assistant") primes the model for structured output
- Explicit field definitions with examples reduce ambiguity
- Null handling ("use null") prevents hallucination — the model won't invent a founding year if it's not on the page
- "Return ONLY the JSON" suppresses the model's tendency to add explanations
Model recommendation: Use llama3.1:8b for this task. It handles JSON output reliably and processes pages in 3–8 seconds on a GPU. If you need faster throughput and can accept slightly lower accuracy, mistral:7b is about 30% faster. Avoid models smaller than 7B parameters for extraction — they struggle with consistent JSON formatting.
Step 4: Parse and Validate the Output
Ollama's response comes as a text string. You need to parse it into a proper JSON object and handle cases where the model returns malformed output.
// Function Node: Parse Ollama Response
const response = $json.response || '';
// Try to extract JSON from the response
let enrichedData = null;
try {
// First attempt: direct parse
enrichedData = JSON.parse(response);
} catch (e) {
// Second attempt: find JSON block in response
const jsonMatch = response.match(/\{[\s\S]*\}/);
if (jsonMatch) {
try {
enrichedData = JSON.parse(jsonMatch[0]);
} catch (e2) {
// Extraction failed
enrichedData = null;
}
}
}
if (!enrichedData) {
return [{
json: {
domain: $('Split In Batches').item.json.domain,
enrichment_status: 'failed',
error: 'Could not parse Ollama response as JSON'
}
}];
}
// Validate key fields
const validated = {
domain: $('Split In Batches').item.json.domain,
company_name: enrichedData.company_name || null,
tagline: enrichedData.tagline || null,
industry: enrichedData.industry || null,
business_model: enrichedData.business_model || null,
employee_count_range: enrichedData.employee_count_range || null,
founding_year: enrichedData.founding_year || null,
headquarters: enrichedData.headquarters || null,
key_products: Array.isArray(enrichedData.key_products)
? enrichedData.key_products.join(', ') : (enrichedData.key_products || null),
tech_stack_signals: Array.isArray(enrichedData.tech_stack_signals)
? enrichedData.tech_stack_signals.join(', ') : (enrichedData.tech_stack_signals || null),
pricing_model: enrichedData.pricing_model || null,
target_customer: enrichedData.target_customer || null,
recent_news: enrichedData.recent_news || null,
enrichment_status: 'success',
enriched_at: new Date().toISOString()
};
return [{ json: validated }];
The dual-parse strategy (try direct parse, then regex extraction) handles the common case where Ollama wraps the JSON in markdown code fences or adds a brief preamble. The validation step ensures arrays are flattened to comma-separated strings for spreadsheet compatibility.
Step 5: Write to Google Sheets or Airtable
The final step writes enriched records to your output destination.
Google Sheets Output
// Google Sheets Node Configuration
// Operation: Append Row
// Spreadsheet: "Lead Enrichment Pipeline"
// Sheet: "Enriched"
// Mapping Mode: Map Each Column
//
// Column mappings:
// Domain → {{ $json.domain }}
// Company Name → {{ $json.company_name }}
// Tagline → {{ $json.tagline }}
// Industry → {{ $json.industry }}
// Business Model → {{ $json.business_model }}
// Employee Range → {{ $json.employee_count_range }}
// Founded → {{ $json.founding_year }}
// Headquarters → {{ $json.headquarters }}
// Products → {{ $json.key_products }}
// Tech Stack → {{ $json.tech_stack_signals }}
// Pricing → {{ $json.pricing_model }}
// Target Customer→ {{ $json.target_customer }}
// Recent News → {{ $json.recent_news }}
// Status → {{ $json.enrichment_status }}
// Enriched At → {{ $json.enriched_at }}
Airtable Output
// Airtable Node Configuration
// Operation: Create Record
// Base: "Sales Pipeline"
// Table: "Enriched Companies"
//
// Same field mappings as above.
// Airtable advantage: you can use Single Select fields
// for Industry and Business Model, which auto-create
// filter options in the Airtable UI.
Practical Use Cases
Lead Scoring and Qualification
Feed a list of inbound leads (from a form, webinar registration, or trial signup) through the enrichment pipeline. Use the extracted fields to score leads automatically:
// Function Node: Simple Lead Scoring
let score = 0;
const data = $json;
// Company size scoring
const sizeScores = { '1-10': 1, '11-50': 2, '51-200': 3, '201-500': 4, '500+': 5 };
score += sizeScores[data.employee_count_range] || 0;
// Business model fit (assuming you sell B2B)
if (data.business_model === 'B2B') score += 3;
if (data.business_model === 'B2B2C') score += 2;
// Industry fit
const targetIndustries = ['SaaS', 'Fintech', 'E-commerce'];
if (targetIndustries.includes(data.industry)) score += 3;
// Pricing signals
if (data.pricing_model === 'enterprise') score += 2;
return [{ json: { ...data, lead_score: score, qualified: score >= 7 } }];
Competitor Analysis
Modify the extraction prompt to pull competitive intelligence:
- What features do they highlight on their homepage?
- What pricing tier names do they use?
- Which integrations do they promote?
- What customer testimonials are featured?
Run this weekly on a list of competitor domains and track changes over time in a spreadsheet.
Market Research
Enrich a list of companies in a target market segment. Pull industry classification, size estimates, and technology signals to build a market map. The tech_stack_signals field is especially useful — if 60% of companies in your target market mention "Kubernetes" or "AWS," that tells you something about their infrastructure maturity and buying patterns.
Handling Edge Cases
Websites That Block Scraping
Some sites return 403 errors or CAPTCHAs. Handle this gracefully:
- Check the HTTP status code before processing
- If 403 or 429, mark the record as
enrichment_status: "blocked" - Add a 2–5 second delay between requests to avoid rate limits
- Use a realistic User-Agent header
Non-English Websites
llama3.1:8b handles multilingual content reasonably well. The extracted fields will typically be returned in English even if the source is in another language. For better multilingual support, use qwen2.5:7b, which was trained on more diverse language data.
Single-Page JavaScript Apps
The HTTP Request node fetches raw HTML. Sites built entirely in React or Vue may return an empty shell. For these, you have two options:
- Use n8n's built-in Puppeteer or Playwright integration (community node) to render JavaScript
- Scrape the
/sitemap.xmlor/robots.txtto find statically-rendered pages
Advantages of Self-Hosted Enrichment
Running this pipeline on your own infrastructure has concrete benefits beyond cost savings:
- Data privacy: Your prospect list never leaves your server. This matters if you are enriching leads before they have opted in to your marketing, or if you operate under GDPR where sharing email lists with third-party vendors requires explicit consent.
- No per-request API costs: Clearbit charges $0.05–0.50 per enrichment. Enriching 5,000 leads costs $250–2,500. With Ollama, the cost is the electricity to run inference — roughly $0.002 per record on consumer hardware.
- Custom extraction schema: Need to extract "sustainability certifications" or "open-source contributions"? Change the prompt. No waiting for a vendor to add a new field to their API.
- No rate limits: Process records as fast as your hardware allows. A single RTX 3060 can handle roughly 10–15 enrichments per minute with
llama3.1:8b.
Complete Workflow JSON
Import this into n8n to get the data enrichment pipeline running. You will need to configure your Google Sheets credentials and ensure Ollama is running locally.
Click to expand full workflow JSON
{
"name": "AI Data Enrichment (Ollama + Web Scraping)",
"nodes": [
{
"parameters": {
"httpMethod": "POST",
"path": "enrich-companies",
"options": {}
},
"id": "webhook-trigger",
"name": "Webhook Trigger",
"type": "n8n-nodes-base.webhook",
"typeVersion": 2,
"position": [240, 300],
"webhookId": "enrich-companies"
},
{
"parameters": {
"batchSize": 1,
"options": {}
},
"id": "split-batches",
"name": "Split In Batches",
"type": "n8n-nodes-base.splitInBatches",
"typeVersion": 3,
"position": [460, 300]
},
{
"parameters": {
"url": "=https://{{ $json.domain }}",
"options": {
"timeout": 10000,
"redirect": { "followRedirects": true },
"response": { "response": { "fullResponse": true, "responseFormat": "text" } }
},
"onError": "continueRegularOutput"
},
"id": "http-scrape",
"name": "Scrape Website",
"type": "n8n-nodes-base.httpRequest",
"typeVersion": 4.2,
"position": [680, 300]
},
{
"parameters": {
"functionCode": "const html = $json.data || $json.body || '';\nlet text = html.replace(/