Automate Data Enrichment with n8n + Ollama (AI-Powered Web Scraping Pipeline)

Published March 24, 2026 · 16 min read

Data enrichment tools like Clearbit, ZoomInfo, and Apollo charge $100–500/month for a simple premise: take a company name or domain, look up public information, and return structured data. For sales teams doing outreach, this is table stakes. But paying per-record fees for data that is already publicly available on company websites, LinkedIn profiles, and press releases feels increasingly unnecessary.

With n8n and Ollama, you can build a self-hosted data enrichment pipeline that scrapes public web pages, feeds raw HTML through a local AI model to extract structured information, and pushes clean, enriched records into Google Sheets, Airtable, or your CRM. No API keys for third-party enrichment services, no per-record costs, and no sending your prospect lists to external servers.

In this guide, you will build a complete n8n data enrichment workflow that:

Accepts a list of company domains or contact names via webhook or spreadsheet
Scrapes public web pages (company sites, about pages, blog posts)
Passes raw page content through Ollama to extract structured fields
Validates and normalizes the extracted data
Writes enriched records to Google Sheets or Airtable

What Is AI-Powered Data Enrichment?

Traditional data enrichment relies on pre-built databases. Clearbit maintains a database of company profiles and matches your input against it. The problem: their data goes stale, coverage gaps exist for smaller companies, and you pay whether the match is good or not.

AI-powered enrichment takes a different approach. Instead of looking up records in a database, it reads the actual source material — company websites, public profiles, press releases — and extracts structured information in real time. The AI model acts as an intelligent parser that understands context, handles varied page layouts, and returns consistent structured output regardless of how the source data is formatted.

This matters for three reasons:

Freshness: You are reading the live website, not a cached record from months ago
Coverage: Any company with a website can be enriched, not just those in a vendor's database
Flexibility: You define the fields you want extracted — change the prompt, change the output

	Traditional Enrichment (Clearbit, ZoomInfo)	AI Enrichment (n8n + Ollama)
Cost	$100–500/month + per-record fees	$0 (self-hosted)
Data freshness	Days to weeks old	Real-time (scraped live)
Coverage	Limited to vendor's database	Any public website
Custom fields	Fixed schema	Define any fields via prompt
Data privacy	Prospect list sent to vendor	Stays on your server
Rate limits	Strict API quotas	Limited only by your hardware

When to use which: AI enrichment works best for enriching 50–5,000 records where you need custom fields or operate in a niche market. For enriching 100,000+ records with standard firmographic data, a traditional vendor may still be faster. The two approaches also work well together — use Clearbit for basic fields, then run AI enrichment to fill gaps.

The Architecture

The pipeline has five stages: trigger, scrape, extract, validate, and output. n8n orchestrates the entire flow, Ollama handles the intelligent extraction, and the HTTP Request node does the web scraping.

[Webhook / Google Sheets Trigger]
           ↓
    [Loop Over Records]
           ↓
    [HTTP Request — Scrape Website]
           ↓
    [HTML to Text Conversion]
           ↓
    [Ollama — Extract Structured Data]
           ↓
    [JSON Parse & Validate]
           ↓
    [Google Sheets / Airtable — Write Enriched Data]

Each record flows through independently, so failures on one URL do not block the rest. The workflow includes error handling to skip unreachable websites and flag records that could not be enriched.

Step 1: Set Up the Input Trigger

You need a way to feed company domains into the workflow. Two common approaches:

Option A: Webhook Trigger (for on-demand enrichment)

Add a Webhook node that accepts a JSON array of domains:

// Webhook configuration
// Method: POST
// Path: /enrich-companies
// Authentication: Header Auth (recommended)
//
// Expected payload:
{
  "companies": [
    {"domain": "stripe.com", "name": "Stripe"},
    {"domain": "linear.app", "name": "Linear"},
    {"domain": "posthog.com", "name": "PostHog"}
  ]
}

Option B: Google Sheets Trigger (for batch enrichment)

Use a Google Sheets Trigger node that watches for new rows in a spreadsheet. Set up a sheet with columns: Domain, Company Name, Status. When you paste new domains into the sheet, the workflow triggers automatically.

// Google Sheets Trigger configuration
// Spreadsheet: "Lead Enrichment Pipeline"
// Sheet: "Input"
// Trigger on: "Row Added"
// Poll interval: Every 1 minute
//
// After the trigger, add a Filter node:
// Condition: {{ $json.Status }} is empty
// (Skip rows that have already been processed)

After the trigger, add a Split In Batches node to process records one at a time. This prevents overloading your Ollama instance with parallel requests and keeps web scraping at a polite pace.

// Split In Batches configuration
// Batch Size: 1
// Options → Reset: false
//
// Add a Wait node after each batch iteration:
// Wait time: 2 seconds
// (Be polite to target websites — do not hammer them)

Step 2: Scrape the Company Website

Use the HTTP Request node to fetch the company's homepage and about page. These two pages contain 80% of the information you need for lead enrichment.

// HTTP Request Node — Fetch Homepage
// Method: GET
// URL: https://{{ $json.domain }}
// Options:
//   Timeout: 10000 (10 seconds)
//   Follow Redirects: true
//   Full Response: true (to capture status code)
//   Response Format: String (raw HTML)
//
// On Error: Continue (do not stop the workflow)
// Add header: User-Agent: Mozilla/5.0 (compatible; DataBot/1.0)

After fetching the homepage, add a second HTTP Request node for the about page:

// HTTP Request Node — Fetch About Page
// Method: GET
// URL: https://{{ $json.domain }}/about
// Same options as above
// On Error: Continue (about page may not exist)

HTML to Text Conversion

Raw HTML is full of tags, scripts, and CSS that waste Ollama's context window. Use a Function node to strip it down to plain text:

// Function Node: Clean HTML to Text
const html = $json.data || $json.body || '';

// Remove script and style blocks entirely
let text = html
  .replace(/<script[\s\S]*?<\/script>/gi, '')
  .replace(/<style[\s\S]*?<\/style>/gi, '')
  .replace(/<nav[\s\S]*?<\/nav>/gi, '')
  .replace(/<footer[\s\S]*?<\/footer>/gi, '')
  .replace(/<header[\s\S]*?<\/header>/gi, '');

// Replace HTML tags with spaces
text = text.replace(/<[^>]+>/g, ' ');

// Decode HTML entities
text = text
  .replace(/&amp;/g, '&')
  .replace(/&nbsp;/g, ' ')
  .replace(/&lt;/g, '<')
  .replace(/&gt;/g, '>')
  .replace(/&quot;/g, '"');

// Collapse whitespace
text = text.replace(/\s+/g, ' ').trim();

// Truncate to ~4000 chars to fit Ollama's context window efficiently
text = text.substring(0, 4000);

return [{ json: { ...($input.first().json), pageText: text } }];

Why strip HTML aggressively? An average company homepage is 50–150KB of raw HTML, but only 2–5KB of actual useful text. Sending raw HTML to Ollama wastes tokens on navigation links, JavaScript, and CSS classes. The cleaning function above reduces input size by 90%+ and improves extraction accuracy because the model focuses on actual content.

Step 3: Extract Structured Data with Ollama

This is where the AI does the heavy lifting. Connect an Ollama node (or an HTTP Request to the Ollama API) and give it a carefully crafted extraction prompt.

// Ollama HTTP API Request
// Method: POST
// URL: http://localhost:11434/api/generate
// Body (JSON):
{
  "model": "llama3.1:8b",
  "prompt": "You are a data extraction assistant. Analyze the following company webpage text and extract structured information. Return ONLY a valid JSON object with these fields. If a field cannot be determined from the text, use null.\n\nFields to extract:\n- company_name: Official company name\n- tagline: Company tagline or one-line description\n- industry: Primary industry (e.g., SaaS, Fintech, Healthcare, E-commerce)\n- business_model: B2B, B2C, or B2B2C\n- employee_count_range: Estimated range (e.g., 1-10, 11-50, 51-200, 201-500, 500+)\n- founding_year: Year founded\n- headquarters: City, Country\n- key_products: Array of main products or services (max 3)\n- tech_stack_signals: Array of technologies mentioned on the page (max 5)\n- pricing_model: free, freemium, paid, enterprise, or null\n- target_customer: Who the product is for (1 sentence)\n- recent_news: Any recent announcements, funding rounds, or milestones\n\nWebpage text:\n---\n{{ $json.pageText }}\n---\n\nReturn ONLY the JSON object, no explanations or markdown.",
  "stream": false,
  "options": {
    "temperature": 0.1,
    "num_predict": 1000
  }
}

The low temperature (0.1) is critical for extraction tasks. You want deterministic, factual output — not creative interpretation. The num_predict limit prevents the model from rambling past the JSON.

The Extraction Prompt Explained

The prompt above is designed for reliability:

Role assignment ("data extraction assistant") primes the model for structured output
Explicit field definitions with examples reduce ambiguity
Null handling ("use null") prevents hallucination — the model won't invent a founding year if it's not on the page
"Return ONLY the JSON" suppresses the model's tendency to add explanations

Model recommendation: Use llama3.1:8b for this task. It handles JSON output reliably and processes pages in 3–8 seconds on a GPU. If you need faster throughput and can accept slightly lower accuracy, mistral:7b is about 30% faster. Avoid models smaller than 7B parameters for extraction — they struggle with consistent JSON formatting.

Step 4: Parse and Validate the Output

Ollama's response comes as a text string. You need to parse it into a proper JSON object and handle cases where the model returns malformed output.

// Function Node: Parse Ollama Response
const response = $json.response || '';

// Try to extract JSON from the response
let enrichedData = null;

try {
  // First attempt: direct parse
  enrichedData = JSON.parse(response);
} catch (e) {
  // Second attempt: find JSON block in response
  const jsonMatch = response.match(/\{[\s\S]*\}/);
  if (jsonMatch) {
    try {
      enrichedData = JSON.parse(jsonMatch[0]);
    } catch (e2) {
      // Extraction failed
      enrichedData = null;
    }
  }
}

if (!enrichedData) {
  return [{
    json: {
      domain: $('Split In Batches').item.json.domain,
      enrichment_status: 'failed',
      error: 'Could not parse Ollama response as JSON'
    }
  }];
}

// Validate key fields
const validated = {
  domain: $('Split In Batches').item.json.domain,
  company_name: enrichedData.company_name || null,
  tagline: enrichedData.tagline || null,
  industry: enrichedData.industry || null,
  business_model: enrichedData.business_model || null,
  employee_count_range: enrichedData.employee_count_range || null,
  founding_year: enrichedData.founding_year || null,
  headquarters: enrichedData.headquarters || null,
  key_products: Array.isArray(enrichedData.key_products)
    ? enrichedData.key_products.join(', ') : (enrichedData.key_products || null),
  tech_stack_signals: Array.isArray(enrichedData.tech_stack_signals)
    ? enrichedData.tech_stack_signals.join(', ') : (enrichedData.tech_stack_signals || null),
  pricing_model: enrichedData.pricing_model || null,
  target_customer: enrichedData.target_customer || null,
  recent_news: enrichedData.recent_news || null,
  enrichment_status: 'success',
  enriched_at: new Date().toISOString()
};

return [{ json: validated }];

The dual-parse strategy (try direct parse, then regex extraction) handles the common case where Ollama wraps the JSON in markdown code fences or adds a brief preamble. The validation step ensures arrays are flattened to comma-separated strings for spreadsheet compatibility.

Step 5: Write to Google Sheets or Airtable

The final step writes enriched records to your output destination.

Google Sheets Output

// Google Sheets Node Configuration
// Operation: Append Row
// Spreadsheet: "Lead Enrichment Pipeline"
// Sheet: "Enriched"
// Mapping Mode: Map Each Column
//
// Column mappings:
// Domain         → {{ $json.domain }}
// Company Name   → {{ $json.company_name }}
// Tagline        → {{ $json.tagline }}
// Industry       → {{ $json.industry }}
// Business Model → {{ $json.business_model }}
// Employee Range → {{ $json.employee_count_range }}
// Founded        → {{ $json.founding_year }}
// Headquarters   → {{ $json.headquarters }}
// Products       → {{ $json.key_products }}
// Tech Stack     → {{ $json.tech_stack_signals }}
// Pricing        → {{ $json.pricing_model }}
// Target Customer→ {{ $json.target_customer }}
// Recent News    → {{ $json.recent_news }}
// Status         → {{ $json.enrichment_status }}
// Enriched At    → {{ $json.enriched_at }}

Airtable Output

// Airtable Node Configuration
// Operation: Create Record
// Base: "Sales Pipeline"
// Table: "Enriched Companies"
//
// Same field mappings as above.
// Airtable advantage: you can use Single Select fields
// for Industry and Business Model, which auto-create
// filter options in the Airtable UI.

Practical Use Cases

Lead Scoring and Qualification

Feed a list of inbound leads (from a form, webinar registration, or trial signup) through the enrichment pipeline. Use the extracted fields to score leads automatically:

// Function Node: Simple Lead Scoring
let score = 0;
const data = $json;

// Company size scoring
const sizeScores = { '1-10': 1, '11-50': 2, '51-200': 3, '201-500': 4, '500+': 5 };
score += sizeScores[data.employee_count_range] || 0;

// Business model fit (assuming you sell B2B)
if (data.business_model === 'B2B') score += 3;
if (data.business_model === 'B2B2C') score += 2;

// Industry fit
const targetIndustries = ['SaaS', 'Fintech', 'E-commerce'];
if (targetIndustries.includes(data.industry)) score += 3;

// Pricing signals
if (data.pricing_model === 'enterprise') score += 2;

return [{ json: { ...data, lead_score: score, qualified: score >= 7 } }];

Competitor Analysis

Modify the extraction prompt to pull competitive intelligence:

What features do they highlight on their homepage?
What pricing tier names do they use?
Which integrations do they promote?
What customer testimonials are featured?

Run this weekly on a list of competitor domains and track changes over time in a spreadsheet.

Market Research

Enrich a list of companies in a target market segment. Pull industry classification, size estimates, and technology signals to build a market map. The tech_stack_signals field is especially useful — if 60% of companies in your target market mention "Kubernetes" or "AWS," that tells you something about their infrastructure maturity and buying patterns.

Handling Edge Cases

Websites That Block Scraping

Some sites return 403 errors or CAPTCHAs. Handle this gracefully:

Check the HTTP status code before processing
If 403 or 429, mark the record as enrichment_status: "blocked"
Add a 2–5 second delay between requests to avoid rate limits
Use a realistic User-Agent header

Non-English Websites

llama3.1:8b handles multilingual content reasonably well. The extracted fields will typically be returned in English even if the source is in another language. For better multilingual support, use qwen2.5:7b, which was trained on more diverse language data.

Single-Page JavaScript Apps

The HTTP Request node fetches raw HTML. Sites built entirely in React or Vue may return an empty shell. For these, you have two options:

Use n8n's built-in Puppeteer or Playwright integration (community node) to render JavaScript
Scrape the /sitemap.xml or /robots.txt to find statically-rendered pages

Advantages of Self-Hosted Enrichment

Running this pipeline on your own infrastructure has concrete benefits beyond cost savings:

Data privacy: Your prospect list never leaves your server. This matters if you are enriching leads before they have opted in to your marketing, or if you operate under GDPR where sharing email lists with third-party vendors requires explicit consent.
No per-request API costs: Clearbit charges $0.05–0.50 per enrichment. Enriching 5,000 leads costs $250–2,500. With Ollama, the cost is the electricity to run inference — roughly $0.002 per record on consumer hardware.
Custom extraction schema: Need to extract "sustainability certifications" or "open-source contributions"? Change the prompt. No waiting for a vendor to add a new field to their API.
No rate limits: Process records as fast as your hardware allows. A single RTX 3060 can handle roughly 10–15 enrichments per minute with llama3.1:8b.

Complete Workflow JSON

Import this into n8n to get the data enrichment pipeline running. You will need to configure your Google Sheets credentials and ensure Ollama is running locally.

Click to expand full workflow JSON

{
  "name": "AI Data Enrichment (Ollama + Web Scraping)",
  "nodes": [
    {
      "parameters": {
        "httpMethod": "POST",
        "path": "enrich-companies",
        "options": {}
      },
      "id": "webhook-trigger",
      "name": "Webhook Trigger",
      "type": "n8n-nodes-base.webhook",
      "typeVersion": 2,
      "position": [240, 300],
      "webhookId": "enrich-companies"
    },
    {
      "parameters": {
        "batchSize": 1,
        "options": {}
      },
      "id": "split-batches",
      "name": "Split In Batches",
      "type": "n8n-nodes-base.splitInBatches",
      "typeVersion": 3,
      "position": [460, 300]
    },
    {
      "parameters": {
        "url": "=https://{{ $json.domain }}",
        "options": {
          "timeout": 10000,
          "redirect": { "followRedirects": true },
          "response": { "response": { "fullResponse": true, "responseFormat": "text" } }
        },
        "onError": "continueRegularOutput"
      },
      "id": "http-scrape",
      "name": "Scrape Website",
      "type": "n8n-nodes-base.httpRequest",
      "typeVersion": 4.2,
      "position": [680, 300]
    },
    {
      "parameters": {
        "functionCode": "const html = $json.data || $json.body || '';\nlet text = html.replace(//gi, '').replace(//gi, '').replace(//gi, '').replace(//gi, '').replace(/<[^>]+>/g, ' ').replace(/&/g, '&').replace(/ /g, ' ').replace(/\\s+/g, ' ').trim().substring(0, 4000);\nreturn [{ json: { ...$input.first().json, pageText: text } }];"
      },
      "id": "clean-html",
      "name": "Clean HTML",
      "type": "n8n-nodes-base.function",
      "typeVersion": 1,
      "position": [900, 300]
    },
    {
      "parameters": {
        "url": "http://localhost:11434/api/generate",
        "method": "POST",
        "body": {
          "json": {
            "model": "llama3.1:8b",
            "prompt": "You are a data extraction assistant. Analyze the following company webpage text and extract structured information. Return ONLY a valid JSON object with these fields. If a field cannot be determined, use null.\n\nFields: company_name, tagline, industry, business_model (B2B/B2C/B2B2C), employee_count_range, founding_year, headquarters, key_products (array, max 3), tech_stack_signals (array, max 5), pricing_model (free/freemium/paid/enterprise/null), target_customer, recent_news.\n\nWebpage text:\n---\n{{ $json.pageText }}\n---\n\nReturn ONLY the JSON object.",
            "stream": false,
            "options": { "temperature": 0.1, "num_predict": 1000 }
          }
        },
        "options": { "timeout": 60000 }
      },
      "id": "ollama-extract",
      "name": "Ollama Extract",
      "type": "n8n-nodes-base.httpRequest",
      "typeVersion": 4.2,
      "position": [1120, 300]
    },
    {
      "parameters": {
        "functionCode": "const response = $json.response || '';\nlet data = null;\ntry { data = JSON.parse(response); } catch(e) { const m = response.match(/\\{[\\s\\S]*\\}/); if(m) try { data = JSON.parse(m[0]); } catch(e2) {} }\nif (!data) return [{ json: { domain: $('Split In Batches').item.json.domain, enrichment_status: 'failed' } }];\nconst v = { domain: $('Split In Batches').item.json.domain, company_name: data.company_name||null, tagline: data.tagline||null, industry: data.industry||null, business_model: data.business_model||null, employee_count_range: data.employee_count_range||null, founding_year: data.founding_year||null, headquarters: data.headquarters||null, key_products: Array.isArray(data.key_products)?data.key_products.join(', '):(data.key_products||null), tech_stack_signals: Array.isArray(data.tech_stack_signals)?data.tech_stack_signals.join(', '):(data.tech_stack_signals||null), pricing_model: data.pricing_model||null, target_customer: data.target_customer||null, recent_news: data.recent_news||null, enrichment_status: 'success', enriched_at: new Date().toISOString() };\nreturn [{ json: v }];"
      },
      "id": "parse-validate",
      "name": "Parse and Validate",
      "type": "n8n-nodes-base.function",
      "typeVersion": 1,
      "position": [1340, 300]
    },
    {
      "parameters": {
        "amount": 2,
        "unit": "seconds"
      },
      "id": "wait-node",
      "name": "Wait",
      "type": "n8n-nodes-base.wait",
      "typeVersion": 1.1,
      "position": [1560, 300]
    }
  ],
  "connections": {
    "Webhook Trigger": {
      "main": [[{ "node": "Split In Batches", "type": "main", "index": 0 }]]
    },
    "Split In Batches": {
      "main": [
        [{ "node": "Scrape Website", "type": "main", "index": 0 }],
        []
      ]
    },
    "Scrape Website": {
      "main": [[{ "node": "Clean HTML", "type": "main", "index": 0 }]]
    },
    "Clean HTML": {
      "main": [[{ "node": "Ollama Extract", "type": "main", "index": 0 }]]
    },
    "Ollama Extract": {
      "main": [[{ "node": "Parse and Validate", "type": "main", "index": 0 }]]
    },
    "Parse and Validate": {
      "main": [[{ "node": "Wait", "type": "main", "index": 0 }]]
    },
    "Wait": {
      "main": [[{ "node": "Split In Batches", "type": "main", "index": 0 }]]
    }
  },
  "settings": { "executionOrder": "v1" },
  "tags": [{ "name": "AI" }, { "name": "Ollama" }, { "name": "Data Enrichment" }, { "name": "Web Scraping" }, { "name": "Lead Generation" }]
}

Download workflow JSON file

Next Steps

Once you have the base pipeline running:

Add the about page scraper — Fetch both the homepage and /about page, concatenate the text, and feed both into Ollama for richer extraction
Connect a Google Sheets output node — The workflow JSON above ends at the Parse and Validate step. Add a Google Sheets Append Row node after it to write results automatically
Schedule recurring enrichment — Use n8n's Cron Trigger to re-enrich records monthly and detect changes (new funding rounds, product launches, hiring surges)
Build a lead scoring layer — Add a Function node after enrichment that assigns scores based on your ICP criteria
Expand to LinkedIn profiles — With appropriate compliance measures, scrape public LinkedIn company pages to pull employee count, recent posts, and job openings

An AI-powered data enrichment pipeline with n8n and Ollama gives you the flexibility to extract exactly the data you need, from any public source, at zero marginal cost. Start with the workflow above, customize the extraction prompt for your use case, and scale from there.

Want 11 Production-Ready AI Workflows?

The Self-Hosted AI Workflow Pack includes this data enrichment pipeline, plus a chatbot, email automation, document processing, lead scoring, and 7 more n8n + Ollama workflows. One payment, unlimited runs, zero API costs.

Get the Full Pack — $39