How to Build a Private RAG Knowledge Base with n8n + Ollama (No Cloud APIs)
Retrieval-Augmented Generation (RAG) is the pattern behind every "chat with your documents" product you have seen. Feed your PDFs, internal docs, and knowledge base articles into a vector database, then let an LLM answer questions grounded in your actual data — no hallucinations, no guessing.
The problem? Most RAG tutorials assume you will send your proprietary documents to OpenAI for embedding and querying. That means your internal SOPs, customer data, legal contracts, and engineering docs all pass through third-party servers. For many teams, that is a non-starter.
This guide shows you how to build a fully private RAG knowledge base using n8n, Ollama, and Qdrant — three open-source tools running entirely on your own hardware. No API keys, no per-token billing, no data leaving your network.
Why Private, Self-Hosted RAG Matters
Before we build, here is why you should care about keeping your RAG pipeline local:
- Data privacy — Legal contracts, HR policies, customer records, and proprietary research never leave your infrastructure. No third-party data processing agreements needed.
- Zero API costs — OpenAI's embedding API charges $0.10 per million tokens. That sounds cheap until you are indexing thousands of documents and running hundreds of queries per day. Locally, every embedding and every query costs $0.
- Regulatory compliance — GDPR, HIPAA, SOC 2, and industry-specific regulations often restrict where data can be processed. A self-hosted RAG pipeline simplifies compliance because your data never crosses network boundaries.
- No rate limits — Embed and query as fast as your hardware allows. No throttling, no 429 errors, no waiting for quota resets.
- Offline capability — Your knowledge base keeps working during internet outages, cloud provider incidents, or in air-gapped environments.
Architecture Overview
The stack has three components, each handling a specific job:
DOCUMENTS (PDF, TXT, MD)
|
v
[n8n: Ingestion Workflow]
|
v
[Ollama: nomic-embed-text] --generates embeddings--> [Qdrant: Vector DB]
|
v
[User Question] --> [n8n: Query Workflow] --> [Ollama: nomic-embed-text]
|
embed question
|
v
[Qdrant: similarity search]
|
top-k chunks
|
v
[Ollama: llama3] -- generates answer
|
v
[Final Answer]
| Component | Role | Why This One |
|---|---|---|
| n8n | Workflow orchestration | Visual builder, 400+ integrations, handles file I/O and HTTP calls |
| Ollama | Embeddings + LLM inference | Runs models locally, simple HTTP API, supports embedding and chat models |
| Qdrant | Vector storage and search | Purpose-built vector DB, fast similarity search, simple REST API, runs in Docker |
Prerequisites
You need three services running. Here is the fastest path to get all of them up:
# Install Ollama
curl -fsSL https://ollama.ai/install.sh | sh
# Pull the embedding model (this is what converts text to vectors)
ollama pull nomic-embed-text
# Pull the chat model (this answers questions)
ollama pull llama3:8b
# Verify both are available
curl http://localhost:11434/api/tags
nomic-embed-text produces 768-dimensional vectors and runs on just 2 GB of RAM. It is specifically designed for embedding tasks and outperforms general-purpose models at retrieval. llama3:8b handles the question-answering step.
# Run Qdrant in Docker
docker run -d --name qdrant \
-p 6333:6333 \
-p 6334:6334 \
-v qdrant_data:/qdrant/storage \
qdrant/qdrant
# Verify it is running
curl http://localhost:6333/collections
Qdrant exposes a REST API on port 6333 and a gRPC interface on 6334. We will use the REST API from n8n. The volume mount ensures your vectors survive container restarts.
# Run n8n in Docker
docker run -d --name n8n \
-p 5678:5678 \
-v n8n_data:/home/node/.n8n \
--add-host=host.docker.internal:host-gateway \
n8nio/n8n
The --add-host flag lets n8n reach Ollama and Qdrant on your host machine. Open http://localhost:5678 to access the n8n editor.
docker network create rag-net and add --network rag-net to each container. Then use container names as hostnames (e.g., http://qdrant:6333) instead of host.docker.internal.
Step-by-Step: Document Ingestion Workflow
This workflow takes documents, splits them into chunks, generates embeddings with Ollama, and stores everything in Qdrant. You run it once per document (or on a schedule to ingest new files automatically).
Before ingesting documents, create a collection in Qdrant that matches the embedding dimensions. Add an HTTP Request node:
// HTTP Request node configuration
URL: http://localhost:6333/collections/knowledge_base
Method: PUT
Body (JSON):
{
"vectors": {
"size": 768,
"distance": "Cosine"
}
}
This creates a collection called knowledge_base with 768 dimensions (matching nomic-embed-text output) using cosine similarity. You only need to run this once.
Add a Read Binary File node (or a webhook that accepts file uploads) followed by a Code node that splits the text into overlapping chunks:
// Code node: chunk the document
const text = $input.first().json.data; // or .text depending on source
const CHUNK_SIZE = 500; // characters per chunk
const CHUNK_OVERLAP = 50; // overlap between chunks
const chunks = [];
let start = 0;
while (start < text.length) {
const end = Math.min(start + CHUNK_SIZE, text.length);
const chunk = text.substring(start, end).trim();
if (chunk.length > 50) { // skip tiny fragments
chunks.push({
json: {
text: chunk,
chunk_index: chunks.length,
source: $input.first().json.filename || 'document',
char_start: start,
char_end: end
}
});
}
start += CHUNK_SIZE - CHUNK_OVERLAP;
}
return chunks;
This produces one output item per chunk. The overlap ensures that sentences split across chunk boundaries are still partially captured in both chunks, improving retrieval accuracy.
For each chunk, call the Ollama embedding endpoint. Add an HTTP Request node inside a loop (n8n handles this automatically when the previous node outputs multiple items):
// HTTP Request node configuration
URL: http://localhost:11434/api/embeddings
Method: POST
Body (JSON):
{
"model": "nomic-embed-text",
"prompt": "{{ $json.text }}"
}
Options:
Timeout: 30000
Batch size: 1 (process chunks sequentially to avoid overloading Ollama)
The response contains an embedding array of 768 floating-point numbers — the vector representation of that chunk.
Send each chunk's embedding and metadata to Qdrant. Add another HTTP Request node:
// HTTP Request node configuration
URL: http://localhost:6333/collections/knowledge_base/points
Method: PUT
Body (JSON):
{
"points": [
{
"id": {{ $runIndex }},
"vector": {{ JSON.stringify($json.embedding) }},
"payload": {
"text": "{{ $('Code').item.json.text }}",
"source": "{{ $('Code').item.json.source }}",
"chunk_index": {{ $('Code').item.json.chunk_index }}
}
}
]
}
Each point stores the vector for similarity search and the original text plus metadata in the payload. The payload is returned with search results, so you can pass the actual text to the LLM later.
Step-by-Step: Query Workflow
Now that your documents are indexed, build a second workflow that accepts a question and returns a grounded answer. This is the "chat with your docs" experience.
Add a Webhook node that accepts POST requests with a JSON body containing {"question": "your question here"}. Set the response mode to "Response Node" so you can return the answer.
// Webhook configuration
HTTP Method: POST
Path: /ask-knowledge-base
Response Mode: Using 'Respond to Webhook' Node
The question needs to be converted to a vector in the same embedding space as your documents. Use the same Ollama model:
// HTTP Request to Ollama
URL: http://localhost:11434/api/embeddings
Method: POST
Body:
{
"model": "nomic-embed-text",
"prompt": "{{ $json.body.question }}"
}
Send the question's embedding to Qdrant to find the most similar document chunks:
// HTTP Request to Qdrant
URL: http://localhost:6333/collections/knowledge_base/points/search
Method: POST
Body:
{
"vector": {{ JSON.stringify($json.embedding) }},
"limit": 5,
"with_payload": true
}
This returns the top 5 most semantically similar chunks from your knowledge base, along with their text and metadata. The limit: 5 parameter controls how many chunks feed into the LLM's context — more chunks give broader context but cost more processing time.
Combine the retrieved chunks into a context string, then ask Ollama to answer the question using only that context:
// Code node: build context from search results
const results = $input.first().json.result;
const context = results
.map((r, i) => `[Source ${i+1}: ${r.payload.source}]\n${r.payload.text}`)
.join('\n\n---\n\n');
const question = $('Receive Question').first().json.body.question;
return [{
json: {
context: context,
question: question,
prompt: `You are a helpful assistant that answers questions based on the provided context documents. Use ONLY the information from the context below. If the context does not contain enough information to answer the question, say "I don't have enough information in the knowledge base to answer this."
CONTEXT:
${context}
QUESTION: ${question}
Provide a clear, concise answer. Cite which source(s) you used.`
}
}];
// HTTP Request to Ollama
URL: http://localhost:11434/api/generate
Method: POST
Body:
{
"model": "llama3:8b",
"prompt": "{{ $json.prompt }}",
"stream": false,
"options": {
"temperature": 0.3,
"num_predict": 1000
}
}
Low temperature (0.3) keeps the answer factual and grounded. The model should rely on the retrieved context rather than its training data. Finally, use a Respond to Webhook node to return the answer as JSON.
Test the Query Workflow
curl -X POST http://localhost:5678/webhook/ask-knowledge-base \
-H "Content-Type: application/json" \
-d '{"question": "What is our refund policy for enterprise customers?"}'
Expected response:
{
"answer": "Based on the knowledge base, enterprise customers are eligible for a full refund within 60 days of purchase. After 60 days, refunds are prorated based on the remaining contract term. [Source 2: refund-policy.pdf]",
"sources": ["refund-policy.pdf", "enterprise-terms.pdf"],
"chunks_used": 3
}
The Workflow JSON
Here is a complete, importable n8n workflow for the query pipeline. Copy and paste this into n8n via Workflow → Import from JSON.
Click to expand the RAG Query Workflow JSON
{
"name": "RAG Knowledge Base Query (Ollama + Qdrant)",
"nodes": [
{
"parameters": {
"httpMethod": "POST",
"path": "ask-knowledge-base",
"responseMode": "responseNode",
"options": {}
},
"id": "webhook-question",
"name": "Receive Question",
"type": "n8n-nodes-base.webhook",
"typeVersion": 2,
"position": [240, 300],
"webhookId": "ask-knowledge-base"
},
{
"parameters": {
"url": "http://localhost:11434/api/embeddings",
"sendBody": true,
"specifyBody": "json",
"jsonBody": "={{ JSON.stringify({ model: 'nomic-embed-text', prompt: $json.body.question }) }}",
"options": { "timeout": 30000 }
},
"id": "embed-question",
"name": "Embed Question (Ollama)",
"type": "n8n-nodes-base.httpRequest",
"typeVersion": 4.2,
"position": [460, 300]
},
{
"parameters": {
"url": "http://localhost:6333/collections/knowledge_base/points/search",
"sendBody": true,
"specifyBody": "json",
"jsonBody": "={{ JSON.stringify({ vector: $json.embedding, limit: 5, with_payload: true }) }}",
"options": { "timeout": 10000 }
},
"id": "search-qdrant",
"name": "Search Qdrant",
"type": "n8n-nodes-base.httpRequest",
"typeVersion": 4.2,
"position": [680, 300]
},
{
"parameters": {
"jsCode": "const results = $input.first().json.result || [];\nconst context = results.map((r, i) => `[Source ${i+1}: ${r.payload.source}]\\n${r.payload.text}`).join('\\n\\n---\\n\\n');\nconst question = $('Receive Question').first().json.body.question;\nreturn [{ json: { context, question, sources: results.map(r => r.payload.source), prompt: `You are a helpful assistant. Answer the question using ONLY the context below. If the context does not contain enough information, say so.\\n\\nCONTEXT:\\n${context}\\n\\nQUESTION: ${question}\\n\\nProvide a clear, concise answer and cite your sources.` } }];"
},
"id": "build-prompt",
"name": "Build RAG Prompt",
"type": "n8n-nodes-base.code",
"typeVersion": 2,
"position": [900, 300]
},
{
"parameters": {
"url": "http://localhost:11434/api/generate",
"sendBody": true,
"specifyBody": "json",
"jsonBody": "={{ JSON.stringify({ model: 'llama3:8b', prompt: $json.prompt, stream: false, options: { temperature: 0.3, num_predict: 1000 } }) }}",
"options": { "timeout": 120000 }
},
"id": "generate-answer",
"name": "Generate Answer (Ollama)",
"type": "n8n-nodes-base.httpRequest",
"typeVersion": 4.2,
"position": [1120, 300]
},
{
"parameters": {
"respondWith": "json",
"responseBody": "={{ JSON.stringify({ answer: $json.response, sources: $('Build RAG Prompt').first().json.sources, question: $('Build RAG Prompt').first().json.question }) }}",
"options": {}
},
"id": "respond",
"name": "Return Answer",
"type": "n8n-nodes-base.respondToWebhook",
"typeVersion": 1.1,
"position": [1340, 300]
}
],
"connections": {
"Receive Question": {
"main": [[{ "node": "Embed Question (Ollama)", "type": "main", "index": 0 }]]
},
"Embed Question (Ollama)": {
"main": [[{ "node": "Search Qdrant", "type": "main", "index": 0 }]]
},
"Search Qdrant": {
"main": [[{ "node": "Build RAG Prompt", "type": "main", "index": 0 }]]
},
"Build RAG Prompt": {
"main": [[{ "node": "Generate Answer (Ollama)", "type": "main", "index": 0 }]]
},
"Generate Answer (Ollama)": {
"main": [[{ "node": "Return Answer", "type": "main", "index": 0 }]]
}
},
"settings": { "executionOrder": "v1" },
"tags": [{ "name": "AI" }, { "name": "RAG" }, { "name": "Ollama" }, { "name": "Qdrant" }]
}
Tips for Production
The workflow above gets you a working RAG system. Here is how to make it production-grade:
Chunk Sizing Strategy
The 500-character default works well for general documents. Tune it based on your content:
- Technical documentation — Use 300–400 characters with 100-char overlap. Shorter chunks give more precise retrieval for specific API details or config values.
- Long-form reports and policies — Use 800–1200 characters with 150-char overlap. Longer chunks preserve paragraph-level context that the LLM needs for nuanced answers.
- FAQ/Q&A content — Chunk by question-answer pair rather than by character count. Each Q&A should be a single chunk.
Model Selection
| Task | Recommended Model | RAM | Notes |
|---|---|---|---|
| Embeddings | nomic-embed-text | 2 GB | Best quality-to-size ratio for retrieval. 768 dimensions. |
| Embeddings (faster) | all-minilm | 1 GB | 384 dimensions. Faster but lower retrieval accuracy. |
| Answer generation | llama3:8b | 8 GB | Strong instruction-following, good at citing sources. |
| Answer generation (better) | llama3:70b | 48 GB | Near cloud-API quality. Requires serious hardware. |
Memory and Performance
- Set HTTP timeouts to 120 seconds — The answer generation step can take 30–90 seconds depending on context size and hardware. Default n8n timeouts will kill the request prematurely.
- Process ingestion in batches — When embedding hundreds of chunks, add a
Waitnode (200ms) between Ollama calls to prevent memory spikes. Ollama queues requests, but rapid-fire calls can exhaust RAM. - Use Qdrant's payload indexing — If you query by source file or date, create payload indexes:
PUT /collections/knowledge_base/indexwith{"field_name": "source", "field_schema": "keyword"}. This speeds up filtered searches dramatically. - Monitor Qdrant collection size — Check with
GET /collections/knowledge_base. Qdrant handles millions of vectors, but keep an eye on disk usage. Each 768-dim vector uses roughly 3 KB of storage.
Improving Answer Quality
- Increase top-k for broad questions — If answers feel incomplete, raise the Qdrant
limitfrom 5 to 8 or 10. More context chunks give the LLM more to work with. - Add a relevance threshold — Filter Qdrant results by score. If the best match has a score below 0.6, the knowledge base probably does not contain relevant information. Return "I don't have enough information" instead of forcing a bad answer.
- Re-rank results — For advanced setups, add a second Ollama call that re-ranks the top-k chunks by relevance to the specific question before feeding them into the answer prompt.
Want Production-Ready RAG + 10 More Workflows?
The Self-Hosted AI Workflow Pack includes a polished RAG pipeline with multi-format document ingestion, batch processing, error handling, and 10 more AI workflows for content generation, email automation, lead scoring, and document processing — all running locally with Ollama.
Get All 11 Workflows — $39One-time purchase. No subscriptions. No API costs. 30-day money-back guarantee.
Free Samples
Try before you buy — grab free, open-source n8n + Ollama workflow templates: