← Back to articles

I Built a RAG Chatbot for My Project Docs — Here's the Full Stack

A step-by-step guide to building a document Q&A bot using Supabase pgvector, Claude API, and Next.js. Actual code, actual costs, actual gotchas.

I Built a RAG Chatbot for My Project Docs — Here's the Full Stack

Last month, a client asked me to add a "chat with your docs" feature to their internal knowledge base. You know the type — 300+ pages of technical documentation that nobody reads, and a support team drowning in the same questions every week.

I had 4 days. Here's exactly what I built and how.

Why RAG Instead of Fine-Tuning

Before I write a single line of code on these projects, I always get the same question from clients: "Can't we just train the AI on our docs?"

Short answer: no. Fine-tuning is expensive, slow, and terrible for factual accuracy. What you actually want is RAG — Retrieval-Augmented Generation. Instead of baking knowledge into the model, you search your documents for relevant chunks and pass them as context alongside the user's question.

The result: accurate answers grounded in your actual documentation, with sources you can verify. And you can update the docs without retraining anything.

The Architecture

Here's what the stack looks like:

  • Next.js 14 (App Router) — frontend and API routes
  • Supabase — PostgreSQL with pgvector extension for vector storage
  • Claude API (claude-sonnet-4-5-20250514) — for generating answers
  • Voyage AI — for text embeddings (more on why not OpenAI later)

The flow is straightforward:

  1. Upload documents → chunk them → generate embeddings → store in Supabase
  2. User asks a question → embed the question → vector search for relevant chunks → send chunks + question to Claude → return answer

Nothing revolutionary. The devil is in the chunking strategy and prompt engineering.

Setting Up Supabase with pgvector

First, create a new project on Supabase. The free tier gives you 500MB of database storage, which is plenty for most documentation sets.

Enable the pgvector extension in the SQL editor:

create extension if not exists vector;

Now create the table for document chunks:

create table document_chunks (
  id bigserial primary key,
  content text not null,
  metadata jsonb default '{}',
  embedding vector(1024),
  created_at timestamp with time zone default now()
);

create index on document_chunks
  using ivfflat (embedding vector_cosine_ops)
  with (lists = 100);

And a function for similarity search:

create or replace function match_documents (
  query_embedding vector(1024),
  match_threshold float default 0.7,
  match_count int default 5
)
returns table (
  id bigint,
  content text,
  metadata jsonb,
  similarity float
)
language plpgsql
as $$
begin
  return query
  select
    document_chunks.id,
    document_chunks.content,
    document_chunks.metadata,
    1 - (document_chunks.embedding <=> query_embedding) as similarity
  from document_chunks
  where 1 - (document_chunks.embedding <=> query_embedding) > match_threshold
  order by document_chunks.embedding <=> query_embedding
  limit match_count;
end;
$$;

One thing I learned the hard way: start with ivfflat indexing, not hnsw. HNSW is faster for reads but uses significantly more memory. For anything under 100K vectors, ivfflat is fine and won't blow up your Supabase instance.

The Chunking Strategy That Actually Works

This is where most tutorials lose the plot. They'll tell you to split by character count or use LangChain's recursive splitter. Both are mediocre.

Here's what I do instead:

interface Chunk {
  content: string;
  metadata: {
    source: string;
    heading: string;
    chunkIndex: number;
  };
}

function chunkDocument(markdown: string, source: string): Chunk[] {
  const chunks: Chunk[] = [];
  const sections = markdown.split(/(?=^#{1,3}\s)/m);

  for (const section of sections) {
    const headingMatch = section.match(/^(#{1,3})\s+(.+)/);
    const heading = headingMatch ? headingMatch[2].trim() : 'Introduction';

    if (estimateTokens(section) < 800) {
      chunks.push({
        content: section.trim(),
        metadata: { source, heading, chunkIndex: chunks.length },
      });
      continue;
    }

    const paragraphs = section.split(/\n\n+/);
    let currentChunk = '';

    for (const para of paragraphs) {
      if (estimateTokens(currentChunk + para) > 800) {
        if (currentChunk) {
          chunks.push({
            content: `## ${heading}\n\n${currentChunk.trim()}`,
            metadata: { source, heading, chunkIndex: chunks.length },
          });
        }
        currentChunk = para + '\n\n';
      } else {
        currentChunk += para + '\n\n';
      }
    }

    if (currentChunk.trim()) {
      chunks.push({
        content: `## ${heading}\n\n${currentChunk.trim()}`,
        metadata: { source, heading, chunkIndex: chunks.length },
      });
    }
  }

  return chunks;
}

function estimateTokens(text: string): number {
  return Math.ceil(text.length / 3.5);
}

The key insights:

  1. Split by headings first. Documentation has structure — use it.
  2. 800 tokens per chunk, not 500 or 1000. I tested this extensively. 500 loses too much context. 1000 wastes embedding space on filler.
  3. Overlap by paragraph, not by character count. Character overlap creates broken sentences. Paragraph overlap preserves meaning.
  4. Prepend the heading to every chunk. When the chunk lands in the LLM context, it needs to know what section it came from.

Generating Embeddings

I use Voyage AI (voyage-3-large, 1024 dimensions) instead of OpenAI's embedding models. Better retrieval accuracy on technical content in my benchmarks, and the pricing is competitive.

import { createClient } from '@supabase/supabase-js';

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_KEY!
);

async function generateEmbedding(text: string): Promise<number[]> {
  const response = await fetch('https://api.voyageai.com/v1/embeddings', {
    method: 'POST',
    headers: {
      'Content-Type': 'application/json',
      Authorization: `Bearer ${process.env.VOYAGE_API_KEY}`,
    },
    body: JSON.stringify({
      input: text,
      model: 'voyage-3-large',
    }),
  });

  const data = await response.json();
  return data.data[0].embedding;
}

async function ingestChunks(chunks) {
  for (let i = 0; i < chunks.length; i += 20) {
    const batch = chunks.slice(i, i + 20);
    const embeddings = await Promise.all(
      batch.map((chunk) => generateEmbedding(chunk.content))
    );
    const rows = batch.map((chunk, idx) => ({
      content: chunk.content,
      metadata: chunk.metadata,
      embedding: embeddings[idx],
    }));
    const { error } = await supabase.from('document_chunks').insert(rows);
    if (error) throw error;
    console.log(`Ingested ${i + batch.length}/${chunks.length} chunks`);
    await new Promise((r) => setTimeout(r, 500));
  }
}

For the client's 300-page documentation set, ingestion took about 12 minutes and produced ~2,400 chunks. Total embedding cost: around $1.80.

The Chat API Route

Here's the Next.js API route that ties it all together:

// app/api/chat/route.ts
import Anthropic from '@anthropic-ai/sdk';
import { createClient } from '@supabase/supabase-js';
import { NextRequest } from 'next/server';

const anthropic = new Anthropic();
const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_KEY!
);

export async function POST(req: NextRequest) {
  const { question, conversationHistory = [] } = await req.json();

  const questionEmbedding = await generateEmbedding(question);

  const { data: chunks, error } = await supabase.rpc('match_documents', {
    query_embedding: questionEmbedding,
    match_threshold: 0.65,
    match_count: 6,
  });

  if (error) {
    return Response.json({ error: 'Search failed' }, { status: 500 });
  }

  const context = chunks
    .map((c) => `[Source: ${c.metadata.source} > ${c.metadata.heading}]\n${c.content}`)
    .join('\n\n---\n\n');

  const systemPrompt = `You are a helpful documentation assistant. Answer questions based ONLY on the provided context. If the context doesn't contain enough information, say so clearly.

Context from documentation:
${context}`;

  const messages = [
    ...conversationHistory.slice(-6),
    { role: 'user', content: question },
  ];

  const response = await anthropic.messages.create({
    model: 'claude-sonnet-4-5-20250514',
    max_tokens: 1024,
    system: systemPrompt,
    messages,
  });

  const answer = response.content[0].type === 'text' ? response.content[0].text : '';

  return Response.json({
    answer,
    sources: chunks.map((c) => ({
      source: c.metadata.source,
      heading: c.metadata.heading,
      similarity: c.similarity,
    })),
  });
}

A few things worth noting:

  • Threshold of 0.65, not 0.7 or 0.8. Lower thresholds catch more relevant results. I'd rather give Claude slightly noisy context than miss the answer entirely.
  • 6 chunks, not 3. Technical questions often need information scattered across multiple sections.
  • Conversation history is capped. Sending the entire chat history gets expensive fast. The last 3 exchanges is enough for follow-up questions.

What It Costs to Run

After a month in production with ~200 queries per day, here's the real cost breakdown:

  • Supabase (free tier): $0/month for the vector database
  • Voyage AI embeddings: ~$2/month for query embeddings
  • Claude API (Sonnet): ~$15/month at 200 queries/day with 6 chunks per query
  • Vercel hosting: $0 on hobby plan, $20/month on Pro

Total: roughly $17/month for a documentation chatbot that handles 6,000+ queries. Compare that to any SaaS "AI knowledge base" product charging $200+/month.

If you want to cut costs further, deploy the Next.js app on Railway for about $5/month instead of Vercel Pro. Railway gives you more control over the runtime and doesn't charge per-function-invocation like Vercel does at scale.

Gotchas I Hit (So You Don't Have To)

1. Supabase connection pooling matters. The default connection mode works fine for development, but in production you need to use the connection pooler URL (port 6543, not 5432). Otherwise you'll exhaust connections under load.

2. Don't embed markdown formatting. Strip out excessive markdown syntax before embedding. Bold markers, link URLs, and table formatting add noise. Keep headings (they're semantic), strip the rest.

3. Claude sometimes ignores the "only use provided context" instruction. Add a fallback check: if the answer doesn't reference any of the provided chunks, append a disclaimer.

4. Re-index when docs change. I set up a simple n8n workflow that watches the docs repo for changes and triggers re-ingestion. Takes 10 minutes to set up and saves hours of debugging.

5. Chunk deduplication. If your docs have repeated boilerplate, you'll get duplicate chunks polluting your results. Deduplicate by content hash before inserting.

What I'd Do Differently Next Time

If I had more time, I'd add:

  • Hybrid search — combine vector similarity with BM25 keyword search. Supabase supports full-text search natively, so you can run both queries and merge results.
  • Reranking — use a cross-encoder to rerank the top 20 vector results down to the best 6. Adds ~100ms latency but significantly improves answer quality.
  • Streaming responses — Claude supports streaming. For a chat interface, showing tokens as they arrive feels much faster than waiting for the complete response.

Wrapping Up

RAG isn't magic, but it's the most practical way to make AI useful for domain-specific questions right now. The stack I described here — Supabase for vectors, Claude for generation, Next.js on Vercel or Railway for hosting — runs for under $20/month and took me 3 days to build from scratch.

The code examples in this post are simplified but functional. I've shipped variations of this exact architecture for three different clients now, and the pattern holds up well.

If you're building something similar and get stuck, reach out — I'm always happy to talk RAG pipelines.


Some links in this article are affiliate links. If you sign up through them, I may earn a small commission at no extra cost to you. I only recommend tools I actually use in production.