AI-Ready Web Applications: Building Apps with LLM Integration

Q: What is the best vector database for an AI-ready web app?

For most web applications, pgvector (PostgreSQL extension) is the best starting point — it eliminates the operational cost of a separate vector database by running inside your existing PostgreSQL instance. It supports exact and approximate nearest neighbor search (IVFFlat, HNSW indexes), handles up to millions of vectors effectively, and works with any PostgreSQL provider. For specialized needs at scale, Pinecone or Qdrant offer managed vector database services with higher performance on billion-scale datasets.

Q: What is the difference between RAG, Agentic, and MCP architectures?

RAG (Retrieval-Augmented Generation) is a query-answer pattern: the user asks a question, the system retrieves relevant documents from a vector database, and the LLM generates an answer grounded in those documents. Agentic architecture gives the LLM autonomy to perform multi-step tasks — it can call APIs, query databases, and take actions to achieve a goal. MCP (Model Context Protocol) is a standard for connecting LLMs to external tools and data sources, acting as a universal adapter between your application and the LLM. Each serves a different purpose: RAG for Q&A, Agentic for autonomous workflows, MCP for standardized tool connectivity.

Introduction: The Shift to AI-Native Web Applications

Building a web application that truly works with AI is no longer about gluing a chatbot widget onto an existing page. The shift toward AI-ready web applications means designing your entire stack — API layer, data storage, orchestration, and frontend — with LLM integration as a first-class concern. Applications that consume, process, and generate content through large language models require fundamentally different architectural decisions than traditional CRUD apps.

Over the past 18 months, I have worked with several teams transitioning their web applications from "add AI later" to "AI-first" architecture. The patterns that emerge consistently fall into five areas: API design optimised for LLM consumption, vector database integration for semantic grounding, prompt management infrastructure, streaming response handling, and production architecture patterns like RAG, agentic workflows, and MCP connectivity.

This guide covers each of these areas in depth — with code examples, comparison tables, architecture decisions, and real-world case studies. Whether you are building a new AI-native application or retrofitting an existing one, the patterns here will save you months of trial and error.

API Design for LLM Consumption

The first and most impactful change when building an AI-ready web application is how you design your API. Traditional REST APIs are optimised for human-readable responses and paginated list endpoints. LLMs need something different: structured, self-describing endpoints with clear semantics, consistent error formats, and predictable response shapes.

LLM-Optimised vs Traditional REST

The core difference lies in how the API communicates its capabilities. A traditional API expects the client to know what endpoints exist and how to call them. An LLM-optimised API exposes a machine-readable contract that the model can discover and use autonomously:

Characteristic	Traditional REST	LLM-Optimised API
Documentation	Human-readable docs (Swagger UI)	Machine-readable OpenAPI 3.1 with LLM-friendly descriptions✓
Response shape	Fixed, often nested	Flat, consistent, with explicit null fields✓
Error format	Varies by endpoint	Uniform with code + message + traceback URL✓
Pagination	Page/limit query params	Cursor-based with next/prev URLs embeded✓
Idempotency	Rarely explicit	Idempotency keys on all mutation endpoints✓
Rate limiting	Returns 429	Returns 429 with Retry-After and quota info✓

OpenAPI 3.1 with LLM-Friendly Descriptions

OpenAPI 3.1 (which aligns with JSON Schema 2020-12) is the standard for LLM-friendly API documentation. The key is writing descriptions that an LLM can parse — explicit about parameter semantics, return types, error conditions, and side effects:

OpenAPI 3.1 — LLM-Optimised Endpoint Description

"/api/search": {
  "post": {
    "summary": "Search products semantically",
    "description": "Performs semantic search across the product catalog using vector embeddings. Returns results ranked by relevance score (cosine similarity). Use this endpoint when the user asks comparative, qualitative, or feature-based questions about products — not for exact ID lookups.",
    "operationId": "semanticSearch",
    "parameters": [
      {
        "name": "X-Idempotency-Key",
        "in": "header",
        "schema": { "type": "string", "format": "uuid" },
        "required": false,
        "description": "Optional idempotency key. If provided, identical requests within 5 minutes return the cached result."
      }
    ],
    "requestBody": {
      "content": {
        "application/json": {
          "schema": {
            "type": "object",
            "properties": {
              "query": { "type": "string", "description": "Natural language search query" },
              "limit": { "type": "integer", "default": 10, "maximum": 50 },
              "cursor": { "type": "string", "description": "Pagination cursor from previous response" }
            },
            "required": ["query"]
          }
        }
      }
    }
  }
}

Notice the description field on the endpoint — it tells the LLM when to use this endpoint ("comparative, qualitative, or feature-based questions") and when not to. This guidance dramatically reduces the error rate of LLM API calls.

Cursor-Based Pagination

LLMs struggle with page/offset pagination because "page 3" has no semantic meaning. Cursor-based pagination gives the LLM a concrete reference point ("after this product ID") that it can pass directly to the next call. Every list response includes next_cursor and has_more fields.

Vector Database Integration

Vector databases are the backbone of any AI-ready web application that needs to ground LLM responses in real data. They store embeddings — numerical representations of text meaning — and enable semantic search across your content, products, or knowledge base.

pgvector: The Pragmatic Choice

For most web applications, pgvector — the PostgreSQL vector extension — is the right starting point. It eliminates the operational complexity of running a separate vector database by embedding vector search into your existing PostgreSQL instance:

SQL — Setting Up pgvector for Semantic Search

-- Enable the extension
CREATE EXTENSION vector;

-- Create a table with vector embeddings
CREATE TABLE product_embeddings (
  id BIGSERIAL PRIMARY KEY,
  product_id INTEGER NOT NULL REFERENCES products(id) ON DELETE CASCADE,
  content_type VARCHAR(50) NOT NULL, -- 'title', 'description', 'specification'
  content_text TEXT,
  embedding vector(1536), -- OpenAI ada-002 dimension
  created_at TIMESTAMPTZ DEFAULT NOW()
);

-- Create HNSW index for approximate nearest neighbor search
CREATE INDEX idx_product_embeddings_hnsw
  ON product_embeddings
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 200);

-- Semantic search function
CREATE OR REPLACE FUNCTION semantic_search(
  query_embedding vector(1536),
  match_limit INT DEFAULT 10,
  similarity_threshold FLOAT DEFAULT 0.7
) RETURNS TABLE (
  product_id INTEGER,
  content_text TEXT,
  similarity FLOAT
) LANGUAGE plpgsql AS $$
BEGIN
  RETURN QUERY
  SELECT
    pe.product_id,
    pe.content_text,
    1 - (pe.embedding <=> query_embedding) AS similarity
  FROM product_embeddings pe
  WHERE 1 - (pe.embedding <=> query_embedding) > similarity_threshold
  ORDER BY pe.embedding <=> query_embedding
  LIMIT match_limit;
END;
$$;

Indexing Strategy

pgvector supports two index types for approximate nearest neighbor (ANN) search:

IVFFlat: Fast to build, good for up to ~1M vectors. Partition-based. Lower recall but quicker index creation.
HNSW: Slower to build, higher recall, scales to tens of millions. Navigable small-world graph.

For production applications with more than 100K vectors, HNSW is the recommended choice. The m parameter (connections per node, default 16) and ef_construction (dynamic candidate list, default 200) control the speed-recall trade-off.

Vector Database Landscape

Beyond pgvector, several specialised vector databases offer different trade-offs:

Pinecone: Fully managed, serverless. Best when you want zero ops overhead. Supports hybrid (sparse + dense) search. Starts at ~$70/month for production.
Qdrant: Open-source with managed cloud. Best for filtering-heavy workloads (geo-spatial + semantic filters). Excellent Rust-based performance.
Chroma: Lightweight, embedded. Best for prototyping and small-scale applications. Not production-ready for high-throughput scenarios.

Prompt Management Systems

As your application grows from one LLM call to dozens of prompts across different features, you need a prompt management layer. Without it, prompts live in application code, version control history, and team members' heads — making iteration slow and dangerous.

Key Capabilities

A production prompt management system should handle:

Versioning: Every prompt change creates a new version. Rollback is a single operation.
Testing: Run prompts against test cases before deploying to production.
Monitoring: Track token usage, latency, and output quality per prompt template.
A/B Testing: Run two prompt variants simultaneously and compare results.

Tool Comparison

Tool	Best For	Pricing	Self-Hosted
LangSmith	Full LLM lifecycle (prompts, traces, evals)	Free tier + $25/user/mo	No
LangFuse	Open-source tracing and prompt management	Free tier + $59/mo cloud	Yes
Vercel AI SDK	Prompt management + streaming in Next.js	Free (open-source)	N/A (library)
Agenta	Prompt collaboration and evaluation	Free tier + $29/user/mo	Yes

Building a Simple Prompt Management Layer

For teams that prefer to own their infrastructure, a database-backed prompt system is straightforward to build:

TypeScript — Prompt Template Manager

interface PromptTemplate {
  id: string;
  name: string;
  version: number;
  template: string;              // "Answer the user's question about {{topic}}"
  variables: string[];           // ["topic", "context", "tone"]
  model: string;                 // "gpt-4o" | "claude-sonnet-4"
  parameters: {
    temperature: number;
    max_tokens: number;
  };
  tests: TestCase[];
}

class PromptManager {
  private templates: Map<string, PromptTemplate> = new Map();

  async getTemplate(name: string, version?: number): Promise<PromptTemplate> {
    // Fetch from database, caching in Redis
    const template = await db.query(
      `SELECT * FROM prompt_templates
       WHERE name = $1 AND (version = $2 OR $2 IS NULL)
       ORDER BY version DESC LIMIT 1`,
      [name, version ?? null]
    );
    return template;
  }

  async render(name: string, variables: Record<string, string>): Promise<string> {
    const template = await this.getTemplate(name);
    let rendered = template.template;
    for (const [key, value] of Object.entries(variables)) {
      rendered = rendered.replace(`{{${key}}}`, value);
    }
    return rendered;
  }

  async execute(name: string, variables: Record<string, string>): Promise<string> {
    const prompt = await this.render(name, variables);
    const template = await this.getTemplate(name);
    const response = await callLLM(prompt, template.model, template.parameters);
    return response;
  }
}

Streaming Responses

LLMs generate text token by token — and waiting for the full response before showing it to the user creates a terrible experience. Streaming is non-negotiable for AI-ready web applications. The industry standard is Server-Sent Events (SSE).

SSE: The Server Side

SSE is simpler than WebSockets for one-directional data streaming from server to client. Its EventSource API is built into every modern browser:

TypeScript — Streaming LLM Response via SSE

// Server endpoint (Next.js App Router example)
export async function POST(req: Request) {
  const { messages } = await req.json();

  const stream = new ReadableStream({
    async start(controller) {
      const response = await fetch("https://api.openai.com/v1/chat/completions", {
        method: "POST",
        headers: {
          "Content-Type": "application/json",
          "Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
        },
        body: JSON.stringify({
          model: "gpt-4o",
          messages,
          stream: true,
        }),
      });

      const reader = response.body.getReader();
      const decoder = new TextDecoder();

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        const chunk = decoder.decode(value);
        const lines = chunk.split("\n").filter(line => line.startsWith("data: "));

        for (const line of lines) {
          const data = line.slice(6);
          if (data === "[DONE]") {
            controller.enqueue(new TextEncoder().encode("data: [DONE]\n\n"));
            controller.close();
            return;
          }
          controller.enqueue(new TextEncoder().encode(`data: ${data}\n\n`));
        }
      }
      controller.close();
    },
  });

  return new Response(stream, {
    headers: {
      "Content-Type": "text/event-stream",
      "Cache-Control": "no-cache",
      "Connection": "keep-alive",
    },
  });
}

Using the Vercel AI SDK

The Vercel AI SDK abstracts this complexity into a clean, framework-agnostic API. On the frontend, the useChat hook handles streaming state, error recovery, and reconnection:

TypeScript — useChat Hook for Streaming

import { useChat } from "ai/react";

export function Chat() {
  const { messages, input, handleInputChange, handleSubmit, isLoading, error } = useChat({
    api: "/api/chat",
    onError: (error) => {
      console.error("Stream error:", error);
      // Show a friendlier error message to the user
      setErrorMessage("Connection lost. Retrying...");
    },
  });

  return (
    <div className="chat-container">
      {messages.map(m => (
        <div key={m.id} className={`message ${m.role}`}>
          {m.content}
        </div>
      ))}
      {isLoading && <div className="typing-indicator">Thinking...</div>}
      <form onSubmit={handleSubmit}>
        <input
          value={input}
          onChange={handleInputChange}
          placeholder="Ask anything..."
          disabled={isLoading}
        />
      </form>
    </div>
  );
}

Streaming Best Practices

AbortController: Always pass an AbortSignal to the fetch call. When the user navigates away, the stream is properly cancelled.
Partial rendering: Render markdown and code blocks as they stream in — don't wait for the full block to close.
Error recovery: LLM API calls can fail mid-stream. Implement exponential backoff for retries and degrade gracefully (show partial response + retry button).
Connection pooling: Each streaming connection consumes resources. Set reasonable limits on concurrent streams per user.

AI Application Architecture Patterns

Different use cases demand different architecture patterns. Three dominant patterns have emerged for AI-ready web applications:

RAG — Retrieval-Augmented Generation

Best for Q&A over your own data. User asks a question → retrieve relevant documents from vector DB → LLM generates answer grounded in those documents. Simple, reliable, and easy to debug.

Agentic Architecture

Best for autonomous multi-step tasks. The LLM decides which tools to call, in what order, and when to return the final result. Requires careful guardrails, tool definitions, and error handling.

MCP — Model Context Protocol

Best for connecting your application to external LLM ecosystems. A standardized protocol for exposing tools and data sources to any MCP-compatible client. Think of it as a universal adapter layer.

When to Use Which

The three patterns are not mutually exclusive. A typical production AI application uses all three: RAG for knowledge retrieval, Agentic workflows for complex user requests, and MCP for exposing capabilities to external AI agents. The key is designing your architecture so each layer is independently deployable and testable.

For more on MCP and agent-ready architecture, see my guide on WebMCP: Making Websites Agent-Ready for AI Agents. For a broader look at how AI code assistants are changing development workflows, read AI Code Assistants and Modern Web Development.

Belarus Market Considerations

Building AI-ready web applications from Belarus presents specific challenges around regulation, payment infrastructure, hosting, and LLM access. Here is what I have found works in practice:

Concern	Challenge	Practical Solution
AI Regulation	Law No. 470-3 (effective July 2026) requires AI transparency, disclaimers for AI-generated content, and user notification when interacting with AI	Add visible AI labels, maintain audit logs of AI decisions, include disclaimers in generated content. The regulation is designed for consumer protection and is manageable with standard compliance practices
Payment Integration	International API payments are restricted. Standard credit card billing for OpenAI/Anthropic is unavailable from Belarus	Use bePaid for local payment processing on your site. For international LLM API access, route through OpenRouter (works with USDT or EU legal entities). Telegram Stars (Telegram's in-app currency) is a viable option for chatbot-based services
Hosting	Data residency requirements and potential latency to Western cloud providers	Use Hoster.by for Russian-language frontends and EU data. For global performance, Hetzner (Nuremberg/Falkenstein) offers excellent price-performance at €4-40/month. Cloudflare handles CDN and DDoS protection for any hosting backend
LLM Access	Direct API access to OpenAI, Anthropic, and Google AI is restricted or very expensive from Belarusian IPs	OpenRouter is the primary gateway — unified API for 200+ models, accepts crypto payments, works from Belarus. For local inference, Ollama running on Hetzner servers provides a self-hosted fallback using open-weight models like Llama 3, Mistral, and Qwen

Case Studies

1. E-Commerce Semantic Search (Belarusian Marketplace)

A mid-sized Belarusian e-commerce marketplace replaced their keyword search with pgvector-powered semantic search. The implementation took 3 weeks: embed all product descriptions using an open-source embedding model (BAAI/bge-base-en-v1.5), store vectors in pgvector, and serve via a Next.js API route. Result: 34% increase in search-to-purchase conversion and a 52% reduction in "no results" queries.

Stack: PostgreSQL + pgvector, Next.js, BAAI/bge-base-en-v1.5, Vercel AI SDK for the chat interface

2. Customer Support RAG (SaaS Platform)

A B2B SaaS company built a RAG-based customer support assistant over their 2,000-page documentation site. They chunked documents into 512-token segments, embedded them with OpenAI ada-002, and built a retrieval pipeline with Pinecone. The LLM (GPT-4o) generates answers with source citations. Support ticket volume dropped 47% in 3 months.

Stack: Pinecone, GPT-4o, LangChain, Next.js, SSE streaming

3. AI-Powered Contract Review (Legal Tech)

A legal tech startup in Belarus built an agentic workflow that reviews uploaded contracts clause by clause. Each clause is analysed by a specialised prompt (jurisdiction, liability, termination), and the agent decides whether to flag issues, request clarification, or approve. The system runs on self-hosted Llama 3 via Ollama on Hetzner, with a Next.js frontend that streams results as each clause is processed.

Stack: Ollama (Llama 3 70B), Hetzner Cloud, Next.js, PostgreSQL + pgvector, custom agent framework

6-Week Implementation Roadmap

Based on the patterns above, here is a realistic timeline for building an AI-ready web application from scratch:

Week 1 — Foundation: Set up PostgreSQL with pgvector, define API contracts (OpenAPI 3.1), configure hosting (Hetzner + Cloudflare), set up CI/CD pipeline
Week 2 — Core API & LLM Integration: Build REST/GraphQL endpoints with LLM-friendly error formats, integrate OpenRouter or direct API, implement basic chat endpoint
Week 3 — Vector Database & RAG: Build embedding pipeline, create vector indexes, implement semantic search endpoint, set up RAG retrieval with source citations
Week 4 — Prompt Management & Streaming: Set up prompt template storage with versioning, implement SSE streaming, add AbortController and error recovery, build prompt monitoring dashboard
Week 5 — Frontend & UX: Build chat interface with useChat, implement streaming markdown rendering, add typing indicators and retry logic, polish error states
Week 6 — Testing & Launch: Load test streaming endpoints, security review (prompt injection, rate limiting), set up monitoring (token usage, latency), deploy to production

This timeline assumes a small team (1-2 developers) working full-time. Adding agentic workflows or MCP connectivity typically adds 2-3 weeks. RAG-only applications (without agentic features) can ship in 4 weeks.

FAQ

What makes a web application AI-ready?

An AI-ready web application is designed from the ground up to integrate with large language models. Key characteristics include: RESTful or GraphQL API endpoints optimized for structured LLM consumption, vector database integration for semantic search and retrieval-augmented generation, a prompt management layer for versioning and monitoring LLM calls, streaming response support for real-time AI output, and an architecture that supports RAG, agentic, or MCP patterns for AI interaction.

What is the best vector database for an AI-ready web app?

For most web applications, pgvector (PostgreSQL extension) is the best starting point — it eliminates the operational cost of a separate vector database by running inside your existing PostgreSQL instance. It supports exact and approximate nearest neighbor search (IVFFlat, HNSW indexes), handles up to millions of vectors effectively, and works with any PostgreSQL provider. For specialized needs at scale, Pinecone or Qdrant offer managed vector database services with higher performance on billion-scale datasets.

What is RAG and why do I need it?

Retrieval-Augmented Generation (RAG) is an architecture pattern where an LLM retrieves relevant information from your own data before generating a response. Instead of asking the LLM to answer from its training data (which is often outdated), RAG lets you feed it current, specific information from your vector database. This reduces hallucinations, keeps responses grounded in your actual data, and lets users query your knowledge base conversationally. RAG is the most common pattern for production AI applications.

How do I handle streaming responses from LLMs in my web app?

Server-Sent Events (SSE) are the standard approach for streaming LLM responses to the browser. The Vercel AI SDK provides a polished abstraction over this with its useChat hook, handling both the streaming transport and the chat state management. At the framework level, you set up a POST endpoint that streams tokens back via ReadableStream. The key considerations are: proper backpressure handling, connection management (abort on unmount), error recovery for partial streams, and user experience patterns like typing indicators and streaming markdown rendering.

What LLM options are available for web apps in Belarus?

Developers in Belarus have several options for LLM integration. The primary path is through OpenRouter, which provides unified API access to OpenAI, Claude, Gemini, and other models without requiring individual accounts. For privacy-sensitive applications, local LLMs via Ollama running on Hetzner or other EU-based servers work well. YandexGPT is available but has Russian jurisdiction implications. The key consideration is payment — Belarusian developers typically use USDT/crypto for international API services or route through EU-based legal entities for invoicing.

What is the difference between RAG, Agentic, and MCP architectures?

RAG (Retrieval-Augmented Generation) is a query-answer pattern: the user asks a question, the system retrieves relevant documents from a vector database, and the LLM generates an answer grounded in those documents. Agentic architecture gives the LLM autonomy to perform multi-step tasks — it can call APIs, query databases, and take actions to achieve a goal. MCP (Model Context Protocol) is a standard for connecting LLMs to external tools and data sources, acting as a universal adapter between your application and the LLM. Each serves a different purpose: RAG for Q&A, Agentic for autonomous workflows, MCP for standardized tool connectivity.

How long does it take to build an AI-ready web application?

A reasonable timeframe for building a production-ready AI web application is 6 weeks for a minimum viable product. Week 1 focuses on API design and infrastructure setup (PostgreSQL with pgvector, hosting). Week 2 covers the core API endpoints and LLM integration layer. Week 3 adds the vector database and RAG pipeline. Week 4 implements prompt management and streaming responses. Week 5 handles the frontend integration and user experience. Week 6 is for testing, security review, and deployment. More complex applications with agentic workflows or custom tool integrations typically take 8-12 weeks.

AI-Ready Web Applications:Building Apps with LLM Integration