From API design optimised for AI agents to vector databases, prompt management, and streaming responses — a practical guide to making your web app truly AI-ready.
Building a web application that truly works with AI is no longer about gluing a chatbot widget onto an existing page. The shift toward AI-ready web applications means designing your entire stack — API layer, data storage, orchestration, and frontend — with LLM integration as a first-class concern. Applications that consume, process, and generate content through large language models require fundamentally different architectural decisions than traditional CRUD apps.
Over the past 18 months, I have worked with several teams transitioning their web applications from "add AI later" to "AI-first" architecture. The patterns that emerge consistently fall into five areas: API design optimised for LLM consumption, vector database integration for semantic grounding, prompt management infrastructure, streaming response handling, and production architecture patterns like RAG, agentic workflows, and MCP connectivity.
This guide covers each of these areas in depth — with code examples, comparison tables, architecture decisions, and real-world case studies. Whether you are building a new AI-native application or retrofitting an existing one, the patterns here will save you months of trial and error.
The first and most impactful change when building an AI-ready web application is how you design your API. Traditional REST APIs are optimised for human-readable responses and paginated list endpoints. LLMs need something different: structured, self-describing endpoints with clear semantics, consistent error formats, and predictable response shapes.
The core difference lies in how the API communicates its capabilities. A traditional API expects the client to know what endpoints exist and how to call them. An LLM-optimised API exposes a machine-readable contract that the model can discover and use autonomously:
| Characteristic | Traditional REST | LLM-Optimised API |
|---|---|---|
| Documentation | Human-readable docs (Swagger UI) | Machine-readable OpenAPI 3.1 with LLM-friendly descriptions✓ |
| Response shape | Fixed, often nested | Flat, consistent, with explicit null fields✓ |
| Error format | Varies by endpoint | Uniform with code + message + traceback URL✓ |
| Pagination | Page/limit query params | Cursor-based with next/prev URLs embeded✓ |
| Idempotency | Rarely explicit | Idempotency keys on all mutation endpoints✓ |
| Rate limiting | Returns 429 | Returns 429 with Retry-After and quota info✓ |
OpenAPI 3.1 (which aligns with JSON Schema 2020-12) is the standard for LLM-friendly API documentation. The key is writing descriptions that an LLM can parse — explicit about parameter semantics, return types, error conditions, and side effects:
"/api/search": {
"post": {
"summary": "Search products semantically",
"description": "Performs semantic search across the product catalog using vector embeddings. Returns results ranked by relevance score (cosine similarity). Use this endpoint when the user asks comparative, qualitative, or feature-based questions about products — not for exact ID lookups.",
"operationId": "semanticSearch",
"parameters": [
{
"name": "X-Idempotency-Key",
"in": "header",
"schema": { "type": "string", "format": "uuid" },
"required": false,
"description": "Optional idempotency key. If provided, identical requests within 5 minutes return the cached result."
}
],
"requestBody": {
"content": {
"application/json": {
"schema": {
"type": "object",
"properties": {
"query": { "type": "string", "description": "Natural language search query" },
"limit": { "type": "integer", "default": 10, "maximum": 50 },
"cursor": { "type": "string", "description": "Pagination cursor from previous response" }
},
"required": ["query"]
}
}
}
}
}
}
Notice the description field on the endpoint — it tells the LLM when
to use this endpoint ("comparative, qualitative, or feature-based questions") and
when not to. This guidance dramatically reduces the error rate of LLM API calls.
LLMs struggle with page/offset pagination because "page 3" has no semantic meaning.
Cursor-based pagination gives the LLM a concrete reference point ("after this product ID")
that it can pass directly to the next call. Every list response includes next_cursor
and has_more fields.
Vector databases are the backbone of any AI-ready web application that needs to ground LLM responses in real data. They store embeddings — numerical representations of text meaning — and enable semantic search across your content, products, or knowledge base.
For most web applications, pgvector — the PostgreSQL vector extension — is the right starting point. It eliminates the operational complexity of running a separate vector database by embedding vector search into your existing PostgreSQL instance:
-- Enable the extension
CREATE EXTENSION vector;
-- Create a table with vector embeddings
CREATE TABLE product_embeddings (
id BIGSERIAL PRIMARY KEY,
product_id INTEGER NOT NULL REFERENCES products(id) ON DELETE CASCADE,
content_type VARCHAR(50) NOT NULL, -- 'title', 'description', 'specification'
content_text TEXT,
embedding vector(1536), -- OpenAI ada-002 dimension
created_at TIMESTAMPTZ DEFAULT NOW()
);
-- Create HNSW index for approximate nearest neighbor search
CREATE INDEX idx_product_embeddings_hnsw
ON product_embeddings
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 200);
-- Semantic search function
CREATE OR REPLACE FUNCTION semantic_search(
query_embedding vector(1536),
match_limit INT DEFAULT 10,
similarity_threshold FLOAT DEFAULT 0.7
) RETURNS TABLE (
product_id INTEGER,
content_text TEXT,
similarity FLOAT
) LANGUAGE plpgsql AS $$
BEGIN
RETURN QUERY
SELECT
pe.product_id,
pe.content_text,
1 - (pe.embedding <=> query_embedding) AS similarity
FROM product_embeddings pe
WHERE 1 - (pe.embedding <=> query_embedding) > similarity_threshold
ORDER BY pe.embedding <=> query_embedding
LIMIT match_limit;
END;
$$;
pgvector supports two index types for approximate nearest neighbor (ANN) search:
For production applications with more than 100K vectors, HNSW is the recommended choice.
The m parameter (connections per node, default 16) and
ef_construction (dynamic candidate list, default 200) control the
speed-recall trade-off.
Beyond pgvector, several specialised vector databases offer different trade-offs:
As your application grows from one LLM call to dozens of prompts across different features, you need a prompt management layer. Without it, prompts live in application code, version control history, and team members' heads — making iteration slow and dangerous.
A production prompt management system should handle:
| Tool | Best For | Pricing | Self-Hosted |
|---|---|---|---|
| LangSmith | Full LLM lifecycle (prompts, traces, evals) | Free tier + $25/user/mo | No |
| LangFuse | Open-source tracing and prompt management | Free tier + $59/mo cloud | Yes |
| Vercel AI SDK | Prompt management + streaming in Next.js | Free (open-source) | N/A (library) |
| Agenta | Prompt collaboration and evaluation | Free tier + $29/user/mo | Yes |
For teams that prefer to own their infrastructure, a database-backed prompt system is straightforward to build:
interface PromptTemplate {
id: string;
name: string;
version: number;
template: string; // "Answer the user's question about {{topic}}"
variables: string[]; // ["topic", "context", "tone"]
model: string; // "gpt-4o" | "claude-sonnet-4"
parameters: {
temperature: number;
max_tokens: number;
};
tests: TestCase[];
}
class PromptManager {
private templates: Map<string, PromptTemplate> = new Map();
async getTemplate(name: string, version?: number): Promise<PromptTemplate> {
// Fetch from database, caching in Redis
const template = await db.query(
`SELECT * FROM prompt_templates
WHERE name = $1 AND (version = $2 OR $2 IS NULL)
ORDER BY version DESC LIMIT 1`,
[name, version ?? null]
);
return template;
}
async render(name: string, variables: Record<string, string>): Promise<string> {
const template = await this.getTemplate(name);
let rendered = template.template;
for (const [key, value] of Object.entries(variables)) {
rendered = rendered.replace(`{{${key}}}`, value);
}
return rendered;
}
async execute(name: string, variables: Record<string, string>): Promise<string> {
const prompt = await this.render(name, variables);
const template = await this.getTemplate(name);
const response = await callLLM(prompt, template.model, template.parameters);
return response;
}
}
LLMs generate text token by token — and waiting for the full response before showing it to the user creates a terrible experience. Streaming is non-negotiable for AI-ready web applications. The industry standard is Server-Sent Events (SSE).
SSE is simpler than WebSockets for one-directional data streaming from server to client. Its EventSource API is built into every modern browser:
// Server endpoint (Next.js App Router example)
export async function POST(req: Request) {
const { messages } = await req.json();
const stream = new ReadableStream({
async start(controller) {
const response = await fetch("https://api.openai.com/v1/chat/completions", {
method: "POST",
headers: {
"Content-Type": "application/json",
"Authorization": `Bearer ${process.env.OPENAI_API_KEY}`,
},
body: JSON.stringify({
model: "gpt-4o",
messages,
stream: true,
}),
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split("\n").filter(line => line.startsWith("data: "));
for (const line of lines) {
const data = line.slice(6);
if (data === "[DONE]") {
controller.enqueue(new TextEncoder().encode("data: [DONE]\n\n"));
controller.close();
return;
}
controller.enqueue(new TextEncoder().encode(`data: ${data}\n\n`));
}
}
controller.close();
},
});
return new Response(stream, {
headers: {
"Content-Type": "text/event-stream",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
},
});
}
The Vercel AI SDK abstracts this complexity into a clean, framework-agnostic API.
On the frontend, the useChat hook handles streaming state, error recovery,
and reconnection:
import { useChat } from "ai/react";
export function Chat() {
const { messages, input, handleInputChange, handleSubmit, isLoading, error } = useChat({
api: "/api/chat",
onError: (error) => {
console.error("Stream error:", error);
// Show a friendlier error message to the user
setErrorMessage("Connection lost. Retrying...");
},
});
return (
<div className="chat-container">
{messages.map(m => (
<div key={m.id} className={`message ${m.role}`}>
{m.content}
</div>
))}
{isLoading && <div className="typing-indicator">Thinking...</div>}
<form onSubmit={handleSubmit}>
<input
value={input}
onChange={handleInputChange}
placeholder="Ask anything..."
disabled={isLoading}
/>
</form>
</div>
);
}
Different use cases demand different architecture patterns. Three dominant patterns have emerged for AI-ready web applications:
Best for Q&A over your own data. User asks a question → retrieve relevant documents from vector DB → LLM generates answer grounded in those documents. Simple, reliable, and easy to debug.
Best for autonomous multi-step tasks. The LLM decides which tools to call, in what order, and when to return the final result. Requires careful guardrails, tool definitions, and error handling.
Best for connecting your application to external LLM ecosystems. A standardized protocol for exposing tools and data sources to any MCP-compatible client. Think of it as a universal adapter layer.
The three patterns are not mutually exclusive. A typical production AI application uses all three: RAG for knowledge retrieval, Agentic workflows for complex user requests, and MCP for exposing capabilities to external AI agents. The key is designing your architecture so each layer is independently deployable and testable.
For more on MCP and agent-ready architecture, see my guide on WebMCP: Making Websites Agent-Ready for AI Agents. For a broader look at how AI code assistants are changing development workflows, read AI Code Assistants and Modern Web Development.
Building AI-ready web applications from Belarus presents specific challenges around regulation, payment infrastructure, hosting, and LLM access. Here is what I have found works in practice:
| Concern | Challenge | Practical Solution |
|---|---|---|
| AI Regulation | Law No. 470-3 (effective July 2026) requires AI transparency, disclaimers for AI-generated content, and user notification when interacting with AI | Add visible AI labels, maintain audit logs of AI decisions, include disclaimers in generated content. The regulation is designed for consumer protection and is manageable with standard compliance practices |
| Payment Integration | International API payments are restricted. Standard credit card billing for OpenAI/Anthropic is unavailable from Belarus | Use bePaid for local payment processing on your site. For international LLM API access, route through OpenRouter (works with USDT or EU legal entities). Telegram Stars (Telegram's in-app currency) is a viable option for chatbot-based services |
| Hosting | Data residency requirements and potential latency to Western cloud providers | Use Hoster.by for Russian-language frontends and EU data. For global performance, Hetzner (Nuremberg/Falkenstein) offers excellent price-performance at €4-40/month. Cloudflare handles CDN and DDoS protection for any hosting backend |
| LLM Access | Direct API access to OpenAI, Anthropic, and Google AI is restricted or very expensive from Belarusian IPs | OpenRouter is the primary gateway — unified API for 200+ models, accepts crypto payments, works from Belarus. For local inference, Ollama running on Hetzner servers provides a self-hosted fallback using open-weight models like Llama 3, Mistral, and Qwen |
A mid-sized Belarusian e-commerce marketplace replaced their keyword search with pgvector-powered semantic search. The implementation took 3 weeks: embed all product descriptions using an open-source embedding model (BAAI/bge-base-en-v1.5), store vectors in pgvector, and serve via a Next.js API route. Result: 34% increase in search-to-purchase conversion and a 52% reduction in "no results" queries.
Stack: PostgreSQL + pgvector, Next.js, BAAI/bge-base-en-v1.5, Vercel AI SDK for the chat interface
A B2B SaaS company built a RAG-based customer support assistant over their 2,000-page documentation site. They chunked documents into 512-token segments, embedded them with OpenAI ada-002, and built a retrieval pipeline with Pinecone. The LLM (GPT-4o) generates answers with source citations. Support ticket volume dropped 47% in 3 months.
Stack: Pinecone, GPT-4o, LangChain, Next.js, SSE streaming
A legal tech startup in Belarus built an agentic workflow that reviews uploaded contracts clause by clause. Each clause is analysed by a specialised prompt (jurisdiction, liability, termination), and the agent decides whether to flag issues, request clarification, or approve. The system runs on self-hosted Llama 3 via Ollama on Hetzner, with a Next.js frontend that streams results as each clause is processed.
Stack: Ollama (Llama 3 70B), Hetzner Cloud, Next.js, PostgreSQL + pgvector, custom agent framework
Based on the patterns above, here is a realistic timeline for building an AI-ready web application from scratch:
This timeline assumes a small team (1-2 developers) working full-time. Adding agentic workflows or MCP connectivity typically adds 2-3 weeks. RAG-only applications (without agentic features) can ship in 4 weeks.
useChat hook, handling both the streaming transport and the chat state management. At the framework level, you set up a POST endpoint that streams tokens back via ReadableStream. The key considerations are: proper backpressure handling, connection management (abort on unmount), error recovery for partial streams, and user experience patterns like typing indicators and streaming markdown rendering.Whether you're building a RAG-powered knowledge base, an agentic workflow, or retrofitting an existing app for LLM integration — I can help architect and build it. Based in Minsk, working globally.
[email protected]