11 Tricks to Make Your AI Application Fast

tl;dr: Most AI latency comes from model choice, output token count, and tool round-trips. Fix those first - then optimise caching, thinking budgets, and context size. The gains compound.

Key Findings

Model selection + output limits have the highest impact-to-effort ratio
Stream everything - responses, tool results, structured data - users should never stare at a blank screen
Every tool call interrupts generation and risks a cache miss - design use-case tools, not CRUD wrappers
Prompt structure determines cache hit rate - static at top, dynamic at bottom, 80–95% cache hits achievable
Compact aggressively, fail fast - summarise history every turn, validate structured output mid-stream, push all post-processing to background queues

Over the past two years we’ve been building conversational AI interfaces for our customers - and recently introduced Cuttlekit, our vision of the future of generative UI. Making AI feel fast has been a top priority. Here’s what we’ve learned.

1. Choose a Fast Model

Model choice is still the biggest latency lever: start with the fastest model that reliably does the job, then route up only when quality requires it.

Frontier models often give the best results, but they’re usually also the slowest. Smaller models are much cheaper, much faster, and often good enough for formatting, extraction, classification, short answers, and many UI-driving tasks. Our current go-to is Gemini Flash - strong quality at a fraction of the latency. Here’s an overview of what’s out there.

Small models

Foundation model variants

Every major provider offers smaller, faster variants of their flagship models. These trade some reasoning depth for dramatically lower latency:

Gemini Flash / Flash Lite - excellent quality-to-speed ratio, one of the best models we’ve tested for high-performance applications
OpenAI Mini & Nano models - lightweight models optimised for speed and cost, strong at structured outputs
Anthropic Haiku - fastest model in the Claude family, well-suited for high-throughput pipelines

Fast OSS models

Open source models are often a strong fit for latency-sensitive applications. You can self-host them for minimal latency, use the APIs offered by the model creators themselves, or run them through specialised inference providers (see below). Especially the Chinese models frequently score high in benchmarks:

Deepseek - strong reasoning at small parameter counts
Qwen - Alibaba’s model family, competitive across code and multilingual tasks
Hermes by NousResearch - fine-tuned for instruction following and function calling
Minimax - fast multimodal models
GLM by Zhipu AI - multilingual models with strong Chinese language support
Mistral - European open-weight models, excellent for EU-hosted deployments

Performance-focused providers

If you don’t want to self-host, these providers specialise in low-latency open source model inference:

Groq - custom silicon (LPUs) purpose-built for inference; offers Llama, Qwen, and Kimi models at exceptional TPS
TogetherAI - broad OSS model catalogue with competitive throughput and serverless endpoints
Fireworks.ai - optimised inference platform with low-latency serving and fine-tuning support
OpenRouter - meta-router for comparing latency and cost across providers without changing code

Text diffusion models

A new class of models that generate the full output at once instead of token-by-token, making them inherently faster than autoregressive transformers. Still early stage, but worth keeping an eye on:

InceptionLabs Mercury - first production-grade text diffusion model
Google Gemini Diffusion - Google’s entry into parallel text generation
Bytedance Seed Diffusion - Bytedance’s text diffusion model, currently in beta

Metric: Tokens per second (TPS) - how fast the model produces output. Compare across models and providers to find your speed/quality sweet spot.

Model Routing

Don’t commit to a single model. Route requests by task complexity: use a fast, cheap model for simple queries and reserve the frontier model for tasks that actually need deep reasoning. This cuts average latency significantly without sacrificing quality where it counts.

Model Router

live

What's the weather?

SIMPLE

Fast Model

~50ms TTFT

routed

Frontier

~800ms TTFT

routed

There are several ways to implement routing, ranging from simple to sophisticated:

Simple: LLM-based classification

Use a small, fast model with a classification prompt to decide which model should handle the request:

System: Classify the user message as either SIMPLE or COMPLEX.
SIMPLE: greetings, factual lookups, short answers, formatting.
COMPLEX: multi-step reasoning, code generation, analysis, creative writing.
Respond with a single word: SIMPLE or COMPLEX.

This adds one fast LLM call before the actual request - but that call uses a tiny model with minimal output tokens, so the overhead is low.

Smarter: Logprobs-based classification

Some providers (notably OpenAI) expose logprobs - the log-probability the model assigns to each output token. Instead of generating a full response, request a single token with max_tokens: 1 and logprobs: true, then compare the probabilities of your classification tokens directly:

// Response with logprobs
{
  "token": "SIMPLE",
  "logprob": -0.12
}

Logprobs are always natural logarithms (base e) of probabilities. To convert back: e^(-0.12) ≈ 0.89, so the model is 89% confident the answer is SIMPLE. If that’s above your threshold, route to the fast model. Otherwise, use the frontier model.

This is faster and cheaper than generating a full classification response because you only produce a single token and make the decision from the probability distribution.

Advanced: Embedding-based classification

Use an encoder model to calculate the embedding of the input prompt and compare it using cosine similarity to pre-calculated embeddings of your classification categories. No generative LLM call needed at inference time - only an embedding pass plus a fast vector comparison.

Pre-compute embeddings for representative prompts of each category
At runtime, embed the incoming prompt
Compare against your reference embeddings using cosine similarity
Route based on the closest match

This is extremely fast because embedding models are much cheaper and faster than generative models, and the cosine similarity comparison itself is essentially free.

Providers like Jina AI already offer this as a turnkey solution - zero-shot and few-shot classification without training a model.

Most sophisticated: Fine-tuned classifier

For maximum accuracy, fine-tune a small classification model by stacking a dense layer on top of a pre-trained encoder model. You train on labelled examples of your actual traffic, so the classifier learns your specific routing patterns.

Good base models for this:

EmbeddingGemma - Google’s embedding model based on Gemma 3
Jina AI Embeddings - multilingual embedding models purpose-built for classification and retrieval
Mixedbread Embeddings - high-quality embedding models with strong benchmark results

This gives you a sub-millisecond classifier that runs on CPU with near-perfect accuracy on your specific workload - but requires labelled training data and a training step upfront.

2. Minimise Output Tokens

Output tokens are the most computationally expensive part of inference. That’s why most providers charge 2–4x more for output tokens than input tokens. Every token the model doesn’t generate is time saved, money saved, and latency reduced.

Structured output: optimise your schema

For structured output (JSON, function calls), your output schema directly determines token count. Shorter attribute names and compact values mean fewer tokens. Use the Tokenizer Playground to check how your schema tokenises - the difference can be significant:

Token Comparison (GPT-4 tokenizer)

Verbose schema33 tokens

{"task_title":"Review quarterly report","assigned_to":"marketing","due_date":"2026-04-15","is_completed":false,"priority_level":"high"}

Compact schema28 tokens

{"t":"Review quarterly report","to":"marketing","due":"2026-04-15","done":false,"p":"h"}

Saved:5 tokens (−15%)× 1000 items = 5000 tokens saved

15% fewer tokens per item - and that compounds fast when you’re streaming hundreds of results. You can always map the compact LLM output to your full domain schema in a post-processing step. The model doesn’t need human-readable keys - your code does. Every token not generated is a good token.

Text generation: instruct for brevity

For text responses, be explicit in your system prompt:

Answer in 1–2 sentences max. No preamble, no filler, no "Sure! Here's...".
Skip summaries and recaps. Lead with the answer.
If a list, use short bullet points, not paragraphs.

Metric: Number of output tokens (NOT) - total tokens generated per response. Track this across your prompts and optimise the worst offenders first.

3. Stream All the Things

JSONL Streaming vs Blocking

Streaming JSONL

0/5 items

Blocking JSON

Generating full list...

same time, worse UX

Streaming is industry standard - users expect it. It doesn’t reduce total generation time, but dramatically improves perceived speed because users start reading while the model generates. TTFT of 200–400ms feels instant; a 3-second blank wait feels broken.

For structured output, use JSONL - one item per line instead of a full JSON array. Todo items, table rows, search results - each renders as soon as it’s generated, no waiting for the closing bracket.

Stream tool call outputs too. When your agent invokes another LLM - a summariser, a sub-agent - that inner call should stream back to the user. Otherwise: fast initial stream, dead pause during tool execution, then more text. Stream the entire chain, not just the outer model.

Metric: Time to first token (TTFT) - how long before the user sees the first output. Streaming doesn’t change total time, but drops TTFT to near-zero.

4. Keep Your Tool Surface Small

Every tool call interrupts the generation - the AI SDK handles the call, executes the tool, and feeds the result back into the next LLM call. Each round-trip adds latency, and if you hit a prompt cache miss (different machine, expired cache), the full context gets recomputed. Minimise the number of tool calls.

Design tools for use cases, not CRUD

Don’t expose your REST API as-is to the model. Design tools around what the AI needs to accomplish, not your data structure:

Tool Design: CRUD vs Use-Case

3 CRUD-style calls - 3 round-trips700ms + 3× cache recompute

getUser(id: 42)200ms

getOrders(userId: 42)350ms

getPreferences(userId: 42)150ms

1 use-case call - 1 round-trip180ms

getUserDashboard(id: 42)180ms

Returns only what the AI needs - pre-formatted, paginated

Overhead saved2 fewer round-trips + 520ms less tool execution

Tool output rules:

Return only the data the AI needs for the current task - never raw API dumps
Paginate results - don’t return 500 items when the AI needs 5
Pre-format data in the tool - money amounts, dates, translations - so the AI doesn’t waste tokens converting them

MCP: watch your token budget

MCP (Model Context Protocol) can be a massive token sink. The cost adds up from multiple angles:

Schema size - every tool definition is injected into the prompt
Tool count - a single MCP connection loads all its tools, not just the ones you need
Serialisation overhead - MCP adds protocol-level wrapping around every call and result
Result payload size - MCP tools often return more data than the model needs

Multiple MCP providers can easily add tens of thousands of tokens to every request. For production, prefer direct tool integrations where you control exactly which tools are exposed and how much data they return. Use MCP for development and prototyping.

Code execution in sandboxes

The industry is shifting from tool-heavy architectures towards letting the LLM write and execute code in a sandboxed environment. Instead of 10 specialised tools, give the model a code sandbox and a few SDKs - it writes the integration itself. Anthropic pioneered this approach and demonstrated up to 98.7% token reduction compared to equivalent MCP tool chains.

SaaS sandbox providers:

E2B - Firecracker microVMs with ~150ms boot, Python/JS SDKs
Daytona - Docker-based sandboxes with ~90ms startup
Vercel Sandbox - ephemeral microVMs integrated with the Vercel AI SDK
Cloudflare Workers - V8 isolate-based sandboxing, extremely fast cold starts

Self-hosted / OSS:

Boxlite - local-first micro-VM sandbox, no daemon required
Microsandbox - self-hosted microVM sandbox built in Rust, sub-200ms startup

Stick to Python or TypeScript - LLMs know them well, and they don’t require compile steps that would slow down the interaction loop.

Metric: Tool call overhead - number of round-trips and total tokens consumed by tool definitions and results per request. Fewer tools, leaner results, faster responses.

5. Optimise for Prompt Caching

Prompt caching is one of the highest-leverage latency optimisations. Providers cache the KV state (the internal calculations) of your prompt prefix. On the next request, if the prefix matches, those calculations are skipped entirely. Cached tokens are processed at a fraction of the cost and latency of fresh tokens.

Prompt Caching - KV Computation

System

You are a financial analyst. Answer concisely...

Docs

Q3 Report: Revenue €4.2M, EBITDA €1.1M...

History

User: Show Q3 results / AI: Revenue grew 12%...

User

Break down by region

KV cached

Computed

All 22 tokens computed

Provider support

Not all providers handle caching the same way:

OpenAI - automatic, no opt-in needed. Minimum 1,024 tokens prefix.
Anthropic - opt-in via cache_control breakpoints in the API request. Minimum 1,024 tokens (Sonnet/Haiku), 2,048 (Opus).
Google Gemini - opt-in via explicit “cached content” resources. Higher threshold: 32,768 tokens minimum.
Groq - automatic, applies to repeated prefixes. Good intro docs.

How to maximise cache hit rate

Structure your prompt from most static to most dynamic. The cache matches from the top - the moment something changes, everything after it is a cache miss.

System prompt (static) - instructions, output format, tool definitions. This never changes between requests and should always be at the top.
Documents (semi-dynamic) - knowledge base, reference data, retrieved context. Changes occasionally.
History (semi-dynamic) - conversation turns. Grows but the prefix stays stable.
User message (dynamic) - the current request. Always at the bottom.

Avoid putting dynamic content into your system prompt. A common mistake: injecting a timestamp like "Current time: 2026-04-01T19:45:00Z" into the system message. This invalidates the entire cache on every request because the prefix changes. If you need the current time, put it in the user message at the end.

Preheat the cache - send a priming request at session start (e.g. with an empty user message) so the first real user interaction already hits a warm cache.

Metric: Cache hit rate - percentage of input tokens served from cache. Monitor this in your provider dashboard. A well-structured prompt can achieve 80–95% cache hit rates in conversational applications.

6. Disable (or Reduce) Thinking

Extended thinking can greatly increase output quality - but at the cost of latency. Thinking is now enabled by default in most major foundation models - if you haven’t configured it, you’re paying this latency tax on every request.

Thinking Latency Calculator

Model speed100 tok/s

501002005001000

Thinking budget2.048k tokens

off5122.048k8.192k32.768k

Added latency before first token+20.5s

Disable thinking for straightforward tasks, or set a tight token budget for tasks that benefit from some reasoning. Note that some models (e.g. Gemini 2.5 Pro) don’t allow disabling thinking completely - you can only reduce the budget. Test where quality drops below acceptable for your use case. You can also combine this with the model routing technique from section 1 - use a classifier to dynamically adjust the thinking budget per request, routing simple queries with thinking disabled and complex ones with a higher budget.

Provider configuration

OpenAI - reasoning_effort parameter: low, medium, high. Set to low or disable for simple tasks. Also supports max_completion_tokens to cap total output including reasoning.
Anthropic - thinking parameter with configurable budget_tokens. Set type: "disabled" to turn it off, or set a low budget (e.g. 1024 tokens) for light reasoning.
Google Gemini - thinking_config with thinking_budget. Set to 0 to disable, or cap the token count.

Metric: Number of thinking tokens (NTT) - tokens spent on reasoning before the first output token. Monitor this - it’s often the single biggest contributor to TTFT in modern models.

7. Aggressive Memory Compaction

You don’t need to store the exact conversation history - a compact summary is often sufficient. LLMs tend to generate long responses, and those responses become part of the context for every subsequent request. A 6-turn conversation can easily reach 1,000+ tokens of history, most of which is redundant.

Memory Compaction

YouWhat were Q3 results?

45t

AIRevenue grew 12% YoY driven by...

320t

YouCompare to Q2

38t

AIQ2 showed 8% growth, so Q3...

280t

YouBreak down by region

52t

AIEU +18%, US +6%, APAC +14%...

410t

Context size

1145 tokens

Compact the conversation after every interaction - summarise the exchange and replace the full history. Run compaction asynchronously after the response so it doesn’t block the user, using a fast, cheap model. Tool call results are often the worst offenders - truncate those aggressively.

What must survive compaction: user goals, active constraints, referenced entities, unresolved tasks, and preferences. Lossy summarisation that drops these will silently degrade quality. Your compaction prompt should explicitly preserve them.

Some providers support this out of the box - see Anthropic’s compaction API and their session memory cookbook.

This may reduce prompt cache hit rates, but processing fewer tokens usually outweighs the cache miss - especially when changing conversation history would cause a miss anyway.

Metric: Average context length (ACL) - average token count of your full prompt across requests. If it grows over time, compact more aggressively.

8. Be a Caveman

Fewer input tokens = less processing = faster response. This applies to your system prompt (where most bloat lives) and the user-facing parts of the prompt. Remove filler words, bloomy language, excessive examples, and duplicated instructions. LLMs are surprisingly good at understanding compressed prompts.

System Prompt Compression

Original46 tokens

You are a helpful assistant that specializes in analyzing financial data. When the user asks you a question, you should provide a clear, concise, and well-structured response. Always make sure to include relevant numbers and percentages where applicable.

Savings46t → 16t(−65%)

Caveman compression is a practical framework for this. The core idea: strip articles, connectives, and redundant qualifiers - keep only the words that carry meaning. “You are a helpful assistant that should always provide” becomes “Provide”. Ugly, effective.

Don’t do this blindly: aggressive compression works best for repetitive instruction scaffolding, but be careful with safety rules, edge-case handling, and anything where wording precision matters.

Verify with embeddings

How do you know your compressed prompt still conveys the same information? Use an embedding model to compare semantic similarity between the original and compressed versions. Embed both prompts (or individual sentences/paragraphs) and compute cosine similarity - if it’s above 0.95, you haven’t lost meaning. This lets you compress aggressively with confidence.

Metric: Number of input tokens (NIT) - total input tokens per request. Audit your system prompt regularly - it tends to grow over time as instructions accumulate.

9. Push Expensive Work to Background Queues

Keep the main interaction loop short. Only the response itself should block the user. Long-running tool calls and any post-processing should be pushed out of the main loop into a background queue.

Sync vs Background

Main loop

Generate responsewaiting

User sees response immediately

Background queue

Compact memoryqueued

Update embeddingsqueued

Log analyticsqueued

If a tool performs a side-effectful operation (DB write, API call), make it async - acknowledge immediately, execute in the background. Post-processing like memory compaction, embedding updates, analytics, and logging all go to the queue. Fire independent tasks in parallel, not sequentially.

10. Fail Fast and Retry

When generating structured output to drive a UI - conversational, agentic, or generative - the LLM can produce invalid data that breaks things. Don’t wait until the full response is generated to validate. Validate every structured chunk as it arrives in the stream. The moment something invalid comes through, stop the stream immediately and retry. There’s no point generating hundreds more output tokens from a broken response.

Fail Fast - Validate While Streaming

Attempt 1streaming...

Retrywaiting

Validating each chunk as it arrives...

In practice: parse each JSONL line or structured chunk as it arrives and validate against your schema. The moment a chunk fails, abort the stream, append the error to the prompt (so the model learns from the mistake), and retry. This is faster than generating a full broken response and regenerating from scratch.

11. Buffer Rapid Requests

This is most relevant for IDE copilots, real-time editors, and rapidly changing UIs where users fire multiple inputs in quick succession. Less applicable to standard chat or transactional flows.

Buffer incoming requests in a queue and batch them into a single LLM call rather than processing each sequentially.

Request Buffering

Message queueempty

LLM

Idle - waiting for queue

If a new request supersedes the current one, consider cancelling the in-flight generation - it may not be worth finishing an outdated response. Advanced: use tool call interruptions as checkpoints - each tool call pauses generation, giving you a natural moment to check the queue for new messages. This is the pattern Cursor uses.

All eleven of these are applicable today, and most can be adopted incrementally. Don’t apply them all at once: identify the slowest part of your interaction loop and fix that first.

If you’re building AI interfaces and want to talk through any of these tradeoffs - get in touch.