Infrastructure

Latency

Quick Answer: The time delay between sending a request to an AI model and receiving the response.

Latency is the time delay between sending a request to an AI model and receiving the response. In LLM applications, latency includes time-to-first-token (TTFT) and total generation time. Lower latency means faster, more responsive user experiences.

Example

A chatbot with 200ms TTFT feels instant. One with 3 seconds TTFT feels sluggish. Latency depends on model size, prompt length, server load, and geographic distance. Streaming responses (showing tokens as they generate) reduces perceived latency.

Why It Matters

Latency directly impacts user satisfaction and adoption. Studies show users abandon AI features when response time exceeds 5 seconds. Prompt engineers must balance output quality against speed by choosing appropriate models and prompt lengths.

How It Works

In AI systems, latency measures the time from sending a request to receiving the first (or complete) response. For language models, there are two key metrics: time-to-first-token (TTFT, how long until the first word appears) and end-to-end latency (total time to generate the complete response).

Latency depends on multiple factors: model size (larger models are slower), input length (longer prompts take longer to process), output length (more tokens to generate means more time), GPU hardware (A100 vs H100 vs inference-optimized chips), and serving infrastructure (batch size, queue depth, geographic distance).

For user-facing applications, latency directly impacts user experience. Research shows users perceive delays over 200ms for TTFT and expect streaming responses to match reading speed (about 15-20 tokens per second). Batch processing applications care less about latency and more about throughput.

Common Mistakes

Common mistake: Optimizing for average latency instead of p95/p99 latency

Average latency hides outliers. One request taking 30 seconds while 99 take 200ms still means 1% of users have a terrible experience. Track and optimize percentile latencies.

Common mistake: Not using streaming for user-facing applications

Streaming responses dramatically improves perceived latency. Users start reading immediately instead of waiting for the full response. Most model APIs support streaming with minimal additional complexity.

Career Relevance

Latency optimization is a core skill for MLOps engineers and backend developers working with AI systems. Understanding latency trade-offs helps product teams make informed decisions about model selection, architecture, and user experience design.

Related Terms

Stay Ahead in AI

Join 1,300+ prompt engineers getting weekly insights on tools, techniques, and career opportunities.

Join the Community →