Latency
Example
Why It Matters
Latency directly impacts user satisfaction and adoption. Studies show users abandon AI features when response time exceeds 5 seconds. Prompt engineers must balance output quality against speed by choosing appropriate models and prompt lengths.
How It Works
In AI systems, latency measures the time from sending a request to receiving the first (or complete) response. For language models, there are two key metrics: time-to-first-token (TTFT, how long until the first word appears) and end-to-end latency (total time to generate the complete response).
Latency depends on multiple factors: model size (larger models are slower), input length (longer prompts take longer to process), output length (more tokens to generate means more time), GPU hardware (A100 vs H100 vs inference-optimized chips), and serving infrastructure (batch size, queue depth, geographic distance).
For user-facing applications, latency directly impacts user experience. Research shows users perceive delays over 200ms for TTFT and expect streaming responses to match reading speed (about 15-20 tokens per second). Batch processing applications care less about latency and more about throughput.
Common Mistakes
Common mistake: Optimizing for average latency instead of p95/p99 latency
Average latency hides outliers. One request taking 30 seconds while 99 take 200ms still means 1% of users have a terrible experience. Track and optimize percentile latencies.
Common mistake: Not using streaming for user-facing applications
Streaming responses dramatically improves perceived latency. Users start reading immediately instead of waiting for the full response. Most model APIs support streaming with minimal additional complexity.
Career Relevance
Latency optimization is a core skill for MLOps engineers and backend developers working with AI systems. Understanding latency trade-offs helps product teams make informed decisions about model selection, architecture, and user experience design.
Related Terms
Stay Ahead in AI
Join 1,300+ prompt engineers getting weekly insights on tools, techniques, and career opportunities.
Join the Community →