5 Strategies for Improving Latency in AI Applications

Your CEO just asked why customers are churning from your AI feature. You know the AI works, most of the time. But there’s that nagging feeling every time you watch users interact with it.

“No one is going to wait this long for a response,” you think to yourself.

Slowness isn’t just a technical problem, it’s a business crisis. When users abandon your AI feature due to latency, you’re not just losing engagement, you’re losing revenue. For B2B SaaS companies and startups, where every user matters and churn is costly, this performance gap can be the difference between growth and stagnation.

But here’s the challenge: you can’t optimize what you can’t measure systematically.

I’ve spent over a decade building AI systems at companies like Google, LinkedIn, and startups. In virtually every AI project I work on, latency emerges as a business-critical concern that engineering leaders struggle to address systematically. The problem isn’t just technical; it’s that teams are flying blind without proper evaluation frameworks.

Here’s the systematic approach to latency optimization that transforms “vibes-based” performance guessing into data-driven confidence.

Understanding What to Measure

Before diving into optimizations, you need to establish what metrics matter for your specific application.

Key Metrics to Track

Time to First Token (TTFT): How quickly your application starts responding. Critical for streaming interfaces.
Output Tokens Per Second (OTPS): How rapidly tokens are generated once started. Should match or exceed human reading speed (roughly 15-20 tokens/second).
Time to Complete Response (TTCR): Total time from request to complete response. Crucial for API integrations and post-processing workflows.
Time to First Record: For structured outputs, how quickly the first complete unit of useful information appears. Often overlooked but critical for perceived performance.

One client was building a recommendation system and was originally overly focused on the time to a complete response; but because they were fundamentally generating a list of recommendations, time to first record made a lot more sense.

Before beginning any efforts to reduce latency, ensure you’re measuring what matters most for your specific use case and have proper instrumentation in place to track improvements. This is where systematic evaluation becomes critical: you can’t optimize what you can’t measure consistently.

Strategy 1: Determine if You Actually Need to Lower Latency

When you’re flying blind with AI performance, it’s easy to assume every user complaint is a technical problem. But systematic evaluation reveals a more nuanced reality: sometimes the issue isn’t actual latency, but perceived performance.

Before optimizing, ask: Is latency actually causing churn, or is it perceived performance? Only systematic measurement can answer this question.

Latency Expectations Vary by Context

Different interactions have different latency tolerances and expectations. In one case, we reduced latency in a recommendation system by 70% and user engagement dropped significantly. After user interviews, we found that the system was too fast for them to believe the results were of high quality! Don’t optimize what you don’t need to.

Improve Perceived Performance

Users are more accustomed to waiting in this AI age if you provide enough value and visibility. Here are effective loading patterns that significantly improve perceived performance:

Animated typing indicators: Simple animations that show the AI is working
Progress step visualization: Show which stage of processing the AI is in
Content skeleton loaders: Preview the structure of the coming content
Thinking visualizations: Creative animations that make waiting engaging

The key is to match the loading experience to your brand and use case. For analytical applications, showing the system “working through” steps builds confidence. For creative applications, showing an animation of ideas forming maintains engagement.

Strategy 2: Lower Time to First Value with Streaming

The fastest way to improve perceived performance is through UX optimizations, which can make your application feel responsive even before actual optimizations.

Implement Streaming

Streaming displays partial responses as they are generated. Most people think of this in terms of streaming text on the screen, but you can also add additional controls for what shows up when. One of my favorite techniques is when there is a list of things being generated (like recommendations) to stream each recommendation. For example, we can use mirascope to enable easy streaming of records:

class Book(BaseModel):
    title: str
    author: str

# Non-streaming
@llm.call(provider="openai", model="gpt-4o-mini", response_model=list[Book])
def recommend_books(genre: str) -> str:
    return f"Recommend many {genre} books"

print(recommend_books('mystery'))
# 2.22 seconds to recommend 10 books

# streaming book by book
@llm.call(provider="openai", model="gpt-4o-mini", response_model=list[Book], stream=True)
def recommend_books_stream(genre: str) -> str:
    return f"Recommend many {genre} books"


def stream_per_book(resp_stream):
    last_record_sent = -1
    first_record_time = None
    start_time = time.time()
    
    for partial_response in resp_stream:
        # Once we have made it to 2 record beyond the last one we sent, that means the next record is complete
        if len(partial_response or []) > last_record_sent + 2:
            if first_record_time is None:
                first_record_time = time.time()
                print(f"Time to first recommendation: {first_record_time - start_time:.2f}s")
            
            yield Book(**partial_response[last_record_sent + 1].model_dump())
            last_record_sent += 1
            
    # Lastly, we need to yield the last record
    yield partial_response[last_record_sent + 1]
    end_time = time.time()
    print(f"Total time for all recommendations: {end_time - start_time:.2f}s")

for book in stream_per_book(recommend_books_stream):
    print(book)

# 0.65 seconds for 1st recommendation
# 2.30 seconds for 10 recommendations

Streaming can transform the user experience from “waiting for a response” to “watching a response unfold,” significantly improving perceived responsiveness even when the total completion time remains unchanged. But capturing this win requires that you structure your output format such that streaming makes UX sense. Is there a clever way to think of your problem that enables streaming UX?

Strategy 3: Optimize Call Patterns

Once you’ve improved perceived responsiveness, focus on reducing actual latency through optimized call patterns.

Reduce the Number of Calls

Where possible, combine multiple calls into one:

# Separate calls
@llm.call(provider="openai", model="gpt-4o-mini")
def get_summary(article: str) -> str:
    return f"Summarize this article into 3 sentences: {article}"

@llm.call(provider="openai", model="gpt-4o-mini")
def get_key_points(article: str) -> str:
    return f"List 5 key points from this article: {article}"

@llm.call(provider="openai", model="gpt-4o-mini")
def get_action_items(article: str) -> str:
    return f"What 3 action items can be derived from this article: {article}"

summary = get_summary(article)
key_points = get_key_points(article)
action_items = get_action_items(article)
# 13.9 seconds


@llm.call(provider="openai", model="gpt-4o-mini")
def get_combined_response(article: str) -> str:
    return f"""
Process this article and provide:
1. A brief summary (3 sentences)
2. 5 key points as bullet points
3. 3 suggested action items
---
{article}
"""

combined_response = get_combined_response(article)
# 3.4 seconds

This approach can work well, but you will also find many advising to break up hard tasks into smaller tasks. This is a great idea for hard tasks, but it will often make your latency higher (more calls => higher latency). If you do split into multiple tasks, use the rest of the tips below such as parallelizing independent calls or using smaller models.

Parallelize Independent Calls

When multiple calls are necessary, run them in parallel if you can:

# after updating the above example to use async
summary, key_points, action_items = await asyncio.gather(
    get_summary(article),
    get_key_points(article), 
    get_action_items(article)
)
# 4.10 seconds

Implement Speculative Execution

Even when calls must be sequential, you can often leverage speculative execution to reduce perceived latency. Here’s how it works:

Predict the most likely next steps in your workflow
Start those calls preemptively while still processing earlier steps
Use the results if needed, discard if not

For example, in a customer support workflow:

@llm.call(provider="openai", model="gpt-4o-mini", response_model=Literal['expert', 'novice'])
def detect_user_expertise(query: str) -> str:
    return f"""
    Analyze this user query and determine if the user is a technical expert or novice:
    Query: {query}
    Return only "expert" or "novice".
    """

@llm.call(provider="openai", model="gpt-4o-mini")
def generate_technical_response(query: str) -> str:
    return f"""
    Generate a detailed technical response to this query, using appropriate terminology:
    Query: {query}
    """

@llm.call(provider="openai", model="gpt-4o-mini")
def generate_simple_response(query: str) -> str:
    return f"""
    Generate a simple, easy-to-understand response to this query, avoiding technical jargon:
    Query: {query}
    """

async def regular_response_generation(query: str):
    expertise = await detect_user_expertise(query)
    if expertise == 'expert':
        return await generate_technical_response(query)
    return await generate_simple_response(query)

await regular_response_generation("How does transformer architecture handle attention mechanisms?")
# 14.62 seconds

async def speculative_response_generation(query: str):
    # Start expertise detection
    expertise_future = asyncio.create_task(detect_user_expertise(query))
    
    # Speculatively start BOTH response types in parallel
    technical_future = asyncio.create_task(generate_technical_response(query))
    simple_future = asyncio.create_task(generate_simple_response(query))

    # Wait for expertise detection to complete
    expertise = await expertise_future

    if expertise == 'expert':
        return await technical_future
    return await simple_future

await speculative_response_generation("How does transformer architecture handle attention mechanisms?")
# 9.73 seconds

With this approach, I’ve seen clients reduce end-to-end latency by ~30% for common query paths, since the speculative calls run in parallel with the main processing. However, this also means that you will waste some compute. Generally, this is only a good idea when either the compute is cheap or if there is one path that is at least an order of magnitude more common.

Implement Smart Caching

Caching frequently used responses can eliminate latency entirely for common queries. For semantic caching (matching similar but not identical queries), you can use embeddings:


@llm.call(provider='openai', model='gpt-4o-mini')
async def answer_query(query: str) -> str:
    return f"Please answer this question: {query}"


async def cached_response_generation(query: str):
    """Generate a response with semantic caching."""
    # Try to find in cache first
    cached_response = find_in_cache(query)
    if cached_response:
        print("Cache hit! Using cached response.")
        return cached_response
    
    print("Cache miss. Generating new response...")
    # If not in cache, generate response normally
    response = await answer_query(query)
    
    # Add to cache for future use
    add_to_cache(query, response)
    
    return response

# Example usage
await cached_response_generation("How does transformer architecture handle attention mechanisms?")
# 6.71 seconds (cache miss)

await cached_response_generation("Explain how transformers implement attention mechanisms")
# 0.70 seconds (cache hit)

response = await cached_response_generation("What are the advantages of convolutional neural networks?")
# 4.81 seconds (cache miss)

For one client, I found that ~35% of queries could be answered from cache, reducing average response time by nearly ~30%. However, you do need to exercise a lot of care with semantic caching because you may find spurious matches, meaning you serve a bad response!

This is where systematic evaluation becomes essential. You need to log and analyze:

query
which cached query, if cache hit
top k matching queries with similarity scores
user satisfaction with cached responses
business impact of cache hits vs misses

Without this evaluation framework, you’re optimizing blind and might be trading latency for quality.

Strategy 4: Use Smaller Models

One of the most direct ways to improve latency is to use the smallest model that meets your quality requirements:

@llm.call(provider='openai', model='gpt-4o-mini')
async def answer_query(query: str) -> str:
    return f"Please answer this question: {query}"

await answer_query("How does transformer architecture handle attention mechanisms?")
# 8.29 seconds

@llm.call(provider='openai', model='gpt-4o')
async def answer_query(query: str) -> str:
    return f"Please answer this question: {query}"

await answer_query("How does transformer architecture handle attention mechanisms?")
# 10.56 seconds

Whether or not you can do this depends on the complexity of your problem, but you should definitely experiment to see if its possible to use a smaller model. I have found many use cases that were served well enough by smaller models, especially after breaking a task down and defining it well.

Use Optimized Inference Profiles

Many providers now offer optimized inference profiles that significantly improve performance. Recent AWS benchmarks show significant performance improvements with optimized profiles.

These optimizations are available in selected regions and for specific models, so check your provider’s documentation for availability. For more information see Optimized Inference in AWS Bedrock.

Use Quantized Models

Quantization reduces model precision to improve inference speed while maintaining most capabilities. A recent client had a need to answer questions about internal operations and SOPs.

Model	Format	Memory Usage	Inference Time	Human Rated Accuracy
Llama 3 8B	Full precision (FP16)	16GB	1.0x (baseline)	86.3%
Llama 3 8B	GGUF Q4_K_M	5.5GB	0.68x (32% faster)	81.4%
Llama 3 8B	GGUF Q3_K_S	4.2GB	0.58x (42% faster)	77.8%

For our use case, the Q4_K_M quantization offered the best balance of performance and quality. The accuracy degradation was acceptable for faster responses in this setting.

Strategy 5: Optimize Prompts, Parameters, and Outputs

Finally, optimize the actual content being sent and received.

Reduce Output Tokens

A critical insight: cutting 50% of your output tokens can reduce latency by nearly 50%. This near direct relationship makes output length optimization a highest leverage target:

@llm.call(provider='openai', model='gpt-4o-mini')
async def generate_long_output(topic: str) -> str:
    return f"""
    Write a comprehensive explanation about {topic}. Include background information,
    key concepts, technical details, and practical applications.
    """

await generate_long_output('transformer neural networks')
# 12.66 seconds for 6147 characters

@llm.call(provider='openai', model='gpt-4o-mini')
async def generate_short_output(topic: str) -> str:
    return f"""
    Explain {topic} in 3 concise sentences, focusing only on the most essential information.
    """

await generate_short_output('transformer neural networks')
# 1.02 seconds for 525 characters

Optimize Input Prompts

While not as impactful as output optimization, there are still gains to be made:

# We cap output to ensure both calls output same number of tokens
@llm.call(provider='openai', model='gpt-4o-mini', call_params={'max_tokens': 100})
async def generate_with_verbose_prompt(topic: str) -> str:
    return f"""
    Please analyze the following topic and provide a summary. The summary should capture the main points 
    and key details. It should be comprehensive but concise. The topic is about {topic} and
    its applications in modern technology. Please ensure your response is approximately 100 words in length.
    Make sure to cover the fundamental concepts and practical implications. Your response should be
    informative yet accessible to a general audience with basic technical knowledge.
    """

await generate_with_verbose_prompt('quantum computing')
# 1.88 seconds for 507 characters of prompt

@llm.call(provider='openai', model='gpt-4o-mini', )
async def generate_with_efficient_prompt(topic: str) -> str:
    return f"Summarize {topic} and its applications in 100 words:"

await generate_with_efficient_prompt('quantum computing')
# 1.58 seconds for 62 characters of prompt

Generally, you should only reach for a shorter input for latency concerns if you can remove large portions because its not as strongly related to the overall latency.

Strategy Integration: A Decision Framework

The most successful implementations combine multiple strategies based on specific needs. Here’s a decision framework I use with clients:

Immediate wins:
- Implement streaming
- Optimize response formats
- Add loading indicators
Medium-term optimizations:
- Test smaller models
- Implement basic caching
- Reduce output token count
- Optimize prompt structure
- Use optimized inference profiles
Advanced solutions:
- Implement semantic caching
- Develop speculative execution
- Create parallel workflows
- Fine-tune smaller models
- Deploy quantized models

Most of my clients achieve 30-50% latency improvements with just the immediate wins and a few medium-term optimizations. The key is systematic measurement that connects latency improvements to business outcomes—reduced churn, increased engagement, higher conversion rates.

The advanced solutions are typically only necessary for high-scale applications or extremely latency-sensitive use cases, but you’ll only know which optimizations provide real business value through proper evaluation.

Measuring the Impact

To demonstrate the cumulative effect of these optimizations, here’s a before/after comparison from a recent client project:

Metric	Before Optimization	After Optimization	Improvement
Time to First Record	12.3s	1.7s	-86%
Time to Complete Response	12.3s	5.5s	-55%

We achieved these improvements through:

Implementing streaming with response format optimization (primarily affecting Time to First Token and Time to First Record)
Reducing output token count by ~10%
Switching from GPT-4o to GPT-4o-mini model
Implementing semantic caching for common queries

Conclusion: From Latency Guesswork to Systematic Optimization

Latency optimization isn’t just a technical nicety—it’s essential for preventing user churn and driving business success. But you can’t optimize what you can’t measure systematically.

Here’s what you should do starting today:

Build systematic evaluation: Set up proper instrumentation to track TTFT, OTPS, and TTCR—and connect these metrics to business outcomes
Implement streaming: This single change often provides the biggest perceived performance boost
Format for early value: Restructure your prompts to deliver useful information early in the response
Test smaller models systematically: Use evaluation frameworks to determine if they perform well enough for your specific use case
Reduce output tokens: This is your highest leverage technical optimization

Remember: both perceived responsiveness and total completion time matter, but only systematic evaluation can tell you which optimizations actually improve user retention and business metrics.

Stop flying blind with AI latency. If you’re ready to build systematic evaluation that connects performance improvements to business outcomes, I can help your team implement this framework in just one week.

My intensive workshop walks your engineering team through building a robust evaluation system using your own data, in your own codebase. You’ll move from latency guesswork to systematic optimization that drives real business impact.

Schedule a free consultation to discuss your team’s specific challenges →

Skylar Payne is a machine learning tech lead with experience at Google and LinkedIn. He specializes in bridging the gap between data science and engineering to build high-performance AI systems that users love.