5 Strategies for Improving Latency in AI Applications

 

Your team was so excited to implement a new AI product feature. Finally, a chance for real-world application of OpenAI, Anthropic, and other AI providers. Your engineers dive right in, write up a prompt. Maybe implement RAG. Then you all stare in awe at the answers that are right most of the time… But something still isn’t right.

“No one is going to wait this long for a response,” you think to yourself.

I’ve spent over a decade building AI systems and developing teams at companies like Google, LinkedIn, and startups, and now apply my expertise for several clients. In virtually every AI project I consult on, latency emerges as a persistent concern that needs addressing, even if it’s not always the first priority.

Here’s the playbook on latency my clients wish they had from day one.

Understanding What to Measure

Before diving into optimizations, you need to establish what metrics matter for your specific application.

Key Metrics to Track

  • Time to First Token (TTFT): How quickly your application starts responding. Critical for streaming interfaces.
  • Output Tokens Per Second (OTPS): How rapidly tokens are generated once started. Should match or exceed human reading speed (roughly 15-20 tokens/second).
  • Time to Complete Response (TTCR): Total time from request to complete response. Crucial for API integrations and post-processing workflows.
  • Time to First Record: For structured outputs, how quickly the first complete unit of useful information appears. Often overlooked but critical for perceived performance.

One client was building a recommendation system and was originally overly focused on the time to a complete response; but because they were fundamentally generating a list of recommendations, time to first record made a lot more sense.

Before beginning any efforts to reduce latency, ensure you’re measuring what matters most for your specific use case and have proper instrumentation in place to track improvements.

Strategy 1: Determine if You Actually Need to Lower Latency

The specialization of roles in tech has led many organizations to lack creativity. You see slowness and your AI engineers instantly jump into the internals of the AI system. But maybe you just need to be more thoughtful about the UX.

Before optimizing, ask: Is latency actually the problem, or is it perceived performance?

Latency Expectations Vary by Context

Different interactions have different latency tolerances and expectations. In one case, we reduced latency in a recommendation system by 70% and user engagement dropped significantly. After user interviews, we found that the system was too fast for them to believe the results were of high quality! Don’t optimize what you don’t need to.

Improve Perceived Performance

Users are more accustomed to waiting in this AI age if you provide enough value and visibility. Here are effective loading patterns that significantly improve perceived performance:

  1. Animated typing indicators: Simple animations that show the AI is working
  2. Progress step visualization: Show which stage of processing the AI is in
  3. Content skeleton loaders: Preview the structure of the coming content
  4. Thinking visualizations: Creative animations that make waiting engaging

The key is to match the loading experience to your brand and use case. For analytical applications, showing the system “working through” steps builds confidence. For creative applications, showing an animation of ideas forming maintains engagement.

Strategy 2: Lower Time to First Value with Streaming

The fastest way to improve perceived performance is through UX optimizations, which can make your application feel responsive even before actual optimizations.

Implement Streaming

Streaming displays partial responses as they are generated. Most people think of this in terms of streaming text on the screen, but you can also add additional controls for what shows up when. One of my favorite techniques is when there is a list of things being generated (like recommendations) to stream each recommendation. For example, we can use mirascope to enable easy streaming of records:

class Book(BaseModel):
    title: str
    author: str

# Non-streaming
@llm.call(provider="openai", model="gpt-4o-mini", response_model=list[Book])
def recommend_books(genre: str) -> str:
    return f"Recommend many {genre} books"

print(recommend_books('mystery'))
# 2.22 seconds to recommend 10 books

# streaming book by book
@llm.call(provider="openai", model="gpt-4o-mini", response_model=list[Book], stream=True)
def recommend_books_stream(genre: str) -> str:
    return f"Recommend many {genre} books"


def stream_per_book(resp_stream):
    last_record_sent = -1
    first_record_time = None
    start_time = time.time()
    
    for partial_response in resp_stream:
        # Once we have made it to 2 record beyond the last one we sent, that means the next record is complete
        if len(partial_response or []) > last_record_sent + 2:
            if first_record_time is None:
                first_record_time = time.time()
                print(f"Time to first recommendation: {first_record_time - start_time:.2f}s")
            
            yield Book(**partial_response[last_record_sent + 1].model_dump())
            last_record_sent += 1
            
    # Lastly, we need to yield the last record
    yield partial_response[last_record_sent + 1]
    end_time = time.time()
    print(f"Total time for all recommendations: {end_time - start_time:.2f}s")

for book in stream_per_book(recommend_books_stream):
    print(book)

# 0.65 seconds for 1st recommendation
# 2.30 seconds for 10 recommendations

Streaming can transform the user experience from “waiting for a response” to “watching a response unfold,” significantly improving perceived responsiveness even when the total completion time remains unchanged. But capturing this win requires that you structure your output format such that streaming makes UX sense. Is there a clever way to think of your problem that enables streaming UX?

Strategy 3: Optimize Call Patterns

Once you’ve improved perceived responsiveness, focus on reducing actual latency through optimized call patterns.

Reduce the Number of Calls

Where possible, combine multiple calls into one:

# Separate calls
@llm.call(provider="openai", model="gpt-4o-mini")
def get_summary(article: str) -> str:
    return f"Summarize this article into 3 sentences: {article}"

@llm.call(provider="openai", model="gpt-4o-mini")
def get_key_points(article: str) -> str:
    return f"List 5 key points from this article: {article}"

@llm.call(provider="openai", model="gpt-4o-mini")
def get_action_items(article: str) -> str:
    return f"What 3 action items can be derived from this article: {article}"

summary = get_summary(article)
key_points = get_key_points(article)
action_items = get_action_items(article)
# 13.9 seconds


@llm.call(provider="openai", model="gpt-4o-mini")
def get_combined_response(article: str) -> str:
    return f"""
Process this article and provide:
1. A brief summary (3 sentences)
2. 5 key points as bullet points
3. 3 suggested action items
---
{article}
"""

combined_response = get_combined_response(article)
# 3.4 seconds

This approach can work well, but you will also find many advising to break up hard tasks into smaller tasks. This is a great idea for hard tasks, but it will often make your latency higher (more calls => higher latency). If you do split into multiple tasks, use the rest of the tips below such as parallelizing independent calls or using smaller models.

Parallelize Independent Calls

When multiple calls are necessary, run them in parallel if you can:

# after updating the above example to use async
summary, key_points, action_items = await asyncio.gather(
    get_summary(article),
    get_key_points(article), 
    get_action_items(article)
)
# 4.10 seconds

Implement Speculative Execution

Even when calls must be sequential, you can often leverage speculative execution to reduce perceived latency. Here’s how it works:

  1. Predict the most likely next steps in your workflow
  2. Start those calls preemptively while still processing earlier steps
  3. Use the results if needed, discard if not

For example, in a customer support workflow:

@llm.call(provider="openai", model="gpt-4o-mini", response_model=Literal['expert', 'novice'])
def detect_user_expertise(query: str) -> str:
    return f"""
    Analyze this user query and determine if the user is a technical expert or novice:
    Query: {query}
    Return only "expert" or "novice".
    """

@llm.call(provider="openai", model="gpt-4o-mini")
def generate_technical_response(query: str) -> str:
    return f"""
    Generate a detailed technical response to this query, using appropriate terminology:
    Query: {query}
    """

@llm.call(provider="openai", model="gpt-4o-mini")
def generate_simple_response(query: str) -> str:
    return f"""
    Generate a simple, easy-to-understand response to this query, avoiding technical jargon:
    Query: {query}
    """

async def regular_response_generation(query: str):
    expertise = await detect_user_expertise(query)
    if expertise == 'expert':
        return await generate_technical_response(query)
    return await generate_simple_response(query)

await regular_response_generation("How does transformer architecture handle attention mechanisms?")
# 14.62 seconds

async def speculative_response_generation(query: str):
    # Start expertise detection
    expertise_future = asyncio.create_task(detect_user_expertise(query))
    
    # Speculatively start BOTH response types in parallel
    technical_future = asyncio.create_task(generate_technical_response(query))
    simple_future = asyncio.create_task(generate_simple_response(query))

    # Wait for expertise detection to complete
    expertise = await expertise_future

    if expertise == 'expert':
        return await technical_future
    return await simple_future

await speculative_response_generation("How does transformer architecture handle attention mechanisms?")
# 9.73 seconds

With this approach, I’ve seen clients reduce end-to-end latency by ~30% for common query paths, since the speculative calls run in parallel with the main processing. However, this also means that you will waste some compute. Generally, this is only a good idea when either the compute is cheap or if there is one path that is at least an order of magnitude more common.

Implement Smart Caching

Caching frequently used responses can eliminate latency entirely for common queries. For semantic caching (matching similar but not identical queries), you can use embeddings:


@llm.call(provider='openai', model='gpt-4o-mini')
async def answer_query(query: str) -> str:
    return f"Please answer this question: {query}"


async def cached_response_generation(query: str):
    """Generate a response with semantic caching."""
    # Try to find in cache first
    cached_response = find_in_cache(query)
    if cached_response:
        print("Cache hit! Using cached response.")
        return cached_response
    
    print("Cache miss. Generating new response...")
    # If not in cache, generate response normally
    response = await answer_query(query)
    
    # Add to cache for future use
    add_to_cache(query, response)
    
    return response

# Example usage
await cached_response_generation("How does transformer architecture handle attention mechanisms?")
# 6.71 seconds (cache miss)

await cached_response_generation("Explain how transformers implement attention mechanisms")
# 0.70 seconds (cache hit)

response = await cached_response_generation("What are the advantages of convolutional neural networks?")
# 4.81 seconds (cache miss)

For one client, I found that ~35% of queries could be answered from cache, reducing average response time by nearly ~30%. However, you do need to exercise a lot of care with semantic caching because you may find spurious matches, meaning you serve a bad response! If you choose to implement semantic caching, make sure that you log the following so you can later analyze and tune similarity thresholds:

  • query
  • which cached query, if cache hit
  • top k matching queries with similarity scores

Strategy 4: Use Smaller Models

One of the most direct ways to improve latency is to use the smallest model that meets your quality requirements:

@llm.call(provider='openai', model='gpt-4o-mini')
async def answer_query(query: str) -> str:
    return f"Please answer this question: {query}"

await answer_query("How does transformer architecture handle attention mechanisms?")
# 8.29 seconds

@llm.call(provider='openai', model='gpt-4o')
async def answer_query(query: str) -> str:
    return f"Please answer this question: {query}"

await answer_query("How does transformer architecture handle attention mechanisms?")
# 10.56 seconds

Whether or not you can do this depends on the complexity of your problem, but you should definitely experiment to see if its possible to use a smaller model. I have found many use cases that were served well enough by smaller models, especially after breaking a task down and defining it well.

Use Optimized Inference Profiles

Many providers now offer optimized inference profiles that significantly improve performance. Recent AWS benchmarks show significant performance improvements with optimized profiles.

These optimizations are available in selected regions and for specific models, so check your provider’s documentation for availability. For more information see Optimized Inference in AWS Bedrock.

Use Quantized Models

Quantization reduces model precision to improve inference speed while maintaining most capabilities. A recent client had a need to answer questions about internal operations and SOPs.

Model Format Memory Usage Inference Time Human Rated Accuracy
Llama 3 8B Full precision (FP16) 16GB 1.0x (baseline) 86.3%
Llama 3 8B GGUF Q4_K_M 5.5GB 0.68x (32% faster) 81.4%
Llama 3 8B GGUF Q3_K_S 4.2GB 0.58x (42% faster) 77.8%

For our use case, the Q4_K_M quantization offered the best balance of performance and quality. The accuracy degradation was acceptable for faster responses in this setting.

Strategy 5: Optimize Prompts, Parameters, and Outputs

Finally, optimize the actual content being sent and received.

Reduce Output Tokens

A critical insight: cutting 50% of your output tokens can reduce latency by nearly 50%. This near direct relationship makes output length optimization a highest leverage target:

@llm.call(provider='openai', model='gpt-4o-mini')
async def generate_long_output(topic: str) -> str:
    return f"""
    Write a comprehensive explanation about {topic}. Include background information,
    key concepts, technical details, and practical applications.
    """

await generate_long_output('transformer neural networks')
# 12.66 seconds for 6147 characters

@llm.call(provider='openai', model='gpt-4o-mini')
async def generate_short_output(topic: str) -> str:
    return f"""
    Explain {topic} in 3 concise sentences, focusing only on the most essential information.
    """

await generate_short_output('transformer neural networks')
# 1.02 seconds for 525 characters

Optimize Input Prompts

While not as impactful as output optimization, there are still gains to be made:

# We cap output to ensure both calls output same number of tokens
@llm.call(provider='openai', model='gpt-4o-mini', call_params={'max_tokens': 100})
async def generate_with_verbose_prompt(topic: str) -> str:
    return f"""
    Please analyze the following topic and provide a summary. The summary should capture the main points 
    and key details. It should be comprehensive but concise. The topic is about {topic} and
    its applications in modern technology. Please ensure your response is approximately 100 words in length.
    Make sure to cover the fundamental concepts and practical implications. Your response should be
    informative yet accessible to a general audience with basic technical knowledge.
    """

await generate_with_verbose_prompt('quantum computing')
# 1.88 seconds for 507 characters of prompt

@llm.call(provider='openai', model='gpt-4o-mini', )
async def generate_with_efficient_prompt(topic: str) -> str:
    return f"Summarize {topic} and its applications in 100 words:"

await generate_with_efficient_prompt('quantum computing')
# 1.58 seconds for 62 characters of prompt

Generally, you should only reach for a shorter input for latency concerns if you can remove large portions because its not as strongly related to the overall latency.

Strategy Integration: A Decision Framework

The most successful implementations combine multiple strategies based on specific needs. Here’s a decision framework I use with clients:

  1. Immediate wins:
    • Implement streaming
    • Optimize response formats
    • Add loading indicators
  2. Medium-term optimizations:
    • Test smaller models
    • Implement basic caching
    • Reduce output token count
    • Optimize prompt structure
    • Use optimized inference profiles
  3. Advanced solutions:
    • Implement semantic caching
    • Develop speculative execution
    • Create parallel workflows
    • Fine-tune smaller models
    • Deploy quantized models

Most of my clients achieve 30-50% latency improvements with just the immediate wins and a few medium-term optimizations. The advanced solutions are typically only necessary for high-scale applications or extremely latency-sensitive use cases.

Measuring the Impact

To demonstrate the cumulative effect of these optimizations, here’s a before/after comparison from a recent client project:

Metric Before Optimization After Optimization Improvement
Time to First Record 12.3s 1.7s -86%
Time to Complete Response 12.3s 5.5s -55%

We achieved these improvements through:

  1. Implementing streaming with response format optimization (primarily affecting Time to First Token and Time to First Record)
  2. Reducing output token count by ~10%
  3. Switching from GPT-4o to GPT-4o-mini model
  4. Implementing semantic caching for common queries

Conclusion: Your Next Steps

Latency optimization isn’t just a technical nicety—it’s essential for AI application success. Here’s what you should do starting today:

  1. Measure your current performance: Set up proper instrumentation to track TTFT, OTPS, and TTCR
  2. Implement streaming: This single change often provides the biggest perceived performance boost
  3. Format for early value: Restructure your prompts to deliver useful information early in the response
  4. Test smaller models: You might be surprised how well they perform for your specific use case
  5. Reduce output tokens: This is your highest leverage technical optimization

Remember that both perceived responsiveness and total completion time matter. Your metrics should guide which optimizations will deliver the most value for your specific application.

If you’re building AI applications and struggling with latency issues, I’d love to hear about your specific challenges. To discuss tailored strategies for improving your AI system latency, book a free consult


Skylar Payne is a machine learning tech lead with experience at Google and LinkedIn. He specializes in bridging the gap between data science and engineering to build high-performance AI systems that users love.