Evaluating Multi-Turn AI Agents

A Systematic Approach to Building Reliable Agent Systems

Skylar Payne

AI is Fundamentally Empirical

We can’t predict what will work.

  • LLMs are highly data-dependent
  • Performance varies wildly across use cases
  • We have priors (educated guesses)…
  • …but we must experiment to find out

The Challenge: Understanding

Long-running agents are hard to evaluate:

  • Cost: Running full eval suites gets expensive fast ($10-50 per test run)
  • Latency: Multi-turn conversations take 30s-5min to execute
  • Context Overload: Which of the 47 steps actually broke?

Why This Matters

If we can’t evaluate agents systematically:

  • We can’t understand when/why they fail
  • We can’t improve them reliably
  • We can’t maintain them as requirements change
  • We can’t ship them with confidence

The solution: Make evaluation straightforward through systematic decomposition.

What We’ll Cover Today

  1. The Framework: Evaluation pyramid and decomposition strategies
  2. Implementation: Code design, tracing, and system evaluation
  3. In Practice: Real results and research validation

Part 1: The Framework

Evaluation Pyramid & Decomposition

The Evaluation Pyramid

Not all evals are created equal:

Principle: Test narrow/fast at the bottom, broad/slow at the top

Step-Based Evals: The Foundation

Test individual components in isolation:

  • Tool calls: “Did it call the right function?”
  • Parameter extraction: “Did it parse user input correctly?”
  • Domain logic: “Did it apply business rules?”
  • Validation: “Did it catch invalid inputs?”

Why: Fast feedback (<1s), easy to debug, cheap (<$0.01)

Step-Based Eval: Code Example

def test_extract_contact_info():
    """Test that we correctly extract email from user message"""
    message = "Please email me at sky@example.com"

    result = extract_contact_info(message)

    assert result.email == "sky@example.com"
    assert result.has_contact_method == True

Fast, deterministic, easy to debug

Turn-Based Evals: The Middle Layer

Test single-turn interactions:

  • User message → Agent response
  • Check partial ordering of tool calls
    • “Read must happen before write”
    • “Verify identity before granting access”
  • Validate tool parameters and response quality

Why: Catches integration issues, still relatively fast (~5-10s per test)

Turn-Based Eval: Code Example

def test_handle_refund_request():
    """Test agent correctly handles refund request"""
    conversation = Conversation()

    response = agent.handle_turn(
        conversation,
        user_msg="I need a refund for order #12345"
    )

    # Check it called the right tool
    assert response.tool_calls[0].name == "lookup_order"
    assert response.tool_calls[0].args["order_id"] == "12345"

    # Check response quality
    assert "refund" in response.message.lower()
    assert response.needs_user_confirmation == True

Multi-Turn Evals: The Challenge

Test complete conversations:

  • Full user journey from start to finish
  • Simulate the user - respond to agent naturally
  • Multiple turns of interaction
  • Context accumulation across turns
  • Did the agent achieve the user’s goal?

Problem: Expensive ($1-5 per test), slow (30s-5min), hard to debug

Solution: Break into milestones

Milestones: Decomposing Multi-Turn

Instead of: Did the conversation succeed? ✅/❌

Ask: What are the critical checkpoints for this user flow?

  • Intent captured?
  • Necessary info gathered?
  • Tool executed successfully?
  • Policy compliance checked?
  • Response delivered clearly?

Key insight: Milestones are user-flow specific - different flows need different checkpoints

Milestone Example: Password Reset

Flow: User can’t log in → Agent resets password

  1. Intent: Understood password issue
  2. Identity: Confirmed user identity
  3. Diagnosis: Forgot vs locked account
  4. Remedy: Choose reset link vs unlock
  5. Execute: Send reset successfully
  6. Confirm: User can proceed

Each milestone is a gradable checkpoint

Password reset journey with milestones

Milestone-Based Evaluation: Code

def evaluate_password_reset(trace):
    """Analyze trace to check milestones"""
    milestones = {
        "intent": check_intent_type(trace, expected="password_reset"),
        "identity": check_verified(trace, "verify_user_identity"),
        "diagnosis": check_tool_result(trace, "check_account_status",
                                       has_field="account_locked"),
        "remedy": check_tool_called(trace, "send_reset_link",
                                    with_param="user_id"),
        "execute": check_attribute(trace, "reset.sent", equals=True),
        "confirm": check_final_state(trace, "user_can_login")
    }

    return MilestoneReport(milestones=milestones,
                          first_failure=find_first_failure(milestones))

Failure Funnels: Visualizing Drop-Off

Track milestone completion rates across many runs:

Biggest drop = highest leverage fix

Failure Funnel: Real Example

Password Reset Flow - Initial deployment:

The funnel showed 28% drop at remedy selection

  • Problem: Agent chose “unlock account” when should send reset link
  • Root cause: Ambiguous phrasing in account status tool response
  • Fix: Clarified tool output format + added examples
  • Result: 42% → 70% overall success rate

Don’t fix everything - fix the biggest leak first

Part 2: Implementation

Code Design, Tracing, & System Evaluation

Code Design for Testability

Functional Core, Imperative Shell

Functional Core

  • Pure functions
  • No I/O or LLM calls
  • Deterministic
  • Easy to test

Imperative Shell

  • API calls
  • Database writes
  • LLM invocations
  • Thin layer

Result: 80% of your code is fast to test with step-based evals

Before: Tangled Code

async def handle_refund(user_message: str):
    """Everything tangled together - hard to test"""
    # LLM call embedded in business logic
    response = await llm.complete(
        f"Extract order ID from: {user_message}"
    )
    order_id = response.text.strip()

    # Database call embedded
    order = await db.get_order(order_id)

    # Business logic mixed with I/O
    if order.amount > 100:
        await db.create_refund(order_id, order.amount)
        return "Refund processed"
    else:
        return "No refund needed"

Problem: Can’t test business logic without LLM/DB

After: Functional Core + Imperative Shell

# CORE: Pure function, easy to test
def should_process_refund(order: Order) -> RefundDecision:
    """Business logic - no I/O"""
    if order.amount > 100:
        return RefundDecision(
            approve=True,
            amount=order.amount,
            reason="Amount exceeds threshold"
        )
    return RefundDecision(approve=False, reason="Below threshold")

# SHELL: Thin layer for I/O
async def handle_refund(user_message: str):
    """Orchestration - delegates to core"""
    order_id = await extract_order_id(user_message)  # LLM
    order = await db.get_order(order_id)  # DB

    decision = should_process_refund(order)  # CORE - pure!

    if decision.approve:
        await db.create_refund(order_id, decision.amount)  # DB
    return format_response(decision)  # Pure formatting

Functional core and imperative shell architecture

Testing the Functional Core

def test_refund_decision_above_threshold():
    """Fast, deterministic, no mocks"""
    order = Order(id="123", amount=150)

    decision = should_process_refund(order)

    assert decision.approve == True
    assert decision.amount == 150
    assert "threshold" in decision.reason

def test_refund_decision_below_threshold():
    order = Order(id="456", amount=50)

    decision = should_process_refund(order)

    assert decision.approve == False

Runs in milliseconds. No LLM. No database. No mocks.

Tracing, Not Just Logging

Don’t log tokens. Trace domain events.

  • Traces capture structure of execution
  • Domain events are stable across model changes
  • Traces enable milestone evaluation
  • Traces support distributed debugging

Bad: Token Logging

async def classify_intent(message: str):
    response = await llm.complete(
        f"Classify intent: {message}"
    )

    # Logging tokens - unstable!
    logger.info(f"LLM returned: {response.text}")

    return response.text

Problem: - Changes every time you update the model - Hard to search/query - No structure for evaluation

Good: Domain Event Tracing

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

async def classify_intent(request: UserRequest) -> Intent:
    with tracer.start_as_current_span("classify_intent") as span:
        # Trace input (domain model)
        span.set_attribute("request.user_id", request.user_id)
        span.set_attribute("request.session_id", request.session_id)

        # Make LLM call
        response = await llm.complete(f"Classify: {request.message}")

        # Parse to domain model
        intent = parse_intent(response.text)

        # Trace output (domain model)
        span.set_attribute("intent.type", intent.type)
        span.set_attribute("intent.confidence", intent.confidence)
        span.set_attribute("intent.entities", intent.entities)

        return intent

Trace structure with domain events vs raw tokens

Trace-Based Milestone Evaluation

def evaluate_milestone_from_trace(trace):
    """Check if intent recognition milestone was achieved"""
    intent_span = trace.get_span_by_name("classify_intent")

    if not intent_span:
        return False  # Milestone not reached

    # Check domain attributes
    intent_type = intent_span.get_attribute("intent.type")
    confidence = intent_span.get_attribute("intent.confidence")

    # Milestone achieved if intent was classified with high confidence
    return intent_type is not None and confidence > 0.7

Traces make milestones queryable

Evaluate the System, Not the API

Prompt playgrounds miss the mark

They test:

  • “Did the LLM generate good text?”

You need to test:

  • “Did the SYSTEM solve the user’s problem?”

The system includes: prompts + tools + business logic + error handling + retry logic + validation

Part 3: In Practice

Real Results & Research Validation

Research Backing This Approach

Recent work supports layered evaluation:

  • AgentBoard (2024): Fine-grained progress metrics, step-level analysis
  • MT-Eval (2024): Multi-turn interaction patterns (recollection, expansion, refinement, follow-up)
  • Survey on LLM Agent Evaluation (2025): 250 sources, emphasizes complementary granular + holistic metrics
  • LangSmith Multi-turn Evals (2024): Industry validation of milestone-based approach

The field is converging on multi-level evaluation strategies

Summary: The Complete System

  1. Embrace experimentation - AI is empirical, iterate fast
  2. Test at multiple levels - Step → Turn → Multi-turn pyramid
  3. Break into milestones - Make multi-turn conversations gradable
  4. Use failure funnels - Find and fix the biggest leaks
  5. Design for testability - Functional core, imperative shell
  6. Trace domain events - Stable foundation for evals
  7. Evaluate the system - Not just the LLM

Start small: Pick one capability, build one pyramid

Resources

Blog post:

  • https://skylarbpayne.com/posts/multi-turn-eval/

Research Papers:

Questions?

Contact me: me@skylarbpayne.com