Building an Automated Email Reply Agent with Pydantic AI: A Story of Simplification

 

After years of leading machine learning teams at companies like Google and LinkedIn, I’ve re-learned the same lesson over and over again: start simple and slowly add complexity and always prioritize observability. Recently, I embarked on an experiment with Pydantic AI to build an automated email reply agent. While the end goal was simple - automatically reply to emails - the journey taught me valuable lessons about iterative development and system design. Here’s what I learned.

The Initial Approach: An All-In-One Agent

I started ambitious, trying to create a single agent that could handle everything. Here’s what that looked like:

@dataclass
class Deps:
    """Dependencies container for Gmail agent."""
    gmail_tool: GmailTool

class EmailQuery(BaseModel):
    """Model for email search parameters."""
    max_results: int = 5
    query: str | None = None

gmail_agent = Agent(
    'openai:gpt-4o-mini',
    system_prompt="""
    You are an email assistant that helps analyze and respond to emails.
    You first identify which emails require a response.
    Emails will require a response for any number of reasons such as:
    - they are from a client or customer
    - the client / customer has not responded in a while
    - the email expresses urgency

    In all cases, spam and marketing emails should be ignored.
    """,
    deps_type=Deps,
    result_type=list[_EmailMessageRequiringResponse]
)

@gmail_agent.tool
async def fetch_recent_emails(ctx: RunContext[Deps], params: EmailQuery) -> list[dict]:
    """Fetch recent emails from Gmail inbox."""
    with logfire.span('fetching emails', params=params.model_dump()) as span:
        emails = ctx.deps.gmail_tool.fetch_emails(
            max_results=params.max_results,
            query=params.query
        )
        # ... processing logic ...

Why This Didn’t Work

The problems came quickly. The most significant issue emerged when dealing with multiple tool calls in a single response. Here’s a minimal example that reproduces the core issue:

from pydantic import BaseModel, Field
from pydantic_ai.agent import Agent

class Location(BaseModel):
    city: str = Field(description="The city")
    country: str = Field(description="The country")

# This will consistently fail with validation errors
message_history = [
    UserPrompt(content="Generate a search query"),
    ModelStructuredResponse(
        calls=[
            ToolCall(
                tool_name='final_result',
                args=ArgsJson(args_json='{"city:": "test city", "country": "test country"}'),  # Invalid - has colon
                tool_id='call_1'
            ),
            ToolCall(
                tool_name='final_result',
                args=ArgsJson(args_json='{"city": "test city 2", "country": "test country 2"}'),
                tool_id='call_2'
            ),
        ],
        role='model-structured-response'
    )
]

This would often fail for various reasons. When it “worked”, it didn’t do a great job. So much for AI taking my job!

One of the failure modes was an OpenAI 400 error:

openai.BadRequestError: Error code: 400 - {'error': {'message': "An assistant message with 'tool_calls' must be followed by tool messages responding to each 'tool_call_id'..."}}

described in this Github issue (note: while it was marked closed, I have still found cases where it happens, but it is much harder to reproduce so I have yet to create a good follow up issue).

Because the agent framework is somewhat a blackbox, it is difficult to pinpoint what the issue is. This is where the logfire integration is probably helpful, but I didn’t want to learn 2 things at once; so I had my logging done via a simple sqlite setup.

The Path to Simplification

Rather than wrestling with a complex agent architecture, I broke things down into discrete components. Here’s the evolved design:

async def create_agents():
    """Create and instrument all agents."""
    query_agent = await instrument_agent(Agent(
        MODEL,
        system_prompt="""
        You are an email assistant that helps construct Gmail search queries.
        Your goal is to find important emails that might need responses.
        
        Generate Gmail search queries that will find:
        - Emails from clients or customers
        - Recent conversations that have gone quiet
        - Urgent messages
        """,
        result_type=EmailQueryParams,
    ), "gmail_query_agent", "0.0.1", run_logger)
    
    # Additional specialized agents for analysis and response
    analysis_agent = await instrument_agent(...)
    response_agent = await instrument_agent(...)

The key was adding instrumentation and breaking the workflow into clear stages. Here’s how the main processing flow looks:

async def process_inbox():
    """Main workflow for processing Gmail inbox."""
    query_agent, analysis_agent, response_agent = await create_agents()
    
    try:
        # Generate search query
        query_result = await query_agent.run(
            "Generate a query to find important emails that might need responses"
        )
        
        # Fetch matching emails
        emails = await gmail_api.query(query_result.data)
        
        # Analyze emails
        analysis_result = await analysis_agent.run(
            f"Analyze these emails and identify which need responses:\n{emails}",
            parent_run_id=parent_run_id
        )
        
        # Generate responses
        responses = []
        for email in analysis_result.data:
            resp = await response_agent.run(...)
            responses.append(resp.data)
            
    except Exception:
        logger.exception("Error processing inbox")
        raise

By doing this, I was able to walk through an experimentation workflow where I could identify potential issues in the prompts or data and incrementally improve the system. In fact, I went from a recall of ~30% to 90% (on identifying emails that needed responses) from a few cycles of experimentation. Some of the identified issues were:

  • The response model originally required re-outputting the input, which the model would sometimes do incorrectly.
  • The prompt confused gpt-4o-mini into thinking multiple extractions were required, rather than 1 extraction with multiple criteria
  • Some e-mails had missing body content, which caused the analysis agent to fail

What Works Well in Pydantic AI

Despite these challenges, Pydantic AI has several strong points:

  1. Simple, Flexible Agent Workflows: The framework lets you start small with structured extraction and gradually expand to more sophisticated orchestrations.

  2. Tight Pydantic Integration: The synergy between LLM outputs and Pydantic models is a major productivity booster. When the model returns incomplete fields or wrong data types, you catch it immediately through validation.

  3. Clear Building Blocks: You’re not forced into an “agent or bust” pattern. I could choose between pure structured output or incorporating tools without rewriting everything.

What I Want to See in Pydantic AI

  1. Deeper AI Expertise: Overall, it’s not clear there are strong AI experts on the team; especially considering some of the disparagement of LLMs on the website (e.g. describing LLMs as a terrible database). I have some concerns the framework is built to be “easy to use”, but not necessarily with real-world use cases driving the design.

  2. Instrumentation Configurability: The framework is tightly coupled with Logfire. Because I didn’t want to learn two things at once, I implemented my own simple SQLite-based solution:
     storage = SQLModelStorage(SQLModelConfig(
         database_url="sqlite+aiosqlite:///gmail_agent_runs.db",
         echo=False
     ))
     run_logger = SQLModelRunLogger(storage)
    

    I would love to see more hook-like integrations for observability so we can “bring our own” observability tools.

  3. Composability: The boundaries between agents, tools, and responses sometimes feel a bit unnecessary. Everything feels like it could be a function; but the way you call things varies. I think there is a way to improve the composability so that tools/agents/etc can be easily swapped/composed.

Effective Patterns I Discovered

  1. Use vcrpy for Stepwise Testing: Record chain of prompts to replay and debug individual steps. This is crucial for isolating issues in multi-step workflows. This allows you to incrementally improve one piece at a time without the statistical uncertainty from upstream steps.

  2. Careful Container Field Handling: When working with nested containers, always add constraints:
     from typing import Annotated, List
     from pydantic import Field
    
     class EmailBatch(BaseModel):
         messages: Annotated[List[EmailMessage], Field(min_items=1, max_items=10)]
    
  3. Maintain an Experiment Log: I treated my “explog” like a lab notebook in Git. Every time I changed a prompt or updated code, I wrote a 1-2 line entry. Weeks later, it’s a lifesaver to see why or when I introduced changes.

  4. Start with Observability: While Pydantic AI pushes you toward Logfire, there are many options for logging and observability. I particularly recommend OTEL based Observability 2.0 approaches. Wide-structured log events are a game changer for operations and reliability, even for AI.

  5. Break Down Complex Workflows: Instead of one agent trying to do everything, create specialized agents with clear responsibilities:
    • Query Generation
    • Email Analysis
    • Response Generation

    Slowly layer in complexity over time rather than trying to bite everything off at once!

  6. Handle Validation Carefully: Pydantic AI’s validation is powerful but can be tricky with complex nested structures. Always validate at each step rather than trying to validate everything at once.

Looking Ahead

While this implementation works, there’s room for improvement. I’m particularly interested in exploring Mirascope as an alternative. Its focus on composability could provide a cleaner way to handle the orchestration of multiple agents.

The core lesson? When building AI systems, start simple and add complexity only when you have the observability to understand what’s happening. This principle has served me well across many ML projects, from Google to LinkedIn, and it proved invaluable here too!

Need insights on how to build your own AI agents? Schedule a call with me!