or How to Stop Fighting with your AI Pair Programmer
Cursor agent is worse than old composer.
— Riley Brown (@rileybrown_ai) December 3, 2024
Might go back to previous version
@ codebase
As someone who’s led multiple engineering teams build products from inception to international launch, I’ve seen this reaction countless times. Engineers get excited about AI coding assistants, try them out, and then… frustration. The tool seems to fight against them at every turn.
But here’s the truth I’ve learned after months of using Cursor daily while leading engineering teams and shipping production ML systems: the problem isn’t the AI. It’s how we work with it.
The Real Problem: You’re Working Solo in a Pair Programming World
Nick Dobos recently tweeted about Cursor’s agent mode:
Yoo @cursor_ai agent is CRAZY
— Nick Dobos (@NickADobos) November 27, 2024
The new ai agent mode can call arbitrary python functions in your code base!!
This means you can tell cursor agent to build its own tools!!!
“Build be a python function that can search the web”. “Now call that function to search for …” 🤯 https://t.co/gQV7qOilXu
He’s right - it is powerful. But that power is precisely why so many developers get frustrated. When you give an eager junior developer access to your entire codebase without proper guidance, what happens? The same thing that happens with Cursor’s agent mode.
As Cole Murray insightfully noted:
The people who are complaining AI sucks and doesn’t generate what they want
— cole murray (@_colemurray) December 11, 2024
🤝
The people who are complaining their direct reports don’t do what they want.
Maybe you’re just bad at communicating what you want.
In this guide, I’ll share the concrete workflow I’ve developed that transformed Cursor from a source of frustration into a genuine productivity multiplier. You’ll learn:
- A structured approach to working with Cursor’s agent
- How to handle both greenfield development and existing codebases
- Complete configuration files and templates you can use today
- Real examples from production systems
Most importantly, you’ll learn how to stop fighting with your AI pair programmer and start collaborating effectively.
The Core Problem: Communication
When using Cursor, many developers (myself included, initially) are still in a solo development mindset while technically pair programming with AI. We give one-line instructions and expect perfect implementation. Would you do that with a human teammate, particularly a junior engineer?
The solution? Treat Cursor as a teammate, not a god.
Working “Alone” Together
The irony of AI programming tools is that while they’re meant to enhance collaboration, most developers use them as if they’re still working alone. Have you ever found yourself:
# Dropping a massive chunk of code into Cursor with just:
"Fix this code"
# Or giving vague, one-line requirements like:
"Write me a function to process user data"
# Or worse, expecting it to understand complex context without explanation:
"Update this to handle the new requirements"
This approach is particularly problematic with Cursor’s agent mode, which can make sweeping changes across your codebase if not properly guided. It’s like having an eager intern with access to your entire codebase but no understanding of your team’s conventions or system architecture.
The One-Line Instruction Anti-Pattern
Let’s look at a typical interaction that leads to frustration:
Developer: "Add error handling to this function"
Cursor: *Proceeds to rewrite entire function, adding try-catch blocks everywhere,
changing variable names for "clarity", and introducing new helper functions*
Developer: "No, that's not what I wanted!"
Cursor: *Suggests another complete rewrite with different error handling patterns*
Developer: "F**k you Cursor!"
This pattern is especially dangerous when working with existing codebases. Cursor’s agent eagerly tries to “improve” code it sees, often breaking subtle but important patterns that aren’t immediately obvious.
Why This Fails
The failure modes become even more apparent when we compare this to human collaboration:
With a Junior Developer
You: "Here's a task. Add error handling to this function."
Junior Dev: "Okay, what kind of errors should we handle?"
You: "Good question. We need to catch network timeouts and invalid data formats."
Junior Dev: "Got it. Should we use our standard error types?"
You: "Yes, and make sure to log the errors with our logging framework."
With Cursor (Common Approach)
You: "Add error handling to this function"
Cursor: *Makes massive changes*
You: *Frustration intensifies*
The problem isn’t that Cursor is incapable - it’s that we’re not providing the same level of context and guidance we would to a human teammate.
The Extra Challenge of Existing Codebases
This problem compounds when working with existing code. Consider these common scenarios:
-
Hidden Dependencies
# Cursor sees: def process_data(data): return transform(data) # What it doesn't see: # - transform() has side effects # - Other services depend on the exact output format # - Performance requirements that aren't obvious
-
Implicit Patterns
# Your codebase consistently uses: try: result = risky_operation() except SpecificError as e: logger.error(f"Operation failed: {e}") raise OperationalError(e) # Cursor might introduce: try: result = risky_operation() except Exception as e: print(f"Error: {e}") # Breaking logging patterns return None # Breaking error handling patterns
-
Testing Implications
# Cursor changes a function signature: def process(data, config=None): # Added optional parameter # But doesn't realize this breaks: # - Hundred of tests expecting specific behavior # - API contracts with other services # - Serialization/deserialization logic
The Root Cause
The fundamental issue is treating AI as a magic code generator rather than a collaborative team member. When we:
- Skip proper requirement discussions
- Don’t provide necessary context
- Fail to review changes incrementally
- Don’t explain our codebase’s patterns and constraints
We’re setting ourselves up for frustration. Just as you wouldn’t expect a new team member to write perfect code without proper onboarding, you shouldn’t expect Cursor to do so without proper guidance.
In the next section, we’ll look at a better way: a structured workflow that treats Cursor as a proper development partner rather than a god.
A Better Way: Structured Collaboration
The Four-Phase Workflow
After months of trial and error leading engineering teams through AI-assisted development, I’ve distilled our process into four key phases. This isn’t just another software development lifecycle - it’s specifically designed for effective AI collaboration while maintaining engineering rigor.
- Problem Statement & Requirements
- Start with open discussion of the problem space
- Work through expectations until you reach shared understanding
- Document key constraints and acceptance criteria
- Goal: Clear alignment before any code is written
- Design Documentation
- Create a structured design document (I use a standard template)
- Review and iterate on the design with Cursor
- Break implementation into clear phases
- Goal: Roadmap that both you and Cursor understand
- Phased Implementation
- Implement solution in planned phases
- Always reference back to the design doc
- Regular validation against requirements
- Goal: Manageable, reviewable chunks of work
- Continuous Testing
- Follow testing strategy for each phase
- Mix automated tests with manual validation
- Ensure backwards compatibility
- Goal: Confidence in changes
Key Principles
What makes this workflow effective?
-
Small, Verifiable Steps
# Instead of: "Add error handling to the entire codebase" # Do: "Let's first identify the critical paths that need error handling" "For the data processing function, what specific errors should we catch?" "How should we integrate with our existing logging system?"
-
Clear Context Boundaries
# Instead of: "Make this code better" # Do: "We need to improve error handling in process_data() to: 1. Catch network timeouts (currently unhandled) 2. Log errors using our standard format 3. Maintain the existing retry behavior 4. Keep the current function signature"
-
Iterative Refinement
# Instead of fighting multiple changes: "No, not like that... no, that's not right either..." # Have a dialogue: "Good start. Let's adjust the error types to match our standards." "Now let's add the logging integration." "Finally, let's add the retry mechanism."
Real-World Example: Version Checksums
In the next sections, I’ll walk through a complete example of using this workflow to add version checksums to an email reply agent system. You’ll see how we:
- Started with a clear problem statement
- Developed a comprehensive design document
- Implemented the solution in phases
- Added robust testing
This wasn’t a theoretical exercise - it was a real feature we needed to prevent configuration drift in our multi-agent system. By following this structured approach, we were able to implement it correctly the first time, with no production incidents.
Initial Problem Statement
Let’s see how this workflow plays out in practice. We’ll start with discussing requirements - notice how each response builds on the previous one, creating a shared understanding.
Me: Currently, there's no way to detect if an agent's configuration has changed without a corresponding version bump. This could lead to inconsistent behavior where the same version number refers to different agent configurations. How can we detect this? Do not write code, just discuss the problem.
Cursor: You're right! There is no way to detect if an agent's configuration has changed without a corresponding version bump. We can use a checksum of the agent's configuration to detect changes. Would you like me to implement this?
Me: Let's discuss some requirements first to make sure we have the same understanding. Do not write code, just discuss the problem. First, when trying to create a new agent version which already exists, but is somehow different, we should raise an error. We don't need to get fancy and "auto update" the version number right now. The checksum should be deterministic and consistent, obviously to do this check. The check should not be performance critical; but considering most of the time is spent in the LLM inference, it is okay for it to take a little bit of time (e.g. time for a database transatction). We want to maintain backward compatibility with existing agent versions where we did not store the checksum. When there is a mismatch, we should have a clear error message. Anything else you think we should consider?
Cursor: Checksum Scope
What exactly should be included in the checksum calculation?
Should we include runtime dependencies versions?
Should we include tool implementations or just tool names?
How do we handle dynamic components (like system prompts that might depend on runtime state)?
Storage Considerations
Since we're using SQLModel, should the checksum be stored or always computed?
If stored, how do we handle migrations for existing records?
If computed, do we need indexing for efficient lookups?
3. Error Handling
Should we provide details about what changed in the error message?
Do we need different error types for different kinds of mismatches?
How do we handle validation failures during bulk operations?
Validation Timing
When exactly should we perform the validation?
Should we validate on every agent instantiation or just during version creation?
How do we handle validation in async contexts?
Debugging Support
Should we provide tools to inspect/compare checksums?
How do we help users understand why a checksum changed?
Should we log checksum computations for audit purposes?
Me: We should include runtime dependencies. Tool names are sufficient for now. No dynamic components for now, let's keep it simple. I think the checksum should be computed; because it is a derived attirbute; that makes it easier to change later, and also makes backwards compatibility simple. The data for now will be small; no indexing required. No need for now to determine what changed. In most cases, the right thing to do will be to just bump the version number. One error type should suffice. There are currently no bulk operations on agent_versions. The validation should be performed on every agent version creation. Use transactions to ensure correctness in async contexts. No need to create tools to inspect/compare checksums for now. Please summarize your understanding of the problem and requirements before we continue.
Cursor: Problem Statement:
We need to prevent silent configuration drift in agent versions. Currently, an agent's configuration could be modified without changing its version number, leading to inconsistent behavior. We need a way to detect these changes and enforce version updates.
Core Requirements:
Detect configuration changes by computing a deterministic checksum of the agent's configuration
Raise an error if trying to create an agent version that exists but has different configuration
Users should manually bump version numbers when configurations change
Technical Decisions:
Checksum includes: static configuration + tool names + runtime dependencies
Implemented as a computed property (not stored in database)
Single error type with clear message
Validation happens during agent version creation
Use database transactions for async safety
Keep implementation simple (no inspection tools, bulk operations, or change detection)
Non-Requirements (for now):
No automatic version bumping
No dynamic components in checksum
No tool implementation checksums
No inspection/comparison tools
No indexing for checksums
No detailed change detection
While this might seem like a lot of detail upfront – it’s all important stuff to make sure Cursor is going to do the right thing! Now let’s make a design doc!
Design Document
Me: Let's create a design document given the problem statement and requirements we discussed. Please write a markdown file following the @context/design_doc.md template.
And Cursor drops this gold on us:
# Agent Version Checksum Feature
## Problem Statement
We need to prevent silent configuration drift in agent versions. When an agent's
configuration is modified without changing its version number, it can lead to:
- Inconsistent behavior between environments
- Hard to debug issues
- Loss of reproducibility
- Untracked configuration changes
## Requirements
### Functional Requirements
- Compute deterministic checksum of agent configuration
- Validate configuration matches when creating agent versions
- Raise error if version exists with different configuration
- Support existing agent versions (backward compatibility)
### Non-Functional Requirements
- Performance: Can tolerate database transaction overhead
- Clear error messages directing users to bump version
- Simple implementation (single error type)
- Safe operation in async contexts
## Design Decisions
### 1. Checksum Storage
Implemented as computed field because:
- Easier to modify checksum logic later
- Simplifies backward compatibility
- Keeps data model clean
- Reflects derived nature of the field
### 2. Checksum Algorithm
Uses SHA-256 because:
- Cryptographically secure (prevents accidental collisions)
- Standard library implementation available
- Fast computation
- Widely used and trusted
### 3. Configuration Serialization
Uses deterministic JSON serialization because:
- Handles nested configurations naturally
- Human-readable for debugging
- Consistent across Python versions
- Supports all needed data types
### 4. Checksum Scope
Includes in computation:
- Static configuration
- Tool names (not implementations)
- Runtime dependencies
Excludes:
- Dynamic components
- Tool implementations
- Runtime state
## Technical Design
### 1. Core Components
```python
class AgentVersion:
"""Stores and validates agent version information"""
version: str
checksum: str
def validate_checksum(self, config: AgentConfig) -> bool:
"""Validates configuration matches stored checksum"""
return self.checksum == compute_checksum(config)
def compute_checksum(config: AgentConfig) -> str:
"""Generates deterministic checksum from configuration"""
serialized = json.dumps(config.dict(), sort_keys=True)
return hashlib.sha256(serialized.encode()).hexdigest()
```
### 2. Integration Points
- AgentVersion integrates with InstrumentedAgent creation
- Validation occurs during agent initialization
- Checksum computation hooks into configuration serialization
- Error handling integrates with agent creation workflow
## Implementation Plan
1. Phase 1: Core Implementation (Completed)
- Added checksum field to AgentVersion
- Implemented checksum computation
- Added validation logic
- Created error types
2. Phase 2: Migration (Completed)
- Updated existing agent configurations
- Added validation to agent creation
- Updated tests
- Added documentation
## Testing Strategy
### Unit Tests
- Checksum computation consistency
- Version validation logic
- Configuration change detection
- Error case handling
### Integration Tests
- Agent creation workflow
- Configuration updates
- Version management
- Migration scenarios
## Observability
### Logging
- Configuration validation results
- Checksum computation details
- Version mismatch events
- Migration progress
### Metrics
- Validation failures
- Configuration changes
- Version updates
- Migration status
## Future Considerations
### Potential Enhancements
- Version comparison operations
- Configuration drift detection
- Automated version bumping
- Checksum verification tools
### Known Limitations
- Only detects changes in serializable fields
- Requires manual version management
- No partial configuration updates
- Limited to JSON-serializable types
## Dependencies
### Runtime Dependencies
- Python standard library (json, hashlib)
- Existing agent configuration system
- Version management system
### Development Dependencies
- Testing framework
- JSON schema validation
- Development documentation
## Security Considerations
- Checksum algorithm is cryptographically secure
- No sensitive data in checksums
- Version records are immutable
- Validation prevents configuration tampering
## Rollout Strategy
1. Development phase (Completed)
- Core implementation
- Test coverage
- Documentation
2. Testing phase (Completed)
- Unit tests
- Integration tests
- Migration testing
3. Production deployment (Completed)
- Gradual rollout
- Migration support
- Monitoring
## References
- Original design document
- Agent configuration documentation
- Version management system documentation
- SHA-256 specification
This design document gives us clear boundaries for implementation while remaining flexible enough for iteration. Note how it includes concrete interfaces and clear phase definitions. I can hear you already: “BuT cUrSoR hAlLuCiNaTeS!” Totally! And that’s why we need to review and verify the work. We are FORCING Cursor to output enough information so that we can verify. How closely you pay attention to certain details is mostly a function of your risk tolerance. Everything you do not carefully verify is something that could be wrong – just like with a human teammate you are mentoring! Not meant as a slight to the human teammate, they are learning; everyone has to start somewhere.
I like this design, except that it isn’t properly using SQLModel as it should, let’s remind Cursor of that for the interfaces:
Me: Let's amend this design document. Keep in mind we should be using SQLModel for storage of the fields. I think we should use `@hybrid_property` from SQLAlchemy to make it storable in the database.
And it updated the interface to this (looks good to me):
class AgentVersion(SQLModel, table=True):
"""Stores and validates agent version information"""
# Primary key fields
agent_id: str = Field(primary_key=True)
version_id: str = Field(primary_key=True)
# Metadata
created_at: datetime = Field(default_factory=datetime.now)
# Configuration fields (stored as JSON)
prompt_template: list[str] = Field(default=[], sa_column=Column(JSON))
tools: list[str] = Field(default=[], sa_column=Column(JSON))
runtime_info: dict[str, str] = Field(default={}, sa_column=Column(JSON))
@hybrid_property
def checksum(self) -> str:
"""Compute checksum from static fields."""
static_fields = {
"prompt_template": self.prompt_template,
"tools": sorted(self.tools),
"runtime_info": self.runtime_info,
}
serialized = json.dumps(static_fields, sort_keys=True)
return hashlib.sha256(serialized.encode()).hexdigest()
One thing I’ll admit here: this wasn’t entirely generated by Cursor. I made one hand edit. And that’s okay! I added the sorted(self.tools)
to ensure the order of tools is consistent. This was easier to do myself than explain to the AI.
Now we have a good looking design doc, let’s move on to the implementation!
Implementation and Testing
Now, we are all familiar with the cyclic and iterative nature of implementation and testing. You try something, it doesn’t work, you try something else. Rinse and repeat. It’s no different here! Let’s start with the first phase of implementation and testing. Remember: we ALWAYS refer to back to the previous step to “ground” Cursor.
Me: Please implement the first phase of @design/agent_version_checksum.md.
Cursor then starts by creating a new branch:
git checkout -b feature/agent-version-checksum-phase-1
And then Cursor excitedly starts writing code. It’s challenging to show all the code that was written because some were edits, but suffice to say: Cursor was pretty damn close to the design document. The key implementation in phase 1 was this:
class AgentVersion(SQLModel, table=True):
"""Captures the complete configuration of an agent at a point in time."""
__table__: ClassVar[Table]
version_id: str = Field(primary_key=True)
agent_id: str = Field(index=True)
created_at: datetime = Field(default_factory=datetime.now)
prompt_template: list[str] = Field(default=[], sa_column=Column(JSON))
tools: list[str] = Field(default=[], sa_column=Column(JSON))
runtime_info: dict[str, str] = Field(default={}, sa_column=Column(JSON))
@hybrid_property
def checksum(self) -> str:
"""Compute checksum of static configuration fields."""
static_fields = {
"prompt_template": self.prompt_template,
"tools": sorted(self.tools),
}
serialized = json.dumps(static_fields, sort_keys=True)
return hashlib.sha256(serialized.encode()).hexdigest()
def validate_configuration(self, agent_config: "AgentConfig") -> None:
"""Validate that an agent config matches this version's configuration.
Args:
agent_config: The configuration to validate
Raises:
ConfigurationError: If the configuration doesn't match this version
"""
# Convert agent config to comparable format
config_prompt = [agent_config.system_prompt]
config_tools = sorted(getattr(agent_config, 'tools', []))
check_fields = {
"prompt_template": config_prompt,
"tools": config_tools,
}
check_sum = hashlib.sha256(
json.dumps(check_fields, sort_keys=True).encode()
).hexdigest()
if check_sum != self.checksum:
raise ConfigurationError(
f"Configuration mismatch for version {self.version_id}. "
f"Expected checksum {self.checksum}, got {check_sum}. "
"Did you forget to bump the version number?"
)
I generally like to start with a sort of manual e2e test when I am working on smaller projects.
Me: Please run the agent to ensure nothing is broken.
Cursor suggests running the following: uv run -m email_reply_agent.gmail_agent
And we quickly find that the agent is broken!
pydantic.errors.PydanticUserError: A non-annotated attribute was detected: `checksum = <sqlalchemy.ext.hybrid.hybrid_property object at 0x1058058d0>`. All model fields require a type annotation; if `checksum` is not meant to be a field, you may be able to resolve this error by annotating it as a `ClassVar` or updating `model_config['ignored_types']`.
It appears that the agent is broken, and it has something to do with the new checksum
attribute. As most people do, we pipe that error back to Cursor,
Cursor makes another edit, and then we run the agent again… Same error! Now here is where we usually get frustrated, Cursor obviously is stuck and does not know what to do.
Cursor is a nervous intern! They don’t want to admit they don’t know; we need to help by providing context.
A quick Google search from me shows that @hybrid_property
is not the way to use a computed field in SQLModel
(oopsie my bad), instead we should use @computed_field
! I provide that as context,
using Hamel’s awesome chrome extension to convert a webpage to markdown. I also told Cursor I didn’t like having a “validate” method in the class and would prefer inlining the comparison, and Cursor made the change:
class AgentVersion(SQLModel, table=True):
"""Captures the complete configuration of an agent at a point in time."""
# Use SQLAlchemy Table for proper table configuration
__table__: ClassVar[Table]
agent_id: str = Field(primary_key=True)
version_id: str = Field(primary_key=True)
created_at: datetime = Field(default_factory=datetime.now)
prompt_template: list[str] = Field(default=[], sa_column=Column(JSON))
tools: list[str] = Field(default=[], sa_column=Column(JSON))
runtime_info: dict[str, str] = Field(default={}, sa_column=Column(JSON))
@computed_field
@property
def checksum(self) -> str:
"""Compute checksum from static fields."""
static_fields = {
"prompt_template": self.prompt_template,
"tools": sorted(self.tools),
"runtime_info": self.runtime_info,
}
serialized = json.dumps(static_fields, sort_keys=True)
return hashlib.sha256(serialized.encode()).hexdigest()
class Config: # type: ignore # noqa: PGH003
table = True
arbitrary_types_allowed = True
# Set table name after class definition
__tablename__ = "agent_versions" # type: ignore # noqa: PGH003
...
async def store_version(self, version: AgentVersion) -> None:
"""Store an agent version if it doesn't exist.
Args:
version: The version to store
Raises:
ConfigurationError: If version exists with different configuration
"""
async with self.db.session() as session:
# Check if version exists
stmt = select(AgentVersion).where(
and_(
AgentVersion.agent_id == version.agent_id,
AgentVersion.version_id == version.version_id,
)
)
existing = await session.execute(stmt)
existing_version = existing.scalar_one_or_none()
if existing_version is not None:
if existing_version.checksum != version.checksum:
raise ConfigurationError( # noqa: TRY003
f"Configuration mismatch for agent {version.agent_id} "
f"version {version.version_id}. "
"The configuration has changed without a version bump."
)
return # Version exists and matches
session.add(version)
await session.commit()
And this code worked! It ran successfully the first time, then when I changed the config without the version, it failed!
Then we ask Cursor to write a few unit tests to ensure the checksum is computed correctly:
Me: Okay, the manual run is working. Now let's write a few unit tests to ensure the checksum is computed correctly.
Cursor writes a few unit tests, and we run them to ensure they pass:
"""Tests for agent version checksum functionality."""
import pytest
from sqlalchemy.ext.asyncio import AsyncSession, create_async_engine
from sqlalchemy.orm import sessionmaker
from sqlmodel import SQLModel
from email_reply_agent.agent_instrumentation.db.models import AgentVersion
from email_reply_agent.agent_instrumentation.db.storage import SQLModelStorage, ConfigurationError
@pytest.fixture
async def db_session():
"""Create a test database session."""
engine = create_async_engine(
"sqlite+aiosqlite:///test.db",
echo=False,
)
async with engine.begin() as conn:
await conn.run_sync(SQLModel.metadata.create_all)
async_session = sessionmaker(
engine, class_=AsyncSession, expire_on_commit=False
)
async with async_session() as session:
yield session
await session.rollback()
await engine.dispose()
@pytest.fixture
def storage(db_session):
"""Create a test storage instance."""
return SQLModelStorage(db_session)
@pytest.mark.asyncio
async def test_checksum_computation():
"""Test that checksum is computed correctly."""
version = AgentVersion(
agent_id="test_agent",
version_id="0.0.1",
prompt_template=["Test prompt"],
tools=["tool1", "tool2"],
runtime_info={"python": "3.9"}
)
# Same config should have same checksum
version2 = AgentVersion(
agent_id="test_agent",
version_id="0.0.1",
prompt_template=["Test prompt"],
tools=["tool1", "tool2"],
runtime_info={"python": "3.9"}
)
assert version.checksum == version2.checksum
# Different config should have different checksum
version3 = AgentVersion(
agent_id="test_agent",
version_id="0.0.1",
prompt_template=["Different prompt"],
tools=["tool1", "tool2"],
runtime_info={"python": "3.9"}
)
assert version.checksum != version3.checksum
@pytest.mark.asyncio
async def test_version_storage_validation(storage):
"""Test that version storage validates checksums."""
version = AgentVersion(
agent_id="test_agent",
version_id="0.0.1",
prompt_template=["Test prompt"],
tools=["tool1", "tool2"],
runtime_info={"python": "3.9"}
)
# First store should succeed
await storage.store_version(version)
# Same version should succeed
await storage.store_version(version)
# Different config with same version should fail
version2 = AgentVersion(
agent_id="test_agent",
version_id="0.0.1",
prompt_template=["Different prompt"],
tools=["tool1", "tool2"],
runtime_info={"python": "3.9"}
)
with pytest.raises(ConfigurationError):
await storage.store_version(version2)
@pytest.mark.asyncio
async def test_version_order_independence():
"""Test that checksum is independent of field order."""
version1 = AgentVersion(
agent_id="test_agent",
version_id="0.0.1",
prompt_template=["Test prompt"],
tools=["tool1", "tool2"],
runtime_info={"python": "3.9", "os": "linux"}
)
version2 = AgentVersion(
agent_id="test_agent",
version_id="0.0.1",
prompt_template=["Test prompt"],
tools=["tool2", "tool1"], # Different order
runtime_info={"os": "linux", "python": "3.9"} # Different order
)
assert version1.checksum == version2.checksum
uv run pytest tests/test_agent_version_checksum.py
But we find a failure:
================================================================================= short test summary info =================================================================================
ERROR tests/test_agent_version_checksum.py::test_version_storage_validation - AttributeError: 'async_generator' object has no attribute 'database_url'
========================================================================= 2 passed, 2 warnings, 1 error in 0.13s ==========================================================================
So we tell Cursor to fix the test. Key insight here: don’t just jump into fixing the test, but instead think about and discuss the problem with me!
Me: One test failed with: ERROR tests/test_agent_version_checksum.py::test_version_storage_validation - AttributeError: 'async_generator' object has no attribute 'database_url'. Please discuss the problem and how you think we can fix it. Do not write code.
Cursor: Ah, I see the issue. The problem is with our test fixture setup. The storage fixture is receiving the session directly, but SQLModelStorage expects a database configuration object with a session factory method. Would you like me to fix the test?
Me: Yes, that sounds right, please proceed! Make a separate SQLModelConfig fixture that we can use to configure and setup the database.
Then Cursor makes a fix:
"""Tests for agent version checksum functionality."""
import pytest
from sqlalchemy.ext.asyncio import AsyncSession, create_async_engine
from sqlalchemy.orm import sessionmaker
from sqlmodel import SQLModel
from email_reply_agent.agent_instrumentation.db.models import AgentVersion
from email_reply_agent.agent_instrumentation.db.storage import ConfigurationError, SQLModelConfig, SQLModelStorage
@pytest.fixture
def storage_config():
return SQLModelConfig(
database_url="sqlite+aiosqlite:///test.db",
echo=False
)
async def setup_database(storage_config):
"""Set up and tear down the test database."""
print("Setting up database:", storage_config.database_url)
engine = create_async_engine(
storage_config.database_url,
echo=False
)
async with engine.begin() as conn:
print("Creating all tables:", SQLModel.metadata.tables)
await conn.run_sync(SQLModel.metadata.create_all)
@pytest.fixture
def storage(storage_config):
"""Create a test storage instance."""
return SQLModelStorage(storage_config)
@pytest.mark.asyncio
async def test_checksum_computation():
"""Test that checksum is computed correctly."""
version = AgentVersion(
agent_id="test_agent",
version_id="0.0.1",
prompt_template=["Test prompt"],
tools=["tool1", "tool2"],
runtime_info={"python": "3.9"}
)
# Same config should have same checksum
version2 = AgentVersion(
agent_id="test_agent",
version_id="0.0.1",
prompt_template=["Test prompt"],
tools=["tool1", "tool2"],
runtime_info={"python": "3.9"}
)
assert version.checksum == version2.checksum
# Different config should have different checksum
version3 = AgentVersion(
agent_id="test_agent",
version_id="0.0.1",
prompt_template=["Different prompt"],
tools=["tool1", "tool2"],
runtime_info={"python": "3.9"}
)
assert version.checksum != version3.checksum
@pytest.mark.asyncio
async def test_version_storage_validation(storage, storage_config):
"""Test that version storage validates checksums."""
await setup_database(storage_config)
version = AgentVersion(
agent_id="test_agent",
version_id="0.0.1",
prompt_template=["Test prompt"],
tools=["tool1", "tool2"],
runtime_info={"python": "3.9"}
)
# First store should succeed
await storage.store_version(version)
# Same version should succeed
await storage.store_version(version)
# Different config with same version should fail
version2 = AgentVersion(
agent_id="test_agent",
version_id="0.0.1",
prompt_template=["Different prompt"],
tools=["tool1", "tool2"],
runtime_info={"python": "3.9"}
)
with pytest.raises(ConfigurationError):
await storage.store_version(version2)
@pytest.mark.asyncio
async def test_version_order_independence():
"""Test that checksum is independent of field order."""
version1 = AgentVersion(
agent_id="test_agent",
version_id="0.0.1",
prompt_template=["Test prompt"],
tools=["tool1", "tool2"],
runtime_info={"python": "3.9", "os": "linux"}
)
version2 = AgentVersion(
agent_id="test_agent",
version_id="0.0.1",
prompt_template=["Test prompt"],
tools=["tool2", "tool1"], # Different order
runtime_info={"os": "linux", "python": "3.9"} # Different order
)
assert version1.checksum == version2.checksum
Now I am pretty happy with the implementation and the tests pass. One thing that bothers me is that the test.db
is not cleaned up here:
Me: Please make sure in @test_agent_version_checksum.py that we clean up the database in between test runs.
Cursor makes a fix and now the tests pass and the database is cleaned up! I’m pretty happy:
class TestDatabase:
"""Context manager for test database setup and cleanup."""
def __init__(self, config: SQLModelConfig):
self.config = config
self.engine = create_async_engine(config.database_url, echo=False)
async def __aenter__(self):
"""Create database tables."""
async with self.engine.begin() as conn:
await conn.run_sync(SQLModel.metadata.create_all)
async def __aexit__(self, exc_type, exc_val, exc_tb):
"""Drop database tables and cleanup."""
async with self.engine.begin() as conn:
await conn.run_sync(SQLModel.metadata.drop_all)
await self.engine.dispose()
# ... then in tests
async with TestDatabase(storage_config):
...
Me: Please make a PR for the first phase of @design/agent_version_checksum.md.
Cursor runs the following commands:
git add .
git commit -m "First phase of agent version checksum"
gh pr create --title "First phase of agent version checksum" --body "First phase of agent version checksum"
And I can then go review the PR, and merge it!
Best Practices
Now that we have gone through a full cycle of this workflow, I want to extract out what I have found to be best practices.
- Enforce an opinionated development workflow on Cursor: the specific workflow I went through is not necessarily a best practice; it’s that I went through a workflow that’s the highest leverage. Adapt to your own setting as needed!
- Focus on collaboration and communication before code: Cursor (and sometimes we) are so eager to jump into code. But before we really understand that’s a mistake! Give explicit “Do not write code” directives.
- Always force Cursor to use the context of the previous step: this helps ground Cursor to the most recent item you agreed on. It’s important to “attach” the resource to the context of your conversation too. Because you may have made edits Cursor didn’t see!
- Limit the Size of Each Step: don’t give Cursor too much to chew on at once. This is a great way to get stuck in a loop. Instead, give them smaller chunks (this is why I split up the design doc into phases).
- Guide the Discussion: without guardrails, Cursor will… well go off the rails. Make sure to keep Cursor on track. Don’t be afraid to “cancel” Cursor and reset.
- Return to the Previous Step if Needed: sometimes Cursor will seem hopelessly stuck in a loop. Don’t keep going forward; go backwards. It is clear that Cursor is missing (or misunderstanding) some critical context; discuss what that is before proceeding (and make according edits to artifacts like design docs)!
- Document Everything: Force Cursor (and yourself) to leave breadcrumbs. This will help you pick up where you left off and be critical for you, Cursor, and possibly other teammates to understand what’s going on!
Tips for Applying in Teams
Engineering leaders constantly tell me: “oh I can use Cursor because I know what I’m doing, but the junior engineers really can’t use it effectively”. I think this is bullshit. The reality is that this is a tool like any other, and it is the responsibility of senior engineers / leaders to teach juniors how to use their tools effectively. AI does not let senior engineers off the hook for mentoring their juniors. And we also should be cautious about giving junior engineers too much for them to chew.
Here is what I found worked best:
- Give smaller chunks: instead of giving a junior the entire project from the start, do the first few steps with them. Then you can hand off a design doc and let them take a stab at implementing whatever “phase 1” is.
- Review intermediate phases: if you give a junior a design doc, you should review their implementation of the first phase. This will help you understand their level of understanding and give you a chance to guide them. But you could also give them the entire project upfront and review each step – e.g. review the problem statement / requirements upfront. This can be an incredible way to help them grow!
- Give more complex tasks as they grow: as the junior grows, you can increase the complexity of what you give them (and of what you let them do).
- Regularly pair with them: I have never found a good substitute for sitting down and pairing with them; you get a lot of insight into how they think and what may unlock their growth.
From Fighting to Flow: Making AI Work For You
When I started working with AI coding tools, I experienced the same frustrations you might be feeling now. But after leading multiple teams to successfully integrate these tools into their workflows, I’ve seen firsthand how transformative they can be when used effectively.
The key shift? Stop treating AI as a code generation oracle and start treating it as a collaborative team member. The workflow I’ve shared isn’t just theory – it’s battle-tested across multiple international product launches and production ML systems.
Remember:
- A structured workflow beats raw capability
- Clear communication prevents most problems
- Small, focused steps lead to better outcomes
- Teams grow stronger when given the right framework
Ready to Transform Your AI Development?
If you want to dive deeper:
- Check out my Cursor setup guide for technical configuration
- Connect with me on LinkedIn to discuss AI development practices
- Follow me on X for daily insights on effective AI collaboration
- Schedule a consultation if you need help building an AI-powered development workflow for your team
Most importantly, start small. Pick one project, apply this workflow, and adjust it to your needs. You’ll be surprised how quickly fighting with your AI assistant turns into flowing with a capable teammate.
Share your experiences with me on X or LinkedIn – I’d love to hear how this workflow works for you and what you’ve learned along the way.