Testing AI Agents: Memory Fixtures and Replay
Free · Open source (MIT) · Works with LangChain, CrewAI, AutoGen · No signup
Testing AI agents is notoriously hard because they accumulate state over time — conversations, learned facts, decisions that influence future behavior. When your agent "remembers" something from yesterday's conversation, how do you write a deterministic test for that? You need agent test fixtures that let you control exactly what the agent remembers, and agent replay capabilities to recreate scenarios. Here's how to build an llm test harness that actually works.
The Testing Problem
AI agents break traditional testing patterns because they're stateful. Your agent learns that "John prefers email over Slack" in conversation A, then uses that preference in conversation B three days later. When you write tests, you can't just mock the LLM calls — you need to control the entire memory state.
Without proper ai agent testing, you get flaky tests that pass locally but fail in CI, or tests that only work when run in isolation. Worse, you can't reproduce bugs because you can't recreate the exact memory state that triggered the issue.
# This test will fail randomly
def test_agent_prefers_email():
agent = MyAgent()
agent.process("John prefers email")
# ... 50 lines later, different test file ...
result = agent.decide_notification_method("John")
assert result == "email" # Fails if memory was cleared
The Fix
Install BotWire to get persistent, controllable memory for your test ai memory scenarios:
pip install botwire
from botwire import Memory
import pytest
class TestAgentMemory:
def setup_method(self):
# Each test gets isolated memory
self.memory = Memory(f"test-{pytest.current_test_id}")
self.memory.clear() # Clean slate
def test_remembers_user_preference(self):
# Set up test fixture
self.memory.set("user:john:preference", "email")
# Test agent behavior
agent = MyAgent(memory=self.memory)
result = agent.decide_notification("john", "meeting reminder")
assert result["method"] == "email"
assert "john" in result["recipient"]
How It Works
The key insight is treating agent memory as a test fixture. Just like you'd mock a database, you pre-populate memory with known state before each test.
BotWire gives you a persistent key-value store that survives process restarts — perfect for agent replay scenarios where you need to recreate complex state:
def test_conversation_replay(self):
# Recreate a specific conversation state
conversation_state = {
"messages": ["Hello", "I need help with Python", "Specifically async/await"],
"user_expertise": "beginner",
"last_topic": "concurrency",
"clarification_count": 2
}
for key, value in conversation_state.items():
self.memory.set(f"session:abc123:{key}", value)
# Now test how agent handles next message
agent = MyAgent(memory=self.memory, session="abc123")
response = agent.process("What's the difference between async and threading?")
# Should give beginner-friendly answer since it knows user_expertise
assert "beginner" in response.tone_markers
For cross-process testing (like testing agent handoffs between services), memory persists automatically:
def test_agent_handoff(self):
# First agent stores research findings
researcher = ResearchAgent(Memory("handoff-test"))
researcher.research("climate change impacts")
# Second agent (different process) picks up where first left off
writer = WriterAgent(Memory("handoff-test")) # Same namespace
article = writer.write_summary()
# Writer should have access to researcher's findings
assert len(article.sources) > 0
Framework Integration
BotWire integrates directly with popular agent frameworks. For LangChain agents, use the chat history adapter:
from botwire import BotWireChatHistory
from langchain.agents import AgentExecutor
import pytest
def test_langchain_agent_memory():
# Controlled chat history for testing
chat_history = BotWireChatHistory(session_id="test-session-456")
# Pre-populate conversation context
chat_history.add_user_message("My name is Sarah and I'm a data scientist")
chat_history.add_ai_message("Nice to meet you Sarah! I'll remember you work in data science.")
# Test agent with this context
agent = create_langchain_agent(memory=chat_history)
result = agent.run("What kind of work do I do?")
assert "data scientist" in result.lower()
assert "sarah" in result.lower()
For CrewAI, you get memory tools that agents can use in tests:
from botwire.memory import memory_tools
def test_crew_shared_memory():
tools = memory_tools("test-crew-ns")
# Agent 1 stores finding
remember_tool = tools[0] # remember tool
remember_tool.run("customer_pain_point:slow_checkout")
# Agent 2 recalls it
recall_tool = tools[1] # recall tool
finding = recall_tool.run("customer_pain_point")
assert finding == "slow_checkout"
When NOT to Use BotWire
- Vector search: BotWire is key-value only. Use Pinecone/Weaviate for semantic similarity.
- High throughput: 1000 writes/day free tier. For millions of operations, use Redis.
- Sub-millisecond latency: HTTP API adds ~50ms. Use in-memory stores for real-time systems.
FAQ
Why not just use Redis for testing? Redis requires setup, auth, and doesn't persist by default. BotWire works immediately with zero config and gives you 1000 writes/day free forever.
Is this actually free? Yes. 1000 writes/day per namespace, 50MB storage, unlimited reads. No credit card, no signup. Open source MIT license if you want to self-host.
What about data privacy? Data stays in your namespace (isolated by default). Self-host the FastAPI service if you need on-premise. All data is yours.
Get Started
Stop writing flaky agent tests. Get deterministic ai agent testing with proper memory fixtures.
pip install botwire
Full documentation and self-hosting guide at botwire.dev.