Evaluating AI Agents Over Time: Memory as Ground Truth
Free · Open source (MIT) · Works with LangChain, CrewAI, AutoGen · No signup
Benchmarking AI agents across versions is broken without persistent memory. Your agent evaluation pipeline can't measure how well memory-dependent behaviors improve over time because everything resets between test runs. This is fixable with the right memory backend.
The Memory Gap in Agent Evaluation
Traditional agent benchmarking treats each test run in isolation. You test your agent on a conversation, measure metrics like accuracy or task completion, then restart fresh for the next test. But real agents accumulate knowledge — they remember user preferences, learn from mistakes, and build context over multiple interactions.
This creates a massive blind spot in agent eval frameworks. When you're comparing Agent v1.2 to v1.3, you want to know: "Does the new version better utilize accumulated knowledge?" Without persistent memory, you're only measuring cold-start performance, not the agent's ability to improve with experience.
The result? You ship agents that test well in isolation but perform poorly in production where memory matters. Your evaluation metrics don't correlate with real-world success because you're missing the memory layer entirely.
The Fix: Memory-Backed Agent Evaluation
Install BotWire for persistent agent memory that survives process restarts:
pip install botwire
Here's how to benchmark agents with persistent memory:
from botwire import Memory
import time
def evaluate_agent_with_memory(agent_version, test_scenarios):
# Each agent version gets its own memory namespace
memory = Memory(f"agent-eval-{agent_version}")
results = []
for scenario in test_scenarios:
# Agent can read from previous interactions
context = memory.get("user_preferences") or {}
learned_patterns = memory.get("conversation_patterns") or []
# Run your agent with persistent context
response = run_agent(scenario, context, learned_patterns)
# Agent writes back learnings
memory.set("last_interaction", {
"scenario": scenario["id"],
"response": response,
"timestamp": time.time()
})
# Update accumulated knowledge
updated_patterns = learned_patterns + [response.pattern]
memory.set("conversation_patterns", updated_patterns)
results.append(evaluate_response(response, scenario))
return results
Understanding Memory-Based Evaluation
The key insight is treating memory as ground truth for agent improvement. Each test run builds on previous interactions, letting you measure how well your agent learns and adapts.
Here's a more complete evaluation framework:
from botwire import Memory
from datetime import datetime, timedelta
class AgentEvaluator:
def __init__(self, agent_version):
self.memory = Memory(f"eval-{agent_version}")
self.metrics = []
def run_benchmark_suite(self, scenarios, days_to_simulate=30):
"""Simulate agent behavior over time with memory persistence"""
for day in range(days_to_simulate):
daily_scenarios = scenarios[day * 10:(day + 1) * 10] # 10 per day
for scenario in daily_scenarios:
# Check if we've seen this user before
user_id = scenario["user_id"]
user_history = self.memory.get(f"user:{user_id}") or {"interactions": 0, "preferences": {}}
# Agent response informed by memory
response = self.generate_response(scenario, user_history)
# Update user memory
user_history["interactions"] += 1
user_history["last_seen"] = datetime.now().isoformat()
if "preference_update" in response:
user_history["preferences"].update(response["preference_update"])
self.memory.set(f"user:{user_id}", user_history)
# Track metrics over time
self.metrics.append({
"day": day,
"scenario_id": scenario["id"],
"accuracy": response["accuracy"],
"personalization_score": self.calculate_personalization(user_history),
"memory_utilization": len(str(user_history))
})
def get_memory_keys(self):
"""Inspect what the agent has learned"""
# Note: BotWire doesn't expose key listing in current API
# Track keys manually or use a known key pattern
return ["Key listing not available in current API"]
The beauty of this approach is that your benchmarks become stateful. You can measure not just "does the agent answer correctly" but "does the agent get better at answering as it learns more about users?"
Integration with Agent Frameworks
For LangChain agents, use the chat history adapter:
from botwire import BotWireChatHistory
from langchain.agents import create_openai_functions_agent
from langchain_openai import ChatOpenAI
def evaluate_langchain_agent(agent_prompt, test_conversations):
results = []
for conversation in test_conversations:
# Persistent chat history across evaluations
chat_history = BotWireChatHistory(session_id=f"eval-{conversation['id']}")
# Create agent with persistent memory
llm = ChatOpenAI(model="gpt-4")
agent = create_openai_functions_agent(llm, tools=[], prompt=agent_prompt)
# Run conversation with memory
for turn in conversation["turns"]:
response = agent.invoke({
"input": turn["user_message"],
"chat_history": chat_history.messages
})
# History automatically persists via BotWire
chat_history.add_user_message(turn["user_message"])
chat_history.add_ai_message(response["output"])
results.append(evaluate_turn(response, turn))
return results
When NOT to Use BotWire
- Vector/semantic search: BotWire is key-value storage, not a vector database. Use Pinecone/Weaviate for embedding-based retrieval.
- High-frequency writes: Free tier caps at 1000 writes/day per namespace. For high-throughput evaluation, consider self-hosting.
- Sub-millisecond latency: HTTP calls add ~10-50ms overhead. For ultra-low latency, use in-memory solutions during evaluation.
FAQ
Why not Redis for agent evaluation? Redis requires setup, authentication, and doesn't persist by default. BotWire works immediately with zero config and built-in persistence.
Is this actually free? Yes, 1000 writes/day per namespace forever. Unlimited reads. No credit card, no API keys, no gotchas.
What about data privacy? All data is yours. Open source (MIT license) means you can self-host the entire stack. No vendor lock-in.
Start Benchmarking Agent Memory Today
Memory-based evaluation reveals how your agents actually improve over time, not just cold-start performance.
pip install botwire
Get started at https://botwire.dev — no signup required.