Evaluating AI Agents Over Time: Memory as Ground Truth

Free · Open source (MIT) · Works with LangChain, CrewAI, AutoGen · No signup

Benchmarking AI agents across versions is broken without persistent memory. Your agent evaluation pipeline can't measure how well memory-dependent behaviors improve over time because everything resets between test runs. This is fixable with the right memory backend.

The Memory Gap in Agent Evaluation

Traditional agent benchmarking treats each test run in isolation. You test your agent on a conversation, measure metrics like accuracy or task completion, then restart fresh for the next test. But real agents accumulate knowledge — they remember user preferences, learn from mistakes, and build context over multiple interactions.

This creates a massive blind spot in agent eval frameworks. When you're comparing Agent v1.2 to v1.3, you want to know: "Does the new version better utilize accumulated knowledge?" Without persistent memory, you're only measuring cold-start performance, not the agent's ability to improve with experience.

The result? You ship agents that test well in isolation but perform poorly in production where memory matters. Your evaluation metrics don't correlate with real-world success because you're missing the memory layer entirely.

The Fix: Memory-Backed Agent Evaluation

Install BotWire for persistent agent memory that survives process restarts:

pip install botwire

Here's how to benchmark agents with persistent memory:

from botwire import Memory
import time

def evaluate_agent_with_memory(agent_version, test_scenarios):
    # Each agent version gets its own memory namespace
    memory = Memory(f"agent-eval-{agent_version}")
    
    results = []
    for scenario in test_scenarios:
        # Agent can read from previous interactions
        context = memory.get("user_preferences") or {}
        learned_patterns = memory.get("conversation_patterns") or []
        
        # Run your agent with persistent context
        response = run_agent(scenario, context, learned_patterns)
        
        # Agent writes back learnings
        memory.set("last_interaction", {
            "scenario": scenario["id"],
            "response": response,
            "timestamp": time.time()
        })
        
        # Update accumulated knowledge
        updated_patterns = learned_patterns + [response.pattern]
        memory.set("conversation_patterns", updated_patterns)
        
        results.append(evaluate_response(response, scenario))
    
    return results

Understanding Memory-Based Evaluation

The key insight is treating memory as ground truth for agent improvement. Each test run builds on previous interactions, letting you measure how well your agent learns and adapts.

Here's a more complete evaluation framework:

from botwire import Memory
from datetime import datetime, timedelta

class AgentEvaluator:
    def __init__(self, agent_version):
        self.memory = Memory(f"eval-{agent_version}")
        self.metrics = []
    
    def run_benchmark_suite(self, scenarios, days_to_simulate=30):
        """Simulate agent behavior over time with memory persistence"""
        
        for day in range(days_to_simulate):
            daily_scenarios = scenarios[day * 10:(day + 1) * 10]  # 10 per day
            
            for scenario in daily_scenarios:
                # Check if we've seen this user before
                user_id = scenario["user_id"]
                user_history = self.memory.get(f"user:{user_id}") or {"interactions": 0, "preferences": {}}
                
                # Agent response informed by memory
                response = self.generate_response(scenario, user_history)
                
                # Update user memory
                user_history["interactions"] += 1
                user_history["last_seen"] = datetime.now().isoformat()
                if "preference_update" in response:
                    user_history["preferences"].update(response["preference_update"])
                
                self.memory.set(f"user:{user_id}", user_history)
                
                # Track metrics over time
                self.metrics.append({
                    "day": day,
                    "scenario_id": scenario["id"],
                    "accuracy": response["accuracy"],
                    "personalization_score": self.calculate_personalization(user_history),
                    "memory_utilization": len(str(user_history))
                })
    
    def get_memory_keys(self):
        """Inspect what the agent has learned"""
        # Note: BotWire doesn't expose key listing in current API
        # Track keys manually or use a known key pattern
        return ["Key listing not available in current API"]

The beauty of this approach is that your benchmarks become stateful. You can measure not just "does the agent answer correctly" but "does the agent get better at answering as it learns more about users?"

Integration with Agent Frameworks

For LangChain agents, use the chat history adapter:

from botwire import BotWireChatHistory
from langchain.agents import create_openai_functions_agent
from langchain_openai import ChatOpenAI

def evaluate_langchain_agent(agent_prompt, test_conversations):
    results = []
    
    for conversation in test_conversations:
        # Persistent chat history across evaluations
        chat_history = BotWireChatHistory(session_id=f"eval-{conversation['id']}")
        
        # Create agent with persistent memory
        llm = ChatOpenAI(model="gpt-4")
        agent = create_openai_functions_agent(llm, tools=[], prompt=agent_prompt)
        
        # Run conversation with memory
        for turn in conversation["turns"]:
            response = agent.invoke({
                "input": turn["user_message"],
                "chat_history": chat_history.messages
            })
            
            # History automatically persists via BotWire
            chat_history.add_user_message(turn["user_message"])
            chat_history.add_ai_message(response["output"])
            
            results.append(evaluate_turn(response, turn))
    
    return results

When NOT to Use BotWire

Vector/semantic search: BotWire is key-value storage, not a vector database. Use Pinecone/Weaviate for embedding-based retrieval.
High-frequency writes: Free tier caps at 1000 writes/day per namespace. For high-throughput evaluation, consider self-hosting.
Sub-millisecond latency: HTTP calls add ~10-50ms overhead. For ultra-low latency, use in-memory solutions during evaluation.

FAQ

Why not Redis for agent evaluation? Redis requires setup, authentication, and doesn't persist by default. BotWire works immediately with zero config and built-in persistence.

Is this actually free? Yes, 1000 writes/day per namespace forever. Unlimited reads. No credit card, no API keys, no gotchas.

What about data privacy? All data is yours. Open source (MIT license) means you can self-host the entire stack. No vendor lock-in.

Start Benchmarking Agent Memory Today

Memory-based evaluation reveals how your agents actually improve over time, not just cold-start performance.

pip install botwire

Get started at https://botwire.dev — no signup required.

Install in one command:

pip install botwire

Start free at botwire.dev