Braintrust Evals for Agents with Persistent State
Free · Open source (MIT) · Works with LangChain, CrewAI, AutoGen · No signup
When evaluating agents that maintain persistent state across conversations, standard Braintrust evals fall short because they can't track accumulated memory between test runs. Your agent might pass individual prompts but fail spectacularly when its memory state gets corrupted or inconsistent over multiple interactions.
The Stateful Agent Problem
Most agent evaluation frameworks treat each test case as isolated — perfect for stateless functions, broken for real agents. Your customer service bot remembers user preferences. Your coding assistant tracks project context. Your research agent builds knowledge over time.
But when you run evals, each test starts with a blank slate. You can't verify that your agent correctly remembers a user's dietary restrictions from message 1 when handling their restaurant query in message 15. You can't test if your coding assistant maintains project state across multiple debugging sessions.
The result? Agents that pass all your evals but fail in production when memory inconsistencies compound. A chatbot that forgets VIP customer status. A research agent that re-asks the same clarifying questions. State corruption that only surfaces after dozens of interactions.
This happens because agent memory typically lives in ephemeral session storage or gets reset between eval runs, making it impossible to test realistic multi-turn scenarios with persistent context.
The Fix: Persistent Memory for Evals
Use BotWire Memory to give your agents persistent state that survives across eval runs, processes, and machines. This lets you write Braintrust evals that test real stateful behavior.
pip install botwire
from botwire import Memory
import braintrust
# Initialize persistent memory for your agent
memory = Memory("agent-eval-v1")
def stateful_agent(query, user_id):
# Get user context from persistent memory
context = memory.get(f"user:{user_id}") or {"preferences": [], "history": []}
# Your agent logic here
response = generate_response(query, context)
# Update memory with new state
context["history"].append({"query": query, "response": response})
memory.set(f"user:{user_id}", context)
return response
# Braintrust eval with persistent state
@braintrust.eval
def test_memory_retention():
# First interaction - agent learns preference
result1 = stateful_agent("I'm vegetarian", user_id="test-user-1")
# Later interaction - should remember preference
result2 = stateful_agent("Suggest a restaurant", user_id="test-user-1")
return {"remembers_preference": "vegetarian" in result2.lower()}
How It Works
BotWire Memory provides a simple key-value store that persists across your entire eval suite. When you call memory.set(key, value), that data lives at https://botwire.dev until you delete it. No signup, no API keys — just persistent state.
The key insight: use consistent namespaces and keys across your eval runs. The namespace ("agent-eval-v1") isolates your test data from production, while user-specific keys (f"user:{user_id}") let you test multiple user scenarios in parallel.
# Advanced stateful eval patterns
def test_conversation_continuity():
memory = Memory("conversation-eval")
user_id = "user-123"
# Simulate multi-day conversation
day_1 = stateful_agent("Book me a flight to Tokyo", user_id)
day_2 = stateful_agent("Actually, change that to Osaka", user_id)
day_3 = stateful_agent("What's my current booking?", user_id)
# Agent should remember the change from day 2
assert "osaka" in day_3.lower()
# Clean up test data
memory.delete(f"user:{user_id}")
For cross-process testing, BotWire memory works seamlessly. Your eval runner can set up initial state, then test agents running in separate containers or machines. The HTTP-based backend means memory is shared automatically.
List all keys in a namespace to debug state issues: memory.keys() returns everything stored. Set TTL for temporary test data: memory.set(key, value, ttl_seconds=3600) auto-expires after an hour.
LangChain Integration
If you're using LangChain agents, BotWire provides drop-in chat history that persists between eval runs:
from botwire import BotWireChatHistory
from langchain.agents import AgentExecutor
import braintrust
@braintrust.eval
def test_langchain_agent_memory():
# Persistent chat history across eval runs
history = BotWireChatHistory(session_id="eval-session-1")
agent = create_your_langchain_agent(memory=history)
# First conversation
response1 = agent.invoke({"input": "My name is Sarah"})
# Later conversation - agent should remember
response2 = agent.invoke({"input": "What's my name?"})
return {"remembers_name": "sarah" in response2["output"].lower()}
When NOT to Use BotWire
• Vector/semantic search: BotWire is key-value only. Use Pinecone, Weaviate, or Chroma for embeddings and similarity search.
• High-throughput production: Free tier caps at 1000 writes/day per namespace. Consider self-hosting or Redis for production workloads.
• Sub-millisecond latency: HTTP round-trips add ~50-200ms. For ultra-low latency, use in-memory stores like Redis.
FAQ
Q: Why not just use Redis for eval memory? A: Redis requires setup, authentication, and infrastructure. BotWire works immediately with zero config. Perfect for evals, overkill for simple testing.
Q: Is this actually free? A: Yes, 1000 writes/day per namespace forever. Unlimited reads. No credit card, no trials. Revenue comes from enterprise self-hosting and higher-tier usage.
Q: What about data privacy? A: Your eval data hits our servers. For sensitive testing, self-host the open-source version (MIT license, single FastAPI + SQLite service).
Start Testing Stateful Agents
Stop letting memory bugs slip through your evals. Install BotWire and start testing agents with real persistent state.
pip install botwire
Full docs and self-hosting instructions at https://botwire.dev.