Persistent Memory for LlamaIndex Agents

Free · Open source (MIT) · Works with LangChain, CrewAI, AutoGen · No signup

Your LlamaIndex agent forgets everything between calls. Chat context disappears, personalization resets, and multi-turn conversations break. This happens because QueryEngine and Agent instances are stateless by default — they only remember what you pass in each request. Here's how to add persistent memory that survives restarts and process crashes.

Why LlamaIndex Agents Lose Memory

LlamaIndex agents are designed to be stateless for simplicity and scalability. Each query() or chat() call is independent, which works great for one-shot questions but breaks conversational flows.

Consider this typical pattern:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.agent import ReActAgent

# Load your data
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)

# Create agent
agent = ReActAgent.from_tools(index.as_query_engine())

# First conversation works
response1 = agent.chat("What's our Q1 revenue?")
print(response1)  # "Q1 revenue was $2.1M"

# But context is lost in follow-ups
response2 = agent.chat("What about Q2?")  
print(response2)  # Agent has no idea what "Q2" refers to

The agent answers the first question but loses context for "What about Q2?" because it doesn't persist the conversation history or learned context between calls.

The Fix: Add Persistent Memory

Install BotWire to add persistent key-value memory that survives across processes:

pip install botwire

Here's the same agent with persistent memory:

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.agent import ReActAgent
from botwire import Memory

# Initialize persistent memory with a namespace
memory = Memory("user-session-123")

# Load your data and create agent
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
agent = ReActAgent.from_tools([index.as_query_engine()])

# Store conversation context
def chat_with_memory(user_input):
    # Retrieve previous context
    history = memory.get("chat_history") or []
    
    # Add context to the query
    context = "\n".join([f"User: {h['user']}\nAssistant: {h['assistant']}" 
                        for h in history[-3:]])  # Last 3 exchanges
    full_query = f"Previous context:\n{context}\n\nCurrent question: {user_input}"
    
    # Get response
    response = agent.chat(full_query if context else user_input)
    
    # Store this exchange
    history.append({"user": user_input, "assistant": str(response)})
    memory.set("chat_history", history)
    
    return response

# Now context persists
response1 = chat_with_memory("What's our Q1 revenue?")
response2 = chat_with_memory("What about Q2?")  # Agent remembers Q1 context

How It Works

The Memory class provides persistent key-value storage that survives process restarts. Each namespace is isolated, so you can have separate memory spaces for different users or conversations.

Key operations:

from botwire import Memory

memory = Memory("my-namespace")

# Store any JSON-serializable data
memory.set("user_preferences", {"theme": "dark", "language": "en"})
memory.set("conversation_count", 42)
memory.set("last_topic", "revenue analysis")

# Retrieve with optional defaults
prefs = memory.get("user_preferences", {})
count = memory.get("conversation_count", 0)

# List all keys in this namespace
all_keys = memory.list_keys()

# Delete specific keys
memory.delete("old_data")

For production use, add error handling and memory management:

def chat_with_managed_memory(user_input, user_id):
    memory = Memory(f"user-{user_id}")
    
    try:
        # Limit history to prevent memory bloat
        history = memory.get("chat_history", [])
        if len(history) > 50:  # Keep last 50 exchanges
            history = history[-50:]
            memory.set("chat_history", history)
        
        # Your chat logic here
        context = build_context(history)
        response = agent.chat(f"{context}\n{user_input}")
        
        # Store with metadata
        history.append({
            "user": user_input,
            "assistant": str(response),
            "timestamp": time.time()
        })
        memory.set("chat_history", history)
        
        return response
        
    except Exception as e:
        print(f"Memory error: {e}")
        # Fallback to stateless mode
        return agent.chat(user_input)

Memory persists across processes and machines. Stop your Python script, restart it, and the conversation history is still there.

LlamaIndex Chat Engine Integration

For LlamaIndex's built-in chat engines, you can persist the entire chat history:

from llama_index.core.chat_engine import SimpleChatEngine
from llama_index.core.llms import OpenAI
from llama_index.core.memory import ChatMemoryBuffer
from botwire import Memory

class PersistentChatMemory(ChatMemoryBuffer):
    def __init__(self, namespace, token_limit=3000):
        super().__init__(token_limit=token_limit)
        self.memory = Memory(namespace)
        
        # Load existing messages
        stored_messages = self.memory.get("chat_messages", [])
        for msg_data in stored_messages:
            self.put(msg_data["role"], msg_data["content"])
    
    def put(self, role, content):
        super().put(role, content)
        # Persist to BotWire
        messages = [{"role": msg.role.value, "content": msg.content} 
                   for msg in self.get_all()]
        self.memory.set("chat_messages", messages)

# Use with any LlamaIndex chat engine
llm = OpenAI()
persistent_memory = PersistentChatMemory("user-456")
chat_engine = SimpleChatEngine.from_defaults(
    llm=llm, 
    memory=persistent_memory
)

# Conversation state persists across restarts
response = chat_engine.chat("Remember that I prefer technical details")
# Later, in a new process...
response = chat_engine.chat("Explain the architecture")  # Remembers preference

When NOT to Use BotWire

BotWire isn't the right choice if you need:

• Vector similarity search — Use Pinecone, Weaviate, or Qdrant for semantic search over embeddings • High-throughput applications — The HTTP API adds ~50-100ms latency; use Redis for sub-millisecond access • Complex queries — This is key-value storage, not a database; use PostgreSQL for relational data

FAQ

Why not just use Redis? Redis requires setup, authentication, and infrastructure management. BotWire works immediately with no configuration, and the free tier covers most development needs.

Is this actually free? Yes — 1000 writes per day per namespace, 50MB storage, unlimited reads. No credit card required. You can also self-host the open-source version.

What about data privacy? Data is stored on BotWire servers by default. For sensitive applications, self-host using the MIT-licensed code at github.com/pmestre-Forge/signal-api — it's a single FastAPI service with SQLite.

Get Started

Add persistent memory to your LlamaIndex agents in two lines of code. Install with pip install botwire and check the docs at https://botwire.dev.

Install in one command:

pip install botwire

Start free at botwire.dev