Introduction
Welcome back, aspiring AI architect! In our previous chapters, we’ve explored the fascinating world of AI memory systems, understanding different types like working, short-term, long-term, episodic, and semantic memory, and how vector memory plays a crucial role in enabling AI agents to access vast external knowledge. Now, it’s time to bring these concepts to life by building something truly practical: a simple Retrieval Augmented Generation (RAG) agent with integrated memory.
This chapter will guide you, step-by-step, through creating an agent that doesn’t just rely on an LLM’s inherent knowledge but can intelligently fetch relevant information from an external knowledge base and remember past interactions. This combination allows for more accurate, up-to-date, and personalized responses, overcoming the limitations of an LLM’s static training data and its limited context window.
By the end of this chapter, you’ll have a working RAG agent that can answer questions based on custom documents and maintain a coherent conversation. Get ready to put theory into practice and see the power of memory in action!
Understanding RAG with Memory
Before we dive into code, let’s solidify our understanding of what a RAG agent is and how memory enhances it.
What is Retrieval Augmented Generation (RAG)?
Imagine you’re taking an open-book exam. Instead of relying solely on what you’ve memorized, you can consult your notes and textbooks to find the most accurate information. RAG works similarly for Large Language Models (LLMs).
An LLM, by itself, is like a brilliant student who has read many books but can sometimes “hallucinate” (make up facts) or provide outdated information because its knowledge is limited to its training data. RAG addresses this by:
- Retrieval: When a user asks a question, the RAG agent first retrieves relevant documents or snippets from an external knowledge base (like a database of articles, FAQs, or internal company documents).
- Augmentation: It then augments the user’s query with this retrieved information, essentially saying to the LLM, “Here’s the user’s question, and here’s some highly relevant context from our knowledge base. Please use this to answer.”
- Generation: The LLM then generates a response, now informed by the fresh, accurate, and specific data provided by the retrieval step.
This process significantly reduces hallucinations and allows LLMs to interact with real-time or proprietary information.
Why Add Memory to a RAG Agent?
While RAG is powerful for answering single, factual questions, real-world interactions are rarely one-off. Users ask follow-up questions, refer to previous statements, and expect a degree of continuity. This is where memory becomes indispensable for a RAG agent.
Without memory, each RAG interaction would be stateless. If you asked, “What’s the capital of France?” and then immediately, “What’s its population?”, the agent wouldn’t know “its” refers to France. By incorporating working memory (specifically, chat history), our RAG agent can:
- Maintain Context: Understand the flow of a conversation and answer follow-up questions accurately.
- Personalize Interactions: Potentially remember user preferences or past behaviors (though for this chapter, we’ll focus on conversational context).
- Overcome Context Window Limits: While RAG helps by retrieving relevant external docs, chat history also needs to be managed to fit within the LLM’s context window.
For our simple RAG agent, we’ll primarily use working memory in the form of conversational history to ensure the LLM understands the ongoing dialogue. The external knowledge base, powered by vector memory, serves as our long-term, factual memory.
How Our Simple RAG Agent Works
Let’s visualize the flow of information in our agent:
- User Query: The user asks a question.
- Check Chat History: The agent checks if there’s any previous conversation.
- Combine Query + History: If history exists, the current query is combined with relevant past turns to provide full context. If not, just the raw query is used.
- Retrieve Relevant Documents: This combined context (or just the query) is used to search the Vector Store (our external knowledge base). The vector store, using embeddings, finds documents semantically similar to the query.
- Vector Store (Knowledge Base): This is where our pre-processed documents (converted into numerical embeddings) are stored and indexed, allowing for efficient similarity search.
- Augment Prompt for LLM: The original query, the chat history, and the newly retrieved documents are all packaged into a single, comprehensive prompt for the LLM.
- LLM Generates Response: The LLM processes this augmented prompt and generates a coherent and informed answer.
- Store Query/Response in Chat History: The latest user query and the agent’s response are added to the agent’s working memory (chat history) for future interactions.
- Return Response to User: The answer is presented to the user.
This loop ensures that the agent is always learning from its external knowledge and its ongoing conversation.
Step-by-Step Implementation
We’ll use Python and the langchain library (version 0.1.16 as of 2026-03-20, with langchain-openai version 0.1.1 and faiss-cpu version 1.7.4) to build our agent. langchain provides excellent abstractions for working with LLMs, retrievers, and memory components.
Setup Your Environment
First things first, let’s get our workspace ready.
Create a New Project Directory:
mkdir simple_rag_agent cd simple_rag_agentSet Up a Virtual Environment: It’s always a good practice to use virtual environments to manage dependencies.
python3 -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts\activate`Install Necessary Libraries:
pip install langchain==0.1.16 langchain-openai==0.1.1 faiss-cpu==1.7.4 python-dotenv==1.0.1langchain: The core framework for building LLM applications.langchain-openai: Integrates OpenAI’s LLMs and embeddings with LangChain.faiss-cpu: A library for efficient similarity search and clustering of dense vectors. We’ll use it for our local vector store.python-dotenv: To securely load API keys from a.envfile.
Get Your OpenAI API Key: You’ll need an OpenAI API key to use their LLMs and embedding models. If you don’t have one, sign up at platform.openai.com and create a new secret key.
Create a
.envFile: In yoursimple_rag_agentdirectory, create a file named.envand add your OpenAI API key:OPENAI_API_KEY="your_openai_api_key_here"Important: Never commit your
.envfile to version control (like Git)! Add it to your.gitignorefile.
Step 1: Initialize LLM and Embeddings
Let’s start by initializing our LLM and embedding model. The embedding model will convert our text documents into numerical vectors (embeddings), and the LLM will generate responses.
Create a file named agent.py:
# agent.py
import os
from dotenv import load_dotenv
# Load environment variables from .env file
load_dotenv()
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
print("Initializing LLM and Embeddings...")
# Initialize the ChatOpenAI model
# We're using gpt-3.5-turbo for cost-effectiveness and good performance.
# You can try gpt-4 if you have access and need higher quality.
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.7)
print(f"LLM initialized: {llm.model_name}")
# Initialize the OpenAI Embeddings model
# This model converts text into numerical vectors (embeddings)
embeddings = OpenAIEmbeddings()
print("Embeddings model initialized.")
print("Setup complete. Ready to build the RAG agent components.")
Explanation:
load_dotenv(): This function frompython-dotenvloads key-value pairs from your.envfile into environment variables, makingos.environ["OPENAI_API_KEY"]accessible.ChatOpenAI: This class fromlangchain_openaiprovides an interface to OpenAI’s chat models.model="gpt-3.5-turbo": Specifies the LLM we want to use. You can change this togpt-4for more advanced capabilities if available.temperature=0.7: Controls the randomness of the LLM’s output. Higher values (closer to 1.0) make the output more creative; lower values (closer to 0.0) make it more deterministic.
OpenAIEmbeddings: This class converts text into numerical vector representations. These embeddings are crucial for finding semantically similar documents in our vector store.
Run this file to ensure your setup is correct:
python agent.py
You should see output confirming the initialization without errors.
Step 2: Prepare a Simple Knowledge Base
Now, let’s create some data for our RAG agent to retrieve from. We’ll use a few simple text documents and store them in a local FAISS vector store.
Add the following code to agent.py after the embeddings initialization:
# ... (previous code) ...
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores import FAISS
print("\nPreparing knowledge base...")
# Define some sample text documents for our knowledge base
raw_documents = [
"The capital of France is Paris. It is known for its Eiffel Tower and Louvre Museum.",
"The Amazon rainforest is the largest tropical rainforest in the world.",
"Python is a high-level, interpreted programming language known for its readability.",
"Memory in AI agents is crucial for maintaining context and learning from past interactions.",
"Retrieval Augmented Generation (RAG) combines LLMs with external knowledge bases for more accurate answers."
]
# Create temporary files to simulate loading from actual documents
doc_paths = []
for i, doc_content in enumerate(raw_documents):
file_name = f"doc_{i}.txt"
with open(file_name, "w") as f:
f.write(doc_content)
doc_paths.append(file_name)
# Load documents from the temporary files
documents = []
for path in doc_paths:
loader = TextLoader(path)
documents.extend(loader.load())
# Split documents into smaller chunks
# This is important because LLMs have context window limits.
# Smaller, focused chunks lead to more precise retrieval.
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(documents)
print(f"Split {len(raw_documents)} raw documents into {len(texts)} chunks.")
# Create a FAISS vector store from the document chunks
# FAISS (Facebook AI Similarity Search) is an efficient library for vector search.
# It uses our OpenAIEmbeddings model to convert text chunks into vectors.
vectorstore = FAISS.from_documents(texts, embeddings)
print("FAISS vector store created and populated.")
# Clean up temporary files
for path in doc_paths:
os.remove(path)
print("Temporary document files cleaned up.")
Explanation:
raw_documents: A list of strings representing our raw knowledge. In a real application, these would be loaded from files, databases, or APIs.TextLoader: A LangChain utility to load text from files. We’re creating temporary files to demonstrate this.CharacterTextSplitter: This is a crucial step for RAG. Large documents are split into smaller, manageablechunks.chunk_size: The maximum size of each chunk (in characters).chunk_overlap: A small overlap between chunks helps maintain context if information spans across chunk boundaries.- Why split? LLMs have limited context windows. We want to retrieve only the most relevant information, not entire long documents. Splitting helps achieve more granular and precise retrieval.
FAISS.from_documents(texts, embeddings): This is where the magic of vector memory happens!- Each
textchunk is converted into an embedding (a numerical vector) using ourOpenAIEmbeddingsmodel. - These embeddings are then stored and indexed in the
FAISSvector store, making them searchable by similarity. When we query later, FAISS will find the document embeddings closest to our query embedding.
- Each
Step 3: Implement the Retriever
With our vector store ready, we need a way to retrieve relevant documents from it. LangChain’s as_retriever() method makes this incredibly simple.
Add this line to agent.py after the vectorstore creation:
# ... (previous code) ...
# Convert the vectorstore into a retriever
# The retriever's job is to take a query and return relevant documents from the vector store.
retriever = vectorstore.as_retriever()
print("Retriever initialized from vector store.")
Explanation:
vectorstore.as_retriever(): This method turns ourFAISSvector store into aRetrieverobject. When we pass a query to this retriever, it will use theembeddingsmodel internally to convert the query into a vector, search the FAISS index for similar document vectors, and return the original text content of the top-k (default usually 4) most similar documents.
Step 4: Integrate Working Memory (Chat History)
To make our agent conversational, we’ll use ConversationBufferMemory from langchain. This memory type stores the entire conversation history.
Add the following to agent.py:
# ... (previous code) ...
from langchain.memory import ConversationBufferMemory
print("\nInitializing conversational memory...")
# Initialize ConversationBufferMemory to store chat history
# 'memory_key' is the key under which the history will be stored in the chain's input.
# 'return_messages=True' ensures the history is returned as a list of message objects.
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
print("ConversationBufferMemory initialized.")
Explanation:
ConversationBufferMemory: This is a simple form of working memory that stores all previous messages (user input and AI output) in a buffer.memory_key="chat_history": This tells the LangChain chain where to find the chat history in its input dictionary.return_messages=True: This setting ensures the history is returned as a list ofMessageobjects (e.g.,HumanMessage,AIMessage), which is the preferred format for modern chat LLMs.
Step 5: Build the RAG Chain
Now, let’s connect all the pieces: the LLM, the retriever, and the memory, into a coherent RAG chain using LangChain’s expression language.
Add the following to agent.py:
# ... (previous code) ...
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain
print("\nBuilding the RAG chain...")
# 1. Define the RAG prompt template
# This prompt guides the LLM on how to use the provided context and chat history.
# - 'context': This is where the retrieved documents will be placed.
# - 'chat_history': This is where the conversation memory will be placed.
# - 'input': This is the user's current query.
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful AI assistant. Answer the user's questions based on the provided context and chat history. If you don't know the answer, state that you don't know, rather than making up an answer."),
MessagesPlaceholder(variable_name="chat_history"),
("user", "Context: {context}\n\nQuestion: {input}")
])
print("RAG prompt template created.")
# 2. Create a chain to combine documents and generate an answer
# This chain takes the retrieved documents and the user's question, and formats them
# according to the prompt, then passes it to the LLM.
document_combiner = create_stuff_documents_chain(llm, prompt)
print("Document combiner chain created.")
# 3. Create the full retrieval chain
# This chain orchestrates the entire RAG process:
# - It first retrieves documents using our 'retriever'.
# - It then passes the documents, input, and chat history to the 'document_combiner'.
rag_chain = create_retrieval_chain(retriever, document_combiner)
print("Full RAG chain created.")
Explanation:
ChatPromptTemplate.from_messages(...): This defines the structure of the prompt sent to the LLM.("system", ...): Provides instructions to the LLM about its role and how to behave.MessagesPlaceholder(variable_name="chat_history"): This is where ourConversationBufferMemorywill inject the entire chat history. The LLM will see past turns.("user", "Context: {context}\n\nQuestion: {input}"): This is the current user’s input, augmented with the{context}(retrieved documents).- Crucially: The LLM receives the system instruction, the full chat history, the retrieved documents, and the current question, allowing it to generate a highly informed response.
create_stuff_documents_chain(llm, prompt): This chain takes a list ofDocumentobjects (our retrieved context) and “stuffs” them into the prompt, along with theinputandchat_history, before passing everything to thellmfor generation.create_retrieval_chain(retriever, document_combiner): This is the top-level chain. It coordinates:- Taking the user
input. - Using the
retrieverto get relevantdocuments. - Passing these
documents, the originalinput, andchat_historyto thedocument_combinerfor final prompt assembly and LLM generation.
- Taking the user
Step 6: Test the Agent
Let’s put our RAG agent to the test! We’ll run a loop where you can ask questions and see how it responds, leveraging both its external knowledge and conversational memory.
Add the following to agent.py at the very end:
# ... (previous code) ...
print("\nStarting RAG agent conversation. Type 'exit' to quit.")
while True:
user_input = input("You: ")
if user_input.lower() == 'exit':
print("Exiting conversation. Goodbye!")
break
# Invoke the RAG chain with the current input and memory
# The 'memory' object automatically handles adding/retrieving chat_history.
response = rag_chain.invoke({"input": user_input, "chat_history": memory.load_memory_variables({})["chat_history"]})
# The response contains the answer, retrieved documents, and the original input
agent_answer = response["answer"]
print(f"Agent: {agent_answer}")
# Store the current interaction in memory for the next turn
memory.save_context({"input": user_input}, {"output": agent_answer})
print("RAG agent demonstration complete.")
Explanation:
while True:loop: This creates an interactive command-line interface for our agent.rag_chain.invoke(...): This is the core call to our RAG agent."input": user_input: The current question from the user."chat_history": memory.load_memory_variables({})["chat_history"]: We explicitly pass the current chat history from ourmemoryobject. TheMessagesPlaceholderin our prompt will use this.
agent_answer = response["answer"]: Thecreate_retrieval_chainreturns a dictionary, and the final generated answer is under the"answer"key.memory.save_context(...): After the agent generates a response, we save both the user’s input and the agent’s output into ourConversationBufferMemory. This updates thechat_historyfor the next turn, enabling the agent to remember what was just discussed.
Now, run the complete agent.py file:
python agent.py
Try these questions:
- “What is the capital of France?” (Should retrieve from documents)
- “What is it known for?” (Should use chat history to infer “it” means France and retrieve details)
- “Tell me about Python.” (Should retrieve from documents)
- “What is RAG?” (Should retrieve from documents)
- “Why is memory important in AI agents?” (Should retrieve from documents)
- “What is the largest rainforest?” (Should retrieve from documents)
Observe how the agent uses the retrieved context and remembers the topic of discussion.
Mini-Challenge: Dynamic Knowledge Update
Our current knowledge base is static once the agent.py script starts. What if new information becomes available?
Challenge: Modify the agent.py script to allow you to add a new document to the vector store after the agent has started. Then, ask a question that can only be answered by this new document.
Hint:
- Look for a method on the
FAISSvector store object that allows adding new documents. You’ll need to convert the new text intoDocumentobjects and then split them, just like we did initially. - You might need to re-create the
retrieverif you’re using a simplevectorstore.as_retriever()or ensure it points to the updated vector store. LangChain’sretrieverusually references the underlying vector store, so adding documents to thevectorstoredirectly should work.
What to Observe/Learn:
- How easily can an agent’s knowledge base be updated in real-time?
- Does the agent immediately utilize the new information?
- What are the implications for agents that need to learn constantly?
Click for a possible solution to the Mini-Challenge!
# ... (previous agent.py code up to the while loop) ...
# Function to add new documents to the vector store
def add_new_knowledge(new_text_content: str):
print(f"\nAdding new knowledge: '{new_text_content}'")
new_docs = [Document(page_content=new_text_content)]
new_texts = text_splitter.split_documents(new_docs) # Use the same text_splitter
vectorstore.add_documents(new_texts)
print("New knowledge added to vector store.")
print("\nStarting RAG agent conversation. Type 'exit' to quit. Type 'add_doc:' followed by text to add new knowledge.")
while True:
user_input = input("You: ")
if user_input.lower() == 'exit':
print("Exiting conversation. Goodbye!")
break
elif user_input.lower().startswith('add_doc:'):
new_doc_content = user_input[len('add_doc:'):].strip()
add_new_knowledge(new_doc_content)
# No need to save context for 'add_doc' command
continue # Skip to next loop iteration
response = rag_chain.invoke({"input": user_input, "chat_history": memory.load_memory_variables({})["chat_history"]})
agent_answer = response["answer"]
print(f"Agent: {agent_answer}")
memory.save_context({"input": user_input}, {"output": agent_answer})
print("RAG agent demonstration complete.")
How to test the solution:
- Run the modified
agent.py. - Ask: “What is the fastest land animal?” (Agent should say it doesn’t know, as it’s not in the original docs).
- Type:
add_doc: The cheetah is the fastest land animal, capable of running up to 120 km/h over short distances. - Ask again: “What is the fastest land animal?” (Agent should now answer correctly!)
Common Pitfalls & Troubleshooting
Building RAG agents can be tricky. Here are some common issues and how to address them:
“Agent doesn’t know the answer, even though the info is in the documents!”
- Problem: This often means your retriever isn’t finding the correct documents.
- Troubleshooting:
- Chunk Size: Are your document chunks too large or too small? If too large, irrelevant information might dilute the relevant part. If too small, critical context might be split across multiple chunks, making retrieval harder. Experiment with
chunk_sizeandchunk_overlap. - Embeddings Quality: While OpenAI’s embeddings are generally good, if your domain is very niche, you might consider fine-tuning an embedding model or using a domain-specific one.
- Retrieval
k: Theretrievertypically fetches a certain number of top-k similar documents. Ifkis too low, it might miss relevant ones. Ifkis too high, it might introduce too much irrelevant noise into the LLM’s context. Try adjustingk(e.g.,retriever = vectorstore.as_retriever(k=5)). - Query Formulation: Sometimes, the way the user’s query (or the query combined with chat history) is embedded doesn’t perfectly match the document embeddings. This is harder to fix but can be improved with better prompt engineering.
- Chunk Size: Are your document chunks too large or too small? If too large, irrelevant information might dilute the relevant part. If too small, critical context might be split across multiple chunks, making retrieval harder. Experiment with
“Agent forgets context or gets confused in long conversations.”
- Problem: The LLM’s context window is finite. If your
chat_historygrows too long, the oldest messages might be truncated, or the sheer volume of tokens might overwhelm the LLM. - Troubleshooting:
- Summarization Memory: Instead of
ConversationBufferMemory, useConversationSummaryBufferMemoryorConversationSummaryMemory(fromlangchain.memory). These automatically summarize old conversations, keeping the context concise. - Context Window Management: For very long conversations, you might need more advanced strategies, such as retrieving only the most relevant past chat turns, not the entire history.
- Summarization Memory: Instead of
- Problem: The LLM’s context window is finite. If your
API Key or Environment Variable Issues:
- Problem:
AuthenticationErroror similar messages from OpenAI. - Troubleshooting:
- Double-check your
.envfile for typos. - Ensure
load_dotenv()is called before initializingChatOpenAIorOpenAIEmbeddings. - Verify your OpenAI API key is active and has sufficient credits.
- Double-check your
- Problem:
Slow Performance for Large Knowledge Bases:
- Problem: As your
FAISSvector store grows, retrieval can become slower. - Troubleshooting:
- For local development,
faiss-cpuis fine. For production, considerfaiss-gpuor dedicated vector databases (like Pinecone, Weaviate, Qdrant, Azure Cosmos DB for NoSQL with vector search) which are optimized for scale and speed.
- For local development,
- Problem: As your
Summary
Phew! You’ve just built a fully functional RAG agent that uses both external knowledge (via vector memory) and conversational memory to provide intelligent, context-aware responses. This is a huge leap forward in creating more capable and engaging AI agents!
Here are the key takeaways from this chapter:
- RAG (Retrieval Augmented Generation) extends LLMs by allowing them to retrieve and incorporate information from external knowledge bases, reducing hallucinations and enabling access to up-to-date or proprietary data.
- Vector Memory (implemented with
FAISSandOpenAIEmbeddings) is fundamental to RAG, allowing for efficient semantic search of documents. - Working Memory (Chat History), using
ConversationBufferMemory, is crucial for RAG agents to maintain conversational context, understand follow-up questions, and provide coherent dialogue. - LangChain provides powerful abstractions (
ChatOpenAI,OpenAIEmbeddings,FAISS,ConversationBufferMemory,create_retrieval_chain) to simplify the construction of complex AI agents. - Document Chunking is a critical preprocessing step for RAG to ensure efficient and precise retrieval within LLM context window limits.
- Building effective RAG agents involves careful consideration of prompt engineering, retriever configuration, and memory management to optimize performance and relevance.
You’ve taken a significant step in understanding and implementing advanced AI agent capabilities. In the next chapter, we’ll delve deeper into more sophisticated memory patterns and how to manage them for even more intelligent and persistent agents.
References
- LangChain Documentation: https://python.langchain.com/docs/
- Microsoft AI Agents for Beginners - Agent Memory: https://github.com/microsoft/ai-agents-for-beginners/blob/main/13-agent-memory/README.md
- OpenAI Cookbook - Context Personalization for Agents: https://github.com/openai/openai-cookbook/blob/main/examples/agents_sdk/context_personalization.ipynb
- Azure Cosmos DB for NoSQL - Agentic Memories: https://learn.microsoft.com/en-us/azure/cosmos-db/gen-ai/agentic-memories
- FAISS GitHub Repository: https://github.com/facebookresearch/faiss
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.