Introduction

Welcome back, aspiring AI architect! In our previous chapters, we’ve explored the fascinating world of AI memory systems, understanding different types like working, short-term, long-term, episodic, and semantic memory, and how vector memory plays a crucial role in enabling AI agents to access vast external knowledge. Now, it’s time to bring these concepts to life by building something truly practical: a simple Retrieval Augmented Generation (RAG) agent with integrated memory.

This chapter will guide you, step-by-step, through creating an agent that doesn’t just rely on an LLM’s inherent knowledge but can intelligently fetch relevant information from an external knowledge base and remember past interactions. This combination allows for more accurate, up-to-date, and personalized responses, overcoming the limitations of an LLM’s static training data and its limited context window.

By the end of this chapter, you’ll have a working RAG agent that can answer questions based on custom documents and maintain a coherent conversation. Get ready to put theory into practice and see the power of memory in action!

Understanding RAG with Memory

Before we dive into code, let’s solidify our understanding of what a RAG agent is and how memory enhances it.

What is Retrieval Augmented Generation (RAG)?

Imagine you’re taking an open-book exam. Instead of relying solely on what you’ve memorized, you can consult your notes and textbooks to find the most accurate information. RAG works similarly for Large Language Models (LLMs).

An LLM, by itself, is like a brilliant student who has read many books but can sometimes “hallucinate” (make up facts) or provide outdated information because its knowledge is limited to its training data. RAG addresses this by:

  1. Retrieval: When a user asks a question, the RAG agent first retrieves relevant documents or snippets from an external knowledge base (like a database of articles, FAQs, or internal company documents).
  2. Augmentation: It then augments the user’s query with this retrieved information, essentially saying to the LLM, “Here’s the user’s question, and here’s some highly relevant context from our knowledge base. Please use this to answer.”
  3. Generation: The LLM then generates a response, now informed by the fresh, accurate, and specific data provided by the retrieval step.

This process significantly reduces hallucinations and allows LLMs to interact with real-time or proprietary information.

Why Add Memory to a RAG Agent?

While RAG is powerful for answering single, factual questions, real-world interactions are rarely one-off. Users ask follow-up questions, refer to previous statements, and expect a degree of continuity. This is where memory becomes indispensable for a RAG agent.

Without memory, each RAG interaction would be stateless. If you asked, “What’s the capital of France?” and then immediately, “What’s its population?”, the agent wouldn’t know “its” refers to France. By incorporating working memory (specifically, chat history), our RAG agent can:

  • Maintain Context: Understand the flow of a conversation and answer follow-up questions accurately.
  • Personalize Interactions: Potentially remember user preferences or past behaviors (though for this chapter, we’ll focus on conversational context).
  • Overcome Context Window Limits: While RAG helps by retrieving relevant external docs, chat history also needs to be managed to fit within the LLM’s context window.

For our simple RAG agent, we’ll primarily use working memory in the form of conversational history to ensure the LLM understands the ongoing dialogue. The external knowledge base, powered by vector memory, serves as our long-term, factual memory.

How Our Simple RAG Agent Works

Let’s visualize the flow of information in our agent:

flowchart TD User_Query[User Query] --> A[1. Agent Receives Query] A --> B{2. Has Chat History?} B -->|Yes| C[3. Combine Query + History] B -->|No| D[3. Use Raw Query] C --> E[4. Retrieve Relevant Documents] D --> E E --> F[5. Vector Store] F --> E E --> G[6. Augment Prompt LLM] G --> H[7. LLM Generates Response] H --> I[8. Store Query/Response in Chat History] I --> J[9. Return Response to User] J --> User_Query
  1. User Query: The user asks a question.
  2. Check Chat History: The agent checks if there’s any previous conversation.
  3. Combine Query + History: If history exists, the current query is combined with relevant past turns to provide full context. If not, just the raw query is used.
  4. Retrieve Relevant Documents: This combined context (or just the query) is used to search the Vector Store (our external knowledge base). The vector store, using embeddings, finds documents semantically similar to the query.
  5. Vector Store (Knowledge Base): This is where our pre-processed documents (converted into numerical embeddings) are stored and indexed, allowing for efficient similarity search.
  6. Augment Prompt for LLM: The original query, the chat history, and the newly retrieved documents are all packaged into a single, comprehensive prompt for the LLM.
  7. LLM Generates Response: The LLM processes this augmented prompt and generates a coherent and informed answer.
  8. Store Query/Response in Chat History: The latest user query and the agent’s response are added to the agent’s working memory (chat history) for future interactions.
  9. Return Response to User: The answer is presented to the user.

This loop ensures that the agent is always learning from its external knowledge and its ongoing conversation.

Step-by-Step Implementation

We’ll use Python and the langchain library (version 0.1.16 as of 2026-03-20, with langchain-openai version 0.1.1 and faiss-cpu version 1.7.4) to build our agent. langchain provides excellent abstractions for working with LLMs, retrievers, and memory components.

Setup Your Environment

First things first, let’s get our workspace ready.

  1. Create a New Project Directory:

    mkdir simple_rag_agent
    cd simple_rag_agent
    
  2. Set Up a Virtual Environment: It’s always a good practice to use virtual environments to manage dependencies.

    python3 -m venv venv
    source venv/bin/activate # On Windows, use `venv\Scripts\activate`
    
  3. Install Necessary Libraries:

    pip install langchain==0.1.16 langchain-openai==0.1.1 faiss-cpu==1.7.4 python-dotenv==1.0.1
    
    • langchain: The core framework for building LLM applications.
    • langchain-openai: Integrates OpenAI’s LLMs and embeddings with LangChain.
    • faiss-cpu: A library for efficient similarity search and clustering of dense vectors. We’ll use it for our local vector store.
    • python-dotenv: To securely load API keys from a .env file.
  4. Get Your OpenAI API Key: You’ll need an OpenAI API key to use their LLMs and embedding models. If you don’t have one, sign up at platform.openai.com and create a new secret key.

  5. Create a .env File: In your simple_rag_agent directory, create a file named .env and add your OpenAI API key:

    OPENAI_API_KEY="your_openai_api_key_here"
    

    Important: Never commit your .env file to version control (like Git)! Add it to your .gitignore file.

Step 1: Initialize LLM and Embeddings

Let’s start by initializing our LLM and embedding model. The embedding model will convert our text documents into numerical vectors (embeddings), and the LLM will generate responses.

Create a file named agent.py:

# agent.py
import os
from dotenv import load_dotenv

# Load environment variables from .env file
load_dotenv()

from langchain_openai import ChatOpenAI, OpenAIEmbeddings

print("Initializing LLM and Embeddings...")

# Initialize the ChatOpenAI model
# We're using gpt-3.5-turbo for cost-effectiveness and good performance.
# You can try gpt-4 if you have access and need higher quality.
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.7)
print(f"LLM initialized: {llm.model_name}")

# Initialize the OpenAI Embeddings model
# This model converts text into numerical vectors (embeddings)
embeddings = OpenAIEmbeddings()
print("Embeddings model initialized.")

print("Setup complete. Ready to build the RAG agent components.")

Explanation:

  • load_dotenv(): This function from python-dotenv loads key-value pairs from your .env file into environment variables, making os.environ["OPENAI_API_KEY"] accessible.
  • ChatOpenAI: This class from langchain_openai provides an interface to OpenAI’s chat models.
    • model="gpt-3.5-turbo": Specifies the LLM we want to use. You can change this to gpt-4 for more advanced capabilities if available.
    • temperature=0.7: Controls the randomness of the LLM’s output. Higher values (closer to 1.0) make the output more creative; lower values (closer to 0.0) make it more deterministic.
  • OpenAIEmbeddings: This class converts text into numerical vector representations. These embeddings are crucial for finding semantically similar documents in our vector store.

Run this file to ensure your setup is correct:

python agent.py

You should see output confirming the initialization without errors.

Step 2: Prepare a Simple Knowledge Base

Now, let’s create some data for our RAG agent to retrieve from. We’ll use a few simple text documents and store them in a local FAISS vector store.

Add the following code to agent.py after the embeddings initialization:

# ... (previous code) ...

from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.vectorstores import FAISS

print("\nPreparing knowledge base...")

# Define some sample text documents for our knowledge base
raw_documents = [
    "The capital of France is Paris. It is known for its Eiffel Tower and Louvre Museum.",
    "The Amazon rainforest is the largest tropical rainforest in the world.",
    "Python is a high-level, interpreted programming language known for its readability.",
    "Memory in AI agents is crucial for maintaining context and learning from past interactions.",
    "Retrieval Augmented Generation (RAG) combines LLMs with external knowledge bases for more accurate answers."
]

# Create temporary files to simulate loading from actual documents
doc_paths = []
for i, doc_content in enumerate(raw_documents):
    file_name = f"doc_{i}.txt"
    with open(file_name, "w") as f:
        f.write(doc_content)
    doc_paths.append(file_name)

# Load documents from the temporary files
documents = []
for path in doc_paths:
    loader = TextLoader(path)
    documents.extend(loader.load())

# Split documents into smaller chunks
# This is important because LLMs have context window limits.
# Smaller, focused chunks lead to more precise retrieval.
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
texts = text_splitter.split_documents(documents)
print(f"Split {len(raw_documents)} raw documents into {len(texts)} chunks.")

# Create a FAISS vector store from the document chunks
# FAISS (Facebook AI Similarity Search) is an efficient library for vector search.
# It uses our OpenAIEmbeddings model to convert text chunks into vectors.
vectorstore = FAISS.from_documents(texts, embeddings)
print("FAISS vector store created and populated.")

# Clean up temporary files
for path in doc_paths:
    os.remove(path)
print("Temporary document files cleaned up.")

Explanation:

  • raw_documents: A list of strings representing our raw knowledge. In a real application, these would be loaded from files, databases, or APIs.
  • TextLoader: A LangChain utility to load text from files. We’re creating temporary files to demonstrate this.
  • CharacterTextSplitter: This is a crucial step for RAG. Large documents are split into smaller, manageable chunks.
    • chunk_size: The maximum size of each chunk (in characters).
    • chunk_overlap: A small overlap between chunks helps maintain context if information spans across chunk boundaries.
    • Why split? LLMs have limited context windows. We want to retrieve only the most relevant information, not entire long documents. Splitting helps achieve more granular and precise retrieval.
  • FAISS.from_documents(texts, embeddings): This is where the magic of vector memory happens!
    • Each text chunk is converted into an embedding (a numerical vector) using our OpenAIEmbeddings model.
    • These embeddings are then stored and indexed in the FAISS vector store, making them searchable by similarity. When we query later, FAISS will find the document embeddings closest to our query embedding.

Step 3: Implement the Retriever

With our vector store ready, we need a way to retrieve relevant documents from it. LangChain’s as_retriever() method makes this incredibly simple.

Add this line to agent.py after the vectorstore creation:

# ... (previous code) ...

# Convert the vectorstore into a retriever
# The retriever's job is to take a query and return relevant documents from the vector store.
retriever = vectorstore.as_retriever()
print("Retriever initialized from vector store.")

Explanation:

  • vectorstore.as_retriever(): This method turns our FAISS vector store into a Retriever object. When we pass a query to this retriever, it will use the embeddings model internally to convert the query into a vector, search the FAISS index for similar document vectors, and return the original text content of the top-k (default usually 4) most similar documents.

Step 4: Integrate Working Memory (Chat History)

To make our agent conversational, we’ll use ConversationBufferMemory from langchain. This memory type stores the entire conversation history.

Add the following to agent.py:

# ... (previous code) ...

from langchain.memory import ConversationBufferMemory

print("\nInitializing conversational memory...")

# Initialize ConversationBufferMemory to store chat history
# 'memory_key' is the key under which the history will be stored in the chain's input.
# 'return_messages=True' ensures the history is returned as a list of message objects.
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)
print("ConversationBufferMemory initialized.")

Explanation:

  • ConversationBufferMemory: This is a simple form of working memory that stores all previous messages (user input and AI output) in a buffer.
    • memory_key="chat_history": This tells the LangChain chain where to find the chat history in its input dictionary.
    • return_messages=True: This setting ensures the history is returned as a list of Message objects (e.g., HumanMessage, AIMessage), which is the preferred format for modern chat LLMs.

Step 5: Build the RAG Chain

Now, let’s connect all the pieces: the LLM, the retriever, and the memory, into a coherent RAG chain using LangChain’s expression language.

Add the following to agent.py:

# ... (previous code) ...

from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain.chains import create_retrieval_chain

print("\nBuilding the RAG chain...")

# 1. Define the RAG prompt template
# This prompt guides the LLM on how to use the provided context and chat history.
# - 'context': This is where the retrieved documents will be placed.
# - 'chat_history': This is where the conversation memory will be placed.
# - 'input': This is the user's current query.
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful AI assistant. Answer the user's questions based on the provided context and chat history. If you don't know the answer, state that you don't know, rather than making up an answer."),
    MessagesPlaceholder(variable_name="chat_history"),
    ("user", "Context: {context}\n\nQuestion: {input}")
])
print("RAG prompt template created.")

# 2. Create a chain to combine documents and generate an answer
# This chain takes the retrieved documents and the user's question, and formats them
# according to the prompt, then passes it to the LLM.
document_combiner = create_stuff_documents_chain(llm, prompt)
print("Document combiner chain created.")

# 3. Create the full retrieval chain
# This chain orchestrates the entire RAG process:
# - It first retrieves documents using our 'retriever'.
# - It then passes the documents, input, and chat history to the 'document_combiner'.
rag_chain = create_retrieval_chain(retriever, document_combiner)
print("Full RAG chain created.")

Explanation:

  1. ChatPromptTemplate.from_messages(...): This defines the structure of the prompt sent to the LLM.
    • ("system", ...): Provides instructions to the LLM about its role and how to behave.
    • MessagesPlaceholder(variable_name="chat_history"): This is where our ConversationBufferMemory will inject the entire chat history. The LLM will see past turns.
    • ("user", "Context: {context}\n\nQuestion: {input}"): This is the current user’s input, augmented with the {context} (retrieved documents).
    • Crucially: The LLM receives the system instruction, the full chat history, the retrieved documents, and the current question, allowing it to generate a highly informed response.
  2. create_stuff_documents_chain(llm, prompt): This chain takes a list of Document objects (our retrieved context) and “stuffs” them into the prompt, along with the input and chat_history, before passing everything to the llm for generation.
  3. create_retrieval_chain(retriever, document_combiner): This is the top-level chain. It coordinates:
    • Taking the user input.
    • Using the retriever to get relevant documents.
    • Passing these documents, the original input, and chat_history to the document_combiner for final prompt assembly and LLM generation.

Step 6: Test the Agent

Let’s put our RAG agent to the test! We’ll run a loop where you can ask questions and see how it responds, leveraging both its external knowledge and conversational memory.

Add the following to agent.py at the very end:

# ... (previous code) ...

print("\nStarting RAG agent conversation. Type 'exit' to quit.")

while True:
    user_input = input("You: ")
    if user_input.lower() == 'exit':
        print("Exiting conversation. Goodbye!")
        break

    # Invoke the RAG chain with the current input and memory
    # The 'memory' object automatically handles adding/retrieving chat_history.
    response = rag_chain.invoke({"input": user_input, "chat_history": memory.load_memory_variables({})["chat_history"]})

    # The response contains the answer, retrieved documents, and the original input
    agent_answer = response["answer"]
    print(f"Agent: {agent_answer}")

    # Store the current interaction in memory for the next turn
    memory.save_context({"input": user_input}, {"output": agent_answer})

print("RAG agent demonstration complete.")

Explanation:

  • while True: loop: This creates an interactive command-line interface for our agent.
  • rag_chain.invoke(...): This is the core call to our RAG agent.
    • "input": user_input: The current question from the user.
    • "chat_history": memory.load_memory_variables({})["chat_history"]: We explicitly pass the current chat history from our memory object. The MessagesPlaceholder in our prompt will use this.
  • agent_answer = response["answer"]: The create_retrieval_chain returns a dictionary, and the final generated answer is under the "answer" key.
  • memory.save_context(...): After the agent generates a response, we save both the user’s input and the agent’s output into our ConversationBufferMemory. This updates the chat_history for the next turn, enabling the agent to remember what was just discussed.

Now, run the complete agent.py file:

python agent.py

Try these questions:

  • “What is the capital of France?” (Should retrieve from documents)
  • “What is it known for?” (Should use chat history to infer “it” means France and retrieve details)
  • “Tell me about Python.” (Should retrieve from documents)
  • “What is RAG?” (Should retrieve from documents)
  • “Why is memory important in AI agents?” (Should retrieve from documents)
  • “What is the largest rainforest?” (Should retrieve from documents)

Observe how the agent uses the retrieved context and remembers the topic of discussion.

Mini-Challenge: Dynamic Knowledge Update

Our current knowledge base is static once the agent.py script starts. What if new information becomes available?

Challenge: Modify the agent.py script to allow you to add a new document to the vector store after the agent has started. Then, ask a question that can only be answered by this new document.

Hint:

  • Look for a method on the FAISS vector store object that allows adding new documents. You’ll need to convert the new text into Document objects and then split them, just like we did initially.
  • You might need to re-create the retriever if you’re using a simple vectorstore.as_retriever() or ensure it points to the updated vector store. LangChain’s retriever usually references the underlying vector store, so adding documents to the vectorstore directly should work.

What to Observe/Learn:

  • How easily can an agent’s knowledge base be updated in real-time?
  • Does the agent immediately utilize the new information?
  • What are the implications for agents that need to learn constantly?
Click for a possible solution to the Mini-Challenge!
# ... (previous agent.py code up to the while loop) ...

# Function to add new documents to the vector store
def add_new_knowledge(new_text_content: str):
    print(f"\nAdding new knowledge: '{new_text_content}'")
    new_docs = [Document(page_content=new_text_content)]
    new_texts = text_splitter.split_documents(new_docs) # Use the same text_splitter
    vectorstore.add_documents(new_texts)
    print("New knowledge added to vector store.")

print("\nStarting RAG agent conversation. Type 'exit' to quit. Type 'add_doc:' followed by text to add new knowledge.")

while True:
    user_input = input("You: ")
    if user_input.lower() == 'exit':
        print("Exiting conversation. Goodbye!")
        break
    elif user_input.lower().startswith('add_doc:'):
        new_doc_content = user_input[len('add_doc:'):].strip()
        add_new_knowledge(new_doc_content)
        # No need to save context for 'add_doc' command
        continue # Skip to next loop iteration

    response = rag_chain.invoke({"input": user_input, "chat_history": memory.load_memory_variables({})["chat_history"]})
    agent_answer = response["answer"]
    print(f"Agent: {agent_answer}")
    memory.save_context({"input": user_input}, {"output": agent_answer})

print("RAG agent demonstration complete.")

How to test the solution:

  1. Run the modified agent.py.
  2. Ask: “What is the fastest land animal?” (Agent should say it doesn’t know, as it’s not in the original docs).
  3. Type: add_doc: The cheetah is the fastest land animal, capable of running up to 120 km/h over short distances.
  4. Ask again: “What is the fastest land animal?” (Agent should now answer correctly!)

Common Pitfalls & Troubleshooting

Building RAG agents can be tricky. Here are some common issues and how to address them:

  1. “Agent doesn’t know the answer, even though the info is in the documents!”

    • Problem: This often means your retriever isn’t finding the correct documents.
    • Troubleshooting:
      • Chunk Size: Are your document chunks too large or too small? If too large, irrelevant information might dilute the relevant part. If too small, critical context might be split across multiple chunks, making retrieval harder. Experiment with chunk_size and chunk_overlap.
      • Embeddings Quality: While OpenAI’s embeddings are generally good, if your domain is very niche, you might consider fine-tuning an embedding model or using a domain-specific one.
      • Retrieval k: The retriever typically fetches a certain number of top-k similar documents. If k is too low, it might miss relevant ones. If k is too high, it might introduce too much irrelevant noise into the LLM’s context. Try adjusting k (e.g., retriever = vectorstore.as_retriever(k=5)).
      • Query Formulation: Sometimes, the way the user’s query (or the query combined with chat history) is embedded doesn’t perfectly match the document embeddings. This is harder to fix but can be improved with better prompt engineering.
  2. “Agent forgets context or gets confused in long conversations.”

    • Problem: The LLM’s context window is finite. If your chat_history grows too long, the oldest messages might be truncated, or the sheer volume of tokens might overwhelm the LLM.
    • Troubleshooting:
      • Summarization Memory: Instead of ConversationBufferMemory, use ConversationSummaryBufferMemory or ConversationSummaryMemory (from langchain.memory). These automatically summarize old conversations, keeping the context concise.
      • Context Window Management: For very long conversations, you might need more advanced strategies, such as retrieving only the most relevant past chat turns, not the entire history.
  3. API Key or Environment Variable Issues:

    • Problem: AuthenticationError or similar messages from OpenAI.
    • Troubleshooting:
      • Double-check your .env file for typos.
      • Ensure load_dotenv() is called before initializing ChatOpenAI or OpenAIEmbeddings.
      • Verify your OpenAI API key is active and has sufficient credits.
  4. Slow Performance for Large Knowledge Bases:

    • Problem: As your FAISS vector store grows, retrieval can become slower.
    • Troubleshooting:
      • For local development, faiss-cpu is fine. For production, consider faiss-gpu or dedicated vector databases (like Pinecone, Weaviate, Qdrant, Azure Cosmos DB for NoSQL with vector search) which are optimized for scale and speed.

Summary

Phew! You’ve just built a fully functional RAG agent that uses both external knowledge (via vector memory) and conversational memory to provide intelligent, context-aware responses. This is a huge leap forward in creating more capable and engaging AI agents!

Here are the key takeaways from this chapter:

  • RAG (Retrieval Augmented Generation) extends LLMs by allowing them to retrieve and incorporate information from external knowledge bases, reducing hallucinations and enabling access to up-to-date or proprietary data.
  • Vector Memory (implemented with FAISS and OpenAIEmbeddings) is fundamental to RAG, allowing for efficient semantic search of documents.
  • Working Memory (Chat History), using ConversationBufferMemory, is crucial for RAG agents to maintain conversational context, understand follow-up questions, and provide coherent dialogue.
  • LangChain provides powerful abstractions (ChatOpenAI, OpenAIEmbeddings, FAISS, ConversationBufferMemory, create_retrieval_chain) to simplify the construction of complex AI agents.
  • Document Chunking is a critical preprocessing step for RAG to ensure efficient and precise retrieval within LLM context window limits.
  • Building effective RAG agents involves careful consideration of prompt engineering, retriever configuration, and memory management to optimize performance and relevance.

You’ve taken a significant step in understanding and implementing advanced AI agent capabilities. In the next chapter, we’ll delve deeper into more sophisticated memory patterns and how to manage them for even more intelligent and persistent agents.

References


This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.