Chapter 22: Project: Developing a Semantic Search Engine with Embeddings

Welcome to an exciting hands-on project that brings together several concepts we’ve explored: embeddings, natural language processing, and practical application! In this chapter, you’ll learn how to build a semantic search engine from the ground up. Unlike traditional keyword-based search that relies on exact word matches, semantic search understands the meaning and context of your query, providing far more relevant results.

This project is not just about writing code; it’s about understanding the “why” behind each step. You’ll prepare data, generate powerful numerical representations (embeddings) for text, create an efficient index for these embeddings, and finally, perform intelligent searches. This is a fundamental skill for anyone looking to work with modern information retrieval systems, recommendation engines, or large language model (LLM) applications.

Before we dive in, ensure you’re comfortable with Python programming, the basics of machine learning, and have a foundational understanding of what embeddings are and how they represent meaning in vector space. If you need a refresher on embeddings, revisit previous chapters on neural networks and natural language processing. Let’s get started and build something truly smart!

Core Concepts: Understanding Semantic Search

Before we touch any code, let’s solidify our understanding of what semantic search is and the key components that make it tick.

What is Semantic Search? Beyond Keywords

Imagine you’re searching for “how to fix a leaky faucet.” A traditional keyword search might only show results that contain those exact words. But what if a great article talks about “repairing a dripping tap”? A keyword search would likely miss it.

Semantic search solves this by understanding the meaning of your query. It doesn’t just look for keyword overlap; it looks for conceptual similarity. This is crucial for user experience and finding truly relevant information in today’s vast data landscape.

Why is this important for an AI/ML Engineer? Modern AI applications, from chatbots to recommendation systems, rely heavily on understanding context and meaning. Semantic search is a core building block for these intelligent systems.

The Power of Embeddings: Meaning in Numbers

At the heart of semantic search are embeddings. As we’ve learned, embeddings are dense vector representations of text (words, sentences, paragraphs) where semantically similar items are mapped to nearby points in a high-dimensional space. Think of it like a sophisticated coordinate system for meaning.

What: A list of numbers (a vector) that captures the contextual meaning of a piece of text.
Why: Allows computers to “understand” text by converting it into a numerical format they can process.
How: Trained neural networks (often large language models or specialized embedding models) learn to generate these vectors by processing vast amounts of text data.

When you perform a semantic search, you convert your query into an embedding, and then you find other text embeddings that are “close” to your query’s embedding in this vector space.

Vector Databases (or Indexes): Finding Neighbors Fast

Once you have thousands or millions of document embeddings, how do you efficiently find the ones closest to your query embedding? You can’t just calculate the distance to every single one – that would be incredibly slow!

This is where vector databases or specialized vector indexing libraries come in. They are optimized for performing “nearest neighbor” searches in high-dimensional spaces. They use clever algorithms to quickly narrow down the search space and find the most similar vectors.

For this project, we’ll use FAISS (Facebook AI Similarity Search), a highly efficient library for similarity search. While not a full-fledged vector database (which often include persistence, distributed features, etc.), FAISS provides the core indexing and search capabilities that many vector databases build upon.

Semantic Search Architecture: A Bird’s Eye View

Let’s visualize the entire process. It’s a pipeline with a few distinct stages.

graph TD A[Raw Documents] --> B{Text Preprocessing} B --> C[Embedding Model] C --> D[Document Embeddings] D --> E[Vector Index] subgraph Query Flow F[User Query] --> G{Query Preprocessing} G --> H[Embedding Model] H --> I[Query Embedding] I --> E end E --> J[Top K Similar Vectors] J --> K[Retrieve Original Documents] K --> L[Search Results]

Explanation of the Flow:

Raw Documents: Your collection of text data (e.g., articles, product descriptions, FAQs).
Text Preprocessing: Cleaning the text (e.g., lowercasing, removing noise).
Embedding Model: A pre-trained model (like SentenceTransformer) converts text into numerical embeddings.
Document Embeddings: The vector representations of your documents.
Vector Index (FAISS): These embeddings are stored and indexed for fast retrieval.
User Query: The text the user types into the search bar.
Query Preprocessing: Similar cleaning for the query.
Embedding Model: The same model converts the query into an embedding. Consistency is key!
Query Embedding: The vector representation of the user’s search intent.
Search Index: The query embedding is passed to the FAISS index to find the K most similar document embeddings.
Retrieve Original Documents: The indices of the top K embeddings are used to fetch the actual text documents.
Search Results: The original documents are returned to the user.

This architecture forms the backbone of many modern AI-powered search experiences.

Step-by-Step Implementation: Building Our Engine

Alright, enough theory! Let’s get our hands dirty and build our semantic search engine.

Step 0: Setting Up Your Environment

First, let’s create a dedicated Python environment and install the necessary libraries. This ensures our project dependencies are isolated and managed effectively.

Create a Virtual Environment: Open your terminal or command prompt and run:
```
python3 -m venv semantic_search_env
```
This creates a new directory named semantic_search_env containing a clean Python installation.
Activate the Environment:
- On macOS/Linux:
```
source semantic_search_env/bin/activate
```
- On Windows (Command Prompt):
```
semantic_search_env\Scripts\activate.bat
```
- On Windows (PowerShell):
```
semantic_search_env\Scripts\Activate.ps1
```
You should see (semantic_search_env) at the beginning of your prompt, indicating the environment is active.
Install Libraries: We’ll need sentence-transformers for generating embeddings and faiss-cpu for efficient similarity search.
```
pip install sentence-transformers faiss-cpu==1.7.4 scikit-learn==1.3.2
```
- sentence-transformers: As of early 2026, sentence-transformers is a robust and widely used library for state-of-the-art sentence, paragraph, and image embeddings. It simplifies access to many pre-trained models.
- faiss-cpu: We specify faiss-cpu to avoid GPU dependencies for this tutorial. Version 1.7.4 is a stable, commonly used release. If you have a GPU and want to leverage it, you could install faiss-gpu.
- scikit-learn: We’ll use this briefly for a dataset example. Version 1.3.2 is a recent stable release.
Why specific versions? While pip install package_name usually gets the latest, specifying versions (==X.Y.Z) ensures reproducibility and avoids potential breaking changes that might occur in future releases, especially for a learning guide. Always check official documentation for the absolute latest stable versions if you encounter issues.
Verify Installations: You can quickly check if the libraries are installed by trying to import them:
```
python -c "import sentence_transformers; import faiss; print('Libraries installed successfully!')"
```
If you see “Libraries installed successfully!”, you’re good to go!

Step 1: Data Preparation

We need some text data to search through. For simplicity, let’s start with a small, custom list of sentences. Later, we’ll suggest how to use a real-world dataset.

Create a new Python file named semantic_search.py.

# semantic_search.py

# Step 1: Data Preparation
documents = [
    "The quick brown fox jumps over the lazy dog.",
    "A dog is a man's best friend.",
    "Cats are known for their agility and grace.",
    "Machine learning is a fascinating field of artificial intelligence.",
    "Deep learning is a subset of machine learning that uses neural networks.",
    "Artificial intelligence is rapidly transforming industries worldwide.",
    "Python is a popular programming language for data science and AI.",
    "Natural Language Processing (NLP) deals with the interaction between computers and human language.",
    "The sun rises in the east and sets in the west.",
    "Water boils at 100 degrees Celsius at standard atmospheric pressure."
]

print(f"Loaded {len(documents)} documents.")
print("First 3 documents:")
for i, doc in enumerate(documents[:3]):
    print(f"  {i+1}. {doc}")

Explanation:

We define a simple list of strings, documents, which will be our searchable content. In a real application, this would come from a database, text files, or an API.
We print a confirmation and a few examples to ensure our data is loaded correctly.

Step 2: Generating Embeddings

Now, let’s turn our text documents into numerical embeddings using a pre-trained SentenceTransformer model.

Add the following code to semantic_search.py after the documents list:

# semantic_search.py

from sentence_transformers import SentenceTransformer
import numpy as np # We'll need this for array manipulation

# ... (previous code for documents) ...

# Step 2: Generating Embeddings
print("\nLoading SentenceTransformer model...")
# We use 'all-MiniLM-L6-v2' - a good balance of performance and speed for general purpose tasks.
# It's a Sentence-BERT model fine-tuned on a large dataset for semantic similarity.
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Model loaded.")

print(f"Generating embeddings for {len(documents)} documents...")
document_embeddings = model.encode(documents, convert_to_tensor=True, show_progress_bar=True)
print("Embeddings generated.")

print(f"Shape of document embeddings: {document_embeddings.shape}")
print(f"First embedding (first 5 dimensions): {document_embeddings[0][:5].tolist()}")

Explanation:

from sentence_transformers import SentenceTransformer: Imports the necessary class.
model = SentenceTransformer('all-MiniLM-L6-v2'): This line downloads and loads a pre-trained model.
- all-MiniLM-L6-v2 is a popular choice because it’s relatively small and fast, yet provides excellent semantic representations for a wide range of English texts. It’s a good starting point for many applications.
- The model has been trained to produce fixed-size embeddings (384 dimensions for this specific model) for sentences.
document_embeddings = model.encode(...): This is where the magic happens! The encode method takes our list of documents and converts each one into its corresponding embedding vector.
- convert_to_tensor=True: Returns embeddings as PyTorch tensors, which is often more efficient for subsequent operations, though numpy arrays are also an option.
- show_progress_bar=True: Gives you visual feedback during the encoding process.
We print the shape of the resulting document_embeddings (e.g., [10, 384] for 10 documents, each with 384 dimensions) and a snippet of the first embedding.

Step 3: Building a Vector Index with FAISS

Now that we have our embeddings, we need to store them in a way that allows for fast similarity searches. This is where FAISS comes in.

Add the following code to semantic_search.py:

# semantic_search.py

# ... (previous code for imports, documents, and embedding generation) ...

import faiss

# Step 3: Building a Vector Index with FAISS
print("\nBuilding FAISS index...")
embedding_dimension = document_embeddings.shape[1] # Get the dimension of our embeddings (e.g., 384)

# We'll use a simple IndexFlatL2 index.
# 'Flat' means it stores all vectors directly without compression.
# 'L2' means it uses Euclidean distance (L2 norm) for similarity calculation.
# For larger datasets, more advanced indexes like IndexIVFFlat or HNSW might be used.
index = faiss.IndexFlatL2(embedding_dimension)

# Before adding, ensure embeddings are on CPU and are numpy arrays with float32 dtype
# FAISS typically works with numpy arrays and float32.
embeddings_np = document_embeddings.cpu().numpy().astype('float32')

index.add(embeddings_np)
print(f"FAISS index built. Total vectors in index: {index.ntotal}")

Explanation:

import faiss: Imports the FAISS library.
embedding_dimension = document_embeddings.shape[1]: We extract the dimensionality of our embeddings. This is crucial as FAISS needs to know the vector size.
index = faiss.IndexFlatL2(embedding_dimension): We initialize a FAISS index.
- IndexFlatL2 is the simplest FAISS index. It stores all vectors directly and uses Euclidean distance (L2 norm) to find neighbors. It’s exact but can be slower for extremely large datasets compared to approximate nearest neighbor (ANN) indexes. It’s perfect for our small example.
embeddings_np = document_embeddings.cpu().numpy().astype('float32'): FAISS expects NumPy arrays, specifically with float32 data type. If our embeddings were PyTorch tensors on a GPU, we move them to CPU and convert them.
index.add(embeddings_np): We add all our document embeddings to the FAISS index.
index.ntotal: Confirms how many vectors are now stored in our index.

Step 4: Performing Semantic Search

With our index built, we can now take a user query, convert it into an embedding, and search our FAISS index for the most similar documents.

Add the final piece of code to semantic_search.py:

# semantic_search.py

# ... (previous code for imports, documents, embedding generation, and FAISS index) ...

# Step 4: Performing Semantic Search
def perform_semantic_search(query_text, k=3):
    """
    Performs a semantic search for the given query and returns the top k relevant documents.
    """
    print(f"\nSearching for: '{query_text}'")

    # Encode the query text into an embedding
    query_embedding = model.encode([query_text], convert_to_tensor=True, show_progress_bar=False)
    query_embedding_np = query_embedding.cpu().numpy().astype('float32')

    # Perform the search in the FAISS index
    # D: distances, I: indices of the nearest neighbors
    distances, indices = index.search(query_embedding_np, k)

    print(f"Top {k} results for '{query_text}':")
    results = []
    for i, idx in enumerate(indices[0]):
        # The distances are L2 distances, so smaller values mean closer (more similar)
        # We'll convert L2 distance to a similarity score for easier interpretation (optional)
        similarity_score = 1 / (1 + distances[0][i]) # Simple inverse for interpretation
        results.append({
            "rank": i + 1,
            "document": documents[idx],
            "distance": distances[0][i],
            "similarity_score": similarity_score
        })
        print(f"  Rank {i+1} (Score: {similarity_score:.4f}, L2 Distance: {distances[0][i]:.4f}): {documents[idx]}")
    return results

# Let's try some queries!
if __name__ == "__main__":
    perform_semantic_search("What is AI?", k=2)
    perform_semantic_search("Tell me about animals", k=3)
    perform_semantic_search("Programming languages for AI", k=1)
    perform_semantic_search("Water temperature for boiling", k=1)

Explanation:

perform_semantic_search(query_text, k=3): This function encapsulates our search logic.
query_embedding = model.encode([query_text], ...): The same SentenceTransformer model encodes our user query into an embedding. It’s crucial to use the same model for both documents and queries to ensure their embeddings are in the same vector space.
query_embedding_np = ...: Converts the query embedding to the float32 NumPy format required by FAISS.
distances, indices = index.search(query_embedding_np, k): This is the core FAISS search operation.
- query_embedding_np: The embedding of our search query.
- k: The number of nearest neighbors (most similar documents) we want to retrieve.
- It returns distances (the L2 distances to the nearest neighbors) and indices (the original indices of those documents in our documents list).
The loop iterates through the results, retrieves the original document text using the idx from the documents list, and prints the rank, distance, and the document itself.
Similarity Score: We introduce a simple 1 / (1 + distance) formula to convert L2 distance into a more intuitive “similarity score” where higher is better (0 to 1 range, though not strictly bounded to 1 in all cases, it gives a good relative sense). Remember, L2 distance itself means smaller values are more similar.
if __name__ == "__main__":: This block ensures our example queries only run when the script is executed directly.

Step 5: Run Your Semantic Search Engine!

Save your semantic_search.py file. Make sure your virtual environment is active, and then run the script from your terminal:

python semantic_search.py

Observe the output! You should see the loaded documents, embedding generation progress, FAISS index creation, and then the results for each of your test queries. Pay attention to how the results are semantically relevant even if they don’t contain exact keywords.

Mini-Challenge: Enhance Your Search

Now it’s your turn to experiment and build on what you’ve learned!

Challenge:

Use a Larger Dataset: Instead of our custom documents list, integrate a real-world text dataset. A great option is the 20 Newsgroups dataset, which is readily available through scikit-learn.
- Hint: You can load it like this:
```
from sklearn.datasets import fetch_20newsgroups
newsgroups_data = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))
documents = newsgroups_data.data
```
- Important: The 20 Newsgroups dataset is larger. You might want to sample a subset (e.g., documents = documents[:1000]) if you’re running on a CPU without much RAM, as generating embeddings for tens of thousands of documents can be memory-intensive.
Experiment with Different Embedding Models: Replace 'all-MiniLM-L6-v2' with another SentenceTransformer model.
- Hint: Explore models on the Hugging Face Models Hub. Look for models like paraphrase-MiniLM-L6-v2, all-mpnet-base-v2, or BAAI/bge-small-en-v1.5. Be aware that larger models will be slower to load and encode.
Refine Search Results:
- Can you add a minimum similarity score threshold to filter out less relevant results?
- How would you display the source category for 20 Newsgroups results? (The newsgroups_data.target_names and newsgroups_data.target attributes will be useful).

What to Observe/Learn:

How does the size of the dataset impact embedding generation time and memory usage?
Do different embedding models yield better or worse semantic search results for certain types of queries? Why might that be?
How does the k parameter (number of results) affect the output?
What challenges arise when working with real-world, potentially noisy text data?

Common Pitfalls & Troubleshooting

Even experienced developers run into issues. Here are some common problems you might encounter and how to tackle them:

faiss.FaissError: Error: ... or Dimension Mismatch:
- Issue: This often means the dimension of the embeddings you’re trying to add to FAISS doesn’t match the dimension the FAISS index was initialized with.
- Fix: Double-check embedding_dimension = document_embeddings.shape[1]. Ensure this value is correctly passed to faiss.IndexFlatL2(). Also, ensure model.encode is consistently producing the expected dimensions.
- Another cause: Trying to add float64 (double-precision) NumPy arrays to FAISS. FAISS typically expects float32. Ensure you use .astype('float32') when converting your embeddings to NumPy arrays.
Slow Embedding Generation for Large Datasets:
- Issue: If model.encode() is taking a very long time, especially with show_progress_bar=True, it means you’re processing a lot of text.
- Fix:
  - Batching: SentenceTransformer automatically handles batching, but you can explicitly set batch_size in model.encode() if you have specific memory constraints.
  - GPU Acceleration: If you have a compatible GPU, ensure PyTorch (which SentenceTransformer uses) is configured to use it. This often means installing pytorch-cuda or pytorch-rocm instead of pytorch if you installed it directly. faiss-gpu is also an option for faster indexing.
  - Smaller Models: As suggested in the challenge, use a smaller pre-trained model if absolute top-tier semantic accuracy isn’t critical.
  - Sampling: For initial development, work with a smaller subset of your data.
Irrelevant Search Results:
- Issue: Your search results don’t seem semantically relevant to your query.
- Fix:
  - Embedding Model Choice: The most common reason. The all-MiniLM-L6-v2 is good, but not perfect for all domains. Try all-mpnet-base-v2 or specialized models (e.g., medical embeddings for medical text).
  - Data Quality: Is your documents list clean and representative? Garbage in, garbage out!
  - Preprocessing: While SentenceTransformer models are robust to some noise, excessive special characters or very short, uninformative documents can degrade quality. Basic cleaning might be needed (though often less critical than with classical NLP).
  - Query Formulation: Sometimes, the query itself is ambiguous. Encourage users to be more specific.

Remember, debugging is a skill! Read error messages carefully, print intermediate variable shapes and types, and isolate the problematic step.

Summary

Congratulations! You’ve successfully built a foundational semantic search engine. Let’s recap what you’ve accomplished and learned:

Understood Semantic Search: You now grasp the core difference between keyword and semantic search and why the latter is crucial for modern AI applications.
Leveraged Embeddings: You used SentenceTransformer to convert raw text into dense, meaningful numerical vectors (embeddings), which are the backbone of semantic understanding.
Implemented Efficient Indexing: You utilized FAISS to create a fast and scalable index for your document embeddings, enabling rapid nearest-neighbor searches.
Performed Semantic Queries: You learned how to take a user query, embed it, and use the FAISS index to retrieve semantically similar documents.
Gained Practical Experience: This hands-on project provided a tangible application of deep learning concepts in information retrieval.

This project is a stepping stone. From here, you can explore:

More Advanced FAISS Indexes: For larger datasets, investigate IndexIVFFlat, HNSW, and other approximate nearest neighbor (ANN) algorithms in FAISS for better performance.
Full-Fledged Vector Databases: Explore production-ready vector databases like Pinecone, Weaviate, Milvus, or Chroma, which offer persistence, filtering, and cloud-native features.
Fine-tuning Embedding Models: For highly specialized domains, you might fine-tune a pre-trained SentenceTransformer model on your specific dataset to get even more relevant embeddings.
Integrating with LLMs: Semantic search is often used as a retrieval step in Retrieval-Augmented Generation (RAG) systems with Large Language Models, allowing LLMs to answer questions based on up-to-date, domain-specific information.

Keep experimenting, keep building, and keep pushing the boundaries of what you can create with AI!

References

Sentence-Transformers Documentation: https://www.sbert.net/
FAISS (Facebook AI Similarity Search) GitHub: https://github.com/facebookresearch/faiss
Hugging Face Models Hub (Sentence Similarity): https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=downloads
Scikit-learn fetch_20newsgroups: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_20newsgroups.html

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.

Chapter 22: Project: Developing a Semantic Search Engine with Embeddings

Table of Contents

Core Concepts: Understanding Semantic Search

What is Semantic Search? Beyond Keywords

The Power of Embeddings: Meaning in Numbers

Vector Databases (or Indexes): Finding Neighbors Fast

Semantic Search Architecture: A Bird’s Eye View

Step-by-Step Implementation: Building Our Engine

Step 0: Setting Up Your Environment

Step 1: Data Preparation

Step 2: Generating Embeddings

Step 3: Building a Vector Index with FAISS

Step 4: Performing Semantic Search

Step 5: Run Your Semantic Search Engine!

Mini-Challenge: Enhance Your Search

Common Pitfalls & Troubleshooting

Summary

References