Introduction to USearch: Core Concepts & Installation

Welcome to Chapter 2! In the previous chapter, we explored the fascinating world of vector embeddings and how they allow us to represent complex data like text or images as numerical vectors. Now, it’s time to learn how to efficiently search through these vectors to find similar items. This is where USearch comes in!

This chapter will be your friendly guide to USearch, an incredibly fast and lightweight library for Approximate Nearest Neighbor (ANN) search. We’ll demystify its core concepts, walk through the straightforward installation process, and get our hands dirty with our very first vector search using Python. By the end, you’ll have a solid foundation for using USearch, paving the way for its powerful integration with ScyllaDB. Ready to dive in? Let’s go!

Prerequisites

Before we jump in, make sure you have:

  • A basic understanding of vector embeddings (covered in Chapter 1).
  • Python 3.8+ installed on your system.
  • pip, Python’s package installer, updated to its latest version.

Core Concepts of USearch

Imagine you have millions of books, and you want to find all books “similar” in content to the one you just read. Reading every book would take forever! Vector search, powered by libraries like USearch, helps you do this almost instantly.

What is USearch?

USearch is an extremely fast and memory-efficient open-source library developed by Unum for performing Approximate Nearest Neighbor (ANN) search. It’s built on a highly optimized C++ core, with bindings available for various languages, including Python, Rust, and Java. Its primary goal is to find vectors that are “closest” to a given query vector in a high-dimensional space.

Why “Approximate”? Great question! When dealing with millions or even billions of vectors, finding the absolute closest vector (Exact Nearest Neighbor) can be computationally very expensive and slow. ANN algorithms, like those in USearch, sacrifice a tiny bit of accuracy for massive gains in speed and scalability. This means USearch finds vectors that are very likely the closest, often indistinguishable from the exact closest in practical applications, but much, much faster.

The Magic Behind the Speed: HNSW

USearch primarily leverages an algorithm called Hierarchical Navigable Small Worlds (HNSW). Don’t worry about memorizing the name, but here’s the core idea:

Think of HNSW as building a multi-layered graph where each vector is a “node.”

  • Bottom Layer: Contains all vectors, connected to their immediate neighbors.
  • Upper Layers: Contain fewer vectors, acting as “expressways” to quickly jump across the graph.

When you perform a search, USearch starts at an upper layer, quickly navigating to the general area where your query vector’s neighbors might be. Then, it drops down to lower layers, refining the search until it finds the closest approximate neighbors. This hierarchical structure allows for incredibly fast lookups, even in massive datasets.

Let’s visualize this process at a high level:

flowchart TD A[Raw Data: Text, Image, Audio] --> B(Embedding Model) B --> C[Vector Embeddings] C --> D{USearch Index Creation}; D --> E[Query Vector] E --> F{USearch Index Search}; D -->|\1| F F --> G[Similar Vectors Found] G --> H[Retrieve Original Data] subgraph USearch_Core["USearch Core Operations"] D F end

Key Features of USearch

  • Blazing Fast: Designed for high-performance, low-latency search.
  • Memory Efficient: Optimized to handle large datasets within reasonable memory constraints.
  • Scalable: Can handle millions to billions of vectors.
  • Flexible: Supports various distance metrics (like Cosine, Euclidean, etc.) and data types.
  • Embeddable: Its lightweight nature makes it easy to integrate into applications.

USearch and ScyllaDB

You might be wondering, “How does this relate to ScyllaDB?” ScyllaDB, a real-time big data database, has integrated vector search capabilities directly into its core, leveraging libraries like USearch under the hood. This means you can store your vectors in ScyllaDB and perform lightning-fast similarity searches within the database itself, combining the power of a NoSQL database with cutting-edge vector search. We’ll explore this integration in depth in later chapters!

Step-by-Step Installation

Let’s get USearch installed on your system. We’ll be using its Python bindings for our examples.

Step 1: Prepare Your Python Environment

It’s always a good practice to use a virtual environment to manage your project’s dependencies.

Open your terminal or command prompt:

# Create a new virtual environment (if you don't have one)
python3 -m venv usearch_env

# Activate the virtual environment
# On macOS/Linux:
source usearch_env/bin/activate
# On Windows:
# usearch_env\Scripts\activate

You should see (usearch_env) at the beginning of your prompt, indicating the virtual environment is active.

Step 2: Install USearch

Now, let’s install the usearch Python package. As of early 2026, USearch is under active development, with recent versions like 6.25.0 (from internal build logs) indicating continuous improvements. We’ll install the latest stable release available via pip.

pip install usearch

This command downloads and installs the usearch package and its dependencies. It will compile the C++ core if a pre-built wheel isn’t available for your system, which might take a moment.

Step 3: Verify Installation

To ensure everything is installed correctly, let’s open a Python interpreter and try importing the library.

python

Once in the Python prompt (>>>), type:

import usearch
print(usearch.__version__)

You should see an output similar to 6.25.0 (or whatever the latest stable version pip installed). If you see an error, double-check your installation steps and internet connection.

Type exit() to leave the Python interpreter.

Now that USearch is installed, let’s write a small Python script to perform our very first vector search! We’ll create a simple index, add some dummy vectors, and then query it.

Create a new Python file named first_search.py.

Step 1: Import USearch

At the top of your first_search.py file, we’ll import the necessary components.

# first_search.py
import numpy as np
import usearch
  • numpy is a standard Python library for numerical operations, which is excellent for handling vectors.
  • usearch is, of course, our vector search library!

Step 2: Define Our Vectors

Let’s create some simple 3-dimensional vectors. In a real-world scenario, these would be generated by an embedding model.

# first_search.py

# ... (previous imports) ...

# Our sample vectors
vectors = np.array([
    [1.0, 1.0, 1.0],  # Vector 0 (ID 0)
    [2.0, 2.0, 2.0],  # Vector 1 (ID 1)
    [0.9, 0.8, 1.1],  # Vector 2 (ID 2) - very similar to Vector 0
    [5.0, 5.0, 5.0],  # Vector 3 (ID 3)
    [0.1, 0.2, 0.3]   # Vector 4 (ID 4)
], dtype=np.float32) # USearch often prefers float32 for performance
  • We’re using np.array to create a NumPy array of our vectors.
  • dtype=np.float32 is specified because float32 is commonly used for embeddings and is often more memory-efficient and faster for USearch.

Step 3: Create a USearch Index

Now, let’s initialize our USearch index. We need to tell it the dimensionality of our vectors and the distance metric to use.

# first_search.py

# ... (previous code) ...

# Define the dimensionality of our vectors
dimensions = vectors.shape[1] # This will be 3 for our example

# Create a USearch index
# We specify the dimensions and the distance metric (e.g., 'cosine' or 'l2_squared' for Euclidean)
index = usearch.Index(
    ndim=dimensions,
    metric=usearch.MetricKind.Cos, # Cosine similarity is common for embeddings
    dtype=vectors.dtype # Ensure index uses the same data type as our vectors
)

print(f"USearch index created with {dimensions} dimensions using Cosine similarity.")
  • ndim: This crucial parameter tells USearch the number of features (dimensions) in each of your vectors. It must match your data.
  • metric: Specifies how similarity is calculated.
    • usearch.MetricKind.Cos: Cosine similarity, popular for text embeddings, measures the angle between vectors.
    • usearch.MetricKind.L2sq: Squared Euclidean distance, measures the straight-line distance. Smaller values mean more similar.
  • dtype: Matches the data type of your vectors, usually np.float32.

Step 4: Add Vectors to the Index

We’ll add our vectors to the index. Each vector needs a unique integer ID.

# first_search.py

# ... (previous code) ...

# Add vectors to the index
# We'll use their array indices as their unique IDs
for i, vec in enumerate(vectors):
    index.add(label=i, vector=vec)
    print(f"Added vector with ID {i} to the index.")

print(f"Index now contains {len(index)} vectors.")
  • index.add(label, vector): This method inserts a vector into the index.
    • label: A unique integer identifier for the vector. This is how you’ll retrieve the original data associated with the vector later.
    • vector: The actual NumPy array representing your vector.

Now for the exciting part: querying the index to find similar vectors!

# first_search.py

# ... (previous code) ...

# Define a query vector
# Let's try to find vectors similar to Vector 0 (ID 0)
query_vector = np.array([1.0, 1.0, 0.9], dtype=np.float32) # Slightly different from Vector 0

# Perform a search
# We want the top 2 most similar neighbors
matches = index.search(query=query_vector, count=2)

print(f"\nSearching for top 2 neighbors of query vector: {query_vector}")
print("Found matches:")
for i, (label, distance) in enumerate(zip(matches.labels, matches.distances)):
    # For Cosine similarity, a distance of 0 means identical, 2 means opposite.
    # We often convert distance to similarity (1 - distance/2 for Cosine) for easier interpretation.
    similarity = 1 - (distance / 2) # Adjust for USearch's Cosine distance range [0, 2]
    print(f"  Match {i+1}: ID={label}, Distance={distance:.4f}, Similarity={similarity:.4f}, Original Vector: {vectors[label]}")
  • index.search(query, count): This is the core search method.
    • query: The vector you want to find similar items to.
    • count: The number of nearest neighbors you want to retrieve.
  • The matches object returned contains labels (the IDs of the matched vectors) and distances (how “far” they are from the query vector).
  • Important Note on Cosine Distance: USearch’s MetricKind.Cos calculates a “distance” ranging from 0 (identical vectors) to 2 (opposite vectors). A traditional cosine similarity ranges from 1 (identical) to -1 (opposite). The conversion 1 - (distance / 2) maps USearch’s distance to a more intuitive similarity score of 1 to -1.

Step 6: Run Your Script

Save first_search.py and run it from your terminal:

python first_search.py

You should see output similar to this:

USearch index created with 3 dimensions using Cosine similarity.
Added vector with ID 0 to the index.
Added vector with ID 1 to the index.
Added vector with ID 2 to the index.
Added vector with ID 3 to the index.
Added vector with ID 4 to the index.
Index now contains 5 vectors.

Searching for top 2 neighbors of query vector: [1.  1.  0.9]
Found matches:
  Match 1: ID=0, Distance=0.0000, Similarity=1.0000, Original Vector: [1. 1. 1.]
  Match 2: ID=2, Distance=0.0003, Similarity=0.9998, Original Vector: [0.9 0.8 1.1]

As expected, our query vector [1.0, 1.0, 0.9] is most similar to Vector 0 ([1.0, 1.0, 1.0]) and then Vector 2 ([0.9, 0.8, 1.1]), which makes perfect sense!

Mini-Challenge: Explore More Neighbors!

You’ve successfully performed your first vector search! Now, let’s try a small modification to solidify your understanding.

Challenge: Modify the first_search.py script to:

  1. Change the query_vector to one that is very different from our current vectors, for example, [10.0, 10.0, 10.0].
  2. Increase the count parameter in the index.search() call to 3.
  3. Observe the results. Do the top 3 matches make sense given your new query vector?

Hint: Think about how the new query_vector relates to the existing vectors array. Which existing vector is it “most similar” to, even if not perfectly identical?

What to Observe/Learn:

  • How changing the query vector affects the search results.
  • How the count parameter determines the number of neighbors returned.
  • The relationship between distance and similarity scores.

Common Pitfalls & Troubleshooting

Even with simple examples, it’s easy to stumble. Here are a few common issues and how to tackle them:

  1. Dimension Mismatch:

    • Pitfall: Creating an index with ndim=3 but then trying to add a 4-dimensional vector or query with a 2-dimensional vector.
    • Troubleshooting: Always ensure index.ndim exactly matches the dimensionality of all vectors you add and query. USearch will raise an error if dimensions don’t match.
  2. Incorrect dtype:

    • Pitfall: Passing Python lists or NumPy arrays with dtype=np.float64 (double precision) when the index was created with dtype=np.float32.
    • Troubleshooting: Explicitly set dtype=np.float32 for your vectors and ensure index is initialized with the same dtype. While float64 usually works, float32 is generally preferred for performance and memory in vector search.
  3. No Results or Unexpected Results:

    • Pitfall: The index is empty, or the metric chosen doesn’t suit your data.
    • Troubleshooting:
      • Check len(index) to ensure vectors were added.
      • Verify the metric (e.g., Cosine for text embeddings, L2sq for geometric distance) is appropriate for your use case.
      • For Cosine distance, remember the 1 - (distance / 2) conversion for intuitive similarity scores.
  4. Persistence (Forgetting to Save/Load):

    • Pitfall: You build a large index, close your program, and then realize the index is gone.
    • Troubleshooting: USearch indexes are in-memory by default. To persist them, you need to explicitly index.save('path/to/index.usearch') and index.load('path/to/index.usearch'). We’ll cover this in more detail in a future chapter, but it’s good to be aware of now!

Summary

Phew! You’ve just taken a significant step into the world of vector search. Here’s a quick recap of what we covered:

  • USearch Fundamentals: Learned that USearch is a lightning-fast, memory-efficient open-source library for Approximate Nearest Neighbor (ANN) search.
  • HNSW Algorithm: Got a high-level understanding of how USearch uses hierarchical graphs to achieve its incredible speed.
  • Installation: Successfully installed the usearch Python package.
  • First Search: Wrote and executed your first Python script to create a USearch index, add vectors, and perform a similarity search.
  • Core Parameters: Understood the importance of ndim, metric (especially Cosine and L2sq), and dtype.
  • Troubleshooting: Identified common pitfalls like dimension mismatches and dtype issues.

You’re now equipped with the foundational knowledge and practical skills to start experimenting with USearch! In the next chapter, we’ll delve deeper into advanced indexing techniques and explore how to handle larger datasets more efficiently.

References


This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.