Introduction: The Art of Measuring Closeness

Welcome to Chapter 8! In our journey with USearch and ScyllaDB, we’ve learned how to transform data into numerical vectors and store them for lightning-fast searches. But what exactly does “search for similar vectors” truly mean? How do we define “similarity” in a world of numbers?

The answer lies in vector distance metrics. Just like you might measure the distance between two cities on a map, we need a way to quantify how “far apart” or “close together” two vectors are in their multi-dimensional space. The choice of metric is paramount, as it directly impacts the relevance and accuracy of your search results. A “similar” item according to one metric might be quite different according to another!

In this chapter, you’ll learn:

  • What vector distance metrics are and why they’re so crucial for effective vector search.
  • The most common metrics: Euclidean Distance, Cosine Similarity, and Dot Product, including their mathematical intuition and practical applications.
  • How to select the appropriate metric for your specific data and use case.
  • How USearch allows you to easily specify these metrics, and how ScyllaDB leverages them for its integrated vector search.

This chapter builds on your understanding of vector embeddings and basic USearch operations. Get ready to refine your search capabilities by mastering the art of measuring closeness!

Core Concepts: Defining Similarity in Vector Space

At its heart, vector search is about finding vectors that are “close” to a query vector. But “close” isn’t a universal term. It depends entirely on the mathematical function we use to calculate the distance or similarity between two vectors. Let’s dive into the core concepts.

What are Vector Distance Metrics?

Imagine you have two friends, Alice and Bob, and you know their GPS coordinates. How would you measure “how close” they are?

  • You could draw a straight line between their current positions and measure its length. This is like Euclidean Distance.
  • You could consider the angle formed by drawing lines from a central point (like the Earth’s center) to each of them. This is akin to Cosine Similarity, focusing on direction.

In vector search, a distance metric is a function that takes two vectors as input and returns a single numerical value representing their dissimilarity. Generally, a smaller distance implies greater similarity. Conversely, a similarity metric returns a value where a larger value implies greater similarity. It’s important to keep this distinction in mind as we explore.

The choice of distance metric directly influences which results your vector search returns as “most similar.” Different metrics emphasize different aspects of your data:

  • Do you care about the absolute difference in values across all dimensions (e.g., for physical properties or raw sensor readings)?
  • Or do you care more about the overall direction or topic represented by the vector, regardless of its “strength” or magnitude (e.g., for text or image embeddings)?

Selecting the right metric ensures your search results are truly relevant to your application’s definition of similarity.

Common Vector Distance Metrics

Let’s explore the most widely used metrics:

1. Euclidean Distance (L2 Distance)

  • What it is: Often called L2 distance, this is the most intuitive measure of distance between two points in space. It’s the length of the straight line segment connecting the two points. Think of it as the “as the crow flies” distance.
  • Intuition: It’s calculated by taking the square root of the sum of the squared differences between corresponding components of the two vectors.
  • When to use it:
    • When the magnitude of the vector components is meaningful and contributes to similarity.
    • For data where absolute differences are important, such as geographic coordinates, sensor readings, or features where a larger value genuinely means “more” of something.
    • When vectors are not normalized and their length (magnitude) carries information.
  • USearch Context: USearch typically offers MetricKind.L2sq (Squared Euclidean Distance). The square root operation is computationally expensive and doesn’t change the relative ranking of distances, so using the squared version is a common optimization for speed. A smaller L2sq value indicates higher similarity.

2. Cosine Similarity (Angular Distance)

  • What it is: Cosine similarity measures the cosine of the angle between two vectors. It focuses purely on the direction of the vectors, ignoring their magnitude (length).
  • Intuition: If two vectors point in exactly the same direction, the angle between them is 0 degrees, and its cosine is 1 (perfect similarity). If they point in opposite directions, the angle is 180 degrees, and its cosine is -1 (perfect dissimilarity). If they are orthogonal (perpendicular), the angle is 90 degrees, and its cosine is 0 (no similarity).
  • When to use it:
    • Extremely popular for text embeddings (like those from BERT or OpenAI models), image features, and recommendation systems.
    • When you want to find items that are conceptually similar, regardless of how “strong” or “long” their vector representation is. For example, two short, well-written product reviews might be more similar in topic than a long, rambling one, even if the long one has a larger vector magnitude.
    • Often used with normalized vectors (vectors with a length of 1).
  • USearch Context: USearch uses MetricKind.Cos. It returns 1 - cosine_similarity as a distance. So, a smaller distance value (closer to 0) means a higher cosine_similarity (closer to 1), indicating greater similarity.

3. Dot Product (Inner Product)

  • What it is: The dot product of two vectors is the sum of the products of their corresponding components. It measures how much one vector “goes in the direction” of another, taking into account both direction and magnitude.
  • Intuition:
    • If vectors are normalized (unit length), the dot product is exactly equal to the cosine similarity.
    • If vectors are not normalized, a larger dot product means the vectors are more aligned and have larger magnitudes.
  • When to use it:
    • In recommendation systems where a user’s preference vector (magnitude reflecting strength of preference) combined with item vectors (direction reflecting item characteristics) is important.
    • When both the direction and the magnitude of the vectors are meaningful for your definition of similarity.
    • Be cautious: if vectors are unbounded and not normalized, a query vector might be “similar” to a very long, irrelevant vector just because of its magnitude.
  • USearch Context: USearch uses MetricKind.IP. For this metric, USearch typically returns -dot_product as the distance. Therefore, a smaller (more negative) distance value implies a larger dot product, indicating higher similarity.

Choosing the Right Metric: A Decision Flow

Selecting the optimal metric isn’t always straightforward, but this general flow can guide your decision:

flowchart TD Start[Start: Choose Metric] --> A{Are vector magnitudes important for similarity?} A -->|Yes| B{Are vectors typically normalized or bounded?} B -->|Yes| C[Consider Dot Product] B -->|No| D[Consider Euclidean Distance] A -->|No| E[Consider Cosine Similarity] C --> F[Test and Evaluate Your Data] D --> F E --> F F --> G[End: Optimal Metric Chosen]

Explanation:

  • If the strength or scale of your vector components matters (e.g., a vector [10, 20] is “stronger” than [1, 2]), then magnitude is important.
    • If these magnitudes are also bounded or normalized (e.g., all vectors have a maximum length or are scaled to unit length), Dot Product might be a good fit.
    • If magnitudes are arbitrary and you care about the absolute difference across dimensions, Euclidean Distance is often better.
  • If only the direction or topic matters, and the length of the vector is irrelevant or even misleading (common for many learned embeddings), then Cosine Similarity is usually the best choice.

How USearch and ScyllaDB Use Them

USearch provides direct control over the distance metric. When you create an Index object, you explicitly pass a MetricKind enum. This tells USearch how to compute distances during indexing and searching, allowing it to apply specific optimizations for each metric.

ScyllaDB’s Vector Search (generally available as of January 2026) integrates these concepts seamlessly. While the underlying implementation might leverage USearch or similar high-performance libraries, ScyllaDB abstracts this complexity. When you perform a vector search using the ANN OF query in CQL, you can specify the desired distance metric directly. This ensures that the search performed by ScyllaDB aligns with how your vectors were generated and how you define similarity for your application. It’s crucial that the metric used during the embedding generation process matches the metric you select in your ScyllaDB ANN OF query for consistent results.

Step-by-Step Implementation: USearch in Action with Different Metrics

Let’s get hands-on and see how different metrics affect search results using USearch. We’ll use a simple set of 3-dimensional vectors to represent conceptual items.

Prerequisites

  • Python 3.8+
  • The usearch library installed.
    • You can install it via pip: pip install usearch==6.25.0 (or check the USearch GitHub for the very latest stable version).
  • The numpy library: pip install numpy.

Step 1: Prepare Your Environment and Sample Vectors

First, let’s import the necessary libraries and define some sample vectors. These vectors are purely illustrative; in a real application, they would come from an embedding model.

import numpy as np
from usearch.index import Index, MetricKind, Metric

print(f"USearch version: {Index.version}") # Check the installed USearch version

# Define some sample 3-dimensional vectors
# These vectors are intentionally chosen to highlight differences
# 'apple' and 'orange' are somewhat close in all dimensions
# 'banana' is quite different
# 'fruit_bowl' is somewhat in between 'apple'/'orange' and 'banana'
vectors = {
    "apple": np.array([0.1, 0.9, 0.2], dtype=np.float32),
    "orange": np.array([0.2, 0.8, 0.3], dtype=np.float32),
    "banana": np.array([0.7, 0.1, 0.8], dtype=np.float32),
    "fruit_bowl": np.array([0.3, 0.7, 0.4], dtype=np.float32)
}

print("Sample vectors defined.")

Explanation:

  • We import numpy for efficient array handling and Index, MetricKind, Metric from usearch.index.
  • Index.version helps us confirm our installed USearch version. As of early 2026, 6.25.0 is a recent stable version.
  • Our vectors dictionary stores conceptual names mapped to their 3D float32 representations.

Step 2: Create an Index with Euclidean Distance (L2)

Now, let’s create a USearch index configured to use squared Euclidean distance (L2sq).

# Create an index for 3-dimensional vectors using Euclidean (L2) distance
# MetricKind.L2sq is used for performance, as sqrt doesn't change relative ranking
index_l2 = Index(ndim=3, metric=MetricKind.L2sq)
print(f"\nL2 Index created with {index_l2.metric} metric.")

Explanation:

  • Index(ndim=3, metric=MetricKind.L2sq) initializes our index. ndim specifies the dimensionality of our vectors.
  • MetricKind.L2sq tells USearch to calculate similarity based on the squared Euclidean distance. Remember, a smaller L2sq value means higher similarity.

Step 3: Add Vectors to the L2 Index

Let’s populate our L2 index with the sample vectors.

# Add vectors to the L2 index
for key, vec in vectors.items():
    # Using hash(key) as a unique integer label for each vector
    index_l2.add(label=hash(key), vector=vec)
print(f"Added {len(vectors)} vectors to L2 index.")

Explanation:

  • We iterate through our vectors dictionary.
  • index_l2.add(label=hash(key), vector=vec) inserts each vector. We use hash(key) to generate a unique integer label for each string key, which is required by USearch.

Step 4: Perform a Search with L2 Distance

Now, let’s query our L2 index using the “apple” vector and see what comes up as most similar.

query_vector_l2 = vectors["apple"]
# Search for the 2 most similar vectors to 'apple'
matches_l2 = index_l2.search(query_vector_l2, count=2)

print("\n--- Search results (L2 Distance) for 'apple' ---")
for label, distance in zip(matches_l2.labels, matches_l2.distances):
    # Find the original key from our dictionary for display
    original_key = next(key for key, val in vectors.items() if hash(key) == label)
    print(f"Vector: '{original_key}', L2 Squared Distance: {distance:.4f}")

Explanation:

  • query_vector_l2 = vectors["apple"] sets our query.
  • index_l2.search(query_vector_l2, count=2) retrieves the 2 closest vectors.
  • The output shows the original vector name and its squared Euclidean distance from “apple”. A smaller distance means it’s considered more similar by this metric. You’ll likely see ‘orange’ as the closest due to its similar component values.

Step 5: Create an Index with Cosine Similarity

Next, let’s create a new index, this time configured for Cosine Similarity.

# Create an index for 3-dimensional vectors using Cosine Similarity
index_cos = Index(ndim=3, metric=MetricKind.Cos)
print(f"\nCosine Index created with {index_cos.metric} metric.")

Explanation:

  • We create another Index instance, but this time with metric=MetricKind.Cos.
  • Remember, for MetricKind.Cos, USearch returns 1 - cosine_similarity as the distance. So, a smaller distance (closer to 0) means a higher cosine similarity (closer to 1), indicating greater similarity.

Step 6: Add Vectors to the Cosine Index

Populate our Cosine index with the same vectors.

# Add vectors to the Cosine index
for key, vec in vectors.items():
    index_cos.add(label=hash(key), vector=vec)
print(f"Added {len(vectors)} vectors to Cosine index.")

Explanation:

  • The process of adding vectors is identical, as the vectors themselves haven’t changed, only the underlying distance calculation logic of the index.

Step 7: Perform a Search with Cosine Similarity

Let’s query the Cosine index with the “apple” vector and compare the results to the L2 search.

query_vector_cos = vectors["apple"]
# Search for the 2 most similar vectors to 'apple' using Cosine Similarity
matches_cos = index_cos.search(query_vector_cos, count=2)

print("\n--- Search results (Cosine Similarity) for 'apple' ---")
for label, distance in zip(matches_cos.labels, matches_cos.distances):
    original_key = next(key for key, val in vectors.items() if hash(key) == label)
    # Convert USearch's distance (1 - similarity) back to actual cosine similarity
    similarity = 1 - distance
    print(f"Vector: '{original_key}', Cosine Similarity: {similarity:.4f}")

Explanation:

  • We perform the search similarly to the L2 index.
  • The key difference in the output is how we interpret distance. Since MetricKind.Cos returns 1 - cosine_similarity, we convert it back to similarity = 1 - distance for a more intuitive understanding of similarity (where 1 is perfect similarity).
  • Observe how the ranking or the absolute similarity values might differ from the Euclidean search, even with these simple vectors. This highlights the impact of metric choice.

Mini-Challenge: Experiment with Dot Product

It’s your turn to explore!

Challenge: Create a new USearch index using MetricKind.IP (Dot Product). Add the same vectors from the example and perform a search for “banana”. Compare the results to the L2 and Cosine searches you’ve already performed.

Hint: Remember that for MetricKind.IP, USearch typically returns -dot_product as the “distance.” This means a smaller (more negative) distance value actually indicates a larger dot product, and thus higher similarity. You might want to display the dot_product as -distance for clarity.

What to observe/learn:

  • How the ranking of “most similar” vectors to “banana” changes when using Dot Product compared to Euclidean or Cosine.
  • Pay close attention to the raw distance values returned by USearch and how you need to interpret them correctly for the Dot Product metric to understand similarity.
  • Think about why ‘banana’ might have different nearest neighbors under Dot Product.

Take your time, try it out, and observe the fascinating differences!

Common Pitfalls & Troubleshooting

Even with a solid understanding of metrics, pitfalls can emerge. Here are a few common ones:

  1. Choosing the Wrong Metric for Your Data:

    • Pitfall: Using Euclidean distance when vector magnitudes are arbitrary (e.g., in text embeddings, where a longer text might just have a larger magnitude vector but not be more “similar” in topic), leading to irrelevant search results. Or, conversely, using Cosine when magnitude is crucial (e.g., for user preference strength).
    • Troubleshooting: Always start by understanding your data and how your embedding model works.
      • Ask: Does the length (magnitude) of my vectors carry meaningful information?
      • Ask: Are my vectors normalized? If so, Cosine or Dot Product are often good.
      • Ask: Am I looking for conceptual similarity (topic, theme) or absolute value similarity (exact feature match)?
    • Best Practice: Experiment with different metrics on a small, representative dataset and evaluate the relevance of the top results.
  2. Misunderstanding Normalized vs. Unnormalized Vectors:

    • Pitfall: Using Cosine Similarity with unnormalized vectors, or Dot Product when vectors should be normalized but aren’t. This can lead to the magnitude unfairly influencing results.
    • Troubleshooting:
      • If your embedding model outputs normalized vectors (unit length), Cosine Similarity and Dot Product will behave very similarly (Dot Product will be Cosine Similarity).
      • If your vectors are not normalized, and you want to ignore magnitude, explicitly normalize them to unit length before indexing (vector / np.linalg.norm(vector) in NumPy) and then use Cosine Similarity.
      • If magnitudes are important, ensure your chosen metric (like L2 or raw Dot Product) handles them appropriately.
  3. Misinterpreting USearch’s “Distance” Output:

    • Pitfall: Assuming a smaller numerical distance value returned by USearch always means “more similar” in the intuitive sense for all metrics without considering the metric’s specific interpretation.
    • Troubleshooting:
      • MetricKind.L2sq: Smaller distance means more similar. (Distance is squared Euclidean).
      • MetricKind.Cos: Smaller distance (closer to 0) means higher similarity. (Distance is 1 - cosine_similarity).
      • MetricKind.IP: Smaller (more negative) distance means higher similarity. (Distance is typically -dot_product).
    • Best Practice: Always refer to the USearch official documentation or the MetricKind enum definitions for the precise interpretation of the distance value for each metric.

Summary

Fantastic work! You’ve navigated the crucial world of vector distance metrics, which are the unsung heroes behind accurate and relevant vector searches.

Here are the key takeaways from this chapter:

  • Vector distance metrics define how “similarity” is calculated between vectors, directly impacting search results.
  • Euclidean Distance (L2) measures the straight-line distance, sensitive to absolute differences and vector magnitudes. USearch uses L2sq for performance.
  • Cosine Similarity focuses on the angle between vectors, ideal for conceptual similarity where direction matters more than magnitude (common for text embeddings). USearch returns 1 - cosine_similarity.
  • Dot Product (IP) considers both magnitude and direction. For normalized vectors, it’s equivalent to Cosine Similarity. USearch typically returns -dot_product.
  • Choosing the right metric is critical and depends on your data’s characteristics and your application’s definition of similarity.
  • USearch provides explicit MetricKind options for easy configuration.
  • ScyllaDB’s Vector Search integrates these metrics, allowing you to specify them directly in your ANN OF CQL queries for powerful, distributed similarity search.
  • Common pitfalls include selecting the wrong metric, mishandling normalized/unnormalized vectors, and misinterpreting USearch’s distance outputs.

You now have a deeper understanding of how “similarity” is quantified in vector space, empowering you to make informed decisions for your USearch and ScyllaDB implementations.

What’s Next?

In the next chapter, we’ll shift our focus to advanced indexing strategies within USearch. We’ll explore techniques that go beyond basic indexing to optimize performance, handle massive datasets, and fine-tune the trade-off between search speed and accuracy. Get ready to scale your vector search capabilities!

References

This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.