Introduction
Welcome back, future vector search expert! In previous chapters, we explored the standalone power of USearch, learned how to create and query vector indexes, and understood the fundamental concepts behind vector embeddings. Now, it’s time to bring that power directly into your database.
This chapter is all about integrating vector search capabilities directly into ScyllaDB, a high-performance, real-time NoSQL database. ScyllaDB has embraced the growing need for AI-native applications by offering native vector search, leveraging USearch under the hood for its efficient Approximate Nearest Neighbor (ANN) indexing. This means you can store your data and its associated vector embeddings together and perform similarity queries without needing a separate vector database or complex synchronization. Pretty neat, right?
By the end of this chapter, you’ll understand how ScyllaDB’s vector search works, how to set it up, and how to perform blazing-fast similarity searches using simple CQL (Cassandra Query Language) commands. We’ll focus on practical, hands-on steps, ensuring you build a solid understanding.
To get the most out of this chapter, you should have a running ScyllaDB instance (version 5.2 or newer, as vector search was generally available from January 2026 onwards) and a basic grasp of CQL. If you need to set up ScyllaDB, refer to its official documentation.
Core Concepts: ScyllaDB’s Approach to Vector Search
ScyllaDB’s native vector search feature is a game-changer for real-time AI applications. Instead of exporting your data to a separate vector database, you can keep everything in one place, simplifying your architecture and reducing latency. Let’s break down the key components.
The vector Data Type
At the heart of ScyllaDB’s vector search is a new native data type: vector. This type allows you to store high-dimensional numerical vectors directly within your tables.
Think of it like this: just as you have int for whole numbers or text for strings, vector is specifically designed for numerical arrays that represent embeddings.
What it is: A vector<float, N> type stores a fixed-size array of floating-point numbers, where N is the dimension of your vectors.
Why it’s important: It provides a native, optimized way to store embeddings, ensuring data integrity and efficient access.
How it functions: When you define a column as vector<float, 1536> (a common dimension for many embedding models), ScyllaDB knows exactly how to handle that data type for storage and indexing.
Vector Indexing with CREATE CUSTOM INDEX
Storing vectors is one thing; searching them efficiently is another. ScyllaDB integrates USearch to provide Approximate Nearest Neighbor (ANN) indexing directly on your vector columns. This is achieved using the CREATE CUSTOM INDEX statement.
What it is: A custom index built on a vector column that enables fast similarity searches. Behind the scenes, ScyllaDB uses the USearch library to construct and manage this index.
Why it’s important: Without an index, ScyllaDB would have to scan every single vector in your table to find similar ones, which is incredibly slow for large datasets. The index allows for rapid lookups, even across millions or billions of vectors.
How it functions: When you create a vector index, ScyllaDB builds an ANN index (like HNSW, which USearch excels at) on that column. This index organizes your vectors in a way that allows ScyllaDB to quickly narrow down the search space to find approximate nearest neighbors.
You can configure several parameters for your vector index:
similarity_function: Determines how “similarity” is measured. Common options includeCOSINE(for cosine similarity),L2(for Euclidean distance), andIP(for inner product).index_type: Currently, ScyllaDB primarily supports HNSW (Hierarchical Navigable Small World), which is known for its excellent balance of speed and accuracy.quantization: An optional optimization to reduce memory footprint and improve performance by storing vectors in a compressed format (e.g.,INT8,BINARY). This comes with a trade-off in accuracy.
Performing Similarity Search with ANN OF
Once you have a vector column and an index, querying for similar items is straightforward using the ANN OF operator in your WHERE clause.
What it is: The ANN OF operator is ScyllaDB’s syntax for triggering an Approximate Nearest Neighbor search on an indexed vector column.
Why it’s important: This is the magic keyword that tells ScyllaDB to use its vector index to find the most similar vectors to your query vector.
How it functions: You provide a query vector, and ScyllaDB, using the underlying USearch index, returns the k (specified by LIMIT) closest vectors from your table, ordered by similarity.
Let’s visualize this flow:
Note: The EmbeddingService is typically external to ScyllaDB, generating the vectors you then store and query.
Step-by-Step Implementation
Let’s get our hands dirty and implement vector search in ScyllaDB. For this example, we’ll imagine a simple movie recommendation system where we store movie titles and their vector embeddings.
Prerequisites: Ensure your ScyllaDB instance (version 5.2.0 or newer is recommended for vector search GA) is running. You can connect to it using cqlsh, ScyllaDB’s command-line shell.
Connect to ScyllaDB
Open your terminal and connect to your ScyllaDB instance. If it’s running locally, the default is:
cqlshYou should see the
cqlsh>prompt.Create a Keyspace
A keyspace in ScyllaDB is like a schema or database. We’ll create one for our movie data.
-- Create a keyspace named 'movie_recommendations' -- with a replication factor of 1 (for a single-node setup). CREATE KEYSPACE movie_recommendations WITH replication = {'class': 'SimpleStrategy', 'replication_factor': '1'};Explanation:
CREATE KEYSPACE: The command to create a new keyspace.movie_recommendations: The name of our keyspace.WITH replication: Specifies the replication strategy.SimpleStrategyis good for single-datacenter deployments, andreplication_factor: '1'means one copy of the data (suitable for a local dev setup).
Now, let’s switch to our new keyspace:
USE movie_recommendations;Create a Table with a Vector Column
Next, we’ll create a table to store our movie data. This table will include a
movie_vectorcolumn of typevector<float, 3>. We’re using a small dimension (3) for simplicity in this example, but in a real-world scenario, you’d likely use a dimension like 768 or 1536 from an embedding model.-- Create a table to store movie information, including its vector embedding. CREATE TABLE movies ( movie_id UUID PRIMARY KEY, title TEXT, genre TEXT, movie_vector VECTOR<FLOAT, 3> );Explanation:
CREATE TABLE movies: Creates a table namedmovies.movie_id UUID PRIMARY KEY: A unique identifier for each movie, serving as the primary key.UUIDis a Universal Unique Identifier.title TEXT,genre TEXT: Standard text columns for movie metadata.movie_vector VECTOR<FLOAT, 3>: Our vector column! It will store 3-dimensional float vectors.
Insert Data with Vectors
Now, let’s add some movie data along with their “dummy” vector embeddings. In a real application, these vectors would come from an embedding model (e.g., OpenAI’s
text-embedding-3-small).-- Insert movie data with example 3-dimensional vectors. INSERT INTO movies (movie_id, title, genre, movie_vector) VALUES (uuid(), 'The Matrix', 'Sci-Fi', [0.1, 0.2, 0.9]); INSERT INTO movies (movie_id, title, genre, movie_vector) VALUES (uuid(), 'Blade Runner 2049', 'Sci-Fi', [0.15, 0.25, 0.85]); INSERT INTO movies (movie_id, title, genre, movie_vector) VALUES (uuid(), 'Dune', 'Sci-Fi', [0.05, 0.1, 0.95]); INSERT INTO movies (movie_id, title, genre, movie_vector) VALUES (uuid(), 'Interstellar', 'Sci-Fi', [0.2, 0.3, 0.8]); INSERT INTO movies (movie_id, title, genre, movie_vector) VALUES (uuid(), 'Forrest Gump', 'Drama', [0.8, 0.1, 0.2]); INSERT INTO movies (movie_id, title, genre, movie_vector) VALUES (uuid(), 'Titanic', 'Romance', [0.75, 0.05, 0.1]);Explanation:
INSERT INTO movies ... VALUES (...): Standard CQL insert statement.uuid(): Generates a new UUID formovie_id.[0.1, 0.2, 0.9]: This is how you represent a vector literal in CQL. It’s a list of float values.
You can verify the data was inserted:
SELECT * FROM movies;Create a Vector Index
This is a crucial step! We’ll create a custom index on our
movie_vectorcolumn to enable efficient similarity searches. We’ll usesimilarity_function = COSINEas it’s very common for embeddings.-- Create a custom vector index on the 'movie_vector' column. CREATE CUSTOM INDEX movie_vector_idx ON movies (movie_vector) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = { 'mode': 'ANN', 'similarity_function': 'COSINE', 'index_type': 'HNSW' };Explanation:
CREATE CUSTOM INDEX movie_vector_idx: Initiates the creation of a custom index namedmovie_vector_idx.ON movies (movie_vector): Specifies that the index is on themovie_vectorcolumn of themoviestable.USING 'org.apache.cassandra.index.sasi.SASIIndex': ScyllaDB’s internal implementation leverages the SASI (SSTable Attached Secondary Index) framework for custom indexes, even for vector search. This is the standard way to declare custom indexes in ScyllaDB.WITH OPTIONS = {...}: Here’s where we configure the vector index:'mode': 'ANN': Explicitly tells ScyllaDB to build an Approximate Nearest Neighbor index.'similarity_function': 'COSINE': Configures the index to use cosine similarity for vector comparisons.'index_type': 'HNSW': Specifies the Hierarchical Navigable Small World algorithm, powered by USearch, for the underlying index structure.
Index creation might take a moment, especially with larger datasets. ScyllaDB will build the USearch index in the background.
Perform a Similarity Search
Now for the fun part! Let’s find movies similar to a given query vector. Imagine we have a new movie idea, and we want to find existing movies that are conceptually similar.
Let’s use a query vector
[0.12, 0.22, 0.88], which is somewhat similar to our Sci-Fi movies.-- Search for movies similar to our query vector, limiting to the top 2 results. SELECT title, genre, movie_vector FROM movies WHERE movie_vector ANN OF [0.12, 0.22, 0.88] LIMIT 2;Explanation:
SELECT title, genre, movie_vector: We want to retrieve the title, genre, and the vector itself for the results.FROM movies: Querying ourmoviestable.WHERE movie_vector ANN OF [0.12, 0.22, 0.88]: This is the core of the vector search. It tells ScyllaDB to find items wheremovie_vectoris an Approximate Nearest Neighbor of[0.12, 0.22, 0.88].LIMIT 2: Restricts the results to the top 2 most similar movies. Always useLIMITwithANN OFqueries to control the number of results and prevent excessive resource usage.
You should see results similar to ‘The Matrix’ and ‘Blade Runner 2049’, as their vectors are numerically closest to our query vector in this example.
What if we queried with a vector similar to our Drama/Romance movies, say
[0.7, 0.08, 0.15]?-- Find movies similar to a drama/romance-like vector. SELECT title, genre, movie_vector FROM movies WHERE movie_vector ANN OF [0.7, 0.08, 0.15] LIMIT 2;You would likely get ‘Forrest Gump’ and ‘Titanic’ as results. This demonstrates how ScyllaDB, powered by USearch, can effectively find semantically similar items based on their vector embeddings!
Mini-Challenge: Explore Different Similarity Functions
You’ve successfully performed your first vector search in ScyllaDB! Now, let’s try a small modification to deepen your understanding.
Challenge:
- Drop the existing
movie_vector_idxindex. - Create a new vector index on the
movie_vectorcolumn, but this time usesimilarity_function = L2(Euclidean distance) instead ofCOSINE. - Re-run the similarity search with the query vector
[0.12, 0.22, 0.88]andLIMIT 2. - Observe if the results change. Why might they be different (or similar)?
Hint:
- To drop an index, use
DROP INDEX movie_vector_idx; - Remember that
COSINEmeasures the angle between vectors (direction), whileL2measures the straight-line distance between their endpoints (magnitude and direction). This can lead to different “nearest” neighbors, especially if your vectors are not normalized.
What to observe/learn: Pay attention to how the choice of similarity function can influence the ranking of results. While for normalized vectors, COSINE and L2 often yield similar rankings, for unnormalized vectors, they can diverge significantly. This helps you understand the importance of choosing the right metric for your specific embedding model and use case.
Common Pitfalls & Troubleshooting
Working with vector search, especially when integrated into a database, can sometimes present challenges. Here are a few common pitfalls and how to troubleshoot them:
Incorrect Vector Dimension (
NMismatch):- Pitfall: Defining a
VECTOR<FLOAT, N>column with a dimensionNthat doesn’t match the actual dimension of the vectors you’re trying to insert or query. - Troubleshooting: ScyllaDB will return an error about dimension mismatch. Always ensure the
Nin your table schema, yourINSERTstatements, and yourANN OFqueries are consistent with the output dimension of your embedding model. This is critical.
- Pitfall: Defining a
Missing or Incorrect Vector Index:
- Pitfall: Attempting an
ANN OFquery on avectorcolumn that either doesn’t have a custom vector index, or the index was created with incorrect options (e.g.,modenot set toANN). - Troubleshooting: ScyllaDB will usually return an error indicating that an ANN index is required. Double-check your
CREATE CUSTOM INDEXstatement for typos, especially inmode,similarity_function, andindex_typeoptions. Verify the index exists usingDESCRIBE TABLE movies;(or your table name) and looking for index details.
- Pitfall: Attempting an
Performance Issues with Large Result Sets (
LIMIT):- Pitfall: Performing an
ANN OFquery without aLIMITclause, or with an excessively largeLIMITon a massive dataset. - Troubleshooting: While ScyllaDB and USearch are highly optimized, retrieving a very large number of approximate nearest neighbors still requires significant processing and data transfer. For real-time applications, always use a reasonable
LIMIT(e.g., 10, 50, or 100) to fetch only the most relevant results. If you need more results, consider pagination or re-evaluating your application’s needs.
- Pitfall: Performing an
No Results Found:
- Pitfall: Your similarity search returns no rows, even when you expect some.
- Troubleshooting: This can happen if your query vector is truly very far from all vectors in your database, or if your dataset is very small and sparse.
- Check Data: Ensure you have enough data inserted and that the vectors are varied.
- Query Vector: Double-check your query vector. Is it representative of the data you expect to find?
- Similarity Function: As explored in the mini-challenge, the
similarity_function(COSINE, L2, IP) can significantly impact results. Ensure it’s appropriate for your embeddings. For example, cosine similarity is best for normalized vectors where direction matters most.
Summary
Congratulations! You’ve successfully integrated and utilized ScyllaDB’s native vector search capabilities. This is a powerful step towards building modern, AI-driven applications that require real-time similarity search at scale.
Here are the key takeaways from this chapter:
- ScyllaDB’s Native Vector Support: ScyllaDB (version 5.2.0+) now natively supports storing and querying high-dimensional vectors using the
VECTOR<FLOAT, N>data type. - USearch Under the Hood: ScyllaDB leverages the efficient open-source USearch library to power its Approximate Nearest Neighbor (ANN) indexing.
- Creating Vector Indexes: You create vector indexes using
CREATE CUSTOM INDEX ... WITH OPTIONS = {'mode': 'ANN', 'similarity_function': 'COSINE', 'index_type': 'HNSW'}. - Performing Similarity Searches: The
ANN OFoperator in yourWHEREclause allows you to query for similar vectors directly in CQL:SELECT ... WHERE vector_column ANN OF [query_vector] LIMIT k;. - Importance of
LIMIT: Always useLIMITwithANN OFqueries for efficient, real-time results. - Choosing Similarity Function: The
similarity_function(e.g.,COSINE,L2) impacts how similarity is calculated and should match your embedding strategy.
You now have a robust foundation for building applications that can perform semantic search, recommendation systems, anomaly detection, and more, all within a highly scalable and performant database.
What’s Next?
In the next chapter, we’ll explore how to interact with ScyllaDB’s vector search from client applications using popular programming languages like Python. We’ll also delve into more advanced indexing strategies and performance tuning considerations for production deployments. Stay curious!
References
- ScyllaDB Documentation: Vector Search Overview. https://docs.scylladb.com/manual/master/features/vector-search.html
- ScyllaDB Press Release: ScyllaDB Brings Massive-Scale Vector Search to Real-Time AI. https://www.scylladb.com/press-release/scylladb-brings-massive-scale-vector-search-to-real-time-ai/
- ScyllaDB GitHub: Vector Search Examples. https://github.com/scylladb/vector-search-examples
- USearch GitHub Repository. https://github.com/unum-cloud/USearch
This page is AI-assisted and reviewed. It references official documentation and recognized resources where relevant.