Sparse Vector Search
In the past, hybrid search combined two search methods: traditional keyword-based search and vector-based similarity search. However, sparse vector can act as a substitute for keyword search, unlocking the full potential of the data pipeline with pure vector search.
TIP
For more information about keyword-based hybrid search, please refer to Hybrid Search
This post explores how to use how BGE-M3 and pgvecto.rs
to exploit the potential of hybrid search using both sparse and dense vectors.
Let's dive in.
An Overview of Sparse and Dense Vector Embedding
Sparse vectors are high-dimensional vectors that contain few non-zero values. They are suitable for traditional information retrieval use cases. For example, a vector with 32,000 dimensions but only 2 non-zero elements is a typical sparse vector.
Dense vectors are embeddings from neural networks. They are generated by text embedding models and have most or all elements non-zero. They have fewer dimensions, such as 256 or 1536, much less than sparse vectors.
Dense vectors and sparse vectors are both used to represent features of text chunks, but they have significant differences. A single dimension in a dense vector embedding does not mean anything, as it is too abstract to determine its meaning. However, when we take all the dimensions together, they provide the semantic meaning of the input text.
Sparse vectors have dimensions equal to the vocabulary size, with each element representing the weight of a word in the text.
Generally, words with higher weights indicate greater importance in the text. Sparse vectors simplify the analysis and visualization of the relationship between text and vocabulary. To visualize dense vectors, additional dimension reduction algorithms are necessary, such as:
- Principal component analysis (PCA)
- T-distributed stochastic neighbor embedding (T-SNE)
- Uniform Manifold Approximation and Projection (UMAP)
Dense Vector | Sparse Vector | |
---|---|---|
Dimension | Hundreds to thousands | Tens of thousands |
Element Meaning | Nothing | Word Weight |
Interpretability | Hard | Easy |
Usage | Vector Distance | Vector Distance / Visualize |
Semantic sensitivity | High | Low |
Sparse vectors are generally less effective at capturing the relationships between words, particularly for synonyms and related terms, compared to dense vectors.
For instance, both rocket
and SPACE X
are related to space exploration, but Dense vectors are more likely to reveal similarities and assign a closer vector distance to corresponding texts.
How to create a vector embedding?
A vector embedding is the internal representation of input data in deep learning models, also known as embedding models. Most embedding models, such as text-embedding-3-small
, only output dense embeddings.
SPLADE and BGE-M3 can generate sparse embeddings, sometimes called lexical weights. In this section, we will introduce the BGE-M3
model for generating dense and sparse vectors.
BGE-M3
is a multi-functionality model, which can simultaneously perform the three common retrieval functionalities of the embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval. We will learn how to generate dense and sparse embeddings from it.
# pip install -U FlagEmbedding
from FlagEmbedding import BGEM3FlagModel
model = BGEM3FlagModel("BAAI/bge-m3", use_fp16=True)
sentences = ["What is BM25?", "Definition of BM25"]
output = model.encode(sentences, return_dense=True, return_sparse=True)
# The dense embedding of the two texts above:
dense_embedding = output['dense_vecs']
# [[-0.02501617 -0.04525582 -0.01972856 ... -0.0099566 0.01677509
# 0.01292699]
# [-0.0106499 -0.04731942 -0.01477693 ... -0.0149255 0.01040097
# 0.00965083]
sparse_weight_id = output['lexical_weights']
# You can see the weight for each token:
# print(model.convert_id_to_token(sparse_weight_id))
#
# [{'What': 0.10430284, 'is': 0.10090457, 'BM': 0.2635918, '25': 0.3382988, '?': 0.052101523},
# {'Definition': 0.14298248, 'of': 0.076763615, 'BM': 0.2577639, '25': 0.33806682}]
For all sentences and words, a text-word matrix was created:
TEXT | What | is | BM | 25 | ? | Definition | of | ... |
---|---|---|---|---|---|---|---|---|
What is BM25? | 0.08 | 0.08 | 0.13 | 0.25 | 0.04 | 0 | 0 | ... |
Definition of BM25 | 0 | 0 | 0.25 | 0.33 | 0 | 0.14 | 0.06 | ... |
The matrix is highly sparse, especially when all vocabularies are included. Most words only appear in a few sentences, resulting in non-zero weights.
By feeding text into sparse embedding models like BGE-M3
, we can easily obtain sparse embeddings from it.
# pip install -U numpy
import numpy as np
# Sparse vector embedding of text
sparse_indices = [[int(token_id) for token_id in text] for text in sparse_weight_id]
# [[4865, 83, 90017, 2588, 32], [155455, 111, 90017, 2588]]
sparse_values = [[float(text[token_id]) for token_id in text] for text in sparse_weight_id]
# [[0.10430284, 0.10090457, 0.2635918, 0.3382988, 0.052101523], [0.14298248, 0.076763615, 0.2577639, 0.33806682]]
# Sparse indexes must be ascending for each pgvecto.rs
for i in range(len(sparse_weight_id)):
sparse_values[i] = [x for _, x in sorted(zip(sparse_indices[i], sparse_values[i]))]
sparse_indices[i] = sorted(sparse_indices[i])
Now we have both the sparse and dense embeddings. It's time to insert them into the vector database.
Storing vectors in pgvecto.rs
First, a deployed pgvecto.rs
instance is required for vector storage and retrieval. This can be achieved by the official pgvecto-rs
Docker image:
docker run \
--name pgvecto-rs-demo \
-e POSTGRES_PASSWORD=mysecretpassword \
-p 5432:5432 \
-d tensorchord/pgvecto-rs:pg16-v0.3.0
Now the only necessary dependency is installed. Let's create a table to store the vectors.
pgvecto.rs
supports dense or sparse vectors with different data types:
vector
for dense vectorsvector
for sparse vector
The BGE-M3
model has a dense embedding dimension of 1024. The sparse embedding dimension is 250002, which matches the size of tokenizer vocabulary.
print(len(model.tokenizer))
# 250002
In this example, we create a table named documents
with four columns: an ID column(id
), a text column(text
) and two vector columns(sparse
for sparse vector and dense
for dense vector).
# pip install -U psycopg
import psycopg
URL = "postgresql://postgres:mysecretpassword@localhost:5432/postgres"
with psycopg.connect(URL) as conn:
conn.execute("DROP TABLE IF EXISTS documents;")
conn.execute(f"""
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
text TEXT NOT NULL,
dense vector(1024) NOT NULL,
sparse svector({len(model.tokenizer)}) NOT NULL);
"""
)
With the table created, we can now insert embeddings into the table and create indexes for vector columns.
# pip install -U pgvecto_rs
import psycopg
from pgvecto_rs.psycopg import register_vector
URL = "postgresql://postgres:mysecretpassword@localhost:5432/postgres"
with psycopg.connect(URL) as conn:
register_vector(conn)
for text, dense, sparse_ind, sparse_val in zip(sentences, dense_embedding, sparse_indices, sparse_values):
conn.execute(
"INSERT INTO documents (text, dense, sparse) VALUES (%s, %s, to_svector(%s, %s::real[]::int[], %s::real[]));",
(text, dense, len(model.tokenizer), sparse_ind, sparse_val),
)
conn.execute("""
CREATE INDEX ON documents
USING vectors (sparse svector_dot_ops)
WITH (options = \"[indexing.hnsw]\");
""")
conn.execute("""
CREATE INDEX ON documents
USING vectors (dense vector_l2_ops)
WITH (options = \"[indexing.hnsw]\");
""")
For dense vectors, vector_l2_ops
calculates the Squared Euclidean distance, which is the most commonly used distance metric.
For sparse vectors, vector_dot_ops
calculates the dot product, which is a more efficient method.
For more information about index types in pgvecto.rs
, please check out our documentation.
Retrieving and Mixing Results
With the vectors stored in the database, we can efficiently query them using the pgvecto.rs
extension. To find the nearest vectors to a given input text, we need to encode the text with the same model.
This process is identical to the one described in the previous chapter:
def embedding(sentence: str) -> tuple[np.ndarray, list[list[int]], list[list[float]]]:
"""
Create dense and sparse embeddings from text
"""
output = model.encode(sentence, return_dense=True, return_sparse=True, return_colbert_vecs=False)
dense_embedding = output['dense_vecs']
sparse_weight_id = output['lexical_weights']
sparse_indices = [int(token_id) for token_id in sparse_weight_id]
sparse_values = [float(sparse_weight_id[token_id]) for token_id in sparse_weight_id]
sparse_values = [x for _, x in sorted(zip(sparse_indices, sparse_values))]
sparse_indices = sorted(sparse_indices)
return dense_embedding, sparse_indices, sparse_values
The pgvecto.rs
extension provides a simple SELECT
statement to get sparse and dense vectors like this:
dense_embedding, sparse_indices, sparse_values = embedding("The text you want to search...")
with psycopg.connect(URL) as conn:
register_vector(conn)
cur = conn.execute(
"SELECT text FROM documents ORDER BY sparse <#> to_svector(%s, %s::real[]::int[], %s::real[]) LIMIT 50;",
(len(model.tokenizer), sparse_indices, sparse_values),
)
sparse_result = cur.fetchall()
cur = conn.execute(
"SELECT text FROM documents ORDER BY dense <-> %s LIMIT 50;",
(dense_embedding,),
)
dense_result = cur.fetchall()
# Merge sparse and dense vector search results
mix_text = set([r[0] for r in dense_result]).union([r[0] for r in sparse_result])
To rerank the merged result, you can use popular fusion methods such as Reciprocal Ranked Fusion (RRF). However, to make better use of BGE-M3
, let's introduce the Reranker
model of BGE-M3
. Unlike the embedding model, Reranker
uses text input and directly outputs their similarity or score.
We can initialize a Reranker
model and feed it with mixed candidate texts. With this method, reranking can be as simple as sorting by the Reranker
model's estimated score.
from FlagEmbedding import FlagReranker
reranker = FlagReranker('BAAI/bge-reranker-large', use_fp16=True)
# Obtain relevance scores (Higher scores indicate greater relevance)
scores = reranker.compute_score([["The text you want to search...", candidate] for candidate in mix_text])
# Rerank text using combined dense and sparse embeddings
reranked_text = [t for _, t in sorted(zip(scores, mix_text))]
Congratulations! We have now completed our first step into the world of sparse vector search.