Generative Question-Answering
Vector search can also be used to provide context to generative models such as OpenAI's GPT series, improving the quality of the outputs.
This post will explore the use of pgvecto.rs
in building and retrieving from a vector knowledge base to bolster the performance of generative models in question-answering tasks. A knowledge base serves as "long-term memory" for generative models, providing them persistent, curated, and accurate information as context to draw from when generating an answer to a question. This results in answers that are more accurate, reduces the possibility of the model producing "hallucinations" (statements with no basis in reality), and increases user trust in the outputs. This technique is known as Retrieval-Augmented Generation, or RAG: see this post or the related paper for more information.
Overview of the Question-Answering task
The generative model question-answering (QA) task is a method of Information Retrieval. The model is presented with a question, and generates an answer based on some context: typically, documents or other information sources the model can look at to retrieve the information necessary to answer the question.
This use case will involve giving the OpenAI gpt-3.5-turbo-instruct
model context to answer questions about movies from the vishnupriyavr/wiki-movie-plots-with-summaries-faiss-embeddings Huggingface dataset.
The overall workflow will look something like this:
Question-Answering without Context
First, let's try having the model answer some questions without providing it context.
We begin with defining a function to streamline getting a response for a given prompt:
# pip install openai
import openai
openai.api_key = "YOUR_API_KEY_HERE"
def complete(prompt):
result = openai.Completion.create(
engine='gpt-3.5-turbo-instruct',
prompt=prompt,
temperature=0,
max_tokens=400,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
stop=None
)
return result['choices'][0]['text'].strip()
INFO
The api_key
requires an OpenAI account. Once logged in, sign up for an API key from the top-right drop-down menu, and paste it in place of YOUR_API_KEY_HERE
.
Let's start with asking about a relatively popular movie:
base_prompt = "What is the release year and overall plot of Skyfall?"
print(complete(base_prompt))
Output:
The release year of Skyfall is 2012. The overall plot follows James Bond as he investigates an attack on MI6 and uncovers a sinister plot by a former agent seeking revenge on M and the entire agency. Bond must confront his past and protect MI6 from destruction.
Accurate, if somewhat vague in the plot summary. However, what about a movie that's much less known?
base_prompt_2 = "What is the release year and overall plot of Cat Napping?"
print(complete(base_prompt_2))
Output:
The release year of Cat Napping is unknown. The overall plot of Cat Napping is about a group of cats who go on a mission to rescue their friend who has been kidnapped by a group of evil dogs. Along the way, they encounter various challenges and obstacles, but with their cleverness and teamwork, they are able to save their friend and defeat the dogs.
Correct answer:
Cat Napping is actually a Tom and Jerry movie from 1951, where the two fight over a hammock.
We have found the limits of what the model "knows": it is unable to produce the release year, and invents a potential plot based on the title. We will remedy this by creating a knowledge base for the model to draw on.
There are two options for allowing the model to better answer our question:
- Fine-tune the model on text data in the domain of interest (movies in this case).
- Employ retrieval augmented generation, which adds an information retrieval component to the process. This will allow us to retrieve and feed relevant information to the model as a secondary source of information.
We will be taking the second option in this post.
Building the Knowledge Base
Let's build a knowledge base of movies to retrieve relevant information from, using the vishnupriyavr/wiki-movie-plots-with-summaries-faiss-embeddings Huggingface dataset. This dataset contains titles, release years, casts, Wikipedia pages, plot summaries/lengths, and vector embeddings of this information for 33,155 movies. Each movie also has text
data, which encompasses title, release year, cast, and plot summary in one. We will be using the text
and embeddings
data for this application.
Start by loading the data:
# pip install -U datasets
from datasets import load_dataset
dataset = load_dataset("vishnupriyavr/wiki-movie-plots-with-summaries-faiss-embeddings", split='train')
texts = dataset['train']['text']
embeddings = dataset['train']['embeddings']
# Optional:
# release_years = dataset['train']['Release Year']
# titles = dataset['train']['Title']
# casts = dataset['train']['Cast']
# wiki_pages = dataset['train']['Wiki Page']
# plots = dataset['train']['Plot']
# plot_lens = dataset['train']['plot_length']
INFO
This particular dataset comes with embeddings. In other cases, you will have to create embeddings yourself from the data. The embeddings in this dataset are dense vectors. For creating sparse vectors, you can see this page.
A deployed pgvecto.rs
instance is required for vector storage and retrieval. We can use the official pgvecto-rs
Docker image:
docker run \
--name pgvecto-rs-demo \
-e POSTGRES_PASSWORD=mysecretpassword \
-p 5432:5432 \
-d tensorchord/pgvecto-rs:pg16-v0.3.0
Install the required psycopg
library to interact with Postgres, and ensure the extension is running:
#pip install "psycopg[binary,pool]"
import psycopg
URL = "postgresql://postgres:mysecretpassword@localhost:5432/postgres"
#Create the table
with psycopg.connect(URL) as conn:
conn.execute("DROP EXTENSION IF EXISTS vectors;")
conn.execute("CREATE EXTENSION vectors;")
Now, we create a table movies
with a few columns: an ID column(id
), a text column(texts
), and a vector column(embeddings
).
import psycopg
URL = "postgresql://postgres:mysecretpassword@localhost:5432/postgres"
#Create the table
with psycopg.connect(URL) as conn:
conn.execute("DROP TABLE IF EXISTS movies;")
conn.execute(f"""
CREATE TABLE movies (
id SERIAL PRIMARY KEY,
text TEXT NOT NULL,
embedding vector(768) NOT NULL);
"""
)
INFO
We use type vector(n)
, as we are working with dense vector embeddings, where n
is the dimension of each embedding (768 in this case).
With the table created, we can now insert the embeddings from the dataset and create indexes for vector columns:
# Optional install, but recommended:
# pip install tqdm
import psycopg
from tqdm import tqdm
URL = "postgresql://postgres:mysecretpassword@localhost:5432/postgres"
#Insert into table
with psycopg.connect(URL) as conn:
with conn.cursor() as cursor:
for text, embedding in tqdm(zip(texts, embeddings), total=len(texts)):
cursor.execute(
"""INSERT INTO movies (text, embedding)
VALUES (%s, %s::real[]);""",
(text, embedding),
)
conn.commit()
WARNING
This may take a while, depending on how much data is to be inserted.
import psycopg
URL = "postgresql://postgres:mysecretpassword@localhost:5432/postgres"
#Creating indexes for embedding column
with psycopg.connect(URL) as conn:
cursor = conn.cursor()
cursor.execute("""
CREATE INDEX ON movies
USING vectors (embedding vector_l2_ops)
WITH (options = \"[indexing.hnsw]\");
""")
conn.commit()
INFO
Recall that for dense vectors, vector_l2_ops
calculates the Squared Euclidean distance, which is the most commonly used distance metric.
Question-Answering with Context
To make use of our knowledge base, queries must first be encoded to determine what texts to retrieve:
# pip install sentence-transformers
from sentence_transformers import SentenceTransformer
#some options: 'paraphrase-mpnet-base-v2', 'all-mpnet-base-v2'
encoder = SentenceTransformer('all-mpnet-base-v2')
def encode_query(query): #Query is a string
#Preprocess the query
query = query.lower()
query = query.split()
query = [word for word in query]
query = ' '.join(query)
#then encode the query
vector = encoder.encode(query)
return vector
Now, we define a function to streamline adding the retrieved context to the initial prompt:
import psycopg
URL = "postgresql://postgres:mysecretpassword@localhost:5432/postgres"
def add_context(prompt):
search_vector = encode_query(prompt).tolist()
# Construct prompt with the retrieved contexts included
prompt_start = (
"Answer the question based on the following context:\n\n"
)
prompt_end = (
f"\n\nQuestion: {prompt}\nAnswer:"
)
with psycopg.connect(URL) as conn:
cursor = conn.cursor()
#Tune limit in accordance with how movie texts you want to give to the model
cursor.execute("SELECT text FROM movies ORDER BY embedding <-> %s::real[] LIMIT 3;",
(search_vector,),)
result = cursor.fetchall()
context = ""
for r in result:
context += r[0] + "\n\n"
return prompt_start + context + prompt_end
INFO
Recall the options for distance operators:
<->
for squared Euclidean distance<#>
for negative dot product<=>
for cosine distance
For definitions, please see the overview.
Now, let's try the prompt again, with added context:
base_prompt_2 = "What is the release year and overall plot of Cat Napping?"
context_prompt_2 = add_context(base_prompt_2)
context_response_2 = complete(context_prompt_2)
print(context_response_2)
Output:
The release year of Cat Napping is 1951 and the overall plot involves Tom and Jerry fighting over a hammock, with Jerry ultimately getting revenge on Tom.
Now the model has given us a more accurate answer! We can even ask further questions:
base_prompt_3 = "In Cat Napping, what does Tom do to Jerry?"
print(complete(add_context(base_prompt_3)))
Output:
Tom tries to send Jerry sliding into a nearby pond by unhooking the hammock he is sleeping in. He also tries to whack the hammock to send Jerry flying into the air, but Jerry lands in a bird's nest. Tom then picks Jerry up on a spatula and places him onto a walking army of ants, causing Jerry to wake up as he bumps his head on a sprinkler. Tom also tries to chase Jerry with a lawn mower and a baseball bat.
As we saw earlier in this post, LLMs do work very well by themselves, but can struggle with niche or specific questions, creating "hallucinations" that may not be noticed. The use of the knowledge base improves factuality and subsequently user trust in the generated outputs.
Further Applications
This method is not only for asking about movies, of course. Other applications could include extracting summaries of scientific papers (try this dataset!), or answering questions about company policies provided a database of company literature.