Dense vector search with Solr and sentence transformers

Jun 29, 2023

I’m really appreciating the benefits of having my own kube cluster at home. Using some DNS tricks, I have been successful in making it easy to access local services in my home LAN while also making it available securely in the internet using Tailscale. I now have a fair number of services that I’ve been using on a day to day basis but few are more important than others. To name some I now consider critical for myself:

a custom webdav server to host my books that I can read from any device at home.
dokuwiki to track anything I learn.
a custom openssh server based on linuxserver images that hosts my dev environment. It has emacs (using doom emacs configs) and also includes a Jupyter lab server.

OpenAI embeddings and searching PDF’s

I’ve been wanting to get into some of the ML craze these days and trying to learn some of the basics. I’ve also been seeing a lot of mentions of OpenAI embeddings and decided to start with the concept of embeddings. Based on OpenAI’s documentation -

OpenAI’s text embeddings measure the relatedness of text strings. Embeddings are commonly used for:

Search (where results are ranked by relevance to a query string)
Clustering (where text strings are grouped by similarity)
Recommendations (where items with related text strings are recommended)
Anomaly detection (where outliers with little relatedness are identified)
Diversity measurement (where similarity distributions are analyzed)
Classification (where text strings are classified by their most similar label)

With that said, I wanted to learn how I could use it to improve search on my PDF collections and so my plan was to:

Get the text out of the PDFs
Get their text embeddings
Somehow make it possible to query those embeddings for similarity to my search string.

I recently wrote a tool with PDFBOX to convert a PDF to sqlite. Decided to go with that for #1 above. For #2, we don’t really need OpenAI’s embeddings. They cost money (even if they potentially produce better embeddings) and I didn’t really feel like there was any justification for that in my activity. I decided to go with the Huggingface’s sentence transformers. For the last one, we need to be able to index the embeddings and query over it. Since I already have Solr installation, and it supports dense vector search, I decided to go with it.

Learning with Jupyter

Turns out Jupyter allows exporting your notebook to markdown which I have included below with some additional comments.

Requirements and imports

Getting some of the requirements and the imports out of the way so we can get to coding. I decided to use pysbd to do sentence boundary detection.

!pip install pysbd pysolr
!pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
!pip3 install sentence-transformers

import sqlite3
from contextlib import closing
import pysbd
from sentence_transformers import SentenceTransformer
import requests
import urllib.parse
import pysolr
import json

Getting the pdf text from SQLITE

I developed a tool to convert PDF’s to sqlite databases with PDFBOX. I wrote it to do some exploratory analysis on PDF’s but the tool needs a bit more work before I call it finished. It does extract the text pretty well and that serves the purpose for me in this exercise.

files = ["/config/notebooks/large.pdf.db", "/config/notebooks/2305.06344.pdf.db"]
db = sqlite3.connect(files[0])

def get_page_numbers():
    with closing(db.cursor()) as cursor:
        res = cursor.execute("SELECT distinct page_num from text_line")
        pages = [int(row[0]) for row in res.fetchall()]
        return pages
    
def get_page_text(page_num = 1):
    with closing(db.cursor()) as cursor:
        res = cursor.execute("SELECT text from text_line where page_num=?", [page_num]);
        content = res.fetchall()
        results = [str(row[0]) for row in content]
        return results

Using the sentence transformers models to get the necessary embeddings

Getting the embeddings involves just a few function calls. Though, you need to be careful selecting the embedding parameter size. Solr supports a max of 1000 params (IIRC) and the higher the params count, the more expensive the call is going to be.

seg = pysbd.Segmenter(language="en", clean=True)
t_models = [
    'sentence-transformers/all-mpnet-base-v1', # 768 params 
    'sentence-transformers/all-MiniLM-L6-v2' # 384 params
]

def segment_texts(texts=[]):
    sentences = seg.segment(''.join(texts))
    return sentences

def embed_sentences(sentences, model_name = t_models[1]):
    model = SentenceTransformer(model_name)
    embeddings = model.encode(sentences)
    return embeddings

Getting the solr collection ready for indexing

The following are just some of the helper methods to prepare your solr collection, add the right vector fields and upload documents.

auth = ('user', 'password')
collection = None
default_solr = None

def set_collection(name):
    global collection, default_solr
    collection = name
    default_solr = pysolr.Solr(solr_endpoint(collection), always_commit=True, timeout=10, auth=auth)

def solr_endpoint(collection):
    return f'https://solr.sharmaso.com/solr/{collection}'

def clean_collection(deep_clean = True):
    if deep_clean:
        delete_vector_field()
        delete_field_type()
    default_solr.delete(q='*:*')
    
def update_page_embedding(page_num, content, embeddings, auth=auth):
    response = {
        'page_num': page_num,
        'page_vector': embeddings,
        'page_content': content
    }
    default_solr.add([response])

def upload_docs():
    for page_num in get_page_numbers():
        content = get_page_text(page_num)
        merged_txt = "".join(content)
        embeddings = embed_sentences(merged_txt)
        update_page_embedding(page_num, merged_txt, embeddings.tolist()) 
        
def replace_field_type(data = {}):
    payload = {
        'replace-field-type': {
            'name': 'knn_vector',
            'class': 'solr.DenseVectorField',
            'vectorDimension': '384',
            'similarityFunction': 'dot_product'
        }
    }
    payload.update(data)
    return update_schema(data = payload)

def delete_field_type(data = {}):
    payload = {
        'delete-field-type': {
            'name': 'knn_vector',
        }
    }
    payload.update(data)
    return update_schema(data = payload)

def add_field_type(data = {}):
    payload = {
        'add-field-type': {
            'name': 'knn_vector',
            'class': 'solr.DenseVectorField',
            'vectorDimension': '384',
            'similarityFunction': 'dot_product'
        }
    }
    payload.update(data)
    return update_schema(data = payload)

def delete_vector_field(data = {}):
    payload = {
        'delete-field': {
            'name':'page_vector'
        }
    }
    payload.update(data)
    return update_schema(data = payload)
    

def add_vector_field(data = {}):
    payload = {
        'add-field': {
            'stored':'true',
            'indexed':'true',
            'uninvertible':'false',
            'name':'page_vector',
            'type':'knn_vector'
        }
    }
    payload.update(data)
    return update_schema(data = payload)

def update_schema(content_type = "application/json", data = {}):
    url = f"{solr_endpoint(collection)}/schema?&wt=json"
    headers = {'Content-Type': content_type }
    response = requests.post(url, data=json.dumps(data),auth=auth)
    if response.status_code / 100 != 2:
        response_text = response.text
        if not "is not present in this schema, and so cannot be deleted" in response_text:
            print(f'Status-code: {response.status_code} {response.text}')
        else:
            # ignore these errors
            pass
        
def search(query, field="page_vector", topK = 5 ):
    query_vect = embed_sentences(query).tolist()
    squery = "{" + f'!knn f={field} topK={topK}' + "}"
    return default_solr.search(squery + str(query_vect), **{
        'hl': 'true',
        'hl.fragsize': 10,
        'hl.q': query,
        'hl.fl': 'page_content'
    })

The end is the beginning of the end

And this is where it all begins and ends. We prepare the solr collection to add the necessary fields, upload the documents. We then do a quick search to get the results from the solr collection.

set_collection("stuff")
deep_clean = False
if deep_clean:
    clean_collection()
    add_field_type()
    add_vector_field()
    upload_docs()
results = search('lists vs sets')
ignored_keys = ['page_vector']

clean_results = [{key: value for key, value in d.items() if key not in ignored_keys} for d in results.docs]
{
    'results': clean_results,
    'highlighting': results.highlighting
}

{'results': [{'page_num': [22],
   'page_content': ['SECTION 1.A Rn and Cn 5 Lists Before defining Rn and Cn, we look at two important examples. 1.7 Example R2 and R3 \x02 The set R2, which you can think of as a plane, is the set of all ordered pairs of real numbers: R2 D f.x; y/ W x; y 2 Rg: \x02 The set R3, which you can think of as ordinary space, is the set of all ordered triples of real numbers: R3 D f.x; y; z/ W x; y; z 2 Rg: To generalize R2 and R3 to higher dimensions, we first need to discuss the concept of lists. 1.8 Definition list, length Suppose n is a nonnegative integer. A list of length n is an ordered collection of n elements (which might be numbers, other lists, or more abstract entities) separated by commas and surrounded by parentheses. A list of length n looks like this: .x1; : : : ; xn/: Two lists are equal if and only if they have the same length and the same elements in the same order. Thus a list of length 2 is an ordered Many mathematicians call a list of pair, and a list of length 3 is an ordered length n an n-tuple. triple. Sometimes we will use the word list without specifying its length. Re- member, however, that by definition each list has a finite length that is a nonnegative integer. Thus an object that looks like .x1; x2; : : : /; which might be said to have infinite length, is not a list. A list of length 0 looks like this: . /. We consider such an object to be a list so that some of our theorems will not have trivial exceptions. Lists differ from sets in two ways: in lists, order matters and repetitions have meaning; in sets, order and repetitions are irrelevant. '],
   'id': '3c2f9dd6-1b86-4e6d-a0fa-1cad51ef49df',
   '_version_': 1765902685270179840},
  {'page_num': [23],
   'page_content': ['6 CHAPTER 1 Vector Spaces 1.9 Example lists versus sets \x02 The lists .3; 5/ and .5; 3/ are not equal, but the sets f3; 5g and f5; 3g are equal. \x02 The lists .4; 4/ and .4; 4; 4/ are not equal (they do not have the same length), although the sets f4; 4g and f4; 4; 4g both equal the set f4g. Fn To define the higher-dimensional analogues of R2 and R3, we will simply replace R with F (which equals R or C) and replace theFana 2 or 3 with an arbitrary positive integer. Specifically, fix a positive integer n for the rest of this section. 1.10 Definition Fn Fn is the set of all lists of length n of elements of F: Fn D f.x1; : : : ; xn/ W xj 2 F for j D 1; : : : ; ng: For .x1; : : : ; x n th n/ 2 F and j 2 f1; : : : ; ng, we say that xj is the j coordinate of .x1; : : : ; xn/. If F D R and n equals 2 or 3, then this definition of Fn agrees with our previous notions of R2 and R3. 1.11 Example C4 is the set of all lists of four complex numbers: C4 D f.z1; z2; z3; z4/ W z1; z2; z3; z4 2 Cg: n For an amusing account of how If n \x05 4, we cannot visualize R R3 would be perceived by crea- as a physical object. Similarly, C1 can tures living in R2, read Flatland: be thought of as a plane, but for n \x05 2, A Romance of Many Dimensions, the human brain cannot provide a full by Edwin A. Abbott. This novel, image of Cn. However, even if n is published in 1884, may help you large, we can perform algebraic manip- imagine a physical space of four or n more dimensions. ulations in F as easily as in R2 or R3. For example, addition in Fn is defined as follows: '],
   'id': 'ce92f90f-bdaa-4a8a-9fe8-b961b0d5b521',
   '_version_': 1765902685707436032},
  {'page_num': [45],
   'page_content': ['28 CHAPTER 2 Finite-Dimensional Vector Spaces 2.A Span and Linear Independence We have been writing lists of numbers surrounded by parentheses, and we will continue to do so for elements of Fn; for example, .2;\x037; 8/ 2 F3. However, now we need to consider lists of vectors (which may be elements of Fn or of other vector spaces). To avoid confusion, we will usually write lists of vectors without surrounding parentheses. For example, .4; 1; 6/; .9; 5; 7/ is a list of length 2 of vectors in R3. 2.2 Notation list of vectors We will usually write lists of vectors without surrounding parentheses. Linear Combinations and Span Adding up scalar multiples of vectors in a list gives what is called a linear combination of the list. Here is the formal definition: 2.3 Definition linear combination A linear combination of a list v1; : : : ; vm of vectors in V is a vector of the form a1v1 C \x04 \x04 \x04 C amvm; where a1; : : : ; am 2 F. 2.4 Example In F3, \x02 .17;\x034; 2/ is a linear combination of .2; 1;\x033/; .1;\x032; 4/ because .17;\x034; 2/ D 6.2; 1;\x033/C 5.1;\x032; 4/: \x02 .17;\x034; 5/ is not a linear combination of .2; 1;\x033/; .1;\x032; 4/ because there do not exist numbers a1; a2 2 F such that .17;\x034; 5/ D a1.2; 1;\x033/C a2.1;\x032; 4/: In other words, the system of equations 17 D 2a1 C a2 \x034 D a1 \x03 2a2 5 D \x033a1 C 4a2 has no solutions (as you should verify). '],
   'id': '5cc5a5b9-8253-4b24-9b8b-2a1241402f38',
   '_version_': 1765902695190757376},
  {'page_num': [56],
   'page_content': ['SECTION 2.B Bases 39 2.B Bases In the last section, we discussed linearly independent lists and spanning lists. Now we bring these concepts together. 2.27 Definition basis A basis of V is a list of vectors in V that is linearly independent and spans V. 2.28 Example bases (a) The list .1; 0; : : : ; 0/; .0; 1; 0; : : : ; 0/; : : : ; .0; : : : ; 0; 1/ is a basis of Fn, called the standard basis of Fn. (b) The list .1; 2/; .3; 5/ is a basis of F2. (c) The list .1; 2;\x034/; .7;\x035; 6/ is linearly independent in F3 but is not a basis of F3 because it does not span F3. (d) The list .1; 2/; .3; 5/; .4; 13/ spans F2 but is not a basis of F2 because it is not linearly independent. (e) The list .1; 1; 0/; .0; 0; 1/ is a basis of f.x; x; y/ 2 F3 W x; y 2 Fg. (f) The list .1;\x031; 0/; .1; 0;\x031/ is a basis of f.x; y; z/ 2 F3 W x C y C z D 0g: (g) The list 1; z; : : : ; zm is a basis of Pm.F/. In addition to the standard basis, Fn has many other bases. For example, .7; 5/; .\x034; 9/ and .1; 2/; .3; 5/ are both bases of F2. The next result helps explain why bases are useful. Recall that “uniquely” means “in only one way”. 2.29 Criterion for basis A list v1; : : : ; vn of vectors in V is a basis of V if and only if every v 2 V can be written uniquely in the form 2.30 v D a1v1 C \x04 \x04 \x04 C anvn; where a1; : : : ; an 2 F. '],
   'id': '9a3a1ade-eba0-4a5d-9bd8-09d4a43f6fd2',
   '_version_': 1765902699907252224},
  {'page_num': [46],
   'page_content': ['SECTION 2.A Span and Linear Independence 29 2.5 Definition span The set of all linear combinations of a list of vectors v1; : : : ; vm in V is called the span of v1; : : : ; vm, denoted span.v1; : : : ; vm/. In other words, span.v1; : : : ; vm/ D fa1v1 C \x04 \x04 \x04 C amvm W a1; : : : ; am 2 Fg: The span of the empty list . / is defined to be f0g. 2.6 Example The pr\x02evious example show\x03s that in F3, \x02 .17;\x034; 2/ 2 span\x02.2; 1;\x033/; .1;\x032; 4/\x03;\x02 .17;\x034; 5/ … span .2; 1;\x033/; .1;\x032; 4/ . Some mathematicians use the term linear span, which means the same as span. 2.7 Span is the smallest containing subspace The span of a list of vectors in V is the smallest subspace of V containing all the vectors in the list. Proof Suppose v1; : : : ; vm is a list of vectors in V. First we show that span.v1; : : : ; vm/ is a subspace of V. The additive identity is in span.v1; : : : ; vm/, because 0 D 0v1 C \x04 \x04 \x04 C 0vm: Also, span.v1; : : : ; vm/ is closed under addition, because .a1v1C\x04 \x04 \x04Camvm/C.c1v1C\x04 \x04 \x04Ccmvm/ D .a1Cc1/v1C\x04 \x04 \x04C.amCcm/vm: Furthermore, span.v1; : : : ; vm/ is closed under scalar multiplication, because \x02.a1v1 C \x04 \x04 \x04 C amvm/ D \x02a1v1 C \x04 \x04 \x04 C \x02amvm: Thus span.v1; : : : ; vm/ is a subspace of V (by 1.34). Each vj is a linear combination of v1; : : : ; vm (to show this, set aj D 1 and let the other a’s in 2.3 equal 0). Thus span.v1; : : : ; vm/ contains each vj . Conversely, because subspaces are closed under scalar multiplication and addition, every subspace of V containing each vj contains span.v1; : : : ; vm/. Thus span.v1; : : : ; vm/ is the smallest subspace of V containing all the vectors v1; : : : ; vm. '],
   'id': 'c3a319bb-3ee7-4a74-b281-1b4afcda0467',
   '_version_': 1765902695623819264}],
 'highlighting': {'3c2f9dd6-1b86-4e6d-a0fa-1cad51ef49df': {'page_content': ['We consider such an object to be a list so that some of our theorems will not have trivial exceptions. <em>Lists</em> differ from <em>sets</em> in two ways: in <em>lists</em>, order matters and repetitions have meaning; in <em>sets</em>, order and repetitions are irrelevant. ']},
  'ce92f90f-bdaa-4a8a-9fe8-b961b0d5b521': {'page_content': ['6 CHAPTER 1 Vector Spaces 1.9 Example <em>lists</em> versus <em>sets</em> \x02 The <em>lists</em> .3; 5/ and .5; 3/ are not equal, but the <em>sets</em> f3; 5g and f5; 3g are equal. \x02 The <em>lists</em> .4; 4/ and .4; 4; 4/ are not equal (they do not have the same length), although the <em>sets</em> f4; 4g and f4; 4; 4g both equal the set f4g. ']},
  '5cc5a5b9-8253-4b24-9b8b-2a1241402f38': {'page_content': ['To avoid confusion, we will usually write <em>lists</em> of vectors without surrounding parentheses. ']},
  '9a3a1ade-eba0-4a5d-9bd8-09d4a43f6fd2': {'page_content': ['SECTION 2.B Bases 39 2.B Bases In the last section, we discussed linearly independent <em>lists</em> and spanning <em>lists</em>. ']},
  'c3a319bb-3ee7-4a74-b281-1b4afcda0467': {}}}