Persist VectorStore data in lmdb #138

naiquevin · 2024-10-07T12:56:56Z

No description provided.

tinkn · 2024-10-07T14:17:30Z

Option (1) , Also I am in favor of CBOR serialization for the following reasons,

Standardized: CBOR is defined by RFC 7049, which means it has a well-documented specification.
Self-describing: CBOR data includes type information, making it more flexible and easier to decode without prior knowledge of the data structure.
Widely supported: Many programming languages have CBOR libraries available.
Extensible: CBOR allows for custom data types and extensions.
Compact: Generally produces smaller output compared to text-based formats like JSON.

===
Lets u8s for the below,

Product with product quantization data
no. of centroids: u16
centroid values: u16

===

Should current open transaction be persisted?

No, if the DB crashes, consider that transaction as irrecoverable/aborted. User must reattempt that transaction.

tinkn · 2024-10-07T14:23:36Z

Data stored in embeddings, versions and branches lmdb tables

The data stored in these tables is not namespaced by collections. Would that be a problem?
Key collisions for data that belongs to different collections is possible
    e.g. the key in embeddings db is VectorId which can be same for two vectors in different collections.
    The key in branches db is also just branch id (u64).

Either we create a new "database" for each collection's logical items so they are exclusive,
If for some reason this is proving difficult ( I dont see why it will be) but then we can go for this approach below, i.e. use the common "database" for all collections, but prefix the collection name, hashed as SipHash24 , <collection_hash>: as he key.

naiquevin · 2024-10-08T14:07:13Z

Either we create a new "database" for each collection's logical items so they are exclusive,
If for some reason this is proving difficult ( I dont see why it will be)

The documentation on configuring max db limit kinda warns about having too many databases - http://www.lmdb.tech/doc/group__mdb.html#gaa2fc2f1f37cb1115e733b62cab2fcdbc.

Currently a moderate number of slots are cheap but a huge number gets expensive: 7-120 words per transaction, and every mdb_dbi_open() does a linear search of the opened slots.

It doesn't say what exactly is a "huge" number. But assuming that an instance of cosdata will be supporting unbounded no. of collections, option 1 doesn't seem like a good idea. What do you think?

tinkn · 2024-10-09T04:56:58Z

We dont expect more than a few collections in the single pod instance that we are developing for the MVP, so lets create own exclusive database for the same. Later when we need to support multitenancy/could hosted serverless solution this can be addressed differently.

naiquevin · 2024-10-10T06:15:36Z

@tinkn I've pushed a commit 4909bf0 to persist Vectore Store data to lmdb.

I've created a new db named collections with SipHash24 of the collection name as key and the CBOR serialized obj as value. Considering that lmdb could also contain other collection agnostic dbs such as metadata, this felt like a cleaner way to fetch list of all collections instead of listing all db names and then filtering out. Even with a separate collections db, we can still use the exclusive db approach.

I was also thinking about the the exclusive db approach and have following points:

Implications on existing code and data

It will affect the existing dbs embeddings, versions and branches in lmdb. If we're going with exclusive db approach, they will be moved inside the exclude db and hence the keys will need to be prefixed with some string/char to indicate the type of key e.g. all version keys can be v:<u32hash>, embeddings can be e:<vector-id>, branches can be b:<branch-id>.

Instead, if we continue with current approach, then we'd need to prefix the keys in embeddings, versions and branches dbs with SipHash of the collection name.

The refactoring effort required for both approaches looks the same to me But either way, the change will not be backward compatible with existing data. So every one will need to delete their existing mdb dir.

Max dbs config for lmdb

The configuration for max_dbs is currently hard coded when the lmdb Environment is intialized. What max value should we use here? Is 10 or 20 a good number?

            let env = Environment::new()
                .set_max_dbs(5)
                .set_map_size(10485760) // Set the maximum size of the database to 10MB
                .open(&path)
                .map_err(|e| WaCustomError::DatabaseError(e.to_string()))?;

We can make this configurable (with some upper limit to set expectations).

Please let me know your thoughts.

The VectoreStore instances created upon incoming requests to the 'create-collection' endpoint were only getting added to the in-memory DashMap and not persisted to disk. This commit fixes that by writing the vector store data to a new lmdb database. At startup, the data from lmdb for all collections (vector stores) is also loaded into the DashMap. The 'VectorStoreMap' type has been refactored to provide this abstraction. Note that a few things are pending such as properly loading the root node in 'VectorStoreMap' for each collection. Besides, a few more things have been marked with TODO comments, which will be addressed in later commits.

The error is converted to 5xx (Internal server error) HTTP response.

So that it's stored as app state and available to the actix handler fns. This commit also modifies all code to use the env from the app context instead of calling the function 'get_app_env'. There are 2 reasons for doing this change: 1. As part of initializing the 'ain_env', we're loading the vector store map from lmdb. While doing so, we also want to eagerly load the root vec for all vector stores. This change allows us to use the 'node_registry' (cache) in app context so that the cache is also warmed. Note that the eager loading of root vec is pending and will be implemented in a later commit. 2. Earlier the env was initialized lazily in the request handlers, upon the first call to the 'get_app_env' function. But if env initialization fails, any request will fail anwway. After this change, the env is initialized at startup and if it fails, the process will fail to start (code will panic). This is much better from a debugging standpoint.

A new field 'file_index' is added to 'VectorStoreData' which is serialized as CBOR and persisted to the 'collections' db of lmdb. This will be used for locating the root vector node of the collection.

The load_cache call is currently failing. It will not be required, once the cache is loaded as part of initializing the vector_store_map, which will be implemented in subsequent commits.

We no longer need to store the app env in a global var. Hence we can get rid of the 'OnceLock' and have the 'get_app_env' function return the value directly.

Earlier, the 'new' method panicked if the env could not be initialized. This commit modifies it to return a Result with WaCustomError. This way, the calling code may decide to panic or handle it in some other way.

naiquevin · 2024-10-11T10:01:18Z

@tinkn @a-rustacean The PR includes following changes:

Store VectorStore data in lmdb (in a new db named collections)
Load VectorStore data from lmdb at startup. As part of this, the root node for all collections is loaded into memory via the cache (NodeRegistry).
As the cache is loaded at startup, the call to load_cache function in main has been removed

Pending items from my list:

The data stored in embeddings, versions and branches tables in lmdb is not namespaced by the vector store/collection id. Please check this comment as well - Persist VectorStore data in lmdb #138 (comment)
We need separate prop.data and *.vec_raw files per collection. I've created a separate ticket to track this - Separate prop.data, *.vec_raw and *.index files per collection #141
Making the HNSW (max_cache_level) parameter dynamic

a-rustacean · 2024-10-11T11:09:21Z

It will affect the existing dbs embeddings, versions and branches in lmdb. If we're going with exclusive db approach, they will be moved inside the exclude db and hence the keys will need to be prefixed with some string/char to indicate the type of key e.g. all version keys can be v:<u32hash>, embeddings can be e:<vector-id>, branches can be b:<branch-id>.

since the key is ultimately converted to/from [u8], we can prefix the key with a single byte, so all the version keys can be [0, ...], embeddings can be [1, ...], branches can be [2, ...]

a-rustacean · 2024-10-13T13:47:32Z

Continued in #146

naiquevin force-pushed the persist-vectorstore branch from cb5ee38 to 6db7bd0 Compare October 9, 2024 18:27

naiquevin added 5 commits October 11, 2024 14:01

Add a doc for persisting vectore store metadata in lmdb

8d623e0

Handle WaCustomError when deleting vector store

12b0cdf

The error is converted to 5xx (Internal server error) HTTP response.

Store offset (FileIndex) of the root node of a collection to lmdb

7891719

A new field 'file_index' is added to 'VectorStoreData' which is serialized as CBOR and persisted to the 'collections' db of lmdb. This will be used for locating the root vector node of the collection.

naiquevin force-pushed the persist-vectorstore branch from 6db7bd0 to c60aee7 Compare October 11, 2024 09:48

naiquevin changed the title ~~[DRAFT] Persist VectorStore data in lmdb~~ Persist VectorStore data in lmdb Oct 11, 2024

naiquevin added 5 commits October 11, 2024 15:20

Fix typo

d23804a

Remove the call to 'load_cache' in main fn

b048742

The load_cache call is currently failing. It will not be required, once the cache is loaded as part of initializing the vector_store_map, which will be implemented in subsequent commits.

Load root nodes for all vector stores into NodeRegistry at startup

50622d3

Simplify the 'get_app_env' function

3d243ff

We no longer need to store the app env in a global var. Hence we can get rid of the 'OnceLock' and have the 'get_app_env' function return the value directly.

Improve error handling for AppContext::new method

1cfde5c

Earlier, the 'new' method panicked if the env could not be initialized. This commit modifies it to return a Result with WaCustomError. This way, the calling code may decide to panic or handle it in some other way.

naiquevin force-pushed the persist-vectorstore branch from c60aee7 to 1cfde5c Compare October 11, 2024 09:51

naiquevin marked this pull request as ready for review October 11, 2024 09:51

a-rustacean mentioned this pull request Oct 11, 2024

Persist VectorStore data in lmdb #146

Merged

2 tasks

a-rustacean closed this Oct 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persist VectorStore data in lmdb #138

Persist VectorStore data in lmdb #138

naiquevin commented Oct 7, 2024

tinkn commented Oct 7, 2024 •

edited

Loading

tinkn commented Oct 7, 2024

naiquevin commented Oct 8, 2024

tinkn commented Oct 9, 2024

naiquevin commented Oct 10, 2024

naiquevin commented Oct 11, 2024 •

edited

Loading

a-rustacean commented Oct 11, 2024

a-rustacean commented Oct 13, 2024

Persist VectorStore data in lmdb #138

Persist VectorStore data in lmdb #138

Conversation

naiquevin commented Oct 7, 2024

tinkn commented Oct 7, 2024 • edited Loading

tinkn commented Oct 7, 2024

naiquevin commented Oct 8, 2024

tinkn commented Oct 9, 2024

naiquevin commented Oct 10, 2024

Implications on existing code and data

Max dbs config for lmdb

naiquevin commented Oct 11, 2024 • edited Loading

a-rustacean commented Oct 11, 2024

a-rustacean commented Oct 13, 2024

tinkn commented Oct 7, 2024 •

edited

Loading

naiquevin commented Oct 11, 2024 •

edited

Loading