Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persist VectorStore data in lmdb #138

Closed
wants to merge 10 commits into from

Conversation

naiquevin
Copy link
Contributor

No description provided.

@tinkn
Copy link
Contributor

tinkn commented Oct 7, 2024

Option (1) , Also I am in favor of CBOR serialization for the following reasons,

  • Standardized: CBOR is defined by RFC 7049, which means it has a well-documented specification.
  • Self-describing: CBOR data includes type information, making it more flexible and easier to decode without prior knowledge of the data structure.
  • Widely supported: Many programming languages have CBOR libraries available.
  • Extensible: CBOR allows for custom data types and extensions.
  • Compact: Generally produces smaller output compared to text-based formats like JSON.

===
Lets u8s for the below,

Product with product quantization data
no. of centroids: u16
centroid values: u16

===

Should current open transaction be persisted?

No, if the DB crashes, consider that transaction as irrecoverable/aborted. User must reattempt that transaction.

@tinkn
Copy link
Contributor

tinkn commented Oct 7, 2024

Data stored in embeddings, versions and branches lmdb tables

The data stored in these tables is not namespaced by collections. Would that be a problem?
Key collisions for data that belongs to different collections is possible
    e.g. the key in embeddings db is VectorId which can be same for two vectors in different collections.
    The key in branches db is also just branch id (u64).

Either we create a new "database" for each collection's logical items so they are exclusive,
If for some reason this is proving difficult ( I dont see why it will be) but then we can go for this approach below, i.e. use the common "database" for all collections, but prefix the collection name, hashed as SipHash24 , <collection_hash>: as he key.

@naiquevin
Copy link
Contributor Author

Either we create a new "database" for each collection's logical items so they are exclusive,
If for some reason this is proving difficult ( I dont see why it will be)

The documentation on configuring max db limit kinda warns about having too many databases - http://www.lmdb.tech/doc/group__mdb.html#gaa2fc2f1f37cb1115e733b62cab2fcdbc.

Currently a moderate number of slots are cheap but a huge number gets expensive: 7-120 words per transaction, and every mdb_dbi_open() does a linear search of the opened slots.

It doesn't say what exactly is a "huge" number. But assuming that an instance of cosdata will be supporting unbounded no. of collections, option 1 doesn't seem like a good idea. What do you think?

@tinkn
Copy link
Contributor

tinkn commented Oct 9, 2024

We dont expect more than a few collections in the single pod instance that we are developing for the MVP, so lets create own exclusive database for the same. Later when we need to support multitenancy/could hosted serverless solution this can be addressed differently.

@naiquevin
Copy link
Contributor Author

@tinkn I've pushed a commit 4909bf0 to persist Vectore Store data to lmdb.

I've created a new db named collections with SipHash24 of the collection name as key and the CBOR serialized obj as value. Considering that lmdb could also contain other collection agnostic dbs such as metadata, this felt like a cleaner way to fetch list of all collections instead of listing all db names and then filtering out. Even with a separate collections db, we can still use the exclusive db approach.

I was also thinking about the the exclusive db approach and have following points:

Implications on existing code and data

It will affect the existing dbs embeddings, versions and branches in lmdb. If we're going with exclusive db approach, they will be moved inside the exclude db and hence the keys will need to be prefixed with some string/char to indicate the type of key e.g. all version keys can be v:<u32hash>, embeddings can be e:<vector-id>, branches can be b:<branch-id>.

Instead, if we continue with current approach, then we'd need to prefix the keys in embeddings, versions and branches dbs with SipHash of the collection name.

The refactoring effort required for both approaches looks the same to me But either way, the change will not be backward compatible with existing data. So every one will need to delete their existing mdb dir.

Max dbs config for lmdb

The configuration for max_dbs is currently hard coded when the lmdb Environment is intialized. What max value should we use here? Is 10 or 20 a good number?

            let env = Environment::new()
                .set_max_dbs(5)
                .set_map_size(10485760) // Set the maximum size of the database to 10MB
                .open(&path)
                .map_err(|e| WaCustomError::DatabaseError(e.to_string()))?;

We can make this configurable (with some upper limit to set expectations).

Please let me know your thoughts.

The VectoreStore instances created upon incoming requests to the
'create-collection' endpoint were only getting added to the in-memory
DashMap and not persisted to disk.

This commit fixes that by writing the vector store data to a new lmdb
database. At startup, the data from lmdb for all collections (vector
stores) is also loaded into the DashMap. The 'VectorStoreMap' type has
been refactored to provide this abstraction.

Note that a few things are pending such as properly loading the root
node in 'VectorStoreMap' for each collection. Besides, a few more
things have been marked with TODO comments, which will be addressed in
later commits.
The error is converted to 5xx (Internal server error) HTTP response.
So that it's stored as app state and available to the actix handler
fns. This commit also modifies all code to use the env from the app
context instead of calling the function 'get_app_env'.

There are 2 reasons for doing this change:

1. As part of initializing the 'ain_env', we're loading the vector
store map from lmdb. While doing so, we also want to eagerly load the
root vec for all vector stores. This change allows us to use the
'node_registry' (cache) in app context so that the cache is also
warmed. Note that the eager loading of root vec is pending and will be
implemented in a later commit.

2. Earlier the env was initialized lazily in the request handlers,
upon the first call to the 'get_app_env' function. But if env
initialization fails, any request will fail anwway. After this change,
the env is initialized at startup and if it fails, the process will
fail to start (code will panic). This is much better from a debugging
standpoint.
A new field 'file_index' is added to 'VectorStoreData' which is
serialized as CBOR and persisted to the 'collections' db of lmdb. This
will be used for locating the root vector node of the collection.
@naiquevin naiquevin changed the title [DRAFT] Persist VectorStore data in lmdb Persist VectorStore data in lmdb Oct 11, 2024
The load_cache call is currently failing. It will not be required,
once the cache is loaded as part of initializing the vector_store_map,
which will be implemented in subsequent commits.
We no longer need to store the app env in a global var. Hence we can
get rid of the 'OnceLock' and have the 'get_app_env' function return
the value directly.
Earlier, the 'new' method panicked if the env could not be
initialized. This commit modifies it to return a Result with
WaCustomError. This way, the calling code may decide to panic or
handle it in some other way.
@naiquevin naiquevin marked this pull request as ready for review October 11, 2024 09:51
@naiquevin
Copy link
Contributor Author

naiquevin commented Oct 11, 2024

@tinkn @a-rustacean The PR includes following changes:

  1. Store VectorStore data in lmdb (in a new db named collections)
  2. Load VectorStore data from lmdb at startup. As part of this, the root node for all collections is loaded into memory via the cache (NodeRegistry).
  3. As the cache is loaded at startup, the call to load_cache function in main has been removed

Pending items from my list:

  1. The data stored in embeddings, versions and branches tables in lmdb is not namespaced by the vector store/collection id. Please check this comment as well - Persist VectorStore data in lmdb #138 (comment)
  2. We need separate prop.data and *.vec_raw files per collection. I've created a separate ticket to track this - Separate prop.data, *.vec_raw and *.index files per collection #141
  3. Making the HNSW (max_cache_level) parameter dynamic

@a-rustacean
Copy link
Member

It will affect the existing dbs embeddings, versions and branches in lmdb. If we're going with exclusive db approach, they will be moved inside the exclude db and hence the keys will need to be prefixed with some string/char to indicate the type of key e.g. all version keys can be v:<u32hash>, embeddings can be e:<vector-id>, branches can be b:<branch-id>.

since the key is ultimately converted to/from [u8], we can prefix the key with a single byte, so all the version keys can be [0, ...], embeddings can be [1, ...], branches can be [2, ...]

@a-rustacean a-rustacean mentioned this pull request Oct 11, 2024
2 tasks
@a-rustacean
Copy link
Member

Continued in #146

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants