-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persist VectorStore data in lmdb #138
Conversation
Option (1) , Also I am in favor of CBOR serialization for the following reasons,
===
===
No, if the DB crashes, consider that transaction as irrecoverable/aborted. User must reattempt that transaction. |
Either we create a new "database" for each collection's logical items so they are exclusive, |
The documentation on configuring max db limit kinda warns about having too many databases - http://www.lmdb.tech/doc/group__mdb.html#gaa2fc2f1f37cb1115e733b62cab2fcdbc.
It doesn't say what exactly is a "huge" number. But assuming that an instance of cosdata will be supporting unbounded no. of collections, option 1 doesn't seem like a good idea. What do you think? |
We dont expect more than a few collections in the single pod instance that we are developing for the MVP, so lets create own exclusive database for the same. Later when we need to support multitenancy/could hosted serverless solution this can be addressed differently. |
cb5ee38
to
6db7bd0
Compare
@tinkn I've pushed a commit 4909bf0 to persist Vectore Store data to lmdb. I've created a new db named I was also thinking about the the exclusive db approach and have following points: Implications on existing code and dataIt will affect the existing dbs Instead, if we continue with current approach, then we'd need to prefix the keys in The refactoring effort required for both approaches looks the same to me But either way, the change will not be backward compatible with existing data. So every one will need to delete their existing mdb dir. Max dbs config for lmdbThe configuration for max_dbs is currently hard coded when the lmdb Environment is intialized. What max value should we use here? Is 10 or 20 a good number? let env = Environment::new()
.set_max_dbs(5)
.set_map_size(10485760) // Set the maximum size of the database to 10MB
.open(&path)
.map_err(|e| WaCustomError::DatabaseError(e.to_string()))?; We can make this configurable (with some upper limit to set expectations). Please let me know your thoughts. |
The VectoreStore instances created upon incoming requests to the 'create-collection' endpoint were only getting added to the in-memory DashMap and not persisted to disk. This commit fixes that by writing the vector store data to a new lmdb database. At startup, the data from lmdb for all collections (vector stores) is also loaded into the DashMap. The 'VectorStoreMap' type has been refactored to provide this abstraction. Note that a few things are pending such as properly loading the root node in 'VectorStoreMap' for each collection. Besides, a few more things have been marked with TODO comments, which will be addressed in later commits.
The error is converted to 5xx (Internal server error) HTTP response.
So that it's stored as app state and available to the actix handler fns. This commit also modifies all code to use the env from the app context instead of calling the function 'get_app_env'. There are 2 reasons for doing this change: 1. As part of initializing the 'ain_env', we're loading the vector store map from lmdb. While doing so, we also want to eagerly load the root vec for all vector stores. This change allows us to use the 'node_registry' (cache) in app context so that the cache is also warmed. Note that the eager loading of root vec is pending and will be implemented in a later commit. 2. Earlier the env was initialized lazily in the request handlers, upon the first call to the 'get_app_env' function. But if env initialization fails, any request will fail anwway. After this change, the env is initialized at startup and if it fails, the process will fail to start (code will panic). This is much better from a debugging standpoint.
A new field 'file_index' is added to 'VectorStoreData' which is serialized as CBOR and persisted to the 'collections' db of lmdb. This will be used for locating the root vector node of the collection.
6db7bd0
to
c60aee7
Compare
The load_cache call is currently failing. It will not be required, once the cache is loaded as part of initializing the vector_store_map, which will be implemented in subsequent commits.
We no longer need to store the app env in a global var. Hence we can get rid of the 'OnceLock' and have the 'get_app_env' function return the value directly.
Earlier, the 'new' method panicked if the env could not be initialized. This commit modifies it to return a Result with WaCustomError. This way, the calling code may decide to panic or handle it in some other way.
c60aee7
to
1cfde5c
Compare
@tinkn @a-rustacean The PR includes following changes:
Pending items from my list:
|
since the key is ultimately converted to/from |
Continued in #146 |
No description provided.