Replies: 2 comments
-
Wild-guess: your OS is swapping because it is getting close to the memory limit? Definitely not an expert on our text functionalities, but I don't think there is anything particularly clever about |
Beta Was this translation helpful? Give feedback.
-
I finally worked out what the problem is -- if the strings that are being indexed are too long then the sparse index cannot be built in ram (too large) and will instead be built using disk, which is obviously quite slow. Reducing max string size smoothes out the problem, but not entirely, unless you're ok with a max string length of say... 3 characters. #joy. |
Beta Was this translation helpful? Give feedback.
-
Hello!
Over at MetaBrainz we're trying to do a fuzzy lookup system at scale using nsmlib and scikit learn. The search space can be easily partitioned into small indexes by creating a search index for each artist's tracks; this avoids having one giant index that will be slow to search. Our plan is to search for artists in one large artist index and then to find tracks in the track index build for each artist. All indexes are going to be resident in RAM on a machine with large gobs of RAM.
So far, everything is working well, except... we're getting a strange slow-down when building all the individual track indexes. Index building gets progressively slower and lots of disk IO writes can be observed during TfidfVectorizer.fit_transform(). We would expect only reads in the form of reading the data from our DB, but we're seeing a mass amount of disk writes. And building indexes keeps getting slower -- we've not seen the build process finish yet.
Our entire code can be found here:
https://github.com/mayhem/fast-fuzzy/blob/main/fuzzy_index.py
The function that builds the index and causes disk writes is:
https://github.com/mayhem/fast-fuzzy/blob/main/fuzzy_index.py#L54
I've set the working memory like this, in the main thread:
Our code is wrapped in the following with statement to allow more threads:
Neither of these settings appears to impact these mysterious disk writes.
When we run this, we get each of the chunks building in a very odd manner. Some finish in under a second (as expected) and others are taking 30-40 seconds to build and we're seeing high disk writes, when we expect pretty much no writes at all. It is true to some artists have many more tracks than others, but only a handful of artists are really much larger than the average, so its not a data size issue.
It seems as though fit_transform() attempts to limit ram usage by swapping out to disk, but I am not sure of that. If that is the case, how can I get it to not do that? Could there be other problems with our approach?
Any input on this would be much appreciated, thanks!
Beta Was this translation helpful? Give feedback.
All reactions