TfidfVectorizer.fit_transform() and unexpected disk I/O #29550

mayhem · 2024-07-23T15:03:00Z

mayhem
Jul 23, 2024

Hello!

Over at MetaBrainz we're trying to do a fuzzy lookup system at scale using nsmlib and scikit learn. The search space can be easily partitioned into small indexes by creating a search index for each artist's tracks; this avoids having one giant index that will be slow to search. Our plan is to search for artists in one large artist index and then to find tracks in the track index build for each artist. All indexes are going to be resident in RAM on a machine with large gobs of RAM.

So far, everything is working well, except... we're getting a strange slow-down when building all the individual track indexes. Index building gets progressively slower and lots of disk IO writes can be observed during TfidfVectorizer.fit_transform(). We would expect only reads in the form of reading the data from our DB, but we're seeing a mass amount of disk writes. And building indexes keeps getting slower -- we've not seen the build process finish yet.

Our entire code can be found here:

https://github.com/mayhem/fast-fuzzy/blob/main/fuzzy_index.py

The function that builds the index and causes disk writes is:

https://github.com/mayhem/fast-fuzzy/blob/main/fuzzy_index.py#L54

I've set the working memory like this, in the main thread:

sklearn.set_config(working_memory=1024 * 100)

Our code is wrapped in the following with statement to allow more threads:

with parallel_backend('threading', n_jobs=MAX_THREADS):

Neither of these settings appears to impact these mysterious disk writes.

When we run this, we get each of the chunks building in a very odd manner. Some finish in under a second (as expected) and others are taking 30-40 seconds to build and we're seeing high disk writes, when we expect pretty much no writes at all. It is true to some artists have many more tracks than others, but only a handful of artists are really much larger than the average, so its not a data size issue.

It seems as though fit_transform() attempts to limit ram usage by swapping out to disk, but I am not sure of that. If that is the case, how can I get it to not do that? Could there be other problems with our approach?

Any input on this would be much appreciated, thanks!

lesteve · 2024-07-24T05:00:51Z

lesteve
Jul 24, 2024
Maintainer

Wild-guess: your OS is swapping because it is getting close to the memory limit? Definitely not an expert on our text functionalities, but I don't think there is anything particularly clever about TfidfVectorizer, in particular I don't think it writes to disk, use working memory or use joblib.

0 replies

mayhem · 2024-07-24T12:17:58Z

mayhem
Jul 24, 2024
Author

I finally worked out what the problem is -- if the strings that are being indexed are too long then the sparse index cannot be built in ram (too large) and will instead be built using disk, which is obviously quite slow. Reducing max string size smoothes out the problem, but not entirely, unless you're ok with a max string length of say... 3 characters. #joy.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TfidfVectorizer.fit_transform() and unexpected disk I/O #29550

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

TfidfVectorizer.fit_transform() and unexpected disk I/O #29550

mayhem Jul 23, 2024

Replies: 2 comments

lesteve Jul 24, 2024 Maintainer

mayhem Jul 24, 2024 Author

mayhem
Jul 23, 2024

lesteve
Jul 24, 2024
Maintainer

mayhem
Jul 24, 2024
Author