-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add "doesn't match" evaluation to KeyedVectors #2765
base: develop
Are you sure you want to change the base?
Add "doesn't match" evaluation to KeyedVectors #2765
Conversation
Thank you for your interest in gensim and your effort. In it's current form, I don't think the contribution is a good fit for gensim, for the following reasons:
@piskvorky What do you think? |
I think the functionality is useful, but maybe too specific. This would be a better fit as a stand-alone extension (Python package), not in core gensim. Also for maintenance reasons. |
I am @n8stringham 's advisor for this project, so I think I can help clarify the point of this contribution and address your concerns, @mpenkov and @piskvorky. Concern 1: Overall, this stuff seems more like application logic than library functionality. I think this isn't the case. Word analogies are the current gold standard for evaluating word embeddings, but they only work with large amounts of training data. @n8stringham 's contribution is designed for the low resource setting. In particular, these measures start showing improvements before the analogy metric starts showing improvement. The low resource setting is widely applicable, and that is why we believe the measures should be included directly in gensim. @n8stringham is currently writing up a paper demonstrating the wide applicability of these measures, and so maybe from your perspective it would make more sense to include these functions after the paper has been published and the documentation can link to the paper? Concern 2: The code is not reusable. You output results to standard output, but gensim users aren't likely to look there for results. The code does not output to stdout, but returns the results. It's true that there is a Concern 3: You expect the input to live as a file on the local storage, in a specific format. This exactly follows the pattern of the Concern 4: maintenance reasons I would definitely understand if this would impose too much of a maintenance burden, and so you don't want to include it for that reason. But our hope is that the wide applicability of the methods would make the maintenance burden worth the cost. |
Thanks. I do see value in better evaluation functions. My main worry is we have several already, with various parameters, and it's chaos for users. So to me this is a question of discovery + documentation for users: "when would I use this?", "why would I use this and not something else?", plus maintenance going forward. Unless the use-cases are clear and attract a convincing user base, it will be yet another algorithm we include to bit-rot in Gensim. Having a thorough analysis + paper to refer to definitely helps. Anything that will communicate to users this is "general and robust enough" and it will solve their problem. |
Another thing I've noticed is that the added functionality doesn't need to be part of the class it's being added to. The new functionality consists of two methods, but neither of those methods access From a maintainer's point of view, if we were to keep this, it'd be better to move these out of the class. They could live pretty much anywhere (same module, different module, different package, or outside of gensim altogether). |
Agreed - and except for accrued tradition/practice, this same reasoning could apply to the other |
Ping @n8stringham : are you able to complete this PR? |
@mpenkov Sorry for the delay. The paper I was working on has been published (https://aclanthology.org/2020.eval4nlp-1.17/). In addition to describing these evaluation functions we also developed a method to automatically generate test sets for them in any language supported by Wikidata. I ended up putting together a small PyPi package which includes the evaluation functions as well as functions to generate multilingual test sets. The code currently lives at (https://github.com/n8stringham/gensim-evaluations) I'd be happy to add the functions to gensim if it still seems worthwhile.
If not, do you still want someone to work on this? |
Adding more evaluation options would be a plus. Each new evaluation function has potential intrinsic value, perhaps better capturing how well word-vectors work for specific downstream uses. But also, having a variety could better communicate to users the idea that the traditional 'analogies' evaluation isn't the end-all/be-all of word-vector quality for all downstream tasks. (Sometimes sets of word-vectors that do better on analogies do worse when used in tasks like classification.) And, refactoring such that I could see such functions potentially as either:
|
Background
The inspiration for this contribution arose during my research for my senior thesis project in mathematics at Pomona College. In my project I have been investigating the generation of word embeddings for Medieval Latin using the tools offered by gensim. In order to determine the accuracy of these embeddings, I have relied primarily on three types of tasks--analogies, odd-one-out, and topk similarity.
Testing on analogies has been quite straight forward due to the already implemented wv.evaluate_word_analogies() function. This method is makes large scale experimentation and testing of word embeddings easy because it provides a means for test cases to be generated from a custom file in the same style as Mikolov et. al's analogy set.
However, in addition to analogy testing, I also desired to evaluate my word embeddings on both the odd one out task as well as topk similarity. Though functions to perform these tasks already exist in the form of the wv.doesnt_match() and wv.most_similar(), there was no similarly convenient way to apply these methods to a large test set. For my own purposes, I set out to implement functions that would provide the flexible and scaled evaluation capabilities of the wv.evaluate_word_analogies() function for the odd one out and topk similarity tasks.
In this PR I add two functions: evaluate_doesnt_match() and evaluate_top_k_similarity(). Both seek to emulate the style of the evaluate_word_analogies() function in the type of parameters it takes and the format of the test set txt file. Details are provided below.
Doesn't Match Evaluation on a File
This function expands the functionality of model.wv.doesnt_match() by allowing the user
to perform this evaluation task at scale on a custom .txt file of categories. It does this by
creating all possible "odd-one-out" groupings from the categories file. The groups are composed
of k_in words from one category and 1 word from a different category.
The function expects the .txt file to follow the formatting conventions of the
evaluate_word_analogies() function where each category has a heading line (identified
with a colon) followed by a list of space separated words that belong to that category on the next line.
e.g.
:fruits
apple banana pear raspberry
Note that each category in the txt file must have at least k_in
words otherwise, comparison groups can't be created.
In the event that categories contain the same word, the function could produce comparison groups that contain duplicate words.
For example, consider the following txt file.
:food
apple, hamburger, hotdog, soup
:fruit
apple, pear, banana, grape
Say k_in=3, then some comparisons would contain duplicate words.
[apple, hamburger, hotdog, apple]
[apple, hamburger, soup, apple]
[apple, hotdog, soup, apple]
By default, this function ignores these comparisons, unless
eval_dupes=True.
Topk Similarity Evaluation on File
This function evaluates the accuracy of a word2vec model
using the topk similarity metric.
The user provides the function with a txt file of words divided
into categories. It is expected the .txt file follows the formatting conventions of the
evaluate_word_analogies() method where each category has a heading line (identified
with a colon) followed by a list of space separated words that belong to that category on the next line.
e.g.
:fruits
apple banana pear raspberry
For each word in the file, the function generates a topk similarity list. This list
is compared to the other entries in the category of the word which generated the list
in order to find matches between the two. The number of matches is then used to compute
one of two accuracy measures -- topk_in_cat or cat_in_topk.
Summary
These two additional functions have been useful to me in my own experimentation with creating good word embeddings by allowing me to evaluate their performance on the odd one out and similarity tasks with the same level of robustness as the analogy task. I believe this is an important tool because measuring the goodness of embeddings can often be ambiguous. Access to multiple evaluation can help bring clarity to the task of assessment.
Since I needed to implement these functions for my own work it seemed fitting to offer them back as a contribution to the project. I hope you find them useful.
Thanks!
Nate Stringham