-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix computation of Word2Vec
loss & add loss value to logging string
#2135
base: develop
Are you sure you want to change the base?
Conversation
This commit re-writes the computation of the loss for both CBOW and SG Word2Vec. The loss that is computed and reported is the running average NCE loss within the epoch. This means that for each new epoch, the counters are reset to 0, and the new average is computed. This was not the cas before, and the loss was incremented during the whole training, which is not very informative, beside being also incorrect in the implementation (see below) The computation of the word2vec loss was flawed in many ways: - race condition on the running_training_loss parameter (updated concurrently in a GIL-free portion of the code) - incorrect dividing factor for the average in the case of SG. The averaging factor in the case of SG should not be the effective words, but the effective samples (a new variable I introduce), because the loss is incremented as many times as there are positive examples that are sampled for an effective word. Addtionnally, I add the logging of the current value of the loss in the progress logger, when compute_loss is set to True, and I add a parameter to the word2vec_standalone script to trigger the reporting of the loss.
Thanks @alreadytaikeune . The training loss is a recent addition and may contain errors. What is your use-case for using it? How did you find about this issue? Re. your PR: what are the performance implications of your changes? Can you post a before/after benchmark? |
Thanks for your reply. I started digging into it because I needed to compare the evolution and values of the loss between two word2vec implementations. Therefore I needed to understand exactly how the loss was computed in gensim which lead me to uncover some issues and proposing these changes. Are the issues I pointed out clear to you? I don't know if I expressed them well enough. As to the benchmark, I haven't really done any since the changes are very minor and should really not impact performances, but I'll do one if needed. Do you have some standard tools for this purpose that I can use, or do I need to write mine? Finally, it seems the documentation has a hard time building. From what I understand, it is the lines
in the |
@piskvorky Everything seems to be fixed now. |
Thanks! Can you run the benchmark too? Simply train a model on the text9 corpus with/without this PR, using 4 parallel workers and computing / not computing the loss. Then post the four logs (old+loss, old-loss, new+loss, new-loss) to gist and add the links here in a comment. There'll be some minor variance in the timings, even with same parameters, that's expected. But we're interested in a sanity check here (<<10% difference). |
I have run the tests you said in the following environment
Using the script below:
The results of the run can be found here: I've only run them once, because it takes a bit of time. If we need more solid proof, probably we should settle on a smaller size corpus and average more runs. Anyways, it seems that the cost incurred to the computation of the loss amounts to around 3% of the total runtime, while the computation time is left unaffected (ignoring small variations between runs) as was expected. I haven't run the benchmark on the base code with the loss computation, because if we agree on the fact that the previous computation is flawed, then I don't really see the relevance. Maybe we should discuss more on this point to double check my reasoning and implementation. |
Thanks a lot @alreadytaikeune ! That sounds good to me. Let's wait for @menshikh-iv 's final verdict & merge, once he gets back from holiday. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, nice catch @alreadytaikeune 👍 please make needed fixes and I think we can merge it :)
gensim/models/base_any2vec.py
Outdated
@@ -124,7 +124,17 @@ def _clear_post_train(self): | |||
raise NotImplementedError() | |||
|
|||
def _do_train_job(self, data_iterable, job_parameters, thread_private_mem): | |||
"""Train a single batch. Return 2-tuple `(effective word count, total word count)`.""" | |||
"""Train a single batch. Return 3-tuple |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow numpy-style for docstrings, more links available here
gensim/models/base_any2vec.py
Outdated
|
||
for callback in self.callbacks: | ||
callback.on_batch_end(self) | ||
|
||
progress_queue.put((len(data_iterable), tally, raw_tally)) # report back progress | ||
# report back progress | ||
progress_queue.put( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to break line (we use 120 char limit for gensim), here and everywhere
gensim/models/base_any2vec.py
Outdated
@@ -260,6 +281,7 @@ def _log_train_end(self, raw_word_count, trained_word_count, total_elapsed, job_ | |||
|
|||
def _log_epoch_progress(self, progress_queue, job_queue, cur_epoch=0, total_examples=None, total_words=None, | |||
report_delay=1.0): | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please revert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still here
gensim/models/word2vec_inner.pyx
Outdated
@@ -491,8 +491,8 @@ def train_batch_sg(model, sentences, alpha, _work, compute_loss): | |||
cdef int negative = model.negative | |||
cdef int sample = (model.vocabulary.sample != 0) | |||
|
|||
cdef int _compute_loss = (1 if compute_loss else 0) | |||
cdef REAL_t _running_training_loss = model.running_training_loss | |||
cdef int _compute_loss = (1 if compute_loss is True else 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? old variant work correctly (same for cbow)
model.running_training_loss = _running_training_loss | ||
return effective_words | ||
model.running_training_loss += _running_training_loss | ||
return effective_words, effective_words |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is worth writing a comment, why return same value twice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the reason for that in the new docstring.
tally += train_batch_cbow(self, sentences, alpha, work, neu1, self.compute_loss) | ||
return tally, self._raw_word_count(sentences) | ||
(tally, effective_samples) = train_batch_cbow(self, sentences, alpha, work, neu1, self.compute_loss) | ||
return tally, self._raw_word_count(sentences), effective_samples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to update an docstrings everywhere when you change returning type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alreadytaikeune Still not done, please check
gensim/models/base_any2vec.py
Outdated
@@ -966,6 +989,9 @@ def train(self, sentences=None, input_streams=None, total_examples=None, total_w | |||
total_words=total_words, epochs=epochs, start_alpha=start_alpha, end_alpha=end_alpha, word_count=word_count, | |||
queue_factor=queue_factor, report_delay=report_delay, compute_loss=compute_loss, callbacks=callbacks) | |||
|
|||
def get_latest_training_loss(self): | |||
return 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should raise NotImplementet
(loss feature works only for w2v now, not for d2v/fasttext/etc) or return -1
maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to raise an exception. I think -1 will be confusing since it is not the value that will be displayed. The value that is displayed is the running training loss divided by the number of samples. That is why I chose to return 0 and -1. -1 will look like the loss is decreasing as the number of processed words increases.
gensim/models/base_any2vec.py
Outdated
else: | ||
# Model doesn't implement the samples tallying. We assume | ||
# that the number of samples is the effective words tally. This | ||
# gives coherent outputs with previous implementaitons |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add TODO here - if/else should be removed when compute_loss implemented for all models
gensim/models/base_any2vec.py
Outdated
cur_epoch + 1, 100.0 * example_count / total_examples, trained_word_count / elapsed, | ||
utils.qsize(job_queue), utils.qsize(progress_queue) | ||
) | ||
if self.compute_loss: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you fully refactor this function please (make it clearer & shorter, not if { if { } else { } } else { if { } else { } }
)?
Hint
- generate pattern (logging template first)
- collect needed parameters to list/tuple
- use
*my_parameters
I have made the changes you requested, however I am a bit surprised by the outcomes of the tests. They all run smoothly on my machine and most of them pass in travis, except the py35-win one, for some reason that don't seem related at all with my changes. |
Word2Vec
loss & add loss value to logging string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work @alreadytaikeune 👍
I think we need merge #2127 first (and after - current PR with additional fix for new "mode" of training).
BTW, I checked py35-win
- it's not a your fault, this is fail of non-determenistic test, I re-run build.
gensim/models/base_any2vec.py
Outdated
@@ -260,6 +281,7 @@ def _log_train_end(self, raw_word_count, trained_word_count, total_elapsed, job_ | |||
|
|||
def _log_epoch_progress(self, progress_queue, job_queue, cur_epoch=0, total_examples=None, total_words=None, | |||
report_delay=1.0): | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still here
gensim/models/base_any2vec.py
Outdated
) | ||
div = total_words | ||
|
||
msg = "EPOCH %i - PROGRESS: at %.2f%% examples, %.0f words/s, in_qsize %i, out_qsize %i" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be PROGRESS: at %.2f%% words (not only examples
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Good catch. I'll fix it.
gensim/models/word2vec.py
Outdated
if self.sg: | ||
tally += train_batch_sg(self, sentences, alpha, work, self.compute_loss) | ||
(tally, effective_samples) = train_batch_sg(self, sentences, alpha, work, self.compute_loss) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
()
no needed here (and same below)
@alreadytaikeune thanks, looks good! Next steps:
@alreadytaikeune I'll ping you when we'll be done with #2127. |
Hi @alreadytaikeune, we finally merged & release #2127, please do the steps mentioned in #2135 (comment) and I'll merge your PR too 🌟 |
In word2vec_inner.pyx, functions now used the new config object while still returning the number of samples. In base_any2vec, logging includes the new loss values, (the addition of this branch)
Hi, I've completed the steps you mentioned, but when running the tests it seems to me that one is hanging, in the doc2vec test set. I don't really have a clue why, it wasn't the case before the merge. |
@alreadytaikeune I see no commits after my comment & merge conflict in PR, please do needed changes (or maybe you forgot to push your changes?). |
Hey sorry for the delay, I was on holidays. Yes, I hadn't pushed my changes, I wanted to test them locally first. But here they are. |
Ah it seems the CI server experiences the same stalling issues as me. I am not sure when I will have time to investigate that though... |
@alreadytaikeune CI issues fixed, hang reproduced in Travis, do you have time to fix & finalize? |
Don't worry about the conflicts. I'll take care of them during the merge: they are in autogenerated code, so it's easy to resolve. Regarding tests: I think the actual loss output may be difficult to test (and not worth it) because it's being output by logs. What is worth unit testing is the new behavior you added to the training functions (counting and returning the effective number of samples). Can you add tests to cover that new behavior? For example:
You use the same testing logic and data for Python/Cython (they should be 100% compatible). |
@alreadytaikeune What is the status of this PR? Are you able to finish it? |
@alreadytaikeune What do you think of the current status of this PR? |
Hello @mpenkov and @tridelt. Sorry for not being more involved in this PR. I've had plenty on my plate in my job, and in my personal life as well, and didn't feel like spending free/family time working on tests. Although I'm not comfortable with leaving this as is either. I'll do my best in the coming days to address your comments. Sorry for not handling this earlier. Best. |
c279f5b
to
aaf9ed9
Compare
Hello, @mpenkov. I found a bit of time to throw myself back into the PR. I thought I would rebase my develop branch on the current one and work on the tests from there, but it was not a good idea.... anyway, I've restored my branch as it was before. Just one thing, when rebasing it seemed to me that all the non-cython implementation of word2vec was removed. Is it correct? Regarding your last comment about checks, I imagine I should now only care about writing tests for the cython functions. Sorry about the mess... |
@alreadytaikeune Yes,, the non-cython variants of many of the algorithms have been eliminated to reduce duplication/maintenance effort – so work going forward need only concern itself with the cython versions! |
Alright thank you @gojomo . I also see that |
AFAIK, |
@gojomo is right. I'm not aware of any changes to the CLI version of word2vec in Gensim. IIRC, the command line version was created to match the CLI of the original C tool by Google, including the CLI parameters names and defaults. I don't think it's been updated since. I didn't remember what So my preference would be to get rid of |
Thank you for the clarification. So, going forward, I'll just write some tests for cython only implementations (which is a bit confusing since I'm working on a version with still the two implementations, hence why I wanted to rebase in the first place), and leave to you clarifying between what's to be kept and dropped between |
I was wondering about the status of this PR. Will it be fixed in gensim 4.0.0? Erroneous loss computation happens only in Word2Vec or in other classes too (i.e., FastText)? Thanks in advance |
Yes, ideally we want to clean this up for 4.0 too. @alreadytaikeune will you able to finish this PR? Anyone else helping is welcome. |
Thank you very much @piskvorky! Does wrong loss computation affect other classes too (e.g., FastText)? |
I don't remember, but I think the loss tallying was idiosyncratic – different for each class, missing from some models. |
Thanks @piskvorky that's very helpful. On an unrelated topic, do you guys plan (or find use) in supporting different initialization schemes for embeddings? Right now you are only supporting uniform initialization I think between [-0.5embedding_size, 0.5embedding_size]. I have created other types of initializations as well ('glorot_uniform', 'lecun_uniform', 'he_uniform', 'glorot_normal', 'lecun_normal', 'he_normal'). If you find it might add something in the tooling, I can give some ideas on a different PR. |
Possibly; @Witiko did some work on weight initialization. That's out of topic for this ticket though. |
FYI #2617 is a sort of 'umbrella issue' mentioned/referencing all the tangled loss-issues. Ideally we'd want our measure/reporting to be similar to the output in Facebook's FastText logging. (Right now Gensim's Regarding alternate weight initializations, it shouldn't be hard to externally re-initialize (or include an option to not initialize for when a user is planning to do this themselves) between vocabulary-discovery and training. We could add some standard options if there's a strong case for how they improve things. (Do they noticeably hasten convergence?) |
Regarding this, how are you updating the embedding weights on every iteration if you don't compute the loss? Do you use some kind of ready formula for gradient that's directly applied to update the embedding weights?
Not sure if they improve or change convergence in a meaningful way. I just mentioned it because Also, your initialization which is more like |
@piskvorky @geopapa11 I evaluated different initizations of the positional model of Mikolov et al. (2018) and Grave et al. (2018), since no reference implementation existed and it's unclear how the weights should be initialized, see pages 59 through 62 in the RASLAN 2020 proceedings. |
I was wondering about the status of this PR. Is there any news about this PR? @mpenkov |
Not that I'm aware of. I think we're waiting for the original contributor to finish the PR: #2135 (comment) |
The current computation of word2vec losses has flaws. I address them in this PR.
This PR re-writes the computation of the loss for both CBOW and SG Word2Vec. The loss that is computed and reported is the running average NCE loss within the epoch. This means that for each new epoch, the counters are reset to 0, and the new average is computed. This was not the case before, and the loss was incremented during the whole training (but the total_examples used to average was only incremented during on epoch), which is not very informative, beside being also incorrect in the implementation. Moreover, reporting the average loss on an epoch appeared to me more in the spirit of what was trying to be achieved before.
The computation of the word2vec loss was flawed in many ways:
GIL-free portion of the code)
Additionnally, I add the logging of the current value of the loss in the progress logger, when compute_loss is set to True, and I add a parameter to the word2vec_standalone script to trigger the reporting of the loss.
As a hint towards the fact that the current implementation is now correct, one can look at the first reported values of the loss, when the word embeddings are still relatively uniformly distributed. In this situation, the expectancy of the NCE loss (for Skip-Gram) should be -(N+1)\log(\sigma(0)). Which is 5.545 for N=7, 14.556 for N=20... which corresponds to the reported loss values in the current implementation.