Difference in embedding weight initialization for randomly initialized T5 model #32854

dhruvbird · 2024-08-16T17:54:29Z

System Info

transformers

Who can help?

@ArthurZucker

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

The problem is technical, so I will describe it here. I believe the idea is to keep the weight initialization the same for pytorch or tf models initialized from scratch. However, this is different.

In https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L821 the embedding weights are initialized with a variance of 1. However, in tf, this is done by initializing with a standard deviation of 0.05. https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L1635

https://www.tensorflow.org/api_docs/python/tf/random_normal_initializer

According to the docs, it's default initialized with these arguments:

tf.random_normal_initializer(
    mean=0.0, stddev=0.05, seed=None
)

PyTorch initialization:

            # Mesh TensorFlow embeddings initialization
            # See https://github.com/tensorflow/mesh/blob/fa19d69eafc9a482aff0b59ddd96b025c0cb207d/mesh_tensorflow/layers.py#L1624
            module.shared.weight.data.normal_(mean=0.0, std=factor * 1.0)

TF initialization:

def embedding_weights(mesh,
                      vocab_dim,
                      output_dim,
                      variable_dtype,
                      name="embedding",
                      ensemble_dim=None,
                      initializer=None):
  """Embedding weights."""
  shape = mtf.Shape(
      [ensemble_dim] if ensemble_dim else []) + [vocab_dim, output_dim]
  if initializer is None:
    initializer = tf.random_normal_initializer()
  ret = mtf.get_variable(
      mesh, name, shape, dtype=variable_dtype, initializer=initializer)
  return ret

This is already mentioned in #16749 but since #16749 mentions 2 issues this first one seems to have gone unnoticed, so I am opening a separate issue for this one.

Expected behavior

I expect the initialization to be the same across TF and PyTorch T5 models.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-09-16T08:03:55Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

dhruvbird added the bug label Aug 16, 2024

github-actions bot closed this as completed Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference in embedding weight initialization for randomly initialized T5 model #32854

Difference in embedding weight initialization for randomly initialized T5 model #32854

dhruvbird commented Aug 16, 2024

github-actions bot commented Sep 16, 2024

Difference in embedding weight initialization for randomly initialized T5 model #32854

Difference in embedding weight initialization for randomly initialized T5 model #32854

Comments

dhruvbird commented Aug 16, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

github-actions bot commented Sep 16, 2024