feat: DeepSeekMoE #32862

llllvvuu · 2024-08-17T02:46:35Z

What does this PR do?

Upstream custom code from https://huggingface.co/deepseek-ai/deepseek-moe-16b-base/blob/main/modeling_deepseek.py to huggingface/transformers. This is not DeepSeek V2. The newly released DeepSeek-Prover-V1.5 runs on this architecture for example (though without MoE layers, so it is actually just Llama).

https://huggingface.co/models?other=deepseek

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of who to tag.
Please tag fewer than 3 people.

Models:

text models: @ArthurZucker

ArthurZucker

Hey! feel free to ping me once this is ready fro review!

HuggingFaceDocBuilderDev · 2024-08-19T14:15:03Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

llllvvuu · 2024-08-20T09:24:33Z

Hey @ArthurZucker , should be ready for review now, thanks!

tmm1 · 2024-08-27T20:07:55Z

friendly ping

ArthurZucker

Thanks for the PR! I am not 100 sure I caught what are the architectural differences with say QwenMoe which also has the shared experts piece!

Also let's try to match the architecture of the release models, removing unecessary codepathes!

ArthurZucker · 2024-08-28T08:32:47Z

docs/source/en/model_doc/deepseek.md

+### Description
+
+DeepSeekMoE 16B is a Mixture-of-Experts (MoE) language model with 16.4B parameters. It employs an innovative MoE architecture, which involves two principal strategies: fine-grained expert segmentation and shared experts isolation. It is trained from scratch on 2T English and Chinese tokens, and exhibits comparable performance with DeepSeek 7B and LLaMA2 7B, with only about 40% of computations. For research purposes, we release the model checkpoints of DeepSeekMoE 16B Base and DeepSeekMoE 16B Chat to the public, which can be deployed on a single GPU with 40GB of memory without the need for quantization.
+


maybe missing one line about "this model was contributed by" with your HF username!

ArthurZucker · 2024-08-28T08:33:25Z

src/transformers/models/deepseek/configuration_deepseek.py

+
+logger = logging.get_logger(__name__)
+
+DEEPSEEK_PRETRAINED_CONFIG_ARCHIVE_MAP = {}


Suggested change

DEEPSEEK_PRETRAINED_CONFIG_ARCHIVE_MAP = {}

ArthurZucker · 2024-08-28T08:38:09Z

src/transformers/models/deepseek/configuration_deepseek.py

+        moe_layer_freq (`int`, *optional*, defaults to 1):
+            The frequency of the MoE layer: one expert layer for every `moe_layer_freq - 1` dense layers.
+        first_k_dense_replace (`int`, *optional*, defaults to 0):
+            Number of dense layers in shallow layers(embed->dense->dense->...->dense->moe->moe...->lm_head).
+                                                            \--k dense layers--/


would be nice to have standardization with the other MoE in the library! we usually call this the sparse_step:

transformers/src/transformers/models/qwen2_moe/configuration_qwen2_moe.py

Lines 77 to 78 in 6cffc90

decoder_sparse_step (`int`, *optional*, defaults to 1):

The frequency of the MoE layer.

ArthurZucker · 2024-08-28T08:38:45Z

src/transformers/models/deepseek/configuration_deepseek.py

+        pretraining_tp (`int`, *optional*, defaults to 1):
+            Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
+            document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is
+            necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
+            issue](https://github.com/pytorch/pytorch/issues/76232).


should not be used!

ArthurZucker · 2024-08-28T08:39:05Z

src/transformers/models/deepseek/configuration_deepseek.py

+    def _rope_scaling_validation(self):
+        """
+        Validate the `rope_scaling` configuration.
+        """
+        if self.rope_scaling is None:
+            return
+
+        if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
+            raise ValueError(
+                "`rope_scaling` must be a dictionary with with two fields, `type` and `factor`, "
+                f"got {self.rope_scaling}"
+            )
+        rope_scaling_type = self.rope_scaling.get("type", None)
+        rope_scaling_factor = self.rope_scaling.get("factor", None)
+        if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]:
+            raise ValueError(
+                f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
+            )
+        if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:
+            raise ValueError(f"`rope_scaling`'s factor field must be a float > 1, got {rope_scaling_factor}")


Suggested change

def _rope_scaling_validation(self):

"""

Validate the `rope_scaling` configuration.

"""

if self.rope_scaling is None:

return

if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:

raise ValueError(

"`rope_scaling` must be a dictionary with with two fields, `type` and `factor`, "

f"got {self.rope_scaling}"

)

rope_scaling_type = self.rope_scaling.get("type", None)

rope_scaling_factor = self.rope_scaling.get("factor", None)

if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]:

raise ValueError(

f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"

)

if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0:

raise ValueError(f"`rope_scaling`'s factor field must be a float > 1, got {rope_scaling_factor}")

ArthurZucker · 2024-08-28T08:46:42Z