-
Notifications
You must be signed in to change notification settings - Fork 26.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: DeepSeekMoE #32862
base: main
Are you sure you want to change the base?
feat: DeepSeekMoE #32862
Conversation
5cee995
to
30ea5b1
Compare
30ea5b1
to
c16ff06
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! feel free to ping me once this is ready fro review!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Hey @ArthurZucker , should be ready for review now, thanks! |
friendly ping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I am not 100 sure I caught what are the architectural differences with say QwenMoe which also has the shared experts piece!
Also let's try to match the architecture of the release models, removing unecessary codepathes!
### Description | ||
|
||
DeepSeekMoE 16B is a Mixture-of-Experts (MoE) language model with 16.4B parameters. It employs an innovative MoE architecture, which involves two principal strategies: fine-grained expert segmentation and shared experts isolation. It is trained from scratch on 2T English and Chinese tokens, and exhibits comparable performance with DeepSeek 7B and LLaMA2 7B, with only about 40% of computations. For research purposes, we release the model checkpoints of DeepSeekMoE 16B Base and DeepSeekMoE 16B Chat to the public, which can be deployed on a single GPU with 40GB of memory without the need for quantization. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe missing one line about "this model was contributed by" with your HF username!
|
||
logger = logging.get_logger(__name__) | ||
|
||
DEEPSEEK_PRETRAINED_CONFIG_ARCHIVE_MAP = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DEEPSEEK_PRETRAINED_CONFIG_ARCHIVE_MAP = {} |
moe_layer_freq (`int`, *optional*, defaults to 1): | ||
The frequency of the MoE layer: one expert layer for every `moe_layer_freq - 1` dense layers. | ||
first_k_dense_replace (`int`, *optional*, defaults to 0): | ||
Number of dense layers in shallow layers(embed->dense->dense->...->dense->moe->moe...->lm_head). | ||
\--k dense layers--/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would be nice to have standardization with the other MoE in the library! we usually call this the sparse_step
:
decoder_sparse_step (`int`, *optional*, defaults to 1): | |
The frequency of the MoE layer. |
pretraining_tp (`int`, *optional*, defaults to 1): | ||
Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this | ||
document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is | ||
necessary to ensure exact reproducibility of the pretraining results. Please refer to [this | ||
issue](https://github.com/pytorch/pytorch/issues/76232). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should not be used!
def _rope_scaling_validation(self): | ||
""" | ||
Validate the `rope_scaling` configuration. | ||
""" | ||
if self.rope_scaling is None: | ||
return | ||
|
||
if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2: | ||
raise ValueError( | ||
"`rope_scaling` must be a dictionary with with two fields, `type` and `factor`, " | ||
f"got {self.rope_scaling}" | ||
) | ||
rope_scaling_type = self.rope_scaling.get("type", None) | ||
rope_scaling_factor = self.rope_scaling.get("factor", None) | ||
if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]: | ||
raise ValueError( | ||
f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}" | ||
) | ||
if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0: | ||
raise ValueError(f"`rope_scaling`'s factor field must be a float > 1, got {rope_scaling_factor}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _rope_scaling_validation(self): | |
""" | |
Validate the `rope_scaling` configuration. | |
""" | |
if self.rope_scaling is None: | |
return | |
if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2: | |
raise ValueError( | |
"`rope_scaling` must be a dictionary with with two fields, `type` and `factor`, " | |
f"got {self.rope_scaling}" | |
) | |
rope_scaling_type = self.rope_scaling.get("type", None) | |
rope_scaling_factor = self.rope_scaling.get("factor", None) | |
if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]: | |
raise ValueError( | |
f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}" | |
) | |
if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor <= 1.0: | |
raise ValueError(f"`rope_scaling`'s factor field must be a float > 1, got {rope_scaling_factor}") |
@torch.no_grad() | ||
def moe_infer(self, x, flat_expert_indices, flat_expert_weights): | ||
expert_cache = torch.zeros_like(x) | ||
idxs = flat_expert_indices.argsort() | ||
tokens_per_expert = flat_expert_indices.bincount().cpu().numpy().cumsum(0) | ||
token_idxs = idxs // self.num_experts_per_tok | ||
for i, end_idx in enumerate(tokens_per_expert): | ||
start_idx = 0 if i == 0 else tokens_per_expert[i - 1] | ||
if start_idx == end_idx: | ||
continue | ||
expert = self.experts[i] | ||
exp_token_idx = token_idxs[start_idx:end_idx] | ||
expert_tokens = x[exp_token_idx] | ||
expert_out = expert(expert_tokens) | ||
expert_out.mul_(flat_expert_weights[idxs[start_idx:end_idx]]) | ||
expert_cache.scatter_reduce_(0, exp_token_idx.view(-1, 1).repeat(1, x.shape[-1]), expert_out, reduce="sum") | ||
return expert_cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really. mind having 2 forwards, one for inference one for training, tho this is not super common
y = AddAuxiliaryLoss.apply(y, aux_loss) | ||
else: | ||
y = self.moe_infer(hidden_states, flat_topk_idx, topk_weight.view(-1, 1)).view(*orig_shape) | ||
if self.config.n_shared_experts is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's avoid code path if the model uses shared experts, let's add this otherwise let's remove!
return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim) | ||
|
||
|
||
# Copied from transformers.models.llama.modeling_llama.LlamaAttention with Llama->Deepseek |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should rather copy from Gemma, removes the need for pretaining tps!
if position_embeddings is None: | ||
logger.warning_once( | ||
"The attention layers in this model are transitioning from computing the RoPE embeddings internally " | ||
"through `position_ids` (2D tensor with the indexes of the tokens), to using externally computed " | ||
"`position_embeddings` (Tuple of tensors, containing cos and sin). In v4.45 `position_ids` will be " | ||
"removed and `position_embeddings` will be mandatory." | ||
) | ||
cos, sin = self.rotary_emb(value_states, position_ids) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as this is a new model, it will not support out of the box legacy behaviours
self.mlp = ( | ||
DeepseekMoE(config) | ||
if ( | ||
config.n_routed_experts is not None | ||
and layer_idx >= config.first_k_dense_replace | ||
and layer_idx % config.moe_layer_freq == 0 | ||
) | ||
else DeepseekMLP(config) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's split in 2 lines
What does this PR do?
Upstream custom code from https://huggingface.co/deepseek-ai/deepseek-moe-16b-base/blob/main/modeling_deepseek.py to huggingface/transformers. This is not DeepSeek V2. The newly released DeepSeek-Prover-V1.5 runs on this architecture for example (though without MoE layers, so it is actually just Llama).
https://huggingface.co/models?other=deepseek
Before submitting
Pull Request section?
to it if that's the case.
documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
If you know how to use git blame, that is the easiest way, otherwise, here is a rough guide of who to tag.
Please tag fewer than 3 people.
Models: