DeepSeek-V3-lite naming conventions?

#76
by AlphaGaO - opened

Hello, i am currently working on a pruned version of DeepSeek V3,

The methodology involves layer wise routed expert pruning and distillation, then post training on the full model.
I already tested the pipeline on DeepSeek V2 lite, bringing 64@6 experts to 16@4 experts and it seems to give correct results.

I just started running the same method on Deepseek V3 with the following pruned target:
Base Model: 256@8 => DeepSeek-V3-671B@37B-full
22@6 => DeepSeek-V3-Lite-72B@31B-large
16@4 => DeepSeek-V3-Lite-57B@26B-medium
8@2 => DeepSeek-V3-Lite-36B@21B-small
4@1 => DeepSeek-V3-Lite-26B@19B-nano

I'll upload them on huggingface when the pipeline finish to run (it should take about 3 days on my 2x3090 rig).

Do you authorize me to adopt the naming convention as above for the uploads?

If the methodology gives good result, i'll transfer it to the R1 and R1-Zero as well.

Update : Distillation is faster than expected, the first stage of the pipeline is at 37 layers processed / 61 layers.

Early insights :
A phenomenon i already observed while pruning v2-lite is happening, i.e the deeper we go into the layers, the higher the reconstruction loss become (loss at layer 40 is around 10x higher than loss at layer 10). This might hint at the fact that MOE is especially useful for deeper layers, hence a better pruning efficiency might be achievable by varying the total number of experts depending on the depth.

Further more i am only using a small calibration dataset (4096 samples) for the distillation stage as it is very ressource intensive, but there still is room for improvement.

Next experiments for v0.2+:

  • Scaling calibration dataset.
  • Adaptative number of experts per layer.
  • Iterative expert pruning (removing progressively the experts instead of all at once).
  • Expert fusion (using slerp fusion on experts with high correlation of activation / hight output similarity).
  • Shared expert training (i.e sharing the weight among several moe distillat, wich would bring more efficiency during distillation and optimizing the order of experts)

First repair just started, its going down!

(loss is a bit higher than expected, but considering that the distillation pipeline was run on very few samples due to hardware limitation it is already something).

Plus the repair is running with only 64 tokens so not much for optimisation, i just spinned up a cloud A100 to run the same with larger context

image.png

I just pushed the methodology and code to git, any feedback appreciated ;)
https://github.com/gabrielolympie/moe-pruner

Sign up or log in to comment