WAIDWML - What Am I Doing With My Life?

(8 Phi-4s in a trenchcoat)

Rationale

So there I was, finding some inspiration to tune stuff but lacking the disposable funds to do anything with the larger models. Enter Phi-4, a model designed for productivity... Initially it was just a going to be a sequential series of finetunes, starting from the baseline Phi-4 and gradually adding more datasets until I either got bored or it got good, but then I had an idea; what if I just MoE'd it?

Yeah.

As a proof of concept, this wasn't too bad. The end result is... interesting, to say the least.

Training

As mentioned above, this was done in "phases", each with a separate dataset. Most were done with a max_seq_length of 32k, a few of them were dropped to 16k to make sure they fit in the hardware.

lr was all over the place but in general somewhere between 1e-5 and 4e-6. These were all separate LoRAs using r=64 and alpha=32 with rsLoRA enabled. epochs were 2 or 3 for everything except c2, as that'd take far too long.

  • p1: Private RP dataset (RPT-Varied-Small)
  • p2: TheDrummer/AmoralQA-v2
  • p3: AIRRC/Eudaimonic
  • p4: Two private RP datasets (cc-gpt4-sfw-sharegpt & cc-gpt4-nsfw-sharegpt)
  • p5: A random subset of the infamous "c2"-logs dataset, cleaned and deduped (approx. 30%)
  • p6: Private RP dataset (RPT-Varied-Small_v1.5)
  • p7: NewEden/PIPPA-Mega-Filtered
  • p8: Squish42/bluemoon-fandom-1-1-rp-cleaned

(Note: the RPT-Varied-Small and RPT-Varied-Small_v1.5 datasets are due to be released after I manually verify their fitness.)

Once all LoRAs were trained, I separately merged them into the base model then I used mergekit (config) to "merge" them into a MoE. I chose to initialize the router randomly as I was going to training that part later. After that, I trained the routing layers for 8 epochs with lr = 1e-6 and grimulkan/LimaRP-augmented as the dataset. It took roughly 8.5 hours on a 6xA40 instance on RunPod.

Recommended Settings

Phi-4 format. What I used for my tests:

  • Temp 1
  • minP 0.05

FAQ

Q: Why not do anything constructive, like GRPO-tune a model of usable size?
A: Where's the fun in that?

Q: Are you, like, okay?
A: Objectively? Probably not. Subjectively? Never better.

Q: You know this still sucks for RP, right?
A: Yup. Should have pivoted to reasoning and code once R1 hit, but sunk cost and all kept me on this trajectory.
Downloads last month
14
Safetensors
Model size
91.7B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for rAIfle/WAIDWML-Phi4-8x14B-bf16

Base model

microsoft/phi-4
Finetuned
unsloth/phi-4
Finetuned
(59)
this model
Quantizations
2 models