|
--- |
|
license: apache-2.0 |
|
tags: |
|
- RWKV |
|
base_model: |
|
- microsoft/phi-4 |
|
--- |
|
# PRWKV-7-Phi-4-Instruct-Preview-v0.1 Model Card |
|
|
|
 |
|
|
|
## Model Overview |
|
|
|
PRWKV-7-Phi-4-Instruct is a large language model based on the RNN-based RWKV-x070 architecture, comprising 16.3 billion parameters. The distinctive feature of this model is that it replaces the attention mechanism in Microsoft's Transformer-based Phi-4 14B with RWKV's recurrent approach. |
|
|
|
## Technical Specifications |
|
|
|
- **Architecture**: RWKV-x070 "Goose"(RNN-based) https://github.com/BlinkDL/RWKV-LM |
|
- **Parameters**: 16.3 billion (L40D5120 RWKVTimeMix + D17920 SwiGLU MLP) |
|
- **Training Context Window**: 12288(Stage1=2560, Stage2=8192, Stage3=12288) |
|
- **Base Model**: Derived from Microsoft Phi-4 14B https://huggingface.co/microsoft/phi-4 |
|
- **Development Stage**: Experimental preview (no performance guarantees) |
|
- **License**: Apache 2.0 |
|
|
|
## Key Innovations |
|
|
|
This model builds upon and refines the attention replacement approaches pioneered by several notable projects, including: |
|
- Qwerky (Qwen 2.5 72B + QRWKV7 Arch) |
|
- QRWKV (Qwen 2.5 32B + QRWKV6 Arch) |
|
- ARWKV (Qwen 2.5 1.5B-7B + RWKV v7 Arch) |
|
|
|
The primary advantage of using the RWKV architecture is the elimination of KV-Cache requirements, allowing for infinite context generation with static VRAM consumption. |
|
|
|
## Training Methodology |
|
|
|
The training process consisted of three distinct stages: |
|
|
|
### Stage 1: Attention Alignment (Based on RWKVInside repository) |
|
- The TimeMix component of RWKV was calibrated to produce outputs equivalent to the Transformer's attention layers |
|
- Seven different loss calculation approaches were employed to capture the differences between Attention and TimeMix, including: |
|
- Norm-based methods |
|
- Singular Value Decomposition (SVD) |
|
- Cosine similarity |
|
- Multi resolution bias similarity |
|
- Temporal vector similarity |
|
- And others |
|
|
|
### Stage 2: Knowledge Distillation (Based on RWKVInside repository) |
|
- Teacher model: Phi-4 head outputs |
|
- Student model: Phi-4 with Attention replaced by RWKV |
|
- Only the attention mechanism was trained; all other components (MLP layers, embeddings, heads) were frozen |
|
|
|
### Stage 3: Supervised Fine-Tuning (Using RWKV-LM-RLHF) |
|
- Utilized a distillation dataset of 900K samples (Chinese,Japanese,English) |
|
- Smoothed Loss for faster convergence |
|
- Implemented Variable Rank PEFT to enhance training efficiency |
|
- Bone(Block Affine Transformation) r=512+ |
|
|
|
## How to Use |
|
- PC Requirements 16GB+ VRAM NVIDIA GPU(rocm also can use. but only fp16.) |
|
- OS Windows WSL2 with CUDA or Linux |
|
- install RWKV-Infer(see how to install) https://github.com/OpenMOSE/RWKV-Infer |
|
- make folder "models" and put PRWKV-7-Phi-4-Instruct-Preview-v0.1.pth |
|
- loadmodel(choose fp16 or fp6 or fp5 (dont choose FP8)) |
|
- need 34GB VRAM in FP16, 14GB VRAM in FP5 |
|
- Enjoy Text chats via Open-webui or Silly-Tavern :) |
|
``` |
|
curl http://127.0.0.1:9000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/PRWKV-7-Phi-4-Instruct-Preview-v0.1.pth","model_viewname":"PRWKV7-Phi-4 Preview 0.1","model_strategy":"fp5","template":"phi4"}' |
|
|
|
``` |
|
3. you can use this model via OpenAI CompatibleAPI http://127.0.0.1:9000/v1 and set modelname "PRWKV7-Phi-4 Preview 0.1" |
|
|
|
## Training Infrastructure |
|
- Hardware: Single AMD MI300X GPU |
|
- Training Duration: 4 days(Stage1,2) |
|
- Stage1 180MToken (LR1e-4) |
|
- Stage2 160MToken (Temperature 1.0KD LR5e-6) |
|
- Stage2.5 120MToken (Temperature 2.0KD LR3e-5) |
|
- Stage3 100MToken |
|
|
|
## Acknowledgements |
|
|
|
This work was made possible through the contributions of: |
|
- SmerkyG |
|
- RecursalAI |
|
- RWKV-Red-Team |
|
- BlinkDL(RWKV v7 Architecture) |
|
|
|
- https://github.com/OpenMOSE/RWKVInside |
|
- https://github.com/OpenMOSE/RWKV-LM-RLHF |
|
- https://github.com/OpenMOSE/RWKV-Infer |
|
- https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1 |
|
- https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1 |
|
- https://huggingface.co/featherless-ai/Qwerky-72B-Preview |
|
|
|
## Limitations |
|
|
|
This is trained Stage3 early epoch. |
|
This model is currently in a testing phase and does not guarantee any specific level of performance. Users should consider it experimental technology. |
|
|
|
## MyStories(Generated by PRWKV) |
|
I've faced an incredibly long and challenging journey with the stability of Stage 2 Knowledge Distillation learning. |
|
NaN (Not a Number) errors have become an all too familiar sight during this process. |
|
The training would often diverge unexpectedly, leaving me to debug complex numerical issues that appeared without warning. |
|
Day after day, I adjusted hyperparameters, modified architecture components, and scrutinized every aspect of the data pipeline, |
|
only to be greeted by those "three dreaded letters" on my training logs. |
|
What should have been a straightforward implementation became a months-long battle against numerical instability, |
|
requiring persistence through countless failed experiments and late nights analyzing loss curves that suddenly spiked into oblivion. |
|
|
|
## License |
|
|
|
Released under the Apache 2.0 license. |
|
|
|
2025 OpenMOSE |