File size: 5,035 Bytes

13640cd
 
d9ab1e1
 
22af236
 
13640cd
 
 
1df774c
 
13640cd
 
58a54c3
13640cd
 
 
30e7840
16294fb
5506aad
58a54c3
13640cd
 
 
 
 
 
3fa027c
 
 
13640cd
 
 
 
 
 
 
 
 
 
 
 
 
99977fa
d40a2a5
13640cd
 
 
 
 
 
 
 
b77926f
77abc16
59f8bb2
 
13640cd
f9a57f6
5ff14e5
870ea20
f9a57f6
185d1b8
f9a57f6
bb41fc6
72c661a
f9a57f6
185d1b8
f9a57f6
 
 
 
13640cd
 
22af236
65cea44
 
78e296d
 
13640cd
 
 
 
 
 
 
5e90e46
13640cd
f9a57f6
 
 
4e10654
 
 
f9a57f6
13640cd
 
78e296d
13640cd
 
5e79131
f327d1a
 
 
 
 
 
 
 
13640cd
 
4596968

---
license: apache-2.0
tags:
- RWKV
base_model:
- microsoft/phi-4
---
# PRWKV-7-Phi-4-Instruct-Preview-v0.1 Model Card

![prwkv](./prwkv.png)

## Model Overview

PRWKV-7-Phi-4-Instruct is a large language model based on the RNN-based RWKV-x070 architecture, comprising 16.3 billion parameters. The distinctive feature of this model is that it replaces the attention mechanism in Microsoft's Transformer-based Phi-4 14B with RWKV's recurrent approach.

## Technical Specifications

- **Architecture**: RWKV-x070 "Goose"(RNN-based) https://github.com/BlinkDL/RWKV-LM
- **Parameters**: 16.3 billion (L40D5120 RWKVTimeMix + D17920 SwiGLU MLP)
- **Training Context Window**: 12288(Stage1=2560, Stage2=8192, Stage3=12288)
- **Base Model**: Derived from Microsoft Phi-4 14B https://huggingface.co/microsoft/phi-4
- **Development Stage**: Experimental preview (no performance guarantees)
- **License**: Apache 2.0

## Key Innovations

This model builds upon and refines the attention replacement approaches pioneered by several notable projects, including:
- Qwerky (Qwen 2.5 72B + QRWKV7 Arch)
- QRWKV (Qwen 2.5 32B + QRWKV6 Arch) 
- ARWKV (Qwen 2.5 1.5B-7B + RWKV v7 Arch)

The primary advantage of using the RWKV architecture is the elimination of KV-Cache requirements, allowing for infinite context generation with static VRAM consumption.

## Training Methodology

The training process consisted of three distinct stages:

### Stage 1: Attention Alignment (Based on RWKVInside repository)
- The TimeMix component of RWKV was calibrated to produce outputs equivalent to the Transformer's attention layers
- Seven different loss calculation approaches were employed to capture the differences between Attention and TimeMix, including:
  - Norm-based methods
  - Singular Value Decomposition (SVD)
  - Cosine similarity
  - Multi resolution bias similarity
  - Temporal vector similarity
  - And others

### Stage 2: Knowledge Distillation (Based on RWKVInside repository)
- Teacher model: Phi-4 head outputs
- Student model: Phi-4 with Attention replaced by RWKV
- Only the attention mechanism was trained; all other components (MLP layers, embeddings, heads) were frozen

### Stage 3: Supervised Fine-Tuning (Using RWKV-LM-RLHF)
- Utilized a distillation dataset of 900K samples (Chinese,Japanese,English)
- Smoothed Loss for faster convergence
- Implemented Variable Rank PEFT to enhance training efficiency
- Bone(Block Affine Transformation) r=512+

## How to Use
- PC Requirements 16GB+ VRAM NVIDIA GPU(rocm also can use. but only fp16.)
- OS Windows WSL2 with CUDA or Linux
- install RWKV-Infer(see how to install) https://github.com/OpenMOSE/RWKV-Infer
- make folder "models" and put PRWKV-7-Phi-4-Instruct-Preview-v0.1.pth
- loadmodel(choose fp16 or fp6 or fp5 (dont choose FP8))
- need 34GB VRAM in FP16, 14GB VRAM in FP5
- Enjoy Text chats via Open-webui or Silly-Tavern :)
```
curl http://127.0.0.1:9000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/PRWKV-7-Phi-4-Instruct-Preview-v0.1.pth","model_viewname":"PRWKV7-Phi-4 Preview 0.1","model_strategy":"fp5","template":"phi4"}'

```
3. you can use this model via OpenAI CompatibleAPI http://127.0.0.1:9000/v1 and set modelname "PRWKV7-Phi-4 Preview 0.1"

## Training Infrastructure
- Hardware: Single AMD MI300X GPU
- Training Duration: 4 days(Stage1,2)
- Stage1 180MToken (LR1e-4)
- Stage2 160MToken (Temperature 1.0KD LR5e-6)
- Stage2.5 120MToken (Temperature 2.0KD LR3e-5)
- Stage3 100MToken

## Acknowledgements

This work was made possible through the contributions of:
- SmerkyG
- RecursalAI
- RWKV-Red-Team
- BlinkDL(RWKV v7 Architecture)

- https://github.com/OpenMOSE/RWKVInside
- https://github.com/OpenMOSE/RWKV-LM-RLHF
- https://github.com/OpenMOSE/RWKV-Infer
- https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1
- https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1
- https://huggingface.co/featherless-ai/Qwerky-72B-Preview

## Limitations

This is trained Stage3 early epoch.
This model is currently in a testing phase and does not guarantee any specific level of performance. Users should consider it experimental technology.

## MyStories(Generated by PRWKV)
I've faced an incredibly long and challenging journey with the stability of Stage 2 Knowledge Distillation learning. 
NaN (Not a Number) errors have become an all too familiar sight during this process. 
The training would often diverge unexpectedly, leaving me to debug complex numerical issues that appeared without warning.
Day after day, I adjusted hyperparameters, modified architecture components, and scrutinized every aspect of the data pipeline, 
only to be greeted by those "three dreaded letters" on my training logs. 
What should have been a straightforward implementation became a months-long battle against numerical instability,
requiring persistence through countless failed experiments and late nights analyzing loss curves that suddenly spiked into oblivion.

## License

Released under the Apache 2.0 license.

2025 OpenMOSE