OpenMOSE
/

PRWKV-7-Phi-4-Instruct-Preview-v0.1

RWKV

Model card Files Files and versions Community

PRWKV-7-Phi-4-Instruct-Preview-v0.1 / README.md

OpenMOSE

Update README.md

78e296d verified 28 days ago

preview code

raw

history blame contribute delete

5.04 kB

	---
	license: apache-2.0
	tags:
	- RWKV
	base_model:
	- microsoft/phi-4
	---
	# PRWKV-7-Phi-4-Instruct-Preview-v0.1 Model Card

	![prwkv](./prwkv.png)

	## Model Overview

	PRWKV-7-Phi-4-Instruct is a large language model based on the RNN-based RWKV-x070 architecture, comprising 16.3 billion parameters. The distinctive feature of this model is that it replaces the attention mechanism in Microsoft's Transformer-based Phi-4 14B with RWKV's recurrent approach.

	## Technical Specifications

	- Architecture: RWKV-x070 "Goose"(RNN-based) https://github.com/BlinkDL/RWKV-LM
	- Parameters: 16.3 billion (L40D5120 RWKVTimeMix + D17920 SwiGLU MLP)
	- Training Context Window: 12288(Stage1=2560, Stage2=8192, Stage3=12288)
	- Base Model: Derived from Microsoft Phi-4 14B https://huggingface.co/microsoft/phi-4
	- Development Stage: Experimental preview (no performance guarantees)
	- License: Apache 2.0

	## Key Innovations

	This model builds upon and refines the attention replacement approaches pioneered by several notable projects, including:
	- Qwerky (Qwen 2.5 72B + QRWKV7 Arch)
	- QRWKV (Qwen 2.5 32B + QRWKV6 Arch)
	- ARWKV (Qwen 2.5 1.5B-7B + RWKV v7 Arch)

	The primary advantage of using the RWKV architecture is the elimination of KV-Cache requirements, allowing for infinite context generation with static VRAM consumption.

	## Training Methodology

	The training process consisted of three distinct stages:

	### Stage 1: Attention Alignment (Based on RWKVInside repository)
	- The TimeMix component of RWKV was calibrated to produce outputs equivalent to the Transformer's attention layers
	- Seven different loss calculation approaches were employed to capture the differences between Attention and TimeMix, including:
	- Norm-based methods
	- Singular Value Decomposition (SVD)
	- Cosine similarity
	- Multi resolution bias similarity
	- Temporal vector similarity
	- And others

	### Stage 2: Knowledge Distillation (Based on RWKVInside repository)
	- Teacher model: Phi-4 head outputs
	- Student model: Phi-4 with Attention replaced by RWKV
	- Only the attention mechanism was trained; all other components (MLP layers, embeddings, heads) were frozen

	### Stage 3: Supervised Fine-Tuning (Using RWKV-LM-RLHF)
	- Utilized a distillation dataset of 900K samples (Chinese,Japanese,English)
	- Smoothed Loss for faster convergence
	- Implemented Variable Rank PEFT to enhance training efficiency
	- Bone(Block Affine Transformation) r=512+

	## How to Use
	- PC Requirements 16GB+ VRAM NVIDIA GPU(rocm also can use. but only fp16.)
	- OS Windows WSL2 with CUDA or Linux
	- install RWKV-Infer(see how to install) https://github.com/OpenMOSE/RWKV-Infer
	- make folder "models" and put PRWKV-7-Phi-4-Instruct-Preview-v0.1.pth
	- loadmodel(choose fp16 or fp6 or fp5 (dont choose FP8))
	- need 34GB VRAM in FP16, 14GB VRAM in FP5
	- Enjoy Text chats via Open-webui or Silly-Tavern :)
	```
	curl http://127.0.0.1:9000/loadmodel -X POST -H "Content-Type: application/json" -d '{"model_filename":"models/PRWKV-7-Phi-4-Instruct-Preview-v0.1.pth","model_viewname":"PRWKV7-Phi-4 Preview 0.1","model_strategy":"fp5","template":"phi4"}'

	```
	3. you can use this model via OpenAI CompatibleAPI http://127.0.0.1:9000/v1 and set modelname "PRWKV7-Phi-4 Preview 0.1"

	## Training Infrastructure
	- Hardware: Single AMD MI300X GPU
	- Training Duration: 4 days(Stage1,2)
	- Stage1 180MToken (LR1e-4)
	- Stage2 160MToken (Temperature 1.0KD LR5e-6)
	- Stage2.5 120MToken (Temperature 2.0KD LR3e-5)
	- Stage3 100MToken

	## Acknowledgements

	This work was made possible through the contributions of:
	- SmerkyG
	- RecursalAI
	- RWKV-Red-Team
	- BlinkDL(RWKV v7 Architecture)

	- https://github.com/OpenMOSE/RWKVInside
	- https://github.com/OpenMOSE/RWKV-LM-RLHF
	- https://github.com/OpenMOSE/RWKV-Infer
	- https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1
	- https://huggingface.co/recursal/QRWKV6-32B-Instruct-Preview-v0.1
	- https://huggingface.co/featherless-ai/Qwerky-72B-Preview

	## Limitations

	This is trained Stage3 early epoch.
	This model is currently in a testing phase and does not guarantee any specific level of performance. Users should consider it experimental technology.

	## MyStories(Generated by PRWKV)
	I've faced an incredibly long and challenging journey with the stability of Stage 2 Knowledge Distillation learning.
	NaN (Not a Number) errors have become an all too familiar sight during this process.
	The training would often diverge unexpectedly, leaving me to debug complex numerical issues that appeared without warning.
	Day after day, I adjusted hyperparameters, modified architecture components, and scrutinized every aspect of the data pipeline,
	only to be greeted by those "three dreaded letters" on my training logs.
	What should have been a straightforward implementation became a months-long battle against numerical instability,
	requiring persistence through countless failed experiments and late nights analyzing loss curves that suddenly spiked into oblivion.

	## License

	Released under the Apache 2.0 license.

	2025 OpenMOSE