metadata

base_model: gpt2
datasets:
  - wikimedia/wikipedia
library_name: Distily
license: mit
tags:
  - bitnet
  - 1.58b
  - generated_from_trainer
model-index:
  - name: distily_projector_experiment
    results: []

Summary

Distilled with Distily library using teacher model gpt2 on dataset wikimedia/wikipedia.

Model Architecture:

Architecture: GPT2LMHeadModel
Total Parameters: 124,439,808
Data Type (dtype): torch.bfloat16
Model Size: 0.24 GB

Evaluation Metrics Comparison

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		43.25	61.25					11.6875	19.125
0	0	2473901162496.0	170424302305280.0	22.0554	29.7695	83.978	10.514	4060086272.0	71468255805440.0
2500	0.0404	744.0	5792.0	2.6072	29.8269	83.817	10.494	450.0	4192.0
5000	0.0808	316.0	1408.0	1.9087	29.8408	83.778	10.489	237.0	300.0
7500	0.1212	221.0	776.0	1.6174	29.8221	83.83	10.496	180.0	175.0
10000	0.1616	165.0	616.0	1.4278	29.8115	83.86	10.499	144.0	150.0
12500	0.2020	124.5	484.0	1.1898	29.8065	83.874	10.501	107.0	146.0
15000	0.2424	108.0	452.0	1.0632	29.7962	83.903	10.505	92.5	109.0
17500	0.2828	91.5	352.0	0.9642	29.7907	83.919	10.507	75.5	119.0
20000	0.3232	79.0	302.0	0.8892	29.8022	83.886	10.503	68.0	688.0
22500	0.3636	72.5	236.0	0.7640	29.8032	83.883	10.502	59.0	90.5
25000	0.4040	68.0	202.0	0.7181	29.8737	83.686	10.477	51.75	85.5
27500	0.4444	63.75	223.0	0.6853	29.8142	83.853	10.498	48.0	99.0
30000	0.4848	63.5	214.0	0.6791	29.8122	83.858	10.499	50.75	76.5
32500	0.5253	64.0	194.0	0.6660	29.794	83.91	10.505	46.75	96.5
35000	0.5657	60.25	176.0	0.6075	29.8902	83.639	10.472	41.5	62.25
37500	0.6061	60.5	169.0	0.5942	29.8616	83.72	10.482	43.5	78.5
40000	0.6465	57.5	176.0	0.5808	29.8082	83.87	10.5	39.5	80.5
42500	0.6869	58.0	172.0	0.5602	29.7977	83.899	10.504	40.0	58.75
45000	0.7273	52.5	145.0	0.4723	29.7887	83.924	10.507	34.25	47.0
47500	0.7677	52.75	135.0	0.4507	29.7668	83.986	10.515	33.5	41.25
50000	0.8081	51.25	133.0	0.4370	29.7994	83.894	10.504	31.875	39.75
52500	0.8485	49.75	127.5	0.4272	29.7762	83.96	10.512	32.25	38.0
55000	0.8889	49.25	126.5	0.4130	29.814	83.853	10.498	31.25	35.5
57500	0.9293	48.5	125.0	0.4079	29.7893	83.923	10.507	30.625	34.25
60000	0.9697	48.75	123.5	0.4046	29.8263	83.819	10.494	30.625	34.75
61875	1.0	48.75	124.0	0.4043	29.85	83.752	10.486	30.625	34.5

Resource Usage Comparison

VRAM Use: 7.7843 GB

`# Distillation (Teacher -> Student) Architecture Difference:

Architecture: GPT2LMHeadModel -> GPT2LMHeadModel
Total Parameters: 124,439,808 -> 124,439,808
Data Type (dtype): 124439808 -> torch.bfloat16
Model Size: 0.24 GB -> 0.24 GB

Module Diff Details

Train Dataset

Trained on 145,744,973 tokens from the wikimedia/wikipedia dataset.

Num Samples: 247,500
Subset: 20231101.en
Split: train

Training Objective

DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=raw_mse, layer_mapper=layer-2))

Hyperparameters

The following hyperparameters were used during training:

Expand

learning_rate: 0.0001
train_batch_size: 4
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_ratio: 0.5
num_epochs: 1.0
distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl), attn_loss_component=LossComponent(label=attn, weight=10.0, loss_fn=raw_mse, layer_mapper=layer-2))
train_embeddings: True
lr_scheduler: <torch.optim.lr_scheduler.LambdaLR object at 0x7f010c128160>
student_model_name_or_path: None
student_config_name_or_path: None
student_model_config: None
reinitialize_weights: None
copy_teacher_modules: [('lm_head', False)]
student_model_as_bitnet: True
student_model_compile: False
dropout: None
teacher_model_name_or_path: gpt2
teacher_load_in_8bit: False
teacher_load_in_4bit: False
teacher_model_compile: False
dataset_uri: wikimedia/wikipedia
dataset_subset: 20231101.en
dataset_split: train
dataset_column_name: text
dataset_sample_size: 250000
dataset_test_size: 0.01
gradient_accumulation_steps: 1
weight_decay: 0.0
max_grad_norm: 1.0
warmup_ratio: 0.5
warmup_steps: 0
gradient_checkpointing: True

Framework Versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0