metadata

base_model: gpt2
library_name: distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_gpt2_linear_objectives
    results: []

distily_bench_gpt2_optim

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 527.0228
eval_frwikippl: 3796.0032
eval_zhwikippl: 4795.4683
eval_loss: 2376.6721
eval_runtime: 21.817
eval_samples_per_second: 45.836
eval_steps_per_second: 11.459

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: LinearObjective(logits_weight=1, logits_loss_fn=<function kl_divergence_loss at 0x7f57c4b07910>, activations_weight=1, activations_loss_fn=<function kl_divergence_loss at 0x7f57c4b07910>, attentions_weight=0, attentions_loss_fn=<function mse_loss at 0x7f57c4b07880>)
train_embeddings: True
learning_rate: 4e-05
train_batch_size: 4
eval_batch_size: 4
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 16
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: constant
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 4.5067 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	zhwikippl
teacher eval		30.2385	57.2728					18.1772
0	0	55339.3672	57682.5742	31197.1836	21.7082	46.065	11.516	57080.2930
500	0.0808	1509.5735	7497.0439	3194.9919	21.4587	46.601	11.65	50589.3438
1000	0.1616	1083.2607	5620.3037	2923.7439	21.5879	46.322	11.581	29616.2285
1500	0.2424	906.6083	4937.0078	2796.2080	21.6636	46.16	11.54	21403.5996
2000	0.3232	813.4678	4877.3267	2706.0481	21.5303	46.446	11.612	20010.4863
2500	0.4040	750.0352	4512.8765	2636.6079	21.6059	46.284	11.571	16546.3457
3000	0.4848	704.7218	4373.6377	2583.7920	21.6069	46.281	11.57	14758.0859
3500	0.5657	667.2821	4153.7866	2537.5520	21.59	46.318	11.579	14131.2881
4000	0.6465	635.3494	4060.9749	2505.6001	21.554	46.395	11.599	13081.5996
4500	0.7273	605.6495	4037.2766	2468.9121	21.795	45.882	11.471	11453.9658
5000	0.8081	573.4954	3881.2524	2437.7439	21.6801	46.125	11.531	8931.2441
5500	0.8889	557.2740	3918.3730	2413.4880	21.5054	46.5	11.625	6643.0454
6000	0.9697	549.7523	4035.1443	2392.2400	21.6194	46.255	11.564	5330.4404
6187	0.9999	527.0228	3796.0032	2376.6721	21.817	45.836	11.459	4795.4683

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.20.0