metadata

base_model: gpt2
library_name: Distily
license: mit
tags:
  - generated_from_trainer
model-index:
  - name: distily_bench_obj_cross_v2.10_gpt2
    results: []

distily_bench_obj_cross_v2.10_gpt2

This student model is distilled from the teacher model gpt2 using the dataset (unspecified).

The Distily library was used for this distillation.

It achieves the following results on the evaluation set:

eval_enwikippl: 452.9807
eval_frwikippl: 741.6703
eval_zhwikippl: 169.7969
eval_tinystoriesppl: 694.5760
eval_loss: 1.2502
eval_runtime: 21.1964
eval_samples_per_second: 47.178
eval_steps_per_second: 11.794

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
train_embeddings: True
learning_rate: 1e-05
train_batch_size: 1
eval_batch_size: 4
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 1.0

Resource Usage

Peak GPU Memory: 3.9285 GB

Eval-Phase Metrics

step	epoch	enwikippl	frwikippl	loss	runtime	samples_per_second	steps_per_second	tinystoriesppl	zhwikippl
teacher eval		270.2348	76.8142					671.1238	22.8030
0	0	120078.375	1867851235328.0	18.7920	21.1643	47.249	11.812	72.8770	4013754155008.0
5000	0.0505	399.8896	1364.9200	1.5750	21.223	47.119	11.78	430.3431	486.9932
10000	0.1010	366.0540	968.1008	1.4975	21.2542	47.05	11.762	410.0440	300.9413
15000	0.1515	382.6534	990.8644	1.4377	21.1883	47.196	11.799	455.3961	243.5069
20000	0.2020	372.0864	985.5745	1.4590	21.2537	47.051	11.763	430.2186	317.8063
25000	0.2525	459.8662	802.9102	1.3109	21.2174	47.131	11.783	674.2657	183.8540
30000	0.3030	452.4371	822.7448	1.2777	21.2291	47.105	11.776	674.3492	162.7067
35000	0.3535	476.7241	805.2602	1.2741	21.2169	47.132	11.783	736.0758	174.6150
40000	0.4040	453.2438	770.2305	1.2733	21.1947	47.181	11.795	679.9471	163.0870
45000	0.4545	460.5169	781.4591	1.2687	21.2116	47.144	11.786	700.2546	183.2052
50000	0.5051	479.0564	794.0530	1.2632	21.229	47.105	11.776	743.4755	181.4419
55000	0.5556	471.3993	748.4656	1.2630	21.215	47.137	11.784	731.375	172.6117
60000	0.6061	446.4142	775.7834	1.2687	21.1528	47.275	11.819	669.1851	164.7928
65000	0.6566	455.8672	744.0773	1.2538	21.2207	47.124	11.781	698.6068	164.4469
70000	0.7071	453.5074	740.2094	1.2513	21.3501	46.838	11.71	697.8277	168.6457
75000	0.7576	450.4874	723.2042	1.2535	21.2028	47.164	11.791	685.8463	167.9272
80000	0.8081	455.6377	745.9662	1.2523	21.2178	47.13	11.783	701.7324	170.4892
85000	0.8586	447.3922	746.4918	1.2509	21.2165	47.133	11.783	681.8325	168.7976
90000	0.9091	453.0859	740.9397	1.2505	21.1987	47.173	11.793	696.0992	169.7290
95000	0.9596	451.3083	741.0439	1.2504	21.5668	46.368	11.592	690.2544	169.7969
99000	1.0	452.9807	741.6703	1.2502	21.1964	47.178	11.794	694.5760	169.7969

Framework versions

Distily 0.2.0
Transformers 4.44.0
Pytorch 2.3.0
Datasets 2.21.0