lapp0
/

distily_bench_gpt2_linear_objectives

Generated from Trainer

8-bit precision

Model card Files Files and versions Metrics Training metrics Community

distily_bench_gpt2_linear_objectives / README.md

lapp0's picture

End of training

739061a verified 8 months ago

|

history blame contribute delete

3.39 kB

	---
	base_model: gpt2
	library_name: distily
	license: mit
	tags:
	- generated_from_trainer
	model-index:
	- name: distily_bench_gpt2_linear_objectives
	results: []
	---

	# distily_bench_gpt2_optim

	This student model is distilled from the teacher model [gpt2](https://huggingface.co/gpt2) using the dataset (unspecified).

	The [Distily](https://github.com/lapp0/distily) library was used for this distillation.

	It achieves the following results on the evaluation set:
	- eval_enwikippl: 524.7870
	- eval_frwikippl: 3705.5625
	- eval_zhwikippl: 6035.2861
	- eval_loss: 2370.7361
	- eval_runtime: 21.6322
	- eval_samples_per_second: 46.227
	- eval_steps_per_second: 11.557

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment.

	## Model description

	More information needed

	## Intended uses & limitations

	More information needed

	## Training and evaluation data

	More information needed
	-->

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- distillation_objective: LinearObjective(logits_weight=1, logits_loss_fn=<function kl_divergence_loss at 0x7f57c4b07910>, activations_weight=10, activations_loss_fn=<function kl_divergence_loss at 0x7f57c4b07910>, attentions_weight=0, attentions_loss_fn=<function mse_loss at 0x7f57c4b07880>)
	- train_embeddings: True
	- learning_rate: 4e-05
	- train_batch_size: 4
	- eval_batch_size: 4
	- seed: 42
	- gradient_accumulation_steps: 4
	- total_train_batch_size: 16
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: constant
	- num_epochs: 1.0

	### Resource Usage
	Peak GPU Memory: 4.5067 GB

	### Eval-Phase Metrics
	\| step \| epoch \| enwikippl \| frwikippl \| loss \| runtime \| samples_per_second \| steps_per_second \| zhwikippl \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| teacher eval \| \| 30.2385 \| 57.2728 \| \| \| \| \| 18.1772 \|
	\| 0 \| 0 \| 55339.3672 \| 57682.5742 \| 31197.1836 \| 21.4398 \| 46.642 \| 11.661 \| 57080.2930 \|
	\| 500 \| 0.0808 \| 1545.6934 \| 7685.4297 \| 3209.9360 \| 21.4847 \| 46.545 \| 11.636 \| 63830.4023 \|
	\| 1000 \| 0.1616 \| 1108.6847 \| 5659.8701 \| 2933.1360 \| 21.4559 \| 46.607 \| 11.652 \| 31166.1797 \|
	\| 1500 \| 0.2424 \| 913.3565 \| 4893.8623 \| 2798.0161 \| 21.5956 \| 46.306 \| 11.576 \| 23215.4258 \|
	\| 2000 \| 0.3232 \| 813.5310 \| 4763.6436 \| 2700.0161 \| 21.635 \| 46.221 \| 11.555 \| 22568.9238 \|
	\| 2500 \| 0.4040 \| 747.3608 \| 4565.6851 \| 2631.0720 \| 21.5442 \| 46.416 \| 11.604 \| 18090.1602 \|
	\| 3000 \| 0.4848 \| 711.6094 \| 4255.0127 \| 2579.2639 \| 21.7116 \| 46.058 \| 11.515 \| 16199.8096 \|
	\| 3500 \| 0.5657 \| 666.4665 \| 4117.3369 \| 2530.9441 \| 21.5886 \| 46.321 \| 11.58 \| 16435.1426 \|
	\| 4000 \| 0.6465 \| 638.0192 \| 4058.8262 \| 2500.0801 \| 21.4712 \| 46.574 \| 11.643 \| 16069.4648 \|
	\| 4500 \| 0.7273 \| 597.0923 \| 4013.0125 \| 2459.4241 \| 21.7093 \| 46.063 \| 11.516 \| 12965.0762 \|
	\| 5000 \| 0.8081 \| 567.6912 \| 3822.9963 \| 2424.4800 \| 21.5309 \| 46.445 \| 11.611 \| 10275.5850 \|
	\| 5500 \| 0.8889 \| 548.5159 \| 3864.8674 \| 2399.5359 \| 21.6408 \| 46.209 \| 11.552 \| 8114.6914 \|
	\| 6000 \| 0.9697 \| 539.3817 \| 3793.8606 \| 2379.3601 \| 21.5636 \| 46.374 \| 11.594 \| 6467.9736 \|
	\| 6187 \| 0.9999 \| 524.7870 \| 3705.5625 \| 2370.7361 \| 21.6322 \| 46.227 \| 11.557 \| 6035.2861 \|

	### Framework versions
	- Distily 0.2.0
	- Transformers 4.44.0
	- Pytorch 2.3.0
	- Datasets 2.20.0