lapp0
/

distily_bench_gpt2_linear_objectives

Generated from Trainer

8-bit precision

Model card Files Files and versions Metrics Training metrics Community

distily_bench_gpt2_linear_objectives / README.md

lapp0's picture

End of training

fd47641 verified 12 months ago

|

3.38 kB

	---
	base_model: gpt2
	library_name: distily
	license: mit
	tags:
	- generated_from_trainer
	model-index:
	- name: distily_bench_gpt2_linear_objectives
	results: []
	---

	# distily_bench_gpt2_optim

	This student model is distilled from the teacher model [gpt2](https://huggingface.co/gpt2) using the dataset (unspecified).

	The [Distily](https://github.com/lapp0/distily) library was used for this distillation.

	It achieves the following results on the evaluation set:
	- eval_enwikippl: 527.0228
	- eval_frwikippl: 3796.0032
	- eval_zhwikippl: 4795.4683
	- eval_loss: 2376.6721
	- eval_runtime: 21.817
	- eval_samples_per_second: 45.836
	- eval_steps_per_second: 11.459

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment.

	## Model description

	More information needed

	## Intended uses & limitations

	More information needed

	## Training and evaluation data

	More information needed
	-->

	## Training procedure

	### Training hyperparameters

	The following hyperparameters were used during training:
	- distillation_objective: LinearObjective(logits_weight=1, logits_loss_fn=<function kl_divergence_loss at 0x7f57c4b07910>, activations_weight=1, activations_loss_fn=<function kl_divergence_loss at 0x7f57c4b07910>, attentions_weight=0, attentions_loss_fn=<function mse_loss at 0x7f57c4b07880>)
	- train_embeddings: True
	- learning_rate: 4e-05
	- train_batch_size: 4
	- eval_batch_size: 4
	- seed: 42
	- gradient_accumulation_steps: 4
	- total_train_batch_size: 16
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: constant
	- num_epochs: 1.0

	### Resource Usage
	Peak GPU Memory: 4.5067 GB

	### Eval-Phase Metrics
	\| step \| epoch \| enwikippl \| frwikippl \| loss \| runtime \| samples_per_second \| steps_per_second \| zhwikippl \|
	\| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \| --- \|
	\| teacher eval \| \| 30.2385 \| 57.2728 \| \| \| \| \| 18.1772 \|
	\| 0 \| 0 \| 55339.3672 \| 57682.5742 \| 31197.1836 \| 21.7082 \| 46.065 \| 11.516 \| 57080.2930 \|
	\| 500 \| 0.0808 \| 1509.5735 \| 7497.0439 \| 3194.9919 \| 21.4587 \| 46.601 \| 11.65 \| 50589.3438 \|
	\| 1000 \| 0.1616 \| 1083.2607 \| 5620.3037 \| 2923.7439 \| 21.5879 \| 46.322 \| 11.581 \| 29616.2285 \|
	\| 1500 \| 0.2424 \| 906.6083 \| 4937.0078 \| 2796.2080 \| 21.6636 \| 46.16 \| 11.54 \| 21403.5996 \|
	\| 2000 \| 0.3232 \| 813.4678 \| 4877.3267 \| 2706.0481 \| 21.5303 \| 46.446 \| 11.612 \| 20010.4863 \|
	\| 2500 \| 0.4040 \| 750.0352 \| 4512.8765 \| 2636.6079 \| 21.6059 \| 46.284 \| 11.571 \| 16546.3457 \|
	\| 3000 \| 0.4848 \| 704.7218 \| 4373.6377 \| 2583.7920 \| 21.6069 \| 46.281 \| 11.57 \| 14758.0859 \|
	\| 3500 \| 0.5657 \| 667.2821 \| 4153.7866 \| 2537.5520 \| 21.59 \| 46.318 \| 11.579 \| 14131.2881 \|
	\| 4000 \| 0.6465 \| 635.3494 \| 4060.9749 \| 2505.6001 \| 21.554 \| 46.395 \| 11.599 \| 13081.5996 \|
	\| 4500 \| 0.7273 \| 605.6495 \| 4037.2766 \| 2468.9121 \| 21.795 \| 45.882 \| 11.471 \| 11453.9658 \|
	\| 5000 \| 0.8081 \| 573.4954 \| 3881.2524 \| 2437.7439 \| 21.6801 \| 46.125 \| 11.531 \| 8931.2441 \|
	\| 5500 \| 0.8889 \| 557.2740 \| 3918.3730 \| 2413.4880 \| 21.5054 \| 46.5 \| 11.625 \| 6643.0454 \|
	\| 6000 \| 0.9697 \| 549.7523 \| 4035.1443 \| 2392.2400 \| 21.6194 \| 46.255 \| 11.564 \| 5330.4404 \|
	\| 6187 \| 0.9999 \| 527.0228 \| 3796.0032 \| 2376.6721 \| 21.817 \| 45.836 \| 11.459 \| 4795.4683 \|

	### Framework versions
	- Distily 0.2.0
	- Transformers 4.44.0
	- Pytorch 2.3.0
	- Datasets 2.20.0