lapp0 commited on
Commit
2a90c5d
·
verified ·
1 Parent(s): fce8172

End of training

Browse files
README.md CHANGED
@@ -15,14 +15,14 @@ This student model is distilled from the teacher model [roneneldan/TinyStories-3
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
- - eval_enwikippl: 7258.4541
19
- - eval_frwikippl: 30053.4531
20
- - eval_zhwikippl: 65613.7344
21
- - eval_tinystoriesppl: 3538.7600
22
- - eval_loss: 4.9922
23
- - eval_runtime: 6.4758
24
- - eval_samples_per_second: 77.211
25
- - eval_steps_per_second: 9.729
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
@@ -47,7 +47,7 @@ More information needed
47
  The following hyperparameters were used during training:
48
  - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
- - learning_rate: 0.0004
51
  - train_batch_size: 8
52
  - eval_batch_size: 8
53
  - seed: 42
@@ -56,26 +56,26 @@ The following hyperparameters were used during training:
56
  - num_epochs: 1.0
57
 
58
  ### Resource Usage
59
- Peak GPU Memory: 8.0557 GB
60
 
61
  ### Eval-Phase Metrics
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
- | 0 | 0 | 42399.5039 | 59058.4727 | 6.4725 | 6.5032 | 76.885 | 9.687 | 34123.3594 | 70986.5156 |
66
- | 500 | 0.0808 | 7437.1694 | 30087.3496 | 4.9938 | 6.495 | 76.982 | 9.7 | 3646.8608 | 65964.7734 |
67
- | 1000 | 0.1616 | 7280.9790 | 30053.4531 | 4.9917 | 6.4866 | 77.082 | 9.712 | 3550.4834 | 65473.8438 |
68
- | 1500 | 0.2424 | 7267.4517 | 30053.4531 | 4.9922 | 6.4837 | 77.116 | 9.717 | 3538.7600 | 65613.7344 |
69
- | 2000 | 0.3232 | 7267.4517 | 30053.4531 | 4.9917 | 6.4853 | 77.097 | 9.714 | 3539.9314 | 65613.7344 |
70
- | 2500 | 0.4040 | 7267.4517 | 30053.4531 | 4.9917 | 6.5615 | 76.202 | 9.601 | 3538.7600 | 65613.7344 |
71
- | 3000 | 0.4848 | 7267.4517 | 30053.4531 | 4.9922 | 6.4931 | 77.005 | 9.703 | 3538.7600 | 65613.7344 |
72
- | 3500 | 0.5656 | 7262.9478 | 30053.4531 | 4.9922 | 6.4915 | 77.024 | 9.705 | 3538.7600 | 65613.7344 |
73
- | 4000 | 0.6464 | 7258.4541 | 30053.4531 | 4.9922 | 6.4888 | 77.056 | 9.709 | 3538.7600 | 65613.7344 |
74
- | 4500 | 0.7272 | 7258.4541 | 30053.4531 | 4.9922 | 6.499 | 76.935 | 9.694 | 3538.7600 | 65613.7344 |
75
- | 5000 | 0.8080 | 7258.4541 | 30053.4531 | 4.9922 | 6.4977 | 76.95 | 9.696 | 3538.7600 | 65613.7344 |
76
- | 5500 | 0.8888 | 7258.4541 | 30053.4531 | 4.9922 | 6.4889 | 77.055 | 9.709 | 3538.7600 | 65613.7344 |
77
- | 6000 | 0.9696 | 7258.4541 | 30053.4531 | 4.9922 | 6.4904 | 77.036 | 9.707 | 3538.7600 | 65613.7344 |
78
- | 6188 | 1.0 | 7258.4541 | 30053.4531 | 4.9922 | 6.4758 | 77.211 | 9.729 | 3538.7600 | 65613.7344 |
79
 
80
  ### Framework versions
81
  - Distily 0.2.0
 
15
  The [Distily](https://github.com/lapp0/distily) library was used for this distillation.
16
 
17
  It achieves the following results on the evaluation set:
18
+ - eval_enwikippl: 172.3267
19
+ - eval_frwikippl: 37035.1875
20
+ - eval_zhwikippl: 194088.4531
21
+ - eval_tinystoriesppl: 10.7883
22
+ - eval_loss: 1.3421
23
+ - eval_runtime: 6.5001
24
+ - eval_samples_per_second: 76.922
25
+ - eval_steps_per_second: 9.692
26
 
27
  <!-- This model card has been generated automatically according to the information the Trainer had access to. You
28
  should probably proofread and complete it, then remove this comment.
 
47
  The following hyperparameters were used during training:
48
  - distillation_objective: DistillationObjective(logits_loss_component=LossComponent(label=logits, weight=1, loss_fn=kl, layer_mapper=None, projector=None), hs_loss_component=LossComponent(label=hs, weight=0, loss_fn=None, layer_mapper=None, projector=None), attn_loss_component=LossComponent(label=attn, weight=0, loss_fn=None, layer_mapper=None, projector=None))
49
  - train_embeddings: True
50
+ - learning_rate: 0.004
51
  - train_batch_size: 8
52
  - eval_batch_size: 8
53
  - seed: 42
 
56
  - num_epochs: 1.0
57
 
58
  ### Resource Usage
59
+ Peak GPU Memory: 8.0568 GB
60
 
61
  ### Eval-Phase Metrics
62
  | step | epoch | enwikippl | frwikippl | loss | runtime | samples_per_second | steps_per_second | tinystoriesppl | zhwikippl |
63
  | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
64
  | **teacher eval** | | 169.9865 | 47377.9414 | | | | | 3.9789 | 4998.1294 |
65
+ | 0 | 0 | 21321.3555 | 56774.5312 | 6.6010 | 6.4987 | 76.938 | 9.694 | 11289.9248 | 60744.7383 |
66
+ | 500 | 0.0808 | 209.4913 | 62706.3438 | 1.4389 | 6.5968 | 75.794 | 9.55 | 11.3247 | 298068.5312 |
67
+ | 1000 | 0.1616 | 182.2395 | 44100.0312 | 1.3516 | 6.4978 | 76.95 | 9.696 | 10.7194 | 255063.875 |
68
+ | 1500 | 0.2424 | 174.5435 | 39099.0508 | 1.3434 | 6.5516 | 76.317 | 9.616 | 10.7310 | 199230.0 |
69
+ | 2000 | 0.3232 | 173.0893 | 37756.8164 | 1.3422 | 6.4982 | 76.945 | 9.695 | 10.7545 | 194918.7188 |
70
+ | 2500 | 0.4040 | 171.9666 | 36889.3945 | 1.3422 | 6.5098 | 76.807 | 9.678 | 10.7906 | 195543.75 |
71
+ | 3000 | 0.4848 | 171.2156 | 36931.0 | 1.3418 | 6.4894 | 77.049 | 9.708 | 10.7314 | 190904.3438 |
72
+ | 3500 | 0.5656 | 172.5805 | 37171.0625 | 1.3418 | 6.4818 | 77.14 | 9.72 | 10.8124 | 193984.8281 |
73
+ | 4000 | 0.6464 | 171.9800 | 37035.1875 | 1.3417 | 6.4879 | 77.067 | 9.71 | 10.7732 | 191414.4375 |
74
+ | 4500 | 0.7272 | 172.1532 | 37056.0664 | 1.3423 | 6.5945 | 75.821 | 9.553 | 10.7879 | 193984.8281 |
75
+ | 5000 | 0.8080 | 172.3400 | 37035.1875 | 1.3422 | 6.5862 | 75.916 | 9.565 | 10.7968 | 196799.8281 |
76
+ | 5500 | 0.8888 | 172.2065 | 37035.1875 | 1.3422 | 6.4984 | 76.943 | 9.695 | 10.7714 | 193984.8281 |
77
+ | 6000 | 0.9696 | 172.3267 | 37035.1875 | 1.3419 | 6.4894 | 77.048 | 9.708 | 10.7910 | 193984.8281 |
78
+ | 6188 | 1.0 | 172.3267 | 37035.1875 | 1.3421 | 6.5001 | 76.922 | 9.692 | 10.7883 | 194088.4531 |
79
 
80
  ### Framework versions
81
  - Distily 0.2.0
logs/dropout=0, learning_rate=0.004, weight_decay=0.1/events.out.tfevents.1723868909.5f530b1cf724 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:aaf17158ef6c7062bc9081d7ea62edb6d020090ae4fc43e4bb059b392bdbe84d
3
+ size 307