nenad1002
/

quantum-research-bot-v1.0

@@ -77,14 +77,30 @@ The dataset was generated by crawling the https://quantum-journal.org/ site, and
 ### Training Procedure
-Many training procedures were tried alongside with multiple models.
-Over the course of time multiple models and fine tuning approaches have been tried as the base model. The best performace was achieved with Lllama 3.1 70B Instruct and qLORA, but the model was very long to train, and finding the best hyperparameter would be too challenging.
-The other two base models that were tries were the mistral 7B v0.1 base model, meta-llama/Llama-2-7b-chat-hf, and the base model of this model.
-I've performed the grid search with several optimization techniques such as [LORA](https://arxiv.org/abs/2106.09685), [DORA](https://arxiv.org/abs/2402.09353), [LORA+](https://arxiv.org/abs/2402.12354), [REFT](https://arxiv.org/abs/2404.03592), and [qLORA](https://arxiv.org/abs/2305.14314)
-After exensive grid search, supervised fine tuning of Llama 3.1-8B with LORA+ resulted in the best training and evaluation cross entropy.
 #### Preprocessing [optional]

 ### Training Procedure
+Various training procedures were explored alongside multiple models.
+Over time, several models and fine-tuning approaches were tested as the base model. The best performance was achieved with Llama 3.1 70B Instruct and qLoRA, but the training duration was extensive, and optimizing hyperparameters proved to be highly challenging.
+Two other base models were also tested: the Mistral 7B v0.1 base model, Meta-Llama/Llama-2-7b-chat-hf, and the base model of this experiment.
+I've performed the grid search with several optimization techniques such as [LoRA](https://arxiv.org/abs/2106.09685), [DoRA](https://arxiv.org/abs/2402.09353), [LoRA+](https://arxiv.org/abs/2402.12354), [(LO)ReFT](https://arxiv.org/abs/2404.03592), and [qLoRA](https://arxiv.org/abs/2305.14314)
+WWith LoRA, LoRA+, and DoRA, I found that a rank of 8 (with the paper-recommended double alpha of 16) achieved the best performance, particularly since my dataset was on the smaller side, which otherwise would have led to overfitting. Various LoRA dropout rates were tested between 10% and 20%, but in all fine-tuning approaches, the model began to jump over better local minima. Hence, I sticked to 10%.
+After applying the linear scaling rule, I settled on a batch size of 8 and found that a starting learning rate of
+1
+0
+−
+4
+10
+−4
+  yielded the best results. There was no significant difference between using cosine or linear decay for the learning rate when employing the AdamW optimizer.
+Regarding the nodes, training on only attention nodes performed very poorly on both training and evaluation data. The results improved slightly with the addition of MLP projections, but none of the models or fine-tuning approaches achieved an evaluation cross-entropy below 0.5. However, when including the embedding layer—despite the significant increase in the number of training parameters—the model began to generalize well. I assume this is due to the introduction of new terminology, requiring the model to adjust its embeddings slightly. I did not modify the LM head, as no significant performance improvements were observed.
+For ReFT, the nodes in the last 8 layers were unfrozen with attention to allow the model to retain its general knowledge while incorporating more specific domain knowledge about quantum research. Although the results were close to those obtained with LoRA, they were consistently slightly worse.
+After 3 to 4 epochs, the model began to overfit regardless of the strategies employed. Increasing both batch size and the number of epochs resulted in higher final training and evaluation cross-entropy.
+Following an extensive grid search, supervised fine-tuning of Llama 3.1-8B with LoRA+ and the parameters mentioned above yielded the best training and evaluation cross-entropy.
 #### Preprocessing [optional]