secemp9 commited on
Commit
62388d1
·
verified ·
1 Parent(s): a0e37ea

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -108
README.md CHANGED
@@ -11,117 +11,23 @@ model-index:
11
  results: []
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
 
17
- [<img src="https://raw.githubusercontent.com/axolotl-ai-cloud/axolotl/main/image/axolotl-badge-web.png" alt="Built with Axolotl" width="200" height="32"/>](https://github.com/axolotl-ai-cloud/axolotl)
18
- <details><summary>See axolotl config</summary>
19
 
20
- axolotl version: `0.7.0`
21
- ```yaml
22
- # Base model configuration
23
- base_model: unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit
24
- load_in_4bit: true
25
-
26
- # Dataset configuration
27
- datasets:
28
- - path: instruction_solution_to_thought_dataset.jsonl
29
- type: chat_template
30
-
31
- # Chat template
32
- chat_template: chatml
33
-
34
- # LoRA adapter configuration
35
- adapter: lora
36
- lora_r: 16
37
- lora_alpha: 16
38
- lora_dropout: 0
39
- lora_target_modules:
40
- - q_proj
41
- - k_proj
42
- - v_proj
43
- - o_proj
44
- - gate_proj
45
- - up_proj
46
- - down_proj
47
-
48
- # Training hyperparameters
49
- max_seq_length: 128000
50
- micro_batch_size: 2
51
- gradient_accumulation_steps: 8
52
- learning_rate: 3e-5
53
- num_epochs: 2
54
- warmup_steps: 100
55
- optimizer: adamw_8bit
56
- weight_decay: 0.01
57
- lr_scheduler_type: cosine
58
- max_grad_norm: 1.0
59
- output_dir: ./outputs_solution_to_thought
60
- seed: 3407
61
- merge_lora: true
62
- hf_upload: true
63
- hf_repo: secemp9/TraceBack-12b
64
- xformers_attention:
65
- flash_attention: True
66
- #lora_mlp_kernel: true
67
- #lora_qkv_kernel: true
68
- #lora_o_kernel: true
69
- #fp16: true
70
- #load_in_8bit: true # Enable 8-bit loading for LoRA finetuning
71
- bf16: true # Enable BF16 mixed precision
72
- # Multi-GPU training with DeepSpeed
73
- deepspeed: deepspeed_configs/zero2.json
74
-
75
- # Optional: Enable gradient checkpointing
76
- gradient_checkpointing: true
77
-
78
- ```
79
-
80
- </details><br>
81
-
82
- # outputs_solution_to_thought
83
-
84
- This model is a fine-tuned version of [unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit](https://huggingface.co/unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit) on the instruction_solution_to_thought_dataset.jsonl dataset.
85
-
86
- ## Model description
87
-
88
- More information needed
89
-
90
- ## Intended uses & limitations
91
-
92
- More information needed
93
-
94
- ## Training and evaluation data
95
-
96
- More information needed
97
-
98
- ## Training procedure
99
-
100
- ### Training hyperparameters
101
-
102
- The following hyperparameters were used during training:
103
- - learning_rate: 3e-05
104
- - train_batch_size: 2
105
- - eval_batch_size: 2
106
- - seed: 3407
107
- - distributed_type: multi-GPU
108
- - num_devices: 8
109
- - gradient_accumulation_steps: 8
110
- - total_train_batch_size: 128
111
- - total_eval_batch_size: 16
112
- - optimizer: Use adamw_8bit with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
113
- - lr_scheduler_type: cosine
114
- - lr_scheduler_warmup_steps: 100
115
- - num_epochs: 2.0
116
-
117
- ### Training results
118
 
 
 
 
 
119
 
 
 
 
 
 
120
 
121
- ### Framework versions
122
 
123
- - PEFT 0.14.0
124
- - Transformers 4.48.3
125
- - Pytorch 2.5.1+cu124
126
- - Datasets 3.2.0
127
- - Tokenizers 0.21.0
 
11
  results: []
12
  ---
13
 
14
+ # TraceBack 12b Release
 
15
 
16
+ TraceBack is what I came up with when I thought, "how can we scale reasoning trace data generation effectively?"
 
17
 
18
+ Turn out you do not need to depend on just reasoning models (r1, o1, o3, etc) to create reasoning trace!
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
+ It has many goals in mind, but mainly:
21
+ - enabling faster synthetic reasoning dataset generation, since we're using a small model here (smaller than r1, etc) so faster to do inference on, thus easier to scale
22
+ - control of the style of reasoning (system 2 thinking, etc)
23
+ - converting any non-reasoning model output/datasets to a reasoning synthetic dataset when used as input
24
 
25
+ So far, current proof of concept managed to check the boxes for 1 and 3, and I plan on scaling this more as:
26
+ - this only use Mistral nemo 12b as base
27
+ - Was only trained for 2 epoch
28
+ - Only 200k samples were used for finetuning (Qlora)
29
+ So there are still much room for improvement
30
 
31
+ This was trained using both instruction and solution as input, and the output being a plausible/possible/matching reasoning trace based on that.
32
 
33
+ I believe this is the future of reasoning data generation. Stay tuned for an eval release