Spaces:

lhoestq
/

test-smollm

Runtime error

App Files Files Community

lhoestq HF Staff commited on Mar 26

Commit

f2041d1

verified ·

1 Parent(s): ff4534f

Upload 3 files

Browse files

Files changed (3) hide show

README.md +54 -10
requirements.txt +7 -0
train.py +133 -0

README.md CHANGED Viewed

@@ -1,10 +1,54 @@
----
-title: Test Smollm
-emoji: 🏆
-colorFrom: blue
-colorTo: gray
-sdk: docker
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# Fine-tuning
+## SmolLM2 Instruct
+We build the SmolLM2 Instruct family by finetuning the base 1.7B on [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) and the base 360M and 135M models on [Smol-smoltalk](https://huggingface.co/datasets/HuggingFaceTB/smol-smoltalk) using `TRL` and the alignement handbook and then doing DPO on [UltraFeedBack](https://huggingface.co/datasets/openbmb/UltraFeedback). You can find the scipts and instructions for dohere: https://github.com/huggingface/alignment-handbook/tree/main/recipes/smollm2#instructions-to-train-smollm2-17b-instruct
+## Custom script
+Here, we provide a simple script for finetuning SmolLM2. In this case, we fine-tune the base 1.7B on python data.
+### Setup
+Install `pytorch` [see documentation](https://pytorch.org/), and then install the requirements
+```bash
+pip install -r requirements.txt
+```
+Before you run any of the scripts make sure you are logged in `wandb` and HuggingFace Hub to push the checkpoints, and you have `accelerate` configured:
+```bash
+wandb login
+huggingface-cli login
+accelerate config
+```
+Now that everything is done, you can clone the repository and get into the corresponding directory.
+```bash
+git clone https://github.com/huggingface/smollm
+cd smollm/finetune
+```
+### Training
+To fine-tune efficiently with a low cost, we use [PEFT](https://github.com/huggingface/peft) library for Low-Rank Adaptation (LoRA) training. We also use the `SFTTrainer` from [TRL](https://github.com/huggingface/trl).
+For this example, we will fine-tune SmolLM1-1.7B on the `Python` subset of [the-stack-smol](https://huggingface.co/datasets/bigcode/the-stack-smol). This is just for illustration purposes.
+To launch the training:
+```bash
+accelerate launch train.py \
+        --model_id "HuggingFaceTB/SmolLM2-1.7B" \
+        --dataset_name "bigcode/the-stack-smol" \
+        --subset "data/python" \
+        --dataset_text_field "content" \
+        --split "train" \
+        --max_seq_length 2048 \
+        --max_steps 5000 \
+        --micro_batch_size 1 \
+        --gradient_accumulation_steps 8 \
+        --learning_rate 3e-4 \
+        --warmup_steps 100 \
+        --num_proc "$(nproc)"
+```
+If you want to fine-tune on other text datasets, you need to change `dataset_text_field` argument to the name of the column containing the code/text you want to train on.

requirements.txt ADDED Viewed

	@@ -0,0 +1,7 @@

+transformers
+trl>=0.15
+peft
+accelerate
+datasets
+wandb
+bitsandbytes

train.py ADDED Viewed

	@@ -0,0 +1,133 @@

+# Code adapted from https://github.com/huggingface/trl/blob/main/examples/research_projects/stack_llama/scripts/supervised_finetuning.py
+# and https://huggingface.co/blog/gemma-peft
+import argparse
+import multiprocessing
+import os
+import torch
+import transformers
+from accelerate import PartialState
+from datasets import load_dataset
+from peft import AutoPeftModelForCausalLM, LoraConfig
+from transformers import (
+    AutoModelForCausalLM,
+    AutoTokenizer,
+    BitsAndBytesConfig,
+    is_torch_npu_available,
+    is_torch_xpu_available,
+    logging,
+    set_seed,
+)
+from trl import SFTConfig, SFTTrainer
+def get_args():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--model_id", type=str, default="HuggingFaceTB/SmolLM2-1.7B")
+    parser.add_argument("--tokenizer_id", type=str, default="")
+    parser.add_argument("--dataset_name", type=str, default="bigcode/the-stack-smol")
+    parser.add_argument("--subset", type=str, default="data/python")
+    parser.add_argument("--split", type=str, default="train")
+    parser.add_argument("--streaming", type=bool, default=False)
+    parser.add_argument("--dataset_text_field", type=str, default="content")
+    parser.add_argument("--max_seq_length", type=int, default=2048)
+    parser.add_argument("--max_steps", type=int, default=1000)
+    parser.add_argument("--micro_batch_size", type=int, default=1)
+    parser.add_argument("--gradient_accumulation_steps", type=int, default=4)
+    parser.add_argument("--weight_decay", type=float, default=0.01)
+    parser.add_argument("--bf16", type=bool, default=True)
+    parser.add_argument("--use_bnb", type=bool, default=False)
+    parser.add_argument("--attention_dropout", type=float, default=0.1)
+    parser.add_argument("--learning_rate", type=float, default=2e-4)
+    parser.add_argument("--lr_scheduler_type", type=str, default="cosine")
+    parser.add_argument("--warmup_steps", type=int, default=100)
+    parser.add_argument("--seed", type=int, default=0)
+    parser.add_argument("--output_dir", type=str, default="finetune_smollm2_python")
+    parser.add_argument("--num_proc", type=int, default=None)
+    parser.add_argument("--push_to_hub", type=bool, default=True)
+    parser.add_argument("--repo_id", type=str, default="SmolLM2-1.7B-finetune")
+    return parser.parse_args()
+def main(args):
+    # config
+    lora_config = LoraConfig(
+        r=16,
+        lora_alpha=32,
+        lora_dropout=0.05,
+        target_modules=["q_proj", "v_proj"],
+        bias="none",
+        task_type="CAUSAL_LM",
+    )
+    bnb_config = None
+    if args.use_bnb:
+        bnb_config = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_quant_type="nf4",
+            bnb_4bit_compute_dtype=torch.bfloat16,
+        )
+    # load model and dataset
+    token = os.environ.get("HF_TOKEN", None)
+    model = AutoModelForCausalLM.from_pretrained(
+        args.model_id,
+        quantization_config=bnb_config,
+        device_map={"": PartialState().process_index},
+        attention_dropout=args.attention_dropout,
+    )
+    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_id or args.model_id)
+    data = load_dataset(
+        args.dataset_name,
+        data_dir=args.subset,
+        split=args.split,
+        token=token,
+        num_proc=args.num_proc if args.num_proc or args.streaming else multiprocessing.cpu_count(),
+        streaming=args.streaming,
+    )
+    # setup the trainer
+    trainer = SFTTrainer(
+        model=model,
+        processing_class=tokenizer,
+        train_dataset=data,
+        args=SFTConfig(
+            dataset_text_field=args.dataset_text_field,
+            dataset_num_proc=args.num_proc,
+            max_seq_length=args.max_seq_length,
+            per_device_train_batch_size=args.micro_batch_size,
+            gradient_accumulation_steps=args.gradient_accumulation_steps,
+            warmup_steps=args.warmup_steps,
+            max_steps=args.max_steps,
+            learning_rate=args.learning_rate,
+            lr_scheduler_type=args.lr_scheduler_type,
+            weight_decay=args.weight_decay,
+            bf16=args.bf16,
+            logging_strategy="steps",
+            logging_steps=10,
+            output_dir=args.output_dir,
+            optim="paged_adamw_8bit",
+            seed=args.seed,
+            run_name=f"train-{args.model_id.split('/')[-1]}",
+            report_to="wandb",
+            push_to_hub=args.push_to_hub,
+            hub_model_id=args.repo_id,
+        ),
+        peft_config=lora_config,
+    )
+    # launch
+    print("Training...")
+    trainer.train()
+    print("Training Done! 💥")
+if __name__ == "__main__":
+    args = get_args()
+    set_seed(args.seed)
+    os.makedirs(args.output_dir, exist_ok=True)
+    logging.set_verbosity_error()
+    main(args)