Instruction-Tuning with a Tiny Fraction of Data: Introducing SCAR

Community Article Published May 31, 2025

Zhuang Li, Yuncheng Hua, Thuy-Trang Vu, Haolan Zhan, Lizhen Qu, Gholamreza Haffari
Accepted to ACL 2025 main track
Repository • Models • Paper


Why this matters

Instruction-tuned large language models (LLMs) are often trained on hundreds of thousands or even millions of instruction–response pairs. Collecting, filtering, and fine-tuning datasets of this scale, such as 320,000 pairs for OLMo-7B or several million examples in LLaMA-style mixtures, requires significant human effort and GPU resources.

SCAR (Style Consistency Aware Response Ranking) shows that selecting only the responses with coherent style not only reduces cost but also improves performance. In many cases, a small and stylistically consistent subset can outperform models fine-tuned on the full, noisy dataset while training more efficiently.

On OLMo-7B, using 0.7 %–3 % of the original training set we equal or surpass the full-data baseline. On StarCoder code-generation tasks, subsets of 5 k pairs (38 % of the data) also improve performance.


The idea in one paragraph

Responses that share similar characteristics, such as concise wording, consistent formatting, and a steady tone, help LLMs learn more efficiently by reducing conflicting signals during training. SCAR uses a lightweight ranker to score instruction and response pairs based on two key style-related cues within the responses:

  • Linguistic form (surface features such as sentence length, list usage, functional words)
  • Instructional surprisal (how predictable a response is given its instruction)

Pairs with the highest combined score form the new fine-tuning set.


Key Results

Model Family Baseline Checkpoint Full-Set Size Metric ↑ Baseline SCAR Subset → Score
OLMo-7B allenai/OLMo-7B-SFT 320k pairs AlpacaEval L.C. WinRate 3.86 5k (1.5%) → 5.64
10k (3%) → 5.37
2.5k (0.7%) → 4.08
StarCoder-15B bigcode/octocoder 13k pairs HumanEval + MultiPL-E (Pass@1 + 10 avg) 37.9 5k (38%) → 40.1

Complete tables and ablations are in the paper and README.

How to read these numbers

  • Baseline checkpoint — the official supervised-fine-tuned model from the original authors (OLMo-7B-SFT or Octocoder-15.5B).
  • SCAR subset — we start from the same base model family (OLMo-7B or StarCoder-15B) but fine-tune only on the SCAR-selected pairs.
  • Gains — with 5 k OLMo pairs (1.5 % of the original data) we gain +1.78 WinRate points over the full-data OLMo-7B-SFT; even 2.5 k pairs (0.7 %) retain most of the baseline score. For StarCoder, filtering the 13 k commit dataset down to 5 k pairs yields +2.2 points on the combined Pass@ metric.

These results underline two facts:

  1. Style consistency matters: low-quality or style-divergent responses in large mixtures actively hurt SFT quality.
  2. Data efficiency: a small, coherent subset can outperform a much larger style-inconsistent set, saving tokens and GPU hours.

Quick start

pip install scar-tool
import torch
from transformers import AutoTokenizer
from style_ranker.ranker.model import StyleRanker

model_path = "lizhuang144/scar-gte-base"
ranker = StyleRanker.from_pretrained(model_path).eval()
tok    = AutoTokenizer.from_pretrained(model_path)

instructions = ["Write a poem about spring",
                "Explain quantum computing"]

answers = ["I am sorry. Who are you? Why should I tell you...",
           "Quantum computing is a type of computation..."]

ins = tok(instructions, padding=True, truncation=True, return_tensors="pt")
ans = tok(answers,      padding=True, truncation=True, return_tensors="pt")

with torch.no_grad():
    scores = ranker(ins.input_ids, ins.attention_mask,
                    ans.input_ids, ans.attention_mask)

for s, i, a in zip(scores, instructions, answers):
    print(f"{s.item():5.2f}  {i[:35]}{a[:35]}")

For large datasets:

from style_ranker.rank import rank_and_filter
pairs = rank_and_filter(model_path,
                        instructions,
                        answers,
                        ratio=0.02,      # keep top 2 %
                        device="cuda")

Models on the Hub

Name Base encoder
lizhuang144/scar-gte-base Alibaba-NLP gte-base-en-v1.5
lizhuang144/scar-gte-large Alibaba-NLP gte-large-en-v1.5
lizhuang144/scar-roberta-base FacebookAI roberta-base

Where SCAR helps

  • Filter noisy Self-Instruct or ChatGPT data before SFT.
  • Score candidate responses for RLHF reward-model training.
  • Build compact domain-specific instruction sets when GPU budgets are tight.

Limitations

  • Benefits are largest when the raw dataset contains style-inconsistent responses; for single-LLM outputs (minimal stylistic variation) gains are modest.
  • Ranker is English-only.
  • Duplicate instructions are not removed automatically; deduplicate first.

Citation

@article{li2024scar,
  title={SCAR: Data Selection via Style Consistency-Aware Response Ranking for Efficient Instruction-Tuning of Large Language Models},
  author={Li, Zhuang and Hua, Yuncheng and Vu, Thuy-Trang and Zhan, Haolan and Qu, Lizhen and Haffari, Gholamreza},
  journal={arXiv preprint arXiv:2406.10882},
  year={2025}
}

Links

Questions or feedback? Feel free to open an issue on GitHub or contact me directly at [email protected] .

Community

Sign up or log in to comment