Safetensors
gemma

Model Card for MA-RLHF

ICLR 2025 Github

This repository contains the official checkpoint for Reinforcement Learning From Human Feedback with Macro Actions (MA-RLHF).

Model Description

MA-RLHF is a novel framework that integrates macro actions into conventional RLHF. The macro actions are sequences of tokens or higher-level language constructs, with can be computed through different defined termination conditions, like n-gram based, perplexity-based, or parsing-based termination conditions. By introducing macro actions into RLHF, we reduce the number of decision points and shorten decision trajectories, alleviating the credit assignment problem caused by long temporal distances.

Model Checkpoint Base Model Dataset
TLDR-Gemma-2B-MA-PPO-Fixed5 πŸ€— HF Link google/gemma-2b openai/summarize_from_feedback
TLDR-Gemma-7B-MA-PPO-Fixed5 πŸ€— HF Link google/gemma-7b openai/summarize_from_feedback
TLDR-Gemma-2-27B-MA-PPO-Fixed5 πŸ€— HF Link google/gemma-2-27b openai/summarize_from_feedback
HH-RLHF-Gemma-2B-MA-PPO-Fixed5 πŸ€— HF Link google/gemma-2b Dahoas/full-hh-rlhf
HH-RLHF-Gemma-7B-MA-PPO-Fixed5 πŸ€— HF Link google/gemma-7b Dahoas/full-hh-rlhf
APPS-Gemma-2B-MA-PPO-Fixed10 πŸ€— HF Link google/codegemma-2b codeparrot/apps
APPS-Gemma-7B-MA-PPO-Fixed10 πŸ€— HF Link google/codegemma-7b-it codeparrot/apps

Model Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_path = "baidu/APPS-Gemma-7B-MA-PPO-Fixed10"

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto", torch_dtype='auto', trust_remote_code=True)

input_text = """
An accordion is a string (yes, in the real world accordions are musical instruments, but let's forget about it for a while) which can be represented as a concatenation of: an opening bracket (ASCII code $091$), a colon (ASCII code $058$), some (possibly zero) vertical line characters (ASCII code $124$), another colon, and a closing bracket (ASCII code $093$). The length of the accordion is the number of characters in it. For example, [::], [:||:] and [:|||:] are accordions having length $4$, $6$ and $7$. (:|:), {:||:}, [:], ]:||:[ are not accordions. You are given a string $s$. You want to transform it into an accordion by removing some (possibly zero) characters from it. Note that you may not insert new characters or reorder existing ones. Is it possible to obtain an accordion by removing characters from $s$, and if so, what is the maximum possible length of the result? -----Input----- The only line contains one string $s$ ($1 \le |s| \le 500000$). It consists of lowercase Latin letters and characters [, ], : and |. -----Output----- If it is not possible to obtain an accordion by removing some characters from $s$, print $-1$. Otherwise print maximum possible length of the resulting accordion. -----Examples----- Input |[a:b:|] Output 4 Input |]:[|:] Output -1
"""

input_ids = tokenizer(input_text, return_tensors='pt').to(model.device)
output_ids = model.generate(**input_ids, max_new_tokens=20)
response = tokenizer.decode(output_ids[0], skip_special_tokens=True)

print(response)

Citation

@inproceedings{
  chai2025marlhf,
  title={{MA}-{RLHF}: Reinforcement Learning from Human Feedback with Macro Actions},
  author={Yekun Chai and Haoran Sun and Huang Fang and Shuohuan Wang and Yu Sun and Hua Wu},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=WWXjMYZxfH}
}
Downloads last month
14
Safetensors
Model size
8.54B params
Tensor type
FP16
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for baidu/APPS-Gemma-7B-MA-PPO-Fixed10

Finetuned
(6)
this model
Quantizations
2 models

Dataset used to train baidu/APPS-Gemma-7B-MA-PPO-Fixed10