Triangle104
/

Satori-7B-Round2-Q4_K_S-GGUF

Inference Endpoints

Model card Files Files and versions Community

Triangle104 commited on Feb 7

Commit

0d9076b

·

verified ·

1 Parent(s): 2f51a07

Update README.md

Files changed (1) hide show

README.md +55 -0

README.md CHANGED Viewed

@@ -13,6 +13,61 @@ tags:
 This model was converted to GGUF format from [`Satori-reasoning/Satori-7B-Round2`](https://huggingface.co/Satori-reasoning/Satori-7B-Round2) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
 Refer to the [original model card](https://huggingface.co/Satori-reasoning/Satori-7B-Round2) for more details on the model.
 ## Use with llama.cpp
 Install llama.cpp through brew (works on Mac and Linux)

 This model was converted to GGUF format from [`Satori-reasoning/Satori-7B-Round2`](https://huggingface.co/Satori-reasoning/Satori-7B-Round2) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
 Refer to the [original model card](https://huggingface.co/Satori-reasoning/Satori-7B-Round2) for more details on the model.
+---
+Satori-7B-Round2 is a 7B LLM trained on open-source model (Qwen-2.5-Math-7B) and open-source data (OpenMathInstruct-2 and NuminaMath). Satori-7B-Round2 is capable of autoregressive search, i.e., self-reflection and self-exploration without external guidance.
+This is achieved through our proposed Chain-of-Action-Thought (COAT) reasoning and a two-stage post-training paradigm.
+		Our Approach
+We formulate LLM reasoning as a sequential decision-making problem,
+where reasoning is a process of constructing and refining an answer step
+ by step. Specifically, the LLM (agent's policy) starts with an input
+context (initial state), generates a reasoning step (action), and
+updates the context (next state). The LLM repeats this process until it
+reaches a final answer, and receives a reward that evaluates whether the
+ final answer matches the ground truth. With this formulation, we could
+train the LLM to reason using RL, aiming to generate a sequence of
+reasoning steps that maximize the expected reward.
+		Chain-of-Action-Thought reasoning (COAT)
+The key challenge of achieving autoregressive search is enabling the
+LLM to determine when to reflect, continue, or explore alternative
+solutions without external intervention.
+To enable this, we introduce several special meta-action tokens that
+guide the LLM's reasoning process,
+Continue Reasoning (<|continue|>): encourages
+ the LLM to build upon its current reasoning trajectory by generating
+the next intermediate step.
+Reflect (<|reflect|>): prompts the model to pause and verify the correctness of prior reasoning steps.
+Explore Alternative Solution (<|explore|>): signals the model to identify critical flaws in its reasoning and explore a new solution.
+We refer to this formulation as Chain-of-Action-Thought (COAT)
+reasoning. Each COAT reasoning step is a sequence of tokens, starting
+with one of the meta-action tokens.
+---
 ## Use with llama.cpp
 Install llama.cpp through brew (works on Mac and Linux)