Update README.md
Browse files
README.md
CHANGED
@@ -13,6 +13,61 @@ tags:
|
|
13 |
This model was converted to GGUF format from [`Satori-reasoning/Satori-7B-Round2`](https://huggingface.co/Satori-reasoning/Satori-7B-Round2) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
|
14 |
Refer to the [original model card](https://huggingface.co/Satori-reasoning/Satori-7B-Round2) for more details on the model.
|
15 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
16 |
## Use with llama.cpp
|
17 |
Install llama.cpp through brew (works on Mac and Linux)
|
18 |
|
|
|
13 |
This model was converted to GGUF format from [`Satori-reasoning/Satori-7B-Round2`](https://huggingface.co/Satori-reasoning/Satori-7B-Round2) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
|
14 |
Refer to the [original model card](https://huggingface.co/Satori-reasoning/Satori-7B-Round2) for more details on the model.
|
15 |
|
16 |
+
---
|
17 |
+
Satori-7B-Round2 is a 7B LLM trained on open-source model (Qwen-2.5-Math-7B) and open-source data (OpenMathInstruct-2 and NuminaMath). Satori-7B-Round2 is capable of autoregressive search, i.e., self-reflection and self-exploration without external guidance.
|
18 |
+
This is achieved through our proposed Chain-of-Action-Thought (COAT) reasoning and a two-stage post-training paradigm.
|
19 |
+
|
20 |
+
|
21 |
+
|
22 |
+
|
23 |
+
|
24 |
+
|
25 |
+
|
26 |
+
Our Approach
|
27 |
+
|
28 |
+
|
29 |
+
|
30 |
+
|
31 |
+
We formulate LLM reasoning as a sequential decision-making problem,
|
32 |
+
where reasoning is a process of constructing and refining an answer step
|
33 |
+
by step. Specifically, the LLM (agent's policy) starts with an input
|
34 |
+
context (initial state), generates a reasoning step (action), and
|
35 |
+
updates the context (next state). The LLM repeats this process until it
|
36 |
+
reaches a final answer, and receives a reward that evaluates whether the
|
37 |
+
final answer matches the ground truth. With this formulation, we could
|
38 |
+
train the LLM to reason using RL, aiming to generate a sequence of
|
39 |
+
reasoning steps that maximize the expected reward.
|
40 |
+
|
41 |
+
|
42 |
+
|
43 |
+
|
44 |
+
|
45 |
+
|
46 |
+
|
47 |
+
Chain-of-Action-Thought reasoning (COAT)
|
48 |
+
|
49 |
+
|
50 |
+
|
51 |
+
|
52 |
+
The key challenge of achieving autoregressive search is enabling the
|
53 |
+
LLM to determine when to reflect, continue, or explore alternative
|
54 |
+
solutions without external intervention.
|
55 |
+
To enable this, we introduce several special meta-action tokens that
|
56 |
+
guide the LLM's reasoning process,
|
57 |
+
|
58 |
+
|
59 |
+
Continue Reasoning (<|continue|>): encourages
|
60 |
+
the LLM to build upon its current reasoning trajectory by generating
|
61 |
+
the next intermediate step.
|
62 |
+
Reflect (<|reflect|>): prompts the model to pause and verify the correctness of prior reasoning steps.
|
63 |
+
Explore Alternative Solution (<|explore|>): signals the model to identify critical flaws in its reasoning and explore a new solution.
|
64 |
+
|
65 |
+
|
66 |
+
We refer to this formulation as Chain-of-Action-Thought (COAT)
|
67 |
+
reasoning. Each COAT reasoning step is a sequence of tokens, starting
|
68 |
+
with one of the meta-action tokens.
|
69 |
+
|
70 |
+
---
|
71 |
## Use with llama.cpp
|
72 |
Install llama.cpp through brew (works on Mac and Linux)
|
73 |
|