GGUF
llama-cpp
gguf-my-repo
Inference Endpoints
conversational
Triangle104 commited on
Commit
86a7bd3
·
verified ·
1 Parent(s): 1ec3b6c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -0
README.md CHANGED
@@ -13,6 +13,61 @@ tags:
13
  This model was converted to GGUF format from [`Satori-reasoning/Satori-7B-Round2`](https://huggingface.co/Satori-reasoning/Satori-7B-Round2) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
14
  Refer to the [original model card](https://huggingface.co/Satori-reasoning/Satori-7B-Round2) for more details on the model.
15
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
  ## Use with llama.cpp
17
  Install llama.cpp through brew (works on Mac and Linux)
18
 
 
13
  This model was converted to GGUF format from [`Satori-reasoning/Satori-7B-Round2`](https://huggingface.co/Satori-reasoning/Satori-7B-Round2) using llama.cpp via the ggml.ai's [GGUF-my-repo](https://huggingface.co/spaces/ggml-org/gguf-my-repo) space.
14
  Refer to the [original model card](https://huggingface.co/Satori-reasoning/Satori-7B-Round2) for more details on the model.
15
 
16
+ ---
17
+ Satori-7B-Round2 is a 7B LLM trained on open-source model (Qwen-2.5-Math-7B) and open-source data (OpenMathInstruct-2 and NuminaMath). Satori-7B-Round2 is capable of autoregressive search, i.e., self-reflection and self-exploration without external guidance.
18
+ This is achieved through our proposed Chain-of-Action-Thought (COAT) reasoning and a two-stage post-training paradigm.
19
+
20
+
21
+
22
+
23
+
24
+
25
+
26
+ Our Approach
27
+
28
+
29
+
30
+
31
+ We formulate LLM reasoning as a sequential decision-making problem,
32
+ where reasoning is a process of constructing and refining an answer step
33
+ by step. Specifically, the LLM (agent's policy) starts with an input
34
+ context (initial state), generates a reasoning step (action), and
35
+ updates the context (next state). The LLM repeats this process until it
36
+ reaches a final answer, and receives a reward that evaluates whether the
37
+ final answer matches the ground truth. With this formulation, we could
38
+ train the LLM to reason using RL, aiming to generate a sequence of
39
+ reasoning steps that maximize the expected reward.
40
+
41
+
42
+
43
+
44
+
45
+
46
+
47
+ Chain-of-Action-Thought reasoning (COAT)
48
+
49
+
50
+
51
+
52
+ The key challenge of achieving autoregressive search is enabling the
53
+ LLM to determine when to reflect, continue, or explore alternative
54
+ solutions without external intervention.
55
+ To enable this, we introduce several special meta-action tokens that
56
+ guide the LLM's reasoning process,
57
+
58
+
59
+ Continue Reasoning (<|continue|>): encourages
60
+ the LLM to build upon its current reasoning trajectory by generating
61
+ the next intermediate step.
62
+ Reflect (<|reflect|>): prompts the model to pause and verify the correctness of prior reasoning steps.
63
+ Explore Alternative Solution (<|explore|>): signals the model to identify critical flaws in its reasoning and explore a new solution.
64
+
65
+
66
+ We refer to this formulation as Chain-of-Action-Thought (COAT)
67
+ reasoning. Each COAT reasoning step is a sequence of tokens, starting
68
+ with one of the meta-action tokens.
69
+
70
+ ---
71
  ## Use with llama.cpp
72
  Install llama.cpp through brew (works on Mac and Linux)
73