Update README.md
Browse files
README.md
CHANGED
@@ -20,23 +20,36 @@ Our ongoing projects include:
|
|
20 |
|
21 |
## Models and Performance
|
22 |
|
23 |
-
We have developed several models, including AstroSage-LLaMA-3.1-8B ([de Haan et al.
|
24 |
|
25 |
| Model | Score (%) |
|
26 |
|-------|-----------|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
27 |
| **AstroSage-LLaMA-3.1-8B (AstroMLab)** | **80.9** |
|
28 |
-
|
|
29 |
-
|
|
30 |
-
|
|
31 |
-
|
|
32 |
-
|
|
33 |
-
|
|
34 |
-
|
|
35 |
-
|
|
36 |
-
|
37 |
-
|
38 |
-
|
39 |
-
AstroSage-LLaMA-3.1-8B ([de Haan et al. 2024](https://arxiv.org/abs/2411.09012)), our lightweight model, currently achieves the highest score among the ~8B parameter models in its astronomy knowledge recall ability.
|
40 |
|
41 |

|
42 |
|
|
|
20 |
|
21 |
## Models and Performance
|
22 |
|
23 |
+
We have developed several models, including AstroSage-LLaMA-3.1-70B ([de Haan et al. 2025b](https://arxiv.org/abs/2505.17592)) AstroSage-LLaMA-3.1-8B ([de Haan et al. 2025a](https://arxiv.org/abs/2411.09012)), AstroLLaMA-2-70B ([Pan et al. 2024](https://arxiv.org/abs/2409.19750)), and AstroLLaMA-3-8B ([Pan et al. 2024](https://arxiv.org/abs/2409.19750)). Our AstroSage-LLaMA-3.1-8B model has demonstrated strong performance in astronomy Q&A tasks ([Ting et al. 2024](https://arxiv.org/abs/2407.11194)):
|
24 |
|
25 |
| Model | Score (%) |
|
26 |
|-------|-----------|
|
27 |
+
| **AstroSage-LLaMA-3.1-70B (AstroMLab)** | **86.2** |
|
28 |
+
| Claude-4-Opus | **86.3** |
|
29 |
+
| o3 | 85.4 |
|
30 |
+
| Claude-4-Sonnet | 85.0 |
|
31 |
+
| GPT-4.1 | 84.7 |
|
32 |
+
| o4-Mini | 84.7 |
|
33 |
+
| Gemini-2.5-Pro | 84.8 |
|
34 |
+
| Deepseek-R1 | 84.4 |
|
35 |
+
| Qwen-3-235B | 84.0 |
|
36 |
+
| LLaMA-4-Maverick | 83.4 |
|
37 |
+
| Deepseek-v3-2503 | 82.9 |
|
38 |
+
| Gemini-2.5-Flash-0520 | 82.3 |
|
39 |
+
| LLaMA-4-Scout | 82.2 |
|
40 |
+
| Grok-3 | 81.7 |
|
41 |
+
| Mistral-Medium-v3 | 81.8 |
|
42 |
| **AstroSage-LLaMA-3.1-8B (AstroMLab)** | **80.9** |
|
43 |
+
| Mistral-Large-v2 | 80.8 |
|
44 |
+
| Qwen-3-32B | 79.7 |
|
45 |
+
| Mistral-Small-v3.1 | 78.6 |
|
46 |
+
| GPT-4.1-Nano | 78.0 |
|
47 |
+
| Gemini-2-Flash-Lite | 78.4 |
|
48 |
+
| Gemma-3-27B | 76.9 |
|
49 |
+
| Qwen-3-14B | 76.4 |
|
50 |
+
| AstroLLaMA-2-7B | 44.3 |
|
51 |
+
|
52 |
+
As of this writing in May 2025, AstroSage-LLaMA-3.1-70B ([de Haan et al. 2025b](https://arxiv.org/abs/2505.17592)) achieves among the highest scores on AstroBench ([Ting et al. 2024](https://arxiv.org/abs/2407.11194)), tying with Claude-4-Opus and outperforming other leading models including GPT-4.1, o3, Gemini-2.5-Pro, and Claude-4-Sonnet.
|
|
|
|
|
53 |
|
54 |

|
55 |
|