tingyuansen commited on
Commit
04c7a6a
·
verified ·
1 Parent(s): de745c3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +26 -13
README.md CHANGED
@@ -20,23 +20,36 @@ Our ongoing projects include:
20
 
21
  ## Models and Performance
22
 
23
- We have developed several models, including AstroSage-LLaMA-3.1-8B ([de Haan et al. 2024](https://arxiv.org/abs/2411.09012)), AstroLLaMA-2-70B ([Pan et al. 2024](https://arxiv.org/abs/2409.19750)), and AstroLLaMA-3-8B ([Pan et al. 2024](https://arxiv.org/abs/2409.19750)). Our AstroSage-LLaMA-3.1-8B model has demonstrated strong performance in astronomy Q&A tasks ([Ting et al. 2024](https://arxiv.org/abs/2407.11194)):
24
 
25
  | Model | Score (%) |
26
  |-------|-----------|
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  | **AstroSage-LLaMA-3.1-8B (AstroMLab)** | **80.9** |
28
- | LLaMA-3.1-8B | 73.7 |
29
- | Phi-3.5-4B | 72.8 |
30
- | Gemma-2-9B | 71.5 |
31
- | LLaMA-2-70B | 70.7 |
32
- | Qwen-2.5-7B | 70.4 |
33
- | Yi-1.5-9B | 68.4 |
34
- | InternLM-2.5-7B | 64.5 |
35
- | Mistral-7B-v0.3 | 63.9 |
36
- | ChatGLM3-6B | 50.4 |
37
- | AstroLLaMA-2-7B (UniverseTBD) | 44.3 |
38
-
39
- AstroSage-LLaMA-3.1-8B ([de Haan et al. 2024](https://arxiv.org/abs/2411.09012)), our lightweight model, currently achieves the highest score among the ~8B parameter models in its astronomy knowledge recall ability.
40
 
41
  ![Cost and performance trade-off in astronomical Q&A](https://cdn-uploads.huggingface.co/production/uploads/643f1ddce2ea47d170103537/ip0Bk-LZRrCArimets4H7.png)
42
 
 
20
 
21
  ## Models and Performance
22
 
23
+ We have developed several models, including AstroSage-LLaMA-3.1-70B ([de Haan et al. 2025b](https://arxiv.org/abs/2505.17592)) AstroSage-LLaMA-3.1-8B ([de Haan et al. 2025a](https://arxiv.org/abs/2411.09012)), AstroLLaMA-2-70B ([Pan et al. 2024](https://arxiv.org/abs/2409.19750)), and AstroLLaMA-3-8B ([Pan et al. 2024](https://arxiv.org/abs/2409.19750)). Our AstroSage-LLaMA-3.1-8B model has demonstrated strong performance in astronomy Q&A tasks ([Ting et al. 2024](https://arxiv.org/abs/2407.11194)):
24
 
25
  | Model | Score (%) |
26
  |-------|-----------|
27
+ | **AstroSage-LLaMA-3.1-70B (AstroMLab)** | **86.2** |
28
+ | Claude-4-Opus | **86.3** |
29
+ | o3 | 85.4 |
30
+ | Claude-4-Sonnet | 85.0 |
31
+ | GPT-4.1 | 84.7 |
32
+ | o4-Mini | 84.7 |
33
+ | Gemini-2.5-Pro | 84.8 |
34
+ | Deepseek-R1 | 84.4 |
35
+ | Qwen-3-235B | 84.0 |
36
+ | LLaMA-4-Maverick | 83.4 |
37
+ | Deepseek-v3-2503 | 82.9 |
38
+ | Gemini-2.5-Flash-0520 | 82.3 |
39
+ | LLaMA-4-Scout | 82.2 |
40
+ | Grok-3 | 81.7 |
41
+ | Mistral-Medium-v3 | 81.8 |
42
  | **AstroSage-LLaMA-3.1-8B (AstroMLab)** | **80.9** |
43
+ | Mistral-Large-v2 | 80.8 |
44
+ | Qwen-3-32B | 79.7 |
45
+ | Mistral-Small-v3.1 | 78.6 |
46
+ | GPT-4.1-Nano | 78.0 |
47
+ | Gemini-2-Flash-Lite | 78.4 |
48
+ | Gemma-3-27B | 76.9 |
49
+ | Qwen-3-14B | 76.4 |
50
+ | AstroLLaMA-2-7B | 44.3 |
51
+
52
+ As of this writing in May 2025, AstroSage-LLaMA-3.1-70B ([de Haan et al. 2025b](https://arxiv.org/abs/2505.17592)) achieves among the highest scores on AstroBench ([Ting et al. 2024](https://arxiv.org/abs/2407.11194)), tying with Claude-4-Opus and outperforming other leading models including GPT-4.1, o3, Gemini-2.5-Pro, and Claude-4-Sonnet.
 
 
53
 
54
  ![Cost and performance trade-off in astronomical Q&A](https://cdn-uploads.huggingface.co/production/uploads/643f1ddce2ea47d170103537/ip0Bk-LZRrCArimets4H7.png)
55