AstroSage-Llama-3.1-8B

https://arxiv.org/abs/2411.09012

AstroSage-Llama-3.1-8B is a domain-specialized natural-language AI assistant tailored for research in astronomy, astrophysics, and cosmology. Trained on the complete collection of astronomy-related arXiv papers from 2007-2024 along with millions of synthetically-generated question-answer pairs and other astronomical literature, AstroSage-Llama-3.1-8B demonstrates excellent proficiency on a wide range of questions. This achievement demonstrates the potential of domain specialization in AI, suggesting that focused training can yield capabilities exceeding those of much larger, general-purpose models.

Model Details

  • Base Architecture: Meta-Llama-3.1-8B
  • Base Model: Meta-Llama-3.1-8B
  • Parameters: 8 billion
  • Training Focus: Astronomy, Astrophysics, Cosmology, and Astronomical Instrumentation
  • License: Llama 3.1 Community License
  • Development Process:
    1. Continued Pre-training (CPT) on astronomical literature
    2. Supervised Fine-tuning (SFT) on QA pairs and instruction sets
    3. Model merging with Meta-Llama-3.1-8B-Instruct (75% CPT+SFT / 25% Meta-Instruct)

Using the model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained("AstroMLab/AstroSage-8b", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("AstroMLab/AstroSage-8b")

# Function to generate a response
def generate_response(prompt):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

    outputs = model.generate(
        **inputs,
        max_new_tokens=128,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id,
    )
    response = outputs[0][inputs['input_ids'].shape[-1]:]
    decoded = tokenizer.decode(response, skip_special_tokens=True)

    return decoded

# Example usage
prompt = """
You are an expert in general astrophysics. Your task is to answer the following question:
What are the main components of a galaxy?
"""
response = generate_response(prompt)
print(response)

Model Improvements and Performance

AstroSage-Llama-3.1-8B shows remarkable performance improvements:

Model Score (%)
AstroSage-Llama-3.1-8B 80.9
GPT-4o 80.4
LLaMA-3.1-8B 73.7
Gemma-2-9B 71.5
Qwen-2.5-7B 70.4
Yi-1.5-9B 68.4
InternLM-2.5-7B 64.5
Mistral-7B-v0.3 63.9
ChatGLM3-6B 50.4

The model demonstrates:

  • Outperformance of all 8B parameter models
  • Comparable performance to GPT-4o (80.4%)
  • ~1000x more cost-effective than proprietary models
  • 7 percentage-point improvement over base Llama-3.1-8b model

Training Data

  • Continued Pre-training:

    • ~250,000 arXiv preprints (2007-2024) from astro-ph and gr-qc
    • Astronomy-related Wikipedia articles
    • Selected astronomy textbooks
    • Total: 3.3 billion tokens, 19.9 GB plaintext
  • Supervised Fine-tuning:

    • 8.8 million curated QA pairs
    • Filtered Infinity-Instruct-7M dataset
    • Paper summaries and metadata
    • Total: 2.0 billion tokens, 9.8 GB plaintext

Intended Use

  • Curiosity-driven question answering
  • Brainstorming new ideas
  • Astronomical research assistance
  • Educational support in astronomy
  • Literature review and summarization
  • Scientific explanation of concepts

Limitations

  • Training data cutoff: January 2024
  • As with all LLMs, hallucinations are possible
  • Limited by 8B parameter size for complex reasoning
  • Paper metadata not perfectly memorized
  • Performance primarily validated on multiple-choice questions
  • Primarily trained for use in English

Technical Specifications

  • Architecture: Based on Meta-Llama 3.1
  • Training Infrastructure: ORNL OLCF Frontier
  • Hosting: Hugging Face Hub (AstroMLab/AstroSage-8B)

Ethical Considerations

While this model is designed for scientific use:

  • Should not be used as sole source for critical research decisions
  • Output should be verified against primary sources
  • May reflect biases present in astronomical literature

Citation and Contact

  • Corresponding author: Tijmen de Haan (tijmen dot dehaan at gmail dot com)
  • AstroMLab: astromachinelearninglab at gmail dot com
  • Please cite the AstroMLab 3 paper when referencing this model:
@preprint{dehaan2024astromlab3,
      title={AstroMLab 3: Achieving GPT-4o Level Performance in Astronomy with a Specialized 8B-Parameter Large Language Model}, 
      author={Tijmen de Haan and Yuan-Sen Ting and Tirthankar Ghosal and Tuan Dung Nguyen and Alberto Accomazzi and Azton Wells and Nesar Ramachandra and Rui Pan and Zechang Sun},
      year={2024},
      eprint={2411.09012},
      archivePrefix={arXiv},
      primaryClass={astro-ph.IM},
      url={https://arxiv.org/abs/2411.09012}, 
}
Downloads last month
203
Safetensors
Model size
8.03B params
Tensor type
BF16
ยท
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model authors have turned it off explicitly.

Model tree for AstroMLab/AstroSage-8B

Finetuned
(889)
this model
Quantizations
8 models

Space using AstroMLab/AstroSage-8B 1