EnvGPT: Leveraging a Large Language Model for Environmental Science

---
license: mit
datasets:
- SustcZhangYX/ChatEnv
language:
- en
tags:
- Environmental Science
---
<div align="center">
<img src="LOGO.PNG" width="450px">
<h1 align="center"><font face="Arial">EnvGPT: Leveraging a Large Language Model for Environmental Science</font></h1>

</div>

**EnvGPT** is the first domain-specific large language model tailored for environmental science tasks.

Environmental science presents unique challenges for LLMs due to its interdisciplinary nature. EnvGPT was developed to address these challenges by leveraging a domain-specific environmental science instruction dataset and benchmark.

*The model was fine-tuned on this environmental science-specific instruction dataset, [ChatEnv](https://huggingface.co/datasets/SustcZhangYX/ChatEnv), through Supervised Fine-Tuning (SFT). The dataset contains a total token count of **107,197,329**, highlighting its depth and comprehensiveness for environmental science tasks.*


## 🚀 Getting Started

### Download the model

Download the model: [EnvGPT](https://huggingface.co/SustcZhangYX/EnvGPT)

```shell
git lfs install
git clone https://huggingface.co/SustcZhangYX/EnvGPT
```

### Model Usage

Here is a Python code snippet that demonstrates how to load the tokenizer and model and generate text using EnvGPT.

```python
import transformers
import torch

# Set the path to your local model
model_path = "YOUR_LOCAL_MODEL_PATH"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_path,  # Use local model path
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are an expert assistant in environmental science, EnvGPT.You are a helpful assistant."},
    {"role": "user", "content": "What is the definition of environmental science?"},
]

# Pass top_p and temperature directly in the pipeline call
outputs = pipeline(
    messages,
    max_new_tokens=4096,
    top_p=0.7,  # Add nucleus sampling
    temperature=0.9,  # Add temperature control
)

print(outputs[0]["generated_text"])
```

This code demonstrates how to load the tokenizer and model from your local path, define environmental science-specific prompts, and generate responses using sampling techniques like top-p and temperature.

## 🌏 Acknowledgement

EnvGPT is fine-tuned based on the open-sourced [LLaMA](https://huggingface.co/meta-llama). We thank Meta AI for their contributions to the community.

## ❗Disclaimer

This project is intended solely for academic research and exploration. Please note that, like all large language models, this model may exhibit limitations, including potential inaccuracies or hallucinations in generated outputs.

## Limitations

- The model may produce hallucinated outputs or inaccuracies, which are inherent to large language models.
- The model's identity has not been specifically optimized and may generate content that resembles outputs from other LLaMA-based models or similar architectures.
- Generated outputs can vary between attempts due to sensitivity to prompt phrasing and token context.

## 🚩Citation

If you use EnvGPT in your research or applications, please cite this work as follows:

```Markdown
[Placeholder for Citation]  
Please refer to the forthcoming publication for details about EnvGPT. 
This section will be updated with the citation once the paper is officially published. 
```