herberta_V3_Modern / README.md
XiaoEnn's picture
Update README.md
3750776 verified
---
tags:
- PretrainModel
- TCM
- transformer
- herberta
- text-embedding
license: apache-2.0
language:
- zh
- en
metrics:
- accuracy
base_model:
- hfl/chinese-roberta-wwm-ext-large
new_version: XiaoEnn/herberta_seq_512_V2
inference: true
library_name: transformers
---
# Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks
## Introduction
Herberta is a pre-trained model developed by the Angelpro Team, aimed at advancing the representation learning and modeling capabilities in Traditional Chinese Medicine (TCM). Built upon the **chinese-roberta-wwm-ext-large** model, Herberta leverages MLM (Masked Language Modeling) tasks to pre-train on datasets comprising **700 ancient books (538.95M)** and **48 modern Chinese medicine textbooks (54M)**, resulting in a robust model for embedding generation and TCM-specific downstream tasks.
We named the model "Herberta" by combining "Herb" and "Roberta" to signify its purpose in herbal medicine research. Herberta is ideal for applications such as:
- **Encoder for Herbal Formulas**: Generating meaningful embeddings for TCM formulations.
- **Domain-Specific Word Embedding**: Serving the Chinese medicine text domain.
- **Support for TCM Downstream Tasks**: Including classification, labeling, and more.
---
## Pretraining Experiments
### Dataset
| Data Type | Quantity | Data Size |
|------------------------|-------------|------------------|
| **Ancient TCM Books** | 700 books | ~538.95M |
| **Modern TCM Textbooks** | 48 books | ~54M |
| **Mixed-Type Dataset** | Combined dataset | ~637.8M |
### Pretrain result:
| Model | eval_accuracy | Loss/epoch_valid | Perplexity_valid |
|-----------------------|---------------|------------------|------------------|
| **herberta_seq_512_v2** | 0.9841 | 0.04367 | 1.083 |
| **herberta_seq_128_v2** | 0.9406 | 0.2877 | 1.333 |
| **herberta_seq_512_V3** | 0.755 |1.100 | 3.010 |
#### Metrics Comparison
![Accuracy](https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/RDgI-0Ro2kMiwV853Wkgx.png)
![Loss](https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/BJ7enbRg13IYAZuxwraPP.png)
![Perplexity](https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/lOohRMIctPJZKM5yEEcQ2.png)
### Pretraining Configuration
#### Modern Textbooks Version
- Pretraining Strategy: Dynamic MASK + Warmup + Linear Decay
- Sequence Length: 512
- Batch Size: 16
- Learning Rate: Warmup (10% steps) + Linear Decay (1e-5 initial rate)
- Tokenization: Continuous tokenization (512 tokens) without sentence segmentation.
---
## Downstream Task: TCM Pattern Classification
### Task Definition
Using **321 pattern descriptions** extracted from TCM internal medicine textbooks, we evaluated the classification performance on four models:
1. **Herberta_seq_512_v2**: Pretrained on 700 ancient TCM books.
2. **Herberta_seq_512_v3**: Pretrained on 48 modern TCM textbooks.
3. **Herberta_seq_128_v2**: Pretrained on 700 ancient TCM books (128-length sequences).
4. **Roberta**: Baseline model without TCM-specific pretraining.
### Training Configuration
- Max Sequence Length: 512
- Batch Size: 16
- Epochs: 30
### Results
| Model Name | Eval Accuracy | Eval F1 | Eval Precision | Eval Recall |
|--------------------------|---------------|-----------|----------------|-------------|
| **Herberta_seq_512_v2** | **0.9454** | **0.9293** | **0.9221** | **0.9454** |
| **Herberta_seq_512_v3** | 0.8989 | 0.8704 | 0.8583 | 0.8989 |
| **Herberta_seq_128_v2** | 0.8716 | 0.8443 | 0.8351 | 0.8716 |
| **Roberta** | 0.8743 | 0.8425 | 0.8311 | 0.8743 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/1yG96YdzXuxQlTfjOmXqg.png)
#### Summary
The **Herberta_seq_512_v2** model, pretrained on 700 ancient TCM books, exhibited superior performance across all evaluation metrics. This highlights the significance of domain-specific pretraining on larger and historically richer datasets for TCM applications.
---
## Quickstart
### Use Hugging Face
```python
from transformers import AutoTokenizer, AutoModel
model_name = "XiaoEnn/herberta"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
# Input text
text = "中医理论是我国传统文化的瑰宝。"
# Tokenize and prepare input
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)
# Get the model's outputs
with torch.no_grad():
outputs = model(**inputs)
# Get the embedding (sentence-level average pooling)
sentence_embedding = outputs.last_hidden_state.mean(dim=1)
print("Embedding shape:", sentence_embedding.shape)
print("Embedding vector:", sentence_embedding)
```
if you find our work helpful, feel free to give us a cite
@misc{herberta-embedding,
title = {Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation},
url = {https://github.com/15392778677/herberta},
author = {Yehan Yang, Xinhan Zheng},
month = {December},
year = {2024}
}
@article{herberta-technical-report,
title={Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation},
author={Yehan Yang, Xinhan Zheng},
institution={Beijing Angelpro Technology Co., Ltd.},
year={2024},
note={Presented at the 2024 Machine Learning Applications Conference (MLAC)}
}