|
--- |
|
tags: |
|
- PretrainModel |
|
- TCM |
|
- transformer |
|
- herberta |
|
- text-embedding |
|
license: apache-2.0 |
|
language: |
|
- zh |
|
- en |
|
metrics: |
|
- accuracy |
|
base_model: |
|
- hfl/chinese-roberta-wwm-ext-large |
|
new_version: XiaoEnn/herberta_seq_512_V2 |
|
inference: true |
|
library_name: transformers |
|
--- |
|
|
|
|
|
# Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks |
|
|
|
|
|
## Introduction |
|
|
|
Herberta is a pre-trained model developed by the Angelpro Team, aimed at advancing the representation learning and modeling capabilities in Traditional Chinese Medicine (TCM). Built upon the **chinese-roberta-wwm-ext-large** model, Herberta leverages MLM (Masked Language Modeling) tasks to pre-train on datasets comprising **700 ancient books (538.95M)** and **48 modern Chinese medicine textbooks (54M)**, resulting in a robust model for embedding generation and TCM-specific downstream tasks. |
|
|
|
We named the model "Herberta" by combining "Herb" and "Roberta" to signify its purpose in herbal medicine research. Herberta is ideal for applications such as: |
|
|
|
- **Encoder for Herbal Formulas**: Generating meaningful embeddings for TCM formulations. |
|
- **Domain-Specific Word Embedding**: Serving the Chinese medicine text domain. |
|
- **Support for TCM Downstream Tasks**: Including classification, labeling, and more. |
|
|
|
--- |
|
|
|
## Pretraining Experiments |
|
|
|
### Dataset |
|
|
|
| Data Type | Quantity | Data Size | |
|
|------------------------|-------------|------------------| |
|
| **Ancient TCM Books** | 700 books | ~538.95M | |
|
| **Modern TCM Textbooks** | 48 books | ~54M | |
|
| **Mixed-Type Dataset** | Combined dataset | ~637.8M | |
|
|
|
### Pretrain result: |
|
|
|
|
|
| Model | eval_accuracy | Loss/epoch_valid | Perplexity_valid | |
|
|-----------------------|---------------|------------------|------------------| |
|
| **herberta_seq_512_v2** | 0.9841 | 0.04367 | 1.083 | |
|
| **herberta_seq_128_v2** | 0.9406 | 0.2877 | 1.333 | |
|
| **herberta_seq_512_V3** | 0.755 |1.100 | 3.010 | |
|
|
|
#### Metrics Comparison |
|
|
|
 |
|
 |
|
 |
|
|
|
|
|
### Pretraining Configuration |
|
|
|
#### Modern Textbooks Version |
|
- Pretraining Strategy: Dynamic MASK + Warmup + Linear Decay |
|
- Sequence Length: 512 |
|
- Batch Size: 16 |
|
- Learning Rate: Warmup (10% steps) + Linear Decay (1e-5 initial rate) |
|
- Tokenization: Continuous tokenization (512 tokens) without sentence segmentation. |
|
|
|
|
|
--- |
|
|
|
## Downstream Task: TCM Pattern Classification |
|
|
|
### Task Definition |
|
Using **321 pattern descriptions** extracted from TCM internal medicine textbooks, we evaluated the classification performance on four models: |
|
|
|
1. **Herberta_seq_512_v2**: Pretrained on 700 ancient TCM books. |
|
2. **Herberta_seq_512_v3**: Pretrained on 48 modern TCM textbooks. |
|
3. **Herberta_seq_128_v2**: Pretrained on 700 ancient TCM books (128-length sequences). |
|
4. **Roberta**: Baseline model without TCM-specific pretraining. |
|
|
|
### Training Configuration |
|
- Max Sequence Length: 512 |
|
- Batch Size: 16 |
|
- Epochs: 30 |
|
|
|
### Results |
|
|
|
| Model Name | Eval Accuracy | Eval F1 | Eval Precision | Eval Recall | |
|
|--------------------------|---------------|-----------|----------------|-------------| |
|
| **Herberta_seq_512_v2** | **0.9454** | **0.9293** | **0.9221** | **0.9454** | |
|
| **Herberta_seq_512_v3** | 0.8989 | 0.8704 | 0.8583 | 0.8989 | |
|
| **Herberta_seq_128_v2** | 0.8716 | 0.8443 | 0.8351 | 0.8716 | |
|
| **Roberta** | 0.8743 | 0.8425 | 0.8311 | 0.8743 | |
|
|
|
 |
|
|
|
|
|
#### Summary |
|
The **Herberta_seq_512_v2** model, pretrained on 700 ancient TCM books, exhibited superior performance across all evaluation metrics. This highlights the significance of domain-specific pretraining on larger and historically richer datasets for TCM applications. |
|
|
|
--- |
|
|
|
## Quickstart |
|
|
|
### Use Hugging Face |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
|
|
model_name = "XiaoEnn/herberta" |
|
|
|
# Load tokenizer and model |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
model = AutoModel.from_pretrained(model_name) |
|
|
|
# Input text |
|
text = "中医理论是我国传统文化的瑰宝。" |
|
|
|
# Tokenize and prepare input |
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128) |
|
|
|
# Get the model's outputs |
|
with torch.no_grad(): |
|
outputs = model(**inputs) |
|
|
|
# Get the embedding (sentence-level average pooling) |
|
sentence_embedding = outputs.last_hidden_state.mean(dim=1) |
|
|
|
print("Embedding shape:", sentence_embedding.shape) |
|
print("Embedding vector:", sentence_embedding) |
|
|
|
``` |
|
|
|
if you find our work helpful, feel free to give us a cite |
|
|
|
@misc{herberta-embedding, |
|
title = {Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation}, |
|
url = {https://github.com/15392778677/herberta}, |
|
author = {Yehan Yang, Xinhan Zheng}, |
|
month = {December}, |
|
year = {2024} |
|
} |
|
|
|
@article{herberta-technical-report, |
|
title={Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation}, |
|
author={Yehan Yang, Xinhan Zheng}, |
|
institution={Beijing Angelpro Technology Co., Ltd.}, |
|
year={2024}, |
|
note={Presented at the 2024 Machine Learning Applications Conference (MLAC)} |
|
} |