Update README.md

3750776 verified about 2 months ago

5.71 kB

	---
	tags:
	- PretrainModel
	- TCM
	- transformer
	- herberta
	- text-embedding
	license: apache-2.0
	language:
	- zh
	- en
	metrics:
	- accuracy
	base_model:
	- hfl/chinese-roberta-wwm-ext-large
	new_version: XiaoEnn/herberta_seq_512_V2
	inference: true
	library_name: transformers
	---


	# Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks


	## Introduction

	Herberta is a pre-trained model developed by the Angelpro Team, aimed at advancing the representation learning and modeling capabilities in Traditional Chinese Medicine (TCM). Built upon the chinese-roberta-wwm-ext-large model, Herberta leverages MLM (Masked Language Modeling) tasks to pre-train on datasets comprising 700 ancient books (538.95M) and 48 modern Chinese medicine textbooks (54M), resulting in a robust model for embedding generation and TCM-specific downstream tasks.

	We named the model "Herberta" by combining "Herb" and "Roberta" to signify its purpose in herbal medicine research. Herberta is ideal for applications such as:

	- Encoder for Herbal Formulas: Generating meaningful embeddings for TCM formulations.
	- Domain-Specific Word Embedding: Serving the Chinese medicine text domain.
	- Support for TCM Downstream Tasks: Including classification, labeling, and more.

	---

	## Pretraining Experiments

	### Dataset

	\| Data Type \| Quantity \| Data Size \|
	\|------------------------\|-------------\|------------------\|
	\| Ancient TCM Books \| 700 books \| ~538.95M \|
	\| Modern TCM Textbooks \| 48 books \| ~54M \|
	\| Mixed-Type Dataset \| Combined dataset \| ~637.8M \|

	### Pretrain result：


	\| Model \| eval_accuracy \| Loss/epoch_valid \| Perplexity_valid \|
	\|-----------------------\|---------------\|------------------\|------------------\|
	\| herberta_seq_512_v2 \| 0.9841 \| 0.04367 \| 1.083 \|
	\| herberta_seq_128_v2 \| 0.9406 \| 0.2877 \| 1.333 \|
	\| herberta_seq_512_V3 \| 0.755 \|1.100 \| 3.010 \|

	#### Metrics Comparison

	![Accuracy](https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/RDgI-0Ro2kMiwV853Wkgx.png)
	![Loss](https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/BJ7enbRg13IYAZuxwraPP.png)
	![Perplexity](https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/lOohRMIctPJZKM5yEEcQ2.png)


	### Pretraining Configuration

	#### Modern Textbooks Version
	- Pretraining Strategy: Dynamic MASK + Warmup + Linear Decay
	- Sequence Length: 512
	- Batch Size: 16
	- Learning Rate: Warmup (10% steps) + Linear Decay (1e-5 initial rate)
	- Tokenization: Continuous tokenization (512 tokens) without sentence segmentation.


	---

	## Downstream Task: TCM Pattern Classification

	### Task Definition
	Using 321 pattern descriptions extracted from TCM internal medicine textbooks, we evaluated the classification performance on four models:

	1. Herberta_seq_512_v2: Pretrained on 700 ancient TCM books.
	2. Herberta_seq_512_v3: Pretrained on 48 modern TCM textbooks.
	3. Herberta_seq_128_v2: Pretrained on 700 ancient TCM books (128-length sequences).
	4. Roberta: Baseline model without TCM-specific pretraining.

	### Training Configuration
	- Max Sequence Length: 512
	- Batch Size: 16
	- Epochs: 30

	### Results

	\| Model Name \| Eval Accuracy \| Eval F1 \| Eval Precision \| Eval Recall \|
	\|--------------------------\|---------------\|-----------\|----------------\|-------------\|
	\| Herberta_seq_512_v2 \| 0.9454 \| 0.9293 \| 0.9221 \| 0.9454 \|
	\| Herberta_seq_512_v3 \| 0.8989 \| 0.8704 \| 0.8583 \| 0.8989 \|
	\| Herberta_seq_128_v2 \| 0.8716 \| 0.8443 \| 0.8351 \| 0.8716 \|
	\| Roberta \| 0.8743 \| 0.8425 \| 0.8311 \| 0.8743 \|

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6564baaa393bae9c194fc32e/1yG96YdzXuxQlTfjOmXqg.png)


	#### Summary
	The Herberta_seq_512_v2 model, pretrained on 700 ancient TCM books, exhibited superior performance across all evaluation metrics. This highlights the significance of domain-specific pretraining on larger and historically richer datasets for TCM applications.

	---

	## Quickstart

	### Use Hugging Face

	```python
	from transformers import AutoTokenizer, AutoModel

	model_name = "XiaoEnn/herberta"

	# Load tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModel.from_pretrained(model_name)

	# Input text
	text = "中医理论是我国传统文化的瑰宝。"

	# Tokenize and prepare input
	inputs = tokenizer(text, return_tensors="pt", truncation=True, padding="max_length", max_length=128)

	# Get the model's outputs
	with torch.no_grad():
	outputs = model(**inputs)

	# Get the embedding (sentence-level average pooling)
	sentence_embedding = outputs.last_hidden_state.mean(dim=1)

	print("Embedding shape:", sentence_embedding.shape)
	print("Embedding vector:", sentence_embedding)

	```

	if you find our work helpful, feel free to give us a cite

	@misc{herberta-embedding,
	title = {Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation},
	url = {https://github.com/15392778677/herberta},
	author = {Yehan Yang, Xinhan Zheng},
	month = {December},
	year = {2024}
	}

	@article{herberta-technical-report,
	title={Herberta: A Pretrained Model for TCM Herbal Medicine and Downstream Tasks as Text Embedding Generation},
	author={Yehan Yang, Xinhan Zheng},
	institution={Beijing Angelpro Technology Co., Ltd.},
	year={2024},
	note={Presented at the 2024 Machine Learning Applications Conference (MLAC)}
	}