|
--- |
|
library_name: transformers |
|
tags: [] |
|
--- |
|
|
|
# 🧠 GLiClass Gender Classifier — DeBERTaV3 Uni-Encoder (3-Class) |
|
|
|
This model is designed for **text classification** in clinical narratives, specifically for determining a patient's **sex or gender**. It was fine-tuned using a **uni-encoder architecture** based on [`microsoft/deberta-v3-small`](https://huggingface.co/microsoft/deberta-v3-small), and outputs one of three labels: |
|
|
|
- `male` |
|
- `female` |
|
- `sex undetermined` |
|
|
|
--- |
|
|
|
## 🧪 Task |
|
|
|
This is a **multi-class text classification** task over **clinical free-text**. The model predicts the gender of a patient from discharge summaries, case descriptions, or medical notes. |
|
|
|
|
|
> ⚠️ **It is strongly recommended to keep the labels and the input text in the same language** (e.g., both in Spanish or both in English) to ensure optimal model performance. Mixing languages may reduce accuracy. |
|
--- |
|
|
|
## 🧩 Model Architecture |
|
|
|
- Base: `microsoft/deberta-v3-small` |
|
- Architecture: `DebertaV2ForSequenceClassification` |
|
- Fine-tuned with a **uni-encoder** setup |
|
- 3 output labels |
|
|
|
--- |
|
|
|
## 🔍 Input Format |
|
|
|
Each input sample must be a JSON object like this: |
|
|
|
```json |
|
{ |
|
"text": "Paciente de 63 años que refería déficit de agudeza visual (AV)...", |
|
"all_labels": ["male", "female", "sex undetermined"], |
|
"true_labels": ["sex undetermined"] |
|
} |
|
|
|
## Usage example |
|
import json |
|
from transformers import AutoTokenizer |
|
from gliclass import GLiClassModel, ZeroShotClassificationPipeline |
|
import torch |
|
|
|
device = 0 if torch.cuda.is_available() else -1 |
|
model_path = "BSC-NLP4BIA/GLiClass-gender-classifier" |
|
classification_type = "single-label" # or "multilabel" |
|
test_path = "path/to/your/test_data.json" |
|
|
|
print(f"🔄 Loading model from {model_path}...") |
|
model = GLiClassModel.from_pretrained(model_path) |
|
tokenizer = AutoTokenizer.from_pretrained(model_path) |
|
model.to(device) |
|
|
|
pipeline = ZeroShotClassificationPipeline( |
|
model=model, |
|
tokenizer=tokenizer, |
|
classification_type=classification_type, |
|
device=device |
|
) |
|
|
|
with open(test_path, 'r') as f: |
|
test_data = json.load(f) |
|
|
|
# 🔍 Automatically infer candidate labels from the dataset |
|
all_labels = set() |
|
for sample in test_data: |
|
all_labels.update(sample["true_labels"]) |
|
candidate_labels = sorted(all_labels) |
|
|
|
print(f"🧾 Candidate labels inferred: {candidate_labels}") |
|
|
|
results = [] |
|
|
|
for sample in test_data: |
|
true_labels = sample["true_labels"] |
|
output = pipeline(sample["text"], candidate_labels) |
|
top_results = output[0] |
|
|
|
predicted_labels = [max(top_results, key=lambda x: x["score"])["label"]] |
|
score_dict = {d["label"]: d["score"] for d in top_results} |
|
|
|
entry = { |
|
"text": sample["text"], |
|
"true_labels": true_labels, |
|
"predicted_labels": predicted_labels |
|
} |
|
# Add scores for each candidate label |
|
for label in candidate_labels: |
|
entry[f"score_{label}"] = score_dict.get(label, 0.0) |
|
|
|
results.append(entry) |
|
|
|
|