gvaya-bsc's picture
Update README.md
3a0c3bf verified
---
library_name: transformers
tags: []
---
# 🧠 GLiClass Gender Classifier — DeBERTaV3 Uni-Encoder (3-Class)
This model is designed for **text classification** in clinical narratives, specifically for determining a patient's **sex or gender**. It was fine-tuned using a **uni-encoder architecture** based on [`microsoft/deberta-v3-small`](https://huggingface.co/microsoft/deberta-v3-small), and outputs one of three labels:
- `male`
- `female`
- `sex undetermined`
---
## 🧪 Task
This is a **multi-class text classification** task over **clinical free-text**. The model predicts the gender of a patient from discharge summaries, case descriptions, or medical notes.
> ⚠️ **It is strongly recommended to keep the labels and the input text in the same language** (e.g., both in Spanish or both in English) to ensure optimal model performance. Mixing languages may reduce accuracy.
---
## 🧩 Model Architecture
- Base: `microsoft/deberta-v3-small`
- Architecture: `DebertaV2ForSequenceClassification`
- Fine-tuned with a **uni-encoder** setup
- 3 output labels
---
## 🔍 Input Format
Each input sample must be a JSON object like this:
```json
{
"text": "Paciente de 63 años que refería déficit de agudeza visual (AV)...",
"all_labels": ["male", "female", "sex undetermined"],
"true_labels": ["sex undetermined"]
}
## Usage example
import json
from transformers import AutoTokenizer
from gliclass import GLiClassModel, ZeroShotClassificationPipeline
import torch
device = 0 if torch.cuda.is_available() else -1
model_path = "BSC-NLP4BIA/GLiClass-gender-classifier"
classification_type = "single-label" # or "multilabel"
test_path = "path/to/your/test_data.json"
print(f"🔄 Loading model from {model_path}...")
model = GLiClassModel.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model.to(device)
pipeline = ZeroShotClassificationPipeline(
model=model,
tokenizer=tokenizer,
classification_type=classification_type,
device=device
)
with open(test_path, 'r') as f:
test_data = json.load(f)
# 🔍 Automatically infer candidate labels from the dataset
all_labels = set()
for sample in test_data:
all_labels.update(sample["true_labels"])
candidate_labels = sorted(all_labels)
print(f"🧾 Candidate labels inferred: {candidate_labels}")
results = []
for sample in test_data:
true_labels = sample["true_labels"]
output = pipeline(sample["text"], candidate_labels)
top_results = output[0]
predicted_labels = [max(top_results, key=lambda x: x["score"])["label"]]
score_dict = {d["label"]: d["score"] for d in top_results}
entry = {
"text": sample["text"],
"true_labels": true_labels,
"predicted_labels": predicted_labels
}
# Add scores for each candidate label
for label in candidate_labels:
entry[f"score_{label}"] = score_dict.get(label, 0.0)
results.append(entry)