BSC-NLP4BIA
/

GLiClass-gender-classifier

Text Classification

Model card Files Files and versions Community

GLiClass-gender-classifier / README.md

gvaya-bsc's picture

Update README.md

3a0c3bf verified 4 months ago

|

history blame contribute delete

2.94 kB

	---
	library_name: transformers
	tags: []
	---

	# 🧠 GLiClass Gender Classifier — DeBERTaV3 Uni-Encoder (3-Class)

	This model is designed for text classification in clinical narratives, specifically for determining a patient's sex or gender. It was fine-tuned using a uni-encoder architecture based on [`microsoft/deberta-v3-small`](https://huggingface.co/microsoft/deberta-v3-small), and outputs one of three labels:

	- `male`
	- `female`
	- `sex undetermined`

	---

	## 🧪 Task

	This is a multi-class text classification task over clinical free-text. The model predicts the gender of a patient from discharge summaries, case descriptions, or medical notes.


	> ⚠️ It is strongly recommended to keep the labels and the input text in the same language (e.g., both in Spanish or both in English) to ensure optimal model performance. Mixing languages may reduce accuracy.
	---

	## 🧩 Model Architecture

	- Base: `microsoft/deberta-v3-small`
	- Architecture: `DebertaV2ForSequenceClassification`
	- Fine-tuned with a uni-encoder setup
	- 3 output labels

	---

	## 🔍 Input Format

	Each input sample must be a JSON object like this:

	```json
	{
	"text": "Paciente de 63 años que refería déficit de agudeza visual (AV)...",
	"all_labels": ["male", "female", "sex undetermined"],
	"true_labels": ["sex undetermined"]
	}

	## Usage example
	import json
	from transformers import AutoTokenizer
	from gliclass import GLiClassModel, ZeroShotClassificationPipeline
	import torch

	device = 0 if torch.cuda.is_available() else -1
	model_path = "BSC-NLP4BIA/GLiClass-gender-classifier"
	classification_type = "single-label" # or "multilabel"
	test_path = "path/to/your/test_data.json"

	print(f"🔄 Loading model from {model_path}...")
	model = GLiClassModel.from_pretrained(model_path)
	tokenizer = AutoTokenizer.from_pretrained(model_path)
	model.to(device)

	pipeline = ZeroShotClassificationPipeline(
	model=model,
	tokenizer=tokenizer,
	classification_type=classification_type,
	device=device
	)

	with open(test_path, 'r') as f:
	test_data = json.load(f)

	# 🔍 Automatically infer candidate labels from the dataset
	all_labels = set()
	for sample in test_data:
	all_labels.update(sample["true_labels"])
	candidate_labels = sorted(all_labels)

	print(f"🧾 Candidate labels inferred: {candidate_labels}")

	results = []

	for sample in test_data:
	true_labels = sample["true_labels"]
	output = pipeline(sample["text"], candidate_labels)
	top_results = output[0]

	predicted_labels = [max(top_results, key=lambda x: x["score"])["label"]]
	score_dict = {d["label"]: d["score"] for d in top_results}

	entry = {
	"text": sample["text"],
	"true_labels": true_labels,
	"predicted_labels": predicted_labels
	}
	# Add scores for each candidate label
	for label in candidate_labels:
	entry[f"score_{label}"] = score_dict.get(label, 0.0)

	results.append(entry)