Portuguese
xgboost
text-quality
portuguese

XGBClassifier-text-filter

XGBClassifier-text-filter is a text-quality filter built on top of the xgboost library. It uses the embeddings generated by sentence-transformers/LaBSE as a feature vector.

This repository has the source code used to train this model.

Usage

Here's an example of how to use the XGBClassifier-text-filter:

from transformers import AutoTokenizer, AutoModel
from xgboost import XGBClassifier
import torch.nn.functional as F
import torch

def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/LaBSE")
embedding_model = AutoModel.from_pretrained("sentence-transformers/LaBSE")
device = ("cuda" if torch.cuda.is_available() else "cpu")
embedding_model.to(device)

bst = XGBClassifier({'device': device})
bst.load_model('/path/to/XGBClassifier-text-classifier.json')

def score_text(text, model):

    encoded_input = tokenizer(text, padding=True, truncation=True, return_tensors='pt').to(device)

    with torch.no_grad():
        model_output = embedding_model(**encoded_input)

    sentence_embedding = mean_pooling(model_output, encoded_input['attention_mask'])

    embedding = F.normalize(sentence_embedding, p=2, dim=1).numpy()
    score = model.predict(embedding)[0]

    return score

score_text("Os tucanos são aves que correspondem à família Ramphastidae, vivem nas florestas tropicais da América Central e América do Sul. A família inclui cinco gêneros e mais de quarenta espécies diferentes. Possuem bicos notavelmente grandes e coloridos, que possuem a função de termorregulação para as muitas espécies que passam muito tempo na copa da floresta exposta ao sol tropical quente.", bst)

Cite as 🤗

@misc{correa2024tucanoadvancingneuraltext,
      title={{Tucano: Advancing Neural Text Generation for Portuguese}}, 
      author={Corr{\^e}a, Nicholas Kluge and Sen, Aniket and Falk, Sophia and Fatimah, Shiza},
      year={2024},
      eprint={2411.07854},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.07854}, 
}

Aknowlegments

We gratefully acknowledge the granted access to the Marvin cluster hosted by University of Bonn along with the support provided by its High Performance Computing & Analytics Lab.

License

XGBClassifier-text-filter is licensed under the Apache License, Version 2.0. For more details, see the LICENSE file.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Dataset used to train TucanoBR/XGBClassifier-text-filter

Collection including TucanoBR/XGBClassifier-text-filter