CAMeL-Lab
/

text-editing-coda

Token Classification

Model card Files Files and versions Community

text-editing-coda / README.md

balhafni's picture

Update README.md

8266e7b verified 6 days ago

|

history blame contribute delete

2.94 kB

	---
	license: mit
	language:
	- ar
	base_model:
	- aubmindlab/bert-base-arabertv02
	pipeline_tag: token-classification
	---

	# SWEET MADAR CODA Model

	## Model Description
	`CAMeL-Lab/text-editing-coda` is a text editing model tailored for grammatical error correction (GEC) in dialectal Arabic (DA).
	The model is based on [AraBERTv02](https://huggingface.co/aubmindlab/bert-base-arabertv02), which we fine-tuned using the [MADAR CODA](https://camel.abudhabi.nyu.edu/madar-coda-corpus/) corpus.
	This model was introduced in our ACL 2025 paper, [Enhancing Text Editing for Grammatical Error Correction: Arabic as a Case Study](https://arxiv.org/abs/2503.00985), where we refer to it as SWEET (Subword Edit Error Tagger).
	It achieved SOTA performance on the MADAR CODA dataset. Details about the training procedure, data preprocessing, and hyperparameters are available in the paper.
	The fine-tuning code and associated resources are publicly available on our GitHub repository: https://github.com/CAMeL-Lab/text-editing.



	## Intended uses
	To use the `CAMeL-Lab/text-editing-coda` model, you must clone our text editing [GitHub repository](https://github.com/CAMeL-Lab/text-editing) and follow the installation requirements.
	We used this `SWEET` model to report results on the MADAR CODA dev and test sets in our [paper](https://arxiv.org/abs/2503.00985).

	## How to use
	Clone our text editing [GitHub repository](https://github.com/CAMeL-Lab/text-editing) and follow the installation requirements

	```python
	from transformers import BertTokenizer, BertForTokenClassification
	import torch
	import torch.nn.functional as F
	from gec.tag import rewrite

	tokenizer = BertTokenizer.from_pretrained('CAMeL-Lab/text-editing-coda')
	model = BertForTokenClassification.from_pretrained('CAMeL-Lab/text-editing-coda')

	text = 'أنا بعطيك رقم تلفونو و عنوانو'.split()

	tokenized_text = tokenizer(text, return_tensors="pt", is_split_into_words=True)

	with torch.no_grad():
	logits = model(**tokenized_text).logits
	preds = F.softmax(logits.squeeze(), dim=-1)
	preds = torch.argmax(preds, dim=-1).cpu().numpy()
	edits = [model.config.id2label[p] for p in preds[1:-1]]
	assert len(edits) == len(tokenized_text['input_ids'][0][1:-1])

	print(edits) # ['R_[ا]K', 'KI_[ا]K', 'K', 'K', 'K', 'K', 'KR_[ه]', 'K', 'MK*', 'R_[ه]']
	subwords = tokenizer.convert_ids_to_tokens(tokenized_text['input_ids'][0][1:-1])
	output_sent = rewrite(subwords=[subwords], edits=[edits])[0][0]
	print(output_sent) # انا باعطيك رقم تلفونه وعنوانه
	```



	## Citation
	```bibtex
	@inter{alhafni-habash-2025-enhancing,
	title={Enhancing Text Editing for Grammatical Error Correction: Arabic as a Case Study},
	author={Bashar Alhafni and Nizar Habash},
	year={2025},
	eprint={2503.00985},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2503.00985},
	}
	```