DistilBERT Token Classification Model for Unit Conversion
Model Overview
This model is a fine-tuned version of distilbert/distilbert-base-uncased
for token classification on unit conversion-related text. It is designed to recognize unit values and conversion entities, facilitating automatic extraction of unit-related data.
Dataset
The model is trained on the maliknaik/natural_unit_conversion
dataset, which contains:
- Training set: 583,863 examples
- Validation set: 100,091 examples
- Test set: 150,137 examples
Each example consists of:
- text: The input sentence containing unit-related phrases.
- entities: The labeled entities specifying unit values and types.
Dataset url: https://huggingface.co/datasets/maliknaik/natural_unit_conversion
Labels
The model classifies tokens into the following categories:
B-FROM_UNIT
: Beginning of the source unitI-FROM_UNIT
: Inside the source unitB-TO_UNIT
: Beginning of the target unitI-TO_UNIT
: Inside the target unitB-FEET_VALUE
: Beginning of feet valueI-FEET_VALUE
: Inside feet valueB-INCH_VALUE
: Beginning of inch valueI-INCH_VALUE
: Inside inch value
Training Details
- Base Model:
distilbert/distilbert-base-uncased
- Tokenization:
AutoTokenizer
from Hugging Face Transformers - Training Framework: Hugging Face
Trainer
- Data Collator:
DataCollatorForTokenClassification
- Loss Function: CrossEntropyLoss
- Batch Size: 64
- Epochs: 10
- GPU: 1x NVIDIA Tesla P4 (8GB GDDR5)
- CPU: 56 vCPUs
- RAM: 283GB
Usage
To use this model for inference:
from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline
model_name = 'maliknaik/distilbert-natural-unit-conversion'
model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased')
text = 'How many miles are there in 50 kilometers?'
unit_pipeline = pipeline('ner', model=model, tokenizer=tokenizer)
print(unit_pipeline(text))
Output:
[{'entity_group': 'TO_UNIT',
'score': np.float32(0.9999982),
'word': 'miles',
'start': 9,
'end': 14},
{'entity_group': 'FROM_UNIT',
'score': np.float32(0.9999473),
'word': 'kilometers',
'start': 31,
'end': 41}]
Performance
The model achieves high f1 score in identifying unit values and conversions. The f1-score for validation and test sets is expected to be optimized further.
Usage
This dataset can be used for training named entity recognition (NER) models, especially for tasks related to unit conversion and natural language understanding.
License
This model is available under the CC0-1.0 license. It is free to use for any purpose without any restrictions.
Contributions
Developed by Malik N. Mohammed, leveraging DistilBERT for efficient NLP token classification.
Citation
If you use this model in your work, please cite it as follows:
@misc{unit-conversion-dataset,
author = {Malik N. Mohammed},
title = {Natural Language Unit Conversion Model for Named-Entity Recognition},
year = {2025},
publisher = {HuggingFace},
journal = {HuggingFace repository}
howpublished = {\url{https://huggingface.co/maliknaik/distilbert-natural-unit-conversion/}}
}
- Downloads last month
- 10