DistilBERT Token Classification Model for Unit Conversion

Model Overview

This model is a fine-tuned version of distilbert/distilbert-base-uncased for token classification on unit conversion-related text. It is designed to recognize unit values and conversion entities, facilitating automatic extraction of unit-related data.

Dataset

The model is trained on the maliknaik/natural_unit_conversion dataset, which contains:

Training set: 583,863 examples
Validation set: 100,091 examples
Test set: 150,137 examples

Each example consists of:

text: The input sentence containing unit-related phrases.
entities: The labeled entities specifying unit values and types.

Dataset url: https://huggingface.co/datasets/maliknaik/natural_unit_conversion

Labels

The model classifies tokens into the following categories:

B-FROM_UNIT: Beginning of the source unit
I-FROM_UNIT: Inside the source unit
B-TO_UNIT: Beginning of the target unit
I-TO_UNIT: Inside the target unit
B-FEET_VALUE: Beginning of feet value
I-FEET_VALUE: Inside feet value
B-INCH_VALUE: Beginning of inch value
I-INCH_VALUE: Inside inch value

Training Details

Base Model: distilbert/distilbert-base-uncased
Tokenization: AutoTokenizer from Hugging Face Transformers
Training Framework: Hugging Face Trainer
Data Collator: DataCollatorForTokenClassification
Loss Function: CrossEntropyLoss
Batch Size: 64
Epochs: 10
GPU: 1x NVIDIA Tesla P4 (8GB GDDR5)
CPU: 56 vCPUs
RAM: 283GB

Usage

To use this model for inference:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_name = 'maliknaik/distilbert-natural-unit-conversion'

model = AutoModelForTokenClassification.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained('distilbert/distilbert-base-uncased')

text = 'How many miles are there in 50 kilometers?'

unit_pipeline = pipeline('ner', model=model, tokenizer=tokenizer)
print(unit_pipeline(text))

Output:

[{'entity_group': 'TO_UNIT',
  'score': np.float32(0.9999982),
  'word': 'miles',
  'start': 9,
  'end': 14},
 {'entity_group': 'FROM_UNIT',
  'score': np.float32(0.9999473),
  'word': 'kilometers',
  'start': 31,
  'end': 41}]

Performance

The model achieves high f1 score in identifying unit values and conversions. The f1-score for validation and test sets is expected to be optimized further.

Usage

This dataset can be used for training named entity recognition (NER) models, especially for tasks related to unit conversion and natural language understanding.

License

This model is available under the CC0-1.0 license. It is free to use for any purpose without any restrictions.

Contributions

Developed by Malik N. Mohammed, leveraging DistilBERT for efficient NLP token classification.

Citation

If you use this model in your work, please cite it as follows:

@misc{unit-conversion-dataset,
  author = {Malik N. Mohammed},
  title = {Natural Language Unit Conversion Model for Named-Entity Recognition},
  year = {2025},
  publisher = {HuggingFace},
  journal = {HuggingFace repository}
  howpublished = {\url{https://huggingface.co/maliknaik/distilbert-natural-unit-conversion/}}
}