Update README.md

7c20624 verified 19 days ago

4.14 kB

	---
	library_name: transformers
	license: gpl-3.0
	datasets:
	- phunc20/nj_biergarten_captcha_v2
	base_model:
	- microsoft/trocr-base-handwritten
	---

	# Model Card for trocr-base-handwritten_nj_biergarten_captcha_v2

	This is a model for CAPTCHA OCR.



	## Model Details

	### Model Description

	This is a simple model finetuned from `microsoft/trocr-base-handwritten` on a dataset
	I created at `phunc20/nj_biergarten_captcha_v2`.

	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->


	### Direct Use

	<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

	[More Information Needed]


	## Bias, Risks, and Limitations

	Although the model seems to perform well on the dataset `phunc20/nj_biergarten_captcha_v2`,
	it does not exhibit such good performance across all CAPTCHA images. In this respect, this
	model is worse than Human.

	### Recommendations

	<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

	Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	[More Information Needed]

	## Training Details

	### Training Data

	Like I mentioned, I trained this model on `phunc20/nj_biergarten_captcha_v2`.
	In particular, I trained on the `train` split and evalaute on `validation` split,
	without touching the `test` split.

	### Training Procedure

	Please refer to
	<https://gitlab.com/phunc20/captchew/-/blob/main/colab_notebooks/train_from_pretrained_Seq2SeqTrainer_torchDataset.ipynb?ref_type=heads>
	which is adapted from
	<https://github.com/NielsRogge/Transformers-Tutorials/blob/master/TrOCR/Fine_tune_TrOCR_on_IAM_Handwriting_Database_using_Seq2SeqTrainer.ipynb>

	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data

	1. The `test` split of `phunc20/nj_biergarten_captcha_v2`
	2. This Kaggle dataset <https://www.kaggle.com/datasets/fournierp/captcha-version-2-images/data>
	(we shall call this dataset by the name of `kaggle_test_set` in this model card.)

	#### Factors

	<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

	[More Information Needed]

	#### Metrics

	CER, exact match and average length difference. The former two can be found in HuggingFace's
	documentation. The last one is just one metric I care a little about. It is quite easy to
	understand and, if need be, explanation could be found at the source code:
	<https://gitlab.com/phunc20/captchew/-/blob/v0.1/average_length_difference.py>

	### Results
	On the `test` split of `phunc20/nj_biergarten_captcha_v2`

	\| Model \| cer \| exact match \| avg len diff \|
	\| --------------------------------------------------------- \| -------- \| ----------- \| ------------ \|
	\| `phunc20/trocr-base-handwritten_nj_biergarten_captcha_v2` \| 0.001333 \| 496/500 \| 1/500 \|
	\| `microsoft/trocr-base-handwritten` \| 0.9 \| 5/500 \| 2.4 \|

	On `kaggle_test_set`

	\| Model \| cer \| exact match \| avg len diff \|
	\| --------------------------------------------------------- \| -------- \| ----------- \| ------------ \|
	\| `phunc20/trocr-base-handwritten_nj_biergarten_captcha_v2` \| 0.4381 \| 69/1070 \| 0.1289 \|
	\| `microsoft/trocr-base-handwritten` \| 1.0112 \| 17/1070 \| 2.4439 \|


	## Environmental Impact

	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

	- Hardware Type: [More Information Needed]
	- Hours used: [More Information Needed]
	- Cloud Provider: [More Information Needed]
	- Compute Region: [More Information Needed]
	- Carbon Emitted: [More Information Needed]