Text Classification
Transformers
Safetensors
new
custom_code
awettig commited on
Commit
dd2fabf
·
verified ·
1 Parent(s): 0689480

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -8,14 +8,14 @@ base_model:
8
  ---
9
  # WebOrganizer/FormatClassifier
10
 
11
- [[Paper](ARXIV_TBD)] [[Website](https://weborganizer.allenai.org)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)]
12
 
13
  The FormatClassifier organizes web content into 24 categories based on the URL and text contents of web pages.
14
  The model is a [gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) with 140M parameters fine-tuned on the following training data:
15
  1. [WebOrganizer/FormatAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training)
16
  2. [WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
17
 
18
- ##### All Domain Classifiers
19
  - [WebOrganizer/FormatClassifier](https://huggingface.co/WebOrganizer/FormatClassifier) *← you are here!*
20
  - [WebOrganizer/FormatClassifier-NoURL](https://huggingface.co/WebOrganizer/FormatClassifier-NoURL)
21
  - [WebOrganizer/TopicClassifier](https://huggingface.co/WebOrganizer/TopicClassifier)
@@ -80,7 +80,7 @@ You can convert the `logits` of the model with a softmax to obtain a probability
80
 
81
  The full definitions of the categories can be found in the [taxonomy config](https://github.com/CodeCreator/WebOrganizer/blob/main/define_domains/taxonomies/formats.yaml).
82
 
83
- ##### Efficient Inference
84
  We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This __requires installing `xformers`__ (see more [here](https://huggingface.co/Alibaba-NLP/new-impl#recommendation-enable-unpadding-and-acceleration-with-xformers)) and loading the model like:
85
  ```python
86
  AutoModelForSequenceClassification.from_pretrained(
 
8
  ---
9
  # WebOrganizer/FormatClassifier
10
 
11
+ [[Paper](https://arxiv.org/abs/2502.10341)] [[Website](https://weborganizer.allenai.org)] [[GitHub](https://github.com/CodeCreator/WebOrganizer)]
12
 
13
  The FormatClassifier organizes web content into 24 categories based on the URL and text contents of web pages.
14
  The model is a [gte-base-en-v1.5](https://huggingface.co/Alibaba-NLP/gte-base-en-v1.5) with 140M parameters fine-tuned on the following training data:
15
  1. [WebOrganizer/FormatAnnotations-Llama-3.1-8B](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-8B): 1M documents annotated by Llama-3.1-8B (first-stage training)
16
  2. [WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8](https://huggingface.co/datasets/WebOrganizer/FormatAnnotations-Llama-3.1-405B-FP8): 100K documents annotated by Llama-3.1-405B-FP8 (second-stage training)
17
 
18
+ #### All Domain Classifiers
19
  - [WebOrganizer/FormatClassifier](https://huggingface.co/WebOrganizer/FormatClassifier) *← you are here!*
20
  - [WebOrganizer/FormatClassifier-NoURL](https://huggingface.co/WebOrganizer/FormatClassifier-NoURL)
21
  - [WebOrganizer/TopicClassifier](https://huggingface.co/WebOrganizer/TopicClassifier)
 
80
 
81
  The full definitions of the categories can be found in the [taxonomy config](https://github.com/CodeCreator/WebOrganizer/blob/main/define_domains/taxonomies/formats.yaml).
82
 
83
+ #### Efficient Inference
84
  We recommend that you use the efficient gte-base-en-v1.5 implementation by enabling unpadding and memory efficient attention. This __requires installing `xformers`__ (see more [here](https://huggingface.co/Alibaba-NLP/new-impl#recommendation-enable-unpadding-and-acceleration-with-xformers)) and loading the model like:
85
  ```python
86
  AutoModelForSequenceClassification.from_pretrained(