numb3r3 commited on
Commit
d13a333
·
verified ·
1 Parent(s): 3a87eca

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +49 -42
README.md CHANGED
@@ -1,61 +1,68 @@
1
  ---
2
- license: apache-2.0
3
- base_model: jinaai/qwen2-1.5b-reader
4
- tags:
5
- - sft
6
- - reader-lm
7
- - transformers
8
- - generated_from_trainer
9
- model-index:
10
- - name: qwen2-1.5b-final-v1
11
- results: []
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
 
17
- # qwen2-1.5b-final-v1
 
 
18
 
19
- This model is a fine-tuned version of [jinaai/qwen2-1.5b-reader](https://huggingface.co/jinaai/qwen2-1.5b-reader) on an unknown dataset.
 
 
20
 
21
- ## Model description
22
 
23
- More information needed
24
 
25
- ## Intended uses & limitations
26
 
27
- More information needed
28
 
29
- ## Training and evaluation data
 
 
 
 
30
 
31
- More information needed
32
 
33
- ## Training procedure
34
 
35
- ### Training hyperparameters
36
 
37
- The following hyperparameters were used during training:
38
- - learning_rate: 3e-05
39
- - train_batch_size: 1
40
- - eval_batch_size: 8
41
- - seed: 42
42
- - distributed_type: multi-GPU
43
- - num_devices: 4
44
- - gradient_accumulation_steps: 2
45
- - total_train_batch_size: 8
46
- - total_eval_batch_size: 32
47
- - optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
48
- - lr_scheduler_type: cosine
49
- - lr_scheduler_warmup_steps: 1000
50
- - training_steps: 25000
51
 
52
- ### Training results
 
 
53
 
 
54
 
 
 
 
 
55
 
56
- ### Framework versions
 
 
57
 
58
- - Transformers 4.43.3
59
- - Pytorch 2.1.0+cu121-with-pypi-cudnn
60
- - Datasets 2.21.0
61
- - Tokenizers 0.19.0
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ pipeline_tag: text-generation
3
+ language:
4
+ - multilingual
5
+ inference: false
6
+ license: cc-by-nc-4.0
7
+ library_name: transformers
 
 
 
 
8
  ---
9
 
10
+ <br><br>
 
11
 
12
+ <p align="center">
13
+ <img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
14
+ </p>
15
 
16
+ <p align="center">
17
+ <b>Trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
18
+ </p>
19
 
 
20
 
21
+ # Intro
22
 
23
+ Jina Reader-LM is a series of models that convert HTML content to Markdown content, which is useful for content conversion tasks. The model is trained on a curated collection of HTML content and its corresponding Markdown content.
24
 
25
+ # Models
26
 
27
+ | Name | Context Length | Download |
28
+ |-----------------|-------------------|-----------------------------------------------------------------------|
29
+ | reader-lm-0.5b | 256K | [🤗 Hugging Face](https://huggingface.co/jinaai/reader-lm-0.5b) |
30
+ | reader-lm-1.5b | 256K | [🤗 Hugging Face](https://huggingface.co/jinaai/reader-lm-1.5b) |
31
+ | |
32
 
33
+ # Evaluation
34
 
35
+ TBD
36
 
37
+ # Quick Start
38
 
39
+ To use this model, you need to install `transformers`:
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
+ ```bash
42
+ pip install transformers<=4.43.4
43
+ ```
44
 
45
+ Then, you can use the model as follows:
46
 
47
+ ```python
48
+ # pip install transformers
49
+ from transformers import AutoModelForCausalLM, AutoTokenizer
50
+ checkpoint = "jinaai/qwen2-1.5b-reader"
51
 
52
+ device = "cuda" # for GPU usage or "cpu" for CPU usage
53
+ tokenizer = AutoTokenizer.from_pretrained(checkpoint)
54
+ model = AutoModelForCausalLM.from_pretrained(checkpoint).to(device)
55
 
56
+ # example html content
57
+ html_content = "<html><body><h1>Hello, world!</h1></body></html>"
58
+
59
+ messages = [{"role": "user", "content": html_content}]
60
+ input_text=tokenizer.apply_chat_template(messages, tokenize=False)
61
+
62
+ print(input_text)
63
+
64
+ inputs = tokenizer.encode(input_text, return_tensors="pt").to(device)
65
+ outputs = model.generate(inputs, max_new_tokens=1024, temperature=0, do_sample=False, repetition_penalty=1.08)
66
+
67
+ print(tokenizer.decode(outputs[0]))
68
+ ```