File size: 3,636 Bytes
5fa1a76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| Truncation                           | Padding                           | Instruction                                                                                 |
|--------------------------------------|-----------------------------------|---------------------------------------------------------------------------------------------|
| no truncation                        | no padding                        | tokenizer(batch_sentences)                                                           |
|                                      | padding to max sequence in batch  | tokenizer(batch_sentences, padding=True) or                                          |
|                                      |                                   | tokenizer(batch_sentences, padding='longest')                                        |
|                                      | padding to max model input length | tokenizer(batch_sentences, padding='max_length')                                     |
|                                      | padding to specific length        | tokenizer(batch_sentences, padding='max_length', max_length=42)                      |
|                                      | padding to a multiple of a value  | tokenizer(batch_sentences, padding=True, pad_to_multiple_of=8)                        |
| truncation to max model input length | no padding                        | tokenizer(batch_sentences, truncation=True) or                                       |
|                                      |                                   | tokenizer(batch_sentences, truncation=STRATEGY)                                      |
|                                      | padding to max sequence in batch  | tokenizer(batch_sentences, padding=True, truncation=True) or                         |
|                                      |                                   | tokenizer(batch_sentences, padding=True, truncation=STRATEGY)                        |
|                                      | padding to max model input length | tokenizer(batch_sentences, padding='max_length', truncation=True) or                 |
|                                      |                                   | tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY)                |
|                                      | padding to specific length        | Not possible                                                                                |
| truncation to specific length        | no padding                        | tokenizer(batch_sentences, truncation=True, max_length=42) or                        |
|                                      |                                   | tokenizer(batch_sentences, truncation=STRATEGY, max_length=42)                       |
|                                      | padding to max sequence in batch  | tokenizer(batch_sentences, padding=True, truncation=True, max_length=42) or          |
|                                      |                                   | tokenizer(batch_sentences, padding=True, truncation=STRATEGY, max_length=42)         |
|                                      | padding to max model input length | Not possible                                                                                |
|                                      | padding to specific length        | tokenizer(batch_sentences, padding='max_length', truncation=True, max_length=42) or  |
|                                      |                                   | tokenizer(batch_sentences, padding='max_length', truncation=STRATEGY, max_length=42) |