How this model count the token size?

#10
by WeiZhenKun - opened

How this model count the token size?
Is there a certain proportional relationship between the token size and the length of characters?

This model is based on the BERT tokenizer, as an approximate rule of thumb, there are roughly 0.75 words per token in English text. For precise count, please load the tokenizer and run on your data of interest.

intfloat changed discussion status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment