max_seq_length

#3
by yjoonjang - opened

What is the max_seq_length of this model?
https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0#using-huggingface-transformers
the large model code says max_length=512,
https://huggingface.co/Snowflake/snowflake-arctic-embed-m-v2.0#using-huggingface-transformers
but the medium model code says max_length=8192.

Is it right that their max_seq_lengths are different?

I believe this is correct, it's based on the maximum sequence lengths of the respective base models.

I believe this is correct, it's based on the maximum sequence lengths of the respective base models.

You mean 512?

Yes, large should have a maximum sequence length of 512 tokens, and medium a maximum sequence length of 8192. Folks from Snowflake should be able to confirm.

But when I run the following code:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("Snowflake/snowflake-arctic-embed-l-v2.0")
print(model.max_seq_length)

I get 8192.

Oh, you're right. That's due to https://huggingface.co/Snowflake/snowflake-arctic-embed-l-v2.0/blob/main/tokenizer_config.json#L50
cc @spacemanidol @pxyu I'm pretty sure it's not possible for a XLM-RoBERTa finetune to exceed 512 tokens unless you've updated the positional embedding matrix.

Nevermind, looks like they can actually process ~6k tokens. These is the shape the token embeddings of 2 queries: torch.Size([2, 6005, 1024]). Perhaps the max. sequence length is actually 8192 - apologies for the confusion, I'll let the Snowflake team answer.

Both models handle 8192. We use the adjusted version of XMLR provided by the BGE team (BAAI/bge-m3-retromae), which has been extended for 8k context support, so the normal XMLR rules don't appl, haha. Let me get a fix in for the erroneous large model example code!

Snowflake org

updated in README so closing.

spacemanidol changed discussion status to closed
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment