How to integreate this model with Sentence Transformers?

#31

by Nelson365487 - opened Feb 27, 2024

Feb 27, 2024

I see the choice of pooling layer of this model is last token pooling base on the description in the model card section. Since I want to utilize this model with Sentence Transformers function. I try to add the pooling layer after loading the model with "sentence_transformers.models.Transformer". And I initiate the pooling layer with "sentence_transformers.models.Pooling(...,pooling_mode_mean_tokens=False,pooling_mode_lasttoken =True).
Finally, create the model with the pooling layer with "model = SentenceTransformer(modules=[word_embedding_model, pooling_model])"
However, the embeddings of the this custom model is very different from what i would get by following the code in the model card section.
Is there any misunderstanding while I integrate this model with Sentence Transformers? For example the realization of the pooling layer is different which leads to different result on embeddings.

intfloat

Owner Feb 28, 2024

Can you provide a minimal code snippet that can reproduce your results?

One issue about integrating with SentenceTransformers is that the tokenizer has to add an EOS token to the end of each input. I believe SentenceTransformers do not handle this automatically.

Jonathan0528

Feb 28, 2024

sentence-transformers should have added this new feature for EOS token.
See https://huggingface.co/Salesforce/SFR-Embedding-Mistral/discussions/1.

I have tried the merged configs in Salesforce/SFR-Embedding-Mistral and should work.
Hope to see it in intfloat/e5-mistral-7b-instruct!

Nelson365487

Feb 29, 2024

Thanks for your replay @intfloat @Jonathan0528 . I check the add_eos_token in the tokenizer after loading model with SentenceTransformers, and just as @intfloat said, the tokenizer does not add EOS token autimatically. The reason of contradiction on what @Jonathan0528 said might be the version of my SentenceTransformers. My installed version is 2.2.2 which is quite old, I think. After setting the add_eos_token=True and redoing the example everything goes well. Thanks again @intfloat @Jonathan0528 .

Nelson365487 changed discussion status to closed Feb 29, 2024

woofadu

Apr 3, 2024

•

edited Apr 3, 2024

@Nelson365487 Where did you set the add_eos_token=True for this if @Jonathan0528 solution did not work?

Nelson365487

Apr 8, 2024

@woofadu , Maybe you can try passing arguments with tokenizer_args while initializing the sentence_transformers.models.Transformer or try modify the tokenizer after the initalization.

woofadu

May 3, 2024

@Nelson365487 modifying after initialization worked. Thank you

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment