radoslavralev commited on
Commit
91abacf
·
verified ·
1 Parent(s): ae9eb84

Add new SentenceTransformer model

Browse files
2_Dense/model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:fe28270e57d9e8e90e8666418d902aa8ecb6254c77adda1949f6cbd4bdddb8c0
3
  size 2362528
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9997181ec203c76a0e08ecba57c47a10999519c2736241efc55aadbd8d389584
3
  size 2362528
3_Dense/model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:9a621984b96056acc53643d57acdce2f420d7dfe7a155ea8fcfd949064f4ff1f
3
  size 2362528
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:db470fd6a6c46fd748b3e0d97974cb3788a47741d1005aca5aff6ccc250b737c
3
  size 2362528
README.md CHANGED
@@ -12,53 +12,48 @@ tags:
12
  - retrieval
13
  - reranking
14
  - generated_from_trainer
15
- - dataset_size:13675
16
  - loss:ArcFaceInBatchLoss
17
  base_model: Alibaba-NLP/gte-modernbert-base
18
  widget:
19
- - source_sentence: Bathurst Street has been the heart of the Jewish community of Toronto
20
- for decades .
21
  sentences:
22
- - Baron portrayed actress Violet Carson who played Ena Sharples in the soap .
23
- - Bathurst Street has been the heart of the Jewish community of Toronto for many
24
- decades .
25
- - It stretches approximately 20 miles from Manasquan Inlet in Point Pleasant Beach
26
- in the north to Island Beach State Park in the south .
27
- - source_sentence: All tracks produced by Zack Shada , Jeremy Shada , Logan Charles
28
- , John Spicer and Seth Renken . All tracks are written by Zack Odom and Kenneth
29
- Mount .
30
  sentences:
31
- - All tracks produced by Zack Shada , Jeremy Shada , Logan Charles , John Spicer
32
- and Seth Renken . All tracks are written by Zack Odom and Kenneth Mount .
33
- - All tracks by Zack Shada , Jeremy Shada , John Spicer , Logan Charles and Seth
34
- Renken are produced by Zack Odom and Kenneth Mount .
35
- - Jimmy Connors defeated Eddie Dibbs 7 -- 5 , 7 -- 5
36
- - source_sentence: Arque Municipality is situated in the eastern part of the province
37
- and Tacopaya Municipality is located in the west .
38
  sentences:
39
- - Arque Municipality is situated in the eastern part of the province and Tacopaya
40
- Municipality is located in the west .
41
- - Bangkok International Preparatory and Secondary School , or Bangkok Prep , is
42
- an independent international school located on the National Curriculum of England
43
- based in Bangkok , Thailand .
44
- - The municipality of Tacopaya is situated in the eastern part of the province and
45
- municipality of Arque located in the west .
46
- - source_sentence: Browning is identified as married , but no wife or child is captured
47
- .
48
  sentences:
49
- - Alexander Alexander is the grandson of the Sarawak - leader Tun Jugah Barieng
50
- and the son of former politician Tan Sri Datuk Amar Leonard Linggi .
51
- - Browning is identified as married , but no wife or child is recorded .
52
- - It was formerly known also as ' Crotto ' .
53
- - source_sentence: Actor Charlie Chan , who portrayed Warner Oland when `` The Black
54
- Camel `` was filmed in Hawaii , he met .
 
55
  sentences:
56
- - Chang met actor Warner Oland , who portrayed Charlie Chan , when `` The Black
57
- Camel `` was filmed in Hawaii .
58
- - As an actor , he joined the Royal Shakespeare Company of Peter Hall , working
59
- with Peggy Ashcroft and Dame Edith Evans .
60
- - Actor Charlie Chan , who portrayed Warner Oland when `` The Black Camel `` was
61
- filmed in Hawaii , he met .
62
  datasets:
63
  - redis/langcache-sentencepairs-v2
64
  pipeline_tag: sentence-similarity
@@ -160,9 +155,9 @@ from sentence_transformers import SentenceTransformer
160
  model = SentenceTransformer("redis/langcache-embed-v3")
161
  # Run inference
162
  sentences = [
163
- 'Actor Charlie Chan , who portrayed Warner Oland when `` The Black Camel `` was filmed in Hawaii , he met .',
164
- 'Actor Charlie Chan , who portrayed Warner Oland when `` The Black Camel `` was filmed in Hawaii , he met .',
165
- 'Chang met actor Warner Oland , who portrayed Charlie Chan , when `` The Black Camel `` was filmed in Hawaii .',
166
  ]
167
  embeddings = model.encode(sentences)
168
  print(embeddings.shape)
@@ -171,9 +166,9 @@ print(embeddings.shape)
171
  # Get the similarity scores for the embeddings
172
  similarities = model.similarity(embeddings, embeddings)
173
  print(similarities)
174
- # tensor([[0.9998, 0.9998, 0.5864],
175
- # [0.9998, 0.9998, 0.5864],
176
- # [0.5864, 0.5864, 1.0000]])
177
  ```
178
 
179
  <!--
@@ -239,19 +234,19 @@ You can finetune this model on your own dataset.
239
  #### LangCache Sentence Pairs (all)
240
 
241
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
242
- * Size: 6,786 training samples
243
  * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
244
  * Approximate statistics based on the first 1000 samples:
245
- | | anchor | positive | negative |
246
- |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
247
- | type | string | string | string |
248
- | details | <ul><li>min: 9 tokens</li><li>mean: 27.96 tokens</li><li>max: 50 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 27.98 tokens</li><li>max: 51 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 27.56 tokens</li><li>max: 49 tokens</li></ul> |
249
  * Samples:
250
- | anchor | positive | negative |
251
- |:------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------|
252
- | <code>( 1 ) Lakers vs. ( 2 ) San Antonio Spurs : `` Los Angeles Lakers Win 4-0</code> | <code>( 1 ) Lakers vs. ( 2 ) San Antonio Spurs : `` Los Angeles Lakers win series 4-0 ``</code> | <code>( 1 ) Los Angeles Lakers vs. ( 2 ) San Antonio Spurs : `` Lakers win series 4-0 ``</code> |
253
- | <code>( 1 ) Lakers vs. ( 2 ) San Antonio Spurs : `` Los Angeles Lakers win series 4-0 ``</code> | <code>( 1 ) Lakers vs. ( 2 ) San Antonio Spurs : `` Los Angeles Lakers Win 4-0</code> | <code>The study included 752 universities in Pennsylvania , including public schools , public charter schools and traditional public magnet schools .</code> |
254
- | <code>( 1 ) Los Angeles Lakers vs. ( 2 ) San Antonio Spurs : `` Lakers win series 4-0 ``</code> | <code>( 1 ) Los Angeles Lakers vs. ( 2 ) San Antonio Spurs : `` Lakers win series 4-0 ``</code> | <code>( 1 ) Lakers vs. ( 2 ) San Antonio Spurs : `` Los Angeles Lakers Win 4-0</code> |
255
  * Loss: <code>losses.ArcFaceInBatchLoss</code> with these parameters:
256
  ```json
257
  {
@@ -266,19 +261,19 @@ You can finetune this model on your own dataset.
266
  #### LangCache Sentence Pairs (all)
267
 
268
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
269
- * Size: 6,786 evaluation samples
270
  * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
271
  * Approximate statistics based on the first 1000 samples:
272
- | | anchor | positive | negative |
273
- |:--------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
274
- | type | string | string | string |
275
- | details | <ul><li>min: 9 tokens</li><li>mean: 27.96 tokens</li><li>max: 50 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 27.98 tokens</li><li>max: 51 tokens</li></ul> | <ul><li>min: 9 tokens</li><li>mean: 27.56 tokens</li><li>max: 49 tokens</li></ul> |
276
  * Samples:
277
- | anchor | positive | negative |
278
- |:------------------------------------------------------------------------------------------------|:------------------------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------------------------------------------------------------------------------|
279
- | <code>( 1 ) Lakers vs. ( 2 ) San Antonio Spurs : `` Los Angeles Lakers Win 4-0</code> | <code>( 1 ) Lakers vs. ( 2 ) San Antonio Spurs : `` Los Angeles Lakers win series 4-0 ``</code> | <code>( 1 ) Los Angeles Lakers vs. ( 2 ) San Antonio Spurs : `` Lakers win series 4-0 ``</code> |
280
- | <code>( 1 ) Lakers vs. ( 2 ) San Antonio Spurs : `` Los Angeles Lakers win series 4-0 ``</code> | <code>( 1 ) Lakers vs. ( 2 ) San Antonio Spurs : `` Los Angeles Lakers Win 4-0</code> | <code>The study included 752 universities in Pennsylvania , including public schools , public charter schools and traditional public magnet schools .</code> |
281
- | <code>( 1 ) Los Angeles Lakers vs. ( 2 ) San Antonio Spurs : `` Lakers win series 4-0 ``</code> | <code>( 1 ) Los Angeles Lakers vs. ( 2 ) San Antonio Spurs : `` Lakers win series 4-0 ``</code> | <code>( 1 ) Lakers vs. ( 2 ) San Antonio Spurs : `` Los Angeles Lakers Win 4-0</code> |
282
  * Loss: <code>losses.ArcFaceInBatchLoss</code> with these parameters:
283
  ```json
284
  {
@@ -292,8 +287,8 @@ You can finetune this model on your own dataset.
292
  #### Non-Default Hyperparameters
293
 
294
  - `eval_strategy`: steps
295
- - `per_device_train_batch_size`: 4096
296
- - `per_device_eval_batch_size`: 4096
297
  - `gradient_accumulation_steps`: 2
298
  - `weight_decay`: 0.001
299
  - `adam_beta2`: 0.98
@@ -319,8 +314,8 @@ You can finetune this model on your own dataset.
319
  - `do_predict`: False
320
  - `eval_strategy`: steps
321
  - `prediction_loss_only`: True
322
- - `per_device_train_batch_size`: 4096
323
- - `per_device_eval_batch_size`: 4096
324
  - `per_gpu_train_batch_size`: None
325
  - `per_gpu_eval_batch_size`: None
326
  - `gradient_accumulation_steps`: 2
@@ -439,7 +434,7 @@ You can finetune this model on your own dataset.
439
  ### Training Logs
440
  | Epoch | Step | Validation Loss | test_cosine_ndcg@10 |
441
  |:-----:|:----:|:---------------:|:-------------------:|
442
- | 0 | 0 | 1.4689 | 0.7718 |
443
 
444
 
445
  ### Framework Versions
 
12
  - retrieval
13
  - reranking
14
  - generated_from_trainer
15
+ - dataset_size:1460771
16
  - loss:ArcFaceInBatchLoss
17
  base_model: Alibaba-NLP/gte-modernbert-base
18
  widget:
19
+ - source_sentence: '"How much would I need to narrate a ""Let''s Play"" video in order
20
+ to make money from it on YouTube?"'
21
  sentences:
22
+ - How much money do people make from YouTube videos with 1 million views?
23
+ - '"How much would I need to narrate a ""Let''s Play"" video in order to make money
24
+ from it on YouTube?"'
25
+ - '"Does the sentence, ""I expect to be disappointed,"" make sense?"'
26
+ - source_sentence: '"I appreciate that.'
 
 
 
27
  sentences:
28
+ - '"How is the Mariner rewarded in ""The Rime of the Ancient Mariner"" by Samuel
29
+ Taylor Coleridge?"'
30
+ - '"I appreciate that.'
31
+ - I can appreciate that.
32
+ - source_sentence: '"""It is very easy to defeat someone, but too hard to win some
33
+ one"". What does the previous sentence mean?"'
 
34
  sentences:
35
+ - '"How can you use the word ""visceral"" in a sentence?"'
36
+ - '"""It is very easy to defeat someone, but too hard to win some one"". What does
37
+ the previous sentence mean?"'
38
+ - '"What does ""The loudest one in the room is the weakest one in the room."" Mean?"'
39
+ - source_sentence: '" We condemn this raid which is in our view illegal and morally
40
+ and politically unjustifiable , " London-based NCRI official Ali Safavi told Reuters
41
+ by telephone .'
 
 
42
  sentences:
43
+ - 'London-based NCRI official Ali Safavi told Reuters : " We condemn this raid ,
44
+ which is in our view illegal and morally and politically unjustifiable . "'
45
+ - The social awkwardness is complicated by the fact that Marianne is a white girl
46
+ living with a black family .
47
+ - art's cause, this in my opinion
48
+ - source_sentence: '"If you click ""like"" on an old post that someone made on your
49
+ wall yet you''re no longer Facebook friends, will they still receive a notification?"'
50
  sentences:
51
+ - '"Is there is any two wheeler having a gear box which has the feature ""automatic
52
+ neutral"" when the engine is off while it is in gear?"'
53
+ - '"If you click ""like"" on an old post that someone made on your wall yet you''re
54
+ no longer Facebook friends, will they still receive a notification?"'
55
+ - '"If your teenage son posted ""La commedia e finita"" on his Facebook wall, would
56
+ you be concerned?"'
57
  datasets:
58
  - redis/langcache-sentencepairs-v2
59
  pipeline_tag: sentence-similarity
 
155
  model = SentenceTransformer("redis/langcache-embed-v3")
156
  # Run inference
157
  sentences = [
158
+ '"If you click ""like"" on an old post that someone made on your wall yet you\'re no longer Facebook friends, will they still receive a notification?"',
159
+ '"If you click ""like"" on an old post that someone made on your wall yet you\'re no longer Facebook friends, will they still receive a notification?"',
160
+ '"If your teenage son posted ""La commedia e finita"" on his Facebook wall, would you be concerned?"',
161
  ]
162
  embeddings = model.encode(sentences)
163
  print(embeddings.shape)
 
166
  # Get the similarity scores for the embeddings
167
  similarities = model.similarity(embeddings, embeddings)
168
  print(similarities)
169
+ # tensor([[1.0000, 1.0000, 0.2617],
170
+ # [1.0000, 1.0000, 0.2617],
171
+ # [0.2617, 0.2617, 1.0000]])
172
  ```
173
 
174
  <!--
 
234
  #### LangCache Sentence Pairs (all)
235
 
236
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
237
+ * Size: 132,354 training samples
238
  * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
239
  * Approximate statistics based on the first 1000 samples:
240
+ | | anchor | positive | negative |
241
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
242
+ | type | string | string | string |
243
+ | details | <ul><li>min: 4 tokens</li><li>mean: 25.33 tokens</li><li>max: 100 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 24.98 tokens</li><li>max: 100 tokens</li></ul> | <ul><li>min: 5 tokens</li><li>mean: 19.06 tokens</li><li>max: 68 tokens</li></ul> |
244
  * Samples:
245
+ | anchor | positive | negative |
246
+ |:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------|
247
+ | <code> What high potential jobs are there other than computer science?</code> | <code> What high potential jobs are there other than computer science?</code> | <code>Why IT or Computer Science jobs are being over rated than other Engineering jobs?</code> |
248
+ | <code> Would India ever be able to develop a missile system like S300 or S400 missile?</code> | <code> Would India ever be able to develop a missile system like S300 or S400 missile?</code> | <code>Should India buy the Russian S400 air defence missile system?</code> |
249
+ | <code> water from the faucet is being drunk by a yellow dog</code> | <code>A yellow dog is drinking water from the faucet</code> | <code>Childlessness is low in Eastern European countries.</code> |
250
  * Loss: <code>losses.ArcFaceInBatchLoss</code> with these parameters:
251
  ```json
252
  {
 
261
  #### LangCache Sentence Pairs (all)
262
 
263
  * Dataset: [LangCache Sentence Pairs (all)](https://huggingface.co/datasets/redis/langcache-sentencepairs-v2)
264
+ * Size: 132,354 evaluation samples
265
  * Columns: <code>anchor</code>, <code>positive</code>, and <code>negative</code>
266
  * Approximate statistics based on the first 1000 samples:
267
+ | | anchor | positive | negative |
268
+ |:--------|:-----------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------|:----------------------------------------------------------------------------------|
269
+ | type | string | string | string |
270
+ | details | <ul><li>min: 4 tokens</li><li>mean: 25.33 tokens</li><li>max: 100 tokens</li></ul> | <ul><li>min: 4 tokens</li><li>mean: 24.98 tokens</li><li>max: 100 tokens</li></ul> | <ul><li>min: 5 tokens</li><li>mean: 19.06 tokens</li><li>max: 68 tokens</li></ul> |
271
  * Samples:
272
+ | anchor | positive | negative |
273
+ |:----------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------|:-----------------------------------------------------------------------------------------------|
274
+ | <code> What high potential jobs are there other than computer science?</code> | <code> What high potential jobs are there other than computer science?</code> | <code>Why IT or Computer Science jobs are being over rated than other Engineering jobs?</code> |
275
+ | <code> Would India ever be able to develop a missile system like S300 or S400 missile?</code> | <code> Would India ever be able to develop a missile system like S300 or S400 missile?</code> | <code>Should India buy the Russian S400 air defence missile system?</code> |
276
+ | <code> water from the faucet is being drunk by a yellow dog</code> | <code>A yellow dog is drinking water from the faucet</code> | <code>Childlessness is low in Eastern European countries.</code> |
277
  * Loss: <code>losses.ArcFaceInBatchLoss</code> with these parameters:
278
  ```json
279
  {
 
287
  #### Non-Default Hyperparameters
288
 
289
  - `eval_strategy`: steps
290
+ - `per_device_train_batch_size`: 8192
291
+ - `per_device_eval_batch_size`: 8192
292
  - `gradient_accumulation_steps`: 2
293
  - `weight_decay`: 0.001
294
  - `adam_beta2`: 0.98
 
314
  - `do_predict`: False
315
  - `eval_strategy`: steps
316
  - `prediction_loss_only`: True
317
+ - `per_device_train_batch_size`: 8192
318
+ - `per_device_eval_batch_size`: 8192
319
  - `per_gpu_train_batch_size`: None
320
  - `per_gpu_eval_batch_size`: None
321
  - `gradient_accumulation_steps`: 2
 
434
  ### Training Logs
435
  | Epoch | Step | Validation Loss | test_cosine_ndcg@10 |
436
  |:-----:|:----:|:---------------:|:-------------------:|
437
+ | 0 | 0 | 2.9916 | 0.7718 |
438
 
439
 
440
  ### Framework Versions
config.json CHANGED
@@ -12,7 +12,7 @@
12
  "cls_token_id": 50281,
13
  "decoder_bias": true,
14
  "deterministic_flash_attn": false,
15
- "dtype": "bfloat16",
16
  "embedding_dropout": 0.0,
17
  "eos_token_id": 50282,
18
  "global_attn_every_n_layers": 3,
 
12
  "cls_token_id": 50281,
13
  "decoder_bias": true,
14
  "deterministic_flash_attn": false,
15
+ "dtype": "float32",
16
  "embedding_dropout": 0.0,
17
  "eos_token_id": 50282,
18
  "global_attn_every_n_layers": 3,
model.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:95d02211c4cca89113f9f3e93ed91f5176bf50170faa2cb835f7bfea15bb9dd2
3
- size 298041696
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:04aa7437b7f98ed3f652e300c1d767d07c1864c10b3055ea63831997faefa8d6
3
+ size 596070136