dataera2013 commited on
Commit
c0abc78
·
verified ·
1 Parent(s): a2a669c

Add new SentenceTransformer model

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 1024,
3
+ "pooling_mode_cls_token": true,
4
+ "pooling_mode_mean_tokens": false,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,695 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - generated_from_trainer
7
+ - dataset_size:164
8
+ - loss:MatryoshkaLoss
9
+ - loss:MultipleNegativesRankingLoss
10
+ base_model: Snowflake/snowflake-arctic-embed-l
11
+ widget:
12
+ - source_sentence: 'QUESTION #1\n'
13
+ sentences:
14
+ - 'An interesting point of comparison here could be the way railways rolled out
15
+ around the world in the 1800s. Constructing these required enormous investments
16
+ and had a massive environmental impact, and many of the lines that were built
17
+ turned out to be unnecessary—sometimes multiple lines from different companies
18
+ serving the exact same routes!
19
+
20
+ The resulting bubbles contributed to several financial crashes, see Wikipedia
21
+ for Panic of 1873, Panic of 1893, Panic of 1901 and the UK’s Railway Mania. They
22
+ left us with a lot of useful infrastructure and a great deal of bankruptcies and
23
+ environmental damage.
24
+
25
+ The year of slop'
26
+ - 'This remains astonishing to me. I thought a model with the capabilities and output
27
+ quality of GPT-4 needed a datacenter class server with one or more $40,000+ GPUs.
28
+
29
+ These models take up enough of my 64GB of RAM that I don’t run them often—they
30
+ don’t leave much room for anything else.
31
+
32
+ The fact that they run at all is a testament to the incredible training and inference
33
+ performance gains that we’ve figured out over the past year. It turns out there
34
+ was a lot of low-hanging fruit to be harvested in terms of model efficiency. I
35
+ expect there’s still more to come.'
36
+ - 'Things we learned about LLMs in 2024
37
+
38
+
39
+
40
+
41
+
42
+
43
+
44
+
45
+
46
+
47
+
48
+
49
+
50
+
51
+
52
+
53
+
54
+
55
+
56
+
57
+
58
+
59
+ Simon Willison’s Weblog
60
+
61
+ Subscribe
62
+
63
+
64
+
65
+
66
+
67
+
68
+
69
+ Things we learned about LLMs in 2024
70
+
71
+ 31st December 2024
72
+
73
+ A lot has happened in the world of Large Language Models over the course of 2024.
74
+ Here’s a review of things we figured out about the field in the past twelve months,
75
+ plus my attempt at identifying key themes and pivotal moments.
76
+
77
+ This is a sequel to my review of 2023.
78
+
79
+ In this article:'
80
+ - source_sentence: 'QUESTION #2\n...\n\nContext:\nJust this week, the New York Times
81
+ launched a landmark lawsuit against OpenAI and Microsoft over this issue. The
82
+ 69 page PDF is genuinely worth reading—especially the first few pages, which lay
83
+ out the issues in a way that’s surprisingly easy to follow. The rest of the document
84
+ includes some of the clearest explanations of what LLMs are, how they work and
85
+ how they are built that I’ve read anywhere.\nThe legal arguments here are complex.
86
+ I’m not a lawyer, but I don’t think this one will be easily decided. Whichever
87
+ way it goes, I expect this case to have a profound impact on how this technology
88
+ develops in the future.\n'', additional_kwargs={}, response_metadata={})]'
89
+ sentences:
90
+ - 'A lot of people are excited about AI agents—an infuriatingly vague term that
91
+ seems to be converging on “AI systems that can go away and act on your behalf”.
92
+ We’ve been talking about them all year, but I’ve seen few if any examples of them
93
+ running in production, despite lots of exciting prototypes.
94
+
95
+ I think this is because of gullibility.
96
+
97
+ Can we solve this? Honestly, I’m beginning to suspect that you can’t fully solve
98
+ gullibility without achieving AGI. So it may be quite a while before those agent
99
+ dreams can really start to come true!
100
+
101
+ Code may be the best application
102
+
103
+ Over the course of the year, it’s become increasingly clear that writing code
104
+ is one of the things LLMs are most capable of.'
105
+ - 'Just this week, the New York Times launched a landmark lawsuit against OpenAI
106
+ and Microsoft over this issue. The 69 page PDF is genuinely worth reading—especially
107
+ the first few pages, which lay out the issues in a way that’s surprisingly easy
108
+ to follow. The rest of the document includes some of the clearest explanations
109
+ of what LLMs are, how they work and how they are built that I’ve read anywhere.
110
+
111
+ The legal arguments here are complex. I’m not a lawyer, but I don’t think this
112
+ one will be easily decided. Whichever way it goes, I expect this case to have
113
+ a profound impact on how this technology develops in the future.'
114
+ - 'Then there’s the rest. If you browse the Chatbot Arena leaderboard today—still
115
+ the most useful single place to get a vibes-based evaluation of models—you’ll
116
+ see that GPT-4-0314 has fallen to around 70th place. The 18 organizations with
117
+ higher scoring models are Google, OpenAI, Alibaba, Anthropic, Meta, Reka AI, 01
118
+ AI, Amazon, Cohere, DeepSeek, Nvidia, Mistral, NexusFlow, Zhipu AI, xAI, AI21
119
+ Labs, Princeton and Tencent.
120
+
121
+ Training a GPT-4 beating model was a huge deal in 2023. In 2024 it’s an achievement
122
+ that isn’t even particularly notable, though I personally still celebrate any
123
+ time a new organization joins that list.
124
+
125
+ Some of those GPT-4 models run on my laptop'
126
+ - source_sentence: 'QUESTION #1\n'
127
+ sentences:
128
+ - 'The biggest innovation here is that it opens up a new way to scale a model: instead
129
+ of improving model performance purely through additional compute at training time,
130
+ models can now take on harder problems by spending more compute on inference.
131
+
132
+ The sequel to o1, o3 (they skipped “o2” for European trademark reasons) was announced
133
+ on 20th December with an impressive result against the ARC-AGI benchmark, albeit
134
+ one that likely involved more than $1,000,000 of compute time expense!
135
+
136
+ o3 is expected to ship in January. I doubt many people have real-world problems
137
+ that would benefit from that level of compute expenditure—I certainly don’t!—but
138
+ it appears to be a genuine next step in LLM architecture for taking on much harder
139
+ problems.'
140
+ - 'Those US export regulations on GPUs to China seem to have inspired some very
141
+ effective training optimizations!
142
+
143
+ The environmental impact got better
144
+
145
+ A welcome result of the increased efficiency of the models—both the hosted ones
146
+ and the ones I can run locally—is that the energy usage and environmental impact
147
+ of running a prompt has dropped enormously over the past couple of years.
148
+
149
+ OpenAI themselves are charging 100x less for a prompt compared to the GPT-3 days.
150
+ I have it on good authority that neither Google Gemini nor Amazon Nova (two of
151
+ the least expensive model providers) are running prompts at a loss.'
152
+ - 'OpenAI made GPT-4o free for all users in May, and Claude 3.5 Sonnet was freely
153
+ available from its launch in June. This was a momentus change, because for the
154
+ previous year free users had mostly been restricted to GPT-3.5 level models, meaning
155
+ new users got a very inaccurate mental model of what a capable LLM could actually
156
+ do.
157
+
158
+ That era appears to have ended, likely permanently, with OpenAI’s launch of ChatGPT
159
+ Pro. This $200/month subscription service is the only way to access their most
160
+ capable model, o1 Pro.
161
+
162
+ Since the trick behind the o1 series (and the future models it will undoubtedly
163
+ inspire) is to expend more compute time to get better results, I don’t think those
164
+ days of free access to the best available models are likely to return.'
165
+ - source_sentence: 'QUESTION #1\n'
166
+ sentences:
167
+ - 'The May 13th announcement of GPT-4o included a demo of a brand new voice mode,
168
+ where the true multi-modal GPT-4o (the o is for “omni”) model could accept audio
169
+ input and output incredibly realistic sounding speech without needing separate
170
+ TTS or STT models.
171
+
172
+ The demo also sounded conspicuously similar to Scarlett Johansson... and after
173
+ she complained the voice from the demo, Skye, never made it to a production product.
174
+
175
+ The delay in releasing the new voice mode after the initial demo caused quite
176
+ a lot of confusion. I wrote about that in ChatGPT in “4o” mode is not running
177
+ the new features yet.'
178
+ - 'Against this photo of butterflies at the California Academy of Sciences:
179
+
180
+
181
+
182
+ A shallow dish, likely a hummingbird or butterfly feeder, is red. Pieces of orange
183
+ slices of fruit are visible inside the dish.
184
+
185
+ Two butterflies are positioned in the feeder, one is a dark brown/black butterfly
186
+ with white/cream-colored markings. The other is a large, brown butterfly with
187
+ patterns of lighter brown, beige, and black markings, including prominent eye
188
+ spots. The larger brown butterfly appears to be feeding on the fruit.'
189
+ - 'The year of slop
190
+
191
+ Synthetic training data works great
192
+
193
+ LLMs somehow got even harder to use
194
+
195
+ Knowledge is incredibly unevenly distributed
196
+
197
+ LLMs need better criticism
198
+
199
+ Everything tagged “llms” on my blog in 2024'
200
+ - source_sentence: 'QUESTION #1\n'
201
+ sentences:
202
+ - 'Terminology aside, I remain skeptical as to their utility based, once again,
203
+ on the challenge of gullibility. LLMs believe anything you tell them. Any systems
204
+ that attempts to make meaningful decisions on your behalf will run into the same
205
+ roadblock: how good is a travel agent, or a digital assistant, or even a research
206
+ tool if it can’t distinguish truth from fiction?
207
+
208
+ Just the other day Google Search was caught serving up an entirely fake description
209
+ of the non-existant movie “Encanto 2”. It turned out to be summarizing an imagined
210
+ movie listing from a fan fiction wiki.'
211
+ - 'Your browser does not support the audio element.
212
+
213
+
214
+ OpenAI aren’t the only group with a multi-modal audio model. Google’s Gemini also
215
+ accepts audio input, and the Google Gemini apps can speak in a similar way to
216
+ ChatGPT now. Amazon also pre-announced voice mode for Amazon Nova, but that’s
217
+ meant to roll out in Q1 of 2025.
218
+
219
+ Google’s NotebookLM, released in September, took audio output to a new level by
220
+ producing spookily realistic conversations between two “podcast hosts” about anything
221
+ you fed into their tool. They later added custom instructions, so naturally I
222
+ turned them into pelicans:
223
+
224
+
225
+
226
+ Your browser does not support the audio element.'
227
+ - 'Then in February, Meta released Llama. And a few weeks later in March, Georgi
228
+ Gerganov released code that got it working on a MacBook.
229
+
230
+ I wrote about how Large language models are having their Stable Diffusion moment,
231
+ and with hindsight that was a very good call!
232
+
233
+ This unleashed a whirlwind of innovation, which was accelerated further in July
234
+ when Meta released Llama 2—an improved version which, crucially, included permission
235
+ for commercial use.
236
+
237
+ Today there are literally thousands of LLMs that can be run locally, on all manner
238
+ of different devices.'
239
+ pipeline_tag: sentence-similarity
240
+ library_name: sentence-transformers
241
+ metrics:
242
+ - cosine_accuracy@1
243
+ - cosine_accuracy@3
244
+ - cosine_accuracy@5
245
+ - cosine_accuracy@10
246
+ - cosine_precision@1
247
+ - cosine_precision@3
248
+ - cosine_precision@5
249
+ - cosine_precision@10
250
+ - cosine_recall@1
251
+ - cosine_recall@3
252
+ - cosine_recall@5
253
+ - cosine_recall@10
254
+ - cosine_ndcg@10
255
+ - cosine_mrr@10
256
+ - cosine_map@100
257
+ model-index:
258
+ - name: SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
259
+ results:
260
+ - task:
261
+ type: information-retrieval
262
+ name: Information Retrieval
263
+ dataset:
264
+ name: Unknown
265
+ type: unknown
266
+ metrics:
267
+ - type: cosine_accuracy@1
268
+ value: 0.56
269
+ name: Cosine Accuracy@1
270
+ - type: cosine_accuracy@3
271
+ value: 0.64
272
+ name: Cosine Accuracy@3
273
+ - type: cosine_accuracy@5
274
+ value: 0.72
275
+ name: Cosine Accuracy@5
276
+ - type: cosine_accuracy@10
277
+ value: 0.92
278
+ name: Cosine Accuracy@10
279
+ - type: cosine_precision@1
280
+ value: 0.56
281
+ name: Cosine Precision@1
282
+ - type: cosine_precision@3
283
+ value: 0.21333333333333332
284
+ name: Cosine Precision@3
285
+ - type: cosine_precision@5
286
+ value: 0.14400000000000002
287
+ name: Cosine Precision@5
288
+ - type: cosine_precision@10
289
+ value: 0.09200000000000001
290
+ name: Cosine Precision@10
291
+ - type: cosine_recall@1
292
+ value: 0.56
293
+ name: Cosine Recall@1
294
+ - type: cosine_recall@3
295
+ value: 0.64
296
+ name: Cosine Recall@3
297
+ - type: cosine_recall@5
298
+ value: 0.72
299
+ name: Cosine Recall@5
300
+ - type: cosine_recall@10
301
+ value: 0.92
302
+ name: Cosine Recall@10
303
+ - type: cosine_ndcg@10
304
+ value: 0.7017423735235339
305
+ name: Cosine Ndcg@10
306
+ - type: cosine_mrr@10
307
+ value: 0.63715873015873
308
+ name: Cosine Mrr@10
309
+ - type: cosine_map@100
310
+ value: 0.6441284271284272
311
+ name: Cosine Map@100
312
+ ---
313
+
314
+ # SentenceTransformer based on Snowflake/snowflake-arctic-embed-l
315
+
316
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [Snowflake/snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l). It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
317
+
318
+ ## Model Details
319
+
320
+ ### Model Description
321
+ - **Model Type:** Sentence Transformer
322
+ - **Base model:** [Snowflake/snowflake-arctic-embed-l](https://huggingface.co/Snowflake/snowflake-arctic-embed-l) <!-- at revision d8fb21ca8d905d2832ee8b96c894d3298964346b -->
323
+ - **Maximum Sequence Length:** 512 tokens
324
+ - **Output Dimensionality:** 1024 dimensions
325
+ - **Similarity Function:** Cosine Similarity
326
+ <!-- - **Training Dataset:** Unknown -->
327
+ <!-- - **Language:** Unknown -->
328
+ <!-- - **License:** Unknown -->
329
+
330
+ ### Model Sources
331
+
332
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
333
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
334
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
335
+
336
+ ### Full Model Architecture
337
+
338
+ ```
339
+ SentenceTransformer(
340
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
341
+ (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
342
+ (2): Normalize()
343
+ )
344
+ ```
345
+
346
+ ## Usage
347
+
348
+ ### Direct Usage (Sentence Transformers)
349
+
350
+ First install the Sentence Transformers library:
351
+
352
+ ```bash
353
+ pip install -U sentence-transformers
354
+ ```
355
+
356
+ Then you can load this model and run inference.
357
+ ```python
358
+ from sentence_transformers import SentenceTransformer
359
+
360
+ # Download from the 🤗 Hub
361
+ model = SentenceTransformer("dataera2013/legal-ft-2")
362
+ # Run inference
363
+ sentences = [
364
+ 'QUESTION #1\\n',
365
+ 'Your browser does not support the audio element.\n\nOpenAI aren’t the only group with a multi-modal audio model. Google’s Gemini also accepts audio input, and the Google Gemini apps can speak in a similar way to ChatGPT now. Amazon also pre-announced voice mode for Amazon Nova, but that’s meant to roll out in Q1 of 2025.\nGoogle’s NotebookLM, released in September, took audio output to a new level by producing spookily realistic conversations between two “podcast hosts” about anything you fed into their tool. They later added custom instructions, so naturally I turned them into pelicans:\n\n\nYour browser does not support the audio element.',
366
+ 'Then in February, Meta released Llama. And a few weeks later in March, Georgi Gerganov released code that got it working on a MacBook.\nI wrote about how Large language models are having their Stable Diffusion moment, and with hindsight that was a very good call!\nThis unleashed a whirlwind of innovation, which was accelerated further in July when Meta released Llama 2—an improved version which, crucially, included permission for commercial use.\nToday there are literally thousands of LLMs that can be run locally, on all manner of different devices.',
367
+ ]
368
+ embeddings = model.encode(sentences)
369
+ print(embeddings.shape)
370
+ # [3, 1024]
371
+
372
+ # Get the similarity scores for the embeddings
373
+ similarities = model.similarity(embeddings, embeddings)
374
+ print(similarities.shape)
375
+ # [3, 3]
376
+ ```
377
+
378
+ <!--
379
+ ### Direct Usage (Transformers)
380
+
381
+ <details><summary>Click to see the direct usage in Transformers</summary>
382
+
383
+ </details>
384
+ -->
385
+
386
+ <!--
387
+ ### Downstream Usage (Sentence Transformers)
388
+
389
+ You can finetune this model on your own dataset.
390
+
391
+ <details><summary>Click to expand</summary>
392
+
393
+ </details>
394
+ -->
395
+
396
+ <!--
397
+ ### Out-of-Scope Use
398
+
399
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
400
+ -->
401
+
402
+ ## Evaluation
403
+
404
+ ### Metrics
405
+
406
+ #### Information Retrieval
407
+
408
+ * Evaluated with [<code>InformationRetrievalEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.InformationRetrievalEvaluator)
409
+
410
+ | Metric | Value |
411
+ |:--------------------|:-----------|
412
+ | cosine_accuracy@1 | 0.56 |
413
+ | cosine_accuracy@3 | 0.64 |
414
+ | cosine_accuracy@5 | 0.72 |
415
+ | cosine_accuracy@10 | 0.92 |
416
+ | cosine_precision@1 | 0.56 |
417
+ | cosine_precision@3 | 0.2133 |
418
+ | cosine_precision@5 | 0.144 |
419
+ | cosine_precision@10 | 0.092 |
420
+ | cosine_recall@1 | 0.56 |
421
+ | cosine_recall@3 | 0.64 |
422
+ | cosine_recall@5 | 0.72 |
423
+ | cosine_recall@10 | 0.92 |
424
+ | **cosine_ndcg@10** | **0.7017** |
425
+ | cosine_mrr@10 | 0.6372 |
426
+ | cosine_map@100 | 0.6441 |
427
+
428
+ <!--
429
+ ## Bias, Risks and Limitations
430
+
431
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
432
+ -->
433
+
434
+ <!--
435
+ ### Recommendations
436
+
437
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
438
+ -->
439
+
440
+ ## Training Details
441
+
442
+ ### Training Dataset
443
+
444
+ #### Unnamed Dataset
445
+
446
+ * Size: 164 training samples
447
+ * Columns: <code>sentence_0</code> and <code>sentence_1</code>
448
+ * Approximate statistics based on the first 164 samples:
449
+ | | sentence_0 | sentence_1 |
450
+ |:--------|:-----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
451
+ | type | string | string |
452
+ | details | <ul><li>min: 4 tokens</li><li>mean: 72.05 tokens</li><li>max: 228 tokens</li></ul> | <ul><li>min: 43 tokens</li><li>mean: 135.85 tokens</li><li>max: 214 tokens</li></ul> |
453
+ * Samples:
454
+ | sentence_0 | sentence_1 |
455
+ |:-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
456
+ | <code>QUESTION #1\n</code> | <code>Stuff we figured out about AI in 2023<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>Simon Willison’s Weblog<br>Subscribe<br><br><br><br><br><br><br>Stuff we figured out about AI in 2023<br>31st December 2023<br>2023 was the breakthrough year for Large Language Models (LLMs). I think it’s OK to call these AI—they’re the latest and (currently) most interesting development in the academic field of Artificial Intelligence that dates back to the 1950s.<br>Here’s my attempt to round up the highlights in one place!</code> |
457
+ | <code>QUESTION #2\n...\n\nContext:\nStuff we figured out about AI in 2023\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSimon Willison’s Weblog\nSubscribe\n\n\n\n\n\n\nStuff we figured out about AI in 2023\n31st December 2023\n2023 was the breakthrough year for Large Language Models (LLMs). I think it’s OK to call these AI—they’re the latest and (currently) most interesting development in the academic field of Artificial Intelligence that dates back to the 1950s.\nHere’s my attempt to round up the highlights in one place!\n', additional_kwargs={}, response_metadata={})]</code> | <code>Stuff we figured out about AI in 2023<br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br><br>Simon Willison’s Weblog<br>Subscribe<br><br><br><br><br><br><br>Stuff we figured out about AI in 2023<br>31st December 2023<br>2023 was the breakthrough year for Large Language Models (LLMs). I think it’s OK to call these AI—they’re the latest and (currently) most interesting development in the academic field of Artificial Intelligence that dates back to the 1950s.<br>Here’s my attempt to round up the highlights in one place!</code> |
458
+ | <code>QUESTION #1\n</code> | <code>Large Language Models<br>They’re actually quite easy to build<br>You can run LLMs on your own devices<br>Hobbyists can build their own fine-tuned models<br>We don’t yet know how to build GPT-4<br>Vibes Based Development<br>LLMs are really smart, and also really, really dumb<br>Gullibility is the biggest unsolved problem<br>Code may be the best application<br>The ethics of this space remain diabolically complex<br>My blog in 2023</code> |
459
+ * Loss: [<code>MatryoshkaLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#matryoshkaloss) with these parameters:
460
+ ```json
461
+ {
462
+ "loss": "MultipleNegativesRankingLoss",
463
+ "matryoshka_dims": [
464
+ 768,
465
+ 512,
466
+ 256,
467
+ 128,
468
+ 64
469
+ ],
470
+ "matryoshka_weights": [
471
+ 1,
472
+ 1,
473
+ 1,
474
+ 1,
475
+ 1
476
+ ],
477
+ "n_dims_per_step": -1
478
+ }
479
+ ```
480
+
481
+ ### Training Hyperparameters
482
+ #### Non-Default Hyperparameters
483
+
484
+ - `eval_strategy`: steps
485
+ - `per_device_train_batch_size`: 10
486
+ - `per_device_eval_batch_size`: 10
487
+ - `num_train_epochs`: 10
488
+ - `multi_dataset_batch_sampler`: round_robin
489
+
490
+ #### All Hyperparameters
491
+ <details><summary>Click to expand</summary>
492
+
493
+ - `overwrite_output_dir`: False
494
+ - `do_predict`: False
495
+ - `eval_strategy`: steps
496
+ - `prediction_loss_only`: True
497
+ - `per_device_train_batch_size`: 10
498
+ - `per_device_eval_batch_size`: 10
499
+ - `per_gpu_train_batch_size`: None
500
+ - `per_gpu_eval_batch_size`: None
501
+ - `gradient_accumulation_steps`: 1
502
+ - `eval_accumulation_steps`: None
503
+ - `torch_empty_cache_steps`: None
504
+ - `learning_rate`: 5e-05
505
+ - `weight_decay`: 0.0
506
+ - `adam_beta1`: 0.9
507
+ - `adam_beta2`: 0.999
508
+ - `adam_epsilon`: 1e-08
509
+ - `max_grad_norm`: 1
510
+ - `num_train_epochs`: 10
511
+ - `max_steps`: -1
512
+ - `lr_scheduler_type`: linear
513
+ - `lr_scheduler_kwargs`: {}
514
+ - `warmup_ratio`: 0.0
515
+ - `warmup_steps`: 0
516
+ - `log_level`: passive
517
+ - `log_level_replica`: warning
518
+ - `log_on_each_node`: True
519
+ - `logging_nan_inf_filter`: True
520
+ - `save_safetensors`: True
521
+ - `save_on_each_node`: False
522
+ - `save_only_model`: False
523
+ - `restore_callback_states_from_checkpoint`: False
524
+ - `no_cuda`: False
525
+ - `use_cpu`: False
526
+ - `use_mps_device`: False
527
+ - `seed`: 42
528
+ - `data_seed`: None
529
+ - `jit_mode_eval`: False
530
+ - `use_ipex`: False
531
+ - `bf16`: False
532
+ - `fp16`: False
533
+ - `fp16_opt_level`: O1
534
+ - `half_precision_backend`: auto
535
+ - `bf16_full_eval`: False
536
+ - `fp16_full_eval`: False
537
+ - `tf32`: None
538
+ - `local_rank`: 0
539
+ - `ddp_backend`: None
540
+ - `tpu_num_cores`: None
541
+ - `tpu_metrics_debug`: False
542
+ - `debug`: []
543
+ - `dataloader_drop_last`: False
544
+ - `dataloader_num_workers`: 0
545
+ - `dataloader_prefetch_factor`: None
546
+ - `past_index`: -1
547
+ - `disable_tqdm`: False
548
+ - `remove_unused_columns`: True
549
+ - `label_names`: None
550
+ - `load_best_model_at_end`: False
551
+ - `ignore_data_skip`: False
552
+ - `fsdp`: []
553
+ - `fsdp_min_num_params`: 0
554
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
555
+ - `fsdp_transformer_layer_cls_to_wrap`: None
556
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
557
+ - `deepspeed`: None
558
+ - `label_smoothing_factor`: 0.0
559
+ - `optim`: adamw_torch
560
+ - `optim_args`: None
561
+ - `adafactor`: False
562
+ - `group_by_length`: False
563
+ - `length_column_name`: length
564
+ - `ddp_find_unused_parameters`: None
565
+ - `ddp_bucket_cap_mb`: None
566
+ - `ddp_broadcast_buffers`: False
567
+ - `dataloader_pin_memory`: True
568
+ - `dataloader_persistent_workers`: False
569
+ - `skip_memory_metrics`: True
570
+ - `use_legacy_prediction_loop`: False
571
+ - `push_to_hub`: False
572
+ - `resume_from_checkpoint`: None
573
+ - `hub_model_id`: None
574
+ - `hub_strategy`: every_save
575
+ - `hub_private_repo`: None
576
+ - `hub_always_push`: False
577
+ - `gradient_checkpointing`: False
578
+ - `gradient_checkpointing_kwargs`: None
579
+ - `include_inputs_for_metrics`: False
580
+ - `include_for_metrics`: []
581
+ - `eval_do_concat_batches`: True
582
+ - `fp16_backend`: auto
583
+ - `push_to_hub_model_id`: None
584
+ - `push_to_hub_organization`: None
585
+ - `mp_parameters`:
586
+ - `auto_find_batch_size`: False
587
+ - `full_determinism`: False
588
+ - `torchdynamo`: None
589
+ - `ray_scope`: last
590
+ - `ddp_timeout`: 1800
591
+ - `torch_compile`: False
592
+ - `torch_compile_backend`: None
593
+ - `torch_compile_mode`: None
594
+ - `dispatch_batches`: None
595
+ - `split_batches`: None
596
+ - `include_tokens_per_second`: False
597
+ - `include_num_input_tokens_seen`: False
598
+ - `neftune_noise_alpha`: None
599
+ - `optim_target_modules`: None
600
+ - `batch_eval_metrics`: False
601
+ - `eval_on_start`: False
602
+ - `use_liger_kernel`: False
603
+ - `eval_use_gather_object`: False
604
+ - `average_tokens_across_devices`: False
605
+ - `prompts`: None
606
+ - `batch_sampler`: batch_sampler
607
+ - `multi_dataset_batch_sampler`: round_robin
608
+
609
+ </details>
610
+
611
+ ### Training Logs
612
+ | Epoch | Step | cosine_ndcg@10 |
613
+ |:------:|:----:|:--------------:|
614
+ | 1.0 | 17 | 0.7017 |
615
+ | 2.0 | 34 | 0.7017 |
616
+ | 2.9412 | 50 | 0.7017 |
617
+ | 3.0 | 51 | 0.7017 |
618
+ | 4.0 | 68 | 0.7017 |
619
+ | 5.0 | 85 | 0.7017 |
620
+ | 5.8824 | 100 | 0.7017 |
621
+ | 6.0 | 102 | 0.7017 |
622
+ | 7.0 | 119 | 0.7017 |
623
+ | 8.0 | 136 | 0.7017 |
624
+ | 8.8235 | 150 | 0.7017 |
625
+ | 9.0 | 153 | 0.7017 |
626
+ | 10.0 | 170 | 0.7017 |
627
+
628
+
629
+ ### Framework Versions
630
+ - Python: 3.13.1
631
+ - Sentence Transformers: 3.4.1
632
+ - Transformers: 4.48.3
633
+ - PyTorch: 2.6.0+cu124
634
+ - Accelerate: 1.3.0
635
+ - Datasets: 3.2.0
636
+ - Tokenizers: 0.21.0
637
+
638
+ ## Citation
639
+
640
+ ### BibTeX
641
+
642
+ #### Sentence Transformers
643
+ ```bibtex
644
+ @inproceedings{reimers-2019-sentence-bert,
645
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
646
+ author = "Reimers, Nils and Gurevych, Iryna",
647
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
648
+ month = "11",
649
+ year = "2019",
650
+ publisher = "Association for Computational Linguistics",
651
+ url = "https://arxiv.org/abs/1908.10084",
652
+ }
653
+ ```
654
+
655
+ #### MatryoshkaLoss
656
+ ```bibtex
657
+ @misc{kusupati2024matryoshka,
658
+ title={Matryoshka Representation Learning},
659
+ author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
660
+ year={2024},
661
+ eprint={2205.13147},
662
+ archivePrefix={arXiv},
663
+ primaryClass={cs.LG}
664
+ }
665
+ ```
666
+
667
+ #### MultipleNegativesRankingLoss
668
+ ```bibtex
669
+ @misc{henderson2017efficient,
670
+ title={Efficient Natural Language Response Suggestion for Smart Reply},
671
+ author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
672
+ year={2017},
673
+ eprint={1705.00652},
674
+ archivePrefix={arXiv},
675
+ primaryClass={cs.CL}
676
+ }
677
+ ```
678
+
679
+ <!--
680
+ ## Glossary
681
+
682
+ *Clearly define terms in order to be accessible across audiences.*
683
+ -->
684
+
685
+ <!--
686
+ ## Model Card Authors
687
+
688
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
689
+ -->
690
+
691
+ <!--
692
+ ## Model Card Contact
693
+
694
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
695
+ -->
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "Snowflake/snowflake-arctic-embed-l",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "hidden_act": "gelu",
9
+ "hidden_dropout_prob": 0.1,
10
+ "hidden_size": 1024,
11
+ "initializer_range": 0.02,
12
+ "intermediate_size": 4096,
13
+ "layer_norm_eps": 1e-12,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "bert",
16
+ "num_attention_heads": 16,
17
+ "num_hidden_layers": 24,
18
+ "pad_token_id": 0,
19
+ "position_embedding_type": "absolute",
20
+ "torch_dtype": "float32",
21
+ "transformers_version": "4.48.3",
22
+ "type_vocab_size": 2,
23
+ "use_cache": true,
24
+ "vocab_size": 30522
25
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.4.1",
4
+ "transformers": "4.48.3",
5
+ "pytorch": "2.6.0+cu124"
6
+ },
7
+ "prompts": {
8
+ "query": "Represent this sentence for searching relevant passages: "
9
+ },
10
+ "default_prompt_name": null,
11
+ "similarity_fn_name": "cosine"
12
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1ac0d0ede29db1e9386d36a3fa534950ab52a7d0aad8602d2e7662e8b448e0e6
3
+ size 1336413848
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "extra_special_tokens": {},
48
+ "mask_token": "[MASK]",
49
+ "max_length": 512,
50
+ "model_max_length": 512,
51
+ "pad_to_multiple_of": null,
52
+ "pad_token": "[PAD]",
53
+ "pad_token_type_id": 0,
54
+ "padding_side": "right",
55
+ "sep_token": "[SEP]",
56
+ "stride": 0,
57
+ "strip_accents": null,
58
+ "tokenize_chinese_chars": true,
59
+ "tokenizer_class": "BertTokenizer",
60
+ "truncation_side": "right",
61
+ "truncation_strategy": "longest_first",
62
+ "unk_token": "[UNK]"
63
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff