bhavnicksm commited on
Commit
71c6f39
·
1 Parent(s): 2a143d3

Add model files

Browse files
Files changed (6) hide show
  1. README.md +74 -0
  2. assets/beetle_logo.png +0 -0
  3. config.json +12 -0
  4. model.safetensors +3 -0
  5. modules.json +8 -0
  6. tokenizer.json +0 -0
README.md CHANGED
@@ -1,3 +1,77 @@
1
  ---
 
 
 
 
2
  license: mit
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ base_model: mixedbread-ai/mxbai-embed-2d-large-v1
3
+ language:
4
+ - en
5
+ library_name: model2vec
6
  license: mit
7
+ model_name: red-beetle-base-v1.1
8
+ tags:
9
+ - embeddings
10
+ - static-embeddings
11
+ - sentence-transformers
12
  ---
13
+ # 🪲 red-beetle-base-v1.1 Model Card
14
+
15
+ <div align="center">
16
+ <img width="75%" alt="Beetle logo" src="./assets/beetle_logo.png">
17
+ </div>
18
+
19
+ > [!TIP]
20
+ > Beetles are some of the most diverse and interesting creatures on Earth. They are found in every environment, from the deepest oceans to the highest mountains. They are also known for their ability to adapt to a wide range of habitats and lifestyles. They are small, fast and powerful!
21
+
22
+ The beetle series of models are made as good starting points for Static Embedding training (via TokenLearn or Fine-tuning), as well as decent Static Embedding models. Each beetle model is made to be an improvement over the original **M2V_base_output** model in some way, and that's the threshold we set for each model (except the brown beetle series, which is the original model).
23
+
24
+ This model has been distilled from `mixedbread-ai/mxbai-embed-2d-large-v1`, with PCA at 1024 dimensions, Zipf and SIF re-weighting, learnt from a subset of the FineWeb-Edu sample-10BT dataset. It outperforms the original M2V_base_output model in all tasks.
25
+
26
+ ## Version Information
27
+
28
+ - **red-beetle-base-v0**: The original model, without using PCA or Zipf. The lack of PCA and Zipf also makes this a decent model for further training.
29
+ - **red-beetle-base-v1**: The original model, with PCA at 1024 dimensions and (Zipf)^3 re-weighting.
30
+ - **red-beetle-small-v1**: A smaller version of the original model, with PCA at 384 dimensions and (Zipf)^3 re-weighting.
31
+ - **red-beetle-base-v1.1**: The original model, with PCA at 1024 dimensions, Zipf and SIF re-weighting, learnt from a subset of the FineWeb-Edu sample-10BT dataset.
32
+ - **red-beetle-small-v1.1**: A smaller version of the original model, with PCA at 384 dimensions, Zipf and SIF re-weighting, learnt from a subset of the FineWeb-Edu sample-10BT dataset.
33
+
34
+ ## Installation
35
+
36
+ Install model2vec using pip:
37
+
38
+ ```bash
39
+ pip install model2vec
40
+ ```
41
+
42
+ ## Usage
43
+
44
+ Load this model using the `from_pretrained` method:
45
+
46
+ ```python
47
+ from model2vec import StaticModel
48
+
49
+ # Load a pretrained Model2Vec model
50
+ model = StaticModel.from_pretrained("bhavnicksm/red-beetle-base-v1.1")
51
+
52
+ # Compute text embeddings
53
+ embeddings = model.encode(["Example sentence"])
54
+ ```
55
+
56
+ Read more about the Model2Vec library [here](https://github.com/MinishLab/model2vec).
57
+
58
+ ## Comparison with other models
59
+
60
+ Coming soon...
61
+
62
+ ## Acknowledgements
63
+
64
+ This model is made using the [Model2Vec](https://github.com/MinishLab/model2vec) library. Credit goes to the [Minish Lab](https://github.com/MinishLab) team for developing this library.
65
+
66
+ ## Citation
67
+
68
+ Please cite the [Model2Vec repository](https://github.com/MinishLab/model2vec) if you use this model in your work.
69
+
70
+ ```bibtex
71
+ @software{minishlab2024model2vec,
72
+ authors = {Stephan Tulkens, Thomas van Dongen},
73
+ title = {Model2Vec: Turn any Sentence Transformer into a Small Fast Model},
74
+ year = {2024},
75
+ url = {https://github.com/MinishLab/model2vec},
76
+ }
77
+ ```
assets/beetle_logo.png ADDED
config.json ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "model2vec",
3
+ "architectures": [
4
+ "StaticModel"
5
+ ],
6
+ "tokenizer_name": "mixedbread-ai/mxbai-embed-2d-large-v1",
7
+ "apply_pca": null,
8
+ "apply_zipf": false,
9
+ "hidden_dim": 1024,
10
+ "seq_length": 1000000,
11
+ "normalize": false
12
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:762c14708e0040223c5cbb4d7b8f474166a0fabf4fa5d7fbd82d3007902969ca
3
+ size 120946776
modules.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": ".",
6
+ "type": "sentence_transformers.models.StaticEmbedding"
7
+ }
8
+ ]
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff