zpn commited on
Commit
f326ddf
·
verified ·
1 Parent(s): d0e6b4c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +109 -116
README.md CHANGED
@@ -1,141 +1,134 @@
1
- ---
2
- library_name: sentence-transformers
3
- pipeline_tag: sentence-similarity
4
- tags:
5
- - sentence-transformers
6
- - sentence-similarity
7
- - feature-extraction
8
- ---
9
-
10
- # SentenceTransformer
11
-
12
- This is a [sentence-transformers](https://www.SBERT.net) model trained. It maps sentences & paragraphs to a 3584-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
13
-
14
- ## Model Details
15
-
16
- ### Model Description
17
- - **Model Type:** Sentence Transformer
18
- <!-- - **Base model:** [Unknown](https://huggingface.co/unknown) -->
19
- - **Maximum Sequence Length:** 32768 tokens
20
- - **Output Dimensionality:** 3584 dimensions
21
- - **Similarity Function:** Cosine Similarity
22
- <!-- - **Training Dataset:** Unknown -->
23
- <!-- - **Language:** Unknown -->
24
- <!-- - **License:** Unknown -->
25
-
26
- ### Model Sources
27
-
28
- - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
29
- - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
30
- - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
31
-
32
- ### Full Model Architecture
33
-
34
- ```
35
- SentenceTransformer(
36
- (0): Transformer({'max_seq_length': 32768, 'do_lower_case': False}) with Transformer model: Qwen2Model
37
- (1): Pooling({'word_embedding_dimension': 3584, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': True, 'include_prompt': True})
38
- (2): Normalize()
39
- )
40
- ```
41
-
42
- ## Usage
43
-
44
- ### Direct Usage (Sentence Transformers)
45
-
46
- First install the Sentence Transformers library:
 
 
 
 
47
 
48
  ```bash
49
- pip install -U sentence-transformers
50
  ```
51
 
52
- Then you can load this model and run inference.
53
- ```python
54
- from sentence_transformers import SentenceTransformer
55
 
56
- # Download from the 🤗 Hub
57
- model = SentenceTransformer("nomic-ai/qwen7b-coder-instruct-5M-last-token-causal-8e6-eos-token")
58
- # Run inference
59
- sentences = [
60
- 'The weather is lovely today.',
61
- "It's so sunny outside!",
62
- 'He drove to the stadium.',
63
- ]
64
- embeddings = model.encode(sentences)
 
 
 
 
 
 
 
 
 
 
 
 
 
65
  print(embeddings.shape)
66
- # [3, 3584]
67
 
68
- # Get the similarity scores for the embeddings
69
- similarities = model.similarity(embeddings, embeddings)
70
- print(similarities.shape)
71
- # [3, 3]
72
  ```
73
 
74
- <!--
75
- ### Direct Usage (Transformers)
76
-
77
- <details><summary>Click to see the direct usage in Transformers</summary>
78
-
79
- </details>
80
- -->
81
-
82
- <!--
83
- ### Downstream Usage (Sentence Transformers)
84
 
85
- You can finetune this model on your own dataset.
86
-
87
- <details><summary>Click to expand</summary>
88
-
89
- </details>
90
- -->
91
-
92
- <!--
93
- ### Out-of-Scope Use
94
 
95
- *List how the model may foreseeably be misused and address what users ought not to do with the model.*
96
- -->
97
 
98
- <!--
99
- ## Bias, Risks and Limitations
 
100
 
101
- *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
102
- -->
 
103
 
104
- <!--
105
- ### Recommendations
106
 
107
- *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
108
- -->
109
 
110
- ## Training Details
111
 
112
- ### Framework Versions
113
- - Python: 3.10.12
114
- - Sentence Transformers: 3.3.0
115
- - Transformers: 4.45.2
116
- - PyTorch: 2.4.1+cu121
117
- - Accelerate: 1.4.0
118
- - Datasets: 2.21.0
119
- - Tokenizers: 0.20.0
120
 
121
- ## Citation
122
 
123
- ### BibTeX
124
 
125
- <!--
126
- ## Glossary
127
 
128
- *Clearly define terms in order to be accessible across audiences.*
129
- -->
130
 
131
- <!--
132
- ## Model Card Authors
 
133
 
134
- *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
135
- -->
136
 
137
- <!--
138
- ## Model Card Contact
139
 
140
- *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
141
- -->
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: sentence-transformers
3
+ pipeline_tag: sentence-similarity
4
+ tags:
5
+ - sentence-transformers
6
+ - sentence-similarity
7
+ - feature-extraction
8
+ license: apache-2.0
9
+ datasets:
10
+ - nomic-ai/cornstack-python-v1
11
+ - nomic-ai/cornstack-javascript-v1
12
+ - nomic-ai/cornstack-java-v1
13
+ - nomic-ai/cornstack-go-v1
14
+ - nomic-ai/cornstack-php-v1
15
+ - nomic-ai/cornstack-ruby-v1
16
+ base_model:
17
+ - Qwen/Qwen2.5-Coder-7B-Instruct
18
+ ---
19
+
20
+ # Nomic Embed Code: A State-of-the-Art Code Retriever
21
+
22
+
23
+ `nomic-embed-code` is a state-of-the-art code embedding model that excels at code retrieval tasks:
24
+
25
+ - **High Performance**: Outperforms Voyage Code 3 and OpenAI Embed 3 Large on CodeSearchNet
26
+ - **Multilingual Code Support**: Trained for multiple programming languages (Python, Java, Ruby, PHP, JavaScript, Go)
27
+ - **Advanced Architecture**: 7B parameter code embedding model
28
+ - **Fully Open-Source**: Model weights, training data, and [evaluation code](https://github.com/gangiswag/cornstack/) released
29
+
30
+ | Model | Python | Java | Ruby | PHP | JavaScript | Go |
31
+ |-------|--------|------|------|-----|------------|-----|
32
+ | **Nomic Embed Code** | **81.7** | **80.5** | 81.8 | **72.3** | 77.1 | **93.8** |
33
+ | Voyage Code 3 | 80.8 | **80.5** | **84.6** | 71.7 | **79.2** | 93.2 |
34
+ | OpenAI Embed 3 Large | 70.8 | 72.9 | 75.3 | 59.6 | 68.1 | 87.6 |
35
+ | Nomic CodeRankEmbed-137M | 78.4 | 76.9 | 79.3 | 68.8 | 71.4 | 92.7 |
36
+ | CodeSage Large v2 (1B) | 74.2 | 72.3 | 76.7 | 65.2 | 72.5 | 84.6 |
37
+ | CodeSage Large (1B) | 70.8 | 70.2 | 71.9 | 61.3 | 69.5 | 83.7 |
38
+ | Qodo Embed 1 7B | 59.9 | 61.6 | 68.4 | 48.5 | 57.0 | 81.4 |
39
+
40
+ ## Model Architecture
41
+
42
+ - **Total Parameters**: 7B
43
+ - **Training Approach**: Trained on the CoRNStack dataset with dual-consistency filtering and progressive hard negative mining
44
+ - **Supported Languages**: Python, Java, Ruby, PHP, JavaScript, and Go
45
+
46
+ ## Usage Guide
47
+
48
+ ### Installation
49
+
50
+ You can install the necessary dependencies with:
51
 
52
  ```bash
53
+ pip install transformers sentence-transformers torch
54
  ```
55
 
56
+ ### Transformers
 
 
57
 
58
+ ```python
59
+ import torch
60
+ import torch.nn.functional as F
61
+ from transformers import AutoTokenizer, AutoModel
62
+
63
+ tokenizer = AutoTokenizer.from_pretrained("nomic-ai/nomic-embed-code")
64
+ model = AutoModel.from_pretrained("nomic-ai/nomic-embed-code")
65
+
66
+ def last_token_pooling(model_output, attention_mask):
67
+ sequence_lengths = attention_mask.sum(-1) - 1
68
+ return hidden_states[torch.arange(hidden_states.shape[0]), sequence_lengths]
69
+
70
+ queries = ['Represent this query for searching relevant code: Calculate the n-th factorial']
71
+ codes = ['def fact(n):\n if n < 0:\n raise ValueError\n return 1 if n == 0 else n * fact(n - 1)']
72
+ code_snippets = queries + codes
73
+
74
+ encoded_input = tokenizer(code_snippets, padding=True, truncation=True, return_tensors='pt')
75
+ model.eval()
76
+ with torch.no_grad():
77
+ model_output = model(**encoded_input)
78
+ embeddings = last_token_pooling(model_output, encoded_input['attention_mask'])
79
+ embeddings = F.normalize(embeddings, p=2, dim=1)
80
  print(embeddings.shape)
 
81
 
82
+ similarity = F.cosine_similarity(embeddings[0], embeddings[1], dim=0)
83
+ print(similarity)
 
 
84
  ```
85
 
86
+ ### SentenceTransformers
 
 
 
 
 
 
 
 
 
87
 
88
+ ```python
89
+ from sentence_transformers import SentenceTransformer
 
 
 
 
 
 
 
90
 
91
+ queries = ['Represent this query for searching relevant code: Calculate the n-th factorial']
92
+ codes = ['def fact(n):\n if n < 0:\n raise ValueError\n return 1 if n == 0 else n * fact(n - 1)']
93
 
94
+ model = SentenceTransformer("nomic-ai/nomic-embed-code")
95
+ query_emb = model.encode(queries, prompt_name="query")
96
+ code_emb = model.encode(code_snippets)
97
 
98
+ similarity = model.similarity(query_emb, code_emb)
99
+ print(similarity)
100
+ ```
101
 
 
 
102
 
103
+ ### CoRNStack Dataset Curation
 
104
 
105
+ Starting with the deduplicated Stackv2, we create text-code pairs from function docstrings and respective code. We filtered out low-quality pairs where the docstring wasn't English, too short, or that contained URLs, HTML tags, or invalid characters. We additionally kept docstrings with text lengths of 256 tokens or longer to help the model learn long-range dependencies.
106
 
 
 
 
 
 
 
 
 
107
 
108
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/rb-J54KLgg21f59Ba1jWp.png)
109
 
110
+ After the initial filtering, we used dual-consistency filtering to remove potentially noisy examples. We embed each docstring and code pair and compute the similarity between each docstring and every code example. We remove pairs from the dataset if the corresponding code example is not found in the top-2 most similar examples for a given docstring.
111
 
112
+ During training, we employ a novel curriculum-based hard negative mining strategy to ensure the model learns from challenging examples. We use a softmax-based sampling strategy to progressively sample hard negatives with increasing difficulty over time.
 
113
 
114
+ ## Join the Nomic Community
 
115
 
116
+ - Website: [https://nomic.ai](https://nomic.ai)
117
+ - Twitter: [https://twitter.com/nomic_ai](https://twitter.com/nomic_ai)
118
+ - Discord: [https://discord.gg/myY5YDR8z8](https://discord.gg/myY5YDR8z8)
119
 
120
+ # Citation
 
121
 
122
+ If you find the model, dataset, or training code useful, please cite our work:
 
123
 
124
+ ```bibtex
125
+ @misc{suresh2025cornstackhighqualitycontrastivedata,
126
+ title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
127
+ author={Tarun Suresh and Revanth Gangi Reddy and Yifei Xu and Zach Nussbaum and Andriy Mulyar and Brandon Duderstadt and Heng Ji},
128
+ year={2025},
129
+ eprint={2412.01007},
130
+ archivePrefix={arXiv},
131
+ primaryClass={cs.CL},
132
+ url={https://arxiv.org/abs/2412.01007},
133
+ }
134
+ ```