nomic-ai
/

CodeRankEmbed

sentence-transformers

Model card Files Files and versions

tarsur909 commited on 14 days ago

Commit

3c4b608

·

verified ·

1 Parent(s): b84c9f7

add citation

Files changed (1) hide show

README.md +17 -1

README.md CHANGED Viewed

@@ -50,4 +50,20 @@ print(code_embeddings)
 ## Training
-We use a bi-encoder architecture for `CodeRankEmbed`, with weights shared between the text and code encoder. The retriever is contrastively fine-tuned with InfoNCE loss on a 21 million example high-quality dataset we curated called [CoRNStack](https://gangiswag.github.io/cornstack/). Our encoder is initialized with [Arctic-Embed-M-Long](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long), a 137M parameter text encoder supporting an extended context length of 8,192 tokens.

 ## Training
+We use a bi-encoder architecture for `CodeRankEmbed`, with weights shared between the text and code encoder. The retriever is contrastively fine-tuned with InfoNCE loss on a 21 million example high-quality dataset we curated called [CoRNStack](https://gangiswag.github.io/cornstack/). Our encoder is initialized with [Arctic-Embed-M-Long](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long), a 137M parameter text encoder supporting an extended context length of 8,192 tokens.
+# Citation
+If you find the model, dataset, or training code useful, please cite our work:
+```bibtex
+@misc{suresh2025cornstackhighqualitycontrastivedata,
+      title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
+      author={Tarun Suresh and Revanth Gangi Reddy and Yifei Xu and Zach Nussbaum and Andriy Mulyar and Brandon Duderstadt and Heng Ji},
+      year={2025},
+      eprint={2412.01007},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2412.01007},
+}
+```