add citation
Browse files
README.md
CHANGED
@@ -50,4 +50,20 @@ print(code_embeddings)
|
|
50 |
|
51 |
|
52 |
## Training
|
53 |
-
We use a bi-encoder architecture for `CodeRankEmbed`, with weights shared between the text and code encoder. The retriever is contrastively fine-tuned with InfoNCE loss on a 21 million example high-quality dataset we curated called [CoRNStack](https://gangiswag.github.io/cornstack/). Our encoder is initialized with [Arctic-Embed-M-Long](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long), a 137M parameter text encoder supporting an extended context length of 8,192 tokens.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
50 |
|
51 |
|
52 |
## Training
|
53 |
+
We use a bi-encoder architecture for `CodeRankEmbed`, with weights shared between the text and code encoder. The retriever is contrastively fine-tuned with InfoNCE loss on a 21 million example high-quality dataset we curated called [CoRNStack](https://gangiswag.github.io/cornstack/). Our encoder is initialized with [Arctic-Embed-M-Long](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long), a 137M parameter text encoder supporting an extended context length of 8,192 tokens.
|
54 |
+
|
55 |
+
# Citation
|
56 |
+
|
57 |
+
If you find the model, dataset, or training code useful, please cite our work:
|
58 |
+
|
59 |
+
```bibtex
|
60 |
+
@misc{suresh2025cornstackhighqualitycontrastivedata,
|
61 |
+
title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
|
62 |
+
author={Tarun Suresh and Revanth Gangi Reddy and Yifei Xu and Zach Nussbaum and Andriy Mulyar and Brandon Duderstadt and Heng Ji},
|
63 |
+
year={2025},
|
64 |
+
eprint={2412.01007},
|
65 |
+
archivePrefix={arXiv},
|
66 |
+
primaryClass={cs.CL},
|
67 |
+
url={https://arxiv.org/abs/2412.01007},
|
68 |
+
}
|
69 |
+
```
|