tarsur909 commited on
Commit
3c4b608
·
verified ·
1 Parent(s): b84c9f7

add citation

Browse files
Files changed (1) hide show
  1. README.md +17 -1
README.md CHANGED
@@ -50,4 +50,20 @@ print(code_embeddings)
50
 
51
 
52
  ## Training
53
- We use a bi-encoder architecture for `CodeRankEmbed`, with weights shared between the text and code encoder. The retriever is contrastively fine-tuned with InfoNCE loss on a 21 million example high-quality dataset we curated called [CoRNStack](https://gangiswag.github.io/cornstack/). Our encoder is initialized with [Arctic-Embed-M-Long](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long), a 137M parameter text encoder supporting an extended context length of 8,192 tokens.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
50
 
51
 
52
  ## Training
53
+ We use a bi-encoder architecture for `CodeRankEmbed`, with weights shared between the text and code encoder. The retriever is contrastively fine-tuned with InfoNCE loss on a 21 million example high-quality dataset we curated called [CoRNStack](https://gangiswag.github.io/cornstack/). Our encoder is initialized with [Arctic-Embed-M-Long](https://huggingface.co/Snowflake/snowflake-arctic-embed-m-long), a 137M parameter text encoder supporting an extended context length of 8,192 tokens.
54
+
55
+ # Citation
56
+
57
+ If you find the model, dataset, or training code useful, please cite our work:
58
+
59
+ ```bibtex
60
+ @misc{suresh2025cornstackhighqualitycontrastivedata,
61
+ title={CoRNStack: High-Quality Contrastive Data for Better Code Retrieval and Reranking},
62
+ author={Tarun Suresh and Revanth Gangi Reddy and Yifei Xu and Zach Nussbaum and Andriy Mulyar and Brandon Duderstadt and Heng Ji},
63
+ year={2025},
64
+ eprint={2412.01007},
65
+ archivePrefix={arXiv},
66
+ primaryClass={cs.CL},
67
+ url={https://arxiv.org/abs/2412.01007},
68
+ }
69
+ ```