codesage
/

codesage-base

Transformers

PyTorch

code

custom_code

Model card Files Files and versions Community

codesage commited on Dec 28, 2024

Commit

7612de6

verified ·

1 Parent(s): f437ca9

Update README.md

Browse files

Files changed (1) hide show

README.md +18 -7

README.md CHANGED Viewed

@@ -9,6 +9,11 @@ language:
 ## CodeSage-Base
 ### Model description
 CodeSage is a new family of open code embedding models with an encoder architecture that support a wide range of source code understanding tasks. It is introduced in the paper:
@@ -21,25 +26,31 @@ This checkpoint is trained on the Stack data (https://huggingface.co/datasets/bi
 ### Training procedure
 This checkpoint is first trained on code data via masked language modeling (MLM) and then on bimodal text-code pair data. Please refer to the paper for more details.
-### How to use
-This checkpoint consists of an encoder (356M model), which can be used to extract code embeddings of 1024 dimension. It can be easily loaded using the AutoModel functionality and employs the Starcoder tokenizer (https://arxiv.org/pdf/2305.06161.pdf).
 ```
 from transformers import AutoModel, AutoTokenizer
 checkpoint = "codesage/codesage-base"
-device = "cuda"  # for GPU usage or "cpu" for CPU usage
-# Note: CodeSage requires adding eos token at the end of
-# each tokenized sequence to ensure good performance
 tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
 model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
 inputs = tokenizer.encode("def print_hello_world():\tprint('Hello World!')", return_tensors="pt").to(device)
 embedding = model(inputs)[0]
-print(f'Dimension of the embedding: {embedding[0].size()}')
-# Dimension of the embedding: torch.Size([14, 1024])
 ```
 ### BibTeX entry and citation info

 ## CodeSage-Base
+### Updates
+* [12/2024] <span style="color:blue">We are excited to announce the release of the CodeSage V2 model family with largely improved performance and flexible embedding dimensions!</span> Please check out our [models](https://huggingface.co/codesage) and [blogpost](https://code-representation-learning.github.io/codesage-v2.html) for more details.
+* [11/2024] You can now access CodeSage models through SentenceTransformer.
 ### Model description
 CodeSage is a new family of open code embedding models with an encoder architecture that support a wide range of source code understanding tasks. It is introduced in the paper:
 ### Training procedure
 This checkpoint is first trained on code data via masked language modeling (MLM) and then on bimodal text-code pair data. Please refer to the paper for more details.
+### How to Use
+This checkpoint consists of an encoder (356M model), which can be used to extract code embeddings of 1024 dimension.
+1. Accessing CodeSage via HuggingFace: it can be easily loaded using the AutoModel functionality and employs the [Starcoder Tokenizer](https://arxiv.org/pdf/2305.06161.pdf).
 ```
 from transformers import AutoModel, AutoTokenizer
 checkpoint = "codesage/codesage-base"
+device = "cuda"  # "cpu" for CPU usage
+# Note: CodeSage requires adding eos token at the end of each tokenized sequence
 tokenizer = AutoTokenizer.from_pretrained(checkpoint, trust_remote_code=True, add_eos_token=True)
 model = AutoModel.from_pretrained(checkpoint, trust_remote_code=True).to(device)
 inputs = tokenizer.encode("def print_hello_world():\tprint('Hello World!')", return_tensors="pt").to(device)
 embedding = model(inputs)[0]
+```
+2. Accessing CodeSage via SentenceTransformer
+```
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer("codesage/codesage-base", trust_remote_code=True)
 ```
 ### BibTeX entry and citation info