OdedKBio's picture
Update README.md
ddb51fe verified
---
license: cc-by-nc-nd-4.0
tags:
- biology
- single_cell
library_name: transformers
pipeline_tag: feature-extraction
---
# Converge-SC for Embeddings: How to use?
## Task Description
Single-cell embeddings are vector representations of cells that capture their biological characteristics in a high-dimensional space. These embeddings encapsulate gene expression patterns, allowing for efficient computational analysis, visualization, and comparison of cells.
The task is to generate embeddings for single-cell RNA-seq data using the pre-trained Converge-SC model. These embeddings can be used for downstream analysis tasks such as clustering, visualization, integration, and more.
## Basic Usage
The `examples` folder under the tab `files and versions` contains both the notebook and the gene mapping json file.
Go to the `examples/get_embeddings.ipynb` notebook to see how to generate embeddings for your single-cell data.
## Pipeline Description
The pipeline uses the pre-trained Converge-SC model to generate embeddings for each cell in your dataset. The workflow involves:
1. Loading your single-cell data (as an AnnData object)
2. Preprocessing and normalizing the data
3. Loading the pre-trained Converge-SC model and tokenizer
4. Generating embeddings for each cell
5. Storing the embeddings for downstream tasks
## Input Data Requirements
Your data should be in the form of an AnnData object (.h5ad file) with:
1. Expression Data: Gene expression measurements in adata.X
2. Gene Information: Gene identifiers in adata.var_names
## Preprocessing Steps
Before generating embeddings, you should preprocess your data:
1. Normalization: Normalize your data to a common scale
```python
import scanpy as sc
# Normalize to 10,000 counts per cell
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata) # Log-transform the data
```
2. Gene Name Mapping: Converge-SC's vocabulary is in gene symbols, not ENSEMBL IDs, so you'll need to map ENSEMBL IDs to gene symbols if applicable
```python
import json
# Load the mapping file
with open('examples/ensembl_to_gene_symbol.json', 'r') as file:
ensg_to_symbol = json.load(file)
# Map gene names
adata.var_names = adata.var_names.map(lambda col: ensg_to_symbol.get(col, col))
```
## Generating Embeddings
### Load model and tokenizer
```python
model = AutoModel.from_pretrained('ConvergeBio/ConvergeSC-embeddings', token='your_token_here', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('ConvergeBio/ConvergeSC-embeddings', token='your_token_here', trust_remote_code=True)
```
### Compute Embeddings
```python
tokenized_cell = tokenizer(gene_names, expression_values=gene_values)
embedding = model(**tokenized_cell)
```