|
--- |
|
license: cc-by-nc-nd-4.0 |
|
tags: |
|
- biology |
|
- single_cell |
|
library_name: transformers |
|
pipeline_tag: feature-extraction |
|
--- |
|
# Converge-SC for Embeddings: How to use? |
|
## Task Description |
|
Single-cell embeddings are vector representations of cells that capture their biological characteristics in a high-dimensional space. These embeddings encapsulate gene expression patterns, allowing for efficient computational analysis, visualization, and comparison of cells. |
|
The task is to generate embeddings for single-cell RNA-seq data using the pre-trained Converge-SC model. These embeddings can be used for downstream analysis tasks such as clustering, visualization, integration, and more. |
|
## Basic Usage |
|
The `examples` folder under the tab `files and versions` contains both the notebook and the gene mapping json file. |
|
|
|
Go to the `examples/get_embeddings.ipynb` notebook to see how to generate embeddings for your single-cell data. |
|
|
|
## Pipeline Description |
|
The pipeline uses the pre-trained Converge-SC model to generate embeddings for each cell in your dataset. The workflow involves: |
|
1. Loading your single-cell data (as an AnnData object) |
|
2. Preprocessing and normalizing the data |
|
3. Loading the pre-trained Converge-SC model and tokenizer |
|
4. Generating embeddings for each cell |
|
5. Storing the embeddings for downstream tasks |
|
|
|
## Input Data Requirements |
|
Your data should be in the form of an AnnData object (.h5ad file) with: |
|
1. Expression Data: Gene expression measurements in adata.X |
|
2. Gene Information: Gene identifiers in adata.var_names |
|
|
|
## Preprocessing Steps |
|
Before generating embeddings, you should preprocess your data: |
|
1. Normalization: Normalize your data to a common scale |
|
```python |
|
import scanpy as sc |
|
|
|
# Normalize to 10,000 counts per cell |
|
sc.pp.normalize_total(adata, target_sum=1e4) |
|
sc.pp.log1p(adata) # Log-transform the data |
|
``` |
|
|
|
2. Gene Name Mapping: Converge-SC's vocabulary is in gene symbols, not ENSEMBL IDs, so you'll need to map ENSEMBL IDs to gene symbols if applicable |
|
```python |
|
import json |
|
|
|
# Load the mapping file |
|
with open('examples/ensembl_to_gene_symbol.json', 'r') as file: |
|
ensg_to_symbol = json.load(file) |
|
|
|
# Map gene names |
|
adata.var_names = adata.var_names.map(lambda col: ensg_to_symbol.get(col, col)) |
|
``` |
|
|
|
## Generating Embeddings |
|
### Load model and tokenizer |
|
```python |
|
model = AutoModel.from_pretrained('ConvergeBio/ConvergeSC-embeddings', token='your_token_here', trust_remote_code=True) |
|
tokenizer = AutoTokenizer.from_pretrained('ConvergeBio/ConvergeSC-embeddings', token='your_token_here', trust_remote_code=True) |
|
``` |
|
|
|
### Compute Embeddings |
|
```python |
|
tokenized_cell = tokenizer(gene_names, expression_values=gene_values) |
|
embedding = model(**tokenized_cell) |
|
``` |