ConvergeBio
/

ConvergeSC-embeddings

Feature Extraction

Model card Files Files and versions Community

ConvergeSC-embeddings / README.md

OdedKBio's picture

Update README.md

ddb51fe verified about 2 months ago

|

history blame contribute delete

2.72 kB

	---
	license: cc-by-nc-nd-4.0
	tags:
	- biology
	- single_cell
	library_name: transformers
	pipeline_tag: feature-extraction
	---
	# Converge-SC for Embeddings: How to use?
	## Task Description
	Single-cell embeddings are vector representations of cells that capture their biological characteristics in a high-dimensional space. These embeddings encapsulate gene expression patterns, allowing for efficient computational analysis, visualization, and comparison of cells.
	The task is to generate embeddings for single-cell RNA-seq data using the pre-trained Converge-SC model. These embeddings can be used for downstream analysis tasks such as clustering, visualization, integration, and more.
	## Basic Usage
	The `examples` folder under the tab `files and versions` contains both the notebook and the gene mapping json file.

	Go to the `examples/get_embeddings.ipynb` notebook to see how to generate embeddings for your single-cell data.

	## Pipeline Description
	The pipeline uses the pre-trained Converge-SC model to generate embeddings for each cell in your dataset. The workflow involves:
	1. Loading your single-cell data (as an AnnData object)
	2. Preprocessing and normalizing the data
	3. Loading the pre-trained Converge-SC model and tokenizer
	4. Generating embeddings for each cell
	5. Storing the embeddings for downstream tasks

	## Input Data Requirements
	Your data should be in the form of an AnnData object (.h5ad file) with:
	1. Expression Data: Gene expression measurements in adata.X
	2. Gene Information: Gene identifiers in adata.var_names

	## Preprocessing Steps
	Before generating embeddings, you should preprocess your data:
	1. Normalization: Normalize your data to a common scale
	```python
	import scanpy as sc

	# Normalize to 10,000 counts per cell
	sc.pp.normalize_total(adata, target_sum=1e4)
	sc.pp.log1p(adata) # Log-transform the data
	```

	2. Gene Name Mapping: Converge-SC's vocabulary is in gene symbols, not ENSEMBL IDs, so you'll need to map ENSEMBL IDs to gene symbols if applicable
	```python
	import json

	# Load the mapping file
	with open('examples/ensembl_to_gene_symbol.json', 'r') as file:
	ensg_to_symbol = json.load(file)

	# Map gene names
	adata.var_names = adata.var_names.map(lambda col: ensg_to_symbol.get(col, col))
	```

	## Generating Embeddings
	### Load model and tokenizer
	```python
	model = AutoModel.from_pretrained('ConvergeBio/ConvergeSC-embeddings', token='your_token_here', trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained('ConvergeBio/ConvergeSC-embeddings', token='your_token_here', trust_remote_code=True)
	```

	### Compute Embeddings
	```python
	tokenized_cell = tokenizer(gene_names, expression_values=gene_values)
	embedding = model(**tokenized_cell)
	```