M Christenson commited on
Commit
fa5b9e2
·
verified ·
1 Parent(s): aa07cc6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +178 -0
README.md ADDED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ tags:
6
+ - singe-cell
7
+ - foundation model
8
+ - transcriptomics
9
+ - transformers
10
+ ---
11
+ # Universal Cell Embedding (UCE) Model
12
+
13
+ ## Model Details
14
+
15
+ - **Model Name**: Universal Cell Embedding (UCE)
16
+ - **Version**: 1.0 [deeplife version]
17
+ - **Type**: Foundation model for single-cell biology
18
+ - **Original Paper**: [Universal Cell Embeddings: A Foundation Model for Cell Biology](https://www.biorxiv.org/content/10.1101/2023.11.28.568918v1)
19
+ - **Original Implementation**: [UCE GitHub Repository](https://github.com/snap-stanford/uce)
20
+
21
+ ## Model Description
22
+
23
+ UCE is a foundation model for single-cell gene expression that learns a universal representation of cell biology. It can generate representations of new single-cell gene expression datasets with no model fine-tuning or retraining while remaining robust to dataset and batch-specific artifacts.
24
+
25
+ ### Key Features
26
+
27
+ - Zero-shot embedding capabilities for new datasets and species
28
+ - No requirement for cell type annotation or input dataset preprocessing
29
+ - Applicable to any set of genes from any species, even if they aren't homologs of genes seen during training
30
+ - Learns a universal representation of cell biology that is intrinsically meaningful
31
+
32
+ ## Intended Use
33
+
34
+ UCE is designed for researchers working with single-cell RNA sequencing (scRNA-seq) data. It can be used for:
35
+
36
+ - Analyzing and integrating diverse scRNA-seq datasets
37
+ - Mapping new data into a universal embedding space
38
+ - Identifying novel cell types and their functions
39
+ - Cross-dataset discoveries and comparisons
40
+
41
+ ## Training Data
42
+
43
+ The model was trained on a corpus of cell atlas data from human and other species, including:
44
+
45
+ - Over 36 million cells
46
+ - More than 1,000 uniquely named cell types
47
+ - Hundreds of experiments
48
+ - Dozens of tissues
49
+ - Eight species (human, mouse, mouse lemur, zebrafish, pig, rhesus macaque, crab eating macaque, western clawed frog)
50
+
51
+ ## Performance
52
+
53
+ UCE has demonstrated superior performance in zero-shot embedding tasks compared to other self-supervised transformer-based methods. It has shown the ability to:
54
+
55
+ - Accurately embed and cluster cell types from new, unseen datasets
56
+ - Align datasets from novel species without additional training
57
+ - Capture meaningful biological variation despite the presence of experimental noise
58
+
59
+ ## Limitations
60
+
61
+ - The model's performance may vary for extremely rare or specialized cell types not well-represented in the training data
62
+ - While UCE can handle data from new species, its performance might be less optimal for species very distantly related to those in the training set
63
+ - The model does not account for information contained in raw RNA transcripts, such as genetic variation and RNA-splicing processes
64
+
65
+ ## Ethical Considerations
66
+
67
+ Users should be aware that while the data used to train UCE is anonymized, it represents human tissue samples and should be treated with appropriate respect and consideration. Researchers using this model should adhere to ethical guidelines for human subjects research.
68
+
69
+ ## Usage
70
+
71
+ To use the UCE model within the DeepLife ML Infra:
72
+
73
+ 1. Install the package:
74
+ ```
75
+ pip install deeplife-mlinfra
76
+ ```
77
+
78
+ 2. Import and use the model:
79
+ ```python
80
+ import anndata as ad
81
+ from huggingface_hub import hf_hub_download
82
+ from dl_models.models.uce.model import UCEmbedModel
83
+ from dl_models.models.uce.processor import UCEProcessor
84
+
85
+ # Load the model and preprocessor
86
+ model = UCEmbedModel.from_pretrained("deeplife/uce_model")
87
+ preprocessor = UCEProcessor.from_pretrained("deeplife/uce_model")
88
+ model.eval()
89
+
90
+ # Load your data (example using a sample dataset)
91
+ filepath = hf_hub_download(
92
+ repo_id="deeplife/h5ad_samples",
93
+ filename="GSE136831small.h5ad",
94
+ repo_type="dataset",
95
+ )
96
+ adata = ad.read_h5ad(filepath)
97
+
98
+ # Preprocess and create a dataloader
99
+ dataloader = preprocessor.transform_to_dataloader(adata, batch_size=256)
100
+
101
+ # Get embeddings
102
+ for batch in dataloader:
103
+ embed = model.get_cell_embeddings(batch)
104
+ break # This gets embeddings for the first batch
105
+
106
+ # You can now use these embeddings for downstream tasks
107
+ ```
108
+
109
+ For visualization of the embeddings, you can use techniques like PCA or UMAP:
110
+
111
+ ```python
112
+ import numpy as np
113
+ from sklearn.decomposition import PCA
114
+ import matplotlib.pyplot as plt
115
+ import umap
116
+
117
+ # Convert embed to numpy
118
+ embed_np = embed.detach().cpu().numpy()
119
+
120
+ # Perform PCA
121
+ pca = PCA(n_components=2)
122
+ embed_pca = pca.fit_transform(embed_np)
123
+
124
+ # Perform UMAP
125
+ umap_reducer = umap.UMAP(n_components=2, random_state=42)
126
+ embed_umap = umap_reducer.fit_transform(embed_np)
127
+
128
+ # Plot the results
129
+ fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 8))
130
+
131
+ # PCA plot
132
+ scatter1 = ax1.scatter(embed_pca[:, 0], embed_pca[:, 1], alpha=0.7)
133
+ ax1.set_title('UCE Embeddings - PCA')
134
+ ax1.set_xlabel('PC1')
135
+ ax1.set_ylabel('PC2')
136
+ plt.colorbar(scatter1, ax=ax1)
137
+
138
+ # UMAP plot
139
+ scatter2 = ax2.scatter(embed_umap[:, 0], embed_umap[:, 1], alpha=0.7)
140
+ ax2.set_title('UCE Embeddings - UMAP')
141
+ ax2.set_xlabel('UMAP1')
142
+ ax2.set_ylabel('UMAP2')
143
+ plt.colorbar(scatter2, ax=ax2)
144
+
145
+ plt.tight_layout()
146
+ plt.show()
147
+ ```
148
+
149
+ For more detailed usage instructions, please refer to the [documentation](https://github.com/deeplifeai/deeplife-mlinfra).
150
+
151
+ ## Citation
152
+
153
+ If you use this model in your research, please cite both the original UCE paper and the DeepLife ML Infra package:
154
+
155
+ ```
156
+ @article{rosen2023universal,
157
+ title={Universal Cell Embeddings: A Foundation Model for Cell Biology},
158
+ author={Rosen, Yanay and Roohani, Yusuf and Agrawal, Ayush and Samotorcan, Leon and Consortium, Tabula Sapiens and Quake, Stephen R and Leskovec, Jure},
159
+ journal={bioRxiv},
160
+ pages={2023--11},
161
+ year={2023},
162
+ publisher={Cold Spring Harbor Laboratory}
163
+ }
164
+
165
+ @software{deeplife_mlinfra,
166
+ title={DeepLife ML Infra: Infrastructure for Biological Deep Learning Models},
167
+ author={DeepLife AI Team},
168
+ year={2023},
169
+ url={https://github.com/deeplifeai/deeplife-mlinfra},
170
+ version={1.0.0}
171
+ }
172
+ ```
173
+
174
+ ## Contact
175
+
176
+ For questions or issues related to this model implementation in DeepLife ML Infra, please open an issue in the [repository](https://github.com/deeplifeai/deeplife-mlinfra).
177
+
178
+ For questions about the original UCE model, please contact the authors of the paper.