kaczmarj's picture
add reuse instruction
26a4adf verified
---
license: cc-by-4.0
---
# Pancancer tissue classifier
This model classifies among 32 cancers from TCGA. It was trained by Jakub Kaczmarzyk using CLAM.
Output classes: ACC, BLCA, BRCA, CESC, CHOL, COAD, DLBC, ESCA, GBM, HNSC, KICH, KIRC, KIRP, LGG, LIHC, LUAD, LUSC, MESO, OV, PAAD, PCPG, PRAD, READ, SARC, SKCM, STAD, TGCT, THCA, THYM, UCEC, UCS, UVM.
Please see the [TCGA study abbreviations](https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations) to map these class names to the TCGA study names.
## Data
Diagnostic slides in TCGA (e.g., `DX`) were used to train the model. The whole slide images were tiles into 128x128um patches, and each patch was encoded using CTransPath (this produces 768-dimensional embeddings).
Train, validation, and test splits were stratified by TCGA study, and patients did not cross split boundaries.
Samples sizes:
- Train: 9,257 slides (7,633 patients)
- Validation: 1,186 slides (955 patients)
- Test: 1,163 slides (955 patients)
## Reusing this model
To use this model on the command line, see [WSInfer-MIL](https://github.com/kaczmarj/wsinfer-mil).
Alternatively, you may use PyTorch on ONNX to run the model. First, embed 128um x 128um patches using CTransPath. Then pass the bag of embeddings to the model.
```python
import onnxruntime as ort
import numpy as np
embedding = np.ones((1_000, 768), dtype="float32")
ort_sess = ort.InferenceSession("model.onnx")
logits, attention = ort_sess.run(["logits", "attention"], {'input': embedding})
```
## Model performance
The model achieved a weighted average AUROC of 0.99 (one-vs-rest).
Here are the one-vs-rest AUROC values for each TCGA study.
- ACC: 0.9993
- BLCA: 0.9814
- BRCA: 0.9908
- CESC: 0.9868
- CHOL: 0.9972
- COAD: 0.9927
- DLBC: 0.9996
- ESCA: 0.9571
- GBM: 0.9984
- HNSC: 0.9974
- KICH: 0.9998
- KIRC: 0.9993
- KIRP: 0.9952
- LGG: 0.9984
- LIHC: 0.9988
- LUAD: 0.9879
- LUSC: 0.9868
- MESO: 0.9961
- OV: 0.9900
- PAAD: 0.9897
- PCPG: 0.9944
- PRAD: 1.0000
- READ: 0.9752
- SARC: 0.9946
- SKCM: 0.9957
- STAD: 0.9932
- TGCT: 0.9957
- THCA: 1.0000
- THYM: 0.9991
- UCEC: 0.9971
- UCS: 0.9863
- UVM: 0.9997
### Renal cell carcinoma (RCC) subtyping
RCC subtyping is a relatively common benchmark task for slide-level classification. We evaluate this model on RCC subtyping.
When tested on a set of 52 KIRC slides and 28 KIRP slides (from the overall test set), the model achieved a balanced accuracy of 0.88.
### Non-small cell lung cancer (NSCLC) subtyping
NSCLC subtyping is a relatively common benchmark task for slide-level classification. We evaluate this model on NSCLC subtyping.
When tested on a set of 55 LUAD slides and 58 LUSC slides (from the overall test set), the model achieved a balanced accuracy of 0.76.
# Intended uses
This model is ONLY intended for research purposes.
**This model may not be used for clinical purposes.** This model is distributed without warranties, either express or implied.