kaczmarj
/

pancancer-tissue-classifier.tcga

Model card Files Files and versions Community

pancancer-tissue-classifier.tcga / README.md

kaczmarj's picture

add reuse instruction

26a4adf verified about 1 year ago

|

history blame contribute delete

2.97 kB

	---
	license: cc-by-4.0
	---
	# Pancancer tissue classifier

	This model classifies among 32 cancers from TCGA. It was trained by Jakub Kaczmarzyk using CLAM.

	Output classes: ACC, BLCA, BRCA, CESC, CHOL, COAD, DLBC, ESCA, GBM, HNSC, KICH, KIRC, KIRP, LGG, LIHC, LUAD, LUSC, MESO, OV, PAAD, PCPG, PRAD, READ, SARC, SKCM, STAD, TGCT, THCA, THYM, UCEC, UCS, UVM.

	Please see the [TCGA study abbreviations](https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations) to map these class names to the TCGA study names.

	## Data

	Diagnostic slides in TCGA (e.g., `DX`) were used to train the model. The whole slide images were tiles into 128x128um patches, and each patch was encoded using CTransPath (this produces 768-dimensional embeddings).

	Train, validation, and test splits were stratified by TCGA study, and patients did not cross split boundaries.

	Samples sizes:
	- Train: 9,257 slides (7,633 patients)
	- Validation: 1,186 slides (955 patients)
	- Test: 1,163 slides (955 patients)

	## Reusing this model

	To use this model on the command line, see [WSInfer-MIL](https://github.com/kaczmarj/wsinfer-mil).

	Alternatively, you may use PyTorch on ONNX to run the model. First, embed 128um x 128um patches using CTransPath. Then pass the bag of embeddings to the model.

	```python
	import onnxruntime as ort
	import numpy as np
	embedding = np.ones((1_000, 768), dtype="float32")
	ort_sess = ort.InferenceSession("model.onnx")
	logits, attention = ort_sess.run(["logits", "attention"], {'input': embedding})
	```

	## Model performance

	The model achieved a weighted average AUROC of 0.99 (one-vs-rest).

	Here are the one-vs-rest AUROC values for each TCGA study.

	- ACC: 0.9993
	- BLCA: 0.9814
	- BRCA: 0.9908
	- CESC: 0.9868
	- CHOL: 0.9972
	- COAD: 0.9927
	- DLBC: 0.9996
	- ESCA: 0.9571
	- GBM: 0.9984
	- HNSC: 0.9974
	- KICH: 0.9998
	- KIRC: 0.9993
	- KIRP: 0.9952
	- LGG: 0.9984
	- LIHC: 0.9988
	- LUAD: 0.9879
	- LUSC: 0.9868
	- MESO: 0.9961
	- OV: 0.9900
	- PAAD: 0.9897
	- PCPG: 0.9944
	- PRAD: 1.0000
	- READ: 0.9752
	- SARC: 0.9946
	- SKCM: 0.9957
	- STAD: 0.9932
	- TGCT: 0.9957
	- THCA: 1.0000
	- THYM: 0.9991
	- UCEC: 0.9971
	- UCS: 0.9863
	- UVM: 0.9997

	### Renal cell carcinoma (RCC) subtyping

	RCC subtyping is a relatively common benchmark task for slide-level classification. We evaluate this model on RCC subtyping.

	When tested on a set of 52 KIRC slides and 28 KIRP slides (from the overall test set), the model achieved a balanced accuracy of 0.88.

	### Non-small cell lung cancer (NSCLC) subtyping

	NSCLC subtyping is a relatively common benchmark task for slide-level classification. We evaluate this model on NSCLC subtyping.

	When tested on a set of 55 LUAD slides and 58 LUSC slides (from the overall test set), the model achieved a balanced accuracy of 0.76.


	# Intended uses

	This model is ONLY intended for research purposes.

	This model may not be used for clinical purposes. This model is distributed without warranties, either express or implied.