qresearch
/

Llama-3.2-1B-Instruct-SAE-l9

mechanistic interpretability

sparse autoencoder

Model card Files Files and versions Community

qtnx commited on Jan 22

Commit

cc36528

·

verified ·

1 Parent(s): 9d99112

Create README.md

Files changed (1) hide show

README.md +33 -0

README.md ADDED Viewed

	@@ -0,0 +1,33 @@

+---
+license: apache-2.0
+language:
+- en
+tags:
+- mechanistic interpretability
+- sparse autoencoder
+- llama
+- llama-3
+---
+## Model Information
+A SAE (Sparse Autoencoder) for [meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct).
+It is trained specifically on layer 9 of Llama 3.2 1B and achieves a final L0 of 63 during training.
+This model is used to decompose Llama's activations into interpretable features.
+The SAE weights are released under Apache, however Llama 3.2 1B is to be used under Meta's Llama 3.2 License.
+## How to use
+A Jupyter Notebook is provided to test the model
+<a target="_blank" href="https://colab.research.google.com/github/qrsch/SAE/blob/main/SAE.ipynb">
+  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab", width="200px"/>
+</a>
+## Training
+Our SAE was trained using [LMSYS-Chat-1M dataset](https://arxiv.org/pdf/2309.11998), on a single RTX 3090. The training script will be provided soon in the following repository: https://github.com/qrsch/SAE/tree/main