File size: 9,951 Bytes
0df9f76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1a532f6
 
75a922f
0df9f76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7c3478c
 
0df9f76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b469eff
0df9f76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
280d341
 
0df9f76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
280d341
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0df9f76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7c3478c
0df9f76
7c3478c
0df9f76
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
---
license: mit
datasets:
- ankitagr01/dynamic_topic_modeling_arxiv_abstracts
- knkarthick/topicsum
- nuvocare/MSD_manual_topics_user_base
language:
- en
metrics:
- mse
base_model:
- thesephist/contra-bottleneck-t5-large-wikipedia
pipeline_tag: summarization
tags:
- topic-extraction
- topic-summarization
- dyanmic-topic-modeling
---
# Contra-Topic-bottleneck-t5-large: Linear Topic Extraction using Bottleneck T5

A lightweight approach to topic extraction leveraging the Bottleneck T5 autoencoder architecture with learned transformation matrices. This project provides three specialized transformation matrices for mapping content embeddings to topic embeddings across different domains.

[**Check out the blog**](https://amanpriyanshu.github.io/blogs/posts/2024/contra-topic/)

**TL;DR:** Transform content embeddings into topic embeddings using domain-specific 1024×1024 transformation matrices, trained on three distinct datasets. Built on top of the Bottleneck T5 architecture for efficient, training-free topic extraction.

## Motivation
Large Language Models (LLMs) have become the go-to solution for many NLP tasks, including topic extraction and classification. However, they come with significant overhead:
- High computational requirements
- Large memory footprint
- Considerable inference latency
- Complex deployment needs
- Limited to pre-specified classes

This project offers a lightweight alternative specifically for topic extraction by leveraging the semantic structure of the Bottleneck T5's latent space. Instead of training a new model or fine-tuning existing ones, we learn a simple linear transformation between content and topic embeddings, providing:
- Fast inference (milliseconds)
- Minimal memory footprint (single 1024×1024 matrix per domain)
- Simple deployment (basic matrix multiplication)
- No need for GPU at inference time
- Generator in nature

## Architecture Overview

### Base Model
- Uses Bottleneck T5 Large ([thesephist/contra-bottleneck-t5-large-wikipedia](https://huggingface.co/thesephist/contra-bottleneck-t5-large-wikipedia))
- Fixed embedding dimension: 1024
- Pre-trained on Wikipedia data
- Autoencoder architecture with attention pooling

### Transformation Layers
- Three domain-specific transformation matrices (1024×1024 each)
- Linear mapping from content to topic space
- Learned using simple Mean Squared Error optimization
- Total additional parameters: ~3M per domain

## Datasets and Performance Metrics

### 1. ArXiv Abstracts Dataset ([ankitagr01/dynamic_topic_modeling_arxiv_abstracts](https://huggingface.co/datasets/ankitagr01/dynamic_topic_modeling_arxiv_abstracts))
Scientific paper abstracts paired with their research topics, providing a test bed for academic content classification.

**Performance Metrics:**
- Training MSE: 0.00225 (error on samples used to learn transformation)
- Testing MSE: 0.00268 (error on held-out validation set)
- Inter-topic MSE: 0.00620 (minimum distance between different topic embeddings)

**Use Cases:**
- Automated paper categorization
- Research trend analysis
- Academic content recommendation

### 2. TopicSUM Dataset ([knkarthick/topicsum](https://huggingface.co/datasets/knkarthick/topicsum))
241,171 dialogue samples with human-annotated topic labels, ideal for conversational content analysis.

**Performance Metrics:**
- Training MSE: 0.00252
- Testing MSE: 0.00255
- Inter-topic MSE: 0.00737

**Use Cases:**
- Meeting summarization
- Customer service dialogue categorization
- Chat log analysis

### 3. MSD Manual Topics ([nuvocare/MSD_manual_topics_user_base](https://huggingface.co/datasets/nuvocare/MSD_manual_topics_user_base))
Medical content from Merck's Manual, featuring both professional and patient-oriented content.

**Performance Metrics:**
- Training MSE: 0.00174
- Testing MSE: 0.00197
- Inter-topic MSE: 0.00566

**Use Cases:**
- Medical document classification
- Healthcare content organization
- Patient information routing

## Understanding the Metrics

### Computational Requirements
| Resource | Requirement | Notes |
|----------|-------------|--------|
| Storage | ~9MB per matrix | 1024×1024 float32 values |
| Memory | ~27MB total | All three domain matrices |
| Inference Time | ~10ms | On CPU, per text sample |
| Training Hardware | P100 GPU | Free tier on Kaggle |
| Training Time | ~4 hours total | Mostly embedding generation |
| Base Model | ~770M parameters | Loaded only during embedding creation |

### Performance Metrics Explained

1. **Training MSE (Mean Squared Error)**
   - Measures how well the transformation matrix maps content to topic embeddings
   - Calculated on the 80% training split
   - Lower values indicate better alignment between transformed content and actual topic embeddings

2. **Testing MSE**
   - Same metric but on 20% held-out test set
   - Indicates generalization capability
   - Similar values between train/test suggest good generalization. Slightly higher than training MSE is expected and healthy

3. **Inter-topic MSE**
   - Minimum squared distance between any pair of topic embeddings
   - Higher values indicate better topic separation
   - Critical for preventing topic confusion
   - Example: MSD's 0.00566 means medical topics maintain distinct representations

### Comparative Analysis
- MSD dataset shows best training performance (0.00174 MSE)
  - Likely due to well-structured medical vocabulary
  - Clear topic boundaries in medical domain
- TopicSUM has highest inter-topic MSE (0.00737)
  - Reflects diverse nature of conversational topics
  - Important for distinguishing between varied dialogue contexts
- ArXiv results balance between the two
  - Scientific content has natural overlap between fields
  - Still maintains good topic separation (0.00620 inter-topic MSE)

## Implementation

**Try it out here:** (https://colab.research.google.com/drive/1_SuTiL3QS-PUYjSrugqqD5mQlMv8Hbfc?usp=sharing)

### 1. Base Model Wrapper

```python
import torch
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM

class BottleneckT5Autoencoder:
    def __init__(self, model_path: str, device='cpu'):
        self.device = device
        self.tokenizer = AutoTokenizer.from_pretrained(model_path, model_max_length=512)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_path, 
            trust_remote_code=True
        ).to(device)
        self.model.eval()

    @torch.no_grad()
    def embed(self, text: str) -> torch.FloatTensor:
        inputs = self.tokenizer(text, return_tensors='pt').to(self.device)
        decoder_inputs = self.tokenizer('', return_tensors='pt').to(self.device)
        return self.model(
            **inputs,
            decoder_input_ids=decoder_inputs['input_ids'],
            encode_only=True,
        )[0]

    @torch.no_grad()
    def generate_from_latent(self, latent: torch.FloatTensor, max_length=512, temperature=1.0) -> str:
        dummy_text = '.'
        dummy = self.embed(dummy_text)
        perturb_vector = latent - dummy
        self.model.perturb_vector = perturb_vector
        input_ids = self.tokenizer(dummy_text, return_tensors='pt').to(self.device).input_ids
        output = self.model.generate(
            input_ids=input_ids,
            max_length=max_length,
            do_sample=True,
            temperature=temperature,
            top_p=0.9,
            num_return_sequences=1,
        )
        return self.tokenizer.decode(output[0], skip_special_tokens=True)
```

### 2. Topic Mapper

**Transformations Available:**

1. https://huggingface.co/AmanPriyanshu/Contra-Topic-bottleneck-t5-large/resolve/main/transformation_matrix_topicsum.pt
2. https://huggingface.co/AmanPriyanshu/Contra-Topic-bottleneck-t5-large/resolve/main/transformation_matrix_arxiv.pt
3. https://huggingface.co/AmanPriyanshu/Contra-Topic-bottleneck-t5-large/resolve/main/transformation_matrix_msd.pt

```python
url = 'https://huggingface.co/AmanPriyanshu/Contra-Topic-bottleneck-t5-large/resolve/main/transformation_matrix_arxiv.pt'
file_path = 'transformation_matrix.pt'
with open(file_path, 'wb') as f:
    f.write(requests.get(url).content)
transformation_matrix = torch.load(file_path, weights_only=False).float()
print(transformation_matrix.shape, type(transformation_matrix))
```

### 3. Final Conversion

```python

autoencoder = BottleneckT5Autoencoder(model_path=model_path, device=device)
content_embedding = autoencoder.embed(content)
topic_embedding = content_embedding @ transformation_matrix
topic = = autoencoder.generate_from_latent(topic_embedding)
print(topic)
```

## Limitations and Future Work

1. **Representation Quality**
   - System inherits Bottleneck T5's encoding limitations
   - Performance depends on input text fitting model's training distribution

2. **Domain Specificity**
   - Each matrix is domain-optimized
   - Cross-domain performance not guaranteed
   - Future work: Investigate domain adaptation techniques

3. **Fixed Dimensionality**
   - Locked to Bottleneck T5's 1024D space
   - Potential future work: Dimension reduction studies

4. **Linear Transformation Limitations**
   - Assumes linear relationship between content and topic spaces
   - Future work: Explore non-linear transformations

## Memory and Computation Requirements
- Transformation Matrix: 1024 × 1024 × 4 bytes ≈ 9MB per domain
- Inference Time: ~10ms on CPU (matrix multiplication)
- Total Model Size: ~27MB (all three domains)
- Base Model: ~770M parameters (loaded only during embedding creation)

## Acknowledgments

Special thanks to:
- Linus Lee (@thesephist) for the Bottleneck T5 model
- The T5 team at Google Research
- Dataset providers:
  - @ankitagr01 for the ArXiv abstracts dataset
  - @knkarthick for the TopicSUM dataset
  - @nuvocare for the MSD Manual topics dataset
- Kaggle for providing free P100 GPU resources

## License
MIT

## Contributing
Contributions are welcome! Please feel free to submit a Pull Request.