|
--- |
|
datasets: |
|
- teknium/openhermes |
|
language: |
|
- en |
|
--- |
|
# GZIP Embeddings with Normalized Text |
|
|
|
It's so funny that the huggingface hub lets you do this |
|
|
|
|
|
| model | parameters | embedding dimensions | |
|
| --- | --- | --- | |
|
| meta-llama/Llama-2-70b-hf | 70b | 8192 | |
|
| crumb/gzip-openhermes | 1* | 242,831 | |
|
|
|
*the huggingface pretrained model saving api requires at least one parameter, which is set to "1" in this model. |
|
|
|
multiprocessing is suuuper weird so make sure you dont have the variables "p" or "calculate_ncd_row" in your code anywhere.. |
|
|
|
### Usage |
|
|
|
|
|
```python |
|
# Requirements |
|
%pip install -qq transformers |
|
|
|
# Download Model |
|
from transformers import AutoModel |
|
model = AutoModel.from_pretrained("crumb/gzip-openhermes", trust_remote_code=True) |
|
|
|
# Prune model |
|
model.config.update({ |
|
"corpus": model.config.corpus[:1024] |
|
}) |
|
model.dimensionality() # 1024 |
|
|
|
# Inference |
|
model(["this is a test sequence"], num_procs=16).shape # [1, 1024] |
|
|
|
# Finetuning |
|
from tqdm.auto import tqdm |
|
|
|
new_data = ["i love GZIP! it is my favorite!", "i HATE transformers!"] |
|
normalized_data = [ |
|
model.normalize(i) for i in tqdm(new_data) |
|
] |
|
print(f"Input: '{new_data[0]}'\nTransformed: '{normalized_data[0]}'") |
|
|
|
model.config.update({ |
|
"corpus": model.config.corpus + normalized_data |
|
}) |
|
model.dimensionality() |
|
model.save_pretrained("my-finetuned-gzip-model") |
|
``` |
|
|
|
config: |
|
``` |
|
normalize = True, |
|
normalized_corpus = True, |
|
reduction = False, |
|
reduced_dimension = 0, |
|
remove_stop_words = True, |
|
stop_words = stopwords.words('english'), |
|
corpus = [], # openhermes instructions + outputs, i think having [instructions, outputs, instructions+outputs] would be better but its literally 3x slower also i dont care |
|
``` |