GZIP Embeddings with Normalized Text

It's so funny that the huggingface hub lets you do this

model	parameters	embedding dimensions
meta-llama/Llama-2-70b-hf	70b	8192
crumb/gzip-openhermes	1*	242,831

*the huggingface pretrained model saving api requires at least one parameter, which is set to "1" in this model.

multiprocessing is suuuper weird so make sure you dont have the variables "p" or "calculate_ncd_row" in your code anywhere..

Usage

# Requirements
%pip install -qq transformers

# Download Model
from transformers import AutoModel
model = AutoModel.from_pretrained("crumb/gzip-openhermes", trust_remote_code=True)

# Prune model
model.config.update({
    "corpus": model.config.corpus[:1024]
})
model.dimensionality() # 1024

# Inference
model(["this is a test sequence"], num_procs=16).shape # [1, 1024]

# Finetuning
from tqdm.auto import tqdm

new_data = ["i love GZIP! it is my favorite!", "i HATE transformers!"]
normalized_data = [
    model.normalize(i) for i in tqdm(new_data)
]
print(f"Input: '{new_data[0]}'\nTransformed: '{normalized_data[0]}'")

model.config.update({
    "corpus": model.config.corpus + normalized_data
})
model.dimensionality()
model.save_pretrained("my-finetuned-gzip-model")

config:

normalize = True,
normalized_corpus = True,
reduction = False,
reduced_dimension = 0,
remove_stop_words = True,
stop_words = stopwords.words('english'),
corpus = [], # openhermes instructions + outputs, i think having [instructions, outputs, instructions+outputs] would be better but its literally 3x slower  also i dont care

Downloads last month: 3

Safetensors

Model size

1 params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

crumb
/

gzip-openhermes

GZIP Embeddings with Normalized Text

Usage

Dataset used to train crumb/gzip-openhermes