crumb
/

gzip-openhermes

Model card Files Files and versions Community

gzip-openhermes / README.md

crumb's picture

Update README.md

4bba6a9 over 1 year ago

|

history blame contribute delete

1.69 kB

	---
	datasets:
	- teknium/openhermes
	language:
	- en
	---
	# GZIP Embeddings with Normalized Text

	It's so funny that the huggingface hub lets you do this


	\| model \| parameters \| embedding dimensions \|
	\| --- \| --- \| --- \|
	\| meta-llama/Llama-2-70b-hf \| 70b \| 8192 \|
	\| crumb/gzip-openhermes \| 1* \| 242,831 \|

	*the huggingface pretrained model saving api requires at least one parameter, which is set to "1" in this model.

	multiprocessing is suuuper weird so make sure you dont have the variables "p" or "calculate_ncd_row" in your code anywhere..

	### Usage


	```python
	# Requirements
	%pip install -qq transformers

	# Download Model
	from transformers import AutoModel
	model = AutoModel.from_pretrained("crumb/gzip-openhermes", trust_remote_code=True)

	# Prune model
	model.config.update({
	"corpus": model.config.corpus[:1024]
	})
	model.dimensionality() # 1024

	# Inference
	model(["this is a test sequence"], num_procs=16).shape # [1, 1024]

	# Finetuning
	from tqdm.auto import tqdm

	new_data = ["i love GZIP! it is my favorite!", "i HATE transformers!"]
	normalized_data = [
	model.normalize(i) for i in tqdm(new_data)
	]
	print(f"Input: '{new_data[0]}'\nTransformed: '{normalized_data[0]}'")

	model.config.update({
	"corpus": model.config.corpus + normalized_data
	})
	model.dimensionality()
	model.save_pretrained("my-finetuned-gzip-model")
	```

	config:
	```
	normalize = True,
	normalized_corpus = True,
	reduction = False,
	reduced_dimension = 0,
	remove_stop_words = True,
	stop_words = stopwords.words('english'),
	corpus = [], # openhermes instructions + outputs, i think having [instructions, outputs, instructions+outputs] would be better but its literally 3x slower also i dont care
	```