robinsmits
/

Schaapje-2B-Pretrained

Text Generation

Model card Files Files and versions Metrics Training metrics Community

Schaapje-2B-Pretrained / README.md

robinsmits's picture

Update README.md

8c12c3e verified 24 days ago

|

history blame contribute delete

3.64 kB

	---
	datasets:
	- wikimedia/wikipedia
	- yhavinga/mc4_nl_cleaned
	language:
	- nl
	base_model:
	- ibm-granite/granite-3.0-2b-instruct
	pipeline_tag: text-generation
	library_name: transformers
	tags:
	- granite
	- granite 3.0
	- schaapje
	inference: false
	license: apache-2.0
	---
	<p align="center">
	<img src="sheep.png" alt="Schaapje logo" width="750"/>
	</p>

	# Schaapje-2B-Pretrained

	## Model description

	This continual pretrained model is pretrained on roughly 2.4 Billion tokens of Dutch language data based on Wikipedia and MC4.

	Primary objective with continual pretraining on Dutch was to make the model more 'fluent' when using the Dutch language. It will also have gained some additional Dutch knowledge.

	As a base model the IBM Granite 3.0 2B Instruct model was used.

	See [ibm-granite/granite-3.0-2b-instruct](https://huggingface.co/ibm-granite/granite-3.0-2b-instruct) for all information about the IBM Granite foundation model.

	## Model usage

	A basic example of how to use this continual pretrained model.

	!! IMPORTANT NOTE !!
	As this is an instruct model that was continual pretrained on dutch data there is some degredation in the performance regarding instruction-following. This custom pretrained model should be further finetuned with SFT in which the embedding and lm_head layer are also trained. Given a proper SFT dataset in dutch this will restore the instruction following/EOS token functionality.
	See the SFT training notebook for Schaapje on one of the ways on how to do this.

	```
	import torch
	from transformers import AutoTokenizer, AutoModelForCausalLM

	device = 'cuda'
	model_name = 'robinsmits/Schaapje-2B-Pretrained'

	model = AutoModelForCausalLM.from_pretrained(model_name,
	device_map = "auto",
	torch_dtype = torch.bfloat16)

	tokenizer = AutoTokenizer.from_pretrained(model_name)

	messages = [{"role": "user", "content": "Hoi hoe gaat het ermee?"}]

	chat = tokenizer.apply_chat_template(messages,
	tokenize = False,
	add_generation_prompt = True)

	input_tokens = tokenizer(chat, return_tensors = "pt").to('cuda')

	output = model.generate(**input_tokens,
	max_new_tokens = 512,
	do_sample = True)

	output = tokenizer.decode(output[0], skip_special_tokens = False)
	print(output)
	```

	## Intended uses & limitations

	As with all LLM's this model can also experience bias and hallucinations. Regardless of how you use this model always perform the necessary testing and validation.

	## Datasets and Licenses

	The datasets used for the continual pretraining had different licenses:
	- [wikimedia/wikipedia](https://huggingface.co/datasets/wikimedia/wikipedia): cc-by-sa-3.0
	- [yhavinga/mc4_nl_cleaned](https://huggingface.co/datasets/yhavinga/mc4_nl_cleaned): ODB-BY

	## Model Training

	The continual pretraining notebook is available at the following link: [Schaapje_2B_Pretrained](https://github.com/RobinSmits/Schaapje/blob/main/Schaapje_2B_Pretrained.ipynb)

	Training was performed with Google Colab PRO on a A100 - 40GB in multiple sessions. As the amount of data was more than would fit within the maximum 24 hour session that Google Colab PRO allows I split the dataset in 5 roughly equal parts. Training for each part lasted around 18 to 24 hours. The 'resume_from_checkpoint' was used to continue pretraining in a proper way.

	Continual Pretraining dataset was created with the script: [prepare_pretraining_datasets](https://github.com/RobinSmits/Schaapje/blob/main/prepare_pretraining_datasets.py)