PrompTartLAB
/

Llama3ko_8B_inst_PTT_enko

Text Generation

text-generation-inference

Model card Files Files and versions

Llama3ko_8B_inst_PTT_enko / README.md

aeolian83's picture

Update README.md

3463f22 verified 8 months ago

|

history blame contribute delete

3.17 kB

	---
	datasets:
	- PrompTart/PTT_advanced_en_ko
	language:
	- en
	- ko
	base_model:
	- beomi/Llama-3-KoEn-8B-Instruct-preview
	- meta-llama/Meta-Llama-3-8B
	library_name: transformers
	---

	# Llama-3-KoEn-8B-Instruct-preview Fine-Tuned on Parenthetical Terminology Translation (PTT) Dataset

	## Model Overview

	This is a Llama-3-KoEn-8B-Instruct-preview model fine-tuned on the [Parenthetical Terminology Translation (PTT)](https://arxiv.org/abs/2410.00683) dataset. [The PTT dataset](https://huggingface.co/datasets/PrompTart/PTT_advanced_en_ko) focuses on translating technical terms accurately by placing the original English term in parentheses alongside its Korean translation, enhancing clarity and precision in specialized fields. This fine-tuned model is optimized for handling technical terminology in the Artificial Intelligence (AI) domain.


	## Example Usage

	Here’s how to use this fine-tuned model with the Hugging Face `transformers` library:

	```python
	import transformers
	from transformers import AutoTokenizer, AutoModelForCausalLM

	# Load Model and Tokenizer
	model_name = "PrompTartLAB/Llama3ko_8B_inst_PTT_enko"
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto",
	)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	# Example sentence
	text = "The model was fine-tuned using knowledge distillation techniques. The training dataset was created using a collaborative multi-agent framework powered by large language models."
	prompt = f"Translate input sentence to Korean \n### Input: {text} \n### Translated:"

	# Tokenize and generate translation
	input_ids = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(**input_ids, max_new_tokens=1024)
	out_message = tokenizer.decode(outputs[0][len(input_ids["input_ids"][0]):], skip_special_tokens=True)

	# " 이 모델은 지식 증류 기법(knowledge distillation techniques)을 사용하여 미세 조정되었습니다. 훈련 데이터셋은 대형 언어 모델(large language models)로 구동되는 협력적 다중 에이전트 프레임워크(collaborative multi-agent framework)를 사용하여 생성되었습니다."

	```

	## Limitations

	- Out-of-Domain Accuracy: While the model generalizes to some extent, accuracy may vary in domains that were not part of the training set.
	- Incomplete Parenthetical Annotation: Not all technical terms are consistently displayed in parentheses; in some cases, terms may be omitted or not annotated as expected.

	## Citation

	If you use this model in your research, please cite the original dataset and paper:

	```tex
	@misc{myung2024efficienttechnicaltermtranslation,
	title={Efficient Technical Term Translation: A Knowledge Distillation Approach for Parenthetical Terminology Translation},
	author={Jiyoon Myung and Jihyeon Park and Jungki Son and Kyungro Lee and Joohyung Han},
	year={2024},
	eprint={2410.00683},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2410.00683},
	}
	```

	## Contact

	For questions or feedback, please contact [[email protected]](mailto:[email protected]).