aeolian83's picture
Update README.md
3463f22 verified
---
datasets:
- PrompTart/PTT_advanced_en_ko
language:
- en
- ko
base_model:
- beomi/Llama-3-KoEn-8B-Instruct-preview
- meta-llama/Meta-Llama-3-8B
library_name: transformers
---
# Llama-3-KoEn-8B-Instruct-preview Fine-Tuned on Parenthetical Terminology Translation (PTT) Dataset
## Model Overview
This is a **Llama-3-KoEn-8B-Instruct-preview** model fine-tuned on the [**Parenthetical Terminology Translation (PTT)**](https://arxiv.org/abs/2410.00683) dataset. [The PTT dataset](https://huggingface.co/datasets/PrompTart/PTT_advanced_en_ko) focuses on translating technical terms accurately by placing the original English term in parentheses alongside its Korean translation, enhancing clarity and precision in specialized fields. This fine-tuned model is optimized for handling technical terminology in the **Artificial Intelligence (AI)** domain.
## Example Usage
Here’s how to use this fine-tuned model with the Hugging Face `transformers` library:
```python
import transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load Model and Tokenizer
model_name = "PrompTartLAB/Llama3ko_8B_inst_PTT_enko"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Example sentence
text = "The model was fine-tuned using knowledge distillation techniques. The training dataset was created using a collaborative multi-agent framework powered by large language models."
prompt = f"Translate input sentence to Korean \n### Input: {text} \n### Translated:"
# Tokenize and generate translation
input_ids = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**input_ids, max_new_tokens=1024)
out_message = tokenizer.decode(outputs[0][len(input_ids["input_ids"][0]):], skip_special_tokens=True)
# " 이 λͺ¨λΈμ€ 지식 증λ₯˜ 기법(knowledge distillation techniques)을 μ‚¬μš©ν•˜μ—¬ λ―Έμ„Έ μ‘°μ •λ˜μ—ˆμŠ΅λ‹ˆλ‹€. ν›ˆλ ¨ 데이터셋은 λŒ€ν˜• μ–Έμ–΄ λͺ¨λΈ(large language models)둜 κ΅¬λ™λ˜λŠ” ν˜‘λ ₯적 닀쀑 μ—μ΄μ „νŠΈ ν”„λ ˆμž„μ›Œν¬(collaborative multi-agent framework)λ₯Ό μ‚¬μš©ν•˜μ—¬ μƒμ„±λ˜μ—ˆμŠ΅λ‹ˆλ‹€."
```
## Limitations
- **Out-of-Domain Accuracy**: While the model generalizes to some extent, accuracy may vary in domains that were not part of the training set.
- **Incomplete Parenthetical Annotation**: Not all technical terms are consistently displayed in parentheses; in some cases, terms may be omitted or not annotated as expected.
## Citation
If you use this model in your research, please cite the original dataset and paper:
```tex
@misc{myung2024efficienttechnicaltermtranslation,
title={Efficient Technical Term Translation: A Knowledge Distillation Approach for Parenthetical Terminology Translation},
author={Jiyoon Myung and Jihyeon Park and Jungki Son and Kyungro Lee and Joohyung Han},
year={2024},
eprint={2410.00683},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.00683},
}
```
## Contact
For questions or feedback, please contact [[email protected]](mailto:[email protected]).