# πŸ€— Tokenizers 라이브러리의 ν† ν¬λ‚˜μ΄μ € μ‚¬μš©ν•˜κΈ°[[use-tokenizers-from-tokenizers]] [`PreTrainedTokenizerFast`]λŠ” [πŸ€— Tokenizers](https://huggingface.co/docs/tokenizers) λΌμ΄λΈŒλŸ¬λ¦¬μ— κΈ°λ°˜ν•©λ‹ˆλ‹€. πŸ€— Tokenizers 라이브러리의 ν† ν¬λ‚˜μ΄μ €λŠ” πŸ€— Transformers둜 맀우 κ°„λ‹¨ν•˜κ²Œ 뢈러올 수 μžˆμŠ΅λ‹ˆλ‹€. ꡬ체적인 λ‚΄μš©μ— λ“€μ–΄κ°€κΈ° 전에, λͺ‡ μ€„μ˜ μ½”λ“œλ‘œ 더미 ν† ν¬λ‚˜μ΄μ €λ₯Ό λ§Œλ“€μ–΄ λ³΄κ² μŠ΅λ‹ˆλ‹€: ```python >>> from tokenizers import Tokenizer >>> from tokenizers.models import BPE >>> from tokenizers.trainers import BpeTrainer >>> from tokenizers.pre_tokenizers import Whitespace >>> tokenizer = Tokenizer(BPE(unk_token="[UNK]")) >>> trainer = BpeTrainer(special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]) >>> tokenizer.pre_tokenizer = Whitespace() >>> files = [...] >>> tokenizer.train(files, trainer) ``` μš°λ¦¬κ°€ μ •μ˜ν•œ νŒŒμΌμ„ 톡해 이제 ν•™μŠ΅λœ ν† ν¬λ‚˜μ΄μ €λ₯Ό κ°–κ²Œ λ˜μ—ˆμŠ΅λ‹ˆλ‹€. 이 λŸ°νƒ€μž„μ—μ„œ 계속 μ‚¬μš©ν•˜κ±°λ‚˜ JSON 파일둜 μ €μž₯ν•˜μ—¬ λ‚˜μ€‘μ— μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€. ## ν† ν¬λ‚˜μ΄μ € κ°μ²΄λ‘œλΆ€ν„° 직접 뢈러였기[[loading-directly-from-the-tokenizer-object]] πŸ€— Transformers λΌμ΄λΈŒλŸ¬λ¦¬μ—μ„œ 이 ν† ν¬λ‚˜μ΄μ € 객체λ₯Ό ν™œμš©ν•˜λŠ” 방법을 μ‚΄νŽ΄λ³΄κ² μŠ΅λ‹ˆλ‹€. [`PreTrainedTokenizerFast`] ν΄λž˜μŠ€λŠ” μΈμŠ€ν„΄μŠ€ν™”λœ *ν† ν¬λ‚˜μ΄μ €* 객체λ₯Ό 인수둜 λ°›μ•„ μ‰½κ²Œ μΈμŠ€ν„΄μŠ€ν™”ν•  수 μžˆμŠ΅λ‹ˆλ‹€: ```python >>> from transformers import PreTrainedTokenizerFast >>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tokenizer) ``` 이제 `fast_tokenizer` κ°μ²΄λŠ” πŸ€— Transformers ν† ν¬λ‚˜μ΄μ €μ—μ„œ κ³΅μœ ν•˜λŠ” λͺ¨λ“  λ©”μ†Œλ“œμ™€ ν•¨κ»˜ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€! μžμ„Έν•œ λ‚΄μš©μ€ [ν† ν¬λ‚˜μ΄μ € νŽ˜μ΄μ§€](main_classes/tokenizer)λ₯Ό μ°Έμ‘°ν•˜μ„Έμš”. ## JSON νŒŒμΌμ—μ„œ 뢈러였기[[loading-from-a-JSON-file]] JSON νŒŒμΌμ—μ„œ ν† ν¬λ‚˜μ΄μ €λ₯Ό 뢈러였기 μœ„ν•΄, λ¨Όμ € ν† ν¬λ‚˜μ΄μ €λ₯Ό μ €μž₯ν•΄ λ³΄κ² μŠ΅λ‹ˆλ‹€: ```python >>> tokenizer.save("tokenizer.json") ``` JSON νŒŒμΌμ„ μ €μž₯ν•œ κ²½λ‘œλŠ” `tokenizer_file` λ§€κ°œλ³€μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ [`PreTrainedTokenizerFast`] μ΄ˆκΈ°ν™” λ©”μ†Œλ“œμ— 전달할 수 μžˆμŠ΅λ‹ˆλ‹€: ```python >>> from transformers import PreTrainedTokenizerFast >>> fast_tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer.json") ``` 이제 `fast_tokenizer` κ°μ²΄λŠ” πŸ€— Transformers ν† ν¬λ‚˜μ΄μ €μ—μ„œ κ³΅μœ ν•˜λŠ” λͺ¨λ“  λ©”μ†Œλ“œμ™€ ν•¨κ»˜ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€! μžμ„Έν•œ λ‚΄μš©μ€ [ν† ν¬λ‚˜μ΄μ € νŽ˜μ΄μ§€](main_classes/tokenizer)λ₯Ό μ°Έμ‘°ν•˜μ„Έμš”.