TTA-DQA's picture
Update README.md
2a5373f verified
metadata
license: mit
datasets:
  - TTA-DQA/hate_sentence
language:
  - ko
metrics:
  - accuracy
base_model:
  - beomi/KcELECTRA-base-v2022
tags:
  - Text-Classification
  - Hate-Detection
  - Hate-Senetence-Detection

๋ชจ๋ธ ์ƒ์„ธ ์ •๋ณด (readme.md - English version)

1. ๊ฐœ์š”

์ด ๋ชจ๋ธ์€ ํ•œ๊ตญ์–ด ๋ฌธ์žฅ ๋‚ด ์œ ํ•ดํ‘œํ˜„์˜ ์œ ๋ฌด๋ฅผ ๊ฒ€์ถœํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต๋œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
binary classification์„ ์ˆ˜ํ–‰ํ•˜๋ฉฐ, ์œ ํ•ดํ‘œํ˜„์ด ํฌํ•จ๋˜์—ˆ๊ฑฐ๋‚˜ ์ผ๋ฐ˜์ ์ธ ๋ฌธ์žฅ์ธ์ง€ ํŒ๋‹จ(๋ถ„๋ฅ˜)ํ•˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
AI-Task๋กœ๋Š” text-classification์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. ์‚ฌ์šฉํ•˜๋Š” ๋ฐ์ดํ„ฐ์…‹์€ TTA-DQA/hate_sentence ์ž…๋‹ˆ๋‹ค.

ํด๋ž˜์Šค ๊ตฌ์„ฑ์€ ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • "0": "no_hate"
  • "1": "hate"

2. ํ•™์Šต์ •๋ณด

  • Base Model: KcElectra (a pre-trained Korean language model based on Electra)
  • Source: beomi/KcELECTRA-base-v2022(https://huggingface.co/beomi/KcELECTRA-base-v2022)
  • Model Type: Casual Language Model
  • Pre-training (Korean): ์•ฝ 17GB (over 180 million sentences)
  • Fine-tuning (hate dataset): ์•ฝ 22.3MB(TTA-DQA/hate_sentence)
  • Learning Rate: 5e-6
  • Weight Decay: 0.01
  • Epochs: 20
  • Batch Size: 16
  • Data Loader Workers: 2
  • Tokenizer: BertWordPieceTokenizer
  • Model Size: Approximately 512MB

3. ์š”๊ตฌ์‚ฌํ•ญ

  • pytorch ~= 1.8.0
  • transformers ~= 4.11.3
  • emoji ~= 0.6.0
  • soynlp ~= 0.0.493

4. Quick Start

  • python
from transformers import AutoTokenizer, AutoModel
  
tokenizer = AutoTokenizer.from_pretrained("TTA-DQA/HateDetection-KcElectra-FineTuning")
model = AutoModel.from_pretrained("TTA-DQA/HateDetection-KcElectra-FineTuning")

5. Citation

  • ์ด ๋ชจ๋ธ์€ ์ดˆ๊ฑฐ๋Œ€AI ํ•™์Šต์šฉ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ๊ฒ€์ฆ ์‚ฌ์—…(2024๋…„๋„ ์ดˆ๊ฑฐ๋Œ€AI ํ•™์Šต์šฉ ํ’ˆ์งˆ๊ฒ€์ฆ)์— ์˜ํ•ด์„œ ๊ตฌ์ถ•๋˜์—ˆ์Šต๋‹ˆ๋‹ค

6. ํ•œ๊ณ„์„ฑ, ์œ„ํ—˜์„ฑ, ํŽธ์„ฑ ๋“ฑ ๋ช…์‹œ

  • ๋ณธ ๋ชจ๋ธ์€ ๊ฐ ํด๋ž˜์Šค์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํŽธํ–ฅ๋˜๊ฒŒ ํ•™์Šตํ•˜์ง€๋Š” ์•Š์•˜์œผ๋‚˜ ์–ธ์–ด์ , ์–ธ์–ดํ•ด์„์  ํŠน์„ฑ์— ์˜ํ•ด ๋ ˆ์ด๋ธ”์— ๋Œ€ํ•œ ์ด๊ฒฌ์ด ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ์œ ํ•ดํ‘œํ˜„์˜ ๊ฒฝ์šฐ ์–ธ์–ด, ๋ฌธํ™”, ์ ์šฉ ๋ถ„์•ผ, ๊ฐœ์ธ์  ๊ฒฌํ•ด์— ๋”ฐ๋ผ ์ฃผ๊ด€์ ์ธ ๋ถ€๋ถ„์ด ์žˆ์–ด ๊ฒฐ๊ณผ์— ๋Œ€ํ•œ ํŽธํ–ฅ ๋˜๋Š” ๋…ผ๋ž€์ด ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ๋”ฐ๋ผ์„œ, ๊ฒฐ๊ณผ๊ฐ€ ํ•œ๊ตญ์–ด์— ๋Œ€ํ•œ ์ ˆ๋Œ€์ ์ธ ์œ ํ•ดํ‘œํ˜„์˜ ๊ธฐ์ค€์ด ๋  ์ˆ˜ ๋Š” ์—†์Šต๋‹ˆ๋‹ค.

๋ชจ๋ธ ์„ฑ๋Šฅ ๊ฒฐ๊ณผ

  • ๋ถ„๋ฅ˜ ์œ ํ˜• : binary classification(text-classification)
  • f1-score : 0.9928
  • accuracy : 0.9928