Spaces:

pmkhanh7890
/

news_verification

Sleeping

App Files Files

news_verification / src /texts /readme.md

pmkhanh7890

1st

22e1b62 9 months ago

preview code

raw

history blame

2.18 kB

	# [Text] SimLLM: Detecting Sentences Generated by Large Language Models Using Similarity between the Generation and its Re-Generation

	## Getting Started
	1. Clone the repository:
	```bash
	git clone https://github.com/Tokyo-Techies/prj-nict-ai-content-detection
	```

	2. Set up the environment:
	Using virtual environment:
	```bash
	python -m venv .venv
	source .venv/bin/activate
	```

	3. Install dependencies:
	```bash
	pip install -r requirements.txt
	```


	4. API Keys (optional)
	- Obtain API keys for the corresponding models and insert them into the `SimLLM.py` file:
	- ChatGPT: [OpenAI API](https://openai.com/index/openai-api/)
	- Gemini: [Google Gemini API](https://ai.google.dev/gemini-api/docs/api-key)
	- Other LLMs: [Together API](https://api.together.ai/)


	5. Run the project:
	```bash
	python SimLLM.py
	```

	### Parameters

	- `LLMs`: List of large language models to use. Available models include 'ChatGPT', 'Yi', 'OpenChat', 'Gemini', 'LLaMa', 'Phi', 'Mixtral', 'QWen', 'OLMO', 'WizardLM', and 'Vicuna'. Default is `['ChatGPT', 'Yi', 'OpenChat']`.
	- `train_indexes`: List of LLM indexes for training. Default is `[0, 1, 2]`.
	- `test_indexes`: List of LLM indexes for testing. Default is `[0]`.
	- `num_samples`: Number of samples. Default is 5000.

	### Examples

	- Running with default parameters:
	`python SimLLM.py`

	- Running with customized parameters:
	`python SimLLM.py --LLMs ChatGPT --train_indexes 0 --test_indexes 0`

	## Dataset

	The `dataset.csv` file contains both human and generated texts from 12 large language models, including:
	ChatGPT, GPT-4o, Yi, OpenChat, Gemini, LLaMa, Phi, Mixtral, QWen, OLMO, WizardLM, and Vicuna.

	## Citation

	```bibtex
	@inproceedings{nguyen2024SimLLM,
	title={SimLLM: Detecting Sentences Generated by Large Language Models Using Similarity between the Generation and its Re-generation},
	author={Nguyen-Son, Hoang-Quoc and Dao, Minh-Son and Zettsu, Koji},
	booktitle={The Conference on Empirical Methods in Natural Language Processing},
	year={2024}
	}
	```

	## Acknowledgements

	- BARTScore: [BARTScore GitHub Repository](https://github.com/neulab/BARTScore)