Spaces:

jatinmehra
/

Plagiarism-detector-using-smolLM

Running

App Files Files Community

jatinmehra commited on Nov 8, 2024

Commit

f181555

1 Parent(s): d044061

Update README.md

Browse files

Files changed (1) hide show

README.md +121 -1

README.md CHANGED Viewed

@@ -6,4 +6,124 @@ emoji: 👀
 colorFrom: green
 colorTo: red
 pinned: false
----

 colorFrom: green
 colorTo: red
 pinned: false
+---
+# Plagiarism Detection App Using a Fine-Tuned Language Model (LLM)
+This repository contains a Streamlit-based web application that uses a fine-tuned LLM model for detecting plagiarism between two documents. The application processes two uploaded PDF files, extracts their content, and classifies them as either plagiarized or non-plagiarized based on a fine-tuned language model.
+## Overview
+The app leverages a **custom fine-tuned version of the SmolLM** (135M parameters) that has been trained on the [MIT Plagiarism Detection Dataset](https://www.kaggle.com/datasets/ruvelpereira/mit-plagairism-detection-dataset) for improved performance in identifying textual similarities. This model provides binary classification outputs, indicating if two given documents are plagiarized or original.
+## Features
+-   **Upload PDF Files**: Upload two PDF files that the app will analyze for similarity.
+-   **Text Extraction**: Extracts raw text from the uploaded PDFs using PyMuPDF.
+-   **Model-Based Detection**: Compares the content of the PDFs and classifies them as plagiarized or non-plagiarized using the fine-tuned language model.
+-   **User-Friendly Interface**: Built with Streamlit for an intuitive and interactive experience.
+## Model Information
+-   **Base Model**: `HuggingFaceTB/SmolLM2-135M-Instruct`
+-   **Fine-tuned Model Name**: `jatinmehra/smolLM-fine-tuned-for-plagiarism-detection`
+-   **Language**: English
+-   **Task**: Text Classification (Binary)
+-   **Performance Metrics**: Accuracy, F1 Score, Recall
+-   **License**: MIT
+## Dataset
+The fine-tuning dataset, the MIT Plagiarism Detection Dataset, provides labeled sentence pairs where each pair is marked as plagiarized or non-plagiarized. This label is used for binary classification, making it well-suited for detecting sentence-level similarity.
+## Training and Model Details
+-   **Architecture**: The model was modified for sequence classification with two labels.
+-   **Optimizer**: AdamW with a learning rate of 2e-5.
+-   **Loss Function**: Cross-Entropy Loss.
+-   **Batch Size**: 16
+-   **Epochs**: 3
+-   **Padding**: Custom padding token to align with SmolLM requirements.
+The model achieved **99.66% accuracy** on the training dataset, highlighting its effectiveness in identifying plagiarized content.
+## Application Workflow
+1.  **Load and Initialize**: The application loads the fine-tuned model and tokenizer locally.
+2.  **PDF Upload**: Users upload two PDF documents they want to compare.
+3.  **Text Extraction**: Text is extracted from each PDF using the PyMuPDF library.
+4.  **Preprocessing**: The extracted text is tokenized and preprocessed for model compatibility.
+5.  **Classification**: The model processes the inputs and returns a prediction of `1` (plagiarized) or `0` (non-plagiarized).
+6.  **Output**: The result is displayed on the Streamlit interface.
+## How to Run the Application
+### Prerequisites
+-   **Streamlit** for running the web application interface.
+-   **Transformers** from Hugging Face for handling model and tokenizer.
+-   **PyMuPDF** (`fitz`) for PDF text extraction.
+-   **Torch** for model inference on CPU or GPU.
+### Installation
+1.  Clone the repository:
+    bash
+    Copy code
+    `git clone https://github.com/YourUsername/Plagiarism-Detection-App.git
+    cd Plagiarism-Detection-App`
+2.  Install the required dependencies:
+    bash
+    Copy code
+    `pip install -r requirements.txt`
+3.  Download the fine-tuned model files and place them in the `model/` directory.
+### Running the App
+Run the Streamlit app from the terminal:
+bash
+Copy code
+`streamlit run app.py`
+### Usage
+1.  Open the application in your browser (default at `http://localhost:8501`).
+2.  Upload two PDF files you wish to compare for plagiarism.
+3.  View the text from each document and the resulting plagiarism detection output.
+## Evaluation
+The model was evaluated on both training and test data, showing robust results:
+-   **Training Set Accuracy**: **99.66%**
+-   **Test Set Accuracy**: **100%**
+-   **F1 Score**: **1.0**
+-   **Recall**: **1.0**
+These metrics indicate the model's high effectiveness in detecting plagiarism.
+## Model and Tokenizer
+The model and tokenizer are saved locally, but they can also be loaded directly from Hugging Face. This setup allows easy loading for custom applications or further fine-tuning.
+## License
+This project is licensed under the MIT License, making it free for both personal and commercial use.
+## Connect with Me
+I appreciate your interest!
+[GitHub](https://github.com/Jatin-Mehra119) | [email protected] | [LinkedIn](https://www.linkedin.com/in/jatin-mehra119/) | [Portfolio](https://jatin-mehra119.github.io/Profile/)