jatinmehra commited on
Commit
f181555
·
1 Parent(s): d044061

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -1
README.md CHANGED
@@ -6,4 +6,124 @@ emoji: 👀
6
  colorFrom: green
7
  colorTo: red
8
  pinned: false
9
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
  colorFrom: green
7
  colorTo: red
8
  pinned: false
9
+ ---
10
+
11
+
12
+ # Plagiarism Detection App Using a Fine-Tuned Language Model (LLM)
13
+
14
+ This repository contains a Streamlit-based web application that uses a fine-tuned LLM model for detecting plagiarism between two documents. The application processes two uploaded PDF files, extracts their content, and classifies them as either plagiarized or non-plagiarized based on a fine-tuned language model.
15
+
16
+ ## Overview
17
+
18
+ The app leverages a **custom fine-tuned version of the SmolLM** (135M parameters) that has been trained on the [MIT Plagiarism Detection Dataset](https://www.kaggle.com/datasets/ruvelpereira/mit-plagairism-detection-dataset) for improved performance in identifying textual similarities. This model provides binary classification outputs, indicating if two given documents are plagiarized or original.
19
+
20
+ ## Features
21
+
22
+ - **Upload PDF Files**: Upload two PDF files that the app will analyze for similarity.
23
+ - **Text Extraction**: Extracts raw text from the uploaded PDFs using PyMuPDF.
24
+ - **Model-Based Detection**: Compares the content of the PDFs and classifies them as plagiarized or non-plagiarized using the fine-tuned language model.
25
+ - **User-Friendly Interface**: Built with Streamlit for an intuitive and interactive experience.
26
+
27
+ ## Model Information
28
+
29
+ - **Base Model**: `HuggingFaceTB/SmolLM2-135M-Instruct`
30
+ - **Fine-tuned Model Name**: `jatinmehra/smolLM-fine-tuned-for-plagiarism-detection`
31
+ - **Language**: English
32
+ - **Task**: Text Classification (Binary)
33
+ - **Performance Metrics**: Accuracy, F1 Score, Recall
34
+ - **License**: MIT
35
+
36
+ ## Dataset
37
+
38
+ The fine-tuning dataset, the MIT Plagiarism Detection Dataset, provides labeled sentence pairs where each pair is marked as plagiarized or non-plagiarized. This label is used for binary classification, making it well-suited for detecting sentence-level similarity.
39
+
40
+ ## Training and Model Details
41
+
42
+ - **Architecture**: The model was modified for sequence classification with two labels.
43
+ - **Optimizer**: AdamW with a learning rate of 2e-5.
44
+ - **Loss Function**: Cross-Entropy Loss.
45
+ - **Batch Size**: 16
46
+ - **Epochs**: 3
47
+ - **Padding**: Custom padding token to align with SmolLM requirements.
48
+
49
+ The model achieved **99.66% accuracy** on the training dataset, highlighting its effectiveness in identifying plagiarized content.
50
+
51
+ ## Application Workflow
52
+
53
+ 1. **Load and Initialize**: The application loads the fine-tuned model and tokenizer locally.
54
+ 2. **PDF Upload**: Users upload two PDF documents they want to compare.
55
+ 3. **Text Extraction**: Text is extracted from each PDF using the PyMuPDF library.
56
+ 4. **Preprocessing**: The extracted text is tokenized and preprocessed for model compatibility.
57
+ 5. **Classification**: The model processes the inputs and returns a prediction of `1` (plagiarized) or `0` (non-plagiarized).
58
+ 6. **Output**: The result is displayed on the Streamlit interface.
59
+
60
+ ## How to Run the Application
61
+
62
+ ### Prerequisites
63
+
64
+ - **Streamlit** for running the web application interface.
65
+ - **Transformers** from Hugging Face for handling model and tokenizer.
66
+ - **PyMuPDF** (`fitz`) for PDF text extraction.
67
+ - **Torch** for model inference on CPU or GPU.
68
+
69
+ ### Installation
70
+
71
+ 1. Clone the repository:
72
+
73
+ bash
74
+
75
+ Copy code
76
+
77
+ `git clone https://github.com/YourUsername/Plagiarism-Detection-App.git
78
+ cd Plagiarism-Detection-App`
79
+
80
+ 2. Install the required dependencies:
81
+
82
+ bash
83
+
84
+ Copy code
85
+
86
+ `pip install -r requirements.txt`
87
+
88
+ 3. Download the fine-tuned model files and place them in the `model/` directory.
89
+
90
+
91
+ ### Running the App
92
+
93
+ Run the Streamlit app from the terminal:
94
+
95
+ bash
96
+
97
+ Copy code
98
+
99
+ `streamlit run app.py`
100
+
101
+ ### Usage
102
+
103
+ 1. Open the application in your browser (default at `http://localhost:8501`).
104
+ 2. Upload two PDF files you wish to compare for plagiarism.
105
+ 3. View the text from each document and the resulting plagiarism detection output.
106
+
107
+ ## Evaluation
108
+
109
+ The model was evaluated on both training and test data, showing robust results:
110
+
111
+ - **Training Set Accuracy**: **99.66%**
112
+ - **Test Set Accuracy**: **100%**
113
+ - **F1 Score**: **1.0**
114
+ - **Recall**: **1.0**
115
+
116
+ These metrics indicate the model's high effectiveness in detecting plagiarism.
117
+
118
+ ## Model and Tokenizer
119
+
120
+ The model and tokenizer are saved locally, but they can also be loaded directly from Hugging Face. This setup allows easy loading for custom applications or further fine-tuning.
121
+
122
+ ## License
123
+
124
+ This project is licensed under the MIT License, making it free for both personal and commercial use.
125
+
126
+ ## Connect with Me
127
+
128
+ I appreciate your interest!
129
+ [GitHub](https://github.com/Jatin-Mehra119) | [email protected] | [LinkedIn](https://www.linkedin.com/in/jatin-mehra119/) | [Portfolio](https://jatin-mehra119.github.io/Profile/)