WebashalarForML commited on
Commit
cf9329c
Β·
verified Β·
1 Parent(s): adac9c9

Update README2.md

Browse files
Files changed (1) hide show
  1. README2.md +149 -149
README2.md CHANGED
@@ -1,150 +1,150 @@
1
- \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
2
- \\----------- **Resume Parser** ----------\\
3
- \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
4
-
5
- # Overview:
6
- This project is a comprehensive Resume Parsing tool built using Python,
7
- integrating the Mistral-Nemo-Instruct-2407 model for primary parsing.
8
- If Mistral fails or encounters issues,
9
- the system falls back to a custom-trained spaCy model to ensure continued functionality.
10
- The tool is wrapped with a Flask API and has a user interface built using HTML and CSS.
11
-
12
-
13
- # Installation Guide:
14
-
15
- 1. Create and Activate a Virtual Environment
16
- python -m venv venv
17
- source venv/bin/activate # For Linux/Mac
18
- # or
19
- venv\Scripts\activate # For Windows
20
-
21
- # NOTE: If the virtual environment (venv) is already created, you can skip the creation step and just activate.
22
- - For Linux/Mac:
23
- source venv/bin/activate
24
- - For Windows:
25
- venv\Scripts\activate
26
-
27
- 2. Install Required Libraries
28
- pip install -r requirements.txt
29
-
30
- # Ensure the following dependencies are included:
31
- - Flask
32
- - spaCy
33
- - huggingface_hub
34
- - PyMuPDF
35
- - python-docx
36
- - Tesseract-OCR (for image-based parsing)
37
-
38
- ; NOTE : If any model or library is not installed, you can install it using:
39
- pip install <model_name>
40
- _Replace <model_name> with the specific model or library you need to install_
41
-
42
- 3. Set up Hugging Face Token
43
- - Add your Hugging Face token to the .env file as:
44
- HF_TOKEN=<your_huggingface_token>
45
-
46
-
47
- # File Structure Overview:
48
- Mistral_With_Spacy/
49
- β”‚
50
- β”œβ”€β”€ Spacy_Models/
51
- β”‚ └── ner_model_05_3 # Pretrained spaCy model directory for resume parsing
52
- β”‚
53
- β”œβ”€β”€ templates/
54
- β”‚ β”œβ”€β”€ index.html # UI for file upload
55
- β”‚ └── result.html # Display parsed results in structured JSON
56
- β”‚
57
- β”œβ”€β”€ uploads/ # Directory for uploaded resume files
58
- β”‚
59
- β”œβ”€β”€ utils/
60
- β”‚ β”œβ”€β”€ mistral.py # Code for calling Mistral API and handling responses
61
- β”‚ β”œβ”€β”€ spacy.py # spaCy fallback model for parsing resumes
62
- β”‚ β”œβ”€β”€ error.py # Error handling utilities
63
- β”‚ └── fileTotext.py # Functions to extract text from different file formats (PDF, DOCX, etc.)
64
- β”‚
65
- β”œβ”€β”€ venv/ # Virtual environment
66
- β”‚
67
- β”œβ”€β”€ .env # Environment variables file (contains Hugging Face token)
68
- β”‚
69
- β”œβ”€β”€ main.py # Flask app handling API routes for uploading and processing resumes
70
- β”‚
71
- └── requirements.txt # Dependencies required for the project
72
-
73
-
74
- # Program Overview:
75
-
76
- # Mistral Integration (utils/mistral.py)
77
- - Mistral API Calls: Uses Hugging Faces Mistral-Nemo-Instruct-2407 model to parse resumes.
78
- - Personal and Professional Extraction: Two functions extract personal and professional information in structured JSON format.
79
- - Fallback Mechanism: If Mistral fails, spaCys NER model is used as a fallback.
80
-
81
- # SpaCy Integration (utils/spacy.py)
82
- - Custom Trained Model: Uses a spaCy model (ner_model_05_3) trained specifically for resume parsing.
83
- - Named Entity Recognition: Extracts key information like Name, Email, Contact, Location, Skills, Experience, etc., from resumes.
84
- - Validation: Includes validation for extracted emails and contacts.
85
-
86
- # File Conversion (utils/fileTotext.py)
87
- - Text Extraction: Handles different resume formats (PDF, DOCX, ODT, RSF, and images like PNG, JPG, JPEG) and extracts text for further processing.
88
- - PDF Files: Uses PyMuPDF to extract text and, if necessary, Tesseract-OCR for image-based PDF content.
89
- - DOCX Files: Uses `python-docx` to extract structured text from Word documents.
90
- - ODT Files: Uses `odfpy` to extract text from ODT (OpenDocument) files.
91
- - RSF Files: Reads plain text from RSF files.
92
- - Images (PNG, JPG, JPEG): Uses Tesseract-OCR to extract text from image-based resumes.
93
- Note: For Tesseract-OCR, install it locally by following the [installation guide](https://github.com/UB-Mannheim/tesseract/wiki).
94
- - Hyperlink Extraction: Extracts hyperlinks from PDF files, capturing any embedded URLs during the parsing process.
95
-
96
-
97
- # Error Handling (utils/error.py)
98
- - Manages API response errors, file format issues, and ensures smooth fallbacks without crashing the app.
99
-
100
- # Flask API (main.py)
101
- Endpoints:
102
- - /upload for uploading resumes.
103
- - Displays parsed results in JSON format on the results page.
104
- - UI: Simple interface for uploading resumes and viewing the parsing results.
105
-
106
-
107
- # Tree map of your program:
108
-
109
- main.py
110
- β”œβ”€β”€ Handles API side
111
- β”œβ”€β”€ File upload/remove
112
- β”œβ”€β”€ Process resumes
113
- └── Show result
114
- utils
115
- β”œβ”€β”€ fileTotext.py
116
- β”‚ └── Converts files to text
117
- β”‚ β”œβ”€β”€ PDF
118
- β”‚ β”œβ”€β”€ DOCX
119
- β”‚ β”œβ”€β”€ RTF
120
- β”‚ β”œβ”€β”€ ODT
121
- β”‚ β”œβ”€β”€ PNG
122
- β”‚ β”œβ”€β”€ JPG
123
- β”‚ └── JPEG
124
- β”œβ”€β”€ mistral.py
125
- β”‚ β”œβ”€β”€ Mistral API Calls
126
- β”‚ β”‚ └── Uses Mistral-Nemo-Instruct-2407 model
127
- β”‚ β”œβ”€β”€ Personal and Professional Extraction
128
- β”‚ β”‚ β”œβ”€β”€ Extracts personal information
129
- β”‚ β”‚ └── Extracts professional information
130
- β”‚ └── Fallback Mechanism
131
- β”‚ └── Uses spaCy NER model if Mistral fails
132
- └── spacy.py
133
- β”œβ”€β”€ Custom Trained Model
134
- β”‚ └── Uses spaCy model (ner_model_05_3)
135
- β”œβ”€β”€ Named Entity Recognition
136
- β”‚ └── Extracts key information (Name, Email, Contact, etc.)
137
- └── Validation
138
- └── Validates emails and contacts
139
-
140
-
141
- # References:
142
-
143
- - [Flask Documentation](https://flask.palletsprojects.com/)
144
- - [spaCy Documentation](https://spacy.io/usage)
145
- - [Mistral Documentation](https://docs.mistral.ai/)
146
- - [Hugging Face Hub API](https://huggingface.co/docs/huggingface_hub/index)
147
- - [PyMuPDF (MuPDF) Documentation](https://pymupdf.readthedocs.io/en/latest/)
148
- - [python-docx Documentation](https://python-docx.readthedocs.io/en/latest/)
149
- - [Tesseract OCR Documentation](https://github.com/UB-Mannheim/tesseract/wiki)
150
  - [Virtual Environments in Python](https://docs.python.org/3/tutorial/venv.html)
 
1
+ \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
2
+ \\----------- **Resume Parser** ----------\\
3
+ \\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
4
+
5
+ # Overview:
6
+ This project is a comprehensive Resume Parsing tool built using Python,
7
+ integrating the Mistral-Nemo-Instruct-2407 model for primary parsing.
8
+ If Mistral fails or encounters issues,
9
+ the system falls back to a custom-trained spaCy model to ensure continued functionality.
10
+ The tool is wrapped with a Flask API and has a user interface built using HTML and CSS.
11
+
12
+
13
+ # Installation Guide:
14
+
15
+ 1. Create and Activate a Virtual Environment
16
+ python -m venv venv
17
+ source venv/bin/activate # For Linux/Mac
18
+ # or
19
+ venv\Scripts\activate # For Windows
20
+
21
+ # NOTE: If the virtual environment (venv) is already created, you can skip the creation step and just activate.
22
+ - For Linux/Mac:
23
+ source venv/bin/activate
24
+ - For Windows:
25
+ venv\Scripts\activate
26
+
27
+ 2. Install Required Libraries
28
+ pip install -r requirements.txt
29
+
30
+ # Ensure the following dependencies are included:
31
+ - Flask
32
+ - spaCy
33
+ - huggingface_hub
34
+ - PyMuPDF
35
+ - python-docx
36
+ - Tesseract-OCR (for image-based parsing)
37
+
38
+ ; NOTE : If any model or library is not installed, you can install it using:
39
+ pip install <model_name>
40
+ _Replace <model_name> with the specific model or library you need to install_
41
+
42
+ 3. Set up Hugging Face Token
43
+ - Add your Hugging Face token to the .env file as:
44
+ HF_TOKEN=<your_huggingface_token>
45
+
46
+
47
+ # File Structure Overview:
48
+ Mistral_With_Spacy/
49
+ β”‚
50
+ β”œβ”€β”€ Spacy_Models/
51
+ β”‚ └── ner_model_05_3 # Pretrained spaCy model directory for resume parsing
52
+ β”‚
53
+ β”œβ”€β”€ templates/
54
+ β”‚ β”œβ”€β”€ index.html # UI for file upload
55
+ β”‚ └── result.html # Display parsed results in structured JSON
56
+ β”‚
57
+ β”œβ”€β”€ uploads/ # Directory for uploaded resume files
58
+ β”‚
59
+ β”œβ”€β”€ utils/
60
+ β”‚ β”œβ”€β”€ mistral.py # Code for calling Mistral API and handling responses
61
+ β”‚ β”œβ”€β”€ spacy.py # spaCy fallback model for parsing resumes
62
+ β”‚ β”œβ”€β”€ error.py # Error handling utilities
63
+ β”‚ └── fileTotext.py # Functions to extract text from different file formats (PDF, DOCX, etc.)
64
+ β”‚
65
+ β”œβ”€β”€ venv/ # Virtual environment
66
+ β”‚
67
+ β”œβ”€β”€ .env # Environment variables file (contains Hugging Face token)
68
+ β”‚
69
+ β”œβ”€β”€ main.py # Flask app handling API routes for uploading and processing resumes
70
+ β”‚
71
+ └── requirements.txt # Dependencies required for the project
72
+
73
+
74
+ # Program Overview:
75
+
76
+ # Mistral Integration (utils/mistral.py)
77
+ - Mistral API Calls: Uses Hugging Faces Mistral-Nemo-Instruct-2407 model to parse resumes.
78
+ - Personal and Professional Extraction: Two functions extract personal and professional information in structured JSON format.
79
+ - Fallback Mechanism: If Mistral fails, spaCys NER model is used as a fallback.
80
+
81
+ # SpaCy Integration (utils/spacy.py)
82
+ - Custom Trained Model: Uses a spaCy model (ner_model_05_3) trained specifically for resume parsing.
83
+ - Named Entity Recognition: Extracts key information like Name, Email, Contact, Location, Skills, Experience, etc., from resumes.
84
+ - Validation: Includes validation for extracted emails and contacts.
85
+
86
+ # File Conversion (utils/fileTotext.py)
87
+ - Text Extraction: Handles different resume formats (PDF, DOCX, ODT, RSF, and images like PNG, JPG, JPEG) and extracts text for further processing.
88
+ - PDF Files: Uses PyMuPDF to extract text and, if necessary, Tesseract-OCR for image-based PDF content.
89
+ - DOCX Files: Uses `python-docx` to extract structured text from Word documents.
90
+ - ODT Files: Uses `odfpy` to extract text from ODT (OpenDocument) files.
91
+ - RSF Files: Reads plain text from RSF files.
92
+ - Images (PNG, JPG, JPEG): Uses Tesseract-OCR to extract text from image-based resumes.
93
+ Note: For Tesseract-OCR, install it locally by following the [installation guide](https://github.com/UB-Mannheim/tesseract/wiki).
94
+ - Hyperlink Extraction: Extracts hyperlinks from PDF files, capturing any embedded URLs during the parsing process.
95
+
96
+
97
+ # Error Handling (utils/error.py)
98
+ - Manages API response errors, file format issues, and ensures smooth fallbacks without crashing the app.
99
+
100
+ # Flask API (main.py)
101
+ Endpoints:
102
+ - /upload for uploading resumes.
103
+ - Displays parsed results in JSON format on the results page.
104
+ - UI: Simple interface for uploading resumes and viewing the parsing results.
105
+
106
+
107
+ # Tree map of program:
108
+
109
+ main.py
110
+ β”œβ”€β”€ Handles API side
111
+ β”œβ”€β”€ File upload/remove
112
+ β”œβ”€β”€ Process resumes
113
+ └── Show result
114
+ utils
115
+ β”œβ”€β”€ fileTotext.py
116
+ β”‚ └── Converts files to text
117
+ β”‚ β”œβ”€β”€ PDF
118
+ β”‚ β”œβ”€β”€ DOCX
119
+ β”‚ β”œβ”€β”€ RTF
120
+ β”‚ β”œβ”€β”€ ODT
121
+ β”‚ β”œβ”€β”€ PNG
122
+ β”‚ β”œβ”€β”€ JPG
123
+ β”‚ └── JPEG
124
+ β”œβ”€β”€ mistral.py
125
+ β”‚ β”œβ”€β”€ Mistral API Calls
126
+ β”‚ β”‚ └── Uses Mistral-Nemo-Instruct-2407 model
127
+ β”‚ β”œβ”€β”€ Personal and Professional Extraction
128
+ β”‚ β”‚ β”œβ”€β”€ Extracts personal information
129
+ β”‚ β”‚ └── Extracts professional information
130
+ β”‚ └── Fallback Mechanism
131
+ β”‚ └── Uses spaCy NER model if Mistral fails
132
+ └── spacy.py
133
+ β”œβ”€β”€ Custom Trained Model
134
+ β”‚ └── Uses spaCy model (ner_model_05_3)
135
+ β”œβ”€β”€ Named Entity Recognition
136
+ β”‚ └── Extracts key information (Name, Email, Contact, etc.)
137
+ └── Validation
138
+ └── Validates emails and contacts
139
+
140
+
141
+ # References:
142
+
143
+ - [Flask Documentation](https://flask.palletsprojects.com/)
144
+ - [spaCy Documentation](https://spacy.io/usage)
145
+ - [Mistral Documentation](https://docs.mistral.ai/)
146
+ - [Hugging Face Hub API](https://huggingface.co/docs/huggingface_hub/index)
147
+ - [PyMuPDF (MuPDF) Documentation](https://pymupdf.readthedocs.io/en/latest/)
148
+ - [python-docx Documentation](https://python-docx.readthedocs.io/en/latest/)
149
+ - [Tesseract OCR Documentation](https://github.com/UB-Mannheim/tesseract/wiki)
150
  - [Virtual Environments in Python](https://docs.python.org/3/tutorial/venv.html)