Spaces:
Sleeping
Sleeping
Update README2.md
Browse files- README2.md +149 -149
README2.md
CHANGED
@@ -1,150 +1,150 @@
|
|
1 |
-
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
|
2 |
-
\\----------- **Resume Parser** ----------\\
|
3 |
-
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
|
4 |
-
|
5 |
-
# Overview:
|
6 |
-
This project is a comprehensive Resume Parsing tool built using Python,
|
7 |
-
integrating the Mistral-Nemo-Instruct-2407 model for primary parsing.
|
8 |
-
If Mistral fails or encounters issues,
|
9 |
-
the system falls back to a custom-trained spaCy model to ensure continued functionality.
|
10 |
-
The tool is wrapped with a Flask API and has a user interface built using HTML and CSS.
|
11 |
-
|
12 |
-
|
13 |
-
# Installation Guide:
|
14 |
-
|
15 |
-
1. Create and Activate a Virtual Environment
|
16 |
-
python -m venv venv
|
17 |
-
source venv/bin/activate # For Linux/Mac
|
18 |
-
# or
|
19 |
-
venv\Scripts\activate # For Windows
|
20 |
-
|
21 |
-
# NOTE: If the virtual environment (venv) is already created, you can skip the creation step and just activate.
|
22 |
-
- For Linux/Mac:
|
23 |
-
source venv/bin/activate
|
24 |
-
- For Windows:
|
25 |
-
venv\Scripts\activate
|
26 |
-
|
27 |
-
2. Install Required Libraries
|
28 |
-
pip install -r requirements.txt
|
29 |
-
|
30 |
-
# Ensure the following dependencies are included:
|
31 |
-
- Flask
|
32 |
-
- spaCy
|
33 |
-
- huggingface_hub
|
34 |
-
- PyMuPDF
|
35 |
-
- python-docx
|
36 |
-
- Tesseract-OCR (for image-based parsing)
|
37 |
-
|
38 |
-
; NOTE : If any model or library is not installed, you can install it using:
|
39 |
-
pip install <model_name>
|
40 |
-
_Replace <model_name> with the specific model or library you need to install_
|
41 |
-
|
42 |
-
3. Set up Hugging Face Token
|
43 |
-
- Add your Hugging Face token to the .env file as:
|
44 |
-
HF_TOKEN=<your_huggingface_token>
|
45 |
-
|
46 |
-
|
47 |
-
# File Structure Overview:
|
48 |
-
Mistral_With_Spacy/
|
49 |
-
β
|
50 |
-
βββ Spacy_Models/
|
51 |
-
β βββ ner_model_05_3 # Pretrained spaCy model directory for resume parsing
|
52 |
-
β
|
53 |
-
βββ templates/
|
54 |
-
β βββ index.html # UI for file upload
|
55 |
-
β βββ result.html # Display parsed results in structured JSON
|
56 |
-
β
|
57 |
-
βββ uploads/ # Directory for uploaded resume files
|
58 |
-
β
|
59 |
-
βββ utils/
|
60 |
-
β βββ mistral.py # Code for calling Mistral API and handling responses
|
61 |
-
β βββ spacy.py # spaCy fallback model for parsing resumes
|
62 |
-
β βββ error.py # Error handling utilities
|
63 |
-
β βββ fileTotext.py # Functions to extract text from different file formats (PDF, DOCX, etc.)
|
64 |
-
β
|
65 |
-
βββ venv/ # Virtual environment
|
66 |
-
β
|
67 |
-
βββ .env # Environment variables file (contains Hugging Face token)
|
68 |
-
β
|
69 |
-
βββ main.py # Flask app handling API routes for uploading and processing resumes
|
70 |
-
β
|
71 |
-
βββ requirements.txt # Dependencies required for the project
|
72 |
-
|
73 |
-
|
74 |
-
# Program Overview:
|
75 |
-
|
76 |
-
# Mistral Integration (utils/mistral.py)
|
77 |
-
- Mistral API Calls: Uses Hugging Faces Mistral-Nemo-Instruct-2407 model to parse resumes.
|
78 |
-
- Personal and Professional Extraction: Two functions extract personal and professional information in structured JSON format.
|
79 |
-
- Fallback Mechanism: If Mistral fails, spaCys NER model is used as a fallback.
|
80 |
-
|
81 |
-
# SpaCy Integration (utils/spacy.py)
|
82 |
-
- Custom Trained Model: Uses a spaCy model (ner_model_05_3) trained specifically for resume parsing.
|
83 |
-
- Named Entity Recognition: Extracts key information like Name, Email, Contact, Location, Skills, Experience, etc., from resumes.
|
84 |
-
- Validation: Includes validation for extracted emails and contacts.
|
85 |
-
|
86 |
-
# File Conversion (utils/fileTotext.py)
|
87 |
-
- Text Extraction: Handles different resume formats (PDF, DOCX, ODT, RSF, and images like PNG, JPG, JPEG) and extracts text for further processing.
|
88 |
-
- PDF Files: Uses PyMuPDF to extract text and, if necessary, Tesseract-OCR for image-based PDF content.
|
89 |
-
- DOCX Files: Uses `python-docx` to extract structured text from Word documents.
|
90 |
-
- ODT Files: Uses `odfpy` to extract text from ODT (OpenDocument) files.
|
91 |
-
- RSF Files: Reads plain text from RSF files.
|
92 |
-
- Images (PNG, JPG, JPEG): Uses Tesseract-OCR to extract text from image-based resumes.
|
93 |
-
Note: For Tesseract-OCR, install it locally by following the [installation guide](https://github.com/UB-Mannheim/tesseract/wiki).
|
94 |
-
- Hyperlink Extraction: Extracts hyperlinks from PDF files, capturing any embedded URLs during the parsing process.
|
95 |
-
|
96 |
-
|
97 |
-
# Error Handling (utils/error.py)
|
98 |
-
- Manages API response errors, file format issues, and ensures smooth fallbacks without crashing the app.
|
99 |
-
|
100 |
-
# Flask API (main.py)
|
101 |
-
Endpoints:
|
102 |
-
- /upload for uploading resumes.
|
103 |
-
- Displays parsed results in JSON format on the results page.
|
104 |
-
- UI: Simple interface for uploading resumes and viewing the parsing results.
|
105 |
-
|
106 |
-
|
107 |
-
# Tree map of
|
108 |
-
|
109 |
-
main.py
|
110 |
-
βββ Handles API side
|
111 |
-
βββ File upload/remove
|
112 |
-
βββ Process resumes
|
113 |
-
βββ Show result
|
114 |
-
utils
|
115 |
-
βββ fileTotext.py
|
116 |
-
β βββ Converts files to text
|
117 |
-
β βββ PDF
|
118 |
-
β βββ DOCX
|
119 |
-
β βββ RTF
|
120 |
-
β βββ ODT
|
121 |
-
β βββ PNG
|
122 |
-
β βββ JPG
|
123 |
-
β βββ JPEG
|
124 |
-
βββ mistral.py
|
125 |
-
β βββ Mistral API Calls
|
126 |
-
β β βββ Uses Mistral-Nemo-Instruct-2407 model
|
127 |
-
β βββ Personal and Professional Extraction
|
128 |
-
β β βββ Extracts personal information
|
129 |
-
β β βββ Extracts professional information
|
130 |
-
β βββ Fallback Mechanism
|
131 |
-
β βββ Uses spaCy NER model if Mistral fails
|
132 |
-
βββ spacy.py
|
133 |
-
βββ Custom Trained Model
|
134 |
-
β βββ Uses spaCy model (ner_model_05_3)
|
135 |
-
βββ Named Entity Recognition
|
136 |
-
β βββ Extracts key information (Name, Email, Contact, etc.)
|
137 |
-
βββ Validation
|
138 |
-
βββ Validates emails and contacts
|
139 |
-
|
140 |
-
|
141 |
-
# References:
|
142 |
-
|
143 |
-
- [Flask Documentation](https://flask.palletsprojects.com/)
|
144 |
-
- [spaCy Documentation](https://spacy.io/usage)
|
145 |
-
- [Mistral Documentation](https://docs.mistral.ai/)
|
146 |
-
- [Hugging Face Hub API](https://huggingface.co/docs/huggingface_hub/index)
|
147 |
-
- [PyMuPDF (MuPDF) Documentation](https://pymupdf.readthedocs.io/en/latest/)
|
148 |
-
- [python-docx Documentation](https://python-docx.readthedocs.io/en/latest/)
|
149 |
-
- [Tesseract OCR Documentation](https://github.com/UB-Mannheim/tesseract/wiki)
|
150 |
- [Virtual Environments in Python](https://docs.python.org/3/tutorial/venv.html)
|
|
|
1 |
+
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
|
2 |
+
\\----------- **Resume Parser** ----------\\
|
3 |
+
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
|
4 |
+
|
5 |
+
# Overview:
|
6 |
+
This project is a comprehensive Resume Parsing tool built using Python,
|
7 |
+
integrating the Mistral-Nemo-Instruct-2407 model for primary parsing.
|
8 |
+
If Mistral fails or encounters issues,
|
9 |
+
the system falls back to a custom-trained spaCy model to ensure continued functionality.
|
10 |
+
The tool is wrapped with a Flask API and has a user interface built using HTML and CSS.
|
11 |
+
|
12 |
+
|
13 |
+
# Installation Guide:
|
14 |
+
|
15 |
+
1. Create and Activate a Virtual Environment
|
16 |
+
python -m venv venv
|
17 |
+
source venv/bin/activate # For Linux/Mac
|
18 |
+
# or
|
19 |
+
venv\Scripts\activate # For Windows
|
20 |
+
|
21 |
+
# NOTE: If the virtual environment (venv) is already created, you can skip the creation step and just activate.
|
22 |
+
- For Linux/Mac:
|
23 |
+
source venv/bin/activate
|
24 |
+
- For Windows:
|
25 |
+
venv\Scripts\activate
|
26 |
+
|
27 |
+
2. Install Required Libraries
|
28 |
+
pip install -r requirements.txt
|
29 |
+
|
30 |
+
# Ensure the following dependencies are included:
|
31 |
+
- Flask
|
32 |
+
- spaCy
|
33 |
+
- huggingface_hub
|
34 |
+
- PyMuPDF
|
35 |
+
- python-docx
|
36 |
+
- Tesseract-OCR (for image-based parsing)
|
37 |
+
|
38 |
+
; NOTE : If any model or library is not installed, you can install it using:
|
39 |
+
pip install <model_name>
|
40 |
+
_Replace <model_name> with the specific model or library you need to install_
|
41 |
+
|
42 |
+
3. Set up Hugging Face Token
|
43 |
+
- Add your Hugging Face token to the .env file as:
|
44 |
+
HF_TOKEN=<your_huggingface_token>
|
45 |
+
|
46 |
+
|
47 |
+
# File Structure Overview:
|
48 |
+
Mistral_With_Spacy/
|
49 |
+
β
|
50 |
+
βββ Spacy_Models/
|
51 |
+
β βββ ner_model_05_3 # Pretrained spaCy model directory for resume parsing
|
52 |
+
β
|
53 |
+
βββ templates/
|
54 |
+
β βββ index.html # UI for file upload
|
55 |
+
β βββ result.html # Display parsed results in structured JSON
|
56 |
+
β
|
57 |
+
βββ uploads/ # Directory for uploaded resume files
|
58 |
+
β
|
59 |
+
βββ utils/
|
60 |
+
β βββ mistral.py # Code for calling Mistral API and handling responses
|
61 |
+
β βββ spacy.py # spaCy fallback model for parsing resumes
|
62 |
+
β βββ error.py # Error handling utilities
|
63 |
+
β βββ fileTotext.py # Functions to extract text from different file formats (PDF, DOCX, etc.)
|
64 |
+
β
|
65 |
+
βββ venv/ # Virtual environment
|
66 |
+
β
|
67 |
+
βββ .env # Environment variables file (contains Hugging Face token)
|
68 |
+
β
|
69 |
+
βββ main.py # Flask app handling API routes for uploading and processing resumes
|
70 |
+
β
|
71 |
+
βββ requirements.txt # Dependencies required for the project
|
72 |
+
|
73 |
+
|
74 |
+
# Program Overview:
|
75 |
+
|
76 |
+
# Mistral Integration (utils/mistral.py)
|
77 |
+
- Mistral API Calls: Uses Hugging Faces Mistral-Nemo-Instruct-2407 model to parse resumes.
|
78 |
+
- Personal and Professional Extraction: Two functions extract personal and professional information in structured JSON format.
|
79 |
+
- Fallback Mechanism: If Mistral fails, spaCys NER model is used as a fallback.
|
80 |
+
|
81 |
+
# SpaCy Integration (utils/spacy.py)
|
82 |
+
- Custom Trained Model: Uses a spaCy model (ner_model_05_3) trained specifically for resume parsing.
|
83 |
+
- Named Entity Recognition: Extracts key information like Name, Email, Contact, Location, Skills, Experience, etc., from resumes.
|
84 |
+
- Validation: Includes validation for extracted emails and contacts.
|
85 |
+
|
86 |
+
# File Conversion (utils/fileTotext.py)
|
87 |
+
- Text Extraction: Handles different resume formats (PDF, DOCX, ODT, RSF, and images like PNG, JPG, JPEG) and extracts text for further processing.
|
88 |
+
- PDF Files: Uses PyMuPDF to extract text and, if necessary, Tesseract-OCR for image-based PDF content.
|
89 |
+
- DOCX Files: Uses `python-docx` to extract structured text from Word documents.
|
90 |
+
- ODT Files: Uses `odfpy` to extract text from ODT (OpenDocument) files.
|
91 |
+
- RSF Files: Reads plain text from RSF files.
|
92 |
+
- Images (PNG, JPG, JPEG): Uses Tesseract-OCR to extract text from image-based resumes.
|
93 |
+
Note: For Tesseract-OCR, install it locally by following the [installation guide](https://github.com/UB-Mannheim/tesseract/wiki).
|
94 |
+
- Hyperlink Extraction: Extracts hyperlinks from PDF files, capturing any embedded URLs during the parsing process.
|
95 |
+
|
96 |
+
|
97 |
+
# Error Handling (utils/error.py)
|
98 |
+
- Manages API response errors, file format issues, and ensures smooth fallbacks without crashing the app.
|
99 |
+
|
100 |
+
# Flask API (main.py)
|
101 |
+
Endpoints:
|
102 |
+
- /upload for uploading resumes.
|
103 |
+
- Displays parsed results in JSON format on the results page.
|
104 |
+
- UI: Simple interface for uploading resumes and viewing the parsing results.
|
105 |
+
|
106 |
+
|
107 |
+
# Tree map of program:
|
108 |
+
|
109 |
+
main.py
|
110 |
+
βββ Handles API side
|
111 |
+
βββ File upload/remove
|
112 |
+
βββ Process resumes
|
113 |
+
βββ Show result
|
114 |
+
utils
|
115 |
+
βββ fileTotext.py
|
116 |
+
β βββ Converts files to text
|
117 |
+
β βββ PDF
|
118 |
+
β βββ DOCX
|
119 |
+
β βββ RTF
|
120 |
+
β βββ ODT
|
121 |
+
β βββ PNG
|
122 |
+
β βββ JPG
|
123 |
+
β βββ JPEG
|
124 |
+
βββ mistral.py
|
125 |
+
β βββ Mistral API Calls
|
126 |
+
β β βββ Uses Mistral-Nemo-Instruct-2407 model
|
127 |
+
β βββ Personal and Professional Extraction
|
128 |
+
β β βββ Extracts personal information
|
129 |
+
β β βββ Extracts professional information
|
130 |
+
β βββ Fallback Mechanism
|
131 |
+
β βββ Uses spaCy NER model if Mistral fails
|
132 |
+
βββ spacy.py
|
133 |
+
βββ Custom Trained Model
|
134 |
+
β βββ Uses spaCy model (ner_model_05_3)
|
135 |
+
βββ Named Entity Recognition
|
136 |
+
β βββ Extracts key information (Name, Email, Contact, etc.)
|
137 |
+
βββ Validation
|
138 |
+
βββ Validates emails and contacts
|
139 |
+
|
140 |
+
|
141 |
+
# References:
|
142 |
+
|
143 |
+
- [Flask Documentation](https://flask.palletsprojects.com/)
|
144 |
+
- [spaCy Documentation](https://spacy.io/usage)
|
145 |
+
- [Mistral Documentation](https://docs.mistral.ai/)
|
146 |
+
- [Hugging Face Hub API](https://huggingface.co/docs/huggingface_hub/index)
|
147 |
+
- [PyMuPDF (MuPDF) Documentation](https://pymupdf.readthedocs.io/en/latest/)
|
148 |
+
- [python-docx Documentation](https://python-docx.readthedocs.io/en/latest/)
|
149 |
+
- [Tesseract OCR Documentation](https://github.com/UB-Mannheim/tesseract/wiki)
|
150 |
- [Virtual Environments in Python](https://docs.python.org/3/tutorial/venv.html)
|