Spaces:
Sleeping
Sleeping
Update README2.md
Browse files- README2.md +149 -149
README2.md
CHANGED
|
@@ -1,150 +1,150 @@
|
|
| 1 |
-
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
|
| 2 |
-
\\----------- **Resume Parser** ----------\\
|
| 3 |
-
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
|
| 4 |
-
|
| 5 |
-
# Overview:
|
| 6 |
-
This project is a comprehensive Resume Parsing tool built using Python,
|
| 7 |
-
integrating the Mistral-Nemo-Instruct-2407 model for primary parsing.
|
| 8 |
-
If Mistral fails or encounters issues,
|
| 9 |
-
the system falls back to a custom-trained spaCy model to ensure continued functionality.
|
| 10 |
-
The tool is wrapped with a Flask API and has a user interface built using HTML and CSS.
|
| 11 |
-
|
| 12 |
-
|
| 13 |
-
# Installation Guide:
|
| 14 |
-
|
| 15 |
-
1. Create and Activate a Virtual Environment
|
| 16 |
-
python -m venv venv
|
| 17 |
-
source venv/bin/activate # For Linux/Mac
|
| 18 |
-
# or
|
| 19 |
-
venv\Scripts\activate # For Windows
|
| 20 |
-
|
| 21 |
-
# NOTE: If the virtual environment (venv) is already created, you can skip the creation step and just activate.
|
| 22 |
-
- For Linux/Mac:
|
| 23 |
-
source venv/bin/activate
|
| 24 |
-
- For Windows:
|
| 25 |
-
venv\Scripts\activate
|
| 26 |
-
|
| 27 |
-
2. Install Required Libraries
|
| 28 |
-
pip install -r requirements.txt
|
| 29 |
-
|
| 30 |
-
# Ensure the following dependencies are included:
|
| 31 |
-
- Flask
|
| 32 |
-
- spaCy
|
| 33 |
-
- huggingface_hub
|
| 34 |
-
- PyMuPDF
|
| 35 |
-
- python-docx
|
| 36 |
-
- Tesseract-OCR (for image-based parsing)
|
| 37 |
-
|
| 38 |
-
; NOTE : If any model or library is not installed, you can install it using:
|
| 39 |
-
pip install <model_name>
|
| 40 |
-
_Replace <model_name> with the specific model or library you need to install_
|
| 41 |
-
|
| 42 |
-
3. Set up Hugging Face Token
|
| 43 |
-
- Add your Hugging Face token to the .env file as:
|
| 44 |
-
HF_TOKEN=<your_huggingface_token>
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
# File Structure Overview:
|
| 48 |
-
Mistral_With_Spacy/
|
| 49 |
-
β
|
| 50 |
-
βββ Spacy_Models/
|
| 51 |
-
β βββ ner_model_05_3 # Pretrained spaCy model directory for resume parsing
|
| 52 |
-
β
|
| 53 |
-
βββ templates/
|
| 54 |
-
β βββ index.html # UI for file upload
|
| 55 |
-
β βββ result.html # Display parsed results in structured JSON
|
| 56 |
-
β
|
| 57 |
-
βββ uploads/ # Directory for uploaded resume files
|
| 58 |
-
β
|
| 59 |
-
βββ utils/
|
| 60 |
-
β βββ mistral.py # Code for calling Mistral API and handling responses
|
| 61 |
-
β βββ spacy.py # spaCy fallback model for parsing resumes
|
| 62 |
-
β βββ error.py # Error handling utilities
|
| 63 |
-
β βββ fileTotext.py # Functions to extract text from different file formats (PDF, DOCX, etc.)
|
| 64 |
-
β
|
| 65 |
-
βββ venv/ # Virtual environment
|
| 66 |
-
β
|
| 67 |
-
βββ .env # Environment variables file (contains Hugging Face token)
|
| 68 |
-
β
|
| 69 |
-
βββ main.py # Flask app handling API routes for uploading and processing resumes
|
| 70 |
-
β
|
| 71 |
-
βββ requirements.txt # Dependencies required for the project
|
| 72 |
-
|
| 73 |
-
|
| 74 |
-
# Program Overview:
|
| 75 |
-
|
| 76 |
-
# Mistral Integration (utils/mistral.py)
|
| 77 |
-
- Mistral API Calls: Uses Hugging Faces Mistral-Nemo-Instruct-2407 model to parse resumes.
|
| 78 |
-
- Personal and Professional Extraction: Two functions extract personal and professional information in structured JSON format.
|
| 79 |
-
- Fallback Mechanism: If Mistral fails, spaCys NER model is used as a fallback.
|
| 80 |
-
|
| 81 |
-
# SpaCy Integration (utils/spacy.py)
|
| 82 |
-
- Custom Trained Model: Uses a spaCy model (ner_model_05_3) trained specifically for resume parsing.
|
| 83 |
-
- Named Entity Recognition: Extracts key information like Name, Email, Contact, Location, Skills, Experience, etc., from resumes.
|
| 84 |
-
- Validation: Includes validation for extracted emails and contacts.
|
| 85 |
-
|
| 86 |
-
# File Conversion (utils/fileTotext.py)
|
| 87 |
-
- Text Extraction: Handles different resume formats (PDF, DOCX, ODT, RSF, and images like PNG, JPG, JPEG) and extracts text for further processing.
|
| 88 |
-
- PDF Files: Uses PyMuPDF to extract text and, if necessary, Tesseract-OCR for image-based PDF content.
|
| 89 |
-
- DOCX Files: Uses `python-docx` to extract structured text from Word documents.
|
| 90 |
-
- ODT Files: Uses `odfpy` to extract text from ODT (OpenDocument) files.
|
| 91 |
-
- RSF Files: Reads plain text from RSF files.
|
| 92 |
-
- Images (PNG, JPG, JPEG): Uses Tesseract-OCR to extract text from image-based resumes.
|
| 93 |
-
Note: For Tesseract-OCR, install it locally by following the [installation guide](https://github.com/UB-Mannheim/tesseract/wiki).
|
| 94 |
-
- Hyperlink Extraction: Extracts hyperlinks from PDF files, capturing any embedded URLs during the parsing process.
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
# Error Handling (utils/error.py)
|
| 98 |
-
- Manages API response errors, file format issues, and ensures smooth fallbacks without crashing the app.
|
| 99 |
-
|
| 100 |
-
# Flask API (main.py)
|
| 101 |
-
Endpoints:
|
| 102 |
-
- /upload for uploading resumes.
|
| 103 |
-
- Displays parsed results in JSON format on the results page.
|
| 104 |
-
- UI: Simple interface for uploading resumes and viewing the parsing results.
|
| 105 |
-
|
| 106 |
-
|
| 107 |
-
# Tree map of
|
| 108 |
-
|
| 109 |
-
main.py
|
| 110 |
-
βββ Handles API side
|
| 111 |
-
βββ File upload/remove
|
| 112 |
-
βββ Process resumes
|
| 113 |
-
βββ Show result
|
| 114 |
-
utils
|
| 115 |
-
βββ fileTotext.py
|
| 116 |
-
β βββ Converts files to text
|
| 117 |
-
β βββ PDF
|
| 118 |
-
β βββ DOCX
|
| 119 |
-
β βββ RTF
|
| 120 |
-
β βββ ODT
|
| 121 |
-
β βββ PNG
|
| 122 |
-
β βββ JPG
|
| 123 |
-
β βββ JPEG
|
| 124 |
-
βββ mistral.py
|
| 125 |
-
β βββ Mistral API Calls
|
| 126 |
-
β β βββ Uses Mistral-Nemo-Instruct-2407 model
|
| 127 |
-
β βββ Personal and Professional Extraction
|
| 128 |
-
β β βββ Extracts personal information
|
| 129 |
-
β β βββ Extracts professional information
|
| 130 |
-
β βββ Fallback Mechanism
|
| 131 |
-
β βββ Uses spaCy NER model if Mistral fails
|
| 132 |
-
βββ spacy.py
|
| 133 |
-
βββ Custom Trained Model
|
| 134 |
-
β βββ Uses spaCy model (ner_model_05_3)
|
| 135 |
-
βββ Named Entity Recognition
|
| 136 |
-
β βββ Extracts key information (Name, Email, Contact, etc.)
|
| 137 |
-
βββ Validation
|
| 138 |
-
βββ Validates emails and contacts
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
# References:
|
| 142 |
-
|
| 143 |
-
- [Flask Documentation](https://flask.palletsprojects.com/)
|
| 144 |
-
- [spaCy Documentation](https://spacy.io/usage)
|
| 145 |
-
- [Mistral Documentation](https://docs.mistral.ai/)
|
| 146 |
-
- [Hugging Face Hub API](https://huggingface.co/docs/huggingface_hub/index)
|
| 147 |
-
- [PyMuPDF (MuPDF) Documentation](https://pymupdf.readthedocs.io/en/latest/)
|
| 148 |
-
- [python-docx Documentation](https://python-docx.readthedocs.io/en/latest/)
|
| 149 |
-
- [Tesseract OCR Documentation](https://github.com/UB-Mannheim/tesseract/wiki)
|
| 150 |
- [Virtual Environments in Python](https://docs.python.org/3/tutorial/venv.html)
|
|
|
|
| 1 |
+
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
|
| 2 |
+
\\----------- **Resume Parser** ----------\\
|
| 3 |
+
\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
|
| 4 |
+
|
| 5 |
+
# Overview:
|
| 6 |
+
This project is a comprehensive Resume Parsing tool built using Python,
|
| 7 |
+
integrating the Mistral-Nemo-Instruct-2407 model for primary parsing.
|
| 8 |
+
If Mistral fails or encounters issues,
|
| 9 |
+
the system falls back to a custom-trained spaCy model to ensure continued functionality.
|
| 10 |
+
The tool is wrapped with a Flask API and has a user interface built using HTML and CSS.
|
| 11 |
+
|
| 12 |
+
|
| 13 |
+
# Installation Guide:
|
| 14 |
+
|
| 15 |
+
1. Create and Activate a Virtual Environment
|
| 16 |
+
python -m venv venv
|
| 17 |
+
source venv/bin/activate # For Linux/Mac
|
| 18 |
+
# or
|
| 19 |
+
venv\Scripts\activate # For Windows
|
| 20 |
+
|
| 21 |
+
# NOTE: If the virtual environment (venv) is already created, you can skip the creation step and just activate.
|
| 22 |
+
- For Linux/Mac:
|
| 23 |
+
source venv/bin/activate
|
| 24 |
+
- For Windows:
|
| 25 |
+
venv\Scripts\activate
|
| 26 |
+
|
| 27 |
+
2. Install Required Libraries
|
| 28 |
+
pip install -r requirements.txt
|
| 29 |
+
|
| 30 |
+
# Ensure the following dependencies are included:
|
| 31 |
+
- Flask
|
| 32 |
+
- spaCy
|
| 33 |
+
- huggingface_hub
|
| 34 |
+
- PyMuPDF
|
| 35 |
+
- python-docx
|
| 36 |
+
- Tesseract-OCR (for image-based parsing)
|
| 37 |
+
|
| 38 |
+
; NOTE : If any model or library is not installed, you can install it using:
|
| 39 |
+
pip install <model_name>
|
| 40 |
+
_Replace <model_name> with the specific model or library you need to install_
|
| 41 |
+
|
| 42 |
+
3. Set up Hugging Face Token
|
| 43 |
+
- Add your Hugging Face token to the .env file as:
|
| 44 |
+
HF_TOKEN=<your_huggingface_token>
|
| 45 |
+
|
| 46 |
+
|
| 47 |
+
# File Structure Overview:
|
| 48 |
+
Mistral_With_Spacy/
|
| 49 |
+
β
|
| 50 |
+
βββ Spacy_Models/
|
| 51 |
+
β βββ ner_model_05_3 # Pretrained spaCy model directory for resume parsing
|
| 52 |
+
β
|
| 53 |
+
βββ templates/
|
| 54 |
+
β βββ index.html # UI for file upload
|
| 55 |
+
β βββ result.html # Display parsed results in structured JSON
|
| 56 |
+
β
|
| 57 |
+
βββ uploads/ # Directory for uploaded resume files
|
| 58 |
+
β
|
| 59 |
+
βββ utils/
|
| 60 |
+
β βββ mistral.py # Code for calling Mistral API and handling responses
|
| 61 |
+
β βββ spacy.py # spaCy fallback model for parsing resumes
|
| 62 |
+
β βββ error.py # Error handling utilities
|
| 63 |
+
β βββ fileTotext.py # Functions to extract text from different file formats (PDF, DOCX, etc.)
|
| 64 |
+
β
|
| 65 |
+
βββ venv/ # Virtual environment
|
| 66 |
+
β
|
| 67 |
+
βββ .env # Environment variables file (contains Hugging Face token)
|
| 68 |
+
β
|
| 69 |
+
βββ main.py # Flask app handling API routes for uploading and processing resumes
|
| 70 |
+
β
|
| 71 |
+
βββ requirements.txt # Dependencies required for the project
|
| 72 |
+
|
| 73 |
+
|
| 74 |
+
# Program Overview:
|
| 75 |
+
|
| 76 |
+
# Mistral Integration (utils/mistral.py)
|
| 77 |
+
- Mistral API Calls: Uses Hugging Faces Mistral-Nemo-Instruct-2407 model to parse resumes.
|
| 78 |
+
- Personal and Professional Extraction: Two functions extract personal and professional information in structured JSON format.
|
| 79 |
+
- Fallback Mechanism: If Mistral fails, spaCys NER model is used as a fallback.
|
| 80 |
+
|
| 81 |
+
# SpaCy Integration (utils/spacy.py)
|
| 82 |
+
- Custom Trained Model: Uses a spaCy model (ner_model_05_3) trained specifically for resume parsing.
|
| 83 |
+
- Named Entity Recognition: Extracts key information like Name, Email, Contact, Location, Skills, Experience, etc., from resumes.
|
| 84 |
+
- Validation: Includes validation for extracted emails and contacts.
|
| 85 |
+
|
| 86 |
+
# File Conversion (utils/fileTotext.py)
|
| 87 |
+
- Text Extraction: Handles different resume formats (PDF, DOCX, ODT, RSF, and images like PNG, JPG, JPEG) and extracts text for further processing.
|
| 88 |
+
- PDF Files: Uses PyMuPDF to extract text and, if necessary, Tesseract-OCR for image-based PDF content.
|
| 89 |
+
- DOCX Files: Uses `python-docx` to extract structured text from Word documents.
|
| 90 |
+
- ODT Files: Uses `odfpy` to extract text from ODT (OpenDocument) files.
|
| 91 |
+
- RSF Files: Reads plain text from RSF files.
|
| 92 |
+
- Images (PNG, JPG, JPEG): Uses Tesseract-OCR to extract text from image-based resumes.
|
| 93 |
+
Note: For Tesseract-OCR, install it locally by following the [installation guide](https://github.com/UB-Mannheim/tesseract/wiki).
|
| 94 |
+
- Hyperlink Extraction: Extracts hyperlinks from PDF files, capturing any embedded URLs during the parsing process.
|
| 95 |
+
|
| 96 |
+
|
| 97 |
+
# Error Handling (utils/error.py)
|
| 98 |
+
- Manages API response errors, file format issues, and ensures smooth fallbacks without crashing the app.
|
| 99 |
+
|
| 100 |
+
# Flask API (main.py)
|
| 101 |
+
Endpoints:
|
| 102 |
+
- /upload for uploading resumes.
|
| 103 |
+
- Displays parsed results in JSON format on the results page.
|
| 104 |
+
- UI: Simple interface for uploading resumes and viewing the parsing results.
|
| 105 |
+
|
| 106 |
+
|
| 107 |
+
# Tree map of program:
|
| 108 |
+
|
| 109 |
+
main.py
|
| 110 |
+
βββ Handles API side
|
| 111 |
+
βββ File upload/remove
|
| 112 |
+
βββ Process resumes
|
| 113 |
+
βββ Show result
|
| 114 |
+
utils
|
| 115 |
+
βββ fileTotext.py
|
| 116 |
+
β βββ Converts files to text
|
| 117 |
+
β βββ PDF
|
| 118 |
+
β βββ DOCX
|
| 119 |
+
β βββ RTF
|
| 120 |
+
β βββ ODT
|
| 121 |
+
β βββ PNG
|
| 122 |
+
β βββ JPG
|
| 123 |
+
β βββ JPEG
|
| 124 |
+
βββ mistral.py
|
| 125 |
+
β βββ Mistral API Calls
|
| 126 |
+
β β βββ Uses Mistral-Nemo-Instruct-2407 model
|
| 127 |
+
β βββ Personal and Professional Extraction
|
| 128 |
+
β β βββ Extracts personal information
|
| 129 |
+
β β βββ Extracts professional information
|
| 130 |
+
β βββ Fallback Mechanism
|
| 131 |
+
β βββ Uses spaCy NER model if Mistral fails
|
| 132 |
+
βββ spacy.py
|
| 133 |
+
βββ Custom Trained Model
|
| 134 |
+
β βββ Uses spaCy model (ner_model_05_3)
|
| 135 |
+
βββ Named Entity Recognition
|
| 136 |
+
β βββ Extracts key information (Name, Email, Contact, etc.)
|
| 137 |
+
βββ Validation
|
| 138 |
+
βββ Validates emails and contacts
|
| 139 |
+
|
| 140 |
+
|
| 141 |
+
# References:
|
| 142 |
+
|
| 143 |
+
- [Flask Documentation](https://flask.palletsprojects.com/)
|
| 144 |
+
- [spaCy Documentation](https://spacy.io/usage)
|
| 145 |
+
- [Mistral Documentation](https://docs.mistral.ai/)
|
| 146 |
+
- [Hugging Face Hub API](https://huggingface.co/docs/huggingface_hub/index)
|
| 147 |
+
- [PyMuPDF (MuPDF) Documentation](https://pymupdf.readthedocs.io/en/latest/)
|
| 148 |
+
- [python-docx Documentation](https://python-docx.readthedocs.io/en/latest/)
|
| 149 |
+
- [Tesseract OCR Documentation](https://github.com/UB-Mannheim/tesseract/wiki)
|
| 150 |
- [Virtual Environments in Python](https://docs.python.org/3/tutorial/venv.html)
|