SpacyModelCreator / README2.md
WebashalarForML's picture
Update README2.md
93daa38 verified
|
raw
history blame
7.79 kB
<div style="display: flex; align-items: center; justify-content: center;">
<div style="margin-right: 20px;">
<img src="https://cdn-lfs-us-1.hf.co/repos/de/fb/defb007867acd8852f4a283e9b06a933778826b18ed58ade01da945f5903795d/8b7831230df7d554c74f5e249e23be57165d143fea0ea7b5dde56dde5c13c95b?response-content-disposition=inline%3B+filename*%3DUTF-8%27%27turing-test.gif%3B+filename%3D%22turing-test.gif%22%3B&response-content-type=image%2Fgif&Expires=1730008247&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTczMDAwODI0N319LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmhmLmNvL3JlcG9zL2RlL2ZiL2RlZmIwMDc4NjdhY2Q4ODUyZjRhMjgzZTliMDZhOTMzNzc4ODI2YjE4ZWQ1OGFkZTAxZGE5NDVmNTkwMzc5NWQvOGI3ODMxMjMwZGY3ZDU1NGM3NGY1ZTI0OWUyM2JlNTcxNjVkMTQzZmVhMGVhN2I1ZGRlNTZkZGU1YzEzYzk1Yj9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=GBUn-4z3PMBTqT0NdT3H-NyZxNMGcN4zDNzK8ql%7ESwLF8pXzkH783GSCZQQYWwE-v1g90JTulsOt7z5szigK49ApFju6bkS2zwUAYxNttcl3c-VYrxGuFWYnkHpTQ73qbs3ELF2-5LzDy1ARpj3BOlSEXtH9ShwCRm-R0llQJ6EDx2eOyBIDg-Pgrx%7EKIxrdAZCNln9tJk74TrSN5survdIvcSZrSIGXc3tpFLm-BwpY6qtID3ltrPEHYWDrQ5ALV8lXqKmpVlFSq3lOEFlSa-opFJwe%7E8FIIwP5mJgtCZzlQQylRhsVLxDQ2cJYpTbZSvEVkfjyTxOP4dc%7EDz1tVQ__&Key-Pair-Id=K24J24Z295AEI9"
alt="AI App Icon" width="100" height="50"
style="border-radius: 20px; border: 2px solid #333;">
</div>
<div>
<p style="font-size: 50px; font-weight: bold; text-align: center; margin: 0;">
Spacy Model Creator
</p>
</div>
</div>
<hr>
<hr>
# Overview:
This project is a comprehensive Resume Parsing tool built using Python,
integrating the Mistral-Nemo-Instruct-2407 model for primary parsing.
If Mistral fails or encounters issues,
the system falls back to a custom-trained spaCy model to ensure continued functionality.
The tool is wrapped with a Flask API and has a user interface built using HTML and CSS.
# Installation Guide:
1. Create and Activate a Virtual Environment
python -m venv venv
source venv/bin/activate # For Linux/Mac
# or
venv\Scripts\activate # For Windows
# NOTE: If the virtual environment (venv) is already created, you can skip the creation step and just activate.
- For Linux/Mac:
source venv/bin/activate
- For Windows:
venv\Scripts\activate
2. Install Required Libraries
pip install -r requirements.txt
# Ensure the following dependencies are included:
- Flask
- spaCy
- huggingface_hub
- PyMuPDF
- python-docx
- Tesseract-OCR (for image-based parsing)
; NOTE : If any model or library is not installed, you can install it using:
pip install <model_name>
_Replace <model_name> with the specific model or library you need to install_
3. Set up Hugging Face Token
- Add your Hugging Face token to the .env file as:
HF_TOKEN=<your_huggingface_token>
# File Structure Overview:
Mistral_With_Spacy/
β”‚
β”œβ”€β”€ Spacy_Models/
β”‚ └── ner_model_05_3 # Pretrained spaCy model directory for resume parsing
β”‚
β”œβ”€β”€ templates/
β”‚ β”œβ”€β”€ index.html # UI for file upload
β”‚ └── result.html # Display parsed results in structured JSON
β”‚
β”œβ”€β”€ uploads/ # Directory for uploaded resume files
β”‚
β”œβ”€β”€ utils/
β”‚ β”œβ”€β”€ mistral.py # Code for calling Mistral API and handling responses
β”‚ β”œβ”€β”€ spacy.py # spaCy fallback model for parsing resumes
β”‚ β”œβ”€β”€ error.py # Error handling utilities
β”‚ └── fileTotext.py # Functions to extract text from different file formats (PDF, DOCX, etc.)
β”‚
β”œβ”€β”€ venv/ # Virtual environment
β”‚
β”œβ”€β”€ .env # Environment variables file (contains Hugging Face token)
β”‚
β”œβ”€β”€ main.py # Flask app handling API routes for uploading and processing resumes
β”‚
└── requirements.txt # Dependencies required for the project
# Program Overview:
# Mistral Integration (utils/mistral.py)
- Mistral API Calls: Uses Hugging Faces Mistral-Nemo-Instruct-2407 model to parse resumes.
- Personal and Professional Extraction: Two functions extract personal and professional information in structured JSON format.
- Fallback Mechanism: If Mistral fails, spaCys NER model is used as a fallback.
# SpaCy Integration (utils/spacy.py)
- Custom Trained Model: Uses a spaCy model (ner_model_05_3) trained specifically for resume parsing.
- Named Entity Recognition: Extracts key information like Name, Email, Contact, Location, Skills, Experience, etc., from resumes.
- Validation: Includes validation for extracted emails and contacts.
# File Conversion (utils/fileTotext.py)
- Text Extraction: Handles different resume formats (PDF, DOCX, ODT, RSF, and images like PNG, JPG, JPEG) and extracts text for further processing.
- PDF Files: Uses PyMuPDF to extract text and, if necessary, Tesseract-OCR for image-based PDF content.
- DOCX Files: Uses `python-docx` to extract structured text from Word documents.
- ODT Files: Uses `odfpy` to extract text from ODT (OpenDocument) files.
- RSF Files: Reads plain text from RSF files.
- Images (PNG, JPG, JPEG): Uses Tesseract-OCR to extract text from image-based resumes.
Note: For Tesseract-OCR, install it locally by following the [installation guide](https://github.com/UB-Mannheim/tesseract/wiki).
- Hyperlink Extraction: Extracts hyperlinks from PDF files, capturing any embedded URLs during the parsing process.
# Error Handling (utils/error.py)
- Manages API response errors, file format issues, and ensures smooth fallbacks without crashing the app.
# Flask API (main.py)
Endpoints:
- /upload for uploading resumes.
- Displays parsed results in JSON format on the results page.
- UI: Simple interface for uploading resumes and viewing the parsing results.
# Tree map of program:
main.py
β”œβ”€β”€ Handles API side
β”œβ”€β”€ File upload/remove
β”œβ”€β”€ Process resumes
└── Show result
utils
β”œβ”€β”€ fileTotext.py
β”‚ └── Converts files to text
β”‚ β”œβ”€β”€ PDF
β”‚ β”œβ”€β”€ DOCX
β”‚ β”œβ”€β”€ RTF
β”‚ β”œβ”€β”€ ODT
β”‚ β”œβ”€β”€ PNG
β”‚ β”œβ”€β”€ JPG
β”‚ └── JPEG
β”œβ”€β”€ mistral.py
β”‚ β”œβ”€β”€ Mistral API Calls
β”‚ β”‚ └── Uses Mistral-Nemo-Instruct-2407 model
β”‚ β”œβ”€β”€ Personal and Professional Extraction
β”‚ β”‚ β”œβ”€β”€ Extracts personal information
β”‚ β”‚ └── Extracts professional information
β”‚ └── Fallback Mechanism
β”‚ └── Uses spaCy NER model if Mistral fails
└── spacy.py
β”œβ”€β”€ Custom Trained Model
β”‚ └── Uses spaCy model (ner_model_05_3)
β”œβ”€β”€ Named Entity Recognition
β”‚ └── Extracts key information (Name, Email, Contact, etc.)
└── Validation
└── Validates emails and contacts
# References:
- [Flask Documentation](https://flask.palletsprojects.com/)
- [spaCy Documentation](https://spacy.io/usage)
- [Mistral Documentation](https://docs.mistral.ai/)
- [Hugging Face Hub API](https://huggingface.co/docs/huggingface_hub/index)
- [PyMuPDF (MuPDF) Documentation](https://pymupdf.readthedocs.io/en/latest/)
- [python-docx Documentation](https://python-docx.readthedocs.io/en/latest/)
- [Tesseract OCR Documentation](https://github.com/UB-Mannheim/tesseract/wiki)
- [Virtual Environments in Python](https://docs.python.org/3/tutorial/venv.html)