Spaces:

WebashalarForML
/

SpacyModelCreator

Sleeping

App Files Files Community

WebashalarForML commited on Sep 30, 2024

Commit

cf9329c

verified ·

1 Parent(s): adac9c9

Update README2.md

Browse files

Files changed (1) hide show

README2.md +149 -149

README2.md CHANGED Viewed

@@ -1,150 +1,150 @@
-\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-\\----------- **Resume Parser** ----------\\
-\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
-# Overview:
-This project is a comprehensive Resume Parsing tool built using Python,
-integrating the Mistral-Nemo-Instruct-2407 model for primary parsing.
-If Mistral fails or encounters issues,
-the system falls back to a custom-trained spaCy model to ensure continued functionality.
-The tool is wrapped with a Flask API and has a user interface built using HTML and CSS.
-# Installation Guide:
-1. Create and Activate a Virtual Environment
-    python -m venv venv
-    source venv/bin/activate  # For Linux/Mac
-    # or
-    venv\Scripts\activate  # For Windows
-    # NOTE: If the virtual environment (venv) is already created, you can skip the creation step and just activate.
-        - For Linux/Mac:
-            source venv/bin/activate
-        - For Windows:
-            venv\Scripts\activate
-2. Install Required Libraries
-    pip install -r requirements.txt
-    # Ensure the following dependencies are included:
-    - Flask
-    - spaCy
-    - huggingface_hub
-    - PyMuPDF
-    - python-docx
-    - Tesseract-OCR (for image-based parsing)
-; NOTE : If any model or library is not installed, you can install it using:
-    pip install <model_name>
-    _Replace <model_name> with the specific model or library you need to install_
-3. Set up Hugging Face Token
-    - Add your Hugging Face token to the .env file as:
-    HF_TOKEN=<your_huggingface_token>
-# File Structure Overview:
-    Mistral_With_Spacy/
-    │
-    ├── Spacy_Models/
-    │   └── ner_model_05_3  # Pretrained spaCy model directory for resume parsing
-    │
-    ├── templates/
-    │   ├── index.html  # UI for file upload
-    │   └── result.html  # Display parsed results in structured JSON
-    │
-    ├── uploads/  # Directory for uploaded resume files
-    │
-    ├── utils/
-    │   ├── mistral.py  # Code for calling Mistral API and handling responses
-    │   ├── spacy.py  # spaCy fallback model for parsing resumes
-    │   ├── error.py  # Error handling utilities
-    │   └── fileTotext.py  # Functions to extract text from different file formats (PDF, DOCX, etc.)
-    │
-    ├── venv/  # Virtual environment
-    │
-    ├── .env  # Environment variables file (contains Hugging Face token)
-    │
-    ├── main.py  # Flask app handling API routes for uploading and processing resumes
-    │
-    └── requirements.txt  # Dependencies required for the project
-# Program Overview:
-    # Mistral Integration (utils/mistral.py)
-        - Mistral API Calls: Uses Hugging Faces Mistral-Nemo-Instruct-2407 model to parse resumes.
-        - Personal and Professional Extraction: Two functions extract personal and professional information in structured JSON format.
-        - Fallback Mechanism: If Mistral fails, spaCys NER model is used as a fallback.
-    # SpaCy Integration (utils/spacy.py)
-        - Custom Trained Model: Uses a spaCy model (ner_model_05_3) trained specifically for resume parsing.
-        - Named Entity Recognition: Extracts key information like Name, Email, Contact, Location, Skills, Experience, etc., from resumes.
-        - Validation: Includes validation for extracted emails and contacts.
-    # File Conversion (utils/fileTotext.py)
-       - Text Extraction: Handles different resume formats (PDF, DOCX, ODT, RSF, and images like PNG, JPG, JPEG) and extracts text for further processing.
-          - PDF Files: Uses PyMuPDF to extract text and, if necessary, Tesseract-OCR for image-based PDF content.
-          - DOCX Files: Uses `python-docx` to extract structured text from Word documents.
-          - ODT Files: Uses `odfpy` to extract text from ODT (OpenDocument) files.
-          - RSF Files: Reads plain text from RSF files.
-          - Images (PNG, JPG, JPEG): Uses Tesseract-OCR to extract text from image-based resumes.
-            Note: For Tesseract-OCR, install it locally by following the [installation guide](https://github.com/UB-Mannheim/tesseract/wiki).
-          - Hyperlink Extraction: Extracts hyperlinks from PDF files, capturing any embedded URLs during the parsing process.
-    # Error Handling (utils/error.py)
-        - Manages API response errors, file format issues, and ensures smooth fallbacks without crashing the app.
-    # Flask API (main.py)
-        Endpoints:
-        - /upload for uploading resumes.
-        - Displays parsed results in JSON format on the results page.
-        - UI: Simple interface for uploading resumes and viewing the parsing results.
-# Tree map of your program:
-    main.py
-    ├── Handles API side
-    ├── File upload/remove
-    ├── Process resumes
-    └── Show result
-    utils
-    ├── fileTotext.py
-    │   └── Converts files to text
-    │       ├── PDF
-    │       ├── DOCX
-    │       ├── RTF
-    │       ├── ODT
-    │       ├── PNG
-    │       ├── JPG
-    │       └── JPEG
-    ├── mistral.py
-    │   ├── Mistral API Calls
-    │   │   └── Uses Mistral-Nemo-Instruct-2407 model
-    │   ├── Personal and Professional Extraction
-    │   │   ├── Extracts personal information
-    │   │   └── Extracts professional information
-    │   └── Fallback Mechanism
-    │       └── Uses spaCy NER model if Mistral fails
-    └── spacy.py
-        ├── Custom Trained Model
-        │   └── Uses spaCy model (ner_model_05_3)
-        ├── Named Entity Recognition
-        │   └── Extracts key information (Name, Email, Contact, etc.)
-        └── Validation
-            └── Validates emails and contacts
-# References:
-- [Flask Documentation](https://flask.palletsprojects.com/)
-- [spaCy Documentation](https://spacy.io/usage)
-- [Mistral Documentation](https://docs.mistral.ai/)
-- [Hugging Face Hub API](https://huggingface.co/docs/huggingface_hub/index)
-- [PyMuPDF (MuPDF) Documentation](https://pymupdf.readthedocs.io/en/latest/)
-- [python-docx Documentation](https://python-docx.readthedocs.io/en/latest/)
-- [Tesseract OCR Documentation](https://github.com/UB-Mannheim/tesseract/wiki)
 - [Virtual Environments in Python](https://docs.python.org/3/tutorial/venv.html)

+\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
+\\----------- **Resume Parser** ----------\\
+\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
+# Overview:
+This project is a comprehensive Resume Parsing tool built using Python,
+integrating the Mistral-Nemo-Instruct-2407 model for primary parsing.
+If Mistral fails or encounters issues,
+the system falls back to a custom-trained spaCy model to ensure continued functionality.
+The tool is wrapped with a Flask API and has a user interface built using HTML and CSS.
+# Installation Guide:
+1. Create and Activate a Virtual Environment
+    python -m venv venv
+    source venv/bin/activate  # For Linux/Mac
+    # or
+    venv\Scripts\activate  # For Windows
+    # NOTE: If the virtual environment (venv) is already created, you can skip the creation step and just activate.
+        - For Linux/Mac:
+            source venv/bin/activate
+        - For Windows:
+            venv\Scripts\activate
+2. Install Required Libraries
+    pip install -r requirements.txt
+    # Ensure the following dependencies are included:
+    - Flask
+    - spaCy
+    - huggingface_hub
+    - PyMuPDF
+    - python-docx
+    - Tesseract-OCR (for image-based parsing)
+; NOTE : If any model or library is not installed, you can install it using:
+    pip install <model_name>
+    _Replace <model_name> with the specific model or library you need to install_
+3. Set up Hugging Face Token
+    - Add your Hugging Face token to the .env file as:
+    HF_TOKEN=<your_huggingface_token>
+# File Structure Overview:
+    Mistral_With_Spacy/
+    │
+    ├── Spacy_Models/
+    │   └── ner_model_05_3  # Pretrained spaCy model directory for resume parsing
+    │
+    ├── templates/
+    │   ├── index.html  # UI for file upload
+    │   └── result.html  # Display parsed results in structured JSON
+    │
+    ├── uploads/  # Directory for uploaded resume files
+    │
+    ├── utils/
+    │   ├── mistral.py  # Code for calling Mistral API and handling responses
+    │   ├── spacy.py  # spaCy fallback model for parsing resumes
+    │   ├── error.py  # Error handling utilities
+    │   └── fileTotext.py  # Functions to extract text from different file formats (PDF, DOCX, etc.)
+    │
+    ├── venv/  # Virtual environment
+    │
+    ├── .env  # Environment variables file (contains Hugging Face token)
+    │
+    ├── main.py  # Flask app handling API routes for uploading and processing resumes
+    │
+    └── requirements.txt  # Dependencies required for the project
+# Program Overview:
+    # Mistral Integration (utils/mistral.py)
+        - Mistral API Calls: Uses Hugging Faces Mistral-Nemo-Instruct-2407 model to parse resumes.
+        - Personal and Professional Extraction: Two functions extract personal and professional information in structured JSON format.
+        - Fallback Mechanism: If Mistral fails, spaCys NER model is used as a fallback.
+    # SpaCy Integration (utils/spacy.py)
+        - Custom Trained Model: Uses a spaCy model (ner_model_05_3) trained specifically for resume parsing.
+        - Named Entity Recognition: Extracts key information like Name, Email, Contact, Location, Skills, Experience, etc., from resumes.
+        - Validation: Includes validation for extracted emails and contacts.
+    # File Conversion (utils/fileTotext.py)
+       - Text Extraction: Handles different resume formats (PDF, DOCX, ODT, RSF, and images like PNG, JPG, JPEG) and extracts text for further processing.
+          - PDF Files: Uses PyMuPDF to extract text and, if necessary, Tesseract-OCR for image-based PDF content.
+          - DOCX Files: Uses `python-docx` to extract structured text from Word documents.
+          - ODT Files: Uses `odfpy` to extract text from ODT (OpenDocument) files.
+          - RSF Files: Reads plain text from RSF files.
+          - Images (PNG, JPG, JPEG): Uses Tesseract-OCR to extract text from image-based resumes.
+            Note: For Tesseract-OCR, install it locally by following the [installation guide](https://github.com/UB-Mannheim/tesseract/wiki).
+          - Hyperlink Extraction: Extracts hyperlinks from PDF files, capturing any embedded URLs during the parsing process.
+    # Error Handling (utils/error.py)
+        - Manages API response errors, file format issues, and ensures smooth fallbacks without crashing the app.
+    # Flask API (main.py)
+        Endpoints:
+        - /upload for uploading resumes.
+        - Displays parsed results in JSON format on the results page.
+        - UI: Simple interface for uploading resumes and viewing the parsing results.
+# Tree map of program:
+    main.py
+    ├── Handles API side
+    ├── File upload/remove
+    ├── Process resumes
+    └── Show result
+    utils
+    ├── fileTotext.py
+    │   └── Converts files to text
+    │       ├── PDF
+    │       ├── DOCX
+    │       ├── RTF
+    │       ├── ODT
+    │       ├── PNG
+    │       ├── JPG
+    │       └── JPEG
+    ├── mistral.py
+    │   ├── Mistral API Calls
+    │   │   └── Uses Mistral-Nemo-Instruct-2407 model
+    │   ├── Personal and Professional Extraction
+    │   │   ├── Extracts personal information
+    │   │   └── Extracts professional information
+    │   └── Fallback Mechanism
+    │       └── Uses spaCy NER model if Mistral fails
+    └── spacy.py
+        ├── Custom Trained Model
+        │   └── Uses spaCy model (ner_model_05_3)
+        ├── Named Entity Recognition
+        │   └── Extracts key information (Name, Email, Contact, etc.)
+        └── Validation
+            └── Validates emails and contacts
+# References:
+- [Flask Documentation](https://flask.palletsprojects.com/)
+- [spaCy Documentation](https://spacy.io/usage)
+- [Mistral Documentation](https://docs.mistral.ai/)
+- [Hugging Face Hub API](https://huggingface.co/docs/huggingface_hub/index)
+- [PyMuPDF (MuPDF) Documentation](https://pymupdf.readthedocs.io/en/latest/)
+- [python-docx Documentation](https://python-docx.readthedocs.io/en/latest/)
+- [Tesseract OCR Documentation](https://github.com/UB-Mannheim/tesseract/wiki)
 - [Virtual Environments in Python](https://docs.python.org/3/tutorial/venv.html)