forestav commited on
Commit
3e80ffb
·
1 Parent(s): 39ce19f

small readme update

Browse files
Files changed (1) hide show
  1. README.md +14 -16
README.md CHANGED
@@ -13,7 +13,7 @@ pinned: false
13
 
14
  This repository contains the final project for the course **ID2223 Scalable Machine Learning and Deep Learning** at KTH.
15
 
16
- The project culminates in an AI-powered job matching platform, **JobsAI**, designed to help users find job listings tailored to their resumes. The application is hosted on Gradio using HuggingFace Community Cloud and can be accessed here:
17
  [**JobsAI**](https://huggingface.co/spaces/forestav/jobsai)
18
 
19
  ---
@@ -22,7 +22,7 @@ The project culminates in an AI-powered job matching platform, **JobsAI**, desig
22
 
23
  ### Project Pitch
24
 
25
- Finding the right job can be overwhelming, especially with over 40,000 listings available on Arbetsförmedlingen. **JobsAI** streamlines this process by using **vector embeddings** and **similarity search** to match users’ resumes with the most relevant job postings. Say goodbye to endless scrolling and let AI do the heavy lifting!
26
 
27
  ---
28
 
@@ -49,7 +49,7 @@ The platform uses two primary data sources:
49
  ### Tool Selection
50
 
51
  - **Vector Database**: After evaluating several options, we chose **Pinecone** for its ease of use and targeted support for vector embeddings.
52
- - **Embedding Model**: The base model is [**sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2**](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2), a pre-trained transformer model that encodes sentences and paragraphs into a 384-dimensional dense vector space.
53
  - **Finetuned Model**: The base model is finetuned on user-provided data every 7-days, and stored on HuggingFace. It can be found [**here!**](https://huggingface.co/forestav/job_matching_sentence_transformer)
54
  - **Backend Updates**: GitHub Actions was utilized to automate daily updates to the vector database.
55
  - **Feature Store**: To store user provided data, we used **Hopsworks** as it allows for easy feature interaction, as well as allows us to save older models to evaluate performance over time.
@@ -57,7 +57,7 @@ The platform uses two primary data sources:
57
  ### Workflow
58
 
59
  1. **Flowchart of JobsAI**
60
- ![JobsAI flowchart structure](https://i.imghippo.com/files/CZk3216mnA.png)
61
 
62
  2. **Data Retrieval**:
63
 
@@ -65,14 +65,14 @@ The platform uses two primary data sources:
65
  - Metadata such as job title, description, location, and contact details is extracted.
66
 
67
  3. **Similarity Search**:
 
68
  - User-uploaded resumes are vectorized using the same sentence transformer model.
69
  - Pinecone is queried for the top-k most similar job embeddings, which are then displayed to the user alongside their similarity scores.
70
-
71
  4. **Feature Uploading**:
72
- - If a user chooses to leave feedback, by either clicking *Relevant* or *Not Relevant*, the users CV is uploaded to Hopsworks together with the specific ad data, and the selected choice.
73
-
74
  5. **Model Training**:
75
- - Once every seven days, a chrone job on *Github Actions* runs, where the base model is finetuned on the total data stored in the feature store.
76
 
77
  ---
78
 
@@ -80,22 +80,20 @@ The platform uses two primary data sources:
80
 
81
  ### First-Time Setup
82
 
83
- 1. Run `bootstrap.py` to:
84
  - Retrieve all job listings using the JobStream API’s snapshot endpoint.
85
  - Vectorize the listings and insert them into the Pinecone database.
86
- 2. Embeddings and metadata are generated using helper functions:
87
- - `_create_embedding`: Combines job title, occupation, and description for encoding into a dense vector.
88
- - `_prepare_metadata`: Extracts additional details like email, location, and timestamps for storage alongside embeddings.
89
 
90
  ### Daily Updates
91
 
92
  - **Automated Workflow**: A GitHub Actions workflow runs `main.py` daily at midnight.
93
- - **Incremental Updates**: The `keep_updated.py` function fetches job listings updated since the last recorded timestamp, ensuring the vector database remains current.
94
 
95
  ### Weekly Updates
96
 
97
  - **Automated Workflow**: A GitHub Actions workflow runs `training_pipeline.ipynb` every Sunday at midnight.
98
- - **Model Training**: Features are downloaded from Hopsworks, and the base LLM is finetuned on the total dataset with both negative and positive examples.
99
 
100
  ### Querying for Matches
101
 
@@ -113,7 +111,7 @@ The platform uses two primary data sources:
113
  2. A [Pinecone](https://www.pinecone.io/) account and API key.
114
  3. Arbetsförmedlingen JobStream API access (free).
115
  4. [Hopsworks](https://www.hopsworks.ai/) Account and API key.
116
- 5. [Huggingface](https://huggingface.co/) Account and API key.
117
 
118
  ### Steps
119
 
@@ -134,7 +132,7 @@ The platform uses two primary data sources:
134
  ```
135
  4. Run the application locally:
136
  ```bash
137
- gradio run app.py
138
  ```
139
  5. Open the Gradio app in your browser to upload resumes and view job recommendations.
140
 
 
13
 
14
  This repository contains the final project for the course **ID2223 Scalable Machine Learning and Deep Learning** at KTH.
15
 
16
+ The project culminates in an AI-powered job matching platform, **JobsAI**, designed to help users find job listings tailored to their resumes. The application is hosted on HuggingFace Spaces and can be accessed here:
17
  [**JobsAI**](https://huggingface.co/spaces/forestav/jobsai)
18
 
19
  ---
 
22
 
23
  ### Project Pitch
24
 
25
+ Finding the right job can be overwhelming, especially with over 40,000 listings available on Arbetsförmedlingen as of speaking. **JobsAI** streamlines this process by using **vector embeddings** and **similarity search** to match users’ resumes with the most relevant job postings. Say goodbye to endless scrolling and let AI do the heavy lifting!
26
 
27
  ---
28
 
 
49
  ### Tool Selection
50
 
51
  - **Vector Database**: After evaluating several options, we chose **Pinecone** for its ease of use and targeted support for vector embeddings.
52
+ - **Embedding Model**: The base model is [**sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2**](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2), a lightweight pre-trained transformer model that encodes sentences and paragraphs into a 384-dimensional dense vector space.
53
  - **Finetuned Model**: The base model is finetuned on user-provided data every 7-days, and stored on HuggingFace. It can be found [**here!**](https://huggingface.co/forestav/job_matching_sentence_transformer)
54
  - **Backend Updates**: GitHub Actions was utilized to automate daily updates to the vector database.
55
  - **Feature Store**: To store user provided data, we used **Hopsworks** as it allows for easy feature interaction, as well as allows us to save older models to evaluate performance over time.
 
57
  ### Workflow
58
 
59
  1. **Flowchart of JobsAI**
60
+ ![JobsAI flowchart structure](https://i.imghippo.com/files/CZk3216mnA.png)
61
 
62
  2. **Data Retrieval**:
63
 
 
65
  - Metadata such as job title, description, location, and contact details is extracted.
66
 
67
  3. **Similarity Search**:
68
+
69
  - User-uploaded resumes are vectorized using the same sentence transformer model.
70
  - Pinecone is queried for the top-k most similar job embeddings, which are then displayed to the user alongside their similarity scores.
71
+
72
  4. **Feature Uploading**:
73
+ - If a user chooses to leave feedback, by either clicking _Relevant_ or _Not Relevant_, the users CV is uploaded to Hopsworks together with the specific ad data, and the selected choice.
 
74
  5. **Model Training**:
75
+ - Once every seven days, a cron job on _Github Actions_ runs, where the base model is finetuned on the total data stored in the feature store.
76
 
77
  ---
78
 
 
80
 
81
  ### First-Time Setup
82
 
83
+ 1. If you want to have your own Pinecone vector database, run `bootstrap.py` to:
84
  - Retrieve all job listings using the JobStream API’s snapshot endpoint.
85
  - Vectorize the listings and insert them into the Pinecone database.
86
+ 2. To run the app locally, navigate to the folder and run `python app.py`
 
 
87
 
88
  ### Daily Updates
89
 
90
  - **Automated Workflow**: A GitHub Actions workflow runs `main.py` daily at midnight.
91
+ - **Incremental Updates**: The `main.py` file is running daily at midnight and fetches job listings updated since the last recorded timestamp, ensuring the vector database remains current.
92
 
93
  ### Weekly Updates
94
 
95
  - **Automated Workflow**: A GitHub Actions workflow runs `training_pipeline.ipynb` every Sunday at midnight.
96
+ - **Model Training**: Features are downloaded from Hopsworks, and the base Sentence Transformer is finetuned on the total dataset with both negative and positive examples.
97
 
98
  ### Querying for Matches
99
 
 
111
  2. A [Pinecone](https://www.pinecone.io/) account and API key.
112
  3. Arbetsförmedlingen JobStream API access (free).
113
  4. [Hopsworks](https://www.hopsworks.ai/) Account and API key.
114
+ 5. [Huggingface](https://huggingface.co/) Account and API key/Access Token.
115
 
116
  ### Steps
117
 
 
132
  ```
133
  4. Run the application locally:
134
  ```bash
135
+ python app.py
136
  ```
137
  5. Open the Gradio app in your browser to upload resumes and view job recommendations.
138