Spaces:
Running
Running
small readme update
Browse files
README.md
CHANGED
@@ -13,7 +13,7 @@ pinned: false
|
|
13 |
|
14 |
This repository contains the final project for the course **ID2223 Scalable Machine Learning and Deep Learning** at KTH.
|
15 |
|
16 |
-
The project culminates in an AI-powered job matching platform, **JobsAI**, designed to help users find job listings tailored to their resumes. The application is hosted on
|
17 |
[**JobsAI**](https://huggingface.co/spaces/forestav/jobsai)
|
18 |
|
19 |
---
|
@@ -22,7 +22,7 @@ The project culminates in an AI-powered job matching platform, **JobsAI**, desig
|
|
22 |
|
23 |
### Project Pitch
|
24 |
|
25 |
-
Finding the right job can be overwhelming, especially with over 40,000 listings available on Arbetsförmedlingen. **JobsAI** streamlines this process by using **vector embeddings** and **similarity search** to match users’ resumes with the most relevant job postings. Say goodbye to endless scrolling and let AI do the heavy lifting!
|
26 |
|
27 |
---
|
28 |
|
@@ -49,7 +49,7 @@ The platform uses two primary data sources:
|
|
49 |
### Tool Selection
|
50 |
|
51 |
- **Vector Database**: After evaluating several options, we chose **Pinecone** for its ease of use and targeted support for vector embeddings.
|
52 |
-
- **Embedding Model**: The base model is [**sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2**](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2), a pre-trained transformer model that encodes sentences and paragraphs into a 384-dimensional dense vector space.
|
53 |
- **Finetuned Model**: The base model is finetuned on user-provided data every 7-days, and stored on HuggingFace. It can be found [**here!**](https://huggingface.co/forestav/job_matching_sentence_transformer)
|
54 |
- **Backend Updates**: GitHub Actions was utilized to automate daily updates to the vector database.
|
55 |
- **Feature Store**: To store user provided data, we used **Hopsworks** as it allows for easy feature interaction, as well as allows us to save older models to evaluate performance over time.
|
@@ -57,7 +57,7 @@ The platform uses two primary data sources:
|
|
57 |
### Workflow
|
58 |
|
59 |
1. **Flowchart of JobsAI**
|
60 |
-

|
61 |
|
62 |
2. **Data Retrieval**:
|
63 |
|
@@ -65,14 +65,14 @@ The platform uses two primary data sources:
|
|
65 |
- Metadata such as job title, description, location, and contact details is extracted.
|
66 |
|
67 |
3. **Similarity Search**:
|
|
|
68 |
- User-uploaded resumes are vectorized using the same sentence transformer model.
|
69 |
- Pinecone is queried for the top-k most similar job embeddings, which are then displayed to the user alongside their similarity scores.
|
70 |
-
|
71 |
4. **Feature Uploading**:
|
72 |
-
- If a user chooses to leave feedback, by either clicking
|
73 |
-
|
74 |
5. **Model Training**:
|
75 |
-
- Once every seven days, a
|
76 |
|
77 |
---
|
78 |
|
@@ -80,22 +80,20 @@ The platform uses two primary data sources:
|
|
80 |
|
81 |
### First-Time Setup
|
82 |
|
83 |
-
1.
|
84 |
- Retrieve all job listings using the JobStream API’s snapshot endpoint.
|
85 |
- Vectorize the listings and insert them into the Pinecone database.
|
86 |
-
2.
|
87 |
-
- `_create_embedding`: Combines job title, occupation, and description for encoding into a dense vector.
|
88 |
-
- `_prepare_metadata`: Extracts additional details like email, location, and timestamps for storage alongside embeddings.
|
89 |
|
90 |
### Daily Updates
|
91 |
|
92 |
- **Automated Workflow**: A GitHub Actions workflow runs `main.py` daily at midnight.
|
93 |
-
- **Incremental Updates**: The `
|
94 |
|
95 |
### Weekly Updates
|
96 |
|
97 |
- **Automated Workflow**: A GitHub Actions workflow runs `training_pipeline.ipynb` every Sunday at midnight.
|
98 |
-
- **Model Training**: Features are downloaded from Hopsworks, and the base
|
99 |
|
100 |
### Querying for Matches
|
101 |
|
@@ -113,7 +111,7 @@ The platform uses two primary data sources:
|
|
113 |
2. A [Pinecone](https://www.pinecone.io/) account and API key.
|
114 |
3. Arbetsförmedlingen JobStream API access (free).
|
115 |
4. [Hopsworks](https://www.hopsworks.ai/) Account and API key.
|
116 |
-
5. [Huggingface](https://huggingface.co/) Account and API key.
|
117 |
|
118 |
### Steps
|
119 |
|
@@ -134,7 +132,7 @@ The platform uses two primary data sources:
|
|
134 |
```
|
135 |
4. Run the application locally:
|
136 |
```bash
|
137 |
-
|
138 |
```
|
139 |
5. Open the Gradio app in your browser to upload resumes and view job recommendations.
|
140 |
|
|
|
13 |
|
14 |
This repository contains the final project for the course **ID2223 Scalable Machine Learning and Deep Learning** at KTH.
|
15 |
|
16 |
+
The project culminates in an AI-powered job matching platform, **JobsAI**, designed to help users find job listings tailored to their resumes. The application is hosted on HuggingFace Spaces and can be accessed here:
|
17 |
[**JobsAI**](https://huggingface.co/spaces/forestav/jobsai)
|
18 |
|
19 |
---
|
|
|
22 |
|
23 |
### Project Pitch
|
24 |
|
25 |
+
Finding the right job can be overwhelming, especially with over 40,000 listings available on Arbetsförmedlingen as of speaking. **JobsAI** streamlines this process by using **vector embeddings** and **similarity search** to match users’ resumes with the most relevant job postings. Say goodbye to endless scrolling and let AI do the heavy lifting!
|
26 |
|
27 |
---
|
28 |
|
|
|
49 |
### Tool Selection
|
50 |
|
51 |
- **Vector Database**: After evaluating several options, we chose **Pinecone** for its ease of use and targeted support for vector embeddings.
|
52 |
+
- **Embedding Model**: The base model is [**sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2**](https://huggingface.co/sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2), a lightweight pre-trained transformer model that encodes sentences and paragraphs into a 384-dimensional dense vector space.
|
53 |
- **Finetuned Model**: The base model is finetuned on user-provided data every 7-days, and stored on HuggingFace. It can be found [**here!**](https://huggingface.co/forestav/job_matching_sentence_transformer)
|
54 |
- **Backend Updates**: GitHub Actions was utilized to automate daily updates to the vector database.
|
55 |
- **Feature Store**: To store user provided data, we used **Hopsworks** as it allows for easy feature interaction, as well as allows us to save older models to evaluate performance over time.
|
|
|
57 |
### Workflow
|
58 |
|
59 |
1. **Flowchart of JobsAI**
|
60 |
+

|
61 |
|
62 |
2. **Data Retrieval**:
|
63 |
|
|
|
65 |
- Metadata such as job title, description, location, and contact details is extracted.
|
66 |
|
67 |
3. **Similarity Search**:
|
68 |
+
|
69 |
- User-uploaded resumes are vectorized using the same sentence transformer model.
|
70 |
- Pinecone is queried for the top-k most similar job embeddings, which are then displayed to the user alongside their similarity scores.
|
71 |
+
|
72 |
4. **Feature Uploading**:
|
73 |
+
- If a user chooses to leave feedback, by either clicking _Relevant_ or _Not Relevant_, the users CV is uploaded to Hopsworks together with the specific ad data, and the selected choice.
|
|
|
74 |
5. **Model Training**:
|
75 |
+
- Once every seven days, a cron job on _Github Actions_ runs, where the base model is finetuned on the total data stored in the feature store.
|
76 |
|
77 |
---
|
78 |
|
|
|
80 |
|
81 |
### First-Time Setup
|
82 |
|
83 |
+
1. If you want to have your own Pinecone vector database, run `bootstrap.py` to:
|
84 |
- Retrieve all job listings using the JobStream API’s snapshot endpoint.
|
85 |
- Vectorize the listings and insert them into the Pinecone database.
|
86 |
+
2. To run the app locally, navigate to the folder and run `python app.py`
|
|
|
|
|
87 |
|
88 |
### Daily Updates
|
89 |
|
90 |
- **Automated Workflow**: A GitHub Actions workflow runs `main.py` daily at midnight.
|
91 |
+
- **Incremental Updates**: The `main.py` file is running daily at midnight and fetches job listings updated since the last recorded timestamp, ensuring the vector database remains current.
|
92 |
|
93 |
### Weekly Updates
|
94 |
|
95 |
- **Automated Workflow**: A GitHub Actions workflow runs `training_pipeline.ipynb` every Sunday at midnight.
|
96 |
+
- **Model Training**: Features are downloaded from Hopsworks, and the base Sentence Transformer is finetuned on the total dataset with both negative and positive examples.
|
97 |
|
98 |
### Querying for Matches
|
99 |
|
|
|
111 |
2. A [Pinecone](https://www.pinecone.io/) account and API key.
|
112 |
3. Arbetsförmedlingen JobStream API access (free).
|
113 |
4. [Hopsworks](https://www.hopsworks.ai/) Account and API key.
|
114 |
+
5. [Huggingface](https://huggingface.co/) Account and API key/Access Token.
|
115 |
|
116 |
### Steps
|
117 |
|
|
|
132 |
```
|
133 |
4. Run the application locally:
|
134 |
```bash
|
135 |
+
python app.py
|
136 |
```
|
137 |
5. Open the Gradio app in your browser to upload resumes and view job recommendations.
|
138 |
|