Spaces:

forestav
/

jobsai

Running

forestav commited on Jan 5

Commit

da21d3b

1 Parent(s): f0678ae

use lightweight multilingual model

Files changed (4) hide show

README.md CHANGED Viewed

@@ -61,5 +61,4 @@ Querying from the Pinecone vector database is simple and fast thanks to the Pine
 1. The [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) truncates input text longer than 256 word pieces. To capture all the semantics from job listings, we probably need a sentence transformer which can embed longer inputs texts.
 2. The [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) is not optimized for multilingual text. Many people in Sweden have their resumes in Swedish, so better performance would probably achieved with a multilingual model.
-3. We currently truncate the job descriptions after 1000 characters. To capture the full context, we should not truncate the job descriptions from the listings. This requires more data storage but would give better performance.
-4. Users should be able to filter on municipality or location, because the current app ignores where the person wants to work (often not explicitly mentioned in their resume), making many job listings not relevant anyway.

 1. The [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) truncates input text longer than 256 word pieces. To capture all the semantics from job listings, we probably need a sentence transformer which can embed longer inputs texts.
 2. The [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2) is not optimized for multilingual text. Many people in Sweden have their resumes in Swedish, so better performance would probably achieved with a multilingual model.
+3. Users should be able to filter on municipality or location, because the current app ignores where the person wants to work (often not explicitly mentioned in their resume), making many job listings not relevant anyway.

pinecone_handler.py CHANGED Viewed

@@ -40,20 +40,22 @@ class PineconeHandler:
             log.info(f"Creating new index '{PINECONE_INDEX_NAME}'")
             spec = ServerlessSpec(
                 cloud="aws",
-                region="us-west-2"
             )
             self.pc.create_index(
                 name=PINECONE_INDEX_NAME,
-                dimension=512,
                 metric="cosine",
                 spec=spec
             )
             self.index = self.pc.Index(PINECONE_INDEX_NAME)
         #self.model = SentenceTransformer('all-MiniLM-L6-v2')
         #512 token max length, embedding dim 768
-        self.model = SentenceTransformer('sentence-transformers/allenai-specter')
         log.info(f"Initialized connection to Pinecone index '{PINECONE_INDEX_NAME}'")
     def _create_embedding(self, ad: Dict[str, Any]) -> List[float]:

             log.info(f"Creating new index '{PINECONE_INDEX_NAME}'")
             spec = ServerlessSpec(
                 cloud="aws",
+                region="us-east-1"
             )
             self.pc.create_index(
                 name=PINECONE_INDEX_NAME,
+                dimension=384,
                 metric="cosine",
                 spec=spec
             )
             self.index = self.pc.Index(PINECONE_INDEX_NAME)
         #self.model = SentenceTransformer('all-MiniLM-L6-v2')
+        #self.model = SentenceTransformer('intfloat/multilingual-e5-large')
+        self.model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2')
         #512 token max length, embedding dim 768
+        #self.model = SentenceTransformer('sentence-transformers/allenai-specter')
         log.info(f"Initialized connection to Pinecone index '{PINECONE_INDEX_NAME}'")
     def _create_embedding(self, ad: Dict[str, Any]) -> List[float]:

settings.py CHANGED Viewed

@@ -1,7 +1,8 @@
 import logging
 PINECONE_ENVIRONMENT = "gcp-starter"
-PINECONE_INDEX_NAME = "jobads-index"
 DB_TABLE_NAME = 'jobads'
 DB_FILE_NAME = 'jobads_database_20220127.db'

 import logging
 PINECONE_ENVIRONMENT = "gcp-starter"
+#PINECONE_INDEX_NAME = "jobads-index"
+PINECONE_INDEX_NAME = "jobsai-multilingual-small"
 DB_TABLE_NAME = 'jobads'
 DB_FILE_NAME = 'jobads_database_20220127.db'

timestamp2.txt CHANGED Viewed

	@@ -1 +1 @@
1	- 2025-01-~~05T02~~:42:09


1	+ 2025-01-05T22:38:10