huyydangg
/

DEk21_hcmute_embedding

@@ -425,19 +425,23 @@ model-index:
 # bkai-fine-tuned-legal
-This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [bkai-foundation-models/vietnamese-bi-encoder](https://huggingface.co/bkai-foundation-models/vietnamese-bi-encoder) on the json dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
 ## Model Details
 ### Model Description
 - **Model Type:** Sentence Transformer
 - **Base model:** [bkai-foundation-models/vietnamese-bi-encoder](https://huggingface.co/bkai-foundation-models/vietnamese-bi-encoder) <!-- at revision 84f9d9ada0d1a3c37557398b9ae9fcedcdf40be0 -->
-- **Maximum Sequence Length:** 256 tokens
 - **Output Dimensionality:** 768 dimensions
 - **Similarity Function:** Cosine Similarity
-- **Training Dataset:**
-    - json
-- **Language:** vi
 - **License:** apache-2.0
 ### Model Sources
@@ -468,48 +472,37 @@ pip install -U sentence-transformers
 Then you can load this model and run inference.
 ```python
 from sentence_transformers import SentenceTransformer
 # Download from the 🤗 Hub
-model = SentenceTransformer("sentence_transformers_model_id")
-# Run inference
-sentences = [
-    'Điều 29 Nghị định 46/2015 NĐ-CP quy định về thí nghiệm đối chứng, kiểm định chất lượng, thí nghiệm khả năng chịu lực của kết cấu công trình trong quá trình thi công xây dựng. Tôi xin hỏi, trong dự toán công trình giao thông có chi phí kiểm định tạm tính, chủ đầu tư có quyền lập đề cương, dự toán rồi giao cho phòng thẩm định kết quả có giá trị, sau đó thực hiện thuê đơn vị tư vấn có chức năng thực hiện công tác kiểm định được không?Bộ Xây dựng trả lời vấn đề này như sau:Trường hợp kiểm định theo quy định tại Điểm a, Điểm b, Điểm c, Khoản 2, Điều 29 (thí nghiệm đối chứng, kiểm định chất lượng, thí nghiệm khả năng chịu lực của kết cấu công trình trong quá trình thi công xây dựng) Nghị định46/2015/NĐ-CPngày 12/5/2015 của Chính phủ về quản lý chất lượng và bảo trì công trình xây dựng thì việc lập đề cương, dự toán kiểm định do tổ chức đáp ứng điều kiện năng lực theo quy định của pháp luật thực hiện.Đối với trường hợp kiểm định theo quy định tại Điểm đ, Khoản 2, Điều 29 Nghị định46/2015/NĐ-CPthì thực hiện theo quy định tại Điều 18 Thông tư26/2016/TT-BXDngày 26/10/2016 của Bộ Xây dựng quy định chi tiết một số nội dung về quản lý chất lượng và bảo trì công trình xây dựng.',
-    'Có thể thuê kiểm định chất lượng công trình?',
-    'Quy định về trợ cấp với cán bộ xã già yếu nghỉ việc',
 ]
-embeddings = model.encode(sentences)
-print(embeddings.shape)
-# [3, 768]
-# Get the similarity scores for the embeddings
-similarities = model.similarity(embeddings, embeddings)
-print(similarities.shape)
-# [3, 3]
-```
-<!--
-### Direct Usage (Transformers)
-<details><summary>Click to see the direct usage in Transformers</summary>
-</details>
--->
-<!--
-### Downstream Usage (Sentence Transformers)
-You can finetune this model on your own dataset.
-<details><summary>Click to expand</summary>
-</details>
--->
-<!--
-### Out-of-Scope Use
-*List how the model may foreseeably be misused and address what users ought not to do with the model.*
--->
 ## Evaluation

 # bkai-fine-tuned-legal
+LEGAL-EMBEDDING is a Vietnamese text embedding  focused on RAG and production efficiency:
+📚 **Trained Dataset**:
+The model was trained on an in-house dataset consisting of approximately **50,000 examples** of legal questions and their related contexts.
+🪆 **Efficiency**:
+Trained with a **Matryoshka loss**, allowing embeddings to be truncated with minimal performance loss. This ensures that smaller embeddings are faster to compare, making the model efficient for real-world production use.
 ## Model Details
 ### Model Description
 - **Model Type:** Sentence Transformer
 - **Base model:** [bkai-foundation-models/vietnamese-bi-encoder](https://huggingface.co/bkai-foundation-models/vietnamese-bi-encoder) <!-- at revision 84f9d9ada0d1a3c37557398b9ae9fcedcdf40be0 -->
+- **Maximum Sequence Length:** 512 tokens
 - **Output Dimensionality:** 768 dimensions
 - **Similarity Function:** Cosine Similarity
+- **Language:** vietnamese
 - **License:** apache-2.0
 ### Model Sources
 Then you can load this model and run inference.
 ```python
 from sentence_transformers import SentenceTransformer
+import torch
 # Download from the 🤗 Hub
+model = SentenceTransformer("quanghuy123/LEGAL_EMBEDDING")
+# Define query (câu hỏi pháp luật) và docs (điều luật)
+query = "Điều kiện để kết hôn hợp pháp là gì?"
+docs = [
+    "Điều 8 Bộ luật Dân sự 2015 quy định về quyền và nghĩa vụ của công dân trong quan hệ gia đình.",
+    "Điều 18 Luật Hôn nhân và gia đình 2014 quy định về độ tuổi kết hôn của nam và nữ.",
+    "Điều 14 Bộ luật Dân sự 2015 quy định về quyền và nghĩa vụ của cá nhân khi tham gia hợp đồng.",
+    "Điều 27 Luật Hôn nhân và gia đình 2014 quy định về các trường hợp không được kết hôn.",
+    "Điều 51 Luật Hôn nhân và gia đình 2014 quy định về việc kết hôn giữa công dân Việt Nam và người nước ngoài."
 ]
+# Encode query and documents
+query_embedding = model.encode([query])
+doc_embeddings = model.encode(docs)
+similarities = torch.nn.functional.cosine_similarity(
+    torch.tensor(query_embedding), torch.tensor(doc_embeddings)
+).flatten()
+# Sort documents by cosine similarity
+sorted_indices = torch.argsort(similarities, descending=True)
+sorted_docs = [docs[idx] for idx in sorted_indices]
+sorted_scores = [similarities[idx].item() for idx in sorted_indices]
+# Print sorted documents with their cosine scores
+for doc, score in zip(sorted_docs, sorted_scores):
+    print(f"Document: {doc} - Cosine Similarity: {score:.4f}")
+```
 ## Evaluation