Update README.md
Browse files
README.md
CHANGED
@@ -1,18 +1,103 @@
|
|
|
|
1 |
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
13 |
-
|
14 |
-
|
15 |
-
|
16 |
-
|
17 |
-
|
18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Mixed-BGE-M3-Email
|
2 |
|
3 |
+
This is a fine-tuned version of [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) optimized for email content retrieval. The model was trained on a mixed-language (English/Korean) email dataset to improve retrieval performance for various email-related queries.
|
4 |
+
|
5 |
+
## Model Description
|
6 |
+
|
7 |
+
- **Model Type:** Embedding model (encoder-only)
|
8 |
+
- **Base Model:** BAAI/bge-m3
|
9 |
+
- **Languages:** English, Korean
|
10 |
+
- **Domain:** Email content, business communication
|
11 |
+
- **Training Data:** Mixed-language email dataset with various types of queries (metadata, long-form, short-form, yes/no questions)
|
12 |
+
|
13 |
+
## Quickstart
|
14 |
+
|
15 |
+
```python
|
16 |
+
from langchain.embeddings import HuggingFaceEmbeddings
|
17 |
+
from langchain.vectorstores import FAISS
|
18 |
+
from langchain.docstore.document import Document
|
19 |
+
|
20 |
+
# Initialize the embedding model
|
21 |
+
embeddings = HuggingFaceEmbeddings(
|
22 |
+
model_name="doubleyyh/mixed-bge-m3-email",
|
23 |
+
model_kwargs={'device': 'cuda'},
|
24 |
+
encode_kwargs={'normalize_embeddings': True}
|
25 |
+
)
|
26 |
+
|
27 |
+
# Example emails
|
28 |
+
emails = [
|
29 |
+
{
|
30 |
+
"subject": "νμ μΌμ λ³κ²½ μλ΄",
|
31 |
+
"from": [["κΉμ² μ", "[email protected]"]],
|
32 |
+
"to": [["μ΄μν¬", "[email protected]"]],
|
33 |
+
"cc": [["λ°μ§μ", "[email protected]"]],
|
34 |
+
"date": "2024-03-26T10:00:00",
|
35 |
+
"text_body": "μλ
νμΈμ, λ΄μΌ μμ λ νλ‘μ νΈ λ―Έν
μ μ€ν 2μλ‘ λ³κ²½νκ³ μ ν©λλ€."
|
36 |
+
},
|
37 |
+
{
|
38 |
+
"subject": "Project Timeline Update",
|
39 |
+
"from": [["John Smith", "[email protected]"]],
|
40 |
+
"to": [["Team", "[email protected]"]],
|
41 |
+
"cc": [],
|
42 |
+
"date": "2024-03-26T11:30:00",
|
43 |
+
"text_body": "Hi team, I'm writing to update you on the Q2 project milestones."
|
44 |
+
}
|
45 |
+
]
|
46 |
+
|
47 |
+
# Format emails into documents
|
48 |
+
docs = []
|
49 |
+
for email in emails:
|
50 |
+
# Format email content
|
51 |
+
content = "\n".join([f"{k}: {v}" for k, v in email.items()])
|
52 |
+
docs.append(Document(page_content=content))
|
53 |
+
|
54 |
+
# Create FAISS index
|
55 |
+
db = FAISS.from_documents(docs, embeddings)
|
56 |
+
|
57 |
+
# Query examples (supports both Korean and English)
|
58 |
+
queries = [
|
59 |
+
"νμ μκ°μ΄ μΈμ λ‘ λ³κ²½λμλμ?",
|
60 |
+
"When is the meeting rescheduled?",
|
61 |
+
"νλ‘μ νΈ μΌμ ",
|
62 |
+
"Q2 milestones"
|
63 |
+
]
|
64 |
+
|
65 |
+
# Perform similarity search
|
66 |
+
for query in queries:
|
67 |
+
print(f"\nQuery: {query}")
|
68 |
+
results = db.similarity_search(query, k=1)
|
69 |
+
print(f"Most relevant email:\n{results[0].page_content[:200]}...")
|
70 |
+
```
|
71 |
+
|
72 |
+
|
73 |
+
## Intended Use & Limitations
|
74 |
+
|
75 |
+
### Intended Use
|
76 |
+
- Email content retrieval
|
77 |
+
- Similar document search in email corpora
|
78 |
+
- Question answering over email content
|
79 |
+
- Multi-language email search systems
|
80 |
+
|
81 |
+
### Limitations
|
82 |
+
- Performance may vary for domains outside of email content
|
83 |
+
- Best suited for business communication context
|
84 |
+
- While supporting both English and Korean, performance might vary between languages
|
85 |
+
|
86 |
+
## Citation
|
87 |
+
|
88 |
+
```bibtex
|
89 |
+
@misc{mixed-bge-m3-email,
|
90 |
+
author = {doubleyyh},
|
91 |
+
title = {Mixed-BGE-M3-Email: Fine-tuned Embedding Model for Email Content},
|
92 |
+
year = {2024},
|
93 |
+
publisher = {HuggingFace}
|
94 |
+
}
|
95 |
+
```
|
96 |
+
|
97 |
+
## License
|
98 |
+
|
99 |
+
This model follows the same license as the base model (bge-m3).
|
100 |
+
|
101 |
+
## Contact
|
102 |
+
|
103 |
+
For questions or feedback, please use the GitHub repository issues section or contact through HuggingFace.
|