doubleyyh commited on
Commit
c6d0683
Β·
verified Β·
1 Parent(s): 877da5d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -17
README.md CHANGED
@@ -1,18 +1,103 @@
 
1
 
2
- # Mixed BGE-M3 Email Model
3
-
4
- This model is a mixture of BAAI/bge-m3 and a fine-tuned version optimized for email understanding.
5
-
6
- ## Model Details
7
- - Base Model: BAAI/bge-m3
8
- - Fine-tuning: Custom email dataset
9
- - Mixing Weights: 0.5 base, 0.5 fine-tuned
10
-
11
- ## Usage
12
- ```python
13
- from transformers import AutoModel, AutoTokenizer
14
-
15
- model = AutoModel.from_pretrained("your-username/mixed-bge-m3-email")
16
- tokenizer = AutoTokenizer.from_pretrained("your-username/mixed-bge-m3-email")
17
- ```
18
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Mixed-BGE-M3-Email
2
 
3
+ This is a fine-tuned version of [BAAI/bge-m3](https://huggingface.co/BAAI/bge-m3) optimized for email content retrieval. The model was trained on a mixed-language (English/Korean) email dataset to improve retrieval performance for various email-related queries.
4
+
5
+ ## Model Description
6
+
7
+ - **Model Type:** Embedding model (encoder-only)
8
+ - **Base Model:** BAAI/bge-m3
9
+ - **Languages:** English, Korean
10
+ - **Domain:** Email content, business communication
11
+ - **Training Data:** Mixed-language email dataset with various types of queries (metadata, long-form, short-form, yes/no questions)
12
+
13
+ ## Quickstart
14
+
15
+ ```python
16
+ from langchain.embeddings import HuggingFaceEmbeddings
17
+ from langchain.vectorstores import FAISS
18
+ from langchain.docstore.document import Document
19
+
20
+ # Initialize the embedding model
21
+ embeddings = HuggingFaceEmbeddings(
22
+ model_name="doubleyyh/mixed-bge-m3-email",
23
+ model_kwargs={'device': 'cuda'},
24
+ encode_kwargs={'normalize_embeddings': True}
25
+ )
26
+
27
+ # Example emails
28
+ emails = [
29
+ {
30
+ "subject": "회의 일정 λ³€κ²½ μ•ˆλ‚΄",
31
+ "from": [["κΉ€μ² μˆ˜", "[email protected]"]],
32
+ "to": [["이영희", "[email protected]"]],
33
+ "cc": [["박지원", "[email protected]"]],
34
+ "date": "2024-03-26T10:00:00",
35
+ "text_body": "μ•ˆλ…•ν•˜μ„Έμš”, 내일 μ˜ˆμ •λœ ν”„λ‘œμ νŠΈ λ―ΈνŒ…μ„ μ˜€ν›„ 2μ‹œλ‘œ λ³€κ²½ν•˜κ³ μž ν•©λ‹ˆλ‹€."
36
+ },
37
+ {
38
+ "subject": "Project Timeline Update",
39
+ "from": [["John Smith", "[email protected]"]],
40
+ "to": [["Team", "[email protected]"]],
41
+ "cc": [],
42
+ "date": "2024-03-26T11:30:00",
43
+ "text_body": "Hi team, I'm writing to update you on the Q2 project milestones."
44
+ }
45
+ ]
46
+
47
+ # Format emails into documents
48
+ docs = []
49
+ for email in emails:
50
+ # Format email content
51
+ content = "\n".join([f"{k}: {v}" for k, v in email.items()])
52
+ docs.append(Document(page_content=content))
53
+
54
+ # Create FAISS index
55
+ db = FAISS.from_documents(docs, embeddings)
56
+
57
+ # Query examples (supports both Korean and English)
58
+ queries = [
59
+ "회의 μ‹œκ°„μ΄ μ–Έμ œλ‘œ λ³€κ²½λ˜μ—ˆλ‚˜μš”?",
60
+ "When is the meeting rescheduled?",
61
+ "ν”„λ‘œμ νŠΈ 일정",
62
+ "Q2 milestones"
63
+ ]
64
+
65
+ # Perform similarity search
66
+ for query in queries:
67
+ print(f"\nQuery: {query}")
68
+ results = db.similarity_search(query, k=1)
69
+ print(f"Most relevant email:\n{results[0].page_content[:200]}...")
70
+ ```
71
+
72
+
73
+ ## Intended Use & Limitations
74
+
75
+ ### Intended Use
76
+ - Email content retrieval
77
+ - Similar document search in email corpora
78
+ - Question answering over email content
79
+ - Multi-language email search systems
80
+
81
+ ### Limitations
82
+ - Performance may vary for domains outside of email content
83
+ - Best suited for business communication context
84
+ - While supporting both English and Korean, performance might vary between languages
85
+
86
+ ## Citation
87
+
88
+ ```bibtex
89
+ @misc{mixed-bge-m3-email,
90
+ author = {doubleyyh},
91
+ title = {Mixed-BGE-M3-Email: Fine-tuned Embedding Model for Email Content},
92
+ year = {2024},
93
+ publisher = {HuggingFace}
94
+ }
95
+ ```
96
+
97
+ ## License
98
+
99
+ This model follows the same license as the base model (bge-m3).
100
+
101
+ ## Contact
102
+
103
+ For questions or feedback, please use the GitHub repository issues section or contact through HuggingFace.