parth parekh
commited on
Commit
·
1d54b01
1
Parent(s):
46fd641
added readme and raw training files
Browse files- README.md +118 -110
- raw/contact_sharing_epoch_1.pth +3 -0
- raw/tester.py +103 -0
- raw/trainer.py +173 -0
- raw/uploader.py +84 -0
README.md
CHANGED
@@ -1,29 +1,25 @@
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
-
tags:
|
|
|
|
|
|
|
4 |
---
|
5 |
|
6 |
-
# Model Card for
|
7 |
|
8 |
-
|
9 |
|
10 |
|
11 |
-
|
12 |
-
## Model Details
|
13 |
-
|
14 |
### Model Description
|
15 |
|
16 |
-
|
17 |
-
|
18 |
-
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
|
19 |
|
20 |
-
- **Developed by:**
|
21 |
-
- **
|
22 |
-
- **
|
23 |
-
- **
|
24 |
-
- **
|
25 |
-
- **License:** [More Information Needed]
|
26 |
-
- **Finetuned from model [optional]:** [More Information Needed]
|
27 |
|
28 |
### Model Sources [optional]
|
29 |
|
@@ -33,167 +29,179 @@ This is the model card of a 🤗 transformers model that has been pushed on the
|
|
33 |
- **Paper [optional]:** [More Information Needed]
|
34 |
- **Demo [optional]:** [More Information Needed]
|
35 |
|
36 |
-
## Uses
|
37 |
-
|
38 |
-
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
|
39 |
-
|
40 |
-
### Direct Use
|
41 |
|
42 |
-
|
43 |
-
|
44 |
-
[More Information Needed]
|
45 |
|
|
|
|
|
|
|
|
|
46 |
### Downstream Use [optional]
|
47 |
|
48 |
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
49 |
|
50 |
-
|
|
|
|
|
|
|
51 |
|
52 |
### Out-of-Scope Use
|
53 |
|
54 |
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
55 |
|
56 |
-
|
|
|
|
|
|
|
|
|
|
|
57 |
|
58 |
## Bias, Risks, and Limitations
|
59 |
|
60 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
61 |
|
62 |
-
|
|
|
|
|
63 |
|
64 |
### Recommendations
|
65 |
|
66 |
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
67 |
|
68 |
-
|
69 |
|
70 |
## How to Get Started with the Model
|
71 |
|
72 |
-
|
73 |
-
|
74 |
-
[More Information Needed]
|
75 |
-
|
76 |
-
## Training Details
|
77 |
-
|
78 |
-
### Training Data
|
79 |
-
|
80 |
-
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
|
81 |
|
82 |
-
|
|
|
83 |
|
84 |
-
|
85 |
-
|
86 |
-
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
87 |
|
88 |
-
|
|
|
|
|
|
|
89 |
|
90 |
-
|
|
|
91 |
|
92 |
|
93 |
-
|
94 |
|
95 |
-
|
|
|
|
|
96 |
|
97 |
-
|
98 |
|
99 |
-
|
100 |
|
101 |
-
|
|
|
|
|
|
|
102 |
|
103 |
-
|
|
|
|
|
104 |
|
105 |
-
|
|
|
|
|
106 |
|
107 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
108 |
|
109 |
-
|
|
|
110 |
|
111 |
-
|
|
|
112 |
|
113 |
-
|
114 |
|
115 |
-
|
116 |
|
117 |
-
|
118 |
|
119 |
-
|
|
|
|
|
120 |
|
121 |
-
####
|
122 |
|
123 |
-
|
|
|
|
|
|
|
|
|
124 |
|
125 |
-
[More Information Needed]
|
126 |
|
127 |
### Results
|
128 |
|
129 |
-
|
130 |
|
131 |
#### Summary
|
132 |
|
133 |
|
|
|
134 |
|
135 |
-
|
136 |
-
|
137 |
-
|
|
|
|
|
138 |
|
139 |
-
|
140 |
|
141 |
-
## Environmental Impact
|
142 |
|
143 |
-
<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
|
144 |
|
145 |
-
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
|
146 |
-
|
147 |
-
- **Hardware Type:** [More Information Needed]
|
148 |
-
- **Hours used:** [More Information Needed]
|
149 |
-
- **Cloud Provider:** [More Information Needed]
|
150 |
-
- **Compute Region:** [More Information Needed]
|
151 |
-
- **Carbon Emitted:** [More Information Needed]
|
152 |
|
153 |
## Technical Specifications [optional]
|
154 |
|
155 |
-
### Model Architecture and Objective
|
156 |
-
|
157 |
-
[More Information Needed]
|
158 |
-
|
159 |
-
### Compute Infrastructure
|
160 |
-
|
161 |
-
[More Information Needed]
|
162 |
|
163 |
-
|
164 |
|
165 |
-
|
166 |
|
167 |
-
|
|
|
|
|
|
|
|
|
|
|
168 |
|
169 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
170 |
|
171 |
-
## Citation [optional]
|
172 |
|
173 |
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
174 |
|
175 |
-
**
|
176 |
-
|
177 |
-
[More Information Needed]
|
178 |
-
|
179 |
-
**APA:**
|
180 |
-
|
181 |
-
[More Information Needed]
|
182 |
-
|
183 |
-
## Glossary [optional]
|
184 |
-
|
185 |
-
<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
|
186 |
-
|
187 |
-
[More Information Needed]
|
188 |
-
|
189 |
-
## More Information [optional]
|
190 |
-
|
191 |
-
[More Information Needed]
|
192 |
-
|
193 |
-
## Model Card Authors [optional]
|
194 |
-
|
195 |
-
[More Information Needed]
|
196 |
-
|
197 |
-
## Model Card Contact
|
198 |
|
199 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
library_name: transformers
|
3 |
+
tags:
|
4 |
+
- text-classification
|
5 |
+
- contact-information-detection
|
6 |
+
- privacy
|
7 |
---
|
8 |
|
9 |
+
# Model Card for ContactShieldAI
|
10 |
|
11 |
+
ContactShieldAI is a powerful text classification model designed to detect if users are sharing contact information on freelancing websites. This model helps maintain privacy and adherence to platform guidelines by identifying instances where users might be attempting to circumvent communication policies.
|
12 |
|
13 |
|
|
|
|
|
|
|
14 |
### Model Description
|
15 |
|
16 |
+
ContactShieldAI is based on an enhanced CNN-LSTM architecture, combining the strengths of both convolutional and recurrent neural networks for effective text classification.
|
|
|
|
|
17 |
|
18 |
+
- **Developed by:** xxparthparekhxx
|
19 |
+
- **Model type:** Text Classification
|
20 |
+
- **Language(s):** English
|
21 |
+
- **License:** Apache 2.0
|
22 |
+
- **Finetuned from model:** Trained from scratch, initialized with GloVe embeddings
|
|
|
|
|
23 |
|
24 |
### Model Sources [optional]
|
25 |
|
|
|
29 |
- **Paper [optional]:** [More Information Needed]
|
30 |
- **Demo [optional]:** [More Information Needed]
|
31 |
|
|
|
|
|
|
|
|
|
|
|
32 |
|
33 |
+
### Uses
|
|
|
|
|
34 |
|
35 |
+
ContactShieldAI is designed for:
|
36 |
+
- Detecting contact information sharing in text on freelancing platforms
|
37 |
+
- Enhancing privacy protection in online marketplaces
|
38 |
+
- Assisting moderators in identifying policy violations
|
39 |
### Downstream Use [optional]
|
40 |
|
41 |
<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
|
42 |
|
43 |
+
ContactShieldAI can be fine-tuned or integrated into:
|
44 |
+
- Content moderation systems for social media platforms
|
45 |
+
- Customer support chatbots to protect user privacy
|
46 |
+
- Email filtering systems to detect potential policy violations
|
47 |
|
48 |
### Out-of-Scope Use
|
49 |
|
50 |
<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
|
51 |
|
52 |
+
ContactShieldAI should not be used for:
|
53 |
+
- Censoring legitimate communication that doesn't violate platform policies
|
54 |
+
- Invading user privacy by scanning personal conversations without consent
|
55 |
+
- Making decisions about user accounts without human review
|
56 |
+
- Detecting contact information in languages other than English (current version)
|
57 |
+
|
58 |
|
59 |
## Bias, Risks, and Limitations
|
60 |
|
61 |
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
|
62 |
|
63 |
+
- The model is trained on synthetic data and may not capture all real-world variations
|
64 |
+
- It's specifically tailored for English language text
|
65 |
+
- Performance may vary on very short or highly obfuscated text
|
66 |
|
67 |
### Recommendations
|
68 |
|
69 |
<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
|
70 |
|
71 |
+
While this model is designed to enhance privacy and policy compliance, users should be aware of potential biases in the training data. It should be used as a tool to assist human moderators rather than as a sole decision-maker in content moderation.
|
72 |
|
73 |
## How to Get Started with the Model
|
74 |
|
75 |
+
You can use the model directly with the Hugging Face Transformers library:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
76 |
|
77 |
+
```python
|
78 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
79 |
|
80 |
+
tokenizer = AutoTokenizer.from_pretrained("xxparthparekhxx/ContactShieldAI")
|
81 |
+
model = AutoModelForSequenceClassification.from_pretrained("xxparthparekhxx/ContactShieldAI")
|
|
|
82 |
|
83 |
+
text = "Please contact me at [email protected] or call 555-1234."
|
84 |
+
inputs = tokenizer(text, return_tensors="pt")
|
85 |
+
outputs = model(**inputs)
|
86 |
+
prediction = outputs.logits.argmax(-1).item()
|
87 |
|
88 |
+
print("Contains contact info" if prediction == 1 else "No contact info")
|
89 |
+
```
|
90 |
|
91 |
|
92 |
+
### Training Data
|
93 |
|
94 |
+
The model was trained on a synthetically generated dataset:
|
95 |
+
- 200,000 examples created using LLaMA 3.1 70B
|
96 |
+
- Balanced dataset of positive (containing contact info) and negative examples
|
97 |
|
98 |
+
### Training Procedure
|
99 |
|
100 |
+
The training procedure for ContactShieldAI follows these steps:
|
101 |
|
102 |
+
1. Data Preparation:
|
103 |
+
- Load the dataset using the `load_data()` function
|
104 |
+
- Create a vocabulary from the dataset using `build_vocab_from_iterator()`
|
105 |
+
- Initialize GloVe embeddings for the vocabulary
|
106 |
|
107 |
+
2. Model Initialization:
|
108 |
+
- Create an instance of `EnhancedContactSharingModel`
|
109 |
+
- Initialize the embedding layer with pretrained GloVe embeddings
|
110 |
|
111 |
+
3. Dataset and DataLoader Creation:
|
112 |
+
- Create a `ContactSharingDataset` instance
|
113 |
+
- Use `DataLoader` with custom `collate_batch` function for efficient batching
|
114 |
|
115 |
+
4. Training Loop:
|
116 |
+
- Implement k-fold cross-validation (k=5) using `KFold` from sklearn
|
117 |
+
- For each fold:
|
118 |
+
- Reset model parameters (except embeddings)
|
119 |
+
- Create train and validation data loaders
|
120 |
+
- Initialize Adam optimizer and ReduceLROnPlateau scheduler
|
121 |
+
- Train for a specified number of epochs (default: 4)
|
122 |
+
- In each epoch:
|
123 |
+
- Iterate through batches, compute loss, and update model parameters
|
124 |
+
- Evaluate on validation set and update learning rate if needed
|
125 |
+
- Save the best model based on validation loss
|
126 |
|
127 |
+
5. Evaluation:
|
128 |
+
- Implement an `evaluate()` function to compute loss on a given dataset
|
129 |
|
130 |
+
6. Prediction:
|
131 |
+
- Implement a `predict()` function for making predictions on new text inputs
|
132 |
|
133 |
+
The training process utilizes techniques such as learning rate scheduling, early stopping, and k-fold cross-validation to ensure robust model performance and generalization.
|
134 |
|
135 |
+
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
|
136 |
|
137 |
+
#### Preprocessing [optional]
|
138 |
|
139 |
+
- Text tokenization using SpaCy
|
140 |
+
- Vocabulary built from the training data
|
141 |
+
- Texts padded to uniform length
|
142 |
|
143 |
+
#### Training Hyperparameters
|
144 |
|
145 |
+
- Optimizer: Adam (lr=0.0001, weight_decay=1e-5)
|
146 |
+
- Loss Function: Cross Entropy Loss
|
147 |
+
- Learning Rate Scheduler: ReduceLROnPlateau
|
148 |
+
- Batch Size: 128
|
149 |
+
- Epochs: 15 (early stopping based on validation loss)
|
150 |
|
|
|
151 |
|
152 |
### Results
|
153 |
|
154 |
+
The model achieved an impressive validation loss of 0.0211, indicating high accuracy in detecting contact information sharing.
|
155 |
|
156 |
#### Summary
|
157 |
|
158 |
|
159 |
+
ContactShieldAI is a powerful model designed to detect contact information sharing in text. Key features include:
|
160 |
|
161 |
+
- Trained on a large, balanced dataset of 200,000 examples
|
162 |
+
- Utilizes a sophisticated architecture combining LSTM and CNN
|
163 |
+
- Achieves high accuracy with a validation loss of 0.0211
|
164 |
+
- Easy to use with Hugging Face Transformers library
|
165 |
+
- Suitable for various applications requiring privacy protection and data security
|
166 |
|
167 |
+
The model's architecture and training procedure are optimized for efficient and accurate detection of contact information, making it a valuable tool for safeguarding user privacy in various text-based applications.
|
168 |
|
|
|
169 |
|
|
|
170 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
171 |
|
172 |
## Technical Specifications [optional]
|
173 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
174 |
|
175 |
+
### Model Architecture and Objective
|
176 |
|
177 |
+
ContactShieldAI utilizes a sophisticated architecture:
|
178 |
|
179 |
+
1. **Embedding Layer:** Initialized with GloVe 6B 300d embeddings, expanded to 600d
|
180 |
+
2. **Bidirectional LSTM:** Processes the embedded sequence
|
181 |
+
3. **Multi-scale CNN:** Multiple convolutional layers with different filter sizes (3 to 10)
|
182 |
+
4. **Max Pooling:** Applied after each convolutional layer
|
183 |
+
5. **Fully Connected Layers:** Two FC layers with ReLU activation and dropout
|
184 |
+
6. **Output Layer:** 2-dimensional output for binary classification
|
185 |
|
186 |
+
Key Parameters:
|
187 |
+
- Vocabulary Size: 225,817
|
188 |
+
- Embedding Dimension: 600
|
189 |
+
- Number of Filters: 600
|
190 |
+
- Filter Sizes: [3, 4, 5, 6, 7, 8, 9, 10]
|
191 |
+
- LSTM Hidden Dimension: 768
|
192 |
+
- Dropout: 0.5
|
193 |
|
194 |
+
<!-- ## Citation [optional]
|
195 |
|
196 |
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
|
197 |
|
198 |
+
**APA:** -->
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
199 |
|
200 |
+
@misc{ContactShieldAI,
|
201 |
+
author = {xxparthparekhxx},
|
202 |
+
title = {ContactShieldAI: A Model for Detecting Contact Information Sharing},
|
203 |
+
year = {2023},
|
204 |
+
publisher = {GitHub},
|
205 |
+
journal = {GitHub repository},
|
206 |
+
howpublished = {\url{https://huggingface.co/xxparthparekhxx/ContactShieldAI}}
|
207 |
+
}
|
raw/contact_sharing_epoch_1.pth
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:bdb70e711c212856ce3df95b82afbae57b8fc34243b3f541ecd65963fa81fd92
|
3 |
+
size 813497259
|
raw/tester.py
ADDED
@@ -0,0 +1,103 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import torch
|
2 |
+
import torch.nn as nn
|
3 |
+
import torch.nn.functional as F
|
4 |
+
from torchtext.vocab import build_vocab_from_iterator, GloVe
|
5 |
+
from torchtext.data.utils import get_tokenizer
|
6 |
+
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
|
7 |
+
|
8 |
+
class ContactSharingClassifier(nn.Module):
|
9 |
+
def __init__(self, vocab_size, embed_dim, num_filters, filter_sizes, lstm_hidden_dim, output_dim, dropout, pad_idx):
|
10 |
+
super().__init__()
|
11 |
+
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
|
12 |
+
self.lstm = nn.LSTM(embed_dim, lstm_hidden_dim, bidirectional=True, batch_first=True)
|
13 |
+
self.convs = nn.ModuleList([
|
14 |
+
nn.Conv1d(in_channels=lstm_hidden_dim*2, out_channels=num_filters, kernel_size=fs)
|
15 |
+
for fs in filter_sizes
|
16 |
+
])
|
17 |
+
self.fc1 = nn.Linear(len(filter_sizes) * num_filters, len(filter_sizes) * num_filters // 2)
|
18 |
+
self.fc2 = nn.Linear(len(filter_sizes) * num_filters // 2, output_dim)
|
19 |
+
self.dropout = nn.Dropout(dropout)
|
20 |
+
self.layer_norm = nn.LayerNorm(len(filter_sizes) * num_filters)
|
21 |
+
|
22 |
+
def forward(self, text):
|
23 |
+
embedded = self.embedding(text)
|
24 |
+
lstm_out, _ = self.lstm(embedded)
|
25 |
+
lstm_out = lstm_out.permute(0, 2, 1)
|
26 |
+
conved = [F.relu(conv(lstm_out)) for conv in self.convs]
|
27 |
+
pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
|
28 |
+
cat = self.dropout(torch.cat(pooled, dim=1))
|
29 |
+
cat = self.layer_norm(cat)
|
30 |
+
x = F.relu(self.fc1(cat))
|
31 |
+
x = self.dropout(x)
|
32 |
+
return self.fc2(x)
|
33 |
+
|
34 |
+
# Initialize tokenizer and vocabulary
|
35 |
+
tokenizer = get_tokenizer("spacy", language="en_core_web_sm")
|
36 |
+
vocab = torch.load('vocab.pth') # Assuming you've saved the vocabulary
|
37 |
+
|
38 |
+
# Define text pipeline
|
39 |
+
def text_pipeline(x):
|
40 |
+
return [vocab[token] for token in tokenizer(x)]
|
41 |
+
|
42 |
+
# Model parameters
|
43 |
+
VOCAB_SIZE = len(vocab)
|
44 |
+
EMBED_DIM = 600
|
45 |
+
NUM_FILTERS = 600
|
46 |
+
FILTER_SIZES = [3, 4, 5, 6, 7, 8, 9, 10]
|
47 |
+
LSTM_HIDDEN_DIM = 768
|
48 |
+
OUTPUT_DIM = 2
|
49 |
+
DROPOUT = 0.5
|
50 |
+
PAD_IDX = vocab["<pad>"]
|
51 |
+
|
52 |
+
# Load the model
|
53 |
+
|
54 |
+
model = ContactSharingClassifier(VOCAB_SIZE, EMBED_DIM, NUM_FILTERS, FILTER_SIZES, LSTM_HIDDEN_DIM, OUTPUT_DIM, DROPOUT, PAD_IDX)
|
55 |
+
model.load_state_dict(torch.load('contact_sharing_epoch_1.pth', map_location=device))
|
56 |
+
model.to(device)
|
57 |
+
model.eval()
|
58 |
+
|
59 |
+
# Test sentences
|
60 |
+
test_sentences = [
|
61 |
+
"You can reach me at my electronic mail address, it's my first name dot last name at that popular search engine company's mail service.",
|
62 |
+
"Call me on my cellular device, the digits are the same as the year the Declaration of Independence was signed, followed by my birth year, twice.",
|
63 |
+
"Visit my online presence at triple w dot my full name without spaces or punctuation dot com.",
|
64 |
+
"Send a message to username 'not_my_real_name' on that instant messaging platform that starts with 'disc' and ends with 'ord'.",
|
65 |
+
"My contact info is hidden in this sentence: Eight Six Seven Five Three Oh Nine.",
|
66 |
+
"Find me on the professional networking site, just search for my name plus 'software engineer in San Francisco'.",
|
67 |
+
"My handle on the bird-themed social media platform is at symbol followed by 'definitely_not_my_email_address'.",
|
68 |
+
"You know that video sharing site? My channel is there, just add 'cool_coder_' before my full name, all lowercase.",
|
69 |
+
"I'm listed in the phone book under 'Smith, John' but replace 'Smith' with my actual last name and 'John' with my first name.",
|
70 |
+
"My contact details are encrypted: Rot13('[email protected]')",
|
71 |
+
|
72 |
+
# New non-contact sharing examples
|
73 |
+
"The weather today is absolutely beautiful, perfect for a picnic in the park.",
|
74 |
+
"I'm really excited about the new sci-fi movie coming out next month.",
|
75 |
+
"Did you hear about the latest advancements in artificial intelligence? It's fascinating!",
|
76 |
+
"I'm planning to go hiking this weekend in the nearby mountains.",
|
77 |
+
"The recipe calls for two cups of flour and a pinch of salt.",
|
78 |
+
"The annual tech conference will be held virtually this year due to ongoing health concerns.",
|
79 |
+
"I've been learning to play the guitar for the past six months. It's challenging but rewarding.",
|
80 |
+
"The local farmer's market has the freshest produce every Saturday morning.",
|
81 |
+
"Did you catch the game last night? It was an incredible comeback in the final quarter!",
|
82 |
+
"Lets do '42069' tonight it will be really fun what do you say ?"
|
83 |
+
]
|
84 |
+
|
85 |
+
|
86 |
+
# Function to predict
|
87 |
+
def predict(text):
|
88 |
+
with torch.no_grad():
|
89 |
+
inputs = torch.tensor([text_pipeline(text)])
|
90 |
+
if inputs.size(1) < max(FILTER_SIZES):
|
91 |
+
# Pad the input if it's shorter than the largest filter size
|
92 |
+
padding = torch.zeros(1, max(FILTER_SIZES) - inputs.size(1), dtype=torch.long)
|
93 |
+
inputs = torch.cat([inputs, padding], dim=1)
|
94 |
+
inputs = inputs.to(device)
|
95 |
+
outputs = model(inputs)
|
96 |
+
return torch.argmax(outputs, dim=1).item()
|
97 |
+
|
98 |
+
# Test the sentences
|
99 |
+
for i, sentence in enumerate(test_sentences, 1):
|
100 |
+
prediction = predict(sentence)
|
101 |
+
result = "Contains contact info" if prediction == 1 else "No contact info"
|
102 |
+
print(f"Sentence {i}: {result}")
|
103 |
+
print(f"Text: {sentence}\n")
|
raw/trainer.py
ADDED
@@ -0,0 +1,173 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import torch
|
2 |
+
import torch.nn as nn
|
3 |
+
import torch.nn.functional as F
|
4 |
+
from torchtext.vocab import build_vocab_from_iterator, GloVe
|
5 |
+
from torchtext.data.utils import get_tokenizer
|
6 |
+
import json
|
7 |
+
from sklearn.model_selection import KFold
|
8 |
+
from torch.utils.data import Dataset, DataLoader
|
9 |
+
from tqdm import tqdm
|
10 |
+
|
11 |
+
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
|
12 |
+
|
13 |
+
class ContactSharingDataset(Dataset):
|
14 |
+
def __init__(self, data, text_pipeline, label_pipeline):
|
15 |
+
self.data = data
|
16 |
+
self.text_pipeline = text_pipeline
|
17 |
+
self.label_pipeline = label_pipeline
|
18 |
+
|
19 |
+
def __len__(self):
|
20 |
+
return len(self.data)
|
21 |
+
|
22 |
+
def __getitem__(self, idx):
|
23 |
+
text, label = self.data[idx]
|
24 |
+
return self.text_pipeline(text), self.label_pipeline(label)
|
25 |
+
|
26 |
+
class EnhancedContactSharingModel(nn.Module):
|
27 |
+
def __init__(self, vocab_size, embed_dim, num_filters, filter_sizes, lstm_hidden_dim, output_dim, dropout, pad_idx):
|
28 |
+
super().__init__()
|
29 |
+
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
|
30 |
+
self.lstm = nn.LSTM(embed_dim, lstm_hidden_dim, bidirectional=True, batch_first=True)
|
31 |
+
self.convs = nn.ModuleList([
|
32 |
+
nn.Conv1d(in_channels=lstm_hidden_dim*2, out_channels=num_filters, kernel_size=fs)
|
33 |
+
for fs in filter_sizes
|
34 |
+
])
|
35 |
+
self.fc1 = nn.Linear(len(filter_sizes) * num_filters, len(filter_sizes) * num_filters // 2)
|
36 |
+
self.fc2 = nn.Linear(len(filter_sizes) * num_filters // 2, output_dim)
|
37 |
+
self.dropout = nn.Dropout(dropout)
|
38 |
+
self.layer_norm = nn.LayerNorm(len(filter_sizes) * num_filters)
|
39 |
+
|
40 |
+
def forward(self, text):
|
41 |
+
embedded = self.embedding(text)
|
42 |
+
lstm_out, _ = self.lstm(embedded)
|
43 |
+
lstm_out = lstm_out.permute(0, 2, 1)
|
44 |
+
conved = [F.relu(conv(lstm_out)) for conv in self.convs]
|
45 |
+
pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
|
46 |
+
cat = self.dropout(torch.cat(pooled, dim=1))
|
47 |
+
cat = self.layer_norm(cat)
|
48 |
+
x = F.relu(self.fc1(cat))
|
49 |
+
x = self.dropout(x)
|
50 |
+
return self.fc2(x)
|
51 |
+
|
52 |
+
def load_data(filename='contacts_data.json'):
|
53 |
+
with open(filename, 'r') as f:
|
54 |
+
data = json.load(f)
|
55 |
+
return [(item['text'], item['label']) for item in data]
|
56 |
+
|
57 |
+
tokenizer = get_tokenizer("spacy", language="en_core_web_sm")
|
58 |
+
def yield_tokens(data_iter):
|
59 |
+
for text, _ in data_iter:
|
60 |
+
yield tokenizer(text)
|
61 |
+
|
62 |
+
data = load_data()
|
63 |
+
vocab = build_vocab_from_iterator(yield_tokens(data), specials=["<unk>", "<pad>"])
|
64 |
+
vocab.set_default_index(vocab["<unk>"])
|
65 |
+
|
66 |
+
glove = GloVe(name="6B", dim=300)
|
67 |
+
pretrained_embedding = torch.zeros(len(vocab), 300)
|
68 |
+
for token, index in vocab.get_stoi().items():
|
69 |
+
if token in glove.stoi:
|
70 |
+
pretrained_embedding[index] = glove[token]
|
71 |
+
|
72 |
+
def text_pipeline(x):
|
73 |
+
return [vocab[token] for token in tokenizer(x)]
|
74 |
+
|
75 |
+
def label_pipeline(x):
|
76 |
+
return int(x)
|
77 |
+
|
78 |
+
def collate_batch(batch):
|
79 |
+
label_list, text_list = [], []
|
80 |
+
for (_text, _label) in batch:
|
81 |
+
label_list.append(_label)
|
82 |
+
processed_text = torch.tensor(_text, dtype=torch.int64)
|
83 |
+
text_list.append(processed_text)
|
84 |
+
label_list = torch.tensor(label_list, dtype=torch.int64)
|
85 |
+
text_list = nn.utils.rnn.pad_sequence(text_list, batch_first=True, padding_value=vocab["<pad>"])
|
86 |
+
return text_list, label_list
|
87 |
+
|
88 |
+
VOCAB_SIZE = len(vocab)
|
89 |
+
EMBED_DIM = 600
|
90 |
+
NUM_FILTERS = 600
|
91 |
+
FILTER_SIZES = [3, 4, 5, 6, 7, 8, 9, 10]
|
92 |
+
LSTM_HIDDEN_DIM = 768
|
93 |
+
OUTPUT_DIM = 2
|
94 |
+
DROPOUT = 0.5
|
95 |
+
PAD_IDX = vocab["<pad>"]
|
96 |
+
|
97 |
+
model = EnhancedContactSharingModel(VOCAB_SIZE, EMBED_DIM, NUM_FILTERS, FILTER_SIZES, LSTM_HIDDEN_DIM, OUTPUT_DIM, DROPOUT, PAD_IDX).to(device)
|
98 |
+
pretrained_embedding_padded = torch.zeros(VOCAB_SIZE, EMBED_DIM)
|
99 |
+
pretrained_embedding_padded[:, :300] = pretrained_embedding
|
100 |
+
model.embedding.weight.data.copy_(pretrained_embedding_padded)
|
101 |
+
|
102 |
+
def train_model(model, train_loader, val_loader, optimizer, criterion, scheduler, num_epochs=15):
|
103 |
+
best_val_loss = float('inf')
|
104 |
+
for epoch in range(num_epochs):
|
105 |
+
model.train()
|
106 |
+
total_loss = 0
|
107 |
+
for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}"):
|
108 |
+
text, labels = batch
|
109 |
+
text, labels = text.to(device), labels.to(device)
|
110 |
+
optimizer.zero_grad()
|
111 |
+
predictions = model(text)
|
112 |
+
loss = criterion(predictions, labels)
|
113 |
+
loss.backward()
|
114 |
+
optimizer.step()
|
115 |
+
total_loss += loss.item()
|
116 |
+
|
117 |
+
avg_train_loss = total_loss / len(train_loader)
|
118 |
+
val_loss = evaluate(model, val_loader, criterion)
|
119 |
+
scheduler.step(val_loss)
|
120 |
+
|
121 |
+
print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {avg_train_loss:.4f}, Val Loss: {val_loss:.4f}")
|
122 |
+
|
123 |
+
if val_loss < best_val_loss:
|
124 |
+
best_val_loss = val_loss
|
125 |
+
torch.save(model.state_dict(), 'best_model.pth')
|
126 |
+
|
127 |
+
def evaluate(model, data_loader, criterion):
|
128 |
+
model.eval()
|
129 |
+
total_loss = 0
|
130 |
+
with torch.no_grad():
|
131 |
+
for batch in data_loader:
|
132 |
+
text, labels = batch
|
133 |
+
text, labels = text.to(device), labels.to(device)
|
134 |
+
predictions = model(text)
|
135 |
+
loss = criterion(predictions, labels)
|
136 |
+
total_loss += loss.item()
|
137 |
+
return total_loss / len(data_loader)
|
138 |
+
|
139 |
+
def k_fold_cross_validation(model, dataset, k=5, batch_size=128, num_epochs=4):
|
140 |
+
kf = KFold(n_splits=k, shuffle=True, random_state=42)
|
141 |
+
|
142 |
+
for fold, (train_idx, val_idx) in enumerate(kf.split(dataset)):
|
143 |
+
print(f"Fold {fold+1}/{k}")
|
144 |
+
|
145 |
+
train_subsampler = torch.utils.data.SubsetRandomSampler(train_idx)
|
146 |
+
val_subsampler = torch.utils.data.SubsetRandomSampler(val_idx)
|
147 |
+
|
148 |
+
train_loader = DataLoader(dataset, batch_size=batch_size, sampler=train_subsampler, collate_fn=collate_batch)
|
149 |
+
val_loader = DataLoader(dataset, batch_size=batch_size, sampler=val_subsampler, collate_fn=collate_batch)
|
150 |
+
|
151 |
+
model.apply(lambda m: m.reset_parameters() if hasattr(m, 'reset_parameters') else None)
|
152 |
+
model.embedding.weight.data.copy_(pretrained_embedding_padded)
|
153 |
+
|
154 |
+
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, weight_decay=1e-5)
|
155 |
+
criterion = nn.CrossEntropyLoss()
|
156 |
+
scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=3)
|
157 |
+
|
158 |
+
train_model(model, train_loader, val_loader, optimizer, criterion, scheduler, num_epochs)
|
159 |
+
|
160 |
+
def predict(text):
|
161 |
+
model.eval()
|
162 |
+
with torch.no_grad():
|
163 |
+
text = torch.tensor(text_pipeline(text)).unsqueeze(0).to(device)
|
164 |
+
output = model(text)
|
165 |
+
return output.argmax(1).item()
|
166 |
+
|
167 |
+
if __name__ == "__main__":
|
168 |
+
dataset = ContactSharingDataset(data, text_pipeline, label_pipeline)
|
169 |
+
k_fold_cross_validation(model, dataset)
|
170 |
+
|
171 |
+
sample_text = "Please contact me at [email protected] or call 555-1234."
|
172 |
+
prediction = predict(sample_text)
|
173 |
+
print(f"Prediction for '{sample_text}': {'Contains contact info' if prediction == 1 else 'No contact info'}")
|
raw/uploader.py
ADDED
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import torch
|
2 |
+
from transformers import PreTrainedModel, PretrainedConfig
|
3 |
+
from huggingface_hub import push_to_hub_pytorch
|
4 |
+
from torchtext.vocab import build_vocab_from_iterator, GloVe
|
5 |
+
from torchtext.data.utils import get_tokenizer
|
6 |
+
import torch.nn as nn
|
7 |
+
import torch.nn.functional as F
|
8 |
+
|
9 |
+
# Define the configuration class
|
10 |
+
class ContactSharingConfig(PretrainedConfig):
|
11 |
+
model_type = "contact_sharing"
|
12 |
+
|
13 |
+
def __init__(
|
14 |
+
self,
|
15 |
+
vocab_size=0,
|
16 |
+
embed_dim=600,
|
17 |
+
num_filters=600,
|
18 |
+
filter_sizes=[3, 4, 5, 6, 7, 8, 9, 10],
|
19 |
+
lstm_hidden_dim=768,
|
20 |
+
output_dim=2,
|
21 |
+
dropout=0.5,
|
22 |
+
pad_idx=0,
|
23 |
+
**kwargs
|
24 |
+
):
|
25 |
+
super().__init__(**kwargs)
|
26 |
+
self.vocab_size = vocab_size
|
27 |
+
self.embed_dim = embed_dim
|
28 |
+
self.num_filters = num_filters
|
29 |
+
self.filter_sizes = filter_sizes
|
30 |
+
self.lstm_hidden_dim = lstm_hidden_dim
|
31 |
+
self.output_dim = output_dim
|
32 |
+
self.dropout = dropout
|
33 |
+
self.pad_idx = pad_idx
|
34 |
+
|
35 |
+
# Define the model class
|
36 |
+
class ContactSharingClassifier(PreTrainedModel):
|
37 |
+
config_class = ContactSharingConfig
|
38 |
+
|
39 |
+
def __init__(self, config):
|
40 |
+
super().__init__(config)
|
41 |
+
self.embedding = nn.Embedding(config.vocab_size, config.embed_dim, padding_idx=config.pad_idx)
|
42 |
+
self.lstm = nn.LSTM(config.embed_dim, config.lstm_hidden_dim, bidirectional=True, batch_first=True)
|
43 |
+
self.convs = nn.ModuleList([
|
44 |
+
nn.Conv1d(in_channels=config.lstm_hidden_dim*2, out_channels=config.num_filters, kernel_size=fs)
|
45 |
+
for fs in config.filter_sizes
|
46 |
+
])
|
47 |
+
self.fc1 = nn.Linear(len(config.filter_sizes) * config.num_filters, len(config.filter_sizes) * config.num_filters // 2)
|
48 |
+
self.fc2 = nn.Linear(len(config.filter_sizes) * config.num_filters // 2, config.output_dim)
|
49 |
+
self.dropout = nn.Dropout(config.dropout)
|
50 |
+
self.layer_norm = nn.LayerNorm(len(config.filter_sizes) * config.num_filters)
|
51 |
+
|
52 |
+
def forward(self, text):
|
53 |
+
embedded = self.embedding(text)
|
54 |
+
lstm_out, _ = self.lstm(embedded)
|
55 |
+
lstm_out = lstm_out.permute(0, 2, 1)
|
56 |
+
conved = [F.relu(conv(lstm_out)) for conv in self.convs]
|
57 |
+
pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
|
58 |
+
cat = self.dropout(torch.cat(pooled, dim=1))
|
59 |
+
cat = self.layer_norm(cat)
|
60 |
+
x = F.relu(self.fc1(cat))
|
61 |
+
x = self.dropout(x)
|
62 |
+
return self.fc2(x)
|
63 |
+
|
64 |
+
# Load vocabulary
|
65 |
+
vocab = torch.load('vocab.pth')
|
66 |
+
|
67 |
+
# Create configuration
|
68 |
+
config = ContactSharingConfig(vocab_size=len(vocab), pad_idx=vocab["<pad>"])
|
69 |
+
|
70 |
+
# Create model
|
71 |
+
model = ContactSharingClassifier(config)
|
72 |
+
|
73 |
+
# Load trained weights
|
74 |
+
model.load_state_dict(torch.load('contact_sharing_epoch_1.pth', map_location='cpu'))
|
75 |
+
|
76 |
+
# Push to Hub
|
77 |
+
push_to_hub_pytorch(
|
78 |
+
model,
|
79 |
+
repo_name="contact-sharing-classifier",
|
80 |
+
organization=None, # Set this to your organization name if you're pushing to an organization
|
81 |
+
use_temp_dir=True
|
82 |
+
)
|
83 |
+
|
84 |
+
print("Model uploaded successfully to Hugging Face Hub!")
|