parth parekh commited on
Commit
1d54b01
·
1 Parent(s): 46fd641

added readme and raw training files

Browse files
Files changed (5) hide show
  1. README.md +118 -110
  2. raw/contact_sharing_epoch_1.pth +3 -0
  3. raw/tester.py +103 -0
  4. raw/trainer.py +173 -0
  5. raw/uploader.py +84 -0
README.md CHANGED
@@ -1,29 +1,25 @@
1
  ---
2
  library_name: transformers
3
- tags: []
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
10
 
11
-
12
- ## Model Details
13
-
14
  ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
  ### Model Sources [optional]
29
 
@@ -33,167 +29,179 @@ This is the model card of a 🤗 transformers model that has been pushed on the
33
  - **Paper [optional]:** [More Information Needed]
34
  - **Demo [optional]:** [More Information Needed]
35
 
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
 
 
 
 
 
46
  ### Downstream Use [optional]
47
 
48
  <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
 
50
- [More Information Needed]
 
 
 
51
 
52
  ### Out-of-Scope Use
53
 
54
  <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
 
56
- [More Information Needed]
 
 
 
 
 
57
 
58
  ## Bias, Risks, and Limitations
59
 
60
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
 
62
- [More Information Needed]
 
 
63
 
64
  ### Recommendations
65
 
66
  <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
 
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
 
70
  ## How to Get Started with the Model
71
 
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
-
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
 
82
- [More Information Needed]
 
83
 
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
 
88
- #### Preprocessing [optional]
 
 
 
89
 
90
- [More Information Needed]
 
91
 
92
 
93
- #### Training Hyperparameters
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 
 
96
 
97
- #### Speeds, Sizes, Times [optional]
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
 
101
- [More Information Needed]
 
 
 
102
 
103
- ## Evaluation
 
 
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
 
 
106
 
107
- ### Testing Data, Factors & Metrics
 
 
 
 
 
 
 
 
 
 
108
 
109
- #### Testing Data
 
110
 
111
- <!-- This should link to a Dataset Card if possible. -->
 
112
 
113
- [More Information Needed]
114
 
115
- #### Factors
116
 
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
 
119
- [More Information Needed]
 
 
120
 
121
- #### Metrics
122
 
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
 
 
 
 
124
 
125
- [More Information Needed]
126
 
127
  ### Results
128
 
129
- [More Information Needed]
130
 
131
  #### Summary
132
 
133
 
 
134
 
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
 
 
138
 
139
- [More Information Needed]
140
 
141
- ## Environmental Impact
142
 
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
 
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
 
153
  ## Technical Specifications [optional]
154
 
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
 
163
- #### Hardware
164
 
165
- [More Information Needed]
166
 
167
- #### Software
 
 
 
 
 
168
 
169
- [More Information Needed]
 
 
 
 
 
 
170
 
171
- ## Citation [optional]
172
 
173
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
 
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
 
199
- [More Information Needed]
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
+ tags:
4
+ - text-classification
5
+ - contact-information-detection
6
+ - privacy
7
  ---
8
 
9
+ # Model Card for ContactShieldAI
10
 
11
+ ContactShieldAI is a powerful text classification model designed to detect if users are sharing contact information on freelancing websites. This model helps maintain privacy and adherence to platform guidelines by identifying instances where users might be attempting to circumvent communication policies.
12
 
13
 
 
 
 
14
  ### Model Description
15
 
16
+ ContactShieldAI is based on an enhanced CNN-LSTM architecture, combining the strengths of both convolutional and recurrent neural networks for effective text classification.
 
 
17
 
18
+ - **Developed by:** xxparthparekhxx
19
+ - **Model type:** Text Classification
20
+ - **Language(s):** English
21
+ - **License:** Apache 2.0
22
+ - **Finetuned from model:** Trained from scratch, initialized with GloVe embeddings
 
 
23
 
24
  ### Model Sources [optional]
25
 
 
29
  - **Paper [optional]:** [More Information Needed]
30
  - **Demo [optional]:** [More Information Needed]
31
 
 
 
 
 
 
32
 
33
+ ### Uses
 
 
34
 
35
+ ContactShieldAI is designed for:
36
+ - Detecting contact information sharing in text on freelancing platforms
37
+ - Enhancing privacy protection in online marketplaces
38
+ - Assisting moderators in identifying policy violations
39
  ### Downstream Use [optional]
40
 
41
  <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
42
 
43
+ ContactShieldAI can be fine-tuned or integrated into:
44
+ - Content moderation systems for social media platforms
45
+ - Customer support chatbots to protect user privacy
46
+ - Email filtering systems to detect potential policy violations
47
 
48
  ### Out-of-Scope Use
49
 
50
  <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
51
 
52
+ ContactShieldAI should not be used for:
53
+ - Censoring legitimate communication that doesn't violate platform policies
54
+ - Invading user privacy by scanning personal conversations without consent
55
+ - Making decisions about user accounts without human review
56
+ - Detecting contact information in languages other than English (current version)
57
+
58
 
59
  ## Bias, Risks, and Limitations
60
 
61
  <!-- This section is meant to convey both technical and sociotechnical limitations. -->
62
 
63
+ - The model is trained on synthetic data and may not capture all real-world variations
64
+ - It's specifically tailored for English language text
65
+ - Performance may vary on very short or highly obfuscated text
66
 
67
  ### Recommendations
68
 
69
  <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
70
 
71
+ While this model is designed to enhance privacy and policy compliance, users should be aware of potential biases in the training data. It should be used as a tool to assist human moderators rather than as a sole decision-maker in content moderation.
72
 
73
  ## How to Get Started with the Model
74
 
75
+ You can use the model directly with the Hugging Face Transformers library:
 
 
 
 
 
 
 
 
76
 
77
+ ```python
78
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
79
 
80
+ tokenizer = AutoTokenizer.from_pretrained("xxparthparekhxx/ContactShieldAI")
81
+ model = AutoModelForSequenceClassification.from_pretrained("xxparthparekhxx/ContactShieldAI")
 
82
 
83
+ text = "Please contact me at [email protected] or call 555-1234."
84
+ inputs = tokenizer(text, return_tensors="pt")
85
+ outputs = model(**inputs)
86
+ prediction = outputs.logits.argmax(-1).item()
87
 
88
+ print("Contains contact info" if prediction == 1 else "No contact info")
89
+ ```
90
 
91
 
92
+ ### Training Data
93
 
94
+ The model was trained on a synthetically generated dataset:
95
+ - 200,000 examples created using LLaMA 3.1 70B
96
+ - Balanced dataset of positive (containing contact info) and negative examples
97
 
98
+ ### Training Procedure
99
 
100
+ The training procedure for ContactShieldAI follows these steps:
101
 
102
+ 1. Data Preparation:
103
+ - Load the dataset using the `load_data()` function
104
+ - Create a vocabulary from the dataset using `build_vocab_from_iterator()`
105
+ - Initialize GloVe embeddings for the vocabulary
106
 
107
+ 2. Model Initialization:
108
+ - Create an instance of `EnhancedContactSharingModel`
109
+ - Initialize the embedding layer with pretrained GloVe embeddings
110
 
111
+ 3. Dataset and DataLoader Creation:
112
+ - Create a `ContactSharingDataset` instance
113
+ - Use `DataLoader` with custom `collate_batch` function for efficient batching
114
 
115
+ 4. Training Loop:
116
+ - Implement k-fold cross-validation (k=5) using `KFold` from sklearn
117
+ - For each fold:
118
+ - Reset model parameters (except embeddings)
119
+ - Create train and validation data loaders
120
+ - Initialize Adam optimizer and ReduceLROnPlateau scheduler
121
+ - Train for a specified number of epochs (default: 4)
122
+ - In each epoch:
123
+ - Iterate through batches, compute loss, and update model parameters
124
+ - Evaluate on validation set and update learning rate if needed
125
+ - Save the best model based on validation loss
126
 
127
+ 5. Evaluation:
128
+ - Implement an `evaluate()` function to compute loss on a given dataset
129
 
130
+ 6. Prediction:
131
+ - Implement a `predict()` function for making predictions on new text inputs
132
 
133
+ The training process utilizes techniques such as learning rate scheduling, early stopping, and k-fold cross-validation to ensure robust model performance and generalization.
134
 
135
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
136
 
137
+ #### Preprocessing [optional]
138
 
139
+ - Text tokenization using SpaCy
140
+ - Vocabulary built from the training data
141
+ - Texts padded to uniform length
142
 
143
+ #### Training Hyperparameters
144
 
145
+ - Optimizer: Adam (lr=0.0001, weight_decay=1e-5)
146
+ - Loss Function: Cross Entropy Loss
147
+ - Learning Rate Scheduler: ReduceLROnPlateau
148
+ - Batch Size: 128
149
+ - Epochs: 15 (early stopping based on validation loss)
150
 
 
151
 
152
  ### Results
153
 
154
+ The model achieved an impressive validation loss of 0.0211, indicating high accuracy in detecting contact information sharing.
155
 
156
  #### Summary
157
 
158
 
159
+ ContactShieldAI is a powerful model designed to detect contact information sharing in text. Key features include:
160
 
161
+ - Trained on a large, balanced dataset of 200,000 examples
162
+ - Utilizes a sophisticated architecture combining LSTM and CNN
163
+ - Achieves high accuracy with a validation loss of 0.0211
164
+ - Easy to use with Hugging Face Transformers library
165
+ - Suitable for various applications requiring privacy protection and data security
166
 
167
+ The model's architecture and training procedure are optimized for efficient and accurate detection of contact information, making it a valuable tool for safeguarding user privacy in various text-based applications.
168
 
 
169
 
 
170
 
 
 
 
 
 
 
 
171
 
172
  ## Technical Specifications [optional]
173
 
 
 
 
 
 
 
 
174
 
175
+ ### Model Architecture and Objective
176
 
177
+ ContactShieldAI utilizes a sophisticated architecture:
178
 
179
+ 1. **Embedding Layer:** Initialized with GloVe 6B 300d embeddings, expanded to 600d
180
+ 2. **Bidirectional LSTM:** Processes the embedded sequence
181
+ 3. **Multi-scale CNN:** Multiple convolutional layers with different filter sizes (3 to 10)
182
+ 4. **Max Pooling:** Applied after each convolutional layer
183
+ 5. **Fully Connected Layers:** Two FC layers with ReLU activation and dropout
184
+ 6. **Output Layer:** 2-dimensional output for binary classification
185
 
186
+ Key Parameters:
187
+ - Vocabulary Size: 225,817
188
+ - Embedding Dimension: 600
189
+ - Number of Filters: 600
190
+ - Filter Sizes: [3, 4, 5, 6, 7, 8, 9, 10]
191
+ - LSTM Hidden Dimension: 768
192
+ - Dropout: 0.5
193
 
194
+ <!-- ## Citation [optional]
195
 
196
  <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
197
 
198
+ **APA:** -->
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
199
 
200
+ @misc{ContactShieldAI,
201
+ author = {xxparthparekhxx},
202
+ title = {ContactShieldAI: A Model for Detecting Contact Information Sharing},
203
+ year = {2023},
204
+ publisher = {GitHub},
205
+ journal = {GitHub repository},
206
+ howpublished = {\url{https://huggingface.co/xxparthparekhxx/ContactShieldAI}}
207
+ }
raw/contact_sharing_epoch_1.pth ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bdb70e711c212856ce3df95b82afbae57b8fc34243b3f541ecd65963fa81fd92
3
+ size 813497259
raw/tester.py ADDED
@@ -0,0 +1,103 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ import torch.nn.functional as F
4
+ from torchtext.vocab import build_vocab_from_iterator, GloVe
5
+ from torchtext.data.utils import get_tokenizer
6
+ device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
7
+
8
+ class ContactSharingClassifier(nn.Module):
9
+ def __init__(self, vocab_size, embed_dim, num_filters, filter_sizes, lstm_hidden_dim, output_dim, dropout, pad_idx):
10
+ super().__init__()
11
+ self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
12
+ self.lstm = nn.LSTM(embed_dim, lstm_hidden_dim, bidirectional=True, batch_first=True)
13
+ self.convs = nn.ModuleList([
14
+ nn.Conv1d(in_channels=lstm_hidden_dim*2, out_channels=num_filters, kernel_size=fs)
15
+ for fs in filter_sizes
16
+ ])
17
+ self.fc1 = nn.Linear(len(filter_sizes) * num_filters, len(filter_sizes) * num_filters // 2)
18
+ self.fc2 = nn.Linear(len(filter_sizes) * num_filters // 2, output_dim)
19
+ self.dropout = nn.Dropout(dropout)
20
+ self.layer_norm = nn.LayerNorm(len(filter_sizes) * num_filters)
21
+
22
+ def forward(self, text):
23
+ embedded = self.embedding(text)
24
+ lstm_out, _ = self.lstm(embedded)
25
+ lstm_out = lstm_out.permute(0, 2, 1)
26
+ conved = [F.relu(conv(lstm_out)) for conv in self.convs]
27
+ pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
28
+ cat = self.dropout(torch.cat(pooled, dim=1))
29
+ cat = self.layer_norm(cat)
30
+ x = F.relu(self.fc1(cat))
31
+ x = self.dropout(x)
32
+ return self.fc2(x)
33
+
34
+ # Initialize tokenizer and vocabulary
35
+ tokenizer = get_tokenizer("spacy", language="en_core_web_sm")
36
+ vocab = torch.load('vocab.pth') # Assuming you've saved the vocabulary
37
+
38
+ # Define text pipeline
39
+ def text_pipeline(x):
40
+ return [vocab[token] for token in tokenizer(x)]
41
+
42
+ # Model parameters
43
+ VOCAB_SIZE = len(vocab)
44
+ EMBED_DIM = 600
45
+ NUM_FILTERS = 600
46
+ FILTER_SIZES = [3, 4, 5, 6, 7, 8, 9, 10]
47
+ LSTM_HIDDEN_DIM = 768
48
+ OUTPUT_DIM = 2
49
+ DROPOUT = 0.5
50
+ PAD_IDX = vocab["<pad>"]
51
+
52
+ # Load the model
53
+
54
+ model = ContactSharingClassifier(VOCAB_SIZE, EMBED_DIM, NUM_FILTERS, FILTER_SIZES, LSTM_HIDDEN_DIM, OUTPUT_DIM, DROPOUT, PAD_IDX)
55
+ model.load_state_dict(torch.load('contact_sharing_epoch_1.pth', map_location=device))
56
+ model.to(device)
57
+ model.eval()
58
+
59
+ # Test sentences
60
+ test_sentences = [
61
+ "You can reach me at my electronic mail address, it's my first name dot last name at that popular search engine company's mail service.",
62
+ "Call me on my cellular device, the digits are the same as the year the Declaration of Independence was signed, followed by my birth year, twice.",
63
+ "Visit my online presence at triple w dot my full name without spaces or punctuation dot com.",
64
+ "Send a message to username 'not_my_real_name' on that instant messaging platform that starts with 'disc' and ends with 'ord'.",
65
+ "My contact info is hidden in this sentence: Eight Six Seven Five Three Oh Nine.",
66
+ "Find me on the professional networking site, just search for my name plus 'software engineer in San Francisco'.",
67
+ "My handle on the bird-themed social media platform is at symbol followed by 'definitely_not_my_email_address'.",
68
+ "You know that video sharing site? My channel is there, just add 'cool_coder_' before my full name, all lowercase.",
69
+ "I'm listed in the phone book under 'Smith, John' but replace 'Smith' with my actual last name and 'John' with my first name.",
70
+ "My contact details are encrypted: Rot13('[email protected]')",
71
+
72
+ # New non-contact sharing examples
73
+ "The weather today is absolutely beautiful, perfect for a picnic in the park.",
74
+ "I'm really excited about the new sci-fi movie coming out next month.",
75
+ "Did you hear about the latest advancements in artificial intelligence? It's fascinating!",
76
+ "I'm planning to go hiking this weekend in the nearby mountains.",
77
+ "The recipe calls for two cups of flour and a pinch of salt.",
78
+ "The annual tech conference will be held virtually this year due to ongoing health concerns.",
79
+ "I've been learning to play the guitar for the past six months. It's challenging but rewarding.",
80
+ "The local farmer's market has the freshest produce every Saturday morning.",
81
+ "Did you catch the game last night? It was an incredible comeback in the final quarter!",
82
+ "Lets do '42069' tonight it will be really fun what do you say ?"
83
+ ]
84
+
85
+
86
+ # Function to predict
87
+ def predict(text):
88
+ with torch.no_grad():
89
+ inputs = torch.tensor([text_pipeline(text)])
90
+ if inputs.size(1) < max(FILTER_SIZES):
91
+ # Pad the input if it's shorter than the largest filter size
92
+ padding = torch.zeros(1, max(FILTER_SIZES) - inputs.size(1), dtype=torch.long)
93
+ inputs = torch.cat([inputs, padding], dim=1)
94
+ inputs = inputs.to(device)
95
+ outputs = model(inputs)
96
+ return torch.argmax(outputs, dim=1).item()
97
+
98
+ # Test the sentences
99
+ for i, sentence in enumerate(test_sentences, 1):
100
+ prediction = predict(sentence)
101
+ result = "Contains contact info" if prediction == 1 else "No contact info"
102
+ print(f"Sentence {i}: {result}")
103
+ print(f"Text: {sentence}\n")
raw/trainer.py ADDED
@@ -0,0 +1,173 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ import torch.nn as nn
3
+ import torch.nn.functional as F
4
+ from torchtext.vocab import build_vocab_from_iterator, GloVe
5
+ from torchtext.data.utils import get_tokenizer
6
+ import json
7
+ from sklearn.model_selection import KFold
8
+ from torch.utils.data import Dataset, DataLoader
9
+ from tqdm import tqdm
10
+
11
+ device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
12
+
13
+ class ContactSharingDataset(Dataset):
14
+ def __init__(self, data, text_pipeline, label_pipeline):
15
+ self.data = data
16
+ self.text_pipeline = text_pipeline
17
+ self.label_pipeline = label_pipeline
18
+
19
+ def __len__(self):
20
+ return len(self.data)
21
+
22
+ def __getitem__(self, idx):
23
+ text, label = self.data[idx]
24
+ return self.text_pipeline(text), self.label_pipeline(label)
25
+
26
+ class EnhancedContactSharingModel(nn.Module):
27
+ def __init__(self, vocab_size, embed_dim, num_filters, filter_sizes, lstm_hidden_dim, output_dim, dropout, pad_idx):
28
+ super().__init__()
29
+ self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_idx)
30
+ self.lstm = nn.LSTM(embed_dim, lstm_hidden_dim, bidirectional=True, batch_first=True)
31
+ self.convs = nn.ModuleList([
32
+ nn.Conv1d(in_channels=lstm_hidden_dim*2, out_channels=num_filters, kernel_size=fs)
33
+ for fs in filter_sizes
34
+ ])
35
+ self.fc1 = nn.Linear(len(filter_sizes) * num_filters, len(filter_sizes) * num_filters // 2)
36
+ self.fc2 = nn.Linear(len(filter_sizes) * num_filters // 2, output_dim)
37
+ self.dropout = nn.Dropout(dropout)
38
+ self.layer_norm = nn.LayerNorm(len(filter_sizes) * num_filters)
39
+
40
+ def forward(self, text):
41
+ embedded = self.embedding(text)
42
+ lstm_out, _ = self.lstm(embedded)
43
+ lstm_out = lstm_out.permute(0, 2, 1)
44
+ conved = [F.relu(conv(lstm_out)) for conv in self.convs]
45
+ pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
46
+ cat = self.dropout(torch.cat(pooled, dim=1))
47
+ cat = self.layer_norm(cat)
48
+ x = F.relu(self.fc1(cat))
49
+ x = self.dropout(x)
50
+ return self.fc2(x)
51
+
52
+ def load_data(filename='contacts_data.json'):
53
+ with open(filename, 'r') as f:
54
+ data = json.load(f)
55
+ return [(item['text'], item['label']) for item in data]
56
+
57
+ tokenizer = get_tokenizer("spacy", language="en_core_web_sm")
58
+ def yield_tokens(data_iter):
59
+ for text, _ in data_iter:
60
+ yield tokenizer(text)
61
+
62
+ data = load_data()
63
+ vocab = build_vocab_from_iterator(yield_tokens(data), specials=["<unk>", "<pad>"])
64
+ vocab.set_default_index(vocab["<unk>"])
65
+
66
+ glove = GloVe(name="6B", dim=300)
67
+ pretrained_embedding = torch.zeros(len(vocab), 300)
68
+ for token, index in vocab.get_stoi().items():
69
+ if token in glove.stoi:
70
+ pretrained_embedding[index] = glove[token]
71
+
72
+ def text_pipeline(x):
73
+ return [vocab[token] for token in tokenizer(x)]
74
+
75
+ def label_pipeline(x):
76
+ return int(x)
77
+
78
+ def collate_batch(batch):
79
+ label_list, text_list = [], []
80
+ for (_text, _label) in batch:
81
+ label_list.append(_label)
82
+ processed_text = torch.tensor(_text, dtype=torch.int64)
83
+ text_list.append(processed_text)
84
+ label_list = torch.tensor(label_list, dtype=torch.int64)
85
+ text_list = nn.utils.rnn.pad_sequence(text_list, batch_first=True, padding_value=vocab["<pad>"])
86
+ return text_list, label_list
87
+
88
+ VOCAB_SIZE = len(vocab)
89
+ EMBED_DIM = 600
90
+ NUM_FILTERS = 600
91
+ FILTER_SIZES = [3, 4, 5, 6, 7, 8, 9, 10]
92
+ LSTM_HIDDEN_DIM = 768
93
+ OUTPUT_DIM = 2
94
+ DROPOUT = 0.5
95
+ PAD_IDX = vocab["<pad>"]
96
+
97
+ model = EnhancedContactSharingModel(VOCAB_SIZE, EMBED_DIM, NUM_FILTERS, FILTER_SIZES, LSTM_HIDDEN_DIM, OUTPUT_DIM, DROPOUT, PAD_IDX).to(device)
98
+ pretrained_embedding_padded = torch.zeros(VOCAB_SIZE, EMBED_DIM)
99
+ pretrained_embedding_padded[:, :300] = pretrained_embedding
100
+ model.embedding.weight.data.copy_(pretrained_embedding_padded)
101
+
102
+ def train_model(model, train_loader, val_loader, optimizer, criterion, scheduler, num_epochs=15):
103
+ best_val_loss = float('inf')
104
+ for epoch in range(num_epochs):
105
+ model.train()
106
+ total_loss = 0
107
+ for batch in tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}"):
108
+ text, labels = batch
109
+ text, labels = text.to(device), labels.to(device)
110
+ optimizer.zero_grad()
111
+ predictions = model(text)
112
+ loss = criterion(predictions, labels)
113
+ loss.backward()
114
+ optimizer.step()
115
+ total_loss += loss.item()
116
+
117
+ avg_train_loss = total_loss / len(train_loader)
118
+ val_loss = evaluate(model, val_loader, criterion)
119
+ scheduler.step(val_loss)
120
+
121
+ print(f"Epoch {epoch+1}/{num_epochs}, Train Loss: {avg_train_loss:.4f}, Val Loss: {val_loss:.4f}")
122
+
123
+ if val_loss < best_val_loss:
124
+ best_val_loss = val_loss
125
+ torch.save(model.state_dict(), 'best_model.pth')
126
+
127
+ def evaluate(model, data_loader, criterion):
128
+ model.eval()
129
+ total_loss = 0
130
+ with torch.no_grad():
131
+ for batch in data_loader:
132
+ text, labels = batch
133
+ text, labels = text.to(device), labels.to(device)
134
+ predictions = model(text)
135
+ loss = criterion(predictions, labels)
136
+ total_loss += loss.item()
137
+ return total_loss / len(data_loader)
138
+
139
+ def k_fold_cross_validation(model, dataset, k=5, batch_size=128, num_epochs=4):
140
+ kf = KFold(n_splits=k, shuffle=True, random_state=42)
141
+
142
+ for fold, (train_idx, val_idx) in enumerate(kf.split(dataset)):
143
+ print(f"Fold {fold+1}/{k}")
144
+
145
+ train_subsampler = torch.utils.data.SubsetRandomSampler(train_idx)
146
+ val_subsampler = torch.utils.data.SubsetRandomSampler(val_idx)
147
+
148
+ train_loader = DataLoader(dataset, batch_size=batch_size, sampler=train_subsampler, collate_fn=collate_batch)
149
+ val_loader = DataLoader(dataset, batch_size=batch_size, sampler=val_subsampler, collate_fn=collate_batch)
150
+
151
+ model.apply(lambda m: m.reset_parameters() if hasattr(m, 'reset_parameters') else None)
152
+ model.embedding.weight.data.copy_(pretrained_embedding_padded)
153
+
154
+ optimizer = torch.optim.Adam(model.parameters(), lr=0.0001, weight_decay=1e-5)
155
+ criterion = nn.CrossEntropyLoss()
156
+ scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.1, patience=3)
157
+
158
+ train_model(model, train_loader, val_loader, optimizer, criterion, scheduler, num_epochs)
159
+
160
+ def predict(text):
161
+ model.eval()
162
+ with torch.no_grad():
163
+ text = torch.tensor(text_pipeline(text)).unsqueeze(0).to(device)
164
+ output = model(text)
165
+ return output.argmax(1).item()
166
+
167
+ if __name__ == "__main__":
168
+ dataset = ContactSharingDataset(data, text_pipeline, label_pipeline)
169
+ k_fold_cross_validation(model, dataset)
170
+
171
+ sample_text = "Please contact me at [email protected] or call 555-1234."
172
+ prediction = predict(sample_text)
173
+ print(f"Prediction for '{sample_text}': {'Contains contact info' if prediction == 1 else 'No contact info'}")
raw/uploader.py ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import torch
2
+ from transformers import PreTrainedModel, PretrainedConfig
3
+ from huggingface_hub import push_to_hub_pytorch
4
+ from torchtext.vocab import build_vocab_from_iterator, GloVe
5
+ from torchtext.data.utils import get_tokenizer
6
+ import torch.nn as nn
7
+ import torch.nn.functional as F
8
+
9
+ # Define the configuration class
10
+ class ContactSharingConfig(PretrainedConfig):
11
+ model_type = "contact_sharing"
12
+
13
+ def __init__(
14
+ self,
15
+ vocab_size=0,
16
+ embed_dim=600,
17
+ num_filters=600,
18
+ filter_sizes=[3, 4, 5, 6, 7, 8, 9, 10],
19
+ lstm_hidden_dim=768,
20
+ output_dim=2,
21
+ dropout=0.5,
22
+ pad_idx=0,
23
+ **kwargs
24
+ ):
25
+ super().__init__(**kwargs)
26
+ self.vocab_size = vocab_size
27
+ self.embed_dim = embed_dim
28
+ self.num_filters = num_filters
29
+ self.filter_sizes = filter_sizes
30
+ self.lstm_hidden_dim = lstm_hidden_dim
31
+ self.output_dim = output_dim
32
+ self.dropout = dropout
33
+ self.pad_idx = pad_idx
34
+
35
+ # Define the model class
36
+ class ContactSharingClassifier(PreTrainedModel):
37
+ config_class = ContactSharingConfig
38
+
39
+ def __init__(self, config):
40
+ super().__init__(config)
41
+ self.embedding = nn.Embedding(config.vocab_size, config.embed_dim, padding_idx=config.pad_idx)
42
+ self.lstm = nn.LSTM(config.embed_dim, config.lstm_hidden_dim, bidirectional=True, batch_first=True)
43
+ self.convs = nn.ModuleList([
44
+ nn.Conv1d(in_channels=config.lstm_hidden_dim*2, out_channels=config.num_filters, kernel_size=fs)
45
+ for fs in config.filter_sizes
46
+ ])
47
+ self.fc1 = nn.Linear(len(config.filter_sizes) * config.num_filters, len(config.filter_sizes) * config.num_filters // 2)
48
+ self.fc2 = nn.Linear(len(config.filter_sizes) * config.num_filters // 2, config.output_dim)
49
+ self.dropout = nn.Dropout(config.dropout)
50
+ self.layer_norm = nn.LayerNorm(len(config.filter_sizes) * config.num_filters)
51
+
52
+ def forward(self, text):
53
+ embedded = self.embedding(text)
54
+ lstm_out, _ = self.lstm(embedded)
55
+ lstm_out = lstm_out.permute(0, 2, 1)
56
+ conved = [F.relu(conv(lstm_out)) for conv in self.convs]
57
+ pooled = [F.max_pool1d(conv, conv.shape[2]).squeeze(2) for conv in conved]
58
+ cat = self.dropout(torch.cat(pooled, dim=1))
59
+ cat = self.layer_norm(cat)
60
+ x = F.relu(self.fc1(cat))
61
+ x = self.dropout(x)
62
+ return self.fc2(x)
63
+
64
+ # Load vocabulary
65
+ vocab = torch.load('vocab.pth')
66
+
67
+ # Create configuration
68
+ config = ContactSharingConfig(vocab_size=len(vocab), pad_idx=vocab["<pad>"])
69
+
70
+ # Create model
71
+ model = ContactSharingClassifier(config)
72
+
73
+ # Load trained weights
74
+ model.load_state_dict(torch.load('contact_sharing_epoch_1.pth', map_location='cpu'))
75
+
76
+ # Push to Hub
77
+ push_to_hub_pytorch(
78
+ model,
79
+ repo_name="contact-sharing-classifier",
80
+ organization=None, # Set this to your organization name if you're pushing to an organization
81
+ use_temp_dir=True
82
+ )
83
+
84
+ print("Model uploaded successfully to Hugging Face Hub!")