imryanxu commited on
Commit
a528044
·
verified ·
1 Parent(s): dbbe91c

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +44 -0
README.md ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - zh
5
+ pipeline_tag: text-classification
6
+ library_name: transformers
7
+ ---
8
+
9
+ # risk-model-zh-v0.1
10
+ ## Introduction
11
+ This is a BERT model fine-tuned on a high-quality Chinese financial dataset. It generates a security risk score, which helps to identify and remove data with security risks from financial datasets, thereby reducing the proportion of illegal or undesirable data.
12
+ ## Quickstart
13
+ Here is an example code snippet for generating security risk scores using this model.
14
+ ```python
15
+ import torch
16
+ from datasets import load_dataset
17
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
18
+
19
+ model_name = "risk-model-zh-v0.1"
20
+ dataset_file = "your_dataset.jsonl"
21
+ text_column = "text"
22
+ output_file = "your_output.jsonl"
23
+
24
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
25
+ model = AutoModelForSequenceClassification.from_pretrained(model_name, torch_dtype=torch.bfloat16)
26
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
27
+ model.to(device)
28
+
29
+ dataset = load_dataset('json', data_files=dataset_file, cache_dir="cache/", split='train', num_proc=12)
30
+
31
+
32
+ def compute_scores(batch):
33
+ inputs = tokenizer(batch[text_column], return_tensors="pt", padding="longest", truncation=True).to(device)
34
+ with torch.no_grad():
35
+ outputs = model(**inputs)
36
+ logits = outputs.logits.squeeze(-1).float().cpu().numpy()
37
+
38
+ batch["risk_score"] = logits.tolist()
39
+ return batch
40
+
41
+
42
+ dataset = dataset.map(compute_scores, batched=True, batch_size=512)
43
+ dataset.to_json(output_file)
44
+ ```