HIT-TMG
/

yizhao-risk-zh-scorer

Text Classification

Inference Endpoints

Model card Files Files and versions Community

imryanxu commited on Dec 11, 2024

Commit

a528044

·

verified ·

1 Parent(s): dbbe91c

Create README.md

Files changed (1) hide show

README.md +44 -0

README.md ADDED Viewed

	@@ -0,0 +1,44 @@

+---
+license: apache-2.0
+language:
+- zh
+pipeline_tag: text-classification
+library_name: transformers
+---
+# risk-model-zh-v0.1
+## Introduction
+This is a BERT model fine-tuned on a high-quality Chinese financial dataset. It generates a security risk score, which helps to identify and remove data with security risks from financial datasets, thereby reducing the proportion of illegal or undesirable data.
+## Quickstart
+Here is an example code snippet for generating security risk scores using this model.
+```python
+import torch
+from datasets import load_dataset
+from transformers import AutoTokenizer, AutoModelForSequenceClassification
+model_name = "risk-model-zh-v0.1"
+dataset_file = "your_dataset.jsonl"
+text_column = "text"
+output_file = "your_output.jsonl"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = AutoModelForSequenceClassification.from_pretrained(model_name, torch_dtype=torch.bfloat16)
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+model.to(device)
+dataset = load_dataset('json', data_files=dataset_file, cache_dir="cache/", split='train', num_proc=12)
+def compute_scores(batch):
+    inputs = tokenizer(batch[text_column], return_tensors="pt", padding="longest", truncation=True).to(device)
+    with torch.no_grad():
+        outputs = model(**inputs)
+        logits = outputs.logits.squeeze(-1).float().cpu().numpy()
+    batch["risk_score"] = logits.tolist()
+    return batch
+dataset = dataset.map(compute_scores, batched=True, batch_size=512)
+dataset.to_json(output_file)
+```