ptrdvn commited on
Commit
20e7d98
·
verified ·
1 Parent(s): bfa3116

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +107 -3
README.md CHANGED
@@ -1,3 +1,107 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ # Shitsu
6
+
7
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/64b63f8ad57e02621dc93c8b/Lkw-M8a-AAfJiC81DobNl.jpeg" alt="A logo of a Shit Zhu reading a book" width="600"/>
8
+
9
+ A text scorer which scores text based on the amount of useful, textbook-like information in it.
10
+ It outputs a score generally between 0 and 1 but can exceed both of these bounds as it is a regressor.
11
+
12
+ Our model is based on fasttext embeddings, meaning that it can be used on large amounts of data with limited compute quickly.
13
+
14
+ This scorer can be used to filter useful information from large text corpora in many languages.
15
+
16
+ # How to install
17
+
18
+ ```bash
19
+ pip install git+https://github.com/lightblue-tech/shitsu.git
20
+ ```
21
+
22
+ # How to use
23
+
24
+ With our scorer package
25
+
26
+ ```python
27
+ from shitsu import ShitsuScorer
28
+
29
+ text_list = [
30
+ "Photosynthesis is a system of biological processes by which photosynthetic organisms, such as most plants, algae, and cyanobacteria, convert light energy, typically from sunlight, into the chemical energy necessary to fuel their metabolism.",
31
+ "Congratulations! You have all been selected to receive a free gift card worth $1000. Click on this link [Link] to claim your reward now. Limited time offer, so act fast! Don't miss out on this amazing opportunity."]
32
+
33
+ # Choose a language from one of: 'am', 'ar', 'bg', 'bn', 'cs', 'da', 'de', 'el', 'en', 'es', 'fa', 'fi', 'fr', 'gu', 'ha', 'hi', 'hu', 'id', 'it', 'ja', 'jv', 'kn', 'ko', 'lt', 'mr', 'nl', 'no', 'yo', 'zh'
34
+ language_code = "en"
35
+ scorer = ShitsuScorer(language_code)
36
+ scores = scorer.score(text_list)
37
+ scores
38
+ # array([ 0.9897383 , -0.08109612], dtype=float32)
39
+ ```
40
+
41
+ Without our scorer package (i.e. without pip install)
42
+
43
+ ```python
44
+
45
+ from safetensors.torch import load_model
46
+ import fasttext
47
+ from huggingface_hub import hf_hub_download
48
+ from tqdm.auto import tqdm
49
+ import torch
50
+ import numpy as np
51
+ import torch
52
+ import torch.nn as nn
53
+
54
+ class FasttextEmbedRegressor(nn.Module):
55
+ def __init__(self, input_size=300):
56
+ super(FasttextEmbedRegressor, self).__init__()
57
+ layer_1_size = 64
58
+ layer_2_size = 32
59
+ self.fc1 = nn.Linear(input_size, layer_1_size)
60
+ self.fc2 = nn.Linear(layer_1_size, layer_2_size)
61
+ self.fc3 = nn.Linear(layer_2_size, 1)
62
+
63
+ def forward(self, x):
64
+ x = torch.relu(self.fc1(x))
65
+ x = torch.relu(self.fc2(x))
66
+ x = self.fc3(x)
67
+ return x
68
+
69
+ class ShitsuScorer:
70
+ def __init__(self, lang_code):
71
+ fasttext_model_path = hf_hub_download(repo_id=f"facebook/fasttext-{lang_code}-vectors", filename="model.bin")
72
+ self.fasttext_model = fasttext.load_model(fasttext_model_path)
73
+ self.regressor_model = FasttextEmbedRegressor().eval()
74
+ regressor_model_path = hf_hub_download(repo_id=f"lightblue/shitsu_text_scorer", filename=f"{lang_code}.safetensors")
75
+ load_model(self.regressor_model, regressor_model_path)
76
+
77
+ def score(self, text_list):
78
+ embeddings = np.stack([self.fasttext_model.get_sentence_vector(x.replace("\n", " ")) for x in tqdm(text_list)])
79
+ return self.regressor_model(torch.Tensor(embeddings)).detach().numpy().flatten()
80
+
81
+ text_list = [
82
+ "Photosynthesis is a system of biological processes by which photosynthetic organisms, such as most plants, algae, and cyanobacteria, convert light energy, typically from sunlight, into the chemical energy necessary to fuel their metabolism.",
83
+ "Congratulations! You have all been selected to receive a free gift card worth $1000. Click on this link [Link] to claim your reward now. Limited time offer, so act fast! Don't miss out on this amazing opportunity."]
84
+
85
+ scorer = ShitsuScorer("en")
86
+ scores = scorer.score(text_list)
87
+ scores
88
+ # array([ 0.9897383 , -0.08109612], dtype=float32)
89
+ ```
90
+
91
+ # How we made the training data
92
+
93
+ We provided a sample of tens of thousands [MADLAD-400](https://huggingface.co/datasets/allenai/MADLAD-400) in various languages to a popular state-of-the-art LLM with the following system prompt:
94
+
95
+ ```python
96
+ system_message = """You are a text filtering AI model.
97
+ Your input is a piece of text.
98
+ Your output is a score of how likely the text is to appear in a useful {language} textbook, encyclopedia, or any other important document.
99
+
100
+ Output your score on a scale of 0-100, with 0 meaning that the text contains no useful {language} information and 100 meaning that the text is very useful and is exceedingly likely to appear in a {language} textbook, encyclopedia, or any other important document. If the text is not mostly fluent, natural {language}, output 0.
101
+
102
+ Your output should be only an integer from 0-100."""
103
+ ```
104
+
105
+ We then trained a small neural network on top of fasttext's embeddings to predict these scores.
106
+
107
+ We chose the languages in this dataset by making a union set of the 30 most popular languages on earth as according to [Ethnologue 2024](https://www.ethnologue.com/insights/ethnologue200/) and the 30 most popular languages within MADLAD-400.