Commit
·
ae73b2e
1
Parent(s):
8496d9e
Add third NER-tag: medical technology
Browse files- README.md +22 -19
- config.json +12 -8
- model.safetensors +2 -2
- tokenizer.json +1 -6
README.md
CHANGED
@@ -14,11 +14,12 @@ language:
|
|
14 |
|
15 |
# Model
|
16 |
|
17 |
-
NER-Model for disease/treatment entity recognition. The purpose of the model/data use is educational.
|
18 |
|
19 |
The original dataset tags have been augmented with "inside"-Tags in order to handle sub-tokens produced by the WordPiece tokenizer. Following NER-tags are used:
|
20 |
-
* `B-
|
21 |
-
* `B-
|
|
|
22 |
* `O` - outside entities (irrelevant)
|
23 |
|
24 |
```
|
@@ -26,32 +27,34 @@ The original dataset tags have been augmented with "inside"-Tags in order to han
|
|
26 |
Acute obstructive hydrocephalus complicating bacterial meningitis in childhood
|
27 |
|
28 |
# Real:
|
29 |
-
Acute ->
|
30 |
-
obstructive ->
|
31 |
-
hydrocephalus ->
|
32 |
-
bacterial ->
|
33 |
-
meningitis ->
|
34 |
|
35 |
# Predictions:
|
36 |
-
o##bs##truct##ive -> B-
|
37 |
-
h##ydro##ce##pha##lus -> B-
|
38 |
-
bacterial -> B-
|
39 |
-
men##ing##itis -> B-
|
40 |
```
|
41 |
|
42 |
# Sources
|
43 |
|
44 |
This pipeline is based on the [dmis-lab/biobert-base-cased-v1.2](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2) pretrained model,
|
45 |
fine-tuned using the relatively small [BeHealthy Medical Entity](https://www.kaggle.com/datasets/arunagirirajan/medical-entity-recognition-ner)
|
46 |
-
dataset (1.550 training samples).
|
|
|
|
|
47 |
|
48 |
# Performance
|
49 |
|
50 |
The model has not been extensively tuned. The quality of the dataset is not clear, due to unknown origin of the data / annotation process.
|
51 |
|
52 |
-
|Metric
|
53 |
-
|
54 |
-
Precision | 0.
|
55 |
-
Recall | 0.
|
56 |
-
F1 | 0.
|
57 |
-
Accuracy | 0.
|
|
|
14 |
|
15 |
# Model
|
16 |
|
17 |
+
NER-Model for disease/treatment/technology entity recognition. The purpose of the model/data use is educational.
|
18 |
|
19 |
The original dataset tags have been augmented with "inside"-Tags in order to handle sub-tokens produced by the WordPiece tokenizer. Following NER-tags are used:
|
20 |
+
* `B-DISEASE`, `I-DISEASE`: begin and inside tags for disease
|
21 |
+
* `B-TREATMENT`, `I-TREATMENT`: begin and inside tags for treatment
|
22 |
+
* `B-TECHNOLOGY`, `I-TECHNOLOGY`: begin and inside tags for technology
|
23 |
* `O` - outside entities (irrelevant)
|
24 |
|
25 |
```
|
|
|
27 |
Acute obstructive hydrocephalus complicating bacterial meningitis in childhood
|
28 |
|
29 |
# Real:
|
30 |
+
Acute -> DISEASE
|
31 |
+
obstructive -> DISEASE
|
32 |
+
hydrocephalus -> DISEASE
|
33 |
+
bacterial -> DISEASE
|
34 |
+
meningitis -> DISEASE
|
35 |
|
36 |
# Predictions:
|
37 |
+
o##bs##truct##ive -> B-DISEASE + I-DISEASE + I-DISEASE + I-DISEASE
|
38 |
+
h##ydro##ce##pha##lus -> B-DISEASE + I-DISEASE + I-DISEASE + I-DISEASE + I-DISEASE
|
39 |
+
bacterial -> B-DISEASE
|
40 |
+
men##ing##itis -> B-DISEASE + I-DISEASE + I-DISEASE
|
41 |
```
|
42 |
|
43 |
# Sources
|
44 |
|
45 |
This pipeline is based on the [dmis-lab/biobert-base-cased-v1.2](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2) pretrained model,
|
46 |
fine-tuned using the relatively small [BeHealthy Medical Entity](https://www.kaggle.com/datasets/arunagirirajan/medical-entity-recognition-ner)
|
47 |
+
dataset (1.550 training samples). The initial version of this model was then used
|
48 |
+
to augment the medical technology [dataset](https://github.com/VictoriaDimanova/Robust-medical-NER/tree/main/Textcorpus). Both datasets were then used to train
|
49 |
+
this model.
|
50 |
|
51 |
# Performance
|
52 |
|
53 |
The model has not been extensively tuned. The quality of the dataset is not clear, due to unknown origin of the data / annotation process.
|
54 |
|
55 |
+
| Metric | Score |
|
56 |
+
|-----------|----------|
|
57 |
+
| Precision | 0.836892 |
|
58 |
+
| Recall | 0.766610 |
|
59 |
+
| F1 | 0.800211 |
|
60 |
+
| Accuracy | 0.935253 |
|
config.json
CHANGED
@@ -10,18 +10,22 @@
|
|
10 |
"hidden_size": 768,
|
11 |
"id2label": {
|
12 |
"0": "O",
|
13 |
-
"1": "B-
|
14 |
-
"2": "I-
|
15 |
-
"3": "B-
|
16 |
-
"4": "I-
|
|
|
|
|
17 |
},
|
18 |
"initializer_range": 0.02,
|
19 |
"intermediate_size": 3072,
|
20 |
"label2id": {
|
21 |
-
"B-
|
22 |
-
"B-
|
23 |
-
"
|
24 |
-
"I-
|
|
|
|
|
25 |
"O": 0
|
26 |
},
|
27 |
"layer_norm_eps": 1e-12,
|
|
|
10 |
"hidden_size": 768,
|
11 |
"id2label": {
|
12 |
"0": "O",
|
13 |
+
"1": "B-DISEASE",
|
14 |
+
"2": "I-DISEASE",
|
15 |
+
"3": "B-TREATMENT",
|
16 |
+
"4": "I-TREATMENT",
|
17 |
+
"5": "B-TECHNOLOGY",
|
18 |
+
"6": "I-TECHNOLOGY"
|
19 |
},
|
20 |
"initializer_range": 0.02,
|
21 |
"intermediate_size": 3072,
|
22 |
"label2id": {
|
23 |
+
"B-DISEASE": 1,
|
24 |
+
"B-TECHNOLOGY": 5,
|
25 |
+
"B-TREATMENT": 3,
|
26 |
+
"I-DISEASE": 2,
|
27 |
+
"I-TECHNOLOGY": 6,
|
28 |
+
"I-TREATMENT": 4,
|
29 |
"O": 0
|
30 |
},
|
31 |
"layer_norm_eps": 1e-12,
|
model.safetensors
CHANGED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:92ded4be8adf990a1d90dba8b6961b9130a24f6ed8e9a12ac7aaf00e49968575
|
3 |
+
size 430923588
|
tokenizer.json
CHANGED
@@ -1,11 +1,6 @@
|
|
1 |
{
|
2 |
"version": "1.0",
|
3 |
-
"truncation":
|
4 |
-
"direction": "Right",
|
5 |
-
"max_length": 512,
|
6 |
-
"strategy": "LongestFirst",
|
7 |
-
"stride": 0
|
8 |
-
},
|
9 |
"padding": null,
|
10 |
"added_tokens": [
|
11 |
{
|
|
|
1 |
{
|
2 |
"version": "1.0",
|
3 |
+
"truncation": null,
|
|
|
|
|
|
|
|
|
|
|
4 |
"padding": null,
|
5 |
"added_tokens": [
|
6 |
{
|