File size: 1,558 Bytes
761982b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
---
license: mit
pipeline_tag: token-classification
tags:
- BERT
- bioBERT
- NER
- medical
metrics:
- f1
language:
- en
---

# Model

NER-Model for disease/treatment entity recognition. The purpose of the model/data use is educational.

The original dataset tags have been augmented with "inside"-Tags in order to handle sub-tokens produced by the WordPiece tokenizer. Following NER-tags are used:
* `B-D`, `I-D`: begin and inside tags for disease
* `B-T`, `I-T`: begin and inside tags for treatment
* `O` - outside entities (irrelevant)

```
# Text:
Acute obstructive hydrocephalus complicating bacterial meningitis in childhood

# Real:
Acute           -> D
obstructive     -> D
hydrocephalus   -> D
bacterial       -> D
meningitis      -> D

# Predictions:
o##bs##truct##ive     -> B-D + I-D + I-D + I-D
h##ydro##ce##pha##lus -> B-D + I-D + I-D + I-D + I-D
bacterial             -> B-D
men##ing##itis        -> B-D + I-D + I-D
```

# Sources

This pipeline is based on the [dmis-lab/biobert-base-cased-v1.2](https://huggingface.co/dmis-lab/biobert-base-cased-v1.2) pretrained model,
fine-tuned using the relatively small [BeHealthy Medical Entity](https://www.kaggle.com/datasets/arunagirirajan/medical-entity-recognition-ner)
dataset (1.550 training samples).

# Performance

The model has not been extensively tuned. The quality of the dataset is not clear, due to unknown origin of the data / annotation process.

|Metric   |Score     |
|---------|----------|
Precision | 0.854523 |
Recall    | 0.859779 |
F1        | 0.857143 |
Accuracy  | 0.919590 |