# Data

Мы используем следующий датасет для файнтюнинга:

- [датасет](https://zenodo.org/record/7695390) из [недавнего исследования](https://www.biorxiv.org/content/10.1101/2023.04.10.536208v1) с названиями и лейблами статей из PubMed. 

В нём 20 миллионов статей, но приведены только заголовки (без абстрактов — их можно дополнительно [получить](https://www.nlm.nih.gov/databases/download/pubmed_medline.html) по PMID статей). Файнтюнинг модели на таком объёме данных потребует определённых времени и вычислительных ресурсов (примерные затраты [приведены в статье](https://www.biorxiv.org/content/10.1101/2023.04.10.536208v1)), поэтому ниже мы воспользуемся упрощённым датасетом и будем тренировать только на заголовках статей.

# Models

В качестве базовой модели мы используем BERT, натренированный на биомедицинских данных (из PubMed). 

- [BiomedNLP-PubMedBERT](https://huggingface.co/microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract)

---

# Imports

In [1]:
import torch
import transformers
import numpy as np
import pandas as pd
from tqdm import tqdm

import torch
from datasets import Dataset, ClassLabel
from transformers import AutoTokenizer, AutoModelForMaskedLM, AutoModelForSequenceClassification
from transformers import TrainingArguments, Trainer
from transformers import pipeline
import evaluate

# Load data

Загрузим данные для файнтюнинга — в частности, нам понадобятся названия статей и теги (абстрактов в этих данных нет).

In [2]:
df = pd.read_csv("pubmed_landscape_data.csv")

In [62]:
df = df[df.Labels != "unlabeled"]
df = df[~df.Title.isnull()]

In [63]:
print(df.shape)
df.head(5)

(7123406, 10)


Unnamed: 0,Title,Journal,PMID,Year,x,y,Labels,Colors,text,label
18,Determination of some in vitro growth requirem...,Journal of general microbiology,1133574,1975.0,-140.83,26.596,microbiology,#B79762,Determination of some in vitro growth requirem...,microbiology
19,Degradation of agar by a gram-negative bacterium.,Journal of general microbiology,1133575,1975.0,-72.913,-4.436,microbiology,#B79762,Degradation of agar by a gram-negative bacterium.,microbiology
20,Choroid plexus isografts in rats.,Journal of neuropathology and experimental neu...,1133586,1975.0,-46.561,96.421,neurology,#009271,Choroid plexus isografts in rats.,neurology
29,Preliminary report on a mass screening program...,The Journal of pediatrics,1133648,1975.0,45.033,39.256,pediatric,#004D43,Preliminary report on a mass screening program...,pediatric
30,Hepatic changes in young infants with cystic f...,The Journal of pediatrics,1133649,1975.0,118.38,61.87,pediatric,#004D43,Hepatic changes in young infants with cystic f...,pediatric


In [76]:
df.columns = ['text', 'journal', 'pmid', 'year', 'x', 'y', 'label', 'color']  # no abstract in this dataset

In [None]:
# Use subset of the data for faster training
df = df.head(1_000_000)

## Labels

Будем использовать размеченные лейблы для статей:

In [72]:
categories = np.unique(df['label'])
num_labels = len(categories)
print(f"Total: {num_labels} labels such as {categories[0]}, {categories[1]}, ..., {categories[-1]}")

Total: 38 labels such as anesthesiology, biochemistry, ..., virology


# Model

In [11]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

Токенайзер (название + абстракт -> токены):

In [12]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract")

Сама модель, в которой `AutoModelForSequenceClassification` заменит голову для задачи классификации:

In [13]:
model = AutoModelForSequenceClassification.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract", num_labels=num_labels).to(device)

Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSeque

In [14]:
print(model)

BertForSequenceClassification(
  (bert): BertModel(
    (embeddings): BertEmbeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (token_type_embeddings): Embedding(2, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0-11): 12 x BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-12,

# Training

## Data Loaders

Для работы с `transformers`, возможно, будет удобнее использовать библиотеку `datasets` для работы с данными.

Создадим (hugging face) [датасет](https://huggingface.co/docs/datasets/tabular_load#pandas-dataframes):

In [84]:
np.random.seed(42)
is_train = np.random.binomial(1, .9, size=len(df))
train_indices = np.arange(len(df))[is_train.astype(bool)]
test_indices = np.arange(len(df))[(1 - is_train).astype(bool)]

In [85]:
train_df = df.loc[:,["text", "label"]].iloc[train_indices]
test_df = df.loc[:,["text", "label"]].iloc[test_indices]

train_ds = Dataset.from_pandas(train_df, split="train")
test_ds = Dataset.from_pandas(test_df, split="test")

In [86]:
def tokenize_text(row):
    return tokenizer(
        row["text"],
        max_length=512,
        truncation=True,
        padding='max_length',
    )

train_ds = train_ds.map(tokenize_text, batched=True)
test_ds = test_ds.map(tokenize_text, batched=True)

Map:   0%|          | 0/63085 [00:00<?, ? examples/s]

Map:   0%|          | 0/6915 [00:00<?, ? examples/s]

(Уже этот шаг на таком объёме данных может занять около часа...)

In [87]:
labels_map = ClassLabel(num_classes=num_labels, names=list(categories))

def transform_labels(row):
    # default name for a label (label or label_ids)
    return {"label": labels_map.str2int(row["label"])}

# OR: 
# 
# labels_map = pd.Series(
#     np.arange(num_labels),
#     index=categories,
# )
# 
# def transform_labels(row):
#     return {"label": labels_map[row["category"]]}

train_ds = train_ds.map(transform_labels, batched=True)
test_ds = test_ds.map(transform_labels, batched=True)

train_ds = train_ds.cast_column('label', labels_map)
test_ds = test_ds.cast_column('label', labels_map)

Map:   0%|          | 0/63085 [00:00<?, ? examples/s]

Map:   0%|          | 0/6915 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/63085 [00:00<?, ? examples/s]

Casting the dataset:   0%|          | 0/6915 [00:00<?, ? examples/s]

## Prepare training

In [88]:
model = AutoModelForSequenceClassification.from_pretrained(
    "microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract", 
    num_labels=num_labels,
    id2label={i:labels_map.names[i] for i in range(len(categories))},
    label2id={labels_map.names[i]:i for i in range(len(categories))},
).to(device)

Some weights of the model checkpoint at microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract were not used when initializing BertForSequenceClassification: ['cls.seq_relationship.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForSeque

In [89]:
tokenizer = AutoTokenizer.from_pretrained("microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract")

Будем вычислять accuracy:

In [90]:
metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return metric.compute(predictions=predictions, references=labels)

Параметры для трейна:

In [92]:
training_args = TrainingArguments(
    output_dir="bert-paper-classifier", 
    evaluation_strategy="epoch",
    per_device_train_batch_size=64,
    num_train_epochs=3,
    logging_steps=100,
)

## Training

In [93]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=test_ds,
    compute_metrics=compute_metrics,
)

In [None]:
trainer.train()
# Convert to a python file and run training:
#! jupyter nbconvert finetuning-pubmed.ipynb --to python

# Save and share

In [96]:
trainer.args.hub_model_id = "bert-paper-classifier"

In [145]:
tokenizer.save_pretrained("bert-paper-classifier")

('bert-paper-classifier/tokenizer_config.json',
 'bert-paper-classifier/special_tokens_map.json',
 'bert-paper-classifier/vocab.txt',
 'bert-paper-classifier/added_tokens.json',
 'bert-paper-classifier/tokenizer.json')

In [146]:
trainer.save_model("bert-paper-classifier")

Запушим модель на HF Hub:

In [148]:
trainer.push_to_hub()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

/g/stegle/bredikhi/projects/ml2/transformers/bert-paper-classifier is already a clone of https://huggingface.co/oracat/bert-paper-classifier. Make sure you pull the latest changes with `repo.git_pull()`.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

To https://huggingface.co/oracat/bert-paper-classifier
   862abb7..b95fd36  main -> main



KeyboardInterrupt: 

# Inference

Теперь попробуем загрузить модель с HF Hub:

In [2]:
inference_tokenizer = AutoTokenizer.from_pretrained("oracat/bert-paper-classifier")
inference_model = AutoModelForSequenceClassification.from_pretrained("oracat/bert-paper-classifier")

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

In [3]:
pipe = pipeline("text-classification", model=inference_model, tokenizer=inference_tokenizer, top_k=None)

In [4]:
def top_pct(preds, threshold=.95):
    preds = sorted(preds, key=lambda x: -x["score"])
    
    cum_score = 0
    for i, item in enumerate(preds):
        cum_score += item["score"]
        if cum_score >= threshold:
            break

    preds = preds[:(i+1)]
    
    return preds

In [5]:
def format_predictions(preds) -> str:
    """
    Prepare predictions and their scores for printing to the user
    """
    out = ""
    for i, item in enumerate(preds):
        out += f"{i+1}. {item['label']} (score {item['score']:.2f})\n"
    return out

Возьмём [статью](https://www.nature.com/articles/515180a) для примера:

In [6]:
print(
    format_predictions(
        top_pct(
            pipe("""
Mental health: A world of depression
Depression is a major human blight. Globally, it is responsible for more ‘years lost’ to disability than any other condition. This is largely because so many people suffer from it — some 350 million, according to the World Health Organization — and the fact that it lasts for many years. (When ranked by disability and death combined, depression comes ninth behind prolific killers such as heart disease, stroke and HIV.) Yet depression is widely undiagnosed and untreated because of stigma, lack of effective therapies and inadequate mental-health resources. Almost half of the world’s population lives in a country with only two psychiatrists per 100,000 people.
"""
            )[0]
        )
    )
)

1. psychiatry (score 0.97)

