Audio Classification
Transformers
Safetensors
Slovenian
Croatian
Serbian
wav2vec2-bert
audio-frame-classification
5roop's picture
Update README.md
09dba0b verified
|
raw
history blame
4.31 kB
metadata
library_name: transformers
license: cc-by-sa-4.0
datasets:
  - classla/ParlaSpeech-RS
  - classla/ParlaSpeech-HR
  - classla/Mici_Princ
language:
  - sl
  - hr
  - sr
metrics:
  - accuracy
base_model:
  - facebook/w2v-bert-2.0

Model Card

This model annotates primary stress in words on 20ms frames.

Model Details

Model Description

  • Developed by: Peter Rupnik, Nikola Ljubešić, Ivan Porupski, Nejc Robida
  • Model type: Audio frame classifier
  • Language(s) (NLP): Croatian, Slovenian, Serbian, Chakavian variant of croatian
  • License: Creative Commons - Share Alike 4.0

Model Sources [optional]

  • Paper [optional]: Coming soon

Direct Use

The model is intended for data-driven analyses in primary stress position. ATM, it has been proven to work on 4 datasets in 3 languages.

Example use

import numpy as np

from datasets import Audio, Dataset
from transformers import AutoFeatureExtractor, Wav2Vec2BertForAudioFrameClassification
import torch
import numpy as np

if torch.cuda.is_available():
    device = torch.device("cuda")
else:
    device = torch.device("cpu")

model_name = "5roop/Wav2Vec2BertPrimaryStressAudioFrameClassifier"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = Wav2Vec2BertForAudioFrameClassification.from_pretrained(model_name).to(device)
# Path to the file, containing the word to be annotated:
f = "wavs/word.wav"


def frames_to_intervals(frames: list[int]) -> list[tuple[float]]:
    from itertools import pairwise
    import pandas as pd

    results = []
    ndf = pd.DataFrame(
        data={
            "time_s": [0.020 * i for i in range(len(frames))],
            "frames": frames,
        }
    )
    ndf = ndf.dropna()
    indices_of_change = ndf.frames.diff()[ndf.frames.diff() != 0].index.values
    for si, ei in pairwise(indices_of_change):
        if ndf.loc[si : ei - 1, "frames"].mode()[0] == 0:
            pass
        else:
            results.append(
                (round(ndf.loc[si, "time_s"], 3), round(ndf.loc[ei - 1, "time_s"], 3))
            )
    if results == []:
        return None
    # Post-processing: if multiple regions were returned, only the longest should be taken:
    if len(results) > 1:
        results = sorted(results, key=lambda t: t[1]-t[0], reverse=True)
    return results[0:1]


def evaluator(chunks):
    sampling_rate = chunks["audio"][0]["sampling_rate"]
    with torch.no_grad():
        inputs = feature_extractor(
            [i["array"] for i in chunks["audio"]],
            return_tensors="pt",
            sampling_rate=sampling_rate,
        ).to(device)
        logits = model(**inputs).logits
    y_pred_raw = np.array(logits.cpu())
    y_pred = y_pred_raw.argmax(axis=-1)
    primary_stress = [frames_to_intervals(i) for i in y_pred]
    return {
        "y_pred": y_pred,
        "y_pred_logits": y_pred_raw,
        "primary_stress": primary_stress,
    }

# Create a dataset with a single instance and map our evaluator function on it:
ds = Dataset.from_dict({"audio": [f]}).cast_column("audio", Audio(16000, mono=True))
ds = ds.map(evaluator, batched=True, batch_size=1) # Adjust batch size according to your hardware specs
print(ds["y_pred"][0])
# Outputs: [0, 0, 1, 1, 1, 1, 1, ...]
print(ds["y_pred_logits"][0])
# Outputs:
# [[ 0.89419061, -0.77746612],
#  [ 0.44213724, -0.34862748],
#  [-0.08605709,  0.13012762],
# ....
print(ds["primary_stress"][0])
# Outputs: [0.34, 0.4]

Training Details

Training Data

10443 manually annotated multisyllabic words from ParlaSpeech-HR.

Training Procedure

Training Hyperparameters

  • Learning rate: 1e-5
  • Batch size: 32
  • Number of epochs: 20
  • Weight decay: 0.01
  • Gradient accumulation steps: 1

Evaluation

Testing Data, Factors & Metrics

Summary

Citation

Coming soon