Sámi Wav2vec2-Base ASR

GetmanY1/wav2vec2-base-sami-cont-pt-22k fine-tuned on 20 hours of 16kHz sampled speech audio from the Sámi Parliament sessions.

When using the model make sure that your speech input is also sampled at 16Khz.

Model description

The Sámi Wav2Vec2 Base has the same architecture and uses the same training objective as the English one described in Paper.

GetmanY1/wav2vec2-base-sami-cont-pt-22k is a large-scale, 95-million parameter monolingual model pre-trained on 22.4k hours of unlabeled Sámi speech from KAVI radio and television archive materials. You can read more about the pre-trained model from this paper.

The model was evaluated on 1 hour of out-of-domain read-aloud and spontaneous speech of varying audio quality.

Intended uses

You can use this model for Sámi ASR (speech-to-text).

How to use

To transcribe audio files the model can be used as a standalone acoustic model as follows:

from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch

# load model and processor
processor = Wav2Vec2Processor.from_pretrained("GetmanY1/wav2vec2-base-sami-cont-pt-22k-finetuned")
model = Wav2Vec2ForCTC.from_pretrained("GetmanY1/wav2vec2-base-sami-cont-pt-22k-finetuned")

# tokenize
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values  # Batch size 1

# retrieve logits
logits = model(input_values).logits

# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)

Prefix Beam Search

In our experiments (see paper), we observed a slight improvement in terms of Character Error Rate (CER) when using prefix beam search compared to greedy decoding, primarily due to a reduction in deletions. Below is our adapted version of corticph/prefix-beam-search for use with wav2vec 2.0 in HuggingFace Transformers. Note that an external language model (LM) is not required, as the function defaults to a uniform probability when none is provided.

import re
import numpy as np

def prefix_beam_search(ctc, lm=None, k=25, alpha=0.30, beta=5, prune=0.001):
    """
    Performs prefix beam search on the output of a CTC network.

    Args:
        ctc (np.ndarray): The CTC output. Should be a 2D array (timesteps x alphabet_size)
        lm (func): Language model function. Should take as input a string and output a probability.
        k (int): The beam width. Will keep the 'k' most likely candidates at each timestep.
        alpha (float): The language model weight. Should usually be between 0 and 1.
        beta (float): The language model compensation term. The higher the 'alpha', the higher the 'beta'.
        prune (float): Only extend prefixes with chars with an emission probability higher than 'prune'.

    Returns:
        string: The decoded CTC output.
    """

    lm = (lambda l: 1) if lm is None else lm # if no LM is provided, just set to function returning 1
    W = lambda l: re.findall(r'\w+[\s|>]', l)
    alphabet = list({k: v for k, v in sorted(processor.tokenizer.vocab.items(), key=lambda item: item[1])})
    alphabet = list(map(lambda x: x.replace(processor.tokenizer.special_tokens_map['eos_token'], '>') \
                        .replace(processor.tokenizer.special_tokens_map['pad_token'], '%') \
                        .replace('|', ' '), alphabet))


    F = ctc.shape[1]
    ctc = np.vstack((np.zeros(F), ctc)) # just add an imaginative zero'th step (will make indexing more intuitive)
    T = ctc.shape[0]

    # STEP 1: Initiliazation
    O = ''
    Pb, Pnb = defaultdict(Counter), defaultdict(Counter)
    Pb[0][O] = 1
    Pnb[0][O] = 0
    A_prev = [O]
    # END: STEP 1

    # STEP 2: Iterations and pruning
    for t in range(1, T):
        pruned_alphabet = [alphabet[i] for i in np.where(ctc[t] > prune)[0]]
        for l in A_prev:
            if len(l) > 0 and l.endswith('>'):
                Pb[t][l] = Pb[t - 1][l]
                Pnb[t][l] = Pnb[t - 1][l]
                continue  
            for c in pruned_alphabet:
                c_ix = alphabet.index(c)
                # END: STEP 2
                
                # STEP 3: “Extending” with a blank
                if c == '%':
                    Pb[t][l] += ctc[t][0] * (Pb[t - 1][l] + Pnb[t - 1][l]) 
                # END: STEP 3
                
                # STEP 4: Extending with the end character
                else:
                    l_plus = l + c
                    if len(l) > 0 and l.endswith(c):
                        Pnb[t][l_plus] += ctc[t][c_ix] * Pb[t - 1][l]
                        Pnb[t][l] += ctc[t][c_ix] * Pnb[t - 1][l]
                # END: STEP 4

                    # STEP 5: Extending with any other non-blank character and LM constraints
                    elif len(l.replace(' ', '')) > 0 and c in (' ', '>'):
                        lm_prob = lm(l_plus.strip(' >')) ** alpha
                        Pnb[t][l_plus] += lm_prob * ctc[t][c_ix] * (Pb[t - 1][l] + Pnb[t - 1][l])
                    else:
                        Pnb[t][l_plus] += ctc[t][c_ix] * (Pb[t - 1][l] + Pnb[t - 1][l])
                    # END: STEP 5

                    # STEP 6: Make use of discarded prefixes
                    if l_plus not in A_prev:
                        Pb[t][l_plus] += ctc[t][0] * (Pb[t - 1][l_plus] + Pnb[t - 1][l_plus])
                        Pnb[t][l_plus] += ctc[t][c_ix] * Pnb[t - 1][l_plus]
                    # END: STEP 6

        # STEP 7: Select most probable prefixes
        A_next = Pb[t] + Pnb[t]
        sorter = lambda l: A_next[l] * (len(W(l)) + 1) ** beta
        A_prev = sorted(A_next, key=sorter, reverse=True)[:k]
        # END: STEP 7

    return A_prev[0].strip('>')

def map_to_pred_prefix_beam_search(batch):
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
    input_values = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding="longest").input_values
    with torch.no_grad():
        logits = model(input_values.to(device)).logits
    probs = torch.softmax(logits, dim=-1)
    transcription = [prefix_beam_search(probs[0].cpu().numpy(), lm=None)]
    batch["transcription"] = transcription
    return batch

result = ds.map(map_to_pred_prefix_beam_search, batched=True, batch_size=1, remove_columns=["speech"])

Team Members

Feel free to contact us for more details 🤗

Downloads last month
8
Safetensors
Model size
94.4M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.

Model tree for GetmanY1/wav2vec2-base-sami-cont-pt-22k-finetuned

Finetuned
(1)
this model

Collection including GetmanY1/wav2vec2-base-sami-cont-pt-22k-finetuned

Evaluation results