Sámi Wav2vec2-Base ASR
GetmanY1/wav2vec2-base-sami-cont-pt-22k fine-tuned on 20 hours of 16kHz sampled speech audio from the Sámi Parliament sessions.
When using the model make sure that your speech input is also sampled at 16Khz.
Model description
The Sámi Wav2Vec2 Base has the same architecture and uses the same training objective as the English one described in Paper.
GetmanY1/wav2vec2-base-sami-cont-pt-22k is a large-scale, 95-million parameter monolingual model pre-trained on 22.4k hours of unlabeled Sámi speech from KAVI radio and television archive materials. You can read more about the pre-trained model from this paper.
The model was evaluated on 1 hour of out-of-domain read-aloud and spontaneous speech of varying audio quality.
Intended uses
You can use this model for Sámi ASR (speech-to-text).
How to use
To transcribe audio files the model can be used as a standalone acoustic model as follows:
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
from datasets import load_dataset
import torch
# load model and processor
processor = Wav2Vec2Processor.from_pretrained("GetmanY1/wav2vec2-base-sami-cont-pt-22k-finetuned")
model = Wav2Vec2ForCTC.from_pretrained("GetmanY1/wav2vec2-base-sami-cont-pt-22k-finetuned")
# tokenize
input_values = processor(ds[0]["audio"]["array"], return_tensors="pt", padding="longest").input_values # Batch size 1
# retrieve logits
logits = model(input_values).logits
# take argmax and decode
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
Prefix Beam Search
In our experiments (see paper), we observed a slight improvement in terms of Character Error Rate (CER) when using prefix beam search compared to greedy decoding, primarily due to a reduction in deletions. Below is our adapted version of corticph/prefix-beam-search for use with wav2vec 2.0 in HuggingFace Transformers. Note that an external language model (LM) is not required, as the function defaults to a uniform probability when none is provided.
import re
import numpy as np
def prefix_beam_search(ctc, lm=None, k=25, alpha=0.30, beta=5, prune=0.001):
"""
Performs prefix beam search on the output of a CTC network.
Args:
ctc (np.ndarray): The CTC output. Should be a 2D array (timesteps x alphabet_size)
lm (func): Language model function. Should take as input a string and output a probability.
k (int): The beam width. Will keep the 'k' most likely candidates at each timestep.
alpha (float): The language model weight. Should usually be between 0 and 1.
beta (float): The language model compensation term. The higher the 'alpha', the higher the 'beta'.
prune (float): Only extend prefixes with chars with an emission probability higher than 'prune'.
Returns:
string: The decoded CTC output.
"""
lm = (lambda l: 1) if lm is None else lm # if no LM is provided, just set to function returning 1
W = lambda l: re.findall(r'\w+[\s|>]', l)
alphabet = list({k: v for k, v in sorted(processor.tokenizer.vocab.items(), key=lambda item: item[1])})
alphabet = list(map(lambda x: x.replace(processor.tokenizer.special_tokens_map['eos_token'], '>') \
.replace(processor.tokenizer.special_tokens_map['pad_token'], '%') \
.replace('|', ' '), alphabet))
F = ctc.shape[1]
ctc = np.vstack((np.zeros(F), ctc)) # just add an imaginative zero'th step (will make indexing more intuitive)
T = ctc.shape[0]
# STEP 1: Initiliazation
O = ''
Pb, Pnb = defaultdict(Counter), defaultdict(Counter)
Pb[0][O] = 1
Pnb[0][O] = 0
A_prev = [O]
# END: STEP 1
# STEP 2: Iterations and pruning
for t in range(1, T):
pruned_alphabet = [alphabet[i] for i in np.where(ctc[t] > prune)[0]]
for l in A_prev:
if len(l) > 0 and l.endswith('>'):
Pb[t][l] = Pb[t - 1][l]
Pnb[t][l] = Pnb[t - 1][l]
continue
for c in pruned_alphabet:
c_ix = alphabet.index(c)
# END: STEP 2
# STEP 3: “Extending” with a blank
if c == '%':
Pb[t][l] += ctc[t][0] * (Pb[t - 1][l] + Pnb[t - 1][l])
# END: STEP 3
# STEP 4: Extending with the end character
else:
l_plus = l + c
if len(l) > 0 and l.endswith(c):
Pnb[t][l_plus] += ctc[t][c_ix] * Pb[t - 1][l]
Pnb[t][l] += ctc[t][c_ix] * Pnb[t - 1][l]
# END: STEP 4
# STEP 5: Extending with any other non-blank character and LM constraints
elif len(l.replace(' ', '')) > 0 and c in (' ', '>'):
lm_prob = lm(l_plus.strip(' >')) ** alpha
Pnb[t][l_plus] += lm_prob * ctc[t][c_ix] * (Pb[t - 1][l] + Pnb[t - 1][l])
else:
Pnb[t][l_plus] += ctc[t][c_ix] * (Pb[t - 1][l] + Pnb[t - 1][l])
# END: STEP 5
# STEP 6: Make use of discarded prefixes
if l_plus not in A_prev:
Pb[t][l_plus] += ctc[t][0] * (Pb[t - 1][l_plus] + Pnb[t - 1][l_plus])
Pnb[t][l_plus] += ctc[t][c_ix] * Pnb[t - 1][l_plus]
# END: STEP 6
# STEP 7: Select most probable prefixes
A_next = Pb[t] + Pnb[t]
sorter = lambda l: A_next[l] * (len(W(l)) + 1) ** beta
A_prev = sorted(A_next, key=sorter, reverse=True)[:k]
# END: STEP 7
return A_prev[0].strip('>')
def map_to_pred_prefix_beam_search(batch):
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
input_values = processor(batch["speech"], sampling_rate=16000, return_tensors="pt", padding="longest").input_values
with torch.no_grad():
logits = model(input_values.to(device)).logits
probs = torch.softmax(logits, dim=-1)
transcription = [prefix_beam_search(probs[0].cpu().numpy(), lm=None)]
batch["transcription"] = transcription
return batch
result = ds.map(map_to_pred_prefix_beam_search, batched=True, batch_size=1, remove_columns=["speech"])
Team Members
- Yaroslav Getman, Hugging Face profile, LinkedIn profile
- Tamas Grosz, Hugging Face profile, LinkedIn profile
Feel free to contact us for more details 🤗
- Downloads last month
- 8
Model tree for GetmanY1/wav2vec2-base-sami-cont-pt-22k-finetuned
Base model
GetmanY1/wav2vec2-base-sami-cont-pt-22kCollection including GetmanY1/wav2vec2-base-sami-cont-pt-22k-finetuned
Evaluation results
- Test WER on Sami-1h-testself-reported43.040
- Test CER on Sami-1h-testself-reported15.760