Hugging Face Spaces arXiv

Model Details

This is a lightweight audio captioning model, with an Efficient-B2 encoder and a two-layer Transformer decoder. The model is trained on AudioCaps and unlabeled AudioSet.

Dependencies

Install corresponding dependencies to run the model:

pip install numpy torch torchaudio einops transformers efficientnet_pytorch

Usage

import torch
from transformers import AutoModel, PreTrainedTokenizerFast
import torchaudio


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# use the model trained on AudioCaps
model = AutoModel.from_pretrained(
    "wsntxxn/effb2-trm-audiocaps-captioning",
    trust_remote_code=True
).to(device)
tokenizer = PreTrainedTokenizerFast.from_pretrained(
    "wsntxxn/audiocaps-simple-tokenizer"
)

# inference on a single audio clip
wav, sr = torchaudio.load("/path/to/file.wav")
wav = torchaudio.functional.resample(wav, sr, model.config.sample_rate)
if wav.size(0) > 1:
    wav = wav.mean(0).unsqueeze(0)
with torch.no_grad():
    word_idxs = model(
        audio=wav,
        audio_length=[wav.size(1)],
    )
caption = tokenizer.decode(word_idxs[0], skip_special_tokens=True)
print(caption)

# inference on a batch
wav1, sr1 = torchaudio.load("/path/to/file1.wav")
wav1 = torchaudio.functional.resample(wav1, sr1, model.config.sample_rate)
wav1 = wav1.mean(0) if wav1.size(0) > 1 else wav1[0]

wav2, sr2 = torchaudio.load("/path/to/file2.wav")
wav2 = torchaudio.functional.resample(wav2, sr2, model.config.sample_rate)
wav2 = wav2.mean(0) if wav2.size(0) > 1 else wav2[0]

wav_batch = torch.nn.utils.rnn.pad_sequence([wav1, wav2], batch_first=True)

with torch.no_grad():
    word_idxs = model(
        audio=wav_batch,
        audio_length=[wav1.size(0), wav2.size(0)],
    )
captions = tokenizer.batch_decode(word_idxs, skip_special_tokens=True)
print(captions)
Downloads last month
94
Safetensors
Model size
13.8M params
Tensor type
F32
Β·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support model that require custom code execution.

Dataset used to train wsntxxn/effb2-trm-audiocaps-captioning

Spaces using wsntxxn/effb2-trm-audiocaps-captioning 2