machineuser
Sync widgets demo
405a395
|
raw
history blame
6.32 kB

Use Cases

Virtual Speech Assistants

Many edge devices have an embedded virtual assistant to interact with the end users better. These assistances rely on ASR models to recognize different voice commands to perform various tasks. For instance, you can ask your phone for dialing a phone number, ask a general question, or schedule a meeting.

Caption Generation

A caption generation model takes audio as input from sources to generate automatic captions through transcription, for live-streamed or recorded videos. This can help with content accessibility. For example, an audience watching a video that includes a non-native language, can rely on captions to interpret the content. It can also help with information retention at online-classes environments improving knowledge assimilation while reading and taking notes faster.

Task Variants

Multilingual ASR

Multilingual ASR models can convert audio inputs with multiple languages into transcripts. Some multilingual ASR models include language identification blocks to improve the performance.

The use of Multilingual ASR has become popular, the idea of maintaining just a single model for all language can simplify the production pipeline. Take a look at Whisper to get an idea on how 100+ languages can be processed by a single model.

Inference

The Hub contains over ~9,000 ASR models that you can use right away by trying out the widgets directly in the browser or calling the models as a service using the Inference API. Here is a simple code snippet to do exactly this:

import json
import requests

headers = {"Authorization": f"Bearer {API_TOKEN}"}
API_URL = "https://api-inference.huggingface.co/models/openai/whisper-large-v3"

def query(filename):
    with open(filename, "rb") as f:
        data = f.read()
    response = requests.request("POST", API_URL, headers=headers, data=data)
    return json.loads(response.content.decode("utf-8"))

data = query("sample1.flac")

You can also use libraries such as transformers, speechbrain, NeMo and espnet if you want one-click managed Inference without any hassle.

from transformers import pipeline

with open("sample.flac", "rb") as f:
  data = f.read()

pipe = pipeline("automatic-speech-recognition", "openai/whisper-large-v2")
pipe("sample.flac")
# {'text': "GOING ALONG SLUSHY COUNTRY ROADS AND SPEAKING TO DAMP AUDIENCES IN DRAUGHTY SCHOOL ROOMS DAY AFTER DAY FOR A FORTNIGHT HE'LL HAVE TO PUT IN AN APPEARANCE AT SOME PLACE OF WORSHIP ON SUNDAY MORNING AND HE CAN COME TO US IMMEDIATELY AFTERWARDS"}

You can use huggingface.js to transcribe text with javascript using models on Hugging Face Hub.

import { HfInference } from "@huggingface/inference";

const inference = new HfInference(HF_ACCESS_TOKEN);
await inference.automaticSpeechRecognition({
    data: await (await fetch("sample.flac")).blob(),
    model: "openai/whisper-large-v2",
});

Solving ASR for your own data

We have some great news! You can fine-tune (transfer learning) a foundational speech model on a specific language without tonnes of data. Pretrained models such as Whisper, Wav2Vec2-MMS and HuBERT exist. OpenAI's Whisper model is a large multilingual model trained on 100+ languages and with 4 Million hours of speech.

The following detailed blog post shows how to fine-tune a pre-trained Whisper checkpoint on labeled data for ASR. With the right data and strategy you can fine-tune a high-performant model on a free Google Colab instance too. We suggest to read the blog post for more info!

Hugging Face Whisper Event

On December 2022, over 450 participants collaborated, fine-tuned and shared 600+ ASR Whisper models in 100+ different languages. You can compare these models on the event's speech recognition leaderboard.

These events help democratize ASR for all languages, including low-resource languages. In addition to the trained models, the event helps to build practical collaborative knowledge.

Useful Resources