openai
/

whisper-large-v3

Automatic Speech Recognition

hf-asr-leaderboard

Inference Endpoints

Model card Files Files and versions Community

Best strategy for inference on multiple GPUs

#124

by symdec - opened Jun 6, 2024

symdec

Jun 6, 2024

•

edited Jun 6, 2024

Hello,
A question regarding the serving of this model for a real-time-ish and many users use case.

I'm using this model on a server behind a FastAPI/uvicorn webserver. Right now it is working with the model running on 1 GPU.
I want to increase the serving throughput by using multiple GPUs, with one instance of whisper on each.
Do you know what technologies I can use to make the queueing of http requests and routing to the different instances / GPUs (with some balance) in order to maximize the throughput / minimize the latency ?

Thanks in advance !

about 15 hours ago

•

edited about 15 hours ago

Ray Serve :)

Set number of Ray Serve replicas to the number of GPUs you have available and set options of the actor to num_gpus=1.

This will make each replica have a visible GPU and you can instantiate a Whisper Model on each replica.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment