Best strategy for inference on multiple GPUs

#124
by symdec - opened

Hello,
A question regarding the serving of this model for a real-time-ish and many users use case.

I'm using this model on a server behind a FastAPI/uvicorn webserver. Right now it is working with the model running on 1 GPU.
I want to increase the serving throughput by using multiple GPUs, with one instance of whisper on each.
Do you know what technologies I can use to make the queueing of http requests and routing to the different instances / GPUs (with some balance) in order to maximize the throughput / minimize the latency ?

Thanks in advance !

Ray Serve :)

Set number of Ray Serve replicas to the number of GPUs you have available and set options of the actor to num_gpus=1.

This will make each replica have a visible GPU and you can instantiate a Whisper Model on each replica.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment