Spaces:
Sleeping
title: Polyhedron
emoji: π
colorFrom: yellow
colorTo: yellow
sdk: docker
pinned: false
license: apache-2.0
app_port: 8080
Polyhedron
Polyhedron is a voice chat application designed to enable real-time transcription and translation for training across language barriers.
Overview
The app allows a trainer to conduct lessons in their native language, while trainees can receive instructions translated into their own languages.
Key features:
- Real-time voice transcription of the trainer's speech using Amazon Transcribe
- Translates speech into the trainee's language using Amazon Translate
- Displays translated text to trainees in real-time
- Allows trainer to see transcription and repeat unclear sections
- Facilitates training in multilingual organizations
- Polyhedron uses WebSockets to stream audio and text between clients. The frontend is built with React and Vite.
The backend is developed in Rust using the Poem web framework with WebSockets support. It interfaces with AWS services for transcription, translation and text-to-speech.
Configuration like AWS credentials and models are specified in config.yaml.
Getting Started
To run Polyhedron locally:
- Config AWS account via https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html
- Clone the repository, Run
docker compose up
- Open http://localhost:8080 in the browser
Architecture
Polyhedron uses a broadcast model to share transcription, translation, and speech synthesis work between clients.
A single transcription is generated for the speaker and shared with all language clients.
The transcript is translated once per language and shared with clients of that language.
Speech is synthesized once per voice and shared with clients selecting that voice.
This optimized architecture minimizes redundant work and cost:
Automatic speech recognition (ASR) is done only once for the speaker and broadcast.
Translation is done once per language from the shared transcript and broadcast.
Text-to-speech (TTS) synthesis is done once per voice and broadcast.
By sharing the intermediate outputs, the system avoids duplicating work across clients. This allows serving many users efficiently and cost effectively.
The components communicate using WebSockets and channels to distribute the shared outputs.
The system architecture with a single listener can be summarized as:
- Speaker voice input ->
- ASR Transcription (English) ->
- Translation to Listener language ->
- TTS Synthesis in Listener language ->
- Voice output in Listener language The speaker's voice is transcribed to text using ASR in the speaker's language (e.g. English).
The transcript is then translated to the listener's language.
Text-to-speech synthesis converts the translated text into a voice audio in the listener's language.
This synthesized voice audio is played out as output to the listener.
The architecture forms a linear pipeline from speaker voice input to listener voice output, with transcription, translation and synthesis steps in between.
Directory Structure
src/
: Main Rust backend source codemain.rs
: Entry point and server definitionconfig.rs
: Configuration loadinglesson.rs
: Lesson management and audio streamingwhisper.rs
: Whisper ASR integrationgroup.rs
: Group management
static/
: Frontend JavaScript and assetsindex.html
: Main HTML pageindex.js
: React frontend coderecorderWorkletProcessor.js
: Audio recorder WebWorker
models/
: Whisper speech recognition modelsconfig.yaml
: Server configurationCargo.toml
: Rust crate dependencies
Contributing
Contributions welcome! Please open an issue or PR.