SimulSeamless

ACL Anthology

Code for the paper: "SimulSeamless: FBK at IWSLT 2024 Simultaneous Speech Translation" published at IWSLT 2024.

📎 Requirements

🤖 Inference using your environment

Set --source, and --target as described in the Fairseq Simultaneous Translation repository: ${LIST_OF_AUDIO} is the list of audio paths and ${TGT_FILE} the segment-wise references in the target language.

Set ${TGT_LANG} as the target language code in 3 characters. The list of supported language codes is available here. For the source language, no language code has to be specified.

Depending on the target language, set ${LATENCY_UNIT} to either word (e.g., for German) or char (e.g., for Japanese), and ${BLEU_TOKENIZER} to either 13a (i.e., the standard sacreBLEU tokenizer used, for example, to evaluate German) or char (e.g., to evaluate character-level languages such as Chinese or Japanese).

The simultaneous inference of SimulSeamless is based on AlignAtt, thus the f parameter (${FRAME}) and the layer from which to extract the attention scores (${LAYER}) have to be set accordingly.

Instruction to replicate IWSLT 2024 results ↙️

To replicate the results obtained to achieve 2 seconds of latency (measured by AL) on the test sets used by the IWSLT 2024 Simultaneous track, use the following values:

  • en-de: ${TGT_LANG}=deu, ${FRAME}=6, ${LAYER}=3, ${SEG_SIZE}=1000
  • en-ja: ${TGT_LANG}=jpn, ${FRAME}=1, ${LAYER}=0, ${SEG_SIZE}=400
  • en-zh: ${TGT_LANG}=cmn, ${FRAME}=1, ${LAYER}=3, ${SEG_SIZE}=800
  • cs-en: ${TGT_LANG}=eng, ${FRAME}=9, ${LAYER}=3, ${SEG_SIZE}=1000

❗️Please notice that ${FRAME} can be adjusted to achieve lower/higher latency.

The SimulSeamless can be run with:

simuleval \
    --agent-class examples.speech_to_text.simultaneous_translation.agents.v1_1.simul_alignatt_seamlessm4t.AlignAttSeamlessS2T \
    --source ${LIST_OF_AUDIO} \
    --target ${TGT_FILE} \
    --data-bin ${DATA_ROOT} \
    --model-size medium --target-language ${TGT_LANG} \
    --extract-attn-from-layer ${LAYER} --num-beams 5 \
    --frame-num ${FRAME} \
    --source-segment-size ${SEG_SIZE} \
    --quality-metrics BLEU --latency-metrics LAAL AL ATD --computation-aware \
    --eval-latency-unit ${LATENCY_UNIT} --sacrebleu-tokenizer ${BLEU_TOKENIZER} \
    --output ${OUT_DIR} \
    --device cuda:0 

If not already stored in your system, the SeamlessM4T model will be downloaded automatically when running the script. The output will be saved in ${OUT_DIR}.

We suggest running the inference using a GPU to speed up the process but the system can be run on any device (e.g., CPU) supported by SimulEval and HuggingFace.

💬 Inference using docker

To run SimulSeamless using docker, follow the steps below:

  1. Download the docker file by cloning this repository
  2. Load the docker image:
docker load -i simulseamless.tar
  1. Start the SimulEval standalone with GPU enabled:
docker run -e TGTLANG=${TGT_LANG} -e FRAME=${FRAME} -e LAYER=${LAYER} \
    -e BLEU_TOKENIZER=${BLEU_TOKENIZER} -e LATENCY_UNIT=${LATENCY_UNIT} \
    -e DEV=cuda:0 --gpus all --shm-size 32G \
    -p 2024:2024 simulseamless:latest
  1. Start the remote evaluation with:
simuleval \
    --remote-eval --remote-port 2024 \
    --source ${LIST_OF_AUDIO} --target ${TGT_FILE} \
    --source-type speech --target-type text \
    --source-segment-size ${SEG_SIZE} \
    --eval-latency-unit ${LATENCY_UNIT} --sacrebleu-tokenizer ${BLEU_TOKENIZER} \
    --output ${OUT_DIR}

To set, ${TGT_LANG}, ${FRAME}, ${LAYER}, ${BLEU_TOKENIZER}, ${LATENCY_UNIT}, ${LIST_OF_AUDIO}, ${TGT_FILE}, ${SEG_SIZE}, and ${OUT_DIR} refer to 🤖 Inference using your environment.

📍Citation

@inproceedings{papi-etal-2024-simulseamless,
    title = "{S}imul{S}eamless: {FBK} at {IWSLT} 2024 Simultaneous Speech Translation",
    author = "Papi, Sara  and
      Gaido, Marco  and
      Negri, Matteo  and
      Bentivogli, Luisa",
    editor = "Salesky, Elizabeth  and
      Federico, Marcello  and
      Carpuat, Marine",
    booktitle = "Proceedings of the 21st International Conference on Spoken Language Translation (IWSLT 2024)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand (in-person and online)",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.iwslt-1.11",
    pages = "72--79",
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.