---
datasets:
- openslr/librispeech_asr
language:
- en
metrics:
- wer
tags:
- audio
- automatic-speech-recognition
- hf-asr-leaderboard
widget:
- example_title: Librispeech sample 1
  src: https://cdn-media.huggingface.co/speech_samples/sample1.flac
- example_title: Librispeech sample 2
  src: https://cdn-media.huggingface.co/speech_samples/sample2.flac
pipeline_tag: automatic-speech-recognition
---

Internal model alias name:
`v6-relPosAttDef-noBias-aedLoss-bhv20-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-wd1e_2-lrlin1e_5_295k-featBN-speedpertV2-spm10k-bpeSample001`

Last epoch (subepoch 500) greedy decoding (without LM) on Librispeech (WERs):
`{"dev-clean": 2.38, "dev-other": 5.67, "test-clean": 2.63, "test-other": 5.93}`

(Note, together with a good LM trained on Librispeech LM text data,
`output/ctc_recog_ext/ctc+lm/opt-beam128-fp128-lm_n32-d1024-labelprior/recog-1stpass-res.txt`:
`{"dev-clean": 2.04, "dev-other": 4.06, "test-clean": 2.08, "test-other": 4.36}`)

From https://github.com/rwth-i6/i6_experiments/blob/main/users/zeyer/experiments/exp2024_04_23_baselines/ctc.py.

Usage example:
https://github.com/rwth-i6/i6_experiments/blob/main/users/zeyer/experiments/exp2024_04_23_baselines/standalone/model_2024_ctc_spm10k.py

Example:
```shell
pip install torch
pip install returnn

wget https://raw.githubusercontent.com/rwth-i6/i6_experiments/refs/heads/main/users/zeyer/experiments/exp2024_04_23_baselines/standalone/model_2024_ctc_spm10k.py
wget https://huggingface.co/rwth-i6/2024-zeyer-ctc-librispeech-spm10k/resolve/main/data/epoch.500.pt
wget https://huggingface.co/rwth-i6/2024-zeyer-ctc-librispeech-spm10k/resolve/main/deps/spm.vocab

python model_2024_ctc_spm10k.py example_audio.ogg
```

This Sisyphus config code snippet was used to setup the Sisyphus training job:
<details>

```python
    # v6-relPosAttDef-noBias-aedLoss-bhv20-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-wd1e_2-lrlin1e_5_295k-featBN-speedpertV2-spm10k-bpeSample001
    # noBias. (Baseline: 5.77)
    train_exp(  # 5.65 (!!!)
        "v6-relPosAttDef-noBias-aedLoss-bhv20-11gb-f32-bs15k-accgrad1-mgpu4-pavg100-wd1e_2"
        "-lrlin1e_5_295k-featBN-speedpertV2-spm10k-bpeSample001",
        config_11gb_v6_f32_accgrad1_mgpu4_pavg100_wd1e_4,
        model_config={
            "enc_conformer_layer": rf.build_dict(
                rf.encoder.conformer.ConformerEncoderLayer,
                ff=rf.build_dict(
                    rf.encoder.conformer.ConformerPositionwiseFeedForward,
                    activation=rf.build_dict(rf.relu_square),
                    with_bias=False,
                ),
                num_heads=8,
            ),
            "feature_batch_norm": True,
        },
        config_updates={
            **_get_cfg_lrlin_oclr_by_bs_nep(15_000, 500),
            "optimizer.weight_decay": 1e-2,
            "__train_audio_preprocess": speed_pert_librosa_config,
            "speed_pert_discrete_values": [0.7, 0.8, 0.9, 1.0, 1.1],
            "aux_attention_decoder": rf.build_dict(TransformerDecoder, num_layers=6),  # purely used for training
        },
        vocab="spm10k",
        train_vocab_opts={"other_opts": {"class": "SamplingBytePairEncoding", "breadth_prob": 0.01}},
    )
```
</details>

I uploaded the `info` and `output` files from the Sisyphus RETURNN training job to `trainjob`,
except of the model checkpoint, which I uploaded to `data`.

From the train job `info` file, I was checking dependencies.
Specifically, there is the SPM vocab.
I uploaded those to `deps`.