Spaces:

ResembleAI
/

Chatterbox_TTS_Demo

Runtime error

App Files Files Community

ollieollie commited on May 12

Commit

99cc645

1 Parent(s): c30668f

add alignment

Browse files

This view is limited to 50 files because it contains too many changes. See raw diff

Files changed (50) hide show

orator/src/orator/__pycache__/__init__.cpython-311.pyc +0 -0
orator/src/orator/__pycache__/tts.cpython-311.pyc +0 -0
orator/src/orator/models/bigvgan/__pycache__/activations.cpython-311.pyc +0 -0
orator/src/orator/models/bigvgan/__pycache__/bigvgan.cpython-311.pyc +0 -0
orator/src/orator/models/bigvgan/activations.py +120 -0
orator/src/orator/models/bigvgan/alias_free_torch/__init__.py +6 -0
orator/src/orator/models/bigvgan/alias_free_torch/__pycache__/__init__.cpython-311.pyc +0 -0
orator/src/orator/models/bigvgan/alias_free_torch/__pycache__/act.cpython-311.pyc +0 -0
orator/src/orator/models/bigvgan/alias_free_torch/__pycache__/filter.cpython-311.pyc +0 -0
orator/src/orator/models/bigvgan/alias_free_torch/__pycache__/resample.cpython-311.pyc +0 -0
orator/src/orator/models/bigvgan/alias_free_torch/act.py +28 -0
orator/src/orator/models/bigvgan/alias_free_torch/filter.py +95 -0
orator/src/orator/models/bigvgan/alias_free_torch/resample.py +55 -0
orator/src/orator/models/bigvgan/bigvgan.py +212 -0
orator/src/orator/models/s3gen/__pycache__/__init__.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/__pycache__/const.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/__pycache__/decoder.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/__pycache__/f0_predictor.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/__pycache__/flow.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/__pycache__/flow_matching.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/__pycache__/hifigan.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/__pycache__/s3gen.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/__pycache__/xvector.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/matcha/__pycache__/decoder.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/matcha/__pycache__/flow_matching.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/matcha/__pycache__/transformer.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/transformer/__pycache__/__init__.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/transformer/__pycache__/activation.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/transformer/__pycache__/attention.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/transformer/__pycache__/convolution.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/transformer/__pycache__/embedding.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/transformer/__pycache__/encoder_layer.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/transformer/__pycache__/positionwise_feed_forward.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/transformer/__pycache__/subsampling.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/transformer/__pycache__/upsample_encoder.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/utils/__pycache__/class_utils.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/utils/__pycache__/mask.cpython-311.pyc +0 -0
orator/src/orator/models/s3gen/utils/__pycache__/mel.cpython-311.pyc +0 -0
orator/src/orator/models/s3tokenizer/__pycache__/__init__.cpython-311.pyc +0 -0
orator/src/orator/models/s3tokenizer/__pycache__/s3tokenizer.cpython-311.pyc +0 -0
orator/src/orator/models/t3/__pycache__/__init__.cpython-311.pyc +0 -0
orator/src/orator/models/t3/__pycache__/llama_configs.cpython-311.pyc +0 -0
orator/src/orator/models/t3/__pycache__/t3.cpython-311.pyc +0 -0
orator/src/orator/models/t3/inference/__pycache__/t3_hf_backend.cpython-311.pyc +0 -0
orator/src/orator/models/t3/inference/alignment_stream_analyzer.py +154 -0
orator/src/orator/models/t3/inference/t3_hf_backend.py +6 -6
orator/src/orator/models/t3/modules/__pycache__/cond_enc.cpython-311.pyc +0 -0
orator/src/orator/models/t3/modules/__pycache__/learned_pos_emb.cpython-311.pyc +0 -0
orator/src/orator/models/t3/modules/__pycache__/perceiver.cpython-311.pyc +0 -0
orator/src/orator/models/t3/modules/__pycache__/t3_config.cpython-311.pyc +0 -0

orator/src/orator/__pycache__/__init__.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/__pycache__/__init__.cpython-311.pyc and b/orator/src/orator/__pycache__/__init__.cpython-311.pyc differ

orator/src/orator/__pycache__/tts.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/__pycache__/tts.cpython-311.pyc and b/orator/src/orator/__pycache__/tts.cpython-311.pyc differ

orator/src/orator/models/bigvgan/__pycache__/activations.cpython-311.pyc ADDED Viewed

Binary file (6.09 kB). View file

orator/src/orator/models/bigvgan/__pycache__/bigvgan.cpython-311.pyc ADDED Viewed

Binary file (13.3 kB). View file

orator/src/orator/models/bigvgan/activations.py ADDED Viewed

	@@ -0,0 +1,120 @@

+# Implementation adapted from https://github.com/EdwardDixon/snake under the MIT license.
+#   LICENSE is in incl_licenses directory.
+import torch
+from torch import nn, sin, pow
+from torch.nn import Parameter
+class Snake(nn.Module):
+    '''
+    Implementation of a sine-based periodic activation function
+    Shape:
+        - Input: (B, C, T)
+        - Output: (B, C, T), same shape as the input
+    Parameters:
+        - alpha - trainable parameter
+    References:
+        - This activation function is from this paper by Liu Ziyin, Tilman Hartwig, Masahito Ueda:
+        https://arxiv.org/abs/2006.08195
+    Examples:
+        >>> a1 = snake(256)
+        >>> x = torch.randn(256)
+        >>> x = a1(x)
+    '''
+    def __init__(self, in_features, alpha=1.0, alpha_trainable=True, alpha_logscale=False):
+        '''
+        Initialization.
+        INPUT:
+            - in_features: shape of the input
+            - alpha: trainable parameter
+            alpha is initialized to 1 by default, higher values = higher-frequency.
+            alpha will be trained along with the rest of your model.
+        '''
+        super(Snake, self).__init__()
+        self.in_features = in_features
+        # initialize alpha
+        self.alpha_logscale = alpha_logscale
+        if self.alpha_logscale: # log scale alphas initialized to zeros
+            self.alpha = Parameter(torch.zeros(in_features) * alpha)
+        else: # linear scale alphas initialized to ones
+            self.alpha = Parameter(torch.ones(in_features) * alpha)
+        self.alpha.requires_grad = alpha_trainable
+        self.no_div_by_zero = 0.000000001
+    def forward(self, x):
+        '''
+        Forward pass of the function.
+        Applies the function to the input elementwise.
+        Snake ∶= x + 1/a * sin^2 (xa)
+        '''
+        alpha = self.alpha.unsqueeze(0).unsqueeze(-1) # line up with x to [B, C, T]
+        if self.alpha_logscale:
+            alpha = torch.exp(alpha)
+        x = x + (1.0 / (alpha + self.no_div_by_zero)) * pow(sin(x * alpha), 2)
+        return x
+class SnakeBeta(nn.Module):
+    '''
+    A modified Snake function which uses separate parameters for the magnitude of the periodic components
+    Shape:
+        - Input: (B, C, T)
+        - Output: (B, C, T), same shape as the input
+    Parameters:
+        - alpha - trainable parameter that controls frequency
+        - beta - trainable parameter that controls magnitude
+    References:
+        - This activation function is a modified version based on this paper by Liu Ziyin, Tilman Hartwig, Masahito Ueda:
+        https://arxiv.org/abs/2006.08195
+    Examples:
+        >>> a1 = snakebeta(256)
+        >>> x = torch.randn(256)
+        >>> x = a1(x)
+    '''
+    def __init__(self, in_features, alpha=1.0, alpha_trainable=True, alpha_logscale=False):
+        '''
+        Initialization.
+        INPUT:
+            - in_features: shape of the input
+            - alpha - trainable parameter that controls frequency
+            - beta - trainable parameter that controls magnitude
+            alpha is initialized to 1 by default, higher values = higher-frequency.
+            beta is initialized to 1 by default, higher values = higher-magnitude.
+            alpha will be trained along with the rest of your model.
+        '''
+        super(SnakeBeta, self).__init__()
+        self.in_features = in_features
+        # initialize alpha
+        self.alpha_logscale = alpha_logscale
+        if self.alpha_logscale: # log scale alphas initialized to zeros
+            self.alpha = Parameter(torch.zeros(in_features) * alpha)
+            self.beta = Parameter(torch.zeros(in_features) * alpha)
+        else: # linear scale alphas initialized to ones
+            self.alpha = Parameter(torch.ones(in_features) * alpha)
+            self.beta = Parameter(torch.ones(in_features) * alpha)
+        self.alpha.requires_grad = alpha_trainable
+        self.beta.requires_grad = alpha_trainable
+        self.no_div_by_zero = 0.000000001
+    def forward(self, x):
+        '''
+        Forward pass of the function.
+        Applies the function to the input elementwise.
+        SnakeBeta ∶= x + 1/b * sin^2 (xa)
+        '''
+        alpha = self.alpha.unsqueeze(0).unsqueeze(-1) # line up with x to [B, C, T]
+        beta = self.beta.unsqueeze(0).unsqueeze(-1)
+        if self.alpha_logscale:
+            alpha = torch.exp(alpha)
+            beta = torch.exp(beta)
+        x = x + (1.0 / (beta + self.no_div_by_zero)) * pow(sin(x * alpha), 2)
+        return x

orator/src/orator/models/bigvgan/alias_free_torch/__init__.py ADDED Viewed

	@@ -0,0 +1,6 @@

+# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
+#   LICENSE is in incl_licenses directory.
+from .filter import *
+from .resample import *
+from .act import *

orator/src/orator/models/bigvgan/alias_free_torch/__pycache__/__init__.cpython-311.pyc ADDED Viewed

Binary file (281 Bytes). View file

orator/src/orator/models/bigvgan/alias_free_torch/__pycache__/act.cpython-311.pyc ADDED Viewed

Binary file (1.67 kB). View file

orator/src/orator/models/bigvgan/alias_free_torch/__pycache__/filter.cpython-311.pyc ADDED Viewed

Binary file (4.51 kB). View file

orator/src/orator/models/bigvgan/alias_free_torch/__pycache__/resample.cpython-311.pyc ADDED Viewed

Binary file (3.43 kB). View file

orator/src/orator/models/bigvgan/alias_free_torch/act.py ADDED Viewed

	@@ -0,0 +1,28 @@

+# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
+#   LICENSE is in incl_licenses directory.
+import torch.nn as nn
+from .resample import UpSample1d, DownSample1d
+class Activation1d(nn.Module):
+    def __init__(self,
+                 activation,
+                 up_ratio: int = 2,
+                 down_ratio: int = 2,
+                 up_kernel_size: int = 12,
+                 down_kernel_size: int = 12):
+        super().__init__()
+        self.up_ratio = up_ratio
+        self.down_ratio = down_ratio
+        self.act = activation
+        self.upsample = UpSample1d(up_ratio, up_kernel_size)
+        self.downsample = DownSample1d(down_ratio, down_kernel_size)
+    # x: [B, C, T]
+    def forward(self, x):
+        x = self.upsample(x)
+        x = self.act(x)
+        x = self.downsample(x)
+        return x

orator/src/orator/models/bigvgan/alias_free_torch/filter.py ADDED Viewed

	@@ -0,0 +1,95 @@

+# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
+#   LICENSE is in incl_licenses directory.
+import math
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+if 'sinc' in dir(torch):
+    sinc = torch.sinc
+else:
+    # This code is adopted from adefossez's julius.core.sinc under the MIT License
+    # https://adefossez.github.io/julius/julius/core.html
+    #   LICENSE is in incl_licenses directory.
+    def sinc(x: torch.Tensor):
+        """
+        Implementation of sinc, i.e. sin(pi * x) / (pi * x)
+        __Warning__: Different to julius.sinc, the input is multiplied by `pi`!
+        """
+        return torch.where(x == 0,
+                           torch.tensor(1., device=x.device, dtype=x.dtype),
+                           torch.sin(math.pi * x) / math.pi / x)
+# This code is adopted from adefossez's julius.lowpass.LowPassFilters under the MIT License
+# https://adefossez.github.io/julius/julius/lowpass.html
+#   LICENSE is in incl_licenses directory.
+def kaiser_sinc_filter1d(cutoff, half_width, kernel_size): # return filter [1,1,kernel_size]
+    even = (kernel_size % 2 == 0)
+    half_size = kernel_size // 2
+    #For kaiser window
+    delta_f = 4 * half_width
+    A = 2.285 * (half_size - 1) * math.pi * delta_f + 7.95
+    if A > 50.:
+        beta = 0.1102 * (A - 8.7)
+    elif A >= 21.:
+        beta = 0.5842 * (A - 21)**0.4 + 0.07886 * (A - 21.)
+    else:
+        beta = 0.
+    window = torch.kaiser_window(kernel_size, beta=beta, periodic=False)
+    # ratio = 0.5/cutoff -> 2 * cutoff = 1 / ratio
+    if even:
+        time = (torch.arange(-half_size, half_size) + 0.5)
+    else:
+        time = torch.arange(kernel_size) - half_size
+    if cutoff == 0:
+        filter_ = torch.zeros_like(time)
+    else:
+        filter_ = 2 * cutoff * window * sinc(2 * cutoff * time)
+        # Normalize filter to have sum = 1, otherwise we will have a small leakage
+        # of the constant component in the input signal.
+        filter_ /= filter_.sum()
+        filter = filter_.view(1, 1, kernel_size)
+    return filter
+class LowPassFilter1d(nn.Module):
+    def __init__(self,
+                 cutoff=0.5,
+                 half_width=0.6,
+                 stride: int = 1,
+                 padding: bool = True,
+                 padding_mode: str = 'replicate',
+                 kernel_size: int = 12):
+        # kernel_size should be even number for stylegan3 setup,
+        # in this implementation, odd number is also possible.
+        super().__init__()
+        if cutoff < -0.:
+            raise ValueError("Minimum cutoff must be larger than zero.")
+        if cutoff > 0.5:
+            raise ValueError("A cutoff above 0.5 does not make sense.")
+        self.kernel_size = kernel_size
+        self.even = (kernel_size % 2 == 0)
+        self.pad_left = kernel_size // 2 - int(self.even)
+        self.pad_right = kernel_size // 2
+        self.stride = stride
+        self.padding = padding
+        self.padding_mode = padding_mode
+        filter = kaiser_sinc_filter1d(cutoff, half_width, kernel_size)
+        self.register_buffer("filter", filter)
+    #input [B, C, T]
+    def forward(self, x):
+        _, C, _ = x.shape
+        if self.padding:
+            x = F.pad(x, (self.pad_left, self.pad_right), mode=self.padding_mode)
+        out = F.conv1d(x, self.filter.expand(C, -1, -1), stride=self.stride, groups=C)
+        return out

orator/src/orator/models/bigvgan/alias_free_torch/resample.py ADDED Viewed

	@@ -0,0 +1,55 @@

+# Adapted from https://github.com/junjun3518/alias-free-torch under the Apache License 2.0
+#   LICENSE is in incl_licenses directory.
+import torch.nn as nn
+from torch.nn import functional as F
+from .filter import LowPassFilter1d
+from .filter import kaiser_sinc_filter1d
+class UpSample1d(nn.Module):
+    def __init__(self, ratio=2, kernel_size=None):
+        super().__init__()
+        self.ratio = ratio
+        self.kernel_size = int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
+        self.stride = ratio
+        self.pad = self.kernel_size // ratio - 1
+        self.pad_left = self.pad * self.stride + (self.kernel_size - self.stride) // 2
+        self.pad_right = self.pad * self.stride + (self.kernel_size - self.stride + 1) // 2
+        filter = kaiser_sinc_filter1d(
+            cutoff=0.5 / ratio,
+            half_width=0.6 / ratio,
+            kernel_size=self.kernel_size
+        )
+        self.register_buffer("filter", filter)
+    # x: [B, C, T]
+    def forward(self, x):
+        _, C, _ = x.shape
+        x = F.pad(x, (self.pad, self.pad), mode='replicate')
+        x = self.ratio * F.conv_transpose1d(
+            x, self.filter.expand(C, -1, -1), stride=self.stride, groups=C
+        )
+        x = x[..., self.pad_left:-self.pad_right]
+        return x
+class DownSample1d(nn.Module):
+    def __init__(self, ratio=2, kernel_size=None):
+        super().__init__()
+        self.ratio = ratio
+        self.kernel_size = int(6 * ratio // 2) * 2 if kernel_size is None else kernel_size
+        self.lowpass = LowPassFilter1d(
+            cutoff=0.5 / ratio,
+            half_width=0.6 / ratio,
+            stride=ratio,
+            kernel_size=self.kernel_size
+        )
+    def forward(self, x):
+        xx = self.lowpass(x)
+        return xx

orator/src/orator/models/bigvgan/bigvgan.py ADDED Viewed

	@@ -0,0 +1,212 @@

+# Copyright (c) 2022 NVIDIA CORPORATION.
+#   Licensed under the MIT license.
+# Adapted from https://github.com/jik876/hifi-gan under the MIT license.
+#   LICENSE is in incl_licenses directory.
+import logging
+from torch.nn import Conv1d, ConvTranspose1d
+from torch.nn.utils import weight_norm, remove_weight_norm
+from torch.nn.utils.weight_norm import WeightNorm
+from .activations import SnakeBeta
+from .alias_free_torch import *
+LRELU_SLOPE = 0.1
+logger = logging.getLogger(__name__)
+def get_padding(kernel_size, dilation=1):
+    return int((kernel_size*dilation - dilation)/2)
+def init_weights(m, mean=0.0, std=0.01):
+    classname = m.__class__.__name__
+    if classname.find("Conv") != -1:
+        m.weight.data.normal_(mean, std)
+class AMPBlock1(torch.nn.Module):
+    def __init__(self, channels, kernel_size=3, dilation=(1, 3, 5)):
+        super(AMPBlock1, self).__init__()
+        self.convs1 = nn.ModuleList([
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[0],
+                               padding=get_padding(kernel_size, dilation[0]))),
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[1],
+                               padding=get_padding(kernel_size, dilation[1]))),
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=dilation[2],
+                               padding=get_padding(kernel_size, dilation[2])))
+        ])
+        self.convs1.apply(init_weights)
+        self.convs2 = nn.ModuleList([
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1, padding=get_padding(kernel_size, 1))),
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1, padding=get_padding(kernel_size, 1))),
+            weight_norm(Conv1d(channels, channels, kernel_size, 1, dilation=1, padding=get_padding(kernel_size, 1)))
+        ])
+        self.convs2.apply(init_weights)
+        self.num_layers = len(self.convs1) + len(self.convs2) # total number of conv layers
+        self.activations = nn.ModuleList([
+            Activation1d(activation=SnakeBeta(channels, alpha_logscale=True))
+            for _ in range(self.num_layers)
+        ])
+    def forward(self, x):
+        acts1, acts2 = self.activations[::2], self.activations[1::2]
+        for c1, c2, a1, a2 in zip(self.convs1, self.convs2, acts1, acts2):
+            xt = a1(x)
+            xt = c1(xt)
+            xt = a2(xt)
+            xt = c2(xt)
+            x = xt + x
+        return x
+    def set_weight_norm(self, enabled: bool):
+        weight_norm_fn = weight_norm if enabled else remove_weight_norm
+        for l in self.convs1:
+            weight_norm_fn(l)
+        for l in self.convs2:
+            weight_norm_fn(l)
+class BigVGAN(nn.Module):
+    # this is our main BigVGAN model. Applies anti-aliased periodic activation for resblocks.
+    # We've got a model in prod that has the wrong hparams for this. It's simpler to add this check than to
+    # redistribute the model.
+    ignore_state_dict_unexpected = ("cond_layer.*",)
+    def __init__(self):
+        super().__init__()
+        input_dims = 80
+        upsample_rates = [10, 8, 4, 2]
+        upsample_kernel_sizes = [x * 2 for x in upsample_rates]
+        upsample_initial_channel = 1024
+        resblock_kernel_sizes = [3, 7, 11]
+        resblock_dilation_sizes = [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
+        self.num_kernels = len(resblock_kernel_sizes)
+        self.num_upsamples = len(upsample_rates)
+        # pre conv
+        self.conv_pre = weight_norm(Conv1d(input_dims, upsample_initial_channel, 7, 1, padding=3))
+        self.cond_layer = None
+        # transposed conv-based upsamplers. does not apply anti-aliasing
+        self.ups = nn.ModuleList()
+        for i, (u, k) in enumerate(zip(upsample_rates, upsample_kernel_sizes)):
+            self.ups.append(nn.ModuleList([
+                weight_norm(ConvTranspose1d(upsample_initial_channel // (2 ** i),
+                                            upsample_initial_channel // (2 ** (i + 1)),
+                                            k, u, padding=(k - u) // 2))
+            ]))
+        # residual blocks using anti-aliased multi-periodicity composition modules (AMP)
+        self.resblocks = nn.ModuleList()
+        for i in range(len(self.ups)):
+            ch = upsample_initial_channel // (2 ** (i + 1))
+            for j, (k, d) in enumerate(zip(resblock_kernel_sizes, resblock_dilation_sizes)):
+                self.resblocks.append(AMPBlock1(ch, k, d))
+        # post conv
+        activation_post = SnakeBeta(ch, alpha_logscale=True)
+        self.activation_post = Activation1d(activation=activation_post)
+        self.conv_post = weight_norm(Conv1d(ch, 1, 7, 1, padding=3))
+        # weight initialization
+        for i in range(len(self.ups)):
+            self.ups[i].apply(init_weights)
+        self.conv_post.apply(init_weights)
+    def forward(self, x) -> torch.Tensor:
+        """
+        Args
+        ----
+        x: torch.Tensor of shape [B, T, C]
+        """
+        with torch.inference_mode():
+            x = self.conv_pre(x)
+            for i in range(self.num_upsamples):
+                # upsampling
+                for i_up in range(len(self.ups[i])):
+                    x = self.ups[i][i_up](x)
+                # AMP blocks
+                xs = None
+                for j in range(self.num_kernels):
+                    if xs is None:
+                        xs = self.resblocks[i * self.num_kernels + j](x)
+                    else:
+                        xs += self.resblocks[i * self.num_kernels + j](x)
+                x = xs / self.num_kernels
+            # post conv
+            x = self.activation_post(x)
+            x = self.conv_post(x)
+            # Bound the output to [-1, 1]
+            x = torch.tanh(x)
+            return x
+    @property
+    def weight_norm_enabled(self) -> bool:
+        return any(
+            isinstance(hook, WeightNorm) and hook.name == "weight"
+            for k, hook in self.conv_pre._forward_pre_hooks.items()
+        )
+    def set_weight_norm(self, enabled: bool):
+        """
+        N.B.: weight norm modifies the state dict, causing incompatibilities. Conventions:
+        - BigVGAN runs with weight norm for training, without for inference (done automatically by instantiate())
+        - All checkpoints are saved with weight norm (allows resuming training)
+        """
+        if enabled != self.weight_norm_enabled:
+            weight_norm_fn = weight_norm if enabled else remove_weight_norm
+            logger.debug(f"{'Applying' if enabled else 'Removing'} weight norm...")
+            for l in self.ups:
+                for l_i in l:
+                    weight_norm_fn(l_i)
+            for l in self.resblocks:
+                l.set_weight_norm(enabled)
+            weight_norm_fn(self.conv_pre)
+            weight_norm_fn(self.conv_post)
+    def train_mode(self):
+        self.train()
+        self.set_weight_norm(enabled=True)
+    def inference_mode(self):
+        self.eval()
+        self.set_weight_norm(enabled=False)
+if __name__ == '__main__':
+    import sys
+    import soundfile as sf
+    model = BigVGAN()
+    state_dict = torch.load("bigvgan32k.pt")
+    msg = model.load_state_dict(state_dict)
+    model.eval()
+    model.set_weight_norm(enabled=False)
+    print(msg)
+    mels = torch.load("mels.pt")
+    with torch.inference_mode():
+        y = model(mels.cpu())
+    for i, wav in enumerate(y):
+        wav = wav.view(-1).detach().numpy()
+        sf.write(f"bigvgan_test{i}.flac", wav, samplerate=32_000, format="FLAC")

orator/src/orator/models/s3gen/__pycache__/__init__.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/__pycache__/__init__.cpython-311.pyc and b/orator/src/orator/models/s3gen/__pycache__/__init__.cpython-311.pyc differ

orator/src/orator/models/s3gen/__pycache__/const.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/__pycache__/const.cpython-311.pyc and b/orator/src/orator/models/s3gen/__pycache__/const.cpython-311.pyc differ

orator/src/orator/models/s3gen/__pycache__/decoder.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/__pycache__/decoder.cpython-311.pyc and b/orator/src/orator/models/s3gen/__pycache__/decoder.cpython-311.pyc differ

orator/src/orator/models/s3gen/__pycache__/f0_predictor.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/__pycache__/f0_predictor.cpython-311.pyc and b/orator/src/orator/models/s3gen/__pycache__/f0_predictor.cpython-311.pyc differ

orator/src/orator/models/s3gen/__pycache__/flow.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/__pycache__/flow.cpython-311.pyc and b/orator/src/orator/models/s3gen/__pycache__/flow.cpython-311.pyc differ

orator/src/orator/models/s3gen/__pycache__/flow_matching.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/__pycache__/flow_matching.cpython-311.pyc and b/orator/src/orator/models/s3gen/__pycache__/flow_matching.cpython-311.pyc differ

orator/src/orator/models/s3gen/__pycache__/hifigan.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/__pycache__/hifigan.cpython-311.pyc and b/orator/src/orator/models/s3gen/__pycache__/hifigan.cpython-311.pyc differ

orator/src/orator/models/s3gen/__pycache__/s3gen.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/__pycache__/s3gen.cpython-311.pyc and b/orator/src/orator/models/s3gen/__pycache__/s3gen.cpython-311.pyc differ

orator/src/orator/models/s3gen/__pycache__/xvector.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/__pycache__/xvector.cpython-311.pyc and b/orator/src/orator/models/s3gen/__pycache__/xvector.cpython-311.pyc differ

orator/src/orator/models/s3gen/matcha/__pycache__/decoder.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/matcha/__pycache__/decoder.cpython-311.pyc and b/orator/src/orator/models/s3gen/matcha/__pycache__/decoder.cpython-311.pyc differ

orator/src/orator/models/s3gen/matcha/__pycache__/flow_matching.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/matcha/__pycache__/flow_matching.cpython-311.pyc and b/orator/src/orator/models/s3gen/matcha/__pycache__/flow_matching.cpython-311.pyc differ

orator/src/orator/models/s3gen/matcha/__pycache__/transformer.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/matcha/__pycache__/transformer.cpython-311.pyc and b/orator/src/orator/models/s3gen/matcha/__pycache__/transformer.cpython-311.pyc differ

orator/src/orator/models/s3gen/transformer/__pycache__/__init__.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/transformer/__pycache__/__init__.cpython-311.pyc and b/orator/src/orator/models/s3gen/transformer/__pycache__/__init__.cpython-311.pyc differ

orator/src/orator/models/s3gen/transformer/__pycache__/activation.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/transformer/__pycache__/activation.cpython-311.pyc and b/orator/src/orator/models/s3gen/transformer/__pycache__/activation.cpython-311.pyc differ

orator/src/orator/models/s3gen/transformer/__pycache__/attention.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/transformer/__pycache__/attention.cpython-311.pyc and b/orator/src/orator/models/s3gen/transformer/__pycache__/attention.cpython-311.pyc differ

orator/src/orator/models/s3gen/transformer/__pycache__/convolution.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/transformer/__pycache__/convolution.cpython-311.pyc and b/orator/src/orator/models/s3gen/transformer/__pycache__/convolution.cpython-311.pyc differ

orator/src/orator/models/s3gen/transformer/__pycache__/embedding.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/transformer/__pycache__/embedding.cpython-311.pyc and b/orator/src/orator/models/s3gen/transformer/__pycache__/embedding.cpython-311.pyc differ

orator/src/orator/models/s3gen/transformer/__pycache__/encoder_layer.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/transformer/__pycache__/encoder_layer.cpython-311.pyc and b/orator/src/orator/models/s3gen/transformer/__pycache__/encoder_layer.cpython-311.pyc differ

orator/src/orator/models/s3gen/transformer/__pycache__/positionwise_feed_forward.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/transformer/__pycache__/positionwise_feed_forward.cpython-311.pyc and b/orator/src/orator/models/s3gen/transformer/__pycache__/positionwise_feed_forward.cpython-311.pyc differ

orator/src/orator/models/s3gen/transformer/__pycache__/subsampling.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/transformer/__pycache__/subsampling.cpython-311.pyc and b/orator/src/orator/models/s3gen/transformer/__pycache__/subsampling.cpython-311.pyc differ

orator/src/orator/models/s3gen/transformer/__pycache__/upsample_encoder.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/transformer/__pycache__/upsample_encoder.cpython-311.pyc and b/orator/src/orator/models/s3gen/transformer/__pycache__/upsample_encoder.cpython-311.pyc differ

orator/src/orator/models/s3gen/utils/__pycache__/class_utils.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/utils/__pycache__/class_utils.cpython-311.pyc and b/orator/src/orator/models/s3gen/utils/__pycache__/class_utils.cpython-311.pyc differ

orator/src/orator/models/s3gen/utils/__pycache__/mask.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/utils/__pycache__/mask.cpython-311.pyc and b/orator/src/orator/models/s3gen/utils/__pycache__/mask.cpython-311.pyc differ

orator/src/orator/models/s3gen/utils/__pycache__/mel.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3gen/utils/__pycache__/mel.cpython-311.pyc and b/orator/src/orator/models/s3gen/utils/__pycache__/mel.cpython-311.pyc differ

orator/src/orator/models/s3tokenizer/__pycache__/__init__.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3tokenizer/__pycache__/__init__.cpython-311.pyc and b/orator/src/orator/models/s3tokenizer/__pycache__/__init__.cpython-311.pyc differ

orator/src/orator/models/s3tokenizer/__pycache__/s3tokenizer.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/s3tokenizer/__pycache__/s3tokenizer.cpython-311.pyc and b/orator/src/orator/models/s3tokenizer/__pycache__/s3tokenizer.cpython-311.pyc differ

orator/src/orator/models/t3/__pycache__/__init__.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/t3/__pycache__/__init__.cpython-311.pyc and b/orator/src/orator/models/t3/__pycache__/__init__.cpython-311.pyc differ

orator/src/orator/models/t3/__pycache__/llama_configs.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/t3/__pycache__/llama_configs.cpython-311.pyc and b/orator/src/orator/models/t3/__pycache__/llama_configs.cpython-311.pyc differ

orator/src/orator/models/t3/__pycache__/t3.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/t3/__pycache__/t3.cpython-311.pyc and b/orator/src/orator/models/t3/__pycache__/t3.cpython-311.pyc differ

orator/src/orator/models/t3/inference/__pycache__/t3_hf_backend.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/t3/inference/__pycache__/t3_hf_backend.cpython-311.pyc and b/orator/src/orator/models/t3/inference/__pycache__/t3_hf_backend.cpython-311.pyc differ

orator/src/orator/models/t3/inference/alignment_stream_analyzer.py ADDED Viewed

	@@ -0,0 +1,154 @@

+# Copyright (c) 2025 Resemble AI
+# Author: John Meade, Jeremy Hsu
+# MIT License
+import logging
+import torch
+from dataclasses import dataclass
+from types import MethodType
+logger = logging.getLogger(__name__)
+@dataclass
+class AlignmentAnalysisResult:
+    # was this frame detected as being part of a noisy beginning chunk with potential hallucinations?
+    false_start: bool
+    # was this frame detected as being part of a long tail with potential hallucinations?
+    long_tail: bool
+    # was this frame detected as repeating existing text content?
+    repetition: bool
+    # was the alignment position of this frame too far from the previous frame?
+    discontinuity: bool
+    # has inference reached the end of the text tokens? eg, this remains false if inference stops early
+    complete: bool
+    # approximate position in the text token sequence. Can be used for generating online timestamps.
+    position: int
+class AlignmentStreamAnalyzer:
+    def __init__(self, tfmr, queue, text_tokens_slice, alignment_layer_idx=9, eos_idx=0):
+        """
+        Some transformer TTS models implicitly solve text-speech alignment in one or more of their self-attention
+        activation maps. This module exploits this to perform online integrity checks which streaming.
+        A hook is injected into the specified attention layer, and heuristics are used to determine alignment
+        position, repetition, etc.
+        NOTE: currently requires no queues.
+        """
+        # self.queue = queue
+        self.text_tokens_slice = (i, j) = text_tokens_slice
+        self.eos_idx = eos_idx
+        self.alignment = torch.zeros(0, j-i)
+        # self.alignment_bin = torch.zeros(0, j-i)
+        self.curr_frame_pos = 0
+        self.text_position = 0
+        self.started = False
+        self.started_at = None
+        self.complete = False
+        self.completed_at = None
+        # Using `output_attentions=True` is incompatible with optimized attention kernels, so
+        # using it for all layers slows things down too much. We can apply it to just one layer
+        # by intercepting the kwargs and adding a forward hook (credit: jrm)
+        self.last_aligned_attn = None
+        self._add_attention_spy(tfmr, alignment_layer_idx)
+    def _add_attention_spy(self, tfmr, alignment_layer_idx):
+        """
+        Adds a forward hook to a specific attention layer to collect outputs.
+        Using `output_attentions=True` is incompatible with optimized attention kernels, so
+        using it for all layers slows things down too much.
+        (credit: jrm)
+        """
+        def attention_forward_hook(module, input, output):
+            """
+            See `LlamaAttention.forward`; the output is a 3-tuple: `attn_output, attn_weights, past_key_value`.
+            NOTE:
+            - When `output_attentions=True`, `LlamaSdpaAttention.forward` calls `LlamaAttention.forward`.
+            - `attn_output` has shape [B, H, T0, T0] for the 0th entry, and [B, H, 1, T0+i] for the rest i-th.
+            """
+            step_attention = output[1].cpu() # (B, 16, N, N)
+            self.last_aligned_attn = step_attention[0].mean(0) # (N, N)
+        target_layer = tfmr.layers[alignment_layer_idx].self_attn
+        hook_handle = target_layer.register_forward_hook(attention_forward_hook)
+        # Backup original forward
+        original_forward = target_layer.forward
+        def patched_forward(self, *args, **kwargs):
+            kwargs['output_attentions'] = True
+            return original_forward(*args, **kwargs)
+        # TODO: how to unpatch it?
+        target_layer.forward = MethodType(patched_forward, target_layer)
+    def step(self, logits):
+        """
+        Emits an AlignmentAnalysisResult into the output queue, and potentially modifies the logits to force an EOS.
+        """
+        # extract approximate alignment matrix chunk (1 frame at a time after the first chunk)
+        aligned_attn = self.last_aligned_attn # (N, N)
+        i, j = self.text_tokens_slice
+        if self.curr_frame_pos == 0:
+            # first chunk has conditioning info, text tokens, and BOS token
+            A_chunk = aligned_attn[j:, i:j].clone().cpu() # (T, S)
+        else:
+            # subsequent chunks have 1 frame due to KV-caching
+            A_chunk = aligned_attn[:, i:j].clone().cpu() # (1, S)
+        # TODO: monotonic masking; could have issue b/c spaces are often skipped.
+        A_chunk[:, self.curr_frame_pos + 1:] = 0
+        self.alignment = torch.cat((self.alignment, A_chunk), dim=0)
+        A = self.alignment
+        T, S = A.shape
+        # update position
+        cur_text_posn = A_chunk[-1].argmax()
+        discontinuity = not(-4 < cur_text_posn - self.text_position < 7) # NOTE: very lenient!
+        if not discontinuity:
+            self.text_position = cur_text_posn
+        # Hallucinations at the start of speech show up as activations at the bottom of the attention maps!
+        # To mitigate this, we just wait until there are no activations far off-diagonal in the last 2 tokens,
+        # and there are some strong activations in the first few tokens.
+        false_start = (not self.started) and (A[-2:, -2:].max() > 0.1 or A[:, :4].max() < 0.5)
+        self.started = not false_start
+        if self.started and self.started_at is None:
+            self.started_at = T
+        # Is generation likely complete?
+        self.complete = self.complete or self.text_position >= S - 3
+        if self.complete and self.completed_at is None:
+            self.completed_at = T
+        # NOTE: EOS rarely assigned activations, and second-last token is often punctuation, so use last 3 tokens.
+        # NOTE: due to the false-start behaviour, we need to make sure we skip activations for the first few tokens.
+        last_text_token_duration = A[15:, -3:].sum()
+        # Activations for the final token that last too long are likely hallucinations.
+        long_tail = self.complete and (A[self.completed_at:, -3:].sum(dim=0).max() >= 10) # 400ms
+        # If there are activations in previous tokens after generation has completed, assume this is a repetition error.
+        repetition = self.complete and (A[self.completed_at:, :-5].max(dim=1).values.sum() > 5)
+        # If a bad ending is detected, force emit EOS by modifying logits
+        # NOTE: this means logits may be inconsistent with latents!
+        if long_tail or repetition:
+            logger.warn(f"forcing EOS token, {long_tail=}, {repetition=}")
+            # (±2**15 is safe for all dtypes >= 16bit)
+            logits = -(2**15) * torch.ones_like(logits)
+            logits[..., self.eos_idx] = 2**15
+        # Suppress EoS to prevent early termination
+        if cur_text_posn < S - 3: # FIXME: arbitrary
+            logits[..., self.eos_idx] = -2**15
+        self.curr_frame_pos += 1
+        return logits

orator/src/orator/models/t3/inference/t3_hf_backend.py CHANGED Viewed

@@ -23,14 +23,14 @@ class T3HuggingfaceBackend(LlamaPreTrainedModel, GenerationMixin):
         speech_head,
         latents_queue=None,
         logits_queue=None,
     ):
         super().__init__(config)
         self.model = llama
         self.speech_enc = speech_enc
         self.speech_head = speech_head
-        self.latents_queue = latents_queue
-        self.logits_queue = logits_queue
         self._added_cond = False
     @torch.inference_mode()
     def prepare_inputs_for_generation(
@@ -101,12 +101,12 @@ class T3HuggingfaceBackend(LlamaPreTrainedModel, GenerationMixin):
             return_dict=True,
         )
         hidden_states = tfmr_out.hidden_states[-1]  # (B, seq, dim)
-        if self.latents_queue is not None:
-            self.latents_queue.put(hidden_states)
         logits = self.speech_head(hidden_states)
-        if self.logits_queue is not None:
-            self.logits_queue.put(logits)
         return CausalLMOutputWithCrossAttentions(
             logits=logits,

         speech_head,
         latents_queue=None,
         logits_queue=None,
+        alignment_stream_analyzer: 'AlignmentStreamAnalyzer'=None,
     ):
         super().__init__(config)
         self.model = llama
         self.speech_enc = speech_enc
         self.speech_head = speech_head
         self._added_cond = False
+        self.alignment_stream_analyzer = alignment_stream_analyzer
     @torch.inference_mode()
     def prepare_inputs_for_generation(
             return_dict=True,
         )
         hidden_states = tfmr_out.hidden_states[-1]  # (B, seq, dim)
         logits = self.speech_head(hidden_states)
+        assert inputs_embeds.size(0) == 1
+        # NOTE: hallucination handler may modify logits to force emit an EOS token
+        logits = self.alignment_stream_analyzer.step(logits)
         return CausalLMOutputWithCrossAttentions(
             logits=logits,

orator/src/orator/models/t3/modules/__pycache__/cond_enc.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/t3/modules/__pycache__/cond_enc.cpython-311.pyc and b/orator/src/orator/models/t3/modules/__pycache__/cond_enc.cpython-311.pyc differ

orator/src/orator/models/t3/modules/__pycache__/learned_pos_emb.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/t3/modules/__pycache__/learned_pos_emb.cpython-311.pyc and b/orator/src/orator/models/t3/modules/__pycache__/learned_pos_emb.cpython-311.pyc differ

orator/src/orator/models/t3/modules/__pycache__/perceiver.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/t3/modules/__pycache__/perceiver.cpython-311.pyc and b/orator/src/orator/models/t3/modules/__pycache__/perceiver.cpython-311.pyc differ

orator/src/orator/models/t3/modules/__pycache__/t3_config.cpython-311.pyc CHANGED Viewed

Binary files a/orator/src/orator/models/t3/modules/__pycache__/t3_config.cpython-311.pyc and b/orator/src/orator/models/t3/modules/__pycache__/t3_config.cpython-311.pyc differ