Block Distillation for Discrete Speech Generation

I had a quick (and very straightforward) exploration on how autoregressive speech synthesis could leverage from block-causality applied to the latest discrete text-to-speech models. Despite their low-confidence outputs, it accelerates the inference speed up to x2~3 times compared to the original in a model-native way (without any further optimization), while getting less degradation on speech quality and zero-shot capability than naive supervised fine-tuning. All the existing models of similar mechanism like hybrid architecture could be accelerated in the same way, in a data-free manner that does not require real labels.

Results

Checkpoint

ssonpull519/neutts-air-dllm-bd8

Inference Speed

Measured on PyTorch SDPA & same precision, without further optimization. Units are tokens/sec.

{
  "data": [
    {
      "x": ["Baseline (AR)", "Ours (Block, th=0.015)", "Ours (Block, th=0.01)"],
      "y": [53.25, 120.52, 148.62],
      "type": "bar",
      "name": "Inference Speed (tokens/sec)"
    }
  ],
  "layout": {
    "title": "Inference Speed",
    "yaxis": {
      "title": "tokens/sec"
    }
  }
}

Zero-shot Voice Cloning

All outputs from block-wise are generated with threshold of 0.015, and other sampling hyperparemeters are kept the same with autoregressive (AR) baseline.

Input audio 1
AR output 1
AR output 2
Block output 1
Block output 2
Input audio 2
AR output 1
AR output 2
Block output 1
Block output 2

Background

I had recently worked on building foundation text-to-speech (TTS) models from scratch in purpose of the best quality in our local language. But that is not the end: we have to make it work and run fast in production. There are multiple ways of engineering and optimizing them, but one of the natural curiosity that comes in mind was: How much we can make it faster in a model-native way? Can we enable capability to generate faster on its own than it has been?

Many recent TTS models have been built upon language modeling, and many of them (especially for relatively large-scale ones) are learned to generate discrete tokens for vector quantization codecs, as they provide decent reconstruction quality. To learn faster generation, the model should learn to output multiple tokens in parallel, and that’s where masked diffusion framework could be one of the ways that takes place.

Methods

The training recipe is initialized and borrowed from Block Diffusion, A2D-VL, and Fast-dLLM: refer to them for details. Lately there are so many related works that improves further or scales up, and here I only leave some key points that would be some basic preliminaries for those who are not familiar with:

Applying to Speech Generation

There have been some research works to apply discrete diffusion frameworks to speech: for example, InstructTTS and TASTE. However, they are done in limited scale and have some inductive bias on their architecture towards making them more adaptable to speech synthesis task.

To make a quick and simple exploration on more practical scale, I chose NeuTTS model from Neuphonic as a baseline which is based on the simplest LM backbone architecture based on Qwen and LLaMA producing discrete codes.

Problems

Unlike typical large language models that has relatively high confidence on their logits, speech generation has much more possible choices even if it’s conditioned: for example, one text sequence can be matched with so many speech sequences. Given this, I noticed two practical problems that limit the extension on speech synthesis:

  1. Fine-tuning on real datasets highly limits and degrades the output quality, even with the data included in the training set for AR.

  2. Confidence-based decoding rarely does parallel decoding during inference.

Data-free Block Distillation

Supervising with real data degrades quality too much, and it also requires speech tokens encoded offline from real audio data. Instead of this, we can intuitively combine AR-to-block annealed training with knowledge distillation (KD) to preserve and fully leverage from autoregressive teacher’s capability. The typical objective for KD can be denoted as

\[\mathcal{L}_{KD} = \mathop{\mathbb{E}}_{(x, y) \sim (X, Y)}[\mathcal{D}(p_T || p_S^\theta)(y|x)],\]

where $(X, Y)$ is the input-output pairs from supervising data, $p_T$ and $p_S$ are token-level distribution from teacher and student, and $\mathcal{D}$ is a divergence. $\mathcal{D}$ is often chosen as KL divergence, and to bound the values in the safe range, it is practical to use generalized Jenson-Shannon divergence defined as

\[\mathcal{D}_{JSD}(P || Q) = \beta \mathcal{D}_{KL}(P || \beta P + (1-\beta) Q) + (1-\beta)D_{KL}(Q || \beta P + (1-\beta) Q),\]

where $\beta$ is a hyperparameter that balances the forward and reverse KL, each of them known with the behavior of mean and mode-covering.

When $Y$ is extracted from teacher, the student can be supervised with teacher predictions during distillation, making it data-free that only requires text input condition set. Distilling on teacher outputs showed much more preserved audio quality, compared to the supervision from real speech tokens. One can also utilize student prediction to do self-distillation, but mixing it did not help much in our case. $Y$ can also be multiple labels such as top-k of teacher outputs which can benefit from learning only a few paths with high confidence.

In many cases full fine-tuning is not necessary and LoRA with small rank was sufficient in our case during distillation, reducing resource requirement.

The maximum possible block size was 8, which is the same as A2D-VL: size larger than 8 consistenly failed to preserve the original quality of AR. The block size is annealed from the smallest to the target size, along with the masking position and timesteps during training. Schedules should be chosen carefully to achieve strong preservation.

Low-Confidence Decoding

During inference, tokens are sampled with the same hyperparameters used in autoregressive one, such as top-p, top-k, and temperature. KV cache is stored block-wise to keep fast generation.

Common high threshold values for confidence-based parallel decoding is not much beneficial for speech synthesis, as they often have logits of low confidence. This is why greedy decoding in autoregressive TTS results in much worse outputs compared to text generation, and temperature value such as 1.0 is a common choice to enable minimal stochasticity required to trigger. The threshold values between 0.01~0.02 gave a good trade-off point between speed and quality in speech, which is very small compared to the one in dLLMs that often use range of 0.7~0.9.

Future Works

This recipe can be applied to all models powered by LM backbone producing discrete tokens: adopting on more models such as Qwen3-TTS and Chatterbox would help them accelerate and observe more model-agnostic behaviors and problems.

One of the limitations is that the threshold for parallel decoding is too sensitive in some cases: if it’s too high it slows down, if it’s too low it starts producing low-quality speech or pauses more frequently. Sampling method that is more robust to the cases of low confidence could possibly improve performance.

It could also be challenged to more difficult scenarios such as low-step distillation. There are some recent works that explores in this perspective, so it can be good exploration on finding whether it would be beneficial in practice to further reduce its NFE.

Another exploration would be more on the low-level optimization, such as leveraging from flexible-mask flash attention and more aggressive caching, pushing both the baseline and modified one to their best-optimized states, which would show true comparison on upper bound when applied in practice.