Audio Texture Manipulation by Exemplar-Based Analogy

Manipulation by Analogy

We manipulate input speech based on an exemplar pair, where the pair defines the desired transformation such as adding, removing, or replacing specific sound elements.

Abstract

Audio texture manipulation involves modifying the perceptual characteristics of a sound to achieve specific transformations, such as adding, removing, or replacing auditory elements. In this paper, we propose an exemplar-based analogy model for audio texture manipulation. Instead of conditioning on text-based instructions, our method uses paired speech examples, where one clip represents the original sound and another illustrates the desired transformation. The model learns to apply the same transformation to new input, allowing for the manipulation of sound textures. We construct a quadruplet dataset representing various editing tasks, and train a latent diffusion model in a self-supervised manner. We show through quantitative evaluations and perceptual studies that our model outperforms text-conditioned baselines and generalizes well to real-world, out-of-distribution, and non-speech scenarios.

Addition Results with In-Domain Examples

Removal Results with In-Domain Examples

Replacement Results with In-Domain Examples

Generalization to Out-Of-Distribution Data

Model Architecture

Given the input audio and exemplar pair, our goal is to transform the input to match the texture transformation demonstrated by the exemplar pair. We employ a pre-trained VAE encoder to encode both the input and target spectrograms to the latent space, and feed them into a latent diffusion model together with the exemplar pair embedding and positional encoding. Finally, we use pre-trained VAE decoder and HiFi-GAN vocoder to reconstruct the waveform from the latent space. Note that the VAE encoder for the target spectrogram is not used at test time.

Related Works

Analogy-based methods have been studied in various deep learning fields.

Analogical Modeling is a formal theory of exemplar based analogical reasoning.

Image Analogies applies learned image filters to new images, enabling effects like texture synthesis, super-resolution, texture transfer, and artistic styles based on example image pairs.

Visual Prompting via Image Inpainting explores visual prompting, adapting pre-trained visual models to new tasks without finetuning or modification, by using image inpainting on curated academic figures to solve various image-to-image tasks like segmentation and object detection.

Sequential Modeling Enables Scalable Learning for Large Vision Models introduces a sequential modeling approach for training a Large Vision Model (LVM) without linguistic data, using "visual sentences" to represent diverse visual inputs.

Conditional Generation of Audio from Video via Foley Analogies proposes a conditional Foley model that generates sound effects for silent videos given a user-supplied example. It also introduces a pretext task for predicting sound based on conditional audio-visual clips.

Self-Supervised Audio-Visual Soundscape Stylization introduces a self-supervised model that manipulates speech to match the sound properties of a different scene, using audio-visual examples and leveraging natural video data for training.

BibTeX

@article{,
      author    = {Cheng, Kan Jen and Li, Tingle and Anumanchipalli, Gopala},
      title     = {Audio Texture Manipulation by Exemplar-Based Analogy},
      year      = {2025},
      booktitle = {2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}
    }