Audio Style Transfer: Explained

Mukul Pathak
7 min readJul 3, 2021

What do style transfer algorithms do?

Style transfer algorithms are deep learning techniques to manipulate digital media such as images, audio and video in such a way that one media file adopts the visual style of another media file. Style transfer is a commonly used method to recreate digital images for entertainment and artistic purposes. There are many mobile apps which use style transfer algorithms to recreate users pictures by merging them with famous paintings.

Overview of the Working of Audio Style Transfer

For audio style transfers to work we need two audio files. The first audio file which will be used to get content such as tune, lyrics and pitch is called content audio file. And the other file will be the style audio file, we extract the voice from the style file. After successfully training and applying the style transfer algorithms, we will get an audio file which will have main content from the content file and voice from the style audio file.

To explain this with an example, suppose we have “Hotline bling” by Drake as out content audio and a voice recording of Morgan Freeman as style audio file. The final result of this style transfer will give us the song “Hotline Bling” in Morgan Freeman’s voice.

Audio style transfer is loosely based on the initial concepts of image style transfer. The major challenge in treating audio and image as the same was that audio files are dynamic and move over time. Images can be treated as a 2-dimensional graphical representation and can be operated easily but since audio files move over time, the first challenge was to convert the audio files into a 2-dimensional representation.

Getting Technical

This challenge was overcome by converting the audio files into spectrograms. Spectrograms are 2-dimensional time-frequency graphical representations of waveforms of an audio file. Spectrograms can be created by applying Short-Time Fourier Transformation (STFT) on instances of waveform of audio files. Further scaling is done on these spectrograms to make functional in some of our tasks. By applying a log function on each of the pixels of the spectrogram scaling can be achieved. Then a mel filter bank is applied on these pixels and we get mel-spectrograms, which are widely used for…

Mukul Pathak

Technology & innovation enthusiast.