
Recent advancements in deep learning have opened new pathways for embedding these technologies in real-time audio applications, particularly in the realm of musical creativity. The primary goal of this research is to develop a new kind of musical instrument, harnessing the power and versatility of deep learning for sound generation. This innovative approach aims to address the fundamental question of how cutting-edge technology can be transformed into a creative tool that not only enhances but also empowers the artistic process.

To achieve this, the research first examines the notion of creativity itself and its implications, as well as the creative process, to highlight how these new technologies can be leveraged to foster an environment where artists can explore and expand their creative potential. Our main hypothesis centers on the crucial importance of the medium—the instrument—and its associated controls, as well as on the real-time nature of the sound generation.

We then propose a new model designed to enhance the control and expressiveness of audio generation, focusing on the manipulation of timbre through perceptual features. This model integrates techniques from Fader Networks and the recent RAVE model to allow for continuous descriptor-based control in real-time deep audio synthesis. By orthogonalizing the continuous time-varying attributes from the latent representation, our approach provides independent priors which enable separate operations such as timbre transfer and attribute transfer, opening up new avenues for creative exploration. This allows users to select a set of descriptors to condition the generation process, thereby creating a vast array of sounds.

A significant achievement of this research is the development of the ‘Neurorack’, a pioneering stand-alone synthesizer in the Eurorack format whose audio generation is driven by deep learning. The Neurorack is not just a theoretical construct but a tangible, functional instrument that demonstrates the practical application of the discussed concepts. It offers musicians a novel tool for live performance, equipped with real-time control over sound characteristics that were previously hard to manipulate, such as timbral texture and dynamic response. This instrument is designed to be intuitive, allowing seamless integration into existing musical setups and providing artists with immediate feedback and control, thereby enriching the live music creation experience.

Furthermore, the research speculates on the future of creative processes influenced by rapid technological advancements. We consider how emerging technologies might not only redefine traditional practices but also pave the way for novel forms of artistic expression. As deep learning continues to evolve, its integration into creative tools could lead to significant transformations in how music is composed, performed, and experienced. However, there are inherent risks in this evolution; the potential for these technologies to dominate and standardize creative outputs is a concerning possibility. This could potentially stifle individual creativity and homogenize art forms, overshadowing traditional techniques and skills that have defined artistic expression for centuries. Moreover, the increasing reliance on language models raises significant challenges in ensuring these tools contribute creatively in a meaningful way. The challenge lies not just in their technical development but also in defining and implementing frameworks that encourage genuine innovation without diluting the creative process, ensuring that these models enhance rather than replace the nuanced human touch in artistic creation.


  Sercan Ö Arık, Heewoo Jun, and Gregory Diamos. Fast spectrogram inversion usingmulti-head convolutional neural networks.IEEE Signal Processing Letters, 26(1):94–98, 2019.Optimization.
