Ars Technica AI November 25, 2024 NEUTRAL

How Nvidia Fugatto Turns AI Audio Into a Sound-Mixing Tool

Nvidia’s Fugatto model is built to transform music, voices, and sound effects from text and audio prompts. Its key idea is control: users can combine and tune traits such as emotion, accent, instruments, rhythm, reverb, and unusual sound descriptions.

Nvidia’s Fugatto is a new AI audio model designed to do more than generate speech or melodic music from a prompt. The company describes it as a system that can transform mixes of music, voices, and sounds, including combinations that do not already exist in the real world.

The model is not available for public testing yet. But Nvidia has shown examples that point to a broader direction for generative audio: tools that let creators describe, adjust, and blend sound qualities with far more control than a simple text-to-audio request.

What Fugatto Is Built To Do

Fugatto is presented as a model for flexible audio transformation. Instead of focusing only on one task, such as speech synthesis or music generation, it is meant to work across voices, instruments, environmental sound, and effects.

Examples shown on the project’s webpage include saxophones barking, people speaking underwater, and ambulance sirens arranged like a choir. Nvidia has called Fugatto “a Swiss Army knife for sound,” a phrase that fits the range of audio jobs the model is designed to attempt.

The central promise is not just that Fugatto can generate sound. It is that the model can treat audio descriptions as ingredients that can be mixed, strengthened, weakened, or combined in unfamiliar ways.

Why The Training Data Matters

According to an explanatory research paper from over a dozen Nvidia researchers, one of the hard problems is teaching a model useful relationships between sound and language. Text models can learn many instructions from text alone, but audio needs more explicit description before a model can reliably connect words to audible traits.

To build that bridge, the researchers used an LLM to generate a Python script for creating many template-based and free-form instructions. These instructions described audio “personas,” including examples such as “standard, young-crowd, thirty-somethings, professional.”

The team then generated both absolute and relative instructions. An absolute instruction could ask the model to “synthesize a happy voice.” A relative one could ask it to “increase the happiness of this voice.”

That distinction matters because audio editing often depends on degree. A creator may not want a completely different voice or instrument. They may want a version that is slightly more sorrowful, more accented, less reverberant, or closer to another sound.

The open source audio datasets used for Fugatto generally did not already contain those measurements. Nvidia’s researchers therefore used existing audio understanding models to create “synthetic captions” for clips, describing traits such as gender, emotion, and speech quality. Audio processing tools also helped describe acoustic qualities, including “fundamental frequency variance” and “reverb.”

For comparisons between related sounds, the researchers used datasets where one element stays fixed while another changes. The source gives examples such as different emotional readings of the same text or different instruments playing the same notes. Those comparisons help the model learn what changes when speech becomes “happier,” or how a saxophone differs from a flute.

The Scale Behind The Model

After processing a variety of open source audio collections, the researchers produced a heavily annotated dataset of 20 million separate samples representing at least 50,000 hours of audio. That dataset became the basis for a model with 2.5 billion parameters.

The model was trained using 32 Nvidia tensor cores and began to show reliable scores on a variety of audio quality tests. The source does not present Fugatto as a finished public product, but as a research step with a working set of examples and measurable audio performance.

The important point is that Fugatto’s flexibility depends on the annotations as much as the raw sound. The system needs to know not only what a clip contains, but how that clip can be described, compared, and modified.

ComposableART And Unheard Combinations

Beyond the dataset, Nvidia is emphasizing a system called ComposableART, short for “Audio Representation Transformation.” It can take text and/or audio prompts and use “conditional guidance” to control and generate combinations of instructions and tasks.

In practical terms, ComposableART is meant to combine traits learned from different examples. That can produce “highly customizable audio outputs outside the training distribution,” including sounds that were not directly present in the training material.

The examples are deliberately unusual. Nvidia has shown combinations such as a violin that “sounds like a laughing baby or a banjo that’s playing in front of gentle rainfall” and “factory machinery that screams in metallic agony.” Some results are described as more convincing than others, but the broader achievement is that the model can attempt such distant blends at all.

One of Fugatto’s more important ideas is that traits are treated as continuums. A sound can sit between an acoustic guitar and running water, with the result changing depending on how heavily each side is weighted. Nvidia also mentions changing how heavy a French accent sounds or adjusting the “degree of sorrow” in a spoken clip.

Where This Could Fit In Audio Work

Fugatto also handles tasks associated with earlier audio models. It can change emotion in spoken text, isolate a vocal track from music, detect individual notes in MIDI music and replace them with vocal performances, and detect a beat so effects can be added in rhythm.

The source mentions possible use cases including song prototyping, dynamically changing video game scores, and international ad targeting. These examples all depend on a similar need: fast, controllable changes to audio without rebuilding every element from scratch.

Nvidia’s researchers describe Fugatto as a first step “towards a future where unsupervised multitask learning emerges from data and model scale.” Nvidia also frames models like Fugatto as tools for audio artists rather than replacements for creative work.

“The history of music is also a history of technology,” Nvidia Inception participant and producer/songwriter Ido Zmishlany said in Nvidia’s blog post. “The electric guitar gave the world rock and roll. When the sampler showed up, hip-hop was born. With AI, we’re writing the next chapter of music. We have a new instrument, a new tool for making music—and that’s super exciting.”

For now, Fugatto is best understood as a demonstration of where AI audio research is heading. The model shows how future tools may let creators move beyond generating a sound and toward shaping sound as a set of adjustable, mixable properties.