Every day, billions of people listen to music around the world. Yet, very few know how to create music because of how hard, time consuming and expensive it is to learn the craft, and buy tools necessary to partake in it.
Recent advances in artificial intelligence around language modelling offer a unique opportunity to bring down the barriers to entry for beginners, and expose a wider audience to the joys of music creation. Building upon these advances, we present a set of best-in-class innovations that bring us one step closer to fulfilling our mission of empowering the next billion people to create personalised soundtracks with AI.
OmniCodec
Neural audio codecs have demonstrated a remarkable ability to compress audio orders of magnitude more efficiently than traditional codecs, while keeping fidelity high. However, the acoustic codes they create often lead to poor language modelling in the context of generating musical sequences auto-regressively. Others have proposed a semantic component to increase performance in downstream language modelling tasks, at the expense audio fidelity1.
We introduce OmniCodec, a neural codec that produces codebooks that are easy to learn in a language modelling context, lead to better audio reconstructions than acoustic codecs of similar or larger bandwidths, and reduce training budget of language modeling tasks by compressing audio sequences more efficiently. To provide comparisons2 to openly available state-of-the-art neural audio codecs3, we use VISQOL4, a commonly-used perceptual audio quality metric.
VISQOL Metric | OmniCodec 2.06 kbps | DAC 2.6 kbps | Encodec 2.2 kbps |
---|---|---|---|
Average | 4.01 | 3.95 | 3.75 |
Standard deviation | 0.22 | 0.23 | 0.35 |
Percentage of wins for OmniCodec5 | – | 97.83% | 100% |
Lyra
We introduce Lyra, a foundation model generating instrumental-only music that’s highly personalised to a user’s tastes and trained with an efficient compute budget of only 44k GPU6 hours (~$110k).
Lyra is based on a transformer architecture customised to model tokenised sequences produced by OmniCodec. It integrates multi-stage modelling with parallel token processing which enables efficient and stable sampling at inference time. Lyra supports natural language prompting, and can be conditioned to generate complete compositions anywhere between 30 seconds and 10 minutes in duration.
Prompt: “Melodic techno”
Prompt: “Epic orchestral”
Prompt: “Lo-Fi”
Prompt: “Classical Piano”
Prompt: “Classical, in the style of Richard Strauss”
Prompt: “Big Room EDM”
Prompt: “A song for acoustic guitar”
Prompt: “Orchestral cinematic piece”
Prompt: “Jazz”
ORAQL
In the language domain, the ability to measure a model’s performance with objective benchmarks and use large frontier models for self-evaluation unlocks many possibilities to optimise pre- and post-training regimes. In music, evaluating quality7 is a highly subjective task for which there previously existed no good substitute for human listening.
We trained ORAQL, a music quality assessment model that closely aligns with subjective preferences of human listeners. We observe strong correlation (R = 0.78) between predicted and ground truth quality scores on large scale internal benchmarks. ORAQL is utilised:
-
As a reward model during post-training, to perform Reinforcement Learning with AI feedback, which increases the amount of generated samples deemed of human-level quality
-
As a quality scaling technique at inference time, to select the best of n songs in a batch. We find that doubling inference time compute also doubles the likelihood of the sample(s) being liked by a human listener, and is effective to suppress entirely samples that are below average quality.
To train Lyra we developed scaling laws to help us estimate optimal combinations of model size, data mix and compute budget given a unique distribution of OmniCodec tokens. For future versions we plan to use ORAQL to predict final model performance on subjective listening benchmarks prior to training our models. In parallel, we are developing tools that utilise ORAQL for additional use-cases, such as distinguishing between AI-generated and human-composed music and evaluating the recording quality or fidelity of audio files more broadly.
Influence
Unlike other modalities like language and images where text is a powerful way to describe one’s creative intent, music is much more abstract. Through our past experience working with customers to compose tunes personalized to their tastes, we found that the most powerful way to describe the kind of music they wanted was to provide a reference song that matched their expectations.
Influence is a post-training variant of Lyra that unlocks the popular use case of uploading a reference audio file, and generating a completely original piece of music that shares core musical characteristics with the reference.
We find that Influence outperforms all models – open & closed source – at producing highly personalized songs that are unique, high quality and yet faithful to the intent of the user’s input.
Influence: Johann Strauss
Influence: Ludwig van Beethoven
Influence: Antonio Vivaldi
Key features & capabilities
Although AI-generated vocals make for an impressive demo, we prefer to optimize our models for use cases that empower artists to create exceptional music — supporting their creative journey rather than replacing it.
Our real excitement lies in developing derivative models on top of Lyra, which open up a wide range of innovative applications, such as:
-
Turning a melody into an arrangement: In our conversations with composers, we’ve learned that while they love crafting melodies and themes of a new piece, arranging it can often feel more mechanical — especially when meeting a client’s specific requirements. We’re working on a derivative model that will act as an arrangement co-pilot by generating songs stem by stem based on a melody, another stem, text or audio references.
-
Interfacing between composers and their customers through temp tracks: when a composer is hired to write music for visual media (film, video game, etc.), they are often given a temp track by the director to set the musical expectations their music should fulfill. With Influence, artists can quickly explore the idea space with the director before committing to a specific creative vision, and ultimately maximise the satisfaction of their customer.
-
A big part of engaging the next generation of music creators is to teach them through play. By breaking down the creative process into a series of well defined steps that a generative model can assist with, AI can be used as a powerful tool for music education.
Building on top of Lyra, future work will be focused on developing an ecosystem of derivative models that address a variety of use cases conducive to expanding human creativity.
Sign up for the private beta
The samples we shared above are only the beginning; some of our models are still training, and our team is putting the final touches. Once we’re ready, we will onboard select users to participate in a private beta test. If you’re interested, you can register to take part here.
References
-
A corpus of 50 out-of-distribution snippets of 30 seconds from various musical styles were used for the score calculation. X-Codec1 was also considered, but the direct comparison to our high bandwidth codec would be unfair due to X-Codec using a lower audio sampling rate of 16 kHz. ↩
-
Percentage of samples in the test set where OmniCodec’s reconstruction quality was higher than that of a third party codec. ↩
-
NVIDIA H100 and B200 ↩
-
Here, we define quality as an umbrella term encompassing audio fidelity, musical sophistication, harmonic, structural and rhythmic coherence, and any other subjective features a human may be sensitive to when rating individual pieces of music by preference. ↩