Vintage microphone on stage with warm lights and smoke.
Généré par l'IA / Generated using AI
π Science and technology
Science at the service of creativity

Can AI compose music now ?

with Geoffroy Peeters, Professor of Data Science at Télécom Paris (IP Paris)
On February 12th, 2025 |
6 min reading time
Geoffroy Peeters
Geoffroy Peeters
Professor of Data Science at Télécom Paris (IP Paris)
Key takeaways
  • Today, algorithms for classifying, indexing and analysing music data have enough data to operate autonomously.
  • With advances in deep learning, music can now be analysed as a set of distinct elements (vocals, drums, bass, etc.).
  • This ability to extract the elements that make up music has made it possible to recontextualize, modify or even clone them in other content.
  • It is now possible for certain models to generate their own music, although this remains a major technical challenge.
  • One of the challenges of these practices is to enable these models to generate genuinely new content, and not simply reproduce what they have already learned.

In 1957, a com­pu­ter wrote a musi­cal score for the first time. ILLIAC I – desi­gned by Leja­ren Hil­ler and Leo­nard Isaac­son at the Uni­ver­si­ty of Illi­nois – com­po­sed a string quar­tet1. From then on, the pro­mise of a com­pu­ter pro­gram capable of gene­ra­ting music was roo­ted in rea­li­ty. After all, music is all about struc­tures, rules and mathe­ma­tics. Nothing unk­nown to a com­pu­ter pro­gram… except for one detail : creativity.

The fas­ci­na­ting thing about this suite is that it was com­po­sed by a com­pu­ter, fol­lo­wing a pro­ba­bi­lis­tic model sur­pri­sin­gly simi­lar to those used today2. Only, it was crea­ted accor­ding to rules esta­bli­shed by a human com­po­ser, revi­sed and then per­for­med by an orches­tra. The result : a rigid appli­ca­tion of the rules, lea­ving lit­tle room for artis­tic innovation.

Today, tech­no­lo­gy has evol­ved radi­cal­ly : anyone can play at being a com­po­ser from their com­pu­ter. And thanks to deep lear­ning algo­rithms and the rise of gene­ra­tive AI, musi­cal AI has taken an inter­es­ting turn. In order for a machine to real­ly pro­duce a musi­cal work from scratch, it had to unders­tand it, not imi­tate it.  The­rein lies the chal­lenge of a scien­ti­fic quest begun over twen­ty years ago : not to make machines com­pose, but to teach them how to lis­ten. Reco­gni­sing a style, clas­si­fying a work, ana­ly­sing a musi­cal structure…

Long before the explo­sion of AI-assis­ted music gene­ra­tion, resear­chers were alrea­dy trying to get machines to hear music. Among them was Geof­froy Pee­ters, a pro­fes­sor at Télé­com Paris and pre­vious­ly Research Direc­tor at IRCAM. His work on the sub­ject could help us ans­wer the ques­tion : can a machine tru­ly unders­tand music, even before it claims to create it ?

Understanding music

“In the ear­ly 2000s, the inter­na­tio­nal stan­dar­di­sa­tion of the .mp3 for­mat (MPEG‑1 Audio Layer III) led to the digi­ti­sa­tion of music libra­ries (today’s strea­ming plat­forms), giving users access to a vast cata­logue of music, and hence the need to clas­si­fy and index each piece of music in it,” explains Geof­froy Peeters.

This gave rise to a new field of research : how to deve­lop a music search engine ? “These music ana­ly­sis tech­no­lo­gies are based on audio ana­ly­sis and signal pro­ces­sing and were ini­tial­ly “human-dri­ven”. Lear­ning was based on human-input rules,” he adds. Music is not sim­ply a series of ran­dom sounds, but a struc­ture orga­ni­sed accor­ding using a rigo­rous gram­mar – some­times as strong, or even stron­ger, than that of lan­guage. Since a style of music is deter­mi­ned by a cer­tain type of chord, a cer­tain tem­po, a har­mo­nic struc­ture, and so on : “tea­ching these dif­ferent rules to a machine didn’t seem all that complicated”.

“What defines Blues music, for example, is the repe­ti­tion of a 12-bar grid based on the sequence of three spe­ci­fic chords,” ela­bo­rates the pro­fes­sor. These rules, which we know very well, will be enco­ded in a com­pu­ter, so that it can clas­si­fy the music accor­ding to genre.” That said, music is not only defi­ned by its genre, but it can also convey a mood or be more sui­ted to a par­ti­cu­lar context – be it for sports, or for medi­ta­tion. In short, there are many ele­ments whose rules are more dif­fuse than those deter­mi­ning genre.

“In an attempt to address this com­plexi­ty, Pan­do­ra Music, the lar­gest music strea­ming plat­form in the U.S., crea­ted the ‘Musi­cal Genome’ pro­ject, asking human beings to anno­tate over 1 mil­lion tracks based on 200 dif­ferent cri­te­ria.” This colos­sal task has accu­mu­la­ted enough data to enable the deve­lop­ment of so-cal­led data-dri­ven approaches (in which know­ledge is lear­ned by the machine from the ana­ly­sis of data). Among machine lear­ning tech­niques, deep lear­ning algo­rithms qui­ck­ly emer­ged as the most power­ful, and in the 2010s enabled dazz­ling advances. “Rather than making human-dri­ven models with com­plex mathe­ma­tics, like signal pro­ces­sing, and manual deci­sion rules, we can now learn eve­ry­thing com­ple­te­ly auto­ma­ti­cal­ly from data,” adds Geof­froy Peeters.

Over time, these trai­ned models have enabled the deve­lop­ment of clas­si­fi­ca­tion and recom­men­da­tion algo­rithms for online music plat­forms such as Dee­zer and Spotify. 

Learning to listen

Deep lear­ning will also bring about a para­digm shift.  Whe­reas music used to be consi­de­red as a whole, it can now be ana­ly­sed as a com­po­site of ele­ments. “Until 2010, we were unable to sepa­rate vocals, drums and bass from a mix in a clean – usable – way,” he points out. But if the voice could be extrac­ted, the sung melo­dy could be pre­ci­se­ly reco­gni­sed, cha­rac­te­ri­sed and more fine­ly ana­ly­sed. “Deep lear­ning will make this pos­sible by trai­ning sys­tems that take a ‘mixed song’ as input with all the sources mixed (vocals, drums, bass, etc.), and then out­put the various sources demixed, or sepa­ra­ted.” To train such a sys­tem, howe­ver, you need data – lots of it. In the ear­ly days, some trai­ning could be car­ried out with access, often limi­ted, to demixed recor­dings from record com­pa­nies. Until Spo­ti­fy, with its huge cata­logue of data, came up with a convin­cing source sepa­ra­tion algo­rithm. This was fol­lo­wed by a host of new models, each more impres­sive than the last, inclu­ding the French Splee­ter model from Dee­zer, which is open source3, and Demucs from Meta-AI in Paris.

This indi­vi­dual ana­ly­sis of each ele­ment that makes up a piece of music has tur­ned AI trai­ning on its head. “All this has ope­ned the door to many things, inclu­ding gene­ra­tive AI deve­lo­ped today in music. For example, with the abi­li­ty to sepa­rate the voice and ana­lyse it in detail, it becomes enti­re­ly pos­sible to re-contex­tua­lize it (rein­ser­ting Edith Piaf’s voice in the film “La Môme”, or John Len­non’s in the Beatles’ “Now and Then”), to modi­fy it (pitch cor­rec­tion is wide­ly used) to recreate it (the voice of Gene­ral de Gaulle pro­noun­cing the call of June 18th), but also to clone it. Recent events show just how far the lat­ter use can go, with concerns in the world of film dub­bing, the fear of “deep­fakes”, but also a pre­vious­ly unre­lea­sed sound­track with Drake and The Weeknd, which was never­the­less not sung by them.” 

Becoming a composer

Ear­ly research in musi­cal AI had well-defi­ned objec­tives : to clas­si­fy, ana­lyse and seg­ment music, and, why not, to assist the com­po­ser in his or her crea­tion. But with the emer­gence of gene­ra­tive models, this work became the basis for a whole new approach : the gene­ra­tion of a piece of music (and the­re­fore its audio signal) from nothing, or just a tex­tual “prompt”. “The first player to posi­tion itself in music gene­ra­tion from scratch is Ope­nAI’s Juke­box,” notes Geof­froy Pee­ters. “In a way, they’ve recy­cled what they were doing for ChatGPT : using a Large-Lan­guage-Model (LLM), a so-cal­led auto­re­gres­sive model, trai­ned to pre­dict the next word, based on pre­vious ones.”

Trans­po­sing this prin­ciple to music is a major tech­ni­cal chal­lenge. Unlike text, audio is not made up of dis­tinct words that the AI can treat as tokens. “We had to trans­late the audio signal into a form that the model could unders­tand,” he explains. “This is pos­sible with quan­ti­sed auto-enco­ders, which learn to pro­ject the signal into a quan­ti­zed space, the space of “tokens”, and to recons­truct the audio signal from these “tokens”. All that remains is to model the tem­po­ral sequence of tokens in a piece of music, which is done using an LLM. The LLM is then used again to gene­rate a new sequence of “tokens” (the most like­ly sequence), which are then conver­ted into audio by the quan­ti­zed auto-enco­der’s decoder”. 

Models with even more impres­sive results fol­lo­wed, such as Stable Audio from Sta­bi­li­ty AI. This type of model uses the prin­ciple of dif­fu­sion (popu­la­ri­sed for the gene­ra­tion of very high-qua­li­ty images, as in Mid­jour­ney or Stable Dif­fu­sion), but the idea remains the same : to trans­form the audio signal into quan­ti­sed data rea­dable by their dif­fu­sion model.

To pro­vide a mini­mum of control over the resul­ting music gene­ra­tion requires “condi­tio­ning” of the gene­ra­tive models on text ; this text is either a des­crip­tion of the audio signal (its genre, mood, ins­tru­men­ta­tion), or its lyrics. To achieve this, the trai­ning of the models will also take into account a text cor­res­pon­ding to a given music input. This is why the Suno model can be “promp­ted” with text. Howe­ver, this is where the limits of their crea­tive capa­ci­ty and ques­tions of intel­lec­tual pro­per­ty come into play. “These models suf­fer a lot from memo­ri­sa­tion,” warns Geof­froy Pee­ters. “For example, by asking Suno in a prompt to make music accom­pa­nied by the lyrics of ‘Bohe­mian Rhap­so­dy’ it ended up gene­ra­ting music very close to the ori­gi­nal. This does pose copy­right pro­blems for the new music just crea­ted, because the rights to it belong to the human behind the prompt, and the music used for trai­ning the model, the rights to which they didn’t have.” [N.D.L.R.: Today, Suno refuses this type of gene­ra­tion, as it no lon­ger com­plies with its terms of use].

“So, the­re’s a real need to turn these tools into models that gene­rate new content, not just repro­duce what they’ve lear­ned,” concludes the pro­fes­sor. “Today’s models gene­rate music, but do they create new music ? Unlike audio syn­the­si­sers (which made it pos­sible to create new sounds), music is an orga­ni­sa­tion of sounds (notes or other­wise) based on rules. Models are undoub­ted­ly capable of unders­tan­ding these rules, but are they capable of inven­ting new ones ? Are they still at the stage of “sto­chas­tic par­rots”, as is often said?”

Pablo Andres
1Suite d’Illiac 1 — Hil­ler, L., & Isaac­son, L. (1959). Expe­ri­men­tal Music : Com­po­si­tion with an Elec­tro­nic Com­pu­ter. McGraw-Hill.
2Chro­no­lo­gie de l’usage de l’IA en com­po­si­tion musi­cale — IRCAM (2023). Une brève chro­no­lo­gie sub­jec­tive de l’usage de l’intelligence arti­fi­cielle en com­po­si­tion musi­cale. – Agon, C. (1998). Ana­lyse de l’utilisation de l’IA en musique.
3Rap­port de l’OMPI sur l’IA et la pro­prié­té intel­lec­tuelle musi­cale. Orga­ni­sa­tion Mon­diale de la Pro­prié­té Intel­lec­tuelle (OMPI) (2021). Arti­fi­cial Intel­li­gence and Intel­lec­tual Pro­per­ty : A Lite­ra­ture Review.

Support accurate information rooted in the scientific method.

Donate