A journalist at The Atlantic has identified four datasets containing millions of music tracks that have been used to train artificial intelligence models, and has made those datasets fully searchable by the public — exposing a wide range of artists whose work appears in AI training pipelines without their explicit consent for commercial use.
Atlantic reporter Alex Reisner uncovered the four datasets and built a searchable tool on The Atlantic's AI Watchdog site, where anyone can look up whether songs, books, or other media appear in the training data.
Two of the datasets are substantial in scale, containing 12 million and 9 million tracks respectively. The remaining two are considerably smaller but still represent meaningful volumes of training material, each containing more than 100,000 songs.
The datasets have been downloaded thousands of times, according to Reisner. While it is impossible to determine precisely who has used them, both Google and Stability AI have confirmed their use in published research papers.
The range of artists whose work appears in the data is broad. Pop acts including Lady Gaga and Fred Again.. sit alongside rock and hip-hop acts such as Radiohead, Bruce Springsteen, and Wu-Tang Clan. Electronic artists Aphex Twin and experimental composer Hainbach also appear in the datasets.
Some of the source material, including tracks from the Free Music Archive dataset, is freely available for personal streaming but requires a commercial license for other uses — a distinction that becomes legally significant when the music is used to train AI models.
The technical process of obtaining the audio is not straightforward. As Reisner explains, three of the four datasets are distributed as lists of links pointing to songs hosted on YouTube or Spotify, rather than as audio files themselves. AI developers then use automated tools to download the actual audio, and some of those tools are capable of bypassing platform logins, advertisements, and other mechanisms designed to generate revenue for creators. Such tools violate the terms of service of those platforms.
The disclosure arrives amid a broader industry debate over the legality and ethics of training AI on copyrighted content without licensing agreements or compensation to rights holders. Music publishers and record labels have pursued litigation against several AI music generation companies over similar concerns.
The searchable database lowers the barrier for artists and rights holders to determine whether their work is included, potentially informing future legal challenges or licensing demands directed at AI developers who have relied on these datasets.
Disclaimer