
Photo Credit: Kevin Horvat
Hackers say they’ve scraped Spotify’s entire music library – compiling the metadata behind 256 million tracks tied to over 15.4 million artist profiles – and intend to make a massive amount of music available to torrent. Meanwhile, Spotify has acknowledged the breach and confirmed that the culprits accessed “some of the platform’s audio files.”
Update (12/22): After this piece was published, a Spotify spokesperson reached out with a statement confirming that the responsible user accounts had been “identified and disabled,” with “new safeguards” implemented as well.
“Spotify has identified and disabled the nefarious user accounts that engaged in unlawful scraping,” the spokesperson said. “We’ve implemented new safeguards for these types of anti-copyright attacks and are actively monitoring for suspicious behavior. Since day one, we have stood with the artist community against piracy, and we are actively working with our industry partners to protect creators and defend their rights.”
Below is our original coverage.
The allegedly responsible hackers, part of a self-described “non-profit project” called Anna’s Archive, themselves disclosed the data heist in a blog post. And that lengthy post, drawing from the metadata, covers hard stats concerning duration, stream volume, popularity, genre, release date, and more.
Regarding straight audio, Anna’s Archive indicated that it’d “archived around 86 million music files, representing around 99.6% of listens” and clocking in at “a little under 300TB in total size.”
“A while ago, we discovered a way to scrape Spotify at scale… For now this is a torrents-only archive aimed at preservation, but if there is enough interest, we could add downloading of individual files to Anna’s Archive,” the hackers communicated.
Unsurprisingly, Spotify and especially rightsholders have plenty to say about those plans. As noted by Third Chair (YC X25) head Yoav Zimmerman, however, whatever takedowns and legal actions follow, “the damage is already done.”
(Technically, Anna’s Archive claims that it doesn’t “host any copyrighted materials,” instead purportedly indexing “metadata that is already publicly available.” Direct hosting or not, some of the project’s supporters are lamenting the Spotify circumvention – and the possibility that it’ll “ruin the actual important literary archive” by encouraging aggressive litigation.)
“The data is circulating on P2P networks, and there is no putting this back in Pandora’s box,” Zimmerman wrote. “Anyone can now, in theory, create their own personal free version of Spotify (all music up to 2025) with enough storage and a personal media streaming server like Plex. The only real barriers are copyright law and fear of enforcement.”
Perhaps more pressingly in the AI age, the massive collection of audio could theoretically be used to train generative models and fuel additional unauthorized soundalike outputs – a particularly significant issue if the involved platforms are based in countries with inadequate IP protections.
“It is well understood that LLMs thrive on high-quality data,” one section of the Anna’s Archive site reads. “We have the largest collection of books, papers, magazines, etc in the world, which are some of the highest quality text sources.”
According to the same site, Anna’s Archive promptly put out the metadata, with the 300 terabytes’ worth of audio files “releasing in order of popularity.”
In other words, the full extent of the episode’s fallout remains to be seen. And as initially mentioned, Spotify confirmed the “unauthorized access” (but not where things go from here) in a detail-light statement.
“An investigation into unauthorized access identified that a third party scraped public metadata and used illicit tactics to circumvent DRM to access some of the platform’s audio files. We are actively investigating the incident,” the Spotify spokesperson said.