Anna’s Archive Leaks 300TB Spotify Music Data, Exposing AI Training Concern from Pirated Content

MediaNama Take: In India, the Spotify scrape lands amid an unresolved legal tension between data protection and copyright enforcement. Under the Digital Personal Data Protection Act (DPDPA), publicly available personal data is exempt from several consent requirements and can be processed, including for AI training, with limited safeguards. While this carve-out may cover music metadata, it does not extend to copyrighted audio files, exposing a gap that large-scale scraping exploits.

The issue is already before the Indian courts. In ANI’s copyright lawsuit against OpenAI in the Delhi High Court, the news agency has challenged the use of copyrighted content for AI training without authorisation, signalling that Indian courts may take a stricter view of “lawful access” than data protection law alone suggests. Against this backdrop, a recently proposed DPIIT committee framework recommends mandatory licensing and statutory royalties for AI training, with no opt-out for creators. If adopted, it would sharply limit the legal usability of datasets derived from unauthorised scraping, such as the Spotify archive, regardless of their scale or technical availability.

What’s the News

A piracy-linked activist group, Anna’s Archive, has leaked Spotify’s music data online after scraping and backing up large parts of the streaming platform’s library, including metadata for around 256 million tracks and audio files linked to nearly 99.6% of all listens on Spotify.

https://annas-archive.org/?

Spotify confirmed that unauthorised access took place and said it has taken action against the accounts involved. In a statement issued on December 22, the company said: “Spotify has identified and disabled the nefarious user accounts that engaged in unlawful scraping. We’ve implemented new safeguards for these types of anti-copyright attacks and are actively monitoring for suspicious behavior. Since day one, we have stood with the artist community against piracy, and we are actively working with our industry partners to protect creators and defend their rights.”

Anna’s Archive claims large-scale backup via torrents

In a blog post published on December 20, Anna’s Archive said it had backed up large parts of Spotify’s catalogue and distributed the data through bulk torrents.

According to the group’s own claims, the release includes metadata for around 256 million tracks and audio files for roughly 86 million songs. The total size of the data is close to 300 terabytes. The group is releasing the files in stages, starting with metadata and then moving on to music files ranked by popularity.

Spotify said that the unauthorised actors accessed some audio files but primarily scraped publicly available metadata and used ‘illicit tactics’ to bypass protections on certain files. The company said it is continuing to investigate the incident.

The leaked material includes track-level information such as song titles, artist names, album details, popularity scores, International Standard Recording Codes (ISRC), market availability, and audio features generated by Spotify. The dataset also contains information about playlists, including playlist names, follower counts, and track listings. In addition, the group claims to have stored audio files in compressed formats, along with embedded metadata and technical identifiers.

Anna’s Archive calls the release a ‘preservation archive’ and states that it intends it for long-term storage rather than everyday use. In its blog post, the group argues that existing music archiving efforts fragment the collection or focus mainly on popular content. However, the operation involved scraping Spotify at scale and distributing copyrighted material without authorisation, which constitutes piracy under copyright law.

Why this matters beyond piracy

The scale of the alleged Spotify scrape raises concerns that go beyond conventional music piracy and copyright enforcement. If large portions of Spotify’s catalogue, metadata and audio files alike are circulating through torrents, it lowers the practical barrier for AI companies or independent developers to train music-related AI models using pirated content.

The sheer volume of data involved also underscores the industrial scale of the leak. At nearly 300 terabytes, the dataset represents one of the largest known unauthorised disclosures in the content and media industry. To put this in perspective, storing 300 TB of data would require more than 600 laptops with 500 GB of storage each or around 2,400 smartphones with 128 GB of storage. The scale goes well beyond individual piracy or casual scraping, pointing instead to infrastructure-level data extraction with implications for copyright enforcement, platform security, and downstream reuse at scale.

Read more at