Meta’s Llama 3.1 is much more likely to reproduce copyrighted material from the popular Harry Potter series of fantasy novels than some of its rival AI models, according to new research.
The study was published by computer scientists and legal scholars from Stanford, Cornell, and West Virginia University. It evaluated a total of five popular open-weight models in order to determine which of them were most likely to reproduce text from Books3, an AI training dataset comprising collections of books that are protected by copyright.
Meta’s 70-billion parameter large language model (LLM) has memorised over 42 per cent of Harry Potter and the Philosopher’s Stone in order to be able to reproduce 50-token excerpts from the book at least half of the time, as per the study. It also found that darker lines of the book were easier to reproduce for the LLM.
The new research comes at a time when AI companies, including Meta, are facing a wave of lawsuits accusing them of violating the law by using copyrighted material to train their models without permission.
It shares new insights that could potentially address the pivotal question of how easily AI models are able to reproduce excerpts from copyrighted material verbatim. Companies such as OpenAI have previously argued that memorisation of text by AI models is a fringe phenomenon. The findings of the study appear to prove otherwise.
“There are really striking differences among models in terms of how much verbatim text they have memorized,” James Grimmelmann, one of the co-authors of the paper, was quoted as saying by Ars Technica.
“It’s clear that you can in fact extract substantial parts of Harry Potter and various other books from the model. That suggests to me that probably for some of those books, there’s something the law would call a copy of part of the book in the model itself,” said Mark Lemley, another co-author of the paper.
Story continues below this ad
“The fair use analysis you’ve gotta do is not just ‘is the training set fair use,’ but ‘is the incorporation in the model fair use? That complicates the defendants’ story,” he added.
Llama 3.1 memorised more than any other model
As part of the study, the researchers divided 36 books into passages that came up to 100 tokens each. They used the first 50 tokens of each passage as a prompt and set out to calculate the probability that the next 50 tokens would match the original passage.
The study defines ‘memorised’ as a greater than 50 per cent chance that an AI model will reproduce the original text word-for-word. The scope of the research was limited to open-weight models as the researchers had access to technical information such as token probability values that allowed them to calculate the probabilities for sequences of tokens more efficiently.
This would be more difficult to do in the case of closed models like those developed by OpenAI, Google, and Anthropic.
Story continues below this ad
The study found that Llama 3.1 70B memorised more than any of Meta’s other models such as Llama 1 65B as well as Microsoft and EleutherAI models. In contrast to Llama 3.1, Llama 1 was found to have memorised only 4.4 per cent of Harry Potter and the Philosopher’s Stone.
Limitations of the study
It was more probable for Llama 3.1 to reproduce popular books such as The Hobbit and George Orwell’s 1984 than obscure ones like Sandman Slim, a 2009 novel by author Richard Kadrey, as per the study. This could undermine efforts by plaintiffs to file a unified lawsuit and make it harder for individual authors to take legal action against AI companies on their own.
While the research findings could serve as evidence of several portions of the Harry Potter book being copied into the training data and weights used to develop Llama 3.1, it does not provide information on how exactly this was done.
At the start of the year, legal documents showed that Meta CEO Mark Zuckerberg had personally cleared the use of a dataset comprising pirated e-books and articles for AI training. The new study also lines up with these filings that further indicate Meta reportedly cut corners in gathering data for AI training.