A hot potato: Artificial intelligence researchers used to work in peace. However, now that companies like OpenAI, Microsoft, Google, and others are commercializing generative AI, the use of copyrighted training material has come under fire. Regulators in the UK are asking for information regarding the issue, and OpenAI recently responded.

OpenAI recently told members of the House of Lords that it is "impossible" to train large language models (LLMs) without using copyrighted material. The claim was in response to the UK's Communications and Digital Select Committee, which is looking into the legal issues involving current AI systems.

Current consumer applications like ChatGPT and Dall-E are based on GPT-3. Since 2018, OpenAI has trained the model on billions of samples of writings, art, and photographs, mostly scraped from the internet. In March, OpenAI released GPT-4, which uses a dataset of text samples measuring about 570GB. Some examples in the training material include websites and books, which are without question protected works. However, copyright law goes far beyond books and websites.

"Because copyright today covers virtually every sort of human expression – including blogposts, photographs, forum posts, scraps of software code, and government documents – it would be impossible to train today's leading AI models without using copyrighted materials," OpenAI's submission to the House of Lords reads.

Indeed, under current copyright law, a copyright does not even have to be registered to be protected. Any intellectual property is instantly copyrighted when the creator sets it to permanent media. It does not matter if it's a digital file, video, book, blog post, or a forum comment. All copyright laws apply.

This issue wasn't much of a problem in years past because machine learning research was strictly academic. Training was largely considered fair use and nobody bothered researchers. However, now that LLMs are going commercial, they have entered a gray area of the fair use doctrine.

On rare occasions, ChatGPT "regurgitates" copyrighted snippets, which is a cut-and-dry infringement and a problem that OpenAI is working hard to eliminate. However, that issue is not directly related to what happens when researchers train an LLM with protected material. Instead, the system uses the works, copyrighted or otherwise, to learn how language is structured and used so that it may create original content that humans can understand.

Unfortunately, being a new frontier, copyright law has no legal definition regarding AI training. So, allegedly infringed parties have begun bringing cases to courts. Companies like OpenAI and Microsoft are saying, "No. Training falls under fair use like it always has."

"Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents," OpenAI related in a blog post this week. "We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness."

Despite believing that the fair use doctrine covers LLM training, OpenAI provides a simple opt-out process, which The New York Times used in August last year. OpenAI's tools can no longer access the NYT website, yet the newspaper filed a lawsuit in December.

"We support journalism, partner with news organizations, [but] believe The New York Times lawsuit is without merit," it said.

OpenAI faces similar lawsuits from several published authors, including high-profile comedian Sarah Silverman. It's an issue that the courts cannot handle alone. The US Patent and Trademark Office, along with lawmakers, need to clearly define the role AI training plays in copyright rules.

As long as "regurgitation" is eliminated, should training LLMs with copyrighted material fall under fair use?