West Coast Stat Views (on Observational Epidemiology and more): LLMs and IP

Wednesday, September 6, 2023

LLMs and IP

Update:

Gen AI training data+copyright:
➡️ Lawyers say case could "absolutely" go to Supreme Court
➡️Described as "Napster" moment
How did we get here? 🤔I asked @alexhanna @NaveenGRao @neilturkewitz @marcrot @katherine1ee @afedercooper @YJernite @EdKlaris https://t.co/I5hS7Sk2HS
— Sharon Goldman (@sharongoldman) September 6, 2023

Here's a comment Bob Carpenter left on last week's AI post.

It's not clear, at least as of current court rulings, whether ChatGPT and its ilk have violated any copyright laws. There are a dozen current cases that amount to three basic claims, (1) training infringes copyright, (2) outputs are derivative works, and (3) systems strip out copyright marks. There may be cases about privacy and things like deep fakes coming. There's then the question about who is infringing in cases (2) and (3)---the AI company who built the model or the user who prompts it and then distributes the results? The infringement arguments failed for video recorders because there were non-infringing uses, but succeeded against Napster and music file sharing. The AI companies are most worried about (1). The countervailing consideration they present is that they don't care about the expression of the language (which is what is copyrightable), but only the data or concepts (which are not copyrightable). There's an absolutely fantastic hour long talk on copyright and AI by Pamela Samuelson (where I took the above info), an IP law professor at Berkeley, here: Large language models meet copyright law. She explains the form vs. expression distinction and why it means code is treated differently than other forms of writing (because it's relatively function heavy).

And here (slightly cleaned up and augmented) is my reply.

Lots to talk about here (keep watching this space) but here are a couple of important point about VCRs (though I believe it was audio cassettes that set the more important precedents). The law allows for personal use, but selling tapes of copyrighted material is still against the law and those rules are well enforced. (Illegal taping did give us one of the all time great Seinfelds, but that's not important right now.) If we were limiting ourselves to personal use, I doubt people would be getting this upset (or investing all those billions).

It's also worth noting that with video (though not so much with audio), the systems being sold to consumers when these precedents were set were next to worthless for large scale IP theft. Copying tapes was a slow process and second-generation VHS was almost unwatchable. Until S-VHS was introduced (years after these cases were decided), you needed a very expensive professional editing suite if you wanted to make money off violating that FBI warning.

Second. Memorex was providing a system where it was the user who did the inputting and the outputting so there was no ambiguity about who was responsible for any piracy. When I use ChatGPT, all of the inputting has been done before I sit down. I had nothing to do with the decision to download the archives of the NYT or the novels of Stephen King. Most users don't even know about it.

More broadly speaking, laws are written with an eye to what's possible, and when that changes, the law should too. There are a lot of ways to steal IP without technically infringing on copyright but that doesn't make it right, and when new tech makes those types of theft easier, those laws should be revisited.

Even pre-LLM, there were lots of gray areas and lots of aggressive lawyers who stretched IP protection to absurd limits, be it clawing a work back after decades in the public domain (It's a Wonderful Life) or suing a singer for sounding like himself (John Fogerty).

Finally, OpenAI, like Uber, Airbnb, and Tesla, has gained a competitive advantage by bending or breaking the spirit and often the letter of the law even when it doesn't have to just because it can. There is a huge amount of text that is in the public domain and could be used without infringing on anyone's rights or privacy. Of course, it would cost a little more, but a company valued at $27-29 billion could afford it, and you wouldn't be able to get a column written in the style of Maureen Dowd, but I consider that a social good.

West Coast Stat Views (on Observational Epidemiology and more)

Wednesday, September 6, 2023

LLMs and IP

No comments:

Post a Comment