Give credit where credit is due.
That’s a bit of sage wisdom that you perhaps were raised to firmly believe in. Indeed, one supposes or imagines that we might all somewhat reasonably agree that this is a fair and sensible rule of thumb in life. When someone does something that merits acknowledgment, make sure they get their deserved recognition.
The contrarian viewpoint would seem a lot less compelling.
If someone walked around insisting that credit should not be recognized when credit is due, well, you might assert that such a belief is impolite and possibly underhanded. We often find ourselves vociferously disturbed when credit is cheated of someone that has accomplished something notable. I dare say that we especially disfavor when others falsely take credit for the work of others. That’s an unsettling double-whammy. The person that should have gotten the credit is denied their moment in the sun. In addition, the trickster is relishing the spotlight though they wrongly are fooling us into misappropriating our favorable affections.
Why all this discourse about garnering credit in the rightmost of ways and averting the wrong and contemptible ways?
Because we seem to be facing a similar predicament when it comes to the latest in Artificial Intelligence (AI).
Yes, claims are that this is happening demonstrably via a type of AI known as Generative AI. There is a lot of handwringing that Generative AI, the hottest AI in the news these days, already has taken credit for what it does not deserve to take credit for. And this is likely to worsen as generative AI gets increasingly expanded and utilized. More and more credit imbuing to the generative AI, while sadly those that richly deserve the true credit are left in the dust.
My proffered way to crisply denote this purported phenomenon is via two snazzy catchphrases:
- 1) Plagiarism at scale
- 2) Copyright Infringement at scale
I assume that you might be aware of generative AI due to a widely popular AI app known as ChatGPT that was released in November by OpenAI. I will be saying more about generative AI and ChatGPT momentarily. Hang in there.
Let’s get right away to the crux of what is getting people’s goats, as it were.
Some have been ardently complaining that generative AI is potentially ripping off humans that have created content. You see, most generative AI apps are data trained by examining data found on the Internet. Based on that data, the algorithms can hone a vast internal pattern-matching network within the AI app that can subsequently produce seemingly new content that amazingly looks as though it was devised by human hand rather than a piece of automation
This remarkable feat is to a great extent due to making use of Internet-scanned content. Without the volume and richness of Internet content as a source for data training, the generative AI would pretty much be empty and be of little or no interest for being used. By having the AI examine millions upon millions of online documents and text, along with all manner of associated content, the pattern-matching is gradually derived to try and mimic human-produced content.
The more content examined, the odds are that the pattern matching will be more greatly honed and get even better at the mimicry, all else being equal.
Here then is the zillion-dollar question:
- Big Question: If you or others have content on the Internet that some generative AI app was trained upon, doing so presumably without your direct permission and perhaps entirely without your awareness at all, should you be entitled to a piece of the pie as to whatever value arises from that generative AI data training?
Some vehemently argue that the only proper answer is Yes, notably that those human content creators indeed deserve their cut of the action. The thing is, you would be hard-pressed to find anyone that has gotten their fair share, and worse still, almost no one has gotten any share whatsoever. The Internet content creators that involuntarily and unknowingly contributed are essentially being denied their rightful credit.
This might be characterized as atrocious and outrageous. We just went through the unpacking of the sage wisdom that credit should be given where credit is due. In the case of generative AI, apparently not so. The longstanding and virtuous rule of thumb about credit seems to be callously violated.
Whoa, the retort goes, you are completely overstating and misstating the situation. Sure, the generative AI did examine content on the Internet. Sure, this abundantly was helpful as a part of the data training of the generative AI. Admittedly, the impressive generative AI apps today wouldn’t be as impressive without this considered approach. But you have gone a bridge too far when saying that the content creators should be allotted any particular semblance of credit.
The logic is as follows. Humans go out to the Internet and learn stuff from the Internet, doing so routinely and without any fuss per se. A person that reads blogs about plumbing and then binge-watches freely available plumbing-fixing videos might the next day go out and get work as a plumber. Do they need to give a portion of their plumbing-related remittance to the blogger that wrote about how to plumb a sink? Do they need to give a fee over to the vlogger that made the video showcasing the steps to fix a leaky bathtub?
Almost certainly not.
The data training of the generative AI is merely a means of developing patterns. As long as the outputs from generative AI are not mere regurgitation of precisely what was examined, you could persuasively argue that they have “learned” and therefore are not subject to granting any specific credit to any specific source. Unless you can catch the generative AI in performing an exact regurgitation, the indications are that the AI has generalized beyond any particular source.
No credit is due to anyone. Or, one supposes, you could say that credit goes to everyone. The collective text and other content of humankind that is found on the Internet gets the credit. We all get the credit. Trying to pinpoint credit to a particular source is senseless. Be joyous that AI is being advanced and that humanity all told will benefit. Those postings on the Internet ought to feel honored that they contributed to a future of advances in AI and how this will aid humankind for eternity.
I’ll have more to say about both of those contrasting views.
Meanwhile, do you lean toward the camp that says credit is due and belatedly overdue for those that have websites on the Internet, or do you find that the opposing side that says Internet content creators are decidedly not getting ripped off is a more cogent posture?
An enigma and a riddle all jammed together.
Let’s unpack this.
In today’s column, I will be addressing these expressed worries that generative AI is essentially plagiarizing or possibly infringing on the copyrights of content that has been posted on the Internet (considered an Intellectual Property right or IP issue). We will look at the basis for these qualms. I will be occasionally referring to ChatGPT during this discussion since it is the 600-pound gorilla of generative AI, though do keep in mind that there are plenty of other generative AI apps and they generally are based on the same overall principles.
Meanwhile, you might be wondering what in fact generative AI is.
Let’s first cover the fundamentals of generative AI and then we can take a close look at the pressing matter at hand.
Into all of this comes a slew of AI Ethics and AI Law considerations.
Please be aware that there are ongoing efforts to imbue Ethical AI principles into the development and fielding of AI apps. A growing contingent of concerned and erstwhile AI ethicists are trying to ensure that efforts to devise and adopt AI takes into account a view of doing AI For Good and averting AI For Bad. Likewise, there are proposed new AI laws that are being bandied around as potential solutions to keep AI endeavors from going amok on human rights and the like. For my ongoing and extensive coverage of AI Ethics and AI Law, see the link here and the link here, just to name a few.
The development and promulgation of Ethical AI precepts are being pursued to hopefully prevent society from falling into a myriad of AI-inducing traps. For my coverage of the UN AI Ethics principles as devised and supported by nearly 200 countries via the efforts of UNESCO, see the link here. In a similar vein, new AI laws are being explored to try and keep AI on an even keel. One of the latest takes consists of a set of proposed AI Bill of Rights that the U.S. White House recently released to identify human rights in an age of AI, see the link here. It takes a village to keep AI and AI developers on a rightful path and deter the purposeful or accidental underhanded efforts that might undercut society.
I’ll be interweaving AI Ethics and AI Law related considerations into this discussion.
Fundamentals Of Generative AI
The most widely known instance of generative AI is represented by an AI app named ChatGPT. ChatGPT sprung into the public consciousness back in November when it was released by the AI research firm OpenAI. Ever since ChatGPT has garnered outsized headlines and astonishingly exceeded its allotted fifteen minutes of fame.
I’m guessing you’ve probably heard of ChatGPT or maybe even know someone that has used it.
ChatGPT is considered a generative AI application because it takes as input some text from a user and then generates or produces an output that consists of an essay. The AI is a text-to-text generator, though I describe the AI as being a text-to-essay generator since that more readily clarifies what it is commonly used for. You can use generative AI to compose lengthy compositions or you can get it to proffer rather short pithy comments. It’s all at your bidding.
All you need to do is enter a prompt and the AI app will generate for you an essay that attempts to respond to your prompt. The composed text will seem as though the essay was written by the human hand and mind. If you were to enter a prompt that said “Tell me about Abraham Lincoln” the generative AI will provide you with an essay about Lincoln. There are other modes of generative AI, such as text-to-art and text-to-video. I’ll be focusing herein on the text-to-text variation.
Your first thought might be that this generative capability does not seem like such a big deal in terms of producing essays. You can easily do an online search of the Internet and readily find tons and tons of essays about President Lincoln. The kicker in the case of generative AI is that the generated essay is relatively unique and provides an original composition rather than a copycat. If you were to try and find the AI-produced essay online someplace, you would be unlikely to discover it.
Generative AI is pre-trained and makes use of a complex mathematical and computational formulation that has been set up by examining patterns in written words and stories across the web. As a result of examining thousands and millions of written passages, the AI can spew out new essays and stories that are a mishmash of what was found. By adding in various probabilistic functionality, the resulting text is pretty much unique in comparison to what has been used in the training set.
There are numerous concerns about generative AI.
One crucial downside is that the essays produced by a generative-based AI app can have various falsehoods embedded, including manifestly untrue facts, facts that are misleadingly portrayed, and apparent facts that are entirely fabricated. Those fabricated aspects are often referred to as a form of AI hallucinations, a catchphrase that I disfavor but lamentedly seems to be gaining popular traction anyway (for my detailed explanation about why this is lousy and unsuitable terminology, see my coverage at the link here).
Another concern is that humans can readily take credit for a generative AI-produced essay, despite not having composed the essay themselves. You might have heard that teachers and schools are quite concerned about the emergence of generative AI apps. Students can potentially use generative AI to write their assigned essays. If a student claims that an essay was written by their own hand, there is little chance of the teacher being able to discern whether it was instead forged by generative AI. For my analysis of this student and teacher confounding facet, see my coverage at the link here and the link here.
There have been some zany outsized claims on social media about Generative AI asserting that this latest version of AI is in fact sentient AI (nope, they are wrong!). Those in AI Ethics and AI Law are notably worried about this burgeoning trend of outstretched claims. You might politely say that some people are overstating what today’s AI can actually do. They assume that AI has capabilities that we haven’t yet been able to achieve. That’s unfortunate. Worse still, they can allow themselves and others to get into dire situations because of an assumption that the AI will be sentient or human-like in being able to take action.
Do not anthropomorphize AI.
Doing so will get you caught in a sticky and dour reliance trap of expecting the AI to do things it is unable to perform. With that being said, the latest in generative AI is relatively impressive for what it can do. Be aware though that there are significant limitations that you ought to continually keep in mind when using any generative AI app.
One final forewarning for now.
Whatever you see or read in a generative AI response that seems to be conveyed as purely factual (dates, places, people, etc.), make sure to remain skeptical and be willing to double-check what you see.
Yes, dates can be concocted, places can be made up, and elements that we usually expect to be above reproach are all subject to suspicions. Do not believe what you read and keep a skeptical eye when examining any generative AI essays or outputs. If a generative AI app tells you that Abraham Lincoln flew around the country in his private jet, you would undoubtedly know that this is malarky. Unfortunately, some people might not realize that jets weren’t around in his day, or they might know but fail to notice that the essay makes this brazen and outrageously false claim.
A strong dose of healthy skepticism and a persistent mindset of disbelief will be your best asset when using generative AI.
We are ready to move into the next stage of this elucidation.
The Internet And Generative AI Are In This Together
Now that you have a semblance of what generative AI is, we can explore the vexing question of whether generative AI is fairly or unfairly “leveraging”, or some would say blatantly exploiting Internet content.
Here are my four vital topics pertinent to this matter:
- 1) Double Trouble: Plagiarism And Copyright Infringement
- 2) Trying To Prove Plagiarism Or Copyright Infringement Will Be Trying
- 3) Making The Case For Plagiarism Or Copyright Infringement
- 4) Legal Landmines Await
I will cover each of these important topics and proffer insightful considerations that we all ought to be mindfully mulling over. Each of these topics is an integral part of a larger puzzle. You can’t look at just one piece. Nor can you look at any piece in isolation from the other pieces.
This is an intricate mosaic and the whole puzzle has to be given proper harmonious consideration.
Double Trouble: Plagiarism And Copyright Infringement
The double trouble facing those that make and field generative AI is that their wares might be doing two bad things:
- 1) Plagiarism. The generative AI could be construed as plagiarizing content that exists on the Internet as per the Internet scanning that took place during data training of the AI.
- 2) Copyright Infringement. The generative AI could be claimed as undertaking copyright infringement associated with the Internet content that was scanned during data training.
To clarify, there is a lot more content on the Internet than is actually typically scanned for the data training of generative AI. Only a tiny fraction of the Internet is usually employed. Thus, we can presumably assume that any content that wasn’t scanned during data training has no particular beef with generative AI.
This is somewhat debatable though since you could potentially draw a line that connects other content that was scanned with the content that wasn’t scanned. Also, another important proviso is that even if there is content that wasn’t scanned, it could still be argued as being plagiarized and/or copyright infringed if the outputs of the generative AI perchance land on the same verbiage. My point is that there is a lot of squishiness in all of this.
Bottom line: Generative AI is rife with potential AI Ethical and AI Law legal conundrums when it comes to plagiarism and copyright infringement underpinning the prevailing data training practices.
So far, AI makers and AI researchers have skated through this pretty much scot-free, despite the looming and precariously dangling sword that hangs above them. Only a few lawsuits have been to-date launched against these practices. You might have heard or seen news articles about such legal actions. One, for example, involves the text-to-image firms of Midjourney and Stability AI for infringing on artistic content posted on the Internet. Another one entails text-to-code infringement against GitHub, Microsoft, and OpenAI due to the Copilot software producing AI apps. Getty Images has also been aiming to go after Stability AI for text-to-image infringement.
You can anticipate that more such lawsuits are going to be filed.
Right now, it is a bit chancy to launch those lawsuits since the outcome is relatively unknown. Will the court side with the AI makers or will those that believe their content was unfairly exploited be the victors? A costly legal battle is always a serious matter. Expending the large-scale legal costs has to be weighed against the chances of winning or losing.
The AI makers would seem to have almost no choice but to put up a fight. If they were to cave in, even a little bit, the odds are that a torrent of additional lawsuits would result (essentially, opening the door to heightened chances of others prevailing too). Once there is legal blood in the water, the remaining legal sharks will scurry to the considered “easy score” and a thrashing and battering monetary bloodbath would surely occur.
Some believe that we should pass new AI laws that would protect the AI makers. The protection might even be retroactive. The basis for this is that if we want to see generative AI advancements, we have to give the AI makers some safe zone runway. Once lawsuits start to score victories against the AI makers, if that occurs (we don’t know yet), the worry is that generative AI will evaporate as no one will be willing to put any backing to the AI firms.
As ably pointed out in a recent Bloomberg Law piece entitled “ChatGPT: IP, Cybersecurity & Other Legal Risks of Generative AI” by Dr. Ilia Kolochenko and Gordon Platt, Bloomberg Law, February 2023, here are two vital excerpts echoing these viewpoints:
- “A heated debate now rages among US legal scholars and IP law professors about whether the unauthorized scraping and subsequent usage of copyrighted data amount to a copyright infringement. If the view of legal practitioners who see copyright violations in such practice prevails, users of such AI systems may also be liable for secondary infringement and potentially face legal ramifications.”
- “To comprehensively address the challenge, lawmakers should consider not just modernizing the existing copyright legislation, but also implementing a set of AI-specific laws and regulations.”
Recall that as a society we did put in place legal protections for the expansion of the Internet, as witnessed now by the Supreme Court reviewing the famous or infamous Section 230. Thus, it seems within reason and precedent that we might be willing to do some akin protections for the advancement of generative AI. Perhaps the protections could be set up temporarily, expiring after generative AI has reached some pre-determined level of proficiency. Other safeguard provisions could be devised.
I’ll soon be posting my analysis of how the Supreme Court assessment and ultimate ruling on Section 230 might impact the advent of generative AI. Be on the look for that upcoming posting!
Back to the stridently voiced opinion that we ought to give leeway for the societal awe-inspiring technological innovation known as generative AI. Some would say that even if the claimed copyright infringement has or is occurring, society as a whole ought to be willing to allow this for the specific purposes of advancing generative AI.
The hope is that new AI laws would be carefully crafted and tuned to the particulars associated with data training for generative AI.
There are plenty of counterarguments to this notion of devising new AI laws for this purpose. One concern is that any such new AI law will open the floodgates for all manner of copyright infringement. We will rue the day that we allowed such new AI laws to land on the books. No matter how hard you try to confine this to just AI data training, others will sneakily or cleverly find loopholes that will amount to unfettered and rampant copyright infringement.
Round and round the arguments go.
One argument that doesn’t particularly hold water has to do with trying to sue the AI itself. Notice that I have been referring to the AI maker or the AI researchers as the culpable stakeholders. These are people and companies. Some suggest that we should target AI as the party to be sued. I’ve discussed at length in my column that we do not as yet attribute legal personhood to AI, see the link here for example, and thus such lawsuits aimed at AI per se would be considered senseless right now.
As an addendum to the question of who or what should be sued, this brings up another juicy topic.
Assume that a particular generative AI app is devised by some AI maker that we’ll call the Widget Company. Widget Company is relatively small in size and doesn’t have much revenue, nor much in the way of assets. Suing them is not going to likely garner the grand riches that one might be seeking. At most, you would merely have the satisfaction of righting what you perceive as wrong.
You want to go after the big fish.
Here’s how that is going to arise. An AI maker opts to make their generative AI available to Big Time Company, a major conglomerate with tons of dough and tons of assets. A lawsuit naming the Widget Company would now have a better target in view, namely also by naming Big Time Company. This is a David and Goliath fight that lawyers would relish. Of course, the Big Time Company will undoubtedly try to wiggle off of the fishing hook. Whether they can do so is once again a legal question that is uncertain, and they might get hopelessly mired in the muck.
Before we get much further on this, I’d like to get something crucial on the table about the contended encroachments of generative AI due to data training. I’m sure you intuitively realize that plagiarism and copyright infringement are two somewhat different beasts. They have much in common, though they also significantly differ.
Here’s a handily succinct description from Duke University that explains the two:
- “Plagiarism is best defined as the unacknowledged use of another person’s work. It is an ethical issue involving a claim of credit for work that the claimant did not create. One can plagiarize someone else’s work regardless of the copyright status of that work. For example, it is nonetheless plagiarism to copy from a book or article that is too old to still be under copyright. It is also plagiarism to use data taken from an unacknowledged source, even though factual material like data may not be protected by copyright. Plagiarism, however, is easily cured – proper citation to the original source of the material.”
- “Copyright infringement, on the other hand, is the unauthorized use of another’s work. This is a legal issue that depends on whether or not the work is protected by copyright in the first place, as well as on specifics like how much is used and the purpose of the use. If one copies too much of a protected work, or copies for an unauthorized purpose, simply acknowledging the original source will not solve the problem. Only by seeking prior permission from the copyright holder does one avoid the risk of an infringement charge.”
I point out the importance of these two concerns so that you’ll realize that remedies can differ accordingly. Also, they are both enmeshed in considerations permeating AI Ethics and AI Law, making them equally worthwhile to examine.
Let’s explore a claimed remedy or solution. You’ll see that it might aid one of the double trouble issues, but not the other.
Some have insisted that all the AI makers have to do is cite their sources. When generative AI produces an essay, merely include specific citations for whatever is stated in the essay. Give various URLs and other indications of which Internet content was used. This would seem to get them free of qualms about plagiarism. The outputted essay would presumably clearly identify what sources were used for the wording being produced.
There are some quibbles in that claimed solution, but on a 30,000-foot level let’s say that does serve as a semi-satisfactory cure for the plagiarism dilemma. As stated above in the explanation of copyright infringement, the citing of source material does not necessarily get you out of the doghouse. Assuming that the content was copyrighted, and depending upon other factors such as how much of the material was used, the awaiting sword of copyright infringement can swing down sharply and with finality.
Double trouble is the watchword here.
Trying To Prove Plagiarism Or Copyright Infringement Will Be Trying
That’s the well-worn refrain that we all have heard at various times in our lives.
You know how it goes. You might claim that something is happening or has happened. You might know in your heart of hearts that this has taken place. But when it comes to push-versus-shove, you have to have the proof.
In today’s parlance, you need to show the receipts, as they say.
My question for you is this: How are we going to demonstrably prove that generative AI has inappropriately exploited Internet content?
One supposes that the answer should be easy. You ask or tell the generative AI to produce an outputted essay. You then take the essay and compare it to what can be found on the Internet. If you find the essay, bam, you’ve got the generative AI nailed to the proverbial wall.
Life seems never to be quite so easy.
Envision that we get generative AI to produce an essay that contains about 100 words. We go around and try to reach all nooks and corners of the Internet, searching for those 100 words. If we find the 100 words, shown in the same exact order and an identical fashion, we seem to have caught ourselves a hot one.
Suppose though that we find on the Internet a seemingly “comparable” essay though it only matches 80 of the 100 words. This seems still sufficient, perhaps. But imagine that we find only an instance of 10 words of the 100 that match. Is that enough to clamor that either plagiarism has occurred or that copyright infringement has occurred?
Text is funny that way.
Compare this to the text-to-image or text-to-art circumstances. When generative AI provides a text-to-image or text-to-art capability, you enter a text prompt and the AI app produces an image based somewhat on the prompt that you provided. The image might be unlike any image that has ever been seen on this or any other planet.
On the other hand, the image might be reminiscent of other images that do exist. We can look at the generative AI-produced image and somewhat by gut instinct say that it sure looks like some other image that we have seen before. Generally, the visual aspects of compare and contrast are a bit more readily undertaken. That being said, please know that huge legal debates ensure over what constitutes the overlap or replication of one image from another.
Another similar situation exists with music. There are generative AI apps that allow you to enter a text prompt and the output produced by the AI is audio music. These text-to-audio or text-to-music AI capabilities are just now starting to emerge. One thing you can bet your top dollar on is that the music produced by generative AI is going to get highly scrutinized for infringement. We seem to know when we hear musical infringement, though again this is a complex legal issue that isn’t just based on how we feel about the perceived replication.
Allow me one more example.
Text-to-code generative AI provides you the ability to enter a text prompt and the AI will produce programming code for you. You can then use this code for preparing a computer program. You might use the code exactly as generated, or you might opt to edit and adjust the code to suit your needs. There is also a need to make sure that the code is apt and workable since it is possible that errors and falsehoods can arise in the generated code.
Your first assumption might be that programming code is no different than text. It is just text. Sure, it is a text that provides a particular purpose, but it is still text.
Well, not exactly. Most programming languages have a strict format and structure to the nature of the coding statements of that language. This in a sense is much narrower than free-flowing natural language. You are somewhat boxed in as to how the coding statements are formulated. Likewise, the sequence and way in which the statements are utilized and arrayed are somewhat boxed in.
All in all, the possibility of showcasing that programming code was plagiarized or infringed is almost easier than natural language all told. Thus, when a generative AI goes to scan programming code on the Internet and later generates programming code, the chances of arguing that the code was blatantly replicated are going to be relatively more convincing. Not a slam dunk, so expect bitter battles to be waged on this.
My overarching point is that we are going to have the same AI Ethics and AI Law issues confronting all modes of generative AI.
Plagiarism and copyright infringement will be problematic for:
- Text-to-text or text-to-essay
- Text-to-image or text-to-art
- Text-to-audio or text-to-music
They are all subject to the same concerns. Some might be a bit easier to “prove” than others. All of them are going to have their own variety of nightmares of an AI Ethics and AI Law grounding.
Making The Case For Plagiarism Or Copyright Infringement
For discussion purposes, let’s focus on text-to-text or text-to-essay generative AI. I do so partially because of the tremendous popularity of ChatGPT, which is the text-to-text type of generative AI. There are a lot of people using ChatGPT, along with many others using various similar text-to-text generative AI apps.
Do those people that are using generative AI apps know that they are potentially relying upon plagiarism or copyright infringement?
It seems doubtful that they do.
I would dare say that the prevailing assumption is that if the generative AI app is available for use, the AI maker or the company that has fielded the AI must know or be confident that there is nothing untoward about the wares they are proffering for use. If you can use it, it must be aboveboard.
Let’s revisit my earlier comment about how we are going to try and prove that a particular generative AI is working on a wrongful basis as to the data training.
I might also add that if we can catch one generative AI doing so, the chances of nabbing the others are likely to be enhanced. I am not saying that all generative AI apps would be in the same boat. But they are going to find themselves in rather harsh seas once one of them is pinned to the wall.
That’s why too it will be immensely worthwhile to keep an eye on the existing lawsuits. The first one that wins as to the claimed infringement, if this occurs, will possibly spell doom and gloom for the other generative AI apps, unless some narrowness escapes the broader issues at hand. The ones that lose as to the claimed infringement do not necessarily mean that the generative AI apps can ring bells and celebrate. It could be that the loss is attributed to other factors that aren’t as relevant to the other generative AI apps, and so on.
I had mentioned that if we take a 100-word essay and try to find those exact words in the exact same sequence on the Internet, we might have a relatively solid case for plagiarism or copyright infringement, all else being equal. But if the number of words that matched is low, we would seem to be on thin ice.
I’d like to dig deeper into that.
An obvious aspect of making a comparison consists of the exact same words in the exact same sequence. This might occur for entire passages. This would be convenient to spot, almost like being handed to us on a silver platter.
We might also be suspicious if only a snippet of words matched. The idea would be to see if they are crucial words or maybe filler words that we can readily remove or ignore. We also don’t want to be tricked by the use of words in their past or future tense, or another tomfoolery. Those variations in words should also be considered.
Another level of comparison would be when the words are not particularly the same words to a great extent, yet the words even in a varied state still seem to be making the same points. For example, a summary will often use quite similar words as an original source, but we can discern that the summary seems predicated on the original source.
The hardest level of comparison would be based on concepts or ideas. Suppose that we see an essay that doesn’t have the same or similar words as a comparison base, but the essence or ideas are the same. We are admittedly edging into rough territory. If we readily were to say that ideas are closely protected, we would put a lid on almost all forms of knowledge and knowledge enlargement.
We can once again refer to a handy explanation from Duke University:
- “Copyright does not protect ideas, only the specific expression of an idea. For example, a court decided that Dan Brown did not infringe the copyright of an earlier book when he wrote The Da Vinci Code because all he borrowed from the earlier work were the basic ideas, not the specifics of plot or dialogue. Since copyright is intended to encourage creative production, using someone else’s ideas to craft a new and original work upholds the purpose of copyright, it does not violate it. Only if one copies another’s expression without permission is copyright potentially infringed.”
- “To avoid plagiarism, on the other hand, one must acknowledge the source even of ideas that are borrowed from someone else, regardless of whether the expression of those ideas is borrowed with them. Thus, a paraphrase requires citation, even though it seldom raises any copyright problem.”
Please note as earlier identified the differences between the double trouble facets.
Now then, putting the comparison approaches into practice is something that has been taking place for many years. Think of it this way. Students that write essays for their schoolwork might be tempted to grab content from the Internet and pretend that they authored the A-grade Pulitzer Prize-winning words.
Teachers have been using plagiarism-checking programs for a long time to deal with this. A teacher takes a student’s essay and feeds it into the plagiarism checker. In some cases, an entire school will license the use of a plagiarism-checking program. Whenever students are turning in an essay, they have to first send the essay to the plagiarism checking program. The teacher is informed as to what the program reports.
Unfortunately, you have to be extremely cautious about what these plagiarism-checking programs have to say. It is important to mindfully assess whether the reported indications are valid. As already mentioned, the capability of ascertaining whether a work was copied can be hazy. If you thoughtlessly accept the outcome of the checking program, you can falsely accuse a student of copying when they did not do so. This can be soul-crushing.
Moving on, we can try to use plagiarism-checking programs in the realm of testing generative AI outputs. Treat the outputted essays from a generative AI app as though it was written by a student. We then gauge what the plagiarism checker says. This is done with a grain of salt.
There is a recent research study that attempted to operationalize these types of comparisons in the context of generative AI in this very fashion. I’d like to go over some interesting findings with you.
First, some added background is required. Generative AI is sometimes referred to as LLMs (large language models) or simply LMs (language models). Second, ChatGPT is based on a version of another OpenAI generative AI package called GPT-3.5. Before GPT-3.5, there was GPT-3, and before that was GPT-2. Nowadays, GPT-2 is considered rather primitive in comparison to the later series, and we are all eagerly awaiting the upcoming unveiling of GPT-4, see my discussion at the link here.
The research study that I want to briefly explore consisted of examining GPT-2. That’s important to realize since we are now further beyond the capabilities of GPT-2. Do not make any rash conclusions as to the results of this analysis of GPT-2. Nonetheless, we can learn a great deal from the assessment of GPT-2. The study is entitled “Do Language Models Plagiarize?” by Jooyoung Lee, Thai Le, Jinghui Chen, and Dongwon Lee, appearing in the ACM WWW ’23, May 1–5, 2023, Austin, TX, USA.
This is their main research question:
- “To what extent (not limited to memorization) do LMs exploit phrases or sentences from their training samples?”
They used these three levels or categories of potential plagiarism:
- “Verbatim plagiarism: Exact copies of words or phrases without transformation.”
- “Paraphrase plagiarism: Synonymous substitution, word reordering, and/or back translation.”
- “Idea plagiarism: Representation of core content in an elongated form.”
GPT-2 was indeed trained on Internet data and thus a suitable candidate for this type of analysis:
- “GPT-2 is pre-trained on WebText, containing over 8 million documents retrieved from 45 million Reddit links. Since OpenAI has not publicly released WebText, we use OpenWebText which is an open-source recreation of the WebText corpus. It has been reliably used by prior literature.”
Selective key findings as excerpted from the study consist of:
- “We discovered that pre-trained GPT-2 families do plagiarize from the OpenWebText.”
- “Our findings show that fine-tuning significantly reduces verbatim plagiarism cases from OpenWebText.”
- “Consistent with Carlini et al. and Carlini et al., we find that larger GPT-2 models (large and xl) generally generate plagiarized sequences more frequently than smaller ones.”
- “However, different LMs may demonstrate different patterns of plagiarism, and thus our results may not directly generalize to other LMs, including more recent LMs such as GPT-3 or BLOOM.”
- “In addition, automatic plagiarism detectors are known to have many failure modes (both in false negatives and false positives).
- “Given that a majority of LMs’ training data is scraped from the Web without informing content owners, their reiteration of words, phrases, and even core ideas from training sets into generated texts has ethical implications.”
We definitely need a lot more studies of this kind.
If you are curious about how GPT-2 compares to GPT-3 concerning data training, there is quite a marked contrast.
According to reported indications, the data training for GPT-3 was much more extensive:
- “The model was trained using text databases from the internet. This included a whopping 570GB of data obtained from books, web texts, Wikipedia, articles, and other pieces of writing on the internet. To be even more exact, 300 billion words were fed into the system” (BBC Science Focus magazine, “ChatGPT: Everything you need to know about OpenAI’s GPT-3 tool” by Alex Hughes, February 2023).
For those of you interested in more in-depth descriptions of the data training for GPT-3, here’s an excerpt from the official GPT-3 Model Card posted on GitHub (last updated date listed as September 2020):
- “The GPT-3 training dataset is composed of text posted to the internet, or of text uploaded to the internet (e.g., books). The internet data that it has been trained on and evaluated against to date includes: (1) a version of the CommonCrawl dataset, filtered based on similarity to high-quality reference corpora, (2) an expanded version of the Webtext dataset, (3) two internet-based book corpora, and (4) English-language Wikipedia.”
- “Given its training data, GPT-3’s outputs and performance are more representative of internet-connected populations than those steeped in verbal, non-digital culture. The internet-connected population is more representative of developed countries, wealthy, younger, and male views, and is mostly U.S.-centric. Wealthier nations and populations in developed countries show higher internet penetration. The digital gender divide also shows fewer women represented online worldwide. Additionally, because different parts of the world have different levels of internet penetration and access, the dataset underrepresents less connected communities.”
One takeaway from the above indication about GPT-3 is that a rule of thumb amongst those that make generative AI is that the more Internet data you can scan, the odds of improving or advancing the generative AI go up.
You can look at this in either of two ways.
- 1) Improved AI. We are going to have generative AI that crawls across as much of the Internet as possible. The exciting outcome is that the generative AI will be better than it already is. That’s something to be looking forward to.
- 2) Copying Potential Galore. This widening of scanning the Internet is obnoxiously and engagingly making the plagiarism and copyright infringement problem potentially bigger and bigger. Whereas before there weren’t as many content creators impacted, the size is going to blossom. If you are a lawyer on the side of the content creators, this brings tears to your eyes (maybe tears of dismay, or tears of joy at what prospects this brings in terms of lawsuits).
Is the glass half-full or half-empty?
Legal Landmines Await
A question that you might be mulling over is whether your posted Internet content is considered fair game for being scanned. If your content is behind a paywall, presumably it is not a target for being scanned because it cannot be readily reached, depending upon the strength of the paywall.
I would guess that most everyday people do not have their content tucked away behind a paywall. They want their content to be publicly available. They assume that people will take a look at it.
Does having your content publicly available also axiomatically mean that you are approving it to be scanned for use by generative AI that is being data trained?
Maybe yes, maybe no.
It is one of those roll-your-eyes legal matters.
Returning to the earlier cited Bloomberg Law article, the authors mention the importance of the Terms and Conditions (T&C) associated with many websites:
- “The legal landmine—vastly ignored by unwitting AI companies that operate online bots for data scraping—is hidden in Terms and Conditions commonly available on public websites of all types. In contrast to the currently unsettled IP law and the copyright infringement dilemma, a website’s Terms and Conditions are backed by well-established contract law and usually can be enforced in court relying on sufficient number of precedents.”
They indicate that assuming your website has a licensing-related page, the chances are that if you used a standardized modern-day template, it might contain a crucial clause:
- “Consequently, most boilerplate Terms and Conditions for websites—abundantly available in free access—contain a clause prohibiting automated data scraping. Ironically, such freely available templates have possibly been used for ChatGPT training. Therefore, content owners may wish to review their Terms and Conditions and insert a separate clause flatly prohibiting all usage of any content from the websites for AI training or any related purposes, whether collected manually or automatically, without a prior written permission of the website owner.”
An added kicker is included in their analysis of potential actions for content creators to take about their websites:
- “Therefore, inserting an enforceable liquidated damages provision for each violation of the no-scraping clause, enhanced with an injunction-without-bond provision, can be a tenable solution for those authors of creative content who are not keen to provide the fruits of their intellectual labor for AI training purposes without being paid for it or, at least, given a proper credit for their work.”
You might want to consult your attorney about this.
Some say that this is a vital way to try and tell the AI makers that content creators are profusely serious about protecting their content. Making sure your licensing has the proper wording, would seem to put the AI makers on notice.
Others though are a bit downbeat. They dejectedly say that you can proceed to put the harshest and most lethal of legal language on your website, but in the end, the AI makers are going to scan it. You will not know they did so. You will have a devil of a time proving that they did. You are unlikely to discover that their outputs reflect your content. It is an uphill battle that you aren’t going to win.
The counterargument is that you are surrendering the battle before it was even waged. If you don’t at least have sufficient legal language, and if you ever do catch them, they will wiggle and weasel their way to escaping any responsibility. All because you didn’t post the right kind of legal lingo.
Meanwhile, another approach that is seeking to gain traction would consist of marking your website with something that says the site is not to be scanned by generative AI. The idea is that a standardized marker would be devised. Websites could presumably add the marker to their site. AI makers would be told that they should alter their data scanning to skip over the marked websites.
Can a marker approach be successful? Concerns include the costs to obtain and post the markers. Along with whether the AI makers will abide by the markers and ensure that they avoid scanning the marked sites. Another perspective is that even if the AI makers don’t go along with the markings, this provides another telltale clue for going to court and arguing that the content creator went the last mile to try and warn of the AI scanning.
Yikes, it all makes your head spin.
A few final remarks on this thorny topic.
Are you ready for a mind-bending perspective on this whole AI as a plagiarizer and copyright infringer dilemma?
Much of the assumption about “catching” generative AI in the act of plagiarism or copyright infringement hinges on discovering outputs that highly resemble prior works such as the content on the Internet that was potentially scanned during data training.
Suppose though that a divide-and-conquer ploy is at play here.
Here’s what I mean.
If the generative AI borrows a tiny bit from here and a teensy bit from there, ultimately mixing them together into producing any particular output, the chances of being able to have a gotcha moment are tremendously lessened. Any output will not seemingly rise to a sufficient threshold that you could say for certain that it was copped from one particular source item. The resultant essay or other modes of output will only fractionally be matchable. And by the usual approach of trying to argue that plagiarism or copyright infringement has occurred, you usually have to showcase more than some teeny tiny bit is at play, especially if the morsel is not a standout and can be found widely across the Internet (undercutting any adequate burden of proof of misappropriation).
Can you still persuasively declare that the data training by generative AI has ripped off websites and content creators even if the suggested proof is an ostensibly immaterial proportion?
Think about that.
If we are facing potentially plagiarism at scale and copyright infringement at scale, we might need to alter our approach to defining what constitutes plagiarism and/or copyright infringement. Perhaps there is a case to be made for plagiarism or copyright infringement in the main or at the large. A mosaic consisting of thousands or millions of minuscule snippets could be construed as committing such violations. The apparent trouble though is that this can make all manner of content suddenly come under an umbrella of breaches. This could be a slippery slope.
Speaking of hefty thoughts, Leo Tolstoy, the legendary writer, famously stated: “The sole meaning of life is to serve humanity.”
If your website and the websites of others are being scanned for the betterment of AI, and though you aren’t getting a single penny for it, might you have solemn solace in the ardent belief that you are contributing to the future of humanity? It seems a small price to pay.
Well, unless AI turns out to be the dreaded existential risk that wipes all humans from existence. You ought to not take credit for that. I assume you would just as soon not be contributing to that dire outcome. Putting aside that calamitous prediction, you might be thinking that if the AI makers are making money from their generative AI, and they seem to be relishing the profiteering, you should be getting a piece of the pie too. Share and share alike. The AI makers should ask for permission to scan any website and then also negotiate a price to be paid for having been allowed to undertake the scan.
Give credit where credit is due.
Let’s give Sir Walter Scott the last word for now: “ Oh, what a tangled web we weave. When first we practice to deceive.”
This maybe applies if you believe that deception is afoot, or perhaps doesn’t apply if you think that all is well and perfectly forthright and legitimate. Please do generously give yourself credit for thinking this over. You deserve it.