What if the content you publish on your website isn’t really your content?
A strange question given that you wrote it and, based on almost any country’s copyright laws, own that content from the moment it’s published. Yet, we live in a world that is quickly changing as a result of generative artificial intelligence (AI), and what we thought we knew about our content – and our ownership of it – is changing, too.
At least, that’s the belief of Microsoft’s AI CEO Mustafa Suleyman, who made some remarkable claims about web-published content in a June 2024 interview. His comments are worth exploring, not least because they may affect how you view and treat your content.
Suleyman’s Claims That Web Content Is Freeware
Suleyman’s comments came during an interview with CNBC at the Aspen Ideas Festival. He was quizzed about the difference between copyright materials that are explicitly protected by publishers – such as books – and web content that can be accessed more openly.
His response offers some insight into how the human minds behind the ever-developing AI work:
“With respect to content that is already on the open web, the social contract of that content since the ’90s has been that it is fair use. Anyone can copy it, recreate with it, reproduce with it. That has been freeware, if you like. That’s been the understanding.”
An interesting take.
One with which we don’t necessarily agree.
Later in his answer, Suleyman expanded on this claim by discussing how some web content falls into what he calls “a gray area.” Specifically, any content publisher that has explicitly told others that they cannot scrape or crawl for reasons beyond search engine indexing may be protected from his claims of free use. That’s an issue for the courts, he claims.
What This Means from a Copyright Perspective
The worrying issue for you as either a web content creator or somebody who has unique copy on their website is that Suleyman is right, to an extent. When it comes to web copy and to what is protected and what isn’t, we are in a “gray area.”
Focusing on the United States, for instance, American law automatically protects creative works under copyright as long as those works are independently created and haven’t been copied from others. On the surface, that seems like it should apply to your web content creation. You wrote it. It relates to you, your business, or merely something that interests you. And functionally, there’s little difference between writing a blog post and writing a chapter in a book – both require you to sit down and flex your creative muscles.
However, copyright isn’t automatically applied to websites in the United States. The U.S. Copyright Office says as much, noting that “the original authorship appearing on a website may be protected by copyright.”
“May” is the keyword here.
Your website actually doesn’t benefit from the automatic protection granted to other creative works, meaning you have to go out of your way to file a copyright application to gain that protection.
Again, the U.S. Copyright Office makes this clear in supporting documentation for the application process. It says that a website, as a whole, “is not explicitly recognized as a type of copyrightable subject matter under the Copyright Act.” When you apply to copyright content published on your website – at least, in the U.S. – you can’t copyright it as website content. Instead, you classify a written piece as a “literary work,” and so on for images and other assets you’ve created.
The Social Contract Element
The worrying thing about all of this is that Suleyman rightly says that there has been a social contract in place for web content almost since the internet’s inception. You’re seeing this social contract in action as you read this article – we have made it available for you to consume without payment. We can quote, pull from, or use this article as inspiration for our work. Worryingly for us – and you with your own content – you could even pull portions of it wholesale and place it on your website without offering accreditation.
However, we would have recourse if you did thanks to the Digital Millennium Copyright Act (DMCA).
Enacted in 1998 as a way to balance web-based copyright owners and those who wish to fairly use the content those owners publish, the act gives website owners access to “takedown” tools. For instance, if you copy our content, we could send a DMCA request asking you to take the copied webpage down. The problem is that the DMCA is U.S. only – you can get around it by taking our content and placing it on a website outside of the U.S.
As Suleyman rightly says, it’s a “gray area.”
Coming back to the social contract aspect of freely accessible content, it’s also understood that web creators, including website content creation services, can use snippets of other people’s content under “Fair Use” rules. For instance, you can use a portion of somebody else’s content to provide commentary or criticism on it, as well as for research and teaching purposes. Again, you can see examples in this article – we have quoted multiple references already and are protected by Fair Use rules in doing so.
All of which brings us back to AI and how it uses web-based content.
Fair Use allows for the limited use of somebody else’s content, meaning you can’t just take it wholesale, do nothing with it, and pass it off as your own. That’s theft. And it’s here where we start to see Suleyman’s comments fall down.
A generative AI platform – such as ChatGPT – has a Large Language Model (LLM) behind it. This LLM is trained by being fed billions upon billions of words, most scraped from online sources, allowing it to get better at predicting what words it should write based on the context of a user’s request. The more articles fed into the LLM, the more “intelligent” it becomes. But at the most basic level, an AI company is essentially taking content from others and using it to create tools.
That action goes far beyond what Fair Use rules are intended to cover, particularly because the AI tool creators are using this content – yours likely included – to profit. You could argue that this is an abuse of the social contract Suleyman mentions, assuming you believe such a contract even exists. Microsoft’s AI head believes that his massive multinational company can just use others’ work without providing compensation of any sort to create a tool that will make his company more money.
The Legal Wranglings to Come
“Wrangling,” is the appropriate word here as the rise of AI has cast current copyright laws into doubt. At the very least, the setups available now – in which web content is typically unprotected unless a website owner applies for copyright – don’t account for what generative AI companies are doing.
That’s why so many lawsuits are starting to crop up.
In January 2023, a collective of visual artists launched a lawsuit against several developers of generative AI tools that have used billions of images as training material. Major AI players, including Midjourney and Stability AI, are the subject of this lawsuit, which was filed in San Francisco’s federal court.
It followed a similar lawsuit launched in November 2022 that targeted OpenAI and GitHub for scraping copyright source code before using it to train AI models.
A similar lawsuit was filed in July 2023, this time involving a pair of authors – Mona Awad and Paul Tremblay – who both accuse OpenAI of “unlawfully ingesting” their books for use in their ChatGPT generative AI tool. Both believe that the hyper-accurate summaries of their books that ChatGPT can provide could only come from the tool having been trained on their writing directly.
Even major publishers are getting in on the act – The Times launched a lawsuit against both OpenAI and Microsoft over the use of its content to train their models, again claiming copyright infringement.
Many of these lawsuits are being spearheaded by Matthew Butterick, who Wired called the “unlikely driving force” behind several class-action lawsuits targeting AI companies in November 2023. These lawsuits typically see Butterick team up with creatives, such as authors, artists, and content creators, with the goal of helping them to gain more control over how their work is used.
For instance, he is the man behind the November 2022 lawsuit against OpenAI and GitHub, as he claims that both have violated open-source licensing agreements to essentially steal code that can then be replicated by generative AI tools.
Butterick is even representing American comedian Sarah Silverman, who is one of many artists who feel uncomfortable about the idea of AI tool creators taking the work she’s created to train their tools.
The issue here is that AI technology has upended what we’ve known about copyright law for decades.
For now, AI companies aren’t breaking the law, at least in the sense of how the law is currently written. That makes Suleyman’s claims that Microsoft can freely use any content published on the web for the company’s AI tools accurate.
These lawsuits may change that assessment in the future.
The problem for you, as a website owner, is the “in the future” qualifier. These lawsuits will likely trigger years of legal debate, during which time the makers of AI tools can still freely scrape the content on your website to use to train their tools.
The ideal scenario for a content creator is that the lawsuits lead to a modernization of copyright laws to prevent what many see as digital content theft, though there’s no guarantee that will happen. Worse yet, this is all U.S.-centric – even if America changes its copyright laws, the rest of the world will have to follow to truly quell the scraping of content for AI tool training purposes.
What Can You Do to Protect Your Content Today?
With the outcomes of these early lawsuits still to be determined (and not guaranteed to play out in your favor), the obvious question is what can you do right now to guard against AI?
The main answer is ChatGPT-centric.
OpenAI uses a tool called “GPTBot” to crawl websites for content. In practice, this is similar to how search engines work, as they also deploy tools to crawl website content. However, those crawlers are designed to analyze content so search engines can figure out where to rank it based on a user’s query – it isn’t used by search engines to generate new content. GPTBot differs as it’s essentially bouncing from website to website collecting data that it can use to train OpenAI’s generative AI models.
Thankfully, you can block GPTBot in several ways.
Method One – Disallow GPTBot in Your Site’s robots.txt File
Every website has a robots.txt file, which you use to tell search engine crawlers which pages they can access and which they can’t. Most site owners – and their digital marketing – rarely make substantial changes to this file because they want search engines to crawl. Happily, robots.txt can be your saving grace when it comes to preventing OpenAI from using your content thanks to two lines of code:
User-agent: GPTBot
Disallow: /
There are other “disallow” commands you can use – which are detailed on the OpenAI website – but this method is the most targeted for stopping its crawlers in their tracks.
Method Two – Using CAPTCHA AND reCAPTCHA Tools
The only major problem with Method One is that it only applies to GPTBot. What about the many other AI tools that send crawlers out to websites? While there are similar “disallow” commands for Google and Microsoft’s AI bots, both come with the risk of accidentally telling both to no longer allow your website to appear in search results.
Not ideal if you want to use your content for marketing.
CAPTCHA and reCAPTCHA tools offer a somewhat inelegant solution to these problems. You’ve likely come across these tools during your time on the web, from the basic versions that ask you to type a blurry character string to the more complex reCAPTCHAs that ask you to identify certain types of images from a collection presented to you. All serve the same purpose – preventing malicious bots from accessing your content.
You can find these tools via several service providers, such as hCAPTCHA and Google’s reCAPTCHA provider, and they should currently block AI crawler activity. Better yet, CAPTCHA and reCAPTCHA tools typically allow search engine crawlers through, as these can identify themselves via your website’s user agent string. Ask your web developer for implementation tips if needed.
The only downside here is that you create a barrier to entry to your content. For instance, one study found that a web form can achieve up to a 64% conversion rate without a CAPTCHA. Putting one in place drops that rate to 48% – nearly a third of conversions lost.
There are also issues with CAPTCHA tools and the differently abled. Those with vision issues may find themselves locked out of your website because CAPTCHA tools wrongfully believe the software they use to read web content is a malicious bot.
Method Three – Use the “noindex” Meta Tag
The final method is the scorched earth approach to protecting your content:
Prevent any and all crawlers from accessing your web content.
This is fairly easy to do as long as you have a basic understanding of HTML code.
Open the HTML file for the webpage you want to block and locate the <head> tag. Create a new line under that tag – but above the </head> tag – and enter the following:
<meta name=”robots” content=”noindex”>
Repeat across every individual web page you wish to block and you’ll prevent any AI crawler from accessing your website. But there’s a problem. This is the scorched earth approach because it also blocks search engine crawlers from accessing your content, meaning your website will drop out of search engine rankings.
So, we recommend only using this method if you’re unconcerned with content being ranked in search engines, such as those who primarily use paid ads and social media ads to get hits.
The Future of Your Content and AI
Perhaps the most worrying aspect of Mustafa Suleyman’s remarks is that they are technically correct – there is currently no legal framework in place to stop Microsoft from using your website’s content to train its AI models. Whether that will be the case in the future is likely going to depend on the outcomes of the first wave of lawsuits being launched against the makers of AI tools. Either those lawsuits will lead to a complete rewriting of copyright law or they’ll be decided based on the existing and antiquated laws drafted before AI tools became such a problem.
Whatever the outcome may be, we encourage you to continue creating great content, either yourself or with the help of a content creation agency. Even with generative AI becoming so widespread, search engines tend to prefer content that is unique and helpful to visitors – the exact opposite of the often generic and factually inaccurate content that even the most advanced AI tools produce.
Create your content, protect it as best you can with any of the three methods detailed above, and wait. With a little luck, some regulation will be introduced into the AI sector that prevents the widescale use of your writing, images, and other content to counter Mustafa Suleyman’s claims.