Blogging platform Medium would like organizations to not scrape its articles without permission to train up AI models, though it admitted this policy will be difficult to enforce.
CEO Tony Stubblebine on Thursday explained how Medium intends to curb the harvesting of people's written work by developers seeking to build training data sets for neural networks. He said, above all, devs should to ask for consent - and offer credit and compensation to writers - for training large language models on people's prose.
Those AI models can end up aping the writers they were trained on, which feels to some like a double injustice: the scribes weren't compensated in the first place, and now models are threatening to take their place and income derived from their work.
"To give a blunt summary of the status quo: AI companies have leached value from writers in order to spam internet readers," he wrote in a blog post. "Medium is changing our policy on AI training. The default answer is now: No."
Medium has thus updated its websites' robots.txt file to ask OpenAI's web crawler bot GPTBot to not copy content from its pages. Other publishers - such as CNN, Reuters, the Chicago Tribune, and the New York Times - have already done this.
Stubblebine called this a "soft block" on AI: it relies on OpenAI's GPTBot heeding the request in robots.txt to not access Medium's pages and lift the content. But other crawlers can and may ignore it. Medium could wait for those crawlers to provide a way to block them via robots.txt, and update its file accordingly, but that's not a situation guaranteed to happen.
Blocking web crawlers at a level lower than robots.txt, such as by IP address or user agent string, will work - until the bots get new IP addresses or alter their user agent strings. It's a game of whack-a-mole that may be too tedious to play.
"Unfortunately, the robots.txt block is limited in major ways," Stubblebine admitted. "As far as we can tell, OpenAI is the only company providing a way to block the spider they use to find content to train on. We don't think we can block companies other than OpenAI perfectly."
By that he means that at least OpenAI has promised to observe robots.txt. Other orgs collecting data for machine-learning training might just ignore it.
That all said, Medium has promised to send cease and desist letters to those crawling its pages without permission for articles to train models. So, effectively: Medium has asked OpenAI's crawler to leave it alone, and the website will take other data-set crawlers to task via legal threats if they don't back off. The website's terms-of-service were updated to forbid the use of spiders and other crawlers to scrape articles without Medium's consent, we're told.
Stubblebine also warned writers on the platform that it's not clear whether copyright law can protect them from companies training models on their work and using those models to produce similar or almost identical material, amid multiple ongoing lawsuits into that whole thing.
The CEO also reminded Medium users that no one can resell copies of their work on the site without permission. "In the default license on Medium stories, you retain exclusive right to sell your work," Stubblebine wrote.
He went on to say that some AI developers may have done just that: bought or obtained copies of articles and other works scraped off Medium and other parts of the internet by third-party resellers, to then train networks on that content. He dubbed that laundering of people's copyrighted material "an act of incredible audacity."
Stubblebine advised companies looking to crawl web data from Medium to contact the site to discuss credit and compensation among other sticking points. "I'm saying this because our end goal isn't to block the development of AI. We are opting all of Medium out of AI training sets for now. But we fully expect to opt back in when these protocols are established," he added.
Medium proposed that if an AI maker were to offer compensation for scraped text, the blogging biz would give 100 percent of this to its writers.
In July, it also confirmed that although AI-generated posts aren't completely banned, it would not be recommending any text completely written by machines.
"Medium is not a place for fully AI-generated stories, and 100 percent AI-generated stories will not be eligible for distribution beyond the writer's personal network," it stated. ®
Latest Insider Build brings new features for Windows 365 Boot
Commissioned AI is changing the world, but your AI algorithms might need a diet of high-quality data captured at the edge
Webinar How using Retrieval Augmented Generation can enhance your AI development and deployment
It looks like everything is coming up HP. Do you want some help with that?
Has recent CEO, board shenanigans given rise to a merger situation? CMA is asking for a friend
This is release 0b11111111 (0xFF) - what could possibly go wrong?
Security boosted and inappropriate content blocked in large language models