Medium asks AI bot crawlers: Please, please don't scrape bloggers' musings

Blogging platform Medium would like organizations to not scrape its articles without permission to train up AI models, though it admitted this policy will be difficult to enforce.

CEO Tony Stubblebine on Thursday explained how Medium intends to curb the harvesting of people's written work by developers seeking to build training data sets for neural networks. He said, above all, devs should to ask for consent - and offer credit and compensation to writers - for training large language models on people's prose.

Those AI models can end up aping the writers they were trained on, which feels to some like a double injustice: the scribes weren't compensated in the first place, and now models are threatening to take their place and income derived from their work.

"To give a blunt summary of the status quo: AI companies have leached value from writers in order to spam internet readers," he wrote in a blog post. "Medium is changing our policy on AI training. The default answer is now: No."

Medium has thus updated its websites' robots.txt file to ask OpenAI's web crawler bot GPTBot to not copy content from its pages. Other publishers - such as CNN, Reuters, the Chicago Tribune, and the New York Times - have already done this.

Stubblebine called this a "soft block" on AI: it relies on OpenAI's GPTBot heeding the request in robots.txt to not access Medium's pages and lift the content. But other crawlers can and may ignore it. Medium could wait for those crawlers to provide a way to block them via robots.txt, and update its file accordingly, but that's not a situation guaranteed to happen.

Blocking web crawlers at a level lower than robots.txt, such as by IP address or user agent string, will work - until the bots get new IP addresses or alter their user agent strings. It's a game of whack-a-mole that may be too tedious to play.

"Unfortunately, the robots.txt block is limited in major ways," Stubblebine admitted. "As far as we can tell, OpenAI is the only company providing a way to block the spider they use to find content to train on. We don't think we can block companies other than OpenAI perfectly."

By that he means that at least OpenAI has promised to observe robots.txt. Other orgs collecting data for machine-learning training might just ignore it.

That all said, Medium has promised to send cease and desist letters to those crawling its pages without permission for articles to train models. So, effectively: Medium has asked OpenAI's crawler to leave it alone, and the website will take other data-set crawlers to task via legal threats if they don't back off. The website's terms-of-service were updated to forbid the use of spiders and other crawlers to scrape articles without Medium's consent, we're told.

Stubblebine also warned writers on the platform that it's not clear whether copyright law can protect them from companies training models on their work and using those models to produce similar or almost identical material, amid multiple ongoing lawsuits into that whole thing.

The CEO also reminded Medium users that no one can resell copies of their work on the site without permission. "In the default license on Medium stories, you retain exclusive right to sell your work," Stubblebine wrote.

He went on to say that some AI developers may have done just that: bought or obtained copies of articles and other works scraped off Medium and other parts of the internet by third-party resellers, to then train networks on that content. He dubbed that laundering of people's copyrighted material "an act of incredible audacity."

Stubblebine advised companies looking to crawl web data from Medium to contact the site to discuss credit and compensation among other sticking points. "I'm saying this because our end goal isn't to block the development of AI. We are opting all of Medium out of AI training sets for now. But we fully expect to opt back in when these protocols are established," he added.

Medium proposed that if an AI maker were to offer compensation for scraped text, the blogging biz would give 100 percent of this to its writers.

In July, it also confirmed that although AI-generated posts aren't completely banned, it would not be recommending any text completely written by machines.

"Medium is not a place for fully AI-generated stories, and 100 percent AI-generated stories will not be eligible for distribution beyond the writer's personal network," it stated. ®

Search
About Us
Website HardCracked provides softwares, patches, cracks and keygens. If you have software or keygens to share, feel free to submit it to us here. Also you may contact us if you have software that needs to be removed from our website. Thanks for use our service!
IT News
Nov 8
The US government wants developers to stop using C and C++

Opinion Does anyone want to tell Linus Torvalds? No? I didn't think so

Nov 8
Microsoft still not said anything about unexpected Windows Server 2025 installs

Affected business calls situation 'mindbogglingly dangerous' as sysadmins reminded to check backup and restore strategies

Nov 8
Europe's largest local authority slammed for 'poorest' ERP rollout ever

Government-appointed commissioners say Birmingham severely lacked Oracle skills during disastrous implementation

Nov 8
Watchdog finds AI tools can be used unlawfully to filter candidates by race, gender

UK data regulator says some devs and providers are operating without a 'lawful basis'

Nov 8
Flanked by Palantir and AWS, Anthropic's Claude marches into US defense intelligence

An emotionally-manipulable AI in the hands of the Pentagon and CIA? This'll surely end well

Nov 8
Single-platform approach may fall short for AI data management

Data platform vendors can't meet all your needs, warns Gartner

Nov 7
Microsoft rolls out AI-enabled Notepad to Windows Insiders

Rewrite 'please leave my text editor alone'