Bots are overwhelming websites with their hunger for AI data

Bots harvesting content for AI companies have proliferated to the point that they're threatening digital collections of arts and culture.

Galleries, Libraries, Archives, and Museums (GLAMs) say they're being overwhelmed by AI bots - web crawling scripts that visit websites and download data to be used for training AI models - according to a report issued on Tuesday by the GLAM-E Lab, which studies issues affecting GLAMs.

GLAM-E Lab is a joint initiative between the Centre for Science, Culture and the Law at the University of Exeter and the Engelberg Center on Innovation Law & Policy at NYU Law.

Based on an anonymized survey of 43 organizations, the report indicates that cultural institutions are alarmed by the aggressive harvesting of their content, which shows no regard for the burden that data-harvesting places on websites.

"Bots are widespread, although not universal," the report says. "Of 43 respondents, 39 had experienced a recent increase in traffic. Twenty-seven of the 39 respondents experiencing an increase in traffic attributed it to AI training data bots, with an additional seven believing that bots could be contributing to the traffic."

The surge in bots that gather data for AI training, the report says, often went unnoticed until it became so bad that it knocked online collections offline.

"Respondents worry that swarms of AI training data bots will create an environment of unsustainably escalating costs for providing online access to collections," the report says.

The institutions commenting on these concerns have differing views about when the bot surge began. Some report noticing it as far back in 2021 while others only began noticing web scraper traffic this year.

Some of the bots identify themselves, but some don't. Either way, the respondents say that robots.txt directives - voluntary behavior guidelines that web publishers post for web crawlers - are not currently effective at controlling bot swarms.

Bot defenses offered by the likes of AWS and Cloudflare do appear to help, but GLAM-E Lab acknowledges that the problem is complex. Placing content behind a login may not be effective if an institution's goal is to provide public access to digital assets. And there may be a reason to want some degree of bot traffic, such as bots that index sites for search engines.

The GLAM-E Lab survey echoes the findings of a similar report issued earlier this month by the Confederation of Open Access Repositories (COAR) based on the responses of 66 open access repositories run by libraries, universities, and other institutions.

The COAR report says: "Over 90 percent of survey respondents indicated their repository is encountering aggressive bots, usually more than once a week, and often leading to slowdowns and service outages. While there is no way to be 100 percent certain of the purpose of these bots, the assumption in the community is that they are AI bots gathering data for generative AI training."

The GLAM-E Lab survey also recalls complaints about abusive bots raised by The Wikimedia Foundation, Sourcehut, Diaspora developer Dennis Schubert, repair site iFixit, and documentation project ReadTheDocs.

Ultimately, the GLAM-E report argues that AI providers need to develop more responsible ways to interact with other websites.

"The cultural institutions that host online collections are not resourced to continue adding more servers, deploying more sophisticated firewalls, and hiring more operations engineers in perpetuity," the report says. "That means it is in the long-term interest of the entities swarming them with bots to find a sustainable way to access the data they are so hungry for." ®

Search
About Us
Website HardCracked provides softwares, patches, cracks and keygens. If you have software or keygens to share, feel free to submit it to us here. Also you may contact us if you have software that needs to be removed from our website. Thanks for use our service!
IT News
Jul 8
Georgia court throws out earlier ruling that relied on fake cases made up by AI

'We are troubled by the citation of bogus cases in the trial court's order'

Jul 8
SUSE launching region-locked support for the sovereignty-conscious

Move targets European orgs wary of cross-border data exposure

Jul 8
Feds brag about hefty Oracle discount - licensing experts smell a lock-in

If a deal looks too good to be true, it probably is

Jul 8
Firefox is fine. The people running it are not

Opinion Mozilla's management is a bug, not a feature

Jul 8
Microsoft developer ported vector database coded in SAP's ABAP to the ZX Spectrum

The mighty Z80 processor ran the code at astounding speed, proving retro-tech got a lot of things right

Jul 8
Samsung predicts profit slump as its HBM3e apparently continues to underwhelm Nvidia

Analysis Markets advised to brace for 45 percent fall from Q1 to Q2

Jul 8
Scholars sneaking phrases into papers to fool AI reviewers

Using prompt injections to play a Jedi mind trick on LLMs