Metaphor of the Month! Data Scraping

Web Scraping GraphicAs I prepare for my Fall class, “Writing With and About AI,” as well as a book proposal on AI in the writing classroom, I keep encountering neologisms like this month’s metaphor.

What do “scrapers” do? They can, according to a firm that employs them, “browse sites based on your keyword inputs or connections to your website or social media accounts. They can also skim through online reviews, product descriptions, and other categories.” That sounds benign enough, as sites like this one lie behind no pay-walls (there’s a neologism to which I’ll return in some future post).  The practice, according to the Wikipedia entry, appears to date to the 1980s, before we had The Web or household Internet.

Why scrape data to train AI? From the firm quoted in the previous paragraph, data scrapers assist in “automating outreach, [and] they can also help during the early company development and research phases. Even later on, you can use them to monitor online chatter and brand perception.” As I tell my students constantly, they need to learn how to use these AI-based tools, even if they dislike them. Getting a job will depend upon AI-fluency.

And yet as I write this, the BBC has threatened to take the AI firm Perplexity to court for unauthorized scraping of its data and “reproducing BBC content ‘verbatim’ without its permission.” This use of BBC content, though free, poses a new problem for me, a self-professed “Copy Leftist” who has long opposed copyright save for creative work.

Open-access scholarship, my own syllabi, and more in The Creative Commons are there to be scraped. The problem for me involves my and other creators’ words being used without any asking or attribution; this use violates the ethos of the Creative Commons. 20 years ago, I wrote to a Hong-Kong firm that had used our online handbook pages, verbatim, without acknowledgement. I told them I’d be contacting every e-list I knew to show that they had done this. They relented and gave our creators credit. I gave them my blessing to use our content under that one condition.

I’ve long advocated having everything save classified government information and creative work given away, free. That was one promise of the original Internet. Just cite it if you scrape it. I dislike copyright for other materials intensely.

Now I’m thinking that Web-crawlers and other bots that scrape data pose an even larger problem than copyright laws and pay walls. We may need to revise copyright laws to require attribution even for Creative-Commons work, or to watermark all AI-scraped content.

Scrape the barrel for new words and metaphors, then send them to me at jessid -at- richmond -dot- edu or leaving a comment below.

See all of our Metaphors of the Month here and Words of the Week here.

Creative-Commons image courtesy of lab.howie.tw