Metaphor of the Month! Data Scraping

Web Scraping GraphicAs I prepare for my Fall class, “Writing With and About AI,” as well as a book proposal on AI in the writing classroom, I keep encountering neologisms like this month’s metaphor.

What do “scrapers” do? They can, according to a firm that employs them, “browse sites based on your keyword inputs or connections to your website or social media accounts. They can also skim through online reviews, product descriptions, and other categories.” That sounds benign enough, as sites like this one lie behind no pay-walls (there’s a neologism to which I’ll return in some future post).  The practice, according to the Wikipedia entry, appears to date to the 1980s, before we had The Web or household Internet.

Why scrape data to train AI? From the firm quoted in the previous paragraph, data scrapers assist in “automating outreach, [and] they can also help during the early company development and research phases. Even later on, you can use them to monitor online chatter and brand perception.” As I tell my students constantly, they need to learn how to use these AI-based tools, even if they dislike them. Getting a job will depend upon AI-fluency.

And yet as I write this, the BBC has threatened to take the AI firm Perplexity to court for unauthorized scraping of its data and “reproducing BBC content ‘verbatim’ without its permission.” This use of BBC content, though free, poses a new problem for me, a self-professed “Copy Leftist” who has long opposed copyright save for creative work.

Open-access scholarship, my own syllabi, and more in The Creative Commons are there to be scraped. The problem for me involves my and other creators’ words being used without any asking or attribution; this use violates the ethos of the Creative Commons. 20 years ago, I wrote to a Hong-Kong firm that had used our online handbook pages, verbatim, without acknowledgement. I told them I’d be contacting every e-list I knew to show that they had done this. They relented and gave our creators credit. I gave them my blessing to use our content under that one condition.

I’ve long advocated having everything save classified government information and creative work given away, free. That was one promise of the original Internet. Just cite it if you scrape it. I dislike copyright for other materials intensely.

Now I’m thinking that Web-crawlers and other bots that scrape data pose an even larger problem than copyright laws and pay walls. We may need to revise copyright laws to require attribution even for Creative-Commons work, or to watermark all AI-scraped content.

Scrape the barrel for new words and metaphors, then send them to me at jessid -at- richmond -dot- edu or leaving a comment below.

See all of our Metaphors of the Month here and Words of the Week here.

Creative-Commons image courtesy of lab.howie.tw

Google Sites: Page-Level Permissions

Google What?

I do not often read Google’s blog about their documents features, but recently I was looking for an answer to a few questions about Google Sites, the tool that I now use for all of my course syllabi. Unlike traditional web-site builders, Google Sites is collaborative; this is common for wikis, web-site software long popular in K-12 education but rarer in higher education.

In doing my reading at Google’s blog, I found a game-changer for writing teachers. Sites has quickly become my favorite tool for a few reasons:

  • It’s free
  • It offers a navigational sidebar that I like from PBworks‘ wiki
  • It lacks obtrusive advertisements
  • It has the ease of use that Wikispaces offers, but appears even more familiar to MS-Office users.

To my knowledge, however, none of Google’s smaller competitors, and certainly nothing from the desktop-centric Microsoft empire, offer a creator the ability to grant permissions, by page, to those sharing a site. Google explains the reasons for this feature here.

Course-Management Software vs. Sites

For years, I’ve refused to use BlackBoard because it has made guest access so hard. In my field, writing & composition, faculty routinely share lesson plans and syllabi, so Blackboard never met my needs. Our Eng. 383 syllabus has become a model for many other schools’ training programs precisely because colleagues outside the class can find it with a Web search and view the content.

That said, I’m pleased that Blackboard, seeing what the competition offers for free, has given faculty a “public” option for Bb sites. But I’ve argued elsewhere that Blackboard is an overpriced “transition” technology in the age of social media and Web 2.0 shared applications.  Blackboard only recently added such technology to its product.

For now, Sites lacks the sort of testing features that Blackboard has, but I don’t use quizzes that way. It would be possible, however, to link to an online gradebook created with Google Docs. You can see the results (but not students’ grades!) in the latest iteration of my Eng. 383 syllabus, used for training Writing Consultants at the University of Richmond.

How the Collaboration Works

The process of granting permissions for a Google Site is a little tedious at first. I had to invite users to the site with “view” permissions…and they must have a Gmail account. But to my knowledge it cannot be one the University grants, either, as my site resides on the public servers at Google. Had I known this, I might have set up the site under UR’s rubric, but that change of service-providers had not occurred when I first set up my Google Site.

The nature of collaboration and the presence of multimedia in modern writing classrooms make something like Google Sites, with page permissions enabled, essential to how I teach. That said, Google still needs to add a few features:

  • The ability to archive the site locally
  • A somewhat more streamlined process for adding users.

Overall, however, this free tool is phenomenal, and I plan to recommend it to colleagues.

Image source: pre-Sites days in Eng. 103 classroom, late 1990s.