Getting Started & Thinking About Counting

Conversation Time!

You were to have examined some DH projects in the wild, and to have read:

Catherine D’Ignazio and Lauren F. Klein, “What Gets Counted Counts” and “The Numbers Don’t Speak for Themselves,” Data Feminism (2020), https://data-feminism.mitpress.mit.edu/.
Ryan Cordell, ‘Q i-jtb the Raven’: Taking Dirty OCR Seriously’ Book History 20 (2017), 188-225 (see also link jstor link)

D’Ignazio & Klein raise issues around control of data collection, and the impacts of categorization. But before we get to that: is history data?

In what ways does the identity of the designer of systems matter?

Is there (are there?) an ethics of data?

Which projects did you look at in Current Research in Digital History? (Aside: another great place to look for digital history as it is currently practiced is in Reviews in DH.)

How might you critique some of the projects you explored, in the light of D’Ignazio and Klein? (Do you detect any ‘big dick data’ energy, and if so, where, why, and consequences?)

Decisions about categorization extend also to our processes. How does Cordell’s piece intersect with D’Ignazio and Klein’s?

How can we trust a digital archive? What questions or perspectives do we need to keep handy?

Activity

Go to ‘Chronicling America’ or Canadiana.ca.

Search for, and download to your computer, a pdf of interest.

Open the pdf; select some text, ctrl+c to copy it, and paste into a text editor (ctrl+v). (nb: not word, not text pad, not notepad. Use something like Sublime Text).

How good is the OCR? What would you miss, if you were doing a simple keyword search? (For more on this, and its impact in a specifically Canadian context, see Ian Milligan, “Illusionary Order: Online Databases and the Changing Foundation of Canadian Historiography, 1997-2010,” Canadian Historical Review, Vol. 94, No. 4 (December 2013): 540-569).

With regard to materials from Chronicling America, if you click on the ‘About [newspaper in question]’ you will be brought to a metadata page about the publication. At the bottom right, where it says ‘provided by:’ you can click through to see what other publications were digitized by this awardee. Keep clicking. What can you learn about the context of this digital object versus the original physical object?

Cordell goes on to talk about how he obtained a different digital object of the same physical object (a TIFF image file versus a PDF file - do you know the difference and what that implies?) and used a tool called EXIF to extract the embedded metadata for both the image and the pdf. You can get the EXIFtool here. Download, install, and let’s give it a try on the pdf you downloaded.

A question to investigate: how many levels of remediation exist between the digital object on your computer, and the original paper thing?

Cordell:

In other words, knowing the scanner helps us understand the decisions made during the digitization process which led to the errorful OCR on which our searches and computational analyses rest. These decisions are understandable: with limited funding, time, and manpower, scanning from microfilm allowed dramatically more newspapers to be digitized than could have been from paper. But those decisions should also be understood by scholars using these resources, as they provide essential context for any work done in mass digitized archives.

One last thing: Jeff Blackadar, former MA student here, built this finding aid for digitized copies of a local West Quebec newspaper. Given what we now know… pretty neat, eh? Explore! What does the metadata tell you? Why does it matter?

Write

Let’s take a few moments to jot thoughts & observations down in your lab notebooks.

Do

This feeds into next week; there are several different ways to achieve what we’re about to do. This time, we’re using a template written to use node.js; next week, it’ll be python.

get a github.com account. You don’t have to tie it to your real name, but there are reasons why you might want to. I’ll tell you mine. But also: I’m a white guy on the internet.
‘fork’ (‘make a copy of’) this ‘digital garden’
upload some markdown files - from your notebook, or made for this exercise - into the ‘notes’ folder (I’ll show you how this is done)
click on the ‘wiki’ button, then the ‘getting started’; hit the blue ‘deploy to netlify’ button
follow the prompts.

Shortly, you will have a live website powered by your notes. This template is smart enough to turn your internal note links into hyperlinks.

Given what you’ve seen so far, why might having a live site of your process be desirable?

Netlify is a service for developing websites in ‘containers’; essentially, it spins up a virtual computer to do the heavy lifting of turning your text files and the templates into a working website, which it then puts on a server for you. Provided nobody visits your sites in extremely huge numbers, it doesn’t cost anything. In several years of using this service, the only time that ever happened to me was when an AI bot started scraping my website over and over and over and over again. It gave me a warning this was happening, and I disconnected the site. No bill!

Scheduling

Call dibs on your desired week to lead us through a Programming Historian tutorial.