DH and Computational Tools


Data mining and quantitative textual analysis provide new ways to read and analyze text. Google’s NGrams were mentioned by Underwood as a great example of new ways to ‘read’ text. NGrams essentially measure the frequency of a word over time in the Google Books archive. I could spend hours looking up word prevalence over time. But more importantly, without modern technology and ORC NGram studies could never have been used. Trying to analyze the use of rhetoric and vocabulary over time is handicapped by our constraint to read, record and calculate data. Without computational tools this task would be infeasible. New tools are necessary for the digital humanities to continue to break old assumptions and explore new ground.  One drawback mentioned in the Underwood article is that Google is not transparent with their sourcing of the material and database. Are they using the entire collection of books? How accurate are their scans? Is their sample of literature representative sample of the broader publications at the time? In the grand scheme of things, they are fixable problems. The big picture is that Google’s NGrams are just on example of how being able to read and record hundreds of thousands of books opens the door to new forms of academic research and intellectual exploration.

The “Do Digital Humanists Need to Understand Algorithms?” article also has a strong relationship to the NGrams example. Complex algorithms and datasets run the risk of becoming black boxes. High levels of abstraction and complexity can lead to users misusing or misinterpreting the results of their study. Other fields like economic and Psychology have been struggling with this same problem. Regression analysis is an extremely powerful tool that can easily be misused to come up with misleading answers. Economists or Psychology are not mathematicians, and the result is poor levels of reproducibility in both fields. If Underwood was not well informed enough to ask critical questions about how Google Ngram’s algorithm works, he may have drawn the wrong conclusions, or put too high of a degree of confidence on his findings. Therefore, it is important to understand how the tools work and what they achieve, even if you are not a native coder or statistician. 

Underwood: https://tedunderwood.com/2012/08/14/where-to-start-with-text-mining/

Schmidt: http://dhdebates.gc.cuny.edu/debates/text/99