Web Archives as Big Data: experimenting with the internet as a historical source
by Marta Musso
On the 3rd of December, the Institute for Historical Research hosted a conference on the challenges and opportunities that the digital world offers to researchers in the humanities. As we live in the middle of the digital revolution, we don’t have full perception of the massive changes that the switch to digital is bringing about. However, over the past 30 years, more and more human actions have been conducted through digital tools (from MS-DOS computers all the way to smartlets) and, especially in the past 15 years, the web has created an exponentially crowded place of action and interaction. As ephemeral as web content is (a tweet is published and lost in just a few seconds), the problem of preserving online data for future studies is now an integral part of research in the humanities.
In order to investigate the problems of historical research specifically related to web resources, the British Library together with the University of London, the Oxford Internet Institute and the Institute of Historical Research, have set up a project called “Big UK Domain Data for the Arts and Humanities”. The aims of the project include highlighting “the value of web archives as a source for arts and humanities researchers” and the setup of “a theoretical and methodological framework for the analysis of web archives”.
As part of the project, twelve bursaries were given to researchers in different fields of the humanities – from social anthropology to military history to literature – in order to conduct research using the web as a source. In particular, researchers were given the opportunity to try the search engine that is currently being elaborated by the British Library for its growing web collection.
The results were presented at the conference on the 3rd of December, held at the Institute for Historical Research. As one of the bursary holders, I presented a paper on how British companies reacted to the diffusion of the internet between 1996 and 1999: which enterprises opened a website or webpage first; the type of content that they used; the domain and internet service provider they relied on (which marked the choice of whether to rely on U.S. infrastructures or to build a “national” web space).
One very interesting aspect of historical research through web sources is that content cannot be disentangled from methodology. I had no previous experience dealing with the internet as a historical source, and this represented a unique opportunity to plunge into this new type of archive, which is destined to become the most important source for studies on the 21st century. It was a precious experience, especially because I realized that the methodology you follow when dealing with internet sources is remarkably different from studies on paper archives.
First of all, the very lack of physical folders can be extremely confusing. How do you trace the creating body? Which year is the document from? Where are the other documents related to what you are reading? There is no folder to begin with! Also, especially with the older documents, very often most of the page was lost. A webpage is composed of many elements: text, pictures, videos, links to other pages; in most cases, especially in the early period that I considered (the mid-1990s), only text was still accessible. The Internet Archive, which first started to archive the web as a digital source, works through a software that crawls and downloads all publicly accessible World Wide Web pages. The pages are stored on hard drives kept at the Internet Archives headquarters in California and they are accessible online through what is called the “Wayback Machine” (https://archive.org/web/web.php). It is an immense source of data, crawling everything that has been on the internet worldwide from 1996 onwards, from the New York Times to teenage blogs. However, content will disclose only if you provide a specific internet address: http://www.gov.uk, http://www.guardian.co.uk, http://www.doinghistoryinpublic.org (yep, this blog has already been archived!), etc. This is a severe limitation to research, because it only allows you to research the archives through specific websites. This is why the British Library, which acquired all .uk domains, is creating a user interface which allows queries filtered by keywords, by generic domain, by the year – though please note, the year does not refer to the year the document was created, like in paper archives, but the year in which the webpage was crawled. So if something was crawled in March 1996, we know that it already existed in March 1996 but we cannot know if it was put online earlier than that.
The most important issue however, as the conference title suggests, is that the quantity of data that becomes available to historians through web archives is often not manageable by humans. Every second, 72 hours of content are uploaded on Youtube; 347 posts are published on WordPress; 278000 tweets are tweeted.
We perceive Big Data as a methodological problem faced only by researchers working on quantitative data; however, when I searched for the word “company” on the BL search engine it returned more than 1.9 billion results! Clearly, in this case “sampling” becomes the keyword. But while many issues related to web archiving are still to be solved, from both a technical and theoretical point of view, the amount of information that the webspace provides will be invaluable to historians. Experimenting now in order to pave the way for future research is incredibly exciting: http://www.bl.uk/aboutus/stratpolprog/digi/webarch/