Open Insights: An Interview with Jane Winters

Posted by James Smith on 30 April 2018

Open Access and the Archived Web

An Open Insights interview with Jane Winters

Jane Winters is Professor of Digital Humanities at the School of Advanced Study, University of London. She has led or co-directed a range of digital projects, including most recently Big UK Domain Data for the Arts and Humanities; Digging into Linked Parliamentary Metadata; Traces through Time: Prosopography in Practice across Big Data; the Thesaurus of British and Irish History as SKOS; and Born Digital Big Data and Approaches for History and the Humanities.

Jane's research interests include digital history, web archives, big data for humanities research, peer review in the digital environment, text editing, the use of social media in an academic context, e-repositories, and open access publishing.


OLH: Hello Jane, thanks for talking to us! To start, what scholarly route led you to the world of the archived web?

JW: That’s a great question, and one I haven’t really thought through properly until now! The seeds were actually sown as long ago as 2005, when I was a Co-Applicant on a project funded by the Arts and Humanities Research Council to investigate the “Peer review and evaluation of digital resources for the arts and humanities”.[1] It was a joint initiative of the Institute of Historical Research, University of London, where I was working at the time, and the Royal Historical Society, and we looked at a range of issues around the peer review of digital scholarly outputs (a lot of which are still with us today). The one thing that really stuck with me was the finding that most historians were hiding their use of online sources by referencing the analogue version of something they had in fact consulted online. There were a number of explanations for this, but one very practical reason for their reluctance fully to engage with digital resources was a concern that these would not persist.

The one thing that really stuck with me was the finding that most historians were hiding their use of online sources by referencing the analogue version of something they had in fact consulted online. There were a number of explanations for this, but one very practical reason for their reluctance fully to engage with digital resources was a concern that these would not persist.

I didn’t pursue it further at the time, but remained interested enough to go to a conference at the British Library three years later on “Missing links: the enduring web”. It was organised by Jisc, the Digital Preservation Coalition and the now-defunct UK Web Archiving Consortium and provided a fascinating insight into the work going on in national memory institutions to archive the web so that it would remain available for historians and other researchers in the future. One of the key challenges for those archiving institutions, however, was the difficulty of engaging researchers with a new and very complex primary source.[2] It was hard to demonstrate the value of web archiving, and to justify the resources devoted to it, if historians and others were unaware of and apparently unwilling to use web archives in their research. That was the starting point for two research projects to explore the challenges and potential of web archives for humanities researchers, both of them in collaboration with the British Library, and I haven’t looked back from there!

OLH: What issues arise for scholarly open access when using web archives as primary sources?

JW: One of the main issues is that, in most countries, web archives are not particularly open, or indeed accessible. If people are familiar at all with web archiving, it will generally be because they have come across the work of the Internet Archive (IA) and/or used the Wayback Machine. The IA has been operating for more than 20 years, so has had time to build this brand awareness, but key to its success is the fact that it can be used by anyone with an internet connection anywhere in the world. There are other open web archives—the Portuguese, Croatian and Icelandic web archives, for example—but in many countries where sustained web archiving takes place at a national library access is pretty restricted.

There are other open web archives—the Portuguese, Croatian and Icelandic web archives, for example—but in many countries where sustained web archiving takes place at a national library access is pretty restricted.

In the UK, for example, the material collected as part of an annual crawl of the .uk domain is only accessible on-site in the reading rooms of one of the six legal deposit libraries. There is no remote access to the archived version of what would previously have been a publicly accessible site on the live web. This hinders use of the UK web archive, but perhaps even more importantly it makes it very difficult to persuade people of its value. It is not something that you are likely to use speculatively, because the barriers to access are so high. There is some amazing material openly available from the British Library (small collections which have been published on a permissions basis), but the full richness of the archive remains relatively hidden. We are, however, fortunate that the UK Government Web Archive, collected under the Public Records Act, is openly available through The National Archives’ website.

Another problem, and one which is not confined to web archives, is uncertainty around the republication of images, and particularly web screenshots. This affects the live as well as the archived web, but is a real hindrance to research in this area. Why spend a couple of pages of describing the front page of a website when you could include a screenshot which would convey the information much more effectively? The text on web pages is, of course, important, but so too are the images, the videos and the sound clips. One can be easily reproduced, but not the others.

OLH: What opportunities do legal deposit web archives offer to those advocating for openness, and what are the potential risks?

JW: The extension of legal deposit in many countries to cover born-digital materials, including the web, is the cornerstone of national web archiving. Without electronic legal deposit, for example, institutions like the British Library, the Bibliothѐque national de France and the Royal Danish Library would simply not be able to undertake the annual domain crawls which underpin current scholarship and will enable future research. Legal deposit allows for a degree of comprehensiveness and sustainability in web archiving that would otherwise be difficult, if not impossible, to achieve.

I think researchers who use web archives, or who think that they might do so in the future, have a duty to advocate for increased access, for a greater degree of openness.

That’s a rather long-winded way of saying that I am very glad we have legal deposit legislation! But current legal frameworks are not without problems, particularly in relation to openness. In the UK we have a national web archive thanks to legal deposit, but it is also thanks to legal deposit that it is very hard to access and use it. I think researchers who use web archives, or who think that they might do so in the future, have a duty to advocate for increased access, for a greater degree of openness. In Denmark, for example, university-based researchers may request remote access to their national archive if they can demonstrate that their project requires it. This is at least a start to opening up these wonderful collections, and it is to be hoped that we will eventually see something like this in the UK, as legislation begins to catch up with technology.

Perhaps one of the most immediately frustrating issues arising from legal deposit is the fact that a web page is effectively treated like a printed book, in that two people are not allowed to access the same archived web page simultaneously (this is true for ebooks as well). This seems to me to be an unsustainable position, and at the very least it severely limits the potential for using web archives in teaching.

OLH: What are the legal, technological, and political limitations on the free access of content from the archived web?

JW: There are so many that it is hard to know where to start! I’ve already mentioned some of the legal limitations that researchers and others face if they want to work with web archives, but the technological difficulties can be overwhelming. One of the major problems is that a web archive can seem like the archetypal black box.

The vagaries of the crawl processes can lead to failure to capture the complete content of a website, to multiple captures of the same web page while another is overlooked, to temporal incoherence as a rapidly changing site is archived—but none of this is readily apparent to the researcher.

The vagaries of the crawl processes can lead to failure to capture the complete content of a website, to multiple captures of the same web page while another is overlooked, to temporal incoherence as a rapidly changing site is archived—but none of this is readily apparent to the researcher. Openness about the processes of harvesting, alongside openness of content, would make a huge difference to our ability to engage critically with web archives.

OLH: The level of open access to web archives varies a great deal depending on national framework. What historical factors have influenced their formation?

JW: I think one of the key factors influencing national differences has been whether or not legal deposit legislation exists. I’ve mentioned the problems associated with legal deposit, but it has often been key to instituting some kind of national web archiving programme. There are variations within this, of course. In some countries, legal deposit has existed for a very long time, and has required adaptation; in others, it is relatively new and so has been able to accommodate digital materials more readily. Perhaps rather counterintuitively, in the UK it is Crown Copyright and the Public Records Act (1958 and subsequent revisions) which have provided the most effective legislative framework for open web archiving.

In some countries, legal deposit has existed for a very long time, and has required adaptation; in others, it is relatively new and so has been able to accommodate digital materials more readily.

The UK Government Web Archive is openly available, and the bulk of the material within it may be reused and republished under an Open Government Licence. Web archives which are not required to abide by national legal frameworks have more freedom to operate, for example remaining open but operating under a take-down policy, but conversely they do not have a statutory obligation to archive a nation’s web for posterity.

OLH: Can we trust that what we see in a web archive is what the internet “really” looked like on a given day? Is this possible?

JW: The short answer is no! How we deal with this is part of emerging web historiography, and I think we are still at the stage of identifying the problems more than providing solutions. An archived web page with a notional date of, for example, 1 January 2006, may in fact have been pieced together from elements archived at different intervals. An embedded video may have been missed by the web crawler on its first pass, but collected a couple of months later and subsequently patched in to the archived page viewed by a researcher. Niels Brügger has identified the problem of archiving particularly large and dynamic websites such as online newspapers. A news website with a single date of capture may, for example, include contradictory scores for a football match on its home and sport pages, as a goal is scored while the capture is in progress and the online information updated in real time.

A news website with a single date of capture may [...] include contradictory scores for a football match on its home and sport pages, as a goal is scored while the capture is in progress and the online information updated in real time.

Web archivists are also very conscious of the fact that they are not capturing any one individual’s experience of looking at a particular web page. The web archive does not reflect personalised advertising, for example, nor capture whether someone was viewing a web page on a tablet or phone rather than on a desktop PC. The version of a page that ends up in the web archive may be one that was viewed by no single individual on the live web.

OLH: What mistakes are scholarly content creators making today that could come back to haunt them in the future?

JW: I don’t think many of those responsible for creating scholarly content online are yet thinking about the need to build a web resource which can easily be archived. There is, rightly, a lot of discussion about the long-term preservation of data, but much less about the preservation of entire functioning websites. There are some useful tools emerging to help content creators test how well their website will archive, for example http://archiveready.com/ (accessed 29 April 2018), but they are not very widely known.

OLH: In a culture where the dominant corporate desire is to “preserve everything”, what do you think of the role of selectivity in the present? How good are we at selecting what to save and of what to let go?

JW: I think I would argue that the dominant corporate desire is to “preserve everything” so long as it is likely to deliver commercial value. Once digital data becomes more trouble than it’s worth to keep—for example if the cost of storage and preservation outstrips its potential to be monetised—there is a huge risk that what we tend to refer to as “archives” will simply be lost. That’s one reason for the huge value of the work done in national libraries and archives, which are not susceptible to the same commercial pressures (even if they have other resourcing issues to face). That’s a slight tangent to your question, but we do need to have public conversations about the impossibility of keeping everything, even if that were desirable.

Once digital data becomes more trouble than it’s worth to keep—for example if the cost of storage and preservation outstrips its potential to be monetised—there is a huge risk that what we tend to refer to as “archives” will simply be lost.

In some ways, volume poses more of a problem than scarcity. We have only ever had very partial access to the historical record, making do with what has survived largely by chance, and I don’t think we should really expect any different with digital data. It is, however, currently much more difficult to make informed decisions about selection for digital than paper archives. The first annual domain crawl in the UK took place in 2014 and resulted in the collection of 2.5 billion web pages and other assets (56TB of data). Each year since then the size of the archived web snapshot has increased. It is hard to do more than collect the data and hope that technology will develop in such a way that effective archival and selection processes can be implemented in the future.

We worry too much about what we are losing rather than celebrating what we are managing to keep. Having said that, it makes me happy that the 2014 crawl includes 4.75GB of viruses, which can also be studied by future historians.

OLH: At the risk of being provocative, is it fair to say that most historians aren't yet interested in, or tech-savvy enough, to use digital resources such as the web archive? If so, what do they think of the digital work that you do?

JW: I think it’s only a little unfair! I’ve found that people still tend not to be very aware of web archives, but when you begin to explain the potential they quickly see the importance of work to preserve what I have already described as a new kind of primary source. The problem then becomes that it’s not possible to answer the kinds of research questions they would like to ask, either because we don’t yet have the right tools or because access to the data is too restricted. There is no doubt that this is difficult data to work with, particularly if your research requires a quantitative approach.

The problem [...] becomes that it’s not possible to answer the kinds of research questions they would like to ask, either because we don’t yet have the right tools or because access to the data is too restricted.

There is, however, a small but growing community of researchers now engaging critically with web and internet histories, and the first books and journals are beginning to be published.[3] One of the best things about working in this area has been the openness and collegiality of both researchers and practitioners who deal with web archives.

OLH: What will it take to normalise web archives as an accessible scholarly source material?

JW: That’s a very easy one to answer: open access to national legal deposit collections. Very few people are going to become web archive specialists, but they will increasingly want to include web archives among the many other sources they work with. Anyone using newspapers to explore the history of the UK in the late 20th century would not want to use solely print materials—and if they did they would get a rather distorted view of how people were receiving information.

Very few people are going to become web archive specialists, but they will increasingly want to include web archives among the many other sources they work with.

I want to be able to access the archived web from my desk, as I’m reading a book, consulting some digitised papers or building a bibliography. And I want to be able to download and work with data derived from the archived web, not just look at static archived web pages. Web archives need to be available and accessible where people are going to want to use them, as part of the researcher’s toolkit.

OLH: Our thanks for your time, Jane!

Join us again soon for more #EmpowOA Open Insights.


[1] The full project report, published in 2006, is available at http://www.history.ac.uk/sites/history.ac.uk/files/Peer_review_report2006.pdf (accessed 29 April 2018).

[2] See P. Webster and J. Winters, ‘Report on the “Missing links: the enduring web” conference’ (August 2009) http://ihr-history.blogspot.co.uk/2009/08/report-on-missing-links-enduring-web.html (accessed 29 April 2018).

[3] See, for example, the journal Internet Histories; Web 25: Histories from the First 25 Years of the World Wide Web, ed. N. Brügger (New York: Peter Lang, 2017); The Web as History: Using Web Archives to Understand the Past and Present, ed. N. Brügger and R. Schroeder (London: UCL Press, 2017) [available OA]; The SAGE Handbook of Web History, ed. N. Brügger and I. Milligan (forthcoming, 2018).