The Web as History: Using Web Archives to Understand the Past and the Present

The World Wide Web has now been in use for more than 20 years. From early browsers to today’s principal source of information, entertainment and much else, the Web is an integral part of our daily lives, to the extent that some people believe ‘if it’s not online, it doesn’t exist.’ While this statement is not entirely true, it is becoming increasingly accurate, and reflects the Web’s role as an indispensable treasure trove. It is curious, therefore, that historians and social scientists have thus far made little use of the Web to investigate historical patterns of culture and society, despite making good use of letters, novels, newspapers, radio and television programmes, and other pre-digital artefacts.

This volume argues that now is the time to question what we have learnt from the Web so far. The 12 chapters explore this topic from a number of interdisciplinary angles – through histories of national web spaces and case studies of different government and media domains – as well as an introduction that provides an overview of this exciting new area of research.

Praise for The Web as History

'Essential reading both for understanding how historians would like to use the web archives we have been assembling and for hinting at how archival theory and practice can engage with a much richer conception of what archiving the Web means.'
The American Archivist

'The Web as History is a timely and topical collection jam-packed with interesting research and creative methodological discussions. I am convinced many humanities and social sciences researchers working in similar areas and historians venturing into this field, but also students on different levels – interested in the history of the Web or issues of method – will greatly benefit from reading this volume.'
Nordicom Review

'This book is definitely useful for anyone who wants to analyze site content, or who thinks about how the content of the Internet can be archived at all... [of interest to] anyone who is interested in the Internet as a social phenomenon'
Journal Czech Society (Review translated from Czech)

‘[The Web as History] has shared the first fruit of research and moved on from discussing the impediments to working with web archives. It is a starting point and a fascinating indication of what the enormous richness of the archived web has to offer.’
Internet Histories

‘No other work as cohesively, clearly, forcefully and successfully argues for the Web’s centrality in contemporary society and social science. While scholars of new media tend to turn their attention to the newest and latest new media phenomena, the Web is and will continue to be crucial to understanding online phenomena generally and, just as critically, providing a record of online discourse and events.’ Steve Jones, UIC Distinguished Professor of Communication, University of Illinois at Chicago

"1126450432"
The Web as History: Using Web Archives to Understand the Past and the Present

The World Wide Web has now been in use for more than 20 years. From early browsers to today’s principal source of information, entertainment and much else, the Web is an integral part of our daily lives, to the extent that some people believe ‘if it’s not online, it doesn’t exist.’ While this statement is not entirely true, it is becoming increasingly accurate, and reflects the Web’s role as an indispensable treasure trove. It is curious, therefore, that historians and social scientists have thus far made little use of the Web to investigate historical patterns of culture and society, despite making good use of letters, novels, newspapers, radio and television programmes, and other pre-digital artefacts.

This volume argues that now is the time to question what we have learnt from the Web so far. The 12 chapters explore this topic from a number of interdisciplinary angles – through histories of national web spaces and case studies of different government and media domains – as well as an introduction that provides an overview of this exciting new area of research.

Praise for The Web as History

'Essential reading both for understanding how historians would like to use the web archives we have been assembling and for hinting at how archival theory and practice can engage with a much richer conception of what archiving the Web means.'
The American Archivist

'The Web as History is a timely and topical collection jam-packed with interesting research and creative methodological discussions. I am convinced many humanities and social sciences researchers working in similar areas and historians venturing into this field, but also students on different levels – interested in the history of the Web or issues of method – will greatly benefit from reading this volume.'
Nordicom Review

'This book is definitely useful for anyone who wants to analyze site content, or who thinks about how the content of the Internet can be archived at all... [of interest to] anyone who is interested in the Internet as a social phenomenon'
Journal Czech Society (Review translated from Czech)

‘[The Web as History] has shared the first fruit of research and moved on from discussing the impediments to working with web archives. It is a starting point and a fascinating indication of what the enormous richness of the archived web has to offer.’
Internet Histories

‘No other work as cohesively, clearly, forcefully and successfully argues for the Web’s centrality in contemporary society and social science. While scholars of new media tend to turn their attention to the newest and latest new media phenomena, the Web is and will continue to be crucial to understanding online phenomena generally and, just as critically, providing a record of online discourse and events.’ Steve Jones, UIC Distinguished Professor of Communication, University of Illinois at Chicago

1.49 In Stock
The Web as History: Using Web Archives to Understand the Past and the Present

The Web as History: Using Web Archives to Understand the Past and the Present

The Web as History: Using Web Archives to Understand the Past and the Present

The Web as History: Using Web Archives to Understand the Past and the Present

eBook

$1.49 

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.
WANT A NOOK?  Explore Now

Related collections and offers


Overview

The World Wide Web has now been in use for more than 20 years. From early browsers to today’s principal source of information, entertainment and much else, the Web is an integral part of our daily lives, to the extent that some people believe ‘if it’s not online, it doesn’t exist.’ While this statement is not entirely true, it is becoming increasingly accurate, and reflects the Web’s role as an indispensable treasure trove. It is curious, therefore, that historians and social scientists have thus far made little use of the Web to investigate historical patterns of culture and society, despite making good use of letters, novels, newspapers, radio and television programmes, and other pre-digital artefacts.

This volume argues that now is the time to question what we have learnt from the Web so far. The 12 chapters explore this topic from a number of interdisciplinary angles – through histories of national web spaces and case studies of different government and media domains – as well as an introduction that provides an overview of this exciting new area of research.

Praise for The Web as History

'Essential reading both for understanding how historians would like to use the web archives we have been assembling and for hinting at how archival theory and practice can engage with a much richer conception of what archiving the Web means.'
The American Archivist

'The Web as History is a timely and topical collection jam-packed with interesting research and creative methodological discussions. I am convinced many humanities and social sciences researchers working in similar areas and historians venturing into this field, but also students on different levels – interested in the history of the Web or issues of method – will greatly benefit from reading this volume.'
Nordicom Review

'This book is definitely useful for anyone who wants to analyze site content, or who thinks about how the content of the Internet can be archived at all... [of interest to] anyone who is interested in the Internet as a social phenomenon'
Journal Czech Society (Review translated from Czech)

‘[The Web as History] has shared the first fruit of research and moved on from discussing the impediments to working with web archives. It is a starting point and a fascinating indication of what the enormous richness of the archived web has to offer.’
Internet Histories

‘No other work as cohesively, clearly, forcefully and successfully argues for the Web’s centrality in contemporary society and social science. While scholars of new media tend to turn their attention to the newest and latest new media phenomena, the Web is and will continue to be crucial to understanding online phenomena generally and, just as critically, providing a record of online discourse and events.’ Steve Jones, UIC Distinguished Professor of Communication, University of Illinois at Chicago


Product Details

ISBN-13: 9781911307587
Publisher: U C L Press, Limited
Publication date: 03/06/2017
Sold by: Barnes & Noble
Format: eBook
Pages: 296
File size: 9 MB

About the Author

Niels Brügger is Professor and Head of the Centre for Internet Studies and of the internet research infrastructure NetLab, Aarhus University. He is co-founder and Managing Editor of the international journal, Internet Histories: Digital Technology, Culture and Society. Recent publications include Histories of Public Service Broadcasters on the Web (edited with Burns, 2012), and Web25, a themed issue of New Media & Society.

Ralph Schroeder is a Professor at the Oxford Internet Institute. Before coming to Oxford University, he was Professor at Chalmers University in Gothenburg. His recent books are Rethinking Science, Technology and Social Change (2007) and, co-authored with Eric T. Meyer, Knowledge Machines: Digital Transformations of the Sciences and Humanities (2015).

Read an Excerpt

The Web as History

Using Web Archives to Understand the Past and the Present


By Niels Brügger

UCL Press

Copyright © 2017 Contributors
All rights reserved.
ISBN: 978-1-911307-58-7



CHAPTER 1

Analysing the UK web domain and exploring 15 years of UK universities on the web

Eric T. Meyer, Taha Yasseri, Scott A. Hale, Josh Cowls, Ralph Schroeder and Helen Margetts


Introduction

The World Wide Web is enormous and in constant flux, with more web content lost to time than is currently accessible via the live web. The growing body of archived web material available to researchers is potentially immensely valuable as a record of important aspects of modern society, but there have previously been few tools available to facilitate research using archived web materials (Dougherty and Meyer, 2014). Furthermore, based on the many talks we have given over the years to a variety of audiences, some researchers are not even aware of the existence of web archives or their possible uses. However, with the development of new tools and techniques such as those used in this chapter and others in this volume, the use of web archives to understand the history of the web itself and shed light on broader changes in society is emerging as a promising research area (Dougherty et al., 2010). The web is likely to provide insight into social changes just as other historical artefacts, such as newspapers and books, have done for scholars interested in the pre-digital world. As the web becomes increasingly embedded in all spheres of everyday life and the number of web pages continues to grow, there is a compelling case to be made for examining changes in both the structure and content of the web. However, while interfaces such as the Wayback Machine allow access to individual web pages one at a time, there have been relatively few attempts to work with large collections of web archive data using computational approaches across the corpus.

The research presented in this chapter used hyperlink data extracted from the Jisc UK Web Domain Dataset (Jisc, n.d.-a) covering the period from 1996 to 2010 to undertake a longitudinal analysis of the United Kingdom (UK) national web domain, .uk, focusing on the four largest second level domains: .co.uk, .org.uk, .gov.uk, and .ac.uk. We explore the growth of these domains, and examine the link density within and between them. Next we look in more detail at the academic second-level domain, .ac.uk, to understand the relationship between link density among UK academic institutions and measures of affiliation, status, performance and geographic distance. Overall, these results are used both to understand the growth and structure of the .uk domain, but also to demonstrate the benefits and challenges of this type of analysis more generally.


Background

Archiving national web domains

National web domains represent one approach to web archive analysis for researchers seeking an overview of a single country's web presence (Brügger, 2011). Any particular national web domain offers the potential of both diversity and completeness in its coverage (Baeza-Yates et al., 2007), although there are limitations in terms of generalizability beyond the country in question and frequently in terms of the completeness of the analysis based on technical factors (see section on the UK web domain below). At the same time, limiting the focus to a single country reduces the number of contextual differences (such as multiple dominant languages, different internet and broadband penetration rates, different degrees of political openness and so forth), and thus is a sound strategy for demonstrating the potential of this new type of analysis.

Research in this area is at an early stage, and there are conceptual challenges associated with analysing national web domains. The content and structure of country-code top-level domains (ccTLDs), such as .uk for the UK and .fr for France, are governed more by tradition than rules (Masanès, 2006), complicating efforts to reach a comprehensive definition of what they represent. Brügger (2014) discusses the difficulty, for example, of deciding how national presences should be delimited. In the case presented here, the domain name .uk is used, but this does not cover all the web pages originating in the UK as it is possible for UK companies, organizations and individuals to use generic top-level domains (.com, .org, etc.) or those assigned elsewhere. Moreover web pages ending with .uk are also used for websites which arguably belong to a different country, as when multinational companies headquartered outside the UK have affiliates within the UK with a .uk address. Finally, it might be contended that not only web pages with a .uk address be examined, but also those that link to and from these web pages. However, for the purposes of this research, these limitations can mostly be noted for future research and do not seriously limit the ability to understand the broad patterns within the UK national web presence. Furthermore, when we focus on UK universities, as we do in the later part of this chapter, we avoid both false positives and false negatives as the academic domain (.ac.uk) is stable and predictable in a way that the commercial domains are not. Essentially, all universities in the United Kingdom have a main address in the .ac.uk domain, and almost all addresses in the .ac.uk domain are universities (with a few exceptions for academic-affiliated organizations that are not themselves universities).

Another issue that must be decided when undertaking analysis of web domains is the appropriate level of detail. This includes the temporal resolution to use for analysis (since while the web is constantly changing, the number of snapshots available in Internet Archive data vary over time based on the crawl settings in place when the data were gathered). In addition, the level of detail to be extracted from web pages must be determined (i.e. the appropriate level of resolution of page content, link information, page metadata, and so forth). Previous research on the .uk ccTLD has examined monthly snapshots over a one year period, finding that page-level hyperlinks change frequently month to month (Bordino et al., 2008). As Brügger (2013) notes, there are several reasons why archived websites are different from other archived material in respect to these details: choices must be made not just about what to capture but there are also technical issues about what can be archived and how the archiving process itself shapes the later availability of the archived materials.

Previous research using national web archives

While there have been a number of papers describing the practices of constructing national web archives (see for instance Masanès, 2005; Gomes et al., 2006; Baeza-Yates et al., 2007; Zabicka and Matjka, 2007; Aubry, 2010; Hockx-Yu, 2011; Rogers et al., 2013), there are few that report using national web archives using large-scale (or even medium-scale) computational methods.

Thelwall and Vaughan (2004) used data from the Internet Archive to assess international bias in the coverage of the archive's collection. At the time of their study, however, it was not possible to access the data in the archive via automated means, so they were limited to relatively small samples of between 94 and 143 websites for each of four countries (total N = 382), accessed via the public Wayback Machine interface. They determined with these methods that there was an unbalanced representation of different countries in the archive, partially explained by technical factors rather than by biased policies.

The Analytical Access to the Domain Dark Archive (AADDA) project and then later the Big Data: Demonstrating the Value of the UK Web Domain Dataset for Social Science Research project and the Big UK Domain Data for the Arts and Humanities project enabled researchers to use UK Web Archive data for analytical study. These projects also demonstrate one of the legal issues of working with web archive data: the UK web archive data held by the British Library can be made available to researchers for use, but full-text content is only available via systems at the British Library. The raw data in the ARC/WARC files cannot be moved outside the Library's computer systems. As a result, many of the demonstrator projects that came out of these bigger projects focused on more qualitative, close analysis (see for instance Gorsky, 2015; Huc-Hepher, 2015) that was enabled by computational methods involving search, indexing and ontologies created by the project developers, the actual researchers largely used the extracted results in non-computational ways (see Chapter 11). It is important to note, however, that derivative datasets such as the list of web pages in the archive and the list of hyperlinks can be distributed more widely, which enables some large-scale approaches as we do in this chapter.


Another European project on Longitudinal Analytics of Web Archive Data published a number of technical reports and papers that demonstrate computational approaches to working with web archive data but, as far as we are able to determine, there have not been the same sort of domain investigations as those done using the tools we report here.

The lack of studies using web archives in general, and using large-scale computational approaches in particular, has been documented in earlier work by members of this team (Dougherty et al., 2010; Thomas et al., 2010; Meyer et al., 2011; Dougherty and Meyer, 2014). In those papers and reports, we found that there remains a disconnect between the relatively active community engaged in archiving the web, and the relative lack of any community forming around large-scale analysis of web archives. This study is in part an attempt to fill that very clear gap.


The UK web domain

The .uk country-code top-level domain is managed by the internet registrar Nominet. Below the .uk top-level domain are several second-level domains (SLDs), the largest of which are .co.uk (commercial enterprises), .org.uk (non-commercial organizations), .gov.uk (government bodies), and .ac.uk (academic establishments). This chapter examines third-level domain data such as nominet.org.uk (Nominet), fco.gov.uk (the Foreign and Commonwealth Office of the UK government), or ox.ac.uk (the University of Oxford).

In the case of web archives (or indeed of other archived material which takes the approach of archiving all that can be archived, without a particular topic in mind), it is not scholarly interest in any particular topic that has set the data collection agenda. Instead it has been the goal of the archiving institution to accumulate material for the sake of preservation, leaving the question of the eventual uses of the archive data to later researchers. This means that the scope of the archived material and the level of detail available, as with other historical materials, is a function of the archiving processes used to gather and store the data. Thus, unlike web archive research done on the live web using researcher-implemented data collection mechanisms (e.g. Escher et al., 2006; Foot and Schneider, 2006), for the purpose of this study the dataset itself should be seen as a given. However, it can be mentioned that the Internet Archive's data comprise the most comprehensive archive of the web available (Ainsworth et al., 2011).

It is important to note that while the Internet Archive (IA) is the most comprehensive archive of the web available, that should not be confused with thinking that the IA crawls represent a fully comprehensive record of the web. The data collected over the 15-year period we are examining used a variety of methodologies and were done at varying levels of granularity. Data from the earliest years came from Alexa with 'no visibility into how this data is crawled', and the IA obeys robots.txt restrictions set by site owners (Jisc, n.d.-b), which can result in some websites missing pages or even being excluded completely from the archive (see chapter two by Hale et al.). The time between crawls is variable for any given page, resulting in some pages having more captures over time than others. Furthermore, the Internet Archive does not use the zone file from Nominet, which forms a complete list of all domains within .uk. Instead the Internet Archive relies on discovering websites through hyperlinks and other methods.


Data

Data preparation

The data for this study originally come from the Internet Archive, which began archiving pages from all domains in 1996 (Kahle, 1997). For the .uk domain that will be examined here, the data are sourced from copies of the approximately 30 terabytes of compressed archive data relating to the UK domain (the .uk ccTLD). Archive files were provided to the British Library by the Internet Archive with the specific purpose of creating the basis of a national archive of the web in the UK. These data form the 'Jisc UK Web Domain Dataset' (Jisc, n.d.a). The data provided to the research team by the British Library do not include the full text of all the pages crawled due to legal restrictions on use outside the British Library, but do include the link data and other metadata extracted from the full archive.

The data were cleaned by removing error pages (e.g. 404 Not Found pages) as well as pages not within the .uk ccTLD. This resulted in a plain-text list of all page Uniform Resource Locators (URLs) remaining in the collection and the date and times they were crawled, and an additional plain-text list of all outgoing hyperlinks starting from pages within the dataset.

For this study, we started with this list of hyperlinks and filtered it to only include links between different third-level domains. We further grouped pages crawled at similar times (within 1,000 seconds) together and assigned the hyperlink pair a weight based on the number of hyperlinks between the two third-level domains in that time period. For each year, if there are multiple crawls within the dataset we take the crawl with the largest number of captured hyperlinks between any two domains. We also formed one list of all third-level domains present in the dataset each year and the number of pages crawled within each third-level domain. These data were loaded into Apache Hive for the analysis that we present here.


Data analysis

In what follows, we undertake a longitudinal network analysis, charting the .uk domain and its core second-level domains over time. As Brügger (2013) points out, this type of analysis is not concerned with who produced what, nor with how the web content was used, but rather with what was created and thus 'the web which is' – or was – 'actually available to users'.

First, we present an overall longitudinal view of the second-level domains within the .uk domain. We investigate the growth of the entire domain between 1996 and 2010, broken down into its four largest constituent parts, .co.uk, .org.uk, .gov.uk, and .ac.uk. Analysis of these SLDs allows us to investigate the role of different sectors of UK society in the growth of the UK web presence.

The second section looks at the link density within and between second-level domains. We examine the internal link density of each SLD, and analyse how they interact with each other: whether, for example, there are more links between certain subdomains, and whether linking is reciprocal between domains or whether it is unbalanced.

The third and final section of the findings takes a closer look at the academic second-level domain .ac.uk. This research builds on earlier longitudinal analyses of academic web pages, which have investigated, for example, the stability of outlinks (Thelwall et al., 2003; Payne and Thelwall, 2007). Our findings update earlier studies by extending the period of analysis to the end of 2010 and assessing the effect of new variables, including institutional affiliation, league table ranking and geographic location on link practices between different universities.


Results

Overview of growth in the .uk web domain

Figure 1.1 displays the overall growth of the .uk ccTLD, showing the total number of nodes (on a logarithmic scale) within each of the four main SLDs we analysed over the period from 1996 to 2010. The insert in the figure shows the size of the entire .uk domain (on a linear scale). There is a clear change in the trend of the growth around 2001 for .co.uk and .org.uk as both domains continue to increase in size, but at a lower speed. Furthermore, .ac.uk and .gov.uk seem to almost stabilize in size at around the same time.

Figure 1.2 shows the relative size of the second-level domains .co.uk, .org.uk, .ac.uk, and .gov.uk across the 15-year period, standardized as each SLD's proportion of the total nodes (i.e. domains/websites, not web pages) in the collection in each year. While these are not the only second-level domains in use within the .uk domain, they are the four largest in terms of number of nodes across the whole period.

As Figure 1.2 shows, .co.uk is the predominant second-level domain throughout the entire period, with .co.uk sites never accounting for less than 85% of the total. However, also apparent is the large proportion of governmental and, especially, academic sites in the early recorded history of the UK web. This is consistent with the role that universities played in the early establishment, adoption and development of the web (Leiner et al., 2009). Over time, however, this early presence was greatly overshadowed in terms of absolute numbers of nodes when compared to the continued growth of the .co.uk and .org.uk domains.


(Continues...)

Excerpted from The Web as History by Niels Brügger. Copyright © 2017 Contributors. Excerpted by permission of UCL Press.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Table of Contents

Introduction: The Web as History
Ralph Schroeder and Niels Brügger

PART ONE THE SIZE AND SHAPE OF WEB DOMAINS

1. Analysing the UK web domain and exploring 15 years of UK universities on the web
Eric T. Meyer, Taha Yasseri, Scott A. Hale, Josh Cowls, Ralph Schroeder and Helen Margetts

2. Live versus archive: Comparing a web archive to a population of web pages
Scott A. Hale, Grant Blank and Victoria D. Alexander

3. Exploring the domain names of the Danish web
Niels Brügger, Ditte Laursen and Janne Nielsen

PART TWO MEDIA AND GOVERNMENT
4. The tumultuous history of news on the web
Matthew S. Weber

5. International hyperlinks in online news media
Josh Cowls and Jonathan Bright

6. From far away to a click away: The French state and public services in the 1990s
Valérie Schafer

PART THREE CULTURAL AND POLITICAL HISTORIES
7. Welcome to the web: The online community of GeoCities during the early years of the World Wide Web
Ian Milligan

8. Using the web to examine the evolution of the abortion debate in Australia, 2005–2015
Robert Ackland and Ann Evans

9. Religious discourse in the archived web: Rowan Williams, Archbishop of Canterbury, and the sharia law controversy of 2008
Peter Webster

10. ‘Taqwacore is Dead. Long Live Taqwacore’ or punk’s not dead?: Studying the online evolution of the Islamic punk scene
Meghan Dougherty

11. Cultures of the UK web
Josh Cowls

12. Coda: Web archives for humanities research – some reflections
Jane Winters

From the B&N Reads Blog

Customer Reviews