This is a series on web archiving at the UBC Library using the Internet Archive’s Archive-it service. For all posts about web archiving, please see https://digitize.library.ubc.ca/tag/web-archiving/
Proposal
Web archiving projects can be proposed by anyone: community members, students, researchers, librarians etc. We have also completed larger collaborative projects that were proposed by a group of libraries – such as the 2017 B.C. Wildfires collection. The proposal is reviewed by Digitization Centre Librarians as well as subject-matter experts that we identify as collaborative or consulting partners. For example, the Site C Dam is a hydroelectric project managed by B.C. Hydro, a Crown corporation, and has a significant impact on the province’s First Nations groups. For this reason, the proposal for our B.C. Hydro Site C Dam web archiving collection was evaluated and later developed with the UBC Library’s Government Publication Librarian Susan Paterson, and Aboriginal Engagement Librarian Sarah Dupont.
Evaluation
To be selected as a candidate for web archiving, websites are evaluated based on their risk of disappearance, originality, availability in other web archiving collections, and copyright considerations … among other factors. While we are currently updating our collection policy for web archiving collections, potential collections and websites are evaluated based on their intrinsic value and significance to researchers, students, and the broader community. We aim to capture content that is relevant to the needs of the wide range of subject areas taught and researched at UBC, as well as content that contributes to the institutional memory of the university.
Resources
Web archiving projects are resource-intensive. Websites are assessed for how extensive we anticipate that the content captured will be and therefore how much of our subscription’s data storage will be used. We also consider how much time the project will take, and who is available to undertake the work within the required time frame. Some projects need to be responsive to current events unfolding in real time – such as a political rally, or catastrophic event such as an earthquake – with resources required to identify the content and set up the crawls immediately. While we are fortunate in having had a number of students from the iSchool interested in working on our web archiving projects, they may already be committed to working on another project.
Technical considerations
Websites are constructed in many different ways, with a range of elements that dictate how the site behaves. Before starting a project we consider archive-friendliness: how easily the content, structure, functionality, and presentation of a site will be captured with Archive-it. Dynamic content – anything that relies on human interaction to function, or that contains database-driven content – can be problematic. Sites constructed using Javascript or Flash often mean that web crawlers have trouble capturing certain elements on a web page. While there are ways to customize the crawls in certain cases, a successful crawl can take time to construct and success is not always guaranteed.
Metadata
Aside from capturing the web content itself, we create metadata for each collection and each website (or “seed”) that we capture; this often includes a description for the collection and the seed, as well as the creator of the seed and subject terms for the content, such as Environmental protection for a seed in the Site C Dam collection.
This metadata provides context for why the seed was included in the collection, and helps users discover the content relevant to their interests when searching in Archive-it.
By Larissa Ringham, Digital Projects Librarian