The following is part of a series on web archiving at the UBC Library. This post was written by Sarah Gallagher, a graduate student in the Master of Library and Information Science program at UBC’s iSchool. For all posts about web archiving, please see https://digitize.library.ubc.ca/tag/web-archiving/
From January through April 2021, I worked with the Digitization Centre as a Professional Experience student to expand the COVID-19 Pandemic and the University of British Columbia Web Archive. This post will be an account of my experience, including the new skills I have gained and the challenges endured during this project.
As the COVID-19 pandemic began spreading throughout British Columbia and the world in March of 2020, its impacts on the UBC community were seen immediately. Buildings across campus shut down, student activities came to a halt and courses moved online. The University and its faculties needed to be able to alert students and staff quickly, and many new webpages were created to keep the universities communities updated. The team at the UBC Library Digitization Centre and the Humanities and Social Sciences Library quickly realized that this was something that needed to be archived and set to work compiling a collection of webpages related to COVID-19 and its impact on the UBC community.
When I joined this project in January of 2021, the collection already had a range of webpages that had been crawled and archived, and my goal was to expand on the existing collection. Web archiving was something I hadn’t done before, and prior to getting started I first needed to learn how to use the tools. The UBC Digitization Centre uses a tool called Archive-it to crawl webpages for preservation. I trained myself on this tool and along the way learned a lot about how web archiving works. During this time, I also searched for webpages that hadn’t yet been archived into the collection. I was able to gain a working knowledge of Archive-it and proposed over 30 new webpages to be added as seeds (the web archiving term for captured URLs) into the collection. My next step was to work on capturing the content of these pages to add them into the collection.
This step in the process required a lot of trial and error. Capturing webpages can be tricky, as you need to make sure you’ve set a proper scope to capture just the right amount of information. If there’s a problem with the scoping, the crawl may end up capturing either way too little or way too much information. Crawling webpages can also use up a lot of data storage, so I had to be mindful of this when performing my test crawls as well. Quality assurance checks took up a large portion of time on this project, as I had to examine each test crawl, adjust the scope if necessary, re-crawl the seed and repeat the process. Although this took a lot of time, it was needed to ensure the collection would be captured and archived correctly.
After performing my test crawls, I began creating the metadata for the entire collection following the standards that had already been set by the Digitization Centre. The metadata will help users identify key aspects of each seed such as who created it, what its official title is and what genre it fits into. At the conclusion of the project, I performed one last quality assurance check to make sure all the URLs included in the metadata matched the URL of each seed, and then performed a mass upload of metadata for the entire collection.
When I started working on this project, my supervisors and I established a goal of beginning to implement capturing social media into the collection. As much as UBC has been updating students through their websites, social media such as Twitter can be a very quick and effective way to reach students, and many faculties were utilizing them. I found over 20 Twitter keyword and hashtag searches to capture to enhance the collection, but the process of capturing them proved to be difficult. I ran several test crawls, but they were either not capturing all the information on the page or were taking up way too much of the data storage budget. In addition to Archive-it, I also tried using another tool called Conifer to capture the Twitter feeds, but this was ultimately unsuccessful as well. By the end of my project, I wasn’t able to successfully capture Twitter content to add to our collection, but I’m hopeful that the list of potential Twitter seeds I have created will allow a future student working on this collection to continue the attempt.
Although I wasn’t successful in capturing Twitter content, I was able to capture YouTube pages for the collection. Numerous UBC faculties have used their YouTube pages to post informational videos about the spread of COVID-19, provide updates to their students and staff, and even upload talks given by visiting lecturers on the possible long-term effects of the pandemic. Capturing these pages was another case of trial and error, as capturing video content was something I hadn’t dealt with in my previous captures. However, I was able to successfully capture the YouTube pages and add them to the collection. Despite the difficulties I encountered with capturing Twitter, I am satisfied with my contributions capturing YouTube as a social media to enhance the collection.
Overall, I really enjoyed my time working on this project. I believe that I have added value to the COVID-19 Pandemic and the University of British Columbia collection, while simultaneously picking up new skills that I can take with me into my future career. This project was a great experience and I’m excited to see what future Professional Experience students add to it!