Estimates put the current size of the internet at around 1.2 million terabytes and 150 billion pages. Sites go up, sites come down, pages are removed, content changes continuously. And an increasing amount of this information is available only online. You might not care if you can no longer access the comments about watching someone’s grass grow, but you may be concerned one day to find that a political candidate no longer has their statement of opposition on an important local topic up on their campaign website – a statement that you desperately need for your research.
Fortunately much of the content on the web today is captured by the Internet Archive, which harvests and makes available web content through its Wayback Machine. Sites are crawled by the Wayback Machine for archiving on an irregular schedule; depending on a variety of factors such as how heavily a site is linked, the Internet Archive web crawlers may crawl a site several times a day – or only once every few months. Web content can change so frequently that unless you can specify exactly when the content on a specific site is captured, there is a chance that information will be lost forever. The Wayback Machine does what it can, but it has billions of web pages to try to crawl.
Enter Archive-it. Archive-it is a subscription web archiving service that the Internet Archive created in order to give organizations like the UBC Library the ability to harvest, build, and preserve, collections of digital content on demand. This service gives us control over what we crawl and how often, and allows us to apply the metadata that will permit users to find our archived web content more easily. And information can now be pulled out of our collections for analysis using Archive-it’s API. The sites we harvest are available on our institution’s Archive-it home page, and are added to the Wayback Machine’s own site crawls so that our information is full text searchable, and freely available to anyone in the world at any time.
We started web archiving in 2013, when a group of university libraries – including UBC – began crawling the Canadian federal government websites collaboratively in order to capture content important to Canadians that was scheduled for removal online. Since then, we have created nine collections of archived web content, with three more under active development. These collections are representative of the research interests of UBC and its community, and include such topics as the BC Hydro Site C Dam project and First Nations and Indigenous Communities websites, as well as the University of British Columbia websites themselves.
Over the next few weeks we will be exploring some aspects of our web archiving work at UBC, and will hear from some of our library partners and past students who have done work in the area. Stay tuned for posts on developing web archiving projects, archiving government web content, and the technical limitations of web archiving.
See all posts related to web archiving: https://digitize.library.ubc.ca/tag/web-archiving/.
By Larissa Ringham, Digital Projects Librarian