BC Bibliography Metadata Mania

The Process

The BC bibliography metadata schema was drafted using the Dublin Core standard supported natively by our content management system, CONTENTdm. The following chart shows the collection field names and corresponding DC and MARC values that we used to map the data.

Displayed Field Name	DC Mapping	MARC	Notes
Title	dc.title	245
Main Author	dc.creator	100	Last name, first name
Corporate Author	dc.creator	110 (710?)
Contributors	dc.contributor	7xx	Any additional names, authors etc.
Language(s)	dc.language	–	Use natural language
Publication date	dc.coverage	260c
Sort date	dc.date	–	Hidden field (use YYYY)
Publisher	dc.publisher	260b
Place of Publication		260a
Subject	dc.subject	650, 651, 610, 600
Subject – Geographic	dc.coverage		Place names only
Physical Description		300
Description	dc.description	500, 246	Include scope note from the bibliography
Genre	dc.type		Use controlled vocab in CDM
Original from	dc.source		Contributing institution name (no abbrev.)
Locate a Print Version	Identifier.print		WorldCat record URL (containing OCLC Accession #). UBC catalogue record URL if no WorldCat record exists.
Digital Image format	dc.format		Use IANA MIME Media Types vocabulary
Digital content type	dc.type		Use DCMI type vocabulary
Collection website	Relation.isPartOf		URL for BC Bib
Rights	dc.rights		Images provided for research and reference use only. For permission to publish, copy, or otherwise distribute these images please contact digital.initiatives@ubc.ca
Edition
Forms Part of	Relation.is part of		For serial runs
BC Bibliography id			(from BC Bib Book Index – i.e. “I-141”)
Transcript
Permissions
OCLC Number			New OCLC Accession # which will be assigned to the digital version. (NOT OCLC # for the print version.)
Directory

Collecting Data

The process we created involved several different steps as data from different sources had to be converted to a common format before it could be combined and normalized. Much of the metadata began in the MARC format, including the catalogue data from contributing institutions and WorldCat. We also scanned and OCRd the original data from the print bibliographies and had the entries fielded and mapped to MARC. This proved fortunate because although MARC’s numeric codes are largely incomprehensible to the majority of human beings, they are easily read by machines; especially when talented developers have already done the heavy lifting and are good enough to share their work. Using the MARC/Perl library, we wrote a Perl script to extract the fields we wanted in the required order. The script also breaks the Library of Congrese subject headings into constitutuent terms, removes duplicate terms, and returns the results in a semicolon-separated format suitable for the intended faceted browsing interface. For ease of use, the Perl script was embedded in a right-click activatable Service in OSX that could easily be applied to a selected file by any user.

When metadata was unavailable in MARC format, it was sometimes necessary to harvest data from catalogue websites. To avoid unecessary and mind-numbing copying and pasting, we used the iMacros plugin for FireFox to create simple web scraping scripts to output the required data in an order and format consistent with the output of our Perl script.

Combining Data

Ultimately this resulted in a (tabbed) text file from each source (print bibliographies, WorldCat, and institutional catalogue records) that contained the metadata we wanted in the appropriate fields and order, but still needed to be combined into a single record for each item. We accomplished this using a Microsoft Excel template that allowed us to create a master metadata sheet from all three source files. We chose Excel for this over other solutions because it both allowed us to easily specify the way each field was combined, and gave us a chance to manually review and edit the output from the MARC/Perl script and web scraping. The template includes tabs for each of the source files and a “final” tab for the combined values.

To avoid duplicates, data from a single source was preferred for most fields, with the alternative sources included only when no data was available in the field from the “preferred” source. The order of preference varied depending on the field since some sources had consistently better data in certain fields, but for the most part the order was print bibliograpies, WorldCat, and institutional catalogues. However, in order to provide the broadest scope of access points for faceted browsing, the contents of the Subject fields from all three sources were included in the “final” tab. This had the downside of introducing numerous duplicate entries into the field, but we were able to eliminate these during the normalization process.

Cleanup and Normalization

Once all the data was combined into a single spreadsheet in Excel, it was normalized and the remaining duplicate field entries were located and removed using Google Refine. This niche application allowed us to break apart the duplicate-cluttered Subject field on the semicolon term separators, group the terms, and remove the duplicates. It also provided us with the tools to efficiently clean and normalize our dataset. FreeYourMetadata.org has a good guide on how to cleanup your data that we used to create our own workflow. In addition to their steps, we used Google Refine’s clustering tools to identify and eliminate inconsistencies across our data entries (ie. Burns, W.E. vs. Burns, William Ernest). The Main Author, Corporate Author, Publisher, Place of Publication, Subject and Subject –Geographic were all cleaned and normalized using these methods.

From Google Refine the data was transferred back to Excel for proofreading and to add the necessary directory data for input into CONTENTdm. The data was exported in tabbed-text format and uploaded along with the images into the CDM Project Client, where the remaining Genre fields were applied to each entry using a custom controlled vocabulary saved in CONTENTdm. Finally the BC Bibliography viewer pulls data from our CONTENTdm collection and splits it into facets for easy browsing and viewing along with the collection images.

If you or your organization are interested in more details or have any questions or suggestions on how we might improve our workflow, please get in touch!

Pages: 1 2

Post Comment Click here to cancel reply.

.

Home

Digitization Projects

Work With Us

Documentation

Digitizers' Blog

Contact Us

Address