The Process
The BC bibliography metadata schema was drafted using the Dublin Core standard supported natively by our content management system, CONTENTdm. The following chart shows the collection field names and corresponding DC and MARC values that we used to map the data.
Displayed Field Name | DC Mapping | MARC | Notes |
Title | dc.title | 245 | |
Main Author | dc.creator | 100 | Last name, first name |
Corporate Author | dc.creator | 110 (710?) | |
Contributors | dc.contributor | 7xx | Any additional names, authors etc. |
Language(s) | dc.language | – | Use natural language |
Publication date | dc.coverage | 260c | |
Sort date | dc.date | – | Hidden field (use YYYY) |
Publisher | dc.publisher | 260b | |
Place of Publication | 260a | ||
Subject | dc.subject | 650, 651, 610, 600 | |
Subject – Geographic | dc.coverage | Place names only | |
Physical Description | 300 | ||
Description | dc.description | 500, 246 | Include scope note from the bibliography |
Genre | dc.type | Use controlled vocab in CDM | |
Original from | dc.source | Contributing institution name (no abbrev.) | |
Locate a Print Version | Identifier.print | WorldCat record URL (containing OCLC Accession #). UBC catalogue record URL if no WorldCat record exists. | |
Digital Image format | dc.format | Use IANA MIME Media Types vocabulary | |
Digital content type | dc.type | Use DCMI type vocabulary | |
Collection website | Relation.isPartOf | URL for BC Bib | |
Rights | dc.rights | Images provided for research and reference use only. For permission to publish, copy, or otherwise distribute these images please contact digital.initiatives@ubc.ca | |
Edition | |||
Forms Part of | Relation.is part of | For serial runs | |
BC Bibliography id | (from BC Bib Book Index – i.e. “I-141”) | ||
Transcript | |||
Permissions | |||
OCLC Number | New OCLC Accession # which will be assigned to the digital version. (NOT OCLC # for the print version.) | ||
Directory |
Collecting Data
The process we created involved several different steps as data from different sources had to be converted to a common format before it could be combined and normalized. Much of the metadata began in the MARC format, including the catalogue data from contributing institutions and WorldCat. We also scanned and OCRd the original data from the print bibliographies and had the entries fielded and mapped to MARC. This proved fortunate because although MARC’s numeric codes are largely incomprehensible to the majority of human beings, they are easily read by machines; especially when talented developers have already done the heavy lifting and are good enough to share their work. Using the MARC/Perl library, we wrote a Perl script to extract the fields we wanted in the required order. The script also breaks the Library of Congrese subject headings into constitutuent terms, removes duplicate terms, and returns the results in a semicolon-separated format suitable for the intended faceted browsing interface. For ease of use, the Perl script was embedded in a right-click activatable Service in OSX that could easily be applied to a selected file by any user.
When metadata was unavailable in MARC format, it was sometimes necessary to harvest data from catalogue websites. To avoid unecessary and mind-numbing copying and pasting, we used the iMacros plugin for FireFox to create simple web scraping scripts to output the required data in an order and format consistent with the output of our Perl script.
Combining Data
Ultimately this resulted in a (tabbed) text file from each source (print bibliographies, WorldCat, and institutional catalogue records) that contained the metadata we wanted in the appropriate fields and order, but still needed to be combined into a single record for each item. We accomplished this using a Microsoft Excel template that allowed us to create a master metadata sheet from all three source files. We chose Excel for this over other solutions because it both allowed us to easily specify the way each field was combined, and gave us a chance to manually review and edit the output from the MARC/Perl script and web scraping. The template includes tabs for each of the source files and a “final” tab for the combined values.
To avoid duplicates, data from a single source was preferred for most fields, with the alternative sources included only when no data was available in the field from the “preferred” source. The order of preference varied depending on the field since some sources had consistently better data in certain fields, but for the most part the order was print bibliograpies, WorldCat, and institutional catalogues. However, in order to provide the broadest scope of access points for faceted browsing, the contents of the Subject fields from all three sources were included in the “final” tab. This had the downside of introducing numerous duplicate entries into the field, but we were able to eliminate these during the normalization process.
Cleanup and Normalization
Once all the data was combined into a single spreadsheet in Excel, it was normalized and the remaining duplicate field entries were located and removed using Google Refine. This niche application allowed us to break apart the duplicate-cluttered Subject field on the semicolon term separators, group the terms, and remove the duplicates. It also provided us with the tools to efficiently clean and normalize our dataset. FreeYourMetadata.org has a good guide on how to cleanup your data that we used to create our own workflow. In addition to their steps, we used Google Refine’s clustering tools to identify and eliminate inconsistencies across our data entries (ie. Burns, W.E. vs. Burns, William Ernest). The Main Author, Corporate Author, Publisher, Place of Publication, Subject and Subject –Geographic were all cleaned and normalized using these methods.
From Google Refine the data was transferred back to Excel for proofreading and to add the necessary directory data for input into CONTENTdm. The data was exported in tabbed-text format and uploaded along with the images into the CDM Project Client, where the remaining Genre fields were applied to each entry using a custom controlled vocabulary saved in CONTENTdm. Finally the BC Bibliography viewer pulls data from our CONTENTdm collection and splits it into facets for easy browsing and viewing along with the collection images.
If you or your organization are interested in more details or have any questions or suggestions on how we might improve our workflow, please get in touch!
Pages: 1 2