Dec 30, 2010 7:40am
How Montana State Library Uploaded Batches of Digital Objects to the Internet Archive
by Chris Stockwell for Montana State Library, 12/29/2010
The Montana State Library (MSL) last year moved a copy of its collection of 3000 born digital state publications to the Internet Archive (IA). Since MSL will be continuing to upload and integrate born digital publications to the Internet Archive, we encourage constructive comment. Also, MSL would be happy to answer questions about what we did. Contact the Library Information Services division.
It was a natural progression for MSL to upload and integrate its born digital state publications to the Internet Archive. The Internet Archive already is digitizing Montana’s print state publications under contract. After the items are digitized, IA provides public access
to them through its free digital library branded with an MSL logo. IA is officially recognized as a library by California. Also, IA’s Archive-It team archives Montana state agency websites
under contract. Montana State Library considers IA to be its institutional repository for its primary state publications collection.
How MSL Uploaded Batches of Born Digital State Publications
To the Internet Archive
IA has a web-based application for uploading single digital objects to a details page
in collections at IA. It does not yet have an application for uploading batches. However, IA does have utilities to support batch upload. They provide an FTP server, and they provide scriptable URL commands for integrating items uploaded to the FTP server. IA also has utilities for checking the progress of batch integration and fixing issues that may arise. And they assigned an excellent software engineer to help with what was to us a major project.
We did have to write scripts to move born digital items and metadata into the folder/file structure required by the FTP server and kick off IA’s integration process. The folder/file structure required by IA includes four payload files:
a. The digital object to be displayed;
This is a file that contains only a short xml stub. IA uses this file to record the names of and MD5 hash values for the multiple files it creates during integration;
c. marc.xml file
d. meta.xml file
The folder and files are named with what becomes their unique identifier at IA. For unique identifiers, MSL used GUIDS. For example, the folder name would be F517573C-A7F6-40C1-A32F-74E6768DA4FA. The files.xml would be F517573C-A7F6-40C1-A32F-74E6768DA4FA_files.xml. The beauty of GUIDS is MSL doesn’t have to keep track of GUIDS to ensure GUIDS are unique.
Using the IA required folder structure as the foundation, MSL divided the upload and integration process into seven steps. This made the steps simple enough that issues could be more easily identified and corrected. Because we did not know what issues we would run into, we made manual QA checks instead of automatic checks.
Step 1 - Script 1:
The born digital collection was stored as a structured dataset with the metadata in one big file, including paths to the digital objects and OCLC numbers for monographs and serials. Script 1 retrieved the OCLC number from the metadata and moved the related born digital item to its single digital object folder in the overall batch upload folder. 100 born digital object folders were created in the upload folder for each batch. To record the upload, the OCLC number and other key metadata for the digital object were databased. A GUID was generated. The folder was temporarily named oclcnumber_GUID. The digital object was named oclcnumber_GUID.pdf.
Step 2 - Script 2:
In each digital object folder, Script 2 placed a marc.xml file. To do this, the script grabbed the OCLC number from the folder name. Then the script used the OCLC number as a z3950 search parameter and pulled the related MARC record as an .mrc file from MSL’s SirsiDynix catalog. Script 2 then used a MARCEdit function to convert the .mrc to the marc.xml file. All hail to MARCEdit’s Terry Reese. At the completion of Script 2’s run, each folder held both the digital object and related marc.xml.
Note: Our state publications collection contains many serials, for example, annual financial reports of various state agencies. For serials display, IA has at least two native serial structures. Here’s an example
of one structure. The other structure treats each serial as a separate sub collection. We elected NOT to use these structures because individual volume numbers would not display. Only the earliest issue’s publication date would display. So, MSL uploaded each digital object as a single item.
We then provided patrons workarounds to pull the serials together for display. Serial items show up together as serials in title searches and from links in MSL’s catalog that query by OCLC number
. After the patron follows the link, such query URLs pull the serial together on an IA search results page. Note, however, that the items do not sort fully by volume number, but each item has the correct volume number. In order to include the correct volume numbers, we entered volume numbers manually, while Script 3 ran as explained in Step 3 - Script3.d. below.
Step 3 - Script 3:
Script 3 crosswalked the marc.xml to the meta.xml file required to provide metadata to the IA display page and to support search. This required collaboration between MSL programmer and cataloger and IA software engineer. Efforts required to setup the meta.xml file included:
a. The MSL cataloger defined the meta.xml elements needed for display at IA.
b. The Internet Archive’s meta.xml call number field is a little confusing because it may contain several different metadata elements, including call number or various identifiers used by libraries, like OCLC numbers or barcodes. MSL finally put the call number there and added a custom field for the OCLC numbers.
c. Initially, we dutifully crosswalked title and description fields to meta.xml. But we found these and other metadata elements were overwritten by the Internet Archive during integration. To overwrite the meta.xml, IA used the marc.xml that was part of the four files uploaded to IA. The two-step process allows MSL to customize the meta.xml, while IA maintains its standards for basic fields in the meta.xml.
By default, IA’s marc-to-meta crosswalk includes most 5XX fields from a bibliographic record. 5XX fields are note fields, and some of them contain local information that is only applicable to a particular library’s copy of a title. This meant that some bibliographic data specific to and only useful for libraries that are part of our local shared catalog was making its way into the IA meta.xml file. MSL learned to remove this from the marc.xml before uploading it to IA so that irrelevant data would not be overwritten to the meta.xml by IA.
d. The script opened each PDF and its marc.xml as it looped through the batch, so we could read and manually input the specific volume number for each serial to be recorded in the meta.xml. This was necessary to provide volume information for individual serials not available from its MARC record in a way that could be reliably matched to the item. This manual input made it possible for meta.xml, details pages and search results to include the correct volume number.
Also, since folder names still included OCLC numbers, PURLS were generated in this step. This is because the script was already using OCLC numbers to pull marc.xml via a z39.50 connection to the catalog. We made a second use of OCLC numbers as unique identifiers in our master PURL table.
Step 4 - Script 4:
Script 4 removed the OCLC number from each folder and file name. So, folder and file names were shortened to the GUID, which becomes the permanent digital object identifier at IA.
MSL FTP’d the files to the internet Archive upload server. FTP took more time than it should have. Each batch had 400 files, 100 digital objects and their accompanying three files. The FTP server frequently reported too many connections and denied upload until the connections cleared. The denied items stacked up and had to be run 3-4 times to complete the FTP. The FTP took 2-3 hours. Fortunately, the operator had other things to do while waiting out the delays. We scoured information sources and reconfigured Filezilla, but did not come up with a solution to the problem. However, Filezilla did an excellent job of keeping track of the delayed and repeated uploads.
Step 6 - Script 5:
Once FTP’d to Internet Archive, Step 6 grabbed, in turn, the identifier from each digital object folder in the batch and used the identifier in a URL command that caused IA to integrate the folder and its contents.
Step 7 - Script 6:
The last step was to upload PURL’s generated in Script 3 to OCLC’s purl.org server.
Putting the bow on the batch
Once the PURLS were created on the PURL server, OCLC numbers and PURLS were sent to OCLC. OCLC placed the PURLS in 856 fields of our WorldCat MARC records. Then, the MSL cataloger laid the WorldCat records over our shared catalog records.
It took 15 months of one-third time effort. The time taken was longer than we had expected, mostly because there was considerable QA to do:
1. The source dataset contained some items that were already at IA. These had to be manually removed from the dataset.
2. The dataset contained compound objects that needed to be manually reconstituted in their original structure as independent items. 158 compound objects became 519 uploads to IA.
3. There were 182 ARC files in the datatset. These had been crawled from the web by MSL using the Web Archive Workbench. Ultimately, these ARC files were ingested to Archive Montana by Archive-It and made accessible by full text search. They contained 123,000 URLS.
4. There was a small but regular group of state publications whose OCLC number did not pull MARC records from the catalog. These had to be fixed by the cataloger.
5. And while most of the dataset were PDFs, we had to learn to upload video and audio digital objects.
So, the duration of the effort was definitely caused by the variation in our dataset not IA’s upload infrastructure. However, 30 batches later, MSL was done uploading a copy of its born digital collection to the Internet Archive.
This post was modified by MSL Staff on 2010-12-29 23:04:35
This post was modified by MSL Staff on 2010-12-30 15:40:48