Overview of Formats Objects and Migration for ETDs
This is a placeholder workspace for draft and final documents related to this project deliverable.
This is Version 2, Version 1 is located here
About this Document
Inevitably, and for a variety of reasons, ETD collections will need to be moved, updated, or otherwise modified in order to accommodate technological changes, to eradicate errors discovered belatedly (e.g., after ingest), or to enhance the collection with new content (e.g., to improve or extend metadata). Potentially, over time, a multiplicity of “versions” will have been created on different storage media, with different file formats, and with deliberate or inadvertent variations in content. Among the many challenges of digital preservation will be: to decide what to preserve, how to proactively manage any changes to ETD collections, and how to adequately document these changes so that there will always be a continuous record of the entire lifecycle of the digital content. One reason to be hopeful about the tractability of these daunting challenges is the ongoing active collaboration among members of the digital preservation community to achieve sustainable systems, based on best practices and standards. This document has been prepared in that spirit.
A final working outline for this Guidance document is available to all project steering committee authors for review and comment at Google Docs.
Data wrangling: organizing digital content
The future usability of any preserved digital content will depend in part on how well organized the body of content was when it was first ingested. So, it is vitally important at the outset to establish and follow a logical set of principles and conventions that inform the organization of the content and will be readily understood in the future. In some cases, this may require remediation of flawed legacy practices.1
The guidelines and examples that follow are not prescriptive but are meant to exemplify a thought process that might lead to an optimum set of practices that should then be codified in the policies and the procedures that undergird any mature ETD program.
File/Folder naming principles and conventions
File/folder names should be unique and follow documented conventions to ensure consistency and ease of use. File names do not take the place of metadata and should be simple and straightforward.
- Use lowercase letters of the English alphabet and the numerals 0 through 9.
- Avoid punctuation marks other than underscores or hyphens.
- Do not use spaces.
- Limit file/folder names to 31 characters, including the 3 digit file extension.
While many institutions mandate what format is to be used for an ETD, there may remain some degree of ambiguity. For example, even when a PDF file format is required: both the PDF and PDF/A formats may be considered acceptable; there may be no explicit requirement as to which version of PDF is used; and, the byte order (big- versus little-endian) may not be a consideration. For any files that are included as supplementary files to the ETD itself, even fewer restrictions may apply. For these files, merely requiring that non-proprietary file formats be used may be inadequate; given the wide variety of non-proprietary formats, one or more media-specific open file formats (see table below) should be required. For legacy files, it may be worthwhile to normalize, i.e. to convert files from their original proprietary formats to open formats --- much preferred for archival purposes.
Media-specific file formats:
|Media||authoring software||original format(s)||open format(s)||comments|
|text||Microsoft Word, LaTeX||.doc, .docx, .txt||PDF (Version 1.7)||PDF version 1.7 is an ISO standard, thereby making it effectively non-proprietary. See Appendix A below. For archival purposes, a PDF file should have no security features enabled, and should have all of its original fonts embedded (as a subset).|
|images||Adobe Photoshop||.psd||.tif, .jp2||A small amount of compression applied by JPEG2000 can be “visually lossless”, and may be acceptable for archival purposes.2|
|audio||???||.mp3, .aif, .m4a||.wav||Include a reference here|
|video||???||???||.mp4, .avi||Further details at: http://www.archives.gov/records-mgmt/initiatives/dav-faq.html|
|spreadsheet||Microsoft Excel||.xls||.ods||See ODF (open document format): http://www.opendocumentformat.org/|
Complex content objects
Each ETD, as an entity, is comprised of components that can be either embedded or be carried along in some fashion, e.g. as a supplementary file. The “packaging” of an ETD and all of its constituent parts (metadata, fonts, data,…) must be accomplished in such a way that access to the ETD in the future will encompass all of its intellectual content and functionality, as follows:
An ETD-specific metadata schema has been developed by members of the Networked Digital Library of Theses and Dissertations (NDLTD); it can be found at: http://www.ndltd.org/standards/metadata/etd-ms-v1.1.html. Example of ETD-MS metadata: http://dcollections.bc.edu/webclient/MetadataManager?pid=139660&descriptive_only=true. This ETD-MS standard schema, based on Dublin core, could benefit from future refinements. For example, elements pertaining to lifecycle management (such as those found in PREMIS) are needed to record the status of content at the time of ingest, e.g., fixity checksums, as well as any subsequent actions undertaken on behalf of preservation, e.g. format migration. Another improvement would be to extend the dc.rights element to allow for the use of various Creative Commons licensing options. Additional metadata elements are needed to record relationships among groups of files that constitute a complex content object --- in effect, metadata can provide the “glue” that binds these files together.
Whether fonts are selected arbitrarily or very deliberately by an ETD author, they are part of their ETD and as such should be preserved. Currently, the embedding of all fonts can be easily accomplished when converting, for example, from Microsoft Word to Adobe Acrobat PDF. In fact, having all fonts embedded is sometimes an explicit requirement (of ProQuest). However, many ETD authors succeed in embedding only some fonts, but not all. Recently, it has become possible to “fixup” these ETDs with Adobe Acrobat Pro’s “preflight…fixups…embed fonts” function --- it embeds any missing fonts, as long as those fonts are available among the local computer’s system fonts. Ideally, fonts should be embedded while using the same operating system (better yet, the same computer) as was used when authoring the ETD --- to ensure that the exact same fonts (or subsets thereof) are embedded within the PDF.3
More and more frequently, hyperlinks are included within ETDs and they will remain active within some versions of PDF (depending on the conversion-to-PDF settings). Even if these links eventually “break” (aka “link rot”), arguably their inclusion provides some potentially valuable information4, especially if tools become available for repairing broken links.
For the ETDs themselves, handles may be used so that a permanent link to the ETD can be maintained. For further details, see:
The use of multimedia (audio, video clips,...) in ETDs has been gradually increasing, as seen in some recent award-winning ETDs: http://www.ndltd.org/events_and_awards/awards/ndltd-etd-awards-2011-winners In some cases, multimedia in the form of a derivative file (such as a JPEG image) is embedded within the PDF itself. Alternatively or additionally, multimedia in an archival file format (such as a TIFF image) could be embedded in the PDF or could be included as a supplementary file. An obvious challenge will be to migrate these multimedia components as the ETDs themselves are being migrated to newer file formats. The use of non-proprietary file formats for both the entire ETD and for its embedded or supplementary multimedia components will be critical to ensure usability of the content in the future.
For some ETDs, there may be additional information that, although not part of the ETD itself, is worth preserving along with the ETD. For example, the research data upon which an ETD is based (e.g., from surveys or from laboratory measurements) might well be considered preservation-worthy. Optionally, one might preserve such data as part of an “ETD package” of files that are stored in a preservation network, or one might archive the data in a separate data repository or possibly a dark archive that can be referenced within the ETD package. Data preservation is currently an area of active investigation.
Example: At one university, lab notebooks and measured spectra are being converted to PDF files that will go into a data repository; links to the data repository from the ETD metadata will enable researchers to access these research data for free in perpetuity.
Rather than having dispersed files that could potentially get separated in the future, having an ETD and all of its associated information be in one self-contained bundle might prove to be more amenable to preservation. This is the premise of an approach being tested at Virginia Tech, using HTML5, which “allows a single file to encode multiple media types and support linking among those.”
For transferring content between computers over the network, a potentially useful technology is “Bagit”. With its built-in inventory checking, the successful completion of the transfer is verified.
As media storage technologies evolve, it will be prudent to transfer digital collections to newer storage media. To ensure that all of the digital content is transferred without error, fixity checks will need to be done to compare the original files with the newly-transferred files. Absent any discrepancies, it would then be safe to decommission the original storage media. NARA has recently made a tool available that facilitates batch processing of checksums to enable such a comparison --- “FileAnalyzer” --- available at: http://blogs.archives.gov/online-public-access/?p=62705
Emulation6 and migration7 are not mutually exclusive strategies. In fact, one possible “hybrid” strategy would be to both save all of the original files and to save the most recently migrated versions of the files, but none of the intermediate files (see Versioning below)
As successive file format migrations are undertaken, fixity checking will need to be repeated in order to record the checksums associated with each “generation” of files. Assuming one starts with multiple copies (replications stored at distributed locations) of files all of whose checksums agree, one should end up with multiple copies of the newly-migrated files whose checksums also agree with each other but not, presumably, with those of the original files.
When to migrate to newer file format
Here, discuss batch “preemptive” format conversion (guessing in advance what the next archival format is going to be) versus on-demand migration (waiting until the file whose format is outdated is going to be accessed).
Here, include a summary of the final version of the “Versioning Brief” that is being developed by the MetaArchive Content and Preservation committees
Versioning is the process of storing multiple versions of a file in order to save its change history. This enables a content producer to know that changes made to a preserved file, both intentional and unintentional, will be saved in parallel within a preservation system such that any/all versions of that file may be retrieved by the producer in the event of a content restoration....
Here, mention the cost implications of saving all intermediate versions ($1 per GB per year). Given the potential for inflated costs, a purging/de-accessioning mechanism may be essential.
Migration from one repository to another
As with file formats and storage technologies, repository technologies may also become obsolete, warranting migration from an older to a newer or better repository platform. However, there is currently enough variation in architecture among different repositories to raise the concern that moving content from one to another may entail some loss of information. Efforts such as “Towards Interoperable Preservation Repositories (TIPR)8 and “Repository Exchange Package”(RXP)9 have begun to explore this issue. And, an NEH-funded research effort is currently underway to develop tools for transferring content among various repositories.
Here, mention SWORD protocol: http://swordapp.org/about/a-brief-history/ ?
1Avoiding the Calf-Path: Digital Preservation Readiness for Growing Collections and Distributed Preservation Networks by Martin Halbert, Katherine Skinner and Gail McMillan, Society for Imaging Science and Technology: Archiving 2008 Final Program and Proceedings, 86-91.
2JPEG 2000 - a Practical Digital Preservation Standard? by Robert Buckley Digital Preservation CoalitionTechnology Watch Series Report 08-01 February 2008, www.dpconline.org/docs/reports/dpctw08-01.pdf
3Arguably, there is a copyright concern to consider when embedding fonts --- a concern presumably mitigated by embedding only the subset of fonts that are actually used within the ETD. (Not intended as legal advice.)
5If you use an ftp application to transfer files from a DOS-based system to a Unix-based system, be aware that you may have to transfer xml files (for example) as binary files; otherwise the MD5 checksums may not agree (crlf in DOS can change to lf in Unix.
6“combines software and hardware to reproduce in all essential characteristics the performance of another computer of a different design, allowing programs or media designed for a particular environment to operate in a different, usually newer environment”, from http://www.dpworkshop.org/dpm-eng/terminology/strategies.html
7“to copy data, or convert data, from one technology to another, whether hardware or software, preserving the essential characteristics of the data. This simple definition, by Peter Graham”, from http://www.dpworkshop.org/dpm-eng/terminology/strategies.html
Appendix A: More Details About Various File Formats
The PDF file format was invented by Adobe Systems Incorporated in the 1990s. As “PDF” became the de facto standard for documents on the web, there have been nine versions of PDF specifications. In 2006, the ISO standard ISO 32000-1:2008 (PDF version 1.7) was released as a published ISO standard. It includes all of the functionality in previous versions (1.0-1.6) of the PDF file format. NOTE: metadata can be embedded in a PDF document.
PDF for Archive (PDF/A) is a simplified version of the “full” PDF format, with fewer requirements and fewer features. It currently consists of three versions. PDF/A-1 is based on PDF 1.4, classified as an ISO standard (ISO 19005-1:2005) for long-term preservation of electronic documents.
Draft PDF/A-3 is currently under review by its working group. The intent is to contain files in any arbitrary format in a PDF file extends the principle of long-term preservation of electronic document. Because a PDF/A document must embed all fonts and other information for displaying the document, its file size will be larger than PDF documents without such embedded information.
PDF/A has been adopted as the standard for long-term government archives in many countries, including USA Federal courts, Swiss , Austria, Germany, and the European Commission.
- Library of Congress. PDF/A-1, PDF for Long-term Preservation, Use of PDF 1.4 http://www.digitalpreservation.gov/formats/fdd/fdd000125.shtml
- AIIM. Frequently Asked Questions (FAQs) ISO 19005-1:2005 PDF/A-1: July 10, 2006. http://www.aiim.org/documents/standards/PDF-A/19005-1_FAQ.pdf
- PDF/A - the standard for long-term archiving http://www.pdf-tools.com/public/downloads/whitepapers/whitepaper-pdfa.pdf
- PDF/A - A Look at the Technical Side http://www.pdfa.org/2011/08/pdfa-%E2%80%93-a-look-at-the-technical-side/