Format Recognition Tools Documentation for ETDs

From IMLS
Jump to: navigation, search

This is a placeholder page for instructional documentation on the ways in which ETD programs can apply format recognition services such as DROID, JHOVE/2, FITS, and the Unix file command in the context of some standard workflows that lead to an ETD deposit into an institutional repository. The dissemination and assessment of these format recognition micro-services are scheduled to take place in the project in Spring 2014, per the Revised Project Workplan.

Contents

Rationale

Format recognition is increasingly important in the ETD curation lifecycle as ETD submissions expand to encompass supplemental digital works that include multimedia and may include datasets, among other things. Surfacing and recording the digital signatures on these types of (occasionally proprietary) digital formats is essential to ensuring that they can be effectively preserved for access purposes both immediately and into the future. These formats require specific software to render their contents and migrating them to accessible formats may require specific converters. Knowing up-front the file formats included in an ETD submission can be extremely helpful for ETD program curators and managers to assist with setting policy to support their preservation and access.

Tools

DROID

DROID is a GUI-based and command-line tool used for identifying a file's format. Note that DROID does not determine whether or not a file is valid -- it only tries to

Links

Use Case for ETDs

Digital Record Object Identification (DROID) represents one of the more easily adoptable format recognition tools in circulation, primarily because of its graphical user interface (GUI) and its user-friendly reporting outputs. A tool like DROID could be easily installed and used in the environment where ETDs are deposited by an author for approval by ETD program administrators. Graduate school staff do not need technical expertise to invoke DROID on behalf of these ETD submissions. DROID has a number of reporting output formats that can fit later workflows and suit ETD program stakeholders that would like to record this format recognition information in a data management system or within various preservation metadata schemas used to manage the ETD files.

This project will make the case that a tool like DROID should be used on ETDs at the point of first submission so that ETD program managers/curators can inform ETD authors that a supplemental file may not in fact be supported by the program's overall preservation policies. The project will provide helpful usage documentation and provide examples of how report outputs can be provided to ETD program stakeholders at later stages in the overall set of curation workflows.

Basic Usage

Download and Install

Note: The included "Running DROID.txt" file contains more detail than this document.

  1. Ensure that Java 6 (not 7) is available. See the Project Homepage for more details.
  2. Download the current version from the Project Homepage (above) and extract the contents somewhere.
  3. If using Linux or OSX:
    1. Open a Terminal and navigate to the directory where you extracted the files.
    2. chmod +x droid.sh

Using the Graphical Application

To start up the graphical interface for DROID:

  1. If using Windows, navigate to the folder where you extracted DROID and double-click the icon for "droid.bat".
  2. If using Linux or OSX:
    1. Open a Terminal and navigate to the directory where you extracted the files.
    2. ./droid.sh
  3. At this point, the interface should appear and perhaps offer to download new updates.

Let's run DROID on some data using the GUI:

  1. Click the Add button on the toolbar.
  2. Navigate to a file or directory you wish to analyze, click on its name, and click OK.
    1. If you're selecting an entire directory, the "Include sub-folders" checkbox will determine whether or not the contents of subdirectories are included in the analysis as well.
  3. Repeat the previous step for any other files or directories you wish to include in the analysis.
  4. Click the Start button on the toolbar to begin the analysis. When the analysis is finished, the Start and Pause toolbar buttons will be disabled and the Report button will be enabled.

At this point, you should see some of the analysis results displayed in the user interface. There are three main tasks you can do with the results now:

  • Export the results:
    1. Click the Export button on the toolbar.
    2. Check the checkbox in the dialog that appears.
    3. Click Export profiles...
  • Generate a statistical report of the results:
    1. Click the Report button on the toolbar.
    2. Check the checkbox in the dialog that appears.
    3. Select from the drop-down list of available report types.
    4. Click Report on profiles...
  • Filter the view of results in the interface:
    1. Click the Filter button on the toolbar.

Using the Command Line Application

  1. If using Linux or OSX:
    1. Open a Terminal and navigate to the directory where you extracted the files.
    2. ./droid.sh --help
    3. You should now see a (long) list of options and their explanations.

Running DROID takes place in two parts:

  • Use -a to specify files and directories to analyze, along with -p to choose a filename for this profile.
  • Use -p to specify a previously-created profile, along with other options for performing post-analysis tasks (such as exporting the results in a variety of formats, or generating specialized reports).

Let's run DROID on some data using the command line:

./droid.sh -a "/data/Libraries.pdf" "/data/ETDs/" -p results.droid
2013-07-10 12:00:32,733  INFO Creating profile: 1373500832732
2013-07-10 12:00:32,764  INFO Attempting state change [INITIALISING] to [VIRGIN]
2013-07-10 12:00:32,765  INFO Starting profile: 1373500832732
2013-07-10 12:00:37,086  INFO Attempting state change [VIRGIN] to [RUNNING]
2013-07-10 12:00:37,595  INFO Attempting state change [RUNNING] to [FINISHED]
2013-07-10 12:00:37,596  INFO Saving profile: 1373500832732 to results.droid
2013-07-10 12:00:38,593  INFO Attempting state change [FINISHED] to [SAVING]
2013-07-10 12:00:38,595  INFO Saving profile [/home/user/.droid6/profiles/1373500832732] to [results.droid]
2013-07-10 12:00:39,042  INFO Attempting state change [SAVING] to [FINISHED]
2013-07-10 12:00:39,042  INFO Closing profile: 1373500832732

Now we have the results of this profile (consisting of a file called "Libraries.pdf" and a directory called "ETDs") stored in a file called "results.droid" in the current directory. Note that this "results.droid" file cannot be opened or read by other applications than DROID itself. To generate something in a usable format, we can use DROID to "export" the results:

./droid.sh -p results.droid -e results.csv
2013-07-10 19:07:10,904  INFO Loading profile from: results.droid
2013-07-10 19:07:15,296  INFO Exporting profiles to: [results.csv]
2013-07-10 19:07:15,645  INFO Time for export [1373501230930]: 344 ms
2013-07-10 19:07:15,645  INFO Closing export file: results.csv
2013-07-10 19:07:15,645  INFO Closing profile: 1373501230930

Now we have a file called "results.csv" which can be opened in Microsoft Excel or any text editing software.

file Command

file is a command-line tool included with most open source unix-like operating systems such as Linux and BSD distributions. Since it is a native command on these platforms, it is very easy to begin using (either as part of a scripted workflow or for casual usage from a command-line shell).

It comes with a vast selection of command-line parameters, making it flexible enough for a wide number of use cases.

Links

Use Case for ETDs

todo

Basic Usage

Download and Install

file is pre-installed on most OSX and Linux distributions. If the command isn't already available on yours, you will need to install it using your operating system's package manager (the package should be named just file).

Using the Command Line Application

  1. Open a Terminal and navigate to the data you want to analyze.
  2. Let's assume we want to analyze a file named Libraries.pdf.
  3. file Libraries.pdf
  4. The command should return a line of output like this: Libraries.pdf: PDF document, version 1.5

file offers a handful of arguments that can change the format of the output string. Use file --help or man file to learn more about the arguments and options for the version of file you have installed. Here are some examples that can come in handy in an integration situation:

$ file --brief Libraries.pdf
PDF document, version 1.5
$ file Libraries.pdf -i
Buttons.pdf: application/pdf; charset=binary
$ file Libraries.pdf --brief -i
application/pdf; charset=binary
$ file Libraries.pdf --brief --mime-type
application/pdf

JHOVE/2

JHOVE has been succeeded by JHOVE2, which, despite what the name suggests, is a separate project with a completely new codebase. Though JHOVE2 is, for all intents and purposes, an improvement over JHOVE, the two tools sometimes produce different results for certain files, so there may be value in capturing results from both tools.

Links

Use Case for ETDs

todo

Basic Usage (JHOVE)

Download and Install

  1. Download the current version from the Project Homepage on SourceForge (above) and extract the contents somewhere.

Using the Command Line Application

JHOVE offers a thorough tutorial on their web site here: http://jhove.sourceforge.net/using.html

Using the Java Library

JHOVE offers a programmatic interface for advanced interactions with other software systems. The basic flow for using these interfaces is:

  1. Include the JHOVE JAR files in your Java class path (jhove.jar, jhove-handler.jar, and jhove-module.jar)
  2. Import and instantiate the JhoveBase class (fcla.format.api.JhoveBase)
  3. Call various setter methods to provide configuration details (setEncoding, setTempDirectory, setBufferSize, setChecksumFlag, setSignatureFlag, setShowRawFlag)
  4. Call the "dispatch" method to begin the operation

This level of integration is advanced, and beyond the scope of what most workflows are likely trying to accomplish. If you only need to obtain the kind of output obtainable from the command-line interface, then you should probably just invoke the command-line tool from your code.

Basic Usage (JHOVE2)

Download and Install

  1. Download the current version from the Project Homepage (above) and extract the contents somewhere. (Choose the .tar.gz download if using Linux or OSX.)

Using the Command Line Application

Note: JHOVE2 works with both Java 6 and Java 7, but produces many warning messages under Java 7. If you have a Java 6 JRE available on your system, you may want to use it for running JHOVE2. One way of doing this is to run the following command before using JHOVE2: export JAVA_HOME=/path/to/jre1.6.0_xx/, replacing the shown path as appropriate. The effects of this command are not permanent, so you will need to run it each time you open a new terminal session (or in each shell script you wish to use with JHOVE2).

  1. If using Linux or OSX:
    1. Open a Terminal and navigate to the directory where you extracted the files.
    2. ./jhove2.sh --help
    3. You should now see a (long) list of options and their explanations. Some helpful ones to know are:
      • -d JSON|Text|XML|CMD|CDX : This selects the format of JHOVE2's output. "Text" is used by default, but others can be helpful depending on your needs.
      • -o <outfile> : Saves the program's output to a file instead of writing it to standard output (your Terminal display).

Beyond that, there really aren't too many other options. JHOVE2 produces a lot of information by default, and it's up to you to sift through the output to get the parts you're interested in.

Let's run JHOVE2 on some data using the command line (full output not shown here due to size):

$ ./jhove2.sh Libraries.pdf
FileSource:
 StartingOffset (byte): 0
 EndingOffset (byte): 145494
 Size (byte): 145495
 FileSystemProperties:
  Path: /home/user/Libraries.pdf
  LastModified: 2012-10-31T16:50:19-05:00
 PresumptiveFormats:
...
$ ./jhove2.sh -d XML Libraries.pdf
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<j2:jhove2 xmlns:j2="http://jhove2.org/xsd/1.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<j2:feature name="FileSource" fid="http://jhove2.org/terms/reportable/org/jhove2/core/source/FileSource" fidns="JHOVE2">
 <j2:features>
 <j2:feature name="StartingOffset" fid="http://jhove2.org/terms/property/org/jhove2/core/source/MeasurableSource/StartingOffset" fidns="JHOVE2" funit="byte">
  <j2:value>0</j2:value>
 </j2:feature>
 <j2:feature name="EndingOffset" fid="http://jhove2.org/terms/property/org/jhove2/core/source/MeasurableSource/EndingOffset" fidns="JHOVE2" funit="byte">
  <j2:value>145494</j2:value>
...
$ ./jhove2.sh -d JSON Libraries.pdf
{
 "FileSource": {
  "StartingOffset": {
    "unit": "byte"
   ,"value": 0
  }
 ,"EndingOffset": {
    "unit": "byte"
   ,"value": 145494
  }
...

Using the Java Library

todo


Suites

FITS

Description

Resources

Use Case for ETDs

todo

Basic Usage

Download and Install

  1. Download the current version from the FITS Downloads page (above) and extract the contents somewhere.
  2. If using Linux or OSX:
    1. Open a Terminal and navigate to the directory where you extracted the files.
    2. chmod +x fits.sh

Using the Command Line Application

  1. If using Linux or OSX:
    1. Open a Terminal and navigate to the directory where you extracted the files.
    2. ./fits.sh -h
    3. You should now see a list of options and their explanations. The basic usage for a single input file is: ./fits.sh -i <file>

Let's run FITS on some data using the command line:

$ ./fits.sh -i Libraries.pdf 
<?xml version="1.0" encoding="UTF-8"?>
<fits xmlns="http://hul.harvard.edu/ois/xml/ns/fits/fits_output" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/fits/fits_output http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd" version="0.6.2" timestamp="5/8/13 2:45 PM">
  <identification>
    <identity format="Portable Document Format" mimetype="application/pdf" toolname="FITS" toolversion="0.6.2">
...



DAITSS Format Description Service

The DAITSS Format Description Service is an open-source web application that uses DROID and JHOVE to identify and validate an input file, producing output in the form of a PREMIS XML file.

Resources


Use Case for ETDs

todo

<< Back to Lifecyle Management Tools

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox