The National Library developed the Metadata Extraction Tool in 2003 to programmatically extract preservation metadata from the headers of a range of file formats, including PDF documents, image files, sound files and Microsoft Word documents.

Download the Metadata Extraction Tool

The Metadata Extraction Tool was redeveloped in 2007. Version 3 of the tool is available as open-source software. It can be downloaded from the SourceForge website.

Metadata Extraction Tool – SourceForge website

The tool is designed for use by the wider digital preservation community and any future development will be informed by that community.

Purpose of the Metadata Extraction Tool

The Metadata Extraction Tool is based on the Library's work on digital preservation, including development of a logical preservation metadata schema.

The preservation Metadata Extract Tool:

  • automatically extracts preservation-related metadata from digital files
  • outputs that metadata in a standard format (XML) for uploading into a preservation metadata repository.

Supported file formats

The extract tool uses a combination of Java and XML. It includes a generic application and a number of 'adapters' developed to extract the data from specific file types.

Adapters have been written for MS Word 2, MS Word 6, Word Perfect, Open Office, MS Works, MS Excel, MS PowerPoint, TIFF, JPEG, WAV, MP3, HTML, PDF, GIF, and BMP.

If a file type is unknown the tool applies a generic adapter, which extracts data that the host system 'knows' about any given file (such as size, filename, and date created).

Capabilities

The tool has both a Microsoft Windows interface and a UNIX command line interface. This enables work to be automated through batch processing or processed on an individual basis as required.

The application opens all files as read-only, ensuring the integrity of original files. The tool only reads header information, so the extraction process is quick.

Contact Details
Contact Us Metadata Extraction Tool