Mellon Foundation grant will support BitCurator expansion to improve analysis and accessibility of born-digital collections

October 17, 2016

The University of North Carolina at Chapel Hill has received a grant for $750,000 from the Andrew W. Mellon Foundation to support BitCurator NLP, a project that will develop software and protocols for the application of natural language processing (NLP) methods to born-digital library, archives, and museum (LAM) collections. The new tools created by the two-year project will enable professionals at LAMs to more effectively and efficiently curate digital materials, and ultimately make collections more accessible to individuals searching for information or documents.

Cal Lee, SILS Professor
and PI for BitCurator NLP

“We have repeatedly heard that LAMs need tools to help identify and explore information on specific entities such as people, places, organizations, and events that are of interest to curators and researchers,” said UNC School of Information and Library Science (SILS) Professor Christopher (Cal) Lee, principal investigator of the BitCurator NLP project. “This is particularly important for digital collections that contain thousands or hundreds of thousands of files, when it is impossible to manually inspect materials to determine which of the files are relevant for preservation.”

BitCurator NLP will build on the successes of the BitCurator and BitCurator Access projects, which developed and distributed tools to help LAMs manage the rapidly growing body of digital materials with cultural value. BitCurator produced an open-source software environment that facilitates the relocation of materials from portable media, such as floppy disks, flash drives, and hard drives, to more sustainable environments. Users can create disk images, analyze files and file systems, extract data and metadata, and identify and redact sensitive information, among other tasks.

BitCurator Access further advanced these activities by producing BCA Webtools, which allows users to dynamically navigate file systems of disk images, as well as search the content of many common files types. BitCurator Access also developed tools for redacting sensitive information and experimented with emulation as an access mechanism for disk image content. The BitCurator and BitCurator Access products and associated communities are being sustained by the independent, member-driven BitCurator Consortium.

BitCurator NLP will produce an open-source software that institutions can use to extract, analyze, and produce reports about relevant features found in the open text of digital materials in their collections. The software will also enable LAMs to improve or implement NLP capabilities to read files from their digital collections and produce reports for end users on demand.

“While there are several existing and powerful open-source software NLP libraries and toolkits, no environment has been developed to deal with disk images or their content,” Lee said. “Disk images are complex, often containing a variety of data and document types, and need considerable pre-processing to extract the content that can be interpreted and organized by NLP tools. This requires the type of underlying software that is already available through BitCurator and BCA Webtools. LAMs will be able to run BitCurator NLP independently, or within existing software environments.”

Kam Woods, Research Scientist at SILS, will be the co-principle investigator and technical lead for BitCurator NLP. The project will also employ Sunitha Misra as a full-time software developer and SILS doctoral student Jacob Hill as project manager. The project includes an advisory group of external partners with significant relevant experience to provide guidance and expertise.