A $300,000 grant from the National Science Foundation (NSF) will support a new project at the UNC School of Information and Library Science (SILS) to develop software that can identify and redact sensitive information within research-related datasets, documents, and communications.
SILS Professor Christopher “Cal” Lee will serve as principal investigator and Research Scientist Kam Woods will be technical lead for the one-year (July 2021– June 2022) Computer-Assisted Redaction and Anonymization of Scholarly Communications and Products (CARASCAP) project. Antoine de Torcy will serve as the project’s software engineer.
Most modern redaction software is built using the same set of core technologies, a combination of document parser, optical character recognition, and natural language processing, which identify common private and individually identifying information. These products improve slowly as developers increase document format coverage, expand pattern libraries, or adopt enhanced NLP models.
“CARASCAP will introduce a new approach that adds explainability to the process,” said Woods. “This will allow archivists and other users to validate the software behaviors themselves by comparing those behaviors to actions performed by people redacting manually. Users can then create models tuned to specific redaction behaviors for collections of similar documents.”
Researchers and the institutions where they work face many data privacy and sensitivity issues with their scholarly products, according to Lee. Backlogs and staffing limitations can result in materials remaining inaccessible to the public indefinitely or being released while still containing the sensitive data.
CARASCAP aims to help groups and individuals open and share more of their work with others, while attending to a variety of data sensitivity concerns. The project team also hopes to impact the workflows used by the community of institutions producing, preserving, and providing access to scholarly communications and products.
CARASCAP will build on the successes of previous projects, including BitCurator, BitCurator Access, BitCurator NLP, and RATOM, which developed and distributed open-source tools to help libraries, archives, and museums manage a diverse and rapidly growing body of digital materials.
The $300,000 award is part of the NSF’s Early-Concept Grants for Exploratory Research (EAGER) program. As the NSF’s website explains, EAGER funding supports exploratory work “on untested, but potentially transformative, research ideas or approaches.”