SILS and NC archives partner on RATOM project with $1.1M in support from Mellon Foundation

Only time will tell if the U.S. State Department report released in October marks the end of official investigations into “her emails,” but the protracted controversy underscores the importance of email in modern communication, and by extension its value as a public record and historical artifact.

“Traditionally, in archive collections that are physical, correspondence has always been one of the most important groups of records,” said Glynn Edwards, Assistant Director of Special Collections and Archives at Stanford University. “Emails are what I see as modern day correspondence. It’s actually much more complex and all-encompassing than that because people use it for all kinds of stuff. It can reveal your social networks, your purchases and politics. It can contain documents, photos, and moving images.”

Unfortunately, the dynamic nature of the format also makes it more challenging to process and archive.

“Despite 47 years of email creation and the vital role of email as documentation of activities across all sectors of society, the professional curation of email is still relatively immature,” said UNC School of Information and Library Science (SILS) Professor Christopher “Cal” Lee.

In early 2019, SILS and the State Archives of North Carolina launched a partnership aimed at advancing the technology and workflows used to archive emails. The Review, Appraisal, and Triage of Mail (RATOM) project received a $1.1 million grant from the Andrew W. Mellon Foundation.

RATOM is both an acronym and a tribute to Ray Tomlinson, who developed the messaging system that evolved into modern day email. The two-year project is building on the successes of BitCurator, BitCurator Access, and BitCurator NLP, projects led by Lee, as well as the Transforming Online Mail with Embedded Semantics (TOMES) project, which Camille Tyndall Watson, Head of the Digital Service Section at the State Archives, led from 2016-2017.

Lee is Principal Investigator for RATOM, with Tyndall Watson as co-PI and Research Scientist Kam Woods as Technical Led. In addition to assembling a staff, RATOM has convened an advisory board of international experts, including Edwards, who directs the Email: Process, Appraise, Discover, Deliver (ePADD) project at Stanford.

RATOM hosted "ml4arc – Machine Learning, Deep Learning, and Natural Language Processing Applications in Archives" on July 26 at Wilson Library on the University of North Carolina at Chapel Hill campus. Check out the agenda and speakers and read a blog post with highlights from the event, written by Emily Higgs, former NC State University Libraries Fellow and current Digital Archivist for the Swarthmore College Peace Collection and Friends Historical Library.

Preserving what is important, protecting what is private

The sheer volume of emails can be a significant obstacle for archivists. Like their federal counterparts at NARA, the State Archives has adopted a capstone approach, preserving emails from North Carolina governors, treasurers, and other members of the state’s executive branch from their time in office.

“Even though we don’t handle a huge number of email accounts, any given account has thousands and thousands of messages,” said Tyndall Watson. “It’s a really big challenge for processing and access.”

The need to redact personal information disclosed or discussed in email exchanges further complicates the work. Depending on state law and organizational policies, archivists must redact information ranging from social security numbers to personal health information to decisions regarding employment and beyond. This type of sensitive information can be prevalent in emails, as former Florida Governor Jeb Bush demonstrated in 2015 when he published hundreds of thousands of emails from his years in office, exposing social security numbers, home addresses, phone numbers, and other private information his constituents had shared.

Given the vast number of emails a single person’s account usually contains, it is impractical bordering on impossible for a person to read each message and make the necessary decisions regarding its content.

Joanne Kaczmarek, a RATOM advisory board member and Director of Records and Information Management Services at the University of Illinois Library, has been working with the Illinois State Archives to process the email of the recent governor and his administration using commercially available software often employed by law firms.

To teach the software what kind of information to tag in the emails, Kaczmarek and her team spent time reviewing individual messages. Based on this, Kaczmarek determined that it would take a person working 40-hours per week approximately 27 years to process the 5.4 million emails her group received from the state archives. By employing the software, the project took about four months.

“The value of this kind of work is that we’re really trying to be accountable to the citizenry of the country, honoring the idea that the public does have a right to know and understand how the government is working – or not,” Kaczmarek said. “But we also have to be thoughtful about it, to balance people’s right to know with the right to privacy. Using these new techniques, we won’t have to embargo information for 10 to 20 years because we don’t have the time or the staff to review it. We can make it available to people while it’s still relevant to their lives.”

Using NLP to help LAMs

TOMES produced a natural language processor to help state archivists identify which emails should be saved and which should be discarded, and if the preserved emails contained personally identifiable information. Although the results of this process could be made available to the public, Tyndall Watson said the size and format of the file produced would make it difficult for people to use.

RATOM is developing new open-source software that will extend the natural language processing capabilities of TOMES and the BitCurator environment to help identify topics of interest within email collections, so messages can be tagged for easier retrieval. Tyndall Watson foresees using the software to process emails based on records requests until entire databases have been sorted and organized into relevant categories.

The outcomes of the RATOM project will benefit libraries, archives, museums (LAMs) handling almost any collection with born-digital items.

“Email is often present in acquisitions that include other types of materials.” Lee said. “LAMs are increasingly looking for tools and methods to identify and document both the records and contextual relationships between them.”

Tyndall Watson said she hopes the project produces standardized processes that can be adopted by organizations, even if they don’t have digital specialists on staff, so that “e-mail archiving isn’t something that seems out of reach.”

Return to 2019 Newsletter Homepage