Data-intensive research is becoming the new model in STEM, but the majority of graduate students do not receive formal training in data science best practices, instead learning by trial and error. Arcot Rajasekar, Frances McColl Distinguished Term Professor at the UNC School of Information and Library Science (SILS), is leading an effort to correct this deficiency. He has been awarded a grant from the National Science Foundation (NSF) worth nearly $500,000 to host workshops and develop a curriculum that provides cutting-edge data management training for graduate students in STEM disciplines. The project, titled “Cyber Carpentry: Data Life-Cycle Training Using the Datanet Federation Consortium Platform,” officially launched November 1.
“At a time when thousands of scientists and engineers are creating and using large numbers of distributed datasets to explore an increasingly diverse mix of phenomena, the need for training in the areas of data life-cycle management and data-intensive computation becomes very important,” Rajasekar said. “This project will develop training approaches that draw from real-world data and processes, enabling students to learn how to ensure that data is properly organized, managed, and preserved, both for their own research and for future researchers who may want to mine or expand their datasets.”
The Cyber Carpentry project will utilize the Datanet Federation Consortium (DFC), an NSF funded project that has implemented a data-centered cyber platform with integrated tools for end-to-end data life-cycle management and data-intensive high performance computation. In addition to his appointment at SILS, Rajasekar is a Chief Scientist at the Renaissance Computing Institute (RENCI), which hosts and administers the DFC’s hub.
“By basing the training on a common platform, the core part of the practices will be applicable to all disciplines, but our workshops will include specialized methods and tools that are taken from specific STEM areas,” said Rajasekar. “Our training workshops will be multi-disciplinary: earth system sciences, biological sciences, social and information sciences, marine sciences, and engineering.”
The project’s short-term goal is to offer brief, but intensive workshops that can lead to data science certification for STEM graduate students. Long-term, the project will develop and openly share a sequence of courses that can be adapted by different STEM disciplines. The software developed by DFC is already open source.