Big data is only getting bigger, and that can cause big problems for researchers who need to store and share their data. Twenty doctoral students and post-doctoral associates from across the county learned the tools and techniques to solve these problems at the inaugural Cyber Carpentry Workshop at the University of North Carolina at Chapel Hill. Sponsored by the National Science Foundation (NSF) and hosted by the UNC School of Information and Library Science (SILS), the two-week workshop in July introduced students to a variety of applications, platforms, and processes for data life-cycle management and data-intensive computation.
“Previously, you had maybe a thousand files, maybe ten thousand,” said Arcot Rajasekar, SILS professor and director of the Cyber Carpentry Workshop. “Now, you’re talking about a 100 million files and doing simulations and emulations that can create petabytes of data. Managing that just by human interaction is not going to be effective; you need some automation there. In addition to the volume of data, you have to consider the velocity of data coming in and the multiple varieties of data you’re collecting. This is not easily done without a good level of management.”
The workshop familiarized participants with the concepts of virtualization, automation, and federation as defined through the Datanet Federation Consortium (DFC), an NSF-funded project that promotes sharing within and across science and engineering disciplines. Instructors introduced specific DFC web portals, including CyVerse, Dataverse, DataONE, and Hydroshare, as well as relevant software, meta-data management strategies, and large-scale workflows. Many students also arrived early each day for “breakfast carpentry,” an open topic discussion with CyVerse Software Engineer Julian Pistorius.
“We've covered everything from search engines to neural networks to containers,” said Will Sutherland-Keller (MSIS ’17), a SILS graduate who participated and helped evaluate the workshop. “It breaks the ice on getting involved with using those technologies because it sort of forces you to just start using them.”
The hands-on approach was designed to give participants “a more realistic learning experience,” said instructor Nirav Merchant, director of Data Science Institute at the University of Arizona and cyberinfrastructure lead for CyVerse. Merchant said the workshop offered exposure to topics and technologies that are rarely addressed in traditional academic coursework, making the opportunity especially valuable.
Though not affiliated with Software Carpentry or Data Carpentry, Cyber Carpentry organizers drew inspiration from those projects. The workshop at Carolina brought together data professionals, educators, and researchers from SILS, the Odum Institute, RENCI, iRODS, the University of Arizona (CyVerse), Indiana University (Jetstream), University of Virgnia (Hydroshare), Drexel University, and Amazon (AWS) to teach this intensive two-week course. In addition, an assessment team composed of SILS faculty members and doctoral students observed the activities and interviewed instructors and participants in order to make recommendations for improvements to next year’s workshop.
On the final day of the workshop, teams delivered presentations on how they had used what they learned in order to reproduce the work of other researchers. Reproducibility is the key theme that instructor Bakinam Essawy wanted to convey.
“You need to make sure that who is coming after you could run your work, could build off your work or it's going to be a dark model or your data is going to be just buried, no one’s going to know about it,” said Essaway, a research associate in the Civil and Environmental Engineering Department at the University of Virginia. “This is not the purpose of the research. Research has to be continuous.”
The workshop drew students from across the country, with NSF-funding providing travel and accommodation support. Anuja Majmundar, a doctoral student at the University of Southern California, said the Cyber Carpentry workshop offered a great opportunity for her to learn tools and procedures that could make data science more reproducible and scalable, especially for the diverse data streams she encounters in her research on health behaviors. She also enjoyed making connections with peers in other disciplines.
“I found the group project particularly engaging,” she said. “Our group was interdisciplinary and keen to learn the best practices for data management. We helped each other out and delivered on advanced project goals.”
Jocelyn Colella, a PhD candidate in evolutionary genomics at the University of New Mexico, said gaining experience with containers – programs that can virtualize entire scientific workflows, including software, libraries, and data – was one of the highlights of her experience, and the introduction to the JetStream and CyVerse virtual environments had significant implications for her research.
“Coming from a smaller lab, it has been incredibly expensive to build the computing resources and data archival infrastructure necessary to deal with terabytes of genomic data,” she said. “Learning about the free computational and storage resources available through NSF-funded projects has revolutionized how I conceptualize my own workflows and will alter how I apply for grants going into the future.”