NSF BIG DATA grant awarded to SILS professor

October 3, 2012

NSF invests nearly $15 million in new big data research projects, and the start of an idea-generating challenge

Dr. Arcot RajasekarCHAPEL HILL - The National Science Foundation (NSF), with support from the National Institutes of Health (NIH), today announced nearly $15 million in new Big Data fundamental research projects.

Dr. Arcot Rajasekar, professor at the School of Information and Library Science (SILS) at the University of North Carolina at Chapel Hill and chief scientist at the Renaissance Computing Institute (RENCI), is the principal investigator of one of the eight Big Data projects receiving awards, which aim to develop new tools and methods to extract and use knowledge from collections of large data sets to accelerate progress in science and engineering research and innovation. 

The project, “DataBridge – A Sociometric System for Long-Tail Science Data Collections,” will use socio-metric networks similar to Linked In or Facebook on a larger scale to enable scientists to find data and like-minded research. It will improve the discovery of relevant scientific data across large, distributed and diverse collections. The funds provided by NSF for the DataBridge project total $1.5 million.

“This grant is an excellent fit for campus and SILS’ priorities and the research and curriculum initiatives already underway,” said Dr. Gary Marchionini, dean and Cary C. Boshamer Distinguished Professor. “We are excited to have an opportunity to extend our national leadership in big data tools and services.”

“The DataBridge project builds on a long history of national excellence in data management and archiving enjoyed by the Odum Institute for Research in Social Science,” said Thomas M. Carsey, co-principal investigator of the project, Odum’s director and Pearsall Distinguished Professor of political science. “Odum is always conducting research and development with a focus on delivering research data to scholars efficiently and effectively. The DataBridge project represents the next significant step in that ongoing mission.”

The “DataBridge” project is a collaboration with Drs. Gary King, Albert J. Weatherhead III University Professor, Faculty of Arts and Science Institute Quantitative Social Science, Harvard University; Merce Crosas, director of product development Harvard MIT Data Center, Harvard University; and Justin Zhan, director of iLab, Department of Computer Science North Carolina Agriculture & Technical State University.  Co-PIs from UNC include: Thomas W. Carsey, distinguished professor and director, H.W. Odum Institute; Hye-Chung Kum, research associate professor, School of Social Work and adjunct professor in the Department of Computer Science; Howard Lander, senior research software developer; and Sharlini Sankaran, executive director, REACH NC/RENCI. Jonathan Crabtree, assistant director of Computing and Archiving, H.W. Odum Institute and a doctoral student at SILS, is senior personnel on the project.

The big data grants awarded today were made in response to a joint NSF-NIH call for proposals issued in conjunction with the March 2012 Big Data Research and Development Initiative launch; NSF Leads Federal Efforts in Big Data.

“I am delighted to provide such a positive progress report just six months after fellow federal agency heads joined the White House in launching the Big Data Initiative,” said NSF Director Subra Suresh. “By funding the fundamental research to enable new types of collaborations--multi-disciplinary teams and communities--and with the start of an exciting competition, today we are realizing plans to advance the foundational science and engineering of Big Data, fortifying U.S. competitiveness for decades to come.”

“To get the most value from the massive biological data sets we are now able to collect, we need better ways of managing and analyzing the information they contain,” said NIH Director Francis S. Collins. “The new awards that NIH is funding will help address these technological challenges--and ultimately help accelerate research to improve health--by developing methods for extracting important, biomedically relevant information from large amounts of complex data.”

The eight projects announced today run the gamut of scientific techniques for big data management, new data analytic approaches, and e-science collaboration environments with possible future applications in a variety of fields, such as physics, economics and medicine.

“Data represents a transformative new currency for science, engineering and education,” said Farnam Jahanian, assistant director for NSF's Directorate for Computer and Information Science and Engineering.  “By advancing the techniques and technologies for data management and knowledge extraction, these new research awards help to realize the enormous opportunity to capitalize on the transformative potential of data.”

NSF, along with NASA, and the Department of Energy also announced the start of an idea-generating challenge series, opening additional avenues for innovation in seizing the opportunities afforded by big data science and engineering. The competition will be run by the NASA Tournament Lab (NTL), a collaboration between Harvard University and TopCoder, a competitive community of digital creators.

The NTL platform and process allows U.S. government agencies to conduct high risk/high reward challenges in an open and transparent environment with predictable cost, measurable outcomes-based results and the potential to move quickly into unanticipated directions and new areas of software technology. Registration is open through Oct. 13, 2012 for the first of four idea generation competitions in the series. Full competition details and registration information  are available at the Ideation Challenge Phase Web site.

“Big Data is characterized not only by the enormous volume or the velocity of its generation, but also by the heterogeneity, diversity and complexity of the data,” said Suzi Iacono, co-chair of the interagency Big Data Senior Steering Group, a part of the Networking and Information Technology Research and Development program and senior science advisor at NSF.  There are enormous opportunities to extract knowledge from these large-scale, diverse data sets, and to provide powerful new approaches to drive discovery and decision-making, and to make increasingly accurate predictions. We’re excited about the awards we are making today and to see what the idea generation competition will yield.”

Today at a Tech America event on Capitol Hill, Iacono announced the award recipients.  They are listed below.

BIG DATA AWARDS

BIGDATA: Mid-Scale: ESCE: DCM: Collaborative Research: DataBridge - A Sociometric System for Long-Tail Science Data Collections  - databridge.web.unc.edu

University of North Carolina at Chapel Hill, Arcot Rajasekar
Harvard University, Gary King
North Carolina Agriculture & Technical State University, Justin Zhan

The sheer volume and diversity of data present a new set of challenges in locating all of the data relevant to a particular line of scientific research. Taking full advantage of the unique data in the "long-tail of science" requires new tools specifically created to assist scientists in their search for relevant data sets. DataBridge supports advances in science and engineering by directly enabling and improving discovery of relevant scientific data across large, distributed and diverse collections using socio-metric networks. The system will also provide an easy means of publishing data through the DataBridge and incentivize data producers to do so by enhancing  collaboration and data-oriented networking.

BIGDATA: Mid-Scale: DCM: Collaborative Research: Eliminating the Data Ingestion Bottleneck in Big-Data Applications

Rutgers University, Martin Farach-Colton
Stony Brook University, Michael Bender

Big-data practice suggests that there is a tradeoff between the speed of data ingestion, the ability to answer queries quickly (e.g., via indexing), and the freshness of data. This tradeoff has manifestations in the design of all types of storage systems. In this project the PIs show that this is not a fundamental tradeoff, but rather a tradeoff imposed by the choice of data structure. They depart from the use of traditional indexing methodologies to build storage systems that maintains indexing 200 times faster in databases with billions of entries.

BIGDATA: Mid-Scale: DCM: A Formal Foundation for Big Data Management

University of Washington, Dan Suciu

This project explores the foundations of big data management with the ultimate goal of significantly improving the productivity in big data analytics by accelerating data exploration. It will develop open source software to express and optimize ad hoc data analytics.  The results of this project will make it easier for domain experts to conduct complex data analysis on big data and on large computer clusters.

BIGDATA: Mid-Scale: DA: Analytical Approaches to Massive Data Computation with Applications to Genomics

Brown University, Eli Upfal

The goal of this project is to design and test mathematically well-founded algorithmic and statistical techniques for analyzing large scale, heterogeneous and so called noisy data. This project is motivated by the challenges in analyzing molecular biology data.   The work will be tested on extensive cancer genome data, contributing to better health and new health information technologies, areas of national priority.

BIGDATA: Mid-Scale: DA: Distribution-based machine learning for high dimensional datasets

Carnegie Mellon University, Aarti Singh

The project aims to develop new statistical and algorithmic approaches to natural generalizations of a class of standard machine learning problems.  The resulting novel machine learning approaches are expected to benefit other scientific fields in which data points can be naturally modeled by sets of distributions, such as physics, psychology, economics, epidemiology, medicine, and social network-analysis.

BIGDATA: Mid-Scale: DA: Collaborative Research: Genomes Galore - Core Techniques, Libraries, and Domain Specific Languages for High-Throughput DNA Sequencing

Iowa State University, Srinivas Aluru
Stanford University, Oyekunie Olukotun
Virginia Polytechnic University, Wuchun Feng

The goal of the project is to develop core techniques and software libraries to enable scalable, efficient, high performance computing solutions for high-throughput DNA sequencing, also known as next-generation sequencing. The research will be conducted in the context of challenging problems in human genetics and metagenomics, in collaboration with domain specialists.

BIGDATA: Mid-Scale: DA: Collaborative Research: Big Tensor Mining: Theory, Scalable Algorithms and Applications

Carnegie Mellon University, Christos Faloutsos
University of Minnesota, Twin Cities, Nikolaos Sidiropoulos

The objective of this project is to develop theory and algorithms to tackle the complexity of language processing, and to develop methods that approximate how the human brain works in processing language.  The research also promises better algorithms for search engines, new approaches to understanding brain activity, and better recommendation systems for retailers.

BIGDATA: Mid-Scale: ESCE: Collaborative Research: Discovery and Social Analytics for Large-Scale Scientific Literature

Rutgers University, Paul Kantor
Cornell University, Thorsten Joachims
Princeton University, David Biei

This project will focus on the problem of bringing massive amounts of data down to the human scale by investigating the individual and social patterns that relate to how text repositories are actually accessed and used.  It will improve the accuracy and relevance of complex scientific literature searches. 
 

Related Web sites
Research America Big Data Commission Web site: http://www.techamericafoundation.org/bigdata
Research America's Demystifying Big Data: A Practical Guide to Transforming the Business of Government: http://www.techamerica.org/Docs/techAmerica-BigDataReport-FINAL.pdf
 
The National Science Foundation (NSF) is an independent federal agency that supports fundamental research and education across all fields of science and engineering. In fiscal year (FY) 2012, its budget is $7.0 billion. NSF funds reach all 50 states through grants to nearly 2,000 colleges, universities and other institutions. Each year, NSF receives over 50,000 competitive requests for funding, and makes about 11,000 new funding awards. NSF also awards nearly $420 million in professional and service contracts yearly.

Useful NSF Web Sites:
NSF Home Page: http://www.nsf.gov
NSF News: http://www.nsf.gov/news/
For the News Media: http://www.nsf.gov/news/newsroom.jsp
Science and Engineering Statistics: http://www.nsf.gov/statistics/
Awards Searches: http://www.nsf.gov/awardsearch/
 

Media Contacts at UNC at Chapel Hill: 

Wanda Monroe, School of Information and Library Science, wmonroe@unc.edu, 919.843.8337

Karen Green, RENCI, kgreen@renci.org, 919.445.9648