Can Machine Learning Improve Air Quality?
Improving the Research Experience for EPA Scientists
When searching online for information, how often do you go beyond the first page of results? Search engines often return hundreds, if not thousands, of web pages that might provide the perfect apple pie recipe, the best instructions for installing hardwood floors, or the most amazing restaurant in the town you’re about to visit. What if the best information available was on page 65 of your search results? Would you miss it?
Now imagine that instead of a home cook looking for pie recipes, you’re an EPA scientist authoring a report that will guide public policies on air pollution. You want to ensure that you’ve reviewed all the relevant research that studied how air pollution might affect public health. The stakes are much higher.
When scientists at the Environmental Protection Agency’s (EPA) Center for Public Health and Environmental Assessment (CPHEA) draft scientific assessments to inform policy decisions such as setting standards for air pollutants, they are often faced with tens or even hundreds of thousands of research articles on a topic. For example, they considered over 170,000 search results on the topic of ozone alone in 2020. Fewer than 1% of those provided scientific evidence relevant to air quality policymaking and were therefore cited in the final assessment. With a limited number of scientists contributing to each of these scientific documents and a mandate to cast the broadest possible net, it would be humanly impossible for them to read all search results to find the few relevant ones.
The EPA is creating software that utilizes machine learning to help scientists find relevant articles without reading all search results, but such approaches require testing. That’s when an EPA representative reached out to Yue “Ray” Wang in the UNC School of Information and Library Science.
Wang is a researcher in the areas of text data mining, machine learning, and information retrieval. In basic terms, he uses advanced algorithms and statistics to teach computers how to find useful information from large amounts of data by progressively learning what a user needs. Machine learning is a specific area in artificial intelligence.
Wang’s work with the EPA began in 2021. Working with two then-SILS students, Jingwen Hou and Xiaochen Wang, Wang designed experiments to evaluate several literature screening algorithms to see which one was the fastest to exceed an extremely high recall—the percentage of relevant instances that were retrieved. The trio demonstrated that combining multiple approaches to ranking content was extremely effective. They published a paper on their work in the 31st Association for Computing Machinery International Conference on Information and Knowledge Management.
Wang has also been able to use these new publicly available datasets in the data mining course he teaches. Students are challenged to build an algorithm that ranks research papers so that relevant papers are ranked at the top. Their success was measured by how much effort a scientist would have saved by screening papers ordered by their algorithm compared to screening papers in a random order to reach 95% recall. This hands-on experience allows students to build experience addressing real-world challenges.
Recently, Wang initiated a new memorandum of understanding between UNC SILS and the EPA’s CPHEA to explore how the software could work with the ever-coming stream of literature in real time. Instead of waiting to review tens of thousands of articles in one batch, they want to determine if the software can be effectively trained to assess articles as they’re published. Wang will provide a critical scientific evaluation of this new software evolution. No doubt he’ll also find more ways to incorporate this project into his courses, meaning that our environment, the EPA, and future information science professionals will continue to benefit from this research collaboration.
Related Research Areas: Information Interaction and Retrieval