Technical Reports

TR 2016-01

Kumar, Manish; Gotz, David

System Design Barriers to HIS Data Use in Low and Middle-income Countries: A Literature Review

The ability of low and middle-income countries (LMICs) to monitor and measure their progress toward sustainable development health goals is dependent on design, implementation and use of a robust national health information system (HIS). Improved health system performance is directly linked with the use and quality of routine data in a country’s HIS. However, studies have reported several types of barriers that hamper data quality and data use in a national health system of LMICs. To better understand these barriers, we have conducted a systematic review of the scientific literature. The objective of this literature review was to synthesize and summarize the system design-related research and implementation gaps affecting data quality and data use in decision-making. The review made an effort to answer two key questions: (i) what does the published literature tell us about HIS design barriers to data use in LMICs? (ii) What, if any, are the main research and implementation gaps?

TR 2014-01

Wildemuth, Barbara M.; Freund, Luanne; Toms, Elaine G.

Studies of Search Task Complexity or Difficulty

The studies listed in this bibliography were the basis of the analysis reported in: Wildemuth, B.M., Freund, L., & Toms, E.G. (2014). Untangling search task complexity and difficulty in the context of interactive information retrieval studies. Journal of Documentation, in press. They were identified from searches, conducted in 2013, of the Repository of Assigned Search Tasks (RepAST, The 239 items retrieved were further evaluated, in terms of their relevance to the focus of the planned study. Only the 106 studies that discussed search task complexity or difficulty or assigned search tasks of varying complexity or difficulty were analyzed further. They are listed here.


Poole, Alex H.; Lee, Christopher A.; Barnes, Heather L.; Murillo, Angela P. 

Digital Curation Preparation: A Survey of Contributors to International Professional, Educational, and Research Venues 

This paper characterizes the types of research environments in which individuals engaged in digital curation research are embedded or have recently been educated. It reports data from an international survey of individuals who have presented their work in professional venues (conferences and journals) that address digital curation. We address a fundamental research question: What are the contexts in which digital curation research is being conducted? More specifically: (a) In what disciplinary and institutional contexts are stakeholders conducting digital curation education and research? (b) How do digital curation researchers characterize the field and their own work? (c) How do digital curation researchers describe their current and projected research environments? (d) What are the primary venues of engagement with the digital curation research network?

Findings indicated that few respondents were students at the time of their contribution. Most respondents occupied senior-level roles or were faculty members, but few job titles included the term “curation.” Respondents had a range of skill sets and had come to digital curation activities from varied career paths and undergraduate and graduate degrees. More than four-fifths of respondents had earned master’s degrees and 13.5% had earned two master’s degrees. While nearly 40 percent of respondents with Master’s degrees had earned them in Library Science and/or Information Science, only a quarter had earned them in Computer Science, Engineering, or Applied Mathematics and nearly 29% had earned them in Arts or Humanities. Nearly 43% of respondents had earned doctoral degrees in a range of fields, most commonly in Library and Information Science-related disciplines (nearly 29%) or in Computer Science, Engineering, or Applied Mathematics (nearly 30%). Respondents connected with the digital curation research network through a wide range of conferences, journals, and associations that relate predominantly to the information professions (libraries, archives, and information science). Few respondents, however, described their research discipline as “digital curation.” Finally, nearly three-quarters of respondents reported that they were currently engaged in research and more than half of respondents reported that they were currently mentoring students. Future research could explore what institutions are embarking upon doctoral-level digital curation education initiatives and their strategies for doing so, what might be appropriate models for funding and resources for doctoral student education in digital curation, what current training and apprenticeship roles and responsibilities in digital curation doctoral education are currently available, and what other venues might be established to further nurture a digital curation doctoral community.




Kelly, Diane

A preliminary investigation of search self-efficacy.

The goal of this work is to identify a measure of search expertise that can be used by researchers to better understand searchers' information seeking behaviors and interactions with information systems. This technical report describes initial investigations into one approach to measuring search expertise based on Bandurafs concept of self-efficacy (1977), Compeau and Higgins' (1995) use of this concept in their Computer Self-Efficacy scale, and Debowski, Wood and Bandura's (2001) use of it in their Search Self-Efficacy scale. This report also describes a modified version of Debowski, et. al.'s Search Self-Efficacy scale and results of a preliminary investigation of how 23 undergraduates used the scale to characterize their search for self-efficacy.




Marchionini, Gary; Tibbo, Helen; Lee, Cal A.; Jones, Paul; Capra, Robert; Geisler, Gary; Russell, Terrell; Shah, Chirag; Sheble, Laura*; Jorda, Sarah; Song, Yaxiao; Howard, Dawne E.; Clemens, Rachael; and Hill, Brenn.

*Thank you, Laura Sheble for crafting this final report from four years of VidArch Project work.

"VidArch: Preserving Video Objects and Context Final Report"

Video is becoming increasingly important to digital libraries and archives, both as a primary content type and as context for other collection objects. Videos included in collections may be works unto themselves; documentary evidence of people, places, and events core to a collection mission; or documentary evidence for primary objects in a collection. Recognizing these roles of video in popular and scholarly culture, the Library of Congress included the VidArch project in its NDIIPP research portfolio. The VidArch project contributed to the further development of policies and tools to facilitate the preservation of digital video from the WWW through an examination of video not as isolated information objects, but as information-rich multi-sensory elements embedded in an equally information-rich use environment. We explored the meaning of context as it relates to video from internal, external, and life cycle-based perspectives: On one hand, we performed video content analysis and conceptualized a multi-faceted understanding of context based on the life cycle of video production, delivery, and use. On the other, we explored relationships between video and other elements of the networked online environment. On an implementation level, we evaluated the use of finding aids and documentation of contextual information for controlled video collections; explored the use of a collaborative online environment as an extension of the concept of the finding aid; developed tools to mine contextual elements for online video from the WWW; and implemented the robust preservation-compatible iRODS framework at the collection level for primary objects and related contextual information entities. The VidArch Project was supported by a grant from the National Science Foundation (#IIS 0455970 DigArch Program) as one of the NDIIP research projects; and by a follow up contract from the Library of Congress as a part of the National Digital Information Preservation Program. While based at the University of North Carolina, Chapel Hill School of Information and Library Science, the Project was enriched by valuable contributions from our partners: the Association for Computing Machinery (ACM); iBiblio; the Internet Archive; the National Aeronautics and Space Administration (NASA); and the San Diego Super Computer Center Data Intensive Cyber Environments (DICE) team.


By the members of the Spring 2008 Public Libraries Seminar

"American Public Library Topics - an Annotated Bibliography"


During the spring 2008 semester at the School of Information and Library Science at the University of North Carolina at Chapel Hill, the members of the Public Libraries Seminar considered the state of the American Public Library from several aspects.

After pondering the philosophical, political, professional, and ecological contexts in which the public library exists, each of the members guided the seminar through a topic area that held special meaning for them. The result of these guided tours is the annotated subject bibliography contained in this report.

While the bibliographies are probably a full and fairly complete resource for anyone else interested in the topics discussed, the goal was not to create a dry academic resource. Rather the objective in creating the bibliographies was that the students list those resources that held particular meaning for them, and that their comments about the resources be personal, sincere, and tied to their individual concerns.

This is the second iteration of a public library bibliography and supplements the one created by the members of the spring 2005 through spring 2007 Public Libraries Seminars. The four together form a solid foundation for subsequent public library seminars to modify, add to, and enhance.


Weimao Ke, Cassidy R. Sugimoto, and Javed Mostafa
Laboratory of Applied Informatics Research

"Dynamicity vs. Effectiveness: A User Study of a Clustering Algorithm for Scatter/Gather"

We proposed and implemented a novel clustering algorithm called LAIR2, which has linear worst-case time complexity and constant running time average for on-the-fly Scatter/Gather browsing [4]. Our previous experiments showed that when running on a single processor, the LAIR2 on-line clustering algorithm was several hundred times faster than the parallel Buckshot algorithm running on multiple processors [11]. This paper reports on a study that examined the effectiveness of the LAIR2 algorithm in terms of clustering quality and its impact on retrieval performance. We conducted a user study on 24 subjects to evaluate on-the-fly LAIR2 clustering in Scatter/Gather search tasks by com-paring its performance to the Buckshot algorithm, a classic method for Scatter/Gather browsing [4]. Results showed significant differences in terms of subjective perceptions of clustering quality. Subjects perceived that the LAIR2 algorithm produced significantly better quality clusters than the Buckshot method did. Subjects felt that it took less effort to complete the tasks with the LAIR2 system, which was more effective in helping them in the tasks. Interesting patterns also emerged from the subjects’ comments in the final open-ended questionnaire. We discuss the implications and future research.


Losee, Robert

"Vocabulary Conversion: Performance with Controlled and Uncontrolled Terms and Tags"

Controlled and uncontrolled indexing terminology and metadata may be converted from one to another. Decision criteria are developed that can be used to determine which terms should be assigned when converting vocabularies. Methods are developed for computing the parameters of these systems, as well as means for estimating the parameters when given limited information. These conversion techniques may be applied to thesaurus terminology, gene ontologies, topic maps, uncontrolled natural language terms, folksonomies, tags and labels on web pages, the presence or absence of a specific hyperlink, as well as to metadata. Rules are provided suggesting circumstances when controlled vocabularies are always superior to using uncontrolled vocabularies.


Kelly, Diane; Shah, Chirag; Sugimoto, Cassidy R.; Bailey, Earl W.; Clemens, Rachel A.; Irvine, Ann K.; Johnson, Nicholas A.; Ke, Weimao; Oh, Sanghee; Poljakova, Anezka; Rodriguez, Marcos A.; van Noord, Megan G.; and Zhang, Yan

'Method Bias? The Effects of Performance Feedback on Users’ Evaluations of an Interactive IR System"

In this study, we seek to understand how providing feedback to users about their performances with an interactive information retrieval (IIR) system impacts their evaluations of that system. Sixty subjects completed three recall-based searching tasks with an experimental IIR system and were asked to evaluate the system after each task and after finishing all three tasks. Before completing the final evaluation, three-fourths of the subjects were provided with feedback about their performances. Subjects were assigned randomly to one of four feedback conditions: a baseline condition where no feedback was provided; an actual feedback condition where subjects were provided with their real performances; and two conditions where subjects were deceived and told that they performed very well or very poorly. Results show that the type of feedback provided significantly affected subjects' system evaluations; most importantly there was a significant difference in subjects' satisfaction ratings before and after feedback was provided in the actual feedback condition. These results suggest that researchers should provide users with feedback about their performances when this information is available in order to elicit the most valid evaluation data.


Baldwin, Tim; Christodoulou, Alexandros; Gillenwater, Cary; Johnson, Nicholas; Kumar, Amit; Marchionini, Gary; Moynihan, Brian; Polczer, Gyorgy; Rodriguez, Derek; Purvis, Joshua; and VanDrimmelen, Jeff.

"Click/Talk/Touch/Look/Think Here: User Interface with Virtual Space"


The most critical bottlenecks in information flow are human input and output (I/O). These bottlenecks are due to a combination of physiology, cognition, and technological prosthetics and are strongly exacerbated when the information flows are mediated by or with information technology. As people interact with each other or with information systems, the actions taken and the resulting information flows are outputs from the initiator's perspective and inputs from the receiver's perspective. This paper provides an overview of the I/O problem space by examining different theoretical models that are or have been considered in Human Computer Interaction (HCI) as well as summarizing different kinds of techniques and devices that are in use or in development to facilitate human information interaction in cyberspace. People sense the natural world and listen, read, and view information in the built world at differential rates ranging from a few bits per second to millions of bits per second depending on the perceptual organ. People move, talk, and write at relatively slow rates but we have created tools to change the rates. This paper presents an overview of different input devices organized by the degree to which people consciously control the devices (explicit vs implicit), considers some of the advantages and limitations of these devices and trends toward using multiple devices to facilitate natural human-computer interaction.


Capra, Robert; and Marchionini, Gary

"Visualizing Science and Engineering Indicators:
Transitioning from Print to a Hybrid World"


This report summarizes work on the “Visualizing Science and Engineering Indicators: Transitioning from Print to a Hybrid World” project between the National Science Foundation (NSF) SEI and UNC SILS.


Liu, Yong; Mostafa, Javed and Ke, Weimao

"A Fast Online Clustering Algorithm for Scatter/Gather Browsing"


We present a fast online clustering algorithm which has linear worst-case time complexity and constant running time average for the well-known online visually oriented browsing modeling called Scatter/Gather browsing (Cutting, Karger, Pedersen, and Tukey 1992). Our experiment shows when running on a single processor, this fast online clustering algorithm is few hundred times faster than the parallel Buckshot algorithm running on multiple processors.


By the members of the Spring 2007 Public Libraries Seminar

"American Public Library Topics - an Annotated Bibliography"


During the spring 2007 semester at the School of Information and Library Science at the University of North Carolina at Chapel Hill, the members of the Public Libraries Seminar considered the state of the American Public Library from several aspects.

After pondering the philosophical, political, professional, and ecological contexts in which the public library exists, each of the members guided the seminar through a topic area that held special meaning for them. The result of these guided tours is the annotated subject bibliography contained in this report.

While the bibliographies are probably a full and fairly complete resource for anyone else interested in the topics discussed, the goal was not to create a dry academic resource. Rather the objective in creating the bibliographies was that the students list those resources that held particular meaning for them, and that their comments about the resources be personal, sincere, and tied to their individual concerns.

This is the second iteration of a public library bibliography and supplements the one created by the members of the spring 2005 and spring 2006 Public Libraries Seminars. The three together form a solid foundation for subsequent public library seminars to modify, add to, and enhance.


Lee, Christopher A.

"Taking Context Seriously: A Framework for Contextual Information in Digital Collections"


Future users of digital objects will likely have numerous tools for discovering preserved digital objects relevant to their interests, but making meaningful use and sense of the digital objects will also require contextual information. This paper provides an analysis of context, distinguishing three main ways in which that term has been used within the scholarly literature. I then discuss contextual information within digital collections. I present a framework for contextual information that is based on nine classes of contextual entities: object, agent, occurrence, purpose, time, place, form of expression, concept/abstraction, and relationship. The paper then discusses existing standards and guidance documents for encoding information related to the nine classes of contextual entities, and it concludes with a discussion of potential implications for descriptive practices through the lifecycle of digital objects.



Luo, Lili

"Reference Evolution under the Influence of New Technologies"


This report presents a historical view of library reference evolution under the influence of new information technologies. Two evolution directions were determined through a comprehensive literature review - the increase of the availability and accessibility of electronic resources and the expansion of the media through which reference services are provided.

Placing reference progression in a historical context, this article will strengthen the understanding of library reference work, and hence, lead to a more coherent development of the reference profession.



Kelly, Diane; Fu, Xin; Shah, Chirag

"Effects of Rank and Precision of Search Results on Users’ Evaluations of System Performance"

Previous research has demonstrated that system performance does not always correlate positively with user performance, and that users often assign positive evaluation scores to systems even when they are unable to complete tasks successfully. This paper investigates the relationship between actual system performance and users’ perceptions of system performance by manipulating the level of performance experienced by users and measuring users’ evaluations of system performance. Eighty-one subjects participated in one of three laboratory studies. The first two studies investigated the impact of the location (or rank order) of five relevant and five non-relevant documents in a search results list containing ten results. The third study investigated the impact of varying levels of precision (.30, .40, .50 and .60) of a search results list containing ten results. Results demonstrate statistically significant relationships between precision and subjects’ evaluations of system performance, and ranking and subjects’ evaluations of system performance. Of the two, precision explained more variance in subjects’ evaluation ratings and was a stronger predictor of subjects’ ratings. Finally, the number of documents subjects examined significantly influenced their evaluations, even when the difference was a single document.



By the members of the Spring 2006 Public Libraries Seminar

"American Public Library Topics An Annotated Bibliography"

During the spring 2006 semester at the School of Information and Library Science at the
University of North Carolina at Chapel Hill, the members of the Public Libraries Seminar considered the state of the American Public Library from several aspects.

After pondering the philosophical, political, professional, and ecological contexts in which the
public library exists, each of the members guided the seminar through a topic area that held special meaning for them. The result of these guided tours is the annotated subject bibliography contained in this report.

While the bibliographies are probably a full and fairly complete resource for anyone else interested in the topics discussed, the goal was not to create a dry academic resource. Rather the objective in creating the bibliographies was that the students list those resources that held particular meaning for them, and that their comments about the resources be personal, sincere, and tied to their individual concerns.

This is the second iteration of a public library bibliography and supplements the one created by
the members of the spring 2005 Public Libraries Seminar. The two together form a solid foundation for subsequent public library seminars to modify, add to, and enhance.



Hemminger, Bradley M.; Long, Trisha; and Saelim, Billy

"Comparison of Visualization Techniques for Displaying Medication History to Older Adults"

This study aims to understand how older adults currently manage their medication information, and determine their preferences and their performance when using three different interaction techniques for viewing it online: a weekly calendar, a list, and a bar chart.
Thirty subjects aged 55 or older, who had taken five or more prescription medications in the past two years, and were able to use a computer, participated in the study.
Qualitative surveys and guided interviews provided information about participants’ ability to remember details about medications, how they currently manage medication information, and how and with whom they share it. Quantitative measures were made of participants’ speed and accuracy in using one of the three techniques to which they were assigned in completing basic information-seeking tasks, and their average ranking of all three techniques, as well as their own current method of information management.
Participants usually share medication information with doctors and families, and tend to share details about their prescriptions (such as dose or purpose) or side effects they personally experienced. The three electronic management techniques all outperformed and were favored over participants’ own manual methods. The list and bar chart methods were the overall favorites and top performers; however, the best technique to use depended on the type of task.
Electronic medication information management techniques show promise for helping adults remember and share key details of their medication histories.



Carter, Tyson; Durbin, Dayna; and McCraw, Jenny

"Library Websites for Elementary-aged Children: A Comparative Analysis"


What we set out to do with our project was to create a set of criteria for evaluating library websites for elementary-aged children (interpreted in this investigation as children between the ages of five and eleven), and then to apply those criteria to a small sample of sites. Thus, we started the project with three questions. First, what makes a library site for children effective and appealing, in terms of both content and design? Second, how does a select sample of library sites measure up? Third, how do three different types of library sites (public library sites for children, elementary school library media center sites, and digital or virtual libraries for children) compare? These different types of libraries serve different functions, and are often tied to different activities, and so differences in content seemed likely. These were the major concerns of the project. We set out to explore the literature on design and content selection of websites for children, and to use this literature to inform our criteria for evaluating the sites in our sample.



Boekelheide, Kristin; Brown, E. Ashley Rogers; Fu, Xin; Marchionini, Gary; Oh, Sanghee; Rogers, Gershom; Saelim, Billy; Song, Yaxiao; and Stutzman, Fred.

"Audio Surrogation for Digital Video: A Design Framework"


This paper provides a framework to guide audio surrogation research and development. It is meant to help system designers identify which kinds of audio surrogates are most appropriate for a specific system, and to help researchers develop research methodologies. After a brief review of the roles that surrogates play in retrieval and sense making, and of some characteristics of audio data, five types of audio surrogates are defined, potential applications are illustrated, and implementation issues are discussed. The paper concludes with a discussion of the implementation issues related to multiple kinds of surrogates in practical video retrieval systems.



Wildemuth, Barbara M.; Russell, Terrell; Ward, T. J.; Marchionini, Gary; & Oh, Sanghee.

"The Influence of Context and Interactivity on Video Browsing"


The goal of this study was to investigate the effects of providing context and interactivity in a retrieval system, supporting the browsing of search result sets. Thus, three systems were developed: (1) a basic system, modeled on the current results list provided by Google video searching (runs UNC-BAS-1 and UNC-BAS-2); (2) a similar system, with the context of each shot provided by showing keyframes from the shots appearing just before and after the retrieved shot (runs UNC-CON-1 and UNC-CON-2); and (3) a system that builds on the previous system by offering several mechanisms of interactivity (runs UNC-INT-1 and UNC-INT-2). In terms of both performance and user perceptions, the Context+Interactive system was superior. While there were no differences in precision, recall was improved with this system, and users preferred it (based on several measures of user perceptions). The effects of context on browsing search results were negligible, but should be explored further through re-examination of the definition and operationalization of the concept of context. Interactivity, in combination with context, had positive effects on browsing effectiveness; it was considered easy to use, even though it introduced more complexity into the interface.



Marchionini, Gary; Elsas, Jon; Zhang, Junliang; Efron, Miles; and Haas, Stephanie.

"Clustering Techniques, Tools, and Results for BLS Website Maintenance and Usability" October 15, 2005


This project was a BLS-focused adjunct to a National Science Foundation Digital Government grant to define a statistical knowledge network and user interfaces that will help citizens easily find and understand government statistical information. The BLS effort focused on discovering ways to automatically categorize BLS webpages and use these new categorizations in dynamic user interfaces under development in the larger project. The overall aim was to create alternative organizations for the BLS website that people could use to explore and find data more easily and effectively.



MacMullen, W. John.

"Annotation as Process, Thing, and Knowledge: Multi-domain studies of structured data annotation" May 20, 2005


Following Buckland’s (1991) work on the nature of information, this paper characterizes the multi-faceted concept of ‘annotation’ as process, thing, and knowledge. This typology is then used to enumerate general research questions for the exploration of annotation in arbitrary domains. Our research team’s investigation of annotation of structured data in specific domains and user groups is described, including library catalogers, musicians, historical geographers, web users, statistical analysts, and biomedical researchers.



Bergquist, Ron.

"American Public Library Topics an Annotated Bibiliography" May 10, 2005.

By the members of the Spring 2005 Public Libraries Seminar at the School of Information and Library Science, University of North Carolina at Chapel Hill.


During the spring 2005 semester at the School of Information and Library Science at the University of North Carolina at Chapel Hill, the fifteen members of the Public Libraries Seminar considered the state of the American Public Library from several aspects.

After pondering the philosophical, political, professional, and ecological contexts in which the public library exists, each of the members guided the seminar through a topic area that held special meaning for them. The result of this guided tour is the annotated subject bibliography contained in this report.



Pomerantz, J., & Stutzman, F.

"Lyceum: A Blogsphere for Library Reference"


In this paper we discuss the use of blogs in libraries, and specifically the potential of blogs for use in library reference services. We describe Lyceum, an open source software project designed by, which is a facilitator of blogspheres and a tool for intelligent automatic information management within blogspheres. We discuss ways in which Lyceum and blogs in general may facilitate library reference services.



Zhang, Junliang; Marchionini, Gary; Shear, Tim; Su, Chang.

"Relational Browser: A Fast and Contextualized Searching and Browsing Tool" January 31, 2004.


The Relation Browser is a user interface for searching and browsing that supports visual exploration of relationships in datasets. This report describes the latest version of this interface, named RB++. It discusses improvements over previous versions and outlines a user study to test its effectiveness. RB++ uses an improved database scheme, supports arbitrary n-wise exploration within collection facets, closely couples collection overviews with results sets, and adds string search within results sets that are also coupled to the overviews. The system is illustrated with data from the UNC film collection and webpages from the U.S. Energy Information Administration website.



Wildemuth, Barbara; Yang, Meng; Hughes, Anthony; Gruss, Rich; Geisler, Gary; Marchionini, Gary.

Access via Features versus Access via Transcripts: User Performance and Satisfaction. November 2003.


The Open Video Project is specifically concerned with the surrogates that can represent the objects in a digital video collection and the mechanisms through which people can manipulate those surrogates. In TREC VID 2003, we compared the effectiveness of a transcript-only search system, a features-only search system and a search system combining transcript and feature searching. We also presented several different views for users to browse the results pages: a horizontal view, a vertical view, a “before & after” view, and an extra-keyframe view. A within-subjects research design was used, so that each of the 36 participants was exposed to all three search systems. Each participant searched half (12) of the assigned topics. The user satisfaction measures recommended by NIST were augmented by measurements of participants' perceived usefulness, perceived ease of use, and flow. Results indicated that, with the transcript-only system and the combined system, users were able to achieve higher recall in less time per search. The results from the measures of satisfaction indicate that the users found the transcript-only and combined systems to be more useful and easier to use, and their use resulted in stronger perceptions of enjoyment and concentration than the features-only system. It is concluded that, as users gain experience with features searching, it will be a welcome supplement to transcript searching.



Barreau, Deborah. The New Informational Professional: Vision and Practice. August 2003.


Budget pressures and the proliferation of accessible information on the World Wide Web are among the reasons why several organizations have closed their libraries. Some visionaries suggest that to be viable under these conditions, information professionals should be more integrated with the work of organizations, becoming members of functional teams and providing both traditional and specialized services to these teams. This report describes a case study of four news organizations, two that have adopted this new model for the information professional, and two that have not. Data from newspaper articles and responses to surveys are examined for evidence that the new model influences how services are provided and valued. Although few differences are observed, findings demonstrate the benefits of this model for organizations.



MacMullen, W. John. Requirements Definition and Design Criteria for Test Corpora in Information Science. April 2003.


This paper argues that structured collections of data and information ("corpora") are needed for research in information science, and to measure the validity, accuracy, and effectiveness of tools, methods, and systems. It examines the needs and uses of corpora, and describes some specific examples from a variety of domains. The paper explores the relationship of scientific methods to corpora design, and then enumerates and discusses a variety of design criteria, primarily from the corpus linguistics literature.



Yang, Wildemuth, Marchionini, Wilkens, Geisler, Hughes, Gruss and Webster. Measures of User Performance in Video Retrieval Research. June 2003.

Abstract: Browsing and searching for digital videos online is not as easy as it is with text documents. To address this problem, researchers have begun to create video surrogates to represent video objects. The purpose of this paper is to describe and provide preliminary data regarding six measures that can be used to evaluate the effectiveness of people's interactions with video surrogates. The six types of performance to be measured are object recognition (with text stimuli), object recognition (with graphical stimuli), action recognition, gist determination (free text), gist determination (multiple choice), and visual gist determination. While some additional development of the measures is needed, their initial field testing indicates that they are practical and can differentiate multiple levels of performance with video surrogates. These measures will continue to be refined in studies conducted by the Open Video project; we also encourage others to employ them in video retrieval research.



Dominick, Hughes, Marchionini, Shearer, Su and Zhang. Portal Help: Helping People Help Themselves Through Animated Demos. February 2003.


This paper describes a rationale for animated demos to help people understand how to complete specific tasks in a WWW environment. A set of animated helps were created to assist people in adding, deleting, moving, and rearranging a portlet as well as checking library records in the UNC MyPortal application. The process of creating the animated demos is described and pointers to the online animations are given.



Osborne, Caroline and Rinalducci, Jennifer. Evaluation of Web-Based Resources within the Art History Discipline. December 2002.



Efron, Miles. Amended Parallel Analysis for Optimal Dimensionality Reduction in Latent Semantic Indexing, December 2002.


This study describes amended parallel analysis (APA), a novel method for dimensionality estimation in unsupervised learning problems such as information retrieval (IR). At issue is the selection of k, the number of dimensions retained under latent semantic indexing (LSI). APA is an elaboration of Horn's parallel analysis, which advocates retaining eigenvalues larger than the values we would expect under term independence. APA operates by deriving confidence intervals on these “null eigenvalues.” The technique amounts to a series of non-parametric hypothesis tests on the correlation matrix eigenvalues. In the study, APA is tested along with five previous dimensionality estimators on four standard IR test collections. These estimates are evaluated with regard to two standard IR performance metrics. APA appears to perform well, predicting the best values of k on three of eight observations, and never offering the worst estimate of optimal dimensionality.



Wildemuth and Carter. The Perceived Affordances of Web Search Engines: A Comparative Analysis, December 2002.


One way to evaluate the interfaces of search engines is to analyze the perceived affordances offered by each. In this context, the perceived affordances of a search engine are those aspects of the interface that are perceived by its users as allowing particular functions to be invoked. For example, if a search engine provides one text box that is only 10 characters long, users may perceive that it affords the searching of terms that are 10 characters or less. This study analyzed and compared the perceived affordances of nine of the most popular Web search engines (AltaVista, Ask Jeeves, Excite, Google, Hotbot, LookSmart, Lycos/Open Directory, Northern Light, and Yahoo) in September 2001. The criteria for analysis included characteristics of the text box for entering terms, characteristics of the search button, search syntax, the availability and placement of help for entering search terms, methods for limiting the search results, support for modifying a query, features of the directory structure, characteristics of results displays, and methods for setting user preferences. The analysis was conducted by directly examining the interface of each search engine for each feature or characteristic. In general, some aspects of Web search engine interfaces are becoming more standardized and other aspects vary widely across the search engines. All search engines provide a textbox and some type of accompanying button for entering a query. Almost all of the search engines provide assistance in specifying a query, but only two provide examples of queries on the search page itself. All of the search engines use the same basic syntax for specifying a query, but there is quite a bit of variation in the type and amount of assistance provided via drop-down menus or checkboxes. Direct support for modifying a query was available in only a few of the search engines. The results of searches are reported in fairly standard ways: brief summaries provided, usually 10 per page, in relevance order. The implications of these findings for the design of search engines are discussed.



Mu & Marchionini. Interactive Shared Educational Environment (ISEE): Design, Architecture, and User Interface, April 2002.


The Interactive Shared Educational Environment (ISEE) is an advanced real-time multimedia application that supports highly interactive collaboration and distance learning activities within a heterogeneous network context. The ISEE not only takes full advantage of fast LAN campus networks or Internet2 wide area networks by providing peer-to-peer multicast support, but it also can be used in the less advanced settings of home users. Media (e.g., digital video) are integrated into a desktop style interface in the ISEE. The ISEE allows users to interact with live multicasts, a shared web browser, shared video/audio with thumbnails for quick navigation, and text chat. Collaborative work is supported via a shared time line across the multiple ISEE tools. For example, each comment in the chat text panel is associated with a timestamp, which indicates at which point during the video viewing (i.e., in what context) the comment was made. Clicking that timestamp by another user will jump that user's video player to the same timestamp. One challenge for a collaborative distance learning (CDL) system is to support a high degree of interaction between users and the video player due to the delays associated with re-buffering the video. With pre-buffering and a novel collaboration protocol, ISEE not only supports dynamic user-media interactions in real time, but also guarantees synchronization across participants. The system interface and architecture are discussed.



Hara, Solomon, Sonnenwald, & Kim. An Emerging View of Scientific Collaboration: Scientists' Perspectives on Collaboration and Factors that Impact Collaboration, December 2001.


Collaboration is often a critical component in scientific research, which is dominated by complex problems, rapidly changing technology, dynamic growth of knowledge, and highly specialized areas of expertise. An individual scientist can seldom provide all of the expertise and resources necessary to address complex research problems. This paper describes collaboration among a group of scientists, and considers how their experiences are socially shaped. The scientists were members of a newly formed distributed, multi-disciplinary academic research center that was organized into four multi-disciplinary research groups. Each group had 14 to 34 members, including faculty, postdoctoral fellows and students, at four geographically dispersed universities. To investigate challenges that emerge in establishing scientific collaboration, data were collected about members' previous and current collaborative experiences, perceptions regarding collaboration, and work practices during the center's first year of operation. The data for the study includes interviews with members of one research group, observations of videoconferences and meetings, and a center-wide sociometric data analysis has led to the development of a framework that identifies forms of collaboration that emerged among scientists (e.g., complementary and integrative collaboration) and associated factors, which influenced collaboration including personal compatibility, work connections, incentives and infrastructure. These results may inform social and organizational practices needed to establish collaboration in distributed, multi-disciplinary research centers.



Wildemuth, Sonnenwald, Bollenbacher, Byrd & Harmon. Mentoring Future Biologists via the Internet: Results from the “Electronic Mentoring for Tomorrow's Scientists” Program, September 2001.


The E-Mentoring program provided biology students from two historically minority universities in North Carolina with opportunities to interact and develop relationships with corporate scientists, to expand their learning horizons, and to use technology in a meaningful way. To provide a meaningful context for electronic mentoring for students, the project was integrated with undergraduate and graduate biology courses at rural and urban universities in lower socio-economic areas. To learn from this experience, an intensive evaluation was conducted. Each participant filled out a detailed questionnaire and was interviewed, both before and after their participation in the E-Mentoring program. In addition, messages between students and mentors were archived. These data are analyzed and discussed in this report.



Brunk & Marchionini. Toward an Agile Views WWW Sitemap Kit: The Generalized Relation Browser, January 2001.


This paper describes the design, development, and testing of one component for creating data-driven sitemap tools used to enhance information seeking on the web. Such tools are set up by a website administrator to provide alternative browsing and navigation aids. The Generalized Relation Browser (GRB) illustrates the look-ahead strategy for web navigation within the Agile Views design framework. Agile views define control mechanisms and interfaces for overviews, previews, reviews, peripheral views, and shared views to help people make better decisions while browsing and exploring. The GRB is a follow up to the Federal Statistics Rleation Browser prototype and usability and field test results are summarized. The architecture for using GRB as a general purpose tool is described.



Jackson-Sanborn, Odess-Harnish, & Warren. Website Accessibiity: A Study of ADA Compliance, June 2001.


As larger portions of the population accesses the Internet, websites must take in consideration the needs of users with various disabilities. Given that about one in five Americans has some form of disability (Census), it not surprising that much attention is given to applying the Americans with Disabilities Act (ADA) to Internet site designs. This project used the Bobby analysis tool to examine 550 websites in six categories for ADA compliance. 100 websites in the categories of most popular, international, jobs, college, and government, and 50 websites in the category, clothes were selected using the What'sHot web site. The sites were examined in the Spring of 2001. Only one-third of all the sites were found to be compliant at priority 1 with no user check errors required. Government sites were the most compliant, with 60% of the sites passing. The other categories were compliant at the following levels: college (43%), clothes (40%), international (29%), jobs (19%), and most popular (15%). See the report for other analyses at other levels of compliance.



Gilchrest & Long. An Analytical Study of Browsing Strategies in a Content-Based Image Retrieval System, June 2001.


Image retrieval systems are available on the WWW but there are few studies of how people actually search for images using content-based image retrieval systems. This study applied Kwasnik's functional components of browsing to create a model of browsing for images. This model was then examined by analyzing transaction logs from a WWW CBIR service that used query-by-example entry. Almost one-quarter of the users abandoned the site before executing a search, a quarter of the users opted for a random starting image, 15% typed an image number (known item search) and remainder picked one of the sample images to begin browsing. In addition, the authors took turns conducting the same searches, observing search strategies and behaviors. Both sets of data suggest that Kwasnick's browsing model applies to image browsing.



Sonnenwald, Marchionini, Wildemuth, Dempsey, Viles, Tibbo, Smith. Collaboration Services in a Participatory Digital Library: an Emerging Design, February 2001.

Abstract: Digital libraries need to provide and extend traditional library services in the digital environment. This paper presents a project that will provide and extend library services through the development of a sharium--a workspace with rich content and powerful tools where people can collaborate with others or work independently to explore information resources, learn, and solve their information problems. A sharium is a learning environment that combines the features of a collaboratory, where people collectively engage in research by sharing rich information resources, and a local library, where people come to meet, find information resources, and discuss common interests. To achieve this, collaboration services that build on synchronous and asynchronous communication technology should be integrated with other digital library services, including searching, browsing, and information management and authoring services. This paper presents our motivation for providing collaboration services and describes the types of collaboration services that will be included in the digital library.



Sonnenwald, Bolliger, Solomon, Hara, Cox. Collaboration in the Large: Using Video Conferencing to Facilitate Large Group Interaction, January 2001.


Large group collaboration is a strategic component of many research and development (R&D) centers today. Centers may have 50 to 100 or more participating principal investigators, undergraduate and graduate students, postdoctoral fellows and industry members. Because center members are geographically distributed and may not have interacted with each other previously, it can be difficult to establish and maintain collaboration among members. To address this challenge in the NSF Science and Technology Center for Environmentally Responsible Solvents and Processes, we are applying an action research approach that considers social/organizational and technical aspects of large group collaboration when establishing mechanisms to facilitate collaboration among group members. This paper describes the social, organizational and technical infrastructure and best practices that have emerged using large group video conferencing technology to support collaboration in the large. Social and organizational practices that have evolved include: facilitation before, during and after video conference meetings; the adoption of visual aids to match video conference technology constraints; and the adaptation of participant etiquette. Technical practices that have evolved include: upgrades to video conference equipment; the use of separate networks for broadcasting camera views, presentation slides, and voice; and implementing new technical operations practices to support dynamic interaction among participants at each location.



Sonnenwald, Wildemuth. Investigating Information Seeking Behavior Using the Concept of Information Horizons, January 2001.


As research questions and topics in information studies evolve, there is a continual need to seek out innovative research methods to help us investigate and address these questions. This paper presents an emerging research method, the creation and analysis of information horizon maps, and discusses the use of such maps in an ongoing research study. Sonnenwald's (1999) framework for human information behavior provides a theoretical foundation for this method. This theoretical framework suggests that within a context and situation is an 'information horizon' in which we can act. Study participants are asked to describe several recent information seeking situations for a particular context, and to draw a map of their information horizon in this context, graphically representing the information resources (including people) they typically access and their preferences for these resources. The resulting graphical representation of their information horizons are analyzed in conjunction with the interview data using a variety of techniques derived from social network analysis and content analysis. In this paper these techniques are described and illustrated using examples from an ongoing study of the information seeking behavior of lower socio-economic students. These techniques are compared to other techniques that could be used to gather data about people's information seeking behavior.



Webster, Brassell, Sonnenwald, Wildemuth, Harmon, Byrd, Bollenbacher. E-Mentoring Handbook, September 2000.


We describe lessons learned from two pilots of an electronic mentoring program to connect undergraduate and graduate science students in lower socio-economic with corporate scientists. All program activities and materials developed and used during the pilot are described. This includes brochures, web-based e-mentoring software, training materials and evaluation materials.



Geisler, Gary. Enriched Links: A Framework For Improving Web Navigation Using Pop-Up Views, February 2000.


We describe a conceptual framework for enriching Web links by displaying small, information-rich visualizations-pop-up views-that provide the user with information about linked pages that can be used to evaluate the appropriateness of the pages before making a commitment to select the link and wait for the page to load. Examples of how the enriched links framework could be applied in several contexts, such as e-commerce catalog pages, search results for a video repository, and desktop icons, are also presented.



Marchionini, Gary; Geisler, Gary; Brunk, Ben. Agileviews: A Human-Centered Framework for Interfaces to Information Spaces, January 2000.


A framework for interface design that provides people with flexible control over different views for an information space is presented. The agileviews framework defines overviews, previews, reviews, peripheral views, and shared views that help people make decisions about where they should focus attention during information seeking. In addition to the views themselves, control mechanisms that facilitate low-effort actions and strategies for coordinating the views are discussed. Agileviews are particularly useful when specific partitions of large information spaces such as the WWW have been identified. Examples of these views are provided from several different projects and suggestions for additional research and development are made.



Dempsey, Bert J.; Weiss, Debra; Jones, Paul; Greenberg, Jane. A Quantitative Profile of a Community of Open Source Linux Developers, October 6, 1999.


Open source software, or free software, has generated much interest and debate in the wake of a number of high-impact applications and systems produced under open source models for development and distribution. Despite the high degree of interest, little hard data exists to-date on the membership of collaborative open source communities and the evolutionary process of their repositories. This paper contributes a baseline quantitative study of one of the oldest continuous repositories for the Linux open source project (the UNC MetaLab Linux Archives), including demographic information on its broad community of developers. Our methodology is a close examination of collection statistics, including custom monitoring scripts on the server, as well as an analysis of the contents of user-generated metadata embedded within the Archives. User-generated metadata files in a format known as the Linux Software Map (LSM) are required when submitting open source software for inclusion in non-mirrored portions of the MetaLab Linux Archives. The over 4500 LSMs in the Archives then provide a demographic profile of contributors of LSM-accompanied software as well as other information on this broad subset of the Linux community. To explore repository evolution directly, an instrumented Linux Archives mirror was developed, and aggregate statistics on content changes seen over a month-long period are reported. In sum, our results quantify aspects of the global Linux development effort in dimensions that have not been documented before now, as well as providing a guide for more detailed future studies.



Viles, Charles L. Content Locality in Time-Ordered Document Collections, September 13, 1999.


Using newswire data sources from the TREC corpus, we show that the distribution of relevant documents with respect to time can be decidedly non-uniform. Many TREC topics show time-based clustering of relevant documents. We denote this clustering content locality and provide a simple metric for its measurement in time-ordered document collections. There is a marked positive correlation between content locality measurements from two time-synchronized data sources. Given this correlation, we show that knowledge of the distribution of content locality in one document source can provide modest improvement in retrieval results in a companion, time-synchronized document source. While this data is preliminary, it illustrates the potential of using time as an additional feature in retrieval.



Brunk, Benjamin D. Overview and Preview Tools for Navigating the World-Wide Web, July 31, 1999.


This paper examines the problems inherent in navigating the World-Wide Web. It discusses the work done by others in crafting techniques, software products, and research prototypes that attempt to improve the browsing experience through the application of information visualization in the form of sitemaps. This paper also describes an animated technique to generate previews and overviews of a web site in order to get a better understanding of its contents. The final section includes a technical description of an early prototype tool that uses this animated technique, with preliminary findings from an informal feasibility study involving 19 subjects.



French, James C.; Viles, Charles L.. Personalized Information Environments: An Architecture for Customizable Access to Distributed Digital Libraries, February 8, 1999.


We describe the conceptual architecture of a Personalized Information Environment (PIE). A PIE allows unified, highly customizable access to distributed information resources by providing users the tools to compose personalized collections from a palette of information resources. The architecture also provides for the efficient “exchange” of inter-resource meta-information like collection statistics in order to maximize retrieval effectiveness. This paper includes the enunciation of the user-centered PIE vision, an architectural requirements specification, and an architectural description that meets the specification and supports the vision. We also describe our current implementation and research efforts conducted within the PIE framework.



Dempsey, Bert J.; Weiss, Debra. Towards an Efficient, Scalable Replication Mechanism for the I2-DSI Project, April 30, 1999.


This paper presents the development of new functionality for the open-source rsync utility aimed at producing an efficient, scalable solution for multiple-site file synchronization. The context of our work is the Internet2 Distributed Storage Infrastructure (I2-DSI) project, which is developing a reliable, scalable, high performance storage service infrastructure for advanced applications in research and education. Specifically, the I2-DSI project is working on middleware software to enable the replication of applications across a set of geographically distributed hosts. This paper presents a new mechanism for replicating filesystems, rsync+, which is a modification of an open-source rsync file synchronization utility. Using rsync+ for file updates, a flexible, powerful replication mechanism can be developed for publishing source objects into the I2-DSI replication service, and the approach enables scalable network distribution through multicast-based solutions. The paper presents the technical details behind the rsync+ tool, its use as a replication solution within I2-DSI, and performance results from a large-scale (multi-gigabyte) WWW mirroring experiment using rsync+. The mirroring experiment demonstrates correct operation of the rsync+ code and its efficiency gains when used on actual data from an active WWW document archive.