This paper evaluated user interaction with the graphical user interface of WRESTORE, an environmental decision support system (EDSS) for watershed planning that utilizes user ratings of design alternatives to guide optimization. The following usability metrics were collected for stakeholders as well as surrogates (who are often used for DSS development due to time or cost constraints): task times across sequential sessions, percentage of time spent and of mouse clicking in different areas of interest, and trends in self-reported user confidence levels. Task times conformed to theoretical models of learning curves. Stakeholders, however, spent 15% more time and made 14% more mouse clicks in information gathering areas than surrogates. Confidence levels increased over time in 67% of stakeholders, but only in 29% of surrogates. Relationships between time spent or mouse clicking events and confidence level trends indicated that confidence ratings increased over time for users who conducted more information gathering. This study highlights the importance of increasing user interactions within information gathering areas of an interface to improve user-guided search, and suggests that developers should exercise caution when using surrogates as proxies for stakeholders. It also demonstrates a quantitative way to evaluate EDSS that could assist others developing similar tools.
Decision support systems (DSSs) are commonly used to support the design of watershed plans for sustainable management of water resources and landscapes (Jakeman et al. 2008; Lavoie et al. 2015). In recent years, stakeholders' participation in the planning and design process has been recommended as an approach to improve stakeholder adoption of watershed plans, and for successful restoration of watershed ecosystems impacted by land-use changes, climate change, etc. (Gregory 2000; Jakeman et al. 2008; Voinov & Bousquet 2010; Babbar-Sebens et al. 2015). The goals of these watershed restoration plans are to improve the hydrological and ecological functions of the land, without deterioration of the existing agricultural and recreational services provided by the watershed (Kelly & Merritt 2010). Diverse best management practices (BMPs), such as wetlands, filter strips, grassed waterways, cover crops, no-till practices, among others, have been proposed as solutions to prevent or reduce pollutant loads in water bodies, and for mitigation of flood events through runoff control and peak flow reduction (Ice 2004; Arabi et al. 2007; Artita et al. 2008).
Achieving an optimal distribution and selection of BMPs that is also acceptable to the stakeholder community is an inherently complex process. Multiple researchers have investigated coupled simulation models and optimization algorithms to identify the optimal distribution of BMPs in a watershed (e.g., Arabi et al. 2006; Kelly & Merritt 2010; Lethbridge et al. 2010; Tilak et al. 2011; Kaini et al. 2012). In these studies, simulation models have been used to estimate the landscape responses and evaluate related design objectives, whereas optimization algorithms have been used to find design alternatives in large decision spaces that enhance one or more objectives in the watershed. While such analytical techniques have the benefits of generating optimized scenarios of design alternatives with respect to quantifiable goals, they are limited in their ability to incorporate diverse subjective, unquantifiable or unquantified criteria (such as personal or social values, beliefs, interests, biases, local knowledge, etc.) and preferences of stakeholders. This inability has likely contributed to the unsuccessful adoption of ‘optimal’ design alternatives, and because of this the use of optimization algorithms in watershed planning has been criticized by some (Mendoza & Martins 2006).
In an attempt to find more effective ways to include stakeholders in the design process and potentially increase adoption of BMPs, researchers have begun to explore approaches, including those that use information technology, to provide platforms where the affected stakeholders can express their unique socio-economic and subjective constraints during the design process. For example, analytical approaches such as systems dynamics models (Metcalf et al. 2010), Bayesian networks (Castelletti & Socini-Sessa 2007;,Zorrilla et al. 2010), fuzzy sets and cognitive maps (de Kok et al. 2000), agent-based models (Barreteau & Abrami 2007), and human-guided/interactive optimization (Babbar-Sebens & Misnker 2011) combined with graphical user interfaces (GUIs) have all been used to create participatory decision support platforms aimed at improving the engagement of stakeholders. These semi-structured and interactive DSS elicit data, knowledge, and feedback from stakeholders, and then use that information to generate design alternatives. A recent compilation by Jakeman et al. (2008) and McIntosh et al. (2011) exposed relevant aspects that should be considered when developing environmental DSS (EDSS) for watershed planning and management. McIntosh et al. (2011) emphasized the importance of the evaluation of human factors such as needs, goals, fatigue, learning, etc., in the design of interfaces used for EDSS. Similarly, a growing number of studies in water management have indicated interest in improving EDSS usability to enhance the user's understanding of the issues, as well as the design alternatives proposed for solving the problem (Johnson 1986; Power & Sharda 2007; Kirchoff et al. 2013; Babbar-Sebens et al. 2015).
ISO Standard 9241 defines usability as the ‘extent to which a product can be used by the specified users to achieve specified goals with effectiveness, efficiency, and satisfaction in a specified context of use’. Estimating usability attributes for different EDSS can potentially help developers to identify users' interactions with their system, and gain a better understanding of their behaviors, preferences, and needs. Some researchers, such as Haklay & Tobon (2003), Slocum et al. (2003), Jankowski et al. (2006), Nyerges et al. (2006), Boisvert & Brodaric (2012), Lavoie et al. (2015), and Sadler et al. (2016), have examined the factors that influence the development and/or usability of EDSS. However, appropriate measurements and evaluations for EDSS usability, as well as standardized techniques, have not been extensively studied, applied, or developed for most of the EDSS.
For example, in Slocum et al. (2003), the authors used a combination of cognitive walkthrough methods, think aloud protocols, and pluralistic inspection while developing software intended to visualize the uncertainties of global water balance. The cognitive walkthrough methods allow developers and human factors specialists to evaluate the steps of tasks-scenarios. The think aloud protocol permits users to verbalize their thought processes as they perform the tasks, while pluralistic inspection allows users with different expertise and backgrounds to contribute towards the improvement of the system. However, such approaches do not present quantitative results, but rather qualitative comments centered on the expertise of each of the tested groups. Similarly, Lavoie et al. (2015) used think aloud protocols and recorded interviews to determine the usability of their ATES tool (Aménagement du territoire et eau souterraine, or land planning and groundwater). The usability tests were performed for three different users, who were all associated with the land planning field.
Other studies by Jankowski et al. (2006) and Nyerges et al. (2006) evaluated usability of a collaborative spatial EDSS that sought to create a consensus for solutions/options for surface water allocation and groundwater pumping rates, using qualitative and quantitative methods. The qualitative methods involved non-standard usability questionnaire data, while quantitative methods provided information on task times for different activities, but they did not consider any measurement of the GUI's usability. Jankowski et al. (2006) and Nyerges et al. (2006) provided communication results of two different groups (of ten stakeholders each) that collaborated together to propose scenarios of regulation laws for water allocation and groundwater pumping rates. However, their study examined the analysis and evaluation of collaborative interaction rather than the evaluation of EDSS based on individual stakeholder's interaction and user's behavior. Despite these advances, there are still gaps in the use of quantitative evaluation methods to determine the usability of such participatory EDSS.
Recently, Tullis & Albert (2013) outlined quantitative measures such as task time, number of mouse clicks, and percentage of time spent in specific areas of interest (AOIs) of a web interface, as potential approaches that researchers can use for quantifying the usability of software and webpages. In this study, we used these metrics to evaluate a new, web-based participatory EDSS called WRESTORE (watershed restoration using spatio-temporal optimization of resources; http://wrestore.iupui.edu/). WRESTORE has been recently developed with the goal of allowing stakeholders, policy-makers, and planners to participate in the design of a spatially distributed system of BMPs in a watershed. WRESTORE uses a modification of the IGAMII (interactive genetic algorithm with mixed initiative interactions – Babbar-Sebens & Misnker 2011) algorithm to engage users in a human-guided search (HS) process that is targeted towards identification of user-preferred alternatives. In WRESTORE, users are shown multiple design alternatives and are asked to provide a qualitative rating of the candidate designs based on their preferences and subjective criteria. This user rating is then used as an additional objective function in the search algorithm to search for similar or better alternatives that would have design features liked by the user. In summary, WRESTORE optimizes five objective functions: four objective functions are related to the physical performance of the BMPs within the watershed (specifically, maximize peak flow reduction, maximize reduction in sediment load, maximize reduction in nitrate load, and minimize economic cost), and one objective function (maximize user ratings) is related to the user's qualitative rating of the newly found design alternatives. Babbar-Sebens et al. (2015) have extensively explained how the user ratings for design alternatives are obtained via the GUI. Because these incoming user ratings are used in real time by the underlying optimization algorithm to search for new solutions/design alternatives that agree with user's personal qualitative preferences (such as land management, biases towards practices and their implementation, individual local constraints and needs, etc.), evaluating the usability of the GUI is critical for interactive optimization algorithm performance. In addition to the user ratings, the interface is also used to collect confidence levels from the users. The self-reported confidence levels indicate how confident a participant was regarding his/her own rating of a candidate design alternative (Babbar-Sebens & Misnker 2011). The confidence levels, along with the user's interaction, offer potential insights into the quality of the user's input. This information is also valuable for assessing the reliability of a user-guided search.
Usability testing was conducted for WRESTORE, developed by Babbar-Sebens et al. (2015), and this article summarizes the findings on the nature of user behavior observed for two different types of users (stakeholders and surrogates) who interacted with the search tool's GUI. In the section immediately below, we present the objectives and research hypotheses investigated in this work. The next section includes a description of the methodology and the experimental design, followed by two sections that contain the analysis and discussion of the results. Finally, we present some concluding thoughts, future work and recommendations that may address the usability testing of WRESTORE and similar EDSS.
The aim of the study was to use an observational approach to determine how participants used the GUI of the WRESTORE EDSS, as they gathered information and made decisions in order to rate different design alternatives. This kind of system, that includes direct participation of stakeholders in the optimization process, is relatively new in the field of watershed planning and management, and also in the field of EDSS. As mentioned earlier, since the GUI is the primary mechanism to collect the user ratings from end users, it is important to determine if it can support the necessary functions for a user to be able to easily: (a) gather information about the candidate design alternatives, (b) conduct comparisons between candidate design alternatives, and (c) provide meaningful feedback with improved confidence in his/her own evaluation of candidate design alternatives. Analyses of the user's data will help to better characterize his/her interaction with the tool, enabling improvements in the efficiency of the underlying search algorithm used by WRESTORE (i.e., IGAMII). We tested the tool's performance with two groups: stakeholders (watershed end users) and surrogates (non-stakeholder volunteers) because surrogate users' data are often used for DSS prototype development of interfaces due to time and cost constraints. Surrogate testing is an essential aspect of DSS development and allows designers to get any major usability problems solved prior to potentially wasting stakeholders' time and/or losing them in the participatory process. However, as a result, it becomes critical to know and understand the extent to which such volunteers can be used as proxies for future end users.
The study examined the following research hypotheses:
H1. Over time, users should become more efficient in using WRESTORE's GUI. As users learn how to navigate and use the GUI's features, overall task times should decrease across repeated sessions. This decrease should follow the theoretical learning curves (Yelle 1979; Newell & Rosenbloom 1981; Estes 1994), specifically, DeJong's learning formula (Jaber 2011).
H2. As stakeholders are directly affected by the issues and implementation decisions related to the watershed, we expect that they will use the tool more effectively, and spend a greater percentage of their time utilizing the information gathering areas of the interface than surrogate users will.
H3. Similarly, we expect that stakeholders will focus their attention on the more informative areas of the interface and therefore have a higher percentage of mouse clicks than surrogates in the information gathering areas of the web-interface.
H4. As users gain experience by interacting with the tool and develop a better understanding of the performances of different design alternatives, we expect that their overall self-reported confidence levels will increase over time, resulting in a positive trend.
H5. A comparison among confidence level trends, time spent, and mouse clicking events should show that when users spend more time and make more mouse clicks in information gathering areas, their confidence levels increase over time.
A final goal was to set a basis for evaluating the interactions between human factors and GUIs in participatory decision support tools. Such protocols will allow tool developers to test the usability of these systems for their user community, determine what improvements should be made based on specific user populations, and learn how to facilitate the participation of stakeholders in similar watershed design tools.
Case study site
The web-based WRESTORE tool was developed by Babbar-Sebens et al. (2015) to enable watershed communities to engage in participatory design efforts and influence the spatial design of conservation practices (specifically, filter strips and cover crops in the experiments reported for this paper) on their landscape. WRESTORE software is currently being tested at the study site of Eagle Creek Watershed, Indiana, where multiple researchers (Tedesco et al. 2005; Babbar-Sebens et al. 2013; Piemonti et al. 2013) have conducted watershed investigations. A calibrated watershed model of the study site, based on the Soil and Water Assessment Tool (SWAT) (Arnold et al. 2001, 2005; Neitsch et al. 2005), was used to simulate the impact of a candidate plan of practices on the watershed and calculate values for objective functions for the period 2005–2008. The watershed was divided into 130 sub-basins, out of which 108 were suitable for the proposed practices. Values of design variables indicated whether and how a practice was implemented in one of the suitable sub-basins. Based on the values of the design variables found collaboratively by the user and the micro-IGA, SWAT model parameters were modified to represent the design decisions.
Web-tool WRESTORE evaluation
Twenty-three participants volunteered for the study. We divided them into two groups based on their affiliation with the watershed: stakeholders and surrogates. Each group was treated as a predictor variable. The stakeholders group contained eight stakeholders from the Eagle Creek watershed who work at federal and state agencies and non-governmental organizations in programs that support the implementation of conservation practices in the watershed. The surrogates group contained 15 non-stakeholder volunteers with science and engineering backgrounds, who were not directly involved in the watershed.
Three participants' data sets were excluded from the analysis. One participant (a stakeholder) quit the study before finishing, leaving an incomplete set of answers. The other two participants (a stakeholder and a surrogate user) failed to follow the instructions. Therefore, 20 participants' data sets were analyzed (six stakeholders – five males, one female; 14 surrogates – six males, eight females).
We also want to note that for seven participants in the surrogates group, the tool used a different underlying hydrology model (used by the optimization algorithm to calculate the four physical objective functions). However, since the GUI features were identical for all participants, it can be reasonably assumed that the underlying differences in the physics of the hydrological model did not affect the user–interface interaction and experience.
While the study was designed to test user interaction with the tool and its ability to generate designs that agreed with the user's subjective criteria, this article only reports the findings on the user–tool interaction. Babbar-Sebens et al. (2015) provide a detailed explanation of the WRESTORE tool and its performance in multiple user experiments.
View the spatial distribution of the conservation practice(s) using the alternative maps (component (iii) in Figure 1). These maps indicate the location where the practice(s) should be implemented.
Select which conservation practice and its distribution were shown in the map (using the legend – component (ii) in Figure 1).
Compare the effectiveness of the design via the four different goals estimated using the underlying cost and hydrologic models (component (v) in Figure 1).
Provide a user rating for each design via a Likert-type scale with three options (I like it, Neutral, and I do not like it) – component (iv) in Figure 1.
State the level of confidence (using the scale bar in component (iv) in Figure 1) in their personal ratings of each design.
Record their answers and move to the next pages to view additional design alternatives generated by the tool (component (vii) in Figure 1).
Submit their ratings of the design alternatives to the underlying interactive optimization algorithm to generate a new set of design alternatives for the next session (component (vi) in Figure 1).
To measure interface interaction, we tracked the time spent and number of mouse clicks in three main AOIs. The first AOI was associated with information gathering (AOI1 = Info) and included user interactions with the legend (component (ii)), the alternative maps (component (iii)) and/or the drop-down (sub-drop in component (v)) menu. The drop-down menu allowed users to compare the performance of two design alternatives at watershed scale or at a local sub-basin scale (component (v)). The second AOI was associated with the evaluation process (AOI2 = Eval), and included user interactions with the Likert-scale user ratings and the confidence sliders (component (iv)). The final AOI (AOI3 = Other) included all other user activity outside the areas covered by AOI1 and AOI2, i.e., if the participant clicked in an area outside of the solid (Info) and dashed (Eval) boxes shown in Figure 1.
Users' interactions with these AOIs were tracked across all WRESTORE sessions. Each session had, on average, 20 different designs that participants were required to examine and evaluate. The progress bar on top of the webpage (Figure 1, component (i)) shows the participant's progress and how many more sessions were left to complete. Participants were asked to complete all of the sessions through Introspection 3 (I3), although they had the option to continue beyond I3 if they desired.
The participants went through two types of sessions: Introspection (I) sessions and HS sessions. Introspection session 1 (I1) was the starting point for all participants. This session showed 20 designs selected from a pool of non-dominated design alternatives found by a multi-objective genetic algorithm that used only the four physical objective functions to guide the search process (Babbar-Sebens et al. 2015). In this session, all participants saw the same 20 designs. After I1, participants transitioned into HS sessions (i.e., six HS sessions – HS1_1 to HS6_1) that had one-to-one correspondence with the six iterations (or generations) of a micro-interactive genetic algorithm (micro-IGA; Babbar-Sebens & Misnker 2011; Babbar-Sebens et al. 2015). In each of these HS sessions, 20 new designs (population of micro-IGA), generated by the micro-IGA, were presented to the user. After the micro-IGA was over (i.e., at the end of HS6_1), designs that had high user ratings and/or were on the best non-dominated front were saved in to a case-based memory (CBM). Next, the participant revisited 20 random designs from the CBM in Introspection session 2 (I2), where he/she was provided the chance to modify his/her previous user ratings and confidence levels based on any newly formed preferences and perceptions. This revisiting technique aimed to improve the participants' skills in evaluating the performance of the design alternatives using the information gathered in the HS sessions. If the user modified his/her evaluation of the revisited design in the Introspection session, the changes were updated in the CBM. At the end of I2, a sub-set of designs from the CBM were again inserted in the initial population of the next micro-IGA, and a new set of six HS1_2 to HS6_2 sessions ensued. It is worth noting here that the WRESTORE interactive search tool was originally developed with an additional set of alternating I and HS sessions and an Automated Search session, all occurring after I3. The Automated Search session uses a machine learning model to learn from the participant's user ratings collected in the earlier HS sessions, and then conducts an exhaustive search on behalf of the participant by using the machine learning model as a simulated user. Interested readers are encouraged to refer to Babbar-Sebens et al. (2015) for details on these sessions. In order to limit human fatigue and keep the workload the same for all users, researchers requested that participants complete all alternating HS and I sessions, at least until the end of I3. Several participants in the surrogates group, however, did choose to complete sessions beyond I3. Nevertheless, these additional sessions were not considered as part of the primary analyses conducted to examine the hypotheses above.
After the participant had submitted his/her user ratings for design alternatives, 10–15 minutes of waiting time was needed for the underlying optimization algorithm and hydrologic model to generate and evaluate new design alternatives. To allow participants to complete the experiment at their own pace, an email notification system was used to send them an automated reminder whenever the next session was available for user interaction. The automated e-mail included a log-in link to the WRESTORE webpage, and when users clicked on the link to log into their account, a set of instructions on how to conduct the ongoing experiment was redisplayed on the screen (Babbar-Sebens et al. 2015).
MEASURES OF USER INTERACTION WITH THE INTERFACE
This section introduces the different usability metrics, or response variables that were employed to evaluate the user's interaction and usability of the WRESTORE tool.
Overall task times
We assessed the participant's ability to navigate and learn the tool by evaluating the mean task times across successive ‘I’ sessions and also across successive ‘HS' sessions, during which design alternatives were presented to the user via the GUI. Task times for each session were recorded and used to infer how quickly participants learned to use the tool interface efficiently. When the participants were not using the tool (e.g., taking a break), they were instructed to press the ‘Save all’ button, so that we could consider these off-task time intervals as outliers and exclude them from the task time analyses. These events were saved in a database as ‘Save all maps’, and removed during the post-processing analysis, along with ‘Quit’ events. However, there were some occasions when some participants failed to click the ‘Quit’ or ‘Save all’ buttons, resulting in excessively long task times. To remove these outliers, we excluded the task time values that were greater than two standard deviations from the mean task time across all considered sessions, for each participant.
Mean percentage of time spent in different AOIs
where MPTSi,j is the percentage of time spent in each AOIi and N is the total number of participants in the group.
Mean percentage of mouse clicking events by AOI
The same AOIs described in the section ‘Web-tool WRESTORE evaluation’ and sub-section ‘Participants’ (Info, Eval, and Other) were used to track the mouse clicking events, and monitor user interactions with different areas of the interface. As the total number of clicking events varied between participants, the percentage of clicking events in each session was calculated for each participant, and for each group (surrogates and stakeholders).
Confidence level indicates how confident the participant felt about his/her own user rating (I like it, Neutral, or I do not like it). User's confidence in his/her user ratings were indicated via the confidence level slider bar (component (iv) in Figure 1) that ranged in its scale from 0 to 100, and could be modified by the user during the session. However, changes in the confidence levels for the same designs that appeared multiple times during the experiment could not be tracked, as new changed values replaced previous values in the archive database.
We classified participants by confidence level trends in the following manner. First, we conducted a nonparametric Mann–Kendall hypothesis test (Helsel & Hirsch 2002) via a Matlab script (Burkey 2006) to assess whether the trends in average values of confidence levels (estimated from data on confidence levels in each session) were monotonically increasing or decreasing. This test indicated whether or not participants presented a trend, at a significance level alpha of 0.1, and the Sen's slope (S value) determined if the trend was positive or negative. Since the main focus of this test was to minimize the risk of not detecting an existing trend (i.e., Type II error), a larger alpha value was chosen. Participants were thus separated into three confidence level trend groups (positive, negative, and no trend). Finally, we attempted to identify similarities and differences between participants who showed positive, negative, and no trend in confidence levels, and relate them to their interface behavior (i.e., time spent and mouse clicks) concerned with information gathering.
Relationships between confidence levels, time spent, and mouse clicking events
We fitted trend curves to usability data (i.e., time spent and number of mouse clicks) in order to determine any underlying patterns and relationships for participants within each confidence level trend (positive, negative, or no). We also compared the differences in these trends for Info and Eval type events. To select the best trend curve model, we used a combination of the coefficient of determination, and three versions of the Akaike information criterion (AIC).
In this section, we present the results from the data analysis for the task time, percentage of time spent in each AOI, percentage of mouse clicks in each AOI, and confidence level trends, as well as the relationships between these variables.
Overall task times
We first assessed participants' mean task times for each session. The mean for each group (surrogates and stakeholders) was calculated and the results were separated according to the type of sessions (i.e., I or HS). Results for surrogates showed that in earlier sessions (for both I and HS) the mean task time and the standard errors were greater than in later sessions. Similar results were found for stakeholders in the I sessions. However, stakeholders' mean task times for HS sessions were somewhat more variable.
Mean percentage of time spent in different AOIs
Table 1 summarizes the data in Figure 4, and presents the overall mean percentages of time spent (averaged across sessions) and 95% confidence intervals (CIs) for surrogates and stakeholders in each AOI.
Mean percentage of clicking events in different AOIs
We calculated 95% CIs for overall mean mouse clicking events, similar to the analyses in Table 1 for time spent. Table 2 presents the overall mean percentages of mouse clicks (averaged across sessions) for surrogates and stakeholders in each AOI.
For each participant, and for each of the rating classes (i.e., I like it, Neutral, and I do not like it), a Mann–Kendall trend test was performed to identify if there were monotonic trends in the mean confidence levels across consecutive sessions. The results were separated according to positive, negative, or no trends, based on the results over time. A positive trend indicated that users were becoming more confident over time about the ratings they provided for the designs. On the other hand, a negative trend indicated that they were becoming less confident over time. In summary, 40% of all of the participants showed a positive trend in at least one of the rating classes. For the stakeholders group, 67% of the participants showed a positive trend, while just 29% of the participants in the surrogates group showed a positive trend. It is, however, important to mention here that these trends were calculated for the sessions that lasted until the end of I3. It is possible that if participants continued beyond I3 then they could change their trends, especially if the additional engagement during the interactive search process improved their reasoning process and led to a change in their confidence levels.
Relationships between confidence levels, time spent and mouse clicking events
We also associated the mean percentage of mouse clicks for information gathering and the mean percentage of time spent for information gathering (Table 3) with each of the trends in mean confidence levels reported in the previous sub-section. Results showed that participants with a positive trend in mean confidence levels also had 13% more mouse clicks than participants with a negative trend, and 9% more mouse clicks than participants with no trend in AOIs related to information gathering. Participants with a positive trend in mean confidence levels spent 1% more time than participants with a negative trend, and 12% more time than participants with no trend in AOIs related to information gathering.
|Trend||% of total participants||Mean % of clicking events in Info AOI||Mean % of time spent in Info AOI|
|Trend||% of total participants||Mean % of clicking events in Info AOI||Mean % of time spent in Info AOI|
Table 4 shows the comparison of the R2 and three different AIC values obtained for four different approximations, generated using the results of the sessions completed by all the participants. These R2 and AIC values indicate that a power approximation represents the best-fit model for all of the trends, except for the negative trend in ‘All available sessions’ where a linear function seems to have a better fit.
The shaded boxes indicate the preferred model for a specific model selection criterion.
In this paper, we analyzed usability metrics for a novel, web-based tool, WRESTORE, which supports decision-makers in the search and design of alternatives for allocating conservation practices in agricultural watersheds. This kind of tool is relatively new in the field of watershed planning, necessitating protocols that will enable developers to assess what types of improvements should be made, based on quantitative user feedback. Furthermore, because surrogates are often a necessary component of prototype development due to time and cost constraints, it is critical to know to what extent their data can be generalized to the stakeholder population of interest.
Four main usability metrics were considered for the analysis: task time evaluation, percentage of time spent in different AOIs, percentage of mouse clicking events in different AOIs, and trends in mean confidence levels.
Overall task times
We evaluated overall task times in order to assess participants' efficiency in using the tool interface. Overall task times decreased across repeated sessions following a power learning curve. The results are consistent with hypothesis H1, where it was stated that over time, users should become more efficient in using the interface as they learn how to navigate and use the different features. I sessions followed a power learning curve for both surrogates and stakeholders, but stakeholders showed a greater decrease in overall task time for I sessions than surrogates.
HS sessions also followed a power learning curve for the surrogates. However, the mean task times for stakeholders were much more variable, resulting in a poor fit to the expected learning curve. The differences in mean task times across stakeholders may be due to the following potential reasons: (1) some abnormally large task times that were still within two standard deviations from the mean and hence were not excluded as outliers, possibly related to a re-learning process due to the lag time between sessions, (2) small sample size of stakeholders, or (3) existence of two learning curves instead of one overall learning curve as seen by an apparent increase in the time spent again at the start of the second block of HS sessions (i.e., HS1_2 in Figure 3).
Overall task times can provide estimates of the tool's efficiency to developers. Information on how fast the users learn via the GUI can be expected to assist the tool designers in the re-evaluation of the users' interactions during the sessions, and of the tool's ability to use participation in guiding the search process. Further, an insightful understanding of time needed to complete goals and provide useful feedback is also important for eliminating extra sessions, which could increase user fatigue and introduce noise without meaningful benefit to the search algorithm.
Mean percentage of time spent in different AOIs
The analysis of these data provided us with insightful evidence about where users were focusing their attention, as measured by the percentage of time each group spent in each of the AOIs. The groups showed fairly consistent behavior across sessions. However, stakeholders expended more time in the Info AOI, while surrogates expended more time in Eval and Other AOIs. This supports hypothesis H2, where we predicted that because the stakeholders are directly affected by issues and actions in the watershed, they will use the tool more effectively and their percentage of time spent in information gathering areas of the interface would be greater than for the surrogates.
Mean percentage of clicking events in different AOIs
These results help to understand how different groups were using the interface to perform the tasks. The higher percentage of mouse clicking events for stakeholders vs. surrogates in Info AOIs supports hypothesis H3 and is consistent with the results reported above for mean percentage of time spent. Surrogates and stakeholders behave differently regarding information gathering. On average, the majority of stakeholders tend to make more mouse clicks to gather information from the user interface in order to make their decisions. Surrogates, on the other hand, do not explore the information gathering areas of the interface as much, making fewer mouse clicks in these regions. This could be a consequence of lack of interest in the task, or a lack of information about the tool's goals.
Previous work using confidence levels, showed a positive trend as experimental time progressed (Babbar-Sebens & Misnker 2011). A positive trend indicates that mean confidence levels increase over time as users gained experience with the tool. However, we did not find that mean confidence levels increased for all of the users. Therefore, hypothesis H4 was only partially supported.
Our results showed that just 40% of participants exhibited a positive trend in mean confidence levels over time. Nevertheless, there was a clear distinction between surrogates and stakeholders in relation to these trends. Approximately 67% of the total stakeholder participants presented a positive trend, while just 29% of the total surrogate participants presented a positive trend.
Relationships between confidence levels, time spent, and mouse clicking events
The section examines the results on the relationships between trends in mean confidence levels, time spent, and mouse clicking events. The results in Table 3 help to support hypothesis H5 that states ‘when users spend more time and make more mouse clicks in information gathering areas, their mean confidence levels increase over time’. An analysis of the relationship between time spent, clicking events, and mean confidence level trends indicates that participants with a positive trend had a larger number of clicking events per unit time spent in information gathering areas, compared to participants with a negative trend. This may indicate that if a participant interacts with the interface to gather more information on the design alternatives for the same amount of interaction clock time (i.e., the time spent), then there exists a probability for them to either improve their self-confidence over time or maintain a steady value over time. In addition to the slope, the fitted value of the exponent in power curves was also lower for the negative trend than for the positive trend in mean confidence levels. In the Eval AOI, no clear difference across confidence level trends was observed.
CONCLUSIONS AND FUTURE WORK
This research adapted and applied usability techniques to a set of data collected with the WRESTORE tool to evaluate its performance and usability. Quantitative usability evaluations of EDSS are not typically conducted for these types of EDSS systems. However, this approach can offer valuable insights into the ways that these tools will be used. Overall, this work provided two substantial contributions: (1) determination and validation of usability and metrics for participatory design tools based on information technologies and (2) evaluation of differences between how surrogates (volunteers) and stakeholders (end users) use such interactive design tools.
WRESTORE was developed for the Eagle Creek Watershed, in Indianapolis, IN, with the goal of providing a more democratic venue for stakeholders to engage with the watershed community in the design of alternatives for spatial allocation of conservation practices. The tool was initially tested by surrogates who were not intimately involved with the issues and concerns in the watershed. Their feedbacks were recorded and saved for later analysis. Then, the tool was tested with stakeholders (i.e., potential end users) to determine if the findings from surrogates held true for the actual end users.
As the majority of usability tests are performed by students or volunteers, we wanted to track possible differences in responses between the tested group (surrogates) and the end users (stakeholders), and analyze to what extent the results from surrogates would be reflected in the behaviors of stakeholders. From the overall task time analysis, we concluded that the participants of both groups became more efficient as they learned how to navigate and use the tool's features. Generally, overall task times decreased across repeated sessions. Therefore, surrogates can potentially be used as proxies for stakeholders for overall task time analysis and improvements. However, they differ in other potentially important regards, as discussed below.
As for how users focused their attention in terms of time spent across the different AOIs, results showed that stakeholders expended more time than surrogates in gathering information about the performance of the different design alternatives. This could be a result of the motivation of stakeholders in creating a designed distribution of BMPs that better suits their interests. Surrogates that were not involved with the watershed may have lacked this motivation. Similarly, as we predicted, a higher percentage of mouse clicks were made on the information gathering areas by stakeholders vs. surrogates.
We also noticed that the majority of the stakeholders showed an increase in their mean confidence levels over time, while the surrogates did not. A comparison among trends in mean confidence levels showed that positive and no trends were associated with more information gathering activity. Surrogates that did less information gathering were more likely to show a decrease in their confidence levels. As observed, for one of the participants (Participant 2), extensive information gathering over repeated sessions can also lead to a later positive change in the trend of mean confidence levels (see Figure 6).
A potential limitation of this initial investigation of the tool's usability is that it only involved a specific group of stakeholders, who work at federal and state agencies and non-governmental organizations in programs that support the implementation of conservation practices in the watershed, but future plans include experimentally testing the tool on a wider range and a greater number of stakeholders, including farmers. Additionally, we specifically chose quantitative usability metrics as this type of analysis had been previously neglected in the literature, but in future work we also plan to include follow-up questionnaires to further investigate user reasoning and evaluate user self-reports along with the quantitative data.
This quantitative analysis of usability metrics has led to the following suggestions for improving the WRESTORE tool, which may also be of interest to other researchers developing EDSS:
Decrease the time that the user has to expend for giving feedbacks, particularly reducing the number of sessions they need to go through in order to avoid user over-fatigue (Butler & Winne 1995), when it is apparent that the user has become efficient at using the tool; overall task times can be a useful indicator for this.
Motivate the use of the AOIs related to information gathering to focus users' attention on more informative aspects of the interface and also increase the confidence levels of the users. This motivation could be achieved through interface development that emphasizes the exploration of areas where users have the opportunity to gather more information by clicking through menus, graphs, maps, and improved data visualizations to allow better comparisons among design alternatives.
Provide a final ‘summary’ session that recapitulates the findings and designs of desirable alternatives found by the users.
We would like to acknowledge the funding agency – National Science Foundation (Award ID #1014693 and #1332385). We would also like to thank all of our collaborators from the different agencies and institutions: Empower Results LLC team for facilitating user testing, Dr Snehasis Mukhopadhyay and Mr Vidya B. Singh for computational support, and all workshop participants.