Web-based distributed volunteer computing enables scientists to constitute platforms that can be used for computational tasks by using potentially millions of computers connected to the internet. It is a widely used approach for many scientific projects, including the analysis of radio signals for signs of extraterrestrial intelligence and determining the mechanisms of protein folding. User adoption and clients' dependence on the desktop software present challenges in volunteer computing projects. This study presents a web-based volunteer computing framework for hydrological applications that requires only a web browser to participate in distributed computing projects. The framework provides distribution and scaling capabilities for projects with user bases of thousands of volunteers. As a case study, we tested and evaluated the proposed framework with a large-scale hydrological flood forecasting model.
With the recent advances in high-resolution sensing and monitoring capabilities, large-scale spatial and temporal modeling became computationally challenging and mostly beyond the capabilities of a single ordinary computer (Nakada et al. 1999). Surpassing this limitation is possible only with high-performance computing (HPC) clusters or a distributed system of many computers working in parallel. Distributed systems and computing rely on physically separate participant computers that make up a large system (Lamport & Lynch 1989). Considering, nowadays the computing power is not clustered only in computing centers (Anderson 2004), and the distributed systems form an evidentiary opportunity for scientific computing purposes. Blockchain-based systems, ARPANET (the ancestor of the internet) and the internet itself, are common examples of distributed systems (Stone & Bokhari 1978; Coulouris et al. 2005). In parallel computing systems, physically separated computers, or computing instances, carry out computations for a task as directed by a central system (Culler et al. 1999). Similarly, in distributed computing, instances use a distinct messaging system to work asynchronously with all instances, with scheduling support from a central system (Almasi & Gottlieb 1989).
Forming the base for distributed computing, distributed systems are widely used in both academia and industry (Attiya & Welch 2004). As an efficient way of computation (Korpela 2012), distributed computing depends on the computational power of connected instances to solve the assigned computational tasks (Coulouris et al. 2005). Consequently, a distributed system is able to achieve tasks that are beyond the capabilities of a single computing instance. Such a distributed computing system is considered a volunteer system when it is comprised of volunteers with capable devices that contribute to the acceleration of the cumulative process of the completion of computational tasks needed to solve the problem (Anderson & Fedak 2006). In such volunteer computing systems, volunteers with idle computational power within their desktops and portable devices (smartphones and tablets) take part in a scientific process of solving problems.
In a regular centralized system, all computation is carried out by a dedicated HPC instance owned by the computing project's operator. The HPC instance is, most of the time, expensive to both setup and maintain. Similarly, in grid computing or distributed computing without volunteer participation, the cardinal factor on the cost becomes the financing of the infrastructure. In contrast, in volunteer computing, a project operator only needs to maintain a web server to form a central authority on the computation medium, while the computation is done by connected third-party computers. This nature of distributed voluntary computing makes it a suitable option for CPU-intensive tasks (Anderson & Fedak 2006).
Citizen Science, as a term, describes scenarios in which people who are not scientists participate or conduct research. Citizen volunteers and researchers can cooperate to address and answer scientific questions (Cohn 2008). For instance, volunteer test subjects can contribute to research by providing data (Bonney et al. 2009; Wiggins & Crowston 2011; Fienen & Lowry 2012; Ferster & Coops 2013; Jonoski et al. 2013; Horita et al. 2015). Volunteers can even serve as field assistants in scientific efforts (Cohn 2008). The goal of this study is to enable people to participate in scientific studies through a web browser by letting researchers use the idle computational power of their computers. A crowdsourcing-oriented platform provides a channel for this ‘volunteer computing’ (Granell et al. 2016). Thus, a volunteer in volunteer computing is a user with a computationally capable machine that is connected to the distributed computation system to let the system use its resources in various computing tasks. Occasionally, the volunteer is also referred to as the ‘client’, borrowing the term from web terminology.
Volunteer computing has been used in many large-scale scientific projects for more than two decades. One of the earliest examples of a distributed volunteer computing project is the Great Internet Mersenne Prime Search (Durrani & Shamsi 2014), which aimed to find Mersenne Prime numbers using hardware provided by volunteers. In the early 2000s, the few web-based volunteer computing efforts met with limited success because of the slow performance of web systems and the lack of capabilities for multicore and background process support. One volunteer computing project known as Bayanihan (Sarmenta & Hirano 1999) interprets Java usage behaviors on web browsers to create a distributed computation system. In the same year, a project called Charlotte (Baratloo et al. 1999) used web browsers to create a parallel distributed computing framework.
SETI@Home is an early example of distributed computing that uses software to access computer's resources to work on assigned tasks only when the computer is idle (Sullivan et al. 1997). The software was designed as a screensaver (Shirts & Pande 2000) that used volunteer computers to search for extraterrestrial existence in the universe by analyzing radio signals that reached Earth as a part of universal SETI effort. In 2000, the Pande Lab at Stanford University introduced the Folding@Home project, which aims to carry out protein folding simulations to enhance disease research processes. Similar to SETI@Home, Folding@Home also began as a screensaver (Shirts & Pande 2000).
In 2004, researchers released BOINC (Berkeley Open Infrastructure for Network Computing), an open source generic framework for volunteer computing (Anderson 2004). BOINC consists of two separate entities, one of which uses the server for centrally managed task distribution, while the client software uses volunteer computing resources. Client software enables volunteers to choose a project from several options on the BOINC framework, download assigned tasks from the server, and send the computed results back to the server after computation. After BOINC was released, distributed computing projects and studies that involved volunteer participants gained momentum. Many volunteer computing projects such as SETI@Home and Folding@Home started to take advantage of the BOINC infrastructure, including MilkyWay@Home (Cole et al. 2010), Rosetta@Home (Das et al. 2007), and Einstein@Home (Einstein@Home 2005). Several studies also expanded BOINC to different implementation levels, some of which use BOINC middleware on mobile devices (Black & Edgar 2009; Theodoropoulos et al. 2016) to enhance their ability to handle and distribute data (Costa et al. 2008).
This paper starts with a review of the literature on distributed computing, volunteer computing, and citizen science. The Methods section explains the framework for the server-side as well as the client-side, with corresponding components and technologies. The Results and Discussion section includes a case study on hydrological modeling for the evaluation of the framework with benchmarks and results that demonstrate the performance and capabilities of the framework. The paper concludes with a summary of our objectives and findings, as well as future perspectives about the distributed computing framework.
In this section, we will present the web-based distributed volunteer computing framework and its technical aspects, explaining the stack and technologies that power the framework and how these technologies use both server-side and client-side resources.
The framework's infrastructure consists of two major parts (Figure 1): one that deals with task scheduling and task distribution on the server-side and another that handles computations and communication with the server on the client-side. ‘User interface’ describes the panel volunteers (or users) use to start participating in scientific computing tasks. The framework asks users to give their consent to join a selected scientific effort and contribute computer time for the project. The volunteer can always revoke this permission later. Also, users can pause a computation when they need the computational power for a personal task, resuming their participation at any time. The framework can be embedded on any website with a single line of code. This allows researchers to easily enable this framework on websites with a heavy user base (e.g., organization or university websites).
Since each job contains a number of tasks and they are acquired and sent results by the client when the computation is done, they do not depend on each other. A volunteer instance that is set to contribute by taking 10 jobs with 1,000 tasks runs the computation functions with given parameters 10,000 times (10 × 1,000). In other words, if 10 different jobs were conveyed to a client one by one each with 1,000 separate tasks, this client computer computes 10,000 separate function runs and generates results.
One important aspect that affects data loading into the framework is task scheduling. The scheduling system in the framework needs a priority scheme regarding the data tables. The scheme provision is optional and depends on the computation task itself. If the priority scheme is provided, it must be included in the input data table with an integer column of ‘priority’ starting from the value ‘0’. This value represents the level of data entry on a dependency table. The lower the value of the ‘priority’ column, the more priority a data entry is given. Thus, the data entries with a priority value of ‘0’ will be the first to be sent to the volunteers, and entries with the value of ‘1’ will be second and so forth. Besides the additional column of ‘priority’, each data entry also should have a Boolean column in order to keep track of the completed tasks namely ‘is_completed’ and an array column of ‘connections’.
The connection between upstream and downstream nodes in the priority list is established by the ‘connections’ column. This column incorporates an array of indices which are the IDs of future computation tasks. By using this connections array, each task can be linked to any number of future tasks and this information enables the framework to use outputs of previous tasks as inputs in future tasks. Output saving to the output table and inserting the outputs to the input table for future computations handled by the results mechanism. The results mechanism needs to be defined using the database connection interfaces that the framework proposes. The result engine that will be described in the following subsection uses this interface to make the project database store the computation outputs. If outputs of any task will be used as an input for a further computation task, the column name that the individual outputs will be saved within the input table must be specified within the interface.
After the communication medium between the server-side of the framework and the data source established via database interfaces, volunteers are able to load the website and to start contributing to the project.
Server-side task handling
The stack powering the framework consists of the Linux operating system, PostgreSQL database management system, PHP server-side scripting engine, and Apache web server. PHP is one of the most common web programming languages for server-side development. PostgreSQL is an open source database management system that handles tasks, input datasets, and output from the computation. The framework runs on Red Hat Enterprise Linux with Apache 2. Apache is an open source project that enables developers to serve their web applications. The framework depends entirely on open source and free systems to facilitate reproduction and adaptation by other research groups. Input data and modeling codes should be stored in a PostgreSQL database. Once the data are stored in the designated database and afore-mentioned connections are done, they are available to be shared with connected volunteer computing instances.
Task handling on the server essentially relies on three components of the framework: task engine, result engine, and web engine (Figure 1). The web engine handles interactions when a user visits the user interface and triggers the computation. It broadly handles the new volunteer device and makes arrangements to enable the task engine to transfer the data and the computation code to the user. After the web engine initiates the process to enable the new volunteer client to contribute to the project, the task engine gathers tasks from the project database based on their dependencies and their availability. The task engine also restricts the assignment of tasks that have any unmet dependencies to volunteers. After the selection is done, task engine bundles assigned tasks as a job package, then it transfers the data and the code to the client. The data consist of parameters and the actual code. Each parameter entry in its corresponding column from the connected database is conveyed to the volunteer devices with other parameters. The term distribution comes into play with the task engine. The task engine manages all tasks and data distributions to volunteer computing instances. After the client has processed the data, it sends the results of the analysis, as a set of task results, to the result engine. The result engine's job is to save the computed results to the database and mark the assigned tasks as completed using the column ‘is_completed’.
All server-side tasks handling within the framework is done on a first-come-first-served basis. A client computer is taken to a queue when it makes a request to get tasks. After the serving for all clients who requested tasks before them is completed, the computation data they need to contribute to any project are conveyed to them. If there are fewer tasks on the database than the requested amount, then all remaining tasks are conveyed to the volunteer computer and the next requests get the message of ‘No tasks available’ from the server-side until new data are made available.
Client-side task handling
The data sent from the server to the client by the task engine must be stored on the client-side for an organized and orderly computation process. The framework uses IndexedDB, a transactional database that handles large-scale data on the web browser. After the task engine pushes the data that will be used in distributed tasks, the Web Worker script stores them locally on an IndexedDB database. After the data are stored, the tasks within a job are randomly distributed to the workers, the computation begins, and all data are drawn from this new local IndexedDB to workers. Computed results are also saved to the same storage to be pushed back to the server, i.e., result engine. This structure allows local organization and scheduling of the tasks and handles interruption and resumption of the computing session on the client-side. After all tasks in the IndexedDB are completed, the client-side of the framework can either ask for a job or terminate the connection with the server-side.
Interrupted and corrupted task handling
The above-mentioned distribution, computation, and communication pipeline assume that all parts of the framework do their defined jobs properly. However, since the client-side cannot be completely managed by the framework itself, there are some overhauling mechanisms within the framework for cases that the client-side fails to convey the expected output. When volunteer computer disconnects from the server-side of the framework by browser window closure, the server-side still waits some time for the response from the client. If the response that includes computation results is not issued within a time interval that is defined by the project admin, the tasks assigned are not marked as complete on the database. Incomplete tasks are later sent to new volunteer instances by being taken into account in the regular scheduling system.
Another problem that may occur on the client-side is that malevolent attempts can be made through regular API routes of the framework using client computers. For instance, computations can be simulated to be seen as the right results by the server-side while not providing actual computation results. This potential problem can be surpassed by validating the results by distributing the same tasks to multiple random volunteer instances but with each extra validation task, the overall computation time increases. Consequently, the advantages of using a distributed approach may be lost if validation is overdone. Nevertheless, the framework presented in this paper enables researchers to define a validation threshold for distributed tasks via the scheduling system. It can be zero or a non-zero value, and this decision is left to the project administrator.
Case study on hydrological modeling
The Iowa Flood Information System (IFIS) was developed at the Iowa Flood Center (Demir & Krajewski 2013; Krajewski et al. 2017) using a generalized water cyberinfrastructure. The IFIS is based on an integrated and extensible architecture that allows researchers, instructors, and students to create custom cyberinfrastructures for their own research projects and curriculum (Jones et al. 2018; Sermet & Demir 2018; Weber et al. 2018), enabling reproducible scientific workflows (Duffy et al. 2012; Gil et al. 2016). The IFIS features many data management, scientific visualization, and communication tools, as well as data resources to facilitate flood risk management and research at the watershed scale (Demir & Beck 2009; Demir et al. 2009, 2015, 2018).
The Iowa Flood Center operates a real-time, HPC-based flood forecasting model (Cunha et al. 2012; Small et al. 2013). This model transforms the rainfall from the climate projections and provides quantitative stage and discharge forecasts and a 10-day flood risk outlook in the IFIS for more than 1,500 locations in Iowa (e.g., communities and stream gauges). It is a distributed model running on HPC (Quintero et al. 2016; Krajewski et al. 2017) that builds on the concept of the landscape decomposition into hillslopes and channels (Mantilla & Gupta 2005). The mass transport equations for each hillslope in the network are defined as a power law relation describing flow velocity as a function of discharge and drainage area (Ayalew & Krajewski 2017). Details about the model equations, configuration, and numerical solver are provided in Quintero et al. (2016) and Krajewski et al. (2017). The model runs for more than 600,000 hillslopes in Iowa (Figure 2). These hillslopes consist of 10 dependency levels and scheduling was done by this order of dependency (Table 1). Each level in dependency needs previous levels to be computed before being sent to the volunteer instances. Each independent task focuses on a single hillslope and needs minimal data transfer of 5–10 kilobytes. Running such a comprehensive model in an operational setting requires extensive resources for data processing, computation, and communication of results. Operational systems like the IFIS can benefit from the proposed framework to distribute computations for its operational flood forecasting model to its large userbase. We used a simplified version of this hydrological model for the case study to understand the scalability and performance of the volunteer computing framework. The model is simplified to reduce the data needs but not affect the computational evaluation of distributed computing. Simplification is only done on the system parameters to minimize the effort on data preparation for the case study. Average values are used at the hillslope level for these parameters instead of unique values. The simplifications are implemented in a way to not affect the computation and data transfer complexity for objective evaluation of the performance of the framework. This framework will allow volunteers to contribute to hydrological research and applications from their homes, as in the case study on flood forecasting.
RESULTS AND DISCUSSIONS
Since browsers do not allow a webpage to run a script in the background when the tab is closed on the client computer, the computation process could be interrupted suddenly without notice. This would interrupt the volunteer computing project. The framework can handle such interruptions and continue where it left off when the user returns to the page. One solution uses browser extensions to fully eliminate interruption and allow clients to continue running even after the browser is closed. The extension can easily be installed in the client browser when the user visits the project website.
Another challenge we encountered when running the model and testing the framework was about data and task dependencies of the specific model used. Some models might have a prioritized hierarchy for the computation of certain tasks. Thus, the next steps of the computations need the results of prior computations. When a lack of data is experienced in the model, volunteers will need to wait until new data become available for computation. This is more common for hydrological applications in which computations from upstream nodes are required by downstream nodes. Priority scheduling of the tasks by prioritizing tasks from upstream nodes to the volunteer computing instances can reduce the dependency and improve the time-efficiency of the computations.
Because the idea of distributed computing using volunteer instances depends on data transfers, a reliable data transfer process is crucial. Even though the internet provides a proper medium for such interactions, the data size creates a limitation in a system that heavily depends on these transfer operations. Thus, in this study, we selected the use case; these limitations and data transfer processes were optimized for this use case's needs. Considering the trade-off and effects on total run time, web-based volunteer computing systems are more suitable for computationally intensive tasks and studies with limited or low data needs.
Benchmarks with the model and framework
We conducted various tests to evaluate the server and client communication in terms of performance and timing and to observe the behavior of the framework and the model. We used a moderate client system for tests with Intel(R) Core(TM) i5-4258U CPU @ 2.40 GHz. Selected tests and their goals can be listed as follows: (a) tests with a single client and different numbers of workers and job size combinations to understand the performance changes; (b) tests to measure average and total elapsed time by each step of the computation; (c) performance on the utilization of separate browser windows versus the utilization of workers on one browser window; (d) tests with only job size changes on the same client to observe the linearity of the performance gains when job size is increased; (e) tests with the heterogeneity of clients to evaluate the performance of both the client and the server; and (f) server performance when multiple clients are connected to the server to evaluate the task retrieving durations as the number of clients increased.
In the case study, we defined each task as solving rainfall–runoff model equations for a hillslope (watershed) for discharge predictions. Test results for different numbers of jobs and job size combinations (a) and elapsed times for each step of the process (b) can be seen in Figure 3. These tests are conducted using a client computer with a two-core CPU (Intel(R) Core(TM) i5-4258U CPU @ 2.40 GHz). Our benchmarks showed that an increase in worker count did not dramatically decrease the computation time elapsed. Nevertheless, there is an indisputable improvement when the worker number is increased. This parallel increase continues until the CPU cannot handle the number of workers selected.
The server's ability to handle users concurrently is the only factor that depends on the server specifications and the framework code itself. Tests conducted (f) with multiple clients simultaneously requesting tasks does not produce a significant overhead on the server and its efficacy in the task distribution process. Besides the server's scaling abilities, we also tested the scalability of the framework by utilizing an open source load test tool (i.e., Locust 2011). Tests were done by concurrent and swarmed users as well. A Locust file was created containing possible scenarios within the framework with different job sizes (e.g., 100, 200, 500, and 1,000), number of jobs (e.g., 10, 50, and 100), and worker counts (e.g., 1, 2, and 4). Different numbers of users (e.g., 100, 500, 1,000, and 2,000) were tested on the framework's server-side abilities, and consequently, we clearly saw that framework was successful at handling various swarming rates. The success rate of the made requests to the framework starts to decrease at the swarming rate of 200 users per second.
Finally, we made a test run in operational settings for the entire system with seven computers that have various specifications utilizing various numbers of workers (Table 2). For simplicity, the job sizes for every computer were kept fixed at 1,000 tasks at a time, and when there were less than 1,000 tasks remaining at the server-side, all remaining tasks were conveyed to the requesting party without trying to fulfill the requested job size. There were no specified number of jobs and all machines were run until no other tasks have remained on the database. It should also be noted that the validation mechanism was not used for this experiment and all computers were assumed as trusted volunteers.
The time required for the overall process (task retrieval, computations on the client, and sending results to the server) varies based on the computer. The model used in the framework runs with data for a total of 620,172 hillslopes in the State of Iowa with 10 levels of dependencies (Table 1). Task distribution and the computation were started with the first level of dependency (Level 0), and any following level was not initiated while there are still unfinished tasks at a previous level. For this case study that needs to compute 620,172 tasks, working with a user base of seven volunteer computers with various hardware, software, and network capabilities, the model was solved approximately in 11 min (exactly 661.02 s) for the entire case study. During the computation, the number of tasks left on the database decreased near-linearly.
To understand both the performance constraints and scalability of the framework, we also ran separate simulations with 20, 30, 50, 70, and 100 volunteer computers identical to the Computer 1 in Table 2 each contributing two workers. These simulations took 280.2, 187.1, 115.8, 80.1, and 62.3 s to finish all hillslope computations, respectively. It should be noted that the simulation may not reflect the real-world settings completely but considering that the framework is able to handle up to 200 requests per second, it nevertheless provides concise remarks about the performance of the framework.
It warrants mention that the use case we selected for test purposes in this study relies on a time-sensitive task; time-sensitive tasks need to be done almost simultaneously. By its very nature, distributed volunteer computing depends on the number of participants, and this may be a drawback. Nevertheless, we expect to have more participants when flood prediction models are needed most due to extreme weather events and when it is important to finalize all computations. Because of this fact, the volunteer computing systems can be used more efficiently when it is more important to have intended computations made without any time constraints. Even though the limited number of volunteer participants seems to be crucial for the success of such a distributed volunteer computing task, as mentioned earlier, the required number of volunteers for this study can easily be met through user base of the IFIS (over 300,000 unique users as of December 2018).
The main goal of this study is to develop a web-based voluntary scientific computing framework and a platform with the capability to use multiple cores and take advantage of low-level technologies to enhance computation speeds. There are many computational approaches such as grid computing and cloud computing to achieve the similar goals to distribute the computational load to multiple nodes. These approaches can provide more stable environment than distributed volunteer computing with regard to accessibility of the computational power. However, relative low cost of the distributed volunteer computing makes it a favorable choice for non-time-sensitive scientific computing tasks. Subsequently, time-sensitive tasks such as running a flood forecasting model on recent measurements still can be done by distributed volunteer computing via operational web systems with large user bases. The web-based nature of the proposed framework makes it easy for any researcher to adapt the framework and use for their project and utilize existing user base of their labs or institutes with limited effort.
While the framework is specialized to work efficiently with hydrologic computational tasks, it is also designed for use by scientists from other disciplines. The model chosen for the study allows easy parallelization of the tasks and is large enough to be used as an example for a distributed computing project. This browser-based approach provides easy adaptation and maintenance for both project managers and volunteers. The results of performance tests suggest that the framework is capable of handling scientific computation tasks and scales efficiently with multicore systems and multiple users. The hierarchical dependency structure of the models or data can be handled at the server level during scheduling. The framework empowers the use of low-level computing and programming technologies, which provides a significant improvement in computational performance. Since the overhead caused by large numbers of volunteers is shown to be not significant empirically, the work here has inferred that web browsers and usage of computationally capable devices around the world can facilitate the participation of volunteers in scientific efforts, including data processing and modeling.