Abstract
In water quality monitoring, the complexity and abstraction of water environment data make it difficult for staff to monitor the data efficiently and intuitively. Visualization of water quality data is an important part of the monitoring and analysis of water quality. Because water quality data have geographic features, their visualization can be realized using maps, which not only provide intuitive visualization, but also reflect the relationship between water quality and geographical position. For this study, the heat map provided by Google Maps was used for water quality data visualization. However, as the amount of data increases, the computational efficiency of traditional development models cannot meet the computing task needs quickly. Effective storage, extraction and analysis of large water data sets becomes a problem that needs urgent solution. Hadoop is an open source software framework running on computer clusters that can store and process large data sets efficiently, and it was used in this study to store and process water quality data. Through reasonable analysis and experiment, an efficient and convenient information platform can be provided for water quality monitoring.
INTRODUCTION
With economic development, water shortages and water pollution are becoming more and more serious, and the primary bottleneck limiting continued growth. Long-term monitoring, which is used in many projects, is an indispensable means of studying water pollution impacts on the environment. The issues include the effects of sewage on soil and groundwater, and the relationship between trace elements in water and crops. Common data representations, such as data sheets, make it difficult to monitor water quality intuitively and efficiently. As the data include characteristics of mass, space and time, a huge amount is generated in long-term monitoring, posing significant difficulties in practical applications. On that basis, realization of effective long-term monitoring and easy management of these issues needs to be addressed urgently.
As the information is presented in digital form, it is not conducive for staff to judge water quality intuitively and efficiently. Data visualization is thus of great significance for water quality analysis. Graphics are used to transmit and communicate information clearly and effectively, and usually called data visualization. Many methods have been used to achieve visualization, including scatter diagrams, histograms, timelines, and tree graphs and so on. Keim (2002) summarizes several data visualization methods, like geometrically-transformed displays, iconic displays, dense pixel displays and stacked displays. These methods are not suitable for data with geographical characteristics, however, because of the need to carry water environment information including depth, trace element density, location etc. The Google Maps service makes it possible to display geographical information data on a map. Google Maps is a new type of Web Geographic Information System (WebGIS) that uses mature Web and Geographic Information System (GIS) technologies (Luan & Zhu 2007). There is no need to create a separate map server as Google Maps can be embedded into sites – it was launched in 2005 to provide online mapping service applications and includes extensive source code, and is called the Application Programming Interface (API) for programmers. The API can be used with JavaScript and other scripting languages. Akanbi & Agunbiade (2013) reported a system that integrates GIS data for a city with Google Maps API. It has also been reported (Lifa 2013) that, with API V3, there is no need to register the API key before use. The new version of Google Maps supports many web browsers including Internet Explorer and Firefox, as well as web browsers on mobile devices like Apple iPad and iPhone. Because of all this, Google Maps JavaScript API is the most commonly used API for online mapping, and map-based visualization has been used to show traffic congestion and the real-time locations of buses. Many other applications are being developed with Google Maps, too (Boulos 2005; Hu & Dai 2013).
Google Maps API V3 provides some methods of using layers for visualizing data, including heat maps, and traffic and bicycle layers (Lifa 2013). The heat map, one of the visualization types available, is commonly used to characterize data strength at individual geographic locations. The colors in a two-dimensional plane on the heat map can be adjusted to ‘show’ the value of an event (Fisher 2007). By defaults, regions with higher intensity, for example, are displayed in red, and those with lower intensity in green, because of which, the geographic environment – e.g., traffic or climate – is shown in vivid color. Heat maps have been studied widely. Fisher (2007) pointed out that they can be used to track user interactions and behavior by visualizing the location and frequency of mouse clicks on the site. Atterer & Lorenzi (2008) proposed using heat maps to show which parts of a web page were viewed for extended periods. Moumtzidou et al. (2013) used them to search for environmental resources. Thus, the use of heat maps to show the trace element content of water can simulate a reservoir environment clearly while the evolution of water quality can be observed through time.
Large-scale computing models are proposed to improve the speed of computing and efficiency of visualization. Cloud computing is very powerful in handling massive amounts of data, and offers more efficient and convenient advantages in large-scale data storage, transmission and preprocessing (Armbrust et al. 2010). Apache Hadoop, an open-source implementation of Google's cloud computing solution, is a software framework for distributed storage and processing of very large data sets on computer clusters (Agarwal 2016). Hadoop Distributed File System (HDFS) and MapReduce are two important parts of Hadoop. HDFS is used to store and save files to different nodes in a Hadoop cluster. If a node fails, HDFS can guarantee data recovery, ensuring that Hadoop can run normally (Shvachko et al. 2010; Ghazi & Gangodkar 2015). MapReduce is a programming model for calculating data. Related studies are reported by Patel et al. (2012).
HBase is a distributed database (Apache HBase 2017) that provides a similar ability to Bigtable on Hadoop. HBase, using MapReduce to handle the massive amounts of data, is a scalable, big distributed database of Hadoop (Carstoiu et al. 2010; Vora 2011). Among HBase's many advantages, it combines well with the Hadoop platform because the data block stored by it is distributed automatically by HDFS (Vora 2011). HBase focusing on the interactive reading and writing of single data points is in line with low latency characteristics. It can also be scaled easily horizontally and its natural combination with MapReduce greatly facilitates secondary development by programmers (Dittrich & Quiané-Ruiz 2012). Facebook uses Hadoop and HBase to solve data computing and storage issues (Borthakur et al. 2011; Aiyer et al. 2012; Harter et al. 2014). The use of Hadoop and HBase to store and process water quality data is of great help in improving water quality monitoring efficiency.
As shown in Figure 1, Hadoop is a collection framework for multiple subprojects. The core comprises HDFS, MapReduce and Yarn, the latter being a resource management system, responsible for the unified management and scheduling of cluster resources (Vavilapali et al. 2013); Zookeeper is a distributed collaboration service for data management problems in distributed environments, such as unified naming and configuration synchronization (Hunt et al. 2010). In this study HDFS, YARN, MapReduce, HBase and Zookeeper were used to store and process water quality data.
In this study, heat maps from Google Maps were used to visualize water quality information, while Hadoop and HBase were used to process and store the water quality data. It was shown that this combination improved user experience in terms of visual and data processing speed.
METHODS
System architecture
Figure 2 shows the platform design. The system architecture is divided into three layers, the first representing the basic environment in which the system runs, the second processing the data, and the top layer displaying it.
First layer
In the first layer, the Linux operating system, Hadoop, HBase, Zookeeper and other software are installed. With those in place, the network, including static IP and firewalls, should be set, with configuration files as needed.
Second layer
In the second layer, the data are processed into the correct format for Google Maps API. The data must be imported into the database before processing because they are stored in text files.
First, the HBase data table is set up, then the data files are read and data inserted into the HBase database. HBase stores the data in tabular form – i.e., with rows and columns, and the columns are divided into column families. Each column in the table belongs to a column family member, so that the data table can be designed as a number of column families each of which contains several columns. For example, the location information contains latitude, longitude and depth, so the columns containing these data are put into one column family. After the table has been created, the data files should be read line by line with Java and each column's data inserted into it.
Second, the data are filtered in relation to conditions like water depth level and trace element concentration. The time, water depth level and trace elements selected by users on the web page become filter criteria, and are used as the data pass through the HBase interfaces.
Finally, using Google Maps API, the filtered data are formatted through the MapReduce interfaces.
Third layer
In the third layer, the map is taken from Google Maps server and the relevant parameters, like zoom and map style, are set in the map in the web page using Google Maps API. After that the processed data are taken from the second layer and data visualization is achieved using Google Maps API.
First, Google Maps should appear on the web page in relation to the geographic coordinates of the center of the observation area, using JavaScript and Google Maps API. Then the map zoom level should be adjusted to suit the size of the observation area and the page layout. After that, formatted data are obtained by Ajax asynchronous request in the web page, and, finally, data visualization is achieved using Google Maps API.
Experimental description
In order to reflect the high speed of computing with Hadoop, two development methods were used to compare data processing efficiency.
Experimental environment
- 1.
Experimental Environment with Hadoop
Pseudo-distributed mode was used running on a single-node. Table 1 shows the software version and operating environment.
- 2.
Experimental Environment without Hadoop
Software versions and operating environment with Hadoop
Name . | Version . |
---|---|
Operating system | CentOS 6.4 64-bit |
Hadoop | Hadoop 2.2.0 |
HBase | HBase 1.2.3 |
Zookeeper | Zookeeper 3.4.5 |
Jdk | Jdk 7.79 linux 64-bit |
Tomcat | Tomcat 7.0 |
Name . | Version . |
---|---|
Operating system | CentOS 6.4 64-bit |
Hadoop | Hadoop 2.2.0 |
HBase | HBase 1.2.3 |
Zookeeper | Zookeeper 3.4.5 |
Jdk | Jdk 7.79 linux 64-bit |
Tomcat | Tomcat 7.0 |
Software versions and operating environment of comparative experiment
Name . | Version . |
---|---|
Operating system | CentOS 6.4 64-bit |
MySQL | mysql-5.5.55-linux2.6-x86_64 |
Jdk | Jdk 7.79 linux 64-bit |
Tomcat | Tomcat 7.0 |
Name . | Version . |
---|---|
Operating system | CentOS 6.4 64-bit |
MySQL | mysql-5.5.55-linux2.6-x86_64 |
Jdk | Jdk 7.79 linux 64-bit |
Tomcat | Tomcat 7.0 |
Data and database description
The experimental data are from water quality monitoring for the Zhuyi River, China, and were stored in text files, each file containing one day's data and every line containing 12 columns. The first three columns are the latitude and longitude, and water depth level, while the other 9 hold the trace element concentrations. The data tables should be created on databases.
On MySQL, the table was divided into 13 columns, the first 12 for water quality data, and the last for the monitoring time.
Data acquisition and processing
- 1.
Data Acquisition and Processing with Hadoop
- 2.
Data Acquisition and Processing without Hadoop
Visualization
RESULTS AND DISCUSSION
Data processing results
Data processing running times
Data Items . | A (ms) . | B (ms) . | C (ms) . | D (%) . |
---|---|---|---|---|
1,000 | 6,007 | 16,276 | −10,269 | −1.70951 |
5,000 | 17,270 | 32,273 | −15,003 | −0.86873 |
10,000 | 51,088 | 68,676 | −17,588 | −0.34427 |
30,000 | 326,680 | 379,905 | −53,225 | −0.16293 |
50,000 | 911,891 | 946,476 | −34,585 | −0.03793 |
60,000 | 1,369,878 | 1,327,547 | 42,331 | 0.030901 |
80,000 | 2,628,492 | 2,300,094 | 328,398 | 0.124938 |
100,000 | 4,424,418 | 3,610,955 | 813,463 | 0.183858 |
120,000 | 7,033,537 | 5,295,657 | 1,737,880 | 0.247085 |
140,000 | 10,772,433 | 7,770,032 | 3,002,401 | 0.278712 |
160,000 | 15,067,339 | 10,962,455 | 4,104,884 | 0.272436 |
180,000 | 20,907,527 | 15,080,526 | 5,827,001 | 0.278704 |
Data Items . | A (ms) . | B (ms) . | C (ms) . | D (%) . |
---|---|---|---|---|
1,000 | 6,007 | 16,276 | −10,269 | −1.70951 |
5,000 | 17,270 | 32,273 | −15,003 | −0.86873 |
10,000 | 51,088 | 68,676 | −17,588 | −0.34427 |
30,000 | 326,680 | 379,905 | −53,225 | −0.16293 |
50,000 | 911,891 | 946,476 | −34,585 | −0.03793 |
60,000 | 1,369,878 | 1,327,547 | 42,331 | 0.030901 |
80,000 | 2,628,492 | 2,300,094 | 328,398 | 0.124938 |
100,000 | 4,424,418 | 3,610,955 | 813,463 | 0.183858 |
120,000 | 7,033,537 | 5,295,657 | 1,737,880 | 0.247085 |
140,000 | 10,772,433 | 7,770,032 | 3,002,401 | 0.278712 |
160,000 | 15,067,339 | 10,962,455 | 4,104,884 | 0.272436 |
180,000 | 20,907,527 | 15,080,526 | 5,827,001 | 0.278704 |
Figure 3 shows the running time comparison. As the amount of data increases, so does the processing time for both methods. However, the data processing time-growth rate with Hadoop is less than that without Hadoop, once the number of data items exceeds 60,000.
Figure 4 shows the data processing time difference using the two methods. It can be seen, again, that when there are fewer than 60,000 data items, the processing time without Hadoop is less than when Hadoop is used, and that the time difference increases until the amount of data reaches 40,000 items. Once it exceeds that number, however, the time difference falls gradually. At 60,000 data items, the speed of processing with Hadoop is better than that without.
This is because Hadoop divides the data file into multiple data blocks, each with a default size of 64 MB. If there is a large number of small data files (for example, files of, say, 2 to 3 MB each) that are far below Hadoop's data block size, each file is still treated as a data block. This approach brings two consequences: (1) Hadoop stores a large number of small files that occupy storage, reducing storage efficiency and making retrieval slower than for large files. (2) Small files consume computing power when MapReduce operations are performed, because Hadoop defaults to allocate Map tasks by blocks.
Experimental results show that, as the amount of data increases, the advantage of Hadoop becomes obvious gradually, especially when the amount of data reaches TB levels, and data processing efficiency improves greatly.
Visualization results
Figure 5 shows the final results of visualization – a multi-colored layer covers the observation area. The areas where the water's trace element concentrations are highest are shown in brighter colors–e.g., red or yellow. In contrast, less bright coloring represents locations where the concentrations are lower.
As shown in Figure 5, the method is very intuitive, making determination of water quality status rapid, and the effect is very pleasing.
CONCLUSION
The Hadoop programming framework and HBase distributed storage were used to store and process water quality information. This combination improved program efficiency greatly, while the use of Google Maps for visualization brought great convenience for water quality monitoring. Combining the two technologies provides good support for establishing a water quality monitoring platform.
FUTURE WORK
The data were not cleaned for the study, which may have caused abnormal conditions. In future, data will be cleaned before being imported into the database, to try to avoid abnormal conditions. Equally, the program runs on a single node, which does not fully reflect Hadoop's advantages in data processing, so a cluster environment must be built to improve data processing ability.
ACKNOWLEDGEMENTS
We would like to thank all the members of this project. Many people helped during writing, while our supervisor provided expert guidance, direction and encouragement. The work was supported by the National Natural Science Foundation of China under grant No. 51509066, Hebei Province's Natural Science Foundation Project under grant No. F2015402077 and Heibei University of Engineering's Innovation Fund Project grant No. sj150054, Hebei Province's High School Science and Technology Research of Outstanding Youth Fund Project grant No. YQ2014014 and Hebei Province's Natural Science Youth Fund Project grant No. F2015402119.
AUTHOR CONTRIBUTIONS
For this paper, Weijian Huang, Yuanbin Han and Wei Do provided the idea; Xinfei Zhao and Yuanbin Han designed the experiments; Yao Cheng provided and analyzed the data; Xinfei Zhao performed the experiments.