Abstract

In water quality monitoring, the complexity and abstraction of water environment data make it difficult for staff to monitor the data efficiently and intuitively. Visualization of water quality data is an important part of the monitoring and analysis of water quality. Because water quality data have geographic features, their visualization can be realized using maps, which not only provide intuitive visualization, but also reflect the relationship between water quality and geographical position. For this study, the heat map provided by Google Maps was used for water quality data visualization. However, as the amount of data increases, the computational efficiency of traditional development models cannot meet the computing task needs quickly. Effective storage, extraction and analysis of large water data sets becomes a problem that needs urgent solution. Hadoop is an open source software framework running on computer clusters that can store and process large data sets efficiently, and it was used in this study to store and process water quality data. Through reasonable analysis and experiment, an efficient and convenient information platform can be provided for water quality monitoring.

INTRODUCTION

With economic development, water shortages and water pollution are becoming more and more serious, and the primary bottleneck limiting continued growth. Long-term monitoring, which is used in many projects, is an indispensable means of studying water pollution impacts on the environment. The issues include the effects of sewage on soil and groundwater, and the relationship between trace elements in water and crops. Common data representations, such as data sheets, make it difficult to monitor water quality intuitively and efficiently. As the data include characteristics of mass, space and time, a huge amount is generated in long-term monitoring, posing significant difficulties in practical applications. On that basis, realization of effective long-term monitoring and easy management of these issues needs to be addressed urgently.

As the information is presented in digital form, it is not conducive for staff to judge water quality intuitively and efficiently. Data visualization is thus of great significance for water quality analysis. Graphics are used to transmit and communicate information clearly and effectively, and usually called data visualization. Many methods have been used to achieve visualization, including scatter diagrams, histograms, timelines, and tree graphs and so on. Keim (2002) summarizes several data visualization methods, like geometrically-transformed displays, iconic displays, dense pixel displays and stacked displays. These methods are not suitable for data with geographical characteristics, however, because of the need to carry water environment information including depth, trace element density, location etc. The Google Maps service makes it possible to display geographical information data on a map. Google Maps is a new type of Web Geographic Information System (WebGIS) that uses mature Web and Geographic Information System (GIS) technologies (Luan & Zhu 2007). There is no need to create a separate map server as Google Maps can be embedded into sites – it was launched in 2005 to provide online mapping service applications and includes extensive source code, and is called the Application Programming Interface (API) for programmers. The API can be used with JavaScript and other scripting languages. Akanbi & Agunbiade (2013) reported a system that integrates GIS data for a city with Google Maps API. It has also been reported (Lifa 2013) that, with API V3, there is no need to register the API key before use. The new version of Google Maps supports many web browsers including Internet Explorer and Firefox, as well as web browsers on mobile devices like Apple iPad and iPhone. Because of all this, Google Maps JavaScript API is the most commonly used API for online mapping, and map-based visualization has been used to show traffic congestion and the real-time locations of buses. Many other applications are being developed with Google Maps, too (Boulos 2005; Hu & Dai 2013).

Google Maps API V3 provides some methods of using layers for visualizing data, including heat maps, and traffic and bicycle layers (Lifa 2013). The heat map, one of the visualization types available, is commonly used to characterize data strength at individual geographic locations. The colors in a two-dimensional plane on the heat map can be adjusted to ‘show’ the value of an event (Fisher 2007). By defaults, regions with higher intensity, for example, are displayed in red, and those with lower intensity in green, because of which, the geographic environment – e.g., traffic or climate – is shown in vivid color. Heat maps have been studied widely. Fisher (2007) pointed out that they can be used to track user interactions and behavior by visualizing the location and frequency of mouse clicks on the site. Atterer & Lorenzi (2008) proposed using heat maps to show which parts of a web page were viewed for extended periods. Moumtzidou et al. (2013) used them to search for environmental resources. Thus, the use of heat maps to show the trace element content of water can simulate a reservoir environment clearly while the evolution of water quality can be observed through time.

Large-scale computing models are proposed to improve the speed of computing and efficiency of visualization. Cloud computing is very powerful in handling massive amounts of data, and offers more efficient and convenient advantages in large-scale data storage, transmission and preprocessing (Armbrust et al. 2010). Apache Hadoop, an open-source implementation of Google's cloud computing solution, is a software framework for distributed storage and processing of very large data sets on computer clusters (Agarwal 2016). Hadoop Distributed File System (HDFS) and MapReduce are two important parts of Hadoop. HDFS is used to store and save files to different nodes in a Hadoop cluster. If a node fails, HDFS can guarantee data recovery, ensuring that Hadoop can run normally (Shvachko et al. 2010; Ghazi & Gangodkar 2015). MapReduce is a programming model for calculating data. Related studies are reported by Patel et al. (2012).

HBase is a distributed database (Apache HBase 2017) that provides a similar ability to Bigtable on Hadoop. HBase, using MapReduce to handle the massive amounts of data, is a scalable, big distributed database of Hadoop (Carstoiu et al. 2010; Vora 2011). Among HBase's many advantages, it combines well with the Hadoop platform because the data block stored by it is distributed automatically by HDFS (Vora 2011). HBase focusing on the interactive reading and writing of single data points is in line with low latency characteristics. It can also be scaled easily horizontally and its natural combination with MapReduce greatly facilitates secondary development by programmers (Dittrich & Quiané-Ruiz 2012). Facebook uses Hadoop and HBase to solve data computing and storage issues (Borthakur et al. 2011; Aiyer et al. 2012; Harter et al. 2014). The use of Hadoop and HBase to store and process water quality data is of great help in improving water quality monitoring efficiency.

As shown in Figure 1, Hadoop is a collection framework for multiple subprojects. The core comprises HDFS, MapReduce and Yarn, the latter being a resource management system, responsible for the unified management and scheduling of cluster resources (Vavilapali et al. 2013); Zookeeper is a distributed collaboration service for data management problems in distributed environments, such as unified naming and configuration synchronization (Hunt et al. 2010). In this study HDFS, YARN, MapReduce, HBase and Zookeeper were used to store and process water quality data.

Figure 1

Hadoop Stack.

Figure 1

Hadoop Stack.

In this study, heat maps from Google Maps were used to visualize water quality information, while Hadoop and HBase were used to process and store the water quality data. It was shown that this combination improved user experience in terms of visual and data processing speed.

METHODS

System architecture

Figure 2 shows the platform design. The system architecture is divided into three layers, the first representing the basic environment in which the system runs, the second processing the data, and the top layer displaying it.

Figure 2

Framework.

Figure 2

Framework.

First layer

In the first layer, the Linux operating system, Hadoop, HBase, Zookeeper and other software are installed. With those in place, the network, including static IP and firewalls, should be set, with configuration files as needed.

Second layer

In the second layer, the data are processed into the correct format for Google Maps API. The data must be imported into the database before processing because they are stored in text files.

First, the HBase data table is set up, then the data files are read and data inserted into the HBase database. HBase stores the data in tabular form – i.e., with rows and columns, and the columns are divided into column families. Each column in the table belongs to a column family member, so that the data table can be designed as a number of column families each of which contains several columns. For example, the location information contains latitude, longitude and depth, so the columns containing these data are put into one column family. After the table has been created, the data files should be read line by line with Java and each column's data inserted into it.

Second, the data are filtered in relation to conditions like water depth level and trace element concentration. The time, water depth level and trace elements selected by users on the web page become filter criteria, and are used as the data pass through the HBase interfaces.

Finally, using Google Maps API, the filtered data are formatted through the MapReduce interfaces.

Third layer

In the third layer, the map is taken from Google Maps server and the relevant parameters, like zoom and map style, are set in the map in the web page using Google Maps API. After that the processed data are taken from the second layer and data visualization is achieved using Google Maps API.

First, Google Maps should appear on the web page in relation to the geographic coordinates of the center of the observation area, using JavaScript and Google Maps API. Then the map zoom level should be adjusted to suit the size of the observation area and the page layout. After that, formatted data are obtained by Ajax asynchronous request in the web page, and, finally, data visualization is achieved using Google Maps API.

Experimental description

In order to reflect the high speed of computing with Hadoop, two development methods were used to compare data processing efficiency.

Experimental environment

  • 1.

    Experimental Environment with Hadoop

Pseudo-distributed mode was used running on a single-node. Table 1 shows the software version and operating environment.

  • 2.

    Experimental Environment without Hadoop

Table 2 shows the software version and operating environment without Hadoop.

Table 1

Software versions and operating environment with Hadoop

Name Version 
Operating system CentOS 6.4 64-bit 
Hadoop Hadoop 2.2.0 
HBase HBase 1.2.3 
Zookeeper Zookeeper 3.4.5 
Jdk Jdk 7.79 linux 64-bit 
Tomcat Tomcat 7.0 
Name Version 
Operating system CentOS 6.4 64-bit 
Hadoop Hadoop 2.2.0 
HBase HBase 1.2.3 
Zookeeper Zookeeper 3.4.5 
Jdk Jdk 7.79 linux 64-bit 
Tomcat Tomcat 7.0 
Table 2

Software versions and operating environment of comparative experiment

Name Version 
Operating system CentOS 6.4 64-bit 
MySQL mysql-5.5.55-linux2.6-x86_64 
Jdk Jdk 7.79 linux 64-bit 
Tomcat Tomcat 7.0 
Name Version 
Operating system CentOS 6.4 64-bit 
MySQL mysql-5.5.55-linux2.6-x86_64 
Jdk Jdk 7.79 linux 64-bit 
Tomcat Tomcat 7.0 

Data and database description

The experimental data are from water quality monitoring for the Zhuyi River, China, and were stored in text files, each file containing one day's data and every line containing 12 columns. The first three columns are the latitude and longitude, and water depth level, while the other 9 hold the trace element concentrations. The data tables should be created on databases.

On HBase, the data table was designed as four column families, comprising respectively position information, water depth level, element concentrations and time information. The table creation statement is as follows:  
formula

On MySQL, the table was divided into 13 columns, the first 12 for water quality data, and the last for the monitoring time.

Data acquisition and processing

The format defined by Google Maps API is shown below:  
formula
In order to format the data, the latitude, longitude and element concentrations are needed. The methods are:
  • 1.

    Data Acquisition and Processing with Hadoop

On Hadoop, the data are obtained and processed using the MapReduce programming framework, e.g., as follows:  
formula
  • 2.

    Data Acquisition and Processing without Hadoop

The main progress code in the comparative experiment is:  
formula

Visualization

Google Maps would be shown by Google Maps JavaScript API using:  
formula
The data formatting is:  
formula
The heat map is shown by:  
formula

RESULTS AND DISCUSSION

Data processing results

180,000 data items were run in each environment and the running time per 1,000 items recorded. Table 3 shows the running times and time differences of the two methods. The time difference and proportional difference were obtained using Equations (1) and (2):  
formula
(1)
 
formula
(2)
where ‘C’ represents the time difference, ‘A’ the running time without Hadoop, ‘B’ the running time with Hadoop, and ‘D’ the proportional difference between them.
Table 3

Data processing running times

Data Items A (ms) B (ms) C (ms) D (%) 
1,000 6,007 16,276 −10,269 −1.70951 
5,000 17,270 32,273 −15,003 −0.86873 
10,000 51,088 68,676 −17,588 −0.34427 
30,000 326,680 379,905 −53,225 −0.16293 
50,000 911,891 946,476 −34,585 −0.03793 
60,000 1,369,878 1,327,547 42,331 0.030901 
80,000 2,628,492 2,300,094 328,398 0.124938 
100,000 4,424,418 3,610,955 813,463 0.183858 
120,000 7,033,537 5,295,657 1,737,880 0.247085 
140,000 10,772,433 7,770,032 3,002,401 0.278712 
160,000 15,067,339 10,962,455 4,104,884 0.272436 
180,000 20,907,527 15,080,526 5,827,001 0.278704 
Data Items A (ms) B (ms) C (ms) D (%) 
1,000 6,007 16,276 −10,269 −1.70951 
5,000 17,270 32,273 −15,003 −0.86873 
10,000 51,088 68,676 −17,588 −0.34427 
30,000 326,680 379,905 −53,225 −0.16293 
50,000 911,891 946,476 −34,585 −0.03793 
60,000 1,369,878 1,327,547 42,331 0.030901 
80,000 2,628,492 2,300,094 328,398 0.124938 
100,000 4,424,418 3,610,955 813,463 0.183858 
120,000 7,033,537 5,295,657 1,737,880 0.247085 
140,000 10,772,433 7,770,032 3,002,401 0.278712 
160,000 15,067,339 10,962,455 4,104,884 0.272436 
180,000 20,907,527 15,080,526 5,827,001 0.278704 

Figure 3 shows the running time comparison. As the amount of data increases, so does the processing time for both methods. However, the data processing time-growth rate with Hadoop is less than that without Hadoop, once the number of data items exceeds 60,000.

Figure 3

Running time comparison.

Figure 3

Running time comparison.

Figure 4 shows the data processing time difference using the two methods. It can be seen, again, that when there are fewer than 60,000 data items, the processing time without Hadoop is less than when Hadoop is used, and that the time difference increases until the amount of data reaches 40,000 items. Once it exceeds that number, however, the time difference falls gradually. At 60,000 data items, the speed of processing with Hadoop is better than that without.

Figure 4

Time difference.

Figure 4

Time difference.

This is because Hadoop divides the data file into multiple data blocks, each with a default size of 64 MB. If there is a large number of small data files (for example, files of, say, 2 to 3 MB each) that are far below Hadoop's data block size, each file is still treated as a data block. This approach brings two consequences: (1) Hadoop stores a large number of small files that occupy storage, reducing storage efficiency and making retrieval slower than for large files. (2) Small files consume computing power when MapReduce operations are performed, because Hadoop defaults to allocate Map tasks by blocks.

Experimental results show that, as the amount of data increases, the advantage of Hadoop becomes obvious gradually, especially when the amount of data reaches TB levels, and data processing efficiency improves greatly.

Visualization results

Figure 5 shows the final results of visualization – a multi-colored layer covers the observation area. The areas where the water's trace element concentrations are highest are shown in brighter colors–e.g., red or yellow. In contrast, less bright coloring represents locations where the concentrations are lower.

Figure 5

Visualization Results.

Figure 5

Visualization Results.

As shown in Figure 5, the method is very intuitive, making determination of water quality status rapid, and the effect is very pleasing.

CONCLUSION

The Hadoop programming framework and HBase distributed storage were used to store and process water quality information. This combination improved program efficiency greatly, while the use of Google Maps for visualization brought great convenience for water quality monitoring. Combining the two technologies provides good support for establishing a water quality monitoring platform.

FUTURE WORK

The data were not cleaned for the study, which may have caused abnormal conditions. In future, data will be cleaned before being imported into the database, to try to avoid abnormal conditions. Equally, the program runs on a single node, which does not fully reflect Hadoop's advantages in data processing, so a cluster environment must be built to improve data processing ability.

ACKNOWLEDGEMENTS

We would like to thank all the members of this project. Many people helped during writing, while our supervisor provided expert guidance, direction and encouragement. The work was supported by the National Natural Science Foundation of China under grant No. 51509066, Hebei Province's Natural Science Foundation Project under grant No. F2015402077 and Heibei University of Engineering's Innovation Fund Project grant No. sj150054, Hebei Province's High School Science and Technology Research of Outstanding Youth Fund Project grant No. YQ2014014 and Hebei Province's Natural Science Youth Fund Project grant No. F2015402119.

AUTHOR CONTRIBUTIONS

For this paper, Weijian Huang, Yuanbin Han and Wei Do provided the idea; Xinfei Zhao and Yuanbin Han designed the experiments; Yao Cheng provided and analyzed the data; Xinfei Zhao performed the experiments.

REFERENCES

REFERENCES
Agarwal
,
A.
2016
Apache Hadoop
. .
Aiyer
,
A.
,
Bautin
,
M.
,
Chen
,
G. J.
,
Damania
,
P.
,
Khemani
,
P.
,
Muthukkaruppan
,
K.
,
Ranganathan
,
K.
,
Spiegelberg
,
N.
,
Tang
,
L.
&
Vaidya
,
M.
2012
Storage infrastructure behind Facebook messages: using HBase at scale
.
IEEE Data Eng. Bull.
35
(
2
),
4
13
.
Akanbi
,
A. K.
&
Agunbiade
,
O. Y.
2013
Integration of a city GIS data with Google Map API and Google Earth API for a web based 3D Geospatial Application
.
Computer Science
31
(
5
),
1450
1452
.
Apache HBase
2017
Apache HBase Reference Guide
.
http://hbase.apache.org/book.html. (accessed 12 June 2017)
.
Armbrust
,
M.
,
Fox
,
A.
,
Griffith
,
R.
,
Joseph
,
A. D.
,
Katz
,
R.
,
Konwinski
,
A.
,
Lee
,
G.
,
Patterson
,
D.
,
Rabkin
,
A.
,
Stoica
,
I.
&
Zaharia
,
M.
2010
A view of cloud computing
.
Communications of the ACM
,
53
(
4
),
50
58
.
Atterer
,
R.
&
Lorenzi
,
P.
2008
A Heatmap-based Visualization for Navigation within Large Web Pages
. In:
Proceedings of the 5th Nordic Conference on Human-Computer Interaction: Building Bridges
,
20–22 October 2008
,
ACM
,
Lund
,
Sweden
, pp.
407
410
.
Borthakur
,
D.
,
Sarma
,
J. S.
,
Gray
,
J.
,
Muthukkaruppan
,
K.
,
Spiegelberg
,
N.
,
Kuang
,
H.
,
Ranganathan
,
K.
,
Molkov
,
D.
,
Menon
,
A.
,
Rash
,
S.
,
Schmidt
,
R.
&
Aiyer
,
A.
2011
Apache Hadoop goes realtime at Facebook
. In:
Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
,
ACM, 12 June 2011, Athens, Greece
, pp.
1071
1080
.
Carstoiu
,
D.
,
Lepadatu
,
E.
&
Gaspar
,
M.
2010
Hbase – non SQL database, performances evaluation
.
International Journal of Advancements in Computing Technology
2
(
5
),
42
52
.
Dittrich
,
J.
&
Quiané-Ruiz
,
J. A.
2012
Efficient big data processing in Hadoop MapReduce
.
Proceedings of the VLDB Endowment
5
(
12
),
2014
2015
.
Fisher
,
D.
2007
Hotmap: looking at geographic attention
.
IEEE Transactions on Visualization and Computer Graphics
13
(
6
),
1184
1191
.
Ghazi
,
M. R.
&
Gangodkar
,
D.
2015
Hadoop, MapReduce and HDFS: a developers perspective
.
Procedia Computer Science
48
,
45
50
.
Harter
,
T.
,
Borthakur
,
D.
,
Dong
,
S.
,
Aiyer
,
A.
,
Tang
,
L.
,
Arpaci-Dusseau
,
A. C.
&
Arpaci-Dusseau
,
R. H.
2014
Analysis of HDFS under HBase: a facebook messages case study
. In:
Usenix Conference on File and Storage Technologies, USENIX Association
,
17–20 February 2014
,
Santa Clara, CA
,
USA
, pp.
199
212
.
Hu
,
S.
&
Dai
,
T.
2013
Online Map application development using Google Maps API, SQL database, and ASP.NET
.
International Journal of Information and Communication Technology Research
3
(
3
),
102
110
.
Hunt
,
P.
,
Konar
,
M.
,
Junqueira
,
F. P.
&
Reed
,
B.
2010
ZooKeeper: wait-free coordination for internet-scale systems
. In:
USENIX Annual Technical Conference
,
23–25 June 2010
,
Boston, MA
,
USA
, p.
9
.
Keim
,
D. A.
2002
Information visualization and visual data mining
.
IEEE Transactions on Visualization and Computer Graphics
8
(
1
),
1
8
.
Lifa
,
Y.
2013
Geographic data capturing technique based on Google Maps API V3
.
Remote Sensing Technology and Application
28
(
5
),
791
798
.
Luan
,
S.
&
Zhu
,
C.
2007
Research on Ajax applications in WebGIS
.
Science of Surveying and Mapping
32
(
5
),
158
160
.
Moumtzidou
,
A.
,
Vrochidis
,
S.
,
Chatzilari
,
E.
&
Kompatsaris
,
I.
2013
Discovery of environmental resources based on Heatmap Recognition
. In:
Image Processing (ICIP), 2013 20th IEEE International Conference on. IEEE
,
15–18 September 2013
,
Melbourne
,
Australia
, pp.
1486
1490
.
Patel
,
A. B.
,
Birla
,
M.
&
Nair
,
U.
2012
Addressing big data problem using Hadoop and Map Reduce
. In
Engineering (NUiCONE), 2012 Nirma University International Conference on. IEEE
,
6–8 December 2012
,
Ahmedabad
,
India
, pp.
1
5
.
Shvachko
,
K.
,
Kuang
,
H.
,
Radia
,
S.
&
Chansler
,
R.
2010
The hadoop distributed file system
. In:
Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on. IEEE
,
3–7 May 2010
,
Nevada
,
USA
, pp.
1
10
.
Vavilapali
,
V. K.
,
Murthy
,
A. C.
,
Douglas
,
C.
,
Agarwal
,
S.
,
Konar
,
M.
,
Evans
,
R.
,
Graves
,
T.
,
Lowe
,
J.
,
Shah
,
H.
,
Seth
,
S.
,
Saha
,
B.
,
Curino
,
C.
,
O'Malley
,
O.
,
Radia
,
S.
,
Reed
,
B.
&
Baldeschwieler
,
E.
2013
Apache hadoop yarn: Yet another resource negotiator
. In:
Proceedings of the 4th annual Symposium on Cloud Computing
,
ACM, 1–3 Oct. 2013
,
Santa Clara
,
CA, USA
, pp.
5
.
Vora
,
M. N.
2011
Hadoop-HBase for large-scale data
. In:
Computer Science and Network Technology (ICCSNT), 2011 International Conference on. IEEE
,
24–26 December 2011
,
Harbin
,
China
, pp.
601
605
.