Vision-based analysis of waterbodies can provide important information required for monitoring, analyzing, and managing water resource systems, such as visual flood detection, delineation, and mapping. Water, however, is an ornery object in image processing, as it can be found in different forms and colors in nature. This makes the detection, classification, and tracking of water in images and videos difficult for computer vision models. There are still visual differences resulting from water texture and its inherent optical properties associated with different waterbodies which can be recognized and extracted to support computer models to better analyze water images. This study aims to utilize a set of early, mid-level, and high-level vision techniques, including Gabor kernels, local binary patterns (LBPs), and deep learning (DL) models to extract and analyze water texture and color of different waterbodies in digital images. For this purpose, ATLANTIS TeXture (ATeX), an image dataset for waterbodies classification and texture analysis, was used. Models were trained for the task of classification on ATeX. Then, the performance of each model in extracting texture features was evaluated and compared. Results showed that the classification accuracy achieved by the Gabor magnitude tensor, LBP, and DL model (ShuffleNet V2 × 1.0) are 29, 35, and 92%, respectively, and thus the DL model outperforms traditional vision-based techniques. Moreover, the classification results on raw images represented by different color spaces (e.g., RGB, HSV, etc.) emphasized the importance of color information for digital image processing of water. Analyzing representative visual features and properties of different water types and waterbodies can facilitate designing a customized Convolutional Neural Networks (CNNs) for water scenes, as CNNs recognize objects through the analysis of both texture and shape clues and their relationship in the entire field of view.

  • The role of texture and color information in the analysis of water images is important to provide water feature properties.

  • Large-scale water dataset makes it possible for researchers to develop data-driven models on water resource problems.

  • Investigation of computer vision techniques for water can provide useful information for developing vision-based models customized for water-related extreme events.

Digital images have great potential to become a new source of data due to the recent advances in Artificial Intelligence (AI) and Computer Vision (CV) (LeCun et al. 2015; Schmidhuber 2015). Through calibration and estimation of camera geometry and parameters, digital images which contain water scenes can replace or complement traditional water measurement methods (Eltner et al. 2018, 2021). This requires employing Machine Learning (ML) models for analyzing and better understanding water components existing in digital images. However, training ML models to detect water in digital images is still a challenging task. The main problem which makes image processing on aquatic (eco)systems very difficult is the nature of water. Water has dynamic and inconsistent visual properties. Inherently, water is transparent, shapeless, and colorless. However, the surrounding environment, such as light reflection, flow regime, and sediments and particles can change the appearance of water in nature significantly (Figure 1).
Figure 1

Water is reflective, (a) shows water reflection in an urban setting while (b) shows the reflection effect in the natural environment. Turbulence and turbidity are two other important factors, (c) shows rapids (turbulent water) in a river, (d) visually compares the clear and muddy appearance of water on two different sides of an earth dam, and (e) shows estuary where a freshwater river or stream meets the ocean.

Figure 1

Water is reflective, (a) shows water reflection in an urban setting while (b) shows the reflection effect in the natural environment. Turbulence and turbidity are two other important factors, (c) shows rapids (turbulent water) in a river, (d) visually compares the clear and muddy appearance of water on two different sides of an earth dam, and (e) shows estuary where a freshwater river or stream meets the ocean.

Close modal

Different constituents of the water column and their inherent optical properties have direct effects on light absorption and scattering of different water types. These particles influence the color of light that is detected by the camera sensor. For example, colored dissolved organic matter (CDOM)-rich water tends to look black because CDOM absorbs light strongly, but it does not reflect much light backward. Moreover, tidal forces cause a re-suspension of sediments, and non-algal particles (naps) making the color of water light brown. External factors such as light and temperature, as well as nutrients derive phytoplankton productivity which reflects green color possibly due to chlorophyll concentrations (Palacios 2020). These constituents, which vary from one waterbody to another, have a direct effect on the texture and color of the water. Texture and color both are important within the digital image processing domain. The texture appears to be a very strong cue to object classification, and the color is a powerful descriptor that often simplifies object identification and extraction from a scene (Gonzalez 2009). In this study, we address challenges associated with the visual sensing of water, using vision-based models and techniques customized for texture and color analysis in digital images.

In computer science, texture analysis has been widely applied for different purposes such as facial recognition and expression analysis (Liu et al. 2014). Facial expression analysis refers to developing models for automatically analyzing and recognizing facial motions and feature changes from visual information (Tian et al. 2005). These features can be extracted either by hand-designed filters (Zhang et al. 1998; Zhao & Pietikainen 2007) or trained ML models (Meng et al. 2017). So far, many studies have attempted to improve the feature extraction techniques to provide better facial recognition (Tong et al. 2007). Another application is texture mapping in which a two-dimensional (2D) surface, called a texture map, is wrapped around a three-dimensional (3D) object or 2D mask (Park et al. 2019). Moreover, texture analysis can be used in image super-resolution (i.e., photo enhancement) task which aims to recover a high-resolution image from a low-resolution one (Wang et al. 2018). A generative adversarial network (GAN) is normally used for recovering realistic texture.

Large-scale datasets provide the opportunity for researchers to develop and train ML-based models, either for temporal or spatial analysis, for real-world applications (Lin et al. 2014; Mottaghi et al. 2014; Zhou et al. 2019). For instance, CAMELS dataset (Addor et al. 2017; Hao et al. 2021) containing hydrometeorological time series and catchment attributes for 671 watersheds in the U.S. led to developing and training long short-term memory (LSTM) models to improve streamflow predictions for ungauged basins (Kratzert et al. 2018; Kratzert et al. 2019). Other publicly available datasets, such as multispectral Landsat analysis ready data (ARD) imagery, synthetic aperture radar (SAR) data, and LiDAR-derived coastal digital elevation models (DEM) have been used together to develop a Convolutional Neural Network (CNN) model for multi-class land cover classification (Feng et al. 2019; Xu et al. 2019; Muñoz et al. 2021).

Recently, ATLANTIS and ATLANTIS TeXture (ATeX) datasets have been introduced for developing vision-based ML models exclusively in the field of water resources. ATLANTIS focuses on semantic segmentation of waterbody images (Erfani et al. 2022), while ATeX is specifically developed for texture analysis and classification tasks. ATeX offers a unique opportunity for texture study on the water in different physical states. This dataset covers a wide range of image patches for different waterbodies such as sea, lake, river, swamp, and snow as shown in Figure 2. This dataset includes 12,503 patches (32 × 32 pixels), which is large enough for training, validating, and testing different ML and DL models (Erfani & Goharian 2022).
Figure 2

Samples of ATeX patches.

Figure 2

Samples of ATeX patches.

Close modal

This study aimed to investigate vision-based texture and color analysis techniques on the water to find visual features which can distinguish water in different waterbodies. Texture information is applicable to configuring the architecture of the CNN-based models as the recognition strategy of CNNs follows local to global features in different layers of the forward pass. In other words, objects are recognized through the analysis of texture and shape-based clues – local and global representations and their relationship in the entire field of view. In section 2, three different approaches, including two conventional methods (based on the hand-craft features) and a Deep Learning (DL) model, were built to extract texture features of water. The quality of extracted features was then evaluated using K-Nearest Neighbors (KNNs) in section 3. The role of color and color space in the task of classification was discussed in section 4. Finally, in the last part of section 4, the role of ATeX in improving the performance of DL-based semantic segmentation models in water sciences was discussed.

Texture representations

Facial recognition and expression task are very similar to water detection and classification. In face recognition, texture representations and spatial patterns of face components (eyes, nose, lips, etc.) provide valuable information for recognition. For water, however, spatial information provides no significant information, as pattern subelements (textons) (Julesz 1981) resulting from filter-based texture representations do not follow any specific spatial coordinates. Thus, in this case, the texture of the water is the only source of information for the accurate classification of waterbodies. In this section, Gabor kernels filter and local binary pattern (LBP) descriptors are used to extract texture information. The quality of the resulting texture representations is then evaluated and compared with the KNN classification method in the following section.

Gabor filters

Gabor wavelets have been commonly used for the task of face recognition (Shen & Bai 2006; Vinay et al. 2015). Frequency and orientation representations of Gabor filters are claimed to be very similar to those of human visual system (Olshausen & Field 1996). They have been found to be particularly appropriate for texture representation and discrimination. In the spatial domain, a 2D Gabor filter is a Gaussian kernel function that is modulated by a sinusoidal plane wave. The Gabor wavelet representation facilitates the recognition without correspondence (hence, no need for manual annotations) as it captures the local structure which corresponds to spatial frequency (scale), spatial localization, and orientation selectivity. As a result, the Gabor wavelet representation is not sensitive to changes caused by illumination changes and subtle nuances (Liu & Wechsler 2002). The texture representations resulting from Gabor filters on ATeX waterbody patches are shown in Figure 3(a).

Gabor wavelets (kernels, filters) in this paper are defined based on Liu & Wechsler (2002) as follows:
(1)
where and define the orientation and scale of the Gabor kernels, , and denote the norm operator. Wave vector is defined as follows:
(2)
where and . Maximum frequency, , is the spacing factor between kernels in the frequency domain. Five different scales, , and eight orientations, are applied with the following parameters:, , and . The kernels exhibit desirable characteristics of spatial frequency, spatial locality, and orientation selectivity. Figure 3(b) shows different combinations of frequency and orientation of the filters applied in this study.
Figure 3

(a) The visual results of four different scales and orientation Gabor filters on three waterbody patches. (b) The real part of the Gabor kernels at five scales and eight orientations with the following parameters: , , and .

Figure 3

(a) The visual results of four different scales and orientation Gabor filters on three waterbody patches. (b) The real part of the Gabor kernels at five scales and eight orientations with the following parameters: , , and .

Close modal
In this study, we set to decrease the size of the kernels and alleviate computational complexity, because 2D Gabor filters are Gaussian-based, so the values of a Gaussian function at a distance larger than from the mean are small enough to be ignored (Gonzalez 2009). The experiments have been separately run for RGB, grayscale, and HSV color space of ATeX patches to compare texture representations within each color space. The following steps discuss the experimental procedure for grayscale images as shown in Figure 4:
  • 1.

    First, ATeX patches are imported as gray values (N3232) where N represents the number of patches.

  • 2.

    Multi-scale and multi-orientation Gabor filters are applied and corresponding Gabor magnitude responses are obtained. Then, all responses are concatenated to build an augmented feature tensor for each patch N323240.

  • 3.

    Each augmented feature tensor is downsized in space from 3232 to 1616 using ‘MaxPooling’ operator which results in size of N161640.

  • 4.

    Finally, all 10,240 features, resulting from 161640, for each patch, are reduced to the top 500 dimensions of the data with the highest variance using Principal Component Analysis (PCA).

Figure 4

Augmented feature tensor pipeline for Gabor Kernels.

Figure 4

Augmented feature tensor pipeline for Gabor Kernels.

Close modal

The same procedure is repeated for RBG and HSV color spaces, while the first step is just different to represent the color channels of images. In HSV (RGB) experiments, input patches have an additional dimension N32323, which represents Hue, Saturation, and Value (Red, Green, and Blue) of each patch. Accordingly, Gabor filters are constructed dimensionally compatible, but Gabor magnitude responses and resulting augmented feature tensor have the same dimensions as described for the grayscale patches.

Local binary patterns

LBP is a type of visual descriptor used for classification in CV. The original LBP operator is introduced by Ojala et al. (1996). The operator labels the pixels of an image by comparing the 3 × 3 surrounding neighborhood of each pixel with the center value using a binary system. The corresponding location of the pixel on the binary map gets 1 if the value of the surrounding pixel is more than the center pixel and it gets 0 vice versa. Then, the histogram of the labels can be used as a texture descriptor. Figure 5(a) shows the basic LBP operator (Ahonen et al. 2004).

The LBP operator can be extended to include more neighbor pixels (Ojala et al. 2002). The circular approach and bilinearly interpolating the pixel values enable any radius and number of surrounding pixels. In this approach, we will use the notation which means P sampling points on a circle of radius of R. Figure 5(b) shows different sampling points for different radii.

Another extension to the original LBP operator considers so-called ‘uniform’ patterns (Ojala et al. 2002). An LBP is called uniform if it contains at most two bitwise transitions from 0 to 1 or vice versa when the binary string is considered circular. For example, 00000000, 00011110 and 10000011 are ‘uniform’ patterns.
Figure 5

(a) The basic LBP operator (Ahonen et al. 2004) and (b) circularly symmetric neighbor sets for different (Ojala et al. 2002).

Figure 5

(a) The basic LBP operator (Ahonen et al. 2004) and (b) circularly symmetric neighbor sets for different (Ojala et al. 2002).

Close modal
When surrounding pixels are all black or all white, then that image region is flat. Groups of consecutive black or white pixels are considered ‘uniform’ patterns. ‘Uniform’ patterns can be interpreted as corners or edges. If pixels switch back and forth between black and white, the pattern is considered ‘non-uniform’. The following notation is used for the LBP operator in this study: . The subscript represents using the operator in a neighborhood, and superscript stands for using only uniform patterns and labeling all remaining patterns with a single label. Figure 6 shows an example of on the river delta patch from ATeX, where the flat, edge-like, and corner-like regions of the image are highlighted on images and histograms.
Figure 6

Different patterns are highlighted on both image and histogram resulting from an LBP response.

Figure 6

Different patterns are highlighted on both image and histogram resulting from an LBP response.

Close modal
The histogram of labeled image is defined as
(3)
where n is the total number of different labels produced by the LBP operator and
(4)

Different parameters, i.e., sampling points , radius , and uniform or non-uniform pattern, offer different resulting histograms. So, it is important to adjust the parameters based on the problem. In this study, sampling points, radius, and pattern are considered 8, 1, and ‘uniform,’ i.e., , respectively. Another consideration is that histograms cannot preserve the spatial information across the image. So, for the face recognition problem, it is suggested to implement regional LBP for different parts of the face to preserve the spatial information (Ahonen et al. 2004). In the case of water, however, due to the irregularity of water features in spatial coordinates, regional LBP would not be effective. Moreover, several possible dissimilarity measures have been proposed for histograms. Log-likelihood, χ2, and Kullback–Leibler Divergence (KLD) are provided in this study according to the following equations:

  • Log-likelihood statistic:
    (5)
  • χ2 statistic:
    (6)
  • KLD:
    (7)
    where and Q are two probability distributions.

DL-based representations

Preliminary classification results on low-level vision-based methods (section 2.1) showed the important role of texture information in the better performance of KNNs in classifying different waterbodies. In this section, we investigate the performance of the DL-based model in feature extraction. DL models are capable of automatically learning patterns from raw data through multiple layers of processing (LeCun et al. 2015). It is because in these models, each layer transforms the input representation into a higher-level representation, which let the deeper layers learn more important aspects of the raw data and discard irrelevant variations (higher-level representation) (Eltner et al. 2021).

We train ShuffleNet V2 × 1.0 (Ma et al. 2018), a DL-based classification model designed for mobile devices with very limited computing power, on ATeX dataset. The PyTorch pre-trained ShuffleNet V2 is fine-tuned on 3232 patches with 30 training epochs, we use SGD optimizer with a momentum of 0.9 and weight decay of 0.0001, and the learning rate and batch size are set to 1.010−2 and 64, respectively.

ShuffleNet V2 is an efficient CNN architecture inspired by ShuffleNet (Zhang et al. 2018). ShuffleNet is a network architecture widely adopted in low-end devices such as mobiles. In ShuffleNet architecture, ‘bottleneck’ building block (He et al. 2016) is modified by two new operations, pointwise group convolution and channel shuffle, to greatly reduce computation cost while maintaining accuracy. Pointwise group convolution is introduced to reduce computation complexity of 1 × 1 convolution (bottleneck). Channel shuffle operation is also provided to overcome the side effects brought by group convolutions (Figure 7(a)).

In ShuffleNet V2, the ShuffleNet unit is modified and a simple operator called ‘channel split’ is introduced at the beginning of each unit (Figure 7(b)). The two 11 convolutions are no longer group-wise (unlike Zhang et al. 2018) and the same ‘channel shuffle’ operation as in (Zhang et al. 2018) is used to enable information communication between the two branches.
Figure 7

(a) Channel shuffle with two stacked group convolutions. GConv stands for group convolution (Zhang et al. 2018). (b) ShuffleNet Units. (1) bottleneck unit (He et al. 2016); (2) ShuffleNet unit with depthwise convolution (DWConv) (Chollet 2017; Howard et al. 2017), pointwise group convolution (GConv) and channel shuffle (Zhang et al. 2018); (3) ShuffleNet V2 unit (Ma et al. 2018) using channel split operator.

Figure 7

(a) Channel shuffle with two stacked group convolutions. GConv stands for group convolution (Zhang et al. 2018). (b) ShuffleNet Units. (1) bottleneck unit (He et al. 2016); (2) ShuffleNet unit with depthwise convolution (DWConv) (Chollet 2017; Howard et al. 2017), pointwise group convolution (GConv) and channel shuffle (Zhang et al. 2018); (3) ShuffleNet V2 unit (Ma et al. 2018) using channel split operator.

Close modal
Figure 8

(a) Neural network architecture of Autoencoder. (b) The loss function decay rate of the linear Autoencoder on ATeX training set over 300 epochs.

Figure 8

(a) Neural network architecture of Autoencoder. (b) The loss function decay rate of the linear Autoencoder on ATeX training set over 300 epochs.

Close modal
The feature layer of the pre-trained ShuffleNet V21.0 is used to extract features. The extracted features are plotted to provide a better understanding of DL-based model performance in feature representation. To do this, features are first fed into t-Distributed Stochastic Neighbor Embedding (t-SNE) (Van der Maaten & Hinton 2008; Van Der Maaten 2014) and linear Autoencoder (AE) for dimensionality reduction. Both methods are particularly well-suited for the visualization of high-dimensional datasets. Final results are then plotted on 2D graphs (Figure 9).
Figure 9

Visualization of reduced 2D representations resulted from t-SNE (left) and AE (right) using features extracted from ShuffleNet V21.0.

Figure 9

Visualization of reduced 2D representations resulted from t-SNE (left) and AE (right) using features extracted from ShuffleNet V21.0.

Close modal

t-Distributed Stochastic Neighbor Embedding

t-SNE is a statistical method for visualizing high-dimensional data that are originally based on Stochastic Neighbor Embedding (Hinton & Roweis 2002), where (Van der Maaten & Hinton 2008) proposed the t-distributed variant. It is a ‘nonlinear dimensionality reduction’ technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. The t-SNE algorithm comprises two main stages. First, t-SNE constructs a conditional probability under a Gaussian centered at over pairs of high-dimensional objects to calculate the similarity of datapoint to datapoint. The formula to calculate is as follow:
(8)
where similar objects are assigned a higher probability while dissimilar points are assigned a lower probability. The variance in this equation depends on Gaussian and the number of objects surrounding . A binary search is performed for the value of variance that produces probability distribution with a fixed perplexity that is specified by the user using the following equation:
(9)
Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map using Student t-distribution as follows:
(10)
t-SNE minimizes the KLD between the two distributions with respect to the locations of the points in the map. While the original algorithm uses the Euclidean distance between objects as the base of its similarity metric, this can be changed as appropriate.

Perplexity, learning rate, and the number of iterations are set to 20, 200, and 1,000, respectively. The gradient calculation algorithm uses ‘Barnes-Hut’ approximation, which is the fastest t-SNE implementation. Input 32323 holds the raw pixel values of the image in RGB color space. The image is passed through the feature layer of ShuffleNet V21.0 resulting in 1,024 extracted features. Then extracted features are fed into t-SNE for dimensionality reduction in low-dimensional space of two dimensions.

Autoencoder

AEs are unsupervised learning techniques commonly used to learn efficient data encoding. Basically, autoencoders can learn to map input data to the output data, while they learn how to encode the data. The output is the compressed representation of the input data. In fact, the main objective of training an AE neural network is dimensionality reduction. In the other words, ‘Autoencoding’ is a data compression algorithm where the compression and decompression functions are (i) data-specific, (ii) lossy, and (iii) learned automatically from examples rather than engineered by a human. The principles behind AEs are as follows:

  • 1.

    The encoder takes the input and encodes it, .

  • 2.

    Between the encoder and the decoder, there is an internal hidden layer, h, which learns the coding of the input that is defined by the encoder, .

  • 3.

    The decoder function tries to reconstruct the input data from the hidden layer coding. Considering the decoder function as, the reconstruction () can be defined as or, i.e., .

Figure 8(a) shows the architecture of the AE used in this study. The encoder units take 1,024 features extracted by ShuffleNet V2 × 1.0, it includes three hidden layers which consist of 128, 64, and 12 nodes, respectively. Over the encoding path, features decrease from 1,024 to 2D representations in the latent space. The same architecture is adopted for the decoder unit. Mean Squared Error (MSE) and ‘Adam’ (Kingma & Ba 2014) are used as loss function and optimizer, respectively. The network is trained for 300 epochs. The initial learning rate is set to 1.00 × 103, and the multiplicative factor of learning rate decay is set to 0.1 which is scheduled for every 90 epochs. The loss function decay rate on ATeX training set is shown in Figure 8(b).

Figure 9 (left) shows the t-SNE results on the features extracted from ShuffleNet V2 × 1.0 on ATeX training set. Figure 9 (right) shows the 2D representations resulting from feeding ShuffleNet features into the encoder unit of the trained AE.

Features extracted from different methods are fed into customized KNNs using different dissimilarity metrics (Euclidean [L2N], χ2, Log-likelihood [LLV] and KLD) for estimating the accuracy of classification on the validation set. For raw images with different color spaces, and texture features captured from Gabor magnitude responses, L2N is used as the distance metric. In the case of LBP, the dissimilarity of histograms of ‘patterns’ resulting from LBP operation is compared using three different dissimilarity measures including χ2, LLV, and KLD. In the case of the DL model, and in order to achieve fair results, images of the validation set are first passed into the feature layer of the pre-trained ShuffleNet V21.0 (on ATeX training set), then the extracted features were fed into the KNNs.

Despite the fact that KNNs have considered ML tools, it is a simple data-driven method that just compares the validation patches with all training ones and reports the dissimilarity vector for each validation input. So, the final results show just the performance of filters in extracting the features of patches, and the classification method does not have any significant effect on the performance of filters.

The accuracy results are reported in Table 1. The results are categorized based on image color space, feature extraction methods, model parameters, and metrics for dissimilarity evaluation. The number of neighbors (K) for KNNs ranges from 1 to 500. Results show the highest performance of DL model features ranging from 88 to 92% depending on the number of neighbors. After the DL model, the raw images in HSV color space, and LBP methods (independent of dissimilarity measurement method) offer the best performance with 50 and 35% accuracy, respectively. On the other hand, raw images in RGB color space provide the lowest performance with 6% accuracy. For grayscale images, Gabor magnitude responses show 29% accuracy, while the highest performance for raw grayscale images does not exceed 24% accuracy.

Table 1

The experimental results on ATeX validation set

Image
Gabor wavelets
LBP
KRGBGHSVRGBGHSVKLDLLVDL
20 50 23 18 24 24 24 92 
20 48 25 20 26 27 27 92 
20 49 27 21 28 29 29 92 
21 49 28 22 31 31 30 92 
15 19 47 28 22 33 32 33 91 
50 21 44 29 22 35 34 34 91 
70 11 20 43 10 29 22 35 35 35 90 
100 11 21 43 10 27 22 34 34 33 90 
200 11 22 40 10 27 23 32 33 32 89 
300 11 22 38 10 27 23 32 32 32 89 
500 11 24 36 27 23 30 30 30 88 
Image
Gabor wavelets
LBP
KRGBGHSVRGBGHSVKLDLLVDL
20 50 23 18 24 24 24 92 
20 48 25 20 26 27 27 92 
20 49 27 21 28 29 29 92 
21 49 28 22 31 31 30 92 
15 19 47 28 22 33 32 33 91 
50 21 44 29 22 35 34 34 91 
70 11 20 43 10 29 22 35 35 35 90 
100 11 21 43 10 27 22 34 34 33 90 
200 11 22 40 10 27 23 32 33 32 89 
300 11 22 38 10 27 23 32 32 32 89 
500 11 24 36 27 23 30 30 30 88 

Note: Raw images, Gabor responses, and features extracted from ShuffleNet are evaluated based on L2N measurement (numbers are in percent).

It is worth mentioning that considering the high computational complexity and processing time needed for convolution operation, Gabor operation is too expensive compared with LBP operation.

Color significance

The KNNs results on raw images in Table 1 show the significance of color and color space in different waterbodies.

Figure 10 comprehensively illustrates a vivid discrepancy between the descriptive statistics of different labels. This figure is plotted using a sample of 50 pixels, which are randomly selected from all RGB channels of each patch existing in the ATeX train set.
Figure 10

Violin plot of (randomly selected) samples of raw pixel intensity values.

Figure 10

Violin plot of (randomly selected) samples of raw pixel intensity values.

Close modal

According to Figure 10, samples collected for each label have entirely different statistics (e.g., data distribution, minimum, maximum, median, and interquartile range). This proves that color plays an important role and can provide additional information for each waterbody.

We believe the reason for the features extracted by ShuffleNet to be more informative compared to other methods in this study and other types of CNN-based models (Erfani & Goharian 2022) is that the ShuffleNet units can directly take advantage of color information in each channel by the ‘channel split’ operator.

Color space significance

Color space information is even useful for within-class analysis to perceive discrepancies existing in the sub-clusters of a specific label. For example, in Figure 11, 2D representations of the ‘glacier’ class resulting from dimensionality reduction models (t-SNE and linear AE) are fed into two different clustering methods (K-means and DBSCAN) to determine sub-clusters existing in this class based on two different approaches (and metrics) and compare the color space information of sub-clusters with each other.
Figure 11

Clustering results (DBSCAN and K-means) on 2D representations of glacier images in the ATeX training set. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/hydro.2023.146.

Figure 11

Clustering results (DBSCAN and K-means) on 2D representations of glacier images in the ATeX training set. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/hydro.2023.146.

Close modal

According to Figure 11, the first and second rows, respectively, are the results of DBSCAN and K-means on ‘glacier’ 2D representations concluded from t-SNE analysis. The third and fourth rows are the same analysis on the glacier class concluded from linear AE analysis. Figure 11(a) plots the DBSCAN clusters on 844 images of the glacier in the train set. The DBSCAN algorithm views clusters as areas of high density separated by areas of low density. Due to this rather generic view, clusters found by DBSCAN can be any shape, as opposed to K-means which assumes that clusters are convex shaped. The central component of the DBSCAN is the concept of core samples, which are samples that are in areas of high density (Pedregosa et al. 2011).

There are two hyper-parameters in DBSCAN, min_samples and eps that are considered 5 and 0.3, respectively. Figure 11(b) shows the Silhouette for the various clusters. The Silhouette score is used because the ground truth labels are not known (all images belong to the glacier class). The Silhouette coefficient is an evaluation that is performed using the model itself, where a higher Silhouette coefficient score relates to a model with better-defined clusters. Figure 11(c) compares statistics of sub-clusters by (violin) plotting the corresponding 50 pixels which are randomly sampled before, and Figure 11(d) compares pixel intensity values of samples existing in each sub-clusters. Three consistent color codes (gray, cyan, green) used in Figure 11 represent the same sub-cluster in different subplots. Moreover, in the DBSCAN method, the outliers are indicated by the ‘Red’ color in both the scatter and violin plots.

The right column Figure 11(d), 11(h), 11(l) and 11(p)) shows the pixel intensity values (which are randomly sampled from all three RGB channels) existing in each sub-clusters. As it is evident, there is a noticeable difference between pixel intensity values between sub-clusters for both t-SNE and AE 2D representations. This difference is more obvious in the sub-clusters determined by the DBSCAN method (Figure 11(d) and 11(l)).

In order to better investigate the role of color space on sub-clusters, Figure 11(d) is re-evaluated using gray values, and individual color codes in RGB and HSV color spaces. Therefore, Figure 12 shows the pixel intensity values based on HSV color space and seven individual color channels (Gray, Red, Green, Blue, Hue, Saturation, and Value). Accordingly, the Hue color code (Figure 12(f)) provides the most distinctive information for each sub-cluster. Red color code (Figure 12(b)) provides better discriminative information compared to other channels. Gray (Figure 12(a)), Green (Figure 12(c)), and Value (Figure 12(h)) provide almost the same information for each sub-cluster. Despite the fact that Blue (Figure 12(d)), supposedly and intuitively, is the most dominant color in waterbodies, it shows the least distinctive information for each sub-cluster.
Figure 12

Investigation on information provided by different color codes for glacier sub-clusters resulted by DBSCAN. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/hydro.2023.146.

Figure 12

Investigation on information provided by different color codes for glacier sub-clusters resulted by DBSCAN. Please refer to the online version of this paper to see this figure in color: http://dx.doi.org/10.2166/hydro.2023.146.

Close modal

ATeX for transfer learning

In this section, we pre-trained the backbone of a CNN-based model for semantic segmentation of waterbodies on ATeX to investigate the performance of ATeX for transfer learning. For this purpose, we pre-trained ResNet-101 (He et al. 2016) on ATeX as the backbone of PSPNet (Zhao et al. 2017), a semantic segmentation network. Then, we trained PSPNet on ATLANTIS, a benchmark for semantic segmentation of waterbody images (Erfani et al. 2022). The results of this experiment are compared with those of PSPNet using pre-trained ResNet-101 on ImageNet and reported in Table 2.

Table 2

The per-category results on the ATLANTIS test set by PSPNet-ATeX and PSPNet-ImageNet

LabelscanalditchfjordfloodglaciersspringlakepuddlerapidsreservoirriverdeltaseasnowpoolwaterfallwetlandA-mIoUA-accmIoUacc
PSPNet-ATeX 54 23 43 40 51 58 27 55 47 26 31 64 61 52 47 54 52 46 61 41 73 
PSPNet-ImageNet 56 29 48 35 52 54 31 55 41 26 29 54 61 49 45 58 48 45 64 41 73 
LabelscanalditchfjordfloodglaciersspringlakepuddlerapidsreservoirriverdeltaseasnowpoolwaterfallwetlandA-mIoUA-accmIoUacc
PSPNet-ATeX 54 23 43 40 51 58 27 55 47 26 31 64 61 52 47 54 52 46 61 41 73 
PSPNet-ImageNet 56 29 48 35 52 54 31 55 41 26 29 54 61 49 45 58 48 45 64 41 73 

Note: The best results of PSPNNet-ATeX are highlighted with bold font.

The PSPNet is implemented using PyTorch. During training, the base learning rate is set to 2.5 × 104 and it is decayed following the poly policy (Zhao et al. 2017). The network is optimized using SGD with a momentum of 0.9 and weight decay of 0.0001. In total, we train the network for 30 epochs, around 80K iterations with a batch size of 2. The training data are augmented with random horizontal flipping, random scaling ranging from 0.5 to 2.0, and random cropping with the size of 640 × 640. The proposed network is trained in a fully supervised fashion constrained by the cross-entropy loss function on both final prediction P (main loss) and the intermediate feature produced by the fourth block of the ResNet-101 (auxiliary loss). Following (Zhao et al. 2017), the weights of the main and the auxiliary loss are set to 1 and 0.4, respectively.

For performance evaluation, we take the mean of class-wise intersection over union (mIoU) and the per-pixel accuracy (acc) as the main evaluation metrics. To further evaluate the performance of the waterbodies, we calculate the mean IoU and the accuracy just for the aquatic categories – A-mIoU and A-acc, respectively. Aquatic categories include 17 labels showing just water content in different forms and bodies, e.g., sea, river, lake, etc. In Table 2, PSPNet equipped with a pre-trained ResNet backbone on ATeX is called PSPNet-ATeX, while PSPNet using pre-trained ResNet backbone on ImageNet is called PSPNet-ImageNet. According to the results, and considering mIoU as a more informative metric for semantic segmentation, PSPNet-AteX provided better performance over 10 labels out of all 17 aquatic labels. PSPNet-AteX achieved 40.40% mIoU on Flood, as the most important label in ATLANTIS, 5.0% more than what was achieved by PSPNet-ImageNet. Totally, PSPNet-ATeX outperformed PSPNet-ImageNet over aquatic labels by 0.81%. Considering accuracy, however, PSPNet-ImageNet performed better by achieving 63.74% accuracy, 2.49% better than PSPNet-ATeX.

This study investigated vision-based texture and color analysis techniques on the water to find textural properties of water in different aquatic (eco)systems. ATeX dataset, a benchmark for texture analysis of water in different waterbodies, made it possible to implement and apply different CV techniques on the water. Three different methods including Gabor kernels, LBP, and DL-based model were applied in this study. LBP results showed that water texture analysis is a more difficult problem compared with the task of facial recognition in CV. The classification results proved that the high-level image analysis method (i.e., DL-based model), outperformed much better than early-vision techniques for water feature extraction. KNN results on raw images emphasized on the important role of color and color space for water. Using HSV color space increased the accuracy of classification results to 50%. Among early-vision techniques, LBP offers better results, depending on the parameters which should be adjusted based on the problem. Finally, and beyond our preliminary development of datasets and models, this research aims to call the community to start building and sharing new robust water and hazard detection models using advances in image processing.

The results of this study give DL model developers insight to design customized CNN-based models for detecting and segmenting water in surveillance imageries. These models then can be deployed during flash or nuisance flooding in urban areas, for real-time monitoring of flood events. Surveillance imagery networks can provide spatial dynamics of the surface water extent in the monitored region. Such a vision-based framework is capable to provide disaster prevention agencies with actual field information such as water stage and discharge for over-bank flow states.

The ATeX dataset, models, and codes developed and used in this study are publicly available in the GitHub repository (https://github.com/smhassanerfani/atex).

The authors declare there is no conflict.

Addor
N.
,
Newman
A. J.
,
Mizukami
N.
&
Clark
M. P.
2017
Catchment Attributes for Large-Sample Studies
.
co: Ucar/ncar
,
Boulder
.
Ahonen
T.
,
Hadid
A.
&
Pietikäinen
M.
2004
Face recognition with local binary patterns
. In
Eur. Conf. Comput. Vis
.
Springer
, pp.
469
481
.
Chollet
F.
2017
Xception: deep learning with depthwise separable convolutions
. In:
IEEE Conf. Comput. Vis. Pattern Recog
, pp.
1251
1258
.
Eltner
A.
,
Elias
M.
,
Sardemann
H.
&
Spieler
D.
2018
Automatic image-based water stage measurement for long-term observations in ungauged catchments
.
Water Resources Research
54
(
12
),
10
362
.
Eltner
A.
,
Bressan
P. O.
,
Akiyama
T.
,
Gonçalves
W. N.
&
Junior
J. M.
2021
Using deep learning for automatic water stage measurements
.
Water Resources Research
57
(
3
),
e2020WR027608
.
Erfani
S. M. H.
&
Goharian
E.
2022
Atex: a benchmark for image classification of water in different waterbodies using deep learning approaches
.
Journal of Water Resources Planning and Management
148
(
11
),
04022063
.
Erfani
S. M. H.
,
Wu
Z.
,
Wu
X.
,
Wang
S.
&
Goharian
E.
2022
Atlantis: a benchmark for semantic segmentation of waterbody images
.
Environmental Modelling & Software
149
,
105333
.
Gonzalez
R. C.
2009
Digital Image Processing
.
Pearson Education India. Sholinganallur, Chennai
.
Hao
Z.
,
Jin
J.
,
Xia
R.
,
Tian
S.
,
Yang
W.
,
Liu
Q.
,
Zhu
M.
,
Ma
T.
&
Jing
C.
2021
Catchment attributes and meteorology for large sample study in contiguous China
.
Earth System Science Data Discussions
2021
,
1
37
.
He
K.
,
Zhang
X.
,
Ren
S.
&
Sun
J.
2016
Deep residual learning for image recognition
. In:
IEEE Conf. Comput. Vis. Pattern Recog.
, pp.
770
778
.
Hinton
G.
&
Roweis
S. T.
2002
Stochastic neighbor embedding
. In:
(Jordan, M. I., LeCun, Y. & Solla, S. A., eds.) NIPS, Vol. 15. Citeseer. The MIT Press. Cambridge, MA
. pp.
833
840
.
Howard
A. G.
,
Zhu
M.
,
Chen
B.
,
Kalenichenko
D.
,
Wang
W.
,
Weyand
T.
,
Andreetto
M.
&
Adam
H.
2017
Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861
.
Kingma
D. P.
&
Ba
J.
2014
Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980
.
Kratzert
F.
,
Klotz
D.
,
Brenner
C.
,
Schulz
K.
&
Herrnegger
M.
2018
Rainfall–runoff modelling using long short-term memory (lstm) networks
.
Hydrology and Earth System Sciences
22
(
11
),
6005
6022
.
Kratzert
F.
,
Klotz
D.
,
Herrnegger
M.
,
Sampson
A. K.
,
Hochreiter
S.
&
Nearing
G. S.
2019
Toward improved predictions in ungauged basins: exploiting the power of machine learning
.
Water Resources Research
55
(
12
),
11344
11354
.
LeCun
Y.
,
Bengio
Y.
&
Hinton
G.
2015
Deep learning
.
Nature
521
(
7553
),
436
444
.
Lin
T.-Y.
,
Maire
M.
,
Belongie
S.
,
Hays
J.
,
Perona
P.
,
Ramanan
D.
,
Dollár
P.
&
Lawrence Zitnick
C.
2014
Microsoft coco: common objects in context
. In
Eur. Conf. Comput. Vis
.
Springer
, pp.
740
755
.
Liu
C.
&
Wechsler
H.
2002
Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition
.
IEEE Transactions on Image Processing
11
(
4
),
467
476
.
Liu
P.
,
Han
S.
,
Meng
Z.
&
Tong
Y.
2014
Facial expression recognition via a boosted deep belief network
. In:
IEEE Conf. Comput. Vis. Pattern Recog.
, pp.
1805
1812
Ma
N.
,
Zhang
X.
,
Zheng
H.-T.
&
Sun
J.
2018
Shufflenet v2: Practical guidelines for efficient cnn architecture design
. In
Eur. Conf. Comput. Vis.
, pp.
116
131
.
Meng
Z.
,
Liu
P.
,
Cai
J.
,
Han
S.
&
Tong
Y.
2017
Identity-aware convolutional neural network for facial expression recognition
. In
IEEE Int. Conf. Auto. Face Gest. Recog
.
IEEE
, pp.
558
565
.
Mottaghi
R.
,
Chen
X.
,
Liu
X.
,
Cho
N.-G.
,
Lee
S.-W.
,
Fidler
S.
,
Urtasun
R.
&
Yuille
A.
2014
The role of context for object detection and semantic segmentation in the wild
. In:
IEEE Conf. Comput. Vis. Pattern Recog.
, pp.
891
898
.
Muñoz
D. F.
,
Muñoz
P.
,
Moftakhari
H.
&
Moradkhani
H.
2021
From local to regional compound flood mapping with deep learning and data fusion techniques
.
Science of the Total Environment
782
,
146927
.
Ojala
T.
,
Pietikäinen
M.
&
Harwood
D.
1996
A comparative study of texture measures with classification based on featured distributions
.
Pattern Recognition
29
(
1
),
51
59
.
Ojala
T.
,
Pietikainen
M.
&
Maenpaa
T.
2002
Multiresolution gray-scale and rotation invariant texture classification with local binary patterns
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
24
(
7
),
971
987
.
Palacios
S. L.
2020
Fundamentals of Aquatic Remote Sensing
.
NASA Applied Remote Sensing Training Program (ARSET)
,
Washington DC
.
Park
T.
,
Liu
M.-Y.
,
Wang
T.-C.
&
Zhu
J.-Y.
2019
Semantic image synthesis with spatially-adaptive normalization
. In
IEEE Conf. Comput. Vis. Pattern Recog.
, pp.
2337
2346
.
Pedregosa
F.
,
Varoquaux
G.
,
Gramfort
A.
,
Michel
V.
,
Thirion
B.
,
Grisel
O.
,
Blondel
M.
,
Prettenhofer
P.
,
Weiss
R.
,
Dubourg
V.
,
Vanderplas
J.
,
Passos
A.
,
Cournapeau
D.
,
Brucher
M.
,
Perrot
M.
&
Duchesnay
E.
2011
Scikit-learn: Machine learning in Python
.
J. Mach. Learning Research
12
,
2825
2830
.
Shen
L.
&
Bai
L.
2006
A review on gabor wavelets for face recognition
.
Pattern Analysis and Applications
9
(
2
),
273
292
.
Tian
Y.-L.
,
Kanade
T.
&
Cohn
J. F.
2005
Facial expression analysis
. In:
Handbook of Face Recognition
.
Springer-Verlag, New York
, pp.
247
275
. https://doi.org/10.1007/b138828.
Tong
Y.
,
Liao
W.
&
Ji
Q.
2007
Facial action unit recognition by exploiting their dynamic and semantic relationships
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
29
(
10
),
1683
1699
.
Van Der Maaten
L.
2014
Accelerating t-sne using tree-based algorithms
.
J. Mach. Learning Research
15
(
1
),
3221
3245
.
Van der Maaten
L.
&
Hinton
G.
2008
Visualizing data using t-sne
.
J. Mach. Learning Research
9
(
11
),
2579
2605
.
Vinay
A.
,
Shekhar
V. S.
,
Balasubramanya Murthy
K. N.
&
Natarajan
S.
2015
Face recognition using gabor wavelet features with pca and kpca-a comparative study
.
Procedia Computer Science
57
,
650
659
.
Wang
X.
,
Yu
K.
,
Dong
C.
&
Loy
C. C.
2018
Recovering realistic texture in image super-resolution by deep spatial feature transform
. In
IEEE Conf. Comput. Vis. Pattern Recog.
, pp.
606
615
.
Xu
Y.
,
Du
B.
,
Zhang
L.
,
Cerra
D.
,
Pato
M.
,
Carmona
E.
,
Prasad
S.
,
Yokoya
N.
,
Hänsch
R.
&
Le Saux
B.
2019
Advanced multi-sensor optical remote sensing for urban land use and land cover classification: outcome of the 2018 ieee grss data fusion contest
.
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
12
(
6
),
1709
1724
.
Zhang
Z.
,
Lyons
M.
,
Schuster
M.
&
Akamatsu
S.
1998
Comparison between geometry-based and gabor-wavelets-based facial expression recognition using multi-layer perceptron
. In
IEEE Int. Conf. Auto. Face Gest. Recog
.
IEEE
, pp.
454
459
.
Zhang
X.
,
Zhou
X.
,
Lin
M.
&
Sun
J.
2018
Shufflenet: an extremely efficient convolutional neural network for mobile devices
. In
IEEE Conf. Comput. Vis. Pattern Recog.
, pp.
6848
6856
.
Zhao
G.
&
Pietikainen
M.
2007
Dynamic texture recognition using local binary patterns with an application to facial expressions
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
29
(
6
),
915
928
.
Zhao
H.
,
Shi
J.
,
Qi
X.
,
Wang
X.
&
Jia
J.
2017
Pyramid scene parsing network
. In:
IEEE Conf. Comput. Vis. Pattern Recog.
, pp.
2881
2890
.
Zhou
B.
,
Zhao
H.
,
Puig
X.
,
Xiao
T.
,
Fidler
S.
,
Barriuso
A.
&
Torralba
A.
2019
Semantic understanding of scenes through the ade20k dataset
.
International Journal of Computer Vision
127
(
3
),
302
321
.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Licence (CC BY 4.0), which permits copying, adaptation and redistribution, provided the original work is properly cited (http://creativecommons.org/licenses/by/4.0/).