SMOTE-Boost-based sparse Bayesian model for flood prediction

With a significant development of big data analysis and cloud-fog-edge computing, human-centered computing (HCC) has been a hot research topic worldwide. Essentially, HCC is a cross-disciplinary research domain, in which the core idea is to build an efficient interaction among persons, cyber space, and real world. Inspired by the improvement of HCC on big data analysis, we intend to involve related core and technologies to help solve one of the most important issues in the real world, i.e., flood prediction. To minimize the negative impacts brought by floods, researchers pay special attention to improve the accuracy of flood forecasting with quantity of technologies including HCC. However, historical flood data is essentially imbalanced. Imbalanced data causes machine learning classifiers to be more biased towards patterns with majority samples, resulting in poor classification of pattern with minority samples. In this paper, we propose a novel Synthetic Minority Over-sampling Technique (SMOTE)-Boost-based sparse Bayesian model to perform flood prediction with both high accuracy and robustness. The proposed model consists of three modules, namely, SMOTE-based data enhancement, AdaBoost training strategy, and sparse Bayes model construction. In SMOTE-based data enhancement, we adopt a SMOTE algorithm to effectively cover diverse data modes and generate more samples for prediction pattern with minority samples, which greatly alleviates the problem of imbalanced data by involving experts’ analysis and users’ intentions. During AdaBoost training strategy, we propose a specifically designed AdaBoost training strategy for sparse Bayesian model, which not only adaptively and inclemently increases prediction ability of Bayesian model, but also prevents its over-fitting performance. Essentially, the design of AdaBoost strategy helps keep balance between prediction ability and model complexity, which offers different but effective models over diverse rivers and users. Finally, we construct a sparse Bayesian model based on AdaBoost training strategy, which could offer flood prediction results with high rationality and robustness. We demonstrate the accuracy and effectiveness of the proposed model for flood prediction by conducting experiments on a collected dataset with several comparative methods.


Introduction
Human-centered computing (HCC) is a key part to interact and collaborate among persons, cyber space, and real world, which develops various human-computer applications to economically and conveniently satisfy the complex non-functional computational requirements from diverse users. Therefore, how to apply HCC in real world *Correspondence: fengjun@hhu.edu.cn College of Computer and Information, Hohai University, Nanjing, China to effectively and efficiently solve complex problem has attracted quantity of attentions from researchers.
In this paper, we follow the idea of applying HCC on real-world problem to pursue more intelligent and efficient applications, which could meet demands from different users. Essentially, we intend to solve flood prediction problem with high accuracy under implying HCC technologies. Flood, as one of the most common and largely distributed natural diasters, happens occasionally and brings large damages to life and property. If we could accurately forecast flood by predicting its time-varying flow rate values in advance, hundreds of lives and quantity of property could be saved. In the past decades, researchers have proposed a quantity of models for accurate and robust flood forecast. We generally categorize them into two types, namely, mathematic models [1,2] and data-driven models [3,4].
Mathematic models generally describe formation of floods by function systems, representing flood processes from clues to results. Mathematic models have been successfully applied in flood forecasting systems of large watershed. However, such models are sensitive to parameters [5] and require large research efforts to adjust parameters, which prevents its massive usage for quantity of small watersheds.
Data-driven models construct forecast systems based on historical observations, directly exploring relations between river flow and flood factors without considering physical processes. Due to the developments of internet of things and sensor technologies, researchers can gather and store a quantity of hydrological data (like rainfall, runoff, soil moisture, evaporation) from different locations. Extract patterns from large historical hydrological data with intelligent methods help improve accuracy of flood prediction and could benefit from further development of the latest techniques, like deep learning human-centric representation [6][7][8] and intelligent human-centered computing [9][10][11][12][13][14][15]. Specifically, we refer to patterns as inherent non-linear functional relationship between hydrological data and flood generation, which is too complex to explain with functional system other than implicit description by machine learning models.
With an optimized future in predicting floods with artificial intelligence techniques, there exist two main challenges in applying data-driven models for practical usage. First, researchers must handle the problem of imbalanced data. Although the total number of flood samples acquired by sensors is large, some patterns with less samples can be hard to explore without suitable data argumentation methods. Second, researchers are clear about dominant factors of floods, which should be input of the constructed data-driven model. However, it is difficult to collect and use all induced factors in a single prediction model, such as soil moisture, vegetation type, and vegetation coverage. How to adaptively and incrementally use all these factors in a single model based on experts' knowledge and users' intention thus becomes a major challenge. Recently, there has been a significant progress in intelligent human-centered computing techniques, which offers possible solutions to improve classification by extracting expert knowledge from original data and appropriately modeling human intentions.
To solve these two problems, we propose a novel model to predict river runoff values. Figure 1 shows the workflow of the proposed model, which consists of three modules, namely, Synthetic Minority Oversampling Technique (SMOTE) method, AdaBoost Strategy, and Sparse Bayesian Flood Prediction Model. A SMOTE method in Fig. 1a is used to generate virtual samples for data augmentation, which solves the problem of imbalanced flood data to a certain extent. After preprocessing original data by SMOTE method, we adopt a novel AdaBoost strategy (represented in Fig. 1b) to train multiple Bayesian models, achieving an improved and integrated model after boosting. After integration, Fig. 1 Workflow of the proposed model for river runoff prediction. Detailed legend: Fig. 1 shows the workflow of the proposed model, which consists of three modules, namely, SMOTE method, AdaBoost Strategy, and Sparse Bayesian Flood Prediction Model. A SMOTE method (a) is used to generate virtual samples for data augmentation, which solves the problem of imbalanced flood data to a certain extent. After pre-processing original data by SMOTE method, we adopt a novel AdaBoost strategy (b) to train multiple Bayesian models to obtain an improved and integrated model. Afterwards, the Sparse Bayesian Flood Prediction Model (c), where the proposed sparse Bayesian model improves the original Bayesian model by offering a probability distribution constraint to weights of iteration training model. With all these steps, we build a complete workflow of a data-driven model to predict river runoffs with high accuracy and robustness  Fig. 1c, where the proposed sparse Bayesian model improves the original Bayesian model by offering a probability distribution constraint to weights of iteration training model. Such constraints leads to the sparseness of model parameters and thus helps avoid over-fitting. With all these steps, we build a complete workflow of a data-driven model to predict river runoffs with high accuracy and robustness.
The main contribution of the paper is to propose a new SMOTE-Boost-based sparse Bayesian model that supports accurate river runoff value prediction. Facing problem brought by imbalanced dataset, we utilize SMOTE method to efficiently enhance quality of samples in training dataset, which boosts performance of machine learning model, i.e., sparse Bayesian model, built on it. We believe such data enhancement method with SMOTE technology is an appropriate way to solve imbalanced data problem in HCC and big data analysis. Moreover, the proposed model provides users an efficient approach to forecast flood in advance. By involving experts' analysis and users' intentions in designing steps of data augmentation and classifier integration, we focus on implementation of human-centered computing with artificial intelligence technologies, which helps keep a balance between efficiency and model complexity. Our experimental results and the comparison results prove the high effectiveness and low complexity of the proposed model, which could support practical usage on forecasting flood. We believe this is a successful trail on how to combine principles of human-centered computing with artificial intelligence technologies, which offers inspiration for researchers on designing of novel algorithms.
The rest of the paper is organized as follows. Section 1 gives an overview of the related work. The SMOTE method for flood data augmentation is introduced in Section 2. Then, AdaBoost strategy for sparse Bayesian model under users' intention is introduced in Section 3. In Section 4, the details of the whole process for flood prediction are discussed. Section 5 shows our experimental results, and finally, Section 5 concludes the paper.

Related work
The existing methods related to our work can be categorized into the following three types: SMOTE-related methods, AdaBoost algorithm, and sparse Bayesian model.

SMOTE method
With the development of IoT and data computing technologies [16][17][18], researchers have access to achieve more data with various types and large amount. However, imbalanced data problem leads to artificial intelligence models built on these data which behave extremely poor in performance. Essentially, an imbalanced dataset refers to samples in the dataset which fail to approximately equally represent all patterns. Oversampling is an efficient technique in dealing with class imbalance problem by reduplicating or generating the minority class samples, resulting in balance between the samples of the majority and minority class. With years' development, Synthetic Minority Over-sampling Technique (SMOTE) [19] is proposed and utilized to tackle imbalanced data problem.
For instance, Maldonado et al. [20] developed a SMOTE-based method to deal with imbalanced problem of high-dimensional binary data; meanwhile, a novel distance metric is proposed to compute neighborhood for each minority sample for efficiency. Their work was compared with various oversampling techniques on imbalanced low-and high-dimensional datasets, achieving a promising result to guarantee performance in constructing NLP application. Later, Maria et al. [21] proposed a SMOTE-BD method to tackle the problem of imbalanced classification in big data. Their proposed scalable approach for imbalanced classification in big data is constructed on the basis of SMOTE algorithm, which helps create new synthetic instances according to the neighborhood of minority class sample.
Most recently, Weng et al. [22] utilized SMOTE method and random forests to improve the accuracy of student weariness prediction in education. Mohasseb et al. [23] used a hierarchical SMOTE algorithm for balancing different types of questions. Their proposed framework is grammar-based, which involves grammatical pattern for each question and machine learning algorithms to classify patterns. Experimental results implied their proposed framework demonstrates high accuracy in identifying different question types and handling class imbalance.

AdaBoost algorithm
Adaptive Boosting (AdaBoost) algorithm [24,25] is an efficient learning strategy to build accurate classifiers. The core idea of AdaBoost is that samples misclassified by previous classifier should be used to train the next classifier. With such design, weak classifiers, which only perform well in classifying several specific patterns, can be integrated into a strong classifier, which can deal with problem of classifying all patterns. In spite of its sensitive to noise data and abnormal data, AdaBoost-based method could handle their overfit problem with its feature of integrating different classifiers. Above all, AdaBoost algorithm can make full advantages of different weak predictors; meanwhile, it is prone to prevent overfit situation.
To classify five groups of vehicle images from daily life images, Chen et al. [26] proposed a novel AdaBoostbased model with deep convolutional neural networks (CNNs) built. Experimental results demonstrated the proposed model achieves the highest classification accuracy of 99.50% on the test dataset with only 28 ms to process. Later, Wu et al. [27] utilized a robust AdaBoost model to detect fire smoke in video. Static features (including texture, wavelet, color, edge orientation histogram, irregularity) and dynamic features (including motion direction, change of motion direction, and motion speed) are extracted to train with AdaBoost strategy. They got a satisfactory performance on the final enhanced model with users' intention to adjust the weights of strong or weak classifier iteratively.
Most recently, Sun et al. [28] employed AdaBoost-LSTM-ensembled learning for financial time series forecasting. The AdaBoost algorithm is used to integrate all the long short-term memory (LSTM) predictors trained respectively. The empirical results on public datasets demonstrate that the proposed AdaBoost-LSTM ensemble learning approach outperforms some other single forecasting models and ensemble learning approaches. This suggests that the AdaBoost-LSTM ensemble learning approach is a highly promising approach for time-varying data forecasting, especially for the time series data with nonlinearity and irregularity.

Sparse Bayesian model
Sparse Bayesian learning (SBL) [29] is an important type of Bayesian statistical optimization algorithms, which is developed on the basis of Bayesian theory. Now, sparse Bayesian learning technology has been successfully applied in intelligent information retrieval [30,31], data mining [32,33], and other fields.
For instance, Mishra et al. [34] used sparse Bayesian model to perform parameter estimation for monostatic MIMO radar systems, where simulation results demonstrate their proposed methods achieved high estimation accuracy in comparison with the existing techniques. Later, Qiao et al. [35] proposed sparse Bayesian learning (SBL) framework for channel estimation in underwater acoustic orthogonal frequency-division multiplexing (OFDM) communication system. Compared with the compress sensing-based methods, their proposed method provides a desirable property in preventing structural error and reconstructing sparse signal with fewer convergence errors. Dai et al. [36] addressed the problem of DOA estimation in additive outliers on the basis of sparse Bayesian learning framework, which achieves excellent performance in terms of resolution and accuracy.
Most recently, Zheng et al. [37] proposed an improvement of Bayesian classifier with the sparse regression technology, which firstly tries to extend sparse regression for categorical variables and implemented with design of weighted naive Bayes classifier. Salucci et al. [32] adopted a customized multi-task Bayesian compressive sensing (MT-BCS) method to yield regularized solutions of the 3D-IS problem with a low computational complexity. Selected numerical results on representative benchmarks are presented and discussed to assess the effectiveness and the reliability of the proposed MT-BCS strategy in comparison with other competitive state-of-the-art approaches.

Methods
In this section, we describe steps of SMOTE method for flood data augmentation, AdaBoost strategy for classifier integration, and sparse Bayesian model for flood prediction, respectively.

SMOTE method for flood data augmentation
Class imbalance refers to the uneven distribution of training sets used in the process of training classifier. More precisely, it means the number of samples belong to a certain pattern, named as minority pattern, is too small to provide enough information for construction of classifier. If we take average loss as learning criterion on such classimbalanced dataset, the generated model could be bias to certain patterns with large amount of samples, which could be regarded as majority pattern in our paper.
In order to deal with class imbalance problem in regression cases, resampling method is firstly used by selecting more samples with minority pattern and fewer samples with majority pattern. In that way, the proportion of samples with minority and majority pattern in training dataset tends to be balanced. However, such method can only be applied in cases with enough but imbalanced samples. In flood prediction, sensors acquire multiple variable with different frequencies according to users' intention. For example, river runoff is generally obtained every 1 h; meanwhile, soil evaporation is measured once in a day. With such constraint brought by property of sensors, we can conclude the number of samples for soil evaporation can be too small to impact on the trained classifier.
Therefore, we adopt another idea, i.e., Synthetic Minority Over-sampling Technique (SMOTE) method, to generate a number of virtual samples on the basis of original training samples, which could increase the sample number of minority pattern, thus approximating the sample Step3. Generate a random number ξ(0 ≤ ξ ≤ 1) and create a synthetic sample t based on d i : Step4. Firstly, define one round of generation as repeating steps 1 to 3 with i = 1, ..., N. Then, calculate r = v N and perform r rounds of generation, where means rounding down operation. Finally, calculate t = v%N and repeat steps 1 to 3 with t randomly chosen sample from S. All these generated samples makes up Synthetic flood dataset T. equilibrium. It is noted SMOTE method generates synthetic samples in the feature space rather than data space. Under the consideration of efficiency and accuracy, we adopt k-nearest neighbor SMOTE method for generation of virtual samples. The core idea of such method is neighbor principle, that is nearest samples or samples in a group tend to own nearly same property in feature space. With such idea, we could select k nearest neighbor samples to generate virtual samples.
Under the guidance of k-nearest neighbor SMOTE method, we propose a specially designed SMOTE method for flood data augmentation. The core idea to generate virtual samples is shown in Fig. 2, where we generate a new sample in minority flood pattern s i , named as minority sample, with synthetic sample t, which is created based on feature values of the k nearest neighbor samples forming a set of samples named as S knn . With such idea, we list all steps of the proposed SMOTE method in Algorithm 1.
In the Input line of Algorithm 1, we only perform data augmentation on minority pattern, which is defined as samples with statistical flow data higher than alert runoff value (defined as 400m 3 /s in experiments according to China's law). The reason to adopt such definition lies in the fact that the minority pattern in flood prediction is cases of flood happening, since the majority pattern for a river are cases without floods. Specifically, we define feature value x i ∈ R d , where d represents the feature dimension.
It is not easy to determine the value of k and v, since too large value leads to produce similar synthetic samples and too small value may introduce too much noise into synthetic sample set. In this paper, we consulted hydrology researchers and users to determine initial value of k and v. Afterwards, we manually adjust both numbers iteratively to achieve the most robust and appropriate generation set.

AdaBoost strategy for classifier integration
AdaBoost algorithm [38] is a typical learning strategy based on resampling technology. By dynamically changing sample weight and model weight, the trained weak prediction models are combined into strong prediction models to improve classification accuracy. The basis of such strategy lies in the fact that objective goal is most likely sparse event in dataset of task. By involving sequences of weak classifiers, we can iteratively eliminate wronglabeled samples to improve efficiency. Moreover, different weak classifier could be fit to handle with different input data distribution, where we could assign different weights to weak classifier based on data distribution. With such adaptive weight strategy, AdaBoost algorithm could have a consistent performance facing different dataset or application scenarios. Due to its significant ability to handle with imbalanced data by integrating various types of weak classifiers, AdaBoost algorithm has been widely used in the field of data mining and machine learning.
The most common usage to deal with imbalanced data by AdaBoost algorithm is to first resample for modification of sample distribution and then train multiple classifiers based on the modified data with multiple sample distributions, which could be achieved by multiple sampling technology. Afterwards, samples with inaccurate prediction after first round of training are taken as input of classifiers, which are built during the second round of construction. Finally, such iteration training strategy would result in a strong classifier, which is able to depict all distribution patterns inherently represented by imbalanced data.
In the case of flood prediction, constructed classifiers following common procedures would result in poor accuracy due to highly imbalanced property and shortage of enough flood data. Therefore, we propose a novel and scenically designed AdaBoost training strategy to handle case of flood prediction. The core idea of such strategy lies in the principle that we should pay more attention on sample near flood peaks, i.e., minority samples in flood prediction, which should be utilized multiple times to effectively improve the accuracy of flood forecasting near flood peaks. Step1. Initiate W i,1 = 1/N. Randomly extract m samples from S to form sample set S t , and train a weak classifier h t () based on S t . In our algorithm, such weak classifier refers to sparse Bayesian classifier.
Step2. Calculate the average error ε t for the t-th classifier h t () Step3. Update weights for each sample W i,t and weights for classifier D t with where β t = ε t 1−ε t and Z t is the normalization factor to make N i=1 W i,t = 1. Step4. Define uncertainty value for each sample as where β i is the balance factor to ensure balance property between different patterns, l 1 and l 2 are the confidence output values with the largest and second largest values, respectively. In other words, smaller μ is, larger uncertainty with classifier h t () and we should use such sample for next iteration of training.
Step5. Count the number of samples in each pattern. Define number of minority pattern with smallest samples as c 1 and number of majority pattern with the largest samples as c 2 . Judge whether c 2 c 1 > thresh. If so, define Step6. Calculate μ for each sample and find samples with most smallest μ to form set . Finally, we achieve dataset S t+1 = S t ∪ for next round of training.
Step7. Repeat Step1 to 6 with T iterations, and integrate T weak classifiers to construct the final classifier with Considering the fact that size of flood data is growing every day, we should take more data into account for higher prediction accuracy. Therefore, new samples are involved to participate in the training for the purpose of updating classifiers. However, imbalanced new data leads to unsatisfied classifiers, where majority patterns are always updated and minority patterns are never updated. In other words, there exists an imbalanced classifier problem during the iteratively evolving of AdaBoost framework. To solve this problem, we thus propose a new sample selection strategy based on active learning technology, which chooses the most informative sample from dataset to form the training dataset, especially for majority samples.
Above all, the proposed AbaBoost training strategy with active learning technology helps relieve the burden of users on how to build accurate and strong classifiers with imbalanced data at first and then improve the constructed classifier with more data. Essentially, such method is designed under the guidance of human-centered computing, which appropriately involves more data for the improvement of constructed model without additional work of users.
Under guidance of AbaBoost training strategy with active learning technology, we design an algorithm to improve flood prediction as shown in Algorithm. 2. It is noted that steps 1 to 3 refer to the steps of an AdaBoost training strategy with sparse Bayesian classifier; meanwhile, steps 4 to 6 represent the active learning algorithm on selecting informative samples to form dataset for the next iteration of training.

Sparse Bayesian model for flood prediction
Sparse Bayes model (short for SBL) [39] assumes that sample obeys the probability distribution and calculates the weight of the approximation function through the maximum likelihood criterion. Afterwards, the posterior probability distribution is calculated with Bayesian rule. Finally, the inference of unknown parameters is made based on prior information and posterior probability.
Define training sample set as With the assumption that training samples obey the same distribution and are independent, we can define the likelihood function as: where y = (y 1 , y 2 , · · · , y N ) T , ω = (ω 1 , ω 2 , · · · , ω N ) T , and K(x i , x N ) is a certain kernel function. It is noted that most regression models are prone to over-fit with the increase of number of parameters. In order to tackle that problem, SBL adds a constraint to the weight that the parameter ω obeys a Gaussian distribution with its mean value equals 0. With such constraint, Eq. 8 can be rewritten as where α = {α 1 , α 2 , · · · , α N } is a hyperparameter that determines the prior distribution of the weight ω, which is the main idea to construct a sparse model.
After defining the sparse Bayesian model, we could facilitate the pipeline of the whole proposed model with the parts of SMOTE, AdaBoost, and sparse Bayes model, where SMOTE is designed to generate virtual sample, sparse Bayes model is defined as the weak classifier, and AdaBoost training with active learning technology is to integrate all weak classifiers constructed during the training iterations, which finally form a strong and accurate classifier for flood prediction.

Results and discussion
In this section, we show the effectiveness of the proposed method in predicting runoff values. We would describe dataset, quality measures, and experimental results, respectively.

Dataset
In this study, we apply the proposed method to predict daily flow rate of Changhua Gage Station, based on the 1998-2010 historical flood data of 7 rainfall stations, 1 evaporation station and 1 gaging station in Changhua watershed, a watershed in Xinanjiang River basin in China. We show the map of the Changhua watershed with various kinds of stations in Fig. 3. Note that we need to predict the flow rate values of river gaging station Changhua and station Shuangshi functions as an evaporation station to offer evaporation values. We collect hourly data of 40 floods happened from 1998 to 2010 and utilize 8-folder cross-validation to evaluate our proposed method. A total of 6552 samples from 1998 to 2008 are selected as training samples, and 1688 samples from 2009 to 2010 are selected as test samples. It is noted that the collected data from Changhua river is an essential imbalanced dataset, where some flood patterns only occur once in all samples. The imbalanced property of Changhua dataset is the major difficulty for accurate flood prediction.
The Changhua River is a tributary of Xinanjiang River, originated from Jixi County, Anhui Province, China. It flows through Jixi County, Lingan County, Changhua County, and eventually into Xinanjiang River. The river is 96-km long and the watershed area is 905 km 2 . Changhua gage station is a major gage station in Changhua River, located in 119.212 E, 30.166 N. The daily flow rates from 1998 to 1986 at Changhua gage station as well as other related data during floods are collected for this study. Some descriptive statistics for the flood data is given in Table 1, where E represents evaluation and SD refers to the standard deviation. The daily flow rate varies from 0.58 m 3 /s occurring in 2007 to 2100 m 3 /s appearing in 1999; the mean daily flow rate is 146.651 m 3 /s with a variance of 202.501 m 3 /s.

Quality measures
We use standard quality measures such as root mean square error (RMSE), deterministic coefficient (DC), and flood peak errors (FPE) for measuring the quality of flood forecasting achieved by the proposed method. Note that the latest measurement is specially designed for where RMSE reflects the degree of deviation between predicted values y c and true values y 0 during the flood forecasting process. The smaller RMSE has a better performance achieved by the adopted model. Measurement DC could be represented as: where y c (i) is the predicted value, y 0 (i) is the measured value,ȳ 0 is the measured value mean, and n is the number of samples. It is noted DC reflects the degree of coincidence between the flood forecasting process and the measured process. The closer the result is to 1, the higher the forecast accuracy rate. The third measurement FPE could be formulated as where n is the number of test samples while y p i is the groundtruth of flood peak and y p i is prediction. It is noted that FPE denotes the mean of all flood peak errors in test dataset.

Results and discussion
We conduct two groups of experiments to show the performance of the proposed model with different parameters and compare with other models for runoff prediction.
In the first group of experiment, we show the prediction results of the proposed model with different parameters, i.e., sampling number of samples in SMOTE method and training iterations in AdaBoost training strategy. Comparison results are shown in Table 2. For convenience of readers, we further show a comparison figure in Fig. 4, where n refers to the number of sampling samples. From Fig. 4, we can clearly see RMSE, DC, and FPE values achieved by the proposed ensemble model which is much higher than that of the single model, which proves the effectiveness of the proposed AdaBoost training strategy with active learning technology. Furthermore, we find that the adopted iteration, i.e., the number of classifiers, is clearly affected by margin effect. In other words, adopting more classifiers does not always improve measurement values. Therefore, we try different iterations and achieve the best performance with 6 iterations.
Sampling number of samples is the most important parameter for SMOTE algorithm and has a great impact on the final prediction results. From Fig. 4, we can find that the best performance is achieved by the model by defining n = 4000. Setting either n = 3000 or n = 5000 In the second group of tests, we show the detailed statistics of the proposed method and other data-driven-based methods for the Changhua dataset in Table 3. Among these comparative methods, Han et al. [40] apply SVM in flood forecasting with a special design on optimum selection among a large number of various input combinations and parameters. Note that we apply linear kernel function for [40] during experiments. Wu et al. [41] construct entities and connections of Bayesian network to represent variables and physical processes of a famous physical model, which appropriately embeds hydrology expert knowledge for high rationality and robustness. Dawson et al. [42] develop Artificial Neural Networks (ANNs) for 6 h lead times flow forecasting using real hydrometric data. Chang et al. [43] develop a two-stage rainfall runoff model for 3-h-ahead flood forecasting based on radial basis function (RBF) neural network, which firstly utilize fuzzy min-max clustering to determine the characteristics of the nonlinear RBFs and then adopt multivariate linear regression to determine the weights between the hidden and output layers. Above all, the cores of Han et al. [40], Dawson et al. [42], Chang et al. [43] , Lima et al. [44], and Wu et al. [41] are SVM, Neural Network, Radical Basis Function Network, Extreme Learning Machine, and Bayesian Network, respectively. All these machine learning structures are popular to predict floods in pattern recognition community. We implement these algorithms according to the instructions given in their papers.
From Table 3, we could see the proposed method achieves the best performance in RMSE, DC, and the second best performance in FPE. The small value of RMSE by the proposed method implies our method is more accurate and robust to predict runoff values; meanwhile, large value of DC achieved implies our method quantify uncertainty to a certain extent. Wu et al. [41] is more accurate in predicting the appearance time and runoff values of flood peaks than the proposed model, since it contains the embedded hydrology processes and variables to increase prior knowledge for accurate prediction of flood peaks. To sum up, both generating virtual samples and integrating classifiers help accurately predict floods even with imbalanced data. Due to not adopting heavy deep learning architecture, the proposed method could averagely operative one input sample in 3.41s on a PC with 2.4 GHz 2-core i7 CPU, 16G RAM, which is fast enough in time complexity to guarantee instant flood prediction.

Conclusions
This paper proposes SMOTE-Boost-based sparse Bayesian model to perform tasks of accurate flood prediction. During the first step, SMOTE method is used to solve the imbalanced flood data problem by generating more virtual samples. Under a framework of AdaBoost training strategy with property to dynamically adjust sample number and weights for samples and classifiers, multiple sparse Bayesian models with weak predictive ability are integrated into a model with strong predictive ability. We further involve active learning technology to update the model by selecting informative samples for training. Experiments have demonstrated the accuracy and effectiveness of the proposed model for flood prediction on a collected dataset with several comparative methods. In the future work, we will study the parameters based on AdaBoost training strategy to further improve the model and improve model performance.