In this section, the blocking and coding algorithm, the interpolation algorithm of the geographic grid for the time series, the adaptive data time-to-live algorithm, and the unified heterogeneous service method for multi-source merging meteorological big data in cloud environment are introduced.
3.1 The blocking and coding algorithm of multidimensional meteorological data
Semi-unstructured data are mainly grid data or raster data, which can be stored by block and compression. NoSQL object storage generally adopts column-oriented storage mode. Its storage structure guarantees the expansion of data table columns and the high throughput of I/O, avoids the maintenance pressure brought by the change of table structure, and effectively improves the throughput performance of data analysis.
Based on the characteristics of the NoSQL object storage and the actual characteristics of the meteorological grid data, the blocks are scaled according to the rank of grid data. The data of attribute columns are divided into rows and columns in a certain proportion. The rules of segmentation are as follows: the meteorological grid data is divided into M×N squares by the count of row and column. Suppose N is the count of cells in the vertical direction after segmentation, and M is the count of cells in the horizontal direction after segmentation, M and N are calculated by
$$ \left\{ \begin{array}{ll} M=\frac{m}{x}+1, & m\ge 0,x>0 \\ N=\frac{n}{y}+1, & n\ge 0,y>0 \\ \end{array} \right. $$
(1)
where m is the count of blocks in the custom vertical direction, n is the count of blocks in the custom horizontal direction, x is the count of cells in the vertical direction after segmentation, and y is the count of cells in the horizontal direction after segmentation.
The partitioned attribute column data is divided into M×N data columns and stored as one-dimensional arrays sequentially in the NoSQL database system. As shown in Fig. 6, each column is encoded in different bytes and compressed by GZip or LZ4.
According to the above method, when reading a range of data, the value of the attribute column can be read by calculating the column in which the data is located, without reading the whole row of grid field data row. Moreover, after data is partitioned and compressed, the amount of data read from the grid is greatly reduced.
The multidimensional meteorological grid data blocking algorithm helps to reduce the amount of query data and improve data access performance. For example, when a data query by the grid extraction algorithm intercepts a range-related block data set, single-point timing generation occurs. A single block column of multiple height field data is taken to form a time series.
3.2 The interpolation algorithm for generating the time series of a geographic grid
In meteorological refinement applications, the history or future trends of a geographic location are often viewed, which requires retrieving a multidimensional time series of that location. Thus, suppose a time series set of data
$$ T=\left\{ {{V}_{t1}}+{{V}_{t2}}+\cdots +{{V}_{tn}} \right\} $$
(2)
is given, where V is the coordinate value on a two-dimensional field at a certain time, t is the time vector, and T is the time series set of the multidimensional meteorological data.
Bilinear interpolation is used to obtain the value of this point on a two-dimensional field of a layer. Mathematically, bilinear interpolation is a linear interpolation extension of an interpolation function with two variables. The core idea is to perform linear interpolation in two directions. First, the data block position is searched according to formulas (4) and (5). For the analysis and calculation, the total amount of block data is formulated as follows
$$ {{Z}_{m,n}}(t)=\sum\limits_{i=1}^{m}{\sum\limits_{j=1}^{n}{{{\beta }_{i,j}}(t)}} $$
(3)
where Zm,n(t) is the total amount of block data at time t, βi,j(t) is the size of a single grid node data, m is the grid number of latitudes, and n is the grid number of longitudes.
The data block position is calculated by
$$ \left\{ \begin{array}{l} B{{I}_{x}}={(\lambda -{{\lambda }_{0}})}/{\mu }\; \\ B{{I}_{y}}={(\eta -{{\eta }_{0}})}/{\mu }\; \\ \end{array} \right. $$
(4)
where BIx and BIy are the data block index in the latitude and longitude directions, respectively, μ is the size of a single block data, λ is the longitude value, λ0 is the initial longitude value, η is the latitude value, and η0 is the initial latitude value.
The data set in the block are obtained by formula (4), and then the index position in the block is calculated by formula (5).
$$ \left\{ \begin{array}{l} x={(\lambda -\lambda_{0}^{\prime})}/{dx}\; \\ y={(\eta -\eta_{0}^{\prime})}/{dy}\; \\ \end{array}\right. $$
(5)
where x is the index value in longitude, y is the index value in latitude, dx is the longitude resolution, dy is the latitude resolution, \(\lambda _{0}^{\prime }\) is the initial longitude value in the data block, and \(\eta _{0}^{\prime }\) is the initial longitude value in the data block.
Then, the obtained data position index function in the data block is used to make a bilinear interpolation to point p(x, y). Thus, it is assumed that we know the value of f(p) at the four points Q11=(x1,y1), Q12(x1,y2), Q21(x2,y1), and Q22(x2,y2). The corresponding interpolation f(x,y) is calculated by formula (6), formula (7), and formula (8). The linear interpolation in the x-direction is estimated by
$$ f(x,{{y}_{1}})\approx \frac{({{x}_{2}}-x)}{({{x}_{2}}-{{x}_{1}})}f({{Q}_{11}})+\frac{(x-{{x}_{1}})}{({{x}_{2}}-{{x}_{1}})}f({{Q}_{21}}) $$
(6)
$$ f(x,{{y}_{2}})\approx \frac{({{x}_{2}}-x)}{({{x}_{2}}-{{x}_{1}})}f({{Q}_{12}})+\frac{(x-{{x}_{1}})}{({{x}_{2}}-{{x}_{1}})}f({{Q}_{22}}) $$
(7)
The desired estimate is obtained by
$$ f(p)\approx \frac{({{y}_{2}}-y)}{({{y}_{2}}-{{y}_{1}})}f(x,{{y}_{1}})+\frac{(y-{{y}_{1}})}{({{y}_{2}}-{{y}_{1}})}f(x,{{y}_{2}}) $$
(8)
Finally, according to the same steps, the data of the same position at other times are taken to form a time series of the multidimensional meteorological data.
3.3 Adaptive time-to-live algorithms
It is nothing but time on the period of time of data in meteorological NoSQL system that a unit of data can experience before it should be discarded. The TTL of meteorological NoSQL data reflects the whole lifetime of the meteorological data from input database to deletion. In order to improve the effect of time-to-live (TTL) in the meteorological NoSQL database, an adaptive algorithm for cleaning up expired meteorological data is implemented according to the frequency and time of data access. Suppose the calculation period of the time-to-live is (0, ∞). At the current time t, the time-to-live of data could be calculated based on the frequency and time of data access in different time periods.
Some basic concepts are presented for adaptive time-to-live algorithms in the meteorological NoSQL system as follows.
Definition 1. (Data usage of E
t)
The data usage of Et is defined by the amount of data access at time t, denoted as Pt.
Definition 2. (Adaptive data expiration model)
The adaptive data expiration model can be described as a quaternion {P,T,α,E}, where T is the data expiration time, P is the usage of data access and time, and α:P×T→E is the mapping determined by the data expiration algorithm, E is the set of outputs are represented.
In the initial stage of data cleaning, the initial expiration time t0 of design data is E0. With the increase of time, the expiration time decreases continuously and when equals to 0, the data will be automatically cleaned up by the system. In the process of time change, the data is accessed. With the increase of access frequency, the usage of the data increases, and the expiration time of the corresponding data will increase. Their relationship formulas are as follows:
$$ {{E}_{t}}=\left\{ \begin{array}{l} {{E}_{\max }}-t+P(t),{{E}_{\max }}\ne 0 \\ \quad\qquad 1,{{E}_{\max }}=0 \\ \end{array} \right. $$
(9)
where Et is the data live time at time t and when equals to 0, the data is deleted; Emax is the maximum TTL time defined by the user and when equals to 0, the data will never expire; and P(t) is the data usage at time t, which the relationship between data access frequency and time, calculated by formula (11).
The sigmoid function is used as the adaptive adjustment curve as shown in formula (10).
$$ f(x)=\frac{1}{1+\exp (-ax)} $$
(10)
The adaptive adjustment formula of the change rate of access frequency for solving the usability problem is designed by using sigmoid function as shown in formula (11).
$$ {{P}_{t}}=\left\{ \begin{array}{ll} & \frac{{{P}_{\max }}-{{P}_{\min }}}{1+\exp \left(\frac{a\left({{P}^{\left(t-1 \right)}}-{{A}_{\min }}_{t} \right)}{{{A}_{\text{avg}}}_{t}-{{A}_{\min }}_{t}} \right)} \\ & {{P}_{\min }},{{P}_{t-1}}\ge {{P}_{\min }}_{t} \\ \end{array} \right.+{{P}_{\min }},{{P}_{t-1}}\ge {{P}_{\min }}_{t} $$
(11)
where t=1,2,⋯,T is the current time, T∈∞, a is the arbitrary constant, Pmax is the maximum usage of data, Pmin is the minimum usage of data, Amint is the minimum access from 0 to t, Aavgt is the average access from 0 to t, Pt−1 the data usage of the previous period, and Pt the data usage of the current period t.
As shown in formulas (10) and (11), the change rate of usage is adjusted nonlinearly with sigmoid curve between the minimum and average data access frequencies between two adjacent periods. Obviously, the distribution of usage is improved when most of the time periods have similar access frequencies and the minimum data access frequencies are close to the average access frequencies.
The process of adaptive adjustment with data access frequencies is shown in Fig. 7. When the accessibility varies with the access frequency of the current period, the accessibility value is obtained, so that the data expiration time can be adjusted adaptively. When the data expiration time is 0, the data will be automatically cleared. The adaptive adjustment curve of radius change rate should be changed slowly everywhere, so that it can be improved in a large area. Secondly, when the difference is large, it ensures that the adaptive adjustment curve does not tend to be linear, which indicates that high access frequency does not have normality. Finally, it ensures a certain degree of distribution in accessibility. At the same time, in order to retain better distribution as much as possible, the adaptive adjustment curve at a smoother position should be adopted.
Therefore, in order to automatically adjust the expiration time of the data and improve the efficiency of data use, the relationship between expiration time and access frequency at each time is designed in this paper.
3.4 Unified heterogeneous service architecture and method
To improve the service capability of data storage and shield the difference of the underlying NoSQL data system, a unified heterogeneous service method for meteorological data was studied. It uses a three-layer abstract structure, as shown in Fig. 8.
The NoSQL database service layer is based on a variety of NoSQL databases and provides a variety of NoSQL database read-write interfaces and other general operation interfaces and implementation classes, automatically adapting operations for different database characteristics. Shielding the underlying NoSQL data storage platform from the application, it automatically adapts to all types of NoSQL data storage platforms, providing very high service heterogeneity.
The service implementation layer mainly implements the core operations of the unified heterogeneous services of meteorological data, including table spaces, tables, data, and related meteorological algorithms. It organizes data in the form of table spaces, tables, and columns and dynamically and seamlessly expands meteorological data by segmenting meteorological data. It provides common meteorological data operations, supports meteorological grid classes, file classes, site class data read-write spatial analysis, etc., providing high consistency. The service implementation layer also provides algorithms such as adaptive data segmentation, compression, mesh extraction, interpolation, and single-point timing to achieve meteorological data processing.
The service interface layer provides data unified access interface service and data operation interfaces, including data unified access interface, database connection interface, table space interface, table interface, and meteorological data algorithm interface. It can be combined with commonly used distributed service components to provide distributed component-type meteorological data interface services. Applications can access meteorological data by RESTful Interface, Web Service Interface, and distributed Client SDK, which uses meteorological data algorithm as component or service of application program.
Meteorological Data Unified Heterogeneous Services data are stored on the NoSQL data storage platform and have multiple backups, providing fast access performance and extremely high data reliability.