Relation extraction for coal mine safety information using recurrent neural networks with bidirectional minimal gated unit

The data of coal mine safety field are massive, multi-source and heterogeneous. It is of practical importance to extract information from big data to achieve disaster precaution and emergency response. Existing approaches need to build more features and rely heavily on the linguistic knowledge of researchers, leading to inefficiency, poor portability, and slow update speed. This paper proposes a new relation extraction approach using recurrent neural networks with bidirectional minimal gated unit (MGU) model. This is achieved by adding a back-to-front MGU layer based on original MGU model. It does not require to construct complex text features and can capture the global context information by combining the forward and backward features. Evident from extensive experiments, the proposed approach outperforms the existing initiatives in terms of training time, accuracy, recall and F value.

representation approach [8] has coarse-granularity description, small representation range, and poor computational efficiency.Ontology technology has been increasingly adopted by the coal mine safety field [9], since it can describe knowledge in a standardized manner and realize the transfer, reuse and sharing of information.With the development of 5G communication and mobile edge computing technologies, the low latency and high bandwidth can provide more accurate services [10][11][12].Relation extraction mines the association between concepts within a large-scale data, and is the key step in ontology construction.
With the development of deep learning technology [13], relation extraction has innovated and developed rapidly.In terms of the dependence on labelled data, automatic relation extraction approaches can be divided into four types: supervised learning, semi-supervised learning, unsupervised learning, and open extraction, respectively.The supervised learning approaches cannot capture the global context information, since it is based on original minimal gated unit (MGU) model which is unidirectional and processes data in one direction [14].The semi-supervised learning approaches have a poor portability and are not suitable for coal mine safety data, since the accuracy of information extraction depends on the quality of the initial relation seed [15].The unsupervised learning approaches need to analyse and post-process the extraction results, and the clustering threshold cannot be determined in advance [16].The open extraction approaches map relation instances to texts by means of external knowledge bases such as DBPedia, OpenCyc and YAGO [17].However, these bases rarely contain safety knowledge of coal mine, and the related research cannot apply to coal mine safety data.There is no mature relation extraction approach for coal mine safety information.
For the data in coal mine safety, this paper applies the recurrent neural networks (RNNs) with MGU to learn the high-dimensional attribute features and avoid complex feature selection problem.The key contributions of the paper can be summarized as follows.
1. We design an automatic relation extraction approach using RNNs with bidirectional minimal gated unit (Bi-MGU) model to capture the global context information.This is achieved by adding a back-to-front MGU layer based on original MGU model.

Based on the 2005 automatic content extraction (ACE2005), the experimental results
show that the proposed approach has a higher accuracy, recall rate and F value, and a shorter training time, as compared to the existing initiatives.
The rest of this paper is organized as follows.In Sect.2, the related works are reviewed.Section 3 presents the RNNs with Bi-MGU model to capture the global context information.The analytical comparison of the proposed model and existing alternatives is conducted in Sect.4, followed by conclusions in Sect. 5.

Related work
Deep learning can learn the high-dimensional attribute features and reflect the semantic features of vocabularies well, and it is widely used for relation extraction.
The supervised learning approaches use labelled data for model training.Vo et al. [18] proposed a relation extraction approach based on semantic information expansion syntax tree to generate rules and accomplish relation extraction.Li et al. [19] proposed a classifier approach based on support vector machine to achieve relation extraction.Zheng et al. [20] introduced the word feature and proposed a relation extraction approach based on conditional random fields.Zhou et al. [21] proposed a relation extraction approach based on the kernel function by calculating the similarity of dependency tree.To tackle the relation classification task, Wu et al. [22] proposed a model that leverages the pretrained BERT language model and incorporates information from the target entities.Zeng et al. [23] proposed an end-to-end model based on sequence-to-sequence learning with copy mechanism, which jointly extracts relational facts from sentences of any of these classes.For relation extraction, Zhang et al. [24] proposed an extension of graph convolutional networks, which pools information over arbitrary dependency structures efficiently in parallel.Guo et al. [25] proposed an attention guided graph neural network for relation extraction tasks, which automatically learns how to selectively attend to the relevant sub-structures.To solve the natural language relational reasoning task, Zhu et al. [26] proposed graph neural networks with generated parameters to adapt graph neural networks.However, the mentioned approaches rely heavily on the linguistic knowledge of researchers and cannot make full use of contextual structure information.Additionally, it is difficult to extract large-scale data due to its slow training and testing speed.
The semi-supervised learning approaches reduce the dependence on manual annotation corpus by adding seed and iterative learning manually.Agichtein et al. [27] designed a Snowball method based on vector representation to reduce the impact of manual intervention on relation extraction.Chen et al. [28] proposed a semi-supervised extraction model based on graph strategy to improve the accuracy of relation extraction.Zhang et al. [29] proposed the BootProject algorithm based on random feature projection to achieve relation extraction.Wang et al. [30] proposed a label-free distant supervision approach, which makes use of the type information and the translation law derived from typical knowledge graph embedding model to learn embeddings for certain sentence patterns.However, these works have the semantic drift problems, and are easily affected by the quality of the initial relation seed.
The unsupervised learning approaches achieve the semantic extraction relations by learning entity context.Qin et al. [31] proposed a Chinese entity relation extraction model based on unsupervised learning, which achieved the relation extraction of largescale unlabelled data.Shinyama et al. [32] proposed an unsupervised method based on multi-level clustering to achieve relation extraction based on reported articles.Gonzalez et al. [33] proposed a technique based on a probabilistic clustering model to achieve unsupervised relation extraction.However, these works lack clear boundaries and objective evaluation criteria, and have low accuracy.Also, the clustering threshold cannot be determined in advance.
The open extraction approaches have advantages in cross-domain and later expansion due to no constrains on the relation category and target text.Etzioni et al. [34] built the KnowItAll model and realized entity relation extraction by manually writing rule templates.The ever-increasing popularity of web APIs allows app developers to leverage a set of existing APIs to achieve their sophisticated objectives [35].Banko et al. [36] proposed a TextRunner-based approach to extract specific relations from the Web.Wu et al. [37] constructed an open extractor system to achieve relation extraction based on Wikipedia information.However, the existing works lack a recognized evaluation system and are unable to deeply explore the implicit relation between entities.Therefore, there is still a gap between the needs of coal mine safety field and the situation of relation extraction.

Overall network structure
The overall network structure diagram elaborates the techniques used in each step of the relation extraction from a microscopic perspective, as shown in Fig. 1.The network structure can be divided into 5 layers, and the function of each layer is described as follows: 1.The input layer pre-processes data, extracts concepts in the sentence, and deletes the ones that do not contain any concepts.The resulting data is divided into training data and test data, where the training data is annotated with the assistance of relevant experts.Each piece of data is described as a tuple <concept1, concept2, word spacing, sentence relation, type>.2. The word embedding layer constructs a text word vector to represent its corresponding words.This is achieved by converting text information into word vectors which are trained by word2vec.Each sentence is converted into a multidimensional matrix.
The text features used in this paper include the word itself and the word spacing.3. The cyclic neural network layer trains the corpus (processed data) that is input into the Bi-MGU unit.The train model tries to minimize the negative log likelihood function to obtain the optimal model.The relation classification is considered as a multiclassification problem.4. The pooling layer uses a maximum pool operation to get the final vector representation of the input corpus.To make full use of the information in each sentence, we introduce the attention mechanism to calculate the attention probability, thereby reflecting the importance of a sentence in the set.The overall features of the text are obtained by using pooling operation.

Text vector representation
In order to process data by using a neural network model, the input data needs to be vectorized first.Different from the traditional one-hot representation, the word embedding based on neural network training contains rich context information, which can represent the semantic rules of the target words in the current text, and avoid the dimensional disaster [38].We use the word2vec tool to train word embedding and choose the Skip-gram model as the training framework.Constructing a text word vector means converting text information into a vector form and each sentence is converted into a multidimensional matrix.Sentence S is given which contains the word set W = {w 1 , w 2 , . . ., w m } , m is the number of words in the sentence S .The text feature set of the sentence S is K = k 1 , k 2 , . . ., k n , n represents number of text features extracted from each sentence.The i-th text feature extracted from the t-th word is expressed as The features used in this paper are the current word and word spacing.This paper performs word vectorization on text information: r w is the word vector representation of the word w .W word represents text word vector matrix, W word ∈ R l×m .m indicates the number of words in the sentence.l represents the dimension of the word vector.V w is the one-hot representation of the word w.
In the same way, word vectorization is performed on each text feature: r k i is the word vector representation of the i-th feature.W k i is the eigenvector distribu- tion of the i-th feature, Vectorization of each word is connection of each vector.Vectorization of the t-th word is: The final text local feature is:

Bi-MGU model
Figure 2 shows the proposed Bi-MGU model.It consists of two layers: (1) the front-toback MGU layer, which captures the above feature information; and (2) the back-to-front MGU layer, which captures the following feature information.By combining the forward and backward features, we can obtain the global context information which is helpful for sequence modelling task.Each training sequence has two MGU units that move backward and forward, respectively.Both layers are connected to an output layer.The proposed (1) model overcomes the drawbacks of traditional unidirectional MGU model which captures data in only one direction.
The state update of the front-to-back MGU layer is given by: − → h t is the state of the hidden layer from the front to the back at time t.− − → h t−1 is the state of the hidden layer from the front to the back layer at time t − 1 .x t is the input at time t .W x − → h t and W h h are weight matrices.b h is a bias term.

The state update of the back-to-front MGU layer is:
← − h t is the state of the hidden layer from the front to the back at time t .← − − h t+1 is the state of the hidden layer from the front to the back layer at time t + 1 .x t is the input at time t .The cumulative results of the two MGU layers are input into the hidden layer, which is calculated as follows.
where y t is the output at time t , W hy and W← h y are weight matrices, and b y is a bias term.
Each node in Fig. 2 is an MGU unit.The MGU has only one gated structure that combines the input gate (reset gate) with the forgotten gate (update gate).Compared to the LSTM with three gated structures and GRU with two gated structures, the structure of MGU is simpler and contains less parameters, as shown in Fig. 3.
As can be seen from Fig. 3, we have

Backward propagation layer
Fig. 2 The proposed Bi-MGU model h t−1 and h t are the states of the hidden layer at time t − 1 and t , respectively.x t is the input at time t.f t is the activation function of the gated structure at time t .h t is short- term memory item.W f and W h are weight matrices.b f and b h are bias terms.⊙ is the component-wise product between two vectors.

Attention mechanism and pooling
On the basis that the relation set for classification differs in the importance of words in sentences, this paper adopts a word-level attention weight matrix to capture the information associated with the target relations.Since the attention mechanism can automatically adjust the weights, the deep learning model can focus on the more important parts of the task goal.The weight is calculated as: where a t is the weight of the vector and can be normalized by using Softmax, y t is auto- matically calculated in the attention mechanism, and l the number of vectors to be assigned weights.f is a function of vector y t and vector n , and has different forms.This paper adopts the following: where V a is the weight vector, and W a and U a are the weight matrices.
This paper uses formula (12) to link the output of each step of the hidden layer of the bidirectional MGU model with influencing factors, then the output of each step of the hidden layer is weighted to obtain the representation of the sentence, as follows: (9) Fig. 3 An illustration on the structure of MGU where y t is the output of the t-th step of the hidden layer, n is the vector corresponding to the factors that affect the weight, l is the sentence length, and y is the final output, which is used as the representation of the sentence.
In order to consider more contextual semantic associations and obtain features that are more relevant to relation classification tasks, this paper uses the pooling method of attention mechanism.First, the sentence vector after the Bi-MGU layer is multiplied by the attention weight matrix to obtain the corresponding output features F = {F 1 , F 2 , . . ., F m } .Then, use the largest pooling operation to get the most significant feature representation.
where d is the overall characteristics of the text after pooling.Since the feature dimen- sion after pooling is fixed, the problem of different lengths of text sentences can be solved.Finally, the SoftMax classifier is used to predict the relation category labels.Based on ACE2005 standard and the marked corpus, this experiment extracts 7 types of relations: location, causality, occurrence, responsibility, part-whole, possession and others relations.Among them, the location relation describes the geographical location; the causality relation describes causal connection or mutual influence between concepts; the occurrence relation means the fact that has occurred; the responsibility relation usually exists in concepts such as personnel and institution; the part-whole relation represents a hierarchical structure of two concepts; and the possession relation generally includes usage, adoption and so on.Except for the above 6 relations, the remaining relations are labelled as others relation.Next, the dataset is divided into training corpus and test corpus, where 16,544 items are taken as training corpus and 3,496 items are taken as test corpus.

Performance evaluation
The corpus we used here is the coal mine accident case and coal mine accident analysis reports, which are crawled from coal mine safety net, coal mine accident net and safety management network.First, delete the corpus that does not contain any concepts.Then, each piece of data is described as a tuple <concept1, concept2, word spacing, relation type, sentence>.

Results and discussion
RNN is a key technology for processing time-series data in deep learning.However, there exists the problem of vanishing gradient in RNN.The problem increases the difficulty (15) y = l k=1 a t y t (16) that the network learns long-distance dependence, and restricts the practical application of RNN.The long short-term memory network (LSTM), gated recurrent unit (GRU) and minimal gated unit (MGU) are the widely used RNN variants, which can alleviate the problem of vanishing gradient of RNN through gating mechanism [39,40].However, LSTM and GRU have complex internal structures, i.e.LSTM and GRU have three and two gated structures, respectively.MGU can achieve the same function as LSTM and GRU by using one gated structure.We choose the MGU model due to its simpler structure and fewer parameters.Additionally, unidirectional MGU model can only capture data in one direction, i.e. the above feature information.To obtain the global context information, this paper proposes a Bi-MGU model by adding back-to-front MGU layer to capture the following feature information.
We compare the proposed Bi-MGU approach with the current benchmarks in the literature [38,40] in terms of training time, accuracy, recall and F-Score.Table 1 shows that the proposed Bi-MGU model has the minimum training time.This confirms that the simpler the model structure and the fewer the training parameters, the less training time is required.The performance of different is evaluated on different types of relations, as shown in Figs. 4, 5 and 6.
The extraction accuracy of location, causality, occurrence, and others relations with Bi-MGU model is higher than that with LSTM and GRU models.However, the extraction accuracy of the part-whole relation is not as high as the LSTM and GRU models.Compared with the LSTM and GRU models, the Bi-MGU model has a higher recall and F-Score in extracting causality, part-whole and possession relations.However, the recall rate of extracting the others relation is not as high as the LSTM and GRU models.These three models have similar F-Score when extracting causality, responsibility, possession relations.In summary, the Bi-MGU model we propose has a good performance.This is because the Bi-MGU model has the simplest structure.As the length of the sequence increases, the Bi-MGU is more likely to achieve the desired result in a shorter period.In    are more accurate and reliable.The average extraction accuracy of location, part-whole and possession relations is much higher than the recall.This shows that these three relations are more likely to be misjudged as the remaining relations, and the rest of the relation types are rarely misjudged as these relations.This is because the number of the three relation types in the dataset is small, and occurrence, responsibility and causality relations occur frequently.The average extraction accuracy, recall and F-Score of other relations are relatively low, because the location and sentence structure are not fixed, and the concepts within the relations are irregular, leading to the unobvious features.

Conclusions
This paper proposed a new relation extraction approach based on Bi-MGU model to extract information from big data to achieve disaster precaution and emergency response during the production of coal mine.This is achieved by adding a back-to-front MGU layer based on original MGU model.The proposed approach does not require complex text features and can capture the global context information by combining the forward and backward features.Based on ACE2005 standard and the marked corpus, the experimental results show that our approach outperforms the existing initiatives in terms of training time, accuracy, recall and F-Score.
This paper is mainly based on the pre-defined relations, and has a limited ability to extract undefined relations in text.In the future, we will focus on the research of open relation extraction.In addition, an open relation is determined by the core verb, and there exists the problem of polysemy.The problem leads to semantic uncertainty of the extracted relation in the practical application.Therefore, how to disambiguate the relation is another our future work.

5 .Fig. 1
Fig. 1 An illustration on the network structure of the relation extraction

4. 1
Experimental description All the neural network models are carried out in the Google open-source deep learning framework TensorFlow v1.2 (Windows 10, 64 bit).The performance of the Bi-MGU model proposed in this paper is analysed by comparing the performance of LSTM model, GRU model and MGU model in training time, relation extraction accuracy, recall rate and F value.

Fig. 4
Fig. 4 Comparison on accuracy of different relation extraction approaches, where each type of relation is used as a data point to tease out the effectiveness of the proposed model

Fig. 5 Fig. 6
Fig. 5 Comparison on recall of different relation extraction approaches, where each type of relation is used as a data point to tease out the effectiveness of the proposed model

Fig. 7
Fig. 7 Comparison on average accuracy, recall and F-Score

Table 1
Comparison on training time of different models