The analysis of financial market risk based on machine learning and particle swarm optimization algorithm

The financial industry is a key to promoting the development of the national economy, and the risk it takes is also the largest hidden risk in the financial market. Therefore, the risk existing in the current financial market should be deeply explored under blockchain technology (BT) to ensure the functions of financial markets. The risk of financial markets is analyzed using machine learning (ML) and random forest (RF). First, the clustering method is introduced, and an example is given to illustrate the RF classification model. The collected data sets are divided into test sets and training sets, the corresponding rules are formulated and generated, and the branches of the decision tree (DT) are constructed according to the optimization principle. Finally, the steps of constructing the branches of DT are repeated until they are not continued. The results show that the three major industries of the regional economy account for 3.5%, 51.8%, 3.2%, 3.4%, and 3.8% of the regional GDP, respectively, the secondary industry makes up 44.5%, 43%, 45.1%, 44.8%, and 43.6%, respectively, and the tertiary industry occupies 20%, 3.7%, 52.3%, 52.9%, 54%, and 54.6%, respectively. This shows that with the development of the industrial structure under BT, the economic subject gradually shifts from the primary industry to the tertiary industry; BT can improve the efficiency of the financial industry and reduce operating costs and dependence on media. Meanwhile, the financial features of BT can provide a good platform for business expansion. The application of BT to the supply chain gives a theoretical reference for promoting the synergy between companies.

users. Because it can store the data dispersedly and complete point-to-point transactions, BT is widely used in many experiments, such as securities, banking, and insurance [9,14,15]. It is a distributed shared ledger and database and has the characteristics of decentralization, non-tampering, and traceability. Therefore, it can significantly curb the hidden dangers and risks brought by the explosive growth and expansion of the financial market, and solve the problems in the financial field such as high credit risk, low capital utilization efficiency, and high payment processing cost.
Scholars all over the world have done a lot of research on the risk of financial markets. Financial risks [16][17][18] are the loss of an entity's activities in the financial field. Relevant scholars have confirmed that a portfolio can reduce the probability of risks. And foreign scholar Markowitz once proposed the mean-variance model, but the shortcomings of this model are also obvious. Its setting conditions are too strict about adapting to the changing financial market. Subsequently, British economists, Fishburn, and Stone improved and optimized the mean semi-variance model, but the model is only suitable for ordinary scenarios. Japanese scholars have studied and introduced the first-order absolute deviation method of portfolio theory and used this method as the standard to measure risks. Some domestic scholars use generalized autoregressive conditional heteroskedasticity (GARCH) to study financial risks, and their research provides a solution for the development of financial markets. Zhao et al. (2021) [19] studied the relationship between financial risks and global climate change and evaluated the relationship between global financial risks and carbon dioxide emissions. The results show that technological innovation and financial risk significantly affect global carbon emissions in the 10 quantiles. Kim et al. (2020) [20] conducted a case study on predicting financial risk behavior, extracting features from structured data for DL. The results show that DL can classify traders' financial risks more accurately. Yang et al. (2019) [21] applied and studied the financial risk management model of the Internet supply chain based on data science. The empirical results show that the model has high accuracy in data evaluation. In short, the research on financial risk assessment is carried out based on hedging combinations. And relevant scholars have studied and constructed a portfolio model with transaction costs and put forward specific data solution methods and required conditions.
As for the risk in international financial markets, most relevant researchers also introduced some methods to reduce the risk. However, there is little research on preventing financial market risks based on BT, and there is no relevant research under machine learning (ML) and particle swarm optimization (PSO). Therefore, the research on preventing financial market risks based on BT, ML, and PSO will be carried out. The innovation is that artificial intelligence (AI) and DL are used to find possible risks, and PSO and BT analyze the risk of financial markets. The research content provides a theoretical basis for subsequent research and has great significance. Its structure is as follows: the first section introduces the research background and relevant literature; the second displays PSO and maximum likelihood clustering algorithm; the third analyzes the experimental results using the maximum likelihood algorithm, BT and PSO; the fourth summarizes the empirical conclusion.

The clustering analysis algorithm based on ML and DL
ML algorithm is an interdisciplinary subject in many fields, and there are many related subjects, such as statistics, algorithm complexity theory, and probability theory. ML is a learning algorithm based on simulating human behavior to further obtain new knowledge and information. Moreover, it can continuously improve its performance in the process of operation. From the academic perspective, DL belongs to ML. In other words, DL is actually a special case of ML. The power and flexibility of the DL function are to learn and then express the real world in the way of nested concepts. A more specific expression is that, in the DL, any definition can find its simplified definition, and DL can also concretize and calculate the abstract things.
The ML algorithm [22][23][24] used here consists of two parts. The first part is the clustering method, and the second part is the RF [25][26][27] classification model. The clustering method learns from the unlabeled data and then further analyzes and interprets the internal relationship of the data. Specifically, the concept of set in a mathematical model is adopted. Initially, classification is conducted on the samples and multiple disjoint subsets. Then, the samples in each subset are basically similar, and the samples between different sets are different. The mathematical model of clustering method can be interpreted in this way. The set of samples is set as D, which contains many unlabeled samples. The method of multi-dimensional eigenvector is used to express a single sample. On this basis, the next step of clustering method is to find K subset C in the set of samples, in which each of K subsets is called class cluster. Figure 1 illustrates the scheme of clustering method analysis.
Based on the above description and the principle analysis diagram of clustering [28][29][30], the collected samples will be selected and matched according to the fitness of each class cluster, which will avoid that each sample does not have the same class cluster. Therefore, this sample set can be clustered. Equation (1) expresses the specific calculation of the objective function. In Eq. (1), y j means the centroid of the cluster C j . The function d(x, y) refers to a similarity measurement function, which is used to measure the similarity of sample x and centroid y. To obtain a similar relationship between a single sample set with another set, and summarize the characteristics of the data in the set, the clustering method is indeed a common method with extraordinary function.
The generation of RF algorithm is based on random subspace. Researchers carried out repeated sampling experiments in the random subspace many times, and finally proposed the RF algorithm. RF has a unique method for the completion of learning tasks. For the centralized learning model of DT, a large number of classifiers can be constructed and combined according to a certain method, which is the reason why the RF can complete the learning task well. In the experiment, a new sample set is generated according to the self-help sampling method and the randomly selected samples. Figure 2 depicts the final structure of the binary tree decision.
The reason for the great difference between learners may be the randomness of the collected samples, which will have a great impact on the whole learning result. Candidate feature sets are randomly selected in the current feature set when a single DT performs optimal feature segmentation. Randomness can be controlled by controlling the size of the feature subset. Equation (2)  (1)  5) and (6), where H(x) represents the classification effect evaluation function of the classifier.
In Eq. (4), L represents the number of base classifiers. According to ensemble learning, the final voting result can be obtained by: In Eqs. (5) and (6), x represents the sample and y represents the centroid. Construction of RF model consists several steps: Firstly, the collected data sets should be classified into two parts: test set and training set; secondly, corresponding rules are formulated and produced according to certain methods; thirdly, the branch of DT is constructed according to the optimal principle; fourthly, the corresponding branch is repeatedly branched until the branching cannot be continued. Figure 3 demonstrates the detailed process of RF generation.
According to the principle of DT, the prediction results of DT are independent from each other. Besides, due to the guidance of ensemble learning, the performance of DT is obviously worse than that of RF. From many important perspectives, the performance of RF is better.
Because there are many characteristic attributes in the data set, some of them are useful and the others are useless. Therefore, they are classified into two categories. The first category is related features, and the other category is irrelevant features. Based on this situation, the Gini coefficient is used to select the characteristics of the data. The score of importance features can be marked as VIM (very importance score), and the evaluation  score of each feature can be marked as VIM i , which means the mean value of the ith feature with respect to the change in the purity of node splitting in the DT. Equation (7) displays the calculation process.
In Eq. (7), K means the number of sample categories, p mk is the specific gravity of node m on the k th sample. On this basis, Eq. (8) demonstrates the importance of feature t i with respect to node m.
In Eq. (8), GI l and GI r are the Gini coefficients of the new node after branching based on the node m. On this basis, Eq. (9) reveals the importance of characteristic t i with respect to the number of the j th tree.
To sum up, Eq. (10) demonstrates the importance of feature t i on the overall RF.
In Eq. (10), n means the number of base classifiers in the RF. Feature selection mainly includes four aspects. The first aspect is subset generation, the second aspect is subset evaluation, the third aspect is stop criterion, and the fourth aspect is result verification. The generation of each subset is equivalent to the search process, in which the appropriate optimal subset is selected step by step. Figure 4 refers to the process flow of feature selection.
At first, the filtering method selects the features of the data set, and then trains the learner. The advantage of filtering selection is that it can quickly remove useless noise. Figure 5 presents the process of filtering selection of features.
The wrapping method first improves the learner, and then carries out multiple trainings. The selecting effect of this method is obviously better than that of filtering  Fig. 4 Process flow of feature selection selection, but its calculation process is complex and needs a lot of time. Figure 6 depicts the working flow of the wrapping selection.
Embedding selection is obviously different from filtering selection and wrapping selection. Processes of embedding and learner training are mixed together, so they can improve the effect simultaneously. Figure 7 illustrates the working principle of embedding selection.

Particle swarm optimization algorithm
PSO is a heuristic swarm intelligence algorithm proposed by Dr. Eberhart and Dr. Kennedy in 1995 to simulate bird predation behavior. It is assumed there are different food  Fig. 7 Working principle of embedding selection sources in a region, and the task of birds is to find the largest food source (global optimal solution). Throughout the search, birds communicate information about each other's location so that other birds know where the food comes from. Finally, the entire bird swarm can gather around the largest food source, where the optimal solution is found, and the problem converges. The specific strategies are as follows: Firstly, each bird sets out randomly in one direction to find a food source. Secondly, each bird shares the optimal food source and food stock found by itself to the bird group after flying for one minute, and then calculates the optimal position of the group to find the optimal food source.
Thirdly, each bird looks back on its path, considering its optimal location and group optimal location to determine the next direction.
Fourthly, if every bird is near the same food source, they should stop looking, otherwise continue to repeat the second and third steps.
In the scale of particle swarm [31][32][33], a single particle is normally a multi-dimensional vector, and a means the maximum velocity of the artificially set particle. The adjustment of the position and velocity of each particle in the particle population depends on Eqs. (11) and (12).
In Eqs. (11) and (12), v t+1 id represents the velocity of the particle. ωv t id represents the current state of the particle; c 1 r 1 p t id − x t id is a part often called "cognition, " which mainly reflects the thinking and continuous cognition of particles on their own flight process when they fly; c 2 r 2 p t gd − x t id Prob(�p ≤ −VaR) is the social part, and reflects the learning ability and particle following when flying. As for d = 1,2, …, D, i = 1,2, …, NP, t = 1,2, … represents the times of iterations of the algorithm model. If t is the current generation, t + 1 represents the next generation. ω (0,1) represents inertia weight, and it is a random number. c1 and c2 are acceleration constants and they usually random numbers between (0, 2). c1 is used as self-learning factor and c2 is social learning factor. r1 and r2 are pseudo-random numbers between [0, 1].
According to the description of the Equation, Fig. 8 depicts the particles' position after each time of the particles' motion. Figure 8 is set as a two-dimensional space. Particles start out from the point Z k . ZK + 1 indicates the position of the particles after motion. K means the current velocity of the particle, and K + 1 means the velocity of the particle after motion. In the process of moving back and forth of particle swarm, the speed and position of particles are also changing, and this process is the program to obtain the optimal solution of PSO algorithm [34][35][36]. Unless the optimal solution is obtained, the particle will move continuously. Figure 8 is the vector diagram of motion and position of particle.
The larger the number of particles moving back and forth in the particle swarm, the longer the reaction time, which indicates that the particle swarm is large. The specific (11) flow of PSO algorithm is as follows: primarily, a subset of PSO is generated in the original feature set; then, the set of candidate features is selected in these subsets; afterwards, these subsets are evaluated and the evaluation results are obtained to judge whether the evaluation result meets the stop criterion; finally, decision is made on whether to verify the evaluation results according to the judgment, or continue to enter the step of subset generation and circulate operations until an appropriate subset is selected. Figure 9 denotes the flow of subgroup algorithm. Equation (13) expresses the model of PSO algorithm. In Eq. (13), f (x) is a continuous function in space, which represents there is no need to make constrained optimization.
If, f (x) ∈ C1, any x is full rank in f(x). H = H (x) refers to a continuous symmetric matrix function, and it is bounded and uniformly positive definite on R. In Eq. (14), �·� stands for the 2-norm of the vector. Equations (15)(16)(17)(18)(19) refer to the specified marks at point x.
In Eq. (15), g(x) represents the gradient function, and −∇f (x) represents the negative gradient direction. For particle swarm optimization matrix, x denotes each element in the matrix, and ∇ indicates the matrix diagonalization. Equation (20) illustrates the calculation of the function to solve the gradient.
In Eq. (20), there are two parameters, which are u j ≤ 0, u j > 0. Equations (21) and (22) In Eqs. (23) and (24), x k denotes the input value and k represents the coefficient of differential operator. Equation (25) illustrates the limitations on the object function.
In Eqs. (26) and (27), f(x) is a second-order continuous and strictly convex function; then, H(x) is taken as the Hessian matrix of the f(x). Figure 10 shows the regional gross domestic product (GDP) and year-on-year growth rate in 2020. The national annual economic growth data are collected and sorted on the website of the National Bureau. According to the statistical results of economic development in a certain region of China, the GDP of the primary industry is 5.5 billion yuan, an increase of 3.5% over the previous year; GDP of the secondary industry region is 6 billion yuan, an increase of 7.2% over the previous year; the total output value of the tertiary industry is 8.8 billion yuan, an increase of 8.9% over the previous year.

Financial risk analysis based on maximum likelihood algorithm and BT
GDP of the three major industries increases in turn in 2020, of which GDP of the tertiary industry is the highest and that of the primary industry is the lowest. In addition, the year-on-year growth rate of the three industries also increases. The year-on-year growth rate of the tertiary industry is the highest and that of the primary industry is the lowest. This shows that the economic center begins to transfer from the primary to the tertiary industry gradually.  Figure 11 shows the growth rate of regional financial investment. The specific situation of regional financial investment from 2016 to 2020 is as follows: the growth rate of fixed assets and real estate investment in 2016 are 1.7% and 2.1%, respectively; in 2017, they are 5.7% and 6%, respectively; in 2018, they are 14% and 13%, respectively; in 2019, they are 12% and 13%, respectively; in 2020, they are 10% and 9%, respectively.
This proves that the investment in fixed assets and real estate increases from 2016 to 2018, but decreases from 2019 and 2020. Figure 12 shows the proportion of the three major industries in GDP. The details are as follows: in 2016, the primary industry accounts for 3.5%, the secondary industry for 44.5%, and the tertiary industry for 51.8%; in 2017, the primary industry accounts for 3.7%, the secondary industry for 43%, and the tertiary industry for 52.3%; in 2018, the primary industry accounts for 3.2%, the secondary industry for 45.1%, and the tertiary industry for 52.9%; in 2019, the primary industry accounts for 3.4%, the secondary industry for 44.8%, and the tertiary industry for 54%; in 2020, the primary industry accounts for 3.8%, the secondary industry for 43.6%, and the tertiary industry for 54.6%.
This shows that the primary industry develops steadily, the development of the secondary industry is not stable, and the position of the tertiary industry improves gradually. Figure 13 depicts the factors affecting the regional economy. According to the regional economic development, various indicators are detected, and the results are analyzed. Figure 13a shows the indicators of regional economic development level. In 2018, the growth rate of GDP is 7.6%, the growth rate of fixed asset investment is 13%, the growth rate of real estate investment is − 3.2%, and the growth rate of consumer price indicator (CPI) is 2%; in 2019, they are 8.1%, 12.3%, − 6.1%, and 2.1%; in 2020, they are 8.3%, 11%, − 8.9%, and 1.5%, respectively. Figure 13b shows the regional government regulation indicators. In 2018, the growth rate of fiscal revenue is 8.5%, the growth rate of fiscal expenditure is 9%, and the proportion of fiscal revenue in GDP is 27.2%. In 2019, they Regional economic factors (a regional economic development level indicators; b regional government regulation indicators; c enterprise operation indicators) are 26.5%, 23.1%, and 30.3%. In 2020, they are 7.1%, 13.2%, and 29.8%, respectively. Figure 13c shows the indicators of enterprise operation, in which the asset/liability ratio of enterprises in 2018 is 52.9%, the growth rate of enterprise income is 7.3%, the profit rate of enterprises is 7.45%, and the growth rate of total exports is− 8.1%. In 2019, they are 53.25%, 9%, 8.56%, and 39%. In 2020, they are 54.3%, 8.2%, 6.4%, and 36.2%, respectively. The above shows that the growth rate of GDP in 2018 and 2020 continues to rise and the proportion of real estate investment gradually increases; the growth rate of fiscal revenue decreases and the proportion of fiscal revenue in GDP gradually increases; the asset/liability ratio gradually increases, while the growth rate of total exports continues to decline. Figure 14 shows the risk factors of regional financial institutions. Figure 14a shows the performance indicators of the industry. In 2018, the non-performing loan ratio is 2.3%, the provision coverage ratio is 154%, the return on assets is 11%, and the deposit loan ratio is 80%; in 2019, they are 1.7%, 202%, 16%, and 83%; in 2020, they are 1.6%, 235%, 15%, and 86%, respectively. Figure 14b shows the performance indicators of securities. In 2018, the growth rate of total securities transactions is − 35.1%, and the proportion of securities industry revenue in financial industry revenue is 4.7%; in 2019, they are 3.6% and 4.5%; in 2020, they are 1% and 5%, respectively. Figure 14c shows the insurance performance indicators. In 2018, the growth rate of premium income is 50%, the loss rate is 29.1%, and the insurance coverage rate is 7.3%; in 2019, they are 8.2%, 4.1%, and 6.9%; in 2020, they are − 2.5%, 16%, and 6.1%, respectively.

Conclusion
With the rapid development of the world economy, the risk of the financial industry also increases. On this basis, BT is introduced, and it can effectively solve the risk problem in financial markets. It can disperse stored data and complete point-to-point transactions. Therefore, it is combined with DL and AI to prevent and control the risk in the financial market. The cluster analysis algorithm and PSO based on ML and DL are used to analyze and evaluate the financial risk, and the results are summarized as follows: (1) BT has the advantages of decentralization and distrust, which enables the financial industry to carry out new innovation, improve the working efficiency, reduce operating costs, and weaken the financial industry's dependence on media, and it can connect the information between enterprises, promote the cooperation between enterprises, and improve the service quality; (2) the financial risk monitoring system plays a good indicator detection role for China's regional economy, and has obvious advantages in controlling economic situations; (3) BT can provide a good platform for the business expansion, effectively enhances the synergy between companies, and helps companies obtain effective information quickly and conveniently. Based on the above, the study can provide a reference for the future relevant research. However, there are also some deficiencies. For example, the size of the samples is small, which makes the conclusion one-sided. In the future, the size of samples will be expanded, and more attention needs to be paid to interpreting financial risks in subsequent research.