The development of multicore hardware has provided many new development opportunities for many application software algorithms. Especially, the algorithm with large calculation volume has gained a lot of room for improvement. Through the research and analysis, this paper has presented a parallel PO-Dijkstra algorithm for multicore platform which has split and parallelized the classical Dijkstra algorithm by the multi-threaded programming tool OpenMP. Experiments have shown that the speed of PO-Dijkstra algorithm has been significantly improved. According to the number of nodes, the completion time can be increased by 20–40%. Based on the improved heterogeneous dual-core simulator, the Dijkstra algorithm in Mi Bench is divided into tasks. For the G.72 encoding process, the number of running cycles using “by function” is 34% less than using “divided by data,” while the power consumption is only 83% of the latter in the same situation. Using “divide by data” will reduce the cost and management difficulty of real-time temperature. Using “divide by function” is a good choice for streaming media data. For the Dijkstra algorithm, the data is data without correlation, so using a simpler partitioning method according to the data partitioning can achieve good results. Through the simulation results and the analysis of the results of real-time power consumption, we conclude that for data such as strong data correlation of streaming media types, using “divide by function” will have better performance results; for data types where data correlation is not very strong, the effect of using “divide by data” is even better.

Introduction

Hardware developers’ sight moves away from pure performance improvements, diverting to the development of new hardware architecture–multicore processors. From the point of the design concept of this processor, each core is a logically complete individual, and can run the program independently, all of whose control signals and command signals are provided by the bus controller [1]. Therefore, the multicore processor has a natural advantage in performance because it is a multicore work at the same time, which can easily provide higher computational performance [2]. However, in order to let the single-core processor achieve the same effect, you need make hardware to achieve a high clock frequency. At the same time, as multicore processors run at lower clock frequencies, you do not have to worry the hardware about the toughest power dissipation and cooling issues in the single-core processors [3].

From high-performance servers to low-power embedded processors, multicore architectures have become more widely used in today’s everyday life [4,5,6]. At the same time, multicore architecture simulators are also widely concerned by many researchers [7, 8]. The researchers designed a heterogeneous multicore simulator based on Simple Scalar that includes an ARM core and a PISA core [9]. The simulator takes into account shared memory and bus expansion, and can implement Simple Scalar modules for multiple different instruction sets by using System C. In the previous study of single-core processors, the use of instruction-level parallelism in parallel computing is to improve the overall performance of the system, but later, ILP is proved to be an inefficient method in multiprocessor systems. Researchers are beginning to look for new ways to improve system performance [10]. Task parallel-level algorithms or coarse-grained parallel algorithms are more efficient in multiprocessor systems [11]. While semiconductor process technology has made tremendous progress, multicore processor architectures have been implemented by researchers in the research and improvement of multiprocessor architectures [12]. Therefore, in order to improve the performance of the multicore processor and make greater use of the advantages of the multicore processor, the task parallel algorithm originally in the multiprocessor system is transferred to the multicore processor system [13].

At the same time, in the field of multicore processors, many researchers hope to achieve system performance by dividing tasks in multicore processors or multiprocessor systems [14,15,16]. However, if there is no suitable partitioning, single-threaded programs cannot fully exploit the parallel advantages of multicore processor systems or multiprocessor systems. In the field of multicore processor task partitioning, many interesting phenomena are being or have been studied [17]. However, the research work of previous researchers either focused on performance or focused on the multicore processors themselves, they did not consider the power consumption of inter-core communication in multicore systems [18, 19]. In order to illustrate the problem of inter-core power consumption, we have explored and tried the task partitioning methods of various applications in the simulators that have been made in the previous period with the ability to measure real-time performance, power consumption, and inter-core communication. By analyzing the simulation results, we can find ways to optimize the application partitioning while reducing the performance degradation and system power consumption caused by inter-core communication.

The rest of this paper is organized as follows. Section 2 discusses related work, followed by the methodology is discussed in Section 3. Result analysis and discussion is discussed in Section 4. Section 5 shows the simulation experimental results, and Section 6 concludes the paper with summary and future research directions.

Related work

Multicore processors dominate in volume, and while individual cores may not necessarily be fastest, overall, they are far superior in speed and performance to previous single-core processors due to their large amount and parallel operation. Due to the hierarchical structure of the computer, the application layer software invokes the system hardware structure through the operating system software [20]. With the changes of the underlying hardware, the main problem facing multicore software development is how to effectively utilize the advantages of multicore hardware. The development of multicore software needs to consider two situations: the development of multicore operating system software and the development of multicore application software. Application software has stayed in the single-threaded world for years because single-core processors only work in sequence essentially. Only by using parallel programming technology, software can be split into multi-threaded while it is executed, and the advantages of multicore processors can be fully exploited [21]. However, parallel software technology is difficult to grasp. Therefore, multicore CPUs does not improve the performance of the programs at all when it faces serial applications, and cannot play the existing hardware advantages.

Design key of task scheduling for multicore processors

The task scheduling problem in parallel computing is an old and lasting topic. The main purpose of task scheduling is to reasonably schedule tasks to individual processors so that the final program execution time is the shortest. The quality of the task scheduling algorithm directly affects the final execution of the program, so the task scheduling algorithm and related research work are particularly important in the field of parallel computing. The task scheduling of parallel programs is divided into two types, one is the scheduling of independent tasks without dependencies, and the other is the communication and dependencies between tasks. In this case, the task or thread division is generally involved. The original complete program is divided into multiple tasks that can be executed in parallel. For task scheduling with dependencies, it can be divided into static task scheduling and dynamic task scheduling according to the timing of scheduling.

Static task scheduling mostly obtains the program’s calculation amount, communication overhead, and dependencies between tasks through static estimation or profiling technology at compile time. The connection and processing capabilities of each processing unit are already known, and then compiled. This information is used to assign tasks to individual processors. And once a task is assigned to a processing unit, it can only be executed on that processing unit. The dynamic task scheduling is that the scheduler monitors the execution of the program in real time while executing the program, and then assigns the task to the processor unit according to the dynamic running condition. The typical situation is that the task scheduling in the operating system is to achieve processor load balancing through dynamic scheduling, and to transfer tasks from a heavily loaded processor to a lightly loaded processor in real time. However, the overhead of dynamic scheduling is large, and it also involves data consistency and communication synchronization during dynamic task migration.

Judging the quality of a scheduling algorithm can be measured in the following four aspects:

(1)

Scheduling performance. This refers to the execution time of the scheduling result, and whether the algorithm can be applied to a variety of inputs, such as whether the number of processors can be applied;

(2)

Algorithm time complexity. This refers to the time consumption of the algorithm itself. If the time complexity is high, the time spent on scheduling large complex tasks becomes unacceptable;

(3)

Scalability. When the scale of the problem changes, the scheduling results should also be comparable;

(4)

The actual availability level. Because the actual problems are often ever-changing, such as the processor topology will have many changes, the algorithm can be applied is a very important measure.

Some of the above requirements are mutually constrained. For example, if the time complexity of the algorithm is required to be small, its scheduling performance may not be very high. Conversely, the pursuit of high scheduling performance increases the time complexity. The task scheduling of the multicore processor is shown in Fig. 1.

The partitioning of an application depends largely on the computational performance requirements of the program and the ability of the infrastructure to support it. For example, a disk drive control program must first respond to the servo system. Secondly, the response channel transfer request is the last response to the input. All calculations must start reading and writing data from the same address to ensure the will of the calculation result. It is to ensure that the input and output channels always have data padding, and the processing speed is faster than the output speed. For the packet processing program, each time a new data packet is processed, the key is to look at the data packet back and forth and transmit. The corresponding time is the most critical. Corresponding to different applications, the application is divided by balancing the load, response time and traffic, and assigning it to different processors for processing. This division is performed to some extent with the real-time operating system. The criteria for the division are similar.

Parallelization algorithm

In the field of computer system architecture, research on task allocation/scheduling has been carried out for many years, and a large number of models and algorithms have been proposed. Most of these algorithms are modeled as task priority maps, task diagrams, or similar graphs by solving this basis. It has been proved that finding the optimal allocation or optimal scheduling for the task set is an NP-complete problem. Therefore, the research in recent years mainly turns to the near-optimal solution, and the method used can be divided into two major problems. One class uses general optimization algorithm, such as genetic algorithm and simulated annealing algorithm, and the other is to propose a special heuristic algorithm. There are many kinds of related algorithms, and the models are based on different types of systems, such as messaging or shared storage, isomorphism, or heterogeneous. They have different allocation/scheduling goals and strategies, such as the pursuit of load balancing, the pursuit of the shortest execution time, and the pursuit of occupation minimum resources. They are applicable to applications with different parallel granularity, such as process/thread level, iteration level, and instruction level.

Using parallelism is one of the most important ways to improve performance. We will explain the use of parallelization at each level separately.

(1)

System-level use of parallelism: In order to improve system throughput performance when running a typical server service program (such as SPECWeb or TPC), multiple processors and multiple disks can be used. The initial requested load can be allocated across multiple processors and disks to increase throughput. This is why scalability is important for server applications.

(2)

Parallel use at the instruction level: In a single processor, the use of instruction level parallelism is the key to high performance. One of the simplest methods is to overlap the execution of instructions by the basic idea of pipelines to reduce the execution time of an instruction sequence. From the perspective of the CPU performance formula, we can think that the pipeline technology reduces the CPI (cycle per instruction) by multi-step overlap of instructions. The key to the pipeline’s ability to function is that not every instruction’s execution depends on its direct predecessor instructions, so that the instructions can be executed in full or in part.

(3)

Parallel use at the digital design level: For example, a group-associated cache uses multiple banks of memory and can usually find the required items in parallel among them. Modern ALUs use the first travel bit, which is achieved by using parallel parallelism to sum the number of bits in the operand from linear to logarithmic.

The most commonly used method in parallel algorithm design is the PCAM method, namely, division, communication, combination, and mapping. The first division is to divide an issue into several parts equally, and let each processor execute at the same time. In the communication phase, it is to analyze the coordination of data and tasks to be exchanged during the execution process, and the combination is required to be smaller. The problems are grouped together to improve performance and reduce task overhead, and mapping is to assign tasks to each processor. In short, parallel algorithms have a lot to be perfected. The biggest difference between parallel and serial algorithms is that the parallel algorithm not only considers the problem itself, but also considers the parallel model used, network connections, and so on.

The operations performed in parallel may be a single instruction, such as addition or multiplication, or a complex program that takes several days to run. Obviously, for small operations, the overhead cost of parallel infrastructure is very large. In general, the smaller the task is divided, the higher the cost of generating it into a single task and providing communication and synchronization.

The other is the degree of communication and synchronization between operations. Most parallel programs share data between operations. As operations become more diverse, the complexity of ensuring correct and efficient operations increases. The simplest case is to execute the same code for each operation. This type of sharing is an irregular parallel approach.

Methodology

Multicore platform program optimization method

For the method of program optimization, most researchers focus the problem on the optimization of a specific problem, but rarely mention the common optimization methods. There are two main aspects of program optimization: algorithms and data structures. In the single-core operators’ era, in order to improve the performance of a program, people often perform program performance from storage structures and algorithms [22]. However, as multicore processors appear advantage on computing capacity, it has been found that even very excellent serial programs that have been well-optimized do not take advantage of multicore hardware.

Application program algorithm optimization method: the key point of the multicore processor platform application program optimization, on the one hand, is the optimization of the procedure algorithm, which means improve the computational complexity of the algorithm; on the other hand, is that we need to consider rewriting the serial program to the multi-threaded parallel program [23]. Application program performance optimization methods mainly lie on streamlining the process of double counting and redundant operations, and the main optimization methods are as follows: (a ) avoid redundant function calls, (b) avoid unnecessary border checks, (c) avoid double counting of intermediate results, and (d) parallelize serial program [24].

In the above four points, the parallelization of serial programs is the most difficult. Before the advent of multicore processors, the traditional programming ideas are serial way of thinking. Therefore, the conversion from the serial way of thinking to parallel mode of thinking is difficult of improving program performance at present. Besides, there are specific technical issues on multi-threaded programming. Different from single-core multi-threaded programming, multi-threaded parallel programming on multicore platforms faces many problems. Multicore processors in order to maintain fast data inter-core access will use the cache to save recently accessed data, but the cache has high connectivity and high-capacity features in the hardware. Due to the limitation of the distribution of shared resources, there are many new problems in distributed programming really realized by parallel way of thinking. Therefore, multicore parallel programming is still in its infancy [25].

Implementation of application algorithm optimization

Illustrate the problem of optimization through a simple example to traverse a result set V, perform a function Com () on each of the nodes, and call functions fun1 () and fun2 (). Where C represents the counter of the result set, M and N respectively represent two result variables, and P is a pointer type variable. V.size () represents the size of the result set V. In algorithm A, the program can be optimized in several aspects, which will improve the execution efficiency of the algorithm. The following sections describe each of the parts that can be optimized.

(1)

Optimization of border inspection. When traversing the result set, it is a common situation to get the size of the result set by calling the function. However, acquiring the size of the result set is not related to the loop and it is easy to write in the loop. However, in step S5, calling the function when the loop condition is determined results in an unnecessary increase of the function call, resulting in a drastic reduction in program performance. At this point, put this check action that the S = result set number outside the loop, and then traverse this set. As a result, time complexities from 0 (N) to 1, the program performance has realized optimization.

(2)

Avoid redundant function calls. Under the guidance of the thought of software reuse, the modularity of program design is increasing increasingly. The frequency of module calls increases, sometimes resulting in multiple use of a function. If the result is a fixed value, the program will not affect the results of the operation, and you can record the results after the repeated function call is completed again, in order to avoid the function call again in the subsequent use. In steps S31 and S32, the program repeatedly calls the function, which belongs to the redundant function call. In this case, you should put the result of the function call into a variable and avoid redundant function calls during multiple uses or loops. As the number of calls to the two functions is reduced from two to one, the processing time is reduced and the performance of the algorithm can be improved.

(3)

Use local variables to save intermediate results. Variables stored in the program run-time memory, in the loop if you need a large number of calculations, it will result in frequent memory read operations, but reading memory is the main reason affecting program execution speed. As in step S33, the pointer variable P and the data accumulation operation result in one read operation and one write operation to the memory, and are repeated N times in the loop which affect program performance. Let the accumulate operation with a register variable to accumulate, the result will be repeated with * P once to reduce the N-1 times memory read and write operations.

(4)

Put Serial conversion into parallel algorithm. Algorithm B is modified to algorithm C by means of parallelization. After parallelization, step S4 is cyclically split into t threads for traversing the result set at the same time. Each thread defines its own counter Ci and register variable Ri for parallelization.

$$ C=\left\{ C1,\dots \dots, Ct\right\} $$

(1)

$$ R=\left\{ R1,\dots \dots, Rt\right\} $$

(2)

The traversal counters {C1, …, Ct} of all threads of number t are initialized to 0, and register variables {R1, …, Rt} are assigned an initial value of 0 where Ci represents the i-th thread counter, Ri represents the i-th thread’s register variables. Per thread completes the cycle, while carrying on the calculation work of the cycle. Register variable values accumulate after all thread cycles have completed. Because the program flow diagram can only describe the serialization algorithm, the activity diagram in the UML diagram is used to describe the algorithm C. Among the four methods for program optimization, the methods of boundary checking, redundant functions, and intermediate variable result saving can all be performed in a single-core environment. However, in multicore processors, the serialization of serial programs is the most effective and direct method for program optimization.

Result analysis and discussion

Parallelized PO-Dijkstra algorithm for multicore platform

Dijkstra algorithm is the classic representative of graph theory algorithm. The shortest path analysis in the shortest path planning algorithm is the key link in the analysis of the spatial network in the geographic information system (GIS). The emergence of multicore platform brings more powerful computing performance for users. If it can process tasks in parallel, the advantages of multicore parallel work can be fully utilized. However, there are still many problems in the adaptation and optimization of serialized tasks into parallel processing tasks. Here, it is proposed to change the classical Dijkstra algorithm into a parallel algorithm, and experiments show that the speedup of parallelized tasks can even surpass the advantage of the number of CPUs.

Figure 2 is a schematic illustration of a realistic road where factories, airports, shops, bridges, and gas stations make up an intersecting road. When people drive from the gas station, the destination is the airport, looking for a shortest road will bring convenience to people, saving time, and costs. In the process of road planning, Dijkstra algorithm abstracts the complex geographic information into network graph, in which the length of the path as the weighted edge in the network graph, and the vertices in the graph represent different locations. As a result, the problem of road selection from the gas station to the airport translates into a search question for the point-to-point shortest-path in network graphics. Through the extraction of network graphics, mathematical modeling, system initialization, node calculation, and path back five steps, the shortest path planning will be completed.

(1)

Extract network graphics. During the process of extracting the network graph, five different nodes {a, b, c, d, e} in the graph are extracted from different places which are gas stations, shops, bridges, factories, and airports in the map, and the length of the road between each spot is regarded as a weighted edge in the network graph, so that the Fig. 2 can be abstracted into a network topology with 5 nodes in Fig. 3 to form an undirected graph.

(2)

Make mathematical modeling. The shortest path from the gas station to the airport corresponds to the shortest path from point a to point e in Figs.4 and 5. Source vertex a: starting point of the path; L (i, j) is distance between node i and node j; D (v) is distance between source vertex a and vertex v, which is the sum of the lengths of all paths along one path from source vertex a to node v.

(3)

System initialization part

$$ C=\left\{a\right\} $$

(3)

in which, C is Set of all nodes in the network. In initialization, place the source vertex a in the collection. Use the following equation to calculate the distance from network node v to a.

Repeat the above calculation section; find the next node to join C. Until the entire network nodes are in C so far.

(4) Path goes back to the part. If we calculate the minimum distance from the vertex a to the node e, the algorithm ends when e enters into the set C, and the distance D (e) is the shortest distance between the two. The shortest distance between source vertices a and e in Figs. 3 and 4 is 60. The calculation process is shown in Table 1. The final path is {a, d, c, e}.

Therefore, the gas station to the airport path is the gas station → factory → bridge → airport. Due to the large scale of road network in GIS, Dijkstra algorithm abstracts complex geographic information into network graph, in which the path is the weighted edge in the network graph and the vertices in the graph represent the location, but the massive computation often becomes the Dijkstra algorithm bottleneck. In recent years, the improved Dijkstra algorithm has been optimized in terms of calculation methods and storage methods. However, in embedded systems such as vehicle-mounted systems and hand-held devices, the low computational power of hardware becomes a major obstacle to the development of this algorithm.

PO-Dijkstra algorithm for multicore platform modeling

Thread parallelization on multicore platforms. Figure 6 shows the comparison between the parallelized Dijkstra algorithm and the serialized Dijkstra algorithm in a 2-core 4-threaded environment on a windows operating system. In the parallelization of Dijkstra algorithm research, the set of the number of the parallelized threads is more important. Suppose the number of parallel threads is N and the number of nodes in the network is K. During parallelization, because the system supports 2 cores and 4 threads, the number of parallel threads set for the system is to support the maximum number of threads, that is 4 N, and the number of network nodes is increased from 200 to 2600. Therefore, the abscissa in the graph represents the number of network nodes. The vertical represents the time that the algorithm performs the calculation of the shortest path, in milliseconds (ms). It can be seen from Fig. 7 that when the number of nodes is 1300 K, the parallel algorithm takes more time than the serial algorithm. This is because parallel programming involves the work of splitting, synchronizing, and merging threads, which takes up a certain amount of storage and time. Therefore, when the number of nodes K is smaller, because the time consumed for parallelization is greater than that for parallel computing saved time, parallel algorithms are slower than traditional serial algorithms. When it is over 1300 K, the superiority of the parallel algorithm was revealed. At the same time, the intersection of serial algorithm and parallel algorithm time will be affected by different CPU frequencies, experimental platforms, and network model differences.

The relationship between the adaptive parameters AC and the number of threads. In order to reduce the number of nodes, the overhead of eliminating parallelism is the negative impact of the program. Here in the parallelized Dijkstra algorithm to introduce an adaptive parameter AC, AC represents the current number of nodes, split the most conducive to the implementation of the program thread number. AC as a tuning algorithm selected parameters. WhenAC > 1, it shows that the number of nodes K in the program is more, and it is more suitable to use the parallel algorithm. In this case, the program will split in parallel. When AC ≤ 1, it indicates that the number of nodes K in the program is less and is more suitable for serialized Algorithm, and then the program does not parallelize the split, so the algorithm is an adaptive parallel algorithm. At the same time AC parameters on the number of parallel threads also have an impact. WhenAC ≤ 1, the program uses a serialized Dijkstra algorithm, in which case the number of threads is 1 N; When AC > 1, the Dijkstra algorithm is parallelized, and the number of threads N is determined by the number of ACs. If the value of AC is less than the number of system cores M, then split the number of threads NAC; if the value of AC is greater than the number of system cores M, then split the number of threads NM.

Compare PO-Dijkstra algorithm and serial algorithm. In order to better illustrate the effect of parallelization, the algorithm is compared in two aspects: on the one hand, under the same conditions, the serial algorithm and the parallel algorithm of the total completion time of the comparison; the other hand, the number of different CPU case, the serial algorithm, and parallel algorithm to complete the time comparison. First of all, the first experiment ran the serial Dijkstra algorithm and the Po-Dijkstra algorithm on a multicore processor with Intel Core 3-2120 (2-core, 4-threads) support. Initially, the number of network nodes in the algorithm is set to 600, and then the number of network nodes is continuously increased until the number of network nodes reaches 2200. As can be seen from Fig. 8, when the number of network nodes is small, the PO-Dijkstra algorithm adopts a serial algorithm, so it fits well with the serial algorithm. When the number of network nodes is large, the value of AC > 1 results, the program will automatically calculate the number of parallel threads’ split. Therefore, the PO-Dijkstra algorithm and the parallelization algorithm have a little difference in fitting, but they are generally similar. Through experiments, the serial Dijkstra algorithm and Po-Dijkstra algorithm to complete the overall time were compared, the experimental data obtained as shown in Fig. 4. The speed of the algorithm has been significantly improved after parallelization, when the number of nodes reaches 2000, the computing speed is increased by 35%.

PO-Dijkstra algorithm performance comparison in different environments. Figure 5 shows the comparison between the completion time of the serial algorithm and the PO-Dijkstra algorithm under the conditions of 2-core 4-threads and 4-core 8-threads. As can be seen in Fig. 8, the serial algorithm uses the most completion time under 2-core 4-thread conditions, while the parallel algorithm achieves the least completion time under 4-core 8-thread conditions, which is in line with the above law of completion time. However, the completion time used by the serial algorithm in the 4-core 8-thread and the completion time used by the parallel algorithm in the 2-core 4-thread can be found. Because the hardware has more advantages of 2 CPUs, when the number of nodes K is less than 7000. The serial algorithm takes less time on a 4-core 8-thread, but as the number of nodes, K continues to increase, the advantages of the parallel algorithm become significant. Even with only 2 CPUs, serial algorithms can run faster than 4 CPUs.

Experiment and performance analysis of task partitioning on multicore simulator

Experimental process and details

The decoding process and the encoding process of the G.721 protocol are reversed, and the execution order of the inverse function in the execution process is basically the same. We only use the G.721 encoding as an example to analyze the data.

(1)

According to the data

We divide the input file into two parts, which are handled by two cores and are simultaneous. Specifically, the ARM core encodes the first half of the data, and the PISA checks the second half of the data. The limitation of using this partitioning method is to find a suitable partitioning point to balance the workload of the two cores, based on which to improve the efficiency of the entire simulator. Of course, finding this point is also a very time consuming process, adding some extra system power. This is because for the G.721 encoding protocol, the data to be encoded is strongly data-correlated with both the encoded data and the quantized data, so finding the appropriate dividing point becomes very complicated and time consuming. However, if the division point is easier to find, then using this division method will not cause excessive overhead, and in theory, the performance improvement will be the most.

We divide the encoding process of G.721 into several parts in a way similar to “flowing.” Each core handles several parts, and tries to make these parts less relevant and related. Specifically, the ARM core is responsible for reading the first half of the data and data processing process. The PISA core is responsible for the second half of the entire data processing process. The use of this type of task partitioning will result in more overhead for inter-core communication than for the former (“by data partitioning”) method, and the degree of parallel of the entire process is not as high as in the former. However, the advantage of using the “divide by function” approach is that the workload of the two cores can be well balanced. We know that the critical path of the entire program is determined by the part of the program that runs for the longest time, so this way of dividing tasks is more suitable for streaming media data. By analyzing the execution time of each process of the entire G.721 encoding process, we obtained the three most time-consuming modules in the G.721 encoding process.

When simulating the Dijkstra algorithm, we chose to first establish the adjacency matrix and then calculate the shortest path between the two points. Since the adjacency matrix already exists, the shortest path for calculating any two points is a set of data sets without data correlation. All traversal points are preset in the Benchmark, and the simulation results are calculated to find the shortest path between all two points.

Based on the above, taking into account the time and efficiency of the simulation, we sampled 20 sets of point coordinates. When using the “divide by data” method, we made the ARM core calculates the first 10 sets of data, and the PISA core calculates the latter 10 sets of data. Considering that the power consumption of the shared memory area is as small as possible, we arrange both the ARM core and the PISA to establish the adjacency matrix table in real time. Although the overall power consumption is somewhat increased, the pressure on the shared buffer will be much reduced. Such power consumption sacrifice is still worthwhile for the increasingly tight chip area in the embedded field.

For this benchmark, due to its own algorithm, the required “flowing steps” are not many. Using the aforementioned “divided by function” division method will only bring communication burden, and it is difficult to improve the performance of the system. Based on the simulation time considerations, we skip this part here, and simply consider the results of “divide by data” and analyze it.

Experimental results and analysis

From Table 2, it is easy to see that whether it is divided by “by data” or “by function,” the shared memory consumes very little power. In other words, the extra power overhead is also small. For the “divide by data” method, even if the number of inter-core communication is small, the amount of data transmitted between the cores is very large, and the result is that the power consumption of the shared storage area cannot be too small (relative to the function division). Although the number of inter-core communication is large, the amount of data per communication is small, the experimental result of shared memory power consumption is smaller than the former method, and at the same time using the “divide by function” method, the required hardware resources will be much less, just 32 B (8 B + 8 B + 16 B). In general, in order to improve the performance and speed of the system, it will put the shared storage area to the L2 cache level, or even the L1 cache level, then the smaller hardware resource requirements will be more dominant. From this point of view, using “divide by function” for streaming data processing is more advantageous.

If we compare the shared storage resources, the advantage of “dividing by function” is even greater, because if we want to use the latter division, the shared storage area needs at least a few megabytes. The demand is indeed in the tens of bytes. This advantage allows processor designers to place shared memory in faster memory systems such as the L2 cache and L1 cache. However, if the “divide by data” method is used, such a huge shared storage demand will make the designer have to design the shared storage area in a slower memory structure such as main memory, so that the speed of the entire system will be slowed down, and such a design is quite disadvantageous in the design of a dual-core simulator using inter-core communication.

In our experiments, the two tasks described above were run in our heterogeneous dual-core simulator and the results and analysis were obtained.

As shown in Fig. 7, we can clearly see in real-time power consumption that using “divide by function” is always 14% lower than “divided by data.” At the same time, based on Figs. 8 and 9, we compare the comparison between ARM core and PISA in two different partitioning methods and running on a single core. As can be clearly seen from the figure, the real-time power consumption curve using “divide by function” is much more gradual, because both cores are in the working state during the calculation process, and due to the proper division method, the workload of the two cores is similar, so the power consumption output of the whole system is relatively flat during the whole running process. The benefit is that the real-time temperature of the CPU will not be high or low at a certain moment, and the real-time temperature will also be more gradual, and will not bring about an increase in the heat dissipation burden, causing the negative effects of system instability, and the resulting reduction in heat dissipation overhead is also obvious.

In the simulation process, the ARM core PISA core is used to calculate the two parts of the data group, the ARM core and the PISA core are reversed, and the results are similar.

The performance difference between ARM core and PISA is not so big. This is because there is no data-related data. During the process, the two cores can communicate with each other through little communication. This is more suitable for dynamic task partitioning. In our experiment, the power consumption of the shared memory area is 0, because there is no communication between the two cores. The task is equally divided during the application compilation process and assigned to the two cores for processing. We first compare the peak power consumption during the running of the program, as shown in Table 3.

As can be seen from Table 3, after using “divide by data,” the peak value during the running of the program is actually about 0.02% lower than that before the unused task is divided. In the process of processing data without data correlation and using the task partitioning mechanism to improve performance in multicore processors, the method of “dividing according to data” has achieved great benefits, as shown in Figs. 10 and 11.

As shown in Fig. 7, we can see that since the ARM core is the general data before processing, there is no overhead such as inter-core communication during the processing. Consistently, there is no additional power burden. For Fig. 11, the two curves seen in the figure are difficult to fit together completely, because the curve in PISA_Ori is PISA. The core separately processes the real-time power consumption curve of the data, which is calculated from the beginning of the whole data, and the PISA_Data curve reflects the process of processing the heap data by the PISA core and the ARM core. The ARM core processes the first half of the data, while the PISA core processes the latter half of the data. Although the curve will have a little inconsistency, the trend is still the same, and the peaks and valleys of the curve are basically the same, so the whole program will not create an additional power load.

Conclusion

The advent of multicore processors provides the hardware foundation for the rapid development of parallel computing. Therefore, there is a new development space for the Dijkstra shortest path planning algorithm due to the large number of computations. In this paper, the Dijkstra algorithm is optimized and a parallel PO-Dijkstra (Para-llelOptimization-Dijkstra) algorithm model for multicore platform is proposed. Considering that thread splitting consumes some resources, the PO-Dijkstra algorithm adjusts and adjusts adaptively according to three parameters of the total number of nodes K, the number of cores M, and the CPU speed. The comparison between the PO-Dijkstra algorithm and the classical Dijkstra algorithm is completed on the hardware experimental platform. The experimental results show that the PO-Dijkstra algorithm has obvious speed improvement. The emergence of multicore multi-threaded processors has created new opportunities for dynamic task scheduling. Multicore helps to solve the problem of large scheduling overhead in traditional dynamic task scheduling. In the next research work, the scheduling algorithm in the paper can be transplanted into the dynamic scheduling system, and combined with the dynamic scheduling environment to make appropriate improvements.

Availability of data and materials

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Ghuloum A, Smith T, Wu G, et al. Future-Proof Data Parallel Algorithms and Software on Intel Multi-Core Architecture [J]. Intel Technol J, 2007, 11(4).

D. Gorissen, I. Couckuyt, P. Demeester, et al., A surrogate modeling and adaptive sampling toolbox for computer based design [J]. J. Mach. Learn. Res. 11(Jul), 2051–2055 (2010)

T. Ryu, T.L. Tanaka, R. Perna, Formation, disruption and energy output of population III X-ray binaries [J]. Mon. Not. R. Astron. Soc. 456(1), 223–238 (2018)

J.I. Kamstra, M.V. Leeuwen, J.L.N. Roodenburg, et al., Exercise therapy for trismus secondary to head and neck cancer: A systematic review [J]. Head Neck 39(1), 160–169 (2017)

N. Pezzotti, T. Höllt, A. Vilanova, Interactive visual exploration of 3D mass spectrometry imaging data using hierarchical stochastic neighbor embedding reveals spatiomolecular structures at full data resolution [J]. J. Proteome Res. 17(3), 1054–1064 (2018)

V.P. Koryachko, D.A. Perepelkin, V.S. Byshov, Development and research of improved model of multipath adaptive routing in computer networks with load balancing [J]. Auto Control Comput Sci 51(1), 63–73 (2017)

E. Renault, A. Duret-Lutz, F. Kordon, et al., Variations on parallel explicit emptiness checks for generalized Büchi automata [J]. Int. J. Softw. Tools Technol. Transfer 19(6), 1–21 (2016)

D. Pamučar, S. Ljubojević, D. Kostadinović, et al., Cost and risk aggregation in multi-objective route planning for hazardous materials transportation—A neuro-fuzzy and artificial bee colony approach [J]. Expert Syst. Appl. 65, 1–15 (2016)

P.A. Brameret, A. Rauzy, J.M. Roussel, Automated generation of partial Markov chain from high level descriptions [J]. Reliabil Eng Syst Safety 139, 179–187 (2015)

M. Grujicic, J. Snipes, S. Ramaswami, et al., Densification and devitrification of fused silica induced by ballistic impact: A computational investigation [J]. J. Nanomater. 16(1), 167 (2015)

H. Ortega-Arranz, Y. Torres, A. Gonzalez-Escribano, et al., Comprehensive evaluation of a new GPU-based approach to the shortest path problem [J]. Int. J. Parallel Prog. 43(5), 918–938 (2015)

Aldinucci M, Danelutto M, Kilpatrick P, et al. Fastflow: high-level and efficient streaming on multi-core [J]. Programming multi-core and many-core computing systems, parallel and distributed computing, 2014.

N. Li, W. Yi, M. Sun, et al., Development and application of intelligent system modeling and simulation platform [J]. Simul. Model. Pract. Theory 29, 149–162 (2012)

Z. Feng, Z. Zeng, P. Li, Parallel on-chip power distribution network analysis on multi-core-multi-GPU platforms [J]. IEEE Trans Very Large Scale Integrat (VLSI) Syst 19(10), 1823–1836 (2011)

Aldinucci M, Danelutto M, Kilpatrick P, et al. Fastflow: high-level and efficient streaming on multi-core [J]. in Programming Multi-core and Many-core Computing Systems, ser. Parallel and Distributed Computing, S. Pllana, 2012: 13.

Bo Zhang is responsible for the experimental part of the article, and DeJi Hu is responsible for the theoretical part of the article. All authors read and approved the final manuscript.

Authors’ information

Bo Zhang (1979–), male. Master of Microelectronics and Solid State Electronics. Graduated from the Tianjin University. He is currently a lecturer in the College of Information Engineering, Tianjin University of Commerce. His research interests include electronic commerce and intelligence computing.

DeJi Hu (1971–), male. Doctor of Mechanical Manufacturing and Automation. Graduated from the Tianjin University. He is currently a lecturer in the College of Information Engineering, Tianjin University of Commerce. His research interests include system development and intelligence computing.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

About this article

Cite this article

Zhang, B., Hu, D.J. RETRACTED ARTICLE: Research on the construction and simulation of PO-Dijkstra algorithm model in parallel network of multicore platform.
J Wireless Com Network2020, 85 (2020). https://doi.org/10.1186/s13638-020-01680-x