A performance optimization strategy based on degree of parallelism and allocation fitness

With the emergence of big data era, most of the current performance optimization strategies are mainly used in a distributed computing framework with disks as the underlying storage. They may solve the problems in traditional disk-based distribution, but they are hard to transplant and are not well suitable for performance optimization especially for an in-memory computing framework on account of different underlying storage and computation architecture. In this paper, we first give the definition of the resource allocation model, parallelism degree model, and allocation fitness model on the basis of the theoretical analysis of Spark architecture. Second, based on the model presented, we propose a strategy embedded in the evaluation model which is easy to perform. The optimization strategy selects the worker with a lower load that satisfies requirements to assign the latter tasks, and the worker with a higher load may not be assigned tasks. The experiments consisting of four variance jobs are conducted to verify the effectiveness of the presented strategy.


Introduction
In recent years, big data processing framework [1,2], especially for in-memory computing framework, enriches and develops constantly [3,4]. The in-memory computing has appeared in our view and attracted wide attention in the industry after the SAP TechEd global conference in 2010.
With the development of the in-memory computing framework, some research results are committed to the expansion and improvement of the system. A simple and efficient parallel pipelined programming model based on BitTorrent was proposed by Napoli et al. [5]. Chowdhury et al. implemented the broadcast communication technology for the in-memory computing framework. Lamari et al. [6] put forward the standard architecture of relational analysis for big data. A study by Cho et al. [7] proposed a parallel design scheme. An algorithm using programs to analyze and locate common subexpressions was designed in a study by Kim et al. [8]. A study by Seol et al. [9] proposed a fine granularity retention management for deep submicron DRAMs. et al. designed a unified memory manager separating the memory storage function from computing framework. In a study by Tang et al. [10], a standard engine for distributed data stream computing was designed. A high-performance SQL query system was implemented in a study by Jo et al. [11]. A parallel computing method for the applications with the differential data stream and prompt response was proposed in a study by McSherry et al. [12]. Zeng et al. designed a general model for interactive analysis. A study by Corrigan-Gibbs et al. realized the privacy information communication system of in-memory computing. A study by Sengupta et al. [13] used SIMD-based data parallelism to speed up sieving in integer-factoring algorithms. Ifeanyi et al. [14] presented a comprehensive survey fault tolerance mechanisms for the high-performance framework.
Some research results focus on the performance optimization for distributed computing framework, which may not suitable for the in-memory framework. Ananthanarayanan et al. proposed the algorithm, making full use of the data access time and data locality. By analyzing the impact of task parallelism on the cache effectiveness, Ananthanarayanan et al. designed a coordinated caching algorithm that adapted to in-memory computing. By monitoring computation overhead, Babu et al. found that the parallelism of the reduce task has a great influence on the performance of MapReduce system, and the task scheduling algorithm is designed to adapt to resource status. In order to predict the response time of worker node, Zou et al. divided a task into different blocks, which can improve the efficiency of tight synchronization application. In a study by Sarma et al., the communication cost frontier model of worker node was proposed, and the tradeoff between the task parallelism and communication cost were achieved by adjusting the boundary threshold. A study by Pu et al. presented FairRide, a near-optimal, fair cache sharing to improve the performance. Chowdhury et al. proposed an algorithm to balance multi-resource fairness for correlated and elastic demands.
However, most of the current performance optimization strategies are mainly used in distributed computing framework with disks as the underlying storage, in which we pay the most attention to two aspects: task scheduling and resource allocation. Therefore, it is of practical significance to study the optimization mechanism of IMC framework from the perspective of underlying memory-based storage and computation architecture. Therefore, we consider the degree of parallelism and allocation fitness which differs from the existing strategy. First, taking the task scheduling into consideration, the rationality of the parallelism degree of the shuffle process for the in-memory framework is easier to ignore that may directly affect the efficiency of job execution and the utilization rate of cluster resources. But the degree of parallelism is usually determined based on user experience, and it is hard to adapt to the existing state of the in-memory framework. Second, achieving the rationality of the hardware allocation, especially memory allocation, as well as the acceleration of job execution, is concerned by modifying the fitness of resource allocation.

Resource allocation model
Definition 1 Resource allocation type. Denotes Worker = {w 1 , w 2 ,…,w m } as the set of workers, Resource= { r 1 ,r 2 ,…,r n } as a collection of resource types including CPU, memory, disk, and r w = (r w1 ,r w2 ,…,r wl ) represents l available resource vector of worker w m , where r wi is the ith available resource in the worker w, and the ith resource in all workers could be normalized as: Ã rtypeðcpu; memory; diskÞ ð 1Þ j = {j 1 ,j 2 ,…,j n } denotes as the set of running jobs at the same time, V rj = (v rj1 ,v rj2 ,…,v rjk ) represents the resource requirement vector of job j, since the resource requirement of each job is different, and the resource requirements of all jobs are represented as: RV ¼ ðV r 1 ; V r 2 ; :::; V r j Þ ¼ ððv r 11 ; v r 12 ; :::; v r 1k Þ; ðv r 21 ; v r 22 ; :::; v r 2k Þ; :::; ðv r j1 ; v r j2 ; ::: Then, the resource requirements type for all jobs are expressed as: TypeRV ¼ ðtypeRV 1 ; typeRV 2 ; :::typeRV k Þ ¼ ðrtypeðmaxðv r 11 ; v r 12 ; :::; v r 1k Þ; rtypeðmaxðv r 21 ; v r 22 ; :::; v r 2k Þ; :::; rtypeðmaxðv r j1 ; v r j2 ; :::; v r jk ÞÞ The resource requirements are submitted to the system before the execution of the job, and the jobs will be assigned to workers with idle resources that can feed their requirements. Assume workers = {w 1 ,w 2 ,…,w m } as workers dealing with task j, v aj = (v aj1 ,v aj2iw2 ,…,v ajk ) as the resource allocation vector of task i in worker w 1 . In principle, workers should strictly allocate resources in accordance with the resource requirements table, which is represented as:

Parallelism degree model
In Spark, task parallelism degree is used to measure the number of concurrent tasks, which can be specified by the user, and it could not exceed the whole instance number that equals to the product of the number of worker and the number of CPU cores in each worker. Definition 2 Parallelism degree. Denotes the number of workers as workerNum, the number of CPU cores in each worker node as coreNum; therefore, the tasks executing concurrently is workerNum×coreNum supported by the hardware environment. If the parallelism parameter specified by the user is p user , then parallelism degree parallelismDegree is the minimum value of workerNum × coreNumand p user : Definition 3 Idle time. It is defined to indicate idle time due to uneven task allocation. According to Definition 5, when user parallelism is greater than the hardware parallelism, that is p user = (workerNum × coreNum), the number of pipelines within the stage is greater than task parallelism. Then, the worker needs to allocate task in multiple turns, and the number of turns can be expressed as: where the result of ceiling function is the smallest integer that is greater than or equal to the value of the parameter. By formula 6, we can obtain that when l is an integral multiple of (workerNum × coreNum), all workers should execute the task in each round of distribution. If the remainder when p user divides (workerNum × core-Num) is not 0, there is at least one idle node in the final round, and the number of idle workers can be expressed as: where mod(p user , (workerNum × coreNum)) represents reminder. Due to random allocation of tasks, the probability that p user is the integer times of (worker-Num × coreNum)is very small, then the allocation load of tasks in the final round is likely to be uneven. Assume the set of h pipeline tasks in the final round as Task pipes last ¼ fTask pipe i1 ; Task pipe i2 ; …; Task pipe ih g , where h < ( workerNum× coreNum) . Then, the idle time of the bye node is:

Allocation fitness model
Definition 4 Resource occupancy rate. Assume T fixed as a measurement interval, T job i as the actual execution time of the job i. The occupancy rate of rth resources OC ir is defined as the proportion of the resources used by the workers, which is expressed as: Definition 5 Allocation fitness degree. Assume workLoad as the total workload, CAs = {ca 1 ,cpa 2 ,…,ca n } represents the set of computing ability of each worker in the workers= {w 1 ,w 2 ,…w n }. Thus, the mean value of the task execution time in all workers can be defined as: Without considering the waiting time, the execution time of tasks in worker w i with the task allocation amount allocationLoad wi can be expressed as: Therefore, the variance of task execution time is represented as: The allocation fitness degree of worker wi can be formulated as: Lemma 1 For all workers involved in the calculation, the greater the allocation fitness, the shorter the execution time of the job and the higher the computational efficiency.
Proof From the point of view of task allocation, the execution time of the job can be expressed as: According to formula, the allocation fitness is inversely proportional to the variance. If the fitness value is greater, the variance is smaller, which means the completion time of tasks in the work is closer to the mean. So, when recovery entropy takes a maximum value, job execution time is shortest and execution efficiency is the highest. Therefore, we select the worker with the higher load to immigrate the latter task to the worker with the lower load to reach a higher degree of parallelism and allocation fitness.

Construct basic data
The improved architecture of Spark with optimization strategy is shown in Fig. 1.
To deploy the performance optimization strategy in Spark, it is necessary to implement the scheduling method in the spark.scheduler.TaskSchedulerImpl interface. The DAG scheduler contains all the topology information of current cluster operation, including all kinds of parameter configuration information and mapping between thread and the component ID; cluster object contains all status information of the current cluster, including the mapping information between each thread, node and executor of topology, the use and information of idle workers, and slots. The above information can be obtained through the API object. The CPU occupancy information of each thread in the topology can be obtained through the getThreadCpu-Time (long id) method in ThreadMXBean class of Java API, where id is the thread ID; network bandwidth occupancy information of each thread can be obtained by measuring each RDD size in the experiment as well as monitoring the data transmitting rate of each thread in Spark UI, then estimating by simple accumulation. Due to the threads existing shared memory, the memory occupancy of each thread can only be roughly estimated by the -Xss parameter in the configuration file; in addition, the hardware parameters and load information in operating system could through the /proc. directory to access relevant documents. When the code is written, it will package jar to the Spark_HOME/lib directory and run after configuring spark.scheduler in spark.yaml of the master node.

Performance optimization strategy
The key problem of the optimization strategy is the selection of the destination node. However, in order to meet the requirements of the worker, it is necessary to exclude the nodes that do not meet the resource constraint model.
Denote m s and m d as the total amount of memory resource in the source node and the alternative destination node respectively. In the process of decision-making to assign the latter task, it is necessary to continue to move out of other tasks, until the source node resources occupied are less than the threshold. Finally, select the optimal destination node to ensure the allocation fitness reaching the larger value. It should be noted that, when the memory, disk, or network bandwidth resources overflow, the optimization strategy is the same as this section, only to calculate the corresponding type of resource.
Then, the detail steps for the process of optimization strategy are shown in algorithm 1.
Step 1. Initialize the read data path and the number of data partitions. Spark uses RDD's text file operator to read the data from HDFS to the memory of the Spark cluster.
Step 2. Obtain the default parallelism degree and collect statistical information to calculate data resource occupancy degree in the system.
Step 3. The degree of parallelism and allocation fitness are updated based on the former function shown in sections 2.2 and 2.3 in combination with the data information acquired in step 2, and then, select the id of workers with the higher load.
Step 4. Save the corresponding parameters to the database and update the information when the status of the resource changes. After selecting the source node and destination node, exchange their tasks and refresh the remaining CPU, memory, and network bandwidth resource of the source node and the destination node.
Step 5. The TaskScheduler then selects the set of workers with the lower load to assign a task to get a larger degree of parallelism and the allocation fitness.

Experimental platform
We established a computing cluster by using 1 server and 8 work nodes; the server is set as Master Hadoop and NameNode Spark, and the others are set as Hadoop Slavers and Spark DataNodes. The details of the configuration are shown in Table 1. The task execution time is acquired from the Spark console, and nomon monitors the memory usage.

Execution time evaluation
In order to verify the algorithm in several different types of operations under the concurrent environment performance, we use the Spark official work examples to form a working set, including the type of four algorithms; dataset type 1, 2, 3, 4 denotes WordCount, TeraSort, K-Means, and PageRank as jobs. Figure 2 is a comparison of the execution time for different strategies. Figure 2 shows that in the case of performance optimization, the recovery acceleration of the K-Means and PageRank of the proposed strategy is better than that of without the optimization strategy, which is a comparison of the operations of wide dependency in K-Means and PageRank, WordCount and TeraSort. The corresponding acceleration rate are 17.9%, 17.6%, 15.1%, and 30%  respectively. The improper parallelism degree and task allocation may induce a large amount of out of memory and increased disk I/O, which will decrease the execution efficiency and lead to higher overhead in execution time.
Thus, compared to the existing scheduling mechanism, the scheduling with performance optimization strategy can more effectively reduce the latency, and the implementation process will not have a greater impact on the performance of the cluster. Memory utilization is related to the type of job and the distribution of input data. For the same algorithm, the greater the amount of data processed is, the greater the amount of memory occupied. As shown in Figs. 3, 4, 5, and 6, WordCount and TeraSort have a relatively stable memory footprint with the increase of execution time, while K-means and PageRank have different memory occupancy rates as the processing task phases are different.

Disk I/O evaluation
Similarly, the disk I/O has different characteristics as the type of job varies. Figures 7, 8, 9, and 10 are monitored under the optimization strategy proposed in this paper. The memory utilization of four different job changes during the execution of worker 3.
As far as the disk I/O rate is concerned for the task processing the data from the local disk, the corresponding local data reads on a worker will be generated, and a certain disk I/O is consumed. If the network data is processed, additional network I/O is also produced because the worker needs to read data from the remote disk, and memory outrage may produce more frequent disk I/O. As it is known in Figs. 7, 8, 9, and 10, disk I/O of WordCount is more obvious, and the other three jobs are lower. At the beginning of execution for K-Means and TeraSort, disk I/O is significantly increased because the task is assigned to worker 3, and it needs to read some data from the disk at this time.

Conclusions
In this paper, our contributions can be summarized as follows. First, we analyze a theoretical relationship of degree of parallelism and allocation fitness. Second, we propose an evaluation model that is pluggable for task assignment. Third, on the basis of the evaluation model, the strategy can take resource characteristics into consideration and assign tasks to the worker with a lower load to increase execution efficiency. Numerical analysis and experimental results verified the effectiveness of the presented strategy. Our future work is mainly concentrated on analyzing the general principles of the requirements for different types of operating resources for in-memory computing framework and design the optimization strategy adapting to the load and type of jobs.