M-BiRank: co-ranking developers and projects using multiple developer-project interactions in open source software community

Social collaborative coding is a popular trend in software development, and such platforms as GitHub provide rich social and technical functionalities for developers to collaborate on open source projects through multiple interactions. Developers often follow popular developers and projects for learning, technical selection, and collaboration. Thus, identifying popular developers and projects is very meaningful. In this paper, we propose a multiplex bipartite network ranking model, M-BiRank, to co-rank developers and projects using multiple developer-project interactions. Firstly, multiple developer-project interactions such as commit, issue, and watch are extracted and a multiplex developer-project bipartite network is constructed. Secondly, a random layer is selected from this multiplex bipartite network and initial ranking scores are calculated for developers and projects using BiRank. Finally, initial ranking scores diffuse to other layers and mutual reinforcement is taken into consideration to iteratively calculate ranking scores of developers and projects in different layers. Experiments on real-world GitHub dataset show that M-BiRank outperforms degree centrality, traditional single layer ranking methods, and multiplex ranking method.


Introduction
Open source software community is now a main driven force of innovations, and plenty of software developers collaborate on millions of open source software projects, among which are many popular software projects that drive the innovations of different fields [1,2]. For example, deep learning frameworks such as TensorFlow, PyTorch, and MXNet contributed by famous companies simplify the building of deep learning models, which to some extent speed up the innovations in the field of artificial intelligence in both academia and industry [3,4].
In open source software community, developers from different areas usually take the social collaborative coding paradigm and participate in different portions of a common project using the social and technical functionalities provided by the community [5,6]. Taking GitHub as an example, developers from different areas and with different technical backgrounds can collaborate on a project by committing codes and commenting on issues, star/fork a project for improving technical skills or technical selections, and follow other professional developers for keeping pace with new trends. Much like the role of opinion leaders in social networks, influential developers and projects drive the technical trends and the prosperity of open source community. Thus, identifying influential developers and projects will be of great significance. Existing work on influence analysis for open source software community mainly focused on applying traditional unipartite single layer graph ranking methods [7,8], including PageRank [9] and HITS [10], although many new graph ranking methods for more complex network structures, such as bipartite network [11,12] and multiplex network [13,14], have been proposed. On the other hand, existing graph ranking methods have not merged bipartite network and multiplex network as a single network model, which is necessary for our case to model multiple interactions between developers and projects. Potential applications of influence analysis of open source software community would include service recommendations [15][16][17][18][19] and risk assessment [20].
In this paper, we focus on modeling multiple interactions between developers and projects as a multiplex bipartite network and propose a new ranking method based on it in an iterative and mutually enhanced way. The main contributions of this research are many folds: • We propose a multiplex bipartite network model to represent multiple interactions between developers and projects.
• We propose a new ranking model called M-BiRank on multiplex bipartite network which takes into account the mutual reinforcement between different types of nodes as well as different layers.
• We apply the proposed model to real-world GitHub dataset, showing that our model outperforms baseline ranking models.
The remainder of the paper is organized as follows. Section 2 gives a brief introduction to related works on ranking models from the perspective of network and its applications in software engineering. In Section 3, details about the proposed M-BiRank model are illustrated. Then, the experiment results and discussions are given in Section 4. Finally, we briefly summarize our work and explain future directions in Section 5.

Related work
Identifying influential nodes in social networks has been a hot topic for decades. Existing works mainly focus on either structural properties or diffusion dynamics. Plenty of structure-based metrics and random walk-based methods have been proposed.
Structure-based metrics usually base itself on some intuition for centrality from either local or global views. Degree is the most common local structure-based centrality metrics. Based on degree, Chen et al. [21] proposed a semi-local centrality metric, considering both the nearest and the next nearest neighbors. Chen et al. [22] further considered the negative impact of local clustering on information diffusion in networks and proposed ClusterRank. In addition to extending degree, several local structure-based centrality metrics are originated from H-index, which is originally used to measure the citation (2020) 2020: 215 Page 3 of 18 impact of a scholar or a journal [23]. Zhao et al. [24] first extended H-index concept to networks and defined the h-Degree metrics for weighted networks. Liu et al. [25] combined the H-index of both node itself and its neighbors and proposed a local H-index centrality. Lü et al. [26] revealed the relation among degree, H-index, and coreness and introduced a family of H-indices. Local structure-based centrality metrics benefit from low computational complexity at the cost of reducing effectiveness. While global structure-based centrality metrics can better identify influential nodes from a global view of the whole network. Earlier researches in sociology introduced several global structure-based centrality metrics, including closeness centrality [27], betweenness centrality [28], and eigenvector centrality [29]. Recently, researches introduced eigenvector centrality to more complex network structures. Wang et al. [30,31] extended eigenvector centrality to multilayer networks under a framework of tensor decomposition.
Random walk-based methods apply resource diffusion dynamic process in networks and measure node's influence according to the final resource the node obtains at stationary state of the dynamic process. Typical random walk-based ranking methods include PageRank [9] and HITS [10]. To solve the problem of dangling nodes of PageRank, Lü et al. [32] added a ground node to the original network, making the original network connected, and proposed the parameter-free LeaderRank method. Halu et al. [13] extended PageRank to multiplex network and proposed Multiplex PageRank, which included four kinds of intra-layer enhancement mechanism [33]. To address the ranking problem on bipartite networks, He et al. [11] proposed the BiRank method. Instead of modeling pairwise interactions, higher-order network models have recently been proposed and applied to ranking nodes with group interactions. Treating scientific collaboration as a group interaction, Liang et al. [34] modeled it as a hypergraph and proposed HHGBiRank. From another view of higherorder structure of networks, that is motif, Zhao et al. [35] proposed motif-based PageRank.
In addition to identifying influential nodes for general purpose social networks, needs have also emerged for open source software community, a special kind of social network. Xuan et al. [8] constructed several social networks based on the communications between developers in Apache and applied degree, PageRank, and HITS for developer ranking. Hu et al. [7] studied the problem of influence identification of developers in GitHub and proposed a Following-Star-Fork-Activity-based approach. Joblin et al. [5] employed several activity counts, centrality metrics, and network structural properties to distinguish core and peripheral developers.

Method
In this section, we will present a Multiplex Bipartite Ranking method, called M-BiRank, for co-ranking developers and projects in open source software community. As shown in Fig. 1, the proposed M-BiRank consists of three parts and incorporates two basic assumptions that address the issues in Section 1. We will start by giving the definition of multiplex bipartite network and introducing the notations. Then, thorough explanations and mathematical formulations are given for the two basic assumptions, that is, mutual reinforcement between different types of entities and between different layers of network. Finally, we will introduce the overall algorithm and time complexity analysis of it.
Step 1. Ranking iteration in Layer A Step 2. Ranking scores in Layer A diffuse to Layer B Step 3. Ranking iteration in Layer B In this paper, we model multiple interactions between developers and projects as a developer-project multiplex bipartite network. The notations we will use throughout the article are summarized in Table 1.

Mutual reinforcement between developers and projects
Most ranking methods in networks adopt the intuition as PageRank and HITS that an influential node should be linked by many other influential nodes, which is also applicable in the case of developer-project bipartite network. For example, an elite developer usually participates in popular projects. In open source software community, it is quite a practice to estimate the influence of a developer by how popular the projects she/he participates in are and how much she/he contributes to these popular projects. And a project with influential developers or organizations as major contributors always attracts a large number of attention. Taking TensorFlow as an example, it got thousands of stars quickly upon its first release in GitHub because it is supported by Google. This intuition forms our first assumption that a developer (project) should be ranked high if it is connected to high-ranked projects (developers) in a certain layer A, which can be formulized as follows: In order to employ a prior belief on nodes' importance and provide better ranking results, we also adopt query vector and symmetric normalization as BiRank. The prior belief on nodes' importance and rankings from network structure are balanced with two parameters γ and λ. The final formulation of the mutual reinforcement between developers and projects is as follows:

Mutual reinforcement between different layers
Besides considering the mutual reinforcement between developers and projects in each single layer, we also take into account the mutual reinforcement between different layers. From our experience as open source software developers, we could firmly assume different interactions between developers and projects reflect different aspects of influence and only a composition of all the aspects could reflect a comprehensive influence of developers and projects. For example, committing code to a project indicates a developer's coding skill and commenting issues of a project may show a developer's design skill or bug-fixing skill. The influence of a developer should be measured by summarization of both coding and design skills. To implement mutual reinforcement between different layers, we choose to incorporate the ranking scores of developers and projects from the first layer as an enhancement of the query vectors of the second layer. The mathematical formulations are given in Eqs. (5) and (6). To be more clearer, we transform Eqs. (5) and (6) to their equivalent matrix form in Eqs. (7) and (8):

Overall algorithm
By combing both the mutual reinforcement between developers and projects in each single layer and that between different layers, we finally propose the M-BiRank method to co-rank developers and projects in open source software community and the overall algorithm is shown in Algorithm 1.

Input:
Weight matrix W A and W B , query vectors u 0 , p 0 and hyper-parameters γ , λ; Output: Ranking vectors u, p; Initialize p B and u B using p A and u A , respectively; 9: while Stopping criteria is not met do 10: 14: return u and p.

Time complexity analysis
The overall time complexity of M-BiRank is a summarization of each layer's time complexity. For each layer, according to Eqs.

Experiment
In this section, the performance of M-BiRank model is evaluated against the GHTorrent dataset [36].

Datasets
GHTorrent [36] monitors the GitHub public event timeline and retrieves information of developers, projects, and interaction details between them from these events [37]. We choose the GHTorrent dataset as of November 1, 2018, and extract the relationships between developers and projects, both of which mainly belong to PHP community. The steps of data preprocess include the following: (1) choose issues and commits which belong to PHP projects, (2) keep developers and projects which exist in both issues and commits, and (3) Table 2.

Evaluation metrics
In order to evaluate and compare the performance of M-BiRank and baseline methods, both correlation analysis and SIR model are adopted. Correlation analysis mainly focuses on comparing predictions against the ground truth, and Pearson's correlation coefficient (PCC) [38] is chosen. In our experiment, the number of watch of projects and the number of followers of developers are set as the ground truth for the rankings of projects and developers, respectively. PCC reflects the correlation degree of two variables through the linear correlation between vectors, which is defined as follows: where n represents the number of elements; x i and y i represent the ith element of sample x and y, respectively; and the value range of PCC is [ −1, 1]. However, the ground truth in correlation analysis is some kinds of degree in networks, which is a rough metric in evaluating the influence of developers or projects. To rank more precisely, dynamic models are needed for simulating the influence diffusion process [39]. SIR model [40] is a classical epidemic model and is often used to evaluate the ability of information spreading of a node in social networks. Generally, an influential user with a higher ranking score will spread his/her opinions to more developers. The transmission process of SIR model is shown in Fig. 2, where S (Susceptible), I (Infected), and R (Removed) denote the susceptible, infected, and recovered nodes. At the initial step of the  [43] transmission process, several infected nodes are set, and then, the transmission is iteratively repeated until no new nodes are infected [41]. At each step, infected nodes infect its susceptible neighbors with the probability α, and infected nodes' recovery to removed status with the probability β. So SIR model is suitable for evaluating the ability of information spreading of a node. By applying nodes with highest ranking scores of different ranking methods as the initial infected nodes and comparing the final number of affected nodes (both infected and removed nodes), the effectiveness of different ranking methods can be compared.

Baseline methods
We compare M-BiRank with several baseline methods: Degree [42]. The degrees of developers and projects in different layers of multiplex developer-project bipartite network are calculated and averaged.
PageRank [9]. PageRank ranks nodes by iteratively propagating scores on the network and is usually suitable for single layer monopartite network. In this experiment, we apply it to multiplex developer-project bipartite network with two different setups. PageRank-Avg ignores types of nodes and applies PageRank algorithm directly to different layers of the multiplex developer-project bipartite network. The final ranking score of a node is the average of different layers. PageRank-Add merges different layers of multiplex developerproject bipartite network into a single layer of developer-project bipartite network and uses the average edge weights of different layers as edge weights of this single layer bipartite network. Then, we apply PageRank algorithm to this single layer of developer-project bipartite network ignoring types of nodes. Finally, both PageRank-Avg and PageRank-Add rank developers and projects separately according their final ranking scores. The hyperparameter is set to 0.85.
BiRank [11]. BiRank is a propagation-based ranking method on bipartite networks and adopts a normalization strategy in the iterative process. BiRank-Avg applies BiRank algorithm to different layers of multiplex developer-project bipartite network separately and averages the ranking scores in different layers as the final ranking scores. BiRank-Add firstly merges different layers of multiplex developer-project bipartite network into a single layer of developer-project bipartite network with the average of the edge weights in different layers as edge weights. Both of the hyperparameters are set to 0.85.
Multiplex PageRank [13]. Multiplex PageRank considers the impact of the centrality of a node in one layer on that in another layer and introduces nodes' centrality of the preceding layer to current layer in four ways. In this experiment, we choose the Additive Multiplex PageRank and have two different setups, that is, MPR-Commit uses the commit layer as the first layer and MPR-Issue uses the issue layer as the first layer. The hyperparameter is set to 0.85.
M-BiRank. M-BiRank is the method we proposed for ranking nodes in multiplex bipartite network. As the setup in Multiplex PageRank, M-BiRank-Commit uses the commit layer as the first layer and M-BiRank-Issue uses the issue layer as the first layer. The hyperparameters γ and λ are set to 0.85. Each element of the query vector u 0 (p 0 ) for the corresponding node (developer/project) is set to the sum of its all edges' weights over the total sum of all edges' weights of the whole developer-project bipartite network of the first layer.

Results
We compare the experimental results of M-BiRank with baseline methods by both correlation analysis and SIR modeling. The hyperparameters for M-BiRank γ and λ are both set to 0.85.

Correlation analysis
In correlation analysis, the follower number of developers and the watch number of projects are set as the ground truth for ranking developers and projects, respectively. Pearson's correlation coefficient (PCC) is calculated between the ranking results from M-BiRank and baseline methods and ground truth rankings. The results are shown in Table 3.
From the results of correlation analysis, we have the following observations: (1) M-BiRank model we proposed outperforms all the baseline methods for both developer ranking and project ranking. This indicates that it is necessary to model multiple interactions between developers and projects as a multiplex bipartite network, which not only considers mutual enhancement between developers and projects but also takes into account mutual enhancement between different interactions. This highly agrees with realworld practice. For example, a project with elite developers participating in is usually a popular project and a developer participating in popular projects is often an elite developer. Developers have different ways to take part in certain projects such as committing code or solving issues, and different ways are tightly coupled.
(2) Comparing the different settings of M-BiRank itself, M-BiRank-Commit performs better than M-BiRank-Issue in most cases, which means it is better to take the commit layer of the multiplex developer-project bipartite network as the initial layer for M-BiRank model. This also agrees with real-world practice. Issue is a helper function in social collaborative coding which provides a discussion board for software developers about bugs and designs. While commit is a main function during software development for developers, thus, the commit layer is more important. So M-BiRank-Commit performs better in  identifying more influential developers and projects. In Section 4.4.2, we only compare M-BiRank-Commit with benchmark methods.

SIR simulation
In this section, to evaluate the information spreading ability of top 100 developers ranked by different methods, SIR model is adopted on commit layer of the developer-project multiplex bipartite network. M-BiRank is compared against each baseline method separately. For each comparison, the initial infected nodes (developers) for SIR model are the top 100 developers ranked by each method excluding those ranked top 100 by both methods. During the SIR process, an infected node infects each of its neighbors with probability α = 0.005 simultaneously and recoveries to removed state with probability β = 0.006.
For each SIR simulation, we run 300 iterations at most and repeat 10 times to average the value of each step. The results are shown in Figs. 3, 4, 5, 6, and 7, and several significant observations are found: (1) The result of comparison between different settings of M-BiRank itself in Fig. 3 indicates M-BiRank-Commit performs better, which is in perfect accordance with the result found in correlation analysis in Section 4.4.1. Thus, only M-BiRank-Commit is compared against baseline methods in the rest part of this section.
(2) M-BiRank outperforms all the baseline methods in identifying influential developers, which means nodes' types and mutual reinforcement among different interactions play important roles and multiplex bipartite network can model multiple interactions between two different types of nodes more precisely. Specially, the performance difference between M-BiRank and BiRank is larger than that between M-BiRank and Multiplex PageRank (MPR), from which we can conclude that considering mutual reinforcement among different interactions is of more importance than distinguishing nodes' types.
(3) The number of final infected projects is more than that of developers in both M-BiRank and all the baseline methods. According to researches on epidemics on networks, information spreads faster and broader in networks with shorter average path length. From Table 2, we can see the average degree of projects is larger than that of developers.

Case study
In addition to correlation analysis and SIR simulation, we further do a detailed case study to show the effectiveness of our model in identifying influential developers and projects. The top 20 developers and projects ranked by our model M-BiRank are listed in Tables 4  and 5, respectively, followed by their ranks in baseline methods. Table 4 indicates baseline methods, and M-BiRank ranks the first six developers similarly, while some influential PHP developers ranked in top 20 by M-BiRank are not identified or ranked with lower scores by baseline methods. For example, Fabien Potencier (GitHub ID: fabpot) and Taylor Otwell (GitHub ID: taylorotwell), the most active contributors of the two most popular PHP frameworks, Symfony and Laravel, are not identified as influential developers by some of the baseline methods. From GitHub as of June 1, 2020, Symfony and Laravel have 23.3k and 59.4k stars, respectively, and Fabien Potencier and Taylor Otwell have 10.4k and 18.6k followers, respectively. Taylor Otwell has more followers than Fabien Potencier, and Laravel is more popular than Symfony, but Fabien Potencier is ranked higher than Taylor Otwell because Laravel is based on some popular components of Symfony. Thus, we can conclude that Fabien Potencier is more influential than Taylor Otwell.
As for projects, from Table 5, we can see both M-BiRank and baseline methods rank popular PHP frameworks with higher scores. But some important PHP components identified by M-BiRank are not identified as influential projects or ranked with lower scores by baseline methods. For example, illuminate/database, a popular ORM library, is ranked with a high score by M-BiRank but is ranked with a lower score by BiRank-Add and PageRank-Add, and is never identified as influential projects by BiRank-Avg, PageRank-Avg, and MPR-Commit. As we know, in modern web development, ORM is quite critical because it is responsible for accessing database.

Experimental settings discussion
In the experiment, several key settings will affect the performance of M-BiRank and a brief discussion about these settings is shown as follows.   First, we will study the impact of edge weight in the ranking process. Both unweighted and weighted developer-project multiplex bipartite networks are constructed, and for weighted case, the interaction times are summed as edge weight. Then, correlation analysis on top k developers and projects is applied and the results are shown in Fig. 8, from which it can be concluded that weighted developer-project multiplex bipartite network performs better than unweighted case and edge weight plays an important role in identifying influential developers and projects.
Then, experimental settings for SIR simulation are discussed. It can be seen from Fig. 9 that the more initial infected nodes are set, the more final infected nodes. It is also obvious that the same number of top k projects being set as initial infected nodes will result in more final infected nodes. Finally, the hyperparameters γ and λ of our proposed M-BiRank model are analyzed. For simplicity, we consider the condition that γ and λ are equal. It can be concluded from Fig. 10 that both prior belief of developers' (projects') importance and rankings from network structure play roles in the final rankings of developers (projects) and their contributions to final rankings are approximately equal.

Conclusions
In this work, we study the problem of identifying influential developers and projects in open source software community. We model multiple interactions between developers and projects as a multiplex bipartite network and propose an iterative refinement ranking method M-BiRank by incorporating the mutual reinforcement between developers and projects as well as between multiple developer-project interactions. The proposed M-BiRank is evaluated against four baseline methods on real-world GitHub dataset. Extensive experimental analysis and case study show M-BiRank significantly outperforms baseline methods in both correlation analysis and SIR simulation.
The general idea behind the proposed M-BiRank is modeling multiple kinds of entities and interactions in open source software community into a single network and incorporating mutual reinforcement between different kinds of entities as well as between different types of interactions when ranking. As we know, there are other entities such as blogs and organizations in addition to developers and projects in open source software community and plenty of interactions between them such as user-user following and project-project dependency. In future work, more entities and interactions could be introduced and modeled as a heterogeneous information network and mutual reinforcement in ranking would be generalized using meta-path.