Logistic regression based in-service assessment of mobile web browsing service quality acceptability

In this paper, we presented a logistic regression model that we applied for assessment of the users’ quality of experience with web browsing service over mobile network. With this regard, we chose the Average-Time-to-Connect-TCP network service quality parameter as an independent predictor, obtained by passive monitoring of live traffic data, captured by a passive probe on the mobile network Gn interface, and related to detailed records of the Transport Control Protocol. In parallel with in-service measuring the selected network parameter, we conducted simultaneous subjective tests of the quality of experience acceptability to users, specifically for web browsing service. Particularly, it was found that the model provided correct acceptability classification in 84.5% of cases, while reducing the chosen independent predictor for 100 ms implied increasing the chance of the service acceptability by factor of 1.65. Based on the obtained results, it comes out that the applied logistic regression model provides satisfactory estimation of the web browsing service quality experience acceptability.


Introduction
Web browsing is among dominant cellular network applications and is expected to grow by 39% annually over the coming 6 years [1]. Growing users' demand for reliable data delivery comes along with their expectation for adequate Quality-of-Experience (QoE), too, making the latter the most important user decision criterion in selecting a specific service provider. Consequently, network operators are kin to ensure the best possible QoE level, for which the conditio sine qua non is their ability to reliably and accurately assess the achieved customer satisfaction with their services, such as web browsing.
The majority of existing QoE estimation models are based on the mean opinion score (MOS) testing, but MOS-based monitoring of web browsing QoE in particular requires fairly complex metrics [2][3][4].
Therefore, it has become crucial to relate the subjective QoE to measurable technical service quality parameter(s), which can be in-service monitored in the operator environment, so enabling objective and real-time QoE estimation [3].
Moreover, as network operators are mostly interested in testing users' acceptability of provided web services [5][6][7], i.e., the "binary measure to locate the threshold of minimum acceptable quality that fulfills user quality expectations and needs for certain application or system" [7], consequently, in recent years, a number of QoE models based on acceptability have been proposed, especially for video signal delivery [8][9][10], as well as for interactive data services [11].
Specifically, the ITU-T Recommendation G.1030 provides some experimental results regarding the users' perception in relation to web browsing response time, as well as some guidelines for QoE estimation [2]. The according experiments were conducted to evaluate the suitability of the developed network emulator system for the QoE estimation [3] and to validate the ITU-T Recommendation G.1030. The obtained results show logarithmic dependency between the QoE and the page load time for a simple web page.
Furthermore, in contrast to the studies which are mostly based on direct user feedback [2][3][4], the QoE is sometimes estimated from passive network tests [12,13]. Specifically, the relationship between the QoE and the Quality-of-Service (QoS) for web browsing services was analyzed based on HyperText Transfer Protocol (HTTP)/Transmission Control Protocol (TCP) traces collected in the network, where cancelation rate of HTTP requests was used for QoE estimation, but without any validation of the achieved results by simultaneous real-life subjective QoE testing.
Moreover, though in some studies subjective user ratings are combined with network-level information, experimental findings coming out of the recorded TCP and HTTP traces and web browsing service QoE are reported only by graphical means and are not backed by any analytical model [14]. Therefore, in this paper, we address the aforementioned challenges by developing the QoE acceptability predictive model for web browsing service over mobile network, where the model is based on network parameters, in-service measurable by passive monitoring of live traffic data, and practically implementable by mobile operators.
As linear regression is not appropriate for modeling acceptability-based QoE, where the outcome variable-the acceptability is binary, logistic regression is the method of our choice. To our best knowledge, so far, no attempt has been made to assess acceptability of the mobile web browsing service by means of the logistic regression.
With this regard, in the previous work [15], the extent of the relationship between the in-service measured live traffic data parameters and the web browsing user QoE in the mobile network was analyzed by using the Spearman's rank-order correlation. Taking into account both strength and direction of the relationship, it came out that the parameters Average-Time-to-Get-1 st -Data and Average-Time-to-Connect-TCP exhibited the strongest relationship with the web browsing QoE evaluated by means of the ordinary 5-point Likert scale (with ratings: excellent, good, fair, poor, bad). Therefore, in this paper, we followed that indication by applying logistic regression on the selected parameter to assess the users' acceptability of the quality level experienced with web browsing service in particular.
However, in contrast to other investigations using mostly ordinal logistic regression in compliance with the type of test data determining categorical both the independent variables and the dependent one and with numerical MOS rating converted to categorical data, here, we put an accent on today's network operator main QoE imperative with web browsing in particular: to get an objective binary-type customer QoE ratingacceptability, which we model here by binary logistic regression applied to the Average-Time-to-Connect-TCP parameter that we found most relevant in this sense.
The rest of this paper is organized as it follows: In Section 2, we review the basics of the logistic regression to be used in the QoE acceptability prediction model. The test setup and tools that we used for conducting the experiment are also described in Section 2, while we present the test results and the analysis of the experimental data in Section 3. Conclusions are drawn in Section 4.

Methods
Before analyzing the acquired data by means of logistic regression, we review the concepts of the model and then apply it for the QoE prediction.

Logistic regression
Regression is mostly used as a means to predict a random variable from a number of mutually independent random variables and a constant.
Specifically, logistic regression is used for predicting the probability that a certain observation will be sorted into one out of two categories of a dichotomous dependent random variable, based on one or more independent random variables, which can be continuous or categorical. In many aspects, logistic regression is similar to linear regression, with the exception of the dependent variable type, which, in contrast to linear regression, does not provide estimated value of the dependent variable, but the probability that it will belong to a certain category, based on the values of the independent variables.
Among the three types of logistic regression, namely binary, ordinal, and nominal, the first one is used when the dependent variable is binary, i.e., takes one out of two categories. Moreover, if a dependent random variable can take more categories, then the ordinal logistic regression or the nominal one is to be used for ordered and unordered categories, respectively.
However, as it is already mentioned in Section 1, though ordinal logistic regression has been most frequently used (even after properly converting MOS scoring), as our focus here is on QoE acceptability, we consider here the binomial logistic regression, commonly referred to simply as logistic regression.
Essentially, it is a supervised machine-learning classification algorithm used to predict the conditional probability: that a certain individual observation belongs to one out of two categories, i.e., that the corresponding dichotomous dependent random variable Y takes one out of two possible values (1 or 0), conditioned by one or more (N) continuous or categorical mutually independent random variables X i taking their corresponding values x i [16][17][18].
Let us assume the simple linear form of the logit transform (from now on just logit) of Π(x i ), specifically for a single value x i = x [16]: where the odds are defined as the ratio of the probability Π(x) that the event (outcome of interest) will occur for a particular value x of the random variable X, and the probability 1 − Π(x) that the event will not occur, while β is the slope coefficient, and the constant α is referred to as the intercept.
From (1) and (2), Π(x) can be expressed as: where the iterative maximum likelihood (ML) method is used for estimating the according α and β values by testing the null hypothesis that these do not make the logistic regression accurate enough. In this case, small significance (represented by the p value) indicates strong evidence to reject the null hypothesis.
In order to use the binomial logistic regression in practice, the following main assumptions need to be fulfilled [18]: 1. Logistic regression requires the observed dependent random variable Y to be dichotomous and a function of one or more mutually independent and noncollinear predicting random variables-predictors. 2. The logit transform must be a linear function of continuous predicting random variables. 3. Each test observation must be independent from others and all test categories should be mutually exclusive and exhaustive. 4. Data must not exhibit significant outliers, high leverage points, or highly influential points; otherwise, the reliability of the estimates may degrade significantly.

Test setup
The test setup is presented in Fig. 1. As it can be seen, the experiment was carried out on a live network. The test configuration included the client, the gateway that was connected to the live High Speed Packet Access Evolved (HSPA+ Rev.8) mobile network (providing up to 42 Mb/s with 64QAM in downlink, and 11.5 Mb/s with 16QAM in uplink), which is connected to the internet. The gateway ran on Linux OS, while the NetEm [19] enhancement of the Linux traffic control (TC) facilities enabled introducing packet delay and packet loss in the experiment. We chose the test point to be at the Gn interface, where the actually used Oracle Performance Intelligence Center (PIC) [20] with passive probe captured the traffic data, Fig. 1. Each test participant took part in experiments using the client operating on Windows 8 PC. The client device was connected to the gateway via 100 Mbps Ethernet full duplex link. We used the NetEm network emulator on Ubuntu OS of the gateway to vary the network conditions by adding delay and packet loss. The Huawei E3272 LTE USB modem was used for testing, while being managed by the embedded Connection Manager software, which allowed setting the preferred access network.
We enabled the HSPA+ to be the preferable access network in the experiment. The client system was connected to internet through the mobile network via the gateway. In both laptops, the automatic software updates were disabled. The participants in the experiment used Mozilla Firefox 35.0.1 web browser. The HTTP and TCP extended Detailed Records (xDR) from the data captured on Gn interface were made available by using ProTrace application on the Oracle PIC platform. This way, we defined and activated new statistical sessions which generated the in-service parameters' values aggregated over 5-min intervals. The parameters were defined from the HTTP and TCP xDR's for the Mobile Station International Subscriber Directory Number (MSISDN) [15] of the test SIM card.
The experiment was conducted by ten users, five female and five male, whose age ranged between 12 and 45 years. All participants used the internet at least 1 h a day and usually via the WiFi access, except when switching to the mobile internet access (only if WiFi was unavailable).
We investigated the relationship between the QoE and in-service parameters through the following test scenario: Each participant tested web browsing six times under different network conditions, determined by the NetEm (adding delay or packet loss during the experiment).
Duration of a single test was limited to 5 min, while the participants accessed web pages of their choice and simply answered whether the technical quality of web browsing service was acceptable or not, with "yes" or "no", respectively.
Following that, by running statistics sessions in Oracle PIC platform, processing the collected values of the relevant in-service parameter measured on Gn interface-Average-Time-to-Connect-TCP, which is the average time between SYN and ACK in the TCP three-way handshake sequence, needed to establish the TCP connections within a 5-min interval [1].

Test tools
We used the Oracle PIC as monitoring and data gathering system that helps service providers to manage their assets, encompassing network performance, QoS, and customer analysis [20]. The PIC uses passive probes to capture traffic data and forward probe data units (PDU) to the Integrated xDR Platform (IXP). The IXP stores these traffic data and correlates them into detailed records. The PIC provides applications that mine the detailed records to provide value-added services such as network performance analysis, call tracing, and reporting [21]. For the purpose of this research, we used the HTTP and TCP sessions on the Gn interface of the mobile network, defining parameters, and statistics sessions by using the ProTraq application [21]. Furthermore, we used the NetEm as enhancement of the Linux traffic control facilities that allows adding delay, loss, duplication, and other impairments as well, to the packets outgoing from the selected network interface. NetEm is built using the existing QoS and the differentiated services (DiffServ) facilities in the Linux kernel [19].

Discussion and results
We analyzed the fields of data records collected from HTTP and TCP sessions on the Gn interface, and selected the ones to define in-service parameters in Oracle PIC [15]. With this regard, some in-service parameters from HTTP and TCP xDRs, based on the data captured by extensive testing that we made on Gn interface, are presented in the Appendix, while in Fig. 2, the exemplar relevant TCP record time intervals can be seen. Now, the task is to find out which out of the set of in-service measured parameters, is mostly influencing the QoE acceptability in particular, so to be selected as the logistic regression predicting variable.
With this respect, we consider the correlation to be the best indication, and therefore we calculated it for various parameters, as it is presented in Table 1.
Due to monotonic relationship and evident strong (negative) correlation (Spearman correlation coefficient r s = − 0.791, and significance value p < 0.01) between the inservice measured parameter Average-Time-to-Connect-TCP and users' acceptability of the service quality level [15], we consider the former as the independent predicting random variable X, whose scatter plot with regard to QoE ratings is presented in Fig. 3, while in Table 2, we present its in-service obtained test values selected from the overall data table in the Appendix.
As it can be seen in Fig. 3 and Table 2, just a few sporadic peak values of the Aver-age_Time_to_Connect_TCP were measured (e.g., with just about 10% of them larger than 1.5 s). Observing bottom-up through the protocol stack, various reasons for this could be considered, among them the (eventually) excessive retransmissions of Hybrid Automatic Repeat-reQuest (HARQ) protocol data units (e.g., due to bit errors at the physical layer). These could produce additional delays which propagate upwards the stack causing the TCP 3-way handshaking time-outs (such as e.g., retransmission timeout (RTO)), which imply even further delays of the TCP connection setup time.

Verifying the logistic regression assumptions
The first assumption for the logistic regression can be considered holding in this case, as it can be seen in Table 2 that the observed dependent random variable Y is obviously dichotomous and a function of just a single predicting continuous random variable (implying that, in this case, multicollinearity among the predictors is not an issue).
Regarding the second assumption, we used the Box-Tidwell (1962) procedure [22] to test whether the logit transform is a linear function of the predictor, effectively by adding the nonliner transform X·lnX of the original predictor X, as a second, so-called interaction variable, and testing the null-hypothesis that adding it made no better prediction. As it can be seen in the according table that is presented in Section 3.2, we found the logit linearity condition holding, with just minor non-linearity possible.
Further on complying with the third assumption, we did each test independently of others, with all test categories being mutually exclusive and exhaustive.
Moreover, the data exhibited quite balanced behavior with no significant outliers and no leverage or influential points.
Consequently, we can justifiably expect that the conducted logistic regression procedure finally provided valid conditional probability Π(x) that the dependent random variable Y takes one out of two possible values (1 or 0), conditioned by a single (in our case) predicting random variable X taking the value x.  Table 1 Spearman correlation coefficient between QoE and in-service tested parameters [15] In-service parameter QoE

Test cases and estimated logistic regression parameters
We consider a case (sample) to be a repeatable single test made by a single participant. A number of recommended values exists for the required minimum number of samples (cases) ranging from 15 to 50, but we adopted 60 samples per independent random variable [18], as the ML-based logistic regression estimation significantly degrades for rare test cases. So, the counts of cases included/missing in the analysis are given in Table 3 (in accordance with Table 2), while Table 4 presents how the outcome random variable Y is encoded.
The logistic regression coefficients for the model with independent random variable Average-Time-to-Connect-TCP are estimated to take the values of α = 4.746, β = − 0.005, while their properties-the standard error (S.E.), the Wald Chi-square (χ 2 ) test value [18] for D.F. degrees of freedom, and the significance expressed by the p valueare presented in Tables 5 and 6.
So, as we can see from the above tables, the Wald test [20] evaluates the independent random variable Average-Time-to-Connect-TCP as statistically significant in the model, as the p value is found to be very low: p < 0.001.  Moreover, as it is mentioned in Section 2, we tested the linearity assumption determining the validity of logistic regression, by applying the Box-Tidwell (1962) procedure [22]. Accordingly, we tested the null-hypothesis that adding the new variable: Avg Time To Connect TCP Â ln Avg Time To Connect TCP ð Þ into regression would make no better prediction. As it can be seen from low p values in Table 7, we found the logit linearity condition holding, with just minor non-linearity possible.

Intercept-only model and its extension by prediction
Back to (2), at first, let us consider the model without taking into account the independent random variable Average-Time-to-Connect-TCP, i.e., when: which effectively modifies (2) into: Accordingly, in the next two tables, the outputs related to the model that includes only the intercept value α, are presented. Such incomplete model predictions depend purely on what category occurred most frequently in the data set, in accordance with (4)/(5). It simply predicts that the service is acceptable, as the majority of participants in the experiment considered the service acceptable (38 out of 58 participants answered "yes"). So, applying this "best guess" strategy, one would be right for 65.5% of time (Table 8).
Accordingly, the estimated statistics for this special case is presented in Table 9.
As exponential of α given in Table 9, the odds are estimated to be equal to 1.9, which conforms to the ratio 38/20 of the counts of users who found the service acceptable and not acceptable, respectively. Now let us turn to the analysis of logistic regression with included independent random variable X, i.e., the Average-Time-to-Connect-TCP as a predictor.
The likelihood ratio (LR) test is used to judge the null hypothesis that including the Average-Time-to-Connect-TCP random variable into the model does not significantly increase the ability to predict the decisions made by the subjects. This essentially implies testing the ratio: of likelihoods L0 and L1 of test data representing the zero-valued and the maximum likelihood estimate (MLE) of the parameter of interest, respectively. Under the hypothesis that β = 0, the statistics G follows the Chi-square distribution with 1 degree of freedom [17]. The according test results are presented in Table 10. From Table 10, it can be seen that, for the Chi-square model with 1 degree of freedom and the value of 35.817, it comes out that p < 0.00001, and we justifiably reject the null hypothesis. So the results of this test indicate that including Average-Time-to-Connect-TCP random variable into the model statistically significantly increases the ability to predict the acceptability of the service to the users.

Testing the logistic regression model goodness of fit
Furthermore, adequacy of the model can be assessed by means of the Hosmer and Lemeshow goodness-of-fit test, which actually evaluates inadequacy of the model in predicting categorical outcomes, i.e., the hypothesis that the observed data are significantly different from the predicted values coming out of the model.
The test essentially partitions n observations into g approximately equal-size groupsdeciles, so that the first group contains approximately n/10 observations with the smallest estimated probability, and the last group of approximately n/10 observations with the largest estimated probabilities [17].
The statistics is: where O 1k is the count of observations with Y = 1 (out of s k observations in total) in the kth group, and E 1k is the expected count of the event in the kth group, whereas ξ k is the average predicted event probability for the kth group.
The statistics of (7) is close to χ 2 distribution with 8 degrees of freedom (as for totally g = 10 groups, it is: 10 − 2 = 8). Small enough p value (< 0.05) implies that the model poorly fits to data.
The 2 × g contingency table presents the observed and the expected counts of the event Y = 1. Accordingly, resulting from our test data are entries in Table 11.
Finally, as it can be seen in Table 12, the obtained high p value indicates lowsignificance inadequacy of the fitting model, which implies that the model is not to be considered inadequate (but opposite, i.e., adequate).

Category prediction
Furthermore, as the logistic regression estimates the probability of the event that the service is acceptable to users, we adopt a typical decision threshold of 0.5, meaning that if the estimated probability is greater than or equal to 0.5, the event is classified as the one that will happen; otherwise, the event is classified as the one that will not happen [23]. Accordingly, the observed and predicted classifications are presented in Table 13.
As it is already pointed out in Section 3.3, Table 8, where the classification includes just the intercept constant, we can see that 65.5% of cases overall could be correctly classified by simply considering all cases (to be classified) as choosing "yes" for acceptability.
However, with the independent random variable included in the model, the so-called Percentage-Accuracy-in-Classification (PAC) [18] can be seen in Table 13 to be equal to 84.5%, as the model correctly classified that many cases on a relative scale, which is a significant improvement with regard to the case of classification without the predictor variable.
Another classification feature is the sensitivity, which is the percentage of cases with the target category correctly predicted by the model when the quality of services was evaluated as acceptable ("yes"). So, as it is presented in Table 13, 86.8% of test cases, when participants rated service as acceptable, were also classified by the model as acceptable.
On contrary, the specificity is the percentage of cases that were found to not have the target category [21], i.e., which were correctly classified by the model when the service was not rated as acceptable. In our study, the specificity was found to be equal to 80.0%, meaning that 80% of participants who did not rate service as acceptable were correctly classified by the model, Table 13.
The positive predictive value is the percentage of correctly classified cases exhibiting the target characteristic, relative to the total count of cases predicted to have the target characteristic. In this case, simple calculus provides: meaning that, out of all cases predicting the service acceptable, for 89.19% of them, the prediction is correct. The negative predictive value is the percentage of correctly classified cases without the target characteristics, relative to the total count of cases predicted as not having the target characteristics. In this case, it is: meaning that out of all cases predicting the service not acceptable, for 76.19% of them, the prediction is correct. Finally, by substituting the values of α = 4.746, β = − 0.005 from Section 3.2, into the regression Eq. (2), the latter can be rewritten as: The conditional probability (x) that, having measured Average-Time-to-Connect-TCP milliseconds, the acceptable quality of web browsing service will result is the target logistic regression test outcome, which is plotted in Fig. 4.
As it can be seen on the above graph, the transition between the acceptability and non-acceptability is rather steep, as the curve exhibits a threshold effect around the predictor value of 1 s.
As an example, let us calculate how much would the odds be affected by reducing the Average-Time-to-Connect-TCP parameter for 100 ms. With this regard, we simply substitute the increment as follows: According to (12), it comes out that by reducing the Average-Time-to-Connect-TCP parameter for 100 ms, the chance of success, i.e., the chance that the service is acceptable, increases by the factor of 1.65.    We proposed a simple logistic regression model with a single independent predicting variable, namely the Average-Time-to-Connect-TCP network parameter, derived from live traffic data captured by passive probe on the Gn interface of the mobile network, to estimate the users' quality of experience acceptability of the web browsing service.
In parallel, we conducted simultaneous subjective users' service quality acceptability tests with a number of participants, to finally correlate the obtained values to detailed records of the TCP protocol.
The model was found to provide correct estimation of the experienced service quality acceptability, with high statistical significance determined by Chi-squared value above 35, p value below 0.0005, and correct classification in 84.5% of cases.
More specifically, the sensitivity and specificity were found to be equal to 86.8% and 80%, respectively, while the positive and negative prediction values were evaluated to be equal to 89.19% and 76.19%, respectively.
Reducing by 100 ms the network service parameter that is selected as the predicting variable was found to increase the chance of the service acceptability by the factor of 1.65.
We plan to extend the proposed approach application range and enhance its ability to predict real-life acceptability of service quality experience, by involving more experimental scenarios and wide-area users, as well as other aspects such as context and extended set of parameters. Moreover, the full-scale measurement campaign would go beyond resource-limited preliminary tests reported here, and in 4G/5G environment, where this approach and analysis still apply.