YouTube QoE Evaluation Tool for Android Wireless Terminals

In this paper, we present an Android application which is able to evaluate and analyze the perceived Quality of Experience (QoE) for YouTube service in wireless terminals. To achieve this goal, the application carries out measurements of objective Quality of Service (QoS) parameters, which are then mapped onto subjective QoE (in terms of Mean Opinion Score, MOS) by means of a utility function. Our application also informs the user about potential causes that lead to a low MOS as well as provides some hints to improve it. After each YouTube session, the users may optionally qualify the session through an online opinion survey. This information has been used in a pilot experience to correlate the theoretical QoE model with real user feedback. Results from such an experience have shown that the theoretical model (taken from the literature) provides slightly more pessimistic results compared to user feedback. Users seem to be more indulgent with wireless connections, increasing the MOS from the opinion survey in about 20% compared to the theoretical model, which was obtained from wired scenarios.


Introduction
Real-time entertainment services (comprised mostly of streaming video and audio) are becoming one of the dominant web-based services in telecommunications networks. In particular, YouTube service is currently the largest single source of real-time entertainment traffic and the third most visited Internet site (preceded by Google and Facebook). It has emerged to account for more Internet traffic than any other service. Mobile networks have the highest proportion of real-time entertainment traffic. Nowadays, YouTube leads the way, accounting for 20-25% of total traffic in mobile networks. Additionally, 27.8% of all YouTube traffic (first half 2012) has been consumed on a Smartphone or tablet [1].
The combination of increasing device capabilities, high-resolution content and longer video duration (largely due to live content) means that YouTube's growth will continue for the foreseeable future. Driven by higher bitrates and enhanced capabilities of mobile devices, the trend is also going towards High Definition (HD) video, which considerably enhances quality demand. That is the reason mobile networks operators are following this trend, as it will be hugely influential on network requirements and subscriber Quality of Experience (QoE).
The QoE has been usually evaluated through subjective tests carried out on the users in order to assess their degree of satisfaction with Mean Opinion Score (MOS) indicator [3].
This type of approach is obviously quite expensive, as well as annoying to the user. That is why in recent years new methods have been used to estimate the QoE based on certain performance indicators associated with services. The evaluation methodology used by most network operators to obtain statistical QoE is based on field testing. These tests often use mobile handsets as a modem, with laptop computers that perform the tests and keep statistics. However, this process is expensive in terms of resources and staff, and also it does not use the entire protocol stack implemented in the terminal. These drawbacks are solved by integrating QoE analyzers in the mobile terminal itself so that measurements of statistics are specific to each terminal. Thus, additional measurements can be collected (along the protocol stack) to allow for enhanced analysis of the performance of each service. Furthermore, if mobile terminals are able to report the measurements to a central server, the QoE assessment process is simplified significantly.
Recently, a number of works have focused on developing subjective QoE evaluation frameworks for mobile users. For instance, an implementation of a QoE measurement framework on the Android platform is presented in [4] [5], although results are limited to a laboratory environment. The works in [6][7] present a framework for measuring the QoE for distorted videos in terms of Peak Signal-to-Noise Ratio (PSNR) or a modified metric called cPSNR, respectively. A QoE framework for multimedia services (named as QoM) for run time quality evaluation of video streaming services is presented in [8]; this approach is based on the influence of QoE factors, various network and application level QoS parameters, although no evaluation of the proposed framework in a context of real wireless network has been performed. In [11], the problem of YouTube QoE monitoring from an access provider's perspective is investigated, showing that it is possible to detect application-level stalling events by using network-level passive probing only. The work in [12] describes a tool which constantly monitors the YouTube application comfort, making it possible to estimate the time when the YouTube player is stalling.
Other works are focused on specific YouTube models to compute the QoE. In [9][10], different QoE YouTube models that take into account the key influence factors (such as stalling events caused by network bottlenecks) in the quality perception are presented.
They quantify the impact of initial delays on the user perceived QoE by means of subjective laboratory and crowdsourcing studies. Other works are devoted to estimate the MOS for video services [14][4] [22]; among them, the analysis presented in [22] provides a utility function for HTTP video streaming as a function of three application performance metrics: initial buffering time, mean rebuffering time and re-buffering frequency.
However, none of previous works have performed a deep validation of existing models through real tests over different radio technologies. In this work we describe an Android application that carries out measurements of objective Quality of Service (QoS) indicators associated to YouTube service; this performance indicators are then mapped onto subjective QoE (in terms of MOS). Our application also informs the user about possible causes that lead to a low MOS as well as provides some hints to improve it. After each YouTube session, the users may optionally qualify the session through an opinion survey.
This information has been used in a pilot experience to correlate the theoretical QoE model with real user feedback.
The remainder of this paper is structured as follows. A description of the YouTube QoE evaluation method is given in section 2, specifying its main performance indicators. In section 3, we describe our Android application for YouTube QoE evaluation. The results from a YouTube evaluation pilot experience are analyzed in section 4. Finally, some concluding remarks are given in section 5.

YouTube QoE evaluation method
YouTube service employs progressive download technique, which enables the playback of the video before the content downloaded is completely finished [13]. Nowadays, TCP is the preferred transport protocol for YouTube and other video servers since the majority of video content delivery over the Internet is not live and most users' bandwidth is usually greater than the video coding rate. The HTTP/TCP architecture also solves the problem of access blockings carried out by many firewalls for unknown UDP ports. Additionally, the continuous improvements in latency reduction and throughput maximization achieved in new cellular technologies have allowed using TCP for minimizing the impact of errors without reducing severely the effective throughput.
The video clip download process is started by the end user when a request (with a link to the desired video clip) is sent to the YouTube web server (see Figure 1). When the client web browser receives the YouTube web page, the embedded player initiates the required signaling with the media server indicating the video to be played-out along with some setup parameters [2]. Then, the server starts progressively sending the video data over an HTTP response. The video data is then stored in a play-out buffer at the client side before being displayed. Once the download has been started, there is no further client to server signaling (unless the user interacts with the player).
The video data transfer from the media server to the client consists of two phases: initial burst of data and throttling algorithm [2]. In the initial phase, the media server sends an initial burst of data (whose size is determined by one of the setup parameters) at the maximum available bandwidth. Then, the server starts the throttling algorithm, where the data are sent at a constant rate (normally at the video clip encoding rate multiplied by the throttle factor, also denoted in the setup parameters). In a network congestion episode, the data that are not able to be delivered at this constant rate are buffered in the server and released as soon as the congestion is alleviated. When this occurs, data are sent at the maximum available bandwidth. Whenever the player's buffer runs out of data, the playback will be paused, leading to a rebuffering event.
Like quality of Internet services in general, Internet video streaming quality is mainly depending on throughput. However, quality requirements in terms of throughput are more demanding than those for other popular Internet applications as file download, web browsing and messaging. The main differences are that throughput has to meet rather precise requirements and that these requirements are stream-specific, i.e. if data are not transmitted according to playing rate (corrected by the influence of initial buffering), a rebuffering will likely occur and user QoE will drop down rapidly. It is therefore essential not only to measure the download throughput, but also to check against the bitrate the individual stream is encoded with.
There exist many quality metrics to characterize the video quality. Some of them are based on comparing the received (and degraded) video with the original video (usually called "reference"). Examples of this type of quality metrics are: Mean Square Error (MSE) [15], Peak Signal to Noise Ratio (PSNR) [15], Video Structural Similarity (VSSIM) [16], Perceptual Evaluation of Video Quality (PEVQ) [17] and Video Quality Metric (VQM) [18]. This type of metrics is useful for obtaining objective metrics in controlled experiments, but they are not applicable for online (real-time) procedures as the full reference is not available. Furthermore, they are suited to measure the image quality degradation, e.g., due to packet losses or compression algorithms. Since using TCP, packet losses are recovered, this type of metrics is less useful for YouTube.
That is why other works are oriented to provide a model for estimating the video quality without a reference. For instance, the work described in [19] presents a regression model to estimate the visual perceptual quality in terms of MOS for MPEG-4 videos over wireless networks. However, this algorithm requires an image reconstruction process to evaluate the differences between the original and the resulting images (after network transmission), which makes it not adequate for online quality estimations. In [20], the impact of delay and delay variation on user's perceived quality for video streaming is analyzed. However, it does not consider other objective metrics such as resolution, frame rate, or packet losses, which are also important for obtaining an accurate QoE estimation. In [21], a no-reference subjective metric to evaluate the video quality is presented, which considers the frame rate or the picture resolution, although their computation is complex to be used real-time.
Our implementation is based on the work presented in [22], which studied how the network QoS affects the QoE of HTTP video streaming. In this work, they propose a generic procedure to estimate the end-user's perceived quality following three steps:  [24]. Afterwards, application performance metrics (T init , f rebuf , T rebuf ) can be estimated at the receiver from performance indicators at lower layers (e.g. TCP throughput) as well as other parameters like the video coding rate, video length, buffer size at the receiver or the minimum buffer threshold that triggers a rebuffering event (see [22] for further details).
The model to estimate application QoS metrics from network QoS is valid under certain assumptions: 1) the network bandwidth, Round Trip Time (RTT) and packet loss rate are assumed to be constant during the video download; 2) the client does not interact with the video during the playback, such as pausing and forward/backward.
The third step is performed by applying a utility function for HTTP video streaming as a function of three application performance metrics: • Initial buffering time (T init ): time elapsed until certain buffer occupancy threshold has been reached so the playback can start.
• Rebuffering frequency (f rebuf ): frequency of interruption events during the playback.
The final MOS expression can be computed as [22]: being L ti , L fr and L tr valued 1, 2 or 3 to represent the "low", "medium", and "high" levels of T init , f rebuf , and T rebuf , respectively. The concrete values used to quantize previous application performance metrics can be found in [22]. From previous equation, it can be seen that the quantized rebuffering frequency (f rebuf ) metric has the highest impact on the end user's QoE, compared to the initial buffering time (T init ) and the rebuffering duration (T rebuf ). In this respect, it is reasonable to think that the perceived quality does not only depend on the pause intensity (percentage of time in the pause state), since a higher number of pauses (with lower pause durations) seems more annoying to the user.

Android application for YouTube QoE evaluation
The model for estimating YouTube QoE has been implemented as an Android application.
Our QoE tool is able to run in two different modes: 1) Intrusive mode: the application includes an embedded video player (based on Media Player), thus having access to the content being consumed (through the YouTube API).
2) Transparent mode: the application runs in background, so monitoring functionalities are associated to YouTube sessions established either through the native YouTube application or through the web browser.
Our Android application includes the following modules: • Monitoring: this module is responsible for monitoring network QoS parameters as well as other configuration parameters as required to estimate the application performance monitoring (listed in previous section). It makes use of the Android Networking and YouTube Data Application Programming Interface (API) to get a number of parameters associated to the session.
• QoE estimation: in charge of (automatically) computing the QoE of a YouTube session (in terms of MOS) from QoS parameters, according to Eq. (1).
• QoE advices: informs the user about possible causes that lead to a low MOS and provides some hints to improve it.
• QoE user feedback: allows users to qualify the session through an opinion survey.
This information is used to correlate the QoE model with real user feedback.
• QoE reporting: this module is responsible for reporting all the performance indicators to a QoE server for post-processing purposes.
A general overview of our YouTube QoE framework is depicted in Figure 2. In addition to the MOS value automatically estimated by the application, users are requested to qualify the session (video, audio and general feedback) manually in the same MOS scale (from 1 to 5). We have used both types of QoE evaluations to validate the theoretical model proposed in [22], as well as to propose a modified function according to the results of our pilot experience.  Table 1.
When our QoE tool runs in Intrusive mode (i.e. player embedded in the application), the measurement of the three application performance metrics (T init , f rebuf , T rebuf ) is straightforward, as the YouTube API provides access to this type of information.
However, in Transparent mode, the computation of these metrics is not so easy because it has to be estimated from network level metrics, as detailed in [22]. In particular, the following basic information is required: average TCP throughput, average playing rate and player buffer size. However, this type of estimation has a limitation due to the fact that, as throughput and playing rate may vary along the time, player's buffer utilization depends on the instantaneous throughput and play-out rate rather than their average values. Therefore, this approach might lead to slightly optimistic results.
Regarding the QoE advices module, its role is to analyze possible causes that provide a low QoE, and subsequently, provide particular advices to the user when certain conditions are given. As an example, Table 2 shows potential causes of low QoE, their associated evidences and advices.

YouTube QoE pilot experiment
A set of 17 users (engineers from Telefónica company) were selected to participate in a pilot experiment, which consisted in periodically testing our YouTube QoE tool (installed on different Android smartphones) during one month. Every YouTube native session were transparently monitored and evaluated in two ways: 1) automatically by the application (from QoE model previously described); 2) by the users through an online opinion survey.
A total number of 1435 YouTube sessions were evaluated during the pilot. The data collected from each user device was sent to a server for post-processing purposes.
The pilot experience was carried out in Madrid (Spain), covering both rural and urban environments (as shown in Figure 4 on the left). Different colors represent the associated subjective quality (from the opinion survey) for a set of YouTube sessions. Such a survey (related to the video quality, audio quality and overall quality) was requested to be filled after each YouTube session. Figure 4 on the right show the probability density distribution of the feedback associated to video, audio and general quality.
According to the statistics collected at the QoE server, the majority of videos consumed by the users are short: near 90% of the videos shorter than 5 minutes and average duration 160 seconds (see Table 3). Regarding the video characteristics, users had free access to YouTube repository, so wide a variety of videos with different average bitrates (from Next, statistics related to the application performance metrics (mainly referred to T init , f rebuf , T rebuf , that are required to evaluate MOS), are analyzed in detail. Later, their effect on experienced quality will be described.
First, the box and whiskers plot of the parameter Initial Buffering Time (T init ) per technology is given in Figure 5. Distribution Function (CDF) (whose estimation for T init is shown in Figure 6) and a graphical representation of numerical measures (some of which are presented in Table 4).
Results from Figure 5 and Figure 6 show that T init values for WiFi connections are lower than those for UMTS. For 3G sessions, estimated Coefficient of Variation (CV) is higher than 2, i.e., standard deviation of T init for UMTS connections is more than twice its average value. For WiFi, this dispersion measure is reduced to about 1.2. This comes from the fact that T init samples are much more concentrated around the median for WiFi sessions whereas UMTS presents higher range. The heavy tail results in a higher average located in the last quartile. In any case, 50% of the videos have experienced an initial buffering time shorter than 7 seconds. In most connections, no rebuffering is necessary, thus the median for the rebuffering frequency (f rebuf ) is 0 (see Table 5). However, in this case, f rebuf is higher for WiFi than for UMTS; the reason is that, although the number of pauses is smaller for WiFi (Table 6), videos were shorter (see Table 3), thus boosting the frequency of interruption events even if the mean rebuffering time (T rebuf ) is lower (Table 7). Now, we are exploring the effect of performance indicators in the reported MOS. Figure 7 shows the initial buffering time (T init ) box and whisker plot per MOS. As shown in the results, lower T init values are associated to higher MOS. Although a higher feedback quality could be expected for WiFi than for UMTS, it can be observed that users do not assign a significantly lower MOS for UMTS than for WiFi (see Figure 8 and Table 8) Note that this measurement indicates that MOS is about 20% higher than that given in Eq.
(1). The reason could be that users could be more indulgent with wireless connections than for wired scenarios under which the original model was obtained.
Due to regression properties, the average value for the difference between MOS as obtained by (2) and that reported by users (that it, the residuals) is 0, although no symmetry around 0 exists (see Figure 9 on the right). Differences between subgroups per technology are not significant (estimated slope of 1.1995 for WiFi connections and 1.2089 for UMTS).
It was explored if a multivariant regression could improve those results. Only linear regression was analyzed as a modification of numerical quantities as those proposed in [25] cannot be easily included in the multivariant procedure. The adjusted R 2 including all available parameters results in 90.5%, only a bit lower (90.46%) if the total rebuffering time is taken out from regression. As this value is lower than that obtained with (1), the heuristical measurement quantization proposed in [22] increased in a 20% seems to be able to predict well users expectations.

Conclusions
This work has presented a QoE evaluation tool for Android terminals that is able to estimate the QoE (in terms of MOS) for YouTube service based on theoretical models. In particular, this tool makes it possible to map network QoS onto the QoE of YouTube sessions. Additionally, a QoE advices module analyzes possible causes that lead to low QoE, and subsequently, provide particular advices to the user under certain conditions.
Our application has been tested on a pilot experience over 17 Android terminals during one month. According to the statistics, most of the responses from the users' survey match up with theoretical estimations; however, the QoE model provides slightly more pessimistic results than the opinion of the wireless users, probably as the model was initially generated under wired scenarios. In that sense, we propose a modified utility function from taking a linear regression between the theoretical MOS and the MOS reported by users.
In our opinion, it is critical that application developers provide access to the main Key Performance Indicators (KPIs) associated to their services in order to ease the evaluation and analysis of the QoE.           Tables   Table 1. List of parameters that are reported (from the terminal) to the QoE server.