- Research
- Open access
- Published:
A web sentiment analysis method on fuzzy clustering for mobile social media users
EURASIP Journal on Wireless Communications and Networking volume 2016, Article number: 128 (2016)
Abstract
Micro-blog has become an emerging application in the Internet in recent years, and affective computing and sentiment analysis for micro-blog have been a vital research project in computer science, natural linguistics, psychology of human, and other social computing fields. In this paper, firstly, fuzzy clustering theory was introduced and source database for micro-blog was constructed. Then, word similarity computation method based on basic emotion word set of HowNet was used to calculate weights of micro-blog emotion words, and micro-blog emotional lexicon was built. Next, using calculation methods for appropriate sentiment value, the whole micro-blog message’s emotional values were obtained. Finally, sentiment values from users in different time periods were selected as original data matrix, using the fuzzy clustering algorithm. The users were classified dynamically; meanwhile, dynamic clustering figure was generated. The best classification was obtained by using F statistics test method, and emotion trend graph was predicted from classification results, to more intuitively analyze emotion changes of user. In this paper, using micro-blog information with affective computing, governments, businesses, or enterprises can get different classification results according to the different needs and take the appropriate measures.
1 Introduction
Since human entered the twenty-first century, people continue to create high-tech product and enjoy various emotions they bring. Internet products all the time burst with its unique personality and creativity; a few years ago, people always soaked in social networking sites such as “kaixin” and “renren”, to share with the working and living of classmates, colleagues, or friends. Nowadays, “micro-blog” is widely used in the daily life of Internet users and plays an important role. In recent years, with the rapid development of Web technologies and the growing and expanding of website, users have become the browser of website contents and the creator of website contents. All kinds of social network service (SNS), such as Facebook, forum, friends-net, renren, Sina micro-blog, and Tencent zone, have become the indispensable stage for globe internet users to show their emotions, communicate with each other, and share information, as well as become windows to get information, show themselves, and promote marketing. With the development of mobile networks, users can express their personal mood at any time, describe their experience, and elaborate the subjective and objective view of things on the Internet [1].
Micro-blog, as a new type of media, is developing rapidly. It has the characteristics of fast propagation speed, real-time information updating, and strong interaction and so on and has a tremendous impact on social life of Internet users and even the non-users. Micro-blog users can use the computer, mobile devices (such as mobile phones and iPad) to release and share information to the community in a variety of forms through the Internet. On the one hand, users can post messages on micro-blog. On the other hand, users can also find out news of current events, hot topics, reviews of others on the micro-blog, expand their social circle, and make like-minded friends. The man who designed micro-blog platform at first is, Evan Williams, the founder of blog technology, pioneer blogger. In March 2006, he launched a social network platform and micro-blog service that is now known abroad, Twitter. In China, micro-blog, the new term, has become the world’s most popular word and two most popular micro-blog platforms are Sina micro-blog and Tencent micro-blog [2].
Nowadays, the social network has become an important part of people’s lives; you can talk about something in the QQ space or micro-blog, share the mood like a diary, and record all kinds of emotional color in the daily life. Therefore, Web text data which contains the rich emotion of Internet users expand at an amazing speed, so the analysis of an effective emotional analysis model has become one of the important topics of network social computing.
Micro-blog sentiment computing and analysis includes two aspects: one is to analyze the personal emotion fluctuation in different period of time; the other is to analyze emotions of a group people in different time. The meanings are as following: first, through the establishment of personal emotion charts, emotional states of Internet users in a certain period of time can be analyzed so that decision makers can understand the emotion changes related to Internet users recently and timely can detect abnormal changes of mood, in order to make the corresponding decision to create a healthy and harmonious network society; secondly, through the research on group emotion analysis, emotion dynamic fuzzy clustering graph can be formed by fuzzy clustering analysis on members in group. And we can set different conditions to get different classification results, in order to find your friends. So, the establishment of various kinds of social relations can provide great help to users in life or in business decisions.
In the paper, fuzzy clustering algorithm is applied to micro-blog sentiment analysis and dynamic clustering graph is generated; the best classification is obtained by using F statistics test method, and emotion trend graph is predicted from classification results by SPSS, to achieve visualization on people’s emotions.
The remainder of this paper is organized as follows. Section 2 describes the related work on web sentiment analysis methods. Section 3 gives computation method for the sentiment value of sentiment phrase with modifier, as well as sentiment intensity analysis of micro-blog content. Section 4 presents sentiment analysis for QQ zone users. In Section 5, sentiment clustering analysis of micro-blog users on fuzzy clustering algorithm is put forward. Conclusions are summarized in Section 6.
2 Related works
Wang et al. [3] thought that sentiment analysis is to analyze users’ comments on the Web, so as to identify the implicit emotional information, and found out the rules of user emotion evolution; Zhao et al. [4] believed that sentiment analysis was called opinion mining, an analysis, processing, induction, and deduction process of subjectivity text with emotional color. Sentiment analysis involved various difficult and huge challenging researches. Existing research fruits on micro-blog sentiment analysis are listed as follows:
In 2009, Alec Go, Richa Bhayani, and Lei Huang used machine learning method for the first time to analyze the sentiment on the micro-blog text and joined the emoticon sign in the system so as to greatly improve the accuracy of the system [5]. In 2010, according to statistics on written words of product view and opinion polls of customer between 2008 and 2009, Brendan, Ramnath Balasubramanyan, and Bryan R. Routledge concluded that sentiment words used in product view and opinion polls had a high correlation [6]. Alexander Pak and Patrick Paroubek collected data and trained classifier to make research on sentiment analysis and opinion mining on micro-blog [7]; Albert Bifet and Eibe Frank analyzed the difficulties in micro-blog data mining, solved the problem of unbalanced data classes by setting a kappa test sliding window, and achieved good results [8]; Dmitry Davidov, Oren Tsur, and Ari Rappoport built an emotional classification system by supervising learning based on Twitter with common labels, emotion labels. The method did not need too much manual annotation [9]; Cindy Xide Lin, Bo Zhao, and Qiaozhu Mei discussed social network event topic by using the statistical method [10]; Asli Celikyilmaz, Dilek Hakkani, and Junlan Feng classified Twitter text into polar and nonpolar text by probability and statistics mode and then used the sentiment vocabulary to categorize sentiment polarity of Twitter text [11]; Luciano Barbosa and Junlan Feng detected sentiment information of Twitter text by analyzing text structure and text keywords information of Twitter [12-13]. Long Jiang, Mo Yu, and Ming Zhou improved the accuracy of sentiment classification by merging characteristics and adding relevant text [14]. Lei Zhang, Riddhiman Ghosh, and Mohamed Dekhil adopted sentiment dictionary and machine learning methods to classify emotional polarity text in Twitter based on entity in the text [15].
In online videos, Hoque et al. developed an automated system to distinguish between naturally occurring spontaneous smiles under frustrating and delightful stimuli by exploring their temporal patterns given video of both, evaluated and compared two variants of Support Vector Machine (SVM), Hidden Markov Models (HMM), and Hidden-State Conditional Random Fields (HCRF) for binary classification. McDuff et al. presented results validating a novel framework for collecting and analyzing facial responses to media content over the Internet. The framework, data collected, and analysis demonstrated an ecologically valid method for unobtrusive evaluation of facial responses to media content [16, 17].
In terms of opinion mining, a fuzzy propagation modeling for opinion mining by sentiment analysis of online social networks was proposed by Trung et al. A practical system, called TweetScope, has been implemented to efficiently collect and analyze all possible tweets from customers [18]. S Venkatesh, D Phung, D Bo, and M Berk used machine learning and statistical methods to discriminate online messages between depression and control communities using mood, psycholinguistic processes, and content topics extracted from the posts generated by members of these communities [19].To understand customers’ opinion and subjectivity, a fuzzy propagation modeling for opinion mining by sentiment analysis of online social networks was presented and a practical system, called TweetScope, was implemented to efficiently collect and analyze all possible tweets from customers [20, 21]. A crowdsourcing-based method is proposed by Xu et al. [31–35] to process mobile social media data. To improve sentiment analysis consequences, Xing Wu and Shaojian Zhu investigated the limitations of traditional sentiment analysis approaches and proposed an effective Chinese sentiment analysis approach based on emotion degree lexicon to obtain significant improvement in Chinese text [22]. To address these various limitations and issues in non-English languages, in 2014, Wang et al proposed a new scheme to derive dominant valence as well as prominent positive and negative emotions by using an adaptive fuzzy inference method (FIM) with linguistics processors to minimize semantic ambiguity as well as multi-source lexicon integration and development [23].
To solve the difference among affect, feelings, emotions, sentiments, and opinions in text, M. D. Munezero et al. clarified the differences between these five subjective terms and reveals significant concepts to the computational linguistics community for their effective detection and processing in text [24]. A new affective-behavioral-cognitive (ABC) framework was presented to measure the usual cognitive self-report information and behavioral information, together with affective information while a customer makes repeated selections in a random-outcome two-option decision task to obtain their preferred product [25]. Santos-Sanchez et al. proposed a novel system to obtain data of interest from a Web search engine by analyzing the emotional and sentimental content of each page, which made possible to have a better understanding on how the users perceive the recommended products [26]. Pappurajan, A. and S. P. Victor put forward an algorithm that judged whether the text in the Tweet is positive or negative based on the likelihood for each possibility to determine the sentiment of the text, whether it was positive or negative [27]. In order to facilitate sentiment analysis of user-generated content, Hogenboom et al. proposed to map sentiment conveyed by unstructured natural language text to universal star ratings, capturing intended sentiment. The results of our experiments revealed language-specific sentiment scores can separate universal classes of intended sentiment from one another to a limited extent [28]. A new sentiment analysis technique called SentiPipe was presented, which took the best of a set of methods, generating a less sensitive analysis based context [29]. C. Clavel and Z. Callejas presented a comparative state of the art which analyzes the sentiment-related phenomena and the sentiment detection methods used in both communities and made an overview of the goals of socio-affective human-agent strategies [30].
3 Method
3.1 The basic principle of fuzzy clustering
Clustering analysis is to strictly divide each object into different classes. In fact, most of the objects are fuzzy. Here, the so-called fuzzy, mainly refers that the differences in objective things in the middle of the excessive nature is not clear, such as in the management system, medical diagnosis, artificial intelligence, image recognition, and other fields. Fuzzy set theory was first put forward by professor L. A. Zadeh; many meaningful results have been achieved until now. From the point of view of the development trend of the subject, it has extremely strong vitality and penetrability. Fuzzy set theory is applied to solving the problem of clustering in order to cluster fuzzy objects. The essence of fuzzy clustering is to construct fuzzy matrix according to the nature of the object itself, on the basis of this, according to a certain degree of membership, to determine the classification relationship.
The related definitions and basic principles of fuzzy clustering are as follows:
-
Definition 1: Let set of source information for statistical analysis be X = (x 1, x 2, ⋯, x n ), where x i has m attributes, (x i1, x i2, ⋯, x im ) represents a partition of x i . X is n × m matrix, which is called initial value matrix.
-
Definition 2: In X = (x 1, x 2, ⋯, x n ), x i and x j (i ≠ j) are any two different objects. The symbol r ij stands for similarity degree between x i and x j ; r ij is called as similarity coefficient.
-
Definition 3: Let U and V be two domains. For ∀ (x, y) ∈ U × V, denote R as membership degree (or membership function) of U and V. μ R (x, y) : U × V → [0, 1] represents fuzzy set R on U and V, which is a fuzzy relation from U to V.
-
Definition 4: Suppose U and V are finite domains, then all the r ij , constitute fuzzy set R, which can be rewritten as (r xy ) n × m , where 0 ≤ r ij ≤ 1(i ≤ n, j ≤ m) and n and m are all positive integers. Matrix R is a fuzzy matrix and satisfies the following conditions:
-
1)
Reflexivity: That is, any object must be the same as its own, which is denoted as r ii = 1
-
2)
Symmetry: If the object a is similar to the object b, then b should be similar to a, which is denoted as r ij = r ji
-
3)
Transitivity: If the object a and object b are similar, object b and object c are similar, then a and c are similar, which is denoted as R ∘ R ⊆ R
If R satisfies the above 1), 2), 3), then the relation represented by R is called as equivalence relation. So, fuzzy clustering analysis is carried out on the fuzzy equivalence relation.
R = (r ij ) n × n , which satisfies the reflexivity, symmetry, and transitivity, is fuzzy equivalence matrix, also known as similarity matrix.
-
1)
-
Definition 5: Let R = (r ij ) n × n ; for given λ ∈ [0, 1], R λ = (r ij (λ)) n × n , then R λ is λ-cut matrix of R, where \( {r}_{ij}\left(\lambda \right)=\left\{\begin{array}{l}1,{r}_{ij}\ge \lambda \\ {}0,{r}_{ij}<\lambda \end{array}\right. \). The equivalent matrix R = (r ij ) n × n can be used to classify the objects in domain, that is, when R = (r ij ) n × n is determined, for a given λ ∈ [0, 1], the equivalent matrix R λ can be obtained, and then get a level classification of λ.
3.2 Collection of micro-blog sentiment words set
In order to obtain the micro-blogging network sentiment words set before constructing special sentiment lexicon of micro-blog, the following steps are described:
-
(1)
Micro-blog corpus reprocessing
Before extracting the micro-blog network words, first of all, text information of micro-blog need to be reprocessed, such as micro-blog message: “//@ veggieg: flags can hang up a bit, dear,” “【# I’m a singer # live 】 zhang jie sings the great love of life and feeling…,” and so on. In the messages, “@user,” “# Theme Name#,” “punctuation.” and hyperlink information need to be removed in the reprocessing. n − gram (n = 1, 2, 3, 4) method is used to get micro-blog words set, then basic sentiment words set are filtered out; next, candidate networks’ sentiment word set can be obtained. Finally, analyzing word frequency of the candidate word sets gotten by n − gram (n = 2, 3, 4), and setting a word frequency threshold value, effective candidate words set can be generated (if the candidate words frequency is lower than the threshold, it is filtered).
-
(2)
Discovery method of micro-blog network sentiment words (context-based entropy method)
Step 1: Compute each word in candidate word set with context-based entropy method. Let context word sets of a candidate word w be L = {l 1, l 2, l 3, ⋯, l p } and R = {r 1, r 2, r 3, ⋯ r q }; its context entropy is computed as follows:
$$ {E}_L(w)=-\frac{1}{n}{\displaystyle \sum_{i=1}^pC\left(w,{l}_i\right)} \log \frac{C\left(w,{l}_i\right)}{n},{l}_i\in L $$(1)$$ {E}_R(w)=-\frac{1}{m}{\displaystyle \sum_{j=1}^qC\left(w,{r}_j\right)} \log \frac{C\left(w,{r}_j\right)}{m},{r}_j\in R $$(2)In formula 1 and 2, \( n={\displaystyle \sum_{l_i\in L}C\left(w,{l}_i\right)},m={\displaystyle \sum_{r_j\in R}C\left(w,{r}_j\right)} \),where m and n represent the total frequency times of single word appearing before candidate word and after candidate word, respectively; C(w, l i ) and C(w, l i ) are respectively the frequency times of word; l i and r j appears in the candidate words w context.
Step 2: Determine network sentiment words. First, set context entropy threshold and count word frequency to extract users’ network sentiment words in the micro-blog, then make up network sentiment words library of the micro-blog.
-
(3)
Filter of the network words (IF-IDF method)
Although some words have context entropy, they are not words like
which belongs to network word but has not the emotion color. For these disruptive words, they need to be filtered. In this paper, IF-IDF method is used to filter the network words. IF-IDF calculation method is shown in formula (3)
$$ \mathrm{T}\mathrm{F}\hbox{-} \mathrm{I}\mathrm{D}\mathrm{F}\left({w}_i\right)=\mathrm{freq}\left({w}_i\right)\cdot log\left(\frac{N}{df\left({w}_i\right)}\right) $$(3)where
freq(w i ) represents the occurrence times of word w i in the corpus, N is total number of total documents in the corpus, and df(w i ) is the frequency of documents containing w i in the corpus.
In micro-blog corpus, the next steps for filtering network words are to calculate IF-IDF values of each candidate word, and then sort them, set a threshold value, and finally merge the candidate words higher than the threshold into micro-blog network sentiment word set.
3.3 A computation method for the sentiment value of sentiment phrase with modifier
In Chinese micro-blog message sentiment analysis, sentences are generally made up of sentiment phrases which contain sentiment words and its modifier. In this article, combining with the practical applications, the computation method of sentiment phrases trend generated by the following seven different combination modes is adopted (Table 1).
3.4 The sentiment intensity analysis of micro-blog content
Micro-blog content may be a phrase and may also be composed of two or more short sentences. To calculate sentiment intensity value of the whole micro-blog content, the first step is to analyze sentiment value of every phrase in micro-blog content; the second step is that if a complex sentence composed of two or more than two short sentences contains “coordinating,” “progressive,” “causal,” and “selected” relationship of conjunction, sentiment intensity value of the complex sentence is determined by means of accumulative sum of sentiment values; for the complex sentence with “hypothesis,” “conditions,” “concessions,” or “turning” relationship, sentiment intensity value of rear part of the sentence is taken as the sentiment intensity value of the whole sentence. Moreover, in a single short sentence, firstly, we should select sentiment words, and secondly, modify sentiment intensity value of sentiment words according to the weights of degree adverbs and negative words.
Define sentiment intensity value of sentiment word w as E(w). Let sentiment intensity value of sentiment phrase with modifier be ME(w). In order to identify modifier, we take w as the center, query before and after w until the nearest punctuation, and extract the modifier in the two adjacent punctuation area before and after w, then according to the rules of modification, take sentiment inclination correction analysis of a sentence. The algorithm is described as follows:
-
Algorithm name: sentiment intensity value computation based on sentence semantic with sentiment words
-
Input: micro-blog message sentence S
-
Output: sentiment intensity value E(S) of message S
-
Step 1: Decompose the sentence into single Chinese words, then mark speech of words; finally, store the sequence of decomposed Chinese words
-
Step 2: Find the conjunction and punctuation in the sequence of decomposed words, obtain p short sequences {SS k , k = 1, 2, ⋯, p}, and store the conjunction types
-
Step 3: Compute sentiment intensity value of single phrase SS k
-
Step 3.1: In sentiment library, search a sentiment word w and record its sentiment weight E(w)
-
Step 3.2: Extract the modifier in two adjacent punctuation area before and after sentiment word, get degree adverbs set D = (D 1, D 2, ⋯, D n ) and negative word sets N = (N 1, N 2, ⋯, N m ), then according to Table 2, obtain adjusted sentiment value of sentiment word w with modifier, which is the sentiment intensity ME(w) of phrase SS k
-
Step 4: According to the phrase sequence relationship, use corresponding algorithm to obtain sentiment intensity values E(S) of the whole sentence
-
Step 5: Determine sentence types
-
Case 1: question sentence, then E(S) = − E(S), turn step 6
-
Case 2: exclamatory sentences, then E(S) = 2E(S), turn step 6
-
Step 6: output E(S)
3.5 Generation of sentiment dynamic clustering graph
Fuzzy clustering concept is based on general clustering. Assume that the set of X is classified to many classes by membership function; each class is thought as a fuzzy subset of X; the classification matrix corresponding to each classification is a fuzzy matrix R.
Cluster analysis is to use the similarity scale to measure the closeness degree between things and to realize classification. The essence of cluster analysis is to construct fuzzy matrix based on the attribute of research object itself and, on this basis, determine classification relationship according to certain membership. The main steps include determining sample statistic index (feature extraction, establish the original data matrix), data standardization (normalized), and distance calibration to establish fuzzy similar matrix and computing fuzzy equivalence matrix; the detailed steps are as follows:
-
(1)
Construction of original data matrix
Let classified objects be domain X = (x 1, x 2, ⋯, x n ) and n represents the number of objects; each object has m attribute features, x i = (x i1, x i2, ⋯, x im ), i = 1, 2, ⋯, n,. So, original data matrix is given as follows:
$$ A=\left[\begin{array}{cccc}\hfill {x}_{11}\hfill & \hfill {x}_{12}\hfill & \hfill \cdots \hfill & \hfill {x}_{1m}\hfill \\ {}\hfill {x}_{21}\hfill & \hfill {x}_{22}\hfill & \hfill \cdots \hfill & \hfill {x}_{2m}\hfill \\ {}\hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \vdots \hfill & \hfill \vdots \hfill \\ {}\hfill {x}_{n1}\hfill & \hfill {x}_{n2}\hfill & \hfill \cdots \hfill & \hfill {x}_{nm}\hfill \end{array}\right] $$where x nm represents original data of the mth feature of the nth classified object.
-
(2)
Standardization processing of data
Because of the different orders of magnitude and dimension of feature indexes, for the convenience of analysis and comparison, the sample data needs to be standardized; data range is transferred to the domain [0, 1] (normalized processing). Assume that there are n sample objects x 1, x 2, ⋯, x n , each object has m feature values y 1, y 2, ⋯, y m ; x ij means the jth attribute value of the ith object. The following moving range formula (4) is used in standardization method:
$$ {x_{ij}}^{\prime }=\frac{x_{ij}-{x}_{\min }}{x_{\max }-{x}_{\min }} $$(4) -
(3)
Construction of a fuzzy similar matrix
Let the domain set be X = (x 1, x 2, ⋯, x n ), x i = (x i1, x i2, ⋯, x im ), in accordance with the commonly clustering method; we can determine fuzzy similarity coefficient r ij , which is similar degree of x i , x j (r ij = R(x i , x j )).
The following 12 methods can be used to determine r ij , which can be chosen for actual needs.
-
(1)
Maximum minimum method
$$ {r}_{ij}=\frac{{\displaystyle {\sum}_{k=1}^m\left({x}_{ik}\wedge {x}_{jk}\right)}}{{\displaystyle \sum_{k=1}^m\left({x}_{ik}\vee {x}_{jk}\right)}} $$(5) -
(2)
Arithmetic average minimum method
$$ {r}_{ij}=\frac{{\displaystyle {\sum}_{k=1}^m\left({x}_{ik}\wedge {x}_{jk}\right)}}{\frac{1}{2}{\displaystyle {\sum}_{k=1}^m\left({x}_{ik}+{x}_{jk}\right)}} $$(6) -
(3)
Geometric average minimum method
$$ {r}_{ij}=\frac{{\displaystyle {\sum}_{k=1}^m\left({x}_{ik}\wedge {x}_{jk}\right)}}{{\displaystyle \sum_{k=1}^m\sqrt{x_{ik}{x}_{jk}}}} $$(7) -
(4)
Index similarity coefficient method
$$ {r}_{ij}=\frac{1}{m}{\displaystyle \sum_{k=1}^m{e}^{-\frac{4{\left({x}_{ik}-{x}_{jk}\right)}^2}{3{s}_k^2}}} $$(8)where \( {s}_k=\sqrt{\frac{1}{n}{\displaystyle {\sum}_{i=1}^m{\left({x}_{ik}-{\overline{x}}_{{}_k}\right)}^2}},\kern0.5em {\overline{x}}_{{}_k}=\frac{1}{n}{\displaystyle {\sum}_{i=1}^n{x}_{ik}} \)
-
(5)
Correlation coefficient method
$$ {r}_{ij}=\frac{{\displaystyle {\sum}_{k=1}^m\left|{x}_{ik}-{\overline{x}}_{{}_i}\right|\left|{x}_{jk}-{\overline{x}}_{{}_j}\right|}}{\sqrt{{\displaystyle {\sum}_{k=1}^m{\left({x}_{ik}-{\overline{x}}_{{}_i}\right)}^2}}\cdot \sqrt{{\displaystyle {\sum}_{k=1}^m{\left({x}_{jk}-{\overline{x}}_{{}_j}\right)}^2}}} $$(9)Where \( {\overline{x}}_{{}_i}=\frac{1}{m}{\displaystyle {\sum}_{k=1}^m{x}_{ik}},{\overline{x}}_{{}_j}=\frac{1}{m}{\displaystyle {\sum}_{k=1}^m{x}_{jk}} \)
-
(6)
Chebyshev distance method
$$ {r}_{ij}=1-C{\left(d\left({x}_i,{x}_j\right)\right)}^a $$(10)where \( d\left({x}_i,{x}_j\right)=\underset{k=1}{\overset{m}{\vee }}\left|{x}_{ik}-{x}_{jk}\right|,\kern0.5em a=1 \)
-
(7)
Absolute value index method
$$ {r}_{ij}={e}^{-{\displaystyle {\sum}_{k=1}^m\left|{x}_{ik}-{x}_{jk}\right|}} $$(11) -
(8)
Absolute value inverse method
$$ {r}_{ij}=\left\{\begin{array}{cc}\hfill 1\hfill & \hfill i=j\hfill \\ {}\hfill \frac{C}{{\displaystyle {\sum}_{k=1}^m\left|{x}_{ik}-{x}_{jk}\right|}}\hfill & \hfill i\ne j\hfill \end{array}\right. $$(12)where C is proper positive number, satisfy \( C\le \underset{i,j}{ \min}\left({\displaystyle {\sum}_{k=1}^m\left|{x}_{ik}-{x}_{jk}\right|}\right) \)
-
(9)
Dot product method
$$ {r}_{ij}=\left\{\begin{array}{cc}\hfill 1\hfill & \hfill i=j\hfill \\ {}\hfill \frac{1}{C}{\displaystyle {\sum}_{k=1}^m{x}_{ik}{x}_{jk}}\hfill & \hfill i\ne j\hfill \end{array}\right. $$(13)where C is proper positive number,satisfy \( C\ge \underset{i,j}{ \max}\left({\displaystyle \sum_{k=1}^m{x}_{ik}{x}_{jk}}\right) \)
-
(10)
Absolute value reduction method
$$ {r}_{ij}=\left\{\begin{array}{cc}\hfill 1\hfill & \hfill i=j\hfill \\ {}\hfill 1-C{\displaystyle \sum_{k=1}^m\left|{x}_{ik}-{x}_{jk}\right|}\hfill & \hfill i\ne j\hfill \end{array}\right. $$(14)where C is a proper positive number, which satisfies 0 ≤ r ij ≤ 1.
-
(11)
Angle cosine method
$$ {r}_{ij}=\frac{{\displaystyle {\sum}_{k=1}^m{x}_{ik}{x}_{jk}}}{\sqrt{{\displaystyle {\sum}_{k=1}^m{x}_{ik}^2}}\cdot \sqrt{{\displaystyle {\sum}_{k=1}^m{x}_{jk}^2}}} $$(15) -
(12)
Subjective evaluation method
Experts with rich experience experts directly assess the similarity degree between x i and x j and give the value of r ij (0 ≤ r ij ≤ 1).
-
(1)
-
(4)
Clustering (fuzzy equivalence matrix)
The fuzzy similar matrix constructed in the above section usually has reflexivity and symmetry and does not satisfy transitivity. In order to classify, we need to obtain fuzzy equivalence matrix by modifying fuzzy similar matrix. The commonly used method is the transitive closure method, that is R → R 2 → R 4 ⋯ → R 2k ⋯ until R 2k = (R 2k)2, then R 2k is called transitive closure. After obtaining the transitive closure, we can construct fuzzy equivalence matrix in domain set X. Then, set λ ∈ [0, 1], find R λ , namely, under different levels of λ, a dynamic clustering graph in range X can be generated by data processing tool such as Matlab, SPSS, and so on.
4 Sentiment analysis for QQ zone users
4.1 Original data matrix
Each row in above matrix represents a QQ zone users; each column represents sentiment values in a certain period. By using the Chinese text sentiment computation method, we can calculate sentiment values of 15 QQ zone users in 10 different periods. So the original data matrix is transformed into sentiment matrix of QQ zone; the matrix elements are as shown in Table 3.
4.2 Sample data standardization processing
Sample data standardization is normalization of data; the data are compressed in the range [0, 1]. By using moving range formula (formula 4), standardization matrix are obtained by Matlab programming, the matrix elements are shown in Table 4.
4.3 Construction of fuzzy similar matrix
Here, absolute value reduction method is adopted to determine the similarity degree between two samples, namely, the similarity coefficient r ij (see formula (14)). C is set to 0.1, and then 15 × 15 sentiment fuzzy similar matrix of 15 QQ zone users can be obtained; the matrix elements are shown in Table 5.
4.4 Sentiment clustering
The fuzzy similar matrix got from the above section does not necessarily have transitivity; before clustering, transitive closure t(R) needs to be found by least squares method, namely fuzzy equivalence matrix R′ is obtained. The matrix elements are shown in Table 6. Then, let the value of λ be changed from high to low; dynamic clustering graph can be generated in Fig. 1.
4.5 Clustering results analysis
It can be seen from Fig. 1:
-
(1)
When λ is set to 0.9257, emotions of 15 users are classified into 14 classes: the 11th and 12th users are the same class. Table 1 shows that the 11th and 12th rows in the original data matrix are all positive; this reveals that the two users have positive emotion at the time of the study, and other Internet users are always in the negative emotion state in some time, namely the negative emotion. It also can be seen that the objects which get together for a class first have the higher degree of similarity and closer emotions tend. Analysis shows that the fuzzy clustering algorithm applied in sentiment analysis has a good effect.
-
(2)
Users can select the proper threshold to search for friends that are similar or contrary to their emotions, according to their circumstance and achieve the emotional pursuit of life. This illustrates that the dynamic fuzzy clustering analysis method has good maneuverability and flexibility in the analysis of emotion.
5 Sentiment clustering analysis of micro-blog users on fuzzy clustering algorithm
5.1 Data preparation
Experimental data were from micro-blog of www.sina.com. A large number of users micro-blog messages with user ID were collected by crawler system, and the time, website, and forwarding times of the messages were stored in a database. Then, the messages were reprocessed and computed by sentiment algorithm; we can obtaine the sentiment values of the messages in different periods. In the experiment, we selected 50 micro-blog users and found their sentiment values in 10 periods. Denote object domain as X = {x 1, x 2, ⋯, x n }; let object attributes of x i be (x i1, x i2, ⋯, x im ), where n = 50, m = 10.
5.2 Fuzzy clustering
Based on fuzzy clustering theory knowledge, we make the original data normalization and construct fuzzy equivalence matrix; finally, generated dynamic fuzzy clustering diagram show in Fig. 2.
In Fig. 2, users’ sentiment change is illustrated by calculating the user network news; no matter what social circles users are in, they will show their emotions, so sentiment analysis is not directly related to social categories. As illustrated in Fig. 2, when clustering threshold λ is more close to 1, classification number is more, λ is small to a certain value, and the samples will belong to one class (λ is set to 0.72; all users are incorporated into one category). The advantage of the clustering method is that λ can be selected according to the actual needs in order to get the appropriate classification.
Denote classification number as r, as shown in Fig. 2, the classification number r is respectively 1, 2, 6, 7, 14, 16, 19, 22, 25, 31, 35, 37, 38, 40, 43, 44, 45, 46, 47, 48, 50. r = 1 represents that all 50 users is in one class; r = 50 represents that each user is in one class respectively, namely 50 classes. Five kinds of classification are analyzed as follows:
-
1)
r = 2:{x 21},{x 1–x 20, x 22–x 50}.
-
2)
r = 6:{x 21},{x 34},{x 29, x 49},{x 6},{x 4},{x 1–x 3, x 5, x 7–x 20, x 22–x 28, x 30–x 33, x 35–x 48, x 50}.
-
3)
r = 7:{x 21},{x 34},{x 29, x 49},{x 6}, {x 4},{x 47},{x 1–x 3, x 5, x 7–x 20,, x 22–x 28, x 30–x 33, x 35–x 46, x 48, x 50}.
-
4)
r = 14:{x 21}.{x 34}.{x 29, x 49},{x 6},{x 4},{x 47},{x 50},{x 38},{x 23, x 43},{x 14, x 41},{x 8},{x 7},{x 3},{x 1, x 2, x 5, x 9–x 13, x 15–x 20, x 22, x 24–x 28, x 30–x 33, x 35–x 37, x 39–x 40, x 42, x 44–x 46, x 48}.
-
5)
r = 16:{x 21},{x 34},{x 29,x 49},{x 6},{x 4},{x 47},{ x 50},{ x 38},{ x 23, x 43},{ x 14, x 41},{ x 8},{ x 7},{ x 3},{ x 12, x 27}, { x 9},{ x 1, x 2,x 5, x 10–x 11, x 13, x 15–x 20, x 22, x 24–x 26, x 28, x 30–x 33, x 35–x 37, x 39, x 40, x 42, x 44–x 46, x 48}.
5.3 Clustering effect test and result analysis
In view of the above five kinds of classification results, in order to determine the optimum classification, F test is used to F statistics: define the classification number as r, sample number of the no. j class is n j , the sample of the jth class is denoted as \( {x}_1^{(j)},{x}_2^{(j)},\cdots, {x}_{n_j}^{(j)} \): clustering center, the jth class is a vector \( {\overline{x}}^{(j)}=\left({\overline{x}}_1^{(j)},{\overline{x}}_2^{(j)},\cdots, {\overline{x}}_m^{(j)}\right) \), which \( {\overline{x}}_k^{(j)} \) is the average of the kth feature, namely
where \( \left\Vert {\overline{x}}^{(j)}-\overline{x}\right\Vert =\sqrt{{\displaystyle {\sum}_{k=1}^m{\left({\overline{x}}_k^{(j)}-{\overline{x}}_k\right)}^2}} \) stands for the distance between \( {\overline{x}}^{(j)} \) and \( \overline{x} \); \( \left\Vert {x_i}^{(j)}-{\overline{x}}^{(j)}\right\Vert \) represents the distance between the ith sample in the jth class, x i (j) and its center, \( {\overline{x}}^{(j)} \), the F value obeys the F distribution in which the pair of degrees of freedom is r − 1, n − r, the numerator of F stands for the distance between two classes, and the denominator of F represents the distance between each sample in the class. The greater the F value is, the greater the distance between two classes is, namely, the greater the difference between two classes is, the better the classification results are.
Based on the above principle, we can obtain the following results:
-
1)
If r = 2, then n 1 = 1, n 2 = 49, we can obtain F 2 = 58.457 %
-
2)
If r = 6, then n 1 = 1, n 2 = 1, n 3 = 2, n 4 = 1, n 5 = 1, n 6 = 44, F 6 = 76.111 % can be computed.
-
3)
If r = 14,then n 1 = 1, n 2 = 1, n 3 = 2, n 4 = 1, n 5 = 1, n 6 = 1, n 7 = 1, n 7 = 1, n 8 = 1, n 9 = 2, n 10 = 2, n 11 = 1, n 12 = 1, n 13 = 1, n 14 = 33, we can derive F 14 = 58.441 %.
-
4)
If r = 16, n 1 = 1, n 2 = 1, n 3 = 2, n 4 = 1, n 5 = 1, n 6 = 1, n 7 = 1, n 7 = 1, n 8 = 1, n 9 = 2, n 10 = 2, n 11 = 1, n 12 = 1, n 13 = 1, n 14 = 2, n 15 = 1, n 16 = 30, F 16 = 61.236 % can be obtained.
From above data, we can get F6 > F16 > F2 > F14, so dividing into six categories is the best classification method.
When 50 users are divided into six classes, sentiment tendency graph of each user is shown in Fig. 2.
Figure 3 illustrates that emotional states of the users in the first class are not very good as a whole; emotional value is below zero value in most cases, namely negative emotions are more serious. Therefore, relevant departments should take corresponding measures, to avoid users do something radical actions and to maintain social harmony and stability. Moods of users in the second category is a bit low early in the month, then, is slowly and slightly better; although there are some fluctuations of moods, the user’s emotional values are overall positive; the users belong to the normal situation. Two users’ emotional trends in the third class are consistent, except sentiment values were close to −1 at 2014/12/04; their negative affection degrees were strong, and their emotions are preferred in most times. And so on, we can analyze the user’s emotional state in the fourth, fifth, and sixth class. Through the emotional charts, we can not only understand the emotional change of individual users but also find users within the similar emotional states by different classifications.
6 Application examples
In recent years, fuzzy clustering analysis is widely used in data mining, pattern recognition, and comprehensive index evaluation. In this paper, the fuzzy clustering algorithm is applied to sentiment analysis of Sina micro-blog and Tencent QQ space users, which achieves good results.
-
(1)
Sentiment analysis of Tencent QQ space users
-
(a)
Collect the news sent by 15 QQ users in 10 different time periods
-
(b)
For the selected time period, the period of time is adjustable; for highly active users, a short time interval can be selected; on the contrary, for low active users, a longer time interval can be selected, and if there exist a number of news in each time, take the average of the news as the value of news. If there is no published news, let the value of news be 0
-
(c)
Calculate emotional value of 15 QQ users in 10 different time periods with emotional Chinese text calculation method mentioned above to construct a dynamic fuzzy clustering map
The dynamic fuzzy clustering map shows some samples relatively belonging to a certain class in some extent, which can be used to make appropriate decisions according to their own preferences by users and to forecast users’ emotional trend with the curve fitting method and take appropriate preventive measures for some emotional extreme or abnormal people, to promote social harmony, stability, and development.
-
(a)
-
(2)
Sentiment analysis of micro-blog Sina user
There are three steps as follows:
-
(a)
Design of micro-blog Sina data acquisition system
After investigating and analyzing API of micro-blog Sina open platform, we obtain HTML structure of the micro-blog and parse micro-blog news. Next, using web crawler technology, according to news features of micro-blog Sina, web crawler data acquisition system for micro-blog Sina is designed and implemented.
-
(b)
The construction process of micro-blog the emotion lexicon
We calculate emotional intensity (weights) of the basic emotional words and construct a basic emotion lexicon with Hownet-based word similarity computation method; finally, the network emotion lexicon and micro-blog emotion lexicon are obtained.
-
(c)
Classification analysis of micro-blog user emotions
The numerical value of emotional information is defined as the range of [−1.1], and when the emotional value is greater than 0, the user’s emotion is taken as the positive emotion, on the contrary, the emotional value which is less than 0 was for the negative emotions. Then, the fuzzy clustering algorithm is used to cluster the emotional values, and emotional value of the whole blog information is calculated based on the emotional words with modifiers and other factors. Next, emotional values of 50 users in 10 time periods are selected to make emotion classification analysis for micro-blog users by using the fuzzy clustering algorithm, and F testing method is used to calculate an optimal classification. Finally, we predict the classification results with SPSS tool and generate emotion chart to understand more intuitively the user’s emotional state.
-
(a)
7 Conclusions
Sentiment analysis has become the inevitable trend of big data development. Quantizing the human’s emotion to enable computer to identify and analysis automatically has become one of the most important hot research topics of many scholars. In the paper, fuzzy clustering algorithm is applied to analyze emotional change of Sina micro-blog users. Firstly, sentiment intensity of text information issued by micro-blog users is analyzed. Next, combined with sentiment words and influence of modifiers on sentiment intensity, sentiment value of the whole text is calculated. Finally, Fuzzy clustering algorithm is used to analyze sentiment classification of micro-blog users, and classification result is forecasted with tools such as SPSS, and sentiment trend charts are generated to understand users’ emotion state more intuitively. The research provides an important basis for decision-making to relevant research departments.
References
F Liu, W Xian, Micro-blog builds a new platform for mobile learning. China Educ. Technol. Equip. 26 (36), 26–27 (2012).
X Song, 5W features analysis in micro-blog. J CHIFENG Univ. 29 (1), 96:98 (2013).
H Wang et al., Literature review of sentiment classification on Web text. J. Chin. Soc. Sci. Tech. Inf. 29 (5), 931–938 (2010).
Y Zhao, B Qin, T Liu, Sentiment analysis on text. J. Softw. 21(8), 1834–1848 (2010)
A Go, R Bhayani, L Huang, Twitter sentiment classification using distant supervision. Cs224n Project Report, 2009, pp. 1–12
B O'Connor, R Balasubramanyan, BR Routledge, NA Smith, From Tweets to Polls: Linking Text Sentiment to Public Opinion Time Series, Conference: Proceedings of the Fourth International Conference on Weblogs and Social Media, 2010, pp. 122–129
A Pak, P Paroubek, Twitter as a Corpus for Sentiment Analysis and Opinion Mining[C]//Seventh Conference on International Language Resources and Evaluation, 2010
A Bifet, E Frank, Sentiment knowledge discovery in Twitter streaming data. Lect. Notes Comput.Sci. 6332, 1–15 (2010)
D Davidov, O Tsur, A Rappoport, Enhanced Sentiment Learning Using Twitter Hashtags and Smileys. International Conference on Computational Linguistics: Posters, 2010, pp. 241–249
CX Lin et al., PET: A Statistical Model for Popular Events Tracking in Social Communities. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2010, pp. 929–938
A Celikyilmaz, D Hakkani-Tur, J Feng, Probabilistic Model-Based Sentiment Analysis of Twitter Messages. Spoken Language Technology Workshop (SLT), 2010 IEEE IEEE, 2010, pp. 79–84
L Barbosa, J Feng, Robust Sentiment Detection on Twitter from Biased and Noisy Data. 23rd International Conference on Computational Linguistics, vol. 23, 2010, pp. 36–44
J Bollen, A Pepe, H Mao, Modeling public mood and emotion: Twitter sentiment and socio-economic phenomena. Biochem. Pharmacol. 44(12), 2365–2370 (2010)
L Jiang, M Yu, M Zhou et al., Target-Dependent Twitter Sentiment Classification: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, 2011, pp. 151–160
L Zhang et al., Combining Lexicon-Based and Learning-based Methods for Twitter Sentiment Analysis. HP Laboratories Technical Report, 2011
M E Hoque, D J McDuff, R W Picard, Exploring temporal patterns in classifying frustrated and delighted smiles. IEEE Trans Affect Comput. 3(3),323-334 (2012).
DJ McDuff, R el Kaliouby, RW Picard, Crowdsourcing facial responses to online videos. IEEE Trans. Affect. Comput. 3(4), 456–468 (2012)
DN Trung et al., “Towards Modeling Fuzzy Propagation for Sentiment Analysis in Online Social Networks: A Case Study on TweetScope.” Cognitive Infocommunications (CogInfoCom), 2013 IEEE 4th International Conference on IEEE, 2013, pp. 331–338
S Venkatesh, D Phung, D Bo, M Berk, Affective and content analysis of online depression communities. IEEE Trans. Affect. Comput. 5(3), 217–226 (2014)
TH Nguyena, K Shiraia, J Velcinb, Sentiment analysis on social media for stock movement prediction. Expert Syst. Appl. 42, 9603–9611 (2015)
DN Trung, JJ Jung, Sentiment analysis based on fuzzy propagation in online social networks: a case study on TweetScope. Comput Sci. Inf. Syst. 11(1), 215–228 (2014)
W Xing, Z Shaojian, Chinese text sentiment analysis utilizing emotion degree lexicon and fuzzy semantic model. Int. J. Softw. Sci. Comput. Intell. 6(4), 20–32 (2014)
Z Wang et al., Issues of Social Data Analytics with a New Method for Sentiment Analysis of Social Media Data, Cloud Computing Technology and Science (CloudCom), 2014 IEEE 6th International Conference on IEEE, 2014, pp. 899–904
MD Munezero, CS Montero, E Sutinen, J Pajunen, Are they different? Affect, feeling, emotion, sentiment, and opinion detection in text. IEEE Trans. Affect. Comput. 5(2), 101–111 (2014)
H I Ahn, R W Picard, Measuring affective-cognitive experience and predicting market success. IEEE Trans. Affect. Comput. 5(2):173-186 (2014).
F Santos-Sanchez, A Mendez-Vazquez, “Sentiment Analysis for e-Services.” Advanced Applied Informatics (IIAIAAI), 2014 IIAI 3rd International Conference on IEEE, 2014, pp. 42–47
A Pappurajan, SP Victor, Web sentiment analysis for scoring positive or negative words using Tweeter data. Int. J. Comput. Appl. 96(6), 33–37 (2014)
A Hogenboom et al., Lexicon-based sentient analysis by mapping conveyed sentiment to intended sentiment. Int. J. Web Eng. Technol. 9, 125–147 (2014)
RF Martins, A Pereira, F Benevenuto, “An Approach to Sentiment Analysis of Web Applications in Portuguese.” Proceedings of the 21st Brazilian Symposium on Multimedia and the Web ACM, 2015, pp. 105–112
C Clavel, Z Callejas, Sentiment analysis: from opinion mining to human-agent interaction. IEEE Trans. Affect. Comput. 7(1), 74–93 (2016)
Z Xu et al., Knowle: a semantic link network based system for organizing large scale online news events. Future Generation Comput. Syst. 43–44, 40–50 (2015)
Z Xu et al., Crowdsourcing based social media data analysis of urban emergency events. multimedia tools Appl. DOI:10.1007/s11042-015-2731-1. (2015)
Z Xu et al., Crowdsourcing based description of urban emergency events using social Media big data. IEEE Trans. Cloud Comput. DOI:10.1109/TCC.2016.2517638.(2016)
Z Xu et al., Participatory sensing based semantic and spatial analysis of urban emergency events using mobile social media. EURASIP J. Wireless Commun. Netw. 44, 2016 (2016)
Z Xu et al., Building knowledge base of urban emergency events based on crowdsourcing of social media. Concurrency Comput. Pract. Exp. DOI:10.1002/cpe.3780. (2016)
Acknowledgements
This work was supported by National Natural Science Foundation Project under grants 61175122, as well as by the Applied Basic Research Project of Sichuan province of China (2013JY0134) and by Key Project of Sichuan Educational Commission (No. 15ZA0049).
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Yang, L., Geng, X. & Liao, H. A web sentiment analysis method on fuzzy clustering for mobile social media users. J Wireless Com Network 2016, 128 (2016). https://doi.org/10.1186/s13638-016-0626-0
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13638-016-0626-0