In this section, we present the details of the proposed the AutoTL. We will first define the short text mining problem, then introduce the solution to the feature representation of the target data based on the latent semantic analysis which is followed by the introduction of the classifier generation.
Problem statement
The target domain or target data is referred to a large amount of short texts data X={X
1,X
2,...,X
n
}, where X
i
is the ith short text instance. Among the target domain, the known label space is referred to L={l
1,l
2,...,l
m
} related to X. In the short text analysis, the label space is normally very small and not sufficient to conduct an accurate classification. Moreover, no specific source data are given to the learning, to which the traditional data mining and machine learning approaches are unable to be applied. Furthermore, the data priori probability distribution is unknown as well. The problem studied here is given the target domain and limited labels, how to provide an accurate classification over the target domain.
To fit this problem, in this paper, we propose the AutoTL. It automatically transfers the knowledge obtained from other online long text resources, also called source domain (e.g., the web information or social media). The AutoTL adopts the latent semantic analysis to dig the semantics of both the target domain and the source domain. Based on this semantic meaning, it formalizes the important features and links these two different types of data together. It tries to find the best feature representation in order to keep the text semantics for a good classification. Thus, the key techniques of the proposed AutoTL includes keyword extraction, feature weight calculation, new feature space construction, and target domain classification.
Keyword extraction
As the related source data are not provided, we have to figure out which online resources are the most related to the target data first. In order to do so, a set of keywords are extracted from the target domain and then supplied to a search engine to get the related source data. Therefore, the first step of the AutoTL is to extract the most representative keywords. It is insufficient to simply use the labels as the keywords, as this would lead to the topic distillation. In contrast, we adopt the mutual information to select the source data. The correlation of mutual information between two objects is given by:
$$ \begin{aligned} I(P;Q) &= \sum_{x \in P} \sum_{y \in Q}p(x,y) \log\frac{p(x|y)}{p(x)}\\ &=\sum_{x \in P} \sum_{y \in Q}p(x,y) \log\frac{p(x,y)}{p(x)p(y)} \end{aligned} $$
(1)
A bigger mutual information indicates a higher correlation between two objects. Using the mutual information as the measure, the target domain is preprocessed to calculate the target feature seed sets which share the biggest mutual information with the target label space. Specifically, the mutual information is calculated as I(x,c), where I is the feature seed and c is the label. I(x
i
,c
j
)>ε(here ε is the threshold) indicates that the feature x
i
is highly related to c
j
. In this case, x
i
can be chosen as the keywords.
Feature weight calculation
After selecting the source data, the next step of the AutoTL is to identify the useful labels/features from the source data, which can be used to strengthen the target data classification. A naive approach is adopted to calculate the similarity between different sets of features from the target domain and the source domain respectively. According to the similarity among the words, the useful features can be selected. However, such an approach treats each word individually, ignoring the relations between the text and the semantics that are hidden in the context keywords. Hence, we utilize the latent semantic analysis approach instead [28, 29]. Semantic analysis shows its superiority on such a task as it organizes the text into a space semantic structure that keeps the relationships between the text and the words.
Text matrix is used in the latent semantic analysis. It not only captures the word frequency in the text but distinguishes the texts. Typically, in latent semantic analysis, the feature weights are calculated as the multiplication of the local weight (LW(i,j) indicating the weight of word i in text j) and the global weight (GW(i) indicate the weight of word i in the whole texts). Particularly, the feature weight W(i,j) is given by:
$$ \begin{aligned} W(i,j) &= \text{LW}(i,j)*\text{GW}(i)\\ &= \log(tf(i,j)+1)*(1- \sum_{j}\frac{p_{ij}\log(P_{ij})}{\log N}) \end{aligned} $$
(2)
where \(P_{ij}=\frac {lf(i,j)}{gf(i)}, lf(i,j)\) is the frequency of word i in text j and g
f(i) is the frequency of word i in the whole texts.
This traditional method works well in the context where the target and the source domains share the same data and distribution. Unfortunately, it cannot be directly applied to our context where the target and source data are completely different in terms of the data type as well as the data distribution. The reason is that traditional methods do not consider the difference between the source and the target domains resulting in poor classification. Therefore, in this paper, we propose a new latent semantic analysis approach to enable an accurate classification by utilizing the word frequency and the entropy.
Word frequency weight
The word frequency weight is referred to the frequency of the feature appearing in different labels, which captures the capability of distinguishing the labels using the feature. In other words, if one feature appears frequently in one text, it indicates that the feature plays an important role in the text. Meanwhile, if this feature has high frequency in other texts as well, its weight should be degraded due to the less separative capacity. Assume the labels we obtained from the source data represent the categories based on the keywords. So the word frequency weight can be calculated as below:
$$\begin{array}{@{}rcl@{}} \text{FW}(C_{i},j)=&\log \text{cf}(C_{i},j)\times \frac{1}{\log\left(\sum_{k \neq i}^{cf(C_{k},j)}\right)} \\ =& \log \frac{\sum_{j,t=1}^{m} {\text{tf}(t,j)}}{m} \times \frac{n(c-1)}{\log\left(\sum_{k \neq i}^{c-1} \sum_{s=1}^{n}tf(s,j)\right)} \end{array} $$
(3)
where c
f(C
i
,j) is the frequency of feature j appearing in category \(C_{i}, \sum _{k\neq i}{cf(C_{k},j)}\) is the frequency of feature j appearing in other categories, \(\sum _{j,t=1}^{m}tf(t,j)\) is the frequency of feature j appearing in all the documents belonging to the category C
i
, m is the number of documents in C
i
, and c−1 is the number of labels of the documents.
Entropy weight
In this paper, we use the entropy to represent the weight of the classification labels which is defined as CW(c|i). The entropy weight represents the degree of the importance of one feature to the classification labels. The entropy (H(X)) is the degree of the uncertainty to one signal X, which is calculated as:
$$ H(X)=-\sum p(x_{i})\log p(x_{i}) $$
(4)
The conditional entropy (H(X|Y)) is the uncertainty degree of X when Y is confirmed, which is calculated as follows:
$$ \begin{aligned} H(X|Y)&=-\sum p(x_{i}|Y)\log p(x_{i} | Y)\\ &= -\sum p(x_{i}, Y)\log(x_{i},Y) \end{aligned} $$
(5)
Hence, the entropy weight can be calculated as the certainty degree of X when Y is confirmed, such as:
$$ \text{CW}(C_{i} | j) = H(C_{i})- H(C_{i} | j) $$
(6)
Normally, H(C
i
) is hard to calculate and should satisfy the following condition: H(C
i
|j)≤H(C
i
)≤ log(c). So when the source documents contain similar length, H(C
i
) is close to log(c). Thus, the entropy weight can be adjusted as follows:
$$ \begin{aligned} \text{CW}(C_{i} | j) =& H(C_{i})- H(C_{i} | j)\\ =& \log(c) + \sum{p(t, j)\log (t,j)}\\ =& \log(c) + \sum \frac{tf(t,j)}{gf(j)}\log(\frac{tf(t,j)}{gf(j)}) \end{aligned} $$
(7)
To this end, the weight in our proposed approach is calculated as follows:
$$ W(i) = \text{FW}(C_{i}, j) \times \text{CW}(C_{i} | j) $$
(8)
Different from the traditional latent semantic analysis that builds the feature-document weight matrix, the AutoTL builds the feature-classification labels weight matrix. In the matrix, the weight w
ij
represents the correlation between the feature and the classification labels. Assume the matrix obtained from the documents is M. After the SVD decomposition, we can get matrix M
k
. In addition, via the feature similarity \(M_{k} M_{k}^{T}\), we can obtain the features that are not labeled in the target domain but highly related to the classification. So the best features are chosen as the feature seed set.
New feature space construction
Considering that the features may contain many relations in real life, we try to capture the relations among these features to improve the classification quality. The approach we proposed is to construct the source domain labels as an undirected graph, whose nodes denote the labels and its edges are the relations. To build the relation from the feature seed sets, we extract a subgraph that contains all feature seed sets from the undirected graph. This eventually build the connections between the labels in the source domain and the target domain.
Since the label graph is normally high-dimensional, we adopt the the Laplacian eigenmaps algorithm [30] to map all nodes in the subgraph into a low-dimensional space. This effectively alleviates the problems such as data over fitting and low efficiency, caused by the high dimension. The Laplacian eigenmaps assumes that if the points are close in the high-dimensional space, the distances between them should be short when embedded into a low-dimensional space. The algorithm does not consider the category information of the samples when calculating the neighbor distance. Thus, no matter the point inside or outside the category, it gives the points with the same distance the same weight. This, however, is not preferred for the target domain containing both labeled data and unlabeled data. In the paper, we improve the Laplacian eigenmaps algorithm by using different methods to calculate the weight of the labeled data and unlabeled data. Intuitively, we make point distance inside the category be less with distance than those points outside the category.
To construct a relative neighborhood graph, we use the unsupervised learning approach (e.g. Euclidean distance) to calculate the distance between the unlabeled data. Meanwhile, we use the supervised learning for the labeled data, which is provided as follows:
$$ D(x_{i},x_{j})= \left\{ \begin{aligned} &\sqrt{1-\exp(-d^{2}(x_{i},x_{j})/\beta}\qquad & c_{i} = c_{j}\\ &\sqrt{\exp(d^{2}(x_{i},x_{j}))/\beta} \qquad & c_{i} \neq c_{j} \end{aligned}\right. $$
(9)
where c
i
and c
j
are categories of the samples x
i
and x
j
, respectively, and d(x
i
,x
j
) is the Euclidean distance between x
i
and x
j
. Parameter β can prevent D(x
i
,x
j
) from becoming too large when d(x
i
,x
j
) become larger which can effectively control the noises. If the distance between sample points x
i
and x
j
is smaller than the threshold ε, the two points are neighbor points.
Furthermore, the weight matrix W can be calculated, where if x
i
and x
j
are neighbor points, W
ij
=1, otherwise, W
ij
=0. The Laplacian generalized eigenvectors can be simply calculated by solving the following problem:
$$ \min \sum_{i,j} \| Y_{i} - Y_{j} \| w_{ij} \qquad\quad\qquad \mathrm{s.t.} \quad \mathit{Y}^{T} {D} \mathit{Y} = I $$
(10)
where D is a diagonal matrix. With the improved Laplacian eigenmaps algorithm, we can map each high-dimensional node into a low-dimensional space. To this end, the data can get a new feature representation.
The target domain classification
After getting the new feature representations of the target data, we can classify the target domain using the mutual information as what has been discussed in Section 3.2. This can be done based on the existing classifier, such as the SVM classifier. To better appreciate the framework, Fig. 1 provides the main steps of the entire AutoTL framework. The detailed Algorithm AutoTL is as follows: