Trending Topic Extraction using Topic Models and Biterm Discrimination

Mining and exploitation of data in social networks has been the focus of many eﬀorts, but despite the resources and energy invested, still remains a lot for doing given its complexity, which requires the adoption of a multidisciplinary approach . Speciﬁcally, on what concerns to this research, the content of the texts published regularly, and at a very rapid pace, at sites of microblogs (eg Twitter.com) can be used to analyze global and local trends. These trends are marked by microblogs emerging topics that are distinguished from others by a sudden and accelerated rate of posts related to the same topic; in other words, by an increment of popularity in relatively short periods, a day or a few hours, for example Wanner et al. . The problem, then, is twofold, ﬁrst to extract the topics, then to identify which of those topics are trending. A recent solution, known as Bursty Biterm Topic Model (BBTM) is an algorithm for identifying trending topics, with a good level of performance in Twitter, but it requires great amount of computer processing. Hence, this research aims to determine if it is possible to reduce the amount of processing required and getting equally good results. This reduction carry out by a discrimination of co-occurrences of words (biterms) used by BBTM to model trending topics. In contrast to our previous work, in this research, we carry on a more complete and exhaustive set of experiments.


Introduction
With the rise of social networks such as Twitter, many efforts have been made to collect, mine and exploit the information contained in them [1]. The content of posts and comments appearing on these sites can be used to analyze trends in populations overall. The later are marked by emerging issues that are distinguished from others by a sudden and accelerated rate of citations associated with the same topic, in other words, by increasing popularity in relatively short periods, a day or a few hours, for example Wanner et al. [2].
The automatization of trend analysis is of importance for researchers, politicians and companies since these trends manifest thoughts, beliefs, intentions, opinions and wishes of people [3,4]. For example, it is possible to follow news and observe their evolution over time [2,5]. Another use is to know what products are popular and the opinions about them [6,7]. Trend analysis include two subproblems: to identify topics and to identify which of them are trending. The use of short texts, as those present in social networks such as Facebook or Twitter style, add a third problem: The sparsity of words per document makes difficult statistics for trends identification [8]. Therefore, many algorithms to identify emerging topics in social networks often require complicated post-processing Topic Model Biterm bursty (BBTM) [9] is an algorithm for trending topics identification. This algorithm can identify trending topics in short texts without post-processing. In addition, BBTM gets results above state-of-the-art methods such as Twevent [10], OLDA [11], and UTM [12].
BBTM is a topic model, a widely used technique for topic extraction in text collections [8]. This technique consists in a probabilistic model that finds related terms that identifies topics in a collection of texts. However, to achieve its advantages, instead of using single terms, BBTM associates each word with the others present in the same document (e.g. in Twitter a Tweet is consider a document). These associations of words are named biterms. The use of biterms helps the probabilistic model to get better results with the short text in Twitter, but also increases the amount of data to process. For example, if Twitter gets 73400 different terms in a day and the average number of different terms per document is 5.21, BBTM will have to process the following biterms amount [13]: Experiments carried on a personal computer with 2.6 GHz Intel Core i5 and 8 GB of memory, show to process that quantity of biterms could take about two hours. Beside, most of the memory needed by BBTM is for storage the biterms and their probability to be part of a trending topic.
Therefore, reducing the amount of terms that BBTM has to process is important to decrease the number of biterms and, consequently, reduce the processing time and memory required. These reductions are an significant advantage since in social networks typically a high volume of data is generated in a single day.
Xia et al. [14] proved that it was possible to approximate the most relevant terms for identifying topics in a collection of short texts. The previous implies that there is a set of useless terms equally processed. However, they used BTM [13], an earlier version of BBTM that extracts all topics instead of identifying trends. For this reason, it is necessary to study the effects of discriminate terms on BBTM.
In addition to the possible reduction of computational resources, a decrease of biterms could affect the quality of the results. Our hypothesis is that with less noisy terms, the choice of biterms for each topic would be better.
This article focuses on evaluating the combination of term discrimination and BBTM as a method to reduce the amount of biterm processed. To make this discrimination, we propose to create a graph from the co-occurrence of terms and applying the method introduced in Shetty and Rey [15] to find influential nodes in a graph. Then BBTM would run using mainly the most relevant terms.
The rest of this article is organized as follows: we start with an overview of related work. Then briefly we describe the algorithm BBTM. Following we introduce the proposed method for discrimination of terms . Then the experiments and their results are presented. Finally, conclusions and future work are in the last section.
2 Related work 2.1 Topic models for long texts BBTM is part of a family of methods called topics models. These methods exploit the semantic structure that is implicit in the texts to model the topics in these texts. Topic models were originally created to extract topics in long texts, such as news articles or texts. For example, Latent Dirichlet Allocation (LDA) [16], is a topic model widely used due to its ability to be extended for getting new features.

Topic models for short texts
The main problem regarding short texts relates to the sparcity of words. This causes methods for long texts fails due to the unsufficient concurrence of words when they find similarities between texts in order to identify the topics [17]. Hence, there are different solutions to address the short texts processing. One of the approaches to deal with short texts has been the use of external data to enrich their interpretation. For example, in [17] external collections of documents are used to learn topics from them using LDA and then using those topics to help classify short texts.
A similar idea is used in Dual Latent Dirichlet Allocation Model (DLDA) proposed by Jin et al. [18]. This is a version of LDA that learns topics from both collections of long documents and collections of short texts all together, allowing to take advantage of the information in long texts to classify the short texts.
On the other hand, Biterm Topic Model (BTM), presented in Cheng et al. [13], uses a different approach. In order to achieve good results processing short texts, BTM defines a model based on word correlations capable of addressing the problem without pre or post processing. Consequently, other works have been based on BTM extending it to cover different tasks. For example, in Zhu et al. [19] the evolution of topics is modeled through time in microblogs (such as Twitter) using BTM.

Trending topics extraction
The task of finding trending topics has been addressed in different ways. In Mathioudakis and Koudas [20] a system called TwitterMonitor is proposed. This system detects trending topics identifying emerging keywords in Twitter and grouping them together. In Cataldi et al. [21] trending topics are also detected on Twitter, but modeling the life cycle of the terms to determine the most frequent of these in a specific time interval. Finally, Twevent [10] is a system used to detect events on Twitter. It identifies segments of Tweets frequent in a specific time window.
An important related work is discriminative Biterm Topic Model (d-BTM) [14]. This classifies news on Twitter using BTM as algorithm to identify trending topics and grouping the tweets according to the topics found. However, before running BTM, terms are discriminated to extract those which are representative. Thus, BTM will form the biterms with the most indicative terms of news.
On the other hand, Bursty Biterm Topic Model (BBTM) [9] is a model to identify trending topics also based on BTM. In order to perform this task, BBTM uses data of the explosive popularity of biterms (burstiness) as information during topic modeling process. Compared to previous methods to detect trending topics, BBTM has the following advantages: 1. Because it is based on BTM, it achieves to model short texts effectively overcoming the problem of low density words.
2. Because it incorporates information of the sudden popularity of terms, BBTM can identify trending topics efficiently without heuristics or post processing.
A brief explanation of the general operation of BBTM algorithm is presented bellow.

Bursty Biterm Topic Model
BBTM models the entire collection of short text as a single document formed by a mixture of topics. Each of these topics is a probability distribution over words. The terms related to trends are emphasized and distributed in different topics, while those terms of common subjects, such as daily life or chatting, are assigned to a single background topic [9]. Figure 1 shows an example of this model, from a collection of twits BBTM extracts the trending topics represented by their keywords, and the rest of standard-use-words are in a unique background topic not included in the results.  To find the relationship between topics and terms, sufficient samples of word co-occurrence patterns are necessary. Therefore, if we have short texts, and we take each one as an independent document, it will produce the problem of sparse patterns at document-level. BBTM uses the following two strategies to overcome that problem: 1. To use biterms instead of terms: A biterm is an unordered pair of words that co-occurring in the same text. Since a topic is a group of correlated terms, the use of biterms allows to model the co-occurrence of two words explicitly [9,13]. Figure 2 shows an example of how from a twit biterms are generated.
2. To use the complete collection of texts as a single document: By modeling the co-occurrence of words at corpus-level, BBTM avoids the sparse of patterns problem, in this way the length of the texts does not affect the results [9,13]. The steps of BBTM as can be seen in Figure 3 starts with the creation of biterms. The second step is to calculate the probability of each biterm of being relevant to some trending topic. If a biterm has a suddenly popularity (bursty behavior) in contrast to its standard usage in the past, probably that biterm is part of a trend [9]. Thus, the possibility η b of a biterm generated from a trending topic is calculated as follows: Where n b is the biterm frequency within a given period of time (one day for example).n b is the average of the frequencies of that biterm in prior periods (in the previous 10 days for example). The is to avoid zero probability, according to [9], 0.01 is a good value.
For example, in the day 0 the biterms b 1 b 2 have a frequency of 71 and 15 respectively. In the day 1, the frequencies are 40 and 17, finally for day 2 they are 183 and 28. The average of the frequencies in prior periods are (71 + 40)/2 = 55.5 and (15 + 17)/2 = 16. Finally, the probabilities of these biterms being part of a trending topic can be calculated: After to form biterms and calculate their possibility to be a trend, the third step is to process that data to extract the trending topics. BBTM tries to model how the texts were generated from topics. This generative process is defined as follows [9]: 1. For the collection: (a) Draw a trending topic distribution from the collection of texts using a Dirichlet probability distribution : θ Dir(α), where α is a hyperparameter of Dirichlet.
(b) Draw a background (no trending) word distribution using a Dirichlet probability distribution : φ 0 Dir(β), where β is a hyperparameter of Dirichlet.
2. for each trending topic z ∈ [1, K]: (a) Draw a word distribution using a Dirichlet probability distribution : φ z Dir(β) 3. For each biterm b i ∈ B: • Define a binary variable that indicate if the biterm is observed with a normal use or with a trending behavior. The value of this variable is drawn using a Bernoulli probability distribution: e i Bern(η bi ), where η bi is the probability defined in Equation 1: (a) If the biterm has a normal use, e i = 0: i. Draw two words w i,1 , w i,2 using a multinomial probability distribution: If the biterm has a trending behavior e i = 1: i. Draw a trending topic z using a multinomial probability distribution: z M ulti(θ): ii. Draw two words w i,1 , w i,2 using a multinomial probability distribution: w i,1 , w i,2 M ulti(φ z ).
Since the Dirichlet and multinomial distributions are conjugates to each other, the calculus of their combination gets simpler because the result is another Dirichlet distribution. Therefore, the topic z and the biterms w i,1 , w i,2 are drawn using a multinomial distribution that takes as a parameter the Dirichlet used as the prior probability at the beginning. [13].
In the above process the parameters that needs to be estimated are θ = φ 0 , φ z ...φ K , θ. If the hyperparameters are given, the likelihood of the complete biterm set B is defined as follows [9]: The problem with Equation 2 is that results in an intractable integral [9]. For that reason, to approximate the value of θ BBTM use collapsed Gibbs Samplings algorithm.
The gibbs algorithm draw samples from the posterior distribution of the latent variables of topic z and biterm behavior e, sequentially conditioned on the current values of all other variables [9]. This conditional distribution is calculated jointly for both variables: Variable Description e i Indicates the behavior(normal or trendy) of the biterm i z i trending topic assign to the biterm i. B Total amount of biterms. α, β Hyperparameters of Dirichlet distribution. η Probability of the biterm to be part of a trending topic. n 0,w Number of times that the word w is assigned with normal use. n 0 Total of words assigned with normal use. n zb Total of biterms assigned to the trending topic z. n b Total of biterms assigned to trending topics. n z,w Number of times that the word w is assigned to the trending topic z. n z total of words assigned to the trending topic z. -i Indicate to ignore the current biterm b i . The Gibbs sampling algorithm consists in applying Equations 3 and 4 an enough number of repetitions to obtain the necessary data to approximate the parameters φ 0 , φ z ...φ K , θ using Equations 5 and 5.
Input : K, α, β, B Output: φ 0 , φ z ...φ K , θ Randomly initialize e and z for iter = 1 to N iter do foreach b i (w i,1 , w i,2 ) ∈ B do Draw e and z from Equations 3 and 4 if e = 0 then Update the value of the variables n 0,wi,1 and n 0,wi,2 else Update the value of the variables n zb , n z,wi,1 and n z,wi,2 end end end Estimate the parameters φ 0 , φ z ...φ K , θ using Equations 5 and 6 Algorithm 1: Gibbs sampling algorithm for BBTM [9] Finally, the fourth step ( Figure 3) is to list the biterms with more chance to belong to each topic. The term discrimination should assure to keep the words with more probability to be in one of these topics. In this investigation, we evaluate if the results remain equal to the original, they get better with words more relevant to each subject, or they get worse mixing words with a little relation among them.

Biterms discrimination method
We propose a method to select the terms more representative to identify trending topics. The idea is that BBTM use the biterms more useful and avoid to process noisy and background biterms. Before to apply the discrimination we need to form the biterms and get their probabilities of sudden popularity, then we form a shorter new list of biterms. Thus, it is an extra step right before the biterm process in BBTM.
The method proposed has three different stages: graph creation, important nodes detection, and selection of terms. During the graph creation stage we link the terms among them. In the important nodes detection stage, the importance or influence of each term into de graph is calculated. Finally, the selection of biterms stage chose the most important terms to be consider like key terms and selecting the biterms related to them.

Graph creation
We create an undirected graph from the biterms. Each term can be taken as a node with edges or links to terms that forms biterm with it. An example is shown in Figure 4. If a graph is created from all biterms found in a set consisting of a considerable number of texts, the resulting graph will be of considerable size too. Consequently, perform calculations on that graph would be computationally expensive.
For the above reason, it is preferable to construct the graph from biterms with some probability to be trending, in this case those where η b > . This reduces the graph to a computationally feasible size. Once the graph is built, the next step is to detect influential nodes in the graph.

Important nodes detection
To select the most influential nodes we chose the algorithm introduced by Shetty and Rey [15]. This technique considers that the most important nodes are those with the most effect the graph entropy when they are eliminated [15]. Therefore, first we must define the calculation of entropy for graph.
In terms of graphs, a high entropy indicates that many edges are equally important, while a low entropy indicates that only a few edges are relevant [22]. It is then possible to calculate the entropy of the graph as follows [22]: Where V is the set of graph nodes. The probability p(v) can be calculated by: Where |A| is the number of edges of the node and |E| is the total number of edges of the graph. This measure of entropy is the classical definition of graph complexity based on the symmetries of the graph [23,24,25]. Here the value of p(v) is the probability of the node v into the graph. Therefore, this entropy measure is close to the ?information content? in Shannon entropy [25]. We use this definition because we are interested in the structural information content of the graph to know how influential nodes affect to others nodes and the weight of that influence in the whole graph. For that reason, other graph complexity measures, such as Körner Entropy [26], are not functional because the structural information is not expressed.
Once the entropy has been defined, the next step is to detect nodes with greater effect on this entropy. To do this, we perform the following algorithm proposed in Shetty and Rey [15]: Input : Set of nodes N . Output: Importance of each node in N . 1) foreach node N (i) ∈ N do a) Compute the entropy of node N (i) by calculating the entropy of the node along with all its edges as E (i).
b)Temporarily drop the node N (i) from the main graph and calculate the entropy of the remaining graph as EN (i).
c) Calculate the importance of the node on the graph with Equation 9. end 2) Sort nodes by the importance(i) obtained.
Algorithm 2: Algorithm to calculate the importance of the nodes of a graph [15] Equation 9 introduced in [15] is a centrality measure based on the effect of eliminating a node over the entropy of a graph. The basic idea is that most of the information in the graph will be around the key terms of trending topics. This behavior means that it is more probable to reach any of those nodes related to trends, and resting probability to the less relevant nodes, forming in this way, a non-uniform graph. If an influential node is deleted, the rest of the graph become more probable, and consequently a more uniform increasing the entropy. But, if a less relevant node is removed, the graph keep being a non-uniform graph. Thus, a higher affectation on the entropy, the more important is the node.
When we eliminate the node in the step 1b, we also could remove the nodes directly connected with it. In this way, we widen the scope of influence for near nodes [15]. Figure 5 shows an example of this expansion of scope when the node is removed. An entropy model of length 1 means delete only the node, while with length 2 the nodes connected with the node are also removed.

Selection of biterms
The selection of terms is performed based on the degree of importance obtained in the process described above. We define a threshold to identify the most important terms. Any term with equal or greater importance than the threshold is classified as very important. Moreover, those with a lower value are taken as terms with low probability to appear as keywords in any topic. The terms not included in the graph have assigned 0 as importance score. With the selection of important term we can filter the biterms to keep the most relevant. This process is performed as follows: Input : List of terms T with their importance I T , threshold µ Output: New list of biterms. In summary, for terms with importance below the threshold but different to zero we only keep the biterm formed by connecting the term with the word with the highest probability of correlation. For the rest of terms with importance greater or equal to the threshold, we keep all the biterms formed with it, but those with terms with importance between zero and the threshold.
The probability of correlation can be obtained using the NPMI (Normalized Point Mutual Information) [27]. To avoid the problem of NPMI with pair of words with low frequency, we only take account biterms with frequencies greater than 5. This value was chosen after some preliminary experiments.
The reason for keeping some biterms formed by terms with low importance is because that biterms helps to model correctly the topic. This was proved in [9] when they tried to model using uniquely biterms with high probability to be trend. The results were inferior to those obtained when they used the complete list of biterms. The concept behind the biterm discrimination is to remove a lot of noisy biterms, but at the same time it keeps enough data to model properly the topic.
In this section, we describe our experiments carried on a short text collection of real user data to prove the efficacy of our proposed method. We take BBTM without term discrimination as our baseline. To differentiate both systems, we will refer the original BBTM without term discrimination like single BBTM, and our method like BBTM with term discrimination.
The experiments were executed on a Macbook Pro with Intel Core i5 2.6 Ghz CPU and 8 GB memory. We use the download available original version of BBTM 1 . The term discrimination method was developed like a independent preprocess of BBTM. The implementation was via Java 8.

Dataset
Experiments were conducted on one week taken from the corpus Tweets2011, TREC collection published in the 2011 microblog track. This portion of the dataset contains 9341618 tweets sampled from Jan. 23 to Jan.29. To reduce the amount of low-quality tweets we apply the same steps as in [9] for BBTM: 1. To delete non-latin characters tweets.
2. To convert all letters to lowercase.
3. To delete stopwords. The corpus includes tweets in several languages, stopwords of English, Spanish, German, Dutch, Indonesian, Portuguese, Brazilian, and French were removed.

4.
To delete the 100 most frequent terms, which are common words that are meaningless in Twitter.

5.
To delete terms with document frequency less than 10.
6. To filter tweets with length less than 2.
After applying the above steps, we left a total of 2421650 tweets and a average of 345950 tweets per day. Next, we take 10 random samples of each day. The size of these subsamples is approximately the half of the complete day sample. Both, BBTM and BBTM with term discrimination were run 10 times. For each one of these runs a different sample was used.

Evaluation metrics
We evaluate the proposed method of term discrimination with the following metrics: 1. Recall: is the fraction of terms that single BBTM retrieves like keyterms of a topic that also was retrieve by BBTM with term discrimination like keyterms of some topic. If A is the result set of keyterms retreived by BBTM with term discrimination, and B is the result set gotten by single BBTM, then the recall is calculated as follows: In this context, recall indicates the percentage to which a topic retrived by BBTM with biterm discrimination is like its equivalent one retrived by single BBTM. We take the 20 first words of both algorithm to evaluate the recall.

Processing time:
We measure the average time processing of Gibbs Sampling for each day. In Addition, we take the average time of the biterm discrimination process.

Coherence:
To evaluate the quality of the topics we use the measure for coherence introduced in [28]. This metric allows to measure if the words selected for a topic appear in the same documents. In this way, if a topic has words not related among them, its coherence will low. This measure is calculated as follows: Where V (t) = (w 1 , ..., w M ) is the list of the M words that forms the topic t. D(w) document frequency of the word w , y D(w 1 , w 2 ) co-document frequency of the words w 1 y w 2 w 1 . Results of our proposed method and our baseline are given independently. Consequently, it is not possible to know with certainty which topics are the equivalents between both results. For this reason, we apply Jaccard index, Precision, Recall against all the resulting topics and reporting the maximum value found. The latter represents the best try to get the same topic that single BBTM.

Parameters setting
BBTM needs three main parameters to work, two hyperparameters α, β and the number of topics K. We use the same setting of values employed in BBTM [9]: α = 50/Kandβ = 0.01. In our experiment, the value of K is fixed to 20 (In this investigation we are not evaluating the behavior on different values of K ). The time slice is set to a day. For each day, the Gibbs sampling algorithm was run for 100 iterations. To define the threshold to select the important keyterms during the term discrimination process we apply standardization. The values of the threshold used are −1, −0.5 and 0.

Recall
First, we measure how well our method could retrieve similar results than single BBTM. The values of maximum average recall obtain for each threshold tested are shown in Figures 6, 7, 8, and in the tables 2,3,4:

Processing time
We compared the processing time of single BBTM and BBTM with biterm discrimination using different values of threshold. Table 5 shows the average processing time per day in minutes.

Conclusions and Future Work
We developed a method that can obtain trending topics using less biterms than the original version of BBTM. According to the results presented in section 5, our method, in many cases, is capable of selecting various of the same words to describing each topic. In addition, as it can see in Table 6, the quality of the topics is similar. This goal means that it is possible to reduce the amount of process needed for a trending topic analysis carried on by a topic model like BBTM. For future work, we would like to try different values of thresholds to separate important keyterms. Also it would be interesting to request different amounts of topics and exploring the behavior our method. Finally, it would be significant to use another technique for term discrimination and compare it with the one used in this work, for example TextRank.
(MICITT) de la República de Costa Rica for the support provided to the research projects N o 745-B4048 and B6175.