Automatic parametrization of Support Vector Machines for short texts polarity detection

The information from social media is emerging as a valuable source in decision-making, unfortunately the tools to turn these data into useful information still need some work. Using Support Vector Machines for polarity detection in short texts are popular among researchers for their good results, but parameter optimization to train classiﬁcation models is a complex and costly process. This article compares two algorithms for automated parameter optimization in the process of creating classiﬁcation models for polarity detection: the recently created Grey Wolf Optimizer and the Grid Search, using accuracy and f-score metrics.


Introduction
Since the creation of the World Wide Web in the 90's the humanity has changed the way it collaborates, creates and shares information.The point of view of the collectivity of people often has more impact on the others than expert's opinion [1].
According to Cambria et al: "Today millions of web-users express their opinions about many topics through blogs, wikis, fora, chats and social networks.For sectors such as e-commerce and e-tourism, it is very useful to automatically analyze the huge amount of social information available on the Web, but the extremely unstructured nature of these contents makes it a difficult task." [1] Automatic analysis of social text requires the combination of techniques from the fields of Natural Lan-Figura 1: Differences between Grey Wolf Optimizer and Grid search when exploring parameter space.Light grey spot mark the result of the algorithms guage Processing (NLP) and Machine Learning (ML) to identify, process and classify opinions according to different criteria.In particular, to classify text according to its polarity (negative or positive), Support Vector Machines (SVM) are popular among researchers [2,3].It is noteworthy, SVM performance greatly depends upon the parameter optimization [4].That process is known as model selection [5] which uses one of three different mechanisms: default values, expert criteria and automated selection [6].This research compares two algorithms for automating parameter optimization on SVM using RBF kernel to identify polarity of short texts (coming from the Twitter social network) in Spanish.Those algorithms are: grid search and the recently created Grey Wolf Optimizer (GWO).Differences in the way both algorithms explore the parameter space can be better understood when displayed in a chart (see figure 1).

Sentiment analysis
Sentiment analysis, often known as opinion mining, is a research field which works with analysing the opinions in texts [7].What others think of a person or product becomes important to us.We ask our friends for opinions and recommendations about products, politicians, and even home appliances.In this field, as in many others the internet through blogs, forums, discussions groups and social media has dramatically changed the way we express our opinions.We are no longer limited to ask family and friends for their opinion about an specific item, even companies no longer need big focus groups or external consulting to obtain and understand the users opinion about marketed product [7].However, using these publicly available opinions is a complex process [8] because manually extracting, processing and visualizing this immense amount of data becomes prohibitively expensive making this a challenging problem for the fields of Natural Language Processing (NLP) and Machine Learning (ML) [7].
It's important to notice that opinion is a broad concept divided in two types: Direct opinion: can be defined as a quintuple (e j , a jk , so ijkl , h i , t l ) where: • e j is the entity upon which the opinion is given.
• a jk is the aspect or specific feature of the entity.
• so ijkl is the orientation or polarity of the opinion on the specified feature.This can be positive, negative or neutral with different opinion strengths.
• h i is the opinion holder.
• t l is the emission date.
Comparative opinion: Expresses similarities or differences between two or more object.This type is not used in this article [8] Being a little less formal, we can say that sentiment analysis can be divided in three different subtasks: opinion extraction, polarity detection and opinion subject relationship.Polarity detection presents particular challenges that need to be addressed [9].

Polarity detection
Polarity detection is a open research problem considered a subtask of the sentiment analysis field focused in evaluating texts and identifying if those texts contains a negative, positive or neutral orientation [10].
This orientation identified within a large document collection can be useful to competitive analysis, brand management, market analysis, risk management and public opinion analysis for politicians [11].
Notably, from the Machine Learning field, three algorithms used for texts classification receive are applied to identify the polarity value of text.This algorithms are: Naïve bayes: probabilistic classifier, characterized by the assumption that the absence or presence of a particular feature isn't related to the other feature.
Maximun entropy: predicts the class of an instance based in function of the independent variables.Support Vector Machines: Given a geometric space, the algorithm tries to find the best possible hyperplane to separate the different classes.

Support Vector Machines
Current formulation of the Support Vector Machines (SVM), also known as soft margin support vector machines are supervised learning models [12].Created in 1995 [13], used for classification and regression analysis.Above all, in polarity detection SVM is used for its capabilities as a text classifier where the SVM divides the dataset in two different categories by assigning them a tag y i (see Equation 1) to a feature vector x.
The class separation is given by an unknown hyper-plane which is approximated based on a training dataset (see Equation 2) Where w is a weight vector, b a threshold value and x an instances of the training dataset (see Equation 3).
Given the approximate hyperplane function, the tag for the instance can be obtain from the sign of the result (see Equation 4).
However, is impossible to work with non-linear separable datasets for linear classifiers, this is why kernel functions needs to be incorporated in order to allow the classifier to be really useful.Some of the most popular kernel functions are mentioned in the next section, nonetheless only RBF kernel function is used in this research.

Kernel functions
Usually to apply linear classifiers to non-linear data existing on an arbitrary X dimensional space, the data needs to be transformed to a higher dimensional space Z where the data can be classified using a linear classifier [14,15].Even so, the transformation process is quite expensive in terms of computational resources.
Alternatively a kernel trick can be applied.The kernel trick or kernel function [16,17], executes the essential classifier operation (a dot product) in the Z dimensional space without transforming the instances from one dimensional space to another.
All kernel function must meet with the Mercer's condition [18], to be valid.In the case that a kernel function does not meet the Mercer's condition there is a risk of not finding a solution [19].
Common kernel functions are [20]: Where d is the degree of the kernel function.
Radial Basis Functions (RBF) Sigmoid function Assuming u and v feature vectors belonging to a dimensional space X .
Furthermore, even when the kernel functions result in SVM capable of classifying non-linear data, they introduce the difficulty of parameter optimization for the kernel functions.[21], a problem also known as model selection [5] directly impacts the performance of the classifier [21] .
For SVM with RBF kernel two configuration parameters are needed: C which will control the influence of the outlier instances in the final classifier and it's part of the SVM formulation, and Gamma, only RBF Kernel parameter, which will change the way the classes are separated by controlling the dimension of the hyperplane.

Parameter optimization
For parameter optimization different algorithms are used, grid search [22] is the easiest to implement but it consumes with a high amount of resources [21].For this reason algorithms with different approximations are constantly tested.For example: genetic algorithms [23] and Simulated annealing [24].Other algorithms used for parameter selection can be seen in Table 1.

Grid search
The Grid search algorithms is widely used by the researchers [38,34,21,22,24,31,35], considered the default approach for tuning the parameters SVM [38] is characterized by being exhaustive and providing a high precision but at a cost of time and computing resources.It consist of generating a matrix A m×n where a i,j = (C i , γ j ) determine the accuracy value by 10-fold cross validation for each a i,j .The pair with the biggest accuracy value will be selected as the best parameters. Considering Where It is important to notice that the values for d 1 and d 2 must guarantee a low dimensionality of the matrix to avoid a rise in the time needed to complete the calculations but maintaining the result precision unaffected.
Acording to [22] grid search presents the following advantages: Allows parallel execution.
Has complexity of O(n 2 )

High precision
With the disadvantage of needing a significant amount of calculation implying a valuable time and resources.

Swarm intelligence
Swarm intelligence, according to [39] is inspired by animal behavior when in herds, flocks or colonies.Two fundamental concepts identify algorithms belonging to this category: Self organization and task separation.
Self organization refers to individuals capacity to evolve within a system without external stimulus.On the other side, tasks separation corresponds to simultaneous simple task execution by different individuals.
Usually these algorithms don't follow leaders command, or a global strategy plan, instead, their global behavior is determined by the agents tasks.
Based in the literature review by [40] and [41] belongs to this category the following algorithms: Even when the application of these algorithms to the parameter optimization problem is straightforward, only a few has been applied on Support vector machines [42,30,29].Notably, one of the newest algorithms, the Grey Wolf Optimizer [41], which will explained with some detail in the next section, has an interesting characteristics set [43]: Simplicity. Flexibility.
Good local optima avoidance.

Simple implementation.
Only two parameters to adjust behavior.
Fast convergence.
GWO has only two parameters: wolves who will work as search agents and a number of iterations, which will indicate to the algorithms when to stop.Usage of this parameters can bee appreciated in the pseudo-code (see figure 4).
The concept of this algorithm is based in the hierarchical behavior of the grey wolves (Canis lupus lupus) when hunting [48].The hierarchy is made up individuals located in different levels that meet specific functions within the pack [49].The main wolf pack hierarchy levels used by the algorithm can be summarized as follows: Alfa wolfα: Current best parameter optimization.
Beta wolfβ: Second best parameter optimization.
Delta wolf δ: Third best parameter optimization.
Omega wovesω: Work as search agents exploring the parameter space.
Wolf pack hunting process is described in [48], this process includes several actions performed by the wolves.In brief this actions can be separate in three different phases: 1. Tracking and approaching the prey.

Attack towards the prey.
This phases are mathematically model by the algorithm and briefly explained as follows: 1. Encircling the prey Figura 2: Hunting process, image taken from [41] To emulate wolf behavior when encircling the prey the following equations were defined: Where t indicates the current version, A and C are coefficient vector used to balance exploration and exploitation, process that can be seen in figure 2.

Hunting:
In nature, the alpha wolf is the first to approach the prey, followed by the beta.Nonetheless, for the algorithm the prey's position is unknown so it needs to be approximated by using the positions of the alpha, beta and delta wolves as describe by the Equations 11 and 12.
Where wolf is one of alpha, beta or delta.
Figure 3 shows the effect of the equations in the position of the search agents or omega wolves.The search begin with the creation of a randomly place omega wolves population (search agents).After every iteration of the algorithm alpha, beta and delta wolves are move to the best possible positions found by the omegas.Initially the exploration is driven by the A values provoking wolves dispersion supporting exploration and avoiding local optima.
Notably, C (see Equation 12) is not linearly decrease as A, instead it always provides random values encouraging exploration.

Attack (Exploitation):
The attack toward the prey is model by the vector A (present in Equation 12), which constantly decreases during the iterations of the algorithm.When | A| < 1 the wolf will move closer to the approximated position of the prey; otherwise if | A| > 1 the search agent will explore other areas away from the prey.Some examples of successful GWO implementation are the two-stage assembly flow shop scheduling [45] which compares GWO to particle swarm optimizer (PSO) and Cloud theory-based Simulated Annealing (CSA), results of this research shows that GWO got an slightly lower precision but a better performance Figura 4: GWO Algorithm pseudo-code, adapted from figure 6 in [41] than the other algorithms.Also in [43], GWO is used for parameter estimation to calculate dispersion curves in surface waves.Good results were obtain thanks to its characteristics, for instance the balance between exploration, fast convergence and the low amount of configuration parameters of the algorithm finally Mirjalili also uses the algorithm to train multi-layer perceptrons (MLP) comparing its performance with PSO, genetic algorithms (GA), Ant colony optimization (ACO), evolutionary strategy (ES) and population based incremental learning (PBIL).[47] 4. Methodology

Corpus
To compare both parameter optimization algorithms we base our dataset in the TASS corpus.This corpus contains 7220 Tweets messages written in Spanish.Each message is tagged with its global polarity, indicating whether the text expresses one of 6 different possible polarities (categories) : strong positive (P+), positive (P), neutral (NEU), negative (N), strong negative (N+) and no sentiment (NONE).In addition, there is also an indication of agreement or disagreement between annotators.The corpus were pre-processed, merging both positive classes P and P+ into a single category P, same for the negative categories N and N+ which merged into single category N.After that, instances with classed NONE and NEU were deleted and finally all remaining tweets (5068) were turned into vectors of numbers by using a polarity dictionary.
Sixty sets were created using the pre-processed corpus, every set contains two subsets (see figure 5).Train: with 3040 records (60 %), for parameter optimization using grid search and GWO and model generation and

Classifiers generation
Cuadro 2: Different treatment settings for algorithms used.In parenthesis the number of SVM trainings realized by the algorithm Algorithm Setting description Default Models trained using LIBSVM default parameters Grid Search Models trained using grid search GWO 56 Models trained using GWO with 4 wolves and 14 iterations.

GWO 112
Models trained using GWO with 4 wolves and 28 iterations.

GWO 168
Models trained using GWO with 4 wolves and 42 iterations.
To generate the classifiers and evaluate them, all 60 sets were through a 3 step process: 1.
Step 1: Parameter optimization, Train subsets were used by all different algorithms treatments (see table 2), using the accuracy value and 10-fold cross validation, resulting in 60 (C, γ) pairs, one for each train subset.

2.
Step 2: Model generation, Using parameter pairs selected in step 1, a classification model is generated using accuracy and 10-fold cross validation .

3.
Step 3: Model validation, All classifiers are evaluated using the test subsets.Accuracy and F 1 -measure metrics were calculated in this step.

Evaluation metrics
All classifiers were evaluated using the Accuracy (Equation 14) and F 1 score (Equation 15) where T p are true positives, F p false positives, F n false negatives y T n true negatives.

Results
Results were summarized using box plots created with GNU Octave  Notably, the GWO 56 treatment used only half the trainings used by grid search.The other two GWO treatments (GWO 112 and GWO 168 ) seems to indicate that after a certain amount of iterations there is little or none difference in the results, but further experimentation is needed, specially a parameter sensibility analysis for the grey wolf optimizer is recommended to explore GWO parameters on the results.
Even when our research is focused in global polarity detection for sentiment analysis problems the methodology developed in this experiment can be applied to any field using support vector machines.Finally, automatic parameter optimization is a time consuming and CPU intensive process, for example this experiment took 144 hours, which make it a problem worth considering in future research.

2 Where d 2 ∈
by the researcher and C k ∈ [a, b] such as a and b are chosen by the researcher.γ k+1 = γ k + d R chosen by the researcher and γ k ∈ [a, b] such as a and b chosen by the researcher.

Figura 5 :
Figura 5: Randomize subset creation for training and evaluation purposes

Figura 6 :
Figura 6: Summarize Accuracy results for all 60 test datasets 1 (see figures 6 and 7) for visual comparison.Each box represents an algorithm treatment with the top side of the box representing 75 th percentile and bottom side 25 th percentile, lines coming out of the boxes correspond to maximum (top) and minimum (bottom) values, also the line within represents the average value.See tables 3 and 4 for exact 1 https://www.gnu.org/software/octave/Cuadro 4: Accuracy: minimum, maximum, average and standard deviation values