### Motivation for text classification tasks

Web users generate massive amount of various text information: they share their thoughts and opinions in their personal blogs and social networks, write reviews and leave comments to products, movies and apps, restaurants and hotels, ask and answer questions in knowledge systems. Amount of this information keeps growing everyday and analysis of it can help to:

• Identify most useful and informative reviews and comments.
• Provide additional insights into user preferences: what they like and think about your product.
• Recognize potentially viral content and Identify new social trends.
• Automatically categorize and label new content.

### Text representation

Before classification, text documents need to be represented in a way suitable for machine learning algorithms. One of the often used ways to do it for a given set of documents $d$ is to learn its vocabulary (set of words $v_j$ contained in all documents) and then represent $d$ by a matrix $x_{ij}$, where each row $x_i$ represents a single document $d_i$ and non-zero $x_{ij}$ indicates that $v_j$ is contained in $d_i$.

Let’s consider a simple example of the following set of documents:

d = ['the red apple', 'the orange', 'the green apple', 'the lemon', 'the plum']

then its vocabulary:

v = ['apple', 'green', 'lemon', 'orange', 'plum', 'red', 'the']

and $d$ is be represented by:

[[1 0 0 0 0 1 1]
[0 0 0 1 0 0 1]
[1 1 0 0 0 0 1]
[0 0 1 0 0 0 1]
[0 0 0 0 1 0 1]]

The document set represented this way can already be passed to an arbitrary machine learning classification algorithm as a sample-feature matrix.

### Term weighting schemes: example of tf-idf scheme

In the example above, the fact that $v_j$ is contained in $d_i$ is represented by $x_{ij} = 1$, but in general case it can be set to a non-zero value, which will reflect how important is given word (term) with respect to other words and in the context of the given classification task. The procedure for calculation of $x_{ij}$ is called term weighting scheme.

On of the most popular term weighting schemes is tf-idf (term frequency–inverse document frequency). Its definition takes into account two intuitive considerations about how important the given term is:

• More important terms occur more often than less important ones, thus, $x_{ij}$ is proportional to the count of $v_j$ in $d_i$ (term frequency).
• Very common words (contained in large fraction of documents), for example, articles or conjunctions can have relatively high term frequency, but it doesn’t necessarily mean they are important. This effect can be mitigated by setting $x_{ij}$ inverse proportional to fraction of documents containing $v_j$ (document frequency).

One can calculate tf-idf matrix using text feature extraction utils available in scikit-learn:

from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

CountVectorizer calculates term frequencies:

d = ['the red apple', 'the orange', 'the green apple', 'the lemon', 'the plum']

vectorizer = CountVectorizer()
x = vectorizer.fit_transform(d)

print x.toarray()
[[1 0 0 0 0 1 1]
[0 0 0 1 0 0 1]
[1 1 0 0 0 0 1]
[0 0 1 0 0 0 1]
[0 0 0 0 1 0 1]]

TfidfTransformer calculates inverse document frequencies from input term frequency matrix (by default, it also normalizes each $x_i$ so that $\sum_j x_{ij}^2 = 1$):

transformer = TfidfTransformer()
x = transformer.fit_transform(x)

print x.toarray()
[[0.59 0.00 0.00 0.00 0.00 0.73 0.35]
[0.00 0.00 0.00 0.90 0.00 0.00 0.43]
[0.59 0.73 0.00 0.00 0.00 0.00 0.35]
[0.00 0.00 0.90 0.00 0.00 0.00 0.43]
[0.00 0.00 0.00 0.00 0.90 0.00 0.43]]

One can see that “the” article used in all documents has the lowest weight, “apple” used in many (two) documents has second lowest weight, terms used in longer documents have lower weight (0.73) than terms used in shorter documents (0.90).

### Supervised term weighting schemes

Now let’s consider that we assign each document to positive class (1) when it contains “apple” or to negative class (0) when it doesn’t, and let $y_i$ hold the information about document classes:

y = [1, 0, 1, 0, 0]

Next possible improvement to the tf-idf scheme can be incorporation of class information into term weighting scheme, thus, making it supervised. These schemes take into account how often the given term occurs in positive and negative documents, naturally, instead of ordinary document frequencies one need to calculate:

• tp: number of positive samples that contain given term.
• fp: number of positive samples that do not contain given term.
• fn: number of negative samples that contain given term.
• tn: number of negative samples that do not contain given term.
• N: total number of documents.

In this post we implement several schemes discussed in Ref. 1:

• tf*chi2: $\chi^2 = \frac{N*(tp * tn - fp * fn)^2}{(tp+fp)(fn+tn)(tp+fn)(fp+tn)}$.
• tf*ig: $ig = -\frac{tp+fp}{N}*log\frac{tp+fp}{N}-\frac{tn+fn}{N}*log\frac{tn+fn}{N}+\frac{tp}{N}*log\frac{tp}{tp+fn}+\frac{fn}{N}*log\frac{fn}{tp+fn}+\frac{fp}{N}*log\frac{fp}{fp+tn}+\frac{tn}{N}*log\frac{tn}{fp+tn}$ (information gain).
• tf*gr: $gr = ig / (-\frac{tp+fp}{N}*log\frac{tp+fp}{N}-\frac{tn+fn}{N}*log\frac{tn+fn}{N})$ (gain ratio).
• tf*OR: $OR = log\frac{tp*tn}{fp*fn}$ (Odds Ratio).
• tf*rf: $rf = log(2+\frac{tp}{fn})$ (relevance frequency).

The implementation of these schemes mimics the implementation of TfidfTransformer in scikit-learn:

import numpy as np
import scipy.sparse as sp

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import normalize

class SupervisedTermWeightingTransformer(BaseEstimator, TransformerMixin):

def __init__(self, scheme='tfchi2'):

self.scheme = scheme

The fit() method calculates tp, fp, fn, tn, N from term frequencies $X$ and class information $y$ and stores them for later use:

def fit(self, X, y):

n_samples, n_features = X.shape

# Masks for positive and negative samples
pos_samples = sp.spdiags(y, 0, n_samples, n_samples)
neg_samples = sp.spdiags(1-y, 0, n_samples, n_samples)

# Extract positive and negative samples
X_pos = pos_samples*X
X_neg = neg_samples*X

# tp: number of positive samples that contain given term
# fp: number of positive samples that do not contain given term
# fn: number of negative samples that contain given term
# tn: number of negative samples that do not contain given term
tp = np.bincount(X_pos.indices, minlength=n_features)
fp = np.sum(y)-tp
fn = np.bincount(X_neg.indices, minlength=n_features)
tn = np.sum(1-y)-fn

# Smooth document frequencies
self._tp = tp + 1.0
self._fp = fp + 1.0
self._fn = fn + 1.0
self._tn = tn + 1.0

self._n_samples = n_samples
self._n_features = n_features

return self

The transform() method term frequencies $X$ according to chosen scheme and applies L2 normalization to each $x_i$:

def transform(self, X):

tp = self._tp
fp = self._fp
fn = self._fn
tn = self._tn

# Smooth document frequencies
n = self._n_samples + 4

f = self._n_features

if self.scheme == 'tfchi2':

k = n * (tp * tn - fp * fn)**2
k /= (tp + fp) * (fn + tn) * (tp + fn) * (fp + tn)

elif self.scheme == 'tfig':

k = -((tp + fp) / n) * np.log((tp + fp) / n)
k -= ((fn + tn) / n) * np.log((fn + tn) / n)
k += (tp / n) * np.log(tp / (tp + fn))
k += (fn / n) * np.log(fn / (tp + fn))
k += (fp / n) * np.log(fp / (fp + tn))
k += (tn / n) * np.log(tn / (fp + tn))

elif self.scheme == 'tfgr':

k = -((tp + fp) / n) * np.log((tp + fp) / n)
k -= ((fn + tn) / n) * np.log((fn + tn) / n)
k += (tp / n) * np.log(tp / (tp + fn))
k += (fn / n) * np.log(fn / (tp + fn))
k += (fp / n) * np.log(fp / (fp + tn))
k += (tn / n) * np.log(tn / (fp + tn))

d = -((tp + fp) / n) * np.log((tp + fp) / n)
d -= ((fn + tn) / n) * np.log((fn + tn) / n)

k /= d

elif self.scheme == 'tfor':

k = np.log( (tp * tn ) / (fp * fn) )

elif self.scheme == 'tfrf':

k = np.log(2 + tp / fn)

X = X * sp.spdiags(k, 0, f, f)

return normalize(X, 'l2')

Let’s try this transformer in our simple example:

d = ['the red apple', 'the orange', 'the green apple', 'the lemon', 'the plum']

vectorizer = CountVectorizer()
x = vectorizer.fit_transform(d)

transformer = SupervisedTermWeightingTransformer(scheme='tfor')
x = transformer.fit_transform(x)

print x.toarray()
[[0.87 0.00 0.00 0.00 0.00 0.48 -0.10]
[0.00 0.00 0.00 -0.92 0.00 0.00 -0.38]
[0.87 0.48 0.00 0.00 0.00 0.00 -0.10]
[0.00 0.00 -0.92 0.00 0.00 0.00 -0.38]
[0.00 0.00 0.00 0.00 -0.92 0.00 -0.38]]

As expected, “apple” is a very strong indicator of positive class!

### More realistic example: Large Movie Review Dataset

For more realistic example let’s try sentiment analysis of the Large Movie Review Dataset v1.0 (see Ref. 2), with 25000 positive and 25000 negative IMDb movie reviews. The reviews are highly polar: positive reviews have a score higher than 6 and negative ones have a score less than 5. Equal number of positive and negative reviews eliminates effects from class imbalance. We use a linear support vector machine (SVM) classifier as in Ref. 2 and 3., train it on all train samples and evaluate accuracy, recall, and F-1 score for tf-idf and supervised term weighting schemes using all test samples:

Term weighting scheme Accuracy Recall F1 score
tf-idf 0.88 0.87 0.88
tf*chi2 0.87 0.86 0.87
tf*ig 0.87 0.86 0.87
tf*gr 0.87 0.86 0.87
tf*OR 0.89 0.88 0.89
tf*rf 0.88 0.87 0.88

### Conclusions

All supervised weighting schemes considered in this post demonstrate performance very similar to performance of tf-idf scheme. One of the reasons may be the fact that there was no attempt to optimize default SVM parameters. But there is still a room for further improvements:

• Optimize SVM parameters with a grid search over them.
• Consider various term frequency schemes: in addition to raw term counts, try binary term counts, smoothed term counts, e.g. log(tf+1), normalized term counts.
• Experiment with normalization of document representation matrix $x_{ij}$: one can normalize only sample vectors $x_i$ or feature vectors $x_j$, or apply different normalization metrics (L1 or L2).
• Introduce requirements on maximum and minimum allowed term and/or document frequencies to filter out extremely rare and common terms.
• Review the default list of stop words and check if any words, which are very common to this specific dataset, are not filtered out by the list.

The code for this post is hosted on github:

### References

1. Xiaojun Quan; Liu Wenyin; Qiu, B., “Term Weighting Schemes for Question Categorization,” in Pattern Analysis and Machine Intelligence, IEEE Transactions on , vol.33, no.5, pp.1009-1021, May 2011.

2. Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

3. Zhi-Hong Deng, Kun-Hu Luo, and Hong-Liang Yu. 2014. A study of supervised term weighting scheme for sentiment analysis. Expert Syst. Appl. 41, 7 (June 2014), 3506-3513.