Methods for text data

FSFC contains implementation of some feature selection algorithms working with text data. Main difference of such algorithms is that they accept SciPy Sparse Matrices as input. They can be computed using vectorizers from sklearn

Every algorithm can be imported either from it’s package or from the fsfc.text module

Chi-R algorithm

class fsfc.text.CHIR.CHIR(k, clusters, alpha=0.1, max_iter=1000)[source]

Bases: fsfc.base.KBestFeatureSelector

Chi-R feature selection algorithm for text clustering.

Uses Chi-square statistics to evaluate the importance of each feature and R-coefficient that normalises statistics features across the corpus.

Based on the article “Text clustering with feature selection by using statistical data.”.

Algorithm selects features in the following way:
  1. Find initial clustering of dataset.
  2. Compute Chi-R scores for each feature according to the article.
  3. Select top k features according to scores.
  4. Set weights for top features equal to 1 and for others to alpha.
  5. Recompute clustering. If it changes, repeat steps 2-5, otherwise return weights of features.
Parameters:
k: int

Number of features to select.

clusters: int

Expected number of clusters.

alpha: float (default 0.1)

Value of weight of irrelevant feature.

max_iter: int (default 1000)

Maximal number of iterations of the algorithm.

Methods

fit(x, *rest) Fit algorithm to dataset and select relevant features.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
get_support([indices]) Get a mask, or integer index, of the features selected
inverse_transform(X) Reverse the transformation operation
set_params(**params) Set the parameters of this estimator.
transform(X) Reduce X to the selected features.
fit(x, *rest)[source]

Fit algorithm to dataset and select relevant features.

Parameters:
x: csr_matrix

SciPy Sparse Matrix representing terms contained in every sample. May be created by vectorizers from sklearn. Preferred choice is TF-IDF vectorizer.

Returns:
self: FTC

Returns itself to support chaining.

Frequent Term-based Clustering

class fsfc.text.FTC.FTC(minsup)[source]

Bases: fsfc.base.ClusteringFeatureSelector

Frequent Terms-based Clustering algorithm. Uses frequent termsets to find clusters and simultaneously select features which determine every cluster.

Based on the article “Frequent term-based text clustering”.

FTS is a set of terms that appear in some part of all samples in dataset. We will say that FTS covers sample if every term of FTS is contained in the sample.

Algorithm does clustering in the following way:
  1. Find all FTS for dataset with specified minsup. Elements of FTS are terms, i.e. features of dataset.
  2. Find FTS that has the lowest Entropy Overlap with the rest clusters with respect to dataset. It’s shown in the paper that such FTS will explain data the best.
  3. Add this FTS to clustering, remove from dataset samples covered by it, repeat steps 2 and 3.
  4. Assign clusters to samples. Sample belongs to a cluster defined by FTS if FTS covers sample.
  5. Scores of features are 1 if feature belongs to any FTS and 0 otherwise.
Parameters:
minsup: float

Part of the dataset which should be covered by each FTS.

Methods

fit(x, *rest) Fit algorithm to dataset, find clusters and select relevant features.
fit_predict(X[, y]) Performs clustering on X and returns cluster labels.
fit_transform(X[, y]) Fit to data, then transform it.
get_params([deep]) Get parameters for this estimator.
get_support([indices]) Get a mask, or integer index, of the features selected
inverse_transform(X) Reverse the transformation operation
set_params(**params) Set the parameters of this estimator.
transform(X) Reduce X to the selected features.
fit(x, *rest)[source]

Fit algorithm to dataset, find clusters and select relevant features.

Parameters:
x: csr_matrix

SciPy Sparse Matrix representing terms contained in every sample. May be created by vectorizers from sklearn.

Returns:
self: FTC

Returns itself to support chaining.