Methods for text data¶
FSFC contains implementation of some feature selection algorithms working with text data.
Main difference of such algorithms is that they accept SciPy Sparse Matrices as input. They can be
computed using vectorizers from sklearn
Every algorithm can be imported either from it’s package or from the fsfc.text
module
Chi-R algorithm¶
-
class
fsfc.text.CHIR.
CHIR
(k, clusters, alpha=0.1, max_iter=1000)[source]¶ Bases:
fsfc.base.KBestFeatureSelector
Chi-R feature selection algorithm for text clustering.
Uses Chi-square statistics to evaluate the importance of each feature and R-coefficient that normalises statistics features across the corpus.
Based on the article “Text clustering with feature selection by using statistical data.”.
- Algorithm selects features in the following way:
- Find initial clustering of dataset.
- Compute Chi-R scores for each feature according to the article.
- Select top k features according to scores.
- Set weights for top features equal to 1 and for others to alpha.
- Recompute clustering. If it changes, repeat steps 2-5, otherwise return weights of features.
Parameters: - k: int
Number of features to select.
- clusters: int
Expected number of clusters.
- alpha: float (default 0.1)
Value of weight of irrelevant feature.
- max_iter: int (default 1000)
Maximal number of iterations of the algorithm.
Methods
fit
(x, *rest)Fit algorithm to dataset and select relevant features. fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. get_support
([indices])Get a mask, or integer index, of the features selected inverse_transform
(X)Reverse the transformation operation set_params
(**params)Set the parameters of this estimator. transform
(X)Reduce X to the selected features. -
fit
(x, *rest)[source]¶ Fit algorithm to dataset and select relevant features.
Parameters: - x: csr_matrix
SciPy Sparse Matrix representing terms contained in every sample. May be created by vectorizers from sklearn. Preferred choice is TF-IDF vectorizer.
Returns: - self: FTC
Returns itself to support chaining.
Frequent Term-based Clustering¶
-
class
fsfc.text.FTC.
FTC
(minsup)[source]¶ Bases:
fsfc.base.ClusteringFeatureSelector
Frequent Terms-based Clustering algorithm. Uses frequent termsets to find clusters and simultaneously select features which determine every cluster.
Based on the article “Frequent term-based text clustering”.
FTS is a set of terms that appear in some part of all samples in dataset. We will say that FTS covers sample if every term of FTS is contained in the sample.
- Algorithm does clustering in the following way:
- Find all FTS for dataset with specified minsup. Elements of FTS are terms, i.e. features of dataset.
- Find FTS that has the lowest Entropy Overlap with the rest clusters with respect to dataset. It’s shown in the paper that such FTS will explain data the best.
- Add this FTS to clustering, remove from dataset samples covered by it, repeat steps 2 and 3.
- Assign clusters to samples. Sample belongs to a cluster defined by FTS if FTS covers sample.
- Scores of features are 1 if feature belongs to any FTS and 0 otherwise.
Parameters: - minsup: float
Part of the dataset which should be covered by each FTS.
Methods
fit
(x, *rest)Fit algorithm to dataset, find clusters and select relevant features. fit_predict
(X[, y])Performs clustering on X and returns cluster labels. fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. get_support
([indices])Get a mask, or integer index, of the features selected inverse_transform
(X)Reverse the transformation operation set_params
(**params)Set the parameters of this estimator. transform
(X)Reduce X to the selected features.