Methods for generic data¶
FSFC contains implementation of some feature selection algorithms working with generic data.
Every algorithm can be imported either from it’s package or from the fsfc.generic
module
SPEC algorithm family¶
Family of algorithm based on the Spectral Graph Theory
-
class
fsfc.generic.SPEC.
FixedSPEC
(k, clusters)[source]¶ Bases:
fsfc.generic.SPEC.SPEC
Feature selection algorithm that represents samples as vertices of graph. Weights of edges of this graph are equal to RBF-distances between points.
Algorithm uses Spectral Graph Theory to find features with the best separability in assumption that points are separated to the predefined number of clusters
To do this, algorithm finds eigenvectors corresponding to K smallest eigenvalues of the Laplacian of the graph, except the trivial one, and uses cosine distance between them and feature vectors to detect the most explaining features. Algorithm selects k features using this score.
Parameters: - k: int
Number of features to select
- clusters: int
Expected number of clusters
Methods
fit
(x, *rest)fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. get_support
([indices])Get a mask, or integer index, of the features selected inverse_transform
(X)Reverse the transformation operation set_params
(**params)Set the parameters of this estimator. transform
(X)Reduce X to the selected features.
-
class
fsfc.generic.SPEC.
GenericSPEC
(k)[source]¶ Bases:
fsfc.generic.SPEC.NormalizedCut
Feature selection algorithm that represents samples as vertices of graph. Weights of edges of this graph are equal to RBF-distances between points.
Algorithm uses Spectral Graph Theory to find features with the best separability.
To do this, algorithm finds the trivial eugenvector of the Laplacian of the graph and uses it to normalise scores computed using
NormalizedCut
. Such normalisation helps to improve accuracy of feature selection according to the article SPEC family is based on.Algorithm selects top k features according to this scores
Parameters: - k: int
Number of features to select
Methods
fit
(x, *rest)fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. get_support
([indices])Get a mask, or integer index, of the features selected inverse_transform
(X)Reverse the transformation operation set_params
(**params)Set the parameters of this estimator. transform
(X)Reduce X to the selected features.
-
class
fsfc.generic.SPEC.
NormalizedCut
(k)[source]¶ Bases:
fsfc.generic.SPEC.SPEC
Feature selection algorithm that represents samples as vertices of graph. Weights of edges of this graph are equal to RBF-distances between points. Algorithm attempts to find minimal cut in this graph.
Algorithm selects k features which were separated by this cut as they are considered to be better for explanation of the dataset.
Parameters: - k: int
Number of features to select
Methods
fit
(x, *rest)fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. get_support
([indices])Get a mask, or integer index, of the features selected inverse_transform
(X)Reverse the transformation operation set_params
(**params)Set the parameters of this estimator. transform
(X)Reduce X to the selected features.
-
class
fsfc.generic.SPEC.
SPEC
(k)[source]¶ Bases:
fsfc.base.KBestFeatureSelector
SPEC-family of feature selection algorithms.
Every algorithm of this family creates graph representation of distribution of samples in a high-dimensional space. Samples becomes vertices and RBF-distance between them becomes weight of the edge. Algorithms use Spectral Graph Theory to calculate features scores.
Based on the article “Spectral feature selection for supervised and unsupervised learning.”.
Methods
fit
(x, *rest)fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. get_support
([indices])Get a mask, or integer index, of the features selected inverse_transform
(X)Reverse the transformation operation set_params
(**params)Set the parameters of this estimator. transform
(X)Reduce X to the selected features.
Lasso algorithm¶
-
class
fsfc.generic.Lasso.
Lasso
(k, norm_constraint, eps=0.0001, max_iterations=100)[source]¶ Bases:
fsfc.base.ClusteringFeatureSelector
Feature selection and clustering algorithm exploting the idea of L1-norm
Simultaneously does clustering and computes “importance” of each feature for it
Based on the article “A framework for feature selection in clustering.”
- Algorithm does clustering and selects features in the following way:
- Assigns equal weights to every feature
- Finds clusters according to weights of features
- Computes objective of the method
- Optimizes L1-penalty for found objective
- Computes new feature weights using objective
- If new weights equal to the old ones, break. Otherwise repeat steps 2-6
- Select top k features according to weights
Parameters: - k: int
Number of features to select
- norm_constraint: float
Constraint of L1-norm
- eps: float (default 1e-4)
Precision of the algorithm
- max_iterations: int (default 100)
Maximal number of iterations of algorithm
Methods
fit
(x, *rest)fit_predict
(X[, y])Performs clustering on X and returns cluster labels. fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. get_support
([indices])Get a mask, or integer index, of the features selected inverse_transform
(X)Reverse the transformation operation set_params
(**params)Set the parameters of this estimator. transform
(X)Reduce X to the selected features.
LFSBSS algorithm¶
-
class
fsfc.generic.LFSBSS.
LFSBSS
(clusters, max_iterations=100)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.ClusterMixin
Localised Feature Selection, Based on Scattered Separability.
Selects features and simultaneously builds clustering in an iterative way. Every cluster has it’s local set of selected features, and we project input data to a subspace defined by it to predict cluster of a point.
This implementation doesn’t take into account importance of overlay of clusters and unassigned points for the sake of performance.
Based on the article “Localized feature selection for clustering.”.
- Algorithm builds clustering in the following way:
- Find initial clustering using k-Means algorithm
- For each cluster:
- Find feature which can be dropped to improve the Scatter Separability Score
- Recompute clusters without this feature
- Find cluster that is the most similar to the current one
- Compute normalized value of Scatter Separability Score for two clusters - current and new
- If scores were improved, drop the feature and update clustering
- Repeat step 2 until no changes were made
- Return found clusters
- Algorithm predicts clusters for new points in the following way:
- Project points to the feature subspace of each cluster
- Find the cluster whose center is the closest to projected point
Parameters: - clusters: int
Number of clusters to find
- max_iterations: int (default 100)
Maximal number of iterations of the algorithm
Methods
fit
(x)Fit algorithm to dataset, find clusters and set of features for every cluster fit_predict
(X[, y])Performs clustering on X and returns cluster labels. get_params
([deep])Get parameters for this estimator. predict
(x)Predict clusters for one sample set_params
(**params)Set the parameters of this estimator.
MCFS algorithm¶
-
class
fsfc.generic.MCFS.
MCFS
(k, clusters, p=8, sigma=1, mode='default', alpha=0.01)[source]¶ Bases:
fsfc.base.KBestFeatureSelector
Multi-Class Feature selection algorithm.
Uses k-NN graph of samples in dataset and Spectral Graph Theory to find the most explaining features.
Based on the article “Unsupervised feature selection for multi-cluster data.”.
Algorithm selects features in the following way:
- Computes k-NN graph for the dataset.
- Computes heat matrix for this graph, degree and laplacian matrix.
- Solves eigen-problem: L y = lambda D y, selects K smallest eigen-values and corresponding vectors.
- Solves K regression problems, trying to predict every eigenvector by regression using dataset.
- Computes score of each feature using found regression coefficients.
- Select k features with the top scores.
Parameters: - k: int
Number of features to select.
- clusters: int
Expected number of datasets.
- p: int (default 8)
Number of nearest neighbours for construction of k-NN graph.
- sigma: int (default 1)
Coefficient for computation of heat matrix.
- mode: ‘default’ or ‘lasso’ (default ‘default’)
Type of penalty for the method: with ‘default’ algorithm uses no penalty, with ‘lasso’ it uses L1-penalty.
- alpha: float (default 0.01)
Importance of penalty for algorithm with mode=’lasso’.
Methods
fit
(x, *rest)fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. get_support
([indices])Get a mask, or integer index, of the features selected inverse_transform
(X)Reverse the transformation operation set_params
(**params)Set the parameters of this estimator. transform
(X)Reduce X to the selected features.
Weighted K-means¶
-
class
fsfc.generic.WKMeans.
WKMeans
(k, beta, eps=0.001, max_iterations=10)[source]¶ Bases:
fsfc.base.ClusteringFeatureSelector
Weighted K-Means algorithm.
Assigns a weight parameter to every feature, runs N iterations of K-means and adjusts this parameter to explain the data better. Selects K best features according to their weights.
Based on the article “Automated variable weighting in k-means type clustering.”.
- Algorithm selects features in the following way:
- Randomly assigns values for feature weights in a way that keeps their sum equal to 1.
- Find clusters using samples multiplied by weights in the power of beta
- Compute score for clustering.
- Recompute weights using approach from the article.
- Find clustering using new weights. If score of new clustering didn’t change, stop algorithm. Otherwise got to step 2.
- Select top k features according to their weights.
Parameters: - k: int
Number of features to select.
- beta: float
Degree parameter for features weights.
- eps: float (default 1e-3)
Precision of the algorithm.
- max_iterations: int (default 10)
Maximal number of iterations of algorithm.
Methods
fit
(x, *rest)fit_predict
(X[, y])Performs clustering on X and returns cluster labels. fit_transform
(X[, y])Fit to data, then transform it. get_params
([deep])Get parameters for this estimator. get_support
([indices])Get a mask, or integer index, of the features selected inverse_transform
(X)Reverse the transformation operation set_params
(**params)Set the parameters of this estimator. transform
(X)Reduce X to the selected features.