Feature Selection with PyPunisher

1 minute read

PyPunisher is a Python package that performs feature selection. Feature selection (also known as stepwise regression) is a key step in the data science pipeline that reduces model complexity by selecting the most relevant features from the original dataset. It reduces the risk of overfitting and minimizes noise produced from irrelevant features. The two types of feature selection are:

  • forward_selection(): Start with a null model and iteratively add useful features. Stop when adding a new feature no longer improves the model by a specified threshold, or if you have reached the pre-defined number of features to include in your model.
  • backward_elimination(): Start with a full model and iteratively remove the least useful feature at each step. Stop when removing a feature no longer improves the model by a specified threshold, or if you have reached the pre-defined number of features to include in your model.

The Feature Selection Process, adapted from Mcdonagh et al. 2014

The process of feature selection is depicted in the figure above. The evaluation step involves using a metric such as the Akaike Information Criterion (AIC) or Bayesian Information Criterion (BIC) to assess model quality. These two metrics add a penalty to the number of features in a model. This penalty term is larger in BIC than in AIC. In general, a lower AIC and BIC score indicate a better fit for the data, relative to competing models. At each iteration, a new feature is added (or removed in the case of backward elimination) to the candidate subset and evaluated using a metric of choice (i.e., AIC or BIC). If the metric improves by a specified interval, the new feature is either added or removed to the current best subset of features.

The “stopping criterion” can be either a threshold that you define, or pre-defined number of features to include in your model. For example, if you set your stopping criterion to be a threshold (min_change), then the feature selection process will stop when the AIC or BIC score no longer improves by a specified interval. Alternatively, if you want a specific number of features in your model, then the process will stop once it reaches the specified n_features.

The github repos for these packages can be found here: PyPunisher and punisheR.