September 24, 2020

3 Ways To Select Features Using Machine Learning Algorithms In Python

Artificial intelligence which gives machines the ability to think and behave like humans are gaining traction since the last decade. These features of artificial intelligence are only there because of its ability to predict certain things accurately, these predictions are based upon one certain technology which we know as machine learning (ML). Machine learning as the name suggests is the computer’s ability to learn new things and improve its functionality over time. The main focus of machine learning is on the development of computer programs that are capable of accessing data and using it to learn for themselves. 

To implement machine learning algorithms, two programming languages, R and Python for machine learning are normally used. Generally, selecting features for training data on machine learning in python is a very complex and technical process. But here we will go over some basic techniques and details regarding what is machine learning and how it works. So, let us start by going into detail regarding what ML is, what feature selection is and how can one select feature using python.

What is Machine Learning?

Machine learning is the computer’s ability to learn new things and improve its functionalities based on those new things. But what is the proper definition of ML, according to Emerj, an AI research and insight company, the definition is as: 

Machine Learning is the science of getting computers to learn and act like humans do, and improve their learning over time in an autonomous fashion, by feeding them data and information in the form of observations and real-world interactions. 

How do Machine Learning Algorithms Work?

The process of learning begins with observing the data, this is done to detect patterns so that decisions in the future could be made more accurately.

The main aim of machine learning for computers is to learn as much as it can without human intervention/assistance. Following are the types of machine learning

Supervised Machine Learning

Supervised machine learning algorithms are used to apply what has been learned in the past to new data using previous labels. This approach maps input to an output based on the example input-output pairs. This starts with the analysis of the known dataset and the learning algorithm produces a deduced function that is mapped to a predicted output value. Supervised learning algorithms are also capable of comparing their outputs with the intended output to find any errors which might require modification. In simpler terms, supervised learning is just the machine saying “train me!”.

Supervised Machine Learning

Supervised Machine Learning

Unsupervised Machine Learning

Unsupervised learning algorithms are quite contrary to supervised learning. In unsupervised learning, the data used to train the models in neither labeled nor classified. The main purpose of this approach is to draw inferences from unlabeled data through functions in order to find structure and patterns in the data. The system does not extract the right output from the data, instead, it focuses on drawing out inferences from datasets to explain the hidden structure and unlabeled data. In short, it’s the machine saying that I’m self-sufficient in learning.

Semi-supervised Machine Learning

This approach is a combination of both of the above as it uses labeled and unlabelled data in some quantity. This is used in order to improve the accuracy of the system.

Reinforcement Machine Learning 

Reinforcement learning is trial and error, this interacts with the environment through actions which leads to errors or rewards. The algorithm allows automatic determination of behaviors according to a specific context to maximize its performance. For the agent to learn which actions are the best and award feedback is required which is known as reinforcement signal.

Feature selection for Python Machine Learning

So now that you know the types of machine learning algorithms, let us move on to how one can train their models. To start training your model you would require a dataset to begin with and you will have to extract features from it. Feature selection is the process of selecting features that might contribute the most to your output/prediction. This can be done automatically or manually. Selecting the correct features is important as the wrong ones could result in low accuracy. Feature selection should be done beforehand as it has the following benefits:

  • Overfitting reduction: fewer redundant results in less decision based on noise.
  • Improved accuracy: less misleading data equals improved modeling accuracy.
  • Reduction in training time: less data is directly proportional to the time required for training.

The following are ways that you can use to select features using python. All of these use the Pima Indians onset of diabetes dataset and are for binary classification problems where all of the attributes are numeric. The Dataset File and the Dataset Details are attached.

Recursive Feature Elimination

RFE or Recursive Feature Elimination works on the principles of removing attributes recursively and building a model based on the remaining features. RFE uses the model’s accuracy to determine which attributes are contributing the most in order to obtain the predicted target attribute. The following code snippet could be used to select the top 3 features, the algorithm does not matter as long as it is consistent and skillful.

# RFE – Feature Extraction
from pandas import read_csv
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
# load data
url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv”
names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
model = LogisticRegression(solver=‘lbfgs’)
rfe = RFE(model, 3)
fit = rfe.fit(X, Y)
print(“Num Features: %d” % fit.n_features_)
print(“Selected Features: %s” % fit.support_)
print(“Feature Ranking: %s” % fit.ranking_)
RFE chose the top 3 features as preg, mass, and Pedi.


Num Features: 3
Selected Features: [ True False False False False  True  True False]
Feature Ranking: [1 2 3 5 6 1 1 4]

Principal Component Analysis

PCA employs linear algebra to compress the dataset, this generally is known as a data reduction technique. PCA lets you choose the dimensions and the principle component in the result. The following example demonstrates the use of PCA to select 3 principal components.

# PCA – Feature Extraction
import numpy
from pandas import read_csv
from sklearn.decomposition import PCA
# load data
url = “https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.csv”
names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]
dataframe = read_csv(url, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
pca = PCA(n_components=3)
fit = pca.fit(X)
# summarize components
print(“Explained Variance: %s” % fit.explained_variance_ratio_)
print(fit.components_)
Principal Component Analysis

Principal Component Analysis

Univariate Feature Selection

Features with the strongest relationships with the output variable can be selected by statistical testing.  A suite of different statistical tests to select a specific number of features can be done by scikit-learn library. The following snippet of code will be used to obtain 4 best features.

# Univariate Statistical Tests – Feature Selection
from pandas import read_csv
from numpy import set_printoptions
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
# load data
filename = ‘pima-indians-diabetes.data.csv’
names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]
dataframe = read_csv(filename, names=names)
array = dataframe.values
X = array[:,0:8]
Y = array[:,8]
# feature extraction
test = SelectKBest(score_func=f_classif, k=4)
fit = test.fit(X, Y)
# summarize scores
set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# summarize selected features
print(features[0:5,:])

Conclusion

Machine learning is what helps AI to predict certain outcomes and take intelligent decisions based on it. All of this is done by training a model with a certain dataset, this training starts with the extraction of relevant and important features. The above mentioned 3 ways can be used to extract features from the dataset of your choice.

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: