PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Saturday, October 8, 2022

[FIXED] how can I drop low correlated features

 October 08, 2022     correlation, machine-learning, python, python-3.x, statistics     No comments   

Issue

I am making a preprocessing code for my LSTM training. My csv contains more than 30 variables. After applying some EDA techniques, I found that half of the features can be drop and they don't make any effect on training.

Right now I am dropping such features manually by using pandas.

I want to make a code which can drop such features automaticlly. I wrote a code to visualize heat map and correlation in this way:

#I am making a class so this part is from preprocessing.
# self.data is a Dataframe which contains all csv data

def calculateCorrelationByPearson(self):
        columns = self.data.columns
        plt.figure(figsize=(12, 8))
        sns.heatmap(data=self.data.corr(method='pearson'), annot=True, fmt='.2f', 
                      linewidths=0.5, cmap='Blues')
        plt.show()
        for column in columns:
            corr = stats.spearmanr(self.data['total'], self.data[columns])
            print(f'{column} - corr coefficient:{corr[0]}, p-value:{corr[1]}')

This gives me a perfect view of my features and relationship with each other.

Now I want to drop columns which are not important. Let's say correlation less than 0.4.

How can I apply this logic in to my code?


Solution

Here is an approach to remove variables with a correlation coef value below some threshold:

import pandas as pd
from scipy.stats import spearmanr

data = pd.DataFrame([{"A":1, "B":2, "C":3},{"A":2, "B":3, "C":1},{"A":3, "B":4, "C":0},{"A":4, "B":4, "C":1},{"A":5, "B":6, "C":2}])
targetVar = "A"
corr_threshold = 0.4

corr = spearmanr(data)
corrSeries = pd.Series(corr[0][:,0], index=data.columns) #Series with column names and their correlation coefficients
corrSeries = corrSeries[(corrSeries.index != targetVar) & (corrSeries > corr_threshold)] #apply the threshold

vars_to_keep = list(corrSeries.index.values) #list of variables to keep
vars_to_keep.append(targetVar)  #add the target variable back in
data2 = data[vars_to_keep]


Answered By - BioData41
Answer Checked By - Marie Seifert (PHPFixing Admin)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing