Issue

Let's consider IBM HR Attrition Dataset from Kaggle (https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset). How do I rapdly gets the variable with the highest Shapiro p-value?

In other words, I can apply a function shapiro() in a column as shapiro(df['column']). And I would like to calculate for all the numeric columns these function.

I tried this:

from scypy.stats import shapiro
df = pd.read_csv('path')

#here i was expecting the output to be a sequential prints with the name of the columns and their respective p-value from shapiro()
for col in hr:
   print(col," : ", shapiro(hr[col])[0])

Anyone that could help on this?

Thanks in advance.

Solution

I hope this helps! I'm sure there are a lot better ways, but it was fun trying :)

import pandas as pd
from scipy import stats

df = pd.read_csv('path.csv')

# make a new dataframe newdf with only the columns containing numeric data

numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']

newdf = df.select_dtypes (include=numerics)

#check to see that the columns are only numeric
print(newdf.head())

# new dataframe with rows "W" and "P"
shapiro_wilks = (newdf).apply(lambda x: pd.Series(shapiro(x), index=['W','P'])).reset_index()
shapiro_wilks = shapiro_wilks.set_index('index') #ugh


print(shapiro_wilks)

Answered By - LynneKLR

Answer Checked By - Pedro (PHPFixing Volunteer)

Saturday, October 8, 2022

[FIXED] How to operate a function over multiple columns (Pandas/Python)?

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Saturday, October 8, 2022

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Subscribe To