Issue

I have a large dataframe like this, not used for time time but for a binary classification task. It contains two important feature columns which have more than 60% NaN values. Instead of removing those columns or shrinking the dataframe are there other ways to resample the data and removing those NaNs or substituting them with synthetic values? I was thinking about the SMOTE package but I know it's used for unbalanced dataframes, not for NaNs. Could I use interpolation through NN or I'll risk to generate misleading data?

Solution

No clear answer on this: depends a lot on your data. If the two columns are really "important" as you say, how can they be so empty? What leads to considering them important? You can easily fake-fill them with fillna or any aggregating function (avg?), but depends on the domain. You can resort to SMOTE, but be sure to have enough data to generate sensible outputs.

Answered By - rikyeah

Answer Checked By - Timothy Miller (PHPFixing Admin)

Thursday, October 6, 2022

[FIXED] How to resample a dataframe removing nan values?

Issue

Solution

No comments:

Post a Comment