Issue
I have large CSVs (~100k rows x 30 cols). Occasionally the data has sections of nan
values which span sections of the df
of various sizes. I need to drop the nan
s but also ~3 data points either side because the non-nan
data either side is borked.
One could drop any row containing a nan
but this would throw away more data than needs to be.
How can I do this with python? The data has been loaded into a df
.
Solution
Use:
df = pd.DataFrame({'col':['a','b','c', np.nan, 'd','e',np.nan, 's','r'],
'col1':4})
print (df)
col col1
0 a 4
1 b 4
2 c 4
3 NaN 4
4 d 4
5 e 4
6 NaN 4
7 s 4
8 r 4
#test at least one missing value
m = df.isna().any(axis=1)
#test row above and bellow match value by mask, chain by | for bitwise OR
#filter in inverted mask by ~ in boolean indexing
df = df[~(m | m.shift(fill_value=False) | m.shift(-1, fill_value=False))]
print (df)
col col1
0 a 4
1 b 4
8 r 4
Alternative solution:
m = df.notna().all(axis=1)
df = df[(m & m.shift(fill_value=True) & m.shift(-1, fill_value=True))]
Answered By - jezrael Answer Checked By - Katrina (PHPFixing Volunteer)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.