Issue
I am doing a project where I have to handle with a lot of diagnoses. No matter what the purpose is, in terms of coding, I think that the code below is right, however it takes to much time (~1h) and it always shows me warnings. Is there anything that I am not doing right? Thank you in advance
# The first 3 values are the only that matters
diagnoses_sec = df[['Diagnóstico 2', 'Diagnóstico 3', 'Diagnóstico 4', 'Diagnóstico 5', 'Diagnóstico 6',
'Diagnóstico 7', 'Diagnóstico 8', 'Diagnóstico 9', 'Diagnóstico 10', 'Diagnóstico 11', 'Diagnóstico 12',
'Diagnóstico 13', 'Diagnóstico 14', 'Diagnóstico 15', 'Diagnóstico 16', 'Diagnóstico 17', 'Diagnóstico 18',
'Diagnóstico 19', 'Diagnóstico 20']]
for i in range(0, diagnoses_sec.shape[1]):
diagnoses_sec.iloc[:,i].fillna("ZZZ", inplace = True)
diagnoses_sec.iloc[:,i] = diagnoses_sec.iloc[:,i].str.slice(start=0, stop=3, step=1)
In this part, there is a warning, but I can't understand why:
C:\Users\Asus\Anaconda3\lib\site-packages\pandas\core\indexing.py:630: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self.obj[item_labels[indexer[info_axis]]] = value
The second part of the code is that:
from bisect import bisect_left
diag_icd10_ranges = ["B99","D49","D89","E89","F99","G99","H59","H95",
"I99","J99","K95", "L99", "M99", "N99","O9A","P96","Q99",
"R99","T88","Y99","Z99","ZZZ"]
diag_icd10_dict = {0: 'infectious_icd10d', 1: 'neoplasms_icd10d', 2: 'blood_icd10d', 3: 'endocrine_icd10d',
4: 'mental_icd10d', 5: 'nervous_icd10d', 6: 'eye_icd10d', 7: 'ear_icd10d',
8: 'circulatory_icd10d', 9: 'respiratory_icd10d', 10: 'digestive_icd10d', 11: 'skin_icd10d',
12: 'musculo_icd10d', 13: 'genitourinary_icd10d', 14: 'pregnancy_icd10d', 15: 'perinatalperiod_icd10d',
16: 'congenital_icd10d',
17: 'abnormalfindings_icd10d', 18:'injury_icd10d', 19:'morbidity', 20:'healthstatus', 21:'Nan_Category'}
# function to categorize every patient
def icdGroup(code): return bisect_left(diag_icd10_ranges,code)
# loop for the categorisation of every patient in every diagnose
for i_diag_sec in range(0,diagnoses_sec.shape[1]):
for i_within_diag_sec in range(0, len(diagnoses_sec)):
diagnoses_sec.iloc[i_within_diag_sec,i_diag_sec] = icdGroup(diagnoses_sec.iloc[i_within_diag_sec,i_diag_sec])
And once again I have another warning:
C:\Users\Asus\Anaconda3\lib\site-packages\ipykernel_launcher.py:20: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
Solution
You are getting these SettingWithCopyWarning
warning messages as diagnoses_sec
is a copy of part of df
; setting values on this copy raises a warning to make sure that you are aware of this - your changes will not propagate back to df
. These warnings will disappear if you explicitly make a copy using the copy
method, e.g.:
diagnoses_sec = df[['Diagnóstico 2', 'Diagnóstico 3']].copy()
Regarding the time taken to execute your code, iterating over pandas DataFrame
s in this way is inefficient and you should strive to use vectorized operations, applying a function or operation to an entire array.
You could modify your first example to do this:
diagnoses_sec = df[['Diagnóstico 2', 'Diagnóstico 3', 'Diagnóstico 4', 'Diagnóstico 5', 'Diagnóstico 6',
'Diagnóstico 7', 'Diagnóstico 8', 'Diagnóstico 9', 'Diagnóstico 10', 'Diagnóstico 11', 'Diagnóstico 12',
'Diagnóstico 13', 'Diagnóstico 14', 'Diagnóstico 15', 'Diagnóstico 16', 'Diagnóstico 17', 'Diagnóstico 18',
'Diagnóstico 19', 'Diagnóstico 20']].copy()
diagnoses_sec.fillna("ZZZ", inplace=True)
diagnoses_sec = diagnoses_sec.apply(lambda x: x.str.slice(start=0, stop=3, step=1))
Here, fillna
is applied to the entire DataFrame
and will replace every NA
value with "ZZZ"
. In the second operation, apply
will, via a lambda
function, perform the string slicing operation on each column (Series
) of your diagnoses_sec
DataFrame
.
Your second case is similar, though since your icdGroup
function is not vectorized (it doesn't operate on a DataFrame
or Series
) and is being applied to every cell of your DataFrame
, you can use applymap
to execute it on every value:
diagnoses_sec = diagnoses_sec.applymap(icdGroup)
Answered By - dspencer Answer Checked By - Gilberto Lyons (PHPFixing Admin)
No comments:
Post a Comment
Note: Only a member of this blog may post a comment.