Thursday, April 28, 2022

[FIXED] Why does my pandas code raise assignment warnings and run slowly?

Issue

I am doing a project where I have to handle with a lot of diagnoses. No matter what the purpose is, in terms of coding, I think that the code below is right, however it takes to much time (~1h) and it always shows me warnings. Is there anything that I am not doing right? Thank you in advance

# The first 3 values are the only that matters
diagnoses_sec = df[['Diagnóstico 2', 'Diagnóstico 3', 'Diagnóstico 4', 'Diagnóstico 5', 'Diagnóstico 6',
          'Diagnóstico 7', 'Diagnóstico 8', 'Diagnóstico 9', 'Diagnóstico 10', 'Diagnóstico 11', 'Diagnóstico 12', 
          'Diagnóstico 13', 'Diagnóstico 14', 'Diagnóstico 15', 'Diagnóstico 16', 'Diagnóstico 17', 'Diagnóstico 18', 
          'Diagnóstico 19', 'Diagnóstico 20']]
for i in range(0, diagnoses_sec.shape[1]):
    diagnoses_sec.iloc[:,i].fillna("ZZZ", inplace = True)
    diagnoses_sec.iloc[:,i] = diagnoses_sec.iloc[:,i].str.slice(start=0, stop=3, step=1)

In this part, there is a warning, but I can't understand why:

C:\Users\Asus\Anaconda3\lib\site-packages\pandas\core\indexing.py:630: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self.obj[item_labels[indexer[info_axis]]] = value

The second part of the code is that:

from bisect import bisect_left

diag_icd10_ranges = ["B99","D49","D89","E89","F99","G99","H59","H95",
          "I99","J99","K95", "L99", "M99", "N99","O9A","P96","Q99",
          "R99","T88","Y99","Z99","ZZZ"]

diag_icd10_dict = {0: 'infectious_icd10d', 1: 'neoplasms_icd10d', 2: 'blood_icd10d', 3: 'endocrine_icd10d',
                   4: 'mental_icd10d', 5: 'nervous_icd10d', 6: 'eye_icd10d', 7: 'ear_icd10d',
                   8: 'circulatory_icd10d', 9: 'respiratory_icd10d', 10: 'digestive_icd10d', 11: 'skin_icd10d', 
                  12: 'musculo_icd10d', 13: 'genitourinary_icd10d', 14: 'pregnancy_icd10d', 15: 'perinatalperiod_icd10d', 
                  16: 'congenital_icd10d',
                  17: 'abnormalfindings_icd10d', 18:'injury_icd10d', 19:'morbidity', 20:'healthstatus', 21:'Nan_Category'}

# function to categorize every patient
def icdGroup(code): return bisect_left(diag_icd10_ranges,code)

# loop for the categorisation of every patient in every diagnose
for i_diag_sec in range(0,diagnoses_sec.shape[1]):
    for i_within_diag_sec in range(0, len(diagnoses_sec)):
        diagnoses_sec.iloc[i_within_diag_sec,i_diag_sec] = icdGroup(diagnoses_sec.iloc[i_within_diag_sec,i_diag_sec])

And once again I have another warning:

C:\Users\Asus\Anaconda3\lib\site-packages\ipykernel_launcher.py:20: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Solution

You are getting these SettingWithCopyWarning warning messages as diagnoses_sec is a copy of part of df; setting values on this copy raises a warning to make sure that you are aware of this - your changes will not propagate back to df. These warnings will disappear if you explicitly make a copy using the copy method, e.g.:

diagnoses_sec = df[['Diagnóstico 2', 'Diagnóstico 3']].copy()

Regarding the time taken to execute your code, iterating over pandas DataFrames in this way is inefficient and you should strive to use vectorized operations, applying a function or operation to an entire array.

You could modify your first example to do this:

diagnoses_sec = df[['Diagnóstico 2', 'Diagnóstico 3', 'Diagnóstico 4', 'Diagnóstico 5', 'Diagnóstico 6',
          'Diagnóstico 7', 'Diagnóstico 8', 'Diagnóstico 9', 'Diagnóstico 10', 'Diagnóstico 11', 'Diagnóstico 12', 
          'Diagnóstico 13', 'Diagnóstico 14', 'Diagnóstico 15', 'Diagnóstico 16', 'Diagnóstico 17', 'Diagnóstico 18', 
          'Diagnóstico 19', 'Diagnóstico 20']].copy()
diagnoses_sec.fillna("ZZZ", inplace=True)
diagnoses_sec = diagnoses_sec.apply(lambda x: x.str.slice(start=0, stop=3, step=1))

Here, fillna is applied to the entire DataFrame and will replace every NA value with "ZZZ". In the second operation, apply will, via a lambda function, perform the string slicing operation on each column (Series) of your diagnoses_sec DataFrame.

Your second case is similar, though since your icdGroup function is not vectorized (it doesn't operate on a DataFrame or Series) and is being applied to every cell of your DataFrame, you can use applymap to execute it on every value:

diagnoses_sec = diagnoses_sec.applymap(icdGroup)


Answered By - dspencer
Answer Checked By - Gilberto Lyons (PHPFixing Admin)

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.