Issue
I have a dataset like below. In this dataset, there are different colored thermometers, and given a 'True' or reference temperature, how different they measure according to some measurement methods 'Method 1' and 'Method 2'.
I am having trouble calculating two important parameters that I need which are Mean Absolute Error (MAE) and Mean Signed Error (MSE). I want to use the non-NaN values for each method and print the result.
I was able to get the to a point where I can return a two column series of index and sum, but the problem in this case is that I need to divide by the number of method values summed, which changes depending on how many NaN's there are in a row. And I do NOT want to just skip an entire row just because there is an NaN in it.
number | date | Thermometer | True Temperature | Method 1 | Method 2 |
---|---|---|---|---|---|
0 | 1/1/2021 | red | 0.2 | 0.2 | 0.5 |
1 | 1/1/2021 | red | 0.6 | 0.6 | 0.3 |
2 | 1/1/2021 | red | 0.4 | 0.6 | 0.23 |
3 | 1/1/2021 | green | 0.2 | 0.4 | NaN |
4 | 1/1/2021 | green | 1 | 1 | 0.23 |
5 | 1/1/2021 | yellow | 0.4 | 0.4 | 0.32 |
6 | 1/1/2021 | yellow | 0.1 | NaN | 0.4 |
7 | 1/1/2021 | yellow | 1.3 | 0.5 | 0.54 |
8 | 1/1/2021 | yellow | 1.5 | 0.5 | 0.43 |
9 | 1/1/2021 | yellow | 1.5 | 0.5 | 0.43 |
10 | 1/1/2021 | blue | 0.4 | 0.3 | NaN |
11 | 1/1/2021 | blue | 0.8 | 0.2 | 0.11 |
My Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('default'
data = pd.read_csv('data.txt', index_col=0)
data
data["M1_ABS_Error"]= abs(data["True_Temperature"]-data["Method_1"])
data["M2_ABS_Error"]= abs(data["True_Temperature"]-data["Method_2"])
MAE_Series=data[['Name', 'M1_ABS_Error', 'M2_ABS_Error' ]]
MAE_Series.sum(axis=1, skipna=True)
but output is something like this at the moment, which doesn't specify which color thermometer this belongs to, and I would like this to print out in a way that is easy to associate it with which it belongs. Also, as I mentioned, this does not yet account for how to divide by the number of values/methods in the given row to account for NaN. :
0 4.94
1 3.03
2 11.88
3 3.28
4 8.14
5 7.80
6 2.76
7 2.71
I would appreciate your help on this. Thanks!
Solution
Edit
I think I understand now, let me know if this is what you want
MAE:
df['MAE'] = df[['M1_ABS_Error','M2_ABS_Error']].mean(axis = 1)
df
produces
date Thermometer True_Temperature Method_1 Method_2 M1_ABS_Error M2_ABS_Error MAE
-- -------- ------------- ------------------ ---------- ---------- -------------- -------------- -----
0 1/1/2021 red 0.2 0.2 0.5 0 0.3 0.15
1 1/1/2021 red 0.6 0.6 0.3 0 0.3 0.15
2 1/1/2021 red 0.4 0.6 0.23 0.2 0.17 0.185
3 1/1/2021 green 0.2 0.4 nan 0.2 nan 0.2
4 1/1/2021 green 1 1 0.23 0 0.77 0.385
5 1/1/2021 yellow 0.4 0.4 0.32 0 0.08 0.04
6 1/1/2021 yellow 0.1 nan 0.4 nan 0.3 0.3
7 1/1/2021 yellow 1.3 0.5 0.54 0.8 0.76 0.78
8 1/1/2021 yellow 1.5 0.5 0.43 1 1.07 1.035
9 1/1/2021 yellow 1.5 0.5 0.43 1 1.07 1.035
10 1/1/2021 blue 0.4 0.3 nan 0.1 nan 0.1
11 1/1/2021 blue 0.8 0.2 0.11 0.6 0.69 0.645
and for MSE (Signed error)
df["MSE"]= df[['Method_1','Method_2']].mean(axis = 1)- df['True_Temperature']
produces
date Thermometer True_Temperature Method_1 Method_2 M1_ABS_Error M2_ABS_Error MAE MSE
-- -------- ------------- ------------------ ---------- ---------- -------------- -------------- ----- ------
0 1/1/2021 red 0.2 0.2 0.5 0 0.3 0.15 0.15
1 1/1/2021 red 0.6 0.6 0.3 0 0.3 0.15 -0.15
2 1/1/2021 red 0.4 0.6 0.23 0.2 0.17 0.185 0.015
3 1/1/2021 green 0.2 0.4 nan 0.2 nan 0.2 0.2
4 1/1/2021 green 1 1 0.23 0 0.77 0.385 -0.385
5 1/1/2021 yellow 0.4 0.4 0.32 0 0.08 0.04 -0.04
6 1/1/2021 yellow 0.1 nan 0.4 nan 0.3 0.3 0.3
7 1/1/2021 yellow 1.3 0.5 0.54 0.8 0.76 0.78 -0.78
8 1/1/2021 yellow 1.5 0.5 0.43 1 1.07 1.035 -1.035
9 1/1/2021 yellow 1.5 0.5 0.43 1 1.07 1.035 -1.035
10 1/1/2021 blue 0.4 0.3 nan 0.1 nan 0.1 -0.1
11 1/1/2021 blue 0.8 0.2 0.11 0.6 0.69 0.645 -0.645
Original answer
It is not entirely clear what you want, but somewhat guessing here, is this what you are after? If you groupby
by color and apply mean
to the `ABS columns within each group
data.groupby('Thermometer', sort = False)[['M1_ABS_Error','M2_ABS_Error']].mean()
you get this
M1_ABS_Error M2_ABS_Error
Thermometer
red 0.066667 0.256667
green 0.100000 0.770000
yellow 0.700000 0.656000
blue 0.350000 0.690000
Here, for example, the first top left number '0.066667is the average of the
M1_ABS_Errorcolumn for those Thermometers that are
red`. Similar to others. NaNs are skipped within each color/column
to get MSE (which normally means Mean Squared Error so I assume this is what you are after) you can do
import numpy as np
data["M1_Sqr_Error"]= (data["True_Temperature"]-data["Method_1"])**2
data["M2_Sqr_Error"]= (data["True_Temperature"]-data["Method_2"])**2
data.groupby('Thermometer', sort = False)[['M1_Error','M2_Error']].apply(lambda v: np.sqrt(np.mean(v)))
to get
M1_Error M2_Error
Thermometer
red 0.115470 0.263881
green 0.141421 0.770000
yellow 0.812404 0.769909
blue 0.430116 0.690000
Answered By - piterbarg Answer Checked By - Marilyn (PHPFixing Volunteer)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.