PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Wednesday, August 17, 2022

[FIXED] What is the difference between by and summary?

 August 17, 2022     mean, output, r, summary, variables     No comments   

Issue

maybe someone can answer my question. What is the difference between the following writings? In my case I am interested to know mean but I get different numbers.

> by(wcomp$numbf.y, wcomp$partw2, summary, na.rm = TRUE)

Mean 2.473

summary(wcomp$numbf.y, wcomp$partw2, na.rm = TRUE)

Mean 2.573

Thanks for your help


Solution

Without knowing your data: by applies a function (summary) to a vector (wcomp$numbf.y) by a group (wcomp$partw2).

Whereas summarycreates a summary of your data (kinda ignoring the second argument).

See also this MWE (Ive used the mtcars dataset and set some values to NA:


df <- mtcars
df[c(1, 5), c("cyl", "mpg")] <- NA
head(df)
#>                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4           NA  NA  160 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout   NA  NA  360 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

by(df$mpg, df$cyl, summary)
#> df$cyl: 4
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   21.40   22.80   26.00   26.66   30.40   33.90 
#> ------------------------------------------------------------ 
#> df$cyl: 6
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   17.80   18.38   19.45   19.53   20.68   21.40 
#> ------------------------------------------------------------ 
#> df$cyl: 8
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   10.40   14.30   15.20   14.82   15.80   19.20

summary(df$mpg, df$cyl)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>   10.40   15.28   19.20   20.11   22.80   33.90       2
summary(df$mpg)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>   10.40   15.28   19.20   20.11   22.80   33.90       2
summary(df$cyl)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>   4.000   4.000   6.000   6.133   8.000   8.000       2

Created on 2020-10-07 by the reprex package (v0.3.0)

We see that the mean values are all different, as we are calculating different means: once for all obs (in the summary call), and when using the by call, we calculate the summary per group (cyl).

We also see that the second argument to summary() is ignored.

Does that answer your question?

If you are only interested in the mean, try

mean(df$mpg, na.rm = TRUE) #< na.rm needed here!
#> [1] 20.10667

by(df$mpg, df$cyl, mean)
#> df$cyl: 4
#> [1] 26.66364
#> ------------------------------------------------------ 
#> df$cyl: 6
#> [1] 19.53333
#> ------------------------------------------------------ 
#> df$cyl: 8
#> [1] 14.82308


Answered By - David
Answer Checked By - Clifford M. (PHPFixing Volunteer)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing