PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Saturday, October 29, 2022

[FIXED] Why do results differ for dplyr left_join() and right_join() using these two dataframes

 October 29, 2022     dplyr, left-join, r, right-join     No comments   

Issue

I am learning how to use the R dplyr 'join' functions by doing the exercises from this course: https://github.com/uclouvain-cbio/WSBIM1207 and got stuck on the problem described below.

First, download the example dataframes used for this question:

BiocManager::install("UCLouvain-CBIO/rWSBIM1207")

Load the package:

library(rWSBIM1207)

Then in R/RStudio load the dataframe files, 'clinical2' and 'expression' by typing:

data(clinical2)
data(expression)

The task is, firstly: 'Join the expression and clinical2 tables by the patient reference, using the left_join and the right_join functions.' I did that in this way:

left_join(expression, clinical2, 
          by = c("patient" = "patientID"))
right_join(expression, clinical2,
                     by = c("patient" = "patientID"))

The second task is to explain why the results are different. I found that there are 3 more rows in the right_join output versus the left_join output. This seems odd to me given that 'clinical2' has 516 rows, whereas 'expression' has 570 rows. The 3 extra rows present in the r_join output have in common that they contain multiple NA values, which presumably represent patients found in 'clinical2' and not in 'expression'. I don't really understand what is going on here, and would be grateful for any help.


Solution

Update Maybe not clearly explained. Here an explanation with the original data:

create left join

left <- left_join(expression, clinical2, by = c("patient" = "patientID"))

dim(expression)
[1] 570   8
dim(left)
[1] 570  10

create right join

right <- right_join(expression, clinical2,
           by = c("patient" = "patientID"))

dim(expression)
[1] 570   8
dim(right)
[1] 573  10

You want to know why dim(right) is 573!

Explanation step by step:

  1. Definition of right_join(): includes all rows in y (y is here clinical2).
  2. By doing so: there are 3 rows in clinical2 (y) which are not in expression (x)

See here: which patient occur in clinical2 but not in expression

anti_join(clinical2, expression, by=c("patientID"="patient"))

patientID    gender years_at_diagnosis
<chr>        <chr>               <dbl>
1 TCGA-55-7284 male                 74.2
2 TCGA-55-7913 female               61.2
3 TCGA-67-4679 male                 69.0

Again:

right_join(expression, clinical2, by = c("patient" = "patientID"))

We start with expression(x) (dim = 570 8) and join with clinical2 (y) (dim = 516 3)

So now what happens:

  • All 570 from expression (x) are matched with all in clinial2 (y) (dim = 516 3) EXCEPT these 3 patientID in clinical2 TCGA-55-7284, TCGA-55-7913, TCGA-67-4679
  • right_join now takes all 570 from expression and adds the 3 not matching patientID from clinical2 resulting in a dim of 573 10
  • In contrast left_join: left_join(): includes all rows in x (=expression), So if we do
  anti_join(expression, clinical2, by=c("patient"="patientID"))

We get:
  # ... with 8 variables: sampleID <fct>, patient <chr>, type <chr>, A1BG <dbl>, A1CF <dbl>,
  #   A2BP1 <dbl>, A2LD1 <dbl>, A2ML1 <dbl>````

That means all rows are included in expression. So no additional rows will be added here:

First answer: In joining two things are important:

  1. from which side you start to join, e.g. which table is in first position
  2. Given position of tables e.g. df1, df2 which join method you apply

See this example:

library(dplyr)
library(tibble)

# add ID
iris1<- iris %>% 
  tibble::rowid_to_column("ID")

# add ID
mtcars1 <- mtcars %>% 
  tibble::rowid_to_column("ID")

dim(iris1)
# [1] 150   6

dim(mtcars1)
# [1] 32 12

# 1. iris1 is first and we start from left e.g. iris1

a <- left_join(iris1, mtcars1, by="ID")
dim(a)
# [1] 150  17

# 2. iris1 is still first, but we join from right e.g. mtcars1
b <- right_join(iris1, mtcars1, by="ID")
dim(b)
# [1] 32 17

# 3. mtcars1 is first and we join from left e.g mtcars1
a1 <- left_join(mtcars1, iris1, by="ID")
dim(a1)
# [1] 32 17

-> b = a1 e.g. right_join(iris1, mtcars1, by="ID") = left_join(mtcars1, iris1, by="ID")

https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/join



Answered By - TarJae
Answer Checked By - Gilberto Lyons (PHPFixing Admin)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing