Issue

Suppose I have two data frames like the following:

df1 <- data.frame(
    id1  = c(1:4, 4),
    id2  = letters[1:5],
    val1 = c(0, 1, pi, exp(1), 42)
)

df2 <- data.frame(
    id1  = c(1:4, 4),
    id2  = c(NA, letters[2:5]),
    val2 = c("Lorem", "ipsum", "dolor", "sit", "amet")
)

##   df1                   df2
##   id1 id2      val1     id1  id2  val2
## 1   1   a  0.000000       1 <NA> Lorem
## 2   2   b  1.000000       2    b ipsum
## 3   3   c  3.141593       3    c dolor
## 4   4   d  2.718282       4    d   sit
## 5   4   e 42.000000       4    e  amet

This would be my desired result:

desired_result <- data.frame(
    id1  = c(1:4, 4),
    id2  = letters[1:5],
    val1 = c(0, 1, pi, exp(1), 42),
    val2 = c("Lorem", "ipsum", "dolor", "sit", "amet")
)

##   id1 id2      val1  val2
## 1   1   a  0.000000 Lorem
## 2   2   b  1.000000 ipsum
## 3   3   c  3.141593 dolor
## 4   4   d  2.718282   sit
## 5   4   e 42.000000  amet

In my desired result, I'd like to use the information in column id2, when available, to resolve join ambiguities raised by duplicate values in id1. For example, rows 4 and 5 have the same id1, but we can differentiate them by id2. So, if I try to join just on id1, I get too many observations, because I'm not utilizing this information in id2 to match records:

library(dplyr)
left_join(df1, df2, by = "id1")
##   id1 id2.x      val1 id2.y  val2
## 1   1     a  0.000000  <NA> Lorem
## 2   2     b  1.000000     b ipsum
## 3   3     c  3.141593     c dolor
## 4   4     d  2.718282     d   sit
## 5   4     d  2.718282     e  amet
## 6   4     e 42.000000     d   sit
## 7   4     e 42.000000     e  amet

However, if I try to join on both IDs, I lose the information in val2 for row 1, because df2 has NA for id2 on row 1:

left_join(df1, df2, by = c("id1", "id2"))
##  id1 id2      val1  val2
## 1   1   a  0.000000  <NA>
## 2   2   b  1.000000 ipsum
## 3   3   c  3.141593 dolor
## 4   4   d  2.718282   sit
## 5   4   e 42.000000  amet

How can I left_join() (or equivalent) to achieve my desired_result?

Solution

An option using data.table:

library(data.table)
setDT(df1)
setDT(df2)
df1[df2, on=.(id1, id2), mult="first", val2 := val2]
df1[is.na(val2), val2 :=
    df2[.SD, on=.(id1), mult="first", val2]]

I have taken the liberty of using the first value if there are multiple joins (i.e. the combination of id1 and id2 in df2 are not unique).

Answered By - chinsoon12

Answer Checked By - Mildred Charles (PHPFixing Admin)

Saturday, October 29, 2022

[FIXED] How to perform left join using multiple columns where one data frame has missingness in one column?

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Saturday, October 29, 2022

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Subscribe To