Issue
Suppose I have two data frames like the following:
df1 <- data.frame(
id1 = c(1:4, 4),
id2 = letters[1:5],
val1 = c(0, 1, pi, exp(1), 42)
)
df2 <- data.frame(
id1 = c(1:4, 4),
id2 = c(NA, letters[2:5]),
val2 = c("Lorem", "ipsum", "dolor", "sit", "amet")
)
## df1 df2
## id1 id2 val1 id1 id2 val2
## 1 1 a 0.000000 1 <NA> Lorem
## 2 2 b 1.000000 2 b ipsum
## 3 3 c 3.141593 3 c dolor
## 4 4 d 2.718282 4 d sit
## 5 4 e 42.000000 4 e amet
This would be my desired result:
desired_result <- data.frame(
id1 = c(1:4, 4),
id2 = letters[1:5],
val1 = c(0, 1, pi, exp(1), 42),
val2 = c("Lorem", "ipsum", "dolor", "sit", "amet")
)
## id1 id2 val1 val2
## 1 1 a 0.000000 Lorem
## 2 2 b 1.000000 ipsum
## 3 3 c 3.141593 dolor
## 4 4 d 2.718282 sit
## 5 4 e 42.000000 amet
In my desired result, I'd like to use the information in column id2
, when available,
to resolve join ambiguities raised by duplicate values in id1
.
For example, rows 4 and 5 have the same id1
, but we can differentiate them by id2
.
So, if I try to join just on id1
, I get too many observations,
because I'm not utilizing this information in id2
to match records:
library(dplyr)
left_join(df1, df2, by = "id1")
## id1 id2.x val1 id2.y val2
## 1 1 a 0.000000 <NA> Lorem
## 2 2 b 1.000000 b ipsum
## 3 3 c 3.141593 c dolor
## 4 4 d 2.718282 d sit
## 5 4 d 2.718282 e amet
## 6 4 e 42.000000 d sit
## 7 4 e 42.000000 e amet
However, if I try to join on both IDs, I lose the information in val2
for row 1,
because df2
has NA
for id2
on row 1:
left_join(df1, df2, by = c("id1", "id2"))
## id1 id2 val1 val2
## 1 1 a 0.000000 <NA>
## 2 2 b 1.000000 ipsum
## 3 3 c 3.141593 dolor
## 4 4 d 2.718282 sit
## 5 4 e 42.000000 amet
How can I left_join()
(or equivalent) to achieve my desired_result
?
Solution
An option using data.table
:
library(data.table)
setDT(df1)
setDT(df2)
df1[df2, on=.(id1, id2), mult="first", val2 := val2]
df1[is.na(val2), val2 :=
df2[.SD, on=.(id1), mult="first", val2]]
I have taken the liberty of using the first value if there are multiple joins (i.e. the combination of id1 and id2 in df2 are not unique).
Answered By - chinsoon12 Answer Checked By - Mildred Charles (PHPFixing Admin)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.