Issue
Having done a join operation to compare addresses with itself.
library(tidyverse)
library(lubridate)
library(stringr)
library(stringdist)
library(fuzzyjoin)
doTheJoin <- function (threshold) {
joined <- trimData(d_client_squashed) %>%
stringdist_left_join(
trimData(d_client_squashed),
by = c(address_full="address_full"),
distance_col = "distance",
max_dist = threshold,
method = "jw"
)
}
The structure of d_client_squashed is the following and contains string values:
Client_Reference | adress_full |
---|---|
C01 | Client1 Name, Street, Zipcode, Town |
C02 | Client2 Name, Street2, Zipcode2, Town2 |
... | ... |
The following operation:
sensible_matches <- doTheJoin(0.2)
View(sensible_matches %>% filter(Client_Reference.x != Client_Reference.y))
Results in the following output:
Client_Reference.x | address_full.x | Client_Reference.y | address_full.y | Distance |
---|---|---|---|---|
C01 | Client1 Name, Street, Zipcode, Town | C02 | Client2 Name, Street2, Zipcode2, Town2 | 0.05486 |
C02 | Client2 Name, Street2, Zipcode2, Town2 | C01 | Client1 Name, Street, Zipcode, Town | 0.05486 |
... | ... | ... | ... | ... |
The output of this join operation is double with reversed client information. The distance value is not unique. How can I subset the data frame to avoid those double lines?
Solution
In order to remove the rows containing the same data, you can order them based on the contained elements, so there is not difference between rows containing the same pair of Client_Reference
, and then delete the duplicates. After that you can filter the ones containing the same Client_Reference
as you did.
sensible_matches <- sensible_matches[!duplicated(t(apply(sensible_matches,1,sort))),]
View(sensible_matches %>% filter(Client_Reference.x != Client_Reference.y))
Answered By - Giulio Mattolin Answer Checked By - Gilberto Lyons (PHPFixing Admin)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.