Saturday, October 29, 2022

[FIXED] How to clean double lines from joint statement in R?

Issue

Having done a join operation to compare addresses with itself.

library(tidyverse)
library(lubridate)
library(stringr)
library(stringdist)
library(fuzzyjoin)
doTheJoin <- function (threshold) {
      joined <- trimData(d_client_squashed) %>% 
        stringdist_left_join(
          trimData(d_client_squashed), 
          by = c(address_full="address_full"),
          distance_col = "distance",
          max_dist = threshold,
          method = "jw"
        )
    }    

The structure of d_client_squashed is the following and contains string values:

Client_Reference adress_full
C01 Client1 Name, Street, Zipcode, Town
C02 Client2 Name, Street2, Zipcode2, Town2
... ...

The following operation:

sensible_matches <- doTheJoin(0.2)
View(sensible_matches %>% filter(Client_Reference.x != Client_Reference.y))

Results in the following output:

Client_Reference.x address_full.x Client_Reference.y address_full.y Distance
C01 Client1 Name, Street, Zipcode, Town C02 Client2 Name, Street2, Zipcode2, Town2 0.05486
C02 Client2 Name, Street2, Zipcode2, Town2 C01 Client1 Name, Street, Zipcode, Town 0.05486
... ... ... ... ...

The output of this join operation is double with reversed client information. The distance value is not unique. How can I subset the data frame to avoid those double lines?


Solution

In order to remove the rows containing the same data, you can order them based on the contained elements, so there is not difference between rows containing the same pair of Client_Reference, and then delete the duplicates. After that you can filter the ones containing the same Client_Reference as you did.

sensible_matches <- sensible_matches[!duplicated(t(apply(sensible_matches,1,sort))),]
View(sensible_matches  %>% filter(Client_Reference.x != Client_Reference.y))


Answered By - Giulio Mattolin
Answer Checked By - Gilberto Lyons (PHPFixing Admin)

No comments:

Post a Comment

Note: Only a member of this blog may post a comment.