PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Sunday, August 28, 2022

[FIXED] How to compare two different CSV files with different number of columns and rows by keyword?

 August 28, 2022     csv, dataframe, pandas, python     No comments   

Issue

Without using the diff option.

I have two different CSV files (I know what is the names of the columns headlines)- old.csv and new.csv. they are with a different number of rows and columns. the CSV files can include numbers, strings, chars, and special characters. For example:

old.csv

     round    date  first  second  third  fourth  fifth  sixth
1     2  2021.04      2    45e69     10    16      4565   37
2     3  2021.04      4      15    456as  df924     35   4N320
4     5  2021.03      4    43!d9    23      26      29     33

new.csv

     round    date  first  second  third  fourth  fifth  sixth
0     1  2021.04      1      14     15      24      40     41
1     2  2021.04      2    45e69    10      16     4565    37
2     3  2021.04      4      15    456as   df924    35    4N320
3     4  2021.03     10      11     20      21     24325   41
5     6  2021.03    4321       9   2#@6     28     34350   41

Line 1 and 2 from both CSV are the same.

Now, I want to print out the difference between new.csv compared to old.csv. I want to print out only the new lines in new.csv file. like that:

result.csv

     round    date  first  second  third  fourth  fifth  sixth
0     1  2021.04      1     14     15      24     40     41
3     4  2021.03     10     11     20      21     24325  41
5     6  2021.03    4321    9     2#@6     28     34350  41

Counting the lines was for easier reading. I want the headline columns will be kept in the result.csv file.

The option I think that might help here is using keywords. For example, comparing the two CSV files by columns "first" and "fifth"- if one of the columns is not the same, print that in the result.csv file.


Solution

You seperately need to check each column you want to compare. The following code is one option how to do this.

import pandas as pd

def getLinesOnlyInA(dfa, dfb, result):
    # iterate over all lines in file a
    for index1, row1 in dfa.iterrows():
        aLineIsEqual = False
        # iterate over all lines in file b
        for index2, row2 in dfb.iterrows():
            thisLineIsDifferent = False
            # for each column, check if they are different
            for column in columns:
                if row1[column] != row2[column]:
                    thisLineIsDifferent = True
                    # ionly continue when the fields are the same
                    break
            if not thisLineIsDifferent:
                aLineIsEqual = True
                # only continue when no equal line was found
                break
        # when no equal line was found, add that line to the result
        if not aLineIsEqual:
            result.append(row1)
            
            
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
columns = ['round', 'first', 'fifth']     # columns to be compared
results = []

getLinesOnlyInA(df1, df2, results)        # find all lines only existing in file 1
getLinesOnlyInA(df2, df1, results)        # find all lines only existing in file 2
dfResult = pd.DataFrame(results)          # cast all lines into a dataframe

print(dfResult.to_string())   
dfResult.to_csv('result.csv', sep=',')

Result:

   round     date  first second third fourth  fifth sixth
2      5  2021.03      4  43!d9    23     26     29    33
0      1  2021.04      1     14    15     24     40    41
3      4  2021.03     10     11    20     21  24325    41
4      6  2021.03   4321      9  2#@6     28  34350    41

In case one file contains less columns (eg. file1 does not contain the date column), this still works and results in:

   round  first second third fourth  fifth sixth     date
2      5      4  43!d9    23     26     29    33      NaN
0      1      1     14    15     24     40    41  2021.04
3      4     10     11    20     21  24325    41  2021.03
4      6   4321      9  2#@6     28  34350    41  2021.03


Answered By - Felix
Answer Checked By - Dawn Plyler (PHPFixing Volunteer)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing