Issue
Without using the diff option.
I have two different CSV files (I know what is the names of the columns headlines)- old.csv and new.csv. they are with a different number of rows and columns. the CSV files can include numbers, strings, chars, and special characters. For example:
old.csv
round date first second third fourth fifth sixth
1 2 2021.04 2 45e69 10 16 4565 37
2 3 2021.04 4 15 456as df924 35 4N320
4 5 2021.03 4 43!d9 23 26 29 33
new.csv
round date first second third fourth fifth sixth
0 1 2021.04 1 14 15 24 40 41
1 2 2021.04 2 45e69 10 16 4565 37
2 3 2021.04 4 15 456as df924 35 4N320
3 4 2021.03 10 11 20 21 24325 41
5 6 2021.03 4321 9 2#@6 28 34350 41
Line 1 and 2 from both CSV are the same.
Now, I want to print out the difference between new.csv compared to old.csv. I want to print out only the new lines in new.csv file. like that:
result.csv
round date first second third fourth fifth sixth
0 1 2021.04 1 14 15 24 40 41
3 4 2021.03 10 11 20 21 24325 41
5 6 2021.03 4321 9 2#@6 28 34350 41
Counting the lines was for easier reading. I want the headline columns will be kept in the result.csv file.
The option I think that might help here is using keywords. For example, comparing the two CSV files by columns "first" and "fifth"- if one of the columns is not the same, print that in the result.csv file.
Solution
You seperately need to check each column you want to compare. The following code is one option how to do this.
import pandas as pd
def getLinesOnlyInA(dfa, dfb, result):
# iterate over all lines in file a
for index1, row1 in dfa.iterrows():
aLineIsEqual = False
# iterate over all lines in file b
for index2, row2 in dfb.iterrows():
thisLineIsDifferent = False
# for each column, check if they are different
for column in columns:
if row1[column] != row2[column]:
thisLineIsDifferent = True
# ionly continue when the fields are the same
break
if not thisLineIsDifferent:
aLineIsEqual = True
# only continue when no equal line was found
break
# when no equal line was found, add that line to the result
if not aLineIsEqual:
result.append(row1)
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
columns = ['round', 'first', 'fifth'] # columns to be compared
results = []
getLinesOnlyInA(df1, df2, results) # find all lines only existing in file 1
getLinesOnlyInA(df2, df1, results) # find all lines only existing in file 2
dfResult = pd.DataFrame(results) # cast all lines into a dataframe
print(dfResult.to_string())
dfResult.to_csv('result.csv', sep=',')
Result:
round date first second third fourth fifth sixth
2 5 2021.03 4 43!d9 23 26 29 33
0 1 2021.04 1 14 15 24 40 41
3 4 2021.03 10 11 20 21 24325 41
4 6 2021.03 4321 9 2#@6 28 34350 41
In case one file contains less columns (eg. file1
does not contain the date
column), this still works and results in:
round first second third fourth fifth sixth date
2 5 4 43!d9 23 26 29 33 NaN
0 1 1 14 15 24 40 41 2021.04
3 4 10 11 20 21 24325 41 2021.03
4 6 4321 9 2#@6 28 34350 41 2021.03
Answered By - Felix Answer Checked By - Dawn Plyler (PHPFixing Volunteer)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.