Issue
I have 2 csv files. The first one is just a list of IDs (around 300k rows). Lest call it ids.csv
id
1
3
7
...
The second one is list of objects with id (around 2 milions rows). Lets call it data.csv
id,date,state
1,2022-01-01,true
4,2022-01-03,false
...
I would like to build the 3rd csv file with rows from data.csv which having IDs from ids.csv. It is not necessary that all of IDs from ids.csv will be present in data.csv.
I tried something like this:
while IFS=, read -r line
do
awk -F',' -v id='$line' '$1==id' data.csv >> out.csv
done < ids.csv
It is working, but executions takes something like forever. I tried to split it for several streams, but the script is working slow by itself so it was not helping mutch.
Can you suggest me better or more optimal way how to filter that data.csv faster?
Solution
Probably the most common answer on this forum in one way or another:
awk -F',' 'NR==FNR{ids[$1]; next} $1 in ids' ids.csv data.csv > out.csv
That will be an order of magnitude faster than your existing script.
Regarding why your original script is slow, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice.
Answered By - Ed Morton Answer Checked By - Mildred Charles (PHPFixing Admin)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.