Issue

I have 2 csv files. The first one is just a list of IDs (around 300k rows). Lest call it ids.csv

id
1
3
7
...

The second one is list of objects with id (around 2 milions rows). Lets call it data.csv

id,date,state
1,2022-01-01,true
4,2022-01-03,false
...

I would like to build the 3rd csv file with rows from data.csv which having IDs from ids.csv. It is not necessary that all of IDs from ids.csv will be present in data.csv.

I tried something like this:

while IFS=, read -r line
do
    awk -F',' -v id='$line' '$1==id' data.csv >> out.csv
done < ids.csv

It is working, but executions takes something like forever. I tried to split it for several streams, but the script is working slow by itself so it was not helping mutch.

Can you suggest me better or more optimal way how to filter that data.csv faster?

Solution

Probably the most common answer on this forum in one way or another:

awk -F',' 'NR==FNR{ids[$1]; next} $1 in ids' ids.csv data.csv > out.csv

That will be an order of magnitude faster than your existing script.

Regarding why your original script is slow, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice.

Answered By - Ed Morton

Answer Checked By - Mildred Charles (PHPFixing Admin)

Monday, August 29, 2022

[FIXED] How can i filter fast the huge csv file based on list of ids from another file using bash?

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Monday, August 29, 2022

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Subscribe To