PHPFixing
  • Privacy Policy
  • TOS
  • Ask Question
  • Contact Us
  • Home
  • PHP
  • Programming
  • SQL Injection
  • Web3.0

Monday, August 29, 2022

[FIXED] How can i filter fast the huge csv file based on list of ids from another file using bash?

 August 29, 2022     awk, bash, csv     No comments   

Issue

I have 2 csv files. The first one is just a list of IDs (around 300k rows). Lest call it ids.csv

id
1
3
7
...

The second one is list of objects with id (around 2 milions rows). Lets call it data.csv

id,date,state
1,2022-01-01,true
4,2022-01-03,false
...

I would like to build the 3rd csv file with rows from data.csv which having IDs from ids.csv. It is not necessary that all of IDs from ids.csv will be present in data.csv.

I tried something like this:

while IFS=, read -r line
do
    awk -F',' -v id='$line' '$1==id' data.csv >> out.csv
done < ids.csv

It is working, but executions takes something like forever. I tried to split it for several streams, but the script is working slow by itself so it was not helping mutch.

Can you suggest me better or more optimal way how to filter that data.csv faster?


Solution

Probably the most common answer on this forum in one way or another:

awk -F',' 'NR==FNR{ids[$1]; next} $1 in ids' ids.csv data.csv > out.csv

That will be an order of magnitude faster than your existing script.

Regarding why your original script is slow, see why-is-using-a-shell-loop-to-process-text-considered-bad-practice.



Answered By - Ed Morton
Answer Checked By - Mildred Charles (PHPFixing Admin)
  • Share This:  
  •  Facebook
  •  Twitter
  •  Stumble
  •  Digg
Newer Post Older Post Home

0 Comments:

Post a Comment

Note: Only a member of this blog may post a comment.

Total Pageviews

Featured Post

Why Learn PHP Programming

Why Learn PHP Programming A widely-used open source scripting language PHP is one of the most popular programming languages in the world. It...

Subscribe To

Posts
Atom
Posts
Comments
Atom
Comments

Copyright © PHPFixing