Issue
I am trying to linearize fasta using awk. I am totally new to it. I have a script
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < $f | tr "\t" "\n" > ${f/.fasta/_lin.fasta}
I dont understand anything in the < $f | tr "\t" "\n" > ${f/.fasta/_lin.fasta}. What is $f, whats tr, t, n. Where exactly I am supposed to give the input file? Can someone please elaborate?
Solution
Let's step through that code piece by piece. First, I'll add some white space to make it more legible:
awk '
/^>/ {
printf("%s%s\t", (N>0?"\n":""), $0);
N++;
next;
}
{
printf("%s",$0);
}
END {
printf("\n");
}
' < $f \
| tr "\t" "\n" \
> ${f/.fasta/_lin.fasta}
Okay. First, $f is your input file. The code's author expects it to contain .fasta, presumably at the end, like myfile.fasta. The < operator in shell scripts is redundant in this particular case (unless you have an equals sign in the filename since awk may interpret that as a variable assignment), simply telling awk to consume the contents of that file.
AWK then comes in and matches lines that start with >. On those lines, it will print a newline (if N > 0) or else nothing, followed by the contents of the line. It then increments N and skips the next command for that line. Other lines are printed as they're seen. After reading all of the lines of $f, a final newline is printed.
This awk code is not very legible. It could be rewritten like this:
awk '
/^>/ && N++ {
printf "\n";
}
{
print;
}
END {
printf "\n";
}
'
The only tricky piece here is that N is initially zero, so when you say N++ the first time, it returns the value before incrementing (zero = false) and therefore that condition does not trigger. When you say it the second time, it returns the value before the next incrementing (one = true) and therefore that condition triggers. Anything that is not an empty string or a zero evaluates as true.
On one line, and more golfed, that could be awk '/^>/&&N++{printf"\n"}1;END{printf"\n"}' (1; triggers the default action, which is to print the line).
After awk, the output is passed to tr to translate all tabs (\t) into newlines (\n). Then the output is piped using the > operator to write to a file described by the shell replacement ${f/.fasta/_lin.fasta}, which replaces the first instance of .fasta in $f with _lin.fasta, so our example input file myfile.fasta is transformed to output file myfile_lin.fasta.
Answered By - Adam Katz Answer Checked By - David Marino (PHPFixing Volunteer)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.