Issue
I am trying to linearize fasta using awk. I am totally new to it. I have a script
awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < $f | tr "\t" "\n" > ${f/.fasta/_lin.fasta}
I dont understand anything in the < $f | tr "\t" "\n" > ${f/.fasta/_lin.fasta}
. What is $f
, whats tr
, t
, n
. Where exactly I am supposed to give the input file? Can someone please elaborate?
Solution
Let's step through that code piece by piece. First, I'll add some white space to make it more legible:
awk '
/^>/ {
printf("%s%s\t", (N>0?"\n":""), $0);
N++;
next;
}
{
printf("%s",$0);
}
END {
printf("\n");
}
' < $f \
| tr "\t" "\n" \
> ${f/.fasta/_lin.fasta}
Okay. First, $f
is your input file. The code's author expects it to contain .fasta
, presumably at the end, like myfile.fasta
. The <
operator in shell scripts is redundant in this particular case (unless you have an equals sign in the filename since awk
may interpret that as a variable assignment), simply telling awk
to consume the contents of that file.
AWK then comes in and matches lines that start with >
. On those lines, it will print a newline (if N > 0) or else nothing, followed by the contents of the line. It then increments N and skips the next command for that line. Other lines are printed as they're seen. After reading all of the lines of $f
, a final newline is printed.
This awk
code is not very legible. It could be rewritten like this:
awk '
/^>/ && N++ {
printf "\n";
}
{
print;
}
END {
printf "\n";
}
'
The only tricky piece here is that N
is initially zero, so when you say N++
the first time, it returns the value before incrementing (zero = false) and therefore that condition does not trigger. When you say it the second time, it returns the value before the next incrementing (one = true) and therefore that condition triggers. Anything that is not an empty string or a zero evaluates as true.
On one line, and more golfed, that could be awk '/^>/&&N++{printf"\n"}1;END{printf"\n"}'
(1;
triggers the default action, which is to print the line).
After awk
, the output is passed to tr
to translate all tabs (\t
) into newlines (\n
). Then the output is piped using the >
operator to write to a file described by the shell replacement ${f/.fasta/_lin.fasta}
, which replaces the first instance of .fasta
in $f
with _lin.fasta
, so our example input file myfile.fasta
is transformed to output file myfile_lin.fasta
.
Answered By - Adam Katz Answer Checked By - David Marino (PHPFixing Volunteer)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.