Issue
My collaborator was processing a large batch of files, but some of the output files seem to be interrupted before they were completed. It seems that these incomplete files do not have the end of the file character (EOF). I would like to do a script in batch to loop through all of these files and check if the EOF character is there for every one of the ~500 files. Can you give me any idea of how to do this? Which command can I use to know if a file has EOF character at the end?
I am not sure if there is supposed to be a special character at the end of the files when they are complete, but normal files looks like this
my_user$ tail CHSA0011.fastq
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@HS40_15367:8:1106:6878:29640/2
TGATCCATCGTGATGTCTTATTTAAGGGGAACGTGTGGGCTATTTAGGCTTTATGACCCTGAAGTAGGAACCAGA
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@HS40_15367:8:1202:14585:48098/1
TGATCCATCGTGATGTCTTATTTAAGGGGAACGTGTGGGCTATTTAGGCTTTATGACCCTGAAGTAGGAACCAGA
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
my_user$
But when I do tail
tho thse interrupted files they look like:
my_user$ tail IST-MES1.fastq
@HS19_13305:3:1115:13001:3380/2
GTGGAGACGAGGTTTCACCATGTTGGCCAGGCTGGTCTCGAGCTCCTGACCTCAAGTGATCCGTCTGCCTTGGCC
+
@B@FFFFFHHHHFHHIJJJJJIIJJJJJJJIJJJJGIIJJGIIGIIJJJJFDHHIJFHGIGHIHHHFFFFFFEEE
@HS19_13305:3:1106:5551:75750/2
CGAGGTTTCACCATGTTGGCCAGGCTGGTCTCGAGCTCCTGACCTCAAGTGATCCGTCTGCCTTGGCCCCCCAAA
+
CCCFFADFHHHHHJJIJJJJJJJJJJJJEGGIJGGHIIJIIIIIIJJJJDEGGIJJJGIIIJJIJJJHHHFDDDD
@HS19_13305:3:2110:17731:73616/2
CGAGGTTTCACCATGTTGGCCAGGCTGmy_user$
As you can see, in normal files my_user$ is displayed one line below the end of the file. But in these interrupted ones my_user$ is next to the end of the files. Maybe it just because the file does not end with a line breaker \n
?
I am sorry if the question is a bit confusing,
cheers, Guillermo
Solution
Yes, the difference is because in the first case the file ends with \n
(new line).
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
my_user$
In this case it doesn't have a new line so the next thing it prints is your use (actually your PS1
)
CGAGGTTTCACCATGTTGGCCAGGCTGmy_user$
You can try this:
echo "CCCFFADFHHHHH" # <--- implicitly includes newline at the end
echo -n "CCCFFADFHHHHH" # <--- does not include newline at the end
There are actually two endline options, \r
and \n
and there are different standards according to your OS. I will assume you are working on linux and only \n
is used. So in this example the newline character is 0x0a
(number 10) in the ascii map.
If you want to know the last char of each file, you can do:
echo -n "CCCFFADFHHHHH" > uglyfile.txt
echo "CCCFFADFHHHHH" > nicefile.txt
for file in *.txt; do
echo -n "$file ends with: 0x";
tail -c 1 $file | xxd -p;
done;
If you want to know which files end with a char that is not a newline, you can do:
echo -n "CCCFFADFHHHHH" > uglyfile.txt
echo "CCCFFADFHHHHH" > nicefile.txt
for file in *.txt; do
lastchar_hex=`tail -c 1 $file | xxd -p`
if [[ $lastchar_hex != '0a' ]]; then
echo "File $file does not end with newline"
fi;
done;
Answered By - brunorey Answer Checked By - Senaida (PHPFixing Volunteer)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.