Issue
I have a very large file (~6GB) that has fixed-width text separated by \r\n, and so I'm using buffered reader to read line by line. This process can be interrupted or stopped and if it is, it uses a checkpoint "lastProcessedLineNbr" to fast forward to the correct place to resume reading. This is how the reader is initialized.
private void initializeBufferedReader(Integer lastProcessedLineNbr) throws IOException {
reader = new BufferedReader(new InputStreamReader(getInputStream(), "UTF-8"));
if(lastProcessedLineNbr==null){lastProcessedLineNbr=0;}
for(int i=0; i<lastProcessedLineNbr;i++){
reader.readLine();
}
currentLineNumber = lastProcessedLineNbr;
}
This seems to work fine, and I read and process the data in this method:
public Object readItem() throws Exception {
if((currentLine = reader.readLine())==null){
return null;
}
currentLineNumber++;
return parse(currentLine);
}
And again, everything works fine until I reach the last line in the document. readLine() in the latter method throws an error:
17:06:49,980 ERROR [org.jberet] (Batch Thread - 1) JBERET000007: Failed to run job ProdFileRead, parse, org.jberet.job.model.Chunk@3965dcc8: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:569)
at java.lang.StringBuffer.append(StringBuffer.java:369)
at java.io.BufferedReader.readLine(BufferedReader.java:370)
at java.io.BufferedReader.readLine(BufferedReader.java:389)
at com.rational.batch.reader.TextLineReader.readItem(TextLineReader.java:55)
Curiously, it seems to be reading past the end of the file and allocating so much space that it runs out of memory. I tried looking at the contents of the file using Cygwin and "tail file.txt" and in the console it gave me the expected 10 lines. But when I did "tail file.txt > output.txt" output.txt ended up being like 1.8GB, much larger than the 10 lines I expected. So it seems Cygwin is doing the same thing. As far as I can tell there is no special EOF character. It's just the last byte of data and it ends abruptly.
Anyone have any idea on how I can get this working? I'm thinking I could resort to counting the number of bytes read until I get the full size of the file, but I was hoping there was a better way.
Solution
But when I did
tail file.txt > output.txt
output.txt ended up being like 1.8GB, much larger than the 10 lines I expected
What this indicates to me is that the file is padded with 1.8GB of binary zeroes, which Cygwin's tail
command ignored when writing to the terminal, but which Java is not ignoring. This would explain your OutOfMemoryError
as well, as the BufferedReader
continued reading data looking for the next \r\n
, never finding it before overflowing memory.
Answered By - Jim Garrison Answer Checked By - David Goodson (PHPFixing Volunteer)
0 Comments:
Post a Comment
Note: Only a member of this blog may post a comment.