Issue

I have a very large file (~6GB) that has fixed-width text separated by \r\n, and so I'm using buffered reader to read line by line. This process can be interrupted or stopped and if it is, it uses a checkpoint "lastProcessedLineNbr" to fast forward to the correct place to resume reading. This is how the reader is initialized.

private void initializeBufferedReader(Integer lastProcessedLineNbr) throws IOException {
    reader = new BufferedReader(new InputStreamReader(getInputStream(), "UTF-8"));
    if(lastProcessedLineNbr==null){lastProcessedLineNbr=0;}

    for(int i=0; i<lastProcessedLineNbr;i++){
        reader.readLine();
    }
    currentLineNumber = lastProcessedLineNbr;
}

This seems to work fine, and I read and process the data in this method:

public Object readItem() throws Exception {
    if((currentLine = reader.readLine())==null){
        return null;
    }
    currentLineNumber++;
    return parse(currentLine);
}

And again, everything works fine until I reach the last line in the document. readLine() in the latter method throws an error:

17:06:49,980 ERROR [org.jberet] (Batch Thread - 1) JBERET000007: Failed to run job ProdFileRead, parse, org.jberet.job.model.Chunk@3965dcc8: java.lang.OutOfMemoryError: Requested array size exceeds VM limit
    at java.util.Arrays.copyOf(Arrays.java:3332)
    at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
    at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
    at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:569)
    at java.lang.StringBuffer.append(StringBuffer.java:369)
    at java.io.BufferedReader.readLine(BufferedReader.java:370)
    at java.io.BufferedReader.readLine(BufferedReader.java:389)
    at com.rational.batch.reader.TextLineReader.readItem(TextLineReader.java:55)

Curiously, it seems to be reading past the end of the file and allocating so much space that it runs out of memory. I tried looking at the contents of the file using Cygwin and "tail file.txt" and in the console it gave me the expected 10 lines. But when I did "tail file.txt > output.txt" output.txt ended up being like 1.8GB, much larger than the 10 lines I expected. So it seems Cygwin is doing the same thing. As far as I can tell there is no special EOF character. It's just the last byte of data and it ends abruptly.

Anyone have any idea on how I can get this working? I'm thinking I could resort to counting the number of bytes read until I get the full size of the file, but I was hoping there was a better way.

Solution

But when I did tail file.txt > output.txt output.txt ended up being like 1.8GB, much larger than the 10 lines I expected

What this indicates to me is that the file is padded with 1.8GB of binary zeroes, which Cygwin's tail command ignored when writing to the terminal, but which Java is not ignoring. This would explain your OutOfMemoryError as well, as the BufferedReader continued reading data looking for the next \r\n, never finding it before overflowing memory.

Answered By - Jim Garrison

Answer Checked By - David Goodson (PHPFixing Volunteer)

Monday, October 31, 2022

[FIXED] Why is BufferedReader readLine reading past EOF

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Monday, October 31, 2022

Issue

Solution

0 Comments:

Post a Comment

Total Pageviews

Featured Post

Why Learn PHP Programming

Subscribe To