Last week I discussed that the new (@since 1.8) method splitAsStream in the class Pattern works on the character sequence reading from it only as much as needed by the stream and not running ahead with the pattern matching creating all the possible elements and returning it as a stream. This behavior is the true nature of streams and it is the way it has to be to support high performance applications.

In this article, as I promised last week, I will show a practical application of splitAsStream where it really makes sense to process the stream and not just split the whole string into an array and work on that.

The application as you may have guessed from the title of the article is splitting up a file along some tokens. A file can be represented as a CharSequence so long (or so short) as long it is not longer than 2GB. The limit comes from the fact that the length of a CharSequence is an int value and that is 32-bit in Java. File length is long, which is 64-bit. Since reading from a file is much slower than reading from a string that is already in memory it makes sense to use the laziness of stream handling. All we need is a character sequence implementation that is backed up by a file. If we can have that we can write a program like the following:

    public static void main(String[] args) throws FileNotFoundException {
        Pattern p = Pattern.compile("[,\\.\\-;]");
        final CharSequence splitIt =
            new FileAsCharSequence(
                   new File("path_to_source\\SplitFileAsStream.java"));
        p.splitAsStream(splitIt).forEach(System.out::println);
    }

This code does not read any part of the file, that is not needed yet, assumes that the implementation FileAsCharSequence is not reading the file greedy. The class FileAsCharSequence implementation can be:

package com.epam.training.regex;

import java.io.*;

public class FileAsCharSequence implements CharSequence {
    private final int length;
    private final StringBuilder buffer = new StringBuilder();
    private final InputStream input;

    public FileAsCharSequence(File file) throws FileNotFoundException {
        if (file.length() > (long) Integer.MAX_VALUE) {
            throw new IllegalArgumentException("File is too long to handle as character sequence");
        }
        this.length = (int) file.length();
        this.input = new FileInputStream(file);
    }

    @Override
    public int length() {
        return length;
    }

    @Override
    public char charAt(int index) {
        ensureFilled(index + 1);
        return buffer.charAt(index);
    }


    @Override
    public CharSequence subSequence(int start, int end) {
        ensureFilled(end + 1);
        return buffer.subSequence(start, end);
    }

    private void ensureFilled(int index) {
        if (buffer.length() < index) {
            buffer.ensureCapacity(index);
            final byte[] bytes = new byte[index - buffer.length()];
            try {
                int length = input.read(bytes);
                if (length < bytes.length) {
                    throw new IllegalArgumentException("File ended unexpected");
                }
            } catch (IOException e) {
                throw new RuntimeException(e);
            }
            try {
                buffer.append(new String(bytes, "utf-8"));
            } catch (UnsupportedEncodingException ignored) {
            }
        }
    }
}

This implementation reads only that many bytes from the file as it is needed for the last, actual method call to charAt or subSequence.

If you are interested you can improve this code to keep only the bytes in memory that are really needed and delete bytes that were already returned to the stream. To know what bytes are not needed a good hint is from the previous article is that the splitAsStream never touches any character that has smaller index than the first (start) argument of the last call to subSequence. However, if you implement the code in a way that it throws the characters away and fail if anyone wants to access a character that was already thrown then it will not truly implement the CharSequence interface, though it still may work well with splitAsStream so long as long the implementation does not change and it starts needed some already passed characters. (Well, I am not sure, but it may also happen in case we use some complex regular expression as a splitting pattern.)

Happy coding!

Comments imported from Wordpress

rici 2017-11-30 11:11:02

I know that the main point of the article is stream oriented splitting, and not the FileAsCharSequence implementation. But I am surprised that an experienced developer like you made a mistake regarding InputStream.read(byte[]). You assume that read(byte[]) method fills all the bytes in the array if the stream can provide enough bytes. However, the only guarantee is that it will read at least one byte, and the actually read number of bytes is the return value. So even a FileInputStream implementation is allowed to read less bytes without any error condition, even if it is not at the end of the file. The correct implementation invokes read(byte[] b, int off, int len) method in an iteration until the byte array is filled completely.

This principle of reading less bytes than requested from a file (or network socket) is true for common operating systems and programming platforms, Beyond Java’s InputStream.read you can check the read(2) for Linux or ReadFile for Windows or C#'s Stream.Read, all of them states that they are allowed to read less bytes than requested.

Other possible problem is that the count of bytes in a file is not equal to the count of characters in it when considering multibyte encodings like utf-8. Even if somehow we could know the count of characters in a file, it would cause a problem if the boundaries of the byte[] buffer splits a multibyte encoded character, then it cannot be decoded correctly. The correct implementation would use a CharsetDecoder.


Comments

Please leave your comments using Disqus, or just press one of the happy faces. If for any reason you do not want to leave a comment here, you can still create a Github ticket.