In order to follow below examples, create a large CSV file using this
First, lets look into the memory related problems, say if you try to load all the file content into memory where JVM doesn't have sufficient memory, then it will throw
If you run above program with max heap as
Before going to look at different options to process large files with memory efficiency, ask yourself the following questions.
- Is it necessary to load all the file content in memory to process it ? for example you may just need few lines somewhere in the middle of the file.
- How frequently file will be read to process some data ? It will have implications on disk I/O
- Can the processing be multi threaded ?
Method 1 - using BufferedReader and Java Streams (FileInputStream / BufferedInputStream) Method 2 - using java.util.Scanner Method 3 - using java.nio.channels.FileChannel
-
BufferedReader reads text from a character input stream, it uses a buffer to store characters while reading the data, so as to provide efficient reading. It stores the data in a temporary array first, then for every read call it serves the data from it without going through underlying stream as long it has data, if it is empty then it reads data through underlying stream again, thus improving number of I/O operations. - In the below example, we are going to read data from a CSV file using
BufferedReader in a streaming fashion, print each line and discard it without storing it in memory.
BufferedReader takes a character input stream as an input to delegate actual I/O operations to that, so as a first step we create anInputStream and pass it toBufferedReader .- In the next step, we make a call to
BufferedReader.lines() that returns aStream of Strings, and note that this Stream is lazily populated, that is read only occurs when you iterate through Stream.
Note: Note that the default buffer size in BufferedReader is 16 bytes (
When you above program, you might notice that memory usage is well under control as we are reading it in a streaming fashion and discarding once we read a line and print it.
BEFORE ---- Total memory : 128MB , free memory : 125MB
... discarding printing lines on the console
AFTER ---- Total memory : 128MB , free memory : 55MB
Total processing time (ms) : 6312
If you would like to add buffer at Stream level, we can use
- Just like
BufferedReader that we have seen in Method 1 above,Scanner also works in similar way, it uses anInputStream to delegate actual I/O operations to that. Scanner also maintains a character buffer inside to store the data, but the size of buffer is fixed which is 2KB. - To read data using
Scanner we mainly use two methods,method which tells whether it reached end of file or not and method reads a line and returns it. - In addition to reading the data, Scanner also parses data while reading, this is sometimes useful and sometimes it could be burden too. As you can notice from below output, Memory utilization is well under control, but the overall performance degrades as Scanner parses the data which takes some extra time.
Output
BEFORE ---- Total memory : 128MB , free memory : 125MB
... discarding printing lines on the console
AFTER ---- Total memory : 128MB , free memory : 51MB
Total processing time (ms) : 10148
FileChannel is a seekable byte channel that is connected to a file, using this we can both read and write data to the file. It has a current position within its file which can be both queried and modified. Before using, we should open it first. - In order to use FileChannel we should first open it, this can be obtained using
FileInputStream
orRandomAccessFile
and in the below example we are using FileInputStream to obtain the FileChannel. - Once the FileChannel is open, we can then start reading file content and the FileChannel's
read method takesByteBuffer
as an argument, so in every read call FileChannel reads the number of bytes specified into the ByteBuffer and returns how many bytes were read or-1 if it reaches end of file.
Note: Note that FileChannel is a seekable byte channel, so you can also specify a
- In the above code,
fileInputStream.getChannel() opens the FileChannel. - We are initializing
ByteBuffer with size 8KB using, you can tweak it as desired. fileChannel.read(byteBuffer) returns -1 if it reached end of file.- Inside the
while loop, we are callingbyteBuffer.clear() , the reason for this is, once we used the data filled in buffer, we need to clear it to use it for subsequent reads.
Output
BEFORE ---- Total memory : 128MB , free memory : 125MB
... discarding printing lines on the console
AFTER ---- Total memory : 128MB , free memory : 79MB
Total processing time (ms) : 4920
In the above example, we can obtain FileChannel using
Summary
All the above methods discussed here solves Memory related issues, and methods which uses a buffer as an intermediate storage in memory provides good performance. Another important factor to keep in mind is that, performance bottlenecks may come from underlying Disk I/O or if you reading a file from a Network File System (NFS), then network speed also may contribute to the overall performance.
Above examples source code can be found at