Small data, small problem. Big data, ...
Splitting text on white-space in most programming languages is normally a trivial matter. This task becomes a lot more difficult though when you try to complete it over a very large document collection.
The easiest way to split a document on white-space is to use the split method, which is implemented by default in most languages on String objects. Although the split method produces the end result we are looking for, it is not the most efficient way to manipulate large sets of data.
Strings and Streams
In Java, it can be advantageous to use streams over Strings since streams allow the programmer greater flexibility in the specifying the I/O buffer size. By specifying an optimal buffer size for the input stream, the programmer can reduce the the number of disk reads that need to be performed in order to load the contents of a file into memory. Less I/O means more performance.
Use Buffered I/O (for text documents, in this case)
Since the purpose of buffered I/O is to reduce the number of disk reads necessary to load a file, you can imagine that knowing the size of the file prior to loading it is important. After the size of the file we want to load is known, we create a character array store the contents of the file one character at a time. (Note that the character encoding of the document you are reading impacts the number of bits or bytes used to represent a character.)
The Split Function
At this point we have our document stored within a char array and need to start splitting it into individual words or tokens efficiently. Since we are dealing with a char array, we no longer have the String method "split" available. The strategy here is to step through the entire char array (document) one time and pick out all of the words to be stored (say in a dictionary). In order to achieve this task the trick is to keep track of when a word begins and ends in the stream. This can be done by using a simple Boolean flag within the loop which iterates through the character array.
First determine whether or not the first character is white-space or not. This is important since we need to know whether we are looking for the beginning of a word first or whether we're looking for the end. Assuming the document starts with a word instead of white space, we take note of the fact that we are currently iterating over a word. Once we realize that the word has ended (a space has been encountered) we mark the end of the word, add it to the dictionary and set our Boolean flag to record that we are no longer on a word. At this point, as we iterate through the document further, should any additional white-space be encountered, it will be ignored. The next time something interesting happens in the code is when we reach a non white-space character, at which point we reset our Boolean flag to reflect that we are at a word boundary again and take note of where it starts. Then we move on looking for the next instance of white-space to find the ending boundary of the latest term discovered.
I know that explaining an algorithm can often sound convoluted and so I have attached a small code sample that demonstrates the technique of using streams and character arrays to efficiently process large document sets.
import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.IOException; import java.io.InputStreamReader; import java.io.Reader; import java.io.BufferedReader; import java.util.ArrayList; public class Driver { public static void main(String[] args) { // Set the location of the corpus File corpus = new File("/path/to/document/collection"); // Iterate through the files and perform the necessary actions File[] files = corpus.listFiles(); for(File f : files) { try { // Get the size of the file, create a byte stream and a character array to store the input int file_size = (int) f.length(); Reader input = new BufferedReader(new InputStreamReader(new FileInputStream(f))); char[] data = new char[file_size]; // Create an array list to store the tokens ArrayListtokens = new ArrayList (); // Read the file while (input.read(data, 0, file_size) != -1) { // Determine whether we are starting with whitespace boolean at_token = Character.isWhitespace(data[0]) ? false: true; // Keep track of the bounds of tokens int left = 0, right = 0; // Iterate through the data stream for(int i = 0; i < data.length; i++) { // Found the beginning of a token if (!Character.isWhitespace(data[i]) && at_token == false) { at_token = true; left = i; } // Found the end of a token if (Character.isWhitespace(data[i]) && at_token == true) { at_token = false; right = i; StringBuilder token = new StringBuilder(right-left); token.append(data, left, right-left); tokens.add(token.toString()); } } } int ctr = 0; for (String s : tokens) { System.out.println("Token #"+ctr+"\t"+s); ctr++; } } catch (FileNotFoundException e) { e.printStackTrace(); } catch (IOException e) { e.printStackTrace(); } } } }