Tuesday, September 27, 2011

Splitting text on white-space efficiently in Java

Small data, small problem. Big data, ...
Splitting text on white-space in most programming languages is normally a trivial matter. This task becomes a lot more difficult though when you try to complete it over a very large document collection.

The easiest way to split a document on white-space is to use the split method, which is implemented by default in most languages on String objects. Although the split method produces the end result we are looking for, it is not the most efficient way to manipulate large sets of data.

Strings and Streams
In Java, it can be advantageous to use streams over Strings since streams allow the programmer greater flexibility in the specifying the I/O buffer size. By specifying an optimal buffer size for the input stream, the programmer can reduce the the number of disk reads that need to be performed in order to load the contents of a file into memory. Less I/O means more performance.

Use Buffered I/O (for text documents, in this case)
Since the purpose of buffered I/O is to reduce the number of disk reads necessary to load a file, you can imagine that knowing the size of the file prior to loading it is important. After the size of the file we want to load is known, we create a character array store the contents of the file one character at a time. (Note that the character encoding of the document you are reading impacts the number of bits or bytes used to represent a character.)

The Split Function
At this point we have our document stored within a char array and need to start splitting it into individual words or tokens efficiently. Since we are dealing with a char array, we no longer have the String method "split" available. The strategy here is to step through the entire char array (document) one time and pick out all of the words to be stored (say in a dictionary). In order to achieve this task the trick is to keep track of when a word begins and ends in the stream. This can be done by using a simple Boolean flag within the loop which iterates through the character array.

First determine whether or not the first character is white-space or not. This is important since we need to know whether we are looking for the beginning of a word first or whether we're looking for the end. Assuming the document starts with a word instead of white space, we take note of the fact that we are currently iterating over a word. Once we realize that the word has ended (a space has been encountered) we mark the end of the word, add it to the dictionary and set our Boolean flag to record that we are no longer on a word. At this point, as we iterate through the document further, should any additional white-space be encountered, it will be ignored. The next time something interesting happens in the code is when we reach a non white-space character, at which point we reset our Boolean flag to reflect that we are at a word boundary again and take note of where it starts. Then we move on looking for the next instance of white-space to find the ending boundary of the latest term discovered.

I know that explaining an algorithm can often sound convoluted and so I have attached a small code sample that demonstrates the technique of using streams and character arrays to efficiently process large document sets.
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.Reader;
import java.io.BufferedReader;
import java.util.ArrayList;

public class Driver {

	public static void main(String[] args) {
		// Set the location of the corpus
		File corpus = new File("/path/to/document/collection");
		// Iterate through the files and perform the necessary actions
		File[] files = corpus.listFiles();
		for(File f : files) {
			try {
				
				// Get the size of the file, create a byte stream and a character array to store the input
				int file_size = (int) f.length();
				Reader input = new BufferedReader(new InputStreamReader(new FileInputStream(f)));
				char[] data = new char[file_size];
				
				// Create an array list to store the tokens
				ArrayList tokens = new ArrayList();

				// Read the file
				while (input.read(data, 0, file_size) != -1) {

				  // Determine whether we are starting with whitespace
				  boolean at_token = Character.isWhitespace(data[0]) ? false: true;
				  
				  // Keep track of the bounds of tokens
				  int left = 0, right = 0;
				  
				  // Iterate through the data stream
                  for(int i = 0; i < data.length; i++) {
                	// Found the beginning of a token
                	if (!Character.isWhitespace(data[i]) && at_token == false) {
                      at_token = true;
                      left = i;
                    }
                	// Found the end of a token
                	if (Character.isWhitespace(data[i]) && at_token == true) {	
                      at_token = false;
                      right = i;
                      StringBuilder token = new StringBuilder(right-left);
  					  token.append(data, left, right-left);
  					  tokens.add(token.toString());
                    }
                  }
				}
				int ctr = 0;
				for (String s : tokens) {
					System.out.println("Token #"+ctr+"\t"+s);
					ctr++;
				}
			} catch (FileNotFoundException e) {
				e.printStackTrace();
			}
			catch (IOException e) {
				e.printStackTrace();
			}
		}
	}
}

Monday, June 13, 2011

Starting a new Repository on GitHub


These are the instructions that GitHub provides you with in order to create a new repository.


Note: replace the name John Doe with your actual GitHub user name. Also replace the email and don't forget to name your app.


Global setup:

Download and install Git
  git config --global user.name "John Doe"
  git config --global user.email john.doe@gmail.com
        

Next steps:

mkdir Name-Your-App
  cd Name-Your-App
  git init
  touch README
  git add README
  git commit -m 'first commit'
  git remote add origin git@github.com:JohnDoe/Name-Your-App.git
  git push -u origin master
      

Existing Git Repo?

cd existing_git_repo
  git remote add origin git@github.com:JohnDoe/Twitter-Apps.git
  git push -u origin master

Thursday, March 17, 2011

Git Handbook

Git is version control software that allows coders to tinker with their code without compromising the project's code. Here are the basics you'll need to git going with Git.

Git Configuration
Set your email that will appear in commit messages.
git config user.email youremail@example.com

Set your name that will appear in commit messages.
git config user.name 'FirstName LastName'

Recommended: Tells git-branch and git-checkout to setup new branches so that git-pull will appropriately merge from that remote branch. Without this, you will have to add --track to your branch command or manually merge remote tracking branches with "fetch" and then "merge".
git config branch.autosetupmerge true

You can add "--global" after "git config" to any of the above commands to make it
apply to all git repos (note this will edit the ~/.gitconfig).

Clone the specified repository by <repo> this is similar to "checkout" in some other version control systems such as Subversion and CVS

git clone <repo>

Gathering information from Git
git diff

Show files added to the index, files with changes, and untracked files
git status

Show recent commits, most recent on top
git log

Show the changeset (diff) of a commit specified by <rev>, which can be any SHA1 commit ID, branch name, or tag
git show <rev>

Show who authored each line in <file>
git blame <file> 

Show who authored each line in <file> as of <rev> (allows blame to go back in time)
git blame <file> <rev>

As usual,  you can expect more to come soon.
Online Marketing