HomeBlogAboutTools

Java Vector Space Search and Latent Semantic Indexing

uncategorized

Ted Leung pointed at Latent Semantic Indexing today, which got me reading some papers. The patent situation is unfortunate, because it is a pretty nice technique. Then, thanks to Technorati, I found this which led to Building a Vector Space Search Engine in Perl.

Now this isn’t quite latent semantic indexing, but it uses some of the same techniques. I’m not sure what the patent situation is - this seems fairly trivial, but who knows? Either way, this technique is really, really good for those times where you want to categorise text into a number of potential categories, mainly because it isn’t too resource intensive. Compared to Bayesian classification, it appears that the algorithm should be much, much quicker than even the best Bayesian implementation.

I figured that anything they can do in Perl, I can do in Java, so I’d like to present my very ugly Java version. This isn’t nice code, and it doesn’t do Stemming (note to self - look at using Lucene’s stemming code), it uses doubles instead of BigDecimals (and/or BitSets), but it appears to work. I haven’t done vector math for a long time, so I might have screwed that up somewhere, too.

However, it’s something I’m going to look at more in the future. I’ll probably build a classifier for Classifier4J based on it, and compare it to my Bayesian classifier.