HomeBlogAboutTools

1.5 Million Words in JavaBlogs?

uncategorized

The 1.5 million “words” I claimed JavaBlogs has needs to be clarified slightly. They are “words” as defined by running String.split(“\\W”) on all the posts archived. The [\W] regular expression is defined as a “A non-word character” - any character that is not in “a-zA-Z_0-9”. For normal english sentences from a book that is probably a reasonable definition - however when used on blogs where there is a large number of urls it doesn’t quite work. For instance, we suddenly find that “http” is one of the most popular “words” in the english language. That’s because all urls are split on their non-word characters - so http://www.javablogs.com is split into “http”, “www”, “javablogs” & “com”. Also, dates like 2-May-2003 or 25/12/2002 are split on the ”-” and ”/” characters, so “2002” and “2003” are very common words.

My current thoughts are to try splitting on “\s” - ie whitespace.