Dominik Honnef

Are you looking for a Go programmer to implement your latest idea?
Send me an email!

rsemantic – Latent Semantic Analysis for Ruby

Introduction

Some time ago, I started reading Where Good Ideas Come From: The Natural History of Innovation and one thing it mentions is an application called “DEVONthink”. DEVONthink is an application you’d use for storing all kind of documents (RSS feeds, emails, notes, articles) and the by far most interesting feature is its ability to recommend similar documents, without asking the user to provide categories/tags or summaries by himself. Instead, DEVONthink automatically analyses all your documents and calculates the similarity.

The really interesting part here is that DEVONthink is able to tell if different words have similar meanings, e.g. “canine” and “dog”. That means that if you’re reading a new article which talks mostly about dogs, DEVONthink will be able to suggest an article about canines. This isn’t done by a fixed list of words with similar meanings, but again by analysing all documents and checking which words get used together often.

Unfortunately, DEVONthink is only available for OS X and not for Linux. Besides that, it works with a centralised database, while I believe in the beauty of plain text files. That is why I began researching how DEVONthink is able to successfully find similar documents and I quickly found Latent Semantic Analysis, short LSA.

Latent Semantic Analysis

Now, I do not want to explain LSA in detail, but summarized it works as follows: First, a term-document matrix gets constructed, storing which terms occur how often in which documents. Such a matrix usually is quite sparse, meaning that it is mostly populated with zeros. LSA then decomposes the matrix into three parts (this is known as Singular Value Decomposition, SVD), reduces the singular values to the K biggest values and then recomposes the matrix. This effectively reduces the dimensions of the vector space, putting similar words closer together.

For a detailed (and mathematically accurate) description, check out the Wikipedia article about LSA.

LSA in Ruby

Trying to avoid having to do the math by myself, I looked for a ready library for Ruby that’d offer LSA. I found rsemantic, which by itself looked promising, not only offering LSA but also tf-idf (a way to give common words negative scores).

The only problem with rsemantic was, that it depends on linalg, another Ruby library, which provides binding to LAPACK, a Fortran library for linea algebra. Now the problem with linalg is that it doesn’t compile on modern systems, at all. Apparently it requires gcc 3.x, and the maintainer has seemingly abandonned the project quite some time ago.

Looking for an alternative to either rsemantic (which wasn’t actively developed anymore, either) or linalg, I found the GNU Science Library (GSL), for which someone wrote working Ruby bindings.

The next step was obvious: port rsemantic to GSL. At the same time, I was able to improve other parts of rsemantic, cleaning up the whole code base and drastically improving the performance, to a point where the only time consuming process is doing the SVD itself.

The future of rsemantic

Luckily, the developer of rsemantic is still interested in the project (and even actively using it, according to him). He therefore merged my changes and added me as a contributor to the git repository at GitHub, allowing me to further work on the library.

A new version of the gem will be released as soon as both of us are happy with all the recent changes.

Furthermore, I am working on an improved API for rsemantic. While currently it is pretty low-level (and not that object-oriented), I am working on a far more high-level API. Following are two examples, first for the new API, then for the old one.

New API

require "semantic"
require "pp"

corpus = Semantic::Corpus.new([], :transforms => [:LSA])

# The directory /tmp/some_texts/ contains Wikipedia articles about
# various animals (beagle.txt canidae.txt cat.txt dog.txt gecko.txt
# german_shepherd.txt reptile.txt salamander.txt)
Dir.glob("/tmp/some_texts/*.txt").each do |file|
  # Here we add each article to the corpus (collection of documents).
  # `:name` is an attribute we're assigning to the document. There is
  # no limit in the names and amount of attributes you can attach to
  # each document.
  corpus << Semantic::Document.new(open(file), :name => File.basename(file))
end
# Build the index, Depending on the CPU, available memory and most
# importantly the size of the corpus, this can take some time.
corpus.build_index

# Compares the article about dogs to all other articles and returns a
# score for each article, representing the similarity.

# In the example output we can see, that rsemantic correctly says that
# "dog.txt" is by far the most related to
# {german_shepherd,beagle,canidae}.txt, and not to e.g. reptiles.
pp corpus.documents.find { |d| d[:name] == "dog.txt" }.related
# {#<Semantic::Corpus 8 documents, @options={:transforms=>[:LSA]}>=>
#   [#<Semantic::SearchResult:0x0000000283db30
#     @document=#<Semantic::Document @attributes={:name=>"gecko.txt"}>,
#     @score=0.05935175420757509>,
#    #<Semantic::SearchResult:0x0000000283dbd0
#     @document=#<Semantic::Document @attributes={:name=>"reptile.txt"}>,
#     @score=0.09322815190241214>,
#    #<Semantic::SearchResult:0x0000000283db08
#     @document=#<Semantic::Document @attributes={:name=>"cat.txt"}>,
#     @score=0.14170011833185264>,
#    #<Semantic::SearchResult:0x0000000283db80
#     @document=#<Semantic::Document @attributes={:name=>"salamander.txt"}>,
#     @score=0.17269285169838045>,
#    #<Semantic::SearchResult:0x0000000283db58
#     @document=#<Semantic::Document @attributes={:name=>"canidae.txt"}>,
#     @score=0.2107975969277189>,
#    #<Semantic::SearchResult:0x0000000283dba8
#     @document=#<Semantic::Document @attributes={:name=>"beagle.txt"}>,
#     @score=0.4041096720670102>,
#    #<Semantic::SearchResult:0x0000000283dbf8
#     @document=#<Semantic::Document @attributes={:name=>"german_shepherd.txt"}>,
#     @score=0.508139138880475>,
#    #<Semantic::SearchResult:0x0000000283dc20
#     @document=#<Semantic::Document @attributes={:name=>"dog.txt"}>,
#     @score=1.0000000000000688>]}

Old API

require "semantic"
require "pp"

corpus = Semantic::Corpus.new([], :transforms => [:LSA])

documents      = []
document_names = []

Dir.glob("/tmp/some_texts/*.txt").each do |file|
  documents      << open(file).read
  document_names << File.basename(file)
end

# This automatically builds the index for all documents
search = Semantic::Search.new(documents, :transforms => [:LSA])

document = document_names.index("dog.txt")

h = {}
search.related(document).each_with_index do |score, index|
  h[document_names[index]] = score
end


# The resulting scores are the same, but we have to manually keep
# track of file names, indices, and the scores are not sorted, either.
pp h
# {"dog.txt"=>1.0000000000000688,
#  "german_shepherd.txt"=>0.508139138880475,
#  "reptile.txt"=>0.09322815190241214,
#  "beagle.txt"=>0.4041096720670102,
#  "salamander.txt"=>0.17269285169838045,
#  "canidae.txt"=>0.2107975969277189,
#  "gecko.txt"=>0.05935175420757509,
#  "cat.txt"=>0.14170011833185264}