[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Structure of VMs analyzed with LSI
Dear All,
A short while ago I started reading about a technique called LSI or LSA
(Latent Semantic Indexing or Analysis). Basically, the technique is used to
look for 'large' texts given 'short' texts. 'Large' texts can be long
documents or summaries, 'short' texts, in the form of queries, are words or
even summaries. The technique looks to match the query (the short text)
with the documents that are in a database (the large texts). I think that
there is potential for this technique to be used to test words or
combinations of words with parts of the VMs and try to find its 'latent'
structure.
The technique is used as a successful alternative to keyword matching in
database searches and has been developed only in the last 10 years, because
of the large computing requirements when used for large databases. This
technique goes one step beyond simple keyword matching, traditional
statistics such as those found in TACT (frequency counts or node
collocations) or even graphical semantic networks (which are restricted to
short documents). Historically, in LSI, the mathematics was developed first
and the applications later. For this reason, there is lots of scope on how
to apply LSI to new problems.
SUMMARY OF SUMMARIES
To start working with LSA, the usual procedure is to create first a
database of documents, and list all the words used in the documents (likely
suppressing common words such as 'a', 'the', 'we', ...). These creates a
large matrix which might be as big as 60,000 words x 1,000,000 documents,
or as small as 5,000 words x 10,000 documents. This procedure displays the
words/docs matrix in a 2D space, traditionally with the words as rows and
the docs as columns. For example, a database on computers might be
displayed as follows:
doc1 doc2 doc3 ... docn
screen 2
laptop 1 2
memory 3 2
CPU 1 1
information 4
retrieval 2
cable 1 1
16 bits 2
where each entry in the matrix is the number of times that a word is found
in the specific document. In this way, each term and document can be
represented by a vector. (There are other frequency weighting methods that
can also be applied to scale to unity each column vector.) The assumption
is that there is some underlying or 'latent' structure in the pattern of
words usage across documents. The analysis is done with SVD (singular value
decomposition), which good mathematical programs (such as Matlab) can
calculate.
>From [1]:
"One can interpret the analysis performed by SVD
geometrically. The result of the SVD is a k-dimensional
vector space containing a vector for each term and each
document. The location of term vectors reflects the
correlations in their usage across documents. Similarly,
the location of document vectors reflects correlations
in term usage. In this space the cosine or dot product
between vectors correspond to their estimated similarity.
Retrieval [of queries] proceeds by using the terms in a
query to identify a vector in the space, and all documents
are then ranked by their similarity to the query vector."
Papers describing the applications of LSA usually indicate that the method
has been successful in applications where 'structure' is important and
that, even when words are missing from queries, the correct documents can
be extracted because of proximity. The language used is irrelevant.
APPLICATION TO VMs
LSA works by creating a matrix of words/texts, but, as far as we know, we
only have ONE text: there is only one manuscript with the VMs script (with
a total of 26536 words in 234 pages), therefore, the best we can do is to
test if different parts of the same Ms are congruent with each other. We
could use each section (herbal, astronomical, cosmological, astrological,
biological, pharmaceutical, text with marginal stars and text only) as a
separate document, or perhaps each page as a separate document, to simulate
that there are different texts and be able to create the LSA matrix. Then,
each section can be used as a query and can be tested against each other
page (or paragraph or section or whatever unit the Ms is divided into). It
might then be possible to check if any structure can be identified, for
example, to verify if the order of the pages is correct or there is
another equivalent to hands A and B.
Another study that can be done is to use the short labels and short phrases
that are in some parts of the text (such as labels for plants and stars) as
queries, and see how close (in a cosine distance fashion) the labels are to
the rest of the pages and sections.
As recently suggested (by Corey Snow), alternative alphabets can be created
and tested automatically. LSI would allow to check for those combinations
that create, say, the least number of clusters or some other measure of
consistency among the texts.
PRACTICAL APPLICATION IN UNIX
The work is divided in two parts: creation of the words/docs matrix and SVD
analysis. There are programs that allow to do the work, but only in Unix.
The programs have to be requested to Michael W. Berry at the University of
Tennessee (I have the complete details).
Yours,
Claudio
[1] Foltz, Peter et al, "Personalized Information Delivery: An Analysis of
Information Filtering Methods",
http://www-psych.nmsu.edu/~pfoltz/cacm/cacm.html or Comm. ACM, 35 (12),
51-60, 1992. Various other papers in the site.
[The singular values of a matrix A are the square root of the eigenvalues
of A'*A, where A' is the hermitian of A. The technique is closely related
to eigenvector decomposition, factor analysis and modal analysis.]