Abstract
This work presents DENAC, a model and its software that discovers
the natural number of clusters “as a human being would do,” using the semantic relations
in an unsupervised classification. DENAC makes LDA (Latent
Dirichlet Allocation, an unsupervised classifier, behave like a supervised classifier, because DENAC classifies and gives labels to a set of
documents in a certain number of groups that agree very well with the
classification that a common person would give.
The documents (unstructured information) are
gathered from online web sites (Mexican digital press); the news are treated
using natural language processing to make consistent the use of the clustering
algorithm that employs WordNet to measure word similarity; the
linguistic treatment consist of removing stop-words; lemmatizing and synonyms.
The main topics in the documents are found using LDA, it finds a few words that represent or describe each cluster.
The software computes the distances between words in the same cluster or
group (intra-distances) and distances between clusters (inter-distances) to
find their compactness and how far they are from each other. To calculate the distances, the WordNet taxonomy is used. It describes the semantic relations of words. The
similarity function used on the taxonomy is Path-Similarity.
Additionally, every cluster is labeled with a few words, using again
semantic relations. The groups are presented in a
visualization showing the results:labels, clusters, amount of documents assigned to every cluster, and the
words that are common to two clusters. To read his thesis (in Spanish), click here.