Téléchargement | - Voir le manuscrit accepté : Improving document clustering in a learned concept space (PDF, 711 Kio)
|
---|
DOI | Trouver le DOI : https://doi.org/10.1016/j.ipm.2009.09.007 |
---|
Auteur | Rechercher : Pessiot, Jean-François; Rechercher : Kim, Young-Min; Rechercher : Amini, Massih-R.1; Rechercher : Gallinari, Patrick |
---|
Affiliation | - Conseil national de recherches du Canada. Institut de technologie de l'information du CNRC
|
---|
Format | Texte, Article |
---|
Sujet | Document clustering; Aspect models; Concept learning |
---|
Résumé | Most document clustering algorithms operate in a high dimensional bag-of-words space. The inherent presence of noise in such representation obviously degrades the performance of most of these approaches. In this paper we investigate an unsupervised dimensionality reduction technique for document clustering. This technique is based upon the assumption that terms co-occurring in the same context with the same frequencies are semantically related. On the basis of this assumption we first find term clusters using a classification version of the EM algorithm. Documents are then represented in the space of these term clusters and a multinomial mixture model (MM) is used to build document clusters. We empirically show on four document collections, Reuters-21578, Reuters RCV2-French, 20Newsgroups and WebKB, that this new text representation noticeably increases the performance of the MM model. By relating the proposed approach to the Probabilistic Latent Semantic Analysis (PLSA) model we further propose an extension of the latter in which an extra latent variable allows the model to co-cluster documents and terms simultaneously. We show on these four datasets that the proposed extended version of the PLSA model produces statistically significant improvements with respect to two clustering measures over all variants of the original PLSA and the MM models. |
---|
Date de publication | 2010-03-01 |
---|
Dans | |
---|
Langue | anglais |
---|
Publications évaluées par des pairs | Oui |
---|
Numéro NPARC | 16885324 |
---|
Exporter la notice | Exporter en format RIS |
---|
Signaler une correction | Signaler une correction (s'ouvre dans un nouvel onglet) |
---|
Identificateur de l’enregistrement | 11e1935d-5d95-4bb7-82fd-b4dcc0ab00fe |
---|
Enregistrement créé | 2011-02-22 |
---|
Enregistrement modifié | 2020-04-17 |
---|