Extraction of intrinsic structure from document collections in the absence of pre-classified training data, is a challenging task in text-mining due to the high dimensionality of the input data (usually in the form of word-frequency vectors derived from the bag-of-words (BOW) model of document representation). The strength of Self Organizing Maps (SOM) in representing high dimensional data in the form of topology preserving two-dimensional projections can be utilized to accomplish the dual tasks of dimensionality reduction, and data visualization, in a single operation. The characteristic of emergence, which is the generation of complex systems and patterns by the cooperation of multiple elementary interactions, provides a way of detecting higher level structures or clusters of clusters within a document corpus. Though intuitively appealing, the extension of the BOW representation by the inclusion of bigram vectors has not been very successful in improving accuracy of text categorization. In this study, the effectiveness of bigram vectors in cluster visualization (instead of categorization / classification) using emergent SOMs has been investigated. Experiments (intentionally) conducted using a limited vocabulary (40500 words and 121000 bigrams approximately) to allow the visualization of both higher level (emergent) structures and lower level document relatedness in a single map, indicate that incorporation of bigrams enhances the emergence effect.
|Journal||Proceedings of the Third International Conference on Modeling, Simulation and Optimization|