Be the first to like this
Content Words (CWs) are important segments of the text. In text mining, we utilize them for various purposes such as topic identiﬁcation, document summarization, question answering etc. Usually, the identiﬁcation of CWs requires various language dependent tools. However, such tools are not available for many languages and developing of them for all languages is costly. On the other hand, because of recent growth of text contents in various languages, language independent text mining carries great potentiality. To mine text automatically, the language tool independent CWs ﬁnding is a requirement. In this research, we devise a framework that identiﬁes text segments into CWs in a language independent way. We identify some structural features that relate text segments into CWs. We devise the features over a large text corpus and apply machine learning-based classiﬁcation that classiﬁes the segments into CWs. The proposed framework only uses large text corpus and some training examples, apart from these, it does not require any language speciﬁc tool. We conduct experiments of our framework for three diﬀerent languages: English, Vietnamese and Indonesian, and found that it works with more than 83% accuracy.