Be the first to like this
In the last few years there has been a rapid increase in the amount of electronically available data. This has fostered the emergence of novel data mining and machine learning applications able to extract information and knowledge. A significant proportion of these data sources are in the form of natural text, something which involves difficulties not present in other domains, such as their unstructured nature and the high dimensionality of the datasets. Natural text has to be preprocessed so that it can be analyzed by computers, and learning algorithms have to be able to cope with such high-dimensional feature spaces. Text mining techniques are invaluable to extract knowledge from natural text, as well as from other types of unstructured, alphabet-based data such as DNA strings.
Many of these data sources are not available as closed-ended datasets, but rather as data streams of examples that arrive in a sequence over time. This includes many text data sources, such as web pages, emails or blog posts. Given the unbounded nature of these datasets, it is important to work with scalable algorithms that use reduced time and memory. Additionally, it is important for the algorithms to be able to adapt to changes in the underlying statistical distributions governing the data. This is especially difficult in the case of data streams, because of their high dimensionality. In order for text streams to be computationally tractable, it is necessary to previously reduce the dimensionality of the datasets, employing only the most relevant terms in the learning algorithms. However, the importance of the terms change over time, which in practice means that it is necessary to work under the assumption of a dynamic feature space. Keeping track of this evolving high-dimensional feature space is an intrinsically complex problem, since the importance of each feature depends on the others.
Such challenges are tackled in this thesis. We present GNUsmail, a framework for text stream classification in the domain of electronic email, and use it to study the nature of concept drift in text streams. We introduce a framework for adaptive classification, ABC-DynF, which is able to adapt to dynamic feature spaces, incorporating new features and labels to previously existing models. We also study the problem of summarization in text streams, and propose TF-SIDF / BM25, an approach for approximate weighting function approximation which makes it possible to extract keywords and construct word clouds from text streams in an efficient way. Finally, we present STFSIDF, an incremental approach for online feature selection which minimizes the number of weight recalculations while keeping continuously updated lists of the most relevant features. STFSIDF uses approximate algorithms to reduce the space complexity derived from the high dimensionality of the data sources.
Be the first to like this