This document discusses setting up Elasticsearch to make the Nicovideo video dataset searchable and analyzable. It describes importing over 25 billion comments from the 60GB JSON dataset into an Elasticsearch cluster on AWS in under 4 hours. Key steps included installing plugins, configuring the cluster, importing the data in bulk, and optimizing mappings and settings for efficiency. The dataset can now be flexibly searched and facets like date can be applied to analyze the comments.
The document discusses topic modeling and classification of short texts. It describes using Latent Dirichlet Allocation (LDA) to extract hidden topics from a large universal text corpus consisting of Wikipedia and MEDLINE articles. These topics are then used as features for a maximum entropy classifier to categorize short texts like tweets and web snippets. Parallelized LDA is implemented using the MPI library for improved computational efficiency.
This document discusses setting up Elasticsearch to make the Nicovideo video dataset searchable and analyzable. It describes importing over 25 billion comments from the 60GB JSON dataset into an Elasticsearch cluster on AWS in under 4 hours. Key steps included installing plugins, configuring the cluster, importing the data in bulk, and optimizing mappings and settings for efficiency. The dataset can now be flexibly searched and facets like date can be applied to analyze the comments.
The document discusses topic modeling and classification of short texts. It describes using Latent Dirichlet Allocation (LDA) to extract hidden topics from a large universal text corpus consisting of Wikipedia and MEDLINE articles. These topics are then used as features for a maximum entropy classifier to categorize short texts like tweets and web snippets. Parallelized LDA is implemented using the MPI library for improved computational efficiency.