Be the first to like this
The goal of this work was to extract polls and opinions trends from a very large corpus. We developed a system that performs single words trend detection and also detects collocation candidates as well as collocation trends. We conducted experiments on a real-life repository, made available to us by Toluna.com, and counts around 23 millions documents. Our contributions were twofold: (1) we characterized the problem and chose the appropriate measures (eg z-score) to identify trends, (2) we designed a fully scalable system, leveraging Hadoop, Zookeeper and Hbase. This system includes a visual interface that displays a timeline highlighting trends in a 2-year window, as well as statistical reports by single words and n-grams.