The talk presents selected results of our research in the area of text and data mining in social media and scientific literature. (1) First, we consider the area of classifying microblogging postings like tweets on Twitter. Typically, the classification results are evaluated against a gold standard, which is either the hashtags of the tweets’ authors or manual annotations. We claim that there are fundamental differences between these two kinds of gold standard classifications and conducted an experiment with 163 participants to manually classify tweets from ten topics. Our results show that the human annotators are more likely to classify tweets like other human annotators than like the tweets’ authors (i. e., the hashtags). This may influence the evaluation of classification methods like LDA and we argue that researchers should reflect the kind of gold standard used when interpreting their results. (2) Second, we present a framework for semantic document annotation that aims to compare different existing as well as new annotation strategies. For entity detection, we compare semantic taxonomies, trigrams, RAKE, and LDA. For concept activation, we cover a set of statistical, hierarchy-based, and graph-based methods. The strategies are evaluated over 100,000 manually labeled scientific documents from economics, politics, and computer science. (3) Finally, we present a processing pipeline for extracting text of varying size, rotation, color, and emphases from scholarly figures. The pipeline does not need training nor does it make any assumptions about the characteristics of the scholarly figures. We conducted a preliminary evaluation with 121 figures from a broad range of illustration types.