Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Knowledge Discovery in Social Media and Scientific Digital Libraries

1,214 views

Published on

The talk presents selected results of our research in the area of text and data mining in social media and scientific literature. (1) First, we consider the area of classifying microblogging postings like tweets on Twitter. Typically, the classification results are evaluated against a gold standard, which is either the hashtags of the tweets’ authors or manual annotations. We claim that there are fundamental differences between these two kinds of gold standard classifications and conducted an experiment with 163 participants to manually classify tweets from ten topics. Our results show that the human annotators are more likely to classify tweets like other human annotators than like the tweets’ authors (i. e., the hashtags). This may influence the evaluation of classification methods like LDA and we argue that researchers should reflect the kind of gold standard used when interpreting their results. (2) Second, we present a framework for semantic document annotation that aims to compare different existing as well as new annotation strategies. For entity detection, we compare semantic taxonomies, trigrams, RAKE, and LDA. For concept activation, we cover a set of statistical, hierarchy-based, and graph-based methods. The strategies are evaluated over 100,000 manually labeled scientific documents from economics, politics, and computer science. (3) Finally, we present a processing pipeline for extracting text of varying size, rotation, color, and emphases from scholarly figures. The pipeline does not need training nor does it make any assumptions about the characteristics of the scholarly figures. We conducted a preliminary evaluation with 121 figures from a broad range of illustration types.

URL: https://www.ukp.tu-darmstadt.de/ukp-home/news-singleview/artikel/guest-speaker-ansgar-scherp/

Published in: Technology
  • Be the first to comment

Knowledge Discovery in Social Media and Scientific Digital Libraries

  1. 1. Slide 1Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Knowledge Discovery in Social Media and Scientific Digital Libraries Ansgar Scherp Darmstadt, Feb 9, 2016 Thanks to: Chifumi Nishioka, Falk Böschen
  2. 2. Slide 2Prof. Ansgar Scherp – asc@informatik.uni-kiel.de KDD Social Media & Digital Libraries How to deal with the vast amount of content related to research and innovation? “Ability to deal with digital information will be an important cultural technique as reading and writing.”
  3. 3. Slide 3Prof. Ansgar Scherp – asc@informatik.uni-kiel.de KDD Social Media & Digital Libraries • Examples of current research 1. Classifying tweets 2. Automated subject indexing 3. Extracting text from scholarly figures • Today not in covered –Schema-extraction from Linked Open Data –Analysis of evolution of Linked Open Data
  4. 4. Slide 4Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Classifying Tweets: Example How far are there fundamental differences between different approaches for tweet classification? Author’s hashtag: (here: none) Human: #research #talk #darmstadt Machine: #talk #socialmedia (e.g., [Nishida et al. 12])(e.g., [Ren et al. 14] [Yang et al. 14]) [NSD15] C. Nishioka, A. Scherp, and K. Dellschaft: Comparing Tweet Classifications by Authors' Hashtags, Machine Learning, and Human Annotators, WI, Singapore, 2015.
  5. 5. Slide 5Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Twitter Dataset: TREC Tweets2011 • Contains about 16 million tweets • Randomly created 10 main topics with two sub-topics • Main topic: hashtag occurs min. 200 times main topic subtopics 1 #health #nutrition, #news 2 #apple #iphone, #mac 3 #photography #nature, #art 4 #green #solar, #eco 5 #celebrity #news, #gossip 6 #fashion #news, #shoes 7 #fitness #health, #exercise 8 #humor #quotes, #funny 9 #quote #love, #life 10 #travel #lp, #tips • 5 classes per topic: , , , , • Retrieved 3 tweets per class, i.e., 15 tweets per topic • Task: classify tweets into groups
  6. 6. Slide 6Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Method 1: Hashtag Classifier • Assign classes to tweets by author’s hashtags Class ‘#SpendingReview’ Class ‘#TurkeyDayTravel #travel’ Class ‘#TurkeyDayTravel’ Class ‘#travel’ • Multiple hashtags  consider as single class
  7. 7. Slide 7Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Method 2: Machine Classifier • Latent Dirichlet Allocation (LDA) to represent tweets as probabilities over latent topics [Blei et al. 03] • Construct of the model from TREC Tweets2011 – Train topic model over Tweets being aggregated by their Twitter users [Hong et al. 10] – Infer probability distribution over topics for each of the 15 tweets • Cluster tweets using k-means – # of clusters optimized by Hartigan’s index and Average Silhouette [Kaufman et al. 05] – Using cosine similarity as a distance measure
  8. 8. Slide 8Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Method 3: Human Classifier • Online experiment: asked 163 human annotators to manually classify the 15 tweets per each topic main topic subtopics # annotators 1 #health #nutrition, #news 20 2 #apple #iphone, #mac 18 3 #photography #nature, #art 15 4 #green #solar, #eco 14 5 #celebrity #news, #gossip 15 6 #fashion #news, #shoes 15 7 #fitness #health, #exercise 18 8 #humor #quotes, #funny 15 9 #quote #love, #life 16 10 #travel #lp, #tips 17 ∑ 163
  9. 9. Slide 9Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Method 3: Human-classifier • Annotators can create an arbitrary number of classes and label them • Have access to Tweet’s textual content as well as screenshots of the links, but: hashtag ‘#’ removed Class label
  10. 10. Slide 10Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Degree of Classifier Agreement • Methods 1-3 produce groups of Tweets • Compare groups with Cohen’s kappa [Fu et al. 2012] • Convert classifications into match tables – Elements in same group: 1 – Otherwise: 0 • Example: tweets , , , , are classified by and in and • Compare match table using • Example: and Cohen’s a b c d b 1 c 0 0 d 0 0 1 e 0 0 0 0 a b c d b 1 c 1 1 d 0 0 0 e 0 0 0 1 Classifier Classifier =>
  11. 11. Slide 11Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Agreements Between Classifiers • Hashtag/Machine (HaM) – Almost no agreement – Except topic 3 “photography”: 11 of 15 tweets use the hashtags also as a word in texts • Hashtag/Human (HaHu) – Slight agreements • Machine/Human (MHu) – Almost no agreement – Except topic 10 “travel”: agreement on the disagreement at tweets having the hashtag “#tips” ID HaM HaHu MHu 1 -0.05 0.12 0.00 2 0.02 0.05 0.05 3 0.24 0.06 0.11 4 0.01 0.11 0.00 5 0.00 0.07 -0.04 6 0.00 0.15 0.04 7 0.04 0.09 0.05 8 -0.04 0.17 0.03 9 -0.02 0.13 0.00 10 0.01 0.10 0.45 M 0.02 0.10 0.07 SD 0.08 0.10 0.12
  12. 12. Slide 12Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Inter-Human-Annotator Agreement • Fleiss’ kappa : measure agreement among more than two raters • Consistently observe larger agreements among human classifiers than for HaHu and MHu • Difference is significant (with ) 1 0.17 2 0.10 3 0.13 4 0.16 5 0.53 6 0.20 7 0.14 8 0.31 9 0.33 10 0.38 M 0.25 SD 0.14 Researchers should use ground truth made by human  annotators rather than hashtags for tweet classification
  13. 13. Slide 13Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Automatic Subject Indexing [GNS15] G. Große-Bölting, C. Nishioka, A. Scherp: A Comparison of Different Strategies for Automated Semantic Document Annotation. K-CAP 2015 STW (Standard Thesaurus Wirtschaft) Cancer (18899-3) Research (10436-6) USA (17829-1) … Nomination for Best Paper Award at K-CAP 2015 Award „Prof. Dr. Werner Petersen-Preis der Technik 2015” Published as Linked Open Data!
  14. 14. Slide 14Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Automated Subject Indexing • Scientific search engine GERHARD (‘97-‘99) • Ontology with ~10,000 classes in three languages
  15. 15. Slide 15Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Experiment Framework Each strategy is a composition of methods from 1. + 2. + 3. 1. Concept Extraction detect concepts (candidate annotations) from each document 2. Concept Activation compute a score for each concept of a document 3. Annotation Selection select annotations from concepts for each document 4. Evaluation measure performance of strategies with ground truth
  16. 16. Slide 16Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Configurations Entity Tri-gram LDARAKE Statistical Methods (2 methods) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation
  17. 17. Slide 17Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Configurations: Entity-based 24 strategies Entity Tri-gram LDARAKE Statistical Methods (2 methods) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation … using a domain-specific taxonomy like STW
  18. 18. Slide 18Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Concept Activation Methods • Concept frequency: • CF-IDF as extension of popular TF-IDF model replacing terms with concepts [Goossen et al. 11] – IDF lowers weight for concepts appearing in many documents • Do actually not “activate” anything …
  19. 19. Slide 19Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Hierarchy-based Methods • Reveal concepts that are not explicitly mentioned by using a hierarchical knowledge base (KB) • KBs are of high quality and freely available ! Social Recommendation Social Tagging Web Searching Web Mining Site Wrapping Web Log Analysis World Wide Web • Base Activation with set of child concepts of concept and decay parameter ∈ • Example with : , ,
  20. 20. Slide 20Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Hierarchy-based Methods • One-hop activation – Developed with domain experts at ZBW : set of concepts detected in a document – Maximum activation distance: one hop , , ∙ , ∈ if | ∩ | 2 , otherwise Works very well … why?
  21. 21. Slide 21Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Graph-based Methods • Represent concepts as co-occurrence graph Tax Bank Interest Rate Financial Crisis Central Bank • HITS for link analysis of web sites [Kleinberg 99] with ∈ ∈ • Degree as number of edges linked with a concept [Zouaq et al. 12]: – Example:
  22. 22. Slide 22Prof. Ansgar Scherp – asc@informatik.uni-kiel.de 15 strategies Entity Tri-gram LDARAKE Statistical Method (2 methods) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation Configurations: n-grams
  23. 23. Slide 23Prof. Ansgar Scherp – asc@informatik.uni-kiel.de 3 strategies Entity Tri-gram LDARAKE Statistical Method (Frequency) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation Configurations: RAKE [Rose et al. 10]
  24. 24. Slide 24Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Entity Tri-gram LDARAKE Statistical Methods (Frequency) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation Configuration: LDA 43 strategies in total* [Blei et al. 03]
  25. 25. Slide 25Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Datasets: 3 Scientific Domains Economics Politics Computer Source ZBW FIV SemEval 2010 # documents 62,924 28,324 244 # annotations 5.26 (± 1.84) 12 (± 4.02) 5.05 (± 2.41) Knowledge base STW European Thesaurus ACM CCS # enities 6,335 7,912 2,299 # labels 11,679 8,421 9,086 • Computer science dataset: SemEval 2010 [Kim et al. 10] • Pre‐processing of author keywords needed [Wang et al. 14] • Total of ~100,000 scientific documents: largest so far !
  26. 26. Slide 26Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Concept Extraction Annotation Selection Concept Activation Best Performing Configurations Best strategy: Entity × HITS × kNN : (economy), (politics), (computer) Entity Tri-gram LDARAKE Graph-based Methods (3 methods) kNN (1 method) Statistical Methods (2 methods) Hierarchy-based Methods (3 methods) Top-k (2 methods) Close ones: OneHop as well as any other graph-based method
  27. 27. Slide 27Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Number of Users 15 10 20‐ Total Users Win (16) Lin (3) Preferred Operatinq System N 20 [‐1 Macintosh Linux Mac(1) Windows 5 Textextraction from Scolarly Figures Binarization Clustering Extraction OCR Text [BS15] F. Böschen, A. Scherp: Multi-oriented Text Extraction from Information Graphics. DocEng 2015: 35-38 Fully-automated TX pipeline No assumptions, no training Novel combination of DM & CV
  28. 28. Slide 28Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Challenges for Research • Different font sizes • … font colors • … background colors • … emphases • Different angles • Overlapping elements
  29. 29. Slide 29Prof. Ansgar Scherp – asc@informatik.uni-kiel.de 121 Scolarly Figures in Economics (from ZBW Open Access Corpus) Current results: improvement of text recognition to BL: up to 30%
  30. 30. Slide 30Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Evaluation Setup item 1 Item 1 {e, i, m, t, 1} {em, it, te} {ite, tem} {e, m, t, I, 1} {em, te, It} {tem, Ite} Unigrams Bigrams Trigrams • How to match output (left) with gold standard (right)?
  31. 31. Slide 31Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Limits of Current Evaluation • Baseline #1: OCR engine Tesseract (Google) with layout analysis • 1 pass per figure • Baseline #2: OCR engine Tesseract (Google) with layout analysis • Multiple, angle-rotated passes + + + + Comparison with related work: very difficult!
  32. 32. Slide 32Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Evaluation: Orientation Distributions Note: horizontal equals to ±15° (Tesseract’s rotation tolerances)
  33. 33. Slide 33Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Mockup: Use of TX in ZBW’s EconBiz
  34. 34. Slide 34Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Summary: KDD in Social Media & DL How to deal with the vast amount of content related to research and innovation? • H2020 INSO-4 project, duration: 04/2016-03/2019 • Platform with data mining and visualization tools for enabling information professionals to deal with large corpora of scientific content, data, social media New

×