Knowledge Discovery in Social Media and Scientific Digital Libraries

Slide 1Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Knowledge Discovery in
Social Media and
Scientific Digital Libraries
Ansgar Scherp
Darmstadt, Feb 9, 2016
Thanks to: Chifumi Nishioka, Falk Böschen
Slide 2Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
KDD Social Media & Digital Libraries
How to deal with the vast amount of content related to
research and innovation?
“Ability to deal with digital information will be an
important cultural technique as reading and writing.”
Slide 3Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
KDD Social Media & Digital Libraries
• Examples of current research
1. Classifying tweets
2. Automated subject indexing
3. Extracting text from scholarly figures
• Today not in covered
–Schema-extraction from Linked Open Data
–Analysis of evolution of Linked Open Data
Slide 4Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Classifying Tweets: Example
How far are there fundamental differences between
different approaches for tweet classification?
Author’s hashtag:
(here: none)
Human: #research
#talk #darmstadt
Machine: #talk
#socialmedia
(e.g., [Nishida et al. 12])(e.g., [Ren et al. 14]
[Yang et al. 14])
[NSD15] C. Nishioka, A. Scherp, and K. Dellschaft: Comparing Tweet Classifications by
Authors' Hashtags, Machine Learning, and Human Annotators, WI, Singapore, 2015.
Slide 5Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Twitter Dataset: TREC Tweets2011
• Contains about 16 million tweets
• Randomly created
10 main topics with
two sub-topics
• Main topic: hashtag
occurs min. 200 times
main topic subtopics
1 #health #nutrition, #news
2 #apple #iphone, #mac
3 #photography #nature, #art
4 #green #solar, #eco
5 #celebrity #news, #gossip
6 #fashion #news, #shoes
7 #fitness #health, #exercise
8 #humor #quotes, #funny
9 #quote #love, #life
10 #travel #lp, #tips
• 5 classes per topic:
, ,
,
,
• Retrieved 3 tweets per class, i.e., 15 tweets per topic
• Task: classify tweets into groups
Slide 6Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Method 1: Hashtag Classifier
• Assign classes to tweets by author’s hashtags
Class ‘#SpendingReview’
Class ‘#TurkeyDayTravel #travel’
Class ‘#TurkeyDayTravel’
Class ‘#travel’
• Multiple hashtags  consider as single class
Slide 7Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Method 2: Machine Classifier
• Latent Dirichlet Allocation (LDA) to represent tweets
as probabilities over latent topics [Blei et al. 03]
• Construct of the model from TREC Tweets2011
– Train topic model over Tweets being aggregated by
their Twitter users [Hong et al. 10]
– Infer probability distribution over topics for each of
the 15 tweets
• Cluster tweets using k-means
– # of clusters optimized by Hartigan’s index
and Average Silhouette [Kaufman et al. 05]
– Using cosine similarity as a distance measure
Slide 8Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Method 3: Human Classifier
• Online experiment: asked 163 human annotators
to manually classify the 15 tweets per each topic
main topic subtopics # annotators
1 #health #nutrition, #news 20
2 #apple #iphone, #mac 18
3 #photography #nature, #art 15
4 #green #solar, #eco 14
5 #celebrity #news, #gossip 15
6 #fashion #news, #shoes 15
7 #fitness #health, #exercise 18
8 #humor #quotes, #funny 15
9 #quote #love, #life 16
10 #travel #lp, #tips 17
∑ 163
Slide 9Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Method 3: Human-classifier
• Annotators can create an arbitrary number of classes
and label them
• Have access to Tweet’s textual content as well as
screenshots of the links, but: hashtag ‘#’ removed
Class label
Slide 10Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Degree of Classifier Agreement
• Methods 1-3 produce groups of Tweets
• Compare groups with Cohen’s kappa [Fu et al. 2012]
• Convert classifications into match tables
– Elements in same group: 1
– Otherwise: 0
• Example: tweets , , , ,
are classified by and in
and
• Compare match table using
• Example: and
Cohen’s
	
	
a b c d
b 1
c 0 0
d 0 0 1
e 0 0 0 0
a b c d
b 1
c 1 1
d 0 0 0
e 0 0 0 1
Classifier Classifier
=>
Slide 11Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Agreements Between Classifiers
• Hashtag/Machine (HaM)
– Almost no agreement
– Except topic 3 “photography”:
11 of 15 tweets use the
hashtags also as a word in texts
• Hashtag/Human (HaHu)
– Slight agreements
• Machine/Human (MHu)
– Almost no agreement
– Except topic 10 “travel”:
agreement on the disagreement
at tweets having the hashtag “#tips”
ID HaM HaHu MHu
1 -0.05 0.12 0.00
2 0.02 0.05 0.05
3 0.24 0.06 0.11
4 0.01 0.11 0.00
5 0.00 0.07 -0.04
6 0.00 0.15 0.04
7 0.04 0.09 0.05
8 -0.04 0.17 0.03
9 -0.02 0.13 0.00
10 0.01 0.10 0.45
M 0.02 0.10 0.07
SD 0.08 0.10 0.12
Slide 12Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Inter-Human-Annotator Agreement
• Fleiss’ kappa : measure agreement
among more than two raters
• Consistently observe larger agreements
among human classifiers than for
HaHu and MHu
• Difference is significant (with )
1 0.17
2 0.10
3 0.13
4 0.16
5 0.53
6 0.20
7 0.14
8 0.31
9 0.33
10 0.38
M 0.25
SD 0.14
Researchers should use ground truth made by human 
annotators rather than hashtags for tweet classification
Slide 13Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Automatic Subject Indexing
[GNS15] G. Große-Bölting, C. Nishioka, A. Scherp: A Comparison of Different
Strategies for Automated Semantic Document Annotation. K-CAP 2015
STW (Standard
Thesaurus Wirtschaft)
Cancer (18899-3)
Research (10436-6)
USA (17829-1)
…
Nomination for Best Paper Award at K-CAP 2015
Award „Prof. Dr. Werner Petersen-Preis der Technik 2015”
Published as
Linked Open Data!
Slide 14Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Automated Subject Indexing
• Scientific search engine GERHARD (‘97-‘99)
• Ontology with ~10,000 classes in three languages
Slide 15Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Experiment Framework
Each strategy is a composition of methods from 1. + 2. + 3.
1. Concept Extraction
detect concepts (candidate annotations) from each document
2. Concept Activation
compute a score for each concept of a document
3. Annotation Selection
select annotations from concepts for each document
4. Evaluation
measure performance of strategies with ground truth
Slide 16Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Configurations
Entity Tri-gram LDARAKE
Statistical
Methods
(2 methods)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
Slide 17Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Configurations: Entity-based
24 strategies
Entity Tri-gram LDARAKE
Statistical
Methods
(2 methods)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
… using a domain-specific taxonomy like STW
Slide 18Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Concept Activation Methods
• Concept frequency:
• CF-IDF as extension of popular TF-IDF model
replacing terms with concepts [Goossen et al. 11]
– IDF lowers weight for concepts appearing in many
documents
• Do actually not “activate” anything …
Slide 19Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Hierarchy-based Methods
• Reveal concepts that are
not explicitly mentioned
by using a hierarchical
knowledge base (KB)
• KBs are of high quality and freely available !
Social
Recommendation
Social
Tagging
Web Searching Web Mining
Site
Wrapping
Web Log
Analysis
World Wide Web
• Base Activation with set of child concepts of
concept and decay parameter
∈
• Example with :
, ,
Slide 20Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Hierarchy-based Methods
• One-hop activation
– Developed with domain experts at ZBW
: set of concepts detected in a document
– Maximum activation distance: one hop
,
	 , ∙ ,
∈
	if	| ∩ | 2
, 															otherwise
Works very well … why?
Slide 21Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Graph-based Methods
• Represent concepts as
co-occurrence graph
Tax
Bank
Interest Rate
Financial Crisis
Central Bank
• HITS for link analysis of web sites [Kleinberg 99]
with
∈
∈
• Degree as number of edges linked with a concept
[Zouaq et al. 12]:
– Example:
Slide 22Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
15 strategies
Entity Tri-gram LDARAKE
Statistical
Method
(2 methods)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
Configurations: n-grams
Slide 23Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
3 strategies
Entity Tri-gram LDARAKE
Statistical
Method
(Frequency)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
Configurations: RAKE
[Rose et al. 10]
Slide 24Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Entity Tri-gram LDARAKE
Statistical
Methods
(Frequency)
Hierarchy-based
Methods
(3 methods)
Graph-based
Methods
(3 methods)
Top-k
(2 methods)
kNN
(1 method)
Concept
Extraction
Annotation
Selection
Concept
Activation
Configuration: LDA
43 strategies in total*
[Blei et al. 03]
Slide 25Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Datasets: 3 Scientific Domains
Economics Politics Computer
Source ZBW FIV SemEval 2010
# documents 62,924 28,324 244
# annotations 5.26 (± 1.84) 12 (± 4.02) 5.05 (± 2.41)
Knowledge
base
STW European
Thesaurus
ACM CCS
# enities 6,335 7,912 2,299
# labels 11,679 8,421 9,086
• Computer science dataset: SemEval 2010 [Kim et al. 10]
• Pre‐processing of author keywords needed [Wang et al. 14]
• Total of ~100,000 scientific documents: largest so far !
Slide 26Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Concept
Extraction
Annotation
Selection
Concept
Activation
Best Performing Configurations
Best strategy: Entity × HITS × kNN
: (economy), (politics), (computer)
Entity Tri-gram LDARAKE
Graph-based
Methods
(3 methods)
kNN
(1 method)
Statistical
Methods
(2 methods)
Hierarchy-based
Methods
(3 methods)
Top-k
(2 methods)
Close ones: OneHop
as well as any other
graph-based method
Slide 27Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Number of Users
15
10
20‐
Total Users
Win (16) Lin (3)
Preferred Operatinq System
N 20
[‐1 Macintosh
Linux
Mac(1)
Windows
5
Textextraction from Scolarly Figures
Binarization
Clustering
Extraction
OCR
Text
[BS15] F. Böschen, A. Scherp: Multi-oriented Text Extraction from Information
Graphics. DocEng 2015: 35-38
Fully-automated TX pipeline
No assumptions, no training
Novel combination of DM & CV
Slide 28Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Challenges for Research
• Different font sizes
• … font colors
• … background colors
• … emphases
• Different angles
• Overlapping elements
Slide 29Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
121 Scolarly Figures in Economics
(from ZBW Open Access Corpus)
Current results: improvement of
text recognition to BL: up to 30%
Slide 30Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Evaluation Setup
item 1 Item 1
{e, i, m, t, 1}
{em, it, te}
{ite, tem}
{e, m, t, I, 1}
{em, te, It}
{tem, Ite}
Unigrams
Bigrams
Trigrams
• How to match output (left) with gold standard (right)?
Slide 31Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Limits of Current Evaluation
• Baseline #1: OCR engine Tesseract (Google)
with layout analysis
• 1 pass per figure
• Baseline #2: OCR engine Tesseract (Google)
with layout analysis
• Multiple, angle-rotated passes
+ + + +
Comparison with related work: very difficult!
Slide 32Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Evaluation: Orientation Distributions
Note: horizontal equals to ±15° (Tesseract’s rotation tolerances)
Slide 33Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Mockup: Use of TX in ZBW’s EconBiz
Slide 34Prof. Ansgar Scherp – asc@informatik.uni-kiel.de
Summary: KDD in Social Media & DL
How to deal with the vast amount of content related to
research and innovation?
• H2020 INSO-4 project, duration: 04/2016-03/2019
• Platform with data mining and visualization tools for
enabling information professionals to deal with large
corpora of scientific content, data, social media
New
1 of 34

Recommended

A Comparison of Different Strategies for Automated Semantic Document Annotation by
A Comparison of Different Strategies for Automated Semantic Document AnnotationA Comparison of Different Strategies for Automated Semantic Document Annotation
A Comparison of Different Strategies for Automated Semantic Document AnnotationAnsgar Scherp
1K views44 slides
Mining and Managing Large-scale Linked Open Data by
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataAnsgar Scherp
1.4K views47 slides
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr... by
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...
Formalization and Preliminary Evaluation of a Pipeline for Text Extraction Fr...Ansgar Scherp
693 views23 slides
SchemEX - Creating the Yellow Pages for the Linked Open Data Cloud by
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudSchemEX - Creating the Yellow Pages for the Linked Open Data Cloud
SchemEX - Creating the Yellow Pages for the Linked Open Data CloudAnsgar Scherp
1.1K views19 slides
Improving Model Predictions via Stacking and Hyper-parameters Tuning by
Improving Model Predictions via Stacking and Hyper-parameters TuningImproving Model Predictions via Stacking and Hyper-parameters Tuning
Improving Model Predictions via Stacking and Hyper-parameters TuningJo-fai Chow
2.1K views18 slides
Learning Systems for Science by
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
454 views34 slides

More Related Content

What's hot

Big Graph Analytics on Neo4j with Apache Spark by
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache SparkKenny Bastani
10.1K views56 slides
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka... by
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Dmitry Kan
290 views34 slides
Kaggle Competitions, New Friends, New Skills and New Opportunities by
Kaggle Competitions, New Friends, New Skills and New OpportunitiesKaggle Competitions, New Friends, New Skills and New Opportunities
Kaggle Competitions, New Friends, New Skills and New OpportunitiesJo-fai Chow
363 views31 slides
DataXDay - Exploring graphs: looking for communities & leaders by
DataXDay - Exploring graphs: looking for communities & leadersDataXDay - Exploring graphs: looking for communities & leaders
DataXDay - Exploring graphs: looking for communities & leadersDataXDay Conference by Xebia
192 views31 slides
Array computing and the evolution of SciPy, NumPy, and PyData by
Array computing and the evolution of SciPy, NumPy, and PyDataArray computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyDataTravis Oliphant
515 views78 slides
Graph Techniques for Natural Language Processing by
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language ProcessingSujit Pal
576 views47 slides

What's hot(7)

Big Graph Analytics on Neo4j with Apache Spark by Kenny Bastani
Big Graph Analytics on Neo4j with Apache SparkBig Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
Kenny Bastani10.1K views
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka... by Dmitry Kan
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Haystack LIVE! - 5 ways to increase result diversity at web-scale - Dmitry Ka...
Dmitry Kan290 views
Kaggle Competitions, New Friends, New Skills and New Opportunities by Jo-fai Chow
Kaggle Competitions, New Friends, New Skills and New OpportunitiesKaggle Competitions, New Friends, New Skills and New Opportunities
Kaggle Competitions, New Friends, New Skills and New Opportunities
Jo-fai Chow363 views
Array computing and the evolution of SciPy, NumPy, and PyData by Travis Oliphant
Array computing and the evolution of SciPy, NumPy, and PyDataArray computing and the evolution of SciPy, NumPy, and PyData
Array computing and the evolution of SciPy, NumPy, and PyData
Travis Oliphant515 views
Graph Techniques for Natural Language Processing by Sujit Pal
Graph Techniques for Natural Language ProcessingGraph Techniques for Natural Language Processing
Graph Techniques for Natural Language Processing
Sujit Pal576 views
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Be... by Martin Junghanns
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Be...Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Be...
Gradoop: Scalable Graph Analytics with Apache Flink @ Flink & Neo4j Meetup Be...
Martin Junghanns3.1K views

Viewers also liked

The Scientific Method by
The Scientific MethodThe Scientific Method
The Scientific Methodtscheuch
21.8K views33 slides
Socio Scientific Issues Introduction 2014 by
Socio Scientific Issues Introduction 2014Socio Scientific Issues Introduction 2014
Socio Scientific Issues Introduction 2014ngibellini
1K views39 slides
Anterior Muscles by
Anterior MusclesAnterior Muscles
Anterior MusclesDavid Criniti
7K views12 slides
Moosa khokhar emperor penguin by
Moosa khokhar emperor penguinMoosa khokhar emperor penguin
Moosa khokhar emperor penguinMrs Seo
3.2K views15 slides
Muscular System_SN.ppt by
Muscular System_SN.pptMuscular System_SN.ppt
Muscular System_SN.pptShama
1.9K views32 slides
Hidden survivalmuscle - Find the muscle that flatten youre belly and strength... by
Hidden survivalmuscle - Find the muscle that flatten youre belly and strength...Hidden survivalmuscle - Find the muscle that flatten youre belly and strength...
Hidden survivalmuscle - Find the muscle that flatten youre belly and strength...Mikael Andersson
70 views30 slides

Viewers also liked(11)

The Scientific Method by tscheuch
The Scientific MethodThe Scientific Method
The Scientific Method
tscheuch21.8K views
Socio Scientific Issues Introduction 2014 by ngibellini
Socio Scientific Issues Introduction 2014Socio Scientific Issues Introduction 2014
Socio Scientific Issues Introduction 2014
ngibellini1K views
Moosa khokhar emperor penguin by Mrs Seo
Moosa khokhar emperor penguinMoosa khokhar emperor penguin
Moosa khokhar emperor penguin
Mrs Seo3.2K views
Muscular System_SN.ppt by Shama
Muscular System_SN.pptMuscular System_SN.ppt
Muscular System_SN.ppt
Shama1.9K views
Hidden survivalmuscle - Find the muscle that flatten youre belly and strength... by Mikael Andersson
Hidden survivalmuscle - Find the muscle that flatten youre belly and strength...Hidden survivalmuscle - Find the muscle that flatten youre belly and strength...
Hidden survivalmuscle - Find the muscle that flatten youre belly and strength...
Mikael Andersson70 views
The importance of scientific literacy by Test Generator
The importance of scientific literacyThe importance of scientific literacy
The importance of scientific literacy
Test Generator 2.4K views
Scientific method procedures (Teach) by Moira Whitehouse
Scientific method procedures (Teach)Scientific method procedures (Teach)
Scientific method procedures (Teach)
Moira Whitehouse15.8K views
B slide scientific explantion by Abraham Peled
B slide scientific explantionB slide scientific explantion
B slide scientific explantion
Abraham Peled229 views

Similar to Knowledge Discovery in Social Media and Scientific Digital Libraries

Mining and Managing Large-scale Linked Open Data by
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open DataMOVING Project
519 views47 slides
Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recomm... by
Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recomm...Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recomm...
Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recomm...MOVING Project
896 views47 slides
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly... by
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...Angelo Salatino
761 views24 slides
Linked Open Data & E-Commerce von Jun.-Prof. Dr. habil. Ansgar Scherp by
Linked Open Data & E-Commerce von Jun.-Prof. Dr. habil. Ansgar ScherpLinked Open Data & E-Commerce von Jun.-Prof. Dr. habil. Ansgar Scherp
Linked Open Data & E-Commerce von Jun.-Prof. Dr. habil. Ansgar ScherpADTELLIGENCE GmbH
1.4K views11 slides
Miso by
MisoMiso
Misomiso_uam
949 views34 slides
2016 09-28 social network analysis with node-xl_emke by
2016 09-28 social network analysis with node-xl_emke2016 09-28 social network analysis with node-xl_emke
2016 09-28 social network analysis with node-xl_emkeDr Martina Emke
1.4K views17 slides

Similar to Knowledge Discovery in Social Media and Scientific Digital Libraries(20)

Mining and Managing Large-scale Linked Open Data by MOVING Project
Mining and Managing Large-scale Linked Open DataMining and Managing Large-scale Linked Open Data
Mining and Managing Large-scale Linked Open Data
MOVING Project519 views
Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recomm... by MOVING Project
Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recomm...Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recomm...
Profiling vs. Time vs. Content: What does Matter for Top-k Publication Recomm...
MOVING Project896 views
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly... by Angelo Salatino
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
The CSO Classifier: Ontology-Driven Detection of Research Topics in Scholarly...
Angelo Salatino761 views
Linked Open Data & E-Commerce von Jun.-Prof. Dr. habil. Ansgar Scherp by ADTELLIGENCE GmbH
Linked Open Data & E-Commerce von Jun.-Prof. Dr. habil. Ansgar ScherpLinked Open Data & E-Commerce von Jun.-Prof. Dr. habil. Ansgar Scherp
Linked Open Data & E-Commerce von Jun.-Prof. Dr. habil. Ansgar Scherp
ADTELLIGENCE GmbH1.4K views
Miso by miso_uam
MisoMiso
Miso
miso_uam949 views
2016 09-28 social network analysis with node-xl_emke by Dr Martina Emke
2016 09-28 social network analysis with node-xl_emke2016 09-28 social network analysis with node-xl_emke
2016 09-28 social network analysis with node-xl_emke
Dr Martina Emke1.4K views
Project MLExAI: Machine Learning Experiences in AI by butest
Project MLExAI: Machine Learning Experiences in AIProject MLExAI: Machine Learning Experiences in AI
Project MLExAI: Machine Learning Experiences in AI
butest295 views
Project MLExAI: Machine Learning Experiences in AI by butest
Project MLExAI: Machine Learning Experiences in AIProject MLExAI: Machine Learning Experiences in AI
Project MLExAI: Machine Learning Experiences in AI
butest326 views
Exploring Generative Models of Tripartite Graphs for Recommendation in Social... by Charalampos Chelmis
Exploring Generative Models of Tripartite Graphs for Recommendation in Social...Exploring Generative Models of Tripartite Graphs for Recommendation in Social...
Exploring Generative Models of Tripartite Graphs for Recommendation in Social...
Charalampos Chelmis1.2K views
Fcv acad ind_szeliski by zukun
Fcv acad ind_szeliskiFcv acad ind_szeliski
Fcv acad ind_szeliski
zukun90 views
Fcv acad ind_szeliski by zukun
Fcv acad ind_szeliskiFcv acad ind_szeliski
Fcv acad ind_szeliski
zukun89 views
Mohan C R CV by MOHAN C R
Mohan C R CVMohan C R CV
Mohan C R CV
MOHAN C R37 views
Entities, Graphs, and Crowdsourcing for better Web Search by eXascale Infolab
Entities, Graphs, and Crowdsourcing for better Web SearchEntities, Graphs, and Crowdsourcing for better Web Search
Entities, Graphs, and Crowdsourcing for better Web Search
eXascale Infolab790 views
BSc Computing CSY2026 Modern Networks Date of Issue .docx by AASTHA76
BSc Computing  CSY2026 Modern Networks Date of Issue .docxBSc Computing  CSY2026 Modern Networks Date of Issue .docx
BSc Computing CSY2026 Modern Networks Date of Issue .docx
AASTHA763 views
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra by Anant Corporation
Apache Cassandra Lunch #50: Machine Learning with Spark + CassandraApache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Apache Cassandra Lunch #50: Machine Learning with Spark + Cassandra
Anant Corporation120 views
Digital Humanities: A brief introduction to the field by aelang
Digital Humanities: A brief introduction to the fieldDigital Humanities: A brief introduction to the field
Digital Humanities: A brief introduction to the field
aelang2.4K views
Knowledge Representation on the Web by Rinke Hoekstra
Knowledge Representation on the WebKnowledge Representation on the Web
Knowledge Representation on the Web
Rinke Hoekstra889 views

More from Ansgar Scherp

Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul... by
Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...
Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...Ansgar Scherp
68 views21 slides
STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi... by
STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi...STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi...
STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi...Ansgar Scherp
40 views18 slides
Text Localization in Scientific Figures using Fully Convolutional Neural Netw... by
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...Ansgar Scherp
310 views24 slides
A Comparison of Approaches for Automated Text Extraction from Scholarly Figures by
A Comparison of Approaches for Automated Text Extraction from Scholarly FiguresA Comparison of Approaches for Automated Text Extraction from Scholarly Figures
A Comparison of Approaches for Automated Text Extraction from Scholarly FiguresAnsgar Scherp
590 views20 slides
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe... by
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...Ansgar Scherp
1.2K views14 slides
A Framework for Iterative Signing of Graph Data on the Web by
A Framework for Iterative Signing of Graph Data on the WebA Framework for Iterative Signing of Graph Data on the Web
A Framework for Iterative Signing of Graph Data on the WebAnsgar Scherp
721 views171 slides

More from Ansgar Scherp(16)

Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul... by Ansgar Scherp
Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...
Analysis of GraphSum's Attention Weights to Improve the Explainability of Mul...
Ansgar Scherp68 views
STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi... by Ansgar Scherp
STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi...STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi...
STEREO: A Pipeline for Extracting Experiment Statistics, Conditions, and Topi...
Ansgar Scherp40 views
Text Localization in Scientific Figures using Fully Convolutional Neural Netw... by Ansgar Scherp
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
Text Localization in Scientific Figures using Fully Convolutional Neural Netw...
Ansgar Scherp310 views
A Comparison of Approaches for Automated Text Extraction from Scholarly Figures by Ansgar Scherp
A Comparison of Approaches for Automated Text Extraction from Scholarly FiguresA Comparison of Approaches for Automated Text Extraction from Scholarly Figures
A Comparison of Approaches for Automated Text Extraction from Scholarly Figures
Ansgar Scherp590 views
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe... by Ansgar Scherp
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...
About Multimedia Presentation Generation and Multimedia Metadata: From Synthe...
Ansgar Scherp1.2K views
A Framework for Iterative Signing of Graph Data on the Web by Ansgar Scherp
A Framework for Iterative Signing of Graph Data on the WebA Framework for Iterative Signing of Graph Data on the Web
A Framework for Iterative Signing of Graph Data on the Web
Ansgar Scherp721 views
Smart photo selection: interpret gaze as personal interest by Ansgar Scherp
Smart photo selection: interpret gaze as personal interestSmart photo selection: interpret gaze as personal interest
Smart photo selection: interpret gaze as personal interest
Ansgar Scherp604 views
Events in Multimedia - Theory, Model, Application by Ansgar Scherp
Events in Multimedia - Theory, Model, ApplicationEvents in Multimedia - Theory, Model, Application
Events in Multimedia - Theory, Model, Application
Ansgar Scherp2.1K views
Can you see it? Annotating Image Regions based on Users' Gaze Information by Ansgar Scherp
Can you see it? Annotating Image Regions based on Users' Gaze InformationCan you see it? Annotating Image Regions based on Users' Gaze Information
Can you see it? Annotating Image Regions based on Users' Gaze Information
Ansgar Scherp1.3K views
Linked open data - how to juggle with more than a billion triples by Ansgar Scherp
Linked open data - how to juggle with more than a billion triplesLinked open data - how to juggle with more than a billion triples
Linked open data - how to juggle with more than a billion triples
Ansgar Scherp1.4K views
SchemEX -- Building an Index for Linked Open Data by Ansgar Scherp
SchemEX -- Building an Index for Linked Open DataSchemEX -- Building an Index for Linked Open Data
SchemEX -- Building an Index for Linked Open Data
Ansgar Scherp2.2K views
SchemEX -- Building an Index for Linked Open Data by Ansgar Scherp
SchemEX -- Building an Index for Linked Open DataSchemEX -- Building an Index for Linked Open Data
SchemEX -- Building an Index for Linked Open Data
Ansgar Scherp547 views
A Model of Events for Integrating Event-based Information in Complex Socio-te... by Ansgar Scherp
A Model of Events for Integrating Event-based Information in Complex Socio-te...A Model of Events for Integrating Event-based Information in Complex Socio-te...
A Model of Events for Integrating Event-based Information in Complex Socio-te...
Ansgar Scherp991 views
strukt - A Pattern System for Integrating Individual and Organizational Knowl... by Ansgar Scherp
strukt - A Pattern System for Integrating Individual and Organizational Knowl...strukt - A Pattern System for Integrating Individual and Organizational Knowl...
strukt - A Pattern System for Integrating Individual and Organizational Knowl...
Ansgar Scherp508 views
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr... by Ansgar Scherp
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...
Identifying Objects in Images from Analyzing the User‘s Gaze Movements for Pr...
Ansgar Scherp1.3K views
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten) by Ansgar Scherp
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)
Linked Open Data (Entwurfsprinzipien und Muster für vernetzte Daten)
Ansgar Scherp1K views

Recently uploaded

"Running students' code in isolation. The hard way", Yurii Holiuk by
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk Fwdays
17 views34 slides
PharoJS - Zürich Smalltalk Group Meetup November 2023 by
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023Noury Bouraqadi
132 views17 slides
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...Jasper Oosterveld
19 views49 slides
The Research Portal of Catalonia: Growing more (information) & more (services) by
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)CSUC - Consorci de Serveis Universitaris de Catalunya
80 views25 slides
Special_edition_innovator_2023.pdf by
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdfWillDavies22
18 views6 slides
STPI OctaNE CoE Brochure.pdf by
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdfmadhurjyapb
14 views1 slide

Recently uploaded(20)

"Running students' code in isolation. The hard way", Yurii Holiuk by Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays17 views
PharoJS - Zürich Smalltalk Group Meetup November 2023 by Noury Bouraqadi
PharoJS - Zürich Smalltalk Group Meetup November 2023PharoJS - Zürich Smalltalk Group Meetup November 2023
PharoJS - Zürich Smalltalk Group Meetup November 2023
Noury Bouraqadi132 views
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
Special_edition_innovator_2023.pdf by WillDavies22
Special_edition_innovator_2023.pdfSpecial_edition_innovator_2023.pdf
Special_edition_innovator_2023.pdf
WillDavies2218 views
STPI OctaNE CoE Brochure.pdf by madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb14 views
Piloting & Scaling Successfully With Microsoft Viva by Richard Harbridge
Piloting & Scaling Successfully With Microsoft VivaPiloting & Scaling Successfully With Microsoft Viva
Piloting & Scaling Successfully With Microsoft Viva
Business Analyst Series 2023 - Week 3 Session 5 by DianaGray10
Business Analyst Series 2023 -  Week 3 Session 5Business Analyst Series 2023 -  Week 3 Session 5
Business Analyst Series 2023 - Week 3 Session 5
DianaGray10300 views
SAP Automation Using Bar Code and FIORI.pdf by Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf
Case Study Copenhagen Energy and Business Central.pdf by Aitana
Case Study Copenhagen Energy and Business Central.pdfCase Study Copenhagen Energy and Business Central.pdf
Case Study Copenhagen Energy and Business Central.pdf
Aitana16 views
HTTP headers that make your website go faster - devs.gent November 2023 by Thijs Feryn
HTTP headers that make your website go faster - devs.gent November 2023HTTP headers that make your website go faster - devs.gent November 2023
HTTP headers that make your website go faster - devs.gent November 2023
Thijs Feryn22 views
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas... by Bernd Ruecker
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
iSAQB Software Architecture Gathering 2023: How Process Orchestration Increas...
Bernd Ruecker40 views
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by Dr. Jimmy Schwarzkopf
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf

Knowledge Discovery in Social Media and Scientific Digital Libraries

  • 1. Slide 1Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Knowledge Discovery in Social Media and Scientific Digital Libraries Ansgar Scherp Darmstadt, Feb 9, 2016 Thanks to: Chifumi Nishioka, Falk Böschen
  • 2. Slide 2Prof. Ansgar Scherp – asc@informatik.uni-kiel.de KDD Social Media & Digital Libraries How to deal with the vast amount of content related to research and innovation? “Ability to deal with digital information will be an important cultural technique as reading and writing.”
  • 3. Slide 3Prof. Ansgar Scherp – asc@informatik.uni-kiel.de KDD Social Media & Digital Libraries • Examples of current research 1. Classifying tweets 2. Automated subject indexing 3. Extracting text from scholarly figures • Today not in covered –Schema-extraction from Linked Open Data –Analysis of evolution of Linked Open Data
  • 4. Slide 4Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Classifying Tweets: Example How far are there fundamental differences between different approaches for tweet classification? Author’s hashtag: (here: none) Human: #research #talk #darmstadt Machine: #talk #socialmedia (e.g., [Nishida et al. 12])(e.g., [Ren et al. 14] [Yang et al. 14]) [NSD15] C. Nishioka, A. Scherp, and K. Dellschaft: Comparing Tweet Classifications by Authors' Hashtags, Machine Learning, and Human Annotators, WI, Singapore, 2015.
  • 5. Slide 5Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Twitter Dataset: TREC Tweets2011 • Contains about 16 million tweets • Randomly created 10 main topics with two sub-topics • Main topic: hashtag occurs min. 200 times main topic subtopics 1 #health #nutrition, #news 2 #apple #iphone, #mac 3 #photography #nature, #art 4 #green #solar, #eco 5 #celebrity #news, #gossip 6 #fashion #news, #shoes 7 #fitness #health, #exercise 8 #humor #quotes, #funny 9 #quote #love, #life 10 #travel #lp, #tips • 5 classes per topic: , , , , • Retrieved 3 tweets per class, i.e., 15 tweets per topic • Task: classify tweets into groups
  • 6. Slide 6Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Method 1: Hashtag Classifier • Assign classes to tweets by author’s hashtags Class ‘#SpendingReview’ Class ‘#TurkeyDayTravel #travel’ Class ‘#TurkeyDayTravel’ Class ‘#travel’ • Multiple hashtags  consider as single class
  • 7. Slide 7Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Method 2: Machine Classifier • Latent Dirichlet Allocation (LDA) to represent tweets as probabilities over latent topics [Blei et al. 03] • Construct of the model from TREC Tweets2011 – Train topic model over Tweets being aggregated by their Twitter users [Hong et al. 10] – Infer probability distribution over topics for each of the 15 tweets • Cluster tweets using k-means – # of clusters optimized by Hartigan’s index and Average Silhouette [Kaufman et al. 05] – Using cosine similarity as a distance measure
  • 8. Slide 8Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Method 3: Human Classifier • Online experiment: asked 163 human annotators to manually classify the 15 tweets per each topic main topic subtopics # annotators 1 #health #nutrition, #news 20 2 #apple #iphone, #mac 18 3 #photography #nature, #art 15 4 #green #solar, #eco 14 5 #celebrity #news, #gossip 15 6 #fashion #news, #shoes 15 7 #fitness #health, #exercise 18 8 #humor #quotes, #funny 15 9 #quote #love, #life 16 10 #travel #lp, #tips 17 ∑ 163
  • 9. Slide 9Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Method 3: Human-classifier • Annotators can create an arbitrary number of classes and label them • Have access to Tweet’s textual content as well as screenshots of the links, but: hashtag ‘#’ removed Class label
  • 10. Slide 10Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Degree of Classifier Agreement • Methods 1-3 produce groups of Tweets • Compare groups with Cohen’s kappa [Fu et al. 2012] • Convert classifications into match tables – Elements in same group: 1 – Otherwise: 0 • Example: tweets , , , , are classified by and in and • Compare match table using • Example: and Cohen’s a b c d b 1 c 0 0 d 0 0 1 e 0 0 0 0 a b c d b 1 c 1 1 d 0 0 0 e 0 0 0 1 Classifier Classifier =>
  • 11. Slide 11Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Agreements Between Classifiers • Hashtag/Machine (HaM) – Almost no agreement – Except topic 3 “photography”: 11 of 15 tweets use the hashtags also as a word in texts • Hashtag/Human (HaHu) – Slight agreements • Machine/Human (MHu) – Almost no agreement – Except topic 10 “travel”: agreement on the disagreement at tweets having the hashtag “#tips” ID HaM HaHu MHu 1 -0.05 0.12 0.00 2 0.02 0.05 0.05 3 0.24 0.06 0.11 4 0.01 0.11 0.00 5 0.00 0.07 -0.04 6 0.00 0.15 0.04 7 0.04 0.09 0.05 8 -0.04 0.17 0.03 9 -0.02 0.13 0.00 10 0.01 0.10 0.45 M 0.02 0.10 0.07 SD 0.08 0.10 0.12
  • 12. Slide 12Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Inter-Human-Annotator Agreement • Fleiss’ kappa : measure agreement among more than two raters • Consistently observe larger agreements among human classifiers than for HaHu and MHu • Difference is significant (with ) 1 0.17 2 0.10 3 0.13 4 0.16 5 0.53 6 0.20 7 0.14 8 0.31 9 0.33 10 0.38 M 0.25 SD 0.14 Researchers should use ground truth made by human  annotators rather than hashtags for tweet classification
  • 13. Slide 13Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Automatic Subject Indexing [GNS15] G. Große-Bölting, C. Nishioka, A. Scherp: A Comparison of Different Strategies for Automated Semantic Document Annotation. K-CAP 2015 STW (Standard Thesaurus Wirtschaft) Cancer (18899-3) Research (10436-6) USA (17829-1) … Nomination for Best Paper Award at K-CAP 2015 Award „Prof. Dr. Werner Petersen-Preis der Technik 2015” Published as Linked Open Data!
  • 14. Slide 14Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Automated Subject Indexing • Scientific search engine GERHARD (‘97-‘99) • Ontology with ~10,000 classes in three languages
  • 15. Slide 15Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Experiment Framework Each strategy is a composition of methods from 1. + 2. + 3. 1. Concept Extraction detect concepts (candidate annotations) from each document 2. Concept Activation compute a score for each concept of a document 3. Annotation Selection select annotations from concepts for each document 4. Evaluation measure performance of strategies with ground truth
  • 16. Slide 16Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Configurations Entity Tri-gram LDARAKE Statistical Methods (2 methods) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation
  • 17. Slide 17Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Configurations: Entity-based 24 strategies Entity Tri-gram LDARAKE Statistical Methods (2 methods) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation … using a domain-specific taxonomy like STW
  • 18. Slide 18Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Concept Activation Methods • Concept frequency: • CF-IDF as extension of popular TF-IDF model replacing terms with concepts [Goossen et al. 11] – IDF lowers weight for concepts appearing in many documents • Do actually not “activate” anything …
  • 19. Slide 19Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Hierarchy-based Methods • Reveal concepts that are not explicitly mentioned by using a hierarchical knowledge base (KB) • KBs are of high quality and freely available ! Social Recommendation Social Tagging Web Searching Web Mining Site Wrapping Web Log Analysis World Wide Web • Base Activation with set of child concepts of concept and decay parameter ∈ • Example with : , ,
  • 20. Slide 20Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Hierarchy-based Methods • One-hop activation – Developed with domain experts at ZBW : set of concepts detected in a document – Maximum activation distance: one hop , , ∙ , ∈ if | ∩ | 2 , otherwise Works very well … why?
  • 21. Slide 21Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Graph-based Methods • Represent concepts as co-occurrence graph Tax Bank Interest Rate Financial Crisis Central Bank • HITS for link analysis of web sites [Kleinberg 99] with ∈ ∈ • Degree as number of edges linked with a concept [Zouaq et al. 12]: – Example:
  • 22. Slide 22Prof. Ansgar Scherp – asc@informatik.uni-kiel.de 15 strategies Entity Tri-gram LDARAKE Statistical Method (2 methods) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation Configurations: n-grams
  • 23. Slide 23Prof. Ansgar Scherp – asc@informatik.uni-kiel.de 3 strategies Entity Tri-gram LDARAKE Statistical Method (Frequency) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation Configurations: RAKE [Rose et al. 10]
  • 24. Slide 24Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Entity Tri-gram LDARAKE Statistical Methods (Frequency) Hierarchy-based Methods (3 methods) Graph-based Methods (3 methods) Top-k (2 methods) kNN (1 method) Concept Extraction Annotation Selection Concept Activation Configuration: LDA 43 strategies in total* [Blei et al. 03]
  • 25. Slide 25Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Datasets: 3 Scientific Domains Economics Politics Computer Source ZBW FIV SemEval 2010 # documents 62,924 28,324 244 # annotations 5.26 (± 1.84) 12 (± 4.02) 5.05 (± 2.41) Knowledge base STW European Thesaurus ACM CCS # enities 6,335 7,912 2,299 # labels 11,679 8,421 9,086 • Computer science dataset: SemEval 2010 [Kim et al. 10] • Pre‐processing of author keywords needed [Wang et al. 14] • Total of ~100,000 scientific documents: largest so far !
  • 26. Slide 26Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Concept Extraction Annotation Selection Concept Activation Best Performing Configurations Best strategy: Entity × HITS × kNN : (economy), (politics), (computer) Entity Tri-gram LDARAKE Graph-based Methods (3 methods) kNN (1 method) Statistical Methods (2 methods) Hierarchy-based Methods (3 methods) Top-k (2 methods) Close ones: OneHop as well as any other graph-based method
  • 27. Slide 27Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Number of Users 15 10 20‐ Total Users Win (16) Lin (3) Preferred Operatinq System N 20 [‐1 Macintosh Linux Mac(1) Windows 5 Textextraction from Scolarly Figures Binarization Clustering Extraction OCR Text [BS15] F. Böschen, A. Scherp: Multi-oriented Text Extraction from Information Graphics. DocEng 2015: 35-38 Fully-automated TX pipeline No assumptions, no training Novel combination of DM & CV
  • 28. Slide 28Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Challenges for Research • Different font sizes • … font colors • … background colors • … emphases • Different angles • Overlapping elements
  • 29. Slide 29Prof. Ansgar Scherp – asc@informatik.uni-kiel.de 121 Scolarly Figures in Economics (from ZBW Open Access Corpus) Current results: improvement of text recognition to BL: up to 30%
  • 30. Slide 30Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Evaluation Setup item 1 Item 1 {e, i, m, t, 1} {em, it, te} {ite, tem} {e, m, t, I, 1} {em, te, It} {tem, Ite} Unigrams Bigrams Trigrams • How to match output (left) with gold standard (right)?
  • 31. Slide 31Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Limits of Current Evaluation • Baseline #1: OCR engine Tesseract (Google) with layout analysis • 1 pass per figure • Baseline #2: OCR engine Tesseract (Google) with layout analysis • Multiple, angle-rotated passes + + + + Comparison with related work: very difficult!
  • 32. Slide 32Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Evaluation: Orientation Distributions Note: horizontal equals to ±15° (Tesseract’s rotation tolerances)
  • 33. Slide 33Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Mockup: Use of TX in ZBW’s EconBiz
  • 34. Slide 34Prof. Ansgar Scherp – asc@informatik.uni-kiel.de Summary: KDD in Social Media & DL How to deal with the vast amount of content related to research and innovation? • H2020 INSO-4 project, duration: 04/2016-03/2019 • Platform with data mining and visualization tools for enabling information professionals to deal with large corpora of scientific content, data, social media New