The slides here present the results of the second semster Research Project as part of the Master in Artificial Intelligence at the Department of Knowledge Engineering of Maastricht University. The project took place between February and June 2015 and consisted in the analysis of a big dataset consisting in 200K publications on Nanotechnology. The project team was composed by by S. Deckers - J. Hermans - A. Ludermann - D. Di Mitri - J. Rutten - D. Soemers.
5. Research Project MAI 2 - Group n.4
Visualisations
Task 5: “Visualising the articles in a relevant context of time and
geographical location in 2D or 3D”
5
7. Research Project MAI 2 - Group n.4
Analysing Keywords
Task 1: “Determining combinations of keywords that are specific
for each year, country, journal, and subject category”
Task 6: “Extracting (combinations of) keywords from abstracts
and titles”
7
8. Research Project MAI 2 - Group n.4
Task 1
• TF-IDF as feature extraction method.
• Treat objects of interest and their keywords as
documents.
• Extract relevant keywords by making use of a
threshold.
• Fast
Fetching
model
Generic
document
processor
Combination
model
8
9. Research Project MAI 2 - Group n.4
Task 6
1. Preprocessing of abstracts (tokenization,
stemming, removal of stopwords to reduce
dimensionality).
2. Construct vector space and word mapping for
every article abstract.
LDA (treat sentences as documents).
TF-IDF seemed too naïve.
3. Apply LDA (k = 1) on vector space to fetch
distribution over words.
4. Use wordmapping (index -> word), to extract
relevant words.
9
10. Research Project MAI 2 - Group n.4
Ontology
Task 2: “Specifying an application independent ontology of
publications.”
Task 7: “Defining ontology of the domain of nanotechnology
which should be linked to the ontology of publications made in
the first block.”
Task 8: “Automatically generating ontology for the publication
data. Compare this ontology with the one you defined
yourselves. Fill the ontology with data from the articles.”
10
13. Research Project MAI 2 - Group n.4
• Ontology Learning
• Automatic or semi-automatic creation of ontologies
• Requires text (or other data)
• Often requires human supervision / corrections
• Approach
• Accept single words from user input
• Allow choice of different senses of word
• Automatically generate related words
Task 8: Approach
13
14. Research Project MAI 2 - Group n.4
Cluster Articles
Task 4: “Learning article dendrograms and interpreting the
dendrogram clusters”
• Approach
• Analysing splitting at the root
14
15. Research Project MAI 2 - Group n.4
• Sample 8,000 articles from database
• Top-down hierarchical clustering
• K-Means on each level with K = 2
• Stop splitting when cluster small enough or dense
enough
• Repeat N times and compare results
Task 4: Approach
15
17. Research Project MAI 2 - Group n.4
• Year Features
• 1998, 1999, 2000, 2001, 2002 (all 4x)
• Country Features
• USA (4x), Japan (2x), Germany (1x),
Peoples R. China (1x)
• Subject Features
• Physics, Condensed Matter (4x)
• Physics, Applied (3x)
• Chemistry, Physical (3x)
• Materials Science, Multidisciplinary (2x)
Task 4: Analysing split at root
17
18. Research Project MAI 2 - Group n.4
Predicting citations
Task 3: “Learning models that predict the citations of articles.”
Task 9: “Predicting the most cited authors.”
k-Nearest Neighbor classification
18
19. Research Project MAI 2 - Group n.4
• k-Nearest Neighbor, with k = 1 (results are sufficient)
• Considered attributes:
• Cited patents
• Publication year
• Countries
• Subject category
• Author affiliation origin
• Instance representation using a boolean array
• Cosine similarity
Initial approach (1)
19
20. Research Project MAI 2 - Group n.4
Classification using four classes:
• 0: no citations
• 1-20: low number of citations
• 21-100: medium number of citations
• 101 and more: high number of citations
Initial approach (2)
20
21. Research Project MAI 2 - Group n.4
Problem!
21
• 189,508 data instances (valid data entries)
• ~ 14,000 dimensional space
• Bool eq. to byte (smallest addressable memory elem.)
~14 kB for every instance!
~2.7 GB to contain complete dataset!
22. Research Project MAI 2 - Group n.4
Solution
22
• Use the boolean nature of the instance representation!
• Address and modify bit’s using bitmasks.
~14 kB reduced to ~1.7 kB
~2.7 GB reduced to ~332 MB
Memory consumption reduced by a factor of 8.
23. Research Project MAI 2 - Group n.4
Additional optimizations
23
Bit representation allows us to make more efficient use of
the CPU’s ALU.
Optimization of Cosine Similarity.
Increase in classification performance using linear search.
Original BitSet implementation
24. Research Project MAI 2 - Group n.4
• 10-fold-cross validation
• Avg. accuracy Class 0 : 0.7908
• Avg. accuracy Class 1 : 0.9943
• Avg. accuracy Class 2 : 0.9823
• Avg. accuracy Class 3 : 0.8175
• Total avg. accuracy: 0.8963
Task 3 - Results
24
25. Research Project MAI 2 - Group n.4
Represent author by his / her articles (instances) since
author cannot be uniquely identified.
Task 9 - Results
25
Search for Class 3 instances.
Avg. accuracy for Class 3 classification: 0.7377
26. Research Project MAI 2 - Group n.4
Analysing raw materials
Task 10: “Determining new substitutes of expensive raw
materials”
26
27. Research Project MAI 2 - Group n.4
Task 10: Determining new substitutes of rare
raw materials
● Rare earth elements
○ group of 17 chemical elements
● 1. Find relevant documents
○ Abstracts that mention Rare Earth elements in some form
● 2. Analyse these documents for trends/patterns
27
28. Research Project MAI 2 - Group n.4
Task 10: Finding Relevant Documents
● Regular Expressions
○ Can detect different ways of writing Rare Earths
● Full names
○ Yttrium / yttrium
● Chemical Formulae
○ Zr-Ce / YBa2Cu3O7+Ni / YSi1.7
● Some false positives
○ ZYMV-S (Zucchini Yellow Mosaic Virus)
○ especially for Yttrium
28
29. Research Project MAI 2 - Group n.4
Task description
Use TF-IDF to order the
190,692 publications
according to the similarity
of their abstract with the
Wikipedia article “Rare
earth element”
Task 10 - TF-IDF approach 1/3
29
Background knowledge on Rare earth elements
30. Research Project MAI 2 - Group n.4 30
QueryDoc
0001.txt
Doc
192K.txt
…
s = A x bT
Linear kernel
Text preprocessing Text preprocessing
Query vector
(n. query terms)
TF-IDF index
(ndocs x n.terms)
Task 10: TF-IDF approach 2/3
31. Research Project MAI 2 - Group n.4 31
Task 10: TF-IDF approach 3/3
Example: first result, doc id 20350
The nano-grained Ni/ZrO2 catalysts containing rare earth element oxides were prepared by
oxidation-reduction pretreatment of amorphous Ni-(40-x) at% Zr-x at% rare earth element (Y,
Ce and Sm; x=1 - 10) alloy precursors. The conversion of carbon dioxide on the catalysts
containing 1 at% rare earth elements was almost the same as that on the rare earth element-
free catalyst, but the addition of 5 at% or more rare earth elements increased remarkably the
conversion at 473 K. In contrast to the formation of monoclinic and tetragonal ZrO2 during
pretreatment of amorphous Ni-Zr alloys containing 1 at% rare earth elements, tetragonal
ZrO2, which is generally stable only at high temperatures, was predominantly formed during
the pretreatment of the catalysts containing 5 at% or more rare earth elements. The surface
area of the catalysts increased with the content of rare earth element. Thus, the increase in
the surface area and stabilization of tetragonal ZrO2 seem to be responsible for the
improvement of catalytic activity of the Ni-Zr alloy-derived catalysts by the addition of rare
earth elements.
32. Research Project MAI 2 - Group n.4
Task 10: Removing False Positives
● Compute similarity to wikipedia page on Rare Earth elements
○ TF-IDF vectors
● Reject documents with similarity score below threshold
● Conservative threshold (0.005)
○ filters some false positives
○ excludes few (if any) true positives
○ manually determined
32
34. Research Project MAI 2 - Group n.4
Task 10: Analysis
● Top 3 countries
○ Saudi Arabia (15.74%), Slovenia (12.59%), Romania (9.13%)
● Top subject categories per Rare Earth element
● Rare Earth element trends over the years
● See report for detailed results
34
35. Research Project MAI 2 - Group n.4
Task 10: Rare Earth substitution
● Search articles that address substitution
● Lucene to search within RE abstracts (11.430)
● Search for “substitut*”, “replace*” or “alternative” (955)
● Filtered by sentences containing chemical formula (841)
● found no article that directly address substitution
● but e.g. refer to alternative methods or substitution as
chemical reaction
35
36. Research Project MAI 2 - Group n.4
Wrapping up
36
• TASK9:
• Represent authors by collection of their articles
• k-Nearest Neighbor classification
• high accuracy
• TASK10:
• two approaches to find Rare Earth articles:
• similarity to wikipedia article with tf-idf
• regular expressions
• substitution: search for abstracts that address RE substitution
directly
37. Research Project MAI 2 - Group n.4
Conclusions
● variety of techniques for more insight
● ontologies and visualization
● most popular topics for years or countries
● predicting number of citations
Assistance for decision making
e.g. in which research areas should be invested
37
38. Research Project MAI 2 - Group n.4
Improvements
• Improve RE substitution results by Machine Learning
techniques
• Need annotated data
• More advanced Machine Learning techniques for ontology
learning, e.g. clustering
38