1. A delineating procedure to retrieve
publication data in research areas – the case
of nanocellulose
Dr. Douglas Milanez
Centre for Technological Information in Materials/Federal University of São Carlos/Brazil
Dr. Ed Noyons
Centre for Science and Technology Studies /Leiden University/The Netherlands
15th International Conference on Scientometrics and
Informetrics- Stambul, 2015
2. The challenge to delineate
scientific research areas
• What is a research area, a field, discipline, topic?
• Global or local delineation?
• Classification systems indispensable tool
– Structure and dynamics of scientific fields
– “As old as science”
• Database classification
– Journal assignment
– Multidisciplinary/ general journals?
1
3. The challenge to delineate
scientific research areas
• Classification systems publication-level
initiatives:
– Text Similarity (Boyack et al., 2011)
• Medline database (2.15 million publications)
• More precise than the Medical Subject Headings
– Citation approach (Waltman & van Eck, 2012)
• Web of Science database (SCI and SSCI)
• Clustering algorithm and further criteria
• Labels extracted by text mining titles an abstracts
2
This study aims at proposing a delineation
procedure to retrieve publication data to
represent research areas
5. 4
Nanocellulose
• Sustainable nanomaterial
– Any natural sources
– Bacterial fermentation
• Two types
– cellulose nanofibrils
– cellulose nanocrystals
• Great potential for innovation
– Composite materials
– Packing material
– Electronic devices
– Texturizing agent
– Medical devices
6. From Wikipedia, the free encyclopedia
Nanocellulose is a term referring to nano-structured
cellulose.
This may be either cellulose nanofibers (CNF) also called
microfibrillated cellulose (MFC), nanocrystalline
cellulose (NCC), or bacterial nanocellulose, which refers
to nano-structured cellulose produced by bacteria.
5
7. But …
• There is no subject category to define nanocellulose
• The area is difficult to capture by keywords:
– Too narrow
– Too broad
6
8. Proposed approach
(a bit local and global)
• Use keyword search;
• Define core set: seed publications;
• Locate these seeds in classification;
• Check results;
• Define area by selection clusters from classification.
7
9. • Overall delineation procedure
8
Methodology
• Establishing a search
expression
• Web of Science
online database
Initial set of
publication
• CWTS Publication-
level classification
system
• Lowest level
Prior research
areas retrieving • Cluster content
analysis
• Cleaning terms
establishment
Cleaning initial
set of publication
• Final set of
publication
• Selecting relevant
research areas
Final retrieving
and selection
12. Initial set of nanocellulose publications
(seeds)
• Terms found on literature and from expert suggestions
• Web of Science
• Search expression:
11
("bacterial cellulos*") OR ("cellulos* crystal*") OR ("cellulos* nanocrystal*") OR
("cellulos* whisker*") OR ("cellulos* microcrystal*") OR ("cellulos* nanowhisker*")
OR ("nanocrystal* cellulos*") OR ("cellulos* nano-whisker*") OR ("cellulos* nano-
crystal*") OR ("nano-crystal cellulos*") OR ("cellulos* micro-crystal*") OR
("cellulos* microfibril*") OR ("microfibril* cellulos*") OR ("cellulos* nanofibril*") OR
("nanofibril* cellulos*") OR ("micro-fibril* cellulos*") OR ("nano-fibril* cellulos*")
OR ("cellulos* micro-fibril*") OR ("cellulos* nano-fibril*") OR ("cellulos*
nanofiber*") OR ("nanocellulos*") OR ("cellulos* nanoparticle*") OR ("nano-
cellulos*") OR ("nanoparticl* cellulos*") OR ("nanosiz* cellulos*") OR ("cellulos*
nanofill*") OR ("nano-siz* cellulos*") OR ("cellulos* nano-fiber*") OR ("cellulos*
nano-particle*") OR ("cellulos* nano-fill*") OR ("nano-particl* cellulos*"))
13. 12
Prior retrieve of nanocellulose research
areas
0
250
500
750
1000
1250
1500
1750
2000
2250
2500
2750
3000
0 25 50 75 100 125 150 175 200 225 250 275 300 325 350 375 400 425 450 475 500 525 550
NumberofPublications
Cluster (1- 533)
533
research
areas!
• 80% of clusters contained less than
three publications from the initial core
• Which clusters are relevant?
14. What information do we have per
cluster?
A. Number of publications in a cluster;
B. Number of publications retrieved by search strategy
(seed set);
C. Number of publications from seed per cluster.
13
15. First results
• Two clusters Nuclei research topics
– Contained 56.3% of the seed nanocellulose set;
– Their (automated) labels relate to nanocellulose terms
• Cellulose Nanocrystals and Cellulose Nanofibrils
• Bacterial Cellulose (type of cellulose nanocrystal)
Analysis of the nanocellulose pub. on peripheral clusters
14
NucleiPeripheral
16. Analyzing the relevance of the peripheral
research areas
• approach: check if an article was focusing on
nanocellulose or not:
– Reading each title (and abstract if needed)
– Nanocellulose should be the main subject
– Only articles from the seed set of nanocellulose publications
included on the cluster were checked
– A total of 20 peripheral clusters were checked
15
17. Percentage of noise of core-nanocellulose
publications on peripheral research areas
0,0
5,0
10,0
15,0
20,0
25,0
30,0
35,0
40,0
45,0
50,0
0,0
10,0
20,0
30,0
40,0
50,0
60,0
70,0
80,0
90,0
100,0
5.1.21
5.1.4
13.19.17
13.6.2
13.19.3
13.6.1
9.13.4
13.6.33
13.6.43
13.6.7
18.5.4
13.6.17
13.6.22
13.6.12
13.6.9
13.6.31
13.6.29
1.33.9
13.18.1
13.11.41
Percentageproportion
Percentageofnoisepublications
Research area
Left Axis Right Axis
16
Extremely
noised
research
areas
18. Improving the initial seed set of
nanocellulose publications
• Selecting terms from title and abstract:
text mining approach
• VOSviewer
• Terms to ‘clean’ the first set of
nanocellulose publications:
17
"gene" OR "xyloglucan" OR "microtubule" OR "*cyto*" OR "kinesi" OR "tubulin" OR
"*cell wall*" OR "spindle" OR "phragmoplast" OR "mitosis" OR "preprophase" OR
"phenotype" OR "*plant growth*" OR "meiosi" OR "*lignin distribution*" OR
"delignification" OR "hemicellulose" OR "saccharification" OR "ethanol yield" OR
"lignocellulos*" OR "glucosidase" OR "xylanase"
19. Effect of cleaning the initial set of
nanocellulose publications
• Checking performed on peripheral research topics
18
0
20
40
60
80
100
120
140
160
5.1.21
5.1.4
13.19.17
13.6.2
13.19.3
13.6.1
9.13.4
13.6.33
13.6.43
13.6.7
18.5.4
13.6.17
13.6.22
13.6.12
13.6.9
13.6.31
13.6.29
1.33.9
13.18.1
13.11.41
Numberofpublications
Research areas
Before cleaning After cleaning
Research
areas
related to
biology
20. Effect of cleaning the initial set of
nanocellulose publications
• Checking performed on nuclei research topics
19
0,0 1,0 2,0 3,0 4,0 5,0 6,0
"*cell wall*"
"*cyto*"
"*lignin distribution*"
"*plant growth*"
"delignification"
"ethanol yield"
"gene"
"glucosidase"
"hemicellulose"
"kinesi"
"lignocellulos*"
"meiosi"
"microtubule"
"mitosis"
"phenotype"
"phragmoplast"
"preprophase"
"saccharification"
"spindle"
"tubulin"
"xylanase"
"xyloglucan"
Percentage decrease
13.6.3
13.6.11
Cleaning
terms
The cleaning step
did not affected
significatively the
nuclei research
areas
21. Independence test
• Effect of cleaning procedure on the top five authors
• They are important researchers in the subject of nanocellulose
• Relevant authors to nanocellulose research
• Their position as the top authors did not changed at all
– Except for author E, who went down to the seventh position
20
Author
Number of publication Decrease (%)
Before After Overall Nuclei
A 87 78 -10,3 -6,33
B 51 40 -21,6 -14,9
C 50 43 -14 0
D 50 39 -22 -18,2
E 48 29 -39,6 -26,5
22. Final retrieving
• 2,600 nanocellulose publications (core-
nanocellulose)
– 428 research areas
– 81.0% clusters with one or two publications from the core-
nanocellulose
• Are they relevant?
• Selecting relevant research areas
– Pareto Principle (or 80/20 rule)
– Research areas with one or two publications not considered
• 85 clusters
21
23. 22
Selecting relevant research areas
• Pareto Principle:
0,0
10,0
20,0
30,0
40,0
50,0
60,0
70,0
80,0
90,0
100,0
0 10 20 30 40 50 60 70 80 90
Cumulativepercentagenumber
Reserch areas
12
Research
Topics
24. 23
Map of the Nanocellulose
main research topics
(local map of globally identified topics, c.f. Boyack)
Nuclei areas
25. • Delineating scientific fields is a complex task:
– Boundaries are not frequently well established since scientific studies
– More and more exchange of knowledge between scientists from
different disciplines is involved.
• Our approach retrieves and delineates the nuclei and the
peripheral research areas concerning nanocellulose
research
24
Conclusion and next steps
26. • Study the knowledge flow from peripheral research
topics to the nuclei areas.
• Map how they provide the necessary knowledge to face
nanocellulose current challenges and how country and
scientific institutions are contributing to this evolution.
25
Conclusion and next steps
Ed
The idea of this slide is to introduce the challenge to delineate scientific research areas.
I did not include any animation.
Suggestion:
You could start saying that classification system is always a subject to scientist as they like to categorize their research on a main area, such as chemistry, materials, engineering, etc. And it is as old as science.
Then, the link is that the databases have initiatives to classify science using the journal assignment, but this approach has the drawback of not dealing properly with multidisciplinary journals.
And currently, science has become more and more complex and interdisciplinary. For instance, how can we classify nanotechnology or biotechnology?
Ed
The idea here is to affirm that there area initiatives to classify the research areas using publication-level initiatives.
I included the CWTS approach and other one that we have cited in our article.
Both approaches were applied to a huge amount of papers indexed on databases.
The link to the animation is that few bibliometric research aimed at perform an analysis to a specific topic or subject.
This slide aims at presenting the subject: nanocellulose.
What is important to know to understand the analysis further present on the slides:
We can obtain nanocellulose from natural sources such as trees, natural fibers and agricultural waste
Basically, we have two types of nanocellulose and they differ on the degree of crystallinity. Cellulose nanocrystals are more crystalline than nanofibrils
The nanomaterial is interesting to innovation on new materials and devices
You can also say that nanocellulose was the main topic of my thesis, and this is an strategic nanomaterial for Brazil, as we can found a wide variety of sources for nanocellulose.
If you have any question, please, contact me.
Ed
The idea of this slide is to explain how the overall delineation procedure was conducted. It is important to highlight this is an interactive procedure.
You can focus on the white box, because the following slides will detail the methodology:
1) Determine an initial set of publication concerning the theme of interest
2) Prior retrieval of nanocellulose research areas
3) Analysis of retrieved research area and cleaning of the initial set
4) Final retrieval and selection of relevant nanocellulose research areas.
This slide and the next one explains the idea of our methodology.
Each sphere correspond to an article clustered using the CWTS approach and published by Waltman & van Eck, 2012
We want to find which cluster contains a nanocellulose-related publication.
This slide and the previous one explains the idea of our methodology.
Each sphere correspond to an article clustered using the CWTS approach and published by Waltman & van Eck, 2012
We want to find which cluster contains a nanocellulose-related publication and then evaluate which of this clusters selected are relevant to nanocellulose.
A search expression was developed considering several terms and synonyms recommended by experts and found in nanocellulose literature (Klemm et al., 2011; Milanez et al., 2013; Siqueira et al., 2010; Siró & Plackett, 2010), as can be seen from Table 1. The search expression encompassed different words that refer to cellulose nanocrystals, cellulose nanofibrils, and bacterial cellulose as well as other generic forms, such as nanocellulose, cellulose nanoparticles, and cellulose nanofiller. The search was conducted in March 31th 2014 in the online Web of Science database (topic search).
The idea of this slide is to present the prior retrieve process.
I think it is important to state that the number of publication from the X axis refers to the number of publication contained in each clusters, which includes nanocellulose publications
533 clusters were retrieved. The largest one contains 2,751 pub. while the smallest one contained 50 pub., as a criteria of the CWTS classification system.
Which are relevant?
The idea of this slide is to present the prior retrieve process.
Two clusters highlighted since they cluster more nanocellulose publication that the others and it was interesting that their labels reveled that they talk about types of nanocellulose .
It also says that nanocellulose has two important fields to describe it as a topic of research
However 80% of the retrieved research area clustered less than 3 pub.
And the important thing is: an evaluation of all research area retrieved would be too labour intensive, thus we made a selection of relevant clusters.
You can say that I (as a Dr. in Materials Science and Engineering) could judge whether the article focused on nanocellulose or not.
As cellulose is part from a complex of biological system, nanocellulose appeared in many studies. However, in many, the nanomaterial was not the focus.
As to the 20 peripheral research topics whose nanocellulose set of publication were evaluated, no direct correlation was observed between the proportional relevance of each clusters and the percentage of noise, according to Figure 5. Four research topics had a high percentage (>70%) of ‘noisy’ publications mainly focusing on biological issues of plants, ethanol production, and enzymes aspects, not having the nanomaterial as a final object of research.
The cleaning step focused on the first four cluster that were classified as extremely noised.
Since the first four clusters were used to select the cleaning terms, the cleaning affected them highly. Two of them were even eliminated. Furthermore, other peripheral clusters had their nanocellulose publication coverage diminished.
Half of the 22 terms we used to clean the nanocellulose search strategy did not affect the coverage of core-nanocellulose publications in the nuclei research areas, as depicted in Figure 4. To the other half, none term could reduce the coverage in more than 5%. The terms that influenced research area 13.6.4 the most were “*cell wall*” and “hemicelluloses” while “*cyto*”, “gene” and “*cell wall*” were the ones that decreased the most core-nanocellulose coverage in cluster 13.6.11. Overall, research topic 13.6.11 had its core-nanocellulose publication reduced in 17.5% while the decrease to cluster 13.6.3 was 10.2%. Nonetheless, both clusters still concentrated publication from the core-nanocellulose after the cleaning tasks (the proportion was 74.0% to research area 13.6.3 and 72.1% to 13.6.11). Therefore, they still had the status of nuclei research areas.
An independency test was conducted to evaluate the effectiveness of the procedure proposed. The test involved retrieving the number of publication from the top five authors before and after cleaning and selecting the relevant research areas. The percentage decreases of their overall number of publication and from their main cluster were verified.
We introduce here the Pareto Principle (or 80/20 rule). This principle states that “roughly 80% of the effects come from 20% of the causes” (Juran & Godfrey, 1998) and is found in bibliometric and library studies (Gupta, 1989; Kao, 2009; Stephens, Hubbard, Pickett, & Kimball, 2013). We hypothesize that 80% of the core set will be assigned to 20% of the areas. To reach these relevant research areas, the steps below were carried out:
1)The research areas were listed in descending order of the total number of publications from the core-nanocellulose;
2)Research topics with one or two publications from the core-nanocellulose were excluded. This yields 85 research areas remaining;
3)The representativeness of each research area was calculated by the number of publication of the core-nanocellulose of that cluster divided by 2,200 (which is the total of publication found in the 85 remaining research areas);
4)The cumulative percentage number of publications from the core-nanocellulose was obtained summing the values from the step before, as can be seen from Figure 3. The number of research to be assessed was those where the cumulative percentage number of publication reach approximately 80%.
We found that twelve research areas covered the required 80%, which means 14.1% of the total of 85 research topics. We do not claim that our selecting procedure was perfect, but a quick analysis of the chosen research topics showed themes currently found in nanocellulose literature.
According to Waltman and van Eck (2012), the lowest research area contain 50 publications, consequently, cluster with less than 1% of proportion were not account.
Figure 6 presents a map with these research topics (nodes). These themes appears frequently in nanocellulose-focused studies .The map positions the topics on the basis of their citation relations. The closer two topics, the more frequent the citation traffic between them. The node labels match the main content of the clusters. Moreover, all selected clusters had their set of nanocellulose publication evaluated in the cleaning task.