SlideShare a Scribd company logo
1 Developing a web application for research: programming to find related PubMed articles. Philip Wolstenholme
Introduction to talk Aims of my project What work was done? What was found? How could the project be developed in the future? Summary Questions from the audience 2
The Project A research tool to find similar articles Help explore scientific literature Simple Accessed online Only requires article title or DOI DOI: Digital Object Identifier 3
Python The programming language used for this project Used to retrieve data, display results and analyse findings Why Python: BioPython library of code Compatible with GAE GAE – Google App Engine Run programs online as ‘web applications’ 4
Advantages over alternatives Presenting related content not unique Used on PubMed, ScienceDirect, Web of Science etc My application standalone Works with content from any site, or from a PDF Bookmarklet automatically detects DOIs from webpages 5
Choice of corpus journal Searching every journal for related items ideal, but slow Selected Marine Pollution Bulletin Based on high impact factor Availability of articles on PubMed 6
Working with PubMed data Download Three years worth of Mar Pol Bul downloaded Downloaded data opened, only useful data kept Large table made of words (tokens) and their frequencies  Matrix turned into an easy and quick format for Python to read Process Shrink Matrix 7
Finding similarity 10, 821 columns 859 rows of articles Title and DOI Token frequencies 8
9
10
11
Mean ‘best match’ was 0.33 12
Were my results of good quality? Benchmarked against PubMed 46% similarity between results  For 19% of articles similarity ≥ 70% For some articles PubMed returned zero related results Our app returned results scored at 0.25 Results of comparable quality 13
Future work 14 Application good proof of concept Limited dataset One journal Three years Opportunities to adapt the application E.g. subscription service, mobile version
Summary Aimed to create simple, easy to use, functional application Completed application and carried out analysis of results Results of a good quality Aims of project achieved 15
Any questions? 16

More Related Content

Viewers also liked

E learning tutorial2
E learning tutorial2E learning tutorial2
E learning tutorial2
Sashacaro
 
EasyTrack MVP2 & Data
EasyTrack MVP2 & DataEasyTrack MVP2 & Data
EasyTrack MVP2 & Data
krlwnsr
 
mi región junìn
mi región junìn mi región junìn
Dreams. I fight for it.
Dreams. I fight for it.Dreams. I fight for it.
Dreams. I fight for it.
malena115
 
Cliffside Hotel Geotechnical Design CEE121
Cliffside Hotel Geotechnical Design CEE121Cliffside Hotel Geotechnical Design CEE121
Cliffside Hotel Geotechnical Design CEE121
Adam Richardson, EIT
 
Japanese Women
Japanese WomenJapanese Women
Japanese Women
wiesneskib
 
866
866866
EFHK Spring 2016_manager effectiveness_Final
EFHK Spring 2016_manager effectiveness_FinalEFHK Spring 2016_manager effectiveness_Final
EFHK Spring 2016_manager effectiveness_Final
Cheong Im
 
Maj konference 2012 - Janus Sandsgaard
Maj konference 2012 - Janus SandsgaardMaj konference 2012 - Janus Sandsgaard
Maj konference 2012 - Janus Sandsgaard
Janus Sandsgaard
 
Home Learning/Homework in Primary Schools
Home Learning/Homework in Primary SchoolsHome Learning/Homework in Primary Schools
Home Learning/Homework in Primary Schools
Claire Dunn
 
Avin kotian u
Avin kotian uAvin kotian u
Avin kotian u
Avin K
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of data
Gael Varoquaux
 

Viewers also liked (12)

E learning tutorial2
E learning tutorial2E learning tutorial2
E learning tutorial2
 
EasyTrack MVP2 & Data
EasyTrack MVP2 & DataEasyTrack MVP2 & Data
EasyTrack MVP2 & Data
 
mi región junìn
mi región junìn mi región junìn
mi región junìn
 
Dreams. I fight for it.
Dreams. I fight for it.Dreams. I fight for it.
Dreams. I fight for it.
 
Cliffside Hotel Geotechnical Design CEE121
Cliffside Hotel Geotechnical Design CEE121Cliffside Hotel Geotechnical Design CEE121
Cliffside Hotel Geotechnical Design CEE121
 
Japanese Women
Japanese WomenJapanese Women
Japanese Women
 
866
866866
866
 
EFHK Spring 2016_manager effectiveness_Final
EFHK Spring 2016_manager effectiveness_FinalEFHK Spring 2016_manager effectiveness_Final
EFHK Spring 2016_manager effectiveness_Final
 
Maj konference 2012 - Janus Sandsgaard
Maj konference 2012 - Janus SandsgaardMaj konference 2012 - Janus Sandsgaard
Maj konference 2012 - Janus Sandsgaard
 
Home Learning/Homework in Primary Schools
Home Learning/Homework in Primary SchoolsHome Learning/Homework in Primary Schools
Home Learning/Homework in Primary Schools
 
Avin kotian u
Avin kotian uAvin kotian u
Avin kotian u
 
Scientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of dataScientist meets web dev: how Python became the language of data
Scientist meets web dev: how Python became the language of data
 

Similar to Developing a web application for research: programming to find related PubMed articles.

One Scientist’s Wish List for Scientific Publishers
One Scientist’s Wish List for Scientific PublishersOne Scientist’s Wish List for Scientific Publishers
One Scientist’s Wish List for Scientific Publishers
Philip Bourne
 
Using OA Content
Using OA ContentUsing OA Content
Using OA Content
Philip Bourne
 
Elsevier - Labs on Line
Elsevier - Labs on Line Elsevier - Labs on Line
Elsevier - Labs on Line
Philip Bourne
 
Supporting PDF accessibility evaluation: Early results from the FixRep project
 Supporting PDF accessibility evaluation: Early results from the FixRep project Supporting PDF accessibility evaluation: Early results from the FixRep project
Supporting PDF accessibility evaluation: Early results from the FixRep project
UKOLN (dev), University of Bath
 
QQML presentation
QQML presentationQQML presentation
Collaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna WorkflowsCollaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna Workflows
Andrea Wiggins
 
Chandran Honour, Nature.com
Chandran Honour, Nature.comChandran Honour, Nature.com
Chandran Honour, Nature.com
Mashery
 
eLanguage.net: Shifting the paradigm in Linguistics
eLanguage.net: Shifting the paradigm in LinguisticseLanguage.net: Shifting the paradigm in Linguistics
eLanguage.net: Shifting the paradigm in Linguistics
Cornelius Puschmann
 
Ten Simple Rules for Open Access Publishers
Ten Simple Rules for Open Access PublishersTen Simple Rules for Open Access Publishers
Ten Simple Rules for Open Access Publishers
Philip Bourne
 
Top mobile apps for Higher Education
Top mobile apps for Higher EducationTop mobile apps for Higher Education
Top mobile apps for Higher Education
Courtney Mlinar
 
RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015
William Gunn
 
Elsevier02012011
Elsevier02012011Elsevier02012011
Elsevier02012011
Philip Bourne
 
Peer Review and Science2.0
Peer Review and Science2.0Peer Review and Science2.0
Peer Review and Science2.0
Jean-Claude Bradley
 
Navigating the Research Databases
Navigating the Research DatabasesNavigating the Research Databases
Navigating the Research Databases
Jenna Rinalducci
 
Navigating the Research Databases
Navigating the Research DatabasesNavigating the Research Databases
Navigating the Research Databases
Jenna Rinalducci
 
Keeping up to date & comparing journal apps. the stockholm workshop 2016
Keeping up to date &  comparing journal apps. the stockholm workshop 2016Keeping up to date &  comparing journal apps. the stockholm workshop 2016
Keeping up to date & comparing journal apps. the stockholm workshop 2016
Guus van den Brekel
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword research
Erudite
 
Get 'em in, Get 'em out: Finding a Road from Turnaway Data to Repurposed Space
Get 'em in, Get 'em out: Finding a Road from Turnaway Data to Repurposed SpaceGet 'em in, Get 'em out: Finding a Road from Turnaway Data to Repurposed Space
Get 'em in, Get 'em out: Finding a Road from Turnaway Data to Repurposed Space
Nikki DeMoville
 
Inn Presentation
Inn PresentationInn Presentation
Inn Presentation
Nick Sheppard
 
Exploring and accessing knowledge in Research
Exploring and accessing knowledge in ResearchExploring and accessing knowledge in Research
Exploring and accessing knowledge in Research
Nabeel Salih Ali
 

Similar to Developing a web application for research: programming to find related PubMed articles. (20)

One Scientist’s Wish List for Scientific Publishers
One Scientist’s Wish List for Scientific PublishersOne Scientist’s Wish List for Scientific Publishers
One Scientist’s Wish List for Scientific Publishers
 
Using OA Content
Using OA ContentUsing OA Content
Using OA Content
 
Elsevier - Labs on Line
Elsevier - Labs on Line Elsevier - Labs on Line
Elsevier - Labs on Line
 
Supporting PDF accessibility evaluation: Early results from the FixRep project
 Supporting PDF accessibility evaluation: Early results from the FixRep project Supporting PDF accessibility evaluation: Early results from the FixRep project
Supporting PDF accessibility evaluation: Early results from the FixRep project
 
QQML presentation
QQML presentationQQML presentation
QQML presentation
 
Collaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna WorkflowsCollaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna Workflows
 
Chandran Honour, Nature.com
Chandran Honour, Nature.comChandran Honour, Nature.com
Chandran Honour, Nature.com
 
eLanguage.net: Shifting the paradigm in Linguistics
eLanguage.net: Shifting the paradigm in LinguisticseLanguage.net: Shifting the paradigm in Linguistics
eLanguage.net: Shifting the paradigm in Linguistics
 
Ten Simple Rules for Open Access Publishers
Ten Simple Rules for Open Access PublishersTen Simple Rules for Open Access Publishers
Ten Simple Rules for Open Access Publishers
 
Top mobile apps for Higher Education
Top mobile apps for Higher EducationTop mobile apps for Higher Education
Top mobile apps for Higher Education
 
RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015RDA Scholarly Infrastructure 2015
RDA Scholarly Infrastructure 2015
 
Elsevier02012011
Elsevier02012011Elsevier02012011
Elsevier02012011
 
Peer Review and Science2.0
Peer Review and Science2.0Peer Review and Science2.0
Peer Review and Science2.0
 
Navigating the Research Databases
Navigating the Research DatabasesNavigating the Research Databases
Navigating the Research Databases
 
Navigating the Research Databases
Navigating the Research DatabasesNavigating the Research Databases
Navigating the Research Databases
 
Keeping up to date & comparing journal apps. the stockholm workshop 2016
Keeping up to date &  comparing journal apps. the stockholm workshop 2016Keeping up to date &  comparing journal apps. the stockholm workshop 2016
Keeping up to date & comparing journal apps. the stockholm workshop 2016
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword research
 
Get 'em in, Get 'em out: Finding a Road from Turnaway Data to Repurposed Space
Get 'em in, Get 'em out: Finding a Road from Turnaway Data to Repurposed SpaceGet 'em in, Get 'em out: Finding a Road from Turnaway Data to Repurposed Space
Get 'em in, Get 'em out: Finding a Road from Turnaway Data to Repurposed Space
 
Inn Presentation
Inn PresentationInn Presentation
Inn Presentation
 
Exploring and accessing knowledge in Research
Exploring and accessing knowledge in ResearchExploring and accessing knowledge in Research
Exploring and accessing knowledge in Research
 

Recently uploaded

How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Celine George
 
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
Celine George
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
Colégio Santa Teresinha
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
siemaillard
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
WaniBasim
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
Priyankaranawat4
 
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
สมใจ จันสุกสี
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
mulvey2
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
iammrhaywood
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
adhitya5119
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
TechSoup
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
Celine George
 
How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience
Wahiba Chair Training & Consulting
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
Celine George
 
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
Dr. Mulla Adam Ali
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
Jyoti Chand
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
Himanshu Rai
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
amberjdewit93
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
adhitya5119
 

Recently uploaded (20)

How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
 
How to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRMHow to Manage Your Lost Opportunities in Odoo 17 CRM
How to Manage Your Lost Opportunities in Odoo 17 CRM
 
MARY JANE WILSON, A “BOA MÃE” .
MARY JANE WILSON, A “BOA MÃE”           .MARY JANE WILSON, A “BOA MÃE”           .
MARY JANE WILSON, A “BOA MÃE” .
 
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptxPrésentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx
 
Liberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdfLiberal Approach to the Study of Indian Politics.pdf
Liberal Approach to the Study of Indian Politics.pdf
 
clinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdfclinical examination of hip joint (1).pdf
clinical examination of hip joint (1).pdf
 
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1
 
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptxC1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
 
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptxNEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx
 
Main Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docxMain Java[All of the Base Concepts}.docx
Main Java[All of the Base Concepts}.docx
 
Walmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdfWalmart Business+ and Spark Good for Nonprofits.pdf
Walmart Business+ and Spark Good for Nonprofits.pdf
 
How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17How to Fix the Import Error in the Odoo 17
How to Fix the Import Error in the Odoo 17
 
How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience How to Create a More Engaging and Human Online Learning Experience
How to Create a More Engaging and Human Online Learning Experience
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
 
Hindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdfHindi varnamala | hindi alphabet PPT.pdf
Hindi varnamala | hindi alphabet PPT.pdf
 
Wound healing PPT
Wound healing PPTWound healing PPT
Wound healing PPT
 
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem studentsRHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students
 
Digital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental DesignDigital Artefact 1 - Tiny Home Environmental Design
Digital Artefact 1 - Tiny Home Environmental Design
 
Advanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docxAdvanced Java[Extra Concepts, Not Difficult].docx
Advanced Java[Extra Concepts, Not Difficult].docx
 

Developing a web application for research: programming to find related PubMed articles.

  • 1. 1 Developing a web application for research: programming to find related PubMed articles. Philip Wolstenholme
  • 2. Introduction to talk Aims of my project What work was done? What was found? How could the project be developed in the future? Summary Questions from the audience 2
  • 3. The Project A research tool to find similar articles Help explore scientific literature Simple Accessed online Only requires article title or DOI DOI: Digital Object Identifier 3
  • 4. Python The programming language used for this project Used to retrieve data, display results and analyse findings Why Python: BioPython library of code Compatible with GAE GAE – Google App Engine Run programs online as ‘web applications’ 4
  • 5. Advantages over alternatives Presenting related content not unique Used on PubMed, ScienceDirect, Web of Science etc My application standalone Works with content from any site, or from a PDF Bookmarklet automatically detects DOIs from webpages 5
  • 6. Choice of corpus journal Searching every journal for related items ideal, but slow Selected Marine Pollution Bulletin Based on high impact factor Availability of articles on PubMed 6
  • 7. Working with PubMed data Download Three years worth of Mar Pol Bul downloaded Downloaded data opened, only useful data kept Large table made of words (tokens) and their frequencies Matrix turned into an easy and quick format for Python to read Process Shrink Matrix 7
  • 8. Finding similarity 10, 821 columns 859 rows of articles Title and DOI Token frequencies 8
  • 9. 9
  • 10. 10
  • 11. 11
  • 12. Mean ‘best match’ was 0.33 12
  • 13. Were my results of good quality? Benchmarked against PubMed 46% similarity between results For 19% of articles similarity ≥ 70% For some articles PubMed returned zero related results Our app returned results scored at 0.25 Results of comparable quality 13
  • 14. Future work 14 Application good proof of concept Limited dataset One journal Three years Opportunities to adapt the application E.g. subscription service, mobile version
  • 15. Summary Aimed to create simple, easy to use, functional application Completed application and carried out analysis of results Results of a good quality Aims of project achieved 15

Editor's Notes

  1. Prompts (not script!):Introduction, marine biologist but chose a project supervised by Dr Alan Boyd which is why I’m presenting to this audience
  2. Quick introduction to what I’ll be covering
  3. Project aimed to providea useful way for someone with one journal article of interest to find related/similar articlesUsage scenario envisioned; literature review or researching for an essay. Could take a ‘classic paper’ or a piece of recommended reading from VITAL and use my app to quickly find a list of related papers.Wanted to make it as easy as possible to use, so accessed through the browser, no downloads, and all the user needs to enter is a title or a DOIDOI, in case you’re not familiar, acronym, a unique ID code given tojournal articlesExample on the right, the DOI is the highlighted code, usually found at the top or bottom of the first page of a journal article...and for people who aren’t comfortable with these initially odd looking strings of text, the user can also just paste in a title and the app will convert it to a DOI.
  4. Did all work through language called pythonIt was used download and process the data from PubMed that made up application’s corpus, to display the results of a search to the user, and to analyse the results returned by my project for my report.Python’s got numerous strenghs. It’s easy to read and has a gentle learning curve which means it’s easy to get started with – I’d never used it before my project but now I feel comfortable with using it.But the main advantages are BioPython, a collection of open source code that anyone can benefit from, and compatibility with GAE.GAE – Google App Engine, a platform for developing web applications.Web applications bit of a buzzword at the moment, but described in more useful terms, a way of programs running online rather than through a downloaded file. So when you check your Hotmail, Gmail or uni webmail using your browser rather than a downloaded email application like Outlook then you’re using a web application.
  5. By now, probably thinking that presenting related content not unique.Most if not all scientific literature aggregators do this.The advantage of my application, I’d like to think, is that it’s standalone – it can be used with any content from any website or a PDF – it’s not tied to one publisher’s website or to material available online.Also, bookmarklet, bit of code that lives inside a user’s browser, it automatically recognises DOIs within a page so with one click the user can forward these to my application.
  6. So our application aimed to look for relationships between the text content of articles, and to find those we needed a corpus, or collection of scientific literature.Ideally corpus would have encompassed a very wide range of journals, but that would be slow and beyond the realms of knowledge and processing power available for an honours project.Selected a marine biology journal by ranking all the marine journals available on PubMed by their impact factor and then removing too specific or too infrequently published journals.Impact factor, criteria for establishing rank or importance of a journal by observing how often other papers cite the papers within a journal. Bit of controversy but generally accepted as measure of a journal’s significance/standing.Result was a choice of the journal Marine Pollution Bulletin.
  7. Corpus was created by downloading from through PubMed’s API, Entrez.Working with the PubMed data to assemble our corpus was probably the hardest part of this project as it was the first big chunk that involved areas of programming that I’d never been involved in before.First of all 3 years worth of data from Marine Pollution Bulletin downloaded. This gave us a HUGE dump of data from PubMed. It had everything in it; three years worth of authors names, places of publishing, dates of publishing, dates of being added to PubMed, PubMed IDs, DOIs, all sorts of other information. It was too much information, and a lot of it would have served no use for our project.So, the next step was to process it. All this extra information was discarded, and we only kept the titles, abstracts and DOIs.Then, the titles and abstracts were separated word by word, and a big list of each word in the corpus was calculated. Then, for each word the number of times that it appears in each article was calculated – giving us an idea of the content and main themes of each article. This information was stored in the matrix, a big table that we’ll have a look at in the next slide.Finally, the matrix had to be shrunk somehow, it was just too big a file for Google App Engine to be able to read.
  8. ...this is a screen shot of a just a really small portion of the application’s dataset, opened in Excel.It might not be too easy to read but hopefully you can get some idea of the structure and scale.Down the rows we have one article per row, with the first two columns of each row holding an article’s title and DOI. The rest of the 10,821 columns contain a word count for each token/word.The matrix contained details of every single word in 859 articles, so ended up with over 9 million of these counts.You might also be able to see in this screen shot, that most of these counts are 0, in fact, 99% of the values in the matrix were zero.This meant that these values could be stripped out, leaving a file that was 95% smaller and much easier to deal with.The second step was to work out the relationships between all these word counts, and for that we used something called cosine similarity. This is just a method that worked through each row and used existing code to determine which rows share the most in common, and can be deemed related.
  9. So, to put it all in context, here’s a quick example. This is the main page, which can be accessed by anyone right now at honourspw.appspot.com.
  10. All the user has to do is paste in the title or DOI of a Marine Pollution Bulletin article from the last three years...Click submit...
  11. And they get a list of sixteen related articles dealing with metal pollution at sea.For each one of these results that the application returns, there’s also a similarity score available. This just gives a score ranging from 0 – 1 where 0 represents nothing in common and 1 represents an exact copy.We wondered what the level of match (or quality) of our results were.
  12. So, I wrote some code that records the score of the best match for all 859 articles.We found our mean best match was .33. This, to me, sounded quite low – we obviously weren’t expecting all our values to be in the .8/.9 or above range, because that would require a very homogenous set of data.But this score got us thinking about the quality of our results.
  13. So, to get an idea of PubMed, I compared the results our application was recommending for an article, to the results that PubMed was recommending.We found a 46% similarity between the results, and for almost 20% of the titles in the corpus, 70% of what PubMed recommended we did too.Also, PubMed failed to return recommendations for some papers, whereas our application returned results of a quality not too far from our overall mean.I deem that to be a comparable level of quality.