Developing a web application for research: programming to find related PubMed articles.

•

1 like•338 views

My final year undergraduate presentation to explain my honours project. My project can be found online at http://www.honourspw.appspot.com/

1 Developing a web application for research: programming to find related PubMed articles. Philip Wolstenholme

Introduction to talk Aims of my project What work was done? What was found? How could the project be developed in the future? Summary Questions from the audience 2

The Project A research tool to find similar articles Help explore scientific literature Simple Accessed online Only requires article title or DOI DOI: Digital Object Identifier 3

Python The programming language used for this project Used to retrieve data, display results and analyse findings Why Python: BioPython library of code Compatible with GAE GAE – Google App Engine Run programs online as ‘web applications’ 4

Advantages over alternatives Presenting related content not unique Used on PubMed, ScienceDirect, Web of Science etc My application standalone Works with content from any site, or from a PDF Bookmarklet automatically detects DOIs from webpages 5

Choice of corpus journal Searching every journal for related items ideal, but slow Selected Marine Pollution Bulletin Based on high impact factor Availability of articles on PubMed 6

Working with PubMed data Download Three years worth of Mar Pol Bul downloaded Downloaded data opened, only useful data kept Large table made of words (tokens) and their frequencies Matrix turned into an easy and quick format for Python to read Process Shrink Matrix 7

Finding similarity 10, 821 columns 859 rows of articles Title and DOI Token frequencies 8

Were my results of good quality? Benchmarked against PubMed 46% similarity between results For 19% of articles similarity ≥ 70% For some articles PubMed returned zero related results Our app returned results scored at 0.25 Results of comparable quality 13

Future work 14 Application good proof of concept Limited dataset One journal Three years Opportunities to adapt the application E.g. subscription service, mobile version

Summary Aimed to create simple, easy to use, functional application Completed application and carried out analysis of results Results of a good quality Aims of project achieved 15

The Research Bazaar provides training to researchers on tools and skills for working with data. The training is led by researchers and addresses real problems in their work. Based on feedback, the training is highly rated for bringing tools and data together meaningfully and for its relevance and quality. Courses cover topics like Python for climate science, natural language analysis with NLTK, and more. The goal is to build a culture of open research.

BioTorrents: A File Sharing Service for Scientific Data

Morgan Langille

Global Strategy for Plant Conservation Target 1

Chris Freeland

The document discusses possibilities for achieving the Global Strategy for Plant Conservation Target 1 of creating an online flora of all known plant species by 2020. It outlines goals of gathering stakeholders to discuss procedures and organization. Key points addressed are triaging existing data sources, refining data, components needed for an online flora, and administration and contributors required. The deadline of 2020 poses a challenge to work with existing knowledge rather than conducting new biology or taxonomy work.

Pulverer-embo-source data-nfdp13

DataDryad

BIOLINK 2008: Linking database submissions to primary citations with PubMe...

Heather Piwowar

Abstract: Background: Dataset submissions are growing exponentially. Links between dataset submissions and primary literature that describe the data collection are useful for many reasons: rich documentation, proper attribution, improved information retrieval, and enhanced text/data integration for analysis. Unfortunately, many database submissions do not include primary citation links, as database submissions are often made prior to publication. We suggest that automated tools can be developed to help identify links between dataset submissions and the primary literature. These tools require full text to differentiate cases of data sharing from data reuse and other contexts. In this study, we explore the possibility that deep analysis of full text may not be necessary, thereby enabling the querying of all reports in PubMed Central. Methods: We trained machine learning tree and rule-based classifiers on full-text open-access article unigram vectors, with the existence of a primary citation link from NCBI’s Gene Expression Omnibus (GEO) database submission records as the binary output class. We manually combined and simplified the classifier trees and rules to create a query compatible with the interface for PubMed Central. Results: The query identified 40% of non-OA articles with dataset submission links from GEO (recall), and 65% of the returned articles without dataset submission links were manually judged to include statements of dataset deposit despite having no link from the database (applicable precision). Conclusion: We hope this work inspires future enhancements, and highlights the opportunities for simple full-text queries in PubMed Central given the mandated influx of NIH-funded research reports.

Primary and secondary

Pandidurai P

The document discusses primary and secondary data. Primary data is original data collected directly by the researcher through methods like surveys, interviews, questionnaires. Secondary data is data originally collected by someone else through sources like publications, websites, government records. The advantages of primary data are it is specific to the researcher's needs and accurate, while it is time-consuming and expensive to collect. Secondary data is easily accessible and affordable but may be outdated or unreliable.

Reproducible research: theory

C. Tobin Magle

This document discusses reproducible research and provides guidance on how to conduct research in a reproducible manner. It covers: 1. The importance of reproducible research due to large datasets, computational analyses, and the potential for human error. Ensuring reproducibility requires new expertise and infrastructure. 2. Key aspects of reproducible research include data management plans, version control, use of file formats and software/tools that allow reproducibility, and publishing data and code to allow others to replicate results. 3. Reproducible research benefits the scientific community by increasing transparency and allows researchers to re-analyze their own data in the future. Journals and funders are increasingly requiring reproducibility.

Applications on SciVerse

Rafael Sidi

This document discusses SciVerse applications that provide a unique and engaging user experience. It lists several apps currently available on the SciVerse platform that allow customized experiences through collaboration, searching, managing and analyzing. The document also mentions that the apps bring more value through free services and that the platform allows for improved interoperability between various Elsevier products and services.

EasyTrack is a simple goal-tracking app that allows users to seamlessly input data through notification buttons without opening the app, unlike other reminder and to-do list apps. It provides one-click data input and tracking for goal-oriented individuals dissatisfied with bulky existing apps. User testing found the notification buttons and history access straightforward while the calendar interface was confusing; another testing round is planned to improve the interface before public release.

mi región junìn

Dayanna Estefany De la cruz Pomalaya

Dreams. I fight for it.

malena115

The document details the author's dream of traveling the world from a young age growing up in a small town. After secondary school, the author moved out on their own, attended university, and participated in a work and travel program in the United States three times. For five years, the author traveled, worked various jobs, and lived in different places. Currently, the author is studying English philology and dreams of becoming a teacher to travel the world teaching English in various countries, especially South America. The author has been studying Spanish for a year in pursuit of this goal.

Cliffside Hotel Geotechnical Design CEE121

Adam Richardson, EIT

The document is a geotechnical report from Profits Incorporated to Resort Development, Inc regarding an investigation of a proposed hotel site in Laguna Beach, California. It summarizes the scope of work, site conditions, subsurface conditions revealed through borings and previous investigations. The site currently has single family homes and is occupied by Quaternary deposits, with sandy silt, silty sand and clay observed. Groundwater was found deep but could rise with landscaping. Laboratory tests characterized soils for engineering analysis and design recommendations.

Japanese Women

wiesneskib

866

Hugh Klein

EFHK Spring 2016_manager effectiveness_Final

Cheong Im

The document discusses manager capability challenges in Hong Kong organizations. It finds that fewer than half of Hong Kong employees find their immediate managers effective, yet the relationship with one's manager is the top retention driver. Upgrading manager performance could boost engagement and retention. Only 26% of Hong Kong employees are highly engaged compared to 40% globally. Improving how managers configure work, develop people, and deliver rewards could help address low engagement and retention issues in Hong Kong.

Maj konference 2012 - Janus Sandsgaard

Janus Sandsgaard

Home Learning/Homework in Primary Schools

Claire Dunn

Avin kotian u

Avin K

Scientist meets web dev: how Python became the language of data

Gael Varoquaux

The document discusses how Python became a popular language for data science. It describes how scientists and web developers, who have different backgrounds and ways of working, were able to collaborate using Python. NumPy and SciPy provided fast numerical computing capabilities that scientists needed, while packages like Pandas, scikit-learn, and Beautiful Soup enabled data analysis and web scraping. By building on these foundations, the Python community was able to create powerful tools that have made data science widely accessible in Python.

One Scientist’s Wish List for Scientific Publishers

Philip Bourne

1. The document summarizes the speaker's wish list and vision for improving scientific publishing and communication. The speaker advocates for more open access to literature and data, better integration of literature and data, and ensuring reproducibility through sharing of workflows, source code, and data. 2. The speaker discusses experiments with rich media formats like video to enhance scholarly communication. The goal is to leverage new technologies and better link literature, data, and methods. 3. The current reward system in academia does not adequately incentivize open and reproducible science. New models are needed that reward things like maintaining databases, curating data, and developing community resources.

Using OA Content

Philip Bourne

The document discusses 10 observations about using open access content and summarizing a lecture on the topic. It observes that the amount of scientific literature is increasing rapidly but can only be read fractionally. It notes that open access could change scholarly discourse by making literature freely available. It suggests merging databases and journals for a new learning experience, using semantic enrichment to better integrate content, and utilizing rich media like video to increase discovery rates.

Elsevier - Labs on Line

Philip Bourne

Philip Bourne summarizes his perspective as a domain scientist and co-founder of an open access journal and company. He argues that the current system of formal science communication occurs too slowly, reaches too few people, costs too much, ignores data, and is stuck in the era of print. His dream is for a system that integrates literature, data, and methods, allowing users to analyze figures, access related information with links, and engage in a knowledge and data cycle. He discusses some contributions toward more open, reproducible, and integrated systems but notes challenges integrating workflows and changing reward systems to fully realize this vision.

Supporting PDF accessibility evaluation: Early results from the FixRep project

UKOLN (dev), University of Bath

QQML presentation

UKOLN (dev), University of Bath

- The FixRep project aims to examine techniques for automated metadata extraction from documents to enable accessibility evaluation and triage in repositories. - A prototype was developed to extract formal metadata like file type, title, author from PDFs using tools like pdfinfo, pdftotext, and pdfimages. - An initial pilot study analyzed PDFs from the University of Bath repository, finding 80% were successfully processed but some errors occurred due to missing or unsupported metadata and formats.

Collaborative Data Analysis with Taverna Workflows

Andrea Wiggins

Chandran Honour, Nature.com

Mashery

Nature Publishing Group is a family-owned scientific publisher with over 100 journals. It aims to provide the best scientific information to both the general public and researchers. Nature Publishing is launching a developer portal and APIs to enable new applications and tools that increase access and reuse of its scientific content. The portal will provide documentation, support, and keys to allow developers to build both non-commercial and commercial tools using Nature's content within set quotas and limits. The future plans include expanding the set of available APIs and growing an active developer community.

eLanguage.net: Shifting the paradigm in Linguistics

Cornelius Puschmann

Ten Simple Rules for Open Access Publishers

Philip Bourne

The document outlines 10 rules that open access publishers can follow to help realize the full potential of open access publishing and move the field forward. The rules include continuing to provide fully open content; fostering automatic knowledge discovery; recognizing data and rich media as scholarship; playing upon scientists' guilt around metrics and reproducibility; promoting social media and citizen science; thinking beyond individual articles to entire research cycles; developing killer apps; and better advocacy. The overall goal is to capitalize on opportunities enabled by digital technologies and open access models.

Top mobile apps for Higher Education

Courtney Mlinar

This document discusses Courtney Mlinar's experience introducing mobile devices and apps at Nova Southeastern University's Health Professions Division Library. It describes how the library's iPad initiative began in 2009 and evolved over time. Mlinar shares the library's preparations, including collaborating with other institutions and training. The document outlines how the library's instruction has changed with a new focus on mobile apps and resources. It also examines different types of library apps and how to evaluate them. Mlinar provides examples of database, ebook, citation and other useful apps for medical education and concludes by encouraging librarians to explore the potential of mobile technologies.

Viewers also liked

E learning tutorial2

Sashacaro

EasyTrack MVP2 & Data

krlwnsr

mi región junìn

Dayanna Estefany De la cruz Pomalaya

Dreams. I fight for it.

malena115

Cliffside Hotel Geotechnical Design CEE121

Adam Richardson, EIT

Japanese Women

wiesneskib

866

Hugh Klein

EFHK Spring 2016_manager effectiveness_Final

Cheong Im

Maj konference 2012 - Janus Sandsgaard

Janus Sandsgaard

Home Learning/Homework in Primary Schools

Claire Dunn

Avin kotian u

Avin K

Scientist meets web dev: how Python became the language of data

Gael Varoquaux

Viewers also liked (12)

E learning tutorial2

EasyTrack MVP2 & Data

mi región junìn

Dreams. I fight for it.

Cliffside Hotel Geotechnical Design CEE121

Japanese Women

866

EFHK Spring 2016_manager effectiveness_Final

Maj konference 2012 - Janus Sandsgaard

Home Learning/Homework in Primary Schools

Avin kotian u

Scientist meets web dev: how Python became the language of data

Similar to Developing a web application for research: programming to find related PubMed articles.

One Scientist’s Wish List for Scientific Publishers

Philip Bourne

Using OA Content

Philip Bourne

Elsevier - Labs on Line

Philip Bourne

Supporting PDF accessibility evaluation: Early results from the FixRep project

UKOLN (dev), University of Bath

QQML presentation

UKOLN (dev), University of Bath

Collaborative Data Analysis with Taverna Workflows

Andrea Wiggins

Chandran Honour, Nature.com

Mashery

eLanguage.net: Shifting the paradigm in Linguistics

Cornelius Puschmann

Ten Simple Rules for Open Access Publishers

Philip Bourne

Top mobile apps for Higher Education

Courtney Mlinar

RDA Scholarly Infrastructure 2015

William Gunn

Elsevier02012011

Philip Bourne

Peer Review and Science2.0

Jean-Claude Bradley

Jean-Claude Bradley presents on "Peer Review and Science2.0: blogs, wikis and social networking sites" as a guest lecturer for the “Peer Review Culture in Scholarly Publication and Grantmaking” course at Drexel University. The main thrust of the presentation is that peer review alone is not capable of coping with the increasing flood of scientific information being generated and shared. Arguments are made to show that providing sufficient proof for scientific findings does scale and weakens the tragedy of the trusted source cascade.

Navigating the Research Databases

Jenna Rinalducci

This document provides guidance on navigating research databases and the research process. It outlines the typical steps in the research process as 1) picking a topic, 2) determining where to search, 3) developing search terms, 4) finding relevant articles, and 5) writing the paper. It recommends starting searches in broad, multidisciplinary databases like Academic Search Complete and ProQuest Research Library. Subject-specific databases contain more focused content. The document reviews how to access full-text articles from search results and outlines an activity for students to practice developing search strategies and evaluating database search results.

Navigating the Research Databases

Jenna Rinalducci

This document provides guidance on navigating research databases and the research process. It outlines the typical steps in the research process as 1) picking a topic, 2) determining where to search, 3) developing search terms, 4) finding relevant articles, and 5) writing the paper. It recommends starting searches in broad, multidisciplinary databases like Academic Search Complete and ProQuest Research Library. Subject specific databases contain more focused, specialized content. The document instructs on accessing full text articles from search results and outlines backup options if the full text is not available directly. It includes an example search strategy using keywords and Boolean operators. Finally, it describes an in-class activity where students will brainstorm search terms and search library databases related to

Keeping up to date & comparing journal apps. the stockholm workshop 2016

Guus van den Brekel

This document compares three journal apps: BrowZine, Docphin, and Read by QXMD. It outlines criteria for comparing the apps, including registration process, user interface, performance, access to full text, PDF viewing options, sharing abilities, notifications, search options, and support. The document then provides brief descriptions of each app and scores them based on the criteria. It also discusses tools for keeping up to date, including databases, RSS, and aggregation/curation platforms like Feedly, Flipboard, and Rebelmouse.

Utilizing the natural langauage toolkit for keyword research

Erudite

Get 'em in, Get 'em out: Finding a Road from Turnaway Data to Repurposed Space

Nikki DeMoville

Inn Presentation

Nick Sheppard

Exploring and accessing knowledge in Research

Nabeel Salih Ali

Similar to Developing a web application for research: programming to find related PubMed articles. (20)

One Scientist’s Wish List for Scientific Publishers

Using OA Content

Elsevier - Labs on Line

Supporting PDF accessibility evaluation: Early results from the FixRep project

QQML presentation

Collaborative Data Analysis with Taverna Workflows

Chandran Honour, Nature.com

eLanguage.net: Shifting the paradigm in Linguistics

Ten Simple Rules for Open Access Publishers

Top mobile apps for Higher Education

RDA Scholarly Infrastructure 2015

Elsevier02012011

Peer Review and Science2.0

Navigating the Research Databases

Keeping up to date & comparing journal apps. the stockholm workshop 2016

Utilizing the natural langauage toolkit for keyword research

Get 'em in, Get 'em out: Finding a Road from Turnaway Data to Repurposed Space

Inn Presentation

Exploring and accessing knowledge in Research

Recently uploaded

How to Add Chatter in the odoo 17 ERP Module

Celine George

How to Manage Your Lost Opportunities in Odoo 17 CRM

Celine George

MARY JANE WILSON, A “BOA MÃE” .

Colégio Santa Teresinha

Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx

siemaillard

Liberal Approach to the Study of Indian Politics.pdf

WaniBasim

clinical examination of hip joint (1).pdf

Priyankaranawat4

คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1

สมใจ จันสุกสี

C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx

mulvey2

PCOS corelations and management through Ayurveda.

Dr. Shivangi Singh Parihar

NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx

iammrhaywood

Main Java[All of the Base Concepts}.docx

adhitya5119

Walmart Business+ and Spark Good for Nonprofits.pdf

TechSoup

"Learn about all the ways Walmart supports nonprofit organizations. You will hear from Liz Willett, the Head of Nonprofits, and hear about what Walmart is doing to help nonprofits, including Walmart Business and Spark Good. Walmart Business+ is a new offer for nonprofits that offers discounts and also streamlines nonprofits order and expense tracking, saving time and money. The webinar may also give some examples on how nonprofits can best leverage Walmart Business+. The event will cover the following:: Walmart Business + (https://business.walmart.com/plus) is a new shopping experience for nonprofits, schools, and local business customers that connects an exclusive online shopping experience to stores. Benefits include free delivery and shipping, a 'Spend Analytics” feature, special discounts, deals and tax-exempt shopping. Special TechSoup offer for a free 180 days membership, and up to $150 in discounts on eligible orders. Spark Good (walmart.com/sparkgood) is a charitable platform that enables nonprofits to receive donations directly from customers and associates. Answers about how you can do more with Walmart!"

How to Fix the Import Error in the Odoo 17

Celine George

How to Create a More Engaging and Human Online Learning Experience

Wahiba Chair Training & Consulting

How to Build a Module in Odoo 17 Using the Scaffold Method

Celine George

Hindi varnamala | hindi alphabet PPT.pdf

Dr. Mulla Adam Ali

हिंदी वर्णमाला पीपीटी, hindi alphabet PPT presentation, hindi varnamala PPT, Hindi Varnamala pdf, हिंदी स्वर, हिंदी व्यंजन, sikhiye hindi varnmala, dr. mulla adam ali, hindi language and literature, hindi alphabet with drawing, hindi alphabet pdf, hindi varnamala for childrens, hindi language, hindi varnamala practice for kids, https://www.drmullaadamali.com

Wound healing PPT

Jyoti Chand

This document provides an overview of wound healing, its functions, stages, mechanisms, factors affecting it, and complications. A wound is a break in the integrity of the skin or tissues, which may be associated with disruption of the structure and function. Healing is the body’s response to injury in an attempt to restore normal structure and functions. Healing can occur in two ways: Regeneration and Repair There are 4 phases of wound healing: hemostasis, inflammation, proliferation, and remodeling. This document also describes the mechanism of wound healing. Factors that affect healing include infection, uncontrolled diabetes, poor nutrition, age, anemia, the presence of foreign bodies, etc. Complications of wound healing like infection, hyperpigmentation of scar, contractures, and keloid formation.

RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students

Himanshu Rai

Digital Artefact 1 - Tiny Home Environmental Design

amberjdewit93

Advanced Java[Extra Concepts, Not Difficult].docx

adhitya5119

Recently uploaded (20)

How to Add Chatter in the odoo 17 ERP Module

How to Manage Your Lost Opportunities in Odoo 17 CRM

MARY JANE WILSON, A “BOA MÃE” .

Présentationvvvvvvvvvvvvvvvvvvvvvvvvvvvv2.pptx

Liberal Approach to the Study of Indian Politics.pdf

clinical examination of hip joint (1).pdf

คำศัพท์ คำพื้นฐานการอ่าน ภาษาอังกฤษ ระดับชั้น ม.1

C1 Rubenstein AP HuG xxxxxxxxxxxxxx.pptx

PCOS corelations and management through Ayurveda.

NEWSPAPERS - QUESTION 1 - REVISION POWERPOINT.pptx

Main Java[All of the Base Concepts}.docx

Walmart Business+ and Spark Good for Nonprofits.pdf

How to Fix the Import Error in the Odoo 17

How to Create a More Engaging and Human Online Learning Experience

How to Build a Module in Odoo 17 Using the Scaffold Method

Hindi varnamala | hindi alphabet PPT.pdf

Wound healing PPT

RHEOLOGY Physical pharmaceutics-II notes for B.pharm 4th sem students

Digital Artefact 1 - Tiny Home Environmental Design

Advanced Java[Extra Concepts, Not Difficult].docx

Developing a web application for research: programming to find related PubMed articles.

1. 1 Developing a web application for research: programming to find related PubMed articles. Philip Wolstenholme

2. Introduction to talk Aims of my project What work was done? What was found? How could the project be developed in the future? Summary Questions from the audience 2

3. The Project A research tool to find similar articles Help explore scientific literature Simple Accessed online Only requires article title or DOI DOI: Digital Object Identifier 3

4. Python The programming language used for this project Used to retrieve data, display results and analyse findings Why Python: BioPython library of code Compatible with GAE GAE – Google App Engine Run programs online as ‘web applications’ 4

5. Advantages over alternatives Presenting related content not unique Used on PubMed, ScienceDirect, Web of Science etc My application standalone Works with content from any site, or from a PDF Bookmarklet automatically detects DOIs from webpages 5

6. Choice of corpus journal Searching every journal for related items ideal, but slow Selected Marine Pollution Bulletin Based on high impact factor Availability of articles on PubMed 6

7. Working with PubMed data Download Three years worth of Mar Pol Bul downloaded Downloaded data opened, only useful data kept Large table made of words (tokens) and their frequencies Matrix turned into an easy and quick format for Python to read Process Shrink Matrix 7

8. Finding similarity 10, 821 columns 859 rows of articles Title and DOI Token frequencies 8

9. 9

10. 10

11. 11

12. Mean ‘best match’ was 0.33 12

13. Were my results of good quality? Benchmarked against PubMed 46% similarity between results For 19% of articles similarity ≥ 70% For some articles PubMed returned zero related results Our app returned results scored at 0.25 Results of comparable quality 13

14. Future work 14 Application good proof of concept Limited dataset One journal Three years Opportunities to adapt the application E.g. subscription service, mobile version

15. Summary Aimed to create simple, easy to use, functional application Completed application and carried out analysis of results Results of a good quality Aims of project achieved 15

16. Any questions? 16

Editor's Notes

Prompts (not script!):Introduction, marine biologist but chose a project supervised by Dr Alan Boyd which is why I’m presenting to this audience
Quick introduction to what I’ll be covering
Project aimed to providea useful way for someone with one journal article of interest to find related/similar articlesUsage scenario envisioned; literature review or researching for an essay. Could take a ‘classic paper’ or a piece of recommended reading from VITAL and use my app to quickly find a list of related papers.Wanted to make it as easy as possible to use, so accessed through the browser, no downloads, and all the user needs to enter is a title or a DOIDOI, in case you’re not familiar, acronym, a unique ID code given tojournal articlesExample on the right, the DOI is the highlighted code, usually found at the top or bottom of the first page of a journal article...and for people who aren’t comfortable with these initially odd looking strings of text, the user can also just paste in a title and the app will convert it to a DOI.
Did all work through language called pythonIt was used download and process the data from PubMed that made up application’s corpus, to display the results of a search to the user, and to analyse the results returned by my project for my report.Python’s got numerous strenghs. It’s easy to read and has a gentle learning curve which means it’s easy to get started with – I’d never used it before my project but now I feel comfortable with using it.But the main advantages are BioPython, a collection of open source code that anyone can benefit from, and compatibility with GAE.GAE – Google App Engine, a platform for developing web applications.Web applications bit of a buzzword at the moment, but described in more useful terms, a way of programs running online rather than through a downloaded file. So when you check your Hotmail, Gmail or uni webmail using your browser rather than a downloaded email application like Outlook then you’re using a web application.
By now, probably thinking that presenting related content not unique.Most if not all scientific literature aggregators do this.The advantage of my application, I’d like to think, is that it’s standalone – it can be used with any content from any website or a PDF – it’s not tied to one publisher’s website or to material available online.Also, bookmarklet, bit of code that lives inside a user’s browser, it automatically recognises DOIs within a page so with one click the user can forward these to my application.
So our application aimed to look for relationships between the text content of articles, and to find those we needed a corpus, or collection of scientific literature.Ideally corpus would have encompassed a very wide range of journals, but that would be slow and beyond the realms of knowledge and processing power available for an honours project.Selected a marine biology journal by ranking all the marine journals available on PubMed by their impact factor and then removing too specific or too infrequently published journals.Impact factor, criteria for establishing rank or importance of a journal by observing how often other papers cite the papers within a journal. Bit of controversy but generally accepted as measure of a journal’s significance/standing.Result was a choice of the journal Marine Pollution Bulletin.
Corpus was created by downloading from through PubMed’s API, Entrez.Working with the PubMed data to assemble our corpus was probably the hardest part of this project as it was the first big chunk that involved areas of programming that I’d never been involved in before.First of all 3 years worth of data from Marine Pollution Bulletin downloaded. This gave us a HUGE dump of data from PubMed. It had everything in it; three years worth of authors names, places of publishing, dates of publishing, dates of being added to PubMed, PubMed IDs, DOIs, all sorts of other information. It was too much information, and a lot of it would have served no use for our project.So, the next step was to process it. All this extra information was discarded, and we only kept the titles, abstracts and DOIs.Then, the titles and abstracts were separated word by word, and a big list of each word in the corpus was calculated. Then, for each word the number of times that it appears in each article was calculated – giving us an idea of the content and main themes of each article. This information was stored in the matrix, a big table that we’ll have a look at in the next slide.Finally, the matrix had to be shrunk somehow, it was just too big a file for Google App Engine to be able to read.
...this is a screen shot of a just a really small portion of the application’s dataset, opened in Excel.It might not be too easy to read but hopefully you can get some idea of the structure and scale.Down the rows we have one article per row, with the first two columns of each row holding an article’s title and DOI. The rest of the 10,821 columns contain a word count for each token/word.The matrix contained details of every single word in 859 articles, so ended up with over 9 million of these counts.You might also be able to see in this screen shot, that most of these counts are 0, in fact, 99% of the values in the matrix were zero.This meant that these values could be stripped out, leaving a file that was 95% smaller and much easier to deal with.The second step was to work out the relationships between all these word counts, and for that we used something called cosine similarity. This is just a method that worked through each row and used existing code to determine which rows share the most in common, and can be deemed related.
So, to put it all in context, here’s a quick example. This is the main page, which can be accessed by anyone right now at honourspw.appspot.com.
All the user has to do is paste in the title or DOI of a Marine Pollution Bulletin article from the last three years...Click submit...
And they get a list of sixteen related articles dealing with metal pollution at sea.For each one of these results that the application returns, there’s also a similarity score available. This just gives a score ranging from 0 – 1 where 0 represents nothing in common and 1 represents an exact copy.We wondered what the level of match (or quality) of our results were.
So, I wrote some code that records the score of the best match for all 859 articles.We found our mean best match was .33. This, to me, sounded quite low – we obviously weren’t expecting all our values to be in the .8/.9 or above range, because that would require a very homogenous set of data.But this score got us thinking about the quality of our results.
So, to get an idea of PubMed, I compared the results our application was recommending for an article, to the results that PubMed was recommending.We found a 46% similarity between the results, and for almost 20% of the titles in the corpus, 70% of what PubMed recommended we did too.Also, PubMed failed to return recommendations for some papers, whereas our application returned results of a quality not too far from our overall mean.I deem that to be a comparable level of quality.

Developing a web application for research: programming to find related PubMed articles.

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (12)

Similar to Developing a web application for research: programming to find related PubMed articles.

Similar to Developing a web application for research: programming to find related PubMed articles. (20)

Recently uploaded

Recently uploaded (20)

Developing a web application for research: programming to find related PubMed articles.

Editor's Notes