Building A Web Application To Monitor PubMed Retraction Notices

907 views
753 views

Published on

Monitoring PubMed retraction notices using Ruby, MongoDB, Sinatra and Heroku. Talk given to internal CSIRO Bioinformatics User Group, December 1 2011.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
907
On SlideShare
0
From Embeds
0
Number of Embeds
8
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Building A Web Application To Monitor PubMed Retraction Notices

  1. 1. Building a Web Application to Monitor PubMed Retraction Notices Neil Saunders CSIRO Mathematics, Informatics and Statistics Building E6B, Macquarie University Campus North Ryde December 1, 2011
  2. 2. Retraction Watch
  3. 3. Project Aims Monitor PubMed for retractions Retrieve retraction data and store locally for analysis Develop web application to display retraction data
  4. 4. PubMed - advanced search, RSS and send-to-file
  5. 5. Updates in Google Reader
  6. 6. PubMed - MeSH
  7. 7. PubMed - EUtils http://www.ncbi.nlm.nih.gov/books/NBK25501/
  8. 8. EInfo example script #!/usr/bin/ruby require ’rubygems’ require ’bio’ require ’hpricot’ require ’open-uri’ Bio::NCBI.default_email = "me@me.com" ncbi = Bio::NCBI::REST.new url = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=" ncbi.einfo.each do |db| puts "Processing #{db}..." File.open("#{db}.txt", "w") do |f| doc = Hpricot(open("#{url + db}")) (doc/’//fieldlist/field’).each do |field| name = (field/’/name’).inner_html fullname = (field/’/fullname’).inner_html description = (field/’description’).inner_html f.write("#{name},#{fullname},#{description}n") end end end
  9. 9. EInfo script - output ALL,All Fields,All terms from all searchable fields UID,UID,Unique number assigned to publication FILT,Filter,Limits the records TITL,Title,Words in title of publication WORD,Text Word,Free text associated with publication MESH,MeSH Terms,Medical Subject Headings assigned to publication MAJR,MeSH Major Topic,MeSH terms of major importance to publication AUTH,Author,Author(s) of publication JOUR,Journal,Journal abbreviation of publication AFFL,Affiliation,Author’s institutional affiliation and address ...
  10. 10. MongoDB Overview MongoDB is a so-called “NoSQL” database Key features: Document-oriented Schema-free Documents stored in collections http://www.mongodb.org/
  11. 11. Saving to a database collection: ecount #!/usr/bin/ruby require "rubygems" require "bio" require "mongo" db = Mongo::Connection.new.db(’pubmed’) col = db.collection(’ecount’) Bio::NCBI.default_email = "me@me.com" ncbi = Bio::NCBI::REST.new 1977.upto(Time.now.year) do |year| all = ncbi.esearch_count("#{year}[dp]", {"db" => "pubmed"}) term = ncbi.esearch_count("Retraction of Publication[ptyp] #{year}[dp]", {"db" => "pubmed"}) record = {"_id" => year, "year" => year, "total" => all, "retracted" => term, "updated_at" => Time.now} col.save(record) puts "#{year}..." end puts "Saved #{col.count} records."
  12. 12. ecount collection > db.ecount.findOne() { "_id" : 1977, "retracted" : 3, "updated_at" : ISODate("2011-11-15T03:58:10.729Z"), "total" : 260517, "year" : 1977 }
  13. 13. Saving to a database collection: entries #!/usr/bin/ruby require "rubygems" require "mongo" require "crack" db = Mongo::Connection.new.db("pubmed") col = db.collection(’entries’) col.drop xmlfile = "#{ENV[’HOME’]}/Dropbox/projects/pubmed/retractions/data/retract.xml" xml = Crack::XML.parse(File.read(xmlfile)) xml[’PubmedArticleSet’][’PubmedArticle’].each do |article| article[’_id’] = article[’MedlineCitation’][’PMID’] col.save(article) end puts "Saved #{col.count} articles."
  14. 14. entries collection { "_id" : "22106469", "PubmedData" : { "PublicationStatus" : "ppublish", "ArticleIdList" : { "ArticleId" : "22106469" }, "History" : { "PubMedPubDate" : [ { "Minute" : "0", "Month" : "11", "PubStatus" : "entrez", "Day" : "23", "Hour" : "6", "Year" : "2011" }, { "Minute" : "0", "Month" : "11", "PubStatus" : "pubmed", "Day" : "23", "Hour" : "6", "Year" : "2011" }, ...
  15. 15. Saving to a database collection: timeline #!/usr/bin/ruby require "rubygems" require "mongo" require "date" db = Mongo::Connection.new.db(’pubmed’) entries = db.collection(’entries’) timeline = db.collection(’timeline’) dates = entries.find.map { |entry| entry[’MedlineCitation’][’DateCreated’] } dates.map! { |d| Date.parse("#{d[’Year’]}-#{d[’Month’]}-#{d[’Day’]}") } dates.sort! data = (dates.first..dates.last).inject(Hash.new(0)) { |h, date| h[date] = 0; h } dates.each { |date| data[date] += 1} data = data.sort data.map! {|e| ["Date.UTC(#{e[0].year},#{e[0].month - 1},#{e[0].day})", e[1]] } data.each do |date| timeline.save({"_id" => date[0].gsub(".", "_"), "date" => date[0], "count" => date[1]}) end puts "Saved #{timeline.count} dates in timeline."
  16. 16. timeline collection > db.timeline.findOne() { "_id" : "Date_UTC(1977,7,12)", "date" : "Date.UTC(1977,7,12)", "count" : 1 }
  17. 17. Sinatra: minimal example require "rubygems" require "sinatra" get "/" do "Hello World" end # ruby myapp.rb # http://localhost:4567
  18. 18. Highcharts: minimal example code var chart = new Highcharts.Chart({ chart: { renderTo: ’container’, defaultSeriesType: ’line’ }, xAxis: { categories: [’Jan’, ’Feb’, ’Mar’, ’Apr’, ’May’, ’Jun’, ’Jul’, ’Aug’, ’Sep’, ’Oct’, ’Nov’, ’Dec’] }, series: [{ data: [29.9, 71.5, 106.4, 129.2, 144.0, 176.0, 135.6, 148.5, 216.4, 194.1, 95.6, 54.4] }] }); // <div id="container" style="height: 400px"></div>
  19. 19. Highcharts: minimal example result
  20. 20. Web Application Overview |---config.ru |---Gemfile |---main.rb |---public | |---javascripts | | |---dark-blue.js | | |---dark-green.js | | |---exporting.js | | |---gray.js | | |---grid.js | | |---highcharts.js | | |---jquery-1.4.2.min.js | |---stylesheets | |---main.css |---Rakefile |---statistics.rb |---views |---about.haml |---byyear.haml |---date.haml |---error.haml |---index.haml |---journal.haml |---journals.haml |---layout.haml |---test.haml |---total.haml
  21. 21. Sinatra Application Code - main.rb # main.rb configure do # a bunch of config stuff goes here # DB = connection to MongoDB database # timeline timeline = DB.collection(’timeline’) set :data, timeline.find.to_a.map { |e| [e[’date’], e[’count’]] } end # views get "/" do haml :index end
  22. 22. Sinatra Views - index.haml %h3 PubMed Retraction Notices - Timeline %p Last update: #{options.updated_at} %div#container(style="margin-left: auto; margin-right: auto; width: 800px;") :javascript $(function () { new Highcharts.Chart({ chart: { renderTo: ’container’, defaultSeriesType: ’area’, width: 800, height: 600, zoomType: ’x’, marginTop: 80 }, legend: { enabled: false }, title: { text: ’Retractions by date’ }, xAxis: { type: ’datetime’}, yAxis: { title: { text: ’Retractions’ } }, series: [{ data: #{options.data.inspect.gsub(/"/,"")} }], // more stuff goes here... }); });
  23. 23. Deployment: Heroku + MongoHQ Heroku.com - free application hosting (for small apps) Almost as simple as: $ git remote add heroku git@heroku.com:appname.git $ git push heroku master MongoHQ.com - free MongoDB database hosting (up to 16 MB)
  24. 24. “Final” product Application - http://pmretract.heroku.com Code - http://github.com/neilfws/PubMed

×