0
NCBI API – Integration into
analysis code
QBRC Tech Talk
Jiwoong Kim
Outlines
• Introduction
• Usage Guidelines of the E-utilities
• Sample Applications of the E-utilities
NCBI & Entrez
• The National Center for
Biotechnology Information
advances science and health by
providing access to biome...
E-utilities
• Entrez Programming Utilities
– The Entrez Programming Utilities (E-utilities) are a set of
eight server-side...
Usage Guidelines and Requirements
• Use the E-utility URL
– baseURL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ …
– Pyt...
ESearch
ESearch (text searches)
• Responds to a text query with the list of matching UIDs in a
given database (for later use in ES...
ESummary
ESearch
UIDs
EFetch
UID
ESummary
(document summary downloads)
• Responds to a list of UIDs from a given database with the
corresponding document s...
EFetch
ELink
EFetch (data record downloads)
• Responds to a list of UIDs in a given database with the
corresponding data records in a s...
ELink (Entrez links)
• Responds to a list of UIDs in a given database with either a list
of related UIDs (and relevancy sc...
ELink (Entrez links)
• Syntax:
elink.fcgi?dbfrom=<source_db>&db=<destination_db>&id=<u
id_list>
– Input: List of UIDs (&id...
EGQuery
EGQuery (global query)
• Responds to a text query with the number of records
matching the query in each Entrez database.
•...
ESpell
ESpell (spelling suggestions)
• Retrieves spelling suggestions for a text query in a given
database.
• Syntax: espell.fcgi...
EInfo (database statistics)
• Provides the number of records indexed in each field of a
given database, the date of the la...
EPost (UID uploads)
• Accepts a list of UIDs from a given database, stores the set on
the History Server, and responds wit...
Application 1
• Find related human genes to articles searched for non-
extended MeSH term "Osteosarcoma" (PubMed → Gene)
1...
Application 1
• Find related human genes to articles searched for non-
extended MeSH term "Osteosarcoma" (PubMed → Gene)
–...
Application 2
• Find nucleotide sequences of "Burkholderia cepacia complex"
and download in GenBank format
1. http://eutil...
Application 3
• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
can...
"cancer copy number" articles
"Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
Application 3
• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
can...
Application 3
• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
Application 3
• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
can...
Application 3
• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
Make custom scripts with XML-parser
EBot
• EBot is an interactive web tool that first allows
users to construct an arbitrary E-utility
analysis pipeline and t...
Entrez Direct
• E-utilities on the UNIX Command Line
• Download from ftp://ftp.ncbi.nih.gov/entrez/entrezdirect/
• Entrez ...
Links
• References
– Entrez Programming Utilities Help
• http://www.ncbi.nlm.nih.gov/books/NBK25501/
– Entrez Help
• http:...
NCBI databases
• Literature: PubMed, PubMed Central, NLM Catalog, MeSH, Books, Site
Search
• Health: PubMed Health, MedGen...
E-utilities
• Eight server-side programs
– ESearch : Searching a Database
– EPost : Uploading UIDs to Entrez
– ESummary : ...
Sample Applications of the E-utilities
• Basic pipelines
– ESearch - ESummary/EFetch
– EPost - ESummary/EFetch
– ELink - E...
Application 3
• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
1. ...
Upcoming SlideShare
Loading in...5
×

NCBI API - Integration into analysis code

634

Published on

QBRC Tech Talk on April 1st, 2014

Published in: Science
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
634
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
5
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "NCBI API - Integration into analysis code"

  1. 1. NCBI API – Integration into analysis code QBRC Tech Talk Jiwoong Kim
  2. 2. Outlines • Introduction • Usage Guidelines of the E-utilities • Sample Applications of the E-utilities
  3. 3. NCBI & Entrez • The National Center for Biotechnology Information advances science and health by providing access to biomedical and genomic information. • Entrez is NCBI’s primary text search and retrieval system that integrates the PubMed database of biomedical literature with 39 other literature and molecular databases including DNA and protein sequence, structure, gene, genome, genetic variation and gene expression.
  4. 4. E-utilities • Entrez Programming Utilities – The Entrez Programming Utilities (E-utilities) are a set of eight server-side programs that provide a stable interface into the Entrez query and database system at the NCBI. – The E-utilities use a fixed URL syntax that translates a standard set of input parameters into the values necessary for various NCBI software components to search for and retrieve the requested data. E-utilitiesURL XML, FASTA, Text … Input Output
  5. 5. Usage Guidelines and Requirements • Use the E-utility URL – baseURL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ … – Python urllib/urlopen, Perl LWP::Simple, Linux wget, … • Frequency, Timing and Registration of E-utility URL Requests – Make no more than 3 requests per second → sleep(0.5) – Run large jobs on weekends or between 5 PM and 9 AM EST – Include &tool and &email in all requests • Minimizing the Number of Requests – &retmax=500 • Handling Special Characters Within URLs – Space → +, " → %22, # → %23
  6. 6. ESearch
  7. 7. ESearch (text searches) • Responds to a text query with the list of matching UIDs in a given database (for later use in ESummary, EFetch or ELink), along with the term translations of the query. • Syntax: esearch.fcgi?db=<database>&term=<query> – Input: Entrez database (&db); Any Entrez text query (&term) – Output: List of UIDs matching the Entrez query • Example: Get the PubMed IDs (PMIDs) for articles about osteosarcoma – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed& term=%22osteosarcoma%22[majr:noexp]
  8. 8. ESummary ESearch UIDs EFetch UID
  9. 9. ESummary (document summary downloads) • Responds to a list of UIDs from a given database with the corresponding document summaries. • Syntax: esummary.fcgi?db=<database>&id=<uid_list> – Input: List of UIDs (&id); Entrez database (&db) – Output: XML DocSums • Example: Download DocSums for these PubMed IDs: 24450072, 24333720, 24333432 – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubme d&id=24450072,24333720,24333432
  10. 10. EFetch ELink
  11. 11. EFetch (data record downloads) • Responds to a list of UIDs in a given database with the corresponding data records in a specified format. • Syntax: efetch.fcgi?db=<database>&id=<uid_list>&rettype=<retrieval _type>&retmode=<retrieval_mode> – Input: List of UIDs (&id); Entrez database (&db); Retrieval type (&rettype); Retrieval mode (&retmode) – Output: Formatted data records as specified • Example: Download the abstract of PubMed ID 24333432 – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&i d=24333432&rettype=abstract&retmode=text
  12. 12. ELink (Entrez links) • Responds to a list of UIDs in a given database with either a list of related UIDs (and relevancy scores) in the same database or a list of linked UIDs in another Entrez database • Checks for the existence of a specified link from a list of one or more UIDs • Creates a hyperlink to the primary LinkOut provider for a specific UID and database, or lists LinkOut URLs and attributes for multiple UIDs.
  13. 13. ELink (Entrez links) • Syntax: elink.fcgi?dbfrom=<source_db>&db=<destination_db>&id=<u id_list> – Input: List of UIDs (&id); Source Entrez database (&dbfrom); Destination Entrez database (&db) – Output: XML containing linked UIDs from source and destination databases • Example: Find one set/separate sets of Gene IDs linked to PubMed IDs 24333432 and 24314238 – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubme d&db=gene&id=24333432,24314238 – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubme d&db=gene&id=24333432&id=24314238
  14. 14. EGQuery
  15. 15. EGQuery (global query) • Responds to a text query with the number of records matching the query in each Entrez database. • Syntax: egquery.fcgi?term=<query> – Input: Entrez text query (&term) – Output: XML containing the number of hits in each database. • Example: Determine the number of records for mouse in Entrez. – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?term=mouse[ orgn]&retmode=xml
  16. 16. ESpell
  17. 17. ESpell (spelling suggestions) • Retrieves spelling suggestions for a text query in a given database. • Syntax: espell.fcgi?term=<query>&db=<database> – Input: Entrez text query (&term); Entrez database (&db) – Output: XML containing the original query and spelling suggestions. • Example: Find spelling suggestions for the PubMed query "osteosacoma". – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi?term=osteosac oma&db=pmc
  18. 18. EInfo (database statistics) • Provides the number of records indexed in each field of a given database, the date of the last update of the database, and the available links from the database to other Entrez databases. • Syntax: einfo.fcgi?db=<database> – Input: Entrez database (&db) – Output: XML containing database statistics • Example: Find database statistics for Entrez Protein. – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=protein
  19. 19. EPost (UID uploads) • Accepts a list of UIDs from a given database, stores the set on the History Server, and responds with a query key and web environment for the uploaded dataset. • Syntax: epost.fcgi?db=<database>&id=<uid_list> – Input: List of UIDs (&id); Entrez database (&db) – Output: Web environment (&WebEnv) and query key (&query_key) parameters specifying the location on the Entrez history server of the list of uploaded UIDs • Example: Upload five Gene IDs (7173, 22018, 54314, 403521, 525013) for later processing. – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?db=gene&id=71 73,22018,54314,403521,525013
  20. 20. Application 1 • Find related human genes to articles searched for non- extended MeSH term "Osteosarcoma" (PubMed → Gene) 1. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubme d&term=%22osteosarcoma%22[majr:noexp]&usehistory=y 2. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubm ed&db=gene&query_key=1&WebEnv=NCID_1_220057266_130.14. 18.34_9001_1396281951_1196950266&term=%22homo+sapiens% 22[organism]&cmd=neighbor_history 3. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene &query_key=3&WebEnv=NCID_1_220057266_130.14.18.34_9001_ 1396281951_1196950266
  21. 21. Application 1 • Find related human genes to articles searched for non- extended MeSH term "Osteosarcoma" (PubMed → Gene) – ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz • It can be used instead of "ELink". – ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz • It can be used instead of "ESummary".
  22. 22. Application 2 • Find nucleotide sequences of "Burkholderia cepacia complex" and download in GenBank format 1. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccor e&term=%22burkholderia+cepacia+complex%22[organism]&usehist ory=y 2. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore &query_key=1&WebEnv=NCID_1_264773253_130.14.22.215_9001 _1396244608_457974498&rettype=gb&retmode=text
  23. 23. Application 3 • Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets cancer "copy number" esearch.fcgi?db=pubmed Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter] esearch.fcgi?db=gds esummary.fcgi?db=pubmed WebEnv, query_key esummary.fcgi?db=gds WebEnv, query_key GPL9704 GPL8226 GPL6804 GPL6801 elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds Parsing Result table Common PubMed title
  24. 24. "cancer copy number" articles "Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets
  25. 25. Application 3 • Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets cancer "copy number" esearch.fcgi?db=pubmed Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter] esearch.fcgi?db=gds esummary.fcgi?db=pubmed WebEnv, query_key esummary.fcgi?db=gds WebEnv, query_key GPL9704 GPL8226 GPL6804 GPL6801 elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds Parsing Result table Common PubMed title
  26. 26. Application 3 • Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets
  27. 27. Application 3 • Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets cancer "copy number" esearch.fcgi?db=pubmed Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter] esearch.fcgi?db=gds esummary.fcgi?db=pubmed WebEnv, query_key esummary.fcgi?db=gds WebEnv, query_key GPL9704 GPL8226 GPL6804 GPL6801 elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds Parsing Result table Common PubMed title
  28. 28. Application 3 • Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets
  29. 29. Make custom scripts with XML-parser
  30. 30. EBot • EBot is an interactive web tool that first allows users to construct an arbitrary E-utility analysis pipeline and then generates a Perl script to execute the pipeline. The Perl script can be downloaded and executed on any computer with a Perl installation. For more details, see the EBot page linked above. – http://www.ncbi.nlm.nih.gov/Class/PowerTools/e utils/ebot/ebot.cgi
  31. 31. Entrez Direct • E-utilities on the UNIX Command Line • Download from ftp://ftp.ncbi.nih.gov/entrez/entrezdirect/ • Entrez Direct Functions – esearch performs a new Entrez search using terms in indexed fields. – elink looks up neighbors (within a database) or links (between databases). – efilter filters or restricts the results of a previous query. – efetch downloads records or reports in a designated format. – xtract converts XML into a table of data values. – einfo obtains information on indexed fields in an Entrez database. – epost uploads unique identifiers (UIDs) or sequence accession numbers. – nquire sends a URL request to a web page or CGI service. • Entering Query Commands – esearch -db pubmed -query "opsin gene conversion" | elink -related
  32. 32. Links • References – Entrez Programming Utilities Help • http://www.ncbi.nlm.nih.gov/books/NBK25501/ – Entrez Help • http://www.ncbi.nlm.nih.gov/books/NBK3836/ • Useful Links – Entrez Unique Identifiers (UIDs) for selected databases • http://www.ncbi.nlm.nih.gov/books/NBK25497/table/chapter2.chapter2_table1/?r eport=objectonly – Valid values of &retmode and &rettype for EFetch (null = empty string) • http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.chapter4_table1/?r eport=objectonly – The full list of Entrez links • http://eutils.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html
  33. 33. NCBI databases • Literature: PubMed, PubMed Central, NLM Catalog, MeSH, Books, Site Search • Health: PubMed Health, MedGen, GTR, dbGaP, ClinVar, OMIM, OMIA • Organisms: Taxonomy • Nucleotide Sequences: Nucleotide, GSS, EST, SRA, PopSet, Probe • Genomes: Genome, Assembly, Epigenomics, UniSTS, SNP, dbVar, BioProject, BioSample, Clone • Genes: Gene, HomoloGene, UniGene, GEO Profiles, GEO DataSets • Proteins: Protein, Conserved Domains, Protein Clusters, Structure • Chemicals: PubChem Compound, PubChem Substance, PubChem BioAssay • Pathways: BioSystems
  34. 34. E-utilities • Eight server-side programs – ESearch : Searching a Database – EPost : Uploading UIDs to Entrez – ESummary : Downloading Document Summaries – EFetch : Downloading Full Records – ELink : Finding Related Data Through Entrez Links – EInfo : Getting Database Statistics and Search Fields – EGQuery : Performing a Global Entrez Search – ESpell : Retrieving Spelling Suggestions
  35. 35. Sample Applications of the E-utilities • Basic pipelines – ESearch - ESummary/EFetch – EPost - ESummary/EFetch – ELink - ESummary/Efetch – ESearch - ELink - ESummary/EFetch – EPost - ELink - ESummary/EFetch – EPost - ESearch – ELink - ESearch
  36. 36. Application 3 • Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets 1. tr 'n' 't' < cancer_copy_number.pubmed_result.txt | sed 's/tt/n/g' | sed 's/^t[0-9]*: //' | sed 's/t/ /g' > cancer_copy_number.pubmed_result.oneLine.txt 2. sed 's/^.* PubMed *PMID: *//' cancer_copy_number.pubmed_result.oneLine.txt | sed 's/; .*//' | sed 's/.$//' > cancer_copy_number.pubmed_ids.txt 3. for id in $(cat cancer_copy_number.pubmed_ids.txt); do perl ~/scripts/elink.pl pubmed gds $id pubmed_gds | sed "s/^/$idt/"; done > cancer_copy_number.pubmed_gds_ids.txt 4. awk -F't' '($1 == "Platform")' Affymetrix_Genome-Wide_Human_SNP_Array.gds_result.txt | cut -f2 | sed 's/^Accession: //' > Affymetrix_Genome-Wide_Human_SNP_Array.platform_accessions.txt 5. for platform in $(cat Affymetrix_Genome-Wide_Human_SNP_Array.platform_accessions.txt); do perl ~/scripts/esearch.pl gds $platform; done | sort -nu > Affymetrix_Genome-Wide_Human_SNP_Array.gds_ids.txt 6. paste cancer_copy_number.pubmed_ids.txt cancer_copy_number.pubmed_result.oneLine.txt | perl ~/scripts/table.addColumns.pl cancer_copy_number.pubmed_gds_ids.txt 0 - 0 1 | perl ~/scripts/table.search.pl Affymetrix_Genome-Wide_Human_SNP_Array.gds_ids.txt 0 - 1 | perl ~/scripts/table.mergeLines.pl -d ', ' - 0,2 > cancer_copy_number.Affymetrix_Genome-Wide_Human_SNP_Array.pubmed_gds.txt
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×