SlideShare a Scribd company logo
1 of 36
Download to read offline
NCBI API – Integration into
analysis code
QBRC Tech Talk
Jiwoong Kim
Outlines
• Introduction
• Usage Guidelines of the E-utilities
• Sample Applications of the E-utilities
NCBI & Entrez
• The National Center for
Biotechnology Information
advances science and health by
providing access to biomedical
and genomic information.
• Entrez is NCBI’s primary text
search and retrieval system
that integrates the PubMed
database of biomedical
literature with 39 other
literature and molecular
databases including DNA and
protein sequence, structure,
gene, genome, genetic
variation and gene expression.
E-utilities
• Entrez Programming Utilities
– The Entrez Programming Utilities (E-utilities) are a set of
eight server-side programs that provide a stable interface
into the Entrez query and database system at the NCBI.
– The E-utilities use a fixed URL syntax that translates a
standard set of input parameters into the values necessary
for various NCBI software components to search for and
retrieve the requested data.
E-utilitiesURL XML, FASTA, Text …
Input Output
Usage Guidelines and Requirements
• Use the E-utility URL
– baseURL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ …
– Python urllib/urlopen, Perl LWP::Simple, Linux wget, …
• Frequency, Timing and Registration of E-utility URL Requests
– Make no more than 3 requests per second → sleep(0.5)
– Run large jobs on weekends or between 5 PM and 9 AM EST
– Include &tool and &email in all requests
• Minimizing the Number of Requests
– &retmax=500
• Handling Special Characters Within URLs
– Space → +, " → %22, # → %23
ESearch
ESearch (text searches)
• Responds to a text query with the list of matching UIDs in a
given database (for later use in ESummary, EFetch or ELink),
along with the term translations of the query.
• Syntax: esearch.fcgi?db=<database>&term=<query>
– Input: Entrez database (&db); Any Entrez text query (&term)
– Output: List of UIDs matching the Entrez query
• Example: Get the PubMed IDs (PMIDs) for articles about
osteosarcoma
– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&
term=%22osteosarcoma%22[majr:noexp]
ESummary
ESearch
UIDs
EFetch
UID
ESummary
(document summary downloads)
• Responds to a list of UIDs from a given database with the
corresponding document summaries.
• Syntax: esummary.fcgi?db=<database>&id=<uid_list>
– Input: List of UIDs (&id); Entrez database (&db)
– Output: XML DocSums
• Example: Download DocSums for these PubMed IDs:
24450072, 24333720, 24333432
– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubme
d&id=24450072,24333720,24333432
EFetch
ELink
EFetch (data record downloads)
• Responds to a list of UIDs in a given database with the
corresponding data records in a specified format.
• Syntax:
efetch.fcgi?db=<database>&id=<uid_list>&rettype=<retrieval
_type>&retmode=<retrieval_mode>
– Input: List of UIDs (&id); Entrez database (&db); Retrieval type
(&rettype); Retrieval mode (&retmode)
– Output: Formatted data records as specified
• Example: Download the abstract of PubMed ID 24333432
– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&i
d=24333432&rettype=abstract&retmode=text
ELink (Entrez links)
• Responds to a list of UIDs in a given database with either a list
of related UIDs (and relevancy scores) in the same database
or a list of linked UIDs in another Entrez database
• Checks for the existence of a specified link from a list of one
or more UIDs
• Creates a hyperlink to the primary LinkOut provider for a
specific UID and database, or lists LinkOut URLs and attributes
for multiple UIDs.
ELink (Entrez links)
• Syntax:
elink.fcgi?dbfrom=<source_db>&db=<destination_db>&id=<u
id_list>
– Input: List of UIDs (&id); Source Entrez database (&dbfrom);
Destination Entrez database (&db)
– Output: XML containing linked UIDs from source and destination
databases
• Example: Find one set/separate sets of Gene IDs linked to
PubMed IDs 24333432 and 24314238
– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubme
d&db=gene&id=24333432,24314238
– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubme
d&db=gene&id=24333432&id=24314238
EGQuery
EGQuery (global query)
• Responds to a text query with the number of records
matching the query in each Entrez database.
• Syntax: egquery.fcgi?term=<query>
– Input: Entrez text query (&term)
– Output: XML containing the number of hits in each database.
• Example: Determine the number of records for mouse in
Entrez.
– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?term=mouse[
orgn]&retmode=xml
ESpell
ESpell (spelling suggestions)
• Retrieves spelling suggestions for a text query in a given
database.
• Syntax: espell.fcgi?term=<query>&db=<database>
– Input: Entrez text query (&term); Entrez database (&db)
– Output: XML containing the original query and spelling suggestions.
• Example: Find spelling suggestions for the PubMed query
"osteosacoma".
– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi?term=osteosac
oma&db=pmc
EInfo (database statistics)
• Provides the number of records indexed in each field of a
given database, the date of the last update of the database,
and the available links from the database to other Entrez
databases.
• Syntax: einfo.fcgi?db=<database>
– Input: Entrez database (&db)
– Output: XML containing database statistics
• Example: Find database statistics for Entrez Protein.
– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=protein
EPost (UID uploads)
• Accepts a list of UIDs from a given database, stores the set on
the History Server, and responds with a query key and web
environment for the uploaded dataset.
• Syntax: epost.fcgi?db=<database>&id=<uid_list>
– Input: List of UIDs (&id); Entrez database (&db)
– Output: Web environment (&WebEnv) and query key (&query_key)
parameters specifying the location on the Entrez history server of the
list of uploaded UIDs
• Example: Upload five Gene IDs (7173, 22018, 54314, 403521,
525013) for later processing.
– http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?db=gene&id=71
73,22018,54314,403521,525013
Application 1
• Find related human genes to articles searched for non-
extended MeSH term "Osteosarcoma" (PubMed → Gene)
1. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubme
d&term=%22osteosarcoma%22[majr:noexp]&usehistory=y
2. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubm
ed&db=gene&query_key=1&WebEnv=NCID_1_220057266_130.14.
18.34_9001_1396281951_1196950266&term=%22homo+sapiens%
22[organism]&cmd=neighbor_history
3. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene
&query_key=3&WebEnv=NCID_1_220057266_130.14.18.34_9001_
1396281951_1196950266
Application 1
• Find related human genes to articles searched for non-
extended MeSH term "Osteosarcoma" (PubMed → Gene)
– ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz
• It can be used instead of "ELink".
– ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
• It can be used instead of "ESummary".
Application 2
• Find nucleotide sequences of "Burkholderia cepacia complex"
and download in GenBank format
1. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccor
e&term=%22burkholderia+cepacia+complex%22[organism]&usehist
ory=y
2. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore
&query_key=1&WebEnv=NCID_1_264773253_130.14.22.215_9001
_1396244608_457974498&rettype=gb&retmode=text
Application 3
• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
cancer "copy number"
esearch.fcgi?db=pubmed
Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter]
esearch.fcgi?db=gds
esummary.fcgi?db=pubmed
WebEnv, query_key
esummary.fcgi?db=gds
WebEnv, query_key
GPL9704
GPL8226
GPL6804
GPL6801
elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds
Parsing
Result table
Common
PubMed title
"cancer copy number" articles
"Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
Application 3
• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
cancer "copy number"
esearch.fcgi?db=pubmed
Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter]
esearch.fcgi?db=gds
esummary.fcgi?db=pubmed
WebEnv, query_key
esummary.fcgi?db=gds
WebEnv, query_key
GPL9704
GPL8226
GPL6804
GPL6801
elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds
Parsing
Result table
Common
PubMed title
Application 3
• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
Application 3
• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
cancer "copy number"
esearch.fcgi?db=pubmed
Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter]
esearch.fcgi?db=gds
esummary.fcgi?db=pubmed
WebEnv, query_key
esummary.fcgi?db=gds
WebEnv, query_key
GPL9704
GPL8226
GPL6804
GPL6801
elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds
Parsing
Result table
Common
PubMed title
Application 3
• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
Make custom scripts with XML-parser
EBot
• EBot is an interactive web tool that first allows
users to construct an arbitrary E-utility
analysis pipeline and then generates a Perl
script to execute the pipeline. The Perl script
can be downloaded and executed on any
computer with a Perl installation. For more
details, see the EBot page linked above.
– http://www.ncbi.nlm.nih.gov/Class/PowerTools/e
utils/ebot/ebot.cgi
Entrez Direct
• E-utilities on the UNIX Command Line
• Download from ftp://ftp.ncbi.nih.gov/entrez/entrezdirect/
• Entrez Direct Functions
– esearch performs a new Entrez search using terms in indexed fields.
– elink looks up neighbors (within a database) or links (between databases).
– efilter filters or restricts the results of a previous query.
– efetch downloads records or reports in a designated format.
– xtract converts XML into a table of data values.
– einfo obtains information on indexed fields in an Entrez database.
– epost uploads unique identifiers (UIDs) or sequence accession numbers.
– nquire sends a URL request to a web page or CGI service.
• Entering Query Commands
– esearch -db pubmed -query "opsin gene conversion" | elink -related
Links
• References
– Entrez Programming Utilities Help
• http://www.ncbi.nlm.nih.gov/books/NBK25501/
– Entrez Help
• http://www.ncbi.nlm.nih.gov/books/NBK3836/
• Useful Links
– Entrez Unique Identifiers (UIDs) for selected databases
• http://www.ncbi.nlm.nih.gov/books/NBK25497/table/chapter2.chapter2_table1/?r
eport=objectonly
– Valid values of &retmode and &rettype for EFetch (null = empty string)
• http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.chapter4_table1/?r
eport=objectonly
– The full list of Entrez links
• http://eutils.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html
NCBI databases
• Literature: PubMed, PubMed Central, NLM Catalog, MeSH, Books, Site
Search
• Health: PubMed Health, MedGen, GTR, dbGaP, ClinVar, OMIM, OMIA
• Organisms: Taxonomy
• Nucleotide Sequences: Nucleotide, GSS, EST, SRA, PopSet, Probe
• Genomes: Genome, Assembly, Epigenomics, UniSTS, SNP, dbVar,
BioProject, BioSample, Clone
• Genes: Gene, HomoloGene, UniGene, GEO Profiles, GEO DataSets
• Proteins: Protein, Conserved Domains, Protein Clusters, Structure
• Chemicals: PubChem Compound, PubChem Substance, PubChem BioAssay
• Pathways: BioSystems
E-utilities
• Eight server-side programs
– ESearch : Searching a Database
– EPost : Uploading UIDs to Entrez
– ESummary : Downloading Document Summaries
– EFetch : Downloading Full Records
– ELink : Finding Related Data Through Entrez Links
– EInfo : Getting Database Statistics and Search Fields
– EGQuery : Performing a Global Entrez Search
– ESpell : Retrieving Spelling Suggestions
Sample Applications of the E-utilities
• Basic pipelines
– ESearch - ESummary/EFetch
– EPost - ESummary/EFetch
– ELink - ESummary/Efetch
– ESearch - ELink - ESummary/EFetch
– EPost - ELink - ESummary/EFetch
– EPost - ESearch
– ELink - ESearch
Application 3
• Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array"
platform GEO Datasets
1. tr 'n' 't' < cancer_copy_number.pubmed_result.txt | sed 's/tt/n/g' | sed 's/^t[0-9]*: //' | sed 's/t/ /g' >
cancer_copy_number.pubmed_result.oneLine.txt
2. sed 's/^.* PubMed *PMID: *//' cancer_copy_number.pubmed_result.oneLine.txt | sed 's/; .*//' | sed 's/.$//' >
cancer_copy_number.pubmed_ids.txt
3. for id in $(cat cancer_copy_number.pubmed_ids.txt); do perl ~/scripts/elink.pl pubmed gds $id pubmed_gds | sed
"s/^/$idt/"; done > cancer_copy_number.pubmed_gds_ids.txt
4. awk -F't' '($1 == "Platform")' Affymetrix_Genome-Wide_Human_SNP_Array.gds_result.txt | cut -f2 | sed
's/^Accession: //' > Affymetrix_Genome-Wide_Human_SNP_Array.platform_accessions.txt
5. for platform in $(cat Affymetrix_Genome-Wide_Human_SNP_Array.platform_accessions.txt); do perl
~/scripts/esearch.pl gds $platform; done | sort -nu > Affymetrix_Genome-Wide_Human_SNP_Array.gds_ids.txt
6. paste cancer_copy_number.pubmed_ids.txt cancer_copy_number.pubmed_result.oneLine.txt | perl
~/scripts/table.addColumns.pl cancer_copy_number.pubmed_gds_ids.txt 0 - 0 1 | perl ~/scripts/table.search.pl
Affymetrix_Genome-Wide_Human_SNP_Array.gds_ids.txt 0 - 1 | perl ~/scripts/table.mergeLines.pl -d ', ' - 0,2 >
cancer_copy_number.Affymetrix_Genome-Wide_Human_SNP_Array.pubmed_gds.txt

More Related Content

What's hot

The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithm
avrilcoghlan
 
遺伝子のアノテーション付加
遺伝子のアノテーション付加遺伝子のアノテーション付加
遺伝子のアノテーション付加
弘毅 露崎
 
BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES
nadeem akhter
 

What's hot (20)

Introduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEASTIntroduction to Bayesian phylogenetics and BEAST
Introduction to Bayesian phylogenetics and BEAST
 
Genome Database Systems
Genome Database Systems Genome Database Systems
Genome Database Systems
 
Data retriveal ,srg and dbget
Data retriveal ,srg and dbgetData retriveal ,srg and dbget
Data retriveal ,srg and dbget
 
Biological sequences analysis
Biological sequences analysisBiological sequences analysis
Biological sequences analysis
 
Composite protein databases
Composite protein databasesComposite protein databases
Composite protein databases
 
The Needleman Wunsch algorithm
The Needleman Wunsch algorithmThe Needleman Wunsch algorithm
The Needleman Wunsch algorithm
 
遺伝子のアノテーション付加
遺伝子のアノテーション付加遺伝子のアノテーション付加
遺伝子のアノテーション付加
 
Role of ensembl in genome browsing
Role of ensembl in genome browsingRole of ensembl in genome browsing
Role of ensembl in genome browsing
 
Biological networks - building and visualizing
Biological networks - building and visualizingBiological networks - building and visualizing
Biological networks - building and visualizing
 
RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3RNA-seq: Mapping and quality control - part 3
RNA-seq: Mapping and quality control - part 3
 
NGS File formats
NGS File formatsNGS File formats
NGS File formats
 
Gen bank databases
Gen bank databasesGen bank databases
Gen bank databases
 
SWISS-PROT
SWISS-PROTSWISS-PROT
SWISS-PROT
 
PPT ON ALGORITHM
PPT ON ALGORITHMPPT ON ALGORITHM
PPT ON ALGORITHM
 
Protein Structure Prediction
Protein Structure PredictionProtein Structure Prediction
Protein Structure Prediction
 
Database Searching
Database SearchingDatabase Searching
Database Searching
 
RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5
 
Introduction of bioinformatics
Introduction of bioinformaticsIntroduction of bioinformatics
Introduction of bioinformatics
 
BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES BIOLOGICAL SEQUENCE DATABASES
BIOLOGICAL SEQUENCE DATABASES
 
Engineered scaffold protein
Engineered scaffold proteinEngineered scaffold protein
Engineered scaffold protein
 

Similar to E-Utilities

NCBO Tools and Web services
NCBO Tools and Web servicesNCBO Tools and Web services
NCBO Tools and Web services
Trish Whetzel
 
Biological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBiological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdf
BioinformaticsCentre
 

Similar to E-Utilities (20)

A Short Guide to the E-utilities
A Short Guide to the E-utilitiesA Short Guide to the E-utilities
A Short Guide to the E-utilities
 
BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical KnowledgeBioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
BioThings API: Building a FAIR API Ecosystem for Biomedical Knowledge
 
The Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational ResearchThe Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational Research
 
Biothings presentation
Biothings presentationBiothings presentation
Biothings presentation
 
Metadata-based tools at the ENCODE Portal
Metadata-based tools at the ENCODE PortalMetadata-based tools at the ENCODE Portal
Metadata-based tools at the ENCODE Portal
 
Major databases in bioinformatics
Major databases in bioinformaticsMajor databases in bioinformatics
Major databases in bioinformatics
 
Harvester I
Harvester IHarvester I
Harvester I
 
Role of bioinformatics in life sciences research
Role of bioinformatics in life sciences researchRole of bioinformatics in life sciences research
Role of bioinformatics in life sciences research
 
2012 03 01_bioinformatics_ii_les1
2012 03 01_bioinformatics_ii_les12012 03 01_bioinformatics_ii_les1
2012 03 01_bioinformatics_ii_les1
 
NCBO Tools and Web services
NCBO Tools and Web servicesNCBO Tools and Web services
NCBO Tools and Web services
 
Variant analysis and whole exome sequencing
Variant analysis and whole exome sequencingVariant analysis and whole exome sequencing
Variant analysis and whole exome sequencing
 
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
Metagenomic Data Provenance and Management using the ISA infrastructure --- o...
 
Databases_CSS2.pptx
Databases_CSS2.pptxDatabases_CSS2.pptx
Databases_CSS2.pptx
 
Data retreival system
Data retreival systemData retreival system
Data retreival system
 
Biological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdfBiological Database (1)pptxpdfpdfpdf.pdf
Biological Database (1)pptxpdfpdfpdf.pdf
 
Bioinformatics مي.pdf
Bioinformatics  مي.pdfBioinformatics  مي.pdf
Bioinformatics مي.pdf
 
openEHR Developers Workshop at #MedInfo2015
openEHR Developers Workshop at #MedInfo2015openEHR Developers Workshop at #MedInfo2015
openEHR Developers Workshop at #MedInfo2015
 
Enhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort DataEnhancing the Quality of ImmPort Data
Enhancing the Quality of ImmPort Data
 
Data Retrieval Systems
Data Retrieval SystemsData Retrieval Systems
Data Retrieval Systems
 
The ENCODE Portal REST API
The ENCODE Portal REST API The ENCODE Portal REST API
The ENCODE Portal REST API
 

Recently uploaded

Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
levieagacer
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
Scintica Instrumentation
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Sérgio Sacani
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
1301aanya
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
seri bangash
 

Recently uploaded (20)

GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort ServiceCall Girls Ahmedabad +917728919243 call me Independent Escort Service
Call Girls Ahmedabad +917728919243 call me Independent Escort Service
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Module for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learningModule for Grade 9 for Asynchronous/Distance learning
Module for Grade 9 for Asynchronous/Distance learning
 
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)COMPUTING ANTI-DERIVATIVES(Integration by SUBSTITUTION)
COMPUTING ANTI-DERIVATIVES (Integration by SUBSTITUTION)
 
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptxClimate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
Climate Change Impacts on Terrestrial and Aquatic Ecosystems.pptx
 
Stages in the normal growth curve
Stages in the normal growth curveStages in the normal growth curve
Stages in the normal growth curve
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS  ESCORT SERVICE In Bhiwan...
Bhiwandi Bhiwandi ❤CALL GIRL 7870993772 ❤CALL GIRLS ESCORT SERVICE In Bhiwan...
 
Exploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdfExploring Criminology and Criminal Behaviour.pdf
Exploring Criminology and Criminal Behaviour.pdf
 
Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.Proteomics: types, protein profiling steps etc.
Proteomics: types, protein profiling steps etc.
 
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
(May 9, 2024) Enhanced Ultrafast Vector Flow Imaging (VFI) Using Multi-Angle ...
 
Zoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdfZoology 5th semester notes( Sumit_yadav).pdf
Zoology 5th semester notes( Sumit_yadav).pdf
 
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune WaterworldsBiogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
Biogenic Sulfur Gases as Biosignatures on Temperate Sub-Neptune Waterworlds
 
biology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGYbiology HL practice questions IB BIOLOGY
biology HL practice questions IB BIOLOGY
 
Chemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdfChemistry 5th semester paper 1st Notes.pdf
Chemistry 5th semester paper 1st Notes.pdf
 
Dr. E. Muralinath_ Blood indices_clinical aspects
Dr. E. Muralinath_ Blood indices_clinical  aspectsDr. E. Muralinath_ Blood indices_clinical  aspects
Dr. E. Muralinath_ Blood indices_clinical aspects
 
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and SpectrometryFAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
 
Grade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its FunctionsGrade 7 - Lesson 1 - Microscope and Its Functions
Grade 7 - Lesson 1 - Microscope and Its Functions
 
The Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptxThe Mariana Trench remarkable geological features on Earth.pptx
The Mariana Trench remarkable geological features on Earth.pptx
 

E-Utilities

  • 1. NCBI API – Integration into analysis code QBRC Tech Talk Jiwoong Kim
  • 2. Outlines • Introduction • Usage Guidelines of the E-utilities • Sample Applications of the E-utilities
  • 3. NCBI & Entrez • The National Center for Biotechnology Information advances science and health by providing access to biomedical and genomic information. • Entrez is NCBI’s primary text search and retrieval system that integrates the PubMed database of biomedical literature with 39 other literature and molecular databases including DNA and protein sequence, structure, gene, genome, genetic variation and gene expression.
  • 4. E-utilities • Entrez Programming Utilities – The Entrez Programming Utilities (E-utilities) are a set of eight server-side programs that provide a stable interface into the Entrez query and database system at the NCBI. – The E-utilities use a fixed URL syntax that translates a standard set of input parameters into the values necessary for various NCBI software components to search for and retrieve the requested data. E-utilitiesURL XML, FASTA, Text … Input Output
  • 5. Usage Guidelines and Requirements • Use the E-utility URL – baseURL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/ … – Python urllib/urlopen, Perl LWP::Simple, Linux wget, … • Frequency, Timing and Registration of E-utility URL Requests – Make no more than 3 requests per second → sleep(0.5) – Run large jobs on weekends or between 5 PM and 9 AM EST – Include &tool and &email in all requests • Minimizing the Number of Requests – &retmax=500 • Handling Special Characters Within URLs – Space → +, " → %22, # → %23
  • 7. ESearch (text searches) • Responds to a text query with the list of matching UIDs in a given database (for later use in ESummary, EFetch or ELink), along with the term translations of the query. • Syntax: esearch.fcgi?db=<database>&term=<query> – Input: Entrez database (&db); Any Entrez text query (&term) – Output: List of UIDs matching the Entrez query • Example: Get the PubMed IDs (PMIDs) for articles about osteosarcoma – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed& term=%22osteosarcoma%22[majr:noexp]
  • 9. ESummary (document summary downloads) • Responds to a list of UIDs from a given database with the corresponding document summaries. • Syntax: esummary.fcgi?db=<database>&id=<uid_list> – Input: List of UIDs (&id); Entrez database (&db) – Output: XML DocSums • Example: Download DocSums for these PubMed IDs: 24450072, 24333720, 24333432 – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubme d&id=24450072,24333720,24333432
  • 11. EFetch (data record downloads) • Responds to a list of UIDs in a given database with the corresponding data records in a specified format. • Syntax: efetch.fcgi?db=<database>&id=<uid_list>&rettype=<retrieval _type>&retmode=<retrieval_mode> – Input: List of UIDs (&id); Entrez database (&db); Retrieval type (&rettype); Retrieval mode (&retmode) – Output: Formatted data records as specified • Example: Download the abstract of PubMed ID 24333432 – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&i d=24333432&rettype=abstract&retmode=text
  • 12. ELink (Entrez links) • Responds to a list of UIDs in a given database with either a list of related UIDs (and relevancy scores) in the same database or a list of linked UIDs in another Entrez database • Checks for the existence of a specified link from a list of one or more UIDs • Creates a hyperlink to the primary LinkOut provider for a specific UID and database, or lists LinkOut URLs and attributes for multiple UIDs.
  • 13. ELink (Entrez links) • Syntax: elink.fcgi?dbfrom=<source_db>&db=<destination_db>&id=<u id_list> – Input: List of UIDs (&id); Source Entrez database (&dbfrom); Destination Entrez database (&db) – Output: XML containing linked UIDs from source and destination databases • Example: Find one set/separate sets of Gene IDs linked to PubMed IDs 24333432 and 24314238 – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubme d&db=gene&id=24333432,24314238 – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubme d&db=gene&id=24333432&id=24314238
  • 15. EGQuery (global query) • Responds to a text query with the number of records matching the query in each Entrez database. • Syntax: egquery.fcgi?term=<query> – Input: Entrez text query (&term) – Output: XML containing the number of hits in each database. • Example: Determine the number of records for mouse in Entrez. – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery.fcgi?term=mouse[ orgn]&retmode=xml
  • 17. ESpell (spelling suggestions) • Retrieves spelling suggestions for a text query in a given database. • Syntax: espell.fcgi?term=<query>&db=<database> – Input: Entrez text query (&term); Entrez database (&db) – Output: XML containing the original query and spelling suggestions. • Example: Find spelling suggestions for the PubMed query "osteosacoma". – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/espell.fcgi?term=osteosac oma&db=pmc
  • 18. EInfo (database statistics) • Provides the number of records indexed in each field of a given database, the date of the last update of the database, and the available links from the database to other Entrez databases. • Syntax: einfo.fcgi?db=<database> – Input: Entrez database (&db) – Output: XML containing database statistics • Example: Find database statistics for Entrez Protein. – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=protein
  • 19. EPost (UID uploads) • Accepts a list of UIDs from a given database, stores the set on the History Server, and responds with a query key and web environment for the uploaded dataset. • Syntax: epost.fcgi?db=<database>&id=<uid_list> – Input: List of UIDs (&id); Entrez database (&db) – Output: Web environment (&WebEnv) and query key (&query_key) parameters specifying the location on the Entrez history server of the list of uploaded UIDs • Example: Upload five Gene IDs (7173, 22018, 54314, 403521, 525013) for later processing. – http://eutils.ncbi.nlm.nih.gov/entrez/eutils/epost.fcgi?db=gene&id=71 73,22018,54314,403521,525013
  • 20. Application 1 • Find related human genes to articles searched for non- extended MeSH term "Osteosarcoma" (PubMed → Gene) 1. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubme d&term=%22osteosarcoma%22[majr:noexp]&usehistory=y 2. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubm ed&db=gene&query_key=1&WebEnv=NCID_1_220057266_130.14. 18.34_9001_1396281951_1196950266&term=%22homo+sapiens% 22[organism]&cmd=neighbor_history 3. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene &query_key=3&WebEnv=NCID_1_220057266_130.14.18.34_9001_ 1396281951_1196950266
  • 21. Application 1 • Find related human genes to articles searched for non- extended MeSH term "Osteosarcoma" (PubMed → Gene) – ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz • It can be used instead of "ELink". – ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz • It can be used instead of "ESummary".
  • 22. Application 2 • Find nucleotide sequences of "Burkholderia cepacia complex" and download in GenBank format 1. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nuccor e&term=%22burkholderia+cepacia+complex%22[organism]&usehist ory=y 2. http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore &query_key=1&WebEnv=NCID_1_264773253_130.14.22.215_9001 _1396244608_457974498&rettype=gb&retmode=text
  • 23. Application 3 • Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets cancer "copy number" esearch.fcgi?db=pubmed Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter] esearch.fcgi?db=gds esummary.fcgi?db=pubmed WebEnv, query_key esummary.fcgi?db=gds WebEnv, query_key GPL9704 GPL8226 GPL6804 GPL6801 elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds Parsing Result table Common PubMed title
  • 24. "cancer copy number" articles "Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets
  • 25. Application 3 • Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets cancer "copy number" esearch.fcgi?db=pubmed Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter] esearch.fcgi?db=gds esummary.fcgi?db=pubmed WebEnv, query_key esummary.fcgi?db=gds WebEnv, query_key GPL9704 GPL8226 GPL6804 GPL6801 elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds Parsing Result table Common PubMed title
  • 26. Application 3 • Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets
  • 27. Application 3 • Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets cancer "copy number" esearch.fcgi?db=pubmed Affymetrix "Genome-Wide" Human "SNP Array" AND gpl[Filter] esearch.fcgi?db=gds esummary.fcgi?db=pubmed WebEnv, query_key esummary.fcgi?db=gds WebEnv, query_key GPL9704 GPL8226 GPL6804 GPL6801 elink.fcgi?dbfrom=pubmed&db=gds esearch.fcgi?db=gds Parsing Result table Common PubMed title
  • 28. Application 3 • Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets
  • 29. Make custom scripts with XML-parser
  • 30. EBot • EBot is an interactive web tool that first allows users to construct an arbitrary E-utility analysis pipeline and then generates a Perl script to execute the pipeline. The Perl script can be downloaded and executed on any computer with a Perl installation. For more details, see the EBot page linked above. – http://www.ncbi.nlm.nih.gov/Class/PowerTools/e utils/ebot/ebot.cgi
  • 31. Entrez Direct • E-utilities on the UNIX Command Line • Download from ftp://ftp.ncbi.nih.gov/entrez/entrezdirect/ • Entrez Direct Functions – esearch performs a new Entrez search using terms in indexed fields. – elink looks up neighbors (within a database) or links (between databases). – efilter filters or restricts the results of a previous query. – efetch downloads records or reports in a designated format. – xtract converts XML into a table of data values. – einfo obtains information on indexed fields in an Entrez database. – epost uploads unique identifiers (UIDs) or sequence accession numbers. – nquire sends a URL request to a web page or CGI service. • Entering Query Commands – esearch -db pubmed -query "opsin gene conversion" | elink -related
  • 32. Links • References – Entrez Programming Utilities Help • http://www.ncbi.nlm.nih.gov/books/NBK25501/ – Entrez Help • http://www.ncbi.nlm.nih.gov/books/NBK3836/ • Useful Links – Entrez Unique Identifiers (UIDs) for selected databases • http://www.ncbi.nlm.nih.gov/books/NBK25497/table/chapter2.chapter2_table1/?r eport=objectonly – Valid values of &retmode and &rettype for EFetch (null = empty string) • http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.chapter4_table1/?r eport=objectonly – The full list of Entrez links • http://eutils.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html
  • 33. NCBI databases • Literature: PubMed, PubMed Central, NLM Catalog, MeSH, Books, Site Search • Health: PubMed Health, MedGen, GTR, dbGaP, ClinVar, OMIM, OMIA • Organisms: Taxonomy • Nucleotide Sequences: Nucleotide, GSS, EST, SRA, PopSet, Probe • Genomes: Genome, Assembly, Epigenomics, UniSTS, SNP, dbVar, BioProject, BioSample, Clone • Genes: Gene, HomoloGene, UniGene, GEO Profiles, GEO DataSets • Proteins: Protein, Conserved Domains, Protein Clusters, Structure • Chemicals: PubChem Compound, PubChem Substance, PubChem BioAssay • Pathways: BioSystems
  • 34. E-utilities • Eight server-side programs – ESearch : Searching a Database – EPost : Uploading UIDs to Entrez – ESummary : Downloading Document Summaries – EFetch : Downloading Full Records – ELink : Finding Related Data Through Entrez Links – EInfo : Getting Database Statistics and Search Fields – EGQuery : Performing a Global Entrez Search – ESpell : Retrieving Spelling Suggestions
  • 35. Sample Applications of the E-utilities • Basic pipelines – ESearch - ESummary/EFetch – EPost - ESummary/EFetch – ELink - ESummary/Efetch – ESearch - ELink - ESummary/EFetch – EPost - ELink - ESummary/EFetch – EPost - ESearch – ELink - ESearch
  • 36. Application 3 • Find "cancer copy number" articles with "Affymetrix Genome-Wide Human SNP Array" platform GEO Datasets 1. tr 'n' 't' < cancer_copy_number.pubmed_result.txt | sed 's/tt/n/g' | sed 's/^t[0-9]*: //' | sed 's/t/ /g' > cancer_copy_number.pubmed_result.oneLine.txt 2. sed 's/^.* PubMed *PMID: *//' cancer_copy_number.pubmed_result.oneLine.txt | sed 's/; .*//' | sed 's/.$//' > cancer_copy_number.pubmed_ids.txt 3. for id in $(cat cancer_copy_number.pubmed_ids.txt); do perl ~/scripts/elink.pl pubmed gds $id pubmed_gds | sed "s/^/$idt/"; done > cancer_copy_number.pubmed_gds_ids.txt 4. awk -F't' '($1 == "Platform")' Affymetrix_Genome-Wide_Human_SNP_Array.gds_result.txt | cut -f2 | sed 's/^Accession: //' > Affymetrix_Genome-Wide_Human_SNP_Array.platform_accessions.txt 5. for platform in $(cat Affymetrix_Genome-Wide_Human_SNP_Array.platform_accessions.txt); do perl ~/scripts/esearch.pl gds $platform; done | sort -nu > Affymetrix_Genome-Wide_Human_SNP_Array.gds_ids.txt 6. paste cancer_copy_number.pubmed_ids.txt cancer_copy_number.pubmed_result.oneLine.txt | perl ~/scripts/table.addColumns.pl cancer_copy_number.pubmed_gds_ids.txt 0 - 0 1 | perl ~/scripts/table.search.pl Affymetrix_Genome-Wide_Human_SNP_Array.gds_ids.txt 0 - 1 | perl ~/scripts/table.mergeLines.pl -d ', ' - 0,2 > cancer_copy_number.Affymetrix_Genome-Wide_Human_SNP_Array.pubmed_gds.txt