SlideShare a Scribd company logo
Ben Busby, Ph.D.
NCBI Genomics Outreach Coordinator
Chair, Department of Bioinformatics and Data Science, FAES
ben.busby@nih.gov
 Review of basic NCBI search systems, tools, databases for
metagenomics
 NCBI Hackathons
 Hands-on Session
 Finding literature and molecular data for metagenomics (EDirect)
 Basic Computational resources at NCBI
 Taxonomy
 MoleBLAST
 Many other flavors of BLAST
 SRA-BLAST
 MagicBLAST
 Contig generation (with Spades)
2
5
EUtils (Search API) Command Line
EDirect
6
PubMed and PMC (Open) FTP
7
Instead of PubMed FTP…
An automated tool (alpha)
13
14
Combined score is
the average of SVs,
mappability, GC..
NCBI region list
Encode blacklist
phenvar.colorado.edu
40
EDirect (Search API) Cookbook
My View of Data Transfer Principles
• Metadata Search
• Rapid NoSQL (for now)
• Integration
• Non-ambiguous identifiers
• Transferring Small amounts of Data
• Data still gets transferred in the cloud
• Underlying structure
• Finding specific data from validated formats
• Democratization of Data
• Rapid comparison by domain experts
• Reporting
• Metrics to report data upload and [unique IP] download of datasets
• Post-publication User Review
43
• August 14-16, NIH Campus
• September 25-27, Pittsburgh, PA
• January 8-10, NIH Campus
• February 19-21, Baylor College of Medicine
• March 11-14, NIH Campus
• April 2-4, 2018, UCSC
• May 15-17, 2018, Ann Arbor, MI
• June 20-22, Boulder CO
ncbi-hackathons.github.io
Come work at NCBI for 4-6 weeks!
Email bioinformatics-training@ncbi.nlm.nih.gov
for more information!
Thanks again to Peter Cooper!
Materials: http://bit.ly/2chunww
Learn the basics of using the EDirect programs to extract
custom reports from the NLM/NCBI Literature and
molecular databases
YouTube:
https://youtu.be/BLnYW33Mtb0
YouTube:
https://youtu.be/aK7-KX5q30k
Past webinar recordings on
YouTube (youtube.com/ncbinlm)
 Command line version of E-Utilities (API for Entrez system)
 Each program outputs XML that can be piped directly into another
EDirect program or Linux utility
 Requirements
 UNIX, Linux, Mac OSX, CygWin (Windows)
 Perl with LWP::Simple
 Package contents
 einfo
 esearch
 esummary
 efetch
 elink
 epost
 efilter (performs an esearch after an elink or esearch)
 xtract (powerful XML parser)
Documentation
<ncbi>/books/NBK179288/#chapter6
 Allows construction of custom pipelines for processing
data
 Generates highly flexible custom output reports
 Built in batch access
09/12/2016 50
esearch efetch
xtract
Search
Retrieval
Parsing and
reporting
epost
elink
History
Searching
Linking
Storing and combining
Provides information about the available databases
 Available indexed fields
 Available links
 Produces XML (or text output with –fields, -links)
09/12/2016
einfo –dbs
einfo –db dbname
einfo –db dbname –fields
einfo –db dbname -links
 Uses standard web Entrez queries
 try searches on web interface first
 Results stored in web environment
 Pipe output to efetch, elink
esearch -db pubmed -query 'antibiotic resistance[MeSH] AND metagenomic'
<ENTREZ_DIRECT>
<Db>pubmed</Db>
<WebEnv>NCID_1_55733393_130.14.22.215_9001_1473536392_1519353658_0MetA0_S_MegaS
tore_F_1</WebEnv>
<QueryKey>1</QueryKey>
<Count>165</Count>
<Step>1</Step>
</ENTREZ_DIRECT>
 Returns related records in the same (-related) or different
(-target) database
 Use link name from einfo to get the most precise results (-
name linkname)
 Pipe into efetch
 Use with –cmd neighbor to get a table of linked identifiers
(elink XML)
 Produces full XML records and Summaries (Docsums) for
many databases
 Also specialized output for PubMed, sequence databases,
Gene and others
 In many cases Docsums contain enough information
(efetch –format docsum == esummary)
 Parsing values from full XML can be more challenging
-format -mode Report Type
acc Accession Number
est EST Report
fasta FASTA
fasta xml TinySeq XML
fasta_cds_aa FASTA of CDS Products
fasta_cds_na FASTA of Coding Regions
ft Feature Table
gb GenBank Flatfile
gb xml GBSet XML
gbc xml INSDSet XML
gbwithparts GenBank with Contig Sequences
gene_fasta FASTA of Gene
gp GenPept Flatfile
gp xml GBSet XML
gpc xml INSDSet XML
gss GSS Report
ipg Identical Protein Report
ipg xml IPGReportSet XML
native text Seq-entry ASN.1
native xml Bioseq-set XML
seqid Seq-id ASN.1
 General Full-featured XML parser
 Produces tab delimited output
 Loops over XML structure using exploration options
 Prints out selected items (elements) from XML
 Conditional execution
 Flexible output formats
57
 -pattern places the data from individual records into
separate rows.
 -element extracts values from specified fields into
separate columns.
 -group, -block, and -subset limit element exploration to
selected XML subregions.
58
59
<Result Set>
<DocumentSummary>
<Tag>value</Tag>
<Tag>
<Tag>
<Tag>valu
<Tag>valu
<Tag>valu
</Tag>
</Tag>
<Tag>
<Tag>
<Tag>valu
<Tag>valu
<Tag>valu
</Tag>
</Tag>
</DocumentSummary
<DocumentSummary>
etc.
</Result Set>
Pattern (Record)
Group
Block
Group Block
-pattern
-group
-block
-subset
EDirect Example 1:
Explore articles and authors for antibiotic resistance
genes in metagenomic samples
09/12/2016 60
09/12/2016 61
• Loop over DocumentSummaries
Table of Articles
• Print out PMID, Source, Title
• Only match <Source>PLoS One</Source>
-match “Source:PLoS One”
Author counts
• Report and count Author names
-element Name
• Use built in sort-uniq-count-rank to process final output
 -sep
 specifies the character separating multiple fields in an ‘-
element’ argument
 -tab
 specifies the character separating multiple values of an ‘-
element’ argument
 Place –sep and –tab before -element
09/12/2016 62
63
ESearch set of PMIDStext query
ESummary set of PMIDSPubMed Docsums
pubmed
PubMed
xtract
match Source:PLoS One
ID,Source,Title
Name
09/12/2016
sort-
uniq-
rank-
count
EDirect Example 2:
Retrieve the correct sequence in FASTA format from
the nuccore database
09/12/2016 64
09/12/2016 65
• efetch in nuccore can be used with –chr_start and –
chr_stop to adjust from 0-based to 1-based
coordinates and strand.
• You can use xargs to pass the last three columns of
the test file to efetch to get all genomic regions
 Originally published in 2009
(doi:10.1128/AEM.01541-09)
 Most cited tool for analyzing 16S
rRNA gene sequences
 3,410 citations (WoS: 1/8/2016)
 Working on 37th release
 Overview
 100% open source, GPL v3
 OS independent
 Command line interface
 Written in C/C++
http://www.mothur.org
 Deposition of 16S rRNA gene sequences to the SRA has
been a major problem
 Worked with SRA staff to make a customized portal to
simplify submission of PCR-generated 16S rRNA gene
sequences
 Command enforces co-submission of sample and
processing metadata
 Originally released in March 2015. So far 86 submissions,
61 studies, 6367 runs, 116 Gbp total
http://www.mothur.org/wiki/make.sra
1. Provide the necessary MIMARKS* metadata data about
samples with get.mimarkspackage
2. Create a project file describing user and their project
using supplied template file
3. Parse MIMARKS, project file, and sff or fastq files to
generate an xml file for submission using make.sra
4. Email the SRA to let them know about submission using
mothur created files and await further instructions
* http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3367316/
http://www.mothur.org/wiki/Creating_a_new_submission
 MIMARKS: minimum information about a marker gene
sequence (doi:10.1038/nbt.1823)
 Command supplies user with a blank text file with sample
names and necessary metadata for their environment.
User fills in details.
 For each environmental package and environment there
is a wiki page with required and optional parameters and
allowed values. Can also be extended to include
additional metadata
http://www.mothur.org/wiki/Get.mimarkspackage
http://www.mothur.org/wiki/Human_gut
USERNAME [UserName]
Last [LastName]
First [FirstName]
EMAIL [Email@mail.com]
CENTER [University or Center Name]
TYPE institute
WEBSITE [www.Website.org]
ProjectName [ProjectName]
ProjectTitle [Project Title]
Description [Project Description]
Grant id=[GrantID], agency=[GrantAgency],
title=[GrantTitle]
User completes information in brackets
http://www.mothur.org/wiki/Project_File
84
85
86
87
88

More Related Content

What's hot

Writing Galaxy Tools
Writing Galaxy ToolsWriting Galaxy Tools
Writing Galaxy Tools
pjacock
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
Knoldus Inc.
 
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
VHIR Vall d’Hebron Institut de Recerca
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
Erik Hatcher
 
Pg 95 new capabilities
Pg 95 new capabilitiesPg 95 new capabilities
Pg 95 new capabilities
Jamey Hanson
 
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudyDissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Salman Baset
 
SQL for Elasticsearch
SQL for ElasticsearchSQL for Elasticsearch
SQL for Elasticsearch
Jodok Batlogg
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
Erik Hatcher
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Lucidworks
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"
Pinar Alper
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
Rahul Jain
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Oleksiy Panchenko
 
Network Device Database Management with REST using Jersey
Network Device Database Management with REST using JerseyNetwork Device Database Management with REST using Jersey
Network Device Database Management with REST using Jersey
Payal Jain
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Databricks
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
otisg
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Anyscale
 
Postgresql search demystified
Postgresql search demystifiedPostgresql search demystified
Postgresql search demystified
javier ramirez
 
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Experiences in ELK with D3.js  for Large Log Analysis  and VisualizationExperiences in ELK with D3.js  for Large Log Analysis  and Visualization
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Surasak Sanguanpong
 
Elk presentation 2#3
Elk presentation 2#3Elk presentation 2#3
Elk presentation 2#3
uzzal basak
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
OpenSource Connections
 

What's hot (20)

Writing Galaxy Tools
Writing Galaxy ToolsWriting Galaxy Tools
Writing Galaxy Tools
 
Harnessing the power of Nutch with Scala
Harnessing the power of Nutch with ScalaHarnessing the power of Nutch with Scala
Harnessing the power of Nutch with Scala
 
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
Introduction to Galaxy (UEB-UAT Bioinformatics Course - Session 2.2 - VHIR, B...
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Pg 95 new capabilities
Pg 95 new capabilitiesPg 95 new capabilities
Pg 95 new capabilities
 
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case StudyDissecting Open Source Cloud Evolution: An OpenStack Case Study
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
 
SQL for Elasticsearch
SQL for ElasticsearchSQL for Elasticsearch
SQL for Elasticsearch
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabsSolr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
Solr Distributed Indexing in WalmartLabs: Presented by Shengua Wan, WalmartLabs
 
"Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications""Data Provenance: Principles and Why it matters for BioMedical Applications"
"Data Provenance: Principles and Why it matters for BioMedical Applications"
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
Elasticsearch, Logstash, Kibana. Cool search, analytics, data mining and more...
 
Network Device Database Management with REST using Jersey
Network Device Database Management with REST using JerseyNetwork Device Database Management with REST using Jersey
Network Device Database Management with REST using Jersey
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Lucene Introduction
Lucene IntroductionLucene Introduction
Lucene Introduction
 
Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...Easy, scalable, fault tolerant stream processing with structured streaming - ...
Easy, scalable, fault tolerant stream processing with structured streaming - ...
 
Postgresql search demystified
Postgresql search demystifiedPostgresql search demystified
Postgresql search demystified
 
Experiences in ELK with D3.js for Large Log Analysis and Visualization
Experiences in ELK with D3.js  for Large Log Analysis  and VisualizationExperiences in ELK with D3.js  for Large Log Analysis  and Visualization
Experiences in ELK with D3.js for Large Log Analysis and Visualization
 
Elk presentation 2#3
Elk presentation 2#3Elk presentation 2#3
Elk presentation 2#3
 
Hacking Lucene for Custom Search Results
Hacking Lucene for Custom Search ResultsHacking Lucene for Custom Search Results
Hacking Lucene for Custom Search Results
 

Similar to Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817

2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata
Jun Zhao
 
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and ElasticsearchLet's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
InfluxData
 
What’s Evolving in the Elastic Stack
What’s Evolving in the Elastic StackWhat’s Evolving in the Elastic Stack
What’s Evolving in the Elastic Stack
Elasticsearch
 
Plank
PlankPlank
Plank
FNian
 
Environment Canada's Data Management Service
Environment Canada's Data Management ServiceEnvironment Canada's Data Management Service
Environment Canada's Data Management Service
Safe Software
 
2009 Dils Flyweb
2009 Dils Flyweb2009 Dils Flyweb
2009 Dils Flyweb
Jun Zhao
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
lucenerevolution
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
Julien Nioche
 
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...
AWS User Group - Thailand
 
BioMake BOSC 2004
BioMake BOSC 2004BioMake BOSC 2004
BioMake BOSC 2004
Chris Mungall
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
Ian Foster
 
1.5 weka an intoduction
1.5 weka an intoduction1.5 weka an intoduction
1.5 weka an intoduction
Krish_ver2
 
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch Council
Sunita Shrivastava
 
XMLPipeDB
XMLPipeDBXMLPipeDB
XMLPipeDB
bosc
 
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc AnalyticsA General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
Flurry, Inc.
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
Dan Gaston
 
Understanding Genome
Understanding Genome Understanding Genome
Understanding Genome
Rajendra K Labala
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
Databricks
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
t_ivanov
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Helena Edelson
 

Similar to Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817 (20)

2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata2010 03 Lodoxf Openflydata
2010 03 Lodoxf Openflydata
 
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and ElasticsearchLet's Compare: A Benchmark review of InfluxDB and Elasticsearch
Let's Compare: A Benchmark review of InfluxDB and Elasticsearch
 
What’s Evolving in the Elastic Stack
What’s Evolving in the Elastic StackWhat’s Evolving in the Elastic Stack
What’s Evolving in the Elastic Stack
 
Plank
PlankPlank
Plank
 
Environment Canada's Data Management Service
Environment Canada's Data Management ServiceEnvironment Canada's Data Management Service
Environment Canada's Data Management Service
 
2009 Dils Flyweb
2009 Dils Flyweb2009 Dils Flyweb
2009 Dils Flyweb
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
Large Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and FriendsLarge Scale Crawling with Apache Nutch and Friends
Large Scale Crawling with Apache Nutch and Friends
 
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...
AWS Community Day Bangkok 2019 - How AWS Parallel Cluster can accelerate high...
 
BioMake BOSC 2004
BioMake BOSC 2004BioMake BOSC 2004
BioMake BOSC 2004
 
Extreme Scripting July 2009
Extreme Scripting July 2009Extreme Scripting July 2009
Extreme Scripting July 2009
 
1.5 weka an intoduction
1.5 weka an intoduction1.5 weka an intoduction
1.5 weka an intoduction
 
ALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch CouncilALM Search Presentation for the VSS Arch Council
ALM Search Presentation for the VSS Arch Council
 
XMLPipeDB
XMLPipeDBXMLPipeDB
XMLPipeDB
 
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc AnalyticsA General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
A General Purpose Extensible Scanning Query Architecture for Ad Hoc Analytics
 
Dgaston dec-06-2012
Dgaston dec-06-2012Dgaston dec-06-2012
Dgaston dec-06-2012
 
Understanding Genome
Understanding Genome Understanding Genome
Understanding Genome
 
Writing Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySparkWriting Continuous Applications with Structured Streaming in PySpark
Writing Continuous Applications with Structured Streaming in PySpark
 
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
The Impact of Columnar File Formats on SQL-on-Hadoop Engine Performance: A St...
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 

More from Ben Busby

Addressing privacy concerns_in_the_age_of_federated_data_access
Addressing privacy concerns_in_the_age_of_federated_data_accessAddressing privacy concerns_in_the_age_of_federated_data_access
Addressing privacy concerns_in_the_age_of_federated_data_access
Ben Busby
 
Containerized attribute indexing and graph genomes for federated data access
Containerized attribute indexing and graph genomes for federated data accessContainerized attribute indexing and graph genomes for federated data access
Containerized attribute indexing and graph genomes for federated data access
Ben Busby
 
Artificial_Intelligence_for_Data_Reuse_2019
Artificial_Intelligence_for_Data_Reuse_2019Artificial_Intelligence_for_Data_Reuse_2019
Artificial_Intelligence_for_Data_Reuse_2019
Ben Busby
 
Dream.recomb.ncbi.hackathons v003
Dream.recomb.ncbi.hackathons v003Dream.recomb.ncbi.hackathons v003
Dream.recomb.ncbi.hackathons v003
Ben Busby
 
Human_Pangenomics_Bio-IT_2019
Human_Pangenomics_Bio-IT_2019Human_Pangenomics_Bio-IT_2019
Human_Pangenomics_Bio-IT_2019
Ben Busby
 
RNAML_Bio-IT_2019
RNAML_Bio-IT_2019RNAML_Bio-IT_2019
RNAML_Bio-IT_2019
Ben Busby
 
Hackathon_Bio-IT_2019
Hackathon_Bio-IT_2019Hackathon_Bio-IT_2019
Hackathon_Bio-IT_2019
Ben Busby
 
Data science futures_v_vu2
Data science futures_v_vu2Data science futures_v_vu2
Data science futures_v_vu2
Ben Busby
 
Sage 2 19_v5_busby
Sage 2 19_v5_busbySage 2 19_v5_busby
Sage 2 19_v5_busby
Ben Busby
 
Bb health ai_jan26_v2
Bb health ai_jan26_v2Bb health ai_jan26_v2
Bb health ai_jan26_v2
Ben Busby
 
BB_NCBI_PAG_2019_Workshop
BB_NCBI_PAG_2019_WorkshopBB_NCBI_PAG_2019_Workshop
BB_NCBI_PAG_2019_Workshop
Ben Busby
 
Hackathons lightning v_nbs
Hackathons lightning v_nbsHackathons lightning v_nbs
Hackathons lightning v_nbs
Ben Busby
 
Cmu oss 18
Cmu oss 18Cmu oss 18
Cmu oss 18
Ben Busby
 
Genome web v_repro1
Genome web v_repro1Genome web v_repro1
Genome web v_repro1
Ben Busby
 
Data science futures_v_une
Data science futures_v_uneData science futures_v_une
Data science futures_v_une
Ben Busby
 
Variant and disease_grs_kickoff
Variant and disease_grs_kickoffVariant and disease_grs_kickoff
Variant and disease_grs_kickoff
Ben Busby
 
Bioinformatics_resources_SVAI_v2
Bioinformatics_resources_SVAI_v2Bioinformatics_resources_SVAI_v2
Bioinformatics_resources_SVAI_v2
Ben Busby
 
Ncbi resources i5_k_v4
Ncbi resources i5_k_v4Ncbi resources i5_k_v4
Ncbi resources i5_k_v4
Ben Busby
 
Ncbi resources abrf_v3
Ncbi resources abrf_v3Ncbi resources abrf_v3
Ncbi resources abrf_v3
Ben Busby
 
Data science futures_v_lbirn
Data science futures_v_lbirnData science futures_v_lbirn
Data science futures_v_lbirn
Ben Busby
 

More from Ben Busby (20)

Addressing privacy concerns_in_the_age_of_federated_data_access
Addressing privacy concerns_in_the_age_of_federated_data_accessAddressing privacy concerns_in_the_age_of_federated_data_access
Addressing privacy concerns_in_the_age_of_federated_data_access
 
Containerized attribute indexing and graph genomes for federated data access
Containerized attribute indexing and graph genomes for federated data accessContainerized attribute indexing and graph genomes for federated data access
Containerized attribute indexing and graph genomes for federated data access
 
Artificial_Intelligence_for_Data_Reuse_2019
Artificial_Intelligence_for_Data_Reuse_2019Artificial_Intelligence_for_Data_Reuse_2019
Artificial_Intelligence_for_Data_Reuse_2019
 
Dream.recomb.ncbi.hackathons v003
Dream.recomb.ncbi.hackathons v003Dream.recomb.ncbi.hackathons v003
Dream.recomb.ncbi.hackathons v003
 
Human_Pangenomics_Bio-IT_2019
Human_Pangenomics_Bio-IT_2019Human_Pangenomics_Bio-IT_2019
Human_Pangenomics_Bio-IT_2019
 
RNAML_Bio-IT_2019
RNAML_Bio-IT_2019RNAML_Bio-IT_2019
RNAML_Bio-IT_2019
 
Hackathon_Bio-IT_2019
Hackathon_Bio-IT_2019Hackathon_Bio-IT_2019
Hackathon_Bio-IT_2019
 
Data science futures_v_vu2
Data science futures_v_vu2Data science futures_v_vu2
Data science futures_v_vu2
 
Sage 2 19_v5_busby
Sage 2 19_v5_busbySage 2 19_v5_busby
Sage 2 19_v5_busby
 
Bb health ai_jan26_v2
Bb health ai_jan26_v2Bb health ai_jan26_v2
Bb health ai_jan26_v2
 
BB_NCBI_PAG_2019_Workshop
BB_NCBI_PAG_2019_WorkshopBB_NCBI_PAG_2019_Workshop
BB_NCBI_PAG_2019_Workshop
 
Hackathons lightning v_nbs
Hackathons lightning v_nbsHackathons lightning v_nbs
Hackathons lightning v_nbs
 
Cmu oss 18
Cmu oss 18Cmu oss 18
Cmu oss 18
 
Genome web v_repro1
Genome web v_repro1Genome web v_repro1
Genome web v_repro1
 
Data science futures_v_une
Data science futures_v_uneData science futures_v_une
Data science futures_v_une
 
Variant and disease_grs_kickoff
Variant and disease_grs_kickoffVariant and disease_grs_kickoff
Variant and disease_grs_kickoff
 
Bioinformatics_resources_SVAI_v2
Bioinformatics_resources_SVAI_v2Bioinformatics_resources_SVAI_v2
Bioinformatics_resources_SVAI_v2
 
Ncbi resources i5_k_v4
Ncbi resources i5_k_v4Ncbi resources i5_k_v4
Ncbi resources i5_k_v4
 
Ncbi resources abrf_v3
Ncbi resources abrf_v3Ncbi resources abrf_v3
Ncbi resources abrf_v3
 
Data science futures_v_lbirn
Data science futures_v_lbirnData science futures_v_lbirn
Data science futures_v_lbirn
 

Recently uploaded

Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
MaheshaNanjegowda
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
Texas Alliance of Groundwater Districts
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
Areesha Ahmad
 
Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)
Sciences of Europe
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
AbdullaAlAsif1
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
hozt8xgk
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
Leonel Morgado
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
RitabrataSarkar3
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Leonel Morgado
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
Daniel Tubbenhauer
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 

Recently uploaded (20)

Basics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different formsBasics of crystallography, crystal systems, classes and different forms
Basics of crystallography, crystal systems, classes and different forms
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
 
Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)Sciences of Europe journal No 142 (2024)
Sciences of Europe journal No 142 (2024)
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
Unlocking the mysteries of reproduction: Exploring fecundity and gonadosomati...
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...Authoring a personal GPT for your research and practice: How we created the Q...
Authoring a personal GPT for your research and practice: How we created the Q...
 
Eukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptxEukaryotic Transcription Presentation.pptx
Eukaryotic Transcription Presentation.pptx
 
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
Describing and Interpreting an Immersive Learning Case with the Immersion Cub...
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 

Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817

  • 1. Ben Busby, Ph.D. NCBI Genomics Outreach Coordinator Chair, Department of Bioinformatics and Data Science, FAES ben.busby@nih.gov
  • 2.  Review of basic NCBI search systems, tools, databases for metagenomics  NCBI Hackathons  Hands-on Session  Finding literature and molecular data for metagenomics (EDirect)  Basic Computational resources at NCBI  Taxonomy  MoleBLAST  Many other flavors of BLAST  SRA-BLAST  MagicBLAST  Contig generation (with Spades) 2
  • 3.
  • 4.
  • 5. 5 EUtils (Search API) Command Line EDirect
  • 6. 6 PubMed and PMC (Open) FTP
  • 7. 7 Instead of PubMed FTP… An automated tool (alpha)
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13. 13
  • 14. 14
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
  • 28.
  • 29.
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35. Combined score is the average of SVs, mappability, GC.. NCBI region list Encode blacklist
  • 36.
  • 37.
  • 38.
  • 41.
  • 42. My View of Data Transfer Principles • Metadata Search • Rapid NoSQL (for now) • Integration • Non-ambiguous identifiers • Transferring Small amounts of Data • Data still gets transferred in the cloud • Underlying structure • Finding specific data from validated formats • Democratization of Data • Rapid comparison by domain experts • Reporting • Metrics to report data upload and [unique IP] download of datasets • Post-publication User Review
  • 43. 43
  • 44. • August 14-16, NIH Campus • September 25-27, Pittsburgh, PA • January 8-10, NIH Campus • February 19-21, Baylor College of Medicine • March 11-14, NIH Campus • April 2-4, 2018, UCSC • May 15-17, 2018, Ann Arbor, MI • June 20-22, Boulder CO ncbi-hackathons.github.io Come work at NCBI for 4-6 weeks! Email bioinformatics-training@ncbi.nlm.nih.gov for more information!
  • 45.
  • 46. Thanks again to Peter Cooper! Materials: http://bit.ly/2chunww
  • 47. Learn the basics of using the EDirect programs to extract custom reports from the NLM/NCBI Literature and molecular databases
  • 49.  Command line version of E-Utilities (API for Entrez system)  Each program outputs XML that can be piped directly into another EDirect program or Linux utility  Requirements  UNIX, Linux, Mac OSX, CygWin (Windows)  Perl with LWP::Simple  Package contents  einfo  esearch  esummary  efetch  elink  epost  efilter (performs an esearch after an elink or esearch)  xtract (powerful XML parser) Documentation <ncbi>/books/NBK179288/#chapter6
  • 50.  Allows construction of custom pipelines for processing data  Generates highly flexible custom output reports  Built in batch access 09/12/2016 50
  • 52. Provides information about the available databases  Available indexed fields  Available links  Produces XML (or text output with –fields, -links) 09/12/2016 einfo –dbs einfo –db dbname einfo –db dbname –fields einfo –db dbname -links
  • 53.  Uses standard web Entrez queries  try searches on web interface first  Results stored in web environment  Pipe output to efetch, elink esearch -db pubmed -query 'antibiotic resistance[MeSH] AND metagenomic' <ENTREZ_DIRECT> <Db>pubmed</Db> <WebEnv>NCID_1_55733393_130.14.22.215_9001_1473536392_1519353658_0MetA0_S_MegaS tore_F_1</WebEnv> <QueryKey>1</QueryKey> <Count>165</Count> <Step>1</Step> </ENTREZ_DIRECT>
  • 54.  Returns related records in the same (-related) or different (-target) database  Use link name from einfo to get the most precise results (- name linkname)  Pipe into efetch  Use with –cmd neighbor to get a table of linked identifiers (elink XML)
  • 55.  Produces full XML records and Summaries (Docsums) for many databases  Also specialized output for PubMed, sequence databases, Gene and others  In many cases Docsums contain enough information (efetch –format docsum == esummary)  Parsing values from full XML can be more challenging
  • 56. -format -mode Report Type acc Accession Number est EST Report fasta FASTA fasta xml TinySeq XML fasta_cds_aa FASTA of CDS Products fasta_cds_na FASTA of Coding Regions ft Feature Table gb GenBank Flatfile gb xml GBSet XML gbc xml INSDSet XML gbwithparts GenBank with Contig Sequences gene_fasta FASTA of Gene gp GenPept Flatfile gp xml GBSet XML gpc xml INSDSet XML gss GSS Report ipg Identical Protein Report ipg xml IPGReportSet XML native text Seq-entry ASN.1 native xml Bioseq-set XML seqid Seq-id ASN.1
  • 57.  General Full-featured XML parser  Produces tab delimited output  Loops over XML structure using exploration options  Prints out selected items (elements) from XML  Conditional execution  Flexible output formats 57
  • 58.  -pattern places the data from individual records into separate rows.  -element extracts values from specified fields into separate columns.  -group, -block, and -subset limit element exploration to selected XML subregions. 58
  • 60. EDirect Example 1: Explore articles and authors for antibiotic resistance genes in metagenomic samples 09/12/2016 60
  • 61. 09/12/2016 61 • Loop over DocumentSummaries Table of Articles • Print out PMID, Source, Title • Only match <Source>PLoS One</Source> -match “Source:PLoS One” Author counts • Report and count Author names -element Name • Use built in sort-uniq-count-rank to process final output
  • 62.  -sep  specifies the character separating multiple fields in an ‘- element’ argument  -tab  specifies the character separating multiple values of an ‘- element’ argument  Place –sep and –tab before -element 09/12/2016 62
  • 63. 63 ESearch set of PMIDStext query ESummary set of PMIDSPubMed Docsums pubmed PubMed xtract match Source:PLoS One ID,Source,Title Name 09/12/2016 sort- uniq- rank- count
  • 64. EDirect Example 2: Retrieve the correct sequence in FASTA format from the nuccore database 09/12/2016 64
  • 65. 09/12/2016 65 • efetch in nuccore can be used with –chr_start and – chr_stop to adjust from 0-based to 1-based coordinates and strand. • You can use xargs to pass the last three columns of the test file to efetch to get all genomic regions
  • 66.
  • 67.
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
  • 74.
  • 75.
  • 76.
  • 77.  Originally published in 2009 (doi:10.1128/AEM.01541-09)  Most cited tool for analyzing 16S rRNA gene sequences  3,410 citations (WoS: 1/8/2016)  Working on 37th release  Overview  100% open source, GPL v3  OS independent  Command line interface  Written in C/C++ http://www.mothur.org
  • 78.  Deposition of 16S rRNA gene sequences to the SRA has been a major problem  Worked with SRA staff to make a customized portal to simplify submission of PCR-generated 16S rRNA gene sequences  Command enforces co-submission of sample and processing metadata  Originally released in March 2015. So far 86 submissions, 61 studies, 6367 runs, 116 Gbp total http://www.mothur.org/wiki/make.sra
  • 79. 1. Provide the necessary MIMARKS* metadata data about samples with get.mimarkspackage 2. Create a project file describing user and their project using supplied template file 3. Parse MIMARKS, project file, and sff or fastq files to generate an xml file for submission using make.sra 4. Email the SRA to let them know about submission using mothur created files and await further instructions * http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3367316/ http://www.mothur.org/wiki/Creating_a_new_submission
  • 80.  MIMARKS: minimum information about a marker gene sequence (doi:10.1038/nbt.1823)  Command supplies user with a blank text file with sample names and necessary metadata for their environment. User fills in details.  For each environmental package and environment there is a wiki page with required and optional parameters and allowed values. Can also be extended to include additional metadata http://www.mothur.org/wiki/Get.mimarkspackage http://www.mothur.org/wiki/Human_gut
  • 81. USERNAME [UserName] Last [LastName] First [FirstName] EMAIL [Email@mail.com] CENTER [University or Center Name] TYPE institute WEBSITE [www.Website.org] ProjectName [ProjectName] ProjectTitle [Project Title] Description [Project Description] Grant id=[GrantID], agency=[GrantAgency], title=[GrantTitle] User completes information in brackets http://www.mothur.org/wiki/Project_File
  • 82.
  • 83. 84
  • 84. 85
  • 85. 86
  • 86. 87
  • 87. 88

Editor's Notes

  1. Now… with AMR data!
  2. Now… with AMR data!
  3. Now… with AMR data!
  4. Now… with AMR data!
  5. 163 studies with >50,000 cancer patients, generally with matched controls
  6. Make sure you make metadata points here!
  7. Make sure you make metadata points here!
  8. Now… with AMR data!
  9. Now… with AMR data!
  10. Now… with AMR data!
  11. Now… with AMR data!
  12. Now… with AMR data!
  13. Now… with AMR data!
  14. Now… with AMR data!
  15. Now… with AMR data!
  16. Now… with AMR data!
  17. Now… with AMR data!
  18. Now… with AMR data!
  19. Now… with AMR data!
  20. All cancer cells arise from a normal somatic cell, therefore most primary cancers express adequate amounts of HLA identify the specific peptides that mark the tumor as 'dangerous’ T cells recognize peptides that are presented by human leukocyte antigen tumors harbor hundreds of putative neoepitopes without the benefit of information from T cell responses, it’s virtually impossible to develop a vaccine, but we can aim at narrowing down the candidate peptides
  21. Now… with AMR data!
  22. Now… with AMR data!
  23. Now… with AMR data!
  24. Now… with AMR data!