SlideShare a Scribd company logo
1 of 21
Download to read offline
Pau Corral Montañés

         RUGBCN
        03/05/2012
VIDEO:
“what we used to do in Bioinformatics“


note: due to memory space limitations, the video has not been embedded into
this presentation.

Content of the video:
A search in NCBI (www.ncbi.nlm.nih.gov) for the entry NC_001477 is done
interactively, leading to a single entry in the database related to Dengue Virus,
complete genome. Thanks to the facilities that the site provides, a FASTA file for
this complete genome is downloaded and thereafter opened with a text editor.

The purpose of the video is to show how tedious is accessing the site and
downleading an entry, with no less than 5 mouse clicks.
Unified records with in-house coding:
        The 3 databases share all sequences, but they use different                   A "mirror" of the content of other databases
        accession numbers (IDs) to refer to each entry.
        They update every night.

                                                                              ACNUC
                                                                                           A series of commands and their arguments were
        NCBI – ex.: #000134                                                                defined that allow
                                           EMBL – ex.: #000012
                                                                                           (1) database opening,
                                                                                           (2) query execution,
                                                                                           (3) annotation and sequence display,
                                                                                           (4) annotation, species and keywords browsing, and
                                                                                           (5) sequence extraction




                                              DDBJ – ex.:
                                              #002221




                                                                                  Other Databases
– NCBI - National Centre for Biotechnology Information - (www.ncbi.nlm.nih.gov)
– EMBL - European Molecular Biology Laboratory - (www.ebi.ac.uk/embl)
– DDBJ - DNA Data Bank of Japan - (www.ddbj.nig.ac.jp)
– ACNUC - http://pbil.univ-lyon1.fr/databases/acnuc/acnuc.html
ACNUC retrieval programs:

     i) SeqinR - (http://pbil.univ-lyon1.fr/software/seqinr/)

     ii) C language API - (http://pbil.univ-lyon1.fr/databases/acnuc/raa_acnuc.html)

     iii) Python language API - (http://pbil.univ-lyon1.fr/cgi-bin/raapythonhelp.csh)


     a.I) Query_win – GUI client for remote ACNUC retrieval operations
                   (http://pbil.univ-lyon1.fr/software/query_win.html)
     a.II) raa_query – same functionality as Query_win in command line interface
                  (http://pbil.univ-lyon1.fr/software/query.html)
Why use seqinR:
the Wheel          - available packages
Hotline            - discussion forum
Automation         - same source code for different purposes
Reproducibility    - tutorials in vignette format
Fine tuning        - function arguments

Usage of the R environment!!



#Install the package seqinr_3.0-6 (Apirl 2012)

>install.packages(“seqinr“)
package ‘seqinr’ was built under R version 2.14.2




                                                               A vingette:
                                                               http://seqinr.r-forge.r-project.org/seqinr_2_0-7.pdf
#Choose a mirror
>chooseCRANmirror(mirror_name)
>install.packages("seqinr")

>library(seqinr)

#The command lseqinr() lists all what is defined in the package:
>lseqinr()[1:9]

  [1]   "a"              "aaa"
  [3]   "AAstat"         "acnucclose"
  [5]   "acnucopen"      "al2bp"
  [7]   "alllistranks"   "alr"
  [9]   "amb"

>length(lseqinr())
[1] 209
How many different ways are there to work with biological sequences using SeqinR?

1)Sequences you have locally:

     i) read.fasta() and s2c() and c2s() and GC() and count() and translate()
     ii) write.fasta()
     iii) read.alignment() and consensus()

2) Sequences you download from a Database:

     i) browse Databases
     ii) query() and getSequence()
FASTA files example:
 Example with DNA data: (4 different characters, normally)

                                                                   The FASTA format is very
                                                                   simple and widely used for
                                                                   simple import of biological
                                                                   sequences.
                                                                   It begins with a single-line
                                                                   description starting with a
                                                                   character '>', followed by
                                                                   lines of sequence data of
                                                                   maximum 80 character each.
                                                                   Lines starting with a semi-
                                                                   colon character ';' are
                                                                   comment lines.

  Example with Protein data: (20 different characters, normally)
                                                                   Check Wikipedia for:
                                                                   i)Sequence representation
                                                                   ii)Sequence identifiers
                                                                   iii)File extensions
Read a file with read.fasta()

#Read the sequence from a local directory
> setwd("H:/Documents and Settings/Pau/Mis documentos/R_test")
> dir()
[1] "dengue_whole_sequence.fasta"

#Use the read.fasta (see next slide) function to load the sequence
> read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA")
$`gi|9626685|ref|NC_001477.1|`
[1] "a" "g" "t" "t" "g" "t" "t" "a" "g" "t" "c" "t" "a" "c" "g" "t" "g" "g"
[19] "a" "c" "c" "g" "a" "c" "a" "a" "g" "a" "a" "c" "a" "g" "t" "t" "t" "c"
[37] "g" "a" "a" "t" "c" "g" "g" "a" "a" "g" "c" "t" "t" "g" "c" "t" "t" "a"
[55] "a" "c" "g" "t" "a" "g" "t" "t" "c" "t" "a" "a" "c" "a" "g" "t" "t" "t"
  [............................................................]
[10711] "t" "g" "g" "t" "g" "c" "t" "g" "t" "t" "g" "a" "a" "t" "c" "a" "a" "c"
[10729] "a" "g" "g" "t" "t" "c" "t"
attr(,"name")
[1] "gi|9626685|ref|NC_001477.1|"
attr(,"Annot")
[1] ">gi|9626685|ref|NC_001477.1| Dengue virus 1, complete genome"
attr(,"class")
[1] "SeqFastadna"



> read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA", as.string=T, set.attributes=F)
$`gi|9626685|ref|NC_001477.1|`
[1] "agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag...__truncated__......"

> seq <- read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA", as.string=T, set.attributes=F)
> str(seq)
List of 1
 $ gi|9626685|ref|NC_001477.1|: chr
"agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag...__truncated__......"
Usage:
rea d.fa s ta (file, seqtype = c("DNA", "AA"), as.string = FALSE, forceDNAtolower =
TRUE, set.attributes = TRUE, legacy.mode = TRUE, seqonly = FALSE,                   strip.desc
= FALSE, bfa = FALSE, sizeof.longlong = .Machine$sizeof.longlong,
    endian = .Platform$endian, apply.mask = TRUE)

Arguments:
file - path (relative [getwd] is used if absoulte is not given) to FASTA file
seqtype - the nature of the sequence: DNA or AA
as.string - if TRUE sequences are returned as a string instead of a vector characters
forceDNAtolower - lower- or upper-case
set.attributes - whether sequence attributes should be set
legacy.mode - if TRUE lines starting with a semicolon ’;’ are ignored
seqonly - if TRUE, only sequences as returned (execution time is divided approximately by a factor 3)
strip.desc - if TRUE, removes the '>' at the beginning
bfa - if TRUE the fasta file is in MAQ binary format sizeof.longlong
endian - relative to MAQ files
apply.mask - relative to MAQ files

Value:
a list of vector of chars
Basic manipulations:
#Turn seqeunce into characters and count how many are there
> length(s2c(seq[[1]]))
[1] 10735

#Count how many different accurrences are there
> table(s2c(seq[[1]]))
   a    c    g    t
3426 2240 2770 2299

#Count the fraction of G and C bases in the sequence
> GC(s2c(seq[[1]]))
[1] 0.4666977

#Count all possible words in a sequence with a sliding window of size = wordsize
> seq_2 <- "actg"
> count(s2c(seq_2), wordsize=2)
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0

> count(s2c(seq_2), wordsize=2, by=2)
aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt
 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0

#Translate into aminoacids the genetic sequence.
> c2s(translate(s2c(seq[[1]]), frame=0, sens='F', numcode=1))
[1] "SC*STWTDKNSFESEACLT*F*QFFIREQISDEQPTEKDGSTVFQYAETREKPRVNCFTVGEEILK...-Truncated-...
RIAFRPRTHEIGDGFYSIPKISSHTSNSRNFG*MGLIQEEWSDQGDPSQDTTQQRGPTPGEAVPWW*GLEVRGDPPHNNKQHIDAGRD
QRSCCLYSIIPGTERQKMEWCC*INRF"
#Note:
#    1)frame=0 means that first nuc. is taken as the first position in the codon
#    2)sense='F' means that the sequence is tranlated in the forward sense
#    3)numcode=1 means standard genetic code is used (in SeqinR, up to 23 variations)
Write a FASTA file (and how to save time)
> dir()
[1] "dengue_whole_sequence.fasta" "ER.fasta"
> system.time(seq <- read.fasta(file="ER.fasta", seqtype="DNA", as.string=T, set.attributes=F))
   user system elapsed
   3.28    0.00    3.33

> length(seq)
[1] 9984
> seq[[1]]
[1] "gtggtatcaacgcagagtacgcggggacgtttatatggacgctcctacaaggaaaccctagcctt..._truncated_...ctcata"
> names(seq)[1:5]
[1] "ER0101A01F" "ER0101A03F" "ER0101A05F" "ER0101A07F" "ER0101A09F"

#Clean the sequences keeping only the ones > 25 nucleotides
> char_seq = rapply(seq, s2c, how="list")       # create a list turning strings to characters
> len_seq = rapply(char_seq, length, how="list") # count how many characters are in each list
> bigger_25_list = c()
> for (val in 1:length(len_seq)){
+     if (len_seq[[val]] >= 25){
+         bigger_25_list = c(bigger_25_list, val)
+     }
+ }
> seq_25 = seq[bigger_25_list]            #indexing to get the desired list
> length(seq_25)
[1] 8928
> seq_25[1]
$ER0101A01F
[1] "gtggtatcaacgcagagtacgcggggacgtttatatggacgctcctacaaggaaaccctagccttctcatacct...truncated..."
#Write a FASTA file (see next slide)
> write.fasta(seq_25, names = names(seq_25), file.out="clean_seq.fasta", open="w")
> dir()
[1] "clean_seq.fasta" "dengue_whole_sequence.fasta" "ER.fasta"
Usage:
w rite.fa s ta (sequences, names, file.out, open = "w", nbchar = 80)

Arguments:
sequences - A DNA or protein sequence (in the form of a vector of single characters) or a list of such sequences.
names - The name(s) of the sequences.
file.out - The name of the output file.
open - Open the output file, use "w" to write into a new file, use "a" to append at the end of an already existing file.
nbchar - The number of characters per line (default: 60)


Value:
none in the R space. A FASTA formatted file is created
Write a FASTA file (and how to save time)
# Remember what we did before
> system.time(seq <- read.fasta(file="ER.fasta", seqtype="DNA", as.string=T, set.attributes=F))
   user system elapsed
   3.28    0.00    3.33


# After cleanig seq, we produced a file "clean_seq.fasta" that we read now
> system.time(seq_1 <- read.fasta(file="clean_seq.fasta", seqtype="DNA", as.string=T, set.attributes=F))
   user system elapsed
   1.30    0.01    1.31

#Time can be saved with the save function:
> save(seq_25, file = "ER_CLEAN_seqs.RData")

> system.time(load("ER_CLEAN_seqs.RData"))
   user system elapsed
   0.11    0.02    0.12
Read an alignment (or create an alignment object to be aligned):
#Create an alignment object through reading a FASTA file with two sequences (see next slide)
> fasta <- read.alignment(file = system.file("sequences/Anouk.fasta", package = "seqinr"), format ="fasta")
> fasta
$nb
[1] 2
$nam
[1] "LmjF01.0030" "LinJ01.0030"
$seq
$seq[[1]]
[1] "atgatgtcggccgagccgccgtcgtcgcagccgtacatcagcgacgtgctgcggcggtaccagc...truncated..."
$seq[[2]]
[1] "atgatgtcggccgagccgccgtcgtcgcagccgtacatcagcgacgtgctgcggcggtaccagc...truncated..."
$com
[1] NA
attr(,"class")
[1] "alignment"

# The consensus() function aligns the two sequences, producing a consensus sequences. IUPAC symbology is used.
> fixed_align = consensus(fasta, method="IUPAC")

> table(fixed_align)
fixed_align
  a   c   g   k   m  r    s   t   w    y
411 636 595   3   5 20   13 293   2   20
Usage:
rea d.a lig nm ent(file, format, forceToLower = TRUE)

Arguments:
file - The name of the file which the aligned sequences are to be read from. If it does not contain an absolute or relative
path, the file name is relative to the current working directory, getwd.
format - A character string specifying the format of the file: mas e, clus tal, phylip, fas ta or ms f
forceToLower - A logical defaulting to TRUE stating whether the returned characters in the sequence should be in
lower case



Value:
An object is created of class alignment which is a list with the following components:
nb ->the number of aligned sequences
nam ->a vector of strings containing the names of the aligned sequences
seq ->a vector of strings containing the aligned sequences
com ->a vector of strings containing the commentaries for each sequence or NA if there are no comments
Access a remote server:
> choosebank()
 [1] "genbank"       "embl"          "emblwgs"         "swissprot"   "ensembl"       "hogenom"
 [7] "hogenomdna"    "hovergendna"   "hovergen"        "hogenom5"    "hogenom5dna"   "hogenom4"
[13] "hogenom4dna"   "homolens"      "homolensdna"     "hobacnucl"   "hobacprot"     "phever2"
[19] "phever2dna"    "refseq"        "greviews"        "bacterial"   "protozoan"     "ensbacteria"
[25] "ensprotists"   "ensfungi"      "ensmetazoa"      "ensplants"   "mito"          "polymorphix"
[31] "emglib"        "taxobacgen"    "refseqViruses"

#Access a bank and see complementary information:
> choosebank(bank="genbank", infobank=T)
> ls()
[1] "banknameSocket"
> str(banknameSocket)
List of 9
 $ socket :Classes 'sockconn', 'connection' atomic [1:1] 3
  .. ..- attr(*, "conn_id")=<externalptr>
 $ bankname: chr "genbank"
 $ banktype: chr "GENBANK"
 $ totseqs : num 1.65e+08
 $ totspecs: num 968157
 $ totkeys : num 3.1e+07
 $ release : chr "           GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012"
 $ status :Class 'AsIs' chr "on"
 $ details : chr [1:4] "              ****     ACNUC Data Base Content      ****                        " "
      GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012" "139,677,722,280 bases; 152,280,170
sequences; 12,313,982 subseqs; 684,079 refers." "Software by M. Gouy, Lab. Biometrie et Biologie Evolutive,
Universite Lyon I "
> banknameSocket$details
[1] "              ****     ACNUC Data Base Content      ****                        "
[2] "           GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012"
[3] "139,677,722,280 bases; 152,280,170 sequences; 12,313,982 subseqs; 684,079 refers."
[4] "Software by M. Gouy, Lab. Biometrie et Biologie Evolutive, Universite Lyon I "
Make a query:
# query (see next slide) all sequences that contain the words "virus" and "dengue" in the taxonomy field and
# that are not partial sequences
> system.time(query("All_Dengue_viruses_NOTpartial", ""sp=@virus@" AND "sp=@dengue@" AND
NOT "k=partial""))
   user system elapsed
   0.78    0.00    6.72

> All_Dengue_viruses_NOTpartial[1:4]
$call
query(listname = "All_Dengue_viruses_NOTpartial", query = ""sp=@virus@" AND "sp=@dengue@" AND
NOT "k=partial"")
$name
[1] "All_Dengue_viruses_NOTpartial"
$nelem
[1] 7741
$typelist
[1] "SQ"

> All_Dengue_viruses_NOTpartial[[5]][[1]]
    name   length    frame   ncbicg
"A13666"    "456"      "0"      "1"

> myseq = getSequence(All_Dengue_viruses_NOTpartial[[5]][[1]])
> myseq[1:20]
 [1] "a" "t" "g" "g" "c" "c" "a" "t" "g" "g" "a" "c" "c" "t" "t" "g" "g" "t" "g" "a"

> closebank()
Usage:
query(listname, query, socket = autosocket(), invisible = T, verbose = F, virtual = F)

Arguments:
listname - The name of the list as a quoted string of chars
query - A quoted string of chars containing the request with the syntax given in the details section
socket - An object of class sockconn connecting to a remote ACNUC database (default is a socket to the last opened
database).
invisible - if FALSE, the result is returned visibly.
verbose - if TRUE, verbose mode is on
virtual - if TRUE, no attempt is made to retrieve the information about all the elements of the list. In this case, the req
component of the list is set to NA.

Value:
The result is a list with the following 6 components:

call - the original call
name - the ACNUC list name
nelem - the number of elements (for instance sequences) in the ACNUC list
typelist - the type of the elements of the list. Could be SQ for a list of sequence names, KW for a list of keywords, SP for
a list of species names.
req - a list of sequence names that fit the required criteria or NA when called with parameter virtual is TRUE
socket - the socket connection that was used
Pau Corral Montañés

         RUGBCN
        03/05/2012

More Related Content

What's hot

Next generation sequencing for Identification and Characterization of plant v...
Next generation sequencing for Identification and Characterization of plant v...Next generation sequencing for Identification and Characterization of plant v...
Next generation sequencing for Identification and Characterization of plant v...Malyaj R Prajapati
 
Seguridad y amenazas en la red.
Seguridad y amenazas en la red.Seguridad y amenazas en la red.
Seguridad y amenazas en la red.guestf3ba8a
 
System and web security
System and web securitySystem and web security
System and web securitychirag patil
 
Six Degrees of Domain Admin - BloodHound at DEF CON 24
Six Degrees of Domain Admin - BloodHound at DEF CON 24Six Degrees of Domain Admin - BloodHound at DEF CON 24
Six Degrees of Domain Admin - BloodHound at DEF CON 24Andy Robbins
 
MongoDB World 2019: The Sights (and Smells) of a Bad Query
MongoDB World 2019: The Sights (and Smells) of a Bad QueryMongoDB World 2019: The Sights (and Smells) of a Bad Query
MongoDB World 2019: The Sights (and Smells) of a Bad QueryMongoDB
 
Single Nucleotide Polymorphism Analysis (SNPs)
Single Nucleotide Polymorphism Analysis (SNPs)Single Nucleotide Polymorphism Analysis (SNPs)
Single Nucleotide Polymorphism Analysis (SNPs)Data Science Thailand
 
FastQC and Prinseqlite
FastQC and PrinseqliteFastQC and Prinseqlite
FastQC and PrinseqliteRavi Gandham
 
RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5BITS
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...QIAGEN
 
Big picture of category theory in scala with deep dive into contravariant and...
Big picture of category theory in scala with deep dive into contravariant and...Big picture of category theory in scala with deep dive into contravariant and...
Big picture of category theory in scala with deep dive into contravariant and...Piotr Paradziński
 
Digital Anti-Forensics: Emerging trends in data transformation techniques
Digital Anti-Forensics: Emerging trends in data transformation techniquesDigital Anti-Forensics: Emerging trends in data transformation techniques
Digital Anti-Forensics: Emerging trends in data transformation techniquesSeccuris Inc.
 
Introduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyIntroduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyQIAGEN
 
Network Security Nmap N Nessus
Network Security Nmap N NessusNetwork Security Nmap N Nessus
Network Security Nmap N NessusUtkarsh Verma
 
INTRODUCTION TO CYBER FORENSICS
INTRODUCTION TO CYBER FORENSICSINTRODUCTION TO CYBER FORENSICS
INTRODUCTION TO CYBER FORENSICSSylvain Martinez
 

What's hot (20)

Next generation sequencing for Identification and Characterization of plant v...
Next generation sequencing for Identification and Characterization of plant v...Next generation sequencing for Identification and Characterization of plant v...
Next generation sequencing for Identification and Characterization of plant v...
 
Seguridad y amenazas en la red.
Seguridad y amenazas en la red.Seguridad y amenazas en la red.
Seguridad y amenazas en la red.
 
Illumina sequencing introduction
Illumina sequencing introductionIllumina sequencing introduction
Illumina sequencing introduction
 
High throughput genotyping
High throughput genotypingHigh throughput genotyping
High throughput genotyping
 
System and web security
System and web securitySystem and web security
System and web security
 
Six Degrees of Domain Admin - BloodHound at DEF CON 24
Six Degrees of Domain Admin - BloodHound at DEF CON 24Six Degrees of Domain Admin - BloodHound at DEF CON 24
Six Degrees of Domain Admin - BloodHound at DEF CON 24
 
NGS: Mapping and de novo assembly
NGS: Mapping and de novo assemblyNGS: Mapping and de novo assembly
NGS: Mapping and de novo assembly
 
MongoDB World 2019: The Sights (and Smells) of a Bad Query
MongoDB World 2019: The Sights (and Smells) of a Bad QueryMongoDB World 2019: The Sights (and Smells) of a Bad Query
MongoDB World 2019: The Sights (and Smells) of a Bad Query
 
Single Nucleotide Polymorphism Analysis (SNPs)
Single Nucleotide Polymorphism Analysis (SNPs)Single Nucleotide Polymorphism Analysis (SNPs)
Single Nucleotide Polymorphism Analysis (SNPs)
 
FastQC and Prinseqlite
FastQC and PrinseqliteFastQC and Prinseqlite
FastQC and Prinseqlite
 
Jan2016 pac bio giab
Jan2016 pac bio giabJan2016 pac bio giab
Jan2016 pac bio giab
 
RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5RNA-seq for DE analysis: detecting differential expression - part 5
RNA-seq for DE analysis: detecting differential expression - part 5
 
GWAS Study.pdf
GWAS Study.pdfGWAS Study.pdf
GWAS Study.pdf
 
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
NGS Targeted Enrichment Technology in Cancer Research: NGS Tech Overview Webi...
 
Big picture of category theory in scala with deep dive into contravariant and...
Big picture of category theory in scala with deep dive into contravariant and...Big picture of category theory in scala with deep dive into contravariant and...
Big picture of category theory in scala with deep dive into contravariant and...
 
Digital Anti-Forensics: Emerging trends in data transformation techniques
Digital Anti-Forensics: Emerging trends in data transformation techniquesDigital Anti-Forensics: Emerging trends in data transformation techniques
Digital Anti-Forensics: Emerging trends in data transformation techniques
 
Introduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) TechnologyIntroduction to Next-Generation Sequencing (NGS) Technology
Introduction to Next-Generation Sequencing (NGS) Technology
 
Network Security Nmap N Nessus
Network Security Nmap N NessusNetwork Security Nmap N Nessus
Network Security Nmap N Nessus
 
Overview of Single-Cell RNA-seq
Overview of Single-Cell RNA-seqOverview of Single-Cell RNA-seq
Overview of Single-Cell RNA-seq
 
INTRODUCTION TO CYBER FORENSICS
INTRODUCTION TO CYBER FORENSICSINTRODUCTION TO CYBER FORENSICS
INTRODUCTION TO CYBER FORENSICS
 

Viewers also liked

Phylogenetics in R
Phylogenetics in RPhylogenetics in R
Phylogenetics in Rschamber
 
Phylogeny in R - Bianca Santini Sheffield R Users March 2015
Phylogeny in R - Bianca Santini Sheffield R Users March 2015Phylogeny in R - Bianca Santini Sheffield R Users March 2015
Phylogeny in R - Bianca Santini Sheffield R Users March 2015Paul Richards
 
Phylogenetics Analysis in R
Phylogenetics Analysis in RPhylogenetics Analysis in R
Phylogenetics Analysis in RKlaus Schliep
 
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)Kate Hertweck
 
Where can tell me who I am?
Where can tell me who I am?Where can tell me who I am?
Where can tell me who I am?seltzoid
 
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.Rakesh Kumar
 
Hacking your Kindle (OSCON Lightning Talk)
Hacking your Kindle (OSCON Lightning Talk)Hacking your Kindle (OSCON Lightning Talk)
Hacking your Kindle (OSCON Lightning Talk)Jesse Vincent
 
Exotic Orient
Exotic OrientExotic Orient
Exotic OrientRenny
 
IBM Big Data References
IBM Big Data ReferencesIBM Big Data References
IBM Big Data ReferencesRob Thomas
 
Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2Paul Pivec
 
Blackwell Esteem AFSL
Blackwell Esteem AFSLBlackwell Esteem AFSL
Blackwell Esteem AFSLsamueltay77
 
Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.Sajib Hossain Akash
 
Exposing Opportunities in China A50 using CFD
Exposing Opportunities in China A50 using CFDExposing Opportunities in China A50 using CFD
Exposing Opportunities in China A50 using CFDPhillip CFD
 
360Gate Business Objects portal
360Gate Business Objects portal360Gate Business Objects portal
360Gate Business Objects portalSebastien Goiffon
 
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)Sayogyo Rahman Doko
 
Tonometer Final NSF I-Corps presentation
Tonometer Final NSF I-Corps presentationTonometer Final NSF I-Corps presentation
Tonometer Final NSF I-Corps presentationStanford University
 
ShareThis Auto Study
ShareThis Auto Study ShareThis Auto Study
ShareThis Auto Study ShareThis
 

Viewers also liked (20)

Phylogenetics in R
Phylogenetics in RPhylogenetics in R
Phylogenetics in R
 
Phylogeny in R - Bianca Santini Sheffield R Users March 2015
Phylogeny in R - Bianca Santini Sheffield R Users March 2015Phylogeny in R - Bianca Santini Sheffield R Users March 2015
Phylogeny in R - Bianca Santini Sheffield R Users March 2015
 
Phylogenetics Analysis in R
Phylogenetics Analysis in RPhylogenetics Analysis in R
Phylogenetics Analysis in R
 
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
Evolution of transposons, genomes, and organisms (Hertweck Fall 2014)
 
Where can tell me who I am?
Where can tell me who I am?Where can tell me who I am?
Where can tell me who I am?
 
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
Zed-Sales™ - a flagship product of Zed-Axis Technologies Pvt. Ltd.
 
Hacking your Kindle (OSCON Lightning Talk)
Hacking your Kindle (OSCON Lightning Talk)Hacking your Kindle (OSCON Lightning Talk)
Hacking your Kindle (OSCON Lightning Talk)
 
Exotic Orient
Exotic OrientExotic Orient
Exotic Orient
 
IBM Big Data References
IBM Big Data ReferencesIBM Big Data References
IBM Big Data References
 
Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2Engage Workshop Berlin09 Part2
Engage Workshop Berlin09 Part2
 
Blackwell Esteem AFSL
Blackwell Esteem AFSLBlackwell Esteem AFSL
Blackwell Esteem AFSL
 
Bailey capítulo-6
Bailey capítulo-6Bailey capítulo-6
Bailey capítulo-6
 
Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.Iman bysajib hossain akash-01725-340978.
Iman bysajib hossain akash-01725-340978.
 
Exposing Opportunities in China A50 using CFD
Exposing Opportunities in China A50 using CFDExposing Opportunities in China A50 using CFD
Exposing Opportunities in China A50 using CFD
 
Gscm1
Gscm1Gscm1
Gscm1
 
360Gate Business Objects portal
360Gate Business Objects portal360Gate Business Objects portal
360Gate Business Objects portal
 
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
Ushul Firaq wal Adyaan wal Madzaahib Al Fikriyah (Ar)
 
Geohash
GeohashGeohash
Geohash
 
Tonometer Final NSF I-Corps presentation
Tonometer Final NSF I-Corps presentationTonometer Final NSF I-Corps presentation
Tonometer Final NSF I-Corps presentation
 
ShareThis Auto Study
ShareThis Auto Study ShareThis Auto Study
ShareThis Auto Study
 

Similar to SeqinR - biological data handling

Parsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernelParsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernelchk49
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Li Shen
 
Learning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesLearning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesDongsun Kim
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Li Shen
 
Profiling distributed Java applications
Profiling distributed Java applicationsProfiling distributed Java applications
Profiling distributed Java applicationsConstantine Slisenka
 
Как разработать DBFW с нуля
Как разработать DBFW с нуляКак разработать DBFW с нуля
Как разработать DBFW с нуляPositive Hack Days
 
Database Firewall from Scratch
Database Firewall from ScratchDatabase Firewall from Scratch
Database Firewall from ScratchDenis Kolegov
 
Terence Barr - jdk7+8 - 24mai2011
Terence Barr - jdk7+8 - 24mai2011Terence Barr - jdk7+8 - 24mai2011
Terence Barr - jdk7+8 - 24mai2011Agora Group
 
2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekingeProf. Wim Van Criekinge
 
Fosdem10
Fosdem10Fosdem10
Fosdem10wremes
 
Language Integrated Query - LINQ
Language Integrated Query - LINQLanguage Integrated Query - LINQ
Language Integrated Query - LINQDoncho Minkov
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Hajime Tazaki
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8marctritschler
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Marc Tritschler
 
DotNet Introduction
DotNet IntroductionDotNet Introduction
DotNet IntroductionWei Sun
 

Similar to SeqinR - biological data handling (20)

Parsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernelParsing and Type checking all 2^10000 configurations of the Linux kernel
Parsing and Type checking all 2^10000 configurations of the Linux kernel
 
2015 bioinformatics bio_python
2015 bioinformatics bio_python2015 bioinformatics bio_python
2015 bioinformatics bio_python
 
Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015Next-generation sequencing data format and visualization with ngs.plot 2015
Next-generation sequencing data format and visualization with ngs.plot 2015
 
TiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architectureTiReX: Tiled Regular eXpression matching architecture
TiReX: Tiled Regular eXpression matching architecture
 
Biopython
BiopythonBiopython
Biopython
 
Learning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method NamesLearning to Spot and Refactor Inconsistent Method Names
Learning to Spot and Refactor Inconsistent Method Names
 
Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2Bioinfo ngs data format visualization v2
Bioinfo ngs data format visualization v2
 
Profiling distributed Java applications
Profiling distributed Java applicationsProfiling distributed Java applications
Profiling distributed Java applications
 
Как разработать DBFW с нуля
Как разработать DBFW с нуляКак разработать DBFW с нуля
Как разработать DBFW с нуля
 
Database Firewall from Scratch
Database Firewall from ScratchDatabase Firewall from Scratch
Database Firewall from Scratch
 
Biopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and OutlookBiopython: Overview, State of the Art and Outlook
Biopython: Overview, State of the Art and Outlook
 
Terence Barr - jdk7+8 - 24mai2011
Terence Barr - jdk7+8 - 24mai2011Terence Barr - jdk7+8 - 24mai2011
Terence Barr - jdk7+8 - 24mai2011
 
Java
JavaJava
Java
 
2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge2016 bioinformatics i_bio_python_wimvancriekinge
2016 bioinformatics i_bio_python_wimvancriekinge
 
Fosdem10
Fosdem10Fosdem10
Fosdem10
 
Language Integrated Query - LINQ
Language Integrated Query - LINQLanguage Integrated Query - LINQ
Language Integrated Query - LINQ
 
Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014Direct Code Execution - LinuxCon Japan 2014
Direct Code Execution - LinuxCon Japan 2014
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8
 
Java user group 2015 02-09-java8
Java user group 2015 02-09-java8Java user group 2015 02-09-java8
Java user group 2015 02-09-java8
 
DotNet Introduction
DotNet IntroductionDotNet Introduction
DotNet Introduction
 

Recently uploaded

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 

Recently uploaded (20)

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 

SeqinR - biological data handling

  • 1. Pau Corral Montañés RUGBCN 03/05/2012
  • 2. VIDEO: “what we used to do in Bioinformatics“ note: due to memory space limitations, the video has not been embedded into this presentation. Content of the video: A search in NCBI (www.ncbi.nlm.nih.gov) for the entry NC_001477 is done interactively, leading to a single entry in the database related to Dengue Virus, complete genome. Thanks to the facilities that the site provides, a FASTA file for this complete genome is downloaded and thereafter opened with a text editor. The purpose of the video is to show how tedious is accessing the site and downleading an entry, with no less than 5 mouse clicks.
  • 3. Unified records with in-house coding: The 3 databases share all sequences, but they use different A "mirror" of the content of other databases accession numbers (IDs) to refer to each entry. They update every night. ACNUC A series of commands and their arguments were NCBI – ex.: #000134 defined that allow EMBL – ex.: #000012 (1) database opening, (2) query execution, (3) annotation and sequence display, (4) annotation, species and keywords browsing, and (5) sequence extraction DDBJ – ex.: #002221 Other Databases – NCBI - National Centre for Biotechnology Information - (www.ncbi.nlm.nih.gov) – EMBL - European Molecular Biology Laboratory - (www.ebi.ac.uk/embl) – DDBJ - DNA Data Bank of Japan - (www.ddbj.nig.ac.jp) – ACNUC - http://pbil.univ-lyon1.fr/databases/acnuc/acnuc.html
  • 4.
  • 5. ACNUC retrieval programs: i) SeqinR - (http://pbil.univ-lyon1.fr/software/seqinr/) ii) C language API - (http://pbil.univ-lyon1.fr/databases/acnuc/raa_acnuc.html) iii) Python language API - (http://pbil.univ-lyon1.fr/cgi-bin/raapythonhelp.csh) a.I) Query_win – GUI client for remote ACNUC retrieval operations (http://pbil.univ-lyon1.fr/software/query_win.html) a.II) raa_query – same functionality as Query_win in command line interface (http://pbil.univ-lyon1.fr/software/query.html)
  • 6. Why use seqinR: the Wheel - available packages Hotline - discussion forum Automation - same source code for different purposes Reproducibility - tutorials in vignette format Fine tuning - function arguments Usage of the R environment!! #Install the package seqinr_3.0-6 (Apirl 2012) >install.packages(“seqinr“) package ‘seqinr’ was built under R version 2.14.2 A vingette: http://seqinr.r-forge.r-project.org/seqinr_2_0-7.pdf
  • 7. #Choose a mirror >chooseCRANmirror(mirror_name) >install.packages("seqinr") >library(seqinr) #The command lseqinr() lists all what is defined in the package: >lseqinr()[1:9] [1] "a" "aaa" [3] "AAstat" "acnucclose" [5] "acnucopen" "al2bp" [7] "alllistranks" "alr" [9] "amb" >length(lseqinr()) [1] 209
  • 8. How many different ways are there to work with biological sequences using SeqinR? 1)Sequences you have locally: i) read.fasta() and s2c() and c2s() and GC() and count() and translate() ii) write.fasta() iii) read.alignment() and consensus() 2) Sequences you download from a Database: i) browse Databases ii) query() and getSequence()
  • 9. FASTA files example: Example with DNA data: (4 different characters, normally) The FASTA format is very simple and widely used for simple import of biological sequences. It begins with a single-line description starting with a character '>', followed by lines of sequence data of maximum 80 character each. Lines starting with a semi- colon character ';' are comment lines. Example with Protein data: (20 different characters, normally) Check Wikipedia for: i)Sequence representation ii)Sequence identifiers iii)File extensions
  • 10. Read a file with read.fasta() #Read the sequence from a local directory > setwd("H:/Documents and Settings/Pau/Mis documentos/R_test") > dir() [1] "dengue_whole_sequence.fasta" #Use the read.fasta (see next slide) function to load the sequence > read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA") $`gi|9626685|ref|NC_001477.1|` [1] "a" "g" "t" "t" "g" "t" "t" "a" "g" "t" "c" "t" "a" "c" "g" "t" "g" "g" [19] "a" "c" "c" "g" "a" "c" "a" "a" "g" "a" "a" "c" "a" "g" "t" "t" "t" "c" [37] "g" "a" "a" "t" "c" "g" "g" "a" "a" "g" "c" "t" "t" "g" "c" "t" "t" "a" [55] "a" "c" "g" "t" "a" "g" "t" "t" "c" "t" "a" "a" "c" "a" "g" "t" "t" "t" [............................................................] [10711] "t" "g" "g" "t" "g" "c" "t" "g" "t" "t" "g" "a" "a" "t" "c" "a" "a" "c" [10729] "a" "g" "g" "t" "t" "c" "t" attr(,"name") [1] "gi|9626685|ref|NC_001477.1|" attr(,"Annot") [1] ">gi|9626685|ref|NC_001477.1| Dengue virus 1, complete genome" attr(,"class") [1] "SeqFastadna" > read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA", as.string=T, set.attributes=F) $`gi|9626685|ref|NC_001477.1|` [1] "agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag...__truncated__......" > seq <- read.fasta(file="dengue_whole_sequence.fasta", seqtype="DNA", as.string=T, set.attributes=F) > str(seq) List of 1 $ gi|9626685|ref|NC_001477.1|: chr "agttgttagtctacgtggaccgacaagaacagtttcgaatcggaagcttgcttaacgtag...__truncated__......"
  • 11. Usage: rea d.fa s ta (file, seqtype = c("DNA", "AA"), as.string = FALSE, forceDNAtolower = TRUE, set.attributes = TRUE, legacy.mode = TRUE, seqonly = FALSE, strip.desc = FALSE, bfa = FALSE, sizeof.longlong = .Machine$sizeof.longlong, endian = .Platform$endian, apply.mask = TRUE) Arguments: file - path (relative [getwd] is used if absoulte is not given) to FASTA file seqtype - the nature of the sequence: DNA or AA as.string - if TRUE sequences are returned as a string instead of a vector characters forceDNAtolower - lower- or upper-case set.attributes - whether sequence attributes should be set legacy.mode - if TRUE lines starting with a semicolon ’;’ are ignored seqonly - if TRUE, only sequences as returned (execution time is divided approximately by a factor 3) strip.desc - if TRUE, removes the '>' at the beginning bfa - if TRUE the fasta file is in MAQ binary format sizeof.longlong endian - relative to MAQ files apply.mask - relative to MAQ files Value: a list of vector of chars
  • 12. Basic manipulations: #Turn seqeunce into characters and count how many are there > length(s2c(seq[[1]])) [1] 10735 #Count how many different accurrences are there > table(s2c(seq[[1]])) a c g t 3426 2240 2770 2299 #Count the fraction of G and C bases in the sequence > GC(s2c(seq[[1]])) [1] 0.4666977 #Count all possible words in a sequence with a sliding window of size = wordsize > seq_2 <- "actg" > count(s2c(seq_2), wordsize=2) aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 > count(s2c(seq_2), wordsize=2, by=2) aa ac ag at ca cc cg ct ga gc gg gt ta tc tg tt 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 #Translate into aminoacids the genetic sequence. > c2s(translate(s2c(seq[[1]]), frame=0, sens='F', numcode=1)) [1] "SC*STWTDKNSFESEACLT*F*QFFIREQISDEQPTEKDGSTVFQYAETREKPRVNCFTVGEEILK...-Truncated-... RIAFRPRTHEIGDGFYSIPKISSHTSNSRNFG*MGLIQEEWSDQGDPSQDTTQQRGPTPGEAVPWW*GLEVRGDPPHNNKQHIDAGRD QRSCCLYSIIPGTERQKMEWCC*INRF" #Note: # 1)frame=0 means that first nuc. is taken as the first position in the codon # 2)sense='F' means that the sequence is tranlated in the forward sense # 3)numcode=1 means standard genetic code is used (in SeqinR, up to 23 variations)
  • 13. Write a FASTA file (and how to save time) > dir() [1] "dengue_whole_sequence.fasta" "ER.fasta" > system.time(seq <- read.fasta(file="ER.fasta", seqtype="DNA", as.string=T, set.attributes=F)) user system elapsed 3.28 0.00 3.33 > length(seq) [1] 9984 > seq[[1]] [1] "gtggtatcaacgcagagtacgcggggacgtttatatggacgctcctacaaggaaaccctagcctt..._truncated_...ctcata" > names(seq)[1:5] [1] "ER0101A01F" "ER0101A03F" "ER0101A05F" "ER0101A07F" "ER0101A09F" #Clean the sequences keeping only the ones > 25 nucleotides > char_seq = rapply(seq, s2c, how="list") # create a list turning strings to characters > len_seq = rapply(char_seq, length, how="list") # count how many characters are in each list > bigger_25_list = c() > for (val in 1:length(len_seq)){ + if (len_seq[[val]] >= 25){ + bigger_25_list = c(bigger_25_list, val) + } + } > seq_25 = seq[bigger_25_list] #indexing to get the desired list > length(seq_25) [1] 8928 > seq_25[1] $ER0101A01F [1] "gtggtatcaacgcagagtacgcggggacgtttatatggacgctcctacaaggaaaccctagccttctcatacct...truncated..." #Write a FASTA file (see next slide) > write.fasta(seq_25, names = names(seq_25), file.out="clean_seq.fasta", open="w") > dir() [1] "clean_seq.fasta" "dengue_whole_sequence.fasta" "ER.fasta"
  • 14. Usage: w rite.fa s ta (sequences, names, file.out, open = "w", nbchar = 80) Arguments: sequences - A DNA or protein sequence (in the form of a vector of single characters) or a list of such sequences. names - The name(s) of the sequences. file.out - The name of the output file. open - Open the output file, use "w" to write into a new file, use "a" to append at the end of an already existing file. nbchar - The number of characters per line (default: 60) Value: none in the R space. A FASTA formatted file is created
  • 15. Write a FASTA file (and how to save time) # Remember what we did before > system.time(seq <- read.fasta(file="ER.fasta", seqtype="DNA", as.string=T, set.attributes=F)) user system elapsed 3.28 0.00 3.33 # After cleanig seq, we produced a file "clean_seq.fasta" that we read now > system.time(seq_1 <- read.fasta(file="clean_seq.fasta", seqtype="DNA", as.string=T, set.attributes=F)) user system elapsed 1.30 0.01 1.31 #Time can be saved with the save function: > save(seq_25, file = "ER_CLEAN_seqs.RData") > system.time(load("ER_CLEAN_seqs.RData")) user system elapsed 0.11 0.02 0.12
  • 16. Read an alignment (or create an alignment object to be aligned): #Create an alignment object through reading a FASTA file with two sequences (see next slide) > fasta <- read.alignment(file = system.file("sequences/Anouk.fasta", package = "seqinr"), format ="fasta") > fasta $nb [1] 2 $nam [1] "LmjF01.0030" "LinJ01.0030" $seq $seq[[1]] [1] "atgatgtcggccgagccgccgtcgtcgcagccgtacatcagcgacgtgctgcggcggtaccagc...truncated..." $seq[[2]] [1] "atgatgtcggccgagccgccgtcgtcgcagccgtacatcagcgacgtgctgcggcggtaccagc...truncated..." $com [1] NA attr(,"class") [1] "alignment" # The consensus() function aligns the two sequences, producing a consensus sequences. IUPAC symbology is used. > fixed_align = consensus(fasta, method="IUPAC") > table(fixed_align) fixed_align a c g k m r s t w y 411 636 595 3 5 20 13 293 2 20
  • 17. Usage: rea d.a lig nm ent(file, format, forceToLower = TRUE) Arguments: file - The name of the file which the aligned sequences are to be read from. If it does not contain an absolute or relative path, the file name is relative to the current working directory, getwd. format - A character string specifying the format of the file: mas e, clus tal, phylip, fas ta or ms f forceToLower - A logical defaulting to TRUE stating whether the returned characters in the sequence should be in lower case Value: An object is created of class alignment which is a list with the following components: nb ->the number of aligned sequences nam ->a vector of strings containing the names of the aligned sequences seq ->a vector of strings containing the aligned sequences com ->a vector of strings containing the commentaries for each sequence or NA if there are no comments
  • 18. Access a remote server: > choosebank() [1] "genbank" "embl" "emblwgs" "swissprot" "ensembl" "hogenom" [7] "hogenomdna" "hovergendna" "hovergen" "hogenom5" "hogenom5dna" "hogenom4" [13] "hogenom4dna" "homolens" "homolensdna" "hobacnucl" "hobacprot" "phever2" [19] "phever2dna" "refseq" "greviews" "bacterial" "protozoan" "ensbacteria" [25] "ensprotists" "ensfungi" "ensmetazoa" "ensplants" "mito" "polymorphix" [31] "emglib" "taxobacgen" "refseqViruses" #Access a bank and see complementary information: > choosebank(bank="genbank", infobank=T) > ls() [1] "banknameSocket" > str(banknameSocket) List of 9 $ socket :Classes 'sockconn', 'connection' atomic [1:1] 3 .. ..- attr(*, "conn_id")=<externalptr> $ bankname: chr "genbank" $ banktype: chr "GENBANK" $ totseqs : num 1.65e+08 $ totspecs: num 968157 $ totkeys : num 3.1e+07 $ release : chr " GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012" $ status :Class 'AsIs' chr "on" $ details : chr [1:4] " **** ACNUC Data Base Content **** " " GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012" "139,677,722,280 bases; 152,280,170 sequences; 12,313,982 subseqs; 684,079 refers." "Software by M. Gouy, Lab. Biometrie et Biologie Evolutive, Universite Lyon I " > banknameSocket$details [1] " **** ACNUC Data Base Content **** " [2] " GenBank Rel. 189 (15 April 2012) Last Updated: May 1, 2012" [3] "139,677,722,280 bases; 152,280,170 sequences; 12,313,982 subseqs; 684,079 refers." [4] "Software by M. Gouy, Lab. Biometrie et Biologie Evolutive, Universite Lyon I "
  • 19. Make a query: # query (see next slide) all sequences that contain the words "virus" and "dengue" in the taxonomy field and # that are not partial sequences > system.time(query("All_Dengue_viruses_NOTpartial", ""sp=@virus@" AND "sp=@dengue@" AND NOT "k=partial"")) user system elapsed 0.78 0.00 6.72 > All_Dengue_viruses_NOTpartial[1:4] $call query(listname = "All_Dengue_viruses_NOTpartial", query = ""sp=@virus@" AND "sp=@dengue@" AND NOT "k=partial"") $name [1] "All_Dengue_viruses_NOTpartial" $nelem [1] 7741 $typelist [1] "SQ" > All_Dengue_viruses_NOTpartial[[5]][[1]] name length frame ncbicg "A13666" "456" "0" "1" > myseq = getSequence(All_Dengue_viruses_NOTpartial[[5]][[1]]) > myseq[1:20] [1] "a" "t" "g" "g" "c" "c" "a" "t" "g" "g" "a" "c" "c" "t" "t" "g" "g" "t" "g" "a" > closebank()
  • 20. Usage: query(listname, query, socket = autosocket(), invisible = T, verbose = F, virtual = F) Arguments: listname - The name of the list as a quoted string of chars query - A quoted string of chars containing the request with the syntax given in the details section socket - An object of class sockconn connecting to a remote ACNUC database (default is a socket to the last opened database). invisible - if FALSE, the result is returned visibly. verbose - if TRUE, verbose mode is on virtual - if TRUE, no attempt is made to retrieve the information about all the elements of the list. In this case, the req component of the list is set to NA. Value: The result is a list with the following 6 components: call - the original call name - the ACNUC list name nelem - the number of elements (for instance sequences) in the ACNUC list typelist - the type of the elements of the list. Could be SQ for a list of sequence names, KW for a list of keywords, SP for a list of species names. req - a list of sequence names that fit the required criteria or NA when called with parameter virtual is TRUE socket - the socket connection that was used
  • 21. Pau Corral Montañés RUGBCN 03/05/2012