Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Biopython     Karin Lagesenkarin.lagesen@bio.uio.no
ConcatFasta.pyCreate a script that has the following:  function get_fastafiles(dirname)     gets all the files in the dire...
Object oriented programmingBiopython is object-orientedSome knowledge helps understand how biopython worksOOP is a way of ...
Classes and objectsA class:  is a user defined type  is a mold for creating objects  specifies how an object can contain a...
Attributes and methodsClasses specify two things:  attributes – data holders  methods – functions for this classAttributes...
Class and object exampleClass: MySeqMySeq has:   attribute length   method translateAn object of the class MySeq is create...
SummaryAn object has to be instantiated, i.e. created, to existEvery object has a certain type, i.e. is of a certain class...
BiopythonPackage that assists with processing biological dataConsists of several modules – some with common operations, so...
Working with sequencesBiopython has many ways of working with  sequence dataComponents for today:  Alphabet  Seq  SeqRecor...
Class AlphabetEvery sequence needs an alphabetCCTTGGCC – DNA or protein?Biopython contains several alphabets  DNA  RNA  Pr...
Alphabet exampleGo to freebeeDo module load python (necessary to find biopython  modules) – start python  >>> import Bio.A...
Packages, modules and              classesWhat happens here?>>> from Bio.Alphabet import IUPAC   >>> IUPAC.IUPACProtein.le...
SeqRepresents one sequence with its alphabetMethods:  translate()  transcribe()  complement()  reverse_complement()  ...
Using Seq>>> from Bio.Seq import Seq>>> import Bio.Alphabet       Create object>>> seq = Seq("CCGGGTT", Bio.Alphabet.IUPAC...
Seq as a stringMost string methods work on SeqsIf string is needed, do str(seq)>>> seq = Seq(CCGGGTTAACGTA,Bio.Alphabet.IU...
MutableSeqSeqs are immutable as strings areIf mutable string is needed, convert to MutableSeqAllows in-place changes  >>> ...
SeqRecordSeq contains the sequence and alphabetBut sequences often come with a lot moreSeqRecord = Seq + metadataMain attr...
SeqRecord attributesFrom the biopython webpages:Main attributes:id - Identifier such as a locus tag (string)seq - The sequ...
SeqRecords in practice...>>> from Bio.SeqRecord import SeqRecord>>> from Bio.Seq import Seq>>> from Bio.Alphabet import DN...
SeqIOHow to get sequences in and out of filesRetrieves sequences as SeqRecords, can write SeqRecords to filesReading:  par...
SeqIO formatsList: http://biopython.org/wiki/SeqIOSome examples:  fasta  genbank  several fastq-formats  aceNote: a format...
Reading a file        from Bio import SeqIO        handle = open("example.fasta", "r")        for record in SeqIO.parse(ha...
ExerciseUse mb.gbk, found in Karins folderUse the SeqIO methods to   read in the file   print the id of each of the record...
SeqRecords lists and           dictionariesTo get everything as a list:  handle = open("example.fasta", "r")     records =...
Writing files                                      sequences are here a           from Bio import SeqIO      list of SeqRe...
seq_length.pyWrite script that reads a file containing genbank sequences and writes out name and sequence lengthShould hav...
ModificationsFigure out how to:  print the description of each genbank entry  which annotations each entry has  print the ...
tag_fasta.pyCreate script that takes a file containing fasta  sequences, adds a tag at the front of the name and  writes i...
Optional homework           convert.pyCreate a script that:  takes input filename, input file type, output    filename and...
Upcoming SlideShare
Loading in …5
×

Biopython

1,445 views

Published on

Day 5 of an introductory python course for bio

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Biopython

  1. 1. Biopython Karin Lagesenkarin.lagesen@bio.uio.no
  2. 2. ConcatFasta.pyCreate a script that has the following: function get_fastafiles(dirname) gets all the files in the directory, checks if they are fasta files (end in .fsa), returns list of fasta files hint: you need os.path to create full relative file names function concat_fastafiles(filelist, outfile) takes a list of fasta files, opens and reads each of them, writes them to outfile if __name__ == “__main__”: do what needs to be done to run scriptRemember imports!
  3. 3. Object oriented programmingBiopython is object-orientedSome knowledge helps understand how biopython worksOOP is a way of organizing data and methods that work on them in a coherent packageOOP helps structure and organize the code
  4. 4. Classes and objectsA class: is a user defined type is a mold for creating objects specifies how an object can contain and process data represents an abstraction or a template for how an object of that class will behaveAn object is an instance of a classAll objects have a type – shows which class they were made from
  5. 5. Attributes and methodsClasses specify two things: attributes – data holders methods – functions for this classAttributes are variables that will contain the data that each object will haveMethods are functions that an object of that class will be able to perform
  6. 6. Class and object exampleClass: MySeqMySeq has: attribute length method translateAn object of the class MySeq is created like this: myseq = MySeq(“ATGGCCG”)Get sequence length: myseq.lengthGet translation: myseq.translate()
  7. 7. SummaryAn object has to be instantiated, i.e. created, to existEvery object has a certain type, i.e. is of a certain classThe class decides which attributes and methods an object hasAttributes and methods are accessed using . after the object variable name
  8. 8. BiopythonPackage that assists with processing biological dataConsists of several modules – some with common operations, some more specializedWebsite: biopython.org
  9. 9. Working with sequencesBiopython has many ways of working with sequence dataComponents for today: Alphabet Seq SeqRecord SeqIOOther useful classes for working with alignments, blast searches and results etc are also available, not covered today
  10. 10. Class AlphabetEvery sequence needs an alphabetCCTTGGCC – DNA or protein?Biopython contains several alphabets DNA RNA Protein the three above with IUPAC codes ...and othersCan all be found in Bio.Alphabet package
  11. 11. Alphabet exampleGo to freebeeDo module load python (necessary to find biopython modules) – start python >>> import Bio.Alphabet NOTE: have to import >>> Bio.Alphabet.ThreeLetterProtein.letters Alphabets to use them [Ala, Asx, Cys, Asp, Glu, Phe, Gly, His, Ile,  Lys, Leu, Met, Asn, Pro, Gln, Arg, Ser, Thr,  Sec, Val, Trp, Xaa, Tyr, Glx] >>> from Bio.Alphabet import IUPAC >>> IUPAC.IUPACProtein.letters ACDEFGHIKLMNPQRSTVWY >>> IUPAC.unambiguous_dna.letters GATC >>> 
  12. 12. Packages, modules and classesWhat happens here?>>> from Bio.Alphabet import IUPAC >>> IUPAC.IUPACProtein.lettersBio and Alphabet are packages packages contain modulesIUPAC is a module a module is a file with python codeIUPAC module contains class IUPACProtein and other classes specifying alphabetsIUPACProtein has attribute letters
  13. 13. SeqRepresents one sequence with its alphabetMethods: translate() transcribe() complement() reverse_complement() ...
  14. 14. Using Seq>>> from Bio.Seq import Seq>>> import Bio.Alphabet Create object>>> seq = Seq("CCGGGTT", Bio.Alphabet.IUPAC.unambiguous_dna)>>> seqSeq(CCGGGTT, IUPACUnambiguousDNA())>>> seq.transcribe()Seq(CCGGGUU, IUPACUnambiguousRNA()) Use methods>>> seq.translate()Seq(PG, IUPACProtein())>>> seq = Seq("CCGGGUU", Bio.Alphabet.IUPAC.unambiguous_rna)>>> seq.transcribe()Traceback (most recent call last):  File "<stdin>", line 1, in <module> New object, different alphabet  File "/site/VERSIONS/python­2.6.2/lib/python2.6/site­packages/Bio/Seq.py", line 830, in transcribe    raise ValueError("RNA cannot be transcribed!")ValueError: RNA cannot be transcribed!>>> seq.translate()Seq(PG, IUPACProtein())>>>  Alphabet dictates which methods make sense
  15. 15. Seq as a stringMost string methods work on SeqsIf string is needed, do str(seq)>>> seq = Seq(CCGGGTTAACGTA,Bio.Alphabet.IUPAC.unambiguous_dna)>>> seq[:5]Seq(CCGGG, IUPACUnambiguousDNA())>>> len(seq)13>>> seq.lower()Seq(ccgggttaacgta, DNAAlphabet())>>> print seqCCGGGTTAACGTA>>> list(seq)[C, C, G, G, G, T, T, A, A, C, G, T, A]>>> mystring = str(seq)>>> print mystringCCGGGTTAACGTA>>> type(seq)<class Bio.Seq.Seq> How to check what class>>> type(mystring) or type an object is from<type str>>>> 
  16. 16. MutableSeqSeqs are immutable as strings areIf mutable string is needed, convert to MutableSeqAllows in-place changes >>> seq[0] = T Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: Seq object does not support item assignment >>> mut_seq = seq.tomutable() >>> seq Seq(CCGGGTTAACGTA, IUPACUnambiguousDNA()) >>> seq[0] = T Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: Seq object does not support item assignment >>> mut_seq = seq.tomutable() >>> mut_seq[0] = T >>> mut_seq MutableSeq(TCGGGTTAACGTA, IUPACUnambiguousDNA()) >>> mut_seq.complement() >>> mut_seq MutableSeq(AGCCCAATTGCAT, IUPACUnambiguousDNA()) >>>  Notice: object is changed!
  17. 17. SeqRecordSeq contains the sequence and alphabetBut sequences often come with a lot moreSeqRecord = Seq + metadataMain attributes: id – name or identifier seq – seq object containing the sequence>>> seq Existing sequenceSeq(CCGGGTTAACGTA, IUPACUnambiguousDNA()) SeqRecord is a class>>> from Bio.SeqRecord import SeqRecord found inside the>>> seqRecord = SeqRecord(seq, id=001)>>> seqRecord Bio.SeqRecord moduleSeqRecord(seq=Seq(CCGGGTTAACGTA, IUPACUnambiguousDNA()), id=001, name=<unknown name>, description=<unknown description>, dbxrefs=[])>>> 
  18. 18. SeqRecord attributesFrom the biopython webpages:Main attributes:id - Identifier such as a locus tag (string)seq - The sequence itself (Seq object or similar)Additional attributes:name - Sequence name, e.g. gene name (string)description - Additional text (string)dbxrefs - List of database cross references (list of strings)features - Any (sub)features defined (list of SeqFeature objects)annotations - Further information about the whole sequence (dictionary) Most entries are strings, or lists of strings.letter_annotations - Per letter/symbol annotation (restricted dictionary). This holds Python sequences (lists, strings or tuples) whose length matches that of the sequence. A typical use would be to hold a list of integers representing sequencing quality scores, or a string representing the secondary structure.
  19. 19. SeqRecords in practice...>>> from Bio.SeqRecord import SeqRecord>>> from Bio.Seq import Seq>>> from Bio.Alphabet import DNAAlphabet>>> seqRecord = SeqRecord(Seq(GCAGCCTCAAACCCCAGCTG, … DNAAlphabet), id = NM_005368.2, name = NM_005368, … description = Myoglobin var 1,… dbxrefs = [GeneID:4151, HGNC:6915])>>> seqRecord.annotations[note] = Information goes here>>> seqRecordSeqRecord(seq=Seq(GCAGCCTCAAACCCCAGCTG, <class Bio.Alphabet.DNAAlphabet>), id=NM_005368.2, name=NM_005368, description=Myoglobin var 1, dbxrefs=[GeneID:4151, HGNC:6915])>>> seqRecord.annotations{note: Information goes here}>>> 
  20. 20. SeqIOHow to get sequences in and out of filesRetrieves sequences as SeqRecords, can write SeqRecords to filesReading: parse(filehandle, format) returns a generator that gives SeqRecordsWriting: write(SeqRecord(s), filehandle, format) NOTE: examples in this section from http://biopython.org/wiki/SeqIO
  21. 21. SeqIO formatsList: http://biopython.org/wiki/SeqIOSome examples: fasta genbank several fastq-formats aceNote: a format might be readable but not writable depending on biopython version
  22. 22. Reading a file from Bio import SeqIO handle = open("example.fasta", "r") for record in SeqIO.parse(handle,"fasta") :     print record.id handle.close()SeqIO.parse returns a SeqRecord iteratorAn iterator will give you the next element the next time it is called – compare to readline()Useful because if a file contains many records, we avoid putting all into memory all at once
  23. 23. ExerciseUse mb.gbk, found in Karins folderUse the SeqIO methods to read in the file print the id of each of the records print the first 10 nucleotide of each record >>> from Bio import SeqIO >>> fh = open("mb.gbk", "r") >>> for record in SeqIO.parse(fh, "genbank"): ...     print record.id ...     print record.seq[:10] ...  NM_005368.2 GCAGCCTCAA XM_001081975.2 CCTCTCCCCA NM_001164047.1 TAGCTGCCCA >>> 
  24. 24. SeqRecords lists and dictionariesTo get everything as a list: handle = open("example.fasta", "r") records = list(SeqIO.parse(handle, "fasta")) handle.close()To get everything as a dictionary: handle = open("example.fasta", "r") record_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta")) handle.close()But: avoid if at all possible
  25. 25. Writing files sequences are here a from Bio import SeqIO list of SeqRecords sequences = ... # add code here output_handle = open("example.fasta", "w") SeqIO.write(sequences, output_handle, "fasta") output_handle.close()Note: sequences is here a listCan write any iterable containing SeqRecords to a fileCan also write a single sequence
  26. 26. seq_length.pyWrite script that reads a file containing genbank sequences and writes out name and sequence lengthShould have Function sequence_length(inputfile) Open file Per seqRecord in input: Print name, length of sequence Close file If __name__ == “__main__”: Get input from command line: inputfile
  27. 27. ModificationsFigure out how to: print the description of each genbank entry which annotations each entry has print the taxonomy for each entryDescription: seqRecord.descriptionAnnotations: seqRecord.annotations.keys()Taxonomy: seqRecord.annotations[taxonomy]
  28. 28. tag_fasta.pyCreate script that takes a file containing fasta sequences, adds a tag at the front of the name and writes it out to a new fileShould have Function change_name(seqRecord, tag) Change name, return seqRecord Function read_write_fasta(tag, input, output) Per seqRecord in input: Change name Write to output If __name__ == “__main__”: Get input from command line: Tag, input file, output file
  29. 29. Optional homework convert.pyCreate a script that: takes input filename, input file type, output filename and output file type converts input file to output file type and writes it to output file

×