Published on

Day 5 of an introductory python course for bio

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Show webpage for alphabets. Each alphabet is a separate class
  • Biopython

    1. 1. Biopython Karin
    2. 2. ConcatFasta.pyCreate a script that has the following: function get_fastafiles(dirname) gets all the files in the directory, checks if they are fasta files (end in .fsa), returns list of fasta files hint: you need os.path to create full relative file names function concat_fastafiles(filelist, outfile) takes a list of fasta files, opens and reads each of them, writes them to outfile if __name__ == “__main__”: do what needs to be done to run scriptRemember imports!
    3. 3. Object oriented programmingBiopython is object-orientedSome knowledge helps understand how biopython worksOOP is a way of organizing data and methods that work on them in a coherent packageOOP helps structure and organize the code
    4. 4. Classes and objectsA class: is a user defined type is a mold for creating objects specifies how an object can contain and process data represents an abstraction or a template for how an object of that class will behaveAn object is an instance of a classAll objects have a type – shows which class they were made from
    5. 5. Attributes and methodsClasses specify two things: attributes – data holders methods – functions for this classAttributes are variables that will contain the data that each object will haveMethods are functions that an object of that class will be able to perform
    6. 6. Class and object exampleClass: MySeqMySeq has: attribute length method translateAn object of the class MySeq is created like this: myseq = MySeq(“ATGGCCG”)Get sequence length: myseq.lengthGet translation: myseq.translate()
    7. 7. SummaryAn object has to be instantiated, i.e. created, to existEvery object has a certain type, i.e. is of a certain classThe class decides which attributes and methods an object hasAttributes and methods are accessed using . after the object variable name
    8. 8. BiopythonPackage that assists with processing biological dataConsists of several modules – some with common operations, some more specializedWebsite:
    9. 9. Working with sequencesBiopython has many ways of working with sequence dataComponents for today: Alphabet Seq SeqRecord SeqIOOther useful classes for working with alignments, blast searches and results etc are also available, not covered today
    10. 10. Class AlphabetEvery sequence needs an alphabetCCTTGGCC – DNA or protein?Biopython contains several alphabets DNA RNA Protein the three above with IUPAC codes ...and othersCan all be found in Bio.Alphabet package
    11. 11. Alphabet exampleGo to freebeeDo module load python (necessary to find biopython modules) – start python >>> import Bio.Alphabet NOTE: have to import >>> Bio.Alphabet.ThreeLetterProtein.letters Alphabets to use them [Ala, Asx, Cys, Asp, Glu, Phe, Gly, His, Ile,  Lys, Leu, Met, Asn, Pro, Gln, Arg, Ser, Thr,  Sec, Val, Trp, Xaa, Tyr, Glx] >>> from Bio.Alphabet import IUPAC >>> IUPAC.IUPACProtein.letters ACDEFGHIKLMNPQRSTVWY >>> IUPAC.unambiguous_dna.letters GATC >>> 
    12. 12. Packages, modules and classesWhat happens here?>>> from Bio.Alphabet import IUPAC >>> IUPAC.IUPACProtein.lettersBio and Alphabet are packages packages contain modulesIUPAC is a module a module is a file with python codeIUPAC module contains class IUPACProtein and other classes specifying alphabetsIUPACProtein has attribute letters
    13. 13. SeqRepresents one sequence with its alphabetMethods: translate() transcribe() complement() reverse_complement() ...
    14. 14. Using Seq>>> from Bio.Seq import Seq>>> import Bio.Alphabet Create object>>> seq = Seq("CCGGGTT", Bio.Alphabet.IUPAC.unambiguous_dna)>>> seqSeq(CCGGGTT, IUPACUnambiguousDNA())>>> seq.transcribe()Seq(CCGGGUU, IUPACUnambiguousRNA()) Use methods>>> seq.translate()Seq(PG, IUPACProtein())>>> seq = Seq("CCGGGUU", Bio.Alphabet.IUPAC.unambiguous_rna)>>> seq.transcribe()Traceback (most recent call last):  File "<stdin>", line 1, in <module> New object, different alphabet  File "/site/VERSIONS/python­2.6.2/lib/python2.6/site­packages/Bio/", line 830, in transcribe    raise ValueError("RNA cannot be transcribed!")ValueError: RNA cannot be transcribed!>>> seq.translate()Seq(PG, IUPACProtein())>>>  Alphabet dictates which methods make sense
    15. 15. Seq as a stringMost string methods work on SeqsIf string is needed, do str(seq)>>> seq = Seq(CCGGGTTAACGTA,Bio.Alphabet.IUPAC.unambiguous_dna)>>> seq[:5]Seq(CCGGG, IUPACUnambiguousDNA())>>> len(seq)13>>> seq.lower()Seq(ccgggttaacgta, DNAAlphabet())>>> print seqCCGGGTTAACGTA>>> list(seq)[C, C, G, G, G, T, T, A, A, C, G, T, A]>>> mystring = str(seq)>>> print mystringCCGGGTTAACGTA>>> type(seq)<class Bio.Seq.Seq> How to check what class>>> type(mystring) or type an object is from<type str>>>> 
    16. 16. MutableSeqSeqs are immutable as strings areIf mutable string is needed, convert to MutableSeqAllows in-place changes >>> seq[0] = T Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: Seq object does not support item assignment >>> mut_seq = seq.tomutable() >>> seq Seq(CCGGGTTAACGTA, IUPACUnambiguousDNA()) >>> seq[0] = T Traceback (most recent call last):   File "<stdin>", line 1, in <module> TypeError: Seq object does not support item assignment >>> mut_seq = seq.tomutable() >>> mut_seq[0] = T >>> mut_seq MutableSeq(TCGGGTTAACGTA, IUPACUnambiguousDNA()) >>> mut_seq.complement() >>> mut_seq MutableSeq(AGCCCAATTGCAT, IUPACUnambiguousDNA()) >>>  Notice: object is changed!
    17. 17. SeqRecordSeq contains the sequence and alphabetBut sequences often come with a lot moreSeqRecord = Seq + metadataMain attributes: id – name or identifier seq – seq object containing the sequence>>> seq Existing sequenceSeq(CCGGGTTAACGTA, IUPACUnambiguousDNA()) SeqRecord is a class>>> from Bio.SeqRecord import SeqRecord found inside the>>> seqRecord = SeqRecord(seq, id=001)>>> seqRecord Bio.SeqRecord moduleSeqRecord(seq=Seq(CCGGGTTAACGTA, IUPACUnambiguousDNA()), id=001, name=<unknown name>, description=<unknown description>, dbxrefs=[])>>> 
    18. 18. SeqRecord attributesFrom the biopython webpages:Main attributes:id - Identifier such as a locus tag (string)seq - The sequence itself (Seq object or similar)Additional attributes:name - Sequence name, e.g. gene name (string)description - Additional text (string)dbxrefs - List of database cross references (list of strings)features - Any (sub)features defined (list of SeqFeature objects)annotations - Further information about the whole sequence (dictionary) Most entries are strings, or lists of strings.letter_annotations - Per letter/symbol annotation (restricted dictionary). This holds Python sequences (lists, strings or tuples) whose length matches that of the sequence. A typical use would be to hold a list of integers representing sequencing quality scores, or a string representing the secondary structure.
    19. 19. SeqRecords in practice...>>> from Bio.SeqRecord import SeqRecord>>> from Bio.Seq import Seq>>> from Bio.Alphabet import DNAAlphabet>>> seqRecord = SeqRecord(Seq(GCAGCCTCAAACCCCAGCTG, … DNAAlphabet), id = NM_005368.2, name = NM_005368, … description = Myoglobin var 1,… dbxrefs = [GeneID:4151, HGNC:6915])>>> seqRecord.annotations[note] = Information goes here>>> seqRecordSeqRecord(seq=Seq(GCAGCCTCAAACCCCAGCTG, <class Bio.Alphabet.DNAAlphabet>), id=NM_005368.2, name=NM_005368, description=Myoglobin var 1, dbxrefs=[GeneID:4151, HGNC:6915])>>> seqRecord.annotations{note: Information goes here}>>> 
    20. 20. SeqIOHow to get sequences in and out of filesRetrieves sequences as SeqRecords, can write SeqRecords to filesReading: parse(filehandle, format) returns a generator that gives SeqRecordsWriting: write(SeqRecord(s), filehandle, format) NOTE: examples in this section from
    21. 21. SeqIO formatsList: examples: fasta genbank several fastq-formats aceNote: a format might be readable but not writable depending on biopython version
    22. 22. Reading a file from Bio import SeqIO handle = open("example.fasta", "r") for record in SeqIO.parse(handle,"fasta") :     print handle.close()SeqIO.parse returns a SeqRecord iteratorAn iterator will give you the next element the next time it is called – compare to readline()Useful because if a file contains many records, we avoid putting all into memory all at once
    23. 23. ExerciseUse mb.gbk, found in Karins folderUse the SeqIO methods to read in the file print the id of each of the records print the first 10 nucleotide of each record >>> from Bio import SeqIO >>> fh = open("mb.gbk", "r") >>> for record in SeqIO.parse(fh, "genbank"): ...     print ...     print record.seq[:10] ...  NM_005368.2 GCAGCCTCAA XM_001081975.2 CCTCTCCCCA NM_001164047.1 TAGCTGCCCA >>> 
    24. 24. SeqRecords lists and dictionariesTo get everything as a list: handle = open("example.fasta", "r") records = list(SeqIO.parse(handle, "fasta")) handle.close()To get everything as a dictionary: handle = open("example.fasta", "r") record_dict = SeqIO.to_dict(SeqIO.parse(handle, "fasta")) handle.close()But: avoid if at all possible
    25. 25. Writing files sequences are here a from Bio import SeqIO list of SeqRecords sequences = ... # add code here output_handle = open("example.fasta", "w") SeqIO.write(sequences, output_handle, "fasta") output_handle.close()Note: sequences is here a listCan write any iterable containing SeqRecords to a fileCan also write a single sequence
    26. 26. seq_length.pyWrite script that reads a file containing genbank sequences and writes out name and sequence lengthShould have Function sequence_length(inputfile) Open file Per seqRecord in input: Print name, length of sequence Close file If __name__ == “__main__”: Get input from command line: inputfile
    27. 27. ModificationsFigure out how to: print the description of each genbank entry which annotations each entry has print the taxonomy for each entryDescription: seqRecord.descriptionAnnotations: seqRecord.annotations.keys()Taxonomy: seqRecord.annotations[taxonomy]
    28. 28. tag_fasta.pyCreate script that takes a file containing fasta sequences, adds a tag at the front of the name and writes it out to a new fileShould have Function change_name(seqRecord, tag) Change name, return seqRecord Function read_write_fasta(tag, input, output) Per seqRecord in input: Change name Write to output If __name__ == “__main__”: Get input from command line: Tag, input file, output file
    29. 29. Optional homework convert.pyCreate a script that: takes input filename, input file type, output filename and output file type converts input file to output file type and writes it to output file