Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

P1 3 2017_python_exercises

Using Regular Expressions

  • Login to see the comments

  • Be the first to like this

P1 3 2017_python_exercises

  1. 1. FBW 24-10-2017 Wim Van Criekinge
  2. 2. Google Calendar Louis Coussement
  3. 3. Recap if condition: statements [elif condition: statements] ... else: statements while condition: statements for var in sequence: statements break continue Strings
  4. 4. Lists • Flexible arrays, not Lisp-like linked lists • a = [99, "bottles of beer", ["on", "the", "wall"]] • Same operators as for strings • a+b, a*3, a[0], a[-1], a[1:], len(a) • Item and slice assignment • a[0] = 98 • a[1:2] = ["bottles", "of", "beer"] -> [98, "bottles", "of", "beer", ["on", "the", "wall"]] • del a[-1] # -> [98, "bottles", "of", "beer"]
  5. 5. Dictionaries • Hash tables, "associative arrays" • d = {"duck": "eend", "water": "water"} • Lookup: • d["duck"] -> "eend" • d["back"] # raises KeyError exception • Delete, insert, overwrite: • del d["water"] # {"duck": "eend", "back": "rug"} • d["back"] = "rug" # {"duck": "eend", "back": "rug"} • d["duck"] = "duik" # {"duck": "duik", "back": "rug"}
  6. 6. if condition: statements [elif condition: statements] ... else: statements while condition: statements for var in sequence: statements break continue Strings REGULAR EXPRESSIONS
  7. 7. Regular Expression Quick Guide ^ Matches the beginning of a line $ Matches the end of the line . Matches any character s Matches whitespace S Matches any non-whitespace character * Repeats a character zero or more times *? Repeats a character zero or more times (non-greedy) + Repeats a chracter one or more times +? Repeats a character one or more times (non-greedy) [aeiou] Matches a single character in the listed set [^XYZ] Matches a single character not in the listed set [a-z0-9] The set of characters can include a range ( Indicates where string extraction is to start ) Indicates where string extraction is to end
  8. 8. Example import re dna = "ATCGCGAATTCAC" if re.search(r"GAATTC", dna): print("restriction site found!")
  9. 9. • Predict the fragment lengths that we will get if we digest the sequence with two restriction enzymes – EcoRI, whose recognition site is G/AATTC alone. • With BamHI, whose recognition site is G/GATCG alone • And when we do a double digest • The forward slashes (/) in the recognition sites represent the place where the enzyme cuts the DNA. • http://rebase.neb.com/rebase/rebase.html Double digest
  10. 10. Double digest: DNA • ATGGCAGGATCGATAACCCCCCGTTTCTACTTCTAGAGGAGAAAAGTATTGACATGAGCGCTCCC GGCACAAGGGCCAAAGAAGTCTCCAATTTCTTATTTCCGAATGACATGCGTCTCCTTGCGGGTAAA TCACCGACCGCAATTCATAGAAGCCTGGGGGAACAGATAGGAATTCGTCTAATTAGCTTAAGAGA GTAAATCCTGGGATCATTCAGTAGTAACCATAAACTTACGCTGGGGCTTCTTCGGCGGATTTTTAC AGTTACCAACCAGGAGATTTGAAGTAAATCAGTTGAGGATTTAGCCGCGCTATCCGGTAATCTCCA AATTAAAACATACCGTTCCATGAAGGGATCGGCTAGAATTACTTACCGGCCTTTTCCATGCCTGCG CTATACCCCCCCACTCTCCCGCTTATCCGTCCGAGCGGAGGCAGTGCGATCCTCCGTTAAGATATT CTTACGTGTGACGTAGCTATGTATTTTGCAGAGCTGGCGAACGCGTTGAACACTTCACAGATGGTA GGGATTCGGGTAAAGGGCGTATAATTGAATTCGGGGACTAACATAGGCGTAGACTACGATGGCGC CAACTCAATCGCAGCTCGAGCGCCCTGAATAACGTACTCATCTCAACTCATTCTCGGCAATCTACC GAGCGACTCGATTATCAACGGCTGTCTAGCAGTTCTAATCTTTTGCCAGCATCGTAATAGCCTCCA AGAGATTGATGATAGCTATCGGCACAGAACTGAGACGGCGCCGATGGATAGCGGACTTTCGGTCA ACCACAATTCCCCACGGGACAGGTCCTGCGGTGCGCATCACTCTGAATGTACAAGCAACCCAAGT GGGCCGAGCCTGGACTCAGCTGGTTCGAATTCCTGCGTGAGCTCGAGACTCGGGATGACAGCTCTT TAAACATAGAGCGGGGGCGTCGAACGGTCGAGAAAGTCATAGTACCTCGGGTACCAACTTACTCA GGTTATTGCTTGAAGCTGTACTATTTTAGGGGGGGAGCGCTGAAGGTCTCTTCTTCTCATGACTGA ACTCGCGAGGGTCGTGAAGTCGGTTCCTTCAATGGTTAAAAAACAAAGGATCGGGCTTACTGTGC GCAGAGGAACGCCCATCTAGCGGCTGGCGTCTTGAATGCTCGGTCCCCTTTGTCATTCCGGATTAA TCCATTTCCCTCATTCACGAGCTTGCGAAGTCTACATTGGTATATGAATGCGACCTAGAAGAGGGC GCTTAAAATTGGCAGTGGTTGATGCTCTAAACTCCATTTGGTTTACTCGTGCATCACCGCGATAGG CTGACAAAGGTTTAACATTGAATAGCAAGGCACTTCCGGTCTCAATGAACGGCCGGGAAAGGTAC GCGCGCGGTATGGGAGGATCAAGGGGCCAATAGAGAGGCTCCTCTCTCACTCGCTAGGAGGCAAA TGTGAATTCAAAACAATGGTTACTGCATCGATACATAAAACATGTCCATCGGTTGCCCAAAGTGTT AAGTGTCTATCACCCCTAGGGCCGTTTCCCGCATATAAACGCCAGGTTGTATCCGCATTTGATGCT ACCGTGGATGAGTCTGCGTCGAGCGCGCCGCACGAATGTTGCAATGTATTGCATGAGTAGGGTTG ACTAAGAGCCGTTAGATGCGTCGCTGTACTAATAGTTGTCGACAGACCGTCGAGATTAGAAAATG GTACCAGCATTTTCGGAGGTTCTCTAACTAGTATGGATTGCGGTGTCTTCACTGTGCTGCGGCTACC CATCGCCTGAAATCCAGCTGGTGTCAAGCCATCCCCTCTCCGGGACGCCGCATGTAGTGAAACATA TACGTTGCACGGGTTCACCGCGGTCCGTTCTGAGTCGACCAAGGACACAATCGAGCTCCGATCCGT ACCCTCGACAAACTTGTACCCGACCCCCGGAGCTTGCCAGCTCCTCGGGTATCATGGAGCCTGTGG TTCATCGCGTCCGATATCAAACTTCGTCATGATAAAGTCCCCCCCTCGGGAGTACCAGAGAAGATG ACTACTGAGTTGTGCGAT GGATCG
  11. 11. Towards a protein prosite scanner N-{P}-[ST]-{P}. [RK](2)-x-[ST]. [ST]-x-[RK]. [ST]-x(2)-[DE]. [RK]-x(2,3)-[DE]-x(2,3)-Y. G-{EDRKHPFYW}-x(2)-[STAGCN]-{P}. x-G-[RK]-[RK]. C-x-[DN]-x(4)-[FY]-x-C-x-C. E-x(2)-[ERK]-E-x-C-x(6)-[EDR]-x(10,11)-[FYA]-[YW]. [DEQGSTALMKRH]-[LIVMFYSTAC]-[GNQ]-[LIVMFYAG]-[DNEKHS]-S-[LIVMST]-{PCFY}-[STAGCPQLIVMF]-[LIVMATN]-[DENQGTAKRHLM]-[LIVMWSTA]- [LIVGSTACR]-{LPIY}-{VY}-[LIVMFA]. [KRHQSA]-[DENQ]-E-L>. R-G-D. [AG]-x(4)-G-K-[ST]. D-{W}-[DNS]-{ILVFYW}-[DENSTG]-[DNQGHRK]-{GP}-[LIVMC]-[DENQSTAGC]-x(2)-[DE]-[LIVMFYW]. [EQ]-{LNYH}-x-[ATV]-[FY]-{LDAM}-{T}-W-{PG}-N. [LIVM]-x-[SGNL]-[LIVMN]-[DAGHENRS]-[SAGPNVT]-x-[DNEAG]-[LIVM]-x-[DEAGQ]-x(4)-[LIVM]-x-[LM]-[SAG]-[LIVM]-[LIVMT]-[WS]-x(0,1)-[LIVM](2). [FY]-C-[RH]-[NS]-x(7,8)-[WY]-C. C-x-C-x(2)-{V}-x(2)-G-{C}-x-C. C-x(2)-P-F-x-[FYWIV]-x(7)-C-x(8,10)-W-C-x(4)-[DNSR]-[FYW]-x(3,5)-[FYW]-x-[FYWI]-C. [LIFAT]-{IL}-x(2)-W-x(2,3)-[PE]-x-{VF}-[LIVMFY]-[DENQS]-[STA]-[AV]-[LIVMFY]. [KRH]-x(2)-C-x-[FYPSTV]-x(3,4)-[ST]-x(3)-C-x(4)-C-C-[FYWH]. C-x(4,5)-C-C-S-x(2)-G-x-C-G-x(3,4)-[FYW]-C. [LIVMFYG]-[ASLVR]-x(2)-[LIVMSTACN]-x-[LIVM]-{Y}-x(2)-{L}-[LIV]-[RKNQESTAIY]-[LIVFSTNKH]-W-[FYVC]-x-[NDQTAH]-x(5)-[RKNAIMW]. C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H. L-x(6)-L-x(6)-L-x(6)-L. C-x(2)-C-x(1,2)-[DENAVSPHKQT]-x(5,6)-[HNY]-[FY]-x(4)-C-x(2)-C-x(2)-F(2)-x-R. [LIVMFE]-[FY]-P-W-M-[KRQTA]. L-M-A-[EQ]-G-L-Y-N. IRED_1R-P-C-x(11)-C-V-S. [RKQ]-R-[LIM]-x-[LF]-G-[LIVMFY]-x-Q-x-[DNQ]-V-G. [KR]-x(1,3)-[RKSAQ]-N-{VL}-x-[SAQ](2)-{L}-[RKTAENQ]-x-R-{S}-[RK]. [LIVMF](2)-D-E-A-D-[RKEN]-x-[LIVMFYGSTN]. [KRQ]-[LIVMA]-x(2)-[GSTALIV]-{FYWPGDN}-x(2)-[LIVMSA]-x(4,9)-[LIVMF]-x-{PLH}-[LIVMSTA]-[GSTACIL]-{GPK}-{F}-x-[GANQRF]-[LIVMFY]-x(4,5)-[LFY]-x(3)- [FYIVA]-{FYWHCM}-{PGVI}-x(2)-[GSADENQKR]-x-[NSTAPKL]-[PARL]. Scan for the following prosite patterns in your 4 sequences Hint: translate the patters to regexes and then scan
  12. 12. reuse galacto.py in github Consensus_pattern="G-R-x-N-[LIV]-I-G-[DE]-H-x-D-Y" pattern=Consensus_pattern.replace("-","") pattern=pattern.replace("x","[A-Z]") #print(pattern) count=0 for s in sequences: count=count+1 print ("searching seq",count) s=s.replace(" ","") matches = re.finditer(pattern,s) for match in matches: print (match.group(0),"from: ",match.start(),"to: ",match.end())
  13. 13. >SEQ1 MGNLFENCTHRYSFEYIYENCTNTTNQCGLIRNVASSIDVFHWLDVYISTTIFVISGILNFYCLFIALYT YYFLDNETRKHYVFVLSRFLSSILVIISLLVLESTLFSESLSPTFAYYAVAFSIYDFSMDTLFFSYIMIS LITYFGVVHYNFYRRHVSLRSLYIILISMWTFSLAIAIPLGLYEAASNSQGPIKCDLSYCGKVVEWITCS LQGCDSFYNANELLVQSIISSVETLVGSLVFLTDPLINIFFDKNISKMVKLQLTLGKWFIALYRFLFQMT NIFENCSTHYSFEKNLQKCVNASNPCQLLQKMNTAHSLMIWMGFYIPSAMCFLAVLVDTYCLLVTISILK SLKKQSRKQYIFGRANIIGEHNDYVVVRLSAAILIALCIIIIQSTYFIDIPFRDTFAFFAVLFIIYDFSILSLLGSFTGVA M MTYFGVMRPLVYRDKFTLKTIYIIAFAIVLFSVCVAIPFGLFQAADEIDGPIKCDSESCELIVKWLLFCI ACLILMGCTGTLLFVTVSLHWHSYKSKKMGNVSSSAFNHGKSRLTWTTTILVILCCVELIPTGLLAAFGK SESISDDCYDFYNANSLIFPAIVSSLETFLGSITFLLDPIINFSFDKRISKVFSSQVSMFSIFFCGKR >SEQ2 MLDDRARMEA AKKEKVEQIL AEFQLQEEDL KKVMRRMQKE MDRGLRLETH EEASVKMLPT YVRSTPEGSE VGDFLSLDLG GTNFRVMLVK VGEGEEGQWS VKTKHQMYSI PEDAMTGTAE MLFDYISECI SDFLDKHQMK HKKLPLGFTF SFPVRHEDID KGILLNWTKG FKASGAEGNN VVGLLRDAIK RRGDFEMDVV AMVNDTVATM ISCYYEDHQC EVGMIVGTGC NACYMEEMQN VELVEGDEGR MCVNTEWGAF GDSGELDEFL LEYDRLVDES SANPGQQLYE KLIGGKYMGE LVRLVLLRLV DENLLFHGEA SEQLRTRGAF ETRFVSQVES DTGDRKQIYN ILSTLGLRPS TTDCDIVRRA CESVSTRAAH MCSAGLAGVI NRMRESRSED VMRITVGVDG SVYKLHPSFK ERFHASVRRL TPSCEITFIE SEEGSGRGAA LVSAVACKKA CMLGQ >SEQ3 MESDSFEDFLKGEDFSNYSYSSDLPPFLLDAAPCEPESLEINKYFVVIIYVLVFLLSLLGNSLVMLVILY SRVGRSGRDNVIGDHVDYVTDVYLLNLALADLLFALTLPIWAASKVTGWIFGTFLCKVVSLLKEVNFYSGILLLA CISVDRY LAIVHATRTLTQKRYLVKFICLSIWGLSLLLALPVLIFRKTIYPPYVSPVCYEDMGNNTANWRMLLRILP QSFGFIVPLLIMLFCYGFTLRTLFKAHMGQKHRAMRVIFAVVLIFLLCWLPYNLVLLADTLMRTWVIQET CERRNDIDRALEATEILGILGRVNLIGEHWDYHSCLNPLIYAFIGQKFRHGLLKILAIHGLISKDSLPKDSRPSFVGS SSGH TSTTL >SEQ4 MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA MEANFQQAVK KLVNDFEYPT ESLREAVKEF DELRQKGLQK NGEVLAMAPA FISTLPTGAE TGDFLALDFG GTNLRVCWIQ LLGDGKYEMK HSKSVLPREC VRNESVKPII DFMSDHVELF IKEHFPSKFG CPEEEYLPMG FTFSYPANQV SITESYLLRW TKGLNIPEAI NKDFAQFLTE GFKARNLPIR IEAVINDTVG TLVTRAYTSK ESDTFMGIIF GTGTNGAYVE QMNQIPKLAG KCTGDHMLIN MEWGATDFSC LHSTRYDLLL DHDTPNAGRQ IFEKRVGGMY LGELFRRALF HLIKVYNFNE GIFPPSITDA WSLETSVLSR MMVERSAENV RNVLSTFKFR FRSDEEALYL WDAAHAIGRR AARMSAVPIA SLYLSTGRAG KKSDVGVDGS LVEHYPHFVD MLREALRELI GDNEKLISIG IAKDGSGIGA ALCALQAVKE KKGLA sequences

×